[PDF] Toward an End-to-End Auto-tuning Framework in HPC PowerStack

Abstract

Efficiently utilizing procured power and optimizing performance of scientific applications under power and energy constraints are challenging. The HPC PowerStack defines a software stack to manage power and energy of high-performance computing systems and standardizes the interfaces between different components of the stack. This survey paper presents the findings of a working group focused on the end-to-end tuning of the PowerStack. First, we provide a background on the PowerStack layer-specific tuning efforts in terms of their high-level objectives, the constraints and optimization goals, layer-specific telemetry, and control parameters, and we list the existing software solutions that address those challenges. Second, we propose the PowerStack end-to-end auto-tuning framework, identify the opportunities in co-tuning different layers in the PowerStack, and present specific use cases and solutions. Third, we discuss the research opportunities and challenges for collective auto-tuning of two or more management layers (or domains) in the PowerStack. This paper takes the first steps in identifying and aggregating the important R&D challenges in streamlining the optimization efforts across the layers of the PowerStack.

Full PDF

TToward an End-to-End Auto-tuning Framework in HPC PowerStack

Xingfu Wu

Argonne National LaboratoryThe University of Chicago, USAEmail: [email protected]

Aniruddha Marathe

Lawrence Livermore National LaboratoryLivermore, CA, USAEmail: [email protected]

Siddhartha Jana

Intel Corp., USAEmail: [email protected]

Ondrej Vysocky

IT4Innovations National Supercomputing CenterCzech RepublicEmail: [email protected]

Jophin John

Technical University of MunichMunich, GermanyEmail: [email protected]

Andrea Bartolini

University of BolognaBologna, ItalyEmail: [email protected]

Lubomir Riha

IT4Innovations National Supercomputing CenterCzech RepublicEmail: [email protected]

Michael Gerndt

Technical University of MunichMunich, GermanyEmail: [email protected]

Valerie Taylor

Argonne National LaboratoryThe University of Chicago, USAEmail: [email protected]

Sridutt Bhalachandra

Lawrence Berkeley National Laboratory, USAEmail: [email protected]

Abstract —Efﬁciently utilizing procured power and optimiz-ing performance of scientiﬁc applications under power andenergy constraints are challenging. The HPC PowerStackdeﬁnes a software stack to manage power and energy ofhigh-performance computing systems and standardizes theinterfaces between different components of the stack. Thissurvey paper presents the ﬁndings of a working group fo-cused on the end-to-end tuning of the PowerStack. First, weprovide a background on the PowerStack layer-speciﬁc tuningefforts in terms of their high-level objectives, the constraintsand optimization goals, layer-speciﬁc telemetry, and controlparameters, and we list the existing software solutions thataddress those challenges. Second, we propose the PowerStackend-to-end auto-tuning framework, identify the opportunitiesin co-tuning different layers in the PowerStack, and presentspeciﬁc use cases and solutions. Third, we discuss the researchopportunities and challenges for collective auto-tuning of twoor more management layers (or domains) in the PowerStack.This paper takes the ﬁrst steps in identifying and aggregatingthe important R&D challenges in streamlining the optimizationefforts across the layers of the PowerStack.

1. Introduction

As we enter the exascale computing era, power andenergy management are key design points and constraintsfor any next generation of supercomputers [2]. Efﬁcientlyutilizing procured power and optimizing the performance ofscientiﬁc applications under power and energy constraints are challenging for several reasons including dynamic phasebehavior, manufacturing variation, and increasing system-level heterogeneity. While several individual techniqueshave been proposed for the automatic and efﬁcient man-agement of power and energy, the majority of these tech-niques have been devised to meet the needs of a spe-ciﬁc high-performance computing (HPC) center or speciﬁcoptimization goals. Speciﬁcations such as PowerAPI [14],[15], IPMI [17], and Redﬁsh [8] provide high-level powermanagement interfaces for accessing power knobs. A recentsurvey [22] conducted by the EEHPC WG [9] concludedthat the majority of such techniques lack the application-awareness required to achieve the best system performanceand throughput. Furthermore, each technique tends to im-prove the management of power and energy for a differentsubset of the site or system hardware and at different (andoften conﬂicting) granularities. Unfortunately, the existingtechniques have not been designed to coexist simultaneouslyon one site and cooperate on management in a streamlinedfashion.To address these gaps, the HPC community needs aholistic stack for power and energy management. The HPCPowerStack [2], [16] started in May 2018 as a workinggroup to gather the experience of active developers in indus-try, computing centers, and academia in building softwareinterfaces and solutions for handling and optimizing thepower and energy consumption in HPC systems in pro-duction. PowerStack deﬁnes a software stack that managesthe power and energy of HPC systems and standardizes theinterfaces between different levels of software components1 a r X i v : . [ c s . PF ] A ug n the stack. One of the key aspects of PowerStack is todeﬁne a vision of a holistic power and energy managementstack extensible by design and capable of optimizing thetarget power- or energy-efﬁciency application-aware metricso that it can trade off power, energy, and time to solutionin order to optimize the efﬁciency of an HPC application.Its second aspect is to deﬁne a standard interface to inter-act with optimization software and hardware knobs acrossdifferent vendor HPC systems. Based on the state of the artof the components available in the community for powerand energy management [8], [9], [14], [15], [17], [22],a hierarchical straw man PowerStack design [1], [2] wasproposed to manage power and energy at three levels ofgranularity: the system level, the job level, and the nodelevel. This implies the need to put in place the followingincrementally: • Deﬁne a site-level requirement, a power-aware sys-tem Resource Manager (RM) / job scheduler, apower-aware job-level manager, and a power-awarenode manager. • Deﬁne the interfaces between these layers to trans-late objectives at each layer into actionable items atthe adjacent lower layer. • Drive end-to-end optimization across different layersof the PowerStack.To address this need, we formed an PowerStack End-to-End Auto-tuning Working Group in 2019. A plethora ofliterature on power-aware tuning exists, including notableworks by the members of this working group. These effortsinclude system and hardware tuning, tuning with runtimesystem, application-level tuning, compiler-level parametertuning, loop-level parameter tuning, and deep learning-basedhyperparameter tuning.A primary limitation of most—if not all—of these effortsis that the tuning research has been solely limited to theindividual layers of the PowerStack. The opportunities forfurther gains in power efﬁciency from collectively tuningtwo or more layers of the PowerStack have largely remaineduntapped. The goal of this paper is to explore those untappedopportunities by addressing the following speciﬁc questions. • Opportunity analysis : How do we quantify the po-tential beneﬁts of end-to-end auto-tuning across thedifferent layers of the PowerStack? What experimen-tation is required to achieve baseline quantiﬁcationof the beneﬁts of end-to-end auto-tuning? • Identiﬁcation of high-level challenges : What arethe high-level research questions to be explored inthe end-to-end auto-tuning of the PowerStack? Whatengineering solutions and research approaches areneeded to these questions? • Interaction of existing layer-speciﬁc tuning : Basedon the conceptual diagram of the PowerStack, whatinteractions are required across the layers of thePowerStack with existing layer-speciﬁc tuning ap-proaches as a precursor to end-to-end auto-tuning? • Extension toward end-to-end auto-tuning : Howdo we combine the existing auto-tuning approaches to develop comprehensive end-to-end auto-tuningsolutions for the high-level power and energy goals?To our knowledge, this paper is the ﬁrst attempt atidentifying and aggregating the important R&D challengesin streamlining the interoperation across the layers of thePowerStack. We present the important high-level questions,concrete ideas, and ongoing efforts discussed by the mem-bers of the PowerStack working group. The remainder ofthis paper is organized as follows. Section 2 surveys thelayer-speciﬁc tuning efforts in terms of their high-levelobjectives; describes the high-level objectives; discusses thetarget metrics, layer-speciﬁc control parameters, and spe-ciﬁc methods; and lists the existing software components.Section 3 proposes the PowerStack end-to-end auto-tuningframework, identiﬁes the opportunities in collective tuning(henceforth co-tuning ) different layers in the PowerStack,and presents the speciﬁc use cases. Section 4 identiﬁesthe further opportunities and open challenges for co-tuningof two or more management layers (or domains) in thePowerStack. Section 5 summarizes our conclusions.

2. Survey of PowerStack Layer-Speciﬁc Tuning

In this section, we ﬁrst describe the high-level objectivesof the existing layer-speciﬁc tuning approaches at the dif-ferent layers of the PowerStack: system (i.e., cluster), job /application, and node. Next, we outline the target metrics forthe existing tuning approaches. We then present the layer-speciﬁc control parameters, telemetry and speciﬁc methodsused to accomplish the objectives by the individual layers.

The common objective of each layer of the PowerStackis to operate within the power constraints or energy goalsassigned by the upper layer. A power constraint is appliedand measured over a time window. An energy goal isassigned and measured over the job execution or systemuptime. The smallest supported time window is deﬁned bywhat can be supported at each layer and what is acceptableby the upper layer. Along with the primary objective ofadhering to a power constraint, the following secondarymetrics are targeted: • Power-constrained performance optimization • Performance constrained energy optimization, RM-brokered SLA-compliant performance • Guaranteed rate of change, or lower and upperbounds on power (power usage) in system in aspeciﬁed time window • Thermal-constrained performance optimizationSome of the secondary metrics monitored and affectedin this process are as follows: • System utilization, resource utilization • Thermal metrics: ambient temperature, water tem-perature2

ABLE 1. S

URVEY OF PARAMETERS AND METHODS USED BY THELAYERS OF THE P OWER S TACK • Job turnaround time, queuing delay, throughput • Reduced memory footprint, reduced data movementand I/O footprint

The objectives at different layers of the PowerStack canbe realized by using measured or derived metrics at thoselayers as follows: • Job-level power (watts) or energy (watt hour orjoules) usage, • Execution time (seconds/minutes/hours) • Operating frequency (Hz) • Performance (FLOPS, IPC, IPS) • Power efﬁciency (FLOPS/watt, IPC/watt) • Energy efﬁciency (

ED, ED , F LOP S/Joule,IP C/Joule ) • Node utilization (% of time in use, % of resource inuse)

The objectives described above are realized by eachlayer by managing a set of available controls provided bythe adjacent lower layer. The parameters are tuned throughthe available methods provided by the hardware or indirectlymanaged by the lower layers (runtime, node-level manager,system software, etc.). We describe the parameters and themethods used by the individual layers of the PowerStack atsystem, job/runtime, application, and node levels shown inTable 1.

TABLE 2. E

XISTING TOOLS / SOLUTIONS AT EACH LAYER OF THE P OWER S TACK

In Table 2 we list several existing software componentsat the four layers: resource manager/job scheduler, job-levelruntime system, node-level management, and application-level tuning. By integrating proper software componentsfrom each layer, we are able to do the proposed end-to-endauto-tuning in PowerStack.

3. End-to-End Auto-Tuning Framework

Before we delve into end-to-end tuning of the Power-Stack, we deﬁne the term tuning . At a high level, tuning is the process of improving the target metric through betterhandling of available control parameters and conﬁgurationoptions without violating operating constraints (if any). Theprocess of tuning in the layers of the PowerStack (a) typi-cally targets performance or power efﬁciency as the primarymetric, (b) complies with the operating power constraintimposed on the layer, and (c) attempts to improve the man-agement and orchestration of the available control parame-ters that affect the application and/or hardware performance.In this process, the other layers are either treated as blackboxes or are ignored altogether in order to keep the researchproblem tractable. Extending this deﬁnition, we deﬁne co-tuning as the process of improving the target metrics of twoor more layers of the PowerStack by incorporating cross-layer characteristics in the orchestration process. End-to-endauto-tuning aims to perform holistic co-tuning of all layersof the PowerStack.In 2019, the PowerStack community sketched aschematic diagram outlining the different components of apower management stack, shown in Figure 1. A site has oneor more HPC systems, site policies, and a power budget.Each system is constrained under a derived system-levelpower budget. For end-to-end auto-tuning in PowerStack, wewill focus on tuning at the system, job/application, and node.We propose a high-level overview of the end-to-end auto-tuning framework (orange boxed portion) in Figure 1. Wedescribe the knobs at each layer, what control knobs can bemodiﬁed on temporal and spatial dimensions, who controlsthe knobs (actors), and what metrics can be measured. Wedeﬁne tunable parameters at each layer, then discuss how toauto-tuning the combination of different parameters at thedistinct layers (parameter space) for an optimal solution (the3 igure 1. High-level overview of end-to-end auto-tuning framework (orangeboxed portion) smallest runtime, the lowest power, or the lowest energy)under a system power cap.The traditional PowerStack design has focused largelyon the engineering challenges in standardization and de-ployment. In contrast, this paper focuses on extending thecurrent design of the PowerStack to address novel researchchallenges in end-to-end auto-tuning of the PowerStack.Speciﬁcally, we extend the traditional PowerStack model byconsidering two additional, largely static layers: 1) Applica-tion: We consider application as its own auto-tuning layer;and 2) System software: We consider the system softwaresuch as the compiler toolchain, system-level dependenciessuch as MPI, OpenMP, and thread-management libraries,and other external entities that play an important role inrealizing the PowerStack but have no direct interfaces inthe traditional design.

In this section, we survey the opportunities in collec-tive auto-tuning of two or more management layers in thePowerStack. The goal of this survey is to ﬁnd potentialareas for research and prepare a list of broad researchquestions that the PowerStack is collectively interested intackling. An outcome of this survey would be to comeup with research activities that the PowerStack communitycan collaboratively participate in depending on the area ofexpertise.Before we discuss the co-tuning opportunities for indi-vidual layers, we deﬁne the important terms used in the restof this paper. These terms are listed in Table 3. Figure 2shows a high-level overview of the placement of ResourceManager, Job, Runtime System, and Application in thePowerStack, and the interaction between the layers (orangeand green lines).

The objective of this co-tuning is to explore andleverage opportunities in the simultaneous tuning of theresource manager with empirical or online knowledge of thedynamic behavior of a power-aware runtime system subjectto a target power constraint or energy efﬁciency metric.We consider two directions of interaction in this section—from resource manager (RM) to runtime system and vice

TABLE 3. D

EFINITIONS OF TERMS

Figure 2. Placement of Resource Manager, Job, Runtime System, andApplication (orange lines) versa. Several types of interactions between the RM and theruntime system can occur, as outlined below.

Static interaction : These interactions deﬁne the manage-ment decisions by the resource manager at job launch. • How many nodes (for moldable jobs). The userprovides a minimum and a maximum number ofnodes the job can use. • Which nodes (or compute resources) to select forjob launch for managing inefﬁciencies in the systemsuch as thermal hot spots, and processor manufac-turing variation. • Which job to run (or backﬁll) from the job queue. • Which binary dependencies to pick given the situa-tion on the cluster.

Dynamic interaction : These interactions deﬁne the man-agement decisions by the resource manager during job run-time. • How much power to reassign to a running job (re-duce or increase). • Which job to pause (run queue) or restart (waitqueue) if supported by the job. • Whether the resource manager or the runtime sys-tem can leverage job malleability. Some resourcemanagers and runtime systems already leverage mal-leability at the thread level through concurrencythrottling.4

Job relaunch. Some resource managers may explorejust-in-time (JIT) compilation of the application torelaunch the job either through checkpoint-restart inthe same job or pause/resume over different alloca-tions.Another type of categorization of the interactions be-tween the resource manager and the runtime system is basedon the job awareness. • Job-aware interactions : These are the interactionsbetween the resource manager and runtime systemthat take job behavior into account when applyingpower management decisions. The job awarenessis based on either the empirical proﬁle of the ap-plication or runtime telemetry collected from theapplication. • Job-agnostic interactions : These interactions be-tween the resource manager and runtime systemdo not take job behavior into account. These areprimarily the interactions agnostic (or transparent)to the application itself.The interaction from the resource manager to the runtimesystem may occur through reassignment of power controlsand reporting of degradation in performance or efﬁciencyobserved at the system level with a heartbeat signal. Theinteraction from the runtime system to the resource managermay occur through reporting of job-level power usage, re-quest for additional power usage or returning unused power,and non-power-related controls that indirectly affect job andsystem power efﬁciency. The potential metrics by whichthe opportunity can be measured are system-level energyefﬁciency and system-level job throughput.

The objectives for the co-tuning are to assign resourcesto the job, control their consumption, select heuristics tomaximize power efﬁciency at the system level, and selectoptimal application control parameters at application launch.Note that the job-level runtime is either absent or agnosticto the co-tuning of the resource manager, the application,and the system software.The goal is to understand what opportunities exist fortuning two interactions in this co-tuning space: (a) tuning thecontrol loop in RM by making it application-characteristicsaware through empirical data or on-line monitoring and(b) requesting changes to resource assignments by the ap-plication for further tuning control parameters within theapplication. The objective is to maximize the performanceof all applications on the system under a system-wide powerlimit by maximizing per-job power efﬁciency (minimumperformance impact).We assume that in this interaction, the control andtelemetry information to and from the application is limitedto application-centric data. This does not include power-related controls and telemetry as noted in the deﬁnitions,since the application itself is not involved in power man-agement decisions. For the interaction from the resource manager to application, the following are involved: 1) powerand/or energy budget if the application understands themetric; 2) power efﬁciency translated into an application-level metric such as watts per timestep (But this will beapplication-speciﬁc and will not scale for all applications);3) number of resources; 4) other operating constraints: fromjob to managed resource; 5) power/frequency/concurrencycontrol; 6) application control parameters; and 7) applicationlaunch parameters. Interaction from the application to theresource manager may occur in terms of reporting met-rics describing application progress. The required interfacesbetween the resource manager and the application includepower consumption monitoring, power limit speciﬁcation,the expected efﬁciency metric to monitor from job-level run-time to RM, and the expected efﬁciency metric to monitorfrom application to job-level runtime.

The objective is to ﬁne-tune system parameters, software,and application parameters at job launch as well as runtimein order to maximize job power efﬁciency under the powerbudget. The runtime system may leverage JIT compilationof the application with static actions such as the amountof required computational resources (

The objective of this co-tuning space is to explore oppor-tunities for how the resource manager, runtime system, andapplication (along with system software) can be co-tuned tomaximize application performance under a power constraint.The interactions are the required interfaces for interactionacross three layers. The interfaces must be deﬁned to answerthe following questions: 1) What interfaces are needed totranslate high-level targets at the resource manager intotargets for the job-level runtime system and the application?5nd 2) What telemetry interface needs to be provided fromapplication to the job runtime and from the runtime systemto the resource manager? The discussions of the pairwiseco-tuning process described previously cover the interfacesrequired in this co-tuning process.For example, a target metric of throughput under asystem-level power constraint at the resource manager levelneeds to be translated into power efﬁciency targets or to-tal runtimes of individual jobs managed by the job-levelruntime system subject to a job-level power constraint. Thismust be translated into improvements in the calculations persimulation step per watt at the application level.

In this section, we discuss seven speciﬁc use cases andsolutions for collective tuning of the resource manager,runtime, application, and node layers.

This use case describes the co-tuning of the resourcemanager, runtime system, and application. The target appli-cation is a 27-point Laplacian problem implemented as partof the test program shipped with the Hypre linear solverlibrary [11]. The control parameters exposed by Hypre areprimarily at the algorithm and subalgorithmic level. Specif-ically, Hypre enables the user (or the runtime system) toselect input preconditioner, linear solver, subsolver options,and postconditioner. In our experience, several thousandcombinations of these options can be selected from at joblaunch. While tuning Hypre parameters has been an exten-sively researched topic, our empirical analysis showed thatsubjecting system-level power constraints severely affectsthe applicability of previous tuning efforts [23]. We observedthat the best-case combination of the tuning knobs forHypre is often inefﬁcient when subject to a hardware powerconstraint. Consequently, tuning the resource manager andruntime system without taking into account how the power-aware runtime system and Hypre collectively respond to thedecision layers leaves performance on the table. We use theConductor runtime system [24] to transparently optimize thejob-level power budget on the allocated nodes. Conductorexposes control parameters that impact the granularity andefﬁciency of its power-balancing algorithm under the as-signed job-level power limit.Our target metrics are twofold: 1) improve power efﬁ-ciency in terms of instructions per cycle (IPC) per watt at theruntime system level. and 2) improve job throughput at theresource manager level (

Figure 3. High-level overview of multijob GEOPM policy assignment

At the application level, we will explore several param-eters including the choice of solver algorithm, subsolveroptions, and data preconditioner (smoother and coarsening).At the runtime system level, we explore traditional platformknobs such as power limiting and frequency scaling. At theresource manager level, we explore the number of nodesand MPI tasks as the control parameters. The approachrelies on the telemetry at different layers of the PowerStack.The primary metrics for telemetry include power usage(watts), operating frequency (GHz), and IPC. The primarychallenges include managing static interactions between thelayers at application launch and estimating the impact ofseveral control parameters on the application behavior.

One of the critical challengesat HPC facilities is to limit energy/power consumption oftheir in-house systems based on their contractual obliga-tions with utility service providers [9]. In order to enhanceefﬁciency, a system-level power-aware software agent (likethe resource manager) is needed to frequently monitor andcontrol its consumption in tandem with an application-levelpower management software agent (like GEOPM [10], [12]).Figure 3 illustrates how facility-level power policies ﬁlterdown into job-level granularity.The community is working toward a prototype thatfacilitates a bidirectional communication channel for in-teroperation between the resource manager and GEOPM.The objectives are systemwide characterization of frequency,power, and thermal variation across the system plus nodeoutlier detection, node energy savings or FLOPS improve-ments via adapting CPU/GPU PM controls according to ap-plication phases or characteristics, overall application energysavings or runtime reductions via steering power betweennodes according to load imbalance patterns, and enablingof other tools or resource managers to perform system-widepower/performance management by leveraging GEOPM.GEOPM has three main modes of community site-levelpolicies .1) Enforcing static preconﬁgured sitewide policy: Thismethod relies on preconﬁgured policy conﬁguration ﬁlespassed to GEOPM during job launch. These ﬁles can eitherbe directly baked into the node boot image on a pseudo-ﬁlesystem or passed along to the job launcher. Adding extra6 igure 4. Autotuning application and runtime parameter selection awareness of these ﬁles within the RM enables it to enforcepower management control with/without GEOPM.2) Enforcing job-speciﬁc policies: This method is suit-able for HPC sites that frequently run a ﬁnite set ofapplications with historic proﬁle information about itspower/energy/thermal characteristics. Such sites typicallymaintain a database that maps applications to speciﬁc policyparameters (e.g., ﬁxed frequency or power-cap).3) Fully dynamic policy: This ongoing R&D work re-lies on the entire power management stack participating indynamic cooperation between the electric grid supply, theresource manager, and the job-level runtime (GEOPM). Thismode enables job-speciﬁc hardware resources to react in ac-cordance with the instantaneous system-level power/energyrequirements realized at the site level.Interfaces to system-level agents: The communicationchannel/interface can take the form of environment vari-ables, preconﬁguration ﬁles, job-launcher command line op-tions, or shared memory. Having a shared memory facilitatesefﬁcient interaction and is referred to as an endpoint, whichacts as a gateway between a persistent compute node dae-mon (like SLURM) and an application power-managementdaemon (like GEOPM root controller). GEOPM providesa plugin-based interface that enables users to plug-and-play their own algorithms of choice. By default, a typicalGEOPM installation compares prepacked with ﬁve differentalgorithms that correspond to the most common policiesamong HPC sites: energy efﬁciency under a performancedegradation threshold, power load balancing based on theaverage node power cap, static frequency assignment forthe entire lifetime of the application, static power cap as-signment for the entire lifetime of the application, and mon-itoring application energy/power metrics. The open designchallenges are fault tolerance, security, multitenancy, inter-action with other OS/user-level/RAS daemons, and handlingof conﬂicting power management mechanisms on the samesystem.

In theauto-tuning ytopt project [31], we use the new Clang looptiling, interchange, packing, and/or jam pragmas as exam-ples to illustrate the integration process about auto-tuningthe pragma parameters to achieve the optimal performance.Figure 4 shows our auto-tuning framework.We analyze an application code to identify the importantparameters that we focus on, then replace these parameters (application parameters, loop transformation parameters)with symbols such as

The Horizon 2020 project Runtime Exploitation ofApplication Dynamism for Energy efﬁcient exascale com-puting (READEX) [28] came with an idea and tool suiteproviding a parallel application splitting into regions ofdifferent resource requirements and dynamic tuning of hard-ware parameters that suit the needs of each of the regions.Besides the hardware and system parameters tuning, theREADEX tool suite supports static and dynamic tuning ofthe application parameters. For this purpose a static applica-tion conﬁguration tuning (ACP) plugin and a dynamic appli-cation tuning parameters (ATP) plugin [21] for the PeriscopeTuning Framework [13] were developed. The static tuningrequires the identiﬁcation of parameters in the application’slaunch conﬁguration ﬁle, which changes at the beginningof each run. The dynamic tuning requires the applicationinstrumentation with the ATP’s API to identify the tunedparameters. Each parameter’s value is then changed at every x th iteration of a loop in which the instrumentation is,which requires support at the application side to handle suchmodiﬁcation during runtime. The key input information forboth static and dynamic application parameters tuning is notonly a list of parameter values to set but also dependency7 igure 5. Graph of the ESPRESO FETI solver regions instrumented in thesource codeFigure 6. Dynamic resource redistribution to enforce power corridor conditions that express which combinations of parametersare not allowed.ESPRESO FETI solver [27] was tuned by using theREADEX tools for optimal hardware conﬁguration as wellas using the ATP to ﬁnd an optimal solver, preconditioner,and domain size, all of which have a major impact on thetime to solution of the issued problem. The application hasbeen instrumented with a set of regions as presented inFigure 5. Such optimization led to major runtime and energysavings as well as to improved scalability [21], [27].The complete READEX tool suite does the search ofthe optimal conﬁguration automatically using one of manysupported algorithms for the space state search, which couldbe potentially very large. However, the challenge of thisapproach is in the instrumentation of the application andits conﬁguration ﬁle with valid and complete dependencyconditions for the application parameters. The site-levelpower budget (power corridor) is usually enforced by usingjob cancellation, idle node shutdown, power capping, and dynamic voltage and frequency scaling. A new proactivepower corridor management strategy is developed with ourdynamic resource management infrastructure [7], [18] com-prising an Invasive Resource Manager (IRM) and Inva-sive Message Passing Interface (IMPI). The core researchchallenge is power management using dynamic resourceredistribution among applications.The power usages of running applications are predictedand analyzed for power budget violation. If a violationis predicted, steps are taken to formulate a new resourceredistribution heuristic to satisfy the site-level power con-straints. As shown in Figure 6, the node distribution wasdynamically changed by IRM to maintain the power budget.This dynamic strategy can also cater to dynamic power bud-gets arising because of renewable energy sources. We alsoconstructed a programming paradigm called Elastic PhaseOriented Programming (EPOP) [19] on top of IMPI fordeveloping dynamic applications and for co-tuning. EPOPmeasures the power as well as performance characteristics ofthe application and communicates with IRM upon request.Using EPOP, the programmer can explicitly inform IRM uabout the application phases where resource redistributionis needed or not. A dynamic resource manager also requiresknowledge of application constraints (for example, the re-quirement of a cubic number of processes in LULESH) forefﬁcient resource redistribution. Hence, a co-tuning amongresource managers, runtime system, and applications is nec-essary in order to exploit the dynamism. Lack of running dy-namic applications and application-speciﬁc constraints canhinder the calculation of successful resource redistribution.Power-aware scheduling with other strategies such as powercapping and DVFS can address this shortcoming.

COUNTDOWN is a runtime li-brary for performance-neutral energy saving in MPI applica-tions developed by the University of Bologna and CINECA[5], [6]. The energy saving is obtained transparently to theuser, without requiring application code modiﬁcations orrecompilation of the application. The target metric is energyefﬁciency with no performance degradation seamlessly byleveraging MPI communication phases. We consider themetrics: runtime, power consumption, and energy consump-tion. The control parameters/units are the types and sectionsof the MPI calls in which to reduce the power consumptionand CPU performance. The COUNTDOWN runtime willintercept those and separate waiting time from copy timeand computing time. The COUNTDOWN conﬁguration canbe set at the beginning of a job run to (i) proﬁle onlythe MPI communication regions or (ii) reduce the powerconsumption during MPI wait and copy time or (iii) reducethe power consumption during MPI wait time only. Thepower reduction is achieved by interacting with the nodelevel. While at the system level, the resource managerinteracts with the COUNTDOWN conﬁguration to select thelevel of aggressiveness (i.e., energy-saving). This enablesco-tuning of application and communication phases powermanagement.8he challenges and open issues are the dynamic adapta-tion of the COUNTDOWN conﬁguration during job execu-tion, its integration with the job scheduler, and the estimationduring job execution of the accumulated time-to-solutionoverhead.

This use case for using two runtimesystems at the same time optimizes the hardware parametersfor the same metric. The COUNTDOWN is a powerfultool for optimizing the communication phases of the MPIapplications. However, it does not take into account otherapplication characteristics such as I/O and memory-boundas well as compute-bound regions of the code that mayprovide a major space for savings exploitation [20]. TheMERIC library [29] implements the READEX approach,so it tunes the application based on its instrumentation,which should cover the whole runtime of the applicationand provides a speciﬁc tuned-parameters conﬁguration foreach of the instrumented regions. The instrumentation canbe inserted manually into the source code or automaticallyinto the binary. Since we usually instrument any region forwhich we are able to get at least 100 power samples tohave reliable energy consumption measurement (e.g., forRAPL, it means minimum region size of 100 ms), neithermanual nor automatic instrumentation probably is providingso ﬁne-grained annotation as the communication phases ofthe application are.Both tools have active stand-alone development. Thechallenge is to implement a communication layer that shouldallow synergy of these tools, which guarantees that bothtools keep the system’s knowledge of which tool is incharge and what the current and future hardware settingsare, without creating a conﬂict. This is a work-in-progressidea.

4. Research Opportunity and Challenges

In this section, we identify the important unsolved chal-lenges in collective auto-tuning of two or more manage-ment layers (or domains) in the PowerStack. Identifyingsuch challenges will provide a platform for collaborativeresearch on co-tuning of different layers in the Power-Stack. For co-tuning resource manager and runtime system,the objective is to deﬁne interactions (static and dynamic)based on the type of job (moldable/malleable vs. non-moldable/malleable) and job-agnostic interactions that arecommon to these two types of jobs. For co-tuning theresource manager and application, the objective is to targethow the application can fully utilize the allocated resourcesfor efﬁcient execution while complying with the job-levelpower budget. For co-tuning runtime system and application,the objective is to ﬁne-tune system parameters, software, andapplication parameters at runtime under the job-level powerbudget. For co-tuning resource manager, runtime system,and application, the objective is to explore ways in whichthe application can fully utilize the dynamically allocatedresources for efﬁcient execution while also maximizing the job throughput under the system power limit. When weconsider co-tuning all four layers, we have to carefullyanalyze the tunable parameters from each layer to identifyhow they impact each other across layers, in order to ﬁndthe best combination for the optimal performance under apower budget.We identify four areas where further research effortis required to extend the end-to-end auto-tuning for thePowerStack:

Collective tuning of PowerStack layers has remaineda challenging task because of the lack of interfaces for(1) translation of high-level goals into subsequent lower-level goals, (2) translation of monitored metrics at lowerlayers to derived metrics at higher layers, and (3) enablingof a custom conﬁguration (e.g., resource reallocation andremapping) at the application launch and during runtime.Developing such interfaces will enable the platform to beused to explore several research directions for the reallo-cation of resources to job and sub-job components. In thisresearch direction, the following research questions shouldbe explored. • What target metrics are important to the individ-ual layers of the PowerStack? How can the targetmetrics be translated into metrics understood bythe lower layers of the stack assuming a top-downcontrol transfer? • What are the different approaches to quantify thepotential for performance improvement while tuningresource allocation and mapping across the stack?Potential approaches include exhaustive empiricalexploration, model-based estimation, and emulation. • What customization is required in the software andhardware components participating in resource real-location? For example, for the reallocation of re-sources to a running job, what features must besupported by the layers of the PowerStack to supportthe ﬂexibility? • What operating challenges would be incurred whenperforming resource reallocation? For example, per-forming job reallocation while maintaining a mini-mum power draw across the system may be criticalin order to comply with site-speciﬁc policies andwill require additional control logic to be in place atseveral layers of the PowerStack. • Can we leverage technologies such as containers andvirtual machines to address some of the challengesin resource reallocation? • Which layers of the PowerStack can leverage jobmoldability and malleability? At what point in theapplication run should this be done? What interfacesare provided to utilize moldability and/or malleabil-ity?9 .2. Ofﬂine/Static Co-tuning

The PowerStack typically operates with the softwarecomponents and target binaries provided to it to solve atarget problem. Several software components not directlyincluded in the formal deﬁnition of the PowerStack playan important role in the eventual outcome of the Power-Stack. Such components include compiler tool chains andthe optimization features provided by them, variants ofcommonly used libraries (e.g., implementations of MPI,OpenMP, thread-management libraries), and input decks ofthe target application. These indirectly affect the efﬁciencyof the individual layers of the PowerStack as well as theefﬁciency of the science being performed on the system.We identify the following open research questions in thistopic. • Can we quantify the impact of different compileroptimization ﬂags for one or more target metrics onthe layers of the PowerStack and the application? • Can we inform the compiler tool chain about theruntime (or online) situation on the system includingresource constraints, choice of runtime algorithms,and hardware characteristics while applying opti-mization techniques? • Can we quantify the impact of using several vari-ants of the application dependencies (outside of thePowerStack) on the efﬁciency of the PowerStack? • Can we identify correlations between black-boxand/or white-box characteristics of these dependen-cies and the efﬁciency metrics relevant to the Pow-erStack?

Hardware overprovisioning has been suggested as aviable approach to address the challenges associated withsite-wide or cluster-level power constraints [25]. Since morecompute and storage devices exist than can be poweredup at any given time on each node of the cluster onoverprovisioned hardware, the problem of selecting whichcomponents to power up and how to operate them becomeschallenging. With heterogeneous computing platforms onthe rise to meet the demands of exascale performance, thisproblem becomes even more challenging. Developing solu-tions to tackle this problem, which are often site-speciﬁc,will require exploring the following research questions. • How can one quantify the trade-off between thenumber of compute devices on the system vs.system-level efﬁciency? • How can one quantify the trade-off between design-ing software approaches that efﬁciently use fewer ormore compute resources vs. system efﬁciency? • How can one correlate software-level features intoefﬁcient usage of compute resources on the systemunder a power constraint? • What are the existing discrepancies between therequired interfaces to drive runtime managementof online or ofﬂine compute devices and existinginterfaces to manage those devices?

Several research challenges exist in ﬁne-tuning controlparameters exposed by different software components in thePowerStack. The key motivation behind such a tuning isthat individual layers in the stack are tuned based on a localview of those layers without taking into account the impactof the tuning decisions outside of those layers. We identifythe following research questions in this space. • How do the existing control parameters exposedby the different layers of the PowerStack and theapplication interact with each other on the scaleof power efﬁciency? What are the approaches toquantify that interaction and describe the potentialbeneﬁts of such exploration? • How can one ﬁnd the correlation between the char-acteristics of the allocation and management algo-rithms used by the layers of the PowerStack andhow those algorithms interact with the application?Can we augment this analysis with application-levelalgorithms and sub-algorithmic controls exposed bythe developer? • Can we extend the algorithms at different layers ofthe PowerStack to incorporate semantic informationin the application (e.g., state of the molecular dy-namics simulation at each time step)?

5. Conclusions

The PowerStack community effort has demonstrated theneed for a holistic software stack for power and energymanagement and standardized interfaces between the layersof the PowerStack [2]. In this paper, we presented a surveyof existing efforts and important open challenges to enableend-to-end auto-tuning in the PowerStack. First, we sur-veyed the layer-speciﬁc tuning efforts describing the high-level objectives, the target metrics, layer-speciﬁc controlparameters, and methods, and we listed the existing softwarecomponents. Then, we proposed the PowerStack end-to-endauto-tuning framework, identiﬁed the opportunities in co-tuning different layers in PowerStack, and presented sevenuse cases. We also presented our vision of the researchopportunities and challenges for collective auto-tuning oftwo or more management layers (or domains) in the Pow-erStack. This paper takes the ﬁrst step in identifying andaggregating all the R&D challenges related to interoperationamong multiple layers of the PowerStack. As part of futurework, we invite participation in the collaborative effort todevelop the end-to-end auto-tuning framework and integratethe individual research activities, in order to realize stream-lined auto-tuning of all layers of the PowerStack.10 cknowledgments

This work was supported in part by LDRD funding fromArgonne National Laboratory, provided by the Director,Ofﬁce of Science, of the U.S. Department of Energy undercontract DE-AC02-06CH11357, and in part by NSF grantsCCF-1801856.

References