[PDF] MicroGrad: A Centralized Framework for Workload Cloning and Stress Testing

Abstract

We present MicroGrad, a centralized automated framework that is able to efficiently analyze the capabilities, limits and sensitivities of complex modern processors in the face of constantly evolving application domains. MicroGrad uses Microprobe, a flexible code generation framework as its back-end and a Gradient Descent based tuning mechanism to efficiently enable the evolution of the test cases to suit tasks such as Workload Cloning and Stress Testing. MicroGrad can interface with a variety of execution infrastructure such as performance and power simulators as well as native hardware. Further, the modular 'abstract workload model' approach to building MicroGrad allows it to be easily extended for further use. In this paper, we evaluate MicroGrad over different use cases and architectures and showcase that MicroGrad can achieve greater than 99\% accuracy across different tasks within few tuning epochs and low resource requirements. We also observe that MicroGrad's accuracy is 25 to 30\% higher than competing techniques. At the same time, it is 1.5x to 2.5x faster or would consume 35 to 60\% less compute resources (depending on implementation) over alternate mechanisms. Overall, MicroGrad's fast, resource efficient and accurate test case generation capability allow it to perform rapid evaluation of complex processors.

Full PDF

MMicroGrad: A Centralized Framework for WorkloadCloning and Stress Testing

Gokul Subramanian Ravi

ECE, UW-Madison [email protected]

Ramon Bertran

IBM Research [email protected]

Pradip Bose

IBM Research [email protected]

Mikko Lipasti

ECE, UW-Madison [email protected]

Abstract —We present

MicroGrad , a centralized automatedframework that is able to efﬁciently analyze the capabilities,limits and sensitivities of complex modern processors in theface of constantly evolving application domains.

MicroGrad usesMicroprobe, a ﬂexible code generation framework as its back-endand a Gradient Descent based tuning mechanism to efﬁcientlyenable the evolution of the test cases to suit tasks such as Work-load Cloning and Stress Testing.

MicroGrad can interface with avariety of execution infrastructure such as performance/powersimulators as well as native hardware. Further, the modularabstract workload model approach to building

MicroGrad allowsit to be easily extended for further use.In this paper, we evaluate

MicroGrad over different use casesand architectures and showcase that

MicroGrad can achievegreater than 99% accuracy across different tasks within fewtuning epochs and low resource requirements. We also observethat

MicroGrad’s accuracy is 25-30% higher than competing tech-niques. At the same time, it is 1.5-2.5x faster or would consume35-60% less compute resources (depending on implementation)over alternate mechanisms. Overall,

MicroGrad ’s fast, resourceefﬁcient and accurate test case generation capability allow it toperform rapid evaluation of complex processors.

I. I

NTRODUCTION

Analyzing the capabilities, limits and sensitivities of com-plex modern processors in the face of constantly evolvingapplication domains is arduous and time consuming. Intelli-gently generating test cases which can efﬁciently perform theabove analyses will enable quick turnaround times, therebyaccelerating the ﬁnal third of the Innovate-Build-Analyzecycle.We are particularly interested in two challenging tasks underthis umbrella of test case generation:1)

Workload Cloning: which extracts key execution char-acteristics of a real world application and models theminto a synthetic workload.2)

Stress Testing: that maximizes micro-architectural activ-ity of a given processor, speciﬁcally to achieve worst-case estimates of execution metrics like performance andpower.We present

MicroGrad , an open-source centralized auto-mated framework that is able to efﬁciently analyze processorsbased on the scenarios described above.

MicroGrad derivesits name from 1

Microprobe [5]: a ﬂexible code genera-tion framework which forms

MicroGrad ’s back-end and 2

MicroGrad Tool: https://github.com/rgokulsm/MicroGrad

Gradient Descent: which is the tuning mechanism used by

MicroGrad to efﬁciently enable the evolution of the test casesto suit the tasks described earlier.In the past, there has been considerable work in the domainsof workload cloning [1]–[3], [13], [15], [16] and stress test-ing [5], [8]–[10], [14]. Despite this, open source frameworksfor these goals have been scarce. Meanwhile, the need forthese tools is rapidly increasing with the momentum for opensource hardware. As the open source space grows, we willrequire systematic tools to characterize and stress-test theabundant varied designs and implementations.To our knowledge, two open-source frameworks available inthis space are: Microprobe [5] which can generate user-deﬁnedtest cases, and GeST [10] which uses Genetic Algorithm(GA) based evolution on an instruction-level model to generatestress tests.

MicroGrad goes above and beyond the capabilitiesand use cases of the above, by providing a fast automatedframework for a variety of purposes, all generated with acommon centralized tuning mechanism and code generationback-end. Further,

MicroGrad can interface with a variety ofexecution platforms such as performance/power simulators aswell as native hardware, in order to evaluate the processorarchitecture’s execution efﬁciency. Importantly, the modular”abstract workload model” approach to building

MicroGrad allows it to be easily developed upon - allowing for new usecases, improved tuning algorithms, as well as easy interfacingwith new execution hardware and simulators. An overview of

MicroGrad is shown in Fig.1.To the best of our knowledge, all prior approaches to cloningand stress testing have been either GA-based or expert driven.Thus, the

Gradient Descent based tuning mechanism is a keynovelty and highlight in the

MicroGrad framework. The tuningmechanism is implemented over a gradient descent algorithm,which iterates through a sequence of ”workload generationknobs” conﬁgurations (i.e. inputs to the

Microprobe frame-work) and evaluates a speciﬁed processor execution metric forthose conﬁgurations. It gradually moves the code generationconﬁguration in the direction of the steepest execution metricgradient, i.e. one which achieves the best metric improvementfor every step change in the conﬁguration, until the optimumconﬁguration / convergence is reached. Note that the executionmetric is dependent on the use case - it could be a single high-level statistic like IPC or power consumption in the case ofStress Testing or a combination of both high-level and low- a r X i v : . [ c s . A R ] S e p evel statistics such as branch mispredictions, cache miss ratesand IPC for Workload Cloning. Overall, the tuning mechanismallows for fast and efﬁcient convergence to the prescribed goaland is observed to considerably outperform competing tuningapproaches. Moreover, with its abstracted model, it is easierto deploy compared to expert-driven approaches. MicroGrad

Outputs • Clone / Stress-test binary • Workload characteristics • Metrics (eg. worst case) • Epoch progression

Inputs • Application / Simpoint • Metrics of Interest • Target evaluation platform • Accuracy requirements

Evaluation • Performance Simulator (eg. Gem5) • Power Estimator (eg. McPAT) • Native Hardware

Tuning Mechanism • Grad. Descent (or other) • Use-case loss function

MicroProbe • Test Construction • Code Generation

Knobs

Fig. 1: MicroGrad Overview

Summary of contributions:

1) We present

MicroGrad , an open-source automatedframework for workload cloning and stress testing. Toour knowledge,

MicroGrad is the ﬁrst open tool forautomated cloning and further, the only open tool forfast-exploratory stress testing with an abstract workloadmodel.2)

MicroGrad is the ﬁrst proposal to perform intelligenttest generation via a Gradient Descent based tuningmechanism, which is shown to outperform other tuningmechanisms and is easier to deploy than expert-drivenapproaches.3)

MicroGrad extends the potential for the Microprobeframework which has a wealth of features for codegeneration.4) The modular and abstracted approach to building

Micro-Grad allows the seamless integration of new use cases,execution platforms and tuning algorithms.II. B

ACKGROUND

A. Workload Cloning

There are multiple challenges with using real-world ap-plications for architecture benchmarking, such as intellectualproperty hurdles, effort involved in porting the application tosuit the execution framework, as well as long run times. Whilethe advent of standardized benchmark suites have improved thetesting ecosystem, there are still several time/resource chal-lenges especially posed to architecture simulation in academicresearch as well as industry product development. Simulationtimes are often intractable, even on today’s most efﬁcientsimulators running on the fastest processing systems.Workload Cloning is a general technique to mimic real-world applications or benchmarks via miniature synthetic workloads and has been pursued in multiple prior works [1]–[3], [13], [15], [16]. The technique distills key behavioralcharacteristics of the original application/benchmark and mod-els them into a synthetic workload. The resultant workloadabstracts away any proprietary application characteristics, itis usually signiﬁcantly shorter in execution time and it can besuitable compiled to make it amenable to both native hardwareas well as simulation frameworks. The integral components ofthe Cloning workﬂow are discussed below.

1) Application Characteristics:

Speciﬁc characteristics ofan application are captured and used to generate the syn-thetic workload. These are characteristics which inﬂuencethe instruction distribution, control ﬂow as well as memorypatterns of the application. These characteristics can be dividedinto microarchitecture-independent and microarchitecture-dependent. The former includes instruction distributions, reg-ister dependency distance etc. and memory footprints whilethe latter includes branch misprediction and cache miss ratesand others. While some prior works [3] have used bothmicroarchitecture dependent and independent characteristics inconjunction, others have used a wider range of solely microar-chitecture independent characteristics [16], but which are thensigniﬁcantly impacted by compiler optimizations. In this workwe use the former i.e. a combination of both microarchitecturedependent as well as independent characteristics which allowoptimal capturing of both static and dynamic characteristicsof an application on a speciﬁc processor architecture.

2) Target Metrics:

The generated clones are expected toaccurately meet speciﬁc target metrics. A full system designermight require the clone to mimic the real application interms of low-level target metrics such as L1/L2/TLB missrates, branch misprediction rates, register usage, instructiondistribution as well as high-level target metrics like IPC,power, energy or thermal characteristics. In this paper our toolevaluation focuses on cache miss rates, branch mispredictions,instruction distributions, IPC and Power.

3) Generation Mechanism:

Prior clone generation mecha-nisms have comprised of a number of steps, each attemptingto feed speciﬁc application/benchmark statistics into a model,so as to attempt to generate the required characteristics inthe application. These steps include: generating the syntheticworkload spine using instruction distribution, memory ac-cess pattern modeling, branch predictability modeling, registerassignment, and ﬁnally code generation [3]. While thesesteps might individually achieve satisfactory accuracy for theirlow-level target metric (such as branch misprediction rate),performing them in a sequential manner (a sort of greedyapproach) means that there is limited control over other high-level target metrics of interest like IPC.In our work, we take a more synergic approach to clonegeneration. By estimating gradients and following the steepestcurves, our tuning mechanism is able to inherently sacriﬁcethe accuracy on some speciﬁc low-level target metric (forexample, L2 cache miss rate) if required, if it aids in optimalachievement of other low-level and high-level target metrics,thereby creating a clone with higher ﬁdelity. Further, our2pproach allows a ﬂexible generation time vs. cloning accuracytradeoff. For instance, a 95% accuracy 1-metric target wouldtake considerably less generation time in comparison to a 99%accuracy N-metric target.

B. Stress testing

Benchmark suites are usually built to represent the nominalbehavior of real world applications and not to mimic worst-case scenarios. However, worst-case scenarios in terms of mi-croarchitectural activity, heat dissipation, power consumptionand voltage noise [4], [8], [9], [14], [17], [18] are critical tounderstand the limits and sensitivities of current generationprocessors, so that future systems can migrate to the mostpromising regions of the microarchitectural design space.These worst case scenarios are closely tied to the microar-chitecture, and must be created in accordance. Thus, stresstests are used to create these worst-case scenarios for a giventarget execution metric and a speciﬁc processor microarchitec-ture. Considering the complexities and non-linear relationshipswithin the modern processor, manually crafting such stresstests is usually time consuming and tedious. Consequently,automating their generation is of critical importance for a rapiddesign cycle.

1) Generation Model:

As highlighted in prior work [10],there are two prominent design models for stress-test gener-ation: a) based on an abstract-workload model and b) basedon instruction-level primitives. In the abstract-model [8], [9],[14] the stress test generation process involves tuning a vectorof workload generation parameters/knobs such as instructionmix, register dependency distance, memory footprint / stridepatterns and branch transition patterns. The vector is thenused to generate the assembly (or high level language) codeOn the other hand, for the instruction-level frameworks [10],[17], [18], the tuning is performed directly on the instructionassembly, with per-instruction control.The key advantage with the abstract workload model isthat knobs are well deﬁned, can be selected to be only afew in number, and can potentially have exclusive mapping toparticular execution characteristics, signiﬁcantly reducing thecomplexity of the tuning required to achieve the maximumstress. The advantage of the instruction-level model is thatit provides deterministic and ﬁner granularity of control i.e.on a per-instruction basis. In this work, we adopt the abstractworkload model, which provides suitable abstractions to allowfor a more modular framework suited to multiple use cases,evaluation frameworks and tuning algorithms.

2) Tuning Mechanism:

The role of the tuning mechanismis to nudge the generated test case towards maximum stress(as per the speciﬁed stress metric). Prior works built onboth the generation models described above have predom-inantly utilized genetic algorithm (GA) based tuning. GAstune towards a target metric by applying operators inspired bynatural evolution. These operators include: selection of ﬁttestindividuals, crossover of features, mutation and guaranteed andelitism prioritization [8]–[10], [14]. The GA parameters usedby prior work [10] are shown in Table I. To our knowledge,

Parameter Value

Population Size 50Individual Size (

TABLE I: GA parameters

MicroGrad is the only stress test generation scheme to strayaway from GA based tuning. We ﬁnd that a gradient descentbased tuning approach, with stochastic randomness to jumpout of local minimas, as well as adaptive step sizes (larger tosmaller over time), enable considerably faster (i.e. less numberof tuning epochs) and more accurate convergence comparedto the GA based approach. For the abstract workload modelspeciﬁcally, our insight is that important GA operators likecrossover are rather ineffective, while they are much morevaluable in an instruction-level model. On the other hand, thegradient descent approach of following the steepest path tomaximizing the metric of interest is very effective when localminimas can be avoided.It is also interesting to note that the compute cost for a GDbased tuning epoch is proportional to the number of knobsof interest, which could be low in the context of many usecases. On the other hand, the compute cost in a GA epochis proportional to the population size, which is often ﬁxedthroughout and therefore usually conservative. Thus everyGD epoch oj,is often faster and/or consumes less computeresources in comparison to the GA approach. Our resultsdemonstrate up to a 2.5x beneﬁt for the GD approach.III. T HE M ICRO G RAD FRAMEWORK

An overview of the MicroGrad framework was shown inFig.1. MicroGrad is built in a modular manner, allowing easeof use as well as ﬂexibility for further development. Additionaluse cases and metrics of interest, custom evaluation platforms,as well as improved tuning algorithms, can be developed andintegrated conveniently into the framework.

A. Framework Inputs

The inputs to MicroGrad are provided in the form of aconﬁguration ﬁle. These inputs are use case dependent andthose for our target use cases are described below.

1) Workload Cloning:

The numerical values of the application’s metrics ofinterest (which the clone is expected to match) can bedirectly provided as input. MicroGrad would then tunethe clone to match these values. • The application binary and its input data can be providedalong with speciﬁcation of the metrics of interest. Bydefault, MicroGrad uses instruction distributions, cachemiss rates, branch misprediction rates and IPC as themetrics of interest. • Application Simpoints [21] can be provided, so as togenerate a clone for each simpoint individually. Thecombination of simpoints and clones can expand the eval-uation space of the original application, with potentiallyone clone for each interesting phase of the application.

2) Stress Testing:

Listing 1: Example Knobs and range of values

B. Knob Interface

MicroGrad uses a set of knobs to interface between the Tun-ing mechanism and the Microprobe framework. The Tuning mechanism nudges the knobs in the directions appropriate forthe use case, and these knobs are conveyed to Microprobewhich generates the test-case based on these knob values.Further, the generated test case is executed on the evaluationframework, whose outputs metrics are fed back to the tuningmechanism to re-tune the knob values. An example subsetof the knobs used by MicroGrad and their range of valuesare shown in Listing 1. In this example subset, the instructionknobs act as fractions of the overall distribution, another knobsallows control of the register dependency distance, the memoryknobs specify footprint, stride and temporal locality, and thebranch pattern knob speciﬁes the fraction of randomness inthe branch pattern. Other tuning knobs are not shown in theinterest of space. passes = [

Listing 2: Microprobe passes

C. Code Generation

The tuning mechanism presents knob values to Micro-probe [5], [12] to generate the corresponding test case. Micro-probe is a ﬂexible code generation framework that provides ahigh level Python scripting interface to access to a rich setof mechanisms and features to control the code generationprocess. This enables the users to adapt the code generationprocess to different use cases without having to deal with alllow level details. For instance, it has been used in the past4or power model generation [5], maximum power and dI/dtstressmark generation [4], complete architecture characteriza-tions [7], [23] and also reliability analysis [22].MicroGrad directly uses Microprobe scripting interface todeﬁne the code generation process according the knobs speci-ﬁed. The test case is then generated by a sequence of codesynthesis passes which are applied in accordance with theMicroGrad deﬁned ordering rules. A code snippet highlightingsome of the standard Microprobe passes used by MicroGradare shown in Listing 2. More details on these passes and otherscan be found on the open source Microprobe tool website [12]. def eval(Target):"""Tuning to reach Target""" µ (KC_base)) (cid:15) || (KC - Target) < (cid:15) :breakreturn KCdef epoch(kc, Met_base, Target, step_size):"""GD to create new knob configuration""" δ ) µ (kc_i)) Listing 3: Gradient descent tuning

D. Gradient-based Tuning

Each tuning epoch involves tuning the knob conﬁgurationby evaluating the execution metrics in the vicinity of thecurrent conﬁguration and making changes to the knobs ac-cordingly. Pseudo-code for the tuning mechanism is shown inListing 3 and features of the mechanism are discussed below.1 A new tuning epoch starts with capturing the executionmetrics (eg. IPC, energy, cache miss rates) at the previousepoch’s output knob conﬁguration (random conﬁguration, ifﬁrst epoch). This is the ’base’ conﬁguration for this epoch.This involves generating the test case with Microprobe at thebase conﬁguration, running the test case on the evaluationplatform and measuring the base metrics.2 The goal at the end of the epoch is to ﬁnd the newknob conﬁguration which is the steepest move (in termsof the matching the use case requirements) from the baseconﬁguration.3 In order to achieve this, the base knob conﬁgurationis independently perturbed by +/- δ in each dimension (i.e.each knob). Each resulting conﬁguration is a ’gradient-check’conﬁguration. This results in 2*knobs number of ’gradient-checks’ per epoch.4 The execution metrics are then captured at each of these’gradient-check’ conﬁgurations, again by generating test caseswith Microprobe and running the test cases on the evaluationplatform.5 For each case, the ’gradient-check’ execution metrics arecompared to the base and the target metrics to obtain a Loss,which is tied to the use case goal.6 The gradient of the Loss along each knob dimension iscalculated by evaluating how much the loss function changedalong each dimension’s δ perturbation.7 This information is used to obtain the new knob con-ﬁguration - the knobs with the steepest gradients move by’one’ step-size, while the other knobs proportionally moveby a fractional of the step size. This becomes the startingconﬁguration for the next epoch.8 Inspired by adaptive learning rate based gradient meth-ods [19], the tuning mechanism’s step-sizes are larger onearlier epochs and gradually become smaller, allowing forrapid convergence earlier but slower but surer convergencelater on.9 To add robustness to the convergence to help avoid localminima, a random set of knobs are skipped in tuning eachiteration, with decreasing skipping probability over epochs.10 Tuning continues until either convergence, the targetaccuracy or the maximum number of epochs is reached.5 arameter Small LargeFrequency Front-End Width

ROB/LSQ/RSE

ALU/SIMD/FP

L1/L2 Cache

Memory

TABLE II: Core Conﬁguration

E. Metric Evaluation

Once the test case is generated and compiled to meet therequirements of the evaluation architecture, the test case isexecuted on the platform. MicroGrad is able to interface witha number of platforms such as native hardware, performancesimulators (e.g. Gem5 [6]) and power estimation frameworks(eg. McPAT [20]). In the case of simulators, the architectureconﬁguration can be passed as input to MicroGrad and usedin the simulator to express the desired architecture.In terms of capturing metrics, the requisite metrics aredependent on the use case. A stress test use case might requireonly IPC / Power, whereas a cloning use case might requirelow-level metrics like mispredictions and miss rates. Whenusing simulators, the MicroGrad interface enables the requiredmetrics to be read from the output dumps of the simulators. Inthe case of native hardware evaluation, appropriate hardwarecounters and their required interfacing can be used in similarfashion.

F. Framework Outputs

MicroGrad completes execution when either the target ismet or some execution time/resource constraint is reached.The output at the end of execution is dependent on the usecase. In the case of Workload Cloning, MicroGrad outputsthe clone binary, details of the corresponding knobs and themetrics based on the evaluation of the clone. With stresstesting, MicroGrad outputs the stress test binary, the knobs andthe stress metrics. In both scenarios, intermediate data can bestored, so as to understand the tuning/execution progress overthe epochs (for example, to improve the tuning algorithm).IV. E

VALUATION

A. Experimental Setup1) Workloads:

To evaluate the cloning use case, we choose8 benchmarks from the SPEC INT CPU2006 [11] suite andgenerate clones on simpoints [21] of 100 million instructions.The generated test cases (for both use cases) are made up ofroughly 500 static instructions in an endless loop and run fora total of 10 million dynamic instructions.

2) Evaluation Framework:

We target the Gem5 [6] archi-tectural performance simulator and the McPAT [20] powerestimation framework. While performance numbers and mod-ule level statistics can be evaluated from Gem5 alone, powerestimation requires the transfer of execution statistics fromGem5 to McPAT, based on which dynamic power is estimated.

3) Target Microarchitectures:

We target the RISC-V ISA.We model two cores –

Large and

Small – to evaluate the perfor-mance of MicroGrad on different corners of the architecturedesign space. The details of each core are listed in Table II. Forthe power template, we use the default McPAT conﬁgurationscommensurate with these core sizes.

4) Metrics / Accuracy:

For Workload Cloning, we focuson: i) Integer, Branch, Load, Store instructions, ii) L1D, L1I,L2 cache hit rates, iii) Branch misprediction rate and iv) IPC.For Stress Testing, we focus separately on IPC and DynamicPower. The Loss function utilized by the tuning algorithmcalculates log loss over the metrics of interest speciﬁed above.Where applicable, we target an accuracy of 99% across themetrics.

B. Workload Cloning

In Fig.2 and Fig.3 we showcase the efﬁciency of MicroGradtowards Workload Cloning. Fig.2 shows the workload clonesgenerated across the 8 benchmarks on a

Large core whileFig.3 shows the same on a

Small core. In the ﬁgures, thecircumferential axis represents different metrics - instructionsdistributions, mispredictions, cache miss rates and IPC. Theradial axis represents the accuracy of the clone’s metriccompared to the original benchmark (1 indicates completeaccuracy).For the

Large core, over the eight benchmarks, the accuracyacross all metrics is close to 1 (average error is less than 1%).Worse case scenario is seen in libquantum wherein there isclose to a 5% error in the branch misprediction rate and thedata cache (DC) hit rate.In the case of the

Small core, results are similar (averageerror is less than 2%). The accuracy is marginally less com-pared to the

Large core due to the higher metric sensitivity ina core of smaller size. This is due to program characteristicshaving a larger impact on the execution ﬂow, since the core isnot over provisioned with resources. The worse case error isclose to 10% in the case of xalancbmk’s IC hit rate. We notethat there is potential for more knobs to be implemented inMicroGrad that can control IC Hit Rates with higher accuracy,which we seek to implement in the future.The captions of both ﬁgures indicate the number of epochsrequired to create the workload clones. Epochs vary from only5, to a maximum of 52, clearly highlighting that MicroGrad’shigh accuracy is achievable in very few tuning epochs.The accuracy and fast tuning capability of MicroGrad isheavily inﬂuenced by the Gradient Descent tuning algorithm.To showcase this, we compare against a Genetic Algorithmbased approach in Fig.4 for the Big core. The GA parametersare taken from prior work and were shown in Table I. Forthis analysis, we allow the GA based approach to run for thesame number of tuning epochs as the GD based approach. Theﬁgure shows that the accuracy achieved by GA is considerablylower than the GD approach (note that the ratios on the radialaxes are far greater). The average error in comparison to theoriginal benchmarks is roughly 30%, with worst case errorsof more than 50%6 .9 0.95 1 1.05 1.1