MicroGrad: A Centralized Framework for Workload Cloning and Stress Testing
Gokul Subramanian Ravi, Ramon Bertran, Pradip Bose, Mikko Lipasti
MMicroGrad: A Centralized Framework for WorkloadCloning and Stress Testing
Gokul Subramanian Ravi
ECE, UW-Madison [email protected]
Ramon Bertran
IBM Research [email protected]
Pradip Bose
IBM Research [email protected]
Mikko Lipasti
ECE, UW-Madison [email protected]
Abstract —We present
MicroGrad , a centralized automatedframework that is able to efficiently analyze the capabilities,limits and sensitivities of complex modern processors in theface of constantly evolving application domains.
MicroGrad usesMicroprobe, a flexible code generation framework as its back-endand a Gradient Descent based tuning mechanism to efficientlyenable the evolution of the test cases to suit tasks such as Work-load Cloning and Stress Testing.
MicroGrad can interface with avariety of execution infrastructure such as performance/powersimulators as well as native hardware. Further, the modularabstract workload model approach to building
MicroGrad allowsit to be easily extended for further use.In this paper, we evaluate
MicroGrad over different use casesand architectures and showcase that
MicroGrad can achievegreater than 99% accuracy across different tasks within fewtuning epochs and low resource requirements. We also observethat
MicroGrad’s accuracy is 25-30% higher than competing tech-niques. At the same time, it is 1.5-2.5x faster or would consume35-60% less compute resources (depending on implementation)over alternate mechanisms. Overall,
MicroGrad ’s fast, resourceefficient and accurate test case generation capability allow it toperform rapid evaluation of complex processors.
I. I
NTRODUCTION
Analyzing the capabilities, limits and sensitivities of com-plex modern processors in the face of constantly evolvingapplication domains is arduous and time consuming. Intelli-gently generating test cases which can efficiently perform theabove analyses will enable quick turnaround times, therebyaccelerating the final third of the Innovate-Build-Analyzecycle.We are particularly interested in two challenging tasks underthis umbrella of test case generation:1)
Workload Cloning: which extracts key execution char-acteristics of a real world application and models theminto a synthetic workload.2)
Stress Testing: that maximizes micro-architectural activ-ity of a given processor, specifically to achieve worst-case estimates of execution metrics like performance andpower.We present
MicroGrad , an open-source centralized auto-mated framework that is able to efficiently analyze processorsbased on the scenarios described above.
MicroGrad derivesits name from 1
Microprobe [5]: a flexible code genera-tion framework which forms
MicroGrad ’s back-end and 2
MicroGrad Tool: https://github.com/rgokulsm/MicroGrad
Gradient Descent: which is the tuning mechanism used by
MicroGrad to efficiently enable the evolution of the test casesto suit the tasks described earlier.In the past, there has been considerable work in the domainsof workload cloning [1]–[3], [13], [15], [16] and stress test-ing [5], [8]–[10], [14]. Despite this, open source frameworksfor these goals have been scarce. Meanwhile, the need forthese tools is rapidly increasing with the momentum for opensource hardware. As the open source space grows, we willrequire systematic tools to characterize and stress-test theabundant varied designs and implementations.To our knowledge, two open-source frameworks available inthis space are: Microprobe [5] which can generate user-definedtest cases, and GeST [10] which uses Genetic Algorithm(GA) based evolution on an instruction-level model to generatestress tests.
MicroGrad goes above and beyond the capabilitiesand use cases of the above, by providing a fast automatedframework for a variety of purposes, all generated with acommon centralized tuning mechanism and code generationback-end. Further,
MicroGrad can interface with a variety ofexecution platforms such as performance/power simulators aswell as native hardware, in order to evaluate the processorarchitecture’s execution efficiency. Importantly, the modular”abstract workload model” approach to building
MicroGrad allows it to be easily developed upon - allowing for new usecases, improved tuning algorithms, as well as easy interfacingwith new execution hardware and simulators. An overview of
MicroGrad is shown in Fig.1.To the best of our knowledge, all prior approaches to cloningand stress testing have been either GA-based or expert driven.Thus, the
Gradient Descent based tuning mechanism is a keynovelty and highlight in the
MicroGrad framework. The tuningmechanism is implemented over a gradient descent algorithm,which iterates through a sequence of ”workload generationknobs” configurations (i.e. inputs to the
Microprobe frame-work) and evaluates a specified processor execution metric forthose configurations. It gradually moves the code generationconfiguration in the direction of the steepest execution metricgradient, i.e. one which achieves the best metric improvementfor every step change in the configuration, until the optimumconfiguration / convergence is reached. Note that the executionmetric is dependent on the use case - it could be a single high-level statistic like IPC or power consumption in the case ofStress Testing or a combination of both high-level and low- a r X i v : . [ c s . A R ] S e p evel statistics such as branch mispredictions, cache miss ratesand IPC for Workload Cloning. Overall, the tuning mechanismallows for fast and efficient convergence to the prescribed goaland is observed to considerably outperform competing tuningapproaches. Moreover, with its abstracted model, it is easierto deploy compared to expert-driven approaches. MicroGrad
Outputs • Clone / Stress-test binary • Workload characteristics • Metrics (eg. worst case) • Epoch progression
Inputs • Application / Simpoint • Metrics of Interest • Target evaluation platform • Accuracy requirements
Evaluation • Performance Simulator (eg. Gem5) • Power Estimator (eg. McPAT) • Native Hardware
Tuning Mechanism • Grad. Descent (or other) • Use-case loss function
MicroProbe • Test Construction • Code Generation
Knobs
Fig. 1: MicroGrad Overview
Summary of contributions:
1) We present
MicroGrad , an open-source automatedframework for workload cloning and stress testing. Toour knowledge,
MicroGrad is the first open tool forautomated cloning and further, the only open tool forfast-exploratory stress testing with an abstract workloadmodel.2)
MicroGrad is the first proposal to perform intelligenttest generation via a Gradient Descent based tuningmechanism, which is shown to outperform other tuningmechanisms and is easier to deploy than expert-drivenapproaches.3)
MicroGrad extends the potential for the Microprobeframework which has a wealth of features for codegeneration.4) The modular and abstracted approach to building
Micro-Grad allows the seamless integration of new use cases,execution platforms and tuning algorithms.II. B
ACKGROUND
A. Workload Cloning
There are multiple challenges with using real-world ap-plications for architecture benchmarking, such as intellectualproperty hurdles, effort involved in porting the application tosuit the execution framework, as well as long run times. Whilethe advent of standardized benchmark suites have improved thetesting ecosystem, there are still several time/resource chal-lenges especially posed to architecture simulation in academicresearch as well as industry product development. Simulationtimes are often intractable, even on today’s most efficientsimulators running on the fastest processing systems.Workload Cloning is a general technique to mimic real-world applications or benchmarks via miniature synthetic workloads and has been pursued in multiple prior works [1]–[3], [13], [15], [16]. The technique distills key behavioralcharacteristics of the original application/benchmark and mod-els them into a synthetic workload. The resultant workloadabstracts away any proprietary application characteristics, itis usually significantly shorter in execution time and it can besuitable compiled to make it amenable to both native hardwareas well as simulation frameworks. The integral components ofthe Cloning workflow are discussed below.
1) Application Characteristics:
Specific characteristics ofan application are captured and used to generate the syn-thetic workload. These are characteristics which influencethe instruction distribution, control flow as well as memorypatterns of the application. These characteristics can be dividedinto microarchitecture-independent and microarchitecture-dependent. The former includes instruction distributions, reg-ister dependency distance etc. and memory footprints whilethe latter includes branch misprediction and cache miss ratesand others. While some prior works [3] have used bothmicroarchitecture dependent and independent characteristics inconjunction, others have used a wider range of solely microar-chitecture independent characteristics [16], but which are thensignificantly impacted by compiler optimizations. In this workwe use the former i.e. a combination of both microarchitecturedependent as well as independent characteristics which allowoptimal capturing of both static and dynamic characteristicsof an application on a specific processor architecture.
2) Target Metrics:
The generated clones are expected toaccurately meet specific target metrics. A full system designermight require the clone to mimic the real application interms of low-level target metrics such as L1/L2/TLB missrates, branch misprediction rates, register usage, instructiondistribution as well as high-level target metrics like IPC,power, energy or thermal characteristics. In this paper our toolevaluation focuses on cache miss rates, branch mispredictions,instruction distributions, IPC and Power.
3) Generation Mechanism:
Prior clone generation mecha-nisms have comprised of a number of steps, each attemptingto feed specific application/benchmark statistics into a model,so as to attempt to generate the required characteristics inthe application. These steps include: generating the syntheticworkload spine using instruction distribution, memory ac-cess pattern modeling, branch predictability modeling, registerassignment, and finally code generation [3]. While thesesteps might individually achieve satisfactory accuracy for theirlow-level target metric (such as branch misprediction rate),performing them in a sequential manner (a sort of greedyapproach) means that there is limited control over other high-level target metrics of interest like IPC.In our work, we take a more synergic approach to clonegeneration. By estimating gradients and following the steepestcurves, our tuning mechanism is able to inherently sacrificethe accuracy on some specific low-level target metric (forexample, L2 cache miss rate) if required, if it aids in optimalachievement of other low-level and high-level target metrics,thereby creating a clone with higher fidelity. Further, our2pproach allows a flexible generation time vs. cloning accuracytradeoff. For instance, a 95% accuracy 1-metric target wouldtake considerably less generation time in comparison to a 99%accuracy N-metric target.
B. Stress testing
Benchmark suites are usually built to represent the nominalbehavior of real world applications and not to mimic worst-case scenarios. However, worst-case scenarios in terms of mi-croarchitectural activity, heat dissipation, power consumptionand voltage noise [4], [8], [9], [14], [17], [18] are critical tounderstand the limits and sensitivities of current generationprocessors, so that future systems can migrate to the mostpromising regions of the microarchitectural design space.These worst case scenarios are closely tied to the microar-chitecture, and must be created in accordance. Thus, stresstests are used to create these worst-case scenarios for a giventarget execution metric and a specific processor microarchitec-ture. Considering the complexities and non-linear relationshipswithin the modern processor, manually crafting such stresstests is usually time consuming and tedious. Consequently,automating their generation is of critical importance for a rapiddesign cycle.
1) Generation Model:
As highlighted in prior work [10],there are two prominent design models for stress-test gener-ation: a) based on an abstract-workload model and b) basedon instruction-level primitives. In the abstract-model [8], [9],[14] the stress test generation process involves tuning a vectorof workload generation parameters/knobs such as instructionmix, register dependency distance, memory footprint / stridepatterns and branch transition patterns. The vector is thenused to generate the assembly (or high level language) codeOn the other hand, for the instruction-level frameworks [10],[17], [18], the tuning is performed directly on the instructionassembly, with per-instruction control.The key advantage with the abstract workload model isthat knobs are well defined, can be selected to be only afew in number, and can potentially have exclusive mapping toparticular execution characteristics, significantly reducing thecomplexity of the tuning required to achieve the maximumstress. The advantage of the instruction-level model is thatit provides deterministic and finer granularity of control i.e.on a per-instruction basis. In this work, we adopt the abstractworkload model, which provides suitable abstractions to allowfor a more modular framework suited to multiple use cases,evaluation frameworks and tuning algorithms.
2) Tuning Mechanism:
The role of the tuning mechanismis to nudge the generated test case towards maximum stress(as per the specified stress metric). Prior works built onboth the generation models described above have predom-inantly utilized genetic algorithm (GA) based tuning. GAstune towards a target metric by applying operators inspired bynatural evolution. These operators include: selection of fittestindividuals, crossover of features, mutation and guaranteed andelitism prioritization [8]–[10], [14]. The GA parameters usedby prior work [10] are shown in Table I. To our knowledge,
Parameter Value
Population Size 50Individual Size (
TABLE I: GA parameters
MicroGrad is the only stress test generation scheme to strayaway from GA based tuning. We find that a gradient descentbased tuning approach, with stochastic randomness to jumpout of local minimas, as well as adaptive step sizes (larger tosmaller over time), enable considerably faster (i.e. less numberof tuning epochs) and more accurate convergence comparedto the GA based approach. For the abstract workload modelspecifically, our insight is that important GA operators likecrossover are rather ineffective, while they are much morevaluable in an instruction-level model. On the other hand, thegradient descent approach of following the steepest path tomaximizing the metric of interest is very effective when localminimas can be avoided.It is also interesting to note that the compute cost for a GDbased tuning epoch is proportional to the number of knobsof interest, which could be low in the context of many usecases. On the other hand, the compute cost in a GA epochis proportional to the population size, which is often fixedthroughout and therefore usually conservative. Thus everyGD epoch oj,is often faster and/or consumes less computeresources in comparison to the GA approach. Our resultsdemonstrate up to a 2.5x benefit for the GD approach.III. T HE M ICRO G RAD FRAMEWORK
An overview of the MicroGrad framework was shown inFig.1. MicroGrad is built in a modular manner, allowing easeof use as well as flexibility for further development. Additionaluse cases and metrics of interest, custom evaluation platforms,as well as improved tuning algorithms, can be developed andintegrated conveniently into the framework.
A. Framework Inputs
The inputs to MicroGrad are provided in the form of aconfiguration file. These inputs are use case dependent andthose for our target use cases are described below.
1) Workload Cloning:
The numerical values of the application’s metrics ofinterest (which the clone is expected to match) can bedirectly provided as input. MicroGrad would then tunethe clone to match these values. • The application binary and its input data can be providedalong with specification of the metrics of interest. Bydefault, MicroGrad uses instruction distributions, cachemiss rates, branch misprediction rates and IPC as themetrics of interest. • Application Simpoints [21] can be provided, so as togenerate a clone for each simpoint individually. Thecombination of simpoints and clones can expand the eval-uation space of the original application, with potentiallyone clone for each interesting phase of the application.
2) Stress Testing:
Listing 1: Example Knobs and range of values
B. Knob Interface
MicroGrad uses a set of knobs to interface between the Tun-ing mechanism and the Microprobe framework. The Tuning mechanism nudges the knobs in the directions appropriate forthe use case, and these knobs are conveyed to Microprobewhich generates the test-case based on these knob values.Further, the generated test case is executed on the evaluationframework, whose outputs metrics are fed back to the tuningmechanism to re-tune the knob values. An example subsetof the knobs used by MicroGrad and their range of valuesare shown in Listing 1. In this example subset, the instructionknobs act as fractions of the overall distribution, another knobsallows control of the register dependency distance, the memoryknobs specify footprint, stride and temporal locality, and thebranch pattern knob specifies the fraction of randomness inthe branch pattern. Other tuning knobs are not shown in theinterest of space. passes = [
Listing 2: Microprobe passes
C. Code Generation
The tuning mechanism presents knob values to Micro-probe [5], [12] to generate the corresponding test case. Micro-probe is a flexible code generation framework that provides ahigh level Python scripting interface to access to a rich setof mechanisms and features to control the code generationprocess. This enables the users to adapt the code generationprocess to different use cases without having to deal with alllow level details. For instance, it has been used in the past4or power model generation [5], maximum power and dI/dtstressmark generation [4], complete architecture characteriza-tions [7], [23] and also reliability analysis [22].MicroGrad directly uses Microprobe scripting interface todefine the code generation process according the knobs speci-fied. The test case is then generated by a sequence of codesynthesis passes which are applied in accordance with theMicroGrad defined ordering rules. A code snippet highlightingsome of the standard Microprobe passes used by MicroGradare shown in Listing 2. More details on these passes and otherscan be found on the open source Microprobe tool website [12]. def eval(Target):"""Tuning to reach Target""" µ (KC_base)) (cid:15) || (KC - Target) < (cid:15) :breakreturn KCdef epoch(kc, Met_base, Target, step_size):"""GD to create new knob configuration""" δ ) µ (kc_i)) Listing 3: Gradient descent tuning
D. Gradient-based Tuning
Each tuning epoch involves tuning the knob configurationby evaluating the execution metrics in the vicinity of thecurrent configuration and making changes to the knobs ac-cordingly. Pseudo-code for the tuning mechanism is shown inListing 3 and features of the mechanism are discussed below.1 A new tuning epoch starts with capturing the executionmetrics (eg. IPC, energy, cache miss rates) at the previousepoch’s output knob configuration (random configuration, iffirst epoch). This is the ’base’ configuration for this epoch.This involves generating the test case with Microprobe at thebase configuration, running the test case on the evaluationplatform and measuring the base metrics.2 The goal at the end of the epoch is to find the newknob configuration which is the steepest move (in termsof the matching the use case requirements) from the baseconfiguration.3 In order to achieve this, the base knob configurationis independently perturbed by +/- δ in each dimension (i.e.each knob). Each resulting configuration is a ’gradient-check’configuration. This results in 2*knobs number of ’gradient-checks’ per epoch.4 The execution metrics are then captured at each of these’gradient-check’ configurations, again by generating test caseswith Microprobe and running the test cases on the evaluationplatform.5 For each case, the ’gradient-check’ execution metrics arecompared to the base and the target metrics to obtain a Loss,which is tied to the use case goal.6 The gradient of the Loss along each knob dimension iscalculated by evaluating how much the loss function changedalong each dimension’s δ perturbation.7 This information is used to obtain the new knob con-figuration - the knobs with the steepest gradients move by’one’ step-size, while the other knobs proportionally moveby a fractional of the step size. This becomes the startingconfiguration for the next epoch.8 Inspired by adaptive learning rate based gradient meth-ods [19], the tuning mechanism’s step-sizes are larger onearlier epochs and gradually become smaller, allowing forrapid convergence earlier but slower but surer convergencelater on.9 To add robustness to the convergence to help avoid localminima, a random set of knobs are skipped in tuning eachiteration, with decreasing skipping probability over epochs.10 Tuning continues until either convergence, the targetaccuracy or the maximum number of epochs is reached.5 arameter Small LargeFrequency Front-End Width
ROB/LSQ/RSE
ALU/SIMD/FP
L1/L2 Cache
Memory
TABLE II: Core Configuration
E. Metric Evaluation
Once the test case is generated and compiled to meet therequirements of the evaluation architecture, the test case isexecuted on the platform. MicroGrad is able to interface witha number of platforms such as native hardware, performancesimulators (e.g. Gem5 [6]) and power estimation frameworks(eg. McPAT [20]). In the case of simulators, the architectureconfiguration can be passed as input to MicroGrad and usedin the simulator to express the desired architecture.In terms of capturing metrics, the requisite metrics aredependent on the use case. A stress test use case might requireonly IPC / Power, whereas a cloning use case might requirelow-level metrics like mispredictions and miss rates. Whenusing simulators, the MicroGrad interface enables the requiredmetrics to be read from the output dumps of the simulators. Inthe case of native hardware evaluation, appropriate hardwarecounters and their required interfacing can be used in similarfashion.
F. Framework Outputs
MicroGrad completes execution when either the target ismet or some execution time/resource constraint is reached.The output at the end of execution is dependent on the usecase. In the case of Workload Cloning, MicroGrad outputsthe clone binary, details of the corresponding knobs and themetrics based on the evaluation of the clone. With stresstesting, MicroGrad outputs the stress test binary, the knobs andthe stress metrics. In both scenarios, intermediate data can bestored, so as to understand the tuning/execution progress overthe epochs (for example, to improve the tuning algorithm).IV. E
VALUATION
A. Experimental Setup1) Workloads:
To evaluate the cloning use case, we choose8 benchmarks from the SPEC INT CPU2006 [11] suite andgenerate clones on simpoints [21] of 100 million instructions.The generated test cases (for both use cases) are made up ofroughly 500 static instructions in an endless loop and run fora total of 10 million dynamic instructions.
2) Evaluation Framework:
We target the Gem5 [6] archi-tectural performance simulator and the McPAT [20] powerestimation framework. While performance numbers and mod-ule level statistics can be evaluated from Gem5 alone, powerestimation requires the transfer of execution statistics fromGem5 to McPAT, based on which dynamic power is estimated.
3) Target Microarchitectures:
We target the RISC-V ISA.We model two cores –
Large and
Small – to evaluate the perfor-mance of MicroGrad on different corners of the architecturedesign space. The details of each core are listed in Table II. Forthe power template, we use the default McPAT configurationscommensurate with these core sizes.
4) Metrics / Accuracy:
For Workload Cloning, we focuson: i) Integer, Branch, Load, Store instructions, ii) L1D, L1I,L2 cache hit rates, iii) Branch misprediction rate and iv) IPC.For Stress Testing, we focus separately on IPC and DynamicPower. The Loss function utilized by the tuning algorithmcalculates log loss over the metrics of interest specified above.Where applicable, we target an accuracy of 99% across themetrics.
B. Workload Cloning
In Fig.2 and Fig.3 we showcase the efficiency of MicroGradtowards Workload Cloning. Fig.2 shows the workload clonesgenerated across the 8 benchmarks on a
Large core whileFig.3 shows the same on a
Small core. In the figures, thecircumferential axis represents different metrics - instructionsdistributions, mispredictions, cache miss rates and IPC. Theradial axis represents the accuracy of the clone’s metriccompared to the original benchmark (1 indicates completeaccuracy).For the
Large core, over the eight benchmarks, the accuracyacross all metrics is close to 1 (average error is less than 1%).Worse case scenario is seen in libquantum wherein there isclose to a 5% error in the branch misprediction rate and thedata cache (DC) hit rate.In the case of the
Small core, results are similar (averageerror is less than 2%). The accuracy is marginally less com-pared to the
Large core due to the higher metric sensitivity ina core of smaller size. This is due to program characteristicshaving a larger impact on the execution flow, since the core isnot over provisioned with resources. The worse case error isclose to 10% in the case of xalancbmk’s IC hit rate. We notethat there is potential for more knobs to be implemented inMicroGrad that can control IC Hit Rates with higher accuracy,which we seek to implement in the future.The captions of both figures indicate the number of epochsrequired to create the workload clones. Epochs vary from only5, to a maximum of 52, clearly highlighting that MicroGrad’shigh accuracy is achievable in very few tuning epochs.The accuracy and fast tuning capability of MicroGrad isheavily influenced by the Gradient Descent tuning algorithm.To showcase this, we compare against a Genetic Algorithmbased approach in Fig.4 for the Big core. The GA parametersare taken from prior work and were shown in Table I. Forthis analysis, we allow the GA based approach to run for thesame number of tuning epochs as the GD based approach. Thefigure shows that the accuracy achieved by GA is considerablylower than the GD approach (note that the ratios on the radialaxes are far greater). The average error in comparison to theoriginal benchmarks is roughly 30%, with worst case errorsof more than 50%6 .9 0.95 1 1.05 1.1
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Fig. 2: Workload Cloning targeting a ”large” core, with Gradient Descent. Top Left to Right Bottom: (a) astar [10 epochs],(b) bzip2 [5], (c) gcc [19]. (d) hmmer [52], (e) libquantm [45], (f) mcf [21], (g) sjeng [15], (h) xalancbmk [26]
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Fig. 3: Workload Cloning targeting a ”small” core, with Gradient Descent. Top Left to Right Bottom: (a) astar [21 epochs],(b) bzip2 [5], (c) gcc [36]. (d) hmmer [40], (e) libquantm [50], (f) mcf [30], (g) sjeng [6], (h) xalancbmk [37]It should also be noted that allowing the same number ofepochs is favorable to GA. As discussed earlier, the GA tuningepoch (with Table I parameters) performs roughly 2.5 timesthe work of the GD based approach: 50 evaluations per epoch(population size) in GA vs 20 evaluations per epoch (2 xknobs) in GD. Depending on the implementation, this canmanifest as higher execution time, more compute resourcesneeded or both.Also significant to note is that the GA based tuning algo-rithm fits seamlessly into the MicroGrad framework. This isthanks to the modular implementation of MicroGrad whichallows for flexible development on multiple fronts, includingthe research on use case specific tuning algorithms.
C. Stress Testing
Next, we discuss MicroGrad’s proficiency in stress testing.Fig.5 shows a compute-focused performance stress test sce-nario which seeks to achieve the worst case performance onthe
Large core. This testing scenario is focused only on tuningthe instruction fractions and not on other metrics like missrates and mispredictions. The green line shows the optimalworst case performance as estimated by a brute-force searchexploring the entire workload space. The Gradient Descentmechanism (shown in orange), is able to converge to the worstcase in under 30 epochs. In comparison, a GA based tuningapproach (green) is about 25% off from the optimal worsecase performance in 1.5 times the number of epochs.Next, in Fig.6 we show a compute-focused stress test7
1 1 1 1.06 1 1 1 1.07
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
1 1 1 1 0.9 1 1.06 1.04 1.2
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
1 1 1 1 0.99 1 1.02 1.02 0.99
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
1 1 1 1 0.99 1 1.5 1 0.92
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
1 1 1 1 1.03 1 0.91 0.98 1.35
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
1 1 1 1 1 1.02 1.04 1 0.97
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
1 1 1 1 0.99 1.1 0.9 1 0.94
Integer Load Store Branch Mispredictions IC Hit Rate DC Hit Rate L2 Hit Rate IPC
Fig. 4: Workload Cloning targeting a ”large” core, with Genetic Algorithm. Top Left to Right Bottom: (a) astar [10 epochs],(b) bzip2 [5], (c) gcc [19]. (d) hmmer [52], (e) libquantm [45], (f) mcf [21], (g) sjeng [15], (h) xalancbmk [26] R e l a t i v e P e r f o r m a n c e GAGDMinimum
Fig. 5: Performance virus: GD vs GAscenario targeting worst case dynamic power. Again, the greenline shows the highest dynamic power achieved through brute-force search across the workload space - roughly 2.1 W. TheGD approach is able to achieve 2.01 W (95% accuracy) in only25 tuning epochs. In comparison, the GA approach is able toachieve power that is similar to GD, but requires roughly 2xthe number of epochs.Further, in Table III we show the distribution of instructionsin the GD generated power virus - which shows similarity tothe result of the brute-force search. More than 50% of theinstructions are memory focused and over 20% are floatingpoint operations. On the other hand, the integer operationsare only 6% of the total. The high fractions for memory andFP ops are intuitive considering that these operations performmore complex microarchitectural activity compared to inte-ger operations. Further (not shown), the register dependencydistance chosen by this stress test was at its maximum limit,meaning that ILP was pushed to the maximum extent allowed.This is also intuitive - more the microarchitectural activity,higher the power consumed.Overall, these results indicate that gradient based tuning approach, in combination with an abstract workload model,can generate highly accurate stress tests on different use cases.In addition, the gradient decent tuning outperforms existingGA-based solutions in terms of time to a solution (epochs)and efficiency in resource utilization. D y n a m i c P o w e r ( W ) GAGDMinimum
Fig. 6: Power virus: GD vs GA
Integer Float Branch Load Store
TABLE III: Power virus: Instruction DistributionV. C
ONCLUSION
In summary, we presented MicroGrad, an open-source cen-tralized framework for workload cloning and stress testing.Key novel features in MicroGrad are its gradient based tuningapproach and its Microprobe back-end. The framework is ableto produce fast and accurate workload clones and stress tests.These results are especially evident in comparison to priortechniques.Beyond the specific quantitative benefits that are shownin this paper, MicroGrad is built in a modular manner withclear interface boundaries both internally as well as externally.8his allows it to be a promising springboard for wide futuredevelopment - be it in terms of the use cases it can support,the evaluation platforms it can execute on, as well as runningmore optimum tuning algorithms.For example, MicroGrad can seamlessly support other usecases like bottleneck analysis i.e. sweeping over a speci-fied range of finer execution characteristics –such as cachemiss rate– and analyzing its bottle-necking impact on theoverall processor execution. The framework also allows forexperiments on native hardware and other forms of stresstesting like voltage droops. Thus, we envision that with futuredevelopment, MicroGrad can accelerate the entire Innovate-Build-Analyze cycle as a whole, which is especially criticalin the coming open-source hardware era.R
EFERENCES[1] R. H. Bell, R. R. Bhatia, L. K. John, J. Stuecheli, J. Griswell, P. Tu,L. Capps, A. Blanchard, and R. Thai, “Automatic testcase synthesis andperformance model validation for high performance powerpc proces-sors,” in , 2006, pp. 154–165.[2] R. H. Bell and L. K. John, “Efficient power analysis using synthetic test-cases,” in
IEEE International. 2005 Proceedings of the IEEE WorkloadCharacterization Symposium, 2005. , 2005, pp. 110–118.[3] R. H. Bell and L. K. John, “Improved automatic testcase synthesisfor performance model validation,” in
Proceedings of the 19th AnnualInternational Conference on Supercomputing , ser. ICS 05. New York,NY, USA: Association for Computing Machinery, 2005, p. 111120.[Online]. Available: https://doi.org/10.1145/1088149.1088164[4] R. Bertran, A. Buyuktosunoglu, P. Bose, T. J. Slegel, G. Salem, S. Carey,R. F. Rizzolo, and T. Strach, “Voltage noise in multi-core processors:Empirical characterization and optimization opportunities,” in ,2014, pp. 368–380.[5] R. Bertran, A. Buyuktosunoglu, M. S. Gupta, M. Gonzalez, and P. Bose,“Systematic energy characterization of cmp/smt processor systems viaautomated micro-benchmarks,” in , 2012, pp. 199–211.[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,”
SIGARCH Comput. Archit. News , 2011.[7] E. J. Fluhr, R. M. Rao, H. Smith, A. Buyuktosunoglu, and R. Bertran,“IBM POWER9 circuit design and energy optimization for 14-nmtechnology,”
IBM J. Res. Dev. , vol. 62, no. 4/5, p. 4, 2018. [Online].Available: http://ieeexplore.ieee.org/document/8383685/[8] K. Ganesan and L. K. John, “Maximum multicore power (mampo) anautomatic multithreaded synthetic power virus generation frameworkfor multicore systems,” in
SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage andAnalysis , 2011, pp. 1–12.[9] K. Ganesan, J. Jo, W. L. Bircher, D. Kaseridis, Z. Yu, and L. K. John,“System-level max power (sympo) - a systematic approach for escalatingsystem-level power consumption using synthetic benchmarks,” in
In the19th International Conference on Parallel Architectures and Compila-tion Techniques (PACT , 2010.[10] Z. Hadjilambrou, S. Das, P. Whatmough, D. Bull, and Y. Sazeides,“Gest: An automatic framework for generating cpu stress-tests,” 03 2019,pp. 1–10.[11] J. L. Henning, “Spec cpu2006 benchmark descriptions,”
SIGARCHComput. Archit. News , 2006.[12] IBM Research, “Microprobe: Microbenchmark generation framework,”https://github.com/IBM/microprobe.[13] A. Joshi, L. Eeckhout, R. H. Bell, and L. John, “Performance cloning:A technique for disseminating proprietary applications as benchmarks,”in ,2006, pp. 105–115.[14] A. Joshi, L. Eeckhout, L. K. John, and C. Isen, “Automated microproces-sor stressmark generation,” in , 2008, pp. 229–239.[15] A. Joshi, L. Eeckhout, R. H. Bell, and L. K. John, “Distilling theessence of proprietary workloads into miniature benchmarks,”
ACMTrans. Archit. Code Optim. , vol. 5, no. 2, Sep. 2008. [Online].Available: https://doi.org/10.1145/1400112.1400115[16] A. Joshi, L. Eeckhout, and L. John, “The return of synthetic bench-marks,” in , 2008, pp. 1–11.[17] Y. Kim and L. K. John, “Automated di/dt stressmark generation formicroprocessor power delivery networks,” in
IEEE/ACM InternationalSymposium on Low Power Electronics and Design , 2011, pp. 253–258.[18] Y. Kim, L. K. John, S. Pant, S. Manne, M. Schulte, W. L. Bircher, andM. S. S. Govindan, “Audit: Stress testing the automatic way,” in ,2012, pp. 212–223.[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[20] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, andN. P. Jouppi, “Mcpat: An integrated power, area, and timing modelingframework for multicore and manycore architectures,” ser. MICRO,2009.[21] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, andB. Calder, “Using simpoint for accurate and efficient simulation,” in
International Conference on Measurement and Modeling of ComputerSystems , 2003.[22] K. Swaminathan, R. Bertran, H. M. Jacobson, P. Kudva, and P. Bose,“Generation of stressmarks for early stage soft-error modeling,” in . IEEE, 2019, pp. 42–48. [Online]. Available:https://doi.org/10.1109/DSN-S.2019.00022[23] T. Webel, P. M. Lobo, R. Bertran, G. Salem, M. Allen-Ware, R. F.Rizzolo, S. M. Carey, T. Strach, A. Buyuktosunoglu, C. Lefurgy, P. Bose,R. Nigaglioni, T. J. Slegel, M. S. Floyd, and B. W. Curran, “Robustpower management in the IBM z13,”
IBM J. Res. Dev. , vol. 59, no. 4/5,2015. [Online]. Available: https://doi.org/10.1147/JRD.2015.2446872, vol. 59, no. 4/5,2015. [Online]. Available: https://doi.org/10.1147/JRD.2015.2446872