[PDF] Automatic Microprocessor Performance Bug Detection

Abstract

Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work, we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false positives. Moreover, the detection is automatic, requiring very little performance engineer time.

Full PDF

AAutomatic Microprocessor Performance BugDetection

Erick Carvajal Barboza ∗ , Sara Jacob ∗ , Mahesh Ketkar † , Michael Kishinevsky † , Paul Gratz ∗ , Jiang Hu ∗∗ Texas A&M University { ecarvajal, sarajacob 96, jianghu } @tamu.edu, [email protected] † Intel Corportation { mahesh.c.ketkar, michael.kishinevsky } @intel.com Abstract —Processor design validation and debug is a difﬁcultand complex task, which consumes the lion’s share of thedesign process. Design bugs that affect processor performancerather than its functionality are especially difﬁcult to catch,particularly in new microarchitectures. This is because, unlikefunctional bugs, the correct processor performance of newmicroarchitectures on complex, long-running benchmarks istypically not deterministically known. Thus, when performancebenchmarking new microarchitectures, performance teams mayassume that the design is correct when the performance of thenew microarchitecture exceeds that of the previous generation,despite signiﬁcant performance regressions existing in the design.In this work we present a two-stage, machine learning-basedmethodology that is able to detect the existence of performancebugs in microprocessors. Our results show that our best techniquedetects 91.5% of microprocessor core performance bugs whoseaverage IPC impact across the studied applications is greaterthan 1% versus a bug-free design with zero false positives. Whenevaluated on memory system bugs, our technique achieves 100%detection with zero false positives. Moreover, the detection isautomatic, requiring very little performance engineer time.

I. I

NTRODUCTION

Veriﬁcation and validation are typically the largest com-ponent of the design effort of a new processor. The effortcan be broadly divided into two distinct disciplines, namely,functional and performance veriﬁcation. The former is con-cerned with design correctness and has rightfully receivedsigniﬁcant attention in the literature. Even though challenging,it beneﬁts from the availability of known correct output tocompare against. Alternately, performance veriﬁcation, which istypically concerned with generation-over-generation workloadperformance improvement, suffers from the lack of a knowncorrect output to check against. Given the complexity ofcomputer systems, it is often extremely difﬁcult to accuratelypredict the expected performance of a given design on a givenworkload. Performance bugs, nevertheless, are a critical concernas the cadence of process technology scaling slows, as they robprocessor designs of the primary performance and efﬁciencygains to be had through improved microarchitecture.Further, while simulation studies may provide an approxima-tion of how long a given application might run on a real system,prior work has shown that these performance estimates willtypically vary greatly from real system performance [8], [18],[44]. Thus, the process to detect performance bugs, in practice,is via tedious and complex manual analysis, often taking largeperiods of time to isolate and eliminate a single bug. Forexample, consider the case of the Xeon Platinum 8160 processor snoop ﬁlter eviction bug [42]. Here the performance regressionwas > on several benchmarks. The bug took severalmonths of debugging effort before being fully characterized.In a competitive environment where time to market and performance are essential, there is a critical need for new,more automated mechanisms for performance debugging.To date, little prior research in performance debugging exists,despite its importance and difﬁculty. The few existing works [9],[51], [53] take an approach of manually constructing a referencemodel for comparison. However, as described by Ho et al. [26],the same bug can be repeated in the model and thus cannot bedetected. Moreover, building an effective model is very timeconsuming and labor intensive. Thus, as a ﬁrst step we focuson automating bug detection. We note that, simply detecting abug in the pre-silicon phase or early in the post-silicon phaseis highly valuable to shorten the validation process or to avoidcostly steppings. Further, bug detection is a challenge by itself.Performance bugs, while potentially signiﬁcant versus a bug-free version of the given microarchitecture, may be less than thedifference between microarchitectures. Consider the data shownin Figure 1. The ﬁgure shows the performance of several SPECCPU2006 benchmarks [2] on gem5 [6], conﬁgured to emulatetwo recent Intel microarchitectures: Ivybridge (3 rd gen Intel“Core” [23]) and Skylake (6 th gen [19]), using the methodologydiscussed in Sections III and IV. The ﬁgure shows that thebaseline, bug-free Skylake microarchitecture provides nearly1.7X the performance of the older Ivybridge, as expected giventhe greater core execution resources provided. S peedup . p e r l b e n c h 4 0 1 . b z i p 2 4 0 3 . g c c . m il c . c a c t u s A D M . n a m d 4 5 0 . s o p l e x . s j e n g G e o m e t r i c M e a n Ivybridge (Bug-Free) Skylake (Bug-Free) Skylake (Bug 1) Skylake (Bug 2)

Fig. 1: Speedup of Skylake simulation with and withoutperformance bugs, normalized against Ivybridge simulation.In addition to a bug-free case, we collected performancedata for two arbitrarily selected bug-cases for Skylake.

Bug 1 isan instruction scheduling bug wherein xor instructions are theonly instructions selected to be issued when they are the oldestinstruction in the instruction queue. As the ﬁgure shows, the a r X i v : . [ c s . A R ] N ov verall impact of this bug is very low ( < in average acrossthe given workloads). Bug 2 is another instruction schedulingbug which causes sub instructions to be incorrectly marked assynchronizing, thus all younger instructions must wait to beissued after any sub instruction has been issued and all olderinstructions must issue before any sub instruction, regardlessof dependencies and hardware resources. The impact of thisbug is larger, at ∼ across the given workloads.Nevertheless, for both bugs, the impact is less than thedifference between Skylake and Ivybridge. Thus, from thedesigners perspective, working on the newer Skylake designwith Ivybridge in hand, if early performance models showedthe results shown for Bug 2 (even more so

Bug 1 ) designersmight incorrectly assume that the buggy Skylake design didnot have performance bugs since it was much better thanprevious generations. Even as the performance differencebetween microarchitecture generations decreases, differentiatingbetween performance bugs and expected behavior remains achallenge given the variability between simulated performanceand real systems.Examining the ﬁgure, we note that each bug’s impact variesfrom workload to workload. That said, in the case of

Bug 1 , nosingle benchmark shows a performance loss of more than 1%.This makes sense since the bug only impacts the performanceof applications which have, relatively rare, xor instructions.From this, one might conclude that

Bug 1 is trivial and perhapsneeds not be ﬁxed at all. This assumption however, would bewrong, as many workloads do in fact rely heavily upon the xor instruction, enough so that a moderate impact on themtranslates into signiﬁcant performance impact overall.In this work, we explore several approaches to this problemand ultimately develop an automated, two-stage, machinelearning-based methodology that will allow designers toautomatically detect the existence of performance bugs inmicroprocessor designs. Even though this work focus is theprocessor core, we show that the methodology can be usedfor other system modules, such as the memory. Our approachextracts and exploits knowledge from legacy designs insteadof relying on a reference model. Legacy designs have alreadyundergone major debugging, therefore we considered them tobe bug-free, or to have only minor performance bugs. Theautomatic detection takes several minutes of machine learninginference time and several hours of architecture simulation time.The major contributions of this work include the following: • This is the ﬁrst study on using machine learning forperformance bug detection, to the best of our knowledge.Thus, while we provide a solution to this particular prob-lem, we also hope to draw the attention of the researchcommunity to the broader performance validation domain. • Investigation of different strategies and machine learningtechniques to approach the problem. Ultimately, an accu-rate and straight forward, two phase approach is developed. • A novel automated method to leverage SimPoints [25] toextract, short, performance-orthogonal, microbenchmark“probes” from long running workloads. • A set of processor performance bugs which may beconﬁgured for an arbitrary level of impact, for use intesting performance debugging techniques is developed. • Our methodology detects 91.5% of these processor coreperformance bugs leading to ≥ IPC degradation with0% false positives. It also achieves 100% accuracy forthe evaluated performance bugs in memory systems.As an early work in this space, we limit our scope somewhatto maintain tractablity. This work is focused mainly on theprocessor core (the most complex component in the system),however we show that the methodology can be generalized toother units by evaluating its usage on memory systems. Further,here we focus on performance bugs that affect cycle count andconsider timing issues to be out of scope.The goal of this work is to ﬁnd bugs in new architecturesthat are incrementally evolved from those used in the trainingdata, it may not work as well when there have been large shiftsin microarchitectural details, such as going from in-order toout-of-order. However, major generalized microarchitecturalchanges are increasingly rare. For example, Intel processors’last major microarchitecture refactoring occurred ∼ yearsago, from P4 to “Core”. Over this period, there have beenseveral microarchitectures where our approach could providevalue. In the event of larger changes, our method can bepartially reused by augmenting it with additional features.II. O BJECTIVE AND

A B

ASELINE A PPROACH

The objective of this work is the detection of perfor-mance bugs in new microprocessor designs. We explicitlylook at microarchitectural-level bugs, as opposed to logic orarchitectural-level bugs. As grounds for the detection, one ormultiple legacy microarchitecture designs are available andassumed to be bug-free or have only minor performance bugsremaining. Given the extensive pre-silicon and post-silicondebug, this assumption generally holds on legacy designs.There are no assumptions about how a bug may look like.As it is very difﬁcult to deﬁne a theoretical coverage model forperformance debug, our methodology is a heuristic approach.We introduce a na¨ıve baseline approach to solving the bugdetection problem. This serves as a prelude for describingour proposed methodology. It also provides a reference forcomparison in the experiments.Performance bug detection bears certain similarity to stan-dard chip functional bug detection. Thus, a testbench strategysimilar to functional veriﬁcation is sensible. That is, a set ofmicrobenchmarks are simulated on the new microarchitecture,and certain performance characteristics are monitored andanalyzed. The key difference is that functional veriﬁcationhas correct responses to compare against while bug-freeperformance for a new microarchitecture is not well deﬁned.The baseline approach uses supervised machine learning toclassify if an microarchitecture has performance bugs or not.The input features for a machine learning model consist ofperformance counter data, IPC (committed Instructions PerCycle) and microarchitecture design parameters, such as cachesizes. One classiﬁcation result on bug existence or not isproduced for each application. The overall detection resultis obtained by a voting-based ensemble of classiﬁcation resultsamong multiple applications. Let ρ be the ratio of the numberof applications with positive classiﬁcation (indicating bug)versus the total number of applications. A microarchitecture2esign is judged to have a bug if ρ ≥ θ , where θ is a thresholddetermining the tradeoff between true positive and false positiverates. The models are trained by legacy microarchitecturedesigns and bugs are inserted to obtain labeled data.The details of this approach, such as applications, machinelearning engines, performance counter selection, microarchi-tecture parameters and bug development, overlap with ourproposed method and will be described in later sections.III. P ROPOSED M ETHODOLOGY

A. Key Idea and Methodology Overview

The proposed methodology is composed of two stages. Theﬁrst stage consists of a machine learning model trained withperformance counter data from bug-free microarchitecturesto infer microprocessor performance in terms of IPC. If suchmodel is well-trained and achieves high accuracy, it will capturethe correlation between the performance counters and IPC inbug-free microarchitectures. When this model is then applied ondesigns with bugs, a signiﬁcant increase in inference errors isexpected, as the bugs ruin the correlation between the countersand IPC. The second stage determines bug existence due to theIPC inference errors in the ﬁrst stage. An overview of the two-stage methodology is depicted in Figure 2. Although in thiswork we evaluate the methodology using only microarchitecturesimulations, its key idea is generic and can be extended forFPGA prototyping or post-silicon testing. The applicability inboth pre- and post-silicon validations is a strong point of ourmethodology, especially as some bugs might be triggered bytraces that are too long to be simulated.

Simulation

Probe

ML RegressionPerformance Counters, Arch. Specs

Golden IPCInferred IPCProbe ...

Probe

ErrorErrorError

Stage 1: Performance Modelling

Golden IPCInferred IPCGolden IPCInferred IPC Δ Δ Δ N Classifier - No Bug Detected Stage 2: Bug Detection - Bug Detected ...... Δ Δ Δ N Fig. 2: Overview of the proposed methodology.In this methodology, a microbenchmark and a set of selectedperformance counters form a performance probe . Probedesign is introduced in Section III-B. The machine learningmodel is applied on individual probes for IPC inference, whichis described in Section III-C. The classiﬁcation in stage two isbased on the errors from multiple probes, and elaborated inSection III-D.

B. Performance Probe Design1) SimPoint-Based Microbenchmark Extraction

Finding the right applications to use in “probing” the microar-chitecture for performance bugs is critical to the efﬁciency andefﬁcacy of the proposed methodology. One possible approachcould be to meticulously hand generate an exhaustive set of microbenchmarks to test all the interacting components ofthe microarchitecture under all possible instruction orderings,branch paths, dependencies, etc . This approach is similar tothat taken in some prior works [16], [17], [38]. While thisapproach would likely be possible, automating the processas much as possible is highly desirable, given the overheadsalready required for veriﬁcation and validation. Thus, anotherkey goal of our work is to automatically ﬁnd and select short,orthogonal and relevant performance microbenchmarks to usein the microarchitecture probing.Typically in computer architecture research, large, orthogonalsuites of workloads, such as the SPEC CPU suites [2], [3],are used for performance benchmarking. The applicationsin these suites, however, typically are too large to simulateto completion. Thus, development of statistically accuratemeans to shorten benchmark runtimes has received signiﬁcantattention [12], [48], [56]. One notable work was SimPoints [24],[25], [48], wherein the authors propose to identify short,performance-orthogonal segments of long running applications.These segments are simulated separately and the wholeapplication performance is estimated as a weighted average ofthe performance of those points. In this work, we propose anovel use of the SimPoints methodology. Instead of using themas intended, for performance estimation of large applications,we propose to use them to automatically identify and extractshort, orthogonal, performance-relevant microbenchmarks fromlong running applications. Here we are leveraging SimPoints’identiﬁcation of orthogonal basic-block vectors in the givenprogram to be the source of our orthogonal microbenchmarks.As an example how this provides greater visibility intoperformance bugs, consider Figure 3. In the ﬁgure, we comparethe performance per SimPoint of bug-free and

Bug 1 versions ofSkylake (described in Section I) for the benchmark 403.gcc. Wesee that, although the overall difference in the whole applicationis less than 1%, when we split the performance by SimPoint,SimPoint

Bug 1 ’sbehavior much easier to identify as incorrect.The reason why SimPoint xor operations thanthe others, accounting for 2.3% of all instructions executedhere. We argue that, even though the overall impact on theperformance in 403.gcc is very low, this particular SimPointrepresents well a class of applications with similar behavior thatwould not be represented by looking at any single benchmark inthe SPEC CPU suite. Thus, by utilizing SimPoints individuallymore performance bug coverage can be gained than by lookingat full application performance.

2) Performance Counter Selection

Performance counter data from microbenchmark simulationconstitute the main features for our machine learning models toinfer overall performance. Hence, they are essential componentsfor the performance probes. We note that, as part of the designcycle, performance counters often undergo explicit validation.The assumption of their sanity is central for performance debugin general, not only this methodology.Microprocessors typically contain hundreds to thousands ofperformance counters. Using all of them as features would3 P C c hange vs S ky l a k e ( B ug - F r ee ) . g cc S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t S i m po i n t Skylake (Bug-Free) Skylake (Bug 1)

Fig. 3: IPC by SimPoints in 403.gcc for Skylake architecture.unnecessarily extend training and inference time for machinelearning models, and even degrade them as some uselesscounters may blur the picture. Hence, we select a small subsetof them for each microbenchmark that is sufﬁcient for effectivemachine learning inference on that workload. We note that theabundance of counters along with our selection method makesour methodology resilient to small changes in the countersavailable across different generations of architecture designs.The selection is carried out in two steps. • Step 1: The Pearson correlation coefﬁcient is evaluatedfor each counter with respect to IPC. Only those counterswith high correlations with IPC (magnitude greater than0.7) are retained and the others are pruned out. • Step 2: Among the remaining counters, the correlationbetween every pair of them is evaluated. If two countershave a Pearson correlation coefﬁcient whose magnitudeis greater than 0.95, we consider them redundant and oneof them is pruned out.Counter selection is conducted independently for each probeand thus different probes may use different counters. Examplesof the most commonly selected counters are: the number offetched instructions, percentage of branch instructions, numberof writes to registers, percentage of correctly predicted indirectbranches, number of cycles on which the maximum possiblenumber of instructions were committed, among others.

C. Stage 1: Performance Modelling

Stage 1 of our methodology models overall processorperformance (IPC), via machine learning. Among innumerablepossible ways to perform such task, we elaborately developone with the following key ingredients:1) The training data are taken from (presumably) bug-freedesigns. Otherwise, the inference errors between bug-freedesigns and bug cases cannot be differentiated.2) One model is trained for each probe. As workloadcharacteristics of different probes can be drasticallydifferent, it is very difﬁcult (also unnecessary), for asingle model to capture them all. A separated modelfor each probe is necessary to achieve high accuracy forbug-free designs. This is a key difference from previousmethods of performance counter-based predictions wherea single model is derived to work across many differentworkload [7], [13], [32], [57]. By tuning our model to theparticular counters relevant to a particular probe workload,much higher accuracy can be achieved.3) Models are trained with a variety of microarchitectures.This ensures that inference error differences are due to existence of bugs instead of microarchitecture difference.In practice, the training is performed on legacy designs andthe model inference would be on a new microarchitecture.4) The input features are taken as a time series. Thesimulation (or running system) proceeds over a series oftime steps t , t , ... . Let the current time step be t i , then themodel takes feature data of t i − w +1 , t i − w +2 , ..., t i − w + w as input to estimate IPC at t i , where w is the window size.Usually, a time step consists of multiple clock cycles, e.g.,a half million cycles. If one intends to take the aggregateddata of an entire SimPoint as input, that is a special case ofour framework, where the step size is the entire SimPointand the window size is 1.Besides the selected counters discussed in Section III-B2,some microarchitecture design parameters are used as featuresto our machine learning model. They include clock cycle,pipeline width, re-order buffer size and some cache character-istics such as latency, size and associativity. Please note thatthe parameter features are constant in the time series input fora speciﬁc microarchitecture.The following machine learning engines are investigated forper-probe IPC inference in this work Linear Regression (

Lasso ) This is a simple linear regressionalgorithm [49], of the form y = x T w , where w is the vectorof parameters. It uses L1 regularization on the values of w .The main advantage of this model is its simplicity. Multi-Layer Perceptron (

MLP ) This classical neural networkconsists of multiple layers of perceptrons. Each perceptrontakes a linear combination of signals of the previous layer andgenerates an output via a non-linear activation function [28].

Convolutional Neural Network (

CNN ) This neural networkarchitecture contains convolution operations and the parametersof its convolution kernel are trained with data [36]. In thiswork, the features are taken as 1D vector as described by Eren et al. [20] and Lee et al. [37], as opposed to 2D images.

Long Short-Term Memory Network (

LSTM ) This is a formof recurrent neural network, where the output depends notonly on the input but also internal states [27]. The states areuseful in capturing history effect and make LSTM effectivefor handling time series signals.

Gradient Boosted Trees (

GBT ) This is a decision tree-basedapproach, where a branch node in the tree partitions the featurespace and the leaf nodes tell the inference results. The overallresult is an ensemble of results from multiple trees [21]. In thiswork, we adopt XGBoost [10], which is the latest generationof GBT approach.4onsider a set of probes P = { p , p , ... } . A trained modelis applied to infer IPC ˆ y i,j for time step t j of probe p i for aparticular design, while the corresponding IPC by simulationis y i,j . For each probe p i ∈ P , the inference error is ∆ i = 12 T i (cid:88) j =2 ( | y i,j − ˆ y i,j | + | y i,j − − ˆ y i,j − | ) (1)where T i is the number of time steps in p i . This error canbe interpreted as the area of difference between the inferredIPC over time and the actual (simulated) IPC. It is alsoapproximately equal to the total absolute error. The set oferrors { ∆ , ∆ , ..., ∆ | P | } are fed to stage 2 of our methodology.Empirically, this error metric outperforms others, such as MeanSquared Error (MSE), in the bug classiﬁcation of stage 2. Anadvantage of error metric (1) is that a large error in a single(or few) time step(s) is not averaged out, as it is with MSE.Our machine learning models are not intended to be goldenperformance models. Instead, each model captures the complexrelationship between counters and performance for the work-load in exactly one probe. Their comparison against the simu-lated performance for that particular probe is not to determineif performance is satisfactory, rather it is to attempt to detect ifthe model accuracy is ruined, as this might signal the relationsbetween counters and performance are broken, therefore aperformance bug is likely. The usage of machine learning isessential for bug detection as it overcomes the difﬁculty toobtain a single, accurate, universal bug-free performance model. D. Stage 2: Bug Detection

Given a vector [∆ , ∆ , · · · , ∆ | P | ] of IPC inference errorsfrom stage 1, the purpose of stage 2 is to identify if a bugexists in the corresponding microarchitecture. In general, asmall ∆ i error indicates that the machine learning model instage 1, which is trained with bug-free designs, matches wellthe design under test and hence likely no bugs (with respect tothat probe). On the other hand, large ∆ i error indicates a largelikelihood of bugs. Bugs can reside at different locations andmanifest in a variety of ways. Therefore, multiple orthogonalprobes are necessary to improve the breadth and robustness ofthe detection, as highlighted in the case of Bug 1 in Figure 3.Although we considered using a machine learning classiﬁerfor stage 2, we found the available training data for this stageto be much more scarce than that of stage 1. For stage 1,every time step of the collected data represents a data sample,making thousands of samples available. In contrast, only onedata sample per microarchitecture is available for stage 2,making only a few dozens of samples available. To overcomethe lack of data available for training we developed a customrule-based classiﬁer.Suppose there are m positive samples (with bugs) and n negative samples (bug-free). For each probe p i ∈ P , wecompute the mean µ + i and standard deviation σ + i of the IPCinference errors among the m positive samples. Similarly, wecan obtain µ − i and σ − i among the n negative samples for each p i ∈ P . These statistics form a reference to evaluate the IPCinference errors for a new architecture. For the error ∆ (cid:48) i of applying probe p i on the new architecture,we evaluate the following ratios based on the statistics of thelabeled data. γ + i = ∆ (cid:48) i µ + i + ασ + i , γ − i = ∆ (cid:48) i µ − i + ασ − i (2)where α is a parameter trained with the labeled data. In general,a large value of ∆ (cid:48) i signals a high probability of bugs in the newmicroarchitecture. Ratios γ + i and γ − i are relative to the labeleddata, and make the errors from different probes comparable.Given vectors [ γ +1 , γ +2 , ..., γ + | P | ] and [ γ − , γ − , ..., γ −| P | ] , ourclassiﬁer is a rule-based recipe as follows.1) If max( γ +1 , γ +2 , ..., γ + | P | ) > η , where η is a parameter,this architecture is classiﬁed to have bugs. This is for thecase where a large error appears in at least one proberegardless of errors from the other probes.2) If ( γ − + γ − + ... + γ −| P | ) / | P | > λ , where λ < η is aparameter, this architecture is classiﬁed to have bugs. Thisis for the case where small errors appear on many probes.3) For other cases, the architecture is classiﬁed as bug-free.While the values of η and λ are empirically chosen as 15and 5, respectively, the value of α is trained according to truepositive rate and false positive rate on the labeled data. In thetraining, a set of uniformly spaced values in a range are testedfor α , and the one with the maximum true positive rate andfalse positive rate no greater than 0.25 is chosen.IV. E XPERIMENTAL S ETUP

In this section we elaborate on the details of the imple-mentation of our methodology. Sections IV-A to IV-C coverthe experimental setup for processor core performance bugs,since this is the main focus of our work, we provide detailedexplanations. Section IV-D lists the changes implemented forthe memory system performance bug detection.

A. Probe Setup

The performance probes are extracted from 10 applicationsfrom the SPEC CPU2006 benchmark [2] using the SimPointsmethodology [25], as described in Section III-B1. We chose touse SPEC CPU2006 instead of the more recent SPEC CPU2017suite because their relatively smaller memory footprints, gener-ally reduced running times and computational resources neededto detect performance bugs, though there is no reason ourtechniques would not work with SPEC CPU2017 applications.Unlike developing performance improvement techniques, wherethe benchmarks serve as testcases and should be exhaustivelyused, the selected applications here are part of our methodologyimplementation. In fact, the selection affects the tradeoffbetween the efﬁcacy of detection and runtime overhead ofthe methodology, as we will show.There are 190 SimPoints in total for the ten SPEC CPU2006applications we use here and each SimPoint contains around10M instructions. The applications we selected in particularwere an arbitrary set of the ﬁrst 10 we were able to compileand run in gem5 across all microarchitectures. Adding morebenchmarks would only improve the results we achieve. A listof the applications is provided in Table I.For each probe, between 4 and 64 performance counters areselected using the methodology detailed in Section III-B2. The5ime step size is 500k clock cycles. That is, for each counter,a value is recorded every 500k clock cycles. The values of alltime steps form the time series as input feature to the machinelearning models. As the time step size is sufﬁciently large, thedefault window size (see Section III-C) is one time step.TABLE I: Selected SPEC CPU2006 benchmarks.

Benchmark Operand Type Application SimPoints

B. Simulated Architectures

All experiments are performed via gem5 [6] simulationsin System Emulation mode. Note, the key ideas of ourmethodology can be implemented in other simulators andpost-silicon debug. Here we are focused on core performancebugs, thus we use the Out-Of-Order core (O3CPU) modelwith the x86 ISA. Gem5 is conﬁgured to model differentarchitectures by varying several conﬁguration variables suchas clock period, re-order buffer size, cache size, associativity,latency, and number of levels, as well as functional unit count,latency, and port organization. Besides these common knobs,other microarchitectural differences between designs are notconsidered. The conﬁgurations model eight different existingmicroarchitectures: Intel Broadwell, Cedarview , Ivybridge,Skylake and Silvermont, AMD Jaguar, K8, and K10, aswell as 12 artiﬁcial microarchitectures with realistic settings.With these, we achieve a wide variety of architectures, fromlow-power systems such as Silvermont to high-performanceserver systems such as Skylake. Detailed information of thespeciﬁcations used for each architecture can be found in TableII. Table III describes the port organization for each of thearchitectures. These microarchitectures serve as training datafor machine learning models and unseen data for testing themodels. They are partitioned into four disjoint sets as follows. • Set I:

Contains two real and seven artiﬁcial microarchi-tectures. They are used to train our IPC models. • Set II:

Three microarchitectures in this set, a real one andtwo artiﬁcial ones, play the role of validation in trainingthe IPC models. That is, a training process terminatesafter 100 epochs of training without improvement on thisset. In addition, this set serves as labeled data for trainingthe classiﬁer in stage 2. • Set III:

Four microarchitectures, one real and threeartiﬁcial, are used as additional training data for thestage 2 classiﬁer. As such, the training data for stage2 is composed by set II and III. This is to decouple thetraining data between stage 1 and stage 2. • Set IV:

Four microarchitectures in this set are for thetesting of the stage 2 classiﬁer. As stage 2 is the eventualbug detection, all of the four microarchitectures here arereal ones so as to ensure the overall testing is realistic. Implements a Cedarview-like superscalar architecture but assumes out-of-order completion

C. Bug Development

To the best of our knowledge, gem5 is a performance bug-free simulator. To achieve wide bug representation we reviewederrata of commercial processors, consulted with industry expertsand tried to cover as many units as possible. Ultimately, 14basic types of bugs are developed and are summarized asfollows.1) Serialize X: Every instruction with opcode of X ismarked as a serialized instruction. This causes all futureinstructions (according to program order) to be stalleduntil the instruction with the bug has been issued.2) Issue X only if oldest: Instructions whose opcode is Xwill only be retired from the instruction queue when it hasbecome the oldest instruction there. This bug is similarto “POPCNT Instruction may take longer to execute thanexpected” found on Intel Xeon Processors [31].3) If X is oldest, issue only X: When an instruction whoseopcode is X becomes the oldest in the instruction queue,only that instruction will be issued, even though otherinstructions might also be ready to be issued and thecomputation resources allow it.4) If X depends on another instruction Y, delay T cycles.5) If less than N slots available in instruction queue, delayT cycles.6) If less than N slots available in re-order buffer, delay Tcycles.7) If mispredicted branch, delay T cycles.8) If N stores to cache line, delay T cycles: After N storeshave been executed to the same cache line, upcomingstores to the same line will be delayed by T cycles. Thisis a variation of “Store gathering/merging performanceissue” found on NXP MPC7448 RISC processor [45].9) After N stores to the same register, delay T cycles.This bug is inspired by “GPMC may Stall After 256Write Accesses in NAND DATA, NAND COMMAND,or NAND ADDRESS Registers” found on TI AM3517,AM3505 ARM processor [54], we generalized it for anyphysical register, and in our case, the instruction is onlystalled for a few cycles, as opposed to the actual bug,where the processor hanged. Another variation of this bugis implemented where the delay is applied once every Nstores, instead of every store after the N-th.10) L2 cache latency is increased by T cycles. This bug isinspired by the “L2 latency performance issue” on theNXP MPC7448 RISC processor [45].11) Available registers reduced by N.12) If branch longer than N bytes, delay T cycles.13) If X uses register R, delay T cycles. This bug is a variationof the “POPA/POPAD Instruction Malfunction” found onIntel 386 DX [30].14) Branch predictor’s table index function issue, reducingeffective table size by N entries.For each of these bug types, multiple variants are imple-mented by changing

X, Y, N, R and T respectively. The bugsare grouped into four categories according to their impact onIPC: High means an average IPC degradation (across the usedSPEC CPU2006 applications) of 10% or greater. A

Medium impact means the average IPC is degraded between 5% and6ABLE II: Architectural knobs for implemented architectures.

Set Architecture ClockCycle CPUWidth ROBSize L1 Cache(Size / Assoc. / Latency) L2 Cache(Size / Assoc. / Latency) L3 Cache(Size / Assoc. / Latency) Func. Unit Latency(FP / Multiplier / Divider)

I Broadwell 4.0GHz 4 192 32kB / 8-way / 4 cycles 256kB / 8-way / 12 cycles 64MB / 16-way / 59 cycles 5 cycles / 3 cycles / 20 cyclesI Cedarview 1.8GHz 2 32 32kB / 8-way / 3 cycles 512kB / 8-way / 15 cycles No L3 5 cycles/ 4 cycles / 30 cyclesI Jaguar 1.8GHz 2 56 32kB / 8-way / 3 cycles 2MB / 16-way / 26 cycles No L3 4 cycles / 3 cycles / 20 cyclesI Artiﬁcial 2 4.0GHz 8 168 32kB / 2-way / 5 cycles 256kB / 8-way / 16 cycles No L3 4 cycles / 4 cycles / 20 cyclesI Artiﬁcial 3 3.0GHz 8 32 32kB / 2-way / 3 cycles 512kB / 16-way / 24 cycles 8MB / 32-way / 52 cycles 4 cycles / 4 cycles / 20 cyclesI Artiﬁcial 4 4.0GHz 2 192 64kB / 8-way / 3 cycles 1MB / 8-way / 20 cycles 32MB / 16-way / 28 cycles 5 cycles / 3 cycles / 20 cyclesI Artiﬁcial 6 3.5GHz 4 192 64kB / 8-way / 4 cycles 1MB / 8-way / 16 cycles 8MB / 32-way / 36 cycles 4 cycles / 4 cycles / 20 cyclesI Artiﬁcial 7 3.0GHz 4 32 16kB / 8-way / 3 cycles 512kB / 16-way / 12 cycles 32MB / 32-way / 28 cycles 2 cycles / 7 cycles / 69 cyclesI Artiﬁcial 10 1.5GHz 8 32 32kB / 2-way / 2 cycles 256kB / 16-way / 24 cycles 64MB / 32-way / 36 cycles 5 cycles/ 4 cycles / 30 cyclesI Artiﬁcial 11 3.5GHz 4 32 64kB / 4-way / 5 cycles 256kB / 4-way / 24 cycles No L3 5 cycles/ 4 cycles / 30 cyclesII Ivybridge 3.4GHz 4 168 32kB / 8-way / 4 cycles 256kB / 8-way / 11 cycles 16MB / 16-way / 28 cycles 5 cycles / 3 cycles / 20 cyclesII Artiﬁcial 0 2.5GHz 4 192 64kB / 2-way / 4 cycles 512kB / 4-way / 12 cycles No L3 5 cycles / 3 cycles / 20 cyclesII Artiﬁcial 9 3.5GHz 8 192 16kB / 4-way / 5 cycles 1MB / 4-way / 20 cycles 64MB / 16-way / 44 cycles 4 cycles / 3 cycles / 11 cyclesIII Artiﬁcial 1 1.5GHz 4 192 64kB / 8-way / 5 cycles 2MB / 8-way / 16 cycles No L3 4 cycles / 3 cycles / 11 cyclesIII Artiﬁcial 5 3.5GHz 2 32 32kB / 4-way / 5 cycles 256kB / 4-way / 16 cycles 8MB / 32-way / 44 cycles 4 cycles / 3 cycles / 11 cyclesIII Artiﬁcial 8 3.0GHz 2 192 32kB / 2-way / 2 cycles 1MB / 16-way / 16 cycles 32MB / 32-way / 52 cycles 4 cycles / 3 cycles / 11 cyclesIV K8 2.0GHz 3 24 64kB / 2-way / 4 cycles 512kB / 16-way / 12 cycles No L3 4 cycles / 3 cycles / 11 cyclesIV K10 2.8GHz 3 24 64kB / 2-way / 4 cycles 512kB / 16-way / 12 cycles 6MB / 16-way / 40 cycles 4 cycles / 3 cycles / 11 cyclesIV Silvermont 2.2GHz 2 32 32kB / 8-way / 3 cycles 1MB / 16-way / 14 cycles No L3 2 cycles / 7 cycles / 69 cyclesIV Skylake 4.0GHz 4 256 32kB / 8-way / 4 cycles 256kB / 4-way / 12 cycles 8MB / 16-way / 34 cycles 4 cycles / 4 cycles / 20 cycles

TABLE III: Port organization of implemented architectures.

Set Architecture Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6

I Broadwell 1 ALU, 1 FP Mult1 FP unit, 1 Int Vector1 Int Mult, 1 Divider1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch UnitI Cedarview 1 ALU, 1 Load Unit1 Store Unit, 1 Vector Unit1 Int Mult, 1 Divider 1 ALU, 1 Vector Unit1 FP Unit, 1 Branch Unit 1 Load Unit 1 Store Unit - - -I Jaguar 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 FP Unit, 1 Int Mult 1 FP Mult, 1 Divider 1 Load Unit 1 Store Unit -I Artiﬁcial 2 1 ALU, 1 Vector Unit1 FP Unit, 1 Int Mult1 Divider, 1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 FP Unit1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch UnitI Artiﬁcial 3 1 ALU, 1 Vector Unit1 FP Unit, 1 Int Mult1 Divider, 1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 FP Unit1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch UnitI Artiﬁcial 4 1 ALU, 1 FP Mult1 FP unit, 1 Int Vector1 Int Mult, 1 Divider1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch UnitI Artiﬁcial 6 1 ALU, 1 Vector Unit1 FP Unit, 1 Int Mult1 Divider, 1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 FP Unit1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch UnitI Artiﬁcial 7 1 Load Unit, 1 Store Unit 1 ALU, 1 Integer Mult 1 ALU, 1 Branch Unit 1 FP Mult, 1 Divider 1 FP Unit - -I Artiﬁcial 10 1 ALU, 1 Load Unit1 Store Unit, 1 Vector Unit1 Int Mult, 1 Divider 1 ALU, 1 Vector Unit1 FP Unit, 1 Branch Unit 1 Load Unit 1 Store Unit - - -I Artiﬁcial 11 1 ALU, 1 Load Unit1 Store Unit, 1 Vector Unit1 Int Mult, 1 Divider 1 ALU, 1 Vector Unit1 FP Unit, 1 Branch Unit 1 Load Unit 1 Store Unit - - -II Ivybridge 1 ALU, 1 Vector Unit1 FP Mult, 1 Divider 1 ALU, 1 Vector Unit1 Int Mult, 1 FP Unit 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit1 Branch Unit, 1 FP Unit -II Artiﬁcial 0 1 ALU, 1 FP Mult1 FP unit, 1 Int Vector1 Int Mult, 1 Divider1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch UnitII Artiﬁcial 9 1 ALU, 1 Vector Unit1 Int Mult 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 Load Unit 1 Store Unit 1 FP Unit 1 FP UnitIII Artiﬁcial 1 1 ALU, 1 Vector Unit1 Int Mult 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 Load Unit 1 Store Unit 1 FP Unit 1 FP UnitIII Artiﬁcial 5 1 ALU, 1 Vector Unit1 Int Mult 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 Load Unit 1 Store Unit 1 FP Unit 1 FP UnitIII Artiﬁcial 8 1 ALU, 1 Vector Unit1 Int Mult 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 Load Unit 1 Store Unit 1 FP Unit 1 FP UnitIV K8 1 ALU, 1 Vector Unit1 Int Mult 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 Load Unit 1 Store Unit 1 FP Unit 1 FP UnitIV K10 1 ALU, 1 Vector Unit1 Int Mult 1 ALU, 1 Vector Unit 1 ALU, 1 Vector Unit 1 Load Unit 1 Store Unit 1 FP Unit 1 FP UnitIV Silvermont 1 Load Unit, 1 Store Unit 1 ALU, 1 Integer Mult 1 ALU, 1 Branch Unit 1 FP Mult, 1 Divider 1 FP Unit - -IV Skylake 1 ALU, 1 Vector Unit1 FP Unit, 1 Int Mult1 Divider, 1 Branch Unit 1 ALU, 1 Vector Unit1 FP Mult, 1 FP Unit1 Int Mult 1 Load Unit 1 Load Unit 1 Store Unit 1 ALU, 1 Vector Unit 1 ALU, 1 Branch Unit

Low impact is between 1% and 5% and a

Very-Low impact is less than 1%. Figure 4 shows the distribution of theseverity of average IPC impact for the implemented bugs.

D. Bug Detection in Memory Systems

We also examine the performance of the proposed method-ology on areas outside of the processor core, speciﬁcally onthe processor cache memory system. Overall the methodologyremains unchanged, but there are some minor differences vs.the evaluation of our technique on processor cores. Here we

IPC Impact Severity % i m p l e m en t ed Fig. 4: Distribution of bug severity.use the ChampSim [1] simulator instead of gem5 [6], becauseit provides a more detailed memory system simulation withrelatively short simulation time. ChampSim was developed7or rapid evaluation of microarchitectural techniques in theprocessor cache hierarchy.The probes we use for this evaluation correspond to 22SimPoints from seven applications on the SPEC CPU2006[2] benchmark suite. The emulated architectures are Intel’sBroadwell, Haswell, Skylake, Sandybridge, Ivybridge, Nehalemand AMD’s K10 and Ryzen7, as well as four artiﬁcialarchitectures. The developed bugs are summarized as follows.1) When a cache block is accessed, the age counter for thereplacement policy is not updated.2) During a cache eviction, the policy evicts the most recentlyused block, instead of the least recently used.3) After N load misses on L1D cache, the read operation isdelayed by T clock cycles. Another variant of this bugwas implemented for L2 cache.4) Signature Path Prefetcher (SPP) [34] signatures are reset,making the prefetcher use the wrong address.5) On lookahead prefetching, the path with the least conﬁ-dence is selected.6) Some prefetches are incorrectly marked as executed. Thisbug was found in the SPP [34] prefetching method.Since IPC might be affected by many other factors outsidethe memory system, we evaluate our methodology by usingAverage Memory Access Time (AMAT) as the target metric forthe performance models. Given the differences between simula-tors, architectures and bugs, the rules for stage 2 were slightlymodiﬁed for this evaluation, however, the overall methodologyremains unchanged. V. E

VALUATION

Since the main focus of our work is on the processor core,sections V-A to V-H examine performance bugs in that unit.Section V-I expands examination to memory system bugs.

A. Machine Learning-Based IPC Modeling

In this section, we show the results for the ﬁrst stage of ourmethodology, IPC modelling. We evaluated the performance ofseveral machine learning methods, as discussed in Section III-C,and different variations for each of them. The MLP, CNN andLSTM networks are implemented using Keras library [11]. Intraining all these models, Mean Squared Error (MSE) is used asthe loss function and Adam [35] is employed as the optimizer.A gradient clipping of . is enforced to avoid the gradientexplosion issue, commonly seen in training recurrent networks.The gradient boosted trees are implemented via the XGBoostlibrary [10]. Lasso is implemented using Scikit-Learn [47].The total (all probes) training and inference runtime, as wellas inference error (Eq. 1) for each IPC modelling technique areshown in Table IV. The name of each neural network-basedmethod (LSTM, CNN and MLP), is preﬁxed with a numberindicating the number of hidden layers, and postﬁxed with thenumber of neurons in each hidden layer. The postﬁx numberfor a GBT method is the number of trees. Runtime resultsare measured on an Intel Xeon E5-2697A V4 processor withfrequency of 2.6GHz and 512GB memory . The total simulationtime for detecting bugs in one new microarchitecture takes This machine has no GPU, and GPU acceleration could signiﬁcantly reducethese runtimes. about 6 hours if the simulations are not executed in parallel.TABLE IV: IPC modelling runtime and error statistics.

Runtime Inference Error (Equation (1) )ML Model Training Inference Average Std. Dev. Median 90th Perc.Lasso

0h 8m 3m 06s 10.2749 2.1598 10.3995 13.1740

3h 22m 16m 21s . × . ×

4h 17m 17m 28s . × . ×

3h 39m 24m 18s . × . ×

4h 39m 54m 33s . × . ×

2h 54m 71m 20s . × . ×

0h 19m 08m 16s 5.3959 5.3458 3.4171 17.1287

0h 33m 15m 53s 5.7568 5.4779 3.5433 16.2600

1h 38m 08m 39s 2.2506 1.5636 1.8200 4.2648

1h 51m 07m 01s 2.1298 1.6228 1.6589 4.4684

3h 19m 10m 05s 9.8390 66.6996 4.1207 6.6891

GBT-150

0h 30m 5m 01s 3.7928 2.1445 3.3095 6.3616

GBT-250

0h 38m 4m 34s 3.6181 2.0607 3.1275 6.1395

The right four columns of Table IV summarize the inferenceerrors deﬁned in Equation (1). Please note IPC inference is aregression task, not classiﬁcation. Thus, classiﬁcation metrics,such as true positive and false positive rates, do not apply. Theresults are averaged across all probes for bug-free designs ofmicroarchitectures in Set IV. Median errors of different MLengines are close to each other. LSTM has huge variances andaverage errors due to a few non-convergent outliers. Althoughinput features are a time series, where LSTM should excel,the actual LSTM errors here are no better than non-recurrentnetworks and sometimes much worse. There are two reasonsfor this. First, the time step size we chose is large enough suchthat the history effect has already been well counted in one timestep and thus the recurrence in LSTM becomes unnecessary.Second, LSTM is well-known to be difﬁcult to train and thedifﬁculty is conﬁrmed by those outlier cases. In stage 2 of ourmethodology, LSTM results with huge errors are not used.Figure 5 displays the IPC time series for three SimPoints(chosen to represent varied behavior), on the Skylake microar-chitecture, where the red solid lines indicate the measured(simulated) IPCs, while the dotted/dashed lines are IPCinferences of the different ML engines. In general, the MLmodels trace IPCs very well. LSTM has relatively large errors inFigure 5a, yet it still shows strong correlation with the simulatedIPCs. In all three cases the results conﬁrm the effectiveness ofthe machine learning models across various scenarios.Although IPC inference accuracy for bug-free designs isimportant, the inference error difference between cases with andwithout bugs matters even more. Such difference is illustratedfor two SimPoints in Figure 6. In Figure 6a, where themicroarchitecture is bug-free, GBT-250 estimates IPCs veryaccurately. When there is a bug, however, as shown in Figure 6b,the inference errors drastically increase. The same discrepancyis also exhibited between another bug-free design, Figure 6c,and buggy design, Figure 6d. Both the examples demonstratethat a signiﬁcant loss of accuracy implies bug existence.

B. Bug Detection

In this section, we show the evaluation results for our stage 2classiﬁcation given the IPC inference errors from stage 1. Theevaluation includes comparisons with the na¨ıve single-stagebaseline approach described in Section II. The probe designs ofthe baseline are the same as our proposed methodology. Herewe only use the GBT-250 engine which has the best single-stageresults. Its model features include simulated IPCs in addition todata from the selected counters and microarchitectural design8 ime (500k cycles) I P C (a) SimPoint Time (500k cycles) I P C (b) SimPoint Time (500k cycles) I P C (c) SimPoint Fig. 5: Examples of ML-based IPC inference and simulated IPC on bug-free microarchitectures.

Time (500k cycles) I P C Simulation (Bug-Free) ML Inference (Bug-Free) (a) Bug-Free SimPoint

Time (500k cycles) I P C Simulation (Bug) ML Inference (Bug) (b) Bug SimPoint

Time (500k cycles) I P C Simulation (Bug-Free) ML Inference (Bug-Free) (c) Bug-Free SimPoint

Time (500k cycles) I P C Simulation (Bug) ML Inference (Bug) (d) Bug SimPoint

Fig. 6: IPC estimations by GBT-250: comparisons between microarchitectures with and without bugs.parameters. However, it uses a single value for each featureaggregated from an entire SimPoint instead of the time series.The bug detection efﬁcacy is evaluated by the metrics below.FPR = FPN , Precision = TPTP + FP , TPR = TPP (3)where N and P are the number of real negative (no bug)and positive (bug) cases, respectively. FP (False Positive)indicates the number of cases that are incorrectly classiﬁedas having bugs. TP (True Positive) is the number of casesthat are correctly classiﬁed as having bugs. Additionally, ROC(Receiver Operating Characteristic) AUC (Area Under Curve)is evaluated. ROC shows the trade-off between TPR and FPR.ROC AUC value varies between 0 and 1. A random guesswould result in ROC AUC of 0.5 and a perfect classiﬁer canachieve 1. Accuracy, another common metric, is not employedhere as the number of negative cases is too small, i.e., thetestcases are imbalanced.TABLE V: Bug detection results. Here, Bug 1 is “If XOR isoldest in IQ, issue only XOR” and Bug 2 is “If ADD usesregister 0, delay 10 cycles”.

TPR for different bug categoriesBugs in presumedbug-free training Stage 1ML Model FPR TPR ROCAUC Precision High Medium Low Very Low

No Bug

Single-stage baseline

Lasso

GBT-150

GBT-250

In our testing scheme, we attempt to detect a bug in anew microarchitecture that is not included in training data.Moreover, the training data does not include any bug with thesame type as the one in the testing data, i.e., the bug in thenew microarchitecture is completely unseen in the training.The data organization is as follows. • Training data:1) Data with positive labels. IPC inference errors of allprobes for microarchitectures in sets II and III with buginsertion. For each microarchitecture here, all types ofbugs, except the one to be used in testing, are separatelyinserted. In each case, only one bug is inserted.2) Data with negative labels. IPC inference errors of allprobes on bug-free versions of the microarchitecturesin sets II and III. • Testing data:1) Designs with bugs. Microarchitectures in set IV withall variants of a bug type inserted and this bug type isnot included in training data. In each case, only onebug is inserted.2) Bug-free versions of the microarchitectures in set IV.An example, which has 3 bug types (1, 2, 3) and 2 variantsfor each type ( e.g.

Bug 1.1 and Bug 1.2 are two variants ofthe same bug type), is shown in Figure 7.

No Bug Set IBug 1.1 Set II Set III Set IVSet I Set II Set III Set IV Set I Set II Set III Set IVBug 2.1 Set I Set II Set III Set IV Set I Set II Set III Set IVBug 3.1 Set I Set II Set III Set IV Set I Set II Set III Set IVBug 1.2Bug 2.2Bug 3.2Train Test Unused

Fig. 7: Example of training and testing data split.Since previous architectures that are considered “bug-free”may actually have performance bugs, we also present results forthe case when designs with a bug are presumed as “bug-free”and are used for training.The results are shown in Table V. Although the same stage2 classiﬁer of our methodology is used, several different MLengines are used in stage 1 and listed in the “Stage 1 ML Model”column. The leftmost column indicates whether bug-free or“buggy” designs were used for training.When only bug-free designs are used, using GBT-250 instage 1 produces the best result. It can detect medium and highimpact bugs ( > IPC impact) with 98.5% true positive rate.When low impact bugs ( > IPC impact) are additionally9ncluded, the true positive rate is still as high as . . TheTPRs of GBT-250 beat the single-stage baseline in almost everybug category. It is also superior on ROC AUC. Meanwhile, itachieves 0 false-positive rate and 1.0 precision.The table also shows two cases where the models weretrained using designs with a bug. These two cases correspondto bugs with a low average IPC impact across the studiedworkloads. We included these as representative cases, andargue that bugs with higher IPC impact will most likely becaught during post-silicon validation of previous designs andwill be ﬁxed by the time a new microarchitecture is beingdeveloped. To evaluate these, we use GBT-250 model for stage1 of the methodology as it was best performing. As expected,these results show some degradation in detection. However,GBT-250 is still able to detect around 70% of the bugs, whileincurring a few false-positives.In Figure 8 we show several examples of how the ROCcurve looks for different bug types when stage 1 of our methoduses GBT-250 model. Difﬁcult to catch bugs usually have alower ROC AUC, while other bugs with higher IPC impactcan be detected without false-positives. FPR T P R Fig. 8: ROC curves for GBT-250 on different bug types.

C. Number of Probes

There is a trade-off associated with the number of probes.More probes can potentially detect more bugs or reduce falsepositives, at the cost of higher runtime. To see how accuracyvaries with probe count we examine the impact of reducing thenumber of probes. We perform this experiment in an iterativeapproach. Every iteration we remove ﬁve probes, re-train themodel with the reduced set and collect its detection metrics. Weevaluated these results using the GBT-250 model. We performthe probe reduction experiment with two different orders:1) Remove the probes with the highest error in IPC inferenceﬁrst. The insight for using this method is, if a probe hashigh IPC inference error, it is likely that the model did notlearn from it properly. The results are shown in Figure 9.2) Remove probes in random order. This case is equivalent tohaving fewer probes with which to test the design. Resultsare shown in Figure 9.

Number of probes R a t e Fig. 9: Effect of removing probes. Results for both orderings, in Figure 9 show that, when thenumber of probes is reduced, the quality of results is degraded,either by an increase in FPR or by decreasing the TPR. It isalso important to note, however, that the accuracy change isvery slow versus probe reduction. Thus, our results are quiterobust, arguing that even fewer benchmarks could be used asan input to the process with little impact on the outcome.

D. Counter Selection

We evaluated our counter selection methodology by com-paring with a set of 22 manually, but not arbitrarily, selectedcounters, which include miss rates for different cache levels,branch statistics and other counters related to the core pipelineand how many instructions each stage has processed. Unlikeour method, we use the same 22 counters for all the probes. Theresults are evaluated for models 1-LSTM-500 and GBT-250.The obtained results can be seen on Figure 10. R a t e GBT-250 (Our method) GBT-250 (Manual) 1LSTM 500 (Our method) 1LSTM 500 (Manual)

TPR FPR

Fig. 10: Effect of counter selection method.Our counter selection methodology achieves better resultsin both machine learning models when compared to theresults obtained by manually selecting the counters. Despitebeing heuristic, our counter selection methodology generallyfacilitates better detection results.

E. Time Step Size

In stage 1 of IPC modelling, input features are taken astime series with each time step being 500K clock cycles.Experiments were performed to observe the effect of differenttime step sizes. The results are plotted in Figure 11. Whenthe time step size increases, the inference errors for model1-LSTM-500 decrease as shown in Figure 11a. This is becausecoarser grained inference is generally easier than ﬁne-grained.MSE is used here instead of the error deﬁned by Eq. (1) aserror area among different step sizes series are not comparable.

Time Step Size A v e r age M SE (a) Average MSE across allprobes. Time Step Size R a t e (b) TPR and FPR on bug detec-tion. Fig. 11: Effect of different time step sizes.The reduced IPC inference errors do not necessarily lead toimproved bug detection results. In fact, Figure 11b shows thatboth TPR and FPR degrade as the time step size increases. Therationale is that whether or not the IPC inference is sensitive tobugs matters more than the accuracy. The results in Figure 11conﬁrm the choice of 500K cycles as the time step size. Besides10he efﬁcacy of bug detection, time step size also considerablyaffects computing runtime and data storage. In our experience,step size of 500K cycles reaches a good compromise betweenbug detection and computing load in our experiment setting.

F. Window Size

The IPC inference in stage 1 can take the feature data froma series of time steps. So far in this paper, the window size wehave used in our experiments has been one because the timestep size is sufﬁciently large. Experiments were conducted toobserve the impact of increasing the window size. Table VIshows the TPR and FPR obtained when the window size isincreased for the model GBT-250.TABLE VI: Window size effect.

Window Size

The results conﬁrm the choice of a window size of onethroughout our experiments. Given our time step size, theresults suggest that adding information of previous timesteps do not help to increase sensibility to performance bugs,furthermore, it actually degrades it.

G. Microarchitecture Design Parameter Features

In our baseline methodology we propose the use of microar-chitectural design parameters/speciﬁcations as static features( e.g.

ROB size, issue width, etc.). Here, we examine the impactof removing these static features on the accuracy of our bugdetection methodology. Figure 12 shows the obtained results. R a t e GBT-250 (Arch Feat.) GBT-250 (No Arch Feat.) 1LSTM 500 (Arch Feat.) 1LSTM 500 (No Arch Feat.)

TPR FPR

Fig. 12: Effect of microarchitecture design parameter features.The results show removing the design parameters has noimpact on the accuracy of the GBT-250 model and a smallreduction on the number of detected bugs with the 1LSTM-500model, althought the number of false alarms is also reduced.This impact is contained within the bugs of

Low or Very Low

IPC impact. These results indicate that performance impactinformation is, in many cases, sufﬁciently contained within theperformance counters ( i.e. performance counter data inherentlyconveys enough information for the model to infer the IPC ofdifferent microarchitectures on the given workloads), and thechange on microarchitectural speciﬁcations has a very smallimpact on the quality of results for our methodology.

H. Number of training microarchitectures

We also evaluated the effect of reducing the number ofavailable architectures to train our method. Here, we use 5microarchitectures to train our IPC model (Set I), instead of9. Sets II and III were reduced from 3 and 4 to 2 and 3microarchitectures, respectively. In each case we dropped the“artiﬁcial” microarchitectures, keeping only the real ones. We keep the number of testing microarchitectures constant andshow the results using GBT-250 in Figure 13.From the obtained results we conﬁrm that creating artiﬁcialarchitectures is necessary in order to augment our dataset. This aids the model to learn the difference betweenperformance variation due to microarchitectural speciﬁcationsand performance bugs. R a t e Fig. 13: Effect of number of training microarchitectures.

I. Bug Detection in Cache Memory Systems

In this section, we evaluate performance bug detection inthe cache memory system, as described on Section IV-D. Theobtained results are shown in Table VII.TABLE VII: Bug detection in Memory Systems results.

TPR for different bug categoriesStage 1Metric Stage 1ML Model FPR TPR Precision High Medium Low Very Low

IPC

LSTM

GBT

LSTM

GBT

These results show that our methodology is able to detectall the bugs when GBT models are used for both IPC andAMAT inferences, while LSTM only misses

Very Low

AMATimpact bugs. The high accuracy of these results show thatthis methodology is robust for usage on different systemcomponents beyond the core, as well as different simulators.VI. R

ELATED W ORK

A. Microprocessor Performance Validation

The importance as well as the challenge of processorperformance debug was recognized 26 years ago [9], [53].However, the followup study has been scarce perhaps dueto the difﬁculty. The few known works [9], [46], [51], [53],[55] generally follow the same strategy although with differentemphasis. That is, an architecture model or an architectureperformance model is constructed, and compared with the newdesign on a set of applications. Then, performance bugs canbe detected if a performance discrepancy is observed in thecomparison. In prior work by Bose [9], functional unit andinstruction level models are developed as golden references.However, a performance bug often manifests in the interactionsamong different components or different instructions. Surya et al. [53] built an architecture timing model for PowerPCprocessor. It is focused on enforcing certain invariants inexecuting loops. For example, the IPC for executing a loopshould not decrease when the buffer queue size increases.Although this technique is useful, its effectiveness is restrictedto a few types of performance bugs. In work by Utamaphethai et al. [55], a microarchitecture design is partitioned intobuffers, each of which is modeled by a ﬁnite state machine(FSM). Performance veriﬁcation is carried out along with thefunctional veriﬁcation using the FSM models. This methodis effective only for state dependent performance bugs. The11odel comparison-based approach is also applied for the IntelPentium 4 processor [51]. Similar approach is also appliedfor identifying I/O system performance anomaly [50]. Aparametric machine learning model is developed for architectureperformance evaluation [46]. This technique is mostly forguiding architecture design parameter (e.g., cache size) tuning.Overall, the model comparison based approach has twomain drawbacks. First, the same performance bug can appearin both the model and the design as described by Ho [26],and consequently cannot be detected. Although some worksstrive to ﬁnd an accurate golden reference [9], such effort isrestricted to special cases and very difﬁcult to generalize. Inparticular, in presence of intrinsic performance variability [14],[52], ﬁnding golden reference for general cases becomes almostimpossible. Second, constructing a reference architecture modelis labor intensive and time consuming. On one hand, it is verydifﬁcult to build a general model applicable across differentarchitectures. On the other hand, building one model for everynew architecture is not cost-effective.

B. Performance Bugs in Other Domains

A related performance issue in datacenter computing isperformance anomaly detection [29]. The main techniques hereinclude statistics-based, such as ANOVA tests and regressionanalysis, and machine learning-based classiﬁcation. Althoughthere are many computers in a datacenter, the subject is the over-all system performance upon workloads in very coarse-grainedmetrics. As such, the normal system performance is much betterdeﬁned than individual processors. Performance bug detectionis mentioned for distributed computing in clouds [33]. In thiscontext, the overall computing runtime of a task is greater thanthe sum of runtimes of computing its individual componentas extra time is needed for the data communication. However,the difference should be limited and otherwise an anomalyis detected. As such, the performance debug in distributedcomputing is focused on the communication and assumes thatall processors are bug-free. Evidently, such assumption does nothold for processor microarchitecture designs. Gan et al. [22]developed an online QoS violation prediction technique forcloud computing. This technique applies runtime trace data toa machine learning model for the prediction and is similar toour baseline approach in certain extent. In another work, cloudservice anomaly prediction is realized through self-organizingmap, an unsupervised learning technique [15].Performance bug is also studied for software systemswhere bugs are detected by users or code reasoning [43]. Amachine learning approach is developed for evaluating softwareperformance degradation due to code change [4]. Softwarecode analysis [39] is used to identify performance criticalregions when executing in cloud computing. Parallel programperformance diagnosis is studied by Atachiants et al. [5].Performance bugs in smartphone applications are categorizedinto a set of patterns [40] for future identiﬁcation. As thedegree of concurrency in microprocessor architectures is usuallyhigher than a software program, performance debugging formicroprocessor is generally much more complicated.

C. Performance Counters for Power Prediction

Prior work has used performance counters to predict power consumption [7], [13], [32], [57]. Joseph et al. [32] aim topredict average power consumption on complete workloads, asopposed to the time-series based strategy we use for accuratedetection of bugs. Counters are selected based on heuristics,without automation. Contreras et al. [13] further improves thiswork and present an automated performance counter selectiontechnique able to do time-series prediction of the power. How-ever, this technique was evaluated on an Intel PXA255 proces-sor, a single-issue machine, making the problem much simplerthan aiming at super-scalar processors. Bircher et al. [7] furtherimproves this line of work by creating models other units, suchas memory, disk and interconnect. The main drawback ofthis work is that it requires a thorough study of the designcharacteristics in order to create the performance counter listto be used. Recent work by Yang et al. [57] further expandedBircher’s work by aiming to develop a full system power model,prior work had only focused on component based modeling.Although this line of work has similarities with our IPCestimation methodology, our work is the ﬁrst whose goal isthe identiﬁcation of performance bugs in a design. Further,because our goal is performance bug detection via orthogonalprobes, we can make our models orthogonal and speciﬁc toeach probe, thus we achieve very high IPC estimation accuracy,higher than is possible with the generalized models neededfor general power prediction/management. Another significantdifference is that our methodology is able to generalize to mul-tiple microarchitectures, whereas the methodologies discussedin this section are trained and tested on the same processor.VII. C

ONCLUSION AND F UTURE W ORK

In this work, a machine learning-based approach to automaticperformance bug detection is developed for processor coremicroarchitecture designs. The machine learning models extractknowledge from legacy designs and avoid the previous methodsof reference performance models, which are error prone andtime consuming to construct. Simulation results show thatour methodology can detect 91.5% of the bugs with impactgreater than 1% of the IPC on a new microarchitecture whencompletely new bugs exist in a new microarchitecture. Withthis study we also hope to draw the attention of the researchcommunity to the broader performance validation domain.In future research, we will extend the methodology for debug-ging of multi-core memory systems and on-chip communicationfabrics. We will also further study how to automatically narrowdown bug locations once they are detected. As in functionaldebug [41], the results from our method will potentially serveas symptoms for bug localization. By analyzing characteristicsin common across the probes triggering the bug detection ( e.g. most common instruction types, memory or computationalboundness, etc), the designers could reduce the list of potentialbug locations. Another debugging path could be the analysisof the counters selected for the IPC inference models of thosetraces. Factors such as a lost of correlation between the countersand IPC when compared to legacy designs could also help topinpoint possible sources of bug.12

CKNOWLEDGEMENTS

This work is partially supported by Semiconductor ResearchCorporation Task 2902.001.R

Advances in Neural Information ProcessingSystems , 2019, pp. 11 623–11 635.[5] R. Atachiants, G. Doherty, and D. Gregg, “Parallel performance problemson shared-memory multicore systems: taxonomy and observation,” in

IEEE Transactions on Software Engineering , vol. 42, no. 8, 2016, pp.764–785.[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The Gem5 simulator,”in

SIGARCH Computer Architecture News , vol. 39, no. 2, 2011, pp. 1–7.[7] W. L. Bircher and L. K. John, “Complete system power estimation:A trickle-down approach based on performance events,” in

IEEEInternational Symposium on Performance Analysis of Systems Software ,2007, pp. 158–168.[8] B. Black, A. S. Huang, M. H. Lipasti, and J. P. Shen, “Can trace-drivensimulators accurately predict superscalar performance?” in

InternationalConference on Computer Design. VLSI in Computers and Processors ,1996, pp. 478–485.[9] P. Bose, “Architectural timing veriﬁcation and test for super scalar pro-cessors,” in

IEEE International Symposium on Fault-Tolerant Computing ,1994, pp. 256–265.[10] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,”in

ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , 2016, pp. 785–794.[11] F. Chollet, “Keras,” https://keras.io, 2015.[12] T. M. Conte, M. A. Hirsch, and K. N. Menezes, “Reducing state lossfor effective trace sampling of superscalar processors,” in

InternationalConference on Computer Design. VLSI in Computers and Processors ,1996, pp. 468–477.[13] G. Contreras and M. Martonosi, “Power prediction for Intel XScaleprocessors using performance monitoring unit events,” in

InternationalSymposium on Low Power Electronics and Design , 2005, pp. 221–226.[14] B. Cook, T. Kurth, B. Austin, S. Williams, and J. Deslippe, “Perfor-mance variability on Xeon Phi,” in

International Conference on HighPerformance Computing , 2017, pp. 419–429.[15] D. J. Dean, H. Nguyen, and X. Gu, “UBL: Unsupervised behaviorlearning for predicting performance anomalies in virtualized cloudsystems,” in

International Conference on Autonomic Computing , 2012,pp. 191–200.[16] C. Delimitrou and C. Kozyrakis, “iBench: Quantifying interference fordatacenter applications,” in

IEEE International Symposium on WorkloadCharacterization , 2013, pp. 23–33.[17] C. Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: Reconcilingscheduling speed and quality in large shared clusters,” in

ACM Symposiumon Cloud Computing , 2015, pp. 97–110.[18] R. Desikan, D. Burger, and S. W. Keckler, “Measuring experimental errorin microprocessor simulation,” in

International Symposium on ComputerArchitecture , 2001, p. 266–277.[19] J. Doweck, W.-F. Kao, A. K.-y. Lu, J. Mandelblat, A. Rahatekar,L. Rappoport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-generationIntel Core: New microarchitecture code-named Skylake,” in

IEEE Micro ,vol. 37, no. 2, 2017, pp. 52–62.[20] L. Eren, T. Ince, and S. Kiranyaz, “A generic intelligent bearing faultdiagnosis system using compact adaptive 1D CNN classiﬁer,” in

Journalof Signal Processing Systems , vol. 91, no. 2, 2019, pp. 179–189.[21] J. H. Friedman, “Greedy function approximation: a gradient boostingmachine,” in

Annals of Statistics , 2001, pp. 1189–1232.[22] Y. Gan, Y. Zhang, K. Hu, D. Cheng, Y. He, M. Pancholi, andC. Delimitrou, “Seer: Leveraging big data to navigate the complexityof performance debugging in cloud microservices,” in

InternationalConference on Architectural Support for Programming Languages andOperating Systems , 2019, pp. 19–33.[23] P. Gepner, D. L. Fraser, and V. Gamayunov, “Evaluation of the3rd generation Intel Core processor focusing on HPC applications,” in

International Conference on Parallel and Distributed ProcessingTechniques and Applications , 2012, pp. 1–6.[24] G. Hamerly, E. Perelman, and B. Calder, “How to use simpoint topick simulation points,” in

ACM SIGMETRICS Performance EvaluationReview , vol. 31, no. 4, 2004, pp. 25–30.[25] G. Hamerly, E. Perelman, J. Lau, and B. Calder, “Simpoint 3.0: Fasterand more ﬂexible program phase analysis,” in

Journal of InstructionLevel Parallelism , vol. 7, no. 4, 2005, pp. 1–28.[26] R. C. Ho, C. H. Yang, M. A. Horowitz, and D. L. Dill, “Architecturevalidation for processors,” in

International Symposium on ComputerArchitecture , 1995, pp. 404–413.[27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” in

NeuralComputation , vol. 9, no. 8, 1997, pp. 1735–1780.[28] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforwardnetworks are universal approximators,” in

Neural Networks , vol. 2, no. 5,1989, pp. 359 – 366.[29] O. Ibidunmoye, F. Hern´andez-Rodriguez, and E. Elmroth, “Performanceanomaly detection and bottleneck identiﬁcation,” in

ACM ComputingSurveys , vol. 48, no. 1, 2015, pp. 1–35.[30] Intel Corporation, “Intel386™ DX processor: Speciﬁcation update,” 2004.[31] Intel Corporation, “Intel Xeon processor scalable family: Speciﬁcationupdate,” 2019.[32] R. Joseph and M. Martonosi, “Run-time power estimation in highperformance microprocessors,” in

International Symposium on Low powerelectronics and design , 2001, pp. 135–140.[33] C. Killian, K. Nagaraj, S. Pervez, R. Braud, J. W. Anderson, andR. Jhala, “Finding latent performance bugs in systems implementations,”in

ACM SIGSOFT International Symposium on Foundations of SoftwareEngineering , 2010, pp. 17–26.[34] J. Kim, S. H. Pugsley, P. V. Gratz, A. N. Reddy, C. Wilkerson,and Z. Chishti, “Path conﬁdence based lookahead prefetching,” in

International Symposium on Microarchitecture , 2016, pp. 1–12.[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

International Conference for Learning Representations , 2015.[36] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,and time series,” in

The handbook of brain theory and neural networks ,1995.[37] S.-M. Lee, S. M. Yoon, and H. Cho, “Human activity recognitionfrom accelerometer data using convolutional neural network,” in

IEEEInternational Conference on Big Data and Smart Computing , 2017, pp.131–134.[38] J. Leverich and C. Kozyrakis, “Reconciling high server utilization andsub-millisecond quality-of-service,” in

European Conference on ComputerSystems , 2014, pp. 1–14.[39] J. Li, Y. Chen, H. Liu, S. Lu, Y. Zhang, H. S. Gunawi, X. Gu, X. Lu,and D. Li, “Pcatch: Automatically detecting performance cascading bugsin cloud systems,” in

EuroSys Conference , 2018, pp. 1–14.[40] Y. Liu, C. Xu, and S.-C. Cheung, “Characterizing and detecting perfor-mance bugs for smartphone applications,” in

International Conferenceon Software Engineering , 2014, pp. 1013–1024.[41] B. Mammo, M. Furia, V. Bertacco, S. Mahlke, and D. S. Khudia,“Bugmd: Automatic mismatch diagnosis for bug triaging,” in

InternationalConference on Computer-Aided Design , 2016, pp. 1–7.[42] J. D. McCalpin, “HPL and DGEMM performance variability on theXeon platinum 8160 processor,” in

International Conference for HighPerformance Computing, Networking, Storage and Analysis , 2018, pp.225–237.[43] A. Nistor, T. Jiang, and L. Tan, “Discovering, reporting, and ﬁxing per-formance bugs,” in

Working Conference on Mining Software Repositories ,2013, pp. 237–246.[44] T. Nowatzki, J. Menon, C. Ho, and K. Sankaralingam, “Architecturalsimulators considered harmful,” in

IEEE Micro , vol. 35, no. 6, 2015, pp.4–12.[45] NXP Semicondutctors, “Chip errata for the MPC7448,” 2008.[46] E. Ould-Ahmed-Vall, J. Woodlee, C. Yount, K. A. Doshi, and S. Abraham,“Using model trees for computer architecture performance analysis ofsoftware applications,” in

IEEE International Symposium on PerformanceAnalysis of Systems & Software , 2007, pp. 116–125.[47] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,“Scikit-learn: Machine learning in Python,”

Journal of Machine LearningResearch , vol. 12, pp. 2825–2830, 2011.[48] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, andB. Calder, “Using simpoint for accurate and efﬁcient simulation,” in

ACM SIGMETRICS Performance Evaluation Review , vol. 31, no. 1, 2003,pp. 318–319.

49] F. Santosa and W. Symes, “Linear inversion of band-limited reﬂectionseismograms,” in

SIAM Journal on Scientiﬁc and Statistical Computing ,vol. 7, no. 4, 1986, pp. 1307–1330.[50] K. Shen, M. Zhong, and C. Li, “I/O system performance debuggingusing model-driven anomaly characterization.” in

USENIX Conferenceon File and Storage Technologies , vol. 5, 2005, pp. 309–322.[51] R. Singhal, K. Venkatraman, E. R. Cohn, J. G. Holm, D. A. Koufaty,M.-J. Lin, M. J. Madhav, M. Mattwandel, N. Nidhi, J. D. Pearce et al. ,“Performance analysis and validation of the Intel® Pentium® 4 processoron 90nm technology.” in

Intel Technology Journal , vol. 8, no. 1, 2004,pp. 39–48.[52] D. Skinner and W. Kramer, “Understanding the causes of performancevariability in HPC workloads,” in

IEEE Workload CharacterizationSymposium , 2005, pp. 137–149.[53] S. Surya, P. Bose, and J. Abraham, “Architectural performance veri-ﬁcation: PowerPC processors,” in

IEEE International Conference on Computer Design: VLSI in Computers and Processors , 1994, pp. 344–347.[54] Texas Instruments, “AM3517, AM3505 Sitara processors silicon revisions1.1, 1.0: Silicon errata,” 2016.[55] N. Utamaphethai, R. S. Blanton, and J. P. Shen, “A buffer-orientedmethodology for microarchitecture validation,” in

Journal of ElectronicTesting , vol. 16, no. 1-2, 2000, pp. 49–65.[56] R. E. Wunderlich, T. F. Wenisch, B. Falsaﬁ, and J. C. Hoe, “SMARTS:Accelerating microarchitecture simulation via rigorous statistical sam-pling,” in

International Symposium on Computer Architecture , 2003, pp.84–97.[57] S. Yang, Z. Luan, B. Li, G. Zhang, T. Huang, and D. Qian, “Performanceevents based full system estimation on application power consumption,”in

International Conference on High Performance Computing andCommunications , 2016, pp. 749–756., 2016, pp. 749–756.