[PDF] A Unified Learning Platform for Dynamic Frequency Scaling in Pipelined Processors

Abstract

A machine learning (ML) design framework is proposed for dynamically adjusting clock frequency based on propagation delay of individual instructions. A Random Forest model is trained to classify propagation delays in real-time, utilizing current operation type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline stage within a baseline processor. The modified system is simulated at the gate-level in 45 nm CMOS technology, exhibiting a speed-up of 68% and energy reduction of 37% with coarse-grained ML classification. A speed-up of 95% is demonstrated with finer granularities at additional energy costs.

Full PDF

AA Unified Learning Platform for Dynamic Frequency Scaling inPipelined Processors

Arash Fouman Ajirlou [email protected] of Illinois at ChicagoChciago, IL, USA

Inna Partin-Vaisband [email protected] of Illinois at ChicagoChciago, IL, USA

ABSTRACT

A machine learning (ML) design framework is proposed for dy-namically adjusting clock frequency based on propagation delay ofindividual instructions. A Random Forest model is trained to classifypropagation delays in real-time, utilizing current operation type,current operands, and computation history as ML features. Thetrained model is implemented in Verilog as an additional pipelinestage within a baseline processor. The modified system is simulatedat the gate-level in 45 nm CMOS technology, exhibiting a speed-upof 68% and energy reduction of 37% with coarse-grained ML classi-fication. A speed-up of 95% is demonstrated with finer granularitiesat additional energy costs.

CCS CONCEPTS • Computer systems organization → Pipeline processors ; Ma-chine learning ; Dynamic frequency scaling.

ACM Reference Format:

Arash Fouman Ajirlou and Inna Partin-Vaisband. 2020. A Unified LearningPlatform for Dynamic Frequency Scaling in Pipelined Processors. In

DAC’20: ACM Design Automation Conference, July 19-23, 2020, San Francisco,CA.

ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

The primary design goal in computer architecture is to maximizethe performance of a system under power, area, temperature, andother application-specific constraints. Heterogeneous nature ofVLSI systems and the adverse effect of process, voltage, and tem-perature (PVT) variations have raised challenges in meeting timingconstraints in modern integrated circuits (ICs). To address thesechallenges, timing guardbands have constantly been increased, lim-iting the operational frequency of synchronous digital circuits. Onthe other hand, the augmented variety of functions in modernprocessors increases delay imbalance among different signal prop-agation paths. Bounded by critical paths delay, these systems aretraditionally designed with pessimistically slow clock period, yield-ing underutilized IC performance. Moreover, power efficiency ofthese underutilized systems also degrades due to the increasing

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

DAC ’20, July 19-23, 2020, San Francisco, CA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn power leakage components. Alternatively, when designed with re-laxed timing constraints, integrated systems are prone to functionalfailures. To simultaneously maintain correct functionality and in-crease system performance, constraint optimization techniques aswell as offline and online models have recently been proposed. Typ-ical approaches include, but are not limited to pipelining, multicorecomputing, dynamic frequency and voltage scaling (DVFS), andML driven models[1–9].Propagation delay in a processor is a strong function of the type,input, and output of the current operation, and computation history[4]. Intuitively, majority of operations are completed within a smallportion of the clock period, as determined by the slowest path inthe circuit. Based on path delay distribution, as reported in [5], theoperational frequency can be doubled for majority of instructionsin a typical program.While multicore approaches have been demonstrated to partiallyenhance system performance, the scalability of modern multicoresystems is limited by the design complexity of instruction level par-allelism and thermal design power constraints. Thus, speeding thesingle thread execution is an important cornerstone for enhancingsingle core performance in modern ICs[10]. This is, therefore, theprimary focus of this paper. The main contributions of this workare as follows:(1) A systematic flow is proposed and implemented as a unifiedplatform for extracting and processing input features for MLclassification of instruction delays.(2) A Random Forest (RF) classifier is trained to classify individ-ual instructions into delay classes based on their type, inputoperands, and the computation history of the system.(3) A new pipeline stage is integrated within a pipelined MIPSprocessor.(4) The proposed method is synthesized and verified on LegUp[11]benchmark suite of programs with Synopsys Design Com-piler in 45 nm CMOS technology node.

Predicting timing violations in a constraint-relaxed system is im-practical with deterministic approaches, due to the wide dynamicrange of input and output signals (typically 32 or 64 bits), andthe variety of instructions in a modern processor. ML based ap-proaches for predicting timing violations of individual instructionshave recently been proposed, which consider the impact of the in-put operands and computation history on timing violations[4, 12].While significant for the design process of next generation scalablehigh performance systems, these approaches have several limita-tions:1) The output of individual instructions has been considered as a a r X i v : . [ c s . A R ] J un AC ’20, July 19-23, 2020, San Francisco, CA Fouman and Partin-Vaisband, et al.

ML feature and exploited in these systems for predicting the timingcharacteristics of the individual instructions. These predictions are,however, carried out in advance of the instruction execution, whenthe instruction output is not yet available, limiting the effectivenessof these methods in practical systems.2) The modules under the test are studied separately and isolatedfrom other computational and non-computational component (e.g.,buffers or multiplexers). Despite the reported high prediction ac-curacy, the same accuracy results are not expected if the methodsare applied to a practical execution unit due to the isolated testenvironment.3) Power and timing overheads due to additional hardware are notconsidered in these papers.A bit-level ML based method has been proposed in [13] for pre-dicting timing violations with reduced timing guardbands. Whileup to 95% prediction accuracy has been reported with this method,the excessively high, per bit granularity of the ML predictions isexpected to exhibit substantial power, area, and timing overheads.These overheads are, however, not evaluated in [13]. Furthermore,a procedure for recovery upon a timing error is not provided andthe recovery overheads are also not considered in this work.As an alternative to fine-grain high-overhead ML implementa-tions, multiple coarse-grain schemes for timing error detection andrecovery have been proposed to mitigate the adverse effect of thepessimistic design constraint. A better-than-worst-case (BTWC)design approach has been introduced in [5]. With this approach, theclock period is set to a statistically nominal value (rather than theworst-case propagation delay) and the history of timing erroneousprogram counters (PCs) is kept in a ternary content-addressablememory (TCAM). The TCAM is exploited for predicting timingviolations of the following instructions based on previous obser-vations. Owning to the apparent simplicity of this approach, onlybi-state operating conditions (i.e., nominal and worst-case clockfrequencies) can be efficiently utilized with this method. Alterna-tively, the design complexity and system overheads are expectedto significantly increase with the increasing number of frequencydomains.A thermal-aware voltage scaling approach has been proposed in[14]. A voltage selection algorithm is developed and integrated intoFPGA synthesis process to dynamically scale the core and blockRAM voltages. However, driven by workload and thermal powerdissipation, this method supports only coarse-grained voltage scal-ing.Predicting program error-rate in timing-speculative processorshas been proposed in [15]. A statistical model is developed forpredicting dynamic timing slack (DTS) at various pipeline stages.The predicted DTS values are exploited to estimate the timingerror-rate in a program. The implementation overheads, and thepotential performance or power consumption gains, however, arenot reported with this approach.ML based methods for modeling system behavior have also beenproposed. For example, in [6], linear regression (LR) has been lever-aged for modeling the aging behavior of an embedded processorbased on current instruction and its operands, as well as the compu-tation history and overall circuit switching activity. As a result, the timing guardband designed to compensate for aging in digital cir-cuits is effectively reduced given graceful degradation. Reallocationof delay budget is, however, not considered with this method.ML ICs can exhibit a prohibitively high power consumption andphysical size. Also, given specific applications, they may introduceadditional delay and increase design complexity. To efficiently ex-ploit ML methods for managing frequency in modern processors,delay, power, and area of ML ICs should be considered.

In this paper, a ML driven design methodology is proposed forpipeline processors. With the proposed method, individual instruc-tions are classified into corresponding propagation delay classesin real-time, and the clock frequency is accordingly adjusted tonarrow down the gap between actual propagation delay and theclock period. The classes are defined by segmenting the worst-caseclock period into shorter delay fragments. Each class is character-ized by a an operating condition, such as specific supply voltageand clock frequency. The primary design objective is to maximizesystem performance within the allocated energy budget.The frequency of the system is dynamically adjusted in real-timebased on the result of instruction delay classification. The overalldelay and energy consumption are evaluated with the additionalML components, and both the correct and incorrect predictions.Other control configurations can be defined in a similar mannerfor different design objectives. In order to evaluate this method,TigerMIPS[16] is utilized as a baseline processor. The ML classifieris designed as an additional pipeline stage within the pipelinedMIPS processor, as shown in Fig. 1. The inputs to the additional MLpipeline stage are the current instruction and its operands, as wellas the computation history, as defined by bit-toggled inputs (i.e.,current inputs are XORed with the previous inputs), and output ofthe previous operation. These inputs are utilized as ML featuresfor predicting the delay class of the current instruction based onthe trained ML model. It is important to note that any desiredML model can be trained with this methodology, regardless of itsdelay, as long as the design complexity and hardware costs of thefinal system meet the specified constraints. The trained modelscan be implemented as multiple pipeline stages to meet the timingconstraints and maintain the overall system throughput, despitethe additional latency introduced by the ML functions. Also note

Figure 1: The proposed pipeline with the additional MLstage. In this configuration, six ML features and three delayclasses are used.

Unified Learning Platform for Dynamic Frequency Scaling in Pipelined Processors DAC ’20, July 19-23, 2020, San Francisco, CA

Figure 2: Systematic flow for designing ML predictor within a typical pipelined processor. that the granularity of the output delay classes (e.g., three classesare illustrated in Fig. 1) can be varied as needed.A systematic flow has been developed, implemented, and verifiedon TigerMIPS with LegUp benchmark suite. The flow comprisesthree primary phases, as shown in Fig. 2. The individual phases aredescribed in the following subsections.

First, the high-level HDL model of the baseline processor is synthe-sized into gate-level description model. During this phase, timinginformation is generated in the IEEE standard delay format (SDF).Based on this information, the gate-level simulation (GLS) is per-formed and the instruction-level execution profiles are collected.A profile comprises a list of instructions, the fetched or forwardedoperands, the output of the operations, and the propagation delays.In addition to the execution profile, post place-and-route (PAR)reports, including timing and power information are collected inthis phase.

In this phase, the gate-level profiles from Phase 1 are parsed and uti-lized as ML features. The parser also detects and eliminates outliers.The model is trained in Python with Scikit-learn ML library[17].A HDL code (e.g., Verilog in this paper) of the trained model isgenerated and integrated within the baseline processor as a single(or multiple) pipeline stage(s) between

Decode and

Execute stages(see Fig. 1).

Within this phase, the modified high-level HDL model of the systemwith the ML stage undergoes the synthesis and profiling procedure,as described in Phase 1. To guarantee functional correctness, theoutput signal is double-sampled to detect a timing violation, and atiming-erroneous instruction is replayed with the worst-case clockfrequency. Similar to the baseline iteration, the post PAR reports areextracted for evaluating the timing and energy characteristics ofthe system. Finally, the profiling of the modified system is executedduring this phase to evaluate the overall speed-up of the system.To optimize the final solution in terms of the operational fre-quency and energy consumption, the proposed flow is executediteratively, as shown with the feedback in Fig. 2. The the clocksignal of the pipeline registers is assumed to be near-instantlyswitched based on the individual classification results, as has beenexperimentally demonstrated in [18].

To evaluate the efficiency and efficacy of the proposed method,propagation delay classification is investigated with three commonML algorithms: Neural Network (NN), Support Vector Machines(SVMs), and Random Forest (RF). In the following subsections, theprimary characteristics of each algorithm are discussed.

NNs excel in learning complex hidden patterns in large datasetsand have shown a particular supremacy in vision and text appli-cations as compared with classical ML algorithms. Following thissuccess, promising results have been shown with NNs in varioushardware related applications [19–21]. A multi-class NN classifieris designed in this work with a single hidden layer of 20 neurons

AC ’20, July 19-23, 2020, San Francisco, CA Fouman and Partin-Vaisband, et al.

Table 1: RF, NN, and SVM Configuration and Validation Ac-curacy.

Algorithm h = , h = h = , h = h = , h = and ReLu activation function in Scikit-learn ML framework. Thenetwork is trained using backpropagation algorithm for 200 epochsuntil convergence with quasi-newton optimizer. As a general rule,learning capacity of a NN increases with the network complexity(i.e., number of neurons and number of layers). For a NN to becompetitive with or outperform classical ML algorithms, large num-ber of neurons and layers is required, significantly increasing thesystem complexity and hardware overhead of NN based solutions. SVM classifier learns an optimal hyperplane that separates datasamples in feature space with the objective to minimize the classifi-cation error. Linear SVM can only learn a linearly-separable deci-sion boundary. Alternatively, to learn complex nonlinear patternsin data, SVM can be combined with a kernel trick which appropri-ately transforms the sample features into linearly separable space.In this work, a kernel SVM classifier is designed with the Gaussiankernel. SVM often exhibits excellent performance as compared withother algorithms but suffer from high computational and designcomplexity, and accordingly high power and area overheads [22].

RF classifier is an ensemble of decision tree classifiers. The inputsamples are split into multiple sample sets and each decision treeis trained on one training set. The final classification decision foreach sample is determined by averaging over the decisions of thetrees (i.e., ensembling). RF often benefits from the accuracy, trainingspeed, and interpretability of decision trees, while the ensemblinghandles overfitting. RF is favorable algorithm in scientific and prac-tical applications [4, 23]. The computational and hardware complex-ity of RF is a strong function of the number and depth of decisiontrees. The depth of the individual trees is determined by the numberof features and their correlation. In this work, a RF is trained withlow number of shallow trees (i.e., 10 to 100 trees), exhibiting lowdesign complexity and hardware overheads, as demonstrated inSection V.

The proposed framework is implemented within TigerMIPS andevaluated based on LegUp benchmarks. The details of the imple-mentation are described in this section.

A holistic platform is developed to realize the proposed systemdesign methodology, as illustrated in Fig. 2. The framework is uni-fied within a shell programming platform supported with severalperipheral programs developed in C++ and Python programminglanguages. The synthesis steps, as described in Fig. 2, are sequen-tially executed from

Start to Finish . During the first phase, SynopsysDesign Compiler is called with the high-level HDL model of thebaseline processor. The profiler triggers are then added to the designand GLS is performed in Modelsim. Phase 2 is triggered upon thecompletion of the instruction profiling. An external parser programis called to transform the instruction profiles into ML features datastructure, and eliminate outliers. The model is trained to classifypropagation delays into user-defined number of classes based on auser-specified learning algorithm and delay boundaries. The MLaccuracy and estimated speed-up are evaluated upon the trainingcompletion. If the design requirements are met, the ML softwaremodel is transformed into the high-level HDL code. Otherwise, MLmodel is retrained with a new algorithm or hyperparameters. Even-tually, the HDL code of the ML model is instantiated within theoriginal HDL model of the baseline processor. Finally, the procedurein Phase 1 is repeated in Phase 3 with the modified processor model,and the overall system performance and overheads are evaluated.

The proposed framework is demonstrated on a pipelined MIPSprocessor (i.e., TigerMIPS). In addition to the basic MIPS units, suchas Instruction Fetch (IF), Instruction Decode (ID), Execute (Exe),Memory access (Mem), and Write-back (WB), TigerMIPS comprisesadvanced units, such as, forwarding unit, branch handling unit,stall logic, and instruction and data caches, which are common inmodern pipeline processors.

The baseline model is synthesized in 45 nm NanGate CMOS technol-ogy node with Synopsys Design Compiler. Upon completion of thesynthesis, triggers are implemented in Verilog HDL, enabling dataand timestamp sampling at the input and output of the executionunit within the MIPS pipeline. The profiling is performed based onGLS with Modelsim simulator.

The extracted ML features are transformed into a defined datastructure and the model is trained with different algorithms (i.e.,SVM, NN, and RF). The hyperparameters of the ML algorithms arelisted in Table 1.

The trained ML model is first validated in Python. The HDL codeof the validated ML model is integrated into the baseline processor.The modified processor is then synthesized and its functionality isverified through GLS. The post PAR reports are utilized to evaluatethe modified system with respect to specified design constraints.

Unified Learning Platform for Dynamic Frequency Scaling in Pipelined Processors DAC ’20, July 19-23, 2020, San Francisco, CA

Figure 3: Speed-up with proposed ML framework with two, three, and four delay classes.Table 2: ML and System Level Performance with the Proposed Pipelined Classifier.

Benchmark

Accuracy F1-score Achieved speed-up Ideal speed-up Accuracy F1-weighted Achieved speed-up Ideal speed-up Accuracy F1-score Achieved speed-up Ideal speed-uprand1M 0.837 0.849 1.842 2.633 0.925 0.930 1.732 1.933 0.936 0.940 1.528 1.631adpcm 0.771 0.773 1.936 2.819 0.886 0.884 1.253 2.079 0.945 0.943 1.393 1.715aes 0.947 0.932 2.635 3.665 0.986 0.985 1.936 2.186 1.000 1.000 1.818 1.818blowfish 0.844 0.832 1.783 3.435 0.973 0.970 1.676 2.147 1.000 1.000 1.818 1.818fft 0.750 0.746 1.085 3.256 0.970 0.971 1.833 2.135 0.988 0.986 1.674 1.776fir 0.950 0.947 2.324 3.708 1.000 1.000 2.219 2.222 1.000 1.000 1.818 1.818gsm 0.811 0.799 1.288 2.562 0.844 0.836 1.203 1.911 0.989 0.990 1.594 1.673jpeg 0.868 0.854 2.421 3.078 0.929 0.935 1.742 2.095 0.982 0.981 1.674 1.766sha 0.698 0.771 1.628 3.585 0.953 0.931 1.469 2.178 1.000 1.000 1.818 1.818sra 0.977 0.972 2.735 3.793 0.965 0.966 1.883 2.202 1.000 1.000 1.818 1.818rand100k 0.840 0.851 1.805 2.633 0.922 0.927 1.736 1.941 0.931 0.936 1.528 1.639Average energy ovehead

13% 2% -37%

Figure 4: RF classification accuracy in inference on theLegUp benchmark suite with two, three, and four classes.

Owing to unique learning characteristics and hardware trade-offsof NN, SVM, and RF models, all these ML models are consideredin this paper. Each of these models is trained based on the instruc-tion profiles extracted from a synthetically generated dataset of3,000 random instructions per class. The boundaries of the indi-vidual classes are experimentally determined with respect to theworst-case delay of 4 ns as follows: {[0.0,2.2],(2.2,4.0]} for the 2-class configurations, {[0.0,1.8],(1.8,2.6],(2.6,4.0]} for the 3-class con-figurations, and {[0.0,1.0],(1.0,2.0],(2.0,3.0],(3.0,4.0]} for the 4-classconfigurations. To validate the models, LegUp high-level synthesis benchmarksuite coupled with LLMVM compiler toolchain [24] is utilized forprofiling and verification during GLS. The trained NN, SVM, andRF models with various hyperparameters are validated on the gate-level profiles of the LegUp programs. The average accuracy, F1-score(a typical accuracy measure which considers precision and recallmetrics), and estimated speed-up results are reported in Table 1.RF model is preferred in this paper due to its high classificationaccuracy, higher speed-up, and lower design and hardware com-plexity as compared with the NN and SVM models. The trained RFmodel is tested with nine standard benchmark programs availablewithin the LegUp benchmark suite and two additional syntheticallygenerated benchmarks with one million and 100,000 random in-structions. The RF classification accuracy on the test datasets isshown in Fig. 4 for two, three, and four ML delay classes, yieldingabove 98% accuracy for majority of the programs with two delayclasses. Resultant speed-up for the individual benchmarks is shownin Fig. 3. A detailed performance and average energy characteristicsof the RF model and the modified pipelined processor are listedin Table 2, including the average ML accuracy, practical reportedspeed-up (including the misclassification penalty), ideal speed-up(with 100% classification accuracy), and the energy overhead due toadditional ML hardware and classification errors. To account for de-lay overheads due to the misclassification of a slow instruction intoa higher performance class, a replay penalty of four clock cyclesis considered within the performance results, as reported in Table2. Note that the overall speed-up with four-class configuration ishigher than the speed-up with two-class configuration, albeit thehigher classification accuracy with two delay classes. Alternatively,

AC ’20, July 19-23, 2020, San Francisco, CA Fouman and Partin-Vaisband, et al. higher misclassification rate with four delay classes yields higherreplay energy consumption, as listed in Table 2.

In this work, an additional ML pipeline stage is proposed for in-creasing the overall system performance and temporal resourceutilization. This additional stage is designed to classify instruc-tions into propagation delay classes. The system clock frequencyis dynamically adjusted based on the individual delay predictions.Pipelining is exploited to mitigate the effect of the ML stage la-tency on the overall system performance. Practical ML features areextracted based on current instruction and computation history.ML hardware and misclassification power and delay overheads areconsidered within the reported results. Based on experimental re-sults, up to 95% performance gain can be achieved with four delayclasses at a low energy overhead, and a reduction of 37% in energyconsumption with 68% gain in performance is practical with twodelay classes. A unified shell programing platform with peripheralprograms is introduced, yielding a systematic design flow for MLdriven pipelined processors.