[PDF] A Machine Learning Pipeline Stage for Adaptive Frequency Adjustment

Abstract

A machine learning (ML) design framework is proposed for adaptively adjusting clock frequency based on propagation delay of individual instructions. A random forest model is trained to classify propagation delays in real time, utilizing current operation type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline stage within a baseline processor. The modified system is experimentally tested at the gate level in 45 nm CMOS technology, exhibiting a speedup of 70% and energy reduction of 30% with coarse-grained ML classification. A speedup of 89% is demonstrated with finer granularities with 15.5% reduction in energy consumption.

Full PDF

11 A Machine Learning Pipeline Stage for AdaptiveFrequency Adjustment

Arash Fouman Ajirlou,

Student Member, IEEE,

Inna Partin-Vaisband,

Member,IEEE

Abstract —A machine learning (ML) design framework is proposed for adaptively adjusting clock frequency based on propagationdelay of individual instructions. A random forest model is trained to classify propagation delays in real time, utilizing current operationtype, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipelinestage within a baseline processor. The modiﬁed system is experimentally tested at the gate level in 45 nm CMOS technology, exhibitinga speedup of 70% and energy reduction of 30% with coarse-grained ML classiﬁcation. A speedup of 89% is demonstrated with ﬁnergranularities with 15.5% reduction in energy consumption.

Index Terms —Computer Systems Organization, Microprocessors and microcomputers, Hardware, Pipeline, Processor Architectures,Pipeline processors, Pipeline implementation, VLSI Systems, Impact of VLSI on system design, VLSI, System architectures,integration and modeling, Design Methodology, Cost/performance, Machine learning, Classiﬁer design and evaluation (cid:70)

NTRODUCTION T H e primary design goal in computer architecture isto maximize the performance of a system underpower, area, temperature, and other application-speciﬁcconstraints. Heterogeneous nature of VLSI systems and theadverse effect of process, voltage, and temperature (PVT)variations have raised challenges in meeting timing con-straints in modern integrated circuits (ICs). To address thesechallenges, timing guardbands have constantly been in-creased, limiting the operational frequency of synchronousdigital circuits. On the other hand, the increasing variety offunctions in modern processors increases delay imbalanceamong different signal propagation paths. Bounded by criti-cal path delay, these systems are traditionally designed withpessimistically slow clock period, yielding underutilized ICperformance. Moreover, power efﬁciency of these underuti-lized systems also degrades due to the increasing powerleakage. Alternatively, when designed with relaxed timingconstraints, integrated systems are prone to functional fail-ures. To simultaneously maintain correct functionality andincrease system performance, numerous optimization tech-niques as well as ofﬂine and online models have recentlybeen proposed including: pipelining, multicore computing,dynamic frequency and voltage scaling (DVFS), and MLdriven models [1], [2], [3], [4], [5], [6], [7], [8], [9].Propagation delay in a processor is a strong functionof the type, input operands, and output of the current op-eration, and computation history [4]. Computation historyaccounts for data overwrite and crosstalk noises. Intuitively,majority of operations are completed within a small portionof the clock period, as determined by the slowest path in thecircuit. Based on path delay distribution, as reported in [5], • A. Fouman was with the Department of Electrical and Computer Engi-neering, University of Illinois at Chicago, Chicago, IL, 60607.E-mail: [email protected] • I. Parin-Vaisband was with the Department of Electrical and ComputerEngineering, University of Illinois at Chicago, Chicago, IL, 60607.E-mail: [email protected] the operational frequency can be doubled for majority ( e.g., per instruction in real time, yielding a fundamentallydifferent approach as compared with the traditional, task-based dynamic frequency scaling. The main contributionsof this work are as follows:1) A systematic ﬂow is proposed and implemented as auniﬁed platform for extracting ML input features froman instruction and classifying the instruction executiondelay in real time.2) A random forest (RF) model is trained to classify in-dividual instructions into delay classes based on theirtype, input operands, and the computation history ofthe system.3) A new pipeline stage is integrated within a pipelinedMIPS processor.4) The proposed method is synthesized and veriﬁed onLegUp [13] benchmark suite of programs with Synop-sys Design Compiler in 45 nm CMOS technology node.The rest of the paper is organized as follows. Section2 describes prior and related work. Section 3 explains theproposed uniﬁed platform and the design methodology.ML algorithms for classiﬁcation of instruction delay aredescribed in Section 4. In Section 5 the implementationdetails of the system are introduced. Experimental resultsare presented in Section 6. Conclusions and future work a r X i v : . [ c s . A R ] J u l are discussed in Section 7, and the paper is summarizedin Section 8. RIOR AND R ELATED W ORK

Multiple approaches have been proposed for efﬁcientlytuning the operating point ( i.e., voltage supply and clockfrequency) of a system at various levels of a computingsystem, including application- and task-based methods andinstruction-level speculations.Predicting timing violations in a constraint-relaxed sys-tem is impractical with deterministic approaches, due to thewide dynamic range of input and output signals (typically32 or 64 bits), variety of operations in a modern processor,and delay dependence on the runtime and physical char-acteristics of the system ( e.g., crosstalk noise). ML basedapproaches for predicting timing violations of individualinstructions have recently been proposed, which considerthe impact of input operands and computation history ontiming violations [4], [14], [15]. While signiﬁcant for thedesign process of next generation scalable high performancesystems, these approaches have several limitations:1) Instruction output is considered as a ML feature andexploited in these systems for predicting the timingcharacteristics of the individual instructions. These pre-dictions are, however, carried out before the instruc-tion execution, when the instruction output is not yetavailable, limiting the effectiveness of these methods inpractical systems.2) The modules under the test are studied separately andevaluated in an isolated test environment without theeffects of other processing elements ( e.g., arithmeticmodules, buffers or multiplexers). The high reportedaccuracy is, therefore, expected to degrade if the meth-ods are applied to a complex system ( e.g., a practicalexecution unit).3) Power and timing overheads due to additional hard-ware are not considered in these papers.Granularity of prediction is another primary concern. Abit-level ML based method has been proposed in [16] forpredicting timing violations with reduced timing guard-bands. While up to 95% prediction accuracy has been re-ported with this method, the excessively high, per bit granu-larity of the ML predictions is expected to exhibit substantialpower, area, and timing overheads. These overheads are,however, not evaluated in [16]. Furthermore, a procedurefor recovery upon a timing error is not provided and therecovery overheads are also not considered.As an alternative to ﬁne-grain high-overhead ML meth-ods, multiple coarse-grain schemes for timing error de-tection and recovery have been proposed to mitigate theadverse effect of the pessimistic design constraints. A better-than-worst-case design approach has been introduced in [5].With this approach, the clock period is set to a statisticallynominal value (rather than worst-case propagation delay)and the history of timing erroneous program counters iskept in a ternary content-addressable memory (TCAM).The TCAM is exploited for predicting timing violationsof the instructions based on previous observations. Notethat the system only warns against those timing violationsthat have been previously recorded. Alternatively, unseen violations are not predicted with this approach. Owningto the apparent simplicity of this approach, only bi-stateoperating conditions ( i.e., nominal and worst-case clockfrequencies) can be efﬁciently utilized with this method.Alternatively, the design complexity and system overheadsare expected to signiﬁcantly increase with the increasingnumber of frequency domains.In BandiTS [17], a reinforcement learning approach hasbeen proposed to estimate the timing error probability (TEP)within a program time interval, given timing speculation(TS) ratios,

T SR = t clk /t nom for various values of thereduced clock period t clk , and the worst-case clock period t nom . The TS-based TEP problem is modeled in [17] as theclassical multi-armed bandit problem [18], where the TSratios and TEPs correspond to, respectively, the arms andstochastic rewards. The primary limitation of that work isthe lack of details about the hardware implementation andoverheads. In addition, the maximum achievable perfor-mance gain of only 25% has been reported. Furthermore,BandiTS approach exhibits per-task clock granularity andscales the clock frequency for a batch of instructions. Higherperformance gain is possible with ﬁne-grain, per instructionclock frequency adjustment, as shown in this paper.A thermal-aware voltage scaling has been proposed in[19]. Voltage selection algorithm has been developed andintegrated within FPGA synthesis process to aggressivelyscale the core and block RAM voltages, utilizing the avail-able thermal headroom of the FPGA-mapped design. Asa result, 36% reduction in power consumption has beendemonstrated. Driven by workload and thermal power dis-sipation, this method, however, supports only coarse-grainvoltage and frequency scaling.Predicting program error rate in timing-speculative pro-cessors has been proposed in [20]. A statistical model isdeveloped for predicting dynamic timing slack (DTS) atvarious pipeline stages. The predicted DTS values are ex-ploited to estimate the timing error rate in a program. Theimplementation overheads, and the potential performanceor power consumption gains are, however, not reportedwith this approach.An ofﬂine model for TS processors has been introducedin [21]. This probabilistic model is trained to optimally se-lect a better-than-worst-case, nominal clock frequency. Theprovided hardware-based speculation, however, does notconsider the overall workload or speciﬁc ﬁner units, limitingthe ﬁdelity of the method. Alternatively, the adverse effect ofprocess variations on the propagation delay is considered,strengthening the approach in [21]. Note that PVT varia-tions are also considered with the proposed approach ofclassifying instructions into delay intervals in real time, asdescribed in the following sections.Finally, ML based methods for modeling system behav-ior have also been proposed. For example, in [6], linear re-gression has been leveraged for modeling the aging behav-ior of an embedded processor based on current instructionand its operands, as well as the computation history andoverall circuit switching activity. As a result, the timingguardband designed to compensate for aging in digitalcircuits can be effectively reduced, in presence of gracefuldegradation [6]. Reallocation of delay budget has, however,not been considered with this method. ML ICs can exhibit a prohibitively high power consump-tion and physical size. Furthermore, ML ICs can introduceadditional delay and increase design complexity, dependingupon the application characteristics. To efﬁciently exploitML methods for managing frequency in modern processors,delay, power, and area of ML ICs should be considered.

HE PROPOSED ML BASED FREQUENCY AD - JUSTMENT

In this paper, a design methodology is proposed for MLdriven adjustment of operational frequency in pipeline pro-cessors. With the proposed method, individual instructionsare classiﬁed into the corresponding propagation delayclasses in real time, and the clock frequency is accordinglyadjusted to reduce the gap between the actual propagationdelay and the clock period. The classes are deﬁned bysegmenting the worst-case clock period into shorter delayfragments. Each class is characterized by a speciﬁc supplyvoltage and clock frequency. The primary design objectiveis to maximize system performance within an allocatedenergy budget. The overall delay and energy consumptionare evaluated with the additional ML components, and boththe correct and incorrect predictions. The proposed scalableframework allows for other control conﬁgurations to bedeﬁned in a similar manner for different design objectives.The real-time clock adjustment is enabled by the recentadvancement in clock management circuits [24].In order to evaluate this method, a pipelined, 32-bit MIPSprocessor (TigerMIPS [22]) is utilized as the baseline proces-sor. The ML classiﬁer is designed as an additional pipelinestage within the pipelined MIPS processor, as shown inFig. 1. The inputs to the additional ML pipeline stage arethe current instruction and its operands, as well as thecomputation history, as deﬁned by the toggled inputs bits( i.e., current inputs are XORed with the previous inputs)and output of the previous operation. The choice of theseparameters is in accordance with the results in [4] and [6].These inputs are utilized as ML features for predicting thedelay class of the current instruction based on the trainedML model. It is important to note that more complex, slowerML models can also be trained with this methodology, aslong as the design complexity and hardware costs of theﬁnal system meet the speciﬁed constraints. To meet theoverall system throughput constraints, the trained modelscan be implemented as multiple pipeline stages, mitigatingthe additional latency introduced by the ML functions.Fig. 1: The proposed pipeline with the additional ML stage.In this conﬁguration, six ML features and three delay classesare illustrated. Finally, the granularity of the output delay ( e.g., three delayclasses are illustrated in Fig. 1) can be varied to meet thetiming constraints within the energy budget.A systematic ﬂow has been developed, implemented,and veriﬁed on TigerMIPS with LegUp benchmark suite.The ﬂow comprises three primary phases, as shown in Fig.2. The individual phases are described in the followingsubsections.

First, the high-level hardware description language (HDL)model of the baseline processor is synthesized into gate-level description model. During this phase, timing informa-tion is generated in the IEEE standard delay format (SDF).Based on this information, the gate-level simulation (GLS) isperformed and the instruction-level execution proﬁle is gen-erated. A proﬁle comprises a list of instructions, the fetchedor forwarded operands, the output of the operations, andthe propagation delays. In addition to the execution proﬁle,post place-and-route (PAR) reports, including timing andpower information, are collected in this phase.

In this phase, the gate-level proﬁles from Phase 1 are parsedand utilized as ML features. Based on the extracted features,a preferred ML model is trained in Python with Scikit-learnML library [23]. A HDL code ( e.g.,

Verilog in this paper) ofthe trained model is generated and integrated within thebaseline processor as a single (or multiple) pipeline stage(s)between the decode and execute stages (see Fig. 1).

During this phase, the modiﬁed high-level HDL model ofthe system with the ML pipeline stage is synthesized andproﬁled, as described in Phase 1. To guarantee functionalcorrectness, the output signal is double-sampled to detecttiming violations, and timing-erroneous instructions are re-executed with the worst-case clock frequency. Similar tothe baseline iteration, the post PAR reports are extractedfor evaluating the timing and energy characteristics of thesystem. Finally, the proﬁling of the modiﬁed system isexecuted during this phase to evaluate the overall speedupof the system.To optimize the ﬁnal solution in terms of the operationalfrequency and energy consumption, the proposed ﬂow isexecuted iteratively with various ML algorithms and clockfragments, as shown with the feedback in Fig. 2. The clocksignal of the pipeline registers is assumed to be near-instantly switched based on the individual classiﬁcationresults, as has been experimentally demonstrated in [24].

ACHINE LEARNING MODELS

Owing to the unique learning characteristics and hardwaretrade-offs of neural networks (NNs), support vector ma-chines (SVMs), and random forest (RF) models, all theseML models are considered in this paper. Each model istrained based on the instruction proﬁles extracted from a

Fig. 2: Systematic ﬂow for designing ML predictor within a typical pipelined processor.synthetically generated dataset of 3,000 random instructionsper class. The delay boundaries of the individual classes areexperimentally determined with respect to the worst-casedelay of 4 ns as follows: { [0.0,2.2],(2.2,4.0] } for the two-class conﬁguration, { [0.0,1.8],(1.8,2.6],(2.6,4.0] } for the three-class conﬁguration, and { [0.0,1.0],(1.0,2.0],(2.0,3.0],(3.0,4.0] } for the four-class conﬁguration.The feature vector of the i th instruction comprises sixelements, x i = ( instr, op , op , Xop , Xop , output ) . Theﬁrst feature, instr , comprises four subfeatures, representingthe type of the operation in one-hot format, instr =  , if arithmetic , if arithmetic with immediate operand , if logical , if multiplication or divisionThe subsequent four elements are deﬁned by the operands.The features op and op are the ﬁrst and second operandsof the instruction, and the features Xop and Xop are theXORed values of the ﬁrst and second operands with theirrespective previous values. The last feature, output , is theoutput of the preceding instruction. The last three elementsof the feature vector are exploited to capture the effect ofcomputation history on the instruction delay. Note that theoperands and output of the preceding instruction are 32-bit long, as determined by the 32-bit baseline processorutilized in this work. Thus, the distribution of these featuressigniﬁcantly differs from the distribution of the operationtype subfeatures. To balance the overall distribution of theindividual features, the input features are preprocessed andscaled to follow a normal distribution using quantile trans-former in Python scikit-learn library. An example of operandand output features with and without the transformation isshown in Fig. 3 for arithmetic and logical instructions. Notethat the type subfeatures remain unchanged. To evaluate the efﬁciency and efﬁcacy of the proposedmethod, propagation delay classiﬁcation is investigatedwith three common ML algorithms: NN, SVM, and RF. Theconﬁguration of each of the three ML models is described inthe following subsections, including the hyperparameters,performance, and hardware costs of the individual MLalgorithms. All the algorithms are ﬁve-fold cross-validatedbased on three thousand randomly generated instructionsper class. While ﬁnding an effective metric for stability of theevaluation is still an open question, k-fold cross-validationwith ≤ K ≤ is typically used, as these K values havebeen demonstrated to simultaneously minimize the bias andvariance across many studied test sets [25], [26], [27], [28].Thus, K = 5 is used in this work. ML accuracy is reportedas the F1-score of delay classiﬁcation and the resultantspeedup for each benchmark program has been consideredin determining the performance of each ML algorithm.Hardware cost is evaluated as the number of additionaltransistors required for implementing the individual MLalgorithms and has also been considered in determining theperformance of the ML algorithms. Among the evaluatedML algorithms, the RF classiﬁer is preferred in this workdue to the favorable tradeoff between the performance gainand hardware costs, as well as the relative simplicity of theRF algorithm, as explained in the following subsections. NNs excel in learning complex hidden patterns in largedatasets and have exhibited a particular supremacy in visionand text applications as compared with classical ML algo-rithms. Following this success, promising results have beenshown with NNs in various hardware related applications[29], [30], [31].To determine the preferred set of hyperparameters forthe two-, three-, and four-class NN models, a grid search isexecuted for each multiclass NN over the following ranges: (a)(b)

Fig. 3: A typical feature vector with and without the MLpreprocessing, (a) for arithmetic operation with immediateoperand, and (b) for logical operation. Note that the valueswithout preprocessing are shown on a logarithmic scale,while the values with preprocessing are shown on a linearscale.1) Identity, tanh, logistic, and ReLu activation functions,2) Stochastic gradient descent [32], lbfgs (a limited mem-ory BFGS quasi-Newton optimization algorithm [33]),and Adam (an adaptive learning rate optimization al-gorithm [34]) solvers, and3) A single m -neuron hidden layer ( m ∈ { , , , } )and two hidden NN layers with m and m neuronsin, respectively, the ﬁrst and second layers ( m × m ∈{ × , × , × } ).The networks are trained using backpropagation algorithmfor 200 epochs until convergence with quasi-newton opti-mizer. Note that the number of neurons in the input andoutput layers is determined by, respectively, the numberof ML features (nine, including the four instruction typesubfeatures) and the number of ML classes (two, three,and four). The top ten grid search results (within 1% ofthe highest F1-score) are listed in Table 1 for each of themulticlass NNs in the descending order of the F1-scores.The hardware cost is determined based on the numberof transistors comprising the NN adders and multipliers.The transistor count for the individual NN adders andmultipliers is determined based on [35]. The number ofmultipliers, N MULT , and adders, N ADD , in a NN with L TABLE 1: Top (within 1% of the highest F1-score) NNconﬁgurations and their respective performance metrics( i.e., speedup, hardware cost (in million transistors), andspeedup per hardware metric (SPH)).

Activation Solver Neurons F1-score Speedup HW cost SPH1 tanh lbfgs 10 0.859 1.915 2.834 0.6762 relu adam × × × × × × σ + ) 0.002 0.012 1.634 0.095Negative standard deviation( σ − ) 0.002 0.018 0.961 0.040 Activation Solver Neurons F1-score Speedup HW cost SPH1 logistic lbfgs × × × × × × × σ + ) 0.001 3.7E-4 1.694 0.460Negative standard deviation( σ − ) 0.001 4.0E-4 1.148 0.049 Activation Solver Neurons F1-score Speedup HW cost SPH1 logistic adam × × × × σ + ) 1.0E-4 1.8E-4 2.252 0.294Negative standard deviation( σ − ) 0.003 0.002 1.217 0.137 layers is determined, respectively, as, N MULT = L (cid:88) i =1 m i · v i , (1)and N ADD = L (cid:88) i =1 m i · ( v i − , (2)where m i is the number of neurons in each layer, and v i is the size of the input vector to each layer (or the featurevector size in the input layer).The speedup per hardware cost (SPH) is also listed inTable 1 for each of the NN conﬁgurations. These top NNresults are compared with the SVM and RF top results,as described at the end of this section. As a general rule, learning capacity of a NN increases with the network com-plexity ( i.e., number of neurons and number of layers). Fora NN to be competitive with or outperform a classicalML algorithm, a large number of neurons and layers isrequired, signiﬁcantly increasing the system complexity andhardware overhead of the NN based solutions. SVM classiﬁer generates an optimal hyperplane which sep-arates data samples in feature space with the objectiveto minimize the classiﬁcation error. Linear SVM can onlyclassify a linearly-separable data. Alternatively, to learncomplex nonlinear data patterns, SVM can be combinedwith a kernel trick , enabling the feature transformation intolinearly separable space [36]. In this work, a grid search isperformed over the following kernel SVM hyperparameters:1) Linear, polynomial, and radial basis function (rbf) ker-nels,2) Integer degree of ﬂexibility of the polynomial decisionboundary, d ∈ [2 , , and3) The inﬂuence on the model of a single sample ina training set with N features and variance V ar byscaling ( i.e., gamma = 1 / ( N · V ar ) ) or not scaling ( i.e., gamma = 1 /N ) the kernel coefﬁcient, gamma .The sets of hyperparameters with the highest F1-scores arelisted in Table 2. The speedup, hardware cost, and SPHmetric are also listed in the table for all the SVM conﬁgu-rations. SVM hardware cost is determined as the numberof transistors, based on the method presented in [37]. SVMoften exhibits excellent performance as compared with otherlearning algorithms at the expense of higher computationaland design complexity, and accordingly higher power andarea overheads [38]. These tradeoffs are discussed at the endof this section. RF classiﬁer is an ensemble of decision tree classiﬁers. Theinput samples are split into multiple sample subsets andeach decision tree is trained on one training subset. Theﬁnal classiﬁcation decision for each sample is made basedon the result of averaging the individual tree decisions( i.e., ensembling). RF models beneﬁt from the accuracy,training speed, and interpretability of the decision treemodel, while the ensembling mitigates the overﬁtting,otherwise common to decision tree classiﬁer. RF is oftenpreferred in scientiﬁc and practical applications [4], [39].The computational and hardware complexity of RF is astrong function of the number and depth of the decisiontrees. The depth of the individual trees is dependent onthe number of features and their correlation. In this work,a RF grid search is performed over the following ranges ofhyperparameters:1) Number of trees in the forest, n estimators ∈{ , , , , } ,2) Maximum number of levels in each tree, max depth ∈{ , , , , } .The results of the top estimators (within 1% of the highestF1-score) are listed in Table 3. The hardware cost of an TABLE 2: Top (within 1% of the highest F1-score) SVMconﬁgurations and their respective performance metrics( i.e., speedup, hardware cost (in million transistors), andspeedup per hardware metric (SPH)). kernel degree gamma F1-score Speedup HW cost SPH1 poly 5 scale 0.837 1.873 1323.343 0.0012 poly 4 scale 0.834 1.899 1307.230 0.0013 poly 6 scale 0.833 1.889 1352.670 0.0014 poly 3 scale 0.828 1.910 1291.739 0.0015 poly 7 scale 0.827 1.923 1412.040 0.0016 poly 8 scale 0.826 1.915 1476.320 0.0017 poly 9 scale 0.823 1.908 1534.991 0.0018 rbf N/A scale 0.822 1.916 228.524 0.0089 poly 10 scale 0.819 1.814 1596.913 0.00110 poly 11 scale 0.813 1.793 1654.040 0.001Average 0.826 1.884 1317.781 0.002Positive standard deviation( σ + ) 0.003 0.010 74.701 0.006Negative standard deviation( σ − ) 0.003 0.038 363.206 2.3E-4 kernel degree gamma F1-score Speedup HW cost SPH1 poly 5 scale 0.876 1.604 755.765 0.0022 poly 4 scale 0.875 1.617 715.028 0.0023 poly 3 scale 0.875 1.617 675.647 0.0024 rbf N/A scale 0.875 1.617 119.954 0.0135 poly 7 scale 0.873 1.627 847.086 0.0026 poly 2 scale 0.872 1.618 677.269 0.0027 poly 6 scale 0.872 1.601 800.553 0.0028 poly 8 scale 0.870 1.613 900.614 0.0029 rbf N/A auto 0.869 1.617 135.990 0.01210 poly 9 scale 0.869 1.660 948.560 0.002Average 0.873 1.619 657.647 0.004Positive standard deviation( σ + ) 0.001 0.021 57.771 0.006Negative standard deviation( σ − ) 0.001 0.003 374.579 7.4E-4 kernel degree gamma F1-score Speedup HW cost SPH1 rbf N/A scale 0.957 1.680 28.616 0.0592 poly 4 scale 0.956 1.679 169.482 0.0103 poly 5 scale 0.955 1.681 181.843 0.0094 poly 6 scale 0.954 1.683 198.685 0.0085 poly 3 auto 0.954 1.674 323.000 0.0056 poly 3 scale 0.953 1.677 158.120 0.0117 poly 8 scale 0.952 1.683 225.245 0.0078 poly 7 scale 0.952 1.678 212.632 0.0089 poly 2 scale 0.951 1.666 145.830 0.01110 rbf N/A auto 0.951 1.629 27.952 0.058Average 0.954 1.673 167.141 0.019Positive standard deviation( σ + ) 9.2 E-4 0.003 29.323 0.028Negative standard deviation( σ − ) 8.3 E-4 0.022 49.4330 0.004 RF classiﬁer is evaluated based on the number of requiredcomparators, O ( n estimators × log ( max depth )) , and re-ported in terms of the total number of RF transistors. Tran-sistor count for a single comparator is determined based on[40]. The tradeoffs between the speedup and F1-score are sum-marized in Fig. 4 for all the classiﬁers. Note that not in allthe cases speedup increases with F1-score. This is due to theeffect of the type of misclassiﬁcation on the overall speedup.For example, if a slow instruction is classiﬁed into a fasterclass, the result at the output of the execution unit at theend of the fast clock period is incorrect. Thus, a four-clock-cycle penalty is incurred to re-execute the slow instruction,compensating for the combined latency of the re-executedIF, ID, ML, and EX stages. Alternatively, if a fast instructionis misclassiﬁed into a slow class, the execution still results

TABLE 3: Top (within 1% of the highest F1-score) RFconﬁgurations and their respective performance metrics( i.e., speedup, hardware cost (in million transistors), andspeedup per hardware metric (SPH)). max depth n estimator F1-score Speedup HW cost SPH1 30 50 0.852 1.835 0.177 10.3572 10 200 0.850 1.925 0.480 4.0123 30 200 0.850 1.889 0.709 2.6664 10 100 0.849 1.913 0.240 7.9785 50 200 0.849 1.833 0.815 2.2496 50 100 0.846 1.856 0.407 4.5547 20 200 0.845 1.874 0.624 3.0048 50 50 0.843 1.836 0.204 9.0119 30 100 0.842 1.838 0.354 5.18710 10 50 0.840 1.902 0.120 15.859Average 0.847 1.870 0.413 6.488Positive standard deviation( σ + ) 0.002 0.016 0.137 2.638Negative standard deviation( σ − ) 0.002 0.014 0.078 1.250 max depth n estimator F1-score Speedup HW cost SPH1 20 50 0.949 1.879 0.156 12.0402 30 100 0.947 1.856 0.354 5.2383 20 200 0.946 1.851 0.624 2.9664 40 200 0.945 1.847 0.768 2.4035 40 50 0.944 1.843 0.192 9.5916 50 200 0.944 1.839 0.815 2.2577 10 200 0.944 1.848 0.480 3.8538 30 200 0.942 1.814 0.709 2.5619 10 100 0.940 1.828 0.240 7.62110 10 50 0.939 1.826 0.120 15.224Average 0.944 1.843 0.446 6.375Positive standard deviation( σ + ) 0.002 0.008 0.117 2.764Negative standard deviation( σ − ) 0.002 0.007 0.111 1.360 max depth n estimator F1-score Speedup HW cost SPH1 40 50 0.981 1.688 0.192 8.7852 30 200 0.981 1.686 0.709 2.3803 30 100 0.981 1.683 0.354 4.7504 40 200 0.981 1.686 0.768 2.1935 10 50 0.981 1.686 0.120 14.0586 20 200 0.981 1.683 0.624 2.6967 10 200 0.980 1.686 0.480 3.5158 50 200 0.980 1.687 0.815 2.0709 10 100 0.980 1.685 0.240 7.02410 20 100 0.979 1.683 0.312 5.393Average 0.980 1.685 0.461 5.286Positive standard deviation( σ + ) 2.0E-04 0.001 0.111 2.401Negative standard deviation( σ − ) 4.3E-04 0.001 0.104 1.034 in correct answer albeit the potential loss in performancegain. In addition, if a fast instruction is classiﬁed into anominal-delay class (for example, in the case with threedelay classes), the overall performance of the system is stillincreased (but not maximized) as compared with the execu-tion in the slowest delay class (as designed for the worst-case clock period). To understand the signiﬁcance of speedand overhead in the overall performance of individual MLclassiﬁers, SPH metric is considered. The SPH results (asdetermined based on Tables 1-3) are shown in Fig. 5 forNN, SVM, and RF classiﬁers in two-, three-, and four-classconﬁgurations. Based on these results, RF exhibits the besttradeoff between the hardware cost and speedup, as well asthe lowest design complexity and hardware overheads. RFclassiﬁer is, therefore, preferred in this work as a demon-stration vehicle of the proposed framework. Fig. 4: Speedup vs. F1 for two-, three-, and four-class conﬁg-urations based on Tables 1-3. MPLEMENTATION

The proposed framework is implemented with RF modelwithin TigerMIPS and evaluated based on LegUp bench-marks. The details of the implementation are described inthis section.

A holistic platform is developed based on the proposedsystem design methodology, as illustrated in Fig. 2. Theframework is uniﬁed within a shell programming platformsupported with several peripheral programs developed inC++ and Python. The synthesis steps, as described in Fig. 2,are sequentially executed from

Start to Finish .During the ﬁrst phase, Synopsys Design Compiler iscalled with the high-level HDL model of the baseline pro-cessor. The proﬁler triggers are added to the system andGLS is performed in Modelsim.The second phase is triggered upon the completion of theinstruction proﬁling. An external parser program is calledto transform the instruction proﬁles into the ML featuredata structure and eliminate outliers. The model is trainedto classify propagation delays into user-deﬁned number ofclasses based on a user-speciﬁed learning algorithm anddelay boundaries. The ML accuracy and estimated speedupare evaluated upon the training completion. If the design re-quirements are met, the ML software model is transformedinto the high-level HDL code. Otherwise, ML model isretrained with new parameters.Upon training completion, the HDL code of the MLmodel is instantiated within the original HDL model ofthe baseline processor. Finally, the procedure in Phase 1 isrepeated in Phase 3 with the modiﬁed processor model,and the overall system performance and overheads areevaluated.

The proposed framework is demonstrated on TigerMIPS.In addition to the basic MIPS units, such as InstructionFetch (IF), Instruction Decode (ID), Execute (Exe), Memoryaccess (Mem), and Write-back (WB), TigerMIPS comprisesadvanced units, such as, forwarding unit, branch handlingunit, stall logic, and instruction and data caches, which arecommon in modern pipeline processors.

Fig. 5: Speedup per hardware cost (SPH) for two-, three, andfour-class conﬁgurations. The hardware cost is evaluatedbased on the number of transistors needed to realize eachclassiﬁer. The SPH performance is highest with RF classiﬁeras compared with the SVM and NN based classiﬁers foreach of the classiﬁer conﬁgurations. Numbers correspondto data listed in Tables 1, 2, and 3.

The baseline model is synthesized in 45 nm NanGate CMOStechnology node with Synopsys Design Compiler. Uponcompletion of the synthesis, triggers are implemented inVerilog HDL, enabling data and timestamp sampling at theinput and output of the execution unit within the MIPSpipeline. The proﬁling is performed based on GLS withModelsim simulator.

The trained ML model is ﬁrst validated in Python. TheHDL code of the validated ML model is integrated intothe baseline processor. Finally, the modiﬁed processor issynthesized and its functionality is veriﬁed through GLS. The post PAR reports are utilized to evaluate the modiﬁedsystem with respect to speciﬁed design constraints.

XPERIMENTAL R ESULTS

To demonstrate the framework, LegUp high-level synthesisbenchmark suite coupled with LLMVM compiler toolchain[41] is utilized for proﬁling and veriﬁcation during GLS. Thetrained RF model is tested with nine standard benchmarkprograms available within the LegUp benchmark suite andan additional synthetically generated benchmark with onemillion random instructions. The F1-score is shown in Fig.6 for two, three, and four ML delay classes, yielding above95% F1-score for majority of the programs with two delayclasses. Resultant speedup for the individual benchmarksis shown in Fig. 7, including the practical speedup (withthe misclassiﬁcation penalty), no-penalty speedup (withoutthe misclassiﬁcation penalty), and ideal speedup (with 100%classiﬁcation accuracy). The energy overhead due to theadditional ML hardware and classiﬁcation errors is listedin Table 4. To account for delay overheads due to themisclassiﬁcation of a slow instruction into a higher per-formance class, a re-execution penalty of four clock cycles(compensating for IF, ID, ML, and EX stages) is consideredwithin the performance results, as reported in Fig. 7. The no-penalty speedup is also presented in Fig. 7, visualizing thepenalty due to the misclassiﬁcation of a fast instruction intoa slow class. Note that the overall speedup with four-classconﬁguration is higher than the speedup with two-classconﬁguration, albeit the higher classiﬁcation accuracy withtwo delay classes. Alternatively, higher misclassiﬁcation ratewith four delay classes yields higher re-execution energyconsumption, as listed in Table 4. Also, note that a negativeenergy overhead indicates a reduction in the overall energyconsumption ( i.e., power-delay product).Performance comparison between the proposed methodand state-of-the-art (ML and non-ML) DVFS approaches islisted in Table 5. For example, both the proposed frameworkand the approach in [5] consider binary classiﬁcation withtwo execution delay classes. The proposed method exhibits3.5 times higher speedup gain and 33% energy savings ascompared with 3% energy overhead, as reported in [5]. Ascompared with the adaptive approach in [24], the proposedmethod exhibits up to 4.9 times increase in performance gainwith 50% less energy savings. Alternatively, a 3.85 timeshigher performance gain is demonstrated as compared to[24] with similar energy savings.Power overhead per instruction for two-, three-, andfour-delay class conﬁgurations are also determined for theprograms in the LegUp benchmark suite. The averagepower overhead (due to the additional ML stage and re-execution of misclassiﬁed instructions) is shown in Fig. 8.The average power is linearly reduced with the increasingnumber of program instructions, exhibiting an overhead ofless than 0.02 microwatts in practical applications with morethan one million instructions. Furthermore, the additionalaverage power consumption rapidly converges for variousnumber of classes, as shown in Fig. 7. Thus, when optimiz-ing the number of delay classes in processors with largeworkload, power overhead is a secondary factor. Finally, thesteeper decrease in the power oberhead with the four-class

Fig. 6: Inference RF classiﬁcation based on the LegUp benchmark suite with two, three, and four classes.Fig. 7: Experimental speedup with the proposed ML framework with two, three, and four delay classes. Practical, no-penalty, and ideal speedups are presented for each benchmark and class. The practical speedup considers the experimentalclassiﬁcation accuracy and delay overheads due to misclassiﬁcation of a slow instruction into a fast class. The no-penaltyspeedup considers the experimental accuracy, but disregards the idle time due to misclassiﬁcation of a fast instruction intoa slow class. Finally, the ideal speedup is the theoretical maximum with 100% classiﬁcation accuracy.conﬁguration supports the previous assertion regardingthe gain-overhead tradeoff with ﬁner granularity of delayclasses: as the number of instructions increases, the higheraccuracy with four-class conﬁguration mitigates the adverseeffects of misclassiﬁcations on the overall system frequency.

ONCLUSIONS AND F UTURE W ORK

The proposed uniﬁed framework facilitates efﬁcient utiliza-tion of the time and hardware recourses in the system. In addition, this approach enables the design of ML pipelinestages, while satisfying design constraints, as shown in Fig.2. Finally, classiﬁcation of instructions into delay intervalsin real time alleviates the path propagation variances im-posed by PVT variations and system aging. To enhancethe performance gain, the proposed approach should bepreferred with those applications and systems characterizedby considerable variations in the propagation delay of theindividual instructions.This method is practical with pipelined, MIPS-like pro- TABLE 4: Experimental power and energy overhead of theproposed ML method.

Practical Power Energy Instructionspeedup overhead overhead countrand1M 1.923 38.5% -27.99% 1000000adpcm 1.497 56.49% 4.55% 30197aes 2.087 45.14% -30.46% 11223blowﬁsh 1.633 65.01% 1.02% 199759fft 1.165 42.45% 22.28% 11001ﬁr 2.908 25.94% -56.7% 7024gsm 1.382 45.77% 5.45% 7671jpeg 1.792 55.04% -13.49% 1133161sha 1.657 62.07% -2.22% 345576sra 2.840 20.62% -57.53% 1775Average 1.889 45.7% -15.51% 274738.7Positivestandard 0.35 5.81% 8.70% 374831.68deviation( σ + )Negativestandard 0.17 6.58% 15.50% 92682.62deviation( σ − ) Practical Power Energy Instructionspeedup overhead overhead countrand1M 1.765 23.3% -30.151% 1000000adpcm 1.389 33.69% -3.768% 30197aes 1.987 27.06% -36.051% 11223blowﬁsh 2.125 39.27% -34.456% 199759fft 1.957 25.47% -35.872% 11001ﬁr 2.222 16.14% -47.727% 7024gsm 1.465 27.87% -12.729% 7671jpeg 1.786 32.62% -25.74% 1133161sha 1.578 37.93% -12.566% 345576sra 2.013 7.18% -46.745% 1775Average 1.829 27.05% -28.58% 274738.7Positivestandard 0.11 3.09% 8.41% 374831.68deviation( σ + )Negativestandard 0.13 5.76% 4.84% 92682.62deviation( σ − ) Practical Power Energy Instructionspeedup overhead overhead countrand1M 1.530 14.29% -25.323% 1000000adpcm 1.418 22.4% -13.654% 30197aes 1.818 17.1% -35.595% 11223blowﬁsh 1.818 27.23% -30.024% 199759fft 1.646 16.14% -29.45% 11001ﬁr 1.818 10.5% -39.225% 7024gsm 1.635 18.34% -27.642% 7671jpeg 1.665 22% -26.727% 1133161sha 1.818 27.23% -30.024% 345576sra 1.818 5.5% -41.975% 1775Average 1.699 18.073% -29.964% 274738.7Positivestandard 0.05 2.84% 3.49% 374831.68deviation( σ + )Negativestandard 0.07 3.06% 3.24% 92682.62deviation( σ − ) cessors, in which the overall delay is dominated by the delayof the execution stage. Although, the proposed methodis explored in this work with a single core system, fur-ther increases in energy efﬁciency and the overall systemperformance are expected if the approach is adjusted formodern architecture processors with out-of-order executionand multicore processors with multiple frequency domains.To exploit the positive impact of out-of-order execution and Fig. 8: Power overhead per instruction for 2, 3, and 4 delayclass conﬁguration based on the benchmarks in Table 4.TABLE 5: Comparison between the proposed method andexisting state-of-the-art methods. Algorithm Performance Energy MLgain overhead basedSLoT [4] 23% N/A YesEarly Prediction [5] 20% 3% NoClim [14] 24% N/A YesSLBM [16] 15% N/A YesAdaptive Clock 18.2% -30.4% NoManagement [24]2 classes 70% -30%This work 3 classes 83% -28.6% Yes4 classes 89% -15.5% multicore systems on performance and energy efﬁciency incommercial class processors, the following methodologiesshould be considered.

To support out-of-order execution, instructions within adelay class should be bundled into a delay-class speciﬁcreservation station (RS). Instructions stored in an RS areindividually executed at a constant frequency until theRS is emptied or a dependency is determined, preventingfurther execution of instructions in the RS. Such bundling ofinstructions reduces the number of clock signal transitionsamong various frequencies, increasing the performance andpower efﬁciency of the system.

As previously, to support out-of-order execution, instruc-tions should be bundled based on the delay classes andstored within the matching RS’s. To support multi-clockexecution, the ALUs and FPUs within the execution unitshould be operated at different clock frequencies, as deter-mined by the granularity of the delay classes. Intuitively, theparallelization of execution from different delay classes withthis approach decreases the number of clock adjustments,increasing the system performance and energy efﬁciency. To leverage the advantages provided by processing withmultiple clock domains in multicore systems, bundled in-structions within the individual clock domains (as deﬁnedin subsection 7.1) should be shared among all the systemclock domains, mitigating the additional cost of multipleclocking (as described in subsection 7.2). To enable thesharing of bundles, efﬁcient bundle scheduling and lowoverhead communication channels are required. While thenumber of clock adjustments is expected to further reducewith this approach, additional overheads due to intelli-gent communication of bundles among the cores shouldbe considered. Alternatively, by partially or fully replacingthe traditional DFS, DVFS, and thread scheduling mecha-nisms, additional savings are expected with the proposedapproach. Finally, the proposed method can be adjusted ina similar manner to classify instruction propagation delayof various pipeline stages.Existing approaches are focused on ofﬂine speculations,statistical models, per-task (workload-based) frequency scal-ing, and prediction of timing errors at an operating point ofa system. Alternatively, the proposed method demonstratesthe beneﬁts of ﬁne-grain, instruction-level frequency ad-justment, simultaneously utilizing most of the clock periodslack and mitigating the adverse effects of PVT variationsand aging.

UMMARY

In this work, an additional ML pipeline stage is proposedfor increasing the overall system performance by enhancingthe temporal resource utilization. This additional stage isdesigned to classify instructions into propagation delayclasses. The system clock frequency is adaptively adjustedbased on the individual delay class predictions. Pipeliningis exploited to mitigate the effect of the ML stage latencyon the overall system performance. Practical ML featuresare extracted based on current instruction and computationhistory. ML hardware and misclassiﬁcation power and delayoverheads are considered within the reported results. Tiger-MIPS is utilized as the baseline processor. The processor isenhanced with the ML predictor and simulated with theLegUp benchmark suite. Based on the experimental results,up to 89% performance gain is achieved with four delayclasses with 15.5% energy saving. Alternatively, the reduc-tion of 30% in energy consumption with 70% performancegain is demonstrated with two delay classes. A uniﬁed shellprograming platform with peripheral programs is designedto provide a systematic design ﬂow for ML driven pipelinedprocessors. R EFERENCES [1] Fields B, Bodk R, Hill MD. Slack: Maximizing performance un-der technological constraints. InProceedings 29th Annual Interna-tional Symposium on Computer Architecture 2002 May 25 (pp.47-58). IEEE.[2] Zyuban V, Brooks D, Srinivasan V, Gschwind M, Bose P, StrenskiPN, Emma PG. Integrated analysis of power and performance forpipelined microprocessors. IEEE Transactions on Computers. 2004Jun 21;53(8):1004-16. [3] Kumar R, Farkas KI, Jouppi NP, Ranganathan P, Tullsen DM.Single-ISA heterogeneous multi-core architectures: The potentialfor processor power reduction. InProceedings of the 36th annualIEEE/ACM International Symposium on Microarchitecture 2003Dec 3 (p. 81). IEEE Computer Society.[4] Jiao X, Jiang Y, Rahimi A, Gupta RK. Slot: A supervised learn-ing model to predict dynamic timing errors of functional units.InProceedings of the Conference on Design, Automation & Testin Europe 2017 Mar 27 (pp. 1183-1188). European Design andAutomation Association.[5] Hashemi SH, Ajirlou AF, Soltani M, Navabi Z. Early predictionof timing critical instructions in pipeline processor. In2016 15thBiennial Baltic Electronics Conference (BEC) 2016 Oct 3 (pp. 95-98). IEEE.[6] Moghaddasi I, Fouman A, Salehi ME, Kargahi M. Instruction-level NBTI Stress Estimation and its Application in RuntimeAging Prediction for Embedded Processors. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems. 2018Jun 12.[7] Gepner P, Kowalik MF. Multi-core processors: New way to achievehigh system performance. InInternational Symposium on ParallelComputing in Electrical Engineering (PARELEC’06) 2006 Sep 13(pp. 9-13). IEEE.[8] Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H,Bose P. Microarchitectural techniques for power gating of execu-tion units. InProceedings of the 2004 international symposium onLow power electronics and design 2004 Aug 9 (pp. 32-37). ACM.[9] Wu Q, Pedram M, Wu X. Clock-gating and its application tolow power design of sequential circuits. IEEE Transactions onCircuits and Systems I: Fundamental Theory and Applications.2000 Mar;47(3):415-20.[10] Wang S, Ananthanarayanan G, Zeng Y, Goel N, Pathania A,Mitra T. High-throughput cnn inference on embedded arm big.little multi-core processors. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems. 2019 Sep 30.[11] Rapp M, Sagi M, Pathania A, Herkersdorf A, Henkel J. Power-and Cache-Aware Task Mapping with Dynamic Power Budget-ing for Many-Cores. IEEE Transactions on Computers. 2019 Aug20;69(1):1-3.[12] Isci C, Buyuktosunoglu A, Buyuktosunoglu A, Cher CY, Bose P,Martonosi M. An analysis of efﬁcient multi-core global powermanagement policies: Maximizing performance for a given powerbudget. InProceedings of the 39th annual IEEE/ACM interna-tional symposium on microarchitecture 2006 Dec 9 (pp. 347-358).IEEE Computer Society.[13] Canis A, Choi J, Aldham M, Zhang V, Kammoona A, Anderson JH,Brown S, Czajkowski T. LegUp: high-level synthesis for FPGA-based processor/accelerator systems. InProceedings of the 19thACM/SIGDA international symposium on Field programmablegate arrays 2011 Feb 27 (pp. 33-36). ACM.[14] Jiao X, Rahimi A, Jiang Y, Wang J, Fatemi H, De Gyvez JP, GuptaRK. Clim: A cross-level workload-aware timing error predictionmodel for functional units. IEEE Transactions on Computers. 2017Dec 14;67(6):771-83.[15] Zhang JJ, Garg S. FATE: fast and accurate timing error predic-tion framework for low power DNN accelerator design. In2018IEEE/ACM International Conference on Computer-Aided Design(ICCAD) 2018 Nov 5 (pp. 1-8). IEEE.[16] Jiao X, Rahimi A, Narayanaswamy B, Fatemi H, de GyvezJP, Gupta RK. Supervised learning based model for predictingvariability-induced timing errors. In2015 IEEE 13th InternationalNew Circuits and Systems Conference (NEWCAS) 2015 Jun 7 (pp.1-4). IEEE.[17] Zhang JJ, Garg S. BandiTS: dynamic timing speculation usingmulti-armed bandit based optimization. InDesign, Automation &Test in Europe Conference & Exhibition (DATE), 2017 2017 Mar 27(pp. 922-925). IEEE.[18] Whittle P. Multi-armed bandits and the Gittins index. Journalof the Royal Statistical Society: Series B (Methodological). 1980Jan;42(2):143-9.[19] Khaleghi B, Salamat S, Imani M, Rosing T. FPGA Energy Efﬁciencyby Leveraging Thermal Margin. arXiv preprint arXiv:1911.07187.2019 Nov 17.[20] Assare O, Gupta R. Accurate Estimation of Program Error Rate forTiming-Speculative Processors. InProceedings of the 56th AnnualDesign Automation Conference 2019 2019 Jun 2 (p. 180). ACM. [21] De Kruijf M, Nomura S, Sankaralingam K. A uniﬁed model fortiming speculation: Evaluating the impact of technology scal-ing, CMOS design style, and fault recovery mechanism. In2010IEEE/IFIP International Conference on Dependable Systems &Networks (DSN) 2010 Jun 28 (pp. 487-496). IEEE.[22] Moore, S. and Chadwick, G., 2011. The Tiger “MIPS” processor.[23] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, GriselO, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderplasJ. Scikit-learn: Machine learning in Python. Journal of machinelearning research. 2011;12(Oct):2825-30.[24] Jia T, Joseph R, Gu J. 19.4 An Adaptive Clock Management SchemeExploiting Instruction-Based Dynamic Timing Slack for a General-Purpose Graphics Processor Unit with Deep Pipeline and Out-of-Order Execution. In2019 IEEE International Solid-State CircuitsConference-(ISSCC) 2019 Feb 17 (pp. 318-320). IEEE.[25] G. James, et al., “An Introduction to Statistical Learning,” NewYork: Springer, Vol. 112, 2013.[26] M. Kuhn and J. Kjell, “Applied Predictive Modeling,” New York:Springer, Vol. 26, 2013.[27] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accu-racy Estimation and Model Selection,” Proc. of the InternationalJoint Conference on Artiﬁcial Intelligence, Vol. 14, No. 2, pp. 1137-114, 1995.[28] G. Forman and S. Scholtz, “Apples-to-Apples in Cross-ValidationStudies: Pitfalls in Classiﬁer Performance Measurement.” ACMSIGKDD Explorations Newsletter, Vol. 12, No. 1, pp. 49-57, 2010.[29] Yue J, Liu R, Sun W, Yuan Z, Wang Z, Tu YN, Chen YJ,Ren A, Wang Y, Chang MF, Li X. 7.5 A 65nm 0.39-to-140.3TOPS/W 1-to-12b Uniﬁed Neural Network Processor UsingBlock-Circulant-Enabled Transpose-Domain Acceleration with 8.1Higher TOPS/mm 2 and 6T HBST-TRAM-Based 2D Data-Reuse Architecture. In2019 IEEE International Solid-State CircuitsConference-(ISSCC) 2019 Feb 17 (pp. 138-140). IEEE.[30] Lee J, Lee J, Han D, Lee J, Park G, Yoo HJ. 7.7 lnpu: A 25.3 tﬂops/wsparse deep-neural-network learning processor with ﬁne-grainedmixed precision of fp8-fp16. In2019 IEEE International Solid-StateCircuits Conference-(ISSCC) 2019 Feb 17 (pp. 142-144). IEEE.[31] Lee J, Lee J, Han D, Lee J, Park G, Yoo HJ. 7.7 lnpu: A 25.3 tﬂops/wsparse deep-neural-network learning processor with ﬁne-grainedmixed precision of fp8-fp16. In2019 IEEE International Solid-StateCircuits Conference-(ISSCC) 2019 Feb 17 (pp. 142-144). IEEE.[32] Ruder S. An overview of gradient descent optimization algo-rithms. arXiv preprint arXiv:1609.04747. 2016 Sep 15.[33] Liu DC, Nocedal J. On the limited memory BFGS method for largescale optimization. Mathematical programming. 1989 Aug 1;45(1-3):503-28.[34] Kingma DP, Ba J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980. 2014 Dec 22.[35] Asadi P, Navi K. A new low power 32 32-bit multiplier. WorldApplied Sciences Journal. 2007;2(4):341-7.[36] Hofmann M. Support vector machines-kernels and the kerneltrick. Notes. 2006 Jun 26;26(3).[37] Mitran J, Bouillant S, Bourennane E. Classiﬁcation boundaryapproximation by using combination of training steps for real-time image segmentation. InInternational Workshop on MachineLearning and Data Mining in Pattern Recognition 2003 Jul 5 (pp.141-155). Springer, Berlin, Heidelberg.[38] Kulkarni A, Pino Y, Mohsenin T. SVM-based real-time hardwareTrojan detection for many-core platform. In2016 17th InternationalSymposium on Quality Electronic Design (ISQED) 2016 Mar 15(pp. 362-367). IEEE.[39] Zhang X, Wang W, Zheng X, Ma Y, Wei Y, Li M, Zhang Y.A Clutter Suppression Method Based on SOM-SMOTE RandomForest. In2019 IEEE Radar Conference (RadarConf) 2019 Apr 22(pp. 1-4). IEEE.[40] Cheng SW. A high-speed magnitude comparator with small tran-sistor count. In10th IEEE International Conference on Electronics,Circuits and Systems, 2003. ICECS 2003. Proceedings of the 20032003 Dec 14 (Vol. 3, pp. 1168-1171). IEEE.[41] Lattner C, Adve V. LLVM: A compilation framework for life-long program analysis & transformation. InProceedings of theinternational symposium on Code generation and optimization:feedback-directed and runtime optimization 2004 Mar 20 (p. 75).IEEE Computer Society.[42] Agarwal K, Sylvester D, Blaauw D. Modeling and analysis ofcrosstalk noise in coupled RLC interconnects. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.2006 Apr 24;25(5):892-901. Arash Fouman Ajirlou (S’17) received theBachelor of Science degree in computerengineering from University of Tehran, Tehran,Iran, in 2017. He started the PhD programwith Department of Electrical and ComputerEngineering at the University of Illinois atChicago, in 2018. He was a research assistantin the school of Electrical and ComputerEngineering at University of Tehran between2015 and late 2017. From 2017 to late 2018,he served as the secretary of the Electrical andComputer Engineering committee in Alumni Association of Faculty ofEngineering, University of Tehran. In 2018, prior to starting his PhDin computer engineering at University of Illinois at Chicago, he was adigital designer in the engineering department of Ofogh Tajrobe Mojcompany, Tehran, Iran.His primary interests are embedded systems and high-performance/low-power computing systems, with an emphasis onmachine learning and self governing systems. His current focus ison utilizing machine learning methodologies to enhance processorperformance and energy consumption.