A Machine Learning Pipeline Stage for Adaptive Frequency Adjustment
11 A Machine Learning Pipeline Stage for AdaptiveFrequency Adjustment
Arash Fouman Ajirlou,
Student Member, IEEE,
Inna Partin-Vaisband,
Member,IEEE
Abstract —A machine learning (ML) design framework is proposed for adaptively adjusting clock frequency based on propagationdelay of individual instructions. A random forest model is trained to classify propagation delays in real time, utilizing current operationtype, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipelinestage within a baseline processor. The modified system is experimentally tested at the gate level in 45 nm CMOS technology, exhibitinga speedup of 70% and energy reduction of 30% with coarse-grained ML classification. A speedup of 89% is demonstrated with finergranularities with 15.5% reduction in energy consumption.
Index Terms —Computer Systems Organization, Microprocessors and microcomputers, Hardware, Pipeline, Processor Architectures,Pipeline processors, Pipeline implementation, VLSI Systems, Impact of VLSI on system design, VLSI, System architectures,integration and modeling, Design Methodology, Cost/performance, Machine learning, Classifier design and evaluation (cid:70)
NTRODUCTION T H e primary design goal in computer architecture isto maximize the performance of a system underpower, area, temperature, and other application-specificconstraints. Heterogeneous nature of VLSI systems and theadverse effect of process, voltage, and temperature (PVT)variations have raised challenges in meeting timing con-straints in modern integrated circuits (ICs). To address thesechallenges, timing guardbands have constantly been in-creased, limiting the operational frequency of synchronousdigital circuits. On the other hand, the increasing variety offunctions in modern processors increases delay imbalanceamong different signal propagation paths. Bounded by criti-cal path delay, these systems are traditionally designed withpessimistically slow clock period, yielding underutilized ICperformance. Moreover, power efficiency of these underuti-lized systems also degrades due to the increasing powerleakage. Alternatively, when designed with relaxed timingconstraints, integrated systems are prone to functional fail-ures. To simultaneously maintain correct functionality andincrease system performance, numerous optimization tech-niques as well as offline and online models have recentlybeen proposed including: pipelining, multicore computing,dynamic frequency and voltage scaling (DVFS), and MLdriven models [1], [2], [3], [4], [5], [6], [7], [8], [9].Propagation delay in a processor is a strong functionof the type, input operands, and output of the current op-eration, and computation history [4]. Computation historyaccounts for data overwrite and crosstalk noises. Intuitively,majority of operations are completed within a small portionof the clock period, as determined by the slowest path in thecircuit. Based on path delay distribution, as reported in [5], • A. Fouman was with the Department of Electrical and Computer Engi-neering, University of Illinois at Chicago, Chicago, IL, 60607.E-mail: [email protected] • I. Parin-Vaisband was with the Department of Electrical and ComputerEngineering, University of Illinois at Chicago, Chicago, IL, 60607.E-mail: [email protected] the operational frequency can be doubled for majority ( e.g., per instruction in real time, yielding a fundamentallydifferent approach as compared with the traditional, task-based dynamic frequency scaling. The main contributionsof this work are as follows:1) A systematic flow is proposed and implemented as aunified platform for extracting ML input features froman instruction and classifying the instruction executiondelay in real time.2) A random forest (RF) model is trained to classify in-dividual instructions into delay classes based on theirtype, input operands, and the computation history ofthe system.3) A new pipeline stage is integrated within a pipelinedMIPS processor.4) The proposed method is synthesized and verified onLegUp [13] benchmark suite of programs with Synop-sys Design Compiler in 45 nm CMOS technology node.The rest of the paper is organized as follows. Section2 describes prior and related work. Section 3 explains theproposed unified platform and the design methodology.ML algorithms for classification of instruction delay aredescribed in Section 4. In Section 5 the implementationdetails of the system are introduced. Experimental resultsare presented in Section 6. Conclusions and future work a r X i v : . [ c s . A R ] J u l are discussed in Section 7, and the paper is summarizedin Section 8. RIOR AND R ELATED W ORK
Multiple approaches have been proposed for efficientlytuning the operating point ( i.e., voltage supply and clockfrequency) of a system at various levels of a computingsystem, including application- and task-based methods andinstruction-level speculations.Predicting timing violations in a constraint-relaxed sys-tem is impractical with deterministic approaches, due to thewide dynamic range of input and output signals (typically32 or 64 bits), variety of operations in a modern processor,and delay dependence on the runtime and physical char-acteristics of the system ( e.g., crosstalk noise). ML basedapproaches for predicting timing violations of individualinstructions have recently been proposed, which considerthe impact of input operands and computation history ontiming violations [4], [14], [15]. While significant for thedesign process of next generation scalable high performancesystems, these approaches have several limitations:1) Instruction output is considered as a ML feature andexploited in these systems for predicting the timingcharacteristics of the individual instructions. These pre-dictions are, however, carried out before the instruc-tion execution, when the instruction output is not yetavailable, limiting the effectiveness of these methods inpractical systems.2) The modules under the test are studied separately andevaluated in an isolated test environment without theeffects of other processing elements ( e.g., arithmeticmodules, buffers or multiplexers). The high reportedaccuracy is, therefore, expected to degrade if the meth-ods are applied to a complex system ( e.g., a practicalexecution unit).3) Power and timing overheads due to additional hard-ware are not considered in these papers.Granularity of prediction is another primary concern. Abit-level ML based method has been proposed in [16] forpredicting timing violations with reduced timing guard-bands. While up to 95% prediction accuracy has been re-ported with this method, the excessively high, per bit granu-larity of the ML predictions is expected to exhibit substantialpower, area, and timing overheads. These overheads are,however, not evaluated in [16]. Furthermore, a procedurefor recovery upon a timing error is not provided and therecovery overheads are also not considered.As an alternative to fine-grain high-overhead ML meth-ods, multiple coarse-grain schemes for timing error de-tection and recovery have been proposed to mitigate theadverse effect of the pessimistic design constraints. A better-than-worst-case design approach has been introduced in [5].With this approach, the clock period is set to a statisticallynominal value (rather than worst-case propagation delay)and the history of timing erroneous program counters iskept in a ternary content-addressable memory (TCAM).The TCAM is exploited for predicting timing violationsof the instructions based on previous observations. Notethat the system only warns against those timing violationsthat have been previously recorded. Alternatively, unseen violations are not predicted with this approach. Owningto the apparent simplicity of this approach, only bi-stateoperating conditions ( i.e., nominal and worst-case clockfrequencies) can be efficiently utilized with this method.Alternatively, the design complexity and system overheadsare expected to significantly increase with the increasingnumber of frequency domains.In BandiTS [17], a reinforcement learning approach hasbeen proposed to estimate the timing error probability (TEP)within a program time interval, given timing speculation(TS) ratios,
T SR = t clk /t nom for various values of thereduced clock period t clk , and the worst-case clock period t nom . The TS-based TEP problem is modeled in [17] as theclassical multi-armed bandit problem [18], where the TSratios and TEPs correspond to, respectively, the arms andstochastic rewards. The primary limitation of that work isthe lack of details about the hardware implementation andoverheads. In addition, the maximum achievable perfor-mance gain of only 25% has been reported. Furthermore,BandiTS approach exhibits per-task clock granularity andscales the clock frequency for a batch of instructions. Higherperformance gain is possible with fine-grain, per instructionclock frequency adjustment, as shown in this paper.A thermal-aware voltage scaling has been proposed in[19]. Voltage selection algorithm has been developed andintegrated within FPGA synthesis process to aggressivelyscale the core and block RAM voltages, utilizing the avail-able thermal headroom of the FPGA-mapped design. Asa result, 36% reduction in power consumption has beendemonstrated. Driven by workload and thermal power dis-sipation, this method, however, supports only coarse-grainvoltage and frequency scaling.Predicting program error rate in timing-speculative pro-cessors has been proposed in [20]. A statistical model isdeveloped for predicting dynamic timing slack (DTS) atvarious pipeline stages. The predicted DTS values are ex-ploited to estimate the timing error rate in a program. Theimplementation overheads, and the potential performanceor power consumption gains are, however, not reportedwith this approach.An offline model for TS processors has been introducedin [21]. This probabilistic model is trained to optimally se-lect a better-than-worst-case, nominal clock frequency. Theprovided hardware-based speculation, however, does notconsider the overall workload or specific finer units, limitingthe fidelity of the method. Alternatively, the adverse effect ofprocess variations on the propagation delay is considered,strengthening the approach in [21]. Note that PVT varia-tions are also considered with the proposed approach ofclassifying instructions into delay intervals in real time, asdescribed in the following sections.Finally, ML based methods for modeling system behav-ior have also been proposed. For example, in [6], linear re-gression has been leveraged for modeling the aging behav-ior of an embedded processor based on current instructionand its operands, as well as the computation history andoverall circuit switching activity. As a result, the timingguardband designed to compensate for aging in digitalcircuits can be effectively reduced, in presence of gracefuldegradation [6]. Reallocation of delay budget has, however,not been considered with this method. ML ICs can exhibit a prohibitively high power consump-tion and physical size. Furthermore, ML ICs can introduceadditional delay and increase design complexity, dependingupon the application characteristics. To efficiently exploitML methods for managing frequency in modern processors,delay, power, and area of ML ICs should be considered.
HE PROPOSED ML BASED FREQUENCY AD - JUSTMENT
In this paper, a design methodology is proposed for MLdriven adjustment of operational frequency in pipeline pro-cessors. With the proposed method, individual instructionsare classified into the corresponding propagation delayclasses in real time, and the clock frequency is accordinglyadjusted to reduce the gap between the actual propagationdelay and the clock period. The classes are defined bysegmenting the worst-case clock period into shorter delayfragments. Each class is characterized by a specific supplyvoltage and clock frequency. The primary design objectiveis to maximize system performance within an allocatedenergy budget. The overall delay and energy consumptionare evaluated with the additional ML components, and boththe correct and incorrect predictions. The proposed scalableframework allows for other control configurations to bedefined in a similar manner for different design objectives.The real-time clock adjustment is enabled by the recentadvancement in clock management circuits [24].In order to evaluate this method, a pipelined, 32-bit MIPSprocessor (TigerMIPS [22]) is utilized as the baseline proces-sor. The ML classifier is designed as an additional pipelinestage within the pipelined MIPS processor, as shown inFig. 1. The inputs to the additional ML pipeline stage arethe current instruction and its operands, as well as thecomputation history, as defined by the toggled inputs bits( i.e., current inputs are XORed with the previous inputs)and output of the previous operation. The choice of theseparameters is in accordance with the results in [4] and [6].These inputs are utilized as ML features for predicting thedelay class of the current instruction based on the trainedML model. It is important to note that more complex, slowerML models can also be trained with this methodology, aslong as the design complexity and hardware costs of thefinal system meet the specified constraints. To meet theoverall system throughput constraints, the trained modelscan be implemented as multiple pipeline stages, mitigatingthe additional latency introduced by the ML functions.Fig. 1: The proposed pipeline with the additional ML stage.In this configuration, six ML features and three delay classesare illustrated. Finally, the granularity of the output delay ( e.g., three delayclasses are illustrated in Fig. 1) can be varied to meet thetiming constraints within the energy budget.A systematic flow has been developed, implemented,and verified on TigerMIPS with LegUp benchmark suite.The flow comprises three primary phases, as shown in Fig.2. The individual phases are described in the followingsubsections.
First, the high-level hardware description language (HDL)model of the baseline processor is synthesized into gate-level description model. During this phase, timing informa-tion is generated in the IEEE standard delay format (SDF).Based on this information, the gate-level simulation (GLS) isperformed and the instruction-level execution profile is gen-erated. A profile comprises a list of instructions, the fetchedor forwarded operands, the output of the operations, andthe propagation delays. In addition to the execution profile,post place-and-route (PAR) reports, including timing andpower information, are collected in this phase.
In this phase, the gate-level profiles from Phase 1 are parsedand utilized as ML features. Based on the extracted features,a preferred ML model is trained in Python with Scikit-learnML library [23]. A HDL code ( e.g.,
Verilog in this paper) ofthe trained model is generated and integrated within thebaseline processor as a single (or multiple) pipeline stage(s)between the decode and execute stages (see Fig. 1).
During this phase, the modified high-level HDL model ofthe system with the ML pipeline stage is synthesized andprofiled, as described in Phase 1. To guarantee functionalcorrectness, the output signal is double-sampled to detecttiming violations, and timing-erroneous instructions are re-executed with the worst-case clock frequency. Similar tothe baseline iteration, the post PAR reports are extractedfor evaluating the timing and energy characteristics of thesystem. Finally, the profiling of the modified system isexecuted during this phase to evaluate the overall speedupof the system.To optimize the final solution in terms of the operationalfrequency and energy consumption, the proposed flow isexecuted iteratively with various ML algorithms and clockfragments, as shown with the feedback in Fig. 2. The clocksignal of the pipeline registers is assumed to be near-instantly switched based on the individual classificationresults, as has been experimentally demonstrated in [24].
ACHINE LEARNING MODELS
Owing to the unique learning characteristics and hardwaretrade-offs of neural networks (NNs), support vector ma-chines (SVMs), and random forest (RF) models, all theseML models are considered in this paper. Each model istrained based on the instruction profiles extracted from a
Fig. 2: Systematic flow for designing ML predictor within a typical pipelined processor.synthetically generated dataset of 3,000 random instructionsper class. The delay boundaries of the individual classes areexperimentally determined with respect to the worst-casedelay of 4 ns as follows: { [0.0,2.2],(2.2,4.0] } for the two-class configuration, { [0.0,1.8],(1.8,2.6],(2.6,4.0] } for the three-class configuration, and { [0.0,1.0],(1.0,2.0],(2.0,3.0],(3.0,4.0] } for the four-class configuration.The feature vector of the i th instruction comprises sixelements, x i = ( instr, op , op , Xop , Xop , output ) . Thefirst feature, instr , comprises four subfeatures, representingthe type of the operation in one-hot format, instr = , if arithmetic , if arithmetic with immediate operand , if logical , if multiplication or divisionThe subsequent four elements are defined by the operands.The features op and op are the first and second operandsof the instruction, and the features Xop and Xop are theXORed values of the first and second operands with theirrespective previous values. The last feature, output , is theoutput of the preceding instruction. The last three elementsof the feature vector are exploited to capture the effect ofcomputation history on the instruction delay. Note that theoperands and output of the preceding instruction are 32-bit long, as determined by the 32-bit baseline processorutilized in this work. Thus, the distribution of these featuressignificantly differs from the distribution of the operationtype subfeatures. To balance the overall distribution of theindividual features, the input features are preprocessed andscaled to follow a normal distribution using quantile trans-former in Python scikit-learn library. An example of operandand output features with and without the transformation isshown in Fig. 3 for arithmetic and logical instructions. Notethat the type subfeatures remain unchanged. To evaluate the efficiency and efficacy of the proposedmethod, propagation delay classification is investigatedwith three common ML algorithms: NN, SVM, and RF. Theconfiguration of each of the three ML models is described inthe following subsections, including the hyperparameters,performance, and hardware costs of the individual MLalgorithms. All the algorithms are five-fold cross-validatedbased on three thousand randomly generated instructionsper class. While finding an effective metric for stability of theevaluation is still an open question, k-fold cross-validationwith ≤ K ≤ is typically used, as these K values havebeen demonstrated to simultaneously minimize the bias andvariance across many studied test sets [25], [26], [27], [28].Thus, K = 5 is used in this work. ML accuracy is reportedas the F1-score of delay classification and the resultantspeedup for each benchmark program has been consideredin determining the performance of each ML algorithm.Hardware cost is evaluated as the number of additionaltransistors required for implementing the individual MLalgorithms and has also been considered in determining theperformance of the ML algorithms. Among the evaluatedML algorithms, the RF classifier is preferred in this workdue to the favorable tradeoff between the performance gainand hardware costs, as well as the relative simplicity of theRF algorithm, as explained in the following subsections. NNs excel in learning complex hidden patterns in largedatasets and have exhibited a particular supremacy in visionand text applications as compared with classical ML algo-rithms. Following this success, promising results have beenshown with NNs in various hardware related applications[29], [30], [31].To determine the preferred set of hyperparameters forthe two-, three-, and four-class NN models, a grid search isexecuted for each multiclass NN over the following ranges: (a)(b)
Fig. 3: A typical feature vector with and without the MLpreprocessing, (a) for arithmetic operation with immediateoperand, and (b) for logical operation. Note that the valueswithout preprocessing are shown on a logarithmic scale,while the values with preprocessing are shown on a linearscale.1) Identity, tanh, logistic, and ReLu activation functions,2) Stochastic gradient descent [32], lbfgs (a limited mem-ory BFGS quasi-Newton optimization algorithm [33]),and Adam (an adaptive learning rate optimization al-gorithm [34]) solvers, and3) A single m -neuron hidden layer ( m ∈ { , , , } )and two hidden NN layers with m and m neuronsin, respectively, the first and second layers ( m × m ∈{ × , × , × } ).The networks are trained using backpropagation algorithmfor 200 epochs until convergence with quasi-newton opti-mizer. Note that the number of neurons in the input andoutput layers is determined by, respectively, the numberof ML features (nine, including the four instruction typesubfeatures) and the number of ML classes (two, three,and four). The top ten grid search results (within 1% ofthe highest F1-score) are listed in Table 1 for each of themulticlass NNs in the descending order of the F1-scores.The hardware cost is determined based on the numberof transistors comprising the NN adders and multipliers.The transistor count for the individual NN adders andmultipliers is determined based on [35]. The number ofmultipliers, N MULT , and adders, N ADD , in a NN with L TABLE 1: Top (within 1% of the highest F1-score) NNconfigurations and their respective performance metrics( i.e., speedup, hardware cost (in million transistors), andspeedup per hardware metric (SPH)).
Activation Solver Neurons F1-score Speedup HW cost SPH1 tanh lbfgs 10 0.859 1.915 2.834 0.6762 relu adam × × × × × × σ + ) 0.002 0.012 1.634 0.095Negative standard deviation( σ − ) 0.002 0.018 0.961 0.040 Activation Solver Neurons F1-score Speedup HW cost SPH1 logistic lbfgs × × × × × × × σ + ) 0.001 3.7E-4 1.694 0.460Negative standard deviation( σ − ) 0.001 4.0E-4 1.148 0.049 Activation Solver Neurons F1-score Speedup HW cost SPH1 logistic adam × × × × σ + ) 1.0E-4 1.8E-4 2.252 0.294Negative standard deviation( σ − ) 0.003 0.002 1.217 0.137 layers is determined, respectively, as, N MULT = L (cid:88) i =1 m i · v i , (1)and N ADD = L (cid:88) i =1 m i · ( v i − , (2)where m i is the number of neurons in each layer, and v i is the size of the input vector to each layer (or the featurevector size in the input layer).The speedup per hardware cost (SPH) is also listed inTable 1 for each of the NN configurations. These top NNresults are compared with the SVM and RF top results,as described at the end of this section. As a general rule, learning capacity of a NN increases with the network com-plexity ( i.e., number of neurons and number of layers). Fora NN to be competitive with or outperform a classicalML algorithm, a large number of neurons and layers isrequired, significantly increasing the system complexity andhardware overhead of the NN based solutions. SVM classifier generates an optimal hyperplane which sep-arates data samples in feature space with the objectiveto minimize the classification error. Linear SVM can onlyclassify a linearly-separable data. Alternatively, to learncomplex nonlinear data patterns, SVM can be combinedwith a kernel trick , enabling the feature transformation intolinearly separable space [36]. In this work, a grid search isperformed over the following kernel SVM hyperparameters:1) Linear, polynomial, and radial basis function (rbf) ker-nels,2) Integer degree of flexibility of the polynomial decisionboundary, d ∈ [2 , , and3) The influence on the model of a single sample ina training set with N features and variance V ar byscaling ( i.e., gamma = 1 / ( N · V ar ) ) or not scaling ( i.e., gamma = 1 /N ) the kernel coefficient, gamma .The sets of hyperparameters with the highest F1-scores arelisted in Table 2. The speedup, hardware cost, and SPHmetric are also listed in the table for all the SVM configu-rations. SVM hardware cost is determined as the numberof transistors, based on the method presented in [37]. SVMoften exhibits excellent performance as compared with otherlearning algorithms at the expense of higher computationaland design complexity, and accordingly higher power andarea overheads [38]. These tradeoffs are discussed at the endof this section. RF classifier is an ensemble of decision tree classifiers. Theinput samples are split into multiple sample subsets andeach decision tree is trained on one training subset. Thefinal classification decision for each sample is made basedon the result of averaging the individual tree decisions( i.e., ensembling). RF models benefit from the accuracy,training speed, and interpretability of the decision treemodel, while the ensembling mitigates the overfitting,otherwise common to decision tree classifier. RF is oftenpreferred in scientific and practical applications [4], [39].The computational and hardware complexity of RF is astrong function of the number and depth of the decisiontrees. The depth of the individual trees is dependent onthe number of features and their correlation. In this work,a RF grid search is performed over the following ranges ofhyperparameters:1) Number of trees in the forest, n estimators ∈{ , , , , } ,2) Maximum number of levels in each tree, max depth ∈{ , , , , } .The results of the top estimators (within 1% of the highestF1-score) are listed in Table 3. The hardware cost of an TABLE 2: Top (within 1% of the highest F1-score) SVMconfigurations and their respective performance metrics( i.e., speedup, hardware cost (in million transistors), andspeedup per hardware metric (SPH)). kernel degree gamma F1-score Speedup HW cost SPH1 poly 5 scale 0.837 1.873 1323.343 0.0012 poly 4 scale 0.834 1.899 1307.230 0.0013 poly 6 scale 0.833 1.889 1352.670 0.0014 poly 3 scale 0.828 1.910 1291.739 0.0015 poly 7 scale 0.827 1.923 1412.040 0.0016 poly 8 scale 0.826 1.915 1476.320 0.0017 poly 9 scale 0.823 1.908 1534.991 0.0018 rbf N/A scale 0.822 1.916 228.524 0.0089 poly 10 scale 0.819 1.814 1596.913 0.00110 poly 11 scale 0.813 1.793 1654.040 0.001Average 0.826 1.884 1317.781 0.002Positive standard deviation( σ + ) 0.003 0.010 74.701 0.006Negative standard deviation( σ − ) 0.003 0.038 363.206 2.3E-4 kernel degree gamma F1-score Speedup HW cost SPH1 poly 5 scale 0.876 1.604 755.765 0.0022 poly 4 scale 0.875 1.617 715.028 0.0023 poly 3 scale 0.875 1.617 675.647 0.0024 rbf N/A scale 0.875 1.617 119.954 0.0135 poly 7 scale 0.873 1.627 847.086 0.0026 poly 2 scale 0.872 1.618 677.269 0.0027 poly 6 scale 0.872 1.601 800.553 0.0028 poly 8 scale 0.870 1.613 900.614 0.0029 rbf N/A auto 0.869 1.617 135.990 0.01210 poly 9 scale 0.869 1.660 948.560 0.002Average 0.873 1.619 657.647 0.004Positive standard deviation( σ + ) 0.001 0.021 57.771 0.006Negative standard deviation( σ − ) 0.001 0.003 374.579 7.4E-4 kernel degree gamma F1-score Speedup HW cost SPH1 rbf N/A scale 0.957 1.680 28.616 0.0592 poly 4 scale 0.956 1.679 169.482 0.0103 poly 5 scale 0.955 1.681 181.843 0.0094 poly 6 scale 0.954 1.683 198.685 0.0085 poly 3 auto 0.954 1.674 323.000 0.0056 poly 3 scale 0.953 1.677 158.120 0.0117 poly 8 scale 0.952 1.683 225.245 0.0078 poly 7 scale 0.952 1.678 212.632 0.0089 poly 2 scale 0.951 1.666 145.830 0.01110 rbf N/A auto 0.951 1.629 27.952 0.058Average 0.954 1.673 167.141 0.019Positive standard deviation( σ + ) 9.2 E-4 0.003 29.323 0.028Negative standard deviation( σ − ) 8.3 E-4 0.022 49.4330 0.004 RF classifier is evaluated based on the number of requiredcomparators, O ( n estimators × log ( max depth )) , and re-ported in terms of the total number of RF transistors. Tran-sistor count for a single comparator is determined based on[40]. The tradeoffs between the speedup and F1-score are sum-marized in Fig. 4 for all the classifiers. Note that not in allthe cases speedup increases with F1-score. This is due to theeffect of the type of misclassification on the overall speedup.For example, if a slow instruction is classified into a fasterclass, the result at the output of the execution unit at theend of the fast clock period is incorrect. Thus, a four-clock-cycle penalty is incurred to re-execute the slow instruction,compensating for the combined latency of the re-executedIF, ID, ML, and EX stages. Alternatively, if a fast instructionis misclassified into a slow class, the execution still results
TABLE 3: Top (within 1% of the highest F1-score) RFconfigurations and their respective performance metrics( i.e., speedup, hardware cost (in million transistors), andspeedup per hardware metric (SPH)). max depth n estimator F1-score Speedup HW cost SPH1 30 50 0.852 1.835 0.177 10.3572 10 200 0.850 1.925 0.480 4.0123 30 200 0.850 1.889 0.709 2.6664 10 100 0.849 1.913 0.240 7.9785 50 200 0.849 1.833 0.815 2.2496 50 100 0.846 1.856 0.407 4.5547 20 200 0.845 1.874 0.624 3.0048 50 50 0.843 1.836 0.204 9.0119 30 100 0.842 1.838 0.354 5.18710 10 50 0.840 1.902 0.120 15.859Average 0.847 1.870 0.413 6.488Positive standard deviation( σ + ) 0.002 0.016 0.137 2.638Negative standard deviation( σ − ) 0.002 0.014 0.078 1.250 max depth n estimator F1-score Speedup HW cost SPH1 20 50 0.949 1.879 0.156 12.0402 30 100 0.947 1.856 0.354 5.2383 20 200 0.946 1.851 0.624 2.9664 40 200 0.945 1.847 0.768 2.4035 40 50 0.944 1.843 0.192 9.5916 50 200 0.944 1.839 0.815 2.2577 10 200 0.944 1.848 0.480 3.8538 30 200 0.942 1.814 0.709 2.5619 10 100 0.940 1.828 0.240 7.62110 10 50 0.939 1.826 0.120 15.224Average 0.944 1.843 0.446 6.375Positive standard deviation( σ + ) 0.002 0.008 0.117 2.764Negative standard deviation( σ − ) 0.002 0.007 0.111 1.360 max depth n estimator F1-score Speedup HW cost SPH1 40 50 0.981 1.688 0.192 8.7852 30 200 0.981 1.686 0.709 2.3803 30 100 0.981 1.683 0.354 4.7504 40 200 0.981 1.686 0.768 2.1935 10 50 0.981 1.686 0.120 14.0586 20 200 0.981 1.683 0.624 2.6967 10 200 0.980 1.686 0.480 3.5158 50 200 0.980 1.687 0.815 2.0709 10 100 0.980 1.685 0.240 7.02410 20 100 0.979 1.683 0.312 5.393Average 0.980 1.685 0.461 5.286Positive standard deviation( σ + ) 2.0E-04 0.001 0.111 2.401Negative standard deviation( σ − ) 4.3E-04 0.001 0.104 1.034 in correct answer albeit the potential loss in performancegain. In addition, if a fast instruction is classified into anominal-delay class (for example, in the case with threedelay classes), the overall performance of the system is stillincreased (but not maximized) as compared with the execu-tion in the slowest delay class (as designed for the worst-case clock period). To understand the significance of speedand overhead in the overall performance of individual MLclassifiers, SPH metric is considered. The SPH results (asdetermined based on Tables 1-3) are shown in Fig. 5 forNN, SVM, and RF classifiers in two-, three-, and four-classconfigurations. Based on these results, RF exhibits the besttradeoff between the hardware cost and speedup, as well asthe lowest design complexity and hardware overheads. RFclassifier is, therefore, preferred in this work as a demon-stration vehicle of the proposed framework. Fig. 4: Speedup vs. F1 for two-, three-, and four-class config-urations based on Tables 1-3. MPLEMENTATION
The proposed framework is implemented with RF modelwithin TigerMIPS and evaluated based on LegUp bench-marks. The details of the implementation are described inthis section.
A holistic platform is developed based on the proposedsystem design methodology, as illustrated in Fig. 2. Theframework is unified within a shell programming platformsupported with several peripheral programs developed inC++ and Python. The synthesis steps, as described in Fig. 2,are sequentially executed from
Start to Finish .During the first phase, Synopsys Design Compiler iscalled with the high-level HDL model of the baseline pro-cessor. The profiler triggers are added to the system andGLS is performed in Modelsim.The second phase is triggered upon the completion of theinstruction profiling. An external parser program is calledto transform the instruction profiles into the ML featuredata structure and eliminate outliers. The model is trainedto classify propagation delays into user-defined number ofclasses based on a user-specified learning algorithm anddelay boundaries. The ML accuracy and estimated speedupare evaluated upon the training completion. If the design re-quirements are met, the ML software model is transformedinto the high-level HDL code. Otherwise, ML model isretrained with new parameters.Upon training completion, the HDL code of the MLmodel is instantiated within the original HDL model ofthe baseline processor. Finally, the procedure in Phase 1 isrepeated in Phase 3 with the modified processor model,and the overall system performance and overheads areevaluated.
The proposed framework is demonstrated on TigerMIPS.In addition to the basic MIPS units, such as InstructionFetch (IF), Instruction Decode (ID), Execute (Exe), Memoryaccess (Mem), and Write-back (WB), TigerMIPS comprisesadvanced units, such as, forwarding unit, branch handlingunit, stall logic, and instruction and data caches, which arecommon in modern pipeline processors.
Fig. 5: Speedup per hardware cost (SPH) for two-, three, andfour-class configurations. The hardware cost is evaluatedbased on the number of transistors needed to realize eachclassifier. The SPH performance is highest with RF classifieras compared with the SVM and NN based classifiers foreach of the classifier configurations. Numbers correspondto data listed in Tables 1, 2, and 3.
The baseline model is synthesized in 45 nm NanGate CMOStechnology node with Synopsys Design Compiler. Uponcompletion of the synthesis, triggers are implemented inVerilog HDL, enabling data and timestamp sampling at theinput and output of the execution unit within the MIPSpipeline. The profiling is performed based on GLS withModelsim simulator.
The trained ML model is first validated in Python. TheHDL code of the validated ML model is integrated intothe baseline processor. Finally, the modified processor issynthesized and its functionality is verified through GLS. The post PAR reports are utilized to evaluate the modifiedsystem with respect to specified design constraints.
XPERIMENTAL R ESULTS
To demonstrate the framework, LegUp high-level synthesisbenchmark suite coupled with LLMVM compiler toolchain[41] is utilized for profiling and verification during GLS. Thetrained RF model is tested with nine standard benchmarkprograms available within the LegUp benchmark suite andan additional synthetically generated benchmark with onemillion random instructions. The F1-score is shown in Fig.6 for two, three, and four ML delay classes, yielding above95% F1-score for majority of the programs with two delayclasses. Resultant speedup for the individual benchmarksis shown in Fig. 7, including the practical speedup (withthe misclassification penalty), no-penalty speedup (withoutthe misclassification penalty), and ideal speedup (with 100%classification accuracy). The energy overhead due to theadditional ML hardware and classification errors is listedin Table 4. To account for delay overheads due to themisclassification of a slow instruction into a higher per-formance class, a re-execution penalty of four clock cycles(compensating for IF, ID, ML, and EX stages) is consideredwithin the performance results, as reported in Fig. 7. The no-penalty speedup is also presented in Fig. 7, visualizing thepenalty due to the misclassification of a fast instruction intoa slow class. Note that the overall speedup with four-classconfiguration is higher than the speedup with two-classconfiguration, albeit the higher classification accuracy withtwo delay classes. Alternatively, higher misclassification ratewith four delay classes yields higher re-execution energyconsumption, as listed in Table 4. Also, note that a negativeenergy overhead indicates a reduction in the overall energyconsumption ( i.e., power-delay product).Performance comparison between the proposed methodand state-of-the-art (ML and non-ML) DVFS approaches islisted in Table 5. For example, both the proposed frameworkand the approach in [5] consider binary classification withtwo execution delay classes. The proposed method exhibits3.5 times higher speedup gain and 33% energy savings ascompared with 3% energy overhead, as reported in [5]. Ascompared with the adaptive approach in [24], the proposedmethod exhibits up to 4.9 times increase in performance gainwith 50% less energy savings. Alternatively, a 3.85 timeshigher performance gain is demonstrated as compared to[24] with similar energy savings.Power overhead per instruction for two-, three-, andfour-delay class configurations are also determined for theprograms in the LegUp benchmark suite. The averagepower overhead (due to the additional ML stage and re-execution of misclassified instructions) is shown in Fig. 8.The average power is linearly reduced with the increasingnumber of program instructions, exhibiting an overhead ofless than 0.02 microwatts in practical applications with morethan one million instructions. Furthermore, the additionalaverage power consumption rapidly converges for variousnumber of classes, as shown in Fig. 7. Thus, when optimiz-ing the number of delay classes in processors with largeworkload, power overhead is a secondary factor. Finally, thesteeper decrease in the power oberhead with the four-class
Fig. 6: Inference RF classification based on the LegUp benchmark suite with two, three, and four classes.Fig. 7: Experimental speedup with the proposed ML framework with two, three, and four delay classes. Practical, no-penalty, and ideal speedups are presented for each benchmark and class. The practical speedup considers the experimentalclassification accuracy and delay overheads due to misclassification of a slow instruction into a fast class. The no-penaltyspeedup considers the experimental accuracy, but disregards the idle time due to misclassification of a fast instruction intoa slow class. Finally, the ideal speedup is the theoretical maximum with 100% classification accuracy.configuration supports the previous assertion regardingthe gain-overhead tradeoff with finer granularity of delayclasses: as the number of instructions increases, the higheraccuracy with four-class configuration mitigates the adverseeffects of misclassifications on the overall system frequency.
ONCLUSIONS AND F UTURE W ORK
The proposed unified framework facilitates efficient utiliza-tion of the time and hardware recourses in the system. In addition, this approach enables the design of ML pipelinestages, while satisfying design constraints, as shown in Fig.2. Finally, classification of instructions into delay intervalsin real time alleviates the path propagation variances im-posed by PVT variations and system aging. To enhancethe performance gain, the proposed approach should bepreferred with those applications and systems characterizedby considerable variations in the propagation delay of theindividual instructions.This method is practical with pipelined, MIPS-like pro- TABLE 4: Experimental power and energy overhead of theproposed ML method.
Practical Power Energy Instructionspeedup overhead overhead countrand1M 1.923 38.5% -27.99% 1000000adpcm 1.497 56.49% 4.55% 30197aes 2.087 45.14% -30.46% 11223blowfish 1.633 65.01% 1.02% 199759fft 1.165 42.45% 22.28% 11001fir 2.908 25.94% -56.7% 7024gsm 1.382 45.77% 5.45% 7671jpeg 1.792 55.04% -13.49% 1133161sha 1.657 62.07% -2.22% 345576sra 2.840 20.62% -57.53% 1775Average 1.889 45.7% -15.51% 274738.7Positivestandard 0.35 5.81% 8.70% 374831.68deviation( σ + )Negativestandard 0.17 6.58% 15.50% 92682.62deviation( σ − ) Practical Power Energy Instructionspeedup overhead overhead countrand1M 1.765 23.3% -30.151% 1000000adpcm 1.389 33.69% -3.768% 30197aes 1.987 27.06% -36.051% 11223blowfish 2.125 39.27% -34.456% 199759fft 1.957 25.47% -35.872% 11001fir 2.222 16.14% -47.727% 7024gsm 1.465 27.87% -12.729% 7671jpeg 1.786 32.62% -25.74% 1133161sha 1.578 37.93% -12.566% 345576sra 2.013 7.18% -46.745% 1775Average 1.829 27.05% -28.58% 274738.7Positivestandard 0.11 3.09% 8.41% 374831.68deviation( σ + )Negativestandard 0.13 5.76% 4.84% 92682.62deviation( σ − ) Practical Power Energy Instructionspeedup overhead overhead countrand1M 1.530 14.29% -25.323% 1000000adpcm 1.418 22.4% -13.654% 30197aes 1.818 17.1% -35.595% 11223blowfish 1.818 27.23% -30.024% 199759fft 1.646 16.14% -29.45% 11001fir 1.818 10.5% -39.225% 7024gsm 1.635 18.34% -27.642% 7671jpeg 1.665 22% -26.727% 1133161sha 1.818 27.23% -30.024% 345576sra 1.818 5.5% -41.975% 1775Average 1.699 18.073% -29.964% 274738.7Positivestandard 0.05 2.84% 3.49% 374831.68deviation( σ + )Negativestandard 0.07 3.06% 3.24% 92682.62deviation( σ − ) cessors, in which the overall delay is dominated by the delayof the execution stage. Although, the proposed methodis explored in this work with a single core system, fur-ther increases in energy efficiency and the overall systemperformance are expected if the approach is adjusted formodern architecture processors with out-of-order executionand multicore processors with multiple frequency domains.To exploit the positive impact of out-of-order execution and Fig. 8: Power overhead per instruction for 2, 3, and 4 delayclass configuration based on the benchmarks in Table 4.TABLE 5: Comparison between the proposed method andexisting state-of-the-art methods. Algorithm Performance Energy MLgain overhead basedSLoT [4] 23% N/A YesEarly Prediction [5] 20% 3% NoClim [14] 24% N/A YesSLBM [16] 15% N/A YesAdaptive Clock 18.2% -30.4% NoManagement [24]2 classes 70% -30%This work 3 classes 83% -28.6% Yes4 classes 89% -15.5% multicore systems on performance and energy efficiency incommercial class processors, the following methodologiesshould be considered.
To support out-of-order execution, instructions within adelay class should be bundled into a delay-class specificreservation station (RS). Instructions stored in an RS areindividually executed at a constant frequency until theRS is emptied or a dependency is determined, preventingfurther execution of instructions in the RS. Such bundling ofinstructions reduces the number of clock signal transitionsamong various frequencies, increasing the performance andpower efficiency of the system.
As previously, to support out-of-order execution, instruc-tions should be bundled based on the delay classes andstored within the matching RS’s. To support multi-clockexecution, the ALUs and FPUs within the execution unitshould be operated at different clock frequencies, as deter-mined by the granularity of the delay classes. Intuitively, theparallelization of execution from different delay classes withthis approach decreases the number of clock adjustments,increasing the system performance and energy efficiency. To leverage the advantages provided by processing withmultiple clock domains in multicore systems, bundled in-structions within the individual clock domains (as definedin subsection 7.1) should be shared among all the systemclock domains, mitigating the additional cost of multipleclocking (as described in subsection 7.2). To enable thesharing of bundles, efficient bundle scheduling and lowoverhead communication channels are required. While thenumber of clock adjustments is expected to further reducewith this approach, additional overheads due to intelli-gent communication of bundles among the cores shouldbe considered. Alternatively, by partially or fully replacingthe traditional DFS, DVFS, and thread scheduling mecha-nisms, additional savings are expected with the proposedapproach. Finally, the proposed method can be adjusted ina similar manner to classify instruction propagation delayof various pipeline stages.Existing approaches are focused on offline speculations,statistical models, per-task (workload-based) frequency scal-ing, and prediction of timing errors at an operating point ofa system. Alternatively, the proposed method demonstratesthe benefits of fine-grain, instruction-level frequency ad-justment, simultaneously utilizing most of the clock periodslack and mitigating the adverse effects of PVT variationsand aging.
UMMARY
In this work, an additional ML pipeline stage is proposedfor increasing the overall system performance by enhancingthe temporal resource utilization. This additional stage isdesigned to classify instructions into propagation delayclasses. The system clock frequency is adaptively adjustedbased on the individual delay class predictions. Pipeliningis exploited to mitigate the effect of the ML stage latencyon the overall system performance. Practical ML featuresare extracted based on current instruction and computationhistory. ML hardware and misclassification power and delayoverheads are considered within the reported results. Tiger-MIPS is utilized as the baseline processor. The processor isenhanced with the ML predictor and simulated with theLegUp benchmark suite. Based on the experimental results,up to 89% performance gain is achieved with four delayclasses with 15.5% energy saving. Alternatively, the reduc-tion of 30% in energy consumption with 70% performancegain is demonstrated with two delay classes. A unified shellprograming platform with peripheral programs is designedto provide a systematic design flow for ML driven pipelinedprocessors. R EFERENCES [1] Fields B, Bodk R, Hill MD. Slack: Maximizing performance un-der technological constraints. InProceedings 29th Annual Interna-tional Symposium on Computer Architecture 2002 May 25 (pp.47-58). IEEE.[2] Zyuban V, Brooks D, Srinivasan V, Gschwind M, Bose P, StrenskiPN, Emma PG. Integrated analysis of power and performance forpipelined microprocessors. IEEE Transactions on Computers. 2004Jun 21;53(8):1004-16. [3] Kumar R, Farkas KI, Jouppi NP, Ranganathan P, Tullsen DM.Single-ISA heterogeneous multi-core architectures: The potentialfor processor power reduction. InProceedings of the 36th annualIEEE/ACM International Symposium on Microarchitecture 2003Dec 3 (p. 81). IEEE Computer Society.[4] Jiao X, Jiang Y, Rahimi A, Gupta RK. Slot: A supervised learn-ing model to predict dynamic timing errors of functional units.InProceedings of the Conference on Design, Automation & Testin Europe 2017 Mar 27 (pp. 1183-1188). European Design andAutomation Association.[5] Hashemi SH, Ajirlou AF, Soltani M, Navabi Z. Early predictionof timing critical instructions in pipeline processor. In2016 15thBiennial Baltic Electronics Conference (BEC) 2016 Oct 3 (pp. 95-98). IEEE.[6] Moghaddasi I, Fouman A, Salehi ME, Kargahi M. Instruction-level NBTI Stress Estimation and its Application in RuntimeAging Prediction for Embedded Processors. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems. 2018Jun 12.[7] Gepner P, Kowalik MF. Multi-core processors: New way to achievehigh system performance. InInternational Symposium on ParallelComputing in Electrical Engineering (PARELEC’06) 2006 Sep 13(pp. 9-13). IEEE.[8] Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H,Bose P. Microarchitectural techniques for power gating of execu-tion units. InProceedings of the 2004 international symposium onLow power electronics and design 2004 Aug 9 (pp. 32-37). ACM.[9] Wu Q, Pedram M, Wu X. Clock-gating and its application tolow power design of sequential circuits. IEEE Transactions onCircuits and Systems I: Fundamental Theory and Applications.2000 Mar;47(3):415-20.[10] Wang S, Ananthanarayanan G, Zeng Y, Goel N, Pathania A,Mitra T. High-throughput cnn inference on embedded arm big.little multi-core processors. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems. 2019 Sep 30.[11] Rapp M, Sagi M, Pathania A, Herkersdorf A, Henkel J. Power-and Cache-Aware Task Mapping with Dynamic Power Budget-ing for Many-Cores. IEEE Transactions on Computers. 2019 Aug20;69(1):1-3.[12] Isci C, Buyuktosunoglu A, Buyuktosunoglu A, Cher CY, Bose P,Martonosi M. An analysis of efficient multi-core global powermanagement policies: Maximizing performance for a given powerbudget. InProceedings of the 39th annual IEEE/ACM interna-tional symposium on microarchitecture 2006 Dec 9 (pp. 347-358).IEEE Computer Society.[13] Canis A, Choi J, Aldham M, Zhang V, Kammoona A, Anderson JH,Brown S, Czajkowski T. LegUp: high-level synthesis for FPGA-based processor/accelerator systems. InProceedings of the 19thACM/SIGDA international symposium on Field programmablegate arrays 2011 Feb 27 (pp. 33-36). ACM.[14] Jiao X, Rahimi A, Jiang Y, Wang J, Fatemi H, De Gyvez JP, GuptaRK. Clim: A cross-level workload-aware timing error predictionmodel for functional units. IEEE Transactions on Computers. 2017Dec 14;67(6):771-83.[15] Zhang JJ, Garg S. FATE: fast and accurate timing error predic-tion framework for low power DNN accelerator design. In2018IEEE/ACM International Conference on Computer-Aided Design(ICCAD) 2018 Nov 5 (pp. 1-8). IEEE.[16] Jiao X, Rahimi A, Narayanaswamy B, Fatemi H, de GyvezJP, Gupta RK. Supervised learning based model for predictingvariability-induced timing errors. In2015 IEEE 13th InternationalNew Circuits and Systems Conference (NEWCAS) 2015 Jun 7 (pp.1-4). IEEE.[17] Zhang JJ, Garg S. BandiTS: dynamic timing speculation usingmulti-armed bandit based optimization. InDesign, Automation &Test in Europe Conference & Exhibition (DATE), 2017 2017 Mar 27(pp. 922-925). IEEE.[18] Whittle P. Multi-armed bandits and the Gittins index. Journalof the Royal Statistical Society: Series B (Methodological). 1980Jan;42(2):143-9.[19] Khaleghi B, Salamat S, Imani M, Rosing T. FPGA Energy Efficiencyby Leveraging Thermal Margin. arXiv preprint arXiv:1911.07187.2019 Nov 17.[20] Assare O, Gupta R. Accurate Estimation of Program Error Rate forTiming-Speculative Processors. InProceedings of the 56th AnnualDesign Automation Conference 2019 2019 Jun 2 (p. 180). ACM. [21] De Kruijf M, Nomura S, Sankaralingam K. A unified model fortiming speculation: Evaluating the impact of technology scal-ing, CMOS design style, and fault recovery mechanism. In2010IEEE/IFIP International Conference on Dependable Systems &Networks (DSN) 2010 Jun 28 (pp. 487-496). IEEE.[22] Moore, S. and Chadwick, G., 2011. The Tiger “MIPS” processor.[23] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, GriselO, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderplasJ. Scikit-learn: Machine learning in Python. Journal of machinelearning research. 2011;12(Oct):2825-30.[24] Jia T, Joseph R, Gu J. 19.4 An Adaptive Clock Management SchemeExploiting Instruction-Based Dynamic Timing Slack for a General-Purpose Graphics Processor Unit with Deep Pipeline and Out-of-Order Execution. In2019 IEEE International Solid-State CircuitsConference-(ISSCC) 2019 Feb 17 (pp. 318-320). IEEE.[25] G. James, et al., “An Introduction to Statistical Learning,” NewYork: Springer, Vol. 112, 2013.[26] M. Kuhn and J. Kjell, “Applied Predictive Modeling,” New York:Springer, Vol. 26, 2013.[27] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accu-racy Estimation and Model Selection,” Proc. of the InternationalJoint Conference on Artificial Intelligence, Vol. 14, No. 2, pp. 1137-114, 1995.[28] G. Forman and S. Scholtz, “Apples-to-Apples in Cross-ValidationStudies: Pitfalls in Classifier Performance Measurement.” ACMSIGKDD Explorations Newsletter, Vol. 12, No. 1, pp. 49-57, 2010.[29] Yue J, Liu R, Sun W, Yuan Z, Wang Z, Tu YN, Chen YJ,Ren A, Wang Y, Chang MF, Li X. 7.5 A 65nm 0.39-to-140.3TOPS/W 1-to-12b Unified Neural Network Processor UsingBlock-Circulant-Enabled Transpose-Domain Acceleration with 8.1Higher TOPS/mm 2 and 6T HBST-TRAM-Based 2D Data-Reuse Architecture. In2019 IEEE International Solid-State CircuitsConference-(ISSCC) 2019 Feb 17 (pp. 138-140). IEEE.[30] Lee J, Lee J, Han D, Lee J, Park G, Yoo HJ. 7.7 lnpu: A 25.3 tflops/wsparse deep-neural-network learning processor with fine-grainedmixed precision of fp8-fp16. In2019 IEEE International Solid-StateCircuits Conference-(ISSCC) 2019 Feb 17 (pp. 142-144). IEEE.[31] Lee J, Lee J, Han D, Lee J, Park G, Yoo HJ. 7.7 lnpu: A 25.3 tflops/wsparse deep-neural-network learning processor with fine-grainedmixed precision of fp8-fp16. In2019 IEEE International Solid-StateCircuits Conference-(ISSCC) 2019 Feb 17 (pp. 142-144). IEEE.[32] Ruder S. An overview of gradient descent optimization algo-rithms. arXiv preprint arXiv:1609.04747. 2016 Sep 15.[33] Liu DC, Nocedal J. On the limited memory BFGS method for largescale optimization. Mathematical programming. 1989 Aug 1;45(1-3):503-28.[34] Kingma DP, Ba J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980. 2014 Dec 22.[35] Asadi P, Navi K. A new low power 32 32-bit multiplier. WorldApplied Sciences Journal. 2007;2(4):341-7.[36] Hofmann M. Support vector machines-kernels and the kerneltrick. Notes. 2006 Jun 26;26(3).[37] Mitran J, Bouillant S, Bourennane E. Classification boundaryapproximation by using combination of training steps for real-time image segmentation. InInternational Workshop on MachineLearning and Data Mining in Pattern Recognition 2003 Jul 5 (pp.141-155). Springer, Berlin, Heidelberg.[38] Kulkarni A, Pino Y, Mohsenin T. SVM-based real-time hardwareTrojan detection for many-core platform. In2016 17th InternationalSymposium on Quality Electronic Design (ISQED) 2016 Mar 15(pp. 362-367). IEEE.[39] Zhang X, Wang W, Zheng X, Ma Y, Wei Y, Li M, Zhang Y.A Clutter Suppression Method Based on SOM-SMOTE RandomForest. In2019 IEEE Radar Conference (RadarConf) 2019 Apr 22(pp. 1-4). IEEE.[40] Cheng SW. A high-speed magnitude comparator with small tran-sistor count. In10th IEEE International Conference on Electronics,Circuits and Systems, 2003. ICECS 2003. Proceedings of the 20032003 Dec 14 (Vol. 3, pp. 1168-1171). IEEE.[41] Lattner C, Adve V. LLVM: A compilation framework for life-long program analysis & transformation. InProceedings of theinternational symposium on Code generation and optimization:feedback-directed and runtime optimization 2004 Mar 20 (p. 75).IEEE Computer Society.[42] Agarwal K, Sylvester D, Blaauw D. Modeling and analysis ofcrosstalk noise in coupled RLC interconnects. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.2006 Apr 24;25(5):892-901. Arash Fouman Ajirlou (S’17) received theBachelor of Science degree in computerengineering from University of Tehran, Tehran,Iran, in 2017. He started the PhD programwith Department of Electrical and ComputerEngineering at the University of Illinois atChicago, in 2018. He was a research assistantin the school of Electrical and ComputerEngineering at University of Tehran between2015 and late 2017. From 2017 to late 2018,he served as the secretary of the Electrical andComputer Engineering committee in Alumni Association of Faculty ofEngineering, University of Tehran. In 2018, prior to starting his PhDin computer engineering at University of Illinois at Chicago, he was adigital designer in the engineering department of Ofogh Tajrobe Mojcompany, Tehran, Iran.His primary interests are embedded systems and high-performance/low-power computing systems, with an emphasis onmachine learning and self governing systems. His current focus ison utilizing machine learning methodologies to enhance processorperformance and energy consumption.