[PDF] Decision Tree Based Hardware Power Monitoring for Run Time Dynamic Power Management in FPGA

Abstract

Fine-grained runtime power management techniques could be promising solutions for power reduction. Therefore, it is essential to establish accurate power monitoring schemes to obtain dynamic power variation in a short period (i.e., tens or hundreds of clock cycles). In this paper, we leverage a decision-tree-based power modeling approach to establish fine-grained hardware power monitoring on FPGA platforms. A generic and complete design flow is developed to implement the decision tree power model which is capable of precisely estimating dynamic power in a fine-grained manner. A flexible architecture of the hardware power monitoring is proposed, which can be instrumented in any RTL design for runtime power estimation, dispensing with the need for extra power measurement devices. Experimental results of applying the proposed model to benchmarks with different resource types reveal an average error up to 4% for dynamic power estimation. Moreover, the overheads of area, power and performance incurred by the power monitoring circuitry are extremely low. Finally, we apply our power monitoring technique to the power management using phase shedding with an on-chip multi-phase regulator as a proof of concept and the results demonstrate 14% efficiency enhancement for the power supply of the FPGA internal logic.

Full PDF

DDecision Tree Based Hardware Power Monitoringfor Run Time Dynamic Power Management inFPGA

Zhe Lin ∗ , Wei Zhang ∗ and Sharad Sinha †∗ Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong † School of Computer Engineering, Nanyang Technological University, Singapore { zlinaf, wei.zhang } @ust.hk, sharad [email protected] Abstract —Fine-grained runtime power management tech-niques could be promising solutions for power reduction. There-fore, it is essential to establish accurate power monitoringschemes to obtain dynamic power variation in a short period(i.e., tens or hundreds of clock cycles). In this paper, we leveragea decision-tree-based power modeling approach to establish ﬁne-grained hardware power monitoring on FPGA platforms. Ageneric and complete design ﬂow is developed to implement thedecision tree power model which is capable of precisely estimatingdynamic power in a ﬁne-grained manner. A ﬂexible architectureof the hardware power monitoring is proposed, which can beinstrumented in any RTL design for runtime power estima-tion, dispensing with the need for extra power measurementdevices. Experimental results of applying the proposed modelto benchmarks with different resource types reveal an averageerror up to 4% for dynamic power estimation. Moreover, theoverheads of area, power and performance incurred by the powermonitoring circuitry are extremely low. Finally, we apply ourpower monitoring technique to the power management usingphase shedding with an on-chip multi-phase regulator as aproof of concept and the results demonstrate 14% efﬁciencyenhancement for the power supply of the FPGA internal logic.

I. I

NTRODUCTION

With the growth of capacity and complexity of ﬁeld pro-grammable gate arrays (FPGAs), and the increasing usageof FPGAs in data centers, the reduction of the FPGA powerconsumption is becoming an important issue. To alleviate thisproblem, FPGA vendors provide gate-level power analysistools for power estimation, e.g., Xpower analyzer (XPA) fromXilinx and Powerplay from Altera. These tools are fully awareof the internal hardware implementation and they can provideaccurate power estimation during design time, thus helpingdesigners to tune their circuits using power optimizationtechniques during development periods. Nevertheless, thereis a burgeoning interest in applying runtime power man-agement techniques, e.g., dynamic voltage frequency scaling(DVFS) [1] and task scheduling techniques [2]. Such runtimestrategies make it a necessity to be aware of the runtimedynamic power consumption for the applications running onFPGA. One common approach to obtain runtime power ofFPGAs is to use power measurement circuits. This method,however, suffers from two major disadvantages: (1) it requiresadditional board area to integrate the power measurementdevices; and (2) the power detection period is long, in the gran- ularity of milliseconds [3]. As dynamic power managementapproaches, such as DVFS and current control using on-chipregulators, become feasible and promising [4], dynamic powermonitoring in the granularity of tens or hundreds of cycles isrequired so as to support ﬁne-grained power reduction.Within this context, our goal is to establish a ﬁne-grained,accurate yet light-weight dynamic power monitoring schemefor FPGA-based designs, leveraging the state-of-the-art ma-chine learning theory. We ﬁrst propose a generic and completedesign ﬂow capable of capturing the key features necessary forpower modeling on FPGA and generating samples for powermodeling. Based on that, we exploit a decision-tree-basedpower modeling method and develop an in-situ supportingarchitecture which can be efﬁciently integrated into the RTLdesigns with extremely low overheads of performance, powerconsumption and resource utilization. The power estimationperiod can be as ﬁne-grained as tens of clock cycles, fa-cilitating more possibilities of power management strategies.The experiments reveal that the proposed decision-tree-basedpower model exhibits salient improvement in both accuracyand area overheads in comparison to traditional linear re-gression methods. Furthermore, we propose model ensembleand runtime phase shedding with an on-chip multi-phaseregulator to tackle practical problems using our hardwarepower monitoring scheme.In general, our contributions can be summarized as follows: • A platform-independent and complete synthesis ﬂow toextract features, generate activity traces and power tracesas samples for power model establishment. • A runtime dynamic power monitoring scheme based ondecision tree learning theory, with complete analysis fromfeature selection to model optimization. • A light-weight and in-situ hardware realization of thepower monitoring scheme, including activity counters andan area-efﬁcient memory-based decision tree with smalloverheads of power, performance and resource. • A model ensemble strategy: we experimentally quantifythe error of aggregating the pre-trained power models asan ensemble model and shed light on its viability forlibrary-based and IP-based designs. • A proof of concept for ﬁne-grained phase sheddingusing an on-chip multi-phase voltage regulator for FPGA a r X i v : . [ c s . A R ] S e p nternal logic.The rest of this paper is organized as follows. Section IIgives a general discussion about modern power modeling ap-proaches. Section III and Section IV demonstrate the completedesign ﬂow and Section V elaborates the monitoring hardware.Experimental results are discussed in Section VI and ﬁnally,we conclude the paper in Section VII.II. R ELATED W ORK

In the literature, researchers have shown great interests indynamic power modeling for FPGAs, as it is one of themain challenges for FPGA-based designs. Studies about FPGApower modeling have been conducted at different abstractionlevels. In the work [5], a low level abstraction model wasdeveloped by capturing average switching power based onswitched-level macromodels for LUTs and registers. Likewise,the work [6] derived power models from basic operators, e.g.,adders and multipliers. These low level models are specializedin their targeted FPGA devices and thereby make them difﬁcultto migrate to other FPGA families. In contrast, power modelsat high abstraction level are more promising to be generalizedin a shorter time, using a set of key signals without thenecessity of looking into lower level details.In the recent work [7]–[9], linear regressions have beenemployed for high level power models. Work [7] establishedthe regression model using IO toggle rate and resource uti-lization in a log-arithmetic format. It attempted to formulate ageneric model suitable for all applications running on FPGA.However, the conditions different from the trained model suchas the variation in IO port size will signiﬁcantly degrade theestimation accuracy. Work [8], [9] extracted the toggle rate ofa small set of internal signals and built the model in embeddedprocessors, resulting in the LUT resource overhead of 7%in [8] and 9% in [9] for their tested applications, as well asaround 5% CPU time for both of these two work. Furthermore,as studied in [10], [11], power behaviors of complex arithmeticunits are generally non-linear. Hence, the linear model exhibitsintrinsic restriction in accuracy enhancement when non-linearpower patterns increase with the growing sample size, which isknown as the underﬁtting problem in machine learning theory.In light of this problem, the objective of our work is to leveragea decision tree learning model with the ability to adaptivelylearn different power patterns from samples captured undervarious situations. To the best of our knowledge, our workproposes the ﬁrst approach to establish non-linear dynamicpower models in FPGA using the state-of-the-art machinelearning theory. Furthermore, the overheads of area, powerand performance incurred by our proposed power monitoringis trivial. The resulted area overhead of the power monitor isnegligible in comparison with prior work [8], [9].III. A

UTOMATIC SYNTHESIS FLOW

Provided the original register transfer level (RTL) designs,we propose a complete automatic synthesis ﬂow from powerand activity trace generation to the decision tree model es-tablishment as shown in Fig. 1. The ﬂow can be decomposed into three sub-ﬂows: (1) activity trace ﬂow (ATF); (2) powertrace ﬂow (PTF); and (3) model synthesis ﬂow (MSF). Anactivity trace is deﬁned as the switching activities of a set ofsignals in the design in an estimation period, whereas a powertrace is the power value of the target design in an estimationperiod. Note that our implementation targets Xilinx tool chainand Modelsim, but the generic methodology is applicable toother vendor tools.

SynthesisPlacement

Routing

SynthesisPlacement

Routing

Gate-level Power estimate

Timing simulation

Activity trace flow (ATF) Power trace flow (PTF)

RTL description filesRTL description filesInput stimulus Feature selectionModel tuning

Data learning

Model synthesis flow (MSF)

Power traceActivity trace

HDL netlist

Gate-level Power estimate Feature info

Input stimulus SynthesisPlacementRouting

Synthesis

PlacementRouting

Gate-level Power estimateTiming simulation

Timing simulation

Activity trace flow (ATF) Power trace flow (PTF)

RTL description files Feature selectionModel tuning

Data learning

Model synthesis flow (MSF)

Power traceActivity trace

HDL netlist

Gate-level Power estimate Feature info

Input stimulusInput stimulusSynthesis

Placement

Routing Gate-level Power estimate

Model synthesis flow

Timing simulation

RTL description files

Synthesis

Placement

Routing

HDL netlist

Feature infoFeature selection Model tuning Data learningGate-level Power estimate Timing simulation

Activity Trace flow

Power Trace flow

Activity trace

Power trace

Activity trace

Input stimulus

Fig. 1. Automatic synthesis ﬂow.

A. Activity trace ﬂow (ATF)

Dynamic power is introduced by signal transitions whichdissipate power by charging and discharging the load ca-pacitors. With the improvement on leakage power control inrecent devices, the main consideration for power managementis more relevant to dynamic power which is the main focusof our work. The equation of dynamic power dissipation issummarized in Equation (1), which demonstrates the relation-ship between switching activity α i , capacitance C i on thenet i , supply voltage V dd and operating frequency f . Herevoltage and frequency are pre-set constants while capacitanceis determined by the device and switching activities totallydepend on the runtime execution of the application. P dyn = (cid:88) i ∈ N α i C i V dd f (1)In order to generalize the dynamic power models at highlevel, a set of key signals are captured and we monitortheir switching activities formulated in Equation (2), where a ( · ) is the activity function which returns the difference ofsignal transition counts s ( · ) on the signal set sig over theestimation period from t start to t end . For a large designontaining millions of nets, it is vital to identify a subset ofdiscriminative and informative nets that are strongly indicativeof the power. The number of transitions of an identiﬁed signalin an estimation period serves as a feature for the power modelsynthesis. a ( sig ) = s ( sig , t end ) − s ( sig , t start ) (2)In order to identify the most indicative signals, we ﬁrstrun vector-based timing simulation with randomly generatedinput vectors and then use Xpower to export the sortingof the signal activities. The simulation is supposed to runfor a sufﬁciently long time (i.e., orders of magnitude largerthan an estimation period) to cover as many situations aspossible. We select a number of candidate signals with highestactivities, and then perform feature selection in MSF. Thefeature selection is performed to ﬁlter out the redundantsignals showing repetitive behaviors and leave a smaller subsetof signals with discriminative features across different inputvectors, as discussed in Section IV-B. Note that the identiﬁedsignals are the lower level nets in the HDL netlist exportedby command write verilog after placement and routing. TheHDL netlist is basically composed of primitives (e.g., instancesof LUTs or DSPs) and connections. The main advantages ofusing the HDL netlist are two-fold: it enables us to extract theidentiﬁed signals, and it preserves the original mapping whenthe activity counters are added afterwards for implementingthe model in hardware for runtime monitoring. Finally, timingsimulations are conducted using Modelsim to generate .saif ﬁles for power synthesis in PTF, with the identiﬁed featuresrecorded in activity traces simultaneously. B. Power trace ﬂow (PTF)

We go through synthesis and implementation again usingthe netlist in the PTF ﬂow. After that, we apply the .saif ﬁles derived from ATF to the Xpower analyzer to derive thecorresponding power traces for different estimation periods.To attain high conﬁdence power values, we need to set properconstraints for IO toggling and all the clock speciﬁcations.In addition, more than 25% of the signal toggling should becovered in the .saif ﬁles [12]. The static power is dependent onthe ambient temperature. As we target Virtex-7 series FPGAswith Xilinx high performance low power (HPL) technologyand we note that the static power for Xilinx HPL technology issmall in magnitude and shows small variation in a wide rangeof temperatures, i.e., within 1 W for the power range from-40 to 60 degree centigrade [13], which also conforms withour experimental results. Hence, we conduct our experimentsunder the ambient temperature of 25 degree centigrade toapproximate the static power whereas we mainly investigatethe dynamic power.

C. Model synthesis ﬂow (MSF)

After we collect a sufﬁcient amount of activity traces andthe corresponding power traces, our anchor is to develop aruntime dynamic power monitoring scheme based on decisiontree learning algorithm. As introduced later in Section IV-A, the decision tree can eventually be decomposed into a seriesof if-then-else rules and intrinsically this is suitable to beimplemented in an area-efﬁcient way in hardware with a seriesof comparators. Moreover, the decision tree caters for modelswith different complexities, by tuning the inherent parameters(e.g., maximum depth, minimum number of samples to split anode). It also renders high adaptability to improve itself whenmore samples are provided, as indicated by the learning curvesin Section VI-A. In comparison, the traditionally employedlinear regression model shows an insufﬁciency in the abilityto acclimate itself to further increase accuracy when moretraining samples are available. Nevertheless, the decision treeis prone to the overﬁtting problem. In order to build anaccurate decision tree regression model, we systematicallypresent the essential steps about the decision tree modelestablishment in Section IV.IV. D

ECISION TREE MODEL ESTABLISHMENT

A. Background of decision tree

The decision tree is a nonparametric hierarchical model forsupervised learning which learns the samples in the form ofa tree structure. At each step of the tree building process, adecision rule is generated, in which a coefﬁcient is used tocompare with the target feature. According to the result ofthe comparison, the sample set is split into two homogeneoussubsets, as shown in Fig. 2. Basically a tree node can becategorized as: (1) decision/internal node: a node where thecontained subdivision of sample set can be further split bya feature; and (2) leaf/terminal node: a node associated withan output which can not be further split. The decision nodesrepresent the decision rules whereas the leaf nodes providethe classiﬁcation or regression results. We use the CART [14]algorithm to develop decision tree models.

Rule:feature ≤coeff Decision nodeLeaf nodeTrue FalseTrue False

Out: value Rule: feature ≤coeff Out: value Out: value if feature ≤coeff : Out=value else if feature ≤coeff : Out=value else Out=value Fig. 2. Graphical and textual representation of decision tree regression.

B. Feature selection

In ATF, we have already screened out a set of candidatesignals with high activities. Before we develop the powermodels, we apply feature selection to preserve only thesignals containing informative features across different inputvectors in different estimation periods. Feature selection cancircumvent the overﬁtting problem and reduce the resourceoverhead induced by activity counters used to monitor signals.We leverage recursive feature elimination algorithm to identifyrequired features, which has been successfully applied toother domains [15]. In each iteration, the decision tree modelis trained and the feature importance values are computed,ccording to the node impurity at different splits, which isalso known as the Gini importance [14] in the decision tree.The features with least importance are eliminated and theremaining features are used to re-train the decision tree modeland update feature importance. The insigniﬁcant features arerecursively pruned away and a subset of 20% key featuresis remained for the ﬁnal model establishment. This maximumsubset size is empirically determined based on the observationthat the ﬁnal decision tree will use 10% to 20% of the featuresas we extract one hundred features for each application.

C. Model tuning and training

The problem of learning an optimal decision tree is knownto be NP-complete under several aspects of optimality. Toboost the performance of our target model, we seek to tune aset of essential hyper-parameters which largely determine theaccuracy of training, validation and testing. We focus on theparameters deﬁned in Table I. The most inﬂuential parameterfor the decision tree is the maximum depth which controls howdeep a tree can be formed. A deep tree means a high degree ofmodel complexity while also makes it prone to overﬁtting ondata. On the contrary, a shallow tree often fails to learn fromthe samples well. The tree depth should be tuned in terms ofdifferent applications to strike a good balance between trainingand testing accuracy. In addition, the other three parametersare supposed to be coordinated with the tree depth and betuned to circumvent the overﬁtting problem [14].We experimentally determine the sets of parameters suitablefor our benchmarks: {

3, 4, 5, 6, 7, 8 } for maximum depth, {

5, 10, 15, 20 } for minimum split sample and minimum leafsample, and { } for minimumleaf impurity. To determine the best set of hyper-parametersthat ﬁts the data well, k-fold cross validation approach [16]is employed. We adopt ten-fold cross validation method: thetraining set is further divided into ten subsets. A decisiontree model is trained using nine of them and the model isevaluated with a score using the left subset as the validationset. This procedure is repeated ten times with a differentvalidation set each time and we take the average to get theoverall score of the set of parameters, as an quantitativeindicator for performance. We try every combination of thehyper-parameters and conduct cross validation to quantify theperformance of each model. The set of hyper-parameters withhighest score will be deployed to develop the model. Afterthat, we use both the training set and the validation set totrain the model in the model developing process. Finally, weassess the model with the test set. D. Model ensemble strategy

A large-scale hardware design traditionally requires thecollaborative participation of a group of designers, each ofwhom focuses on an independent part of the complete design.In light of this, we exploit the practical usage of our modelwhen different power models are developed independently fordifferent parts of the design and eventually they are aggregatedtogether and behave as an ensemble power model. The model

TABLE IH

YPER - PARAMETERS FOR D ECISION TREE TUNING . Name Description

Maximumdepth The maximum depth that a tree can grow to.Minimumsplit sample The minimum number of samples used tosplit a decision node.Minimumleaf sample The minimum number of samples necessaryto determine a leaf node.Minimumleaf impurity The minimum percentage of samples givingdifferent output at a leaf node. ensemble ﬂow is shown in Fig. 3. Since the dynamic powerfor different components are derived from different decisiontree models, we directly add up the dynamic power estimatedfrom the corresponding decision tree models in order to obtainthe overall power consumption. Experiment in Section VI-Dshows that the additional error is as small as 1.2% comparedto the re-trained decision tree model.The model ensemble facilitates the model reuse: a powermodel trained for a speciﬁc function can be saved in a libraryand be integrated into an aggregated model directly whenneeded. It is applicable for library-based and IP-based designswith encapsulated power models. With the increasing usage ofFPGAs in data centers to accelerate multiple functions loadedat runtime, such a fast and accurate ensemble method will beparticularly useful and necessary. ...Model training Model 1

Model nModel training

Samples for app 1

Samples for app N Model assessment Ensemble ∑Model i ...

Fig. 3. Model ensemble ﬂow.

V. H

ARDWARE WRAPPER ARCHITECTURE

We propose an area-efﬁcient hardware wrapper to realize theproposed power monitoring in FPGA. The hardware wrappercan be decomposed into two parts: activity counter and deci-sion tree regression engine.

A. Activity counter

The activity counter design is shown in Fig. 4. A positiveedge detector ﬁrstly identiﬁes the positive edges of the inputsignal. Its output is valid for a single cycle, acting as theenable signal to the counter. We propose two realizations ofthe counter design: the LUT-based counter and the DSP-basedcounter. The LUT-based counter only utilizes LUT resourcewhereas the DSP-based counter is implemented using theprimitive

COUNTER LOAD MACRO , as the instantiation ofhe dynamic loading up counter occupying one DSP48 unitwith a maximum data width of 48 bits. These two countertemplates render the ﬂexibility for developers who are awareof the application resource utilization. Since we count thepositive edges of selected signals, the number of clock cyclesin an estimation period implies the upper bound of the signalactivity, which can be referred to set the maximum bit widthfor counters. The counters are reset at the beginning of everydetection cycle by the decision tree regression engine.

D Q D QClkSignal FeaturevalueRstClkEN

LUT/DSPCounter

Positive edge detector

Fig. 4. Activity counter architecture.

B. Decision tree regression engine

We propose a memory-based decision tree regressor asshown in Fig. 5. There are some studies on decision treeimplementation in hardware [17] to maximize the throughputof decision tree computation. Nevertheless, our objective isdifferent from prior work in that the prime consideration isnot throughput in our solution. Instead, our decision-tree-based monitoring scheme only operates once in an estimationperiod and thereby we target at reducing power and resourceoverheads at the ﬁrst place. In our solution, the decisiontree structure is completely preserved in a memory element.Additional peripheral control units are incorporated to or-chestrate the activity counter controlling, tree node decodingand branch deciding. To summarize, the complete hardwarestructure in our proposed solution can be further divided intothree subsystems: (1) a feature controller; (2) a decision treeﬁnite state machine (FSM); and (3) a decision tree structurememory.To achieve periodic power estimate, the feature controllerbuffers feature values from activity counters, invokes the FSMto operating states and periodically resets the activity counters.We use a user-speciﬁed parameter to set the period of thepower estimation. The decision tree FSM has four states: idle(I), node reading (N), stalling (S) and result outputting (R).The FSM starts execution by transferring the state from idleto node reading. In the node reading state, the FSM fetchesthe node information and tree structure from the memoryand completes the if-then-else branch decision by comparingthe addressed feature with the decoded tree node coefﬁcient.Stalling state will operate together with the node reading stateto ensure the correctness of memory reading. The executionﬁnally terminates by transferring to result outputting statewhich gives the ﬁnal result and sets the indication signal whenthe tree leaf has been reached. The decision tree structurememory is the fundamental component which preserves thecomplete tree structure and the feature addresses for ruledecisions, as shown in Fig. 6. The memory uses the blockmemory on FPGA. The maximum execution time for a single invocation is n + 1 cycles where n is the maximum depth of the tree. Notethat the decision tree structure is completely preserved in thestructure memory, meaning that the proposed hardware wrap-per is generally applicable to all decision tree types varying indepth or different pruning structures. In addition, the featuresare the number of signal transitions in an estimation period andtherefore, they are unsigned integer numbers. Correspondingly,the coefﬁcients can also be revised in an unsigned integerformat when doing the rule decision, without loss of precision.Following this observation, no ﬂoating point operations arerequired for the hardware wrapper design, which contributesto the high area efﬁciency of its implementation. Furthermore,our proposed hardware wrappers are applied to HDL netlistsand the HDL netlists are lower level designs comprisingprimitives and connections to describe the mapping of the RTLdesigns after placement and routing. As a result, the integrationof the monitoring circuits preserves the RTL mapping. Act counter 1Act counter 2Act counter N . .. Clk counterSelect & register

Act_valueCal_start

AddrCoeffNext index

Is_leafResult

Done Result

Feature controller

Decision tree

FSM

Decision tree structure memory .. . Rst . .. . .. Act_sel

I R N S Fig. 5. Decision tree regression engine.

Rule:Feature[addr]≤ CoeffY N

Is_current_leaf Current_coeff_val Next_left_addr Next_right_addr Current_act_addr

Is_current _leaf ------ Result

Non-leafLeaf ...

Decision tree structure memory

Non- leaf:

Leaf:

Fig. 6. Decision tree memory structure.

VI. E

XPERIMENTAL RESULTS

Our proposed activity trace ﬂow and power trace ﬂow areimplemented in Vivado 2016.4 and we utilize Modelsim SE10.3 for the generation of the .saif ﬁles. The targeted FPGAplatform is Virtex-7 XC7V2000tﬂg1925-1. The model synthe-sis ﬂow is developed based on the Scikit-learn 0.18.1 [18]machine learning toolbox. We applied our methodology todevelop the decision-tree-based power models and respec-tively build hardware wrappers for several benchmarks inChstone [19], Polybench [20] and Machsuite [21]. Thesebenchmark suites are C-based benchmarks and we derive thesynthesizable Verilog version using Vivado HLS 2016.4.

A. Model assessment

We systematically evaluate the model accuracy using dif-ferent sets of benchmarks which can be categorized by the

00 400 600 800 1000 1200 1400

Sample number(a) S c o r e ( n e g a t i v e M A E ) Train(tree)Validation(tree)Train(linear)Validation(linear)

200 400 600 800 1000 1200 1400

Sample number(b) S c o r e ( n e g a t i v e M A E ) Train(tree)Validation(tree)Train(linear)Validation(linear)

200 400 600 800 1000 1200 1400

Sample number(c) S c o r e ( n e g a t i v e M A E ) Train(tree)Validation(tree)Train(linear)Validation(linear)

200 400 600 800 1000 1200 1400

Sample number(d) S c o r e ( n e g a t i v e M A E ) Train(tree)Validation(tree)Train(linear)Validation(linear)

200 400 600 800 1000 1200 1400

Sample number(e) S c o r e ( n e g a t i v e M A E ) Train(tree)Validation(tree)Train(linear)Validation(linear)

200 400 600 800 1000 1200 1400

Sample number(f) S c o r e ( n e g a t i v e M A E ) Train(tree)Validation(tree)Train(linear)Validation(linear)

Fig. 7. Learning curve: (a) Atax; (b) Bicg; (c) GemmNcubed; (d) Matrixmult; (e) Hybrid 1; (f) Hybrid 2. utilized resources. We deﬁne the following three types ofbenchmarks: LUT-based, DSP-based and hybrid. The LUT-based benchmarks (i.e., Atax and Bicg) mainly use LUTresources, whereas the DSP-based benchmarks (i.e., Gemm-Ncubed and Matrixmult) utilize DSPs as the major resourcetype. The hybrid benchmarks are a combination of benchmarksshowing a large proportion of resource utilization of bothLUTs and DSPs. The LUT-based and DSP-based benchmarksare tested under the clock period of 10 ns and the hybridbenchmarks are evaluated with the clock period of 15 ns.The estimation period is 3 µ s. We collect 2000 samples foreach benchmark by randomly invoking the applications withrandom input vectors. We select 20% of the samples as the testset, whereas the other 80% will be used to train the models aswell as perform cross validation. We compare our decision-tree-based models with linear regression models employedby prior research work [7]–[9]. In the feature selection stepfor linear models, we use the feature weights as the featureimportance described in Section IV-B for a fair comparison.The resource utilization and the mean absolute error (MAE)in percentage for dynamic power consumption is shown inTable II. The average MAE percentage is 4.36% for ourproposed decision tree model whereas the linear regressionindicates an average error of 17.31%. From Table II, we cansee that the advantages of decision tree over linear model islarger for DSP-based and hybrid designs, because the LUTsare intrinsically better to be ﬁtted in the linear regressionmodel while the DSPs are inclined to have non-linear powerpatterns, as reported in [10], [11] that the complex arithmeticunits generally exhibit non-linear power behaviors.The learning curves for training and cross validation furtherreveal the difference of the decision tree models and linearregression models regarding the capability to learn from sam-ples, as shown in Fig. 7. Note that the power consumptiondiffers a lot for different benchmarks and thus the MAE varieslargely for different benchmarks even with similar accuracy. TABLE IIB

ENCHMARK AND M ODEL ASSESSMENT . Benchmark Resource (%) MAE (%)

LUT DSP FF BRAM Dtree LinearAtax 4.82 0 0.21 0 4.14 12.46Bicg 2.72 0 0.14 0 2.58 15.67GemmNcubed 1.16 53.33 1.46 0 4.51 17.80Matrximult 6.24 100 3.65 0 3.54 18.81Hybrid 1 53.82 97.78 20.04 3.99 5.78 20.78Hybrid 2 56.91 100 8.21 5.69 5.61 18.34

Regarding the linear regression, the learning curves imply ahigh-bias scenario: the error is high and the model creasesto improve accuracy given more training samples. In general,the high-bias situation means the underﬁtting problem of thetraining data has occurred. This also accounts for the dete-rioration in training accuracy as non-linear patterns increasewith more samples. Comparatively speaking, the decisiontree model exhibits more superior ability to learn from thetraining samples as the sample size gets larger. As a result,the decision trees exhibit notably lower errors compared withlinear regression models.

B. Frequency variation

In this experiment, we evaluate our proposed model re-garding the effect on estimation accuracy when the operat-ing frequency varies. We use the pre-trained power modelsfor Atax and GemmNcubed from Section VI-A to predictpower consumption and verify the models’ scalability whenfrequency changes at runtime. These models are trained undera clock period of 10 ns with the accuracy shown in Table II.We employ the same estimation cycles (i.e., 300 cycles) andrun our design ﬂow to collect new activity and power tracesfor testing purpose under different frequencies. Note that thepower estimation works for the original period of 10 ns and wecompute the predicted power by scaling the power estimationaccording to the ratio of frequencies (i.e., f current f model ) because theower consumption is proportional to the frequency as shownin Equation (1). The results are shown in Fig. 8. Comparedwith the baseline MAE for the clock period of 10 ns, thedegradation of error is within 0.2% under different frequencies.Thereby, our proposed model is applicable to be used underdifferent operating frequencies with low generalization errors. M A E / % Clock period / ns(a) M A E / % Clock period / ns(b) Fig. 8. Accuracy for frequency variation: (a) Atax (b) GemmNcubed.

C. Model overhead

The decision tree hyper-parameter settings for the testedbenchmarks are presented in Table III. We analyze the extraoverhead of integrating our proposed power models into thebenchmarks from the following three aspects: resource utiliza-tion, operating frequency and power dissipation. The ﬁrst threebenchmarks utilize the DSP-based activity counters while theothers use the LUT-based counterparts. The activity counterwidth can be tuned according to the estimation period as statedin Section V-B. Therein, we uniformly set the width as 20bits which is sufﬁcient to cover the highest activity for a widerange of estimation periods. For each single application, tento twenty selected signals are monitored, each of which isequipped with an activity counter. As shown in Table IV, themonitoring circuits consume less than 0.01% of LUTs and FFs,0.2% of BRAMs and 0.4% of DSPs. Note that by using LUT-based activity counters instead of DSP-based ones, we can fur-ther dispense with DSP resources. In contrast, the prior workusing linear models [8], [9] demonstrate the LUT overheads of7% and 9% respectively and the software overhead occupying5% CPU time. The decision tree exhibits much higher area-efﬁciency because it mainly leverages integer comparisonswhereas the linear model occupies a large amount of ﬂoating-point additions and multiplications, which accounts for itshigh area overhead. The power dissipation of our decision treemodel is extremely low. Besides this, the operating frequencyﬂuctuates slightly, showing a maximum degradation of 0.70MHz (1%). In conclusion, our proposed decision-tree-basedmonitoring hardware demonstrates low overheads in resourceutilization, operating frequency and power dissipation. It canbe efﬁciently integrated in the RTL designs for on-board powermonitoring.

D. Model ensemble

In order to verify the accuracy of the ensemble strategy,we combine Atax, Bicg and GemmNcubed with the pre-trained power models and run the design ﬂow to generatetesting samples. We then quantify the accuracy of aggregat-ing separately pre-trained models as an ensemble model by

TABLE IIID

ECISION TREE HYPER - PARAMETER SETTINGS . Benchmark Max Min split Min leaf Min leafdepth sample sample impurity

Atax 5 20 20 0.01Bicg 5 5 5 0.001GemmNcubed 5 20 20 0.03Matrixmult 4 5 5 0.03Hybrid 1 6 5 5 0.02Hybrid 2 6 20 10 0.05TABLE IVM

ODEL OVERHEAD ANALYSIS . Benchmark Resource (in number)

Freq Power

LUT DSP FF BRAM (MHz) (mW)Atax 127 7 198 2.5 +1.14 3Bicg 125 6 176 0.5 +2.32 2GemmNcubed 149 9 242 2.5 +1.35 4Matrixmult 108 0 325 0.5 0 4Hybrid 1 156 0 415 2.5 -0.70 5Hybrid 2 162 0 508 2.5 +1.62 6 adding up the predicted power for different components. As acomparison, we re-train a monolithic power model using newsamples. The MAE percentage of the ensemble model andthe re-trained model is 5.52% and 4.32%, respectively. Theaccuracy deterioration of the ensemble model is 1.2%. Thisis mainly owing to the changes in placement and routing. Inall, the model ensemble strategy dispenses with the need forre-training new models with moderate accuracy degradation.

E. Fine-grained phase shedding for on-chip multi-phase volt-age regulator

Noticing that the phase shedding for on-chip regulatorshows prominent efﬁciency improvement for the processorin [3] by reducing the number of phases in light load situations,we similarly investigate the viability for phase shedding inFPGA for internal logic supplied by V ccint . However, differentfrom processors which internally incorporate multiple powerstates, applications running in FPGA are fully customized bydesigners and intrinsically there is a lack of indicative statesabout runtime power to guide the phase shedding decision.Following this observation, we leverage our proposed powerestimator to supervise a ﬁne-grained phase shedding strategyof an on-chip regulator.We synthesize a two-stage power delivery system consistingof both on-chip and off-chip multi-phase voltage regulators for V ccint using the tool PowerSoC [22]. Both the on-chip andoff-chip regulators are buck converters with the parametersspeciﬁed in [22]. To approximate the power loss induced byoff-chip to on-chip and on-chip to die parasitic resistances,we estimate the fabrication details including the resistanceof package vias and wires, interposer vias and wires, micro-bumps and TSVs from work [23]–[25]. The number of theabove network components for the targeted device are esti-mated from [25], [26]. Finally the power delivery system forinternal logic is constructed under the nominal power of 20 W.The on-chip regulator has ﬁve phases with the voltage scalingime between V ccint and ground within 20ns, which is safe forﬁne-grained controlling with our 3 µ s power estimation period.We ﬁrst experimentally determine the optimal number ofphases that can maximize the power efﬁciency of the powerdelivery network under different power values as shown inFig. 9. Then, we deploy a look up table based phase sheddingapproach [3] and determine the optimal number of phases touse at runtime based on the power of internal logic includ-ing static power and monitored dynamic power. The powerefﬁciency improvement is derived according to Equation (3)where i denotes the index of a speciﬁc estimation period, P nopt and P nmax are the power under optimal number of phases andmaximum number of phases, respectively. The phase sheddinginduced power overhead P loss is calculated according to [22].Experiments show that the efﬁciency improvement for Hy-brid 1 and Hybrid 2 using the last 400 sample sequencesfrom Section VI-A are 13.6% and 14.4% respectively, with theinternal logic power ranging from 1 W to 11 W. As a result, theproposed runtime power monitoring hardware provides ﬁne-grained power information for runtime phase shedding of anon-chip regulator to boost 14% average efﬁciency of the powerdelivery network. It is promising to be used in future FPGAswith integrated voltage regulators and CPU-FPGA system-on-chips as well for further power saving. Ef f impv = 1 − (cid:80) Ni ( P nopt ( i ) + P loss ( i )) (cid:80) Ni P nmax ( i ) (3) P o w er d e li v er y n e t w o r k e ff i c i e n c y Power consumption / W Max

Fig. 9. Efﬁciency of power delivery network for FPGA internal logic.

VII.

CONCLUSION

In this work, we leverage the state-of-the-art machinelearning techniques to establish a novel decision-tree-baseddynamic power monitoring approach for applications runningon FPGA. The proposed design ﬂow and the decision tree re-gression model can be generally applied for ﬁne-grained powerprediction in FPGA. We also develop a light-weight and in-situmonitoring hardware for the developed power model whichcan be efﬁciently integrated into RTL designs with extremelylow overheads of area, power and performance. We investi-gate our proposed methodology on three different types ofbenchmarks: LUT-based, DSP-based and hybrid benchmarks.Experimental results reveal that the decision tree modelsoutperforms the traditional linear models with more than 10%reduction in mean absolute error. Besides this, the decision-tree-based power monitoring exhibits a high capability to learn from samples as the training sample size increases whereasthe linear regression is prone to the underﬁtting problem.Moreover, we exploit a model ensemble method used forlibrary-based and IP-based designs, with the results exhibitingan additional 1.2% error compared with a completely re-trained model. Furthermore, we utilize our proposed powermonitoring scheme to guide the phase shedding of an on-chip multi-phase regulator as a case study and the resultsdemonstrate 14% average improvement in the efﬁciency ofthe power supply for the FPGA internal logic. In future work,we plan to enhance the power monitoring scheme with staticpower compensation and leverage ﬁne-grained runtime powermonitoring for advanced power management schemes.R

EFERENCES[1] P. Mantovani et al. , “An FPGA-based infrastructure for ﬁne-grainedDVFS analysis in high-performance embedded systems,” in

DAC , 2016.[2] A. L¨osch et al. , “Performance-centric Scheduling with Task Migrationfor a Heterogeneous Compute Node in the Data Center,” in

DATE , 2016.[3] H. Asghari-Moghaddam et al. , “VR-scale: runtime dynamic phasescaling of processor voltage regulators for improving power efﬁciency,”in

DAC , 2015.[4] B. Keller, “Opportunities for Fine-Grained Adaptive Voltage Scalingto Improve System-Level Energy Efﬁciency,” Master’s thesis, EECSDepartment, UCB, 2015.[5] F. Li et al. , “Architecture evaluation for power-efﬁcient FPGAs,” in

FPGA , 2003.[6] C. Najoua et al. , “Accurate dynamic power model for FPGA basedimplementations,”

IJCSNS , 2012.[7] A. Lakshminarayana et al. , “High level power estimation models forFPGAs,” in

ISVLSI , 2011.[8] M. Najem et al. , “Method for dynamic power monitoring on FPGAs,”in

FPL , 2014.[9] E. Hung et al. , “KAPow: A System Identiﬁcation Approach to OnlinePer-Module Power Estimation in FPGA Designs,” in

FCCM , 2016.[10] D. Lee et al. , “Learning-based power modeling of system-level black-box IPs,” in

ICCAD , 2015.[11] A. Bogliolo et al. , “Regression-based RTL power modeling,”

TODAES ,2000.[12] Xilinx, “Vivado Design Suite User Guide,” 2017.[13] M. Klein and S. Kolluri, “Leveraging power leadership at 28 nm withXilinx 7 series FPGAs,”

Xilinx whitepaper , 2013.[14] L. Rutkowski et al. , “The CART decision tree for mining data streams,”

Information Sciences , 2014.[15] H. Ravishankar et al. , “Recursive feature elimination for biomarkerdiscovery in resting-state functional connectivity,” in

EMBC , 2016.[16] J. D. Rodriguez et al. , “Sensitivity analysis of k-fold cross validation inprediction error estimation,”

TPAMI , 2010.[17] F. Saqib et al. , “Pipelined decision tree classiﬁcation accelerator imple-mentation in FPGA (DT-CAIF),” TC , 2015.[18] F. Pedregosa et al. , “Scikit-learn: Machine learning in Python,” JMLR ,no. Oct, 2011.[19] Y. Hara et al. , “Chstone: A benchmark program suite for practical c-based high-level synthesis,” in

ISCAS , 2008.[20] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” , 2012.[21] B. Reagen et al. , “Machsuite: Benchmarks for accelerator design andcustomized architectures,” in

IISWC , 2014.[22] X. Wang et al. , “An Analytical Study of Power Delivery Systems forMany-Core Processors Using On-Chip and Off-Chip Voltage Regula-tors,”

TCAD , 2015.[23] Z. Xu et al. , “Modeling of power delivery into 3D chips on siliconinterposer,” in

ECTC , 2012.[24] R. Wrona et al. , “Resistance Measurements of BGA Contacts DuringReliability Tests,” in

ISSE , 2006.[25] R. Chaware et al. , “Assembly and reliability challenges in 3D integrationof 28nm FPGA die on a large high density 65nm passive interposer,”in