[PDF] HL-Pow: A Learning-Based Power Modeling Framework for High-Level Synthesis

Abstract

High-level synthesis (HLS) enables designers to customize hardware designs efficiently. However, it is still challenging to foresee the correlation between power consumption and HLS-based applications at an early design stage. To overcome this problem, we introduce HL-Pow, a power modeling framework for FPGA HLS based on state-of-the-art machine learning techniques. HL-Pow incorporates an automated feature construction flow to efficiently identify and extract features that exert a major influence on power consumption, simply based upon HLS results, and a modeling flow that can build an accurate and generic power model applicable to a variety of designs with HLS. By using HL-Pow, the power evaluation process for FPGA designs can be significantly expedited because the power inference of HL-Pow is established on HLS instead of the time-consuming register-transfer level (RTL) implementation flow. Experimental results demonstrate that HL-Pow can achieve accurate power modeling that is only 4.67% (24.02 mW) away from onboard power measurement. To further facilitate power-oriented optimizations, we describe a novel design space exploration (DSE) algorithm built on top of HL-Pow to trade off between latency and power consumption. This algorithm can reach a close approximation of the real Pareto frontier while only requiring running HLS flow for 20% of design points in the entire design space.

Full PDF

HHL-Pow: A Learning-Based Power Modeling Framework for High-Level Synthesis

Zhe Lin ∗ , Jieru Zhao ∗ , Sharad Sinha † and Wei Zhang ∗∗ Hong Kong University of Science and Technology, Hong Kong † Indian Institute of Technology (IIT) Goa, India { zlinaf, jzhaoao } @connect.ust.hk, sharad [email protected], [email protected] Abstract —High-level synthesis (HLS) enables designers to cus-tomize hardware designs efﬁciently. However, it is still challeng-ing to foresee the correlation between power consumption andHLS-based applications at an early design stage. To overcome thisproblem, we introduce HL-Pow, a power modeling frameworkfor FPGA HLS based on state-of-the-art machine learning tech-niques. HL-Pow incorporates an automated feature constructionﬂow to efﬁciently identify and extract features that exert a majorinﬂuence on power consumption, simply based upon HLS results,and a modeling ﬂow that can build an accurate and generic powermodel applicable to a variety of designs with HLS. By usingHL-Pow, the power evaluation process for FPGA designs can besigniﬁcantly expedited because the power inference of HL-Powis established on HLS instead of the time-consuming register-transfer level (RTL) implementation ﬂow. Experimental resultsdemonstrate that HL-Pow can achieve accurate power modelingthat is only 4.67% (24.02 mW) away from onboard powermeasurement. To further facilitate power-oriented optimizations,we describe a novel design space exploration (DSE) algorithmbuilt on top of HL-Pow to trade off between latency and powerconsumption. This algorithm can reach a close approximation ofthe real Pareto frontier while only requiring running HLS ﬂowfor 20% of design points in the entire design space.

I. I

NTRODUCTION

High-level synthesis (HLS) [1] automates the process oftranslating applications described by high-level languages(e.g., C++ and Python) into register-transfer level (RTL) de-signs. With the aid of HLS tools, designers targeting hardwareimplementation for ﬁeld-programmable gate arrays (FPGAs)or application-speciﬁc integrated circuits (ASICs) are nolonger required to dig deep into low-level hardware details,such as the micro-architectures of individual components andthe interconnection between them. Besides this, modern HLStools have the capability to give relatively good estimation ofperformance and resource utilization for the created hardware,and also deliver a series of design knobs, or so-called direc-tives, to help designers tune the two aforementioned designmetrics. As a result, the productivity and ﬂexibility broughtby HLS notably speed up the development process of hardwaredesigns, and also open up an opportunity for efﬁcient designspace exploration (DSE) [2]–[7]. However, off-the-shelf HLStools [8] are still lacking in mature power analysis techniques,making it difﬁcult to clearly observe the inﬂuence of differentoptimization strategies of HLS on power consumption.Power consumption is a primal concern for many hard-ware designs, especially for portable electronic devices andembedded systems. The common practice to obtain powerconsumption is through power measurement or estimation,both of which require designers to spend substantial effort.First, RTL designs are created by designers either manuallyor through HLS. Afterwards, the RTL implementation ﬂow,including logic synthesis, placement and routing, is appliedto the provided RTL designs for the generation of gate-level details. For power measurement, the designs are implementedon real systems, and power consumption can be measuredonboard by monitoring devices. For power estimation, gate-level simulation is performed with real input vectors to captureswitching activities of the IO and internal signals. Thereafter, aprebuilt analytical power model [9] provided by the design toolis applied to compute power consumption given the gate-leveldetails and signal activities. After obtaining power values,designers can accordingly reﬁne the hardware architecturesin pursuit of higher performance or power efﬁciency, andrun the above design ﬂow again for veriﬁcation. In general,the creation of power-efﬁcient designs usually necessitatesmultiple iterations of power evaluation and design reﬁnement,which results in a long design time and low productivity.Some state-of-the-art works [10]–[13] have presented powermodeling techniques to accelerate the power analysis processfor hardware designs; however, each of these methods exhibitssome of the following drawbacks: 1) each of the power modelsgenerated by these methods is customized for an individualdesign and not applicable to others, 2) their modeling processfor each target design requires multiple rounds of powercharacterization following the slow RTL implementation ﬂow,and 3) it is difﬁcult to migrate their techniques to new plat-forms due to their dependence on speciﬁc hardware modelingexpertise. Putting it all together, designers must familiarizethemselves with the modeling steps and make great effort tobuild a specialized power model for every target design, thusincurring high labor intensity.In light of the above considerations, in this work, weinvestigate advanced modeling techniques to provide powerprediction for FPGA designs at an early design stage, andalso strive to speed up power-oriented exploration of hardwaredesigns. Speciﬁcally, we propose HL-Pow, a learning-basedpower modeling framework for HLS designs. Our modelingframework features wide applicability and high efﬁciencycompared with state-of-the-art works [10]–[13]. First of all,HL-Pow offers a modeling strategy with high generalizationability so that various designs can use one well developedmodel for power prediction without the need of model recon-struction when targeting the same FPGA platform. Second, ourmethodology can be easily migrated to new platforms withoutknowing low-level hardware details such as the technology,hardware primitives or macros. Third, the power prediction ofHL-Pow for new designs is fast in runtime, as it dispenses withthe need to perform the time-consuming RTL-based powerestimation or measurement ﬂow. With HL-Pow, DSE can bequickly conducted to investigate the design tradeoff betweenpower and other design metrics provided by HLS. In summary,we demonstrates the following contributions in this work: • We introduce an automated feature construction ﬂow a r X i v : . [ c s . A R ] S e p or rapid identiﬁcation and extraction of features closelyrelated to power consumption, simply using results gen-erated by the HLS design ﬂow. • We propose HL-Pow, a learning-based power modelingmethodology with the ability to achieve accurate, fastand early-stage power estimation for HLS designs, bybuilding the power model only once. • We describe a novel DSE algorithm established on HL-Pow to demonstrate how the tradeoff between latencyand power consumption can be effectively and efﬁcientlyevaluated by design space sampling.II. R

ELATED W ORK

A. Hardware Power Modeling

Studies about hardware power modeling have been con-ducted at two abstraction levels: low abstraction and highabstraction. Low-level abstraction methods [14]–[17] look intothe power consumption of primitive components, and deriveoverall power consumption by aggregating power of all usedprimitive components. For this purpose, a library is built inadvance for real-time power reference of primitive compo-nents. A power characterization process should be conductedto construct a power look-up table, or a so-called macro-model, for each basic component, such as the adder andmultiplier. Except that a rich body of basic components shouldbe characterized individually, this power characterization stageshould also take into account various use cases, such as signalactivity levels, bitwidths and even cell selection variances, thusleading to a large evaluation space to walk through all differentsituations per component. The large characterization space forall components requires a tremendous amount of developmenttime. What’s more, different technologies or standard celllibraries would have their speciﬁc design methodologies thatare not shared among the others. Based on this, creating thislibrary also depends on developers having a good understand-ing of all primitive components.In contrast, high-level abstraction methods [10]–[13] viewa design as a whole and build an analytical or learning-basedmodel speciﬁc to it, which avoids going deep into most low-level hardware details. The works [10] and [11] are for post-RTL power modeling, while the works [12] and [13] focus onpre-RTL power modeling and they are close to our work. Thework [12] speciﬁcally looks into afﬁne functions, identiﬁesthe basic code segment as a tile from the programs, anddeduces overall power consumption by summing up powerconsumption of all tiles. For each application, the tile structureis unique. As a result, given a new application, a tile-basedpower characterization stage still needs to be carried outthrough gate-level power simulation. Nevertheless, the powercharacterization time can be signiﬁcantly expedited comparedwith low-level abstraction methods, because only the tilestructures instead of a pool of primitive components should becharacterized. Another work FlexCL [13] targets OpenCL-to-FPGA design ﬂow. Based on the fact that OpenCL applicationstend to show regular behaviors in phases, FlexCL decomposesthe execution timeline of a kernel into work-groups, and thenfurther divides work-groups into work-items. The dynamicpower model is generated according to these two phase levels.Similar to the work [12], FlexCL also involves the ﬁne-grained power characterization for different phases in work-groupsand work-items, but the overall characterization overhead isalso remarkably reduced compared with low-level modelingtechniques.The high-level abstraction modeling methods show signif-icant speedup in model creation compared with low-levelabstraction methods. However, existing high-level abstractionmethods still entail model regeneration for new designs, relyon slow and repetitive power estimation/measurement forpower characterization, and can not be easily migrated to newplatforms because some critical steps, such as power proﬁlingfor particular components or code structures, involve hardwaredesign expertise. To the best of our knowledge, our work forHLS-based power modeling, HL-Pow, is the ﬁrst work thatovercomes all these aforementioned limitations, and ﬁnallypresents an HLS power modeling framework that can deliverhigh accuracy, efﬁciency and generalization ability.

B. Design Space Exploration

A rich body of research studies DSE for HLS. One directionof automatic DSE is to establish predictive models ofﬂineand use brute-force search to retrieve an approximate Paretofrontier between two or more target metrics. The works [12]and [13] elaborated in Section II-A also provide exhaustiveDSE after the power model is developed for an application.Another instance is the MPSeeker [4] which evaluates thetradeoff between performance and area by producing a pre-dictive model for early estimation of HLS results and thentraversing the complete design space to ﬁnd optimal points.An alternative to these methods is to select a subset ofdesign points to feed into HLS and search new design pointsfor exploration according to present HLS results. Due tothe difﬁculties of getting information of all design points inadvance, methods developed in this way ﬁrst selects a smallsubset of samples as promising candidates to put into HLSexecution. After obtaining the results from current samplepoints, knowledge can be learned and used to navigate thesearch space for evaluating new candidate points. The knowl-edge generalization techniques include heuristic methods [3],[7] that are speciﬁc to their target problems, learning-basedmethods [2], [5] to generate predictive models for HLSresults, and a combination of them [6] which applies heuristicalgorithms and machine learning methods in different stages.In our work, we ﬁrst develop a generic model for rapid powerinference of HLS designs, and based on that we present a novelheuristic algorithm to further speed up the DSE to evaluatethe latency-power tradeoff by online design space sampling.These two stages are complementary to each other for fastdesign-time hardware power optimization.III. P

OWER M ODELING F RAMEWORK

Starting with a new platform, the HL-Pow design ﬂow hastwo phases: 1) power model training with a collection ofapplications and 2) power inference for new applications. Thecomplete design ﬂow of the HL-Pow framework is depictedin Fig. 1. In the training phase, a number of representativeapplications described in C or C++ are used to generatetraining samples for power modeling. Each application isassociated with a set of optimization strategies (i.e., directives) L ‐ Pow

PredictionHL ‐ Pow

Training

Power

Measurement/Estimation

App.

Power

ModelingArchitecture

ExtractionActivity

Acquisition

Feature

Construction

Power

InferencePower

EstimatorArchitecture

ExtractionActivity

Acquisition

Feature

Construction

C/C++DirectivesTestbench IR,

FSMDStimuliReports

New

App.

High ‐ Level

SynthesisHigh ‐ Level

SynthesisRTL

Implementation

Fig. 1. Overview of HL-Pow design ﬂow. to produce a number of design points varying in performance,resource utilization and power consumption. The directivesused in this paper are array partitioning , loop unrolling and loop pipelining . The collected design points ﬁrst passthrough the traditional HLS design ﬂow to be converted intosynthesizable RTL designs. After that, two major steps areconducted for training sample generation: feature construction and power collection . For feature construction, we makeuse of input stimuli, the generated reports and intermediateresults from HLS runs to construct features that are of greatimportance to power consumption. For power collection, thepower consumption obtained from estimation or from onboardmeasurement can be used as ground truth power values, bothof which require the design points from HLS to go through theRTL implementation ﬂow. Putting it all together, the featureset and the corresponding true power consumption of eachdesign point constitute a training sample. A training set withmultiple samples from different applications is used to build alearning model that maps from features to power consumption.In the power inference phase, HL-Pow can achieve fast andaccurate power prediction for new applications using the welltrained power model. Firstly, the new applications, togetherwith the directive conﬁgurations to evaluate, are required togo through the HLS design ﬂow. Note that in this stage, RTLcode generation can be skipped to save time if the target HLStool supports the separate execution of different steps in theback-end process. Secondly, the same feature construction stepas in the training phase is executed to capture features for newdesign points. Finally, the created feature set is fed into theprebuilt model for power inference. In this stage, all the stepsare solely based on HLS and thus there is no need to invoke thetedious RTL implementation ﬂow along with power estimationor measurement for any design point.There are two main types of features to acquire: architecturefeatures and activity features . Architecture features describethe overall design information estimated by HLS tools, whileactivity features correspond to the switching activities ofdifferent hardware components in the target designs. A. Data Collection

Starting from the HLS front-end execution, the C/C++source code is ﬁrst translated into intermediate representation(IR). Some optimizations are also performed by vendor tools

TABLE IO

PERATOR TYPES AND IR OPCODES FOR ACTIVITY TRACKING . Operator type IR opcode

Arithmetic add, sub, mul, div, sqrt, fadd, fsub, fmul, fdiv, fsqrtLogic and, or, xor, icmp, fcmpMemory store, load, read, writeArbitration mux, select at this IR level, such as bitwidth reduction and loop unrolling.With the IR code, the HLS back-end process then conductscontrol and data ﬂow graph (CDFG) generation, followed byresource allocation, scheduling and functional unit binding.At this stage, the hardware architecture is determined anddescribed by a ﬁnite state machine with datapath (FSMD)model. Finally, code generation is executed to convert thegenerated FSMD model into synthesizable RTL code.Using Vivado HLS [8] as the design tool for demonstration,some of the data and intermediate results from HLS runsare collected for feature construction: 1) the HLS report(app name.verbose.rpt.xml) containing details of the overalldesign, as described in Section III-B, 2) the IR code (a.o.3.bc)and IR operator information (app name.adb), including eachIR operator’s ID, opcode, type and netlist name correspondingto a hardware component (denoted as RTL operator), and 3)the FSMD model (app name.adb.xml) that describes the FSMstages, dataﬂow, and RTL operator information, including eachRTL operator’s ID, operand bitwidths and related IR instruc-tions. We identify four types of IR operators that contributethe most to power consumption and can be mapped to RTLoperators through ID matching: arithmetic, logic, memory andarbitration operators, as shown in Table I. The activity featuresintroduced in Section III-C only account for these opera-tors. Besides the hardware micro-architectures, the operators’switching activities also depend on the input stimuli, whichcan be collected from real scenarios or generated at random.

B. Architecture Features

The power consumption is associated with the scale andcomplexity of the hardware design and the operating fre-quency. Therefore, we construct the following architecturefeatures for each design point from the HLS report: 1) FPGAresource utilization estimated by HLS, including look-up table(LUT), ﬂip-ﬂop (FF), digital signal processing unit (DSP)and block random access memory (BRAM); 2) performance,including achieved clock period in nanoseconds and latency incycles; and 3) the scaling factors (SFs) of the above metrics forthe current design to those of the baseline design, respectively,which can be computed as SF M = M current M base , (1)where M represents one of the metrics (i.e., different typesof resources, clock period or latency) of the current design, current , or the baseline design, base , in which no direc-tives are used. In general, the SF is a type of importantreference that helps to normalize the resource utilization andperformance across different applications. We constructs 11architecture features in total. C. Activity Features

Dynamic power is introduced by signal transitions whichdissipate power by repeatedly charging and discharging the ctivity

TrackingRTL ‐ to ‐ IR Back

Tracing

HLS

Front ‐ end CompilationC/C++IRFSMD int foo(...){ out = a[0] * b[0] + c[0]; out = out ‐ a[1] * b[1]; define foo(...){ id=1: %1 = mul i32 %2, %3 id=2: %4 = add i32 %1, %5 id=3: %6 = mul i32 %7, %8 ... mul mul1 ... C codeIR code × + - FSMDatapath

MUL ADDSUB

RTL ‐ to ‐ IR Back

Tracing id=1: %1 = mul i32 %2, %3 track(1, mul1, %1, %2, %3) id=3: %6 = mul i32 %7, %8 track(2, mul1, %6, %7, %8)IR code with activity tracking Activity

Tracking

RTLid Signal val. ... mul11 %1, %2, %3%6, %7, %8 FSMD information

AllocationSchedulingBinding id=1 × id=2id=3id=4 Step mul12

Fig. 2. The IR annotator with RTL-to-IR back tracing and activity tracking. load capacitors. Eq. 2 formulates dynamic power P dyn as P dyn = (cid:88) i ∈ I α i C i V dd f, (2)which is a function of signal switching activity α i , capac-itance C i on the net i , supply voltage V dd and operatingfrequency f . It is conceivable that switching activities ofdifferent RTL operators are critical indicators for dynamicpower consumption. In HL-Pow, an automatic design ﬂowis introduced to capture the switching activities of differentcomponents, and construct activity features using them. Toreduce runtime overhead, the design ﬂow targets IR-levelactivity extraction, instead of the time-consuming RTL-basedsimulation. The HLS intermediate results elaborated in III-A(a.o.3.bc, app name.adb, and app name.adb.xml) are usedduring the construction of activity features. Finally, an IRannotator, an activity generator and a histogram constructorare incorporated in this design ﬂow. IR Annotator.

The IR annotator instruments RTL operatorswith functions to keep track of their switching activities. Thetwo main steps in the IR annotator are

RTL-to-IR back tracing and activity tracking , as shown in Fig. 2. The RTL-to-IRback tracing is based on the observation that multiple IRoperators can be mapped to the same RTL operator due toscheduling and resource sharing in the HLS back-end process,as depicted in the right-hand side of Fig. 2. Therefore, multipleIR operations may contribute to the activities of one RTLoperator in different time steps. In the IR code, we traceback the RTL operators to their corresponding IR operatorswith the opcodes shown in Table I. This is done by matchingthe netlist name between IR operators and RTL operatorsin the FSMD model. Following the RTL-to-IR back tracingprocess, we instrument the IR code with an activity trackingfunction after each IR operator to record the values of inputand output signals and the associated RTL operator ID ofthis IR operator. After all the above steps, an annotated IR isgenerated. This IR annotator is developed within the LLVMcompiler toolchain [18].

BinaryInputvectorsIR

Annotator

New IR Clang++ kernel.oAct. tracking func. g++

Act_func.oTestbench g++ tb.o InputvectorsIR annotator

New IR clang++ kernel.o g++ testbench.oTestbench g++ act_func.oAct. func. Cycle Signal val. ... %2, %3%4, %5, %62 op op k RTL op act. list Cycle Signal val. ... %2, %3%4, %5, %62 op op k RTL op act. listBinary Inputvectors

Cycle Signal val. ... %2, %3%4, %5, %62 op op k SA op [1:k]Executableg++ testbench.oTestbenchAnnotated IR clang++ kernel.og++Act. tracking function track.o ... op k SA op [1:k]ExecutableTestbenchAnnotated IRAct. tracking function

Cycle Signal val. ... %2, %3%4, %5, %62 op Inputvectorsclang++ Kernel.oAnnotated

IR g++ Track.oAct. tracking functions g++ Testbench.oTestbench

Fig. 3. The activity generator.

Activity Generator.

Before conducting HLS for an applica-tion, the users are required to provide a C-based testbench anda set of input stimuli to verify the correctness of the designoutput. These ﬁles are leveraged in the activity generator. Asdepicted in Fig. 3, the activity generator ﬁrst compiles thegiven testbench and a library of activity tracking functionswritten in C++ into object ﬁles by the g++ compiler, respec-tively. In addition, the annotated IR is also converted into anobject ﬁle by the clang++ compiler. All these object ﬁles arefurther linked together into a single executable ﬁle. Throughrunning the executable ﬁle with the input vectors, we are ableto invoke the target kernel function in the IR, and extract thecycle-level input and output values for each RTL operator intoa list. Thereafter, we compute the average switching activityper RTL operator by SA op = (cid:80) M op i =1 (cid:80) N op j =1 HD ( s ( i, j ) , s ( i, j − M op · N op , (3)where s ( i, j ) is the bit vector for an operand or result i attime step j for the evaluated RTL operator op , M op is thetotal number of operands and results, N op is the length of thelist of activity vectors for op , and HD ( · ) is the Hammingdistance computation function which counts the differencesbetween two vectors bit by bit.We further scale the average switching activity for each RTLoperator as follows: SA scaled = N op L · SA op , (4)where L is the latency of the target design point estimated byHLS. In this equation, N op L can be regarded as an activationrate to amortize an operator’s average switching activity overthe total execution cycles. Histogram Constructor.

As the directive conﬁgurationsfor different design points lead to different numbers of RTLoperators, the size of the currently extracted activity set alsovaries from design point to design point even for the sameapplication. Noticing that a trained machine learning modelis not able to deal with varying feature size, we need todevise a way to convert the set of extracted activities intofeatures so that the feature size is ﬁxed for various designpoints, and a well developed model is applicable to differentapplications. To this end, we adopt a histogram representation of operator activities. For each opcode, we create a histogramwith a pre-deﬁned number of bins, each of which covers aspeciﬁc activity range. Each RTL operator is ﬁrst sorted intoa particular histogram according to its opcode, and then it isdistributed to the bin covering its scaled switching activity,s computed by Eq. 4. Within each bin, the data statistics tobe collected are the number , the percentage and the averageswitching activity of all the RTL operators in this bin. Theﬁxed-sized statistics for every opcode are used as features andare assembled into an activity feature set for model trainingand inference. In addition, we adopt the total number of RTLoperators for each opcode as a feature. D. Power Model Generation

HL-Pow constructs a total number of 256 features, con-sisting of 11 architecture features mainly accounting forstatic power and 245 activity features contributing to dynamicpower. To obtain ground truth power values for each designpoint, we conduct RTL implementation ﬂow after the HLSﬂow, and collect real power measurement during onboard im-plementation. Besides onboard measurement, gate-level powerestimation is another option to get ground truth power values.We build regression models for power prediction using avariety of supervised learning methods. These models are 1) linear regression : classic linear regression and Lasso regres-sion with a l -norm regularization term; 2) support vectormachine (SVM) : support vector regression with a radial basisfunction (RBF) kernel; 3) tree-based model : decision treeand ensemble models, including bagging trees, adaboost trees,random forests and gradient boosting decision trees (GBDT);and 4) neural network : multi-layer perceptron (MLP), convo-lutional neural network (CNN) and residual neural network(ResNet). For CNN and ResNet, we construct a 16-by-16input map from the 256 features, by ﬁlling it row by row witharchitecture features and the total number of RTL operatorsfor each opcode, followed by the other activity features. Asfor data preprocessing, we perform data normalization whennecessary. For the ﬁrst three categories of models, we conduct feature selection and K-fold cross-validation to determine themodels’ hyperparameters before model generation. For neuralnetworks, we deploy several widely used model instances, andﬁne-tune the model hyperparameters.IV. A

LGORITHM FOR D ESIGN S PACE E XPLORATION

With our power modeling framework, power prediction fora design point can be greatly expedited without running thetedious RTL implementation ﬂow. However, when the goal isto ﬁnd the Pareto-optimal points from a large design space,there is still a large HLS runtime overhead to exhaustivelyassess power for every design point through HL-Pow. Totackle this issue, we propose a novel algorithm to approximatethe Pareto frontier between latency and power consumption byonly sampling a small subset of the design points. Speciﬁcally,we apply a priori knowledge generalized from training appli-cations to navigate the search of Pareto-optimal points.The overview of the algorithm is depicted in Fig. 4. Weﬁrst prune away the design points that produce repetitive RTLdesigns from the design space, and divide the design spaceinto several regions to explore. The pruning is based on thefact that when an outer loop is pipelined, all the inner loopsare automatically unrolled [19]. In such a situation, no matterwhat unrolling factors are set for the inner loops, the resultingarchitectures are the same as that without unrolling the innerloops. Therefore, we reserve one design point and remove

FrontierSearchHLSHL ‐ Pow

Inference3.Design

Evaluation

Approx.Pareto

Frontier

LatencyPower

Sampling

Unrolling factor: U → U max P i p e li n i n g : P n o n e → P i nn e r → P o u t e r Selection

Candidate

Points & Division

Partitioning factor: P → P max ... P outer P inner P none U U U ...U U U U: Unrolling P: Pipelining

Fig. 4. Overveiw of the design space exploration algorithm. the redundant ones when this situation happens. Afterwards,we split the design space into multiple regions by the array-related directive, namely, array partitioning, and use loop-related directives including loop pipelining and loop unrollingfor the search of promising points in each region.Starting with the trimmed and divided design space, an initial sampling step is conducted to collect the ﬁrst set ofdesign points to assess. The heuristic is to select representativepoints in each region that are spreading out over the rangeof both latency and power consumption. Through analysisof the training set, we discover a trend that pipelining theouter loops, compared with pipelininig inner loops or nopipelining, generally leads to higher power consumption alongwith lower latency. Moreover, unrolling the loops with a largerunrolling factor also brings a similar effect. Following theseobservations, we can provide a coarse-grained but a prioriestimation of latency and power consumption for differentdirective conﬁgurations, and accordingly, we transform eachregion into a grid-like representation as shown in step 2 ofFig. 4. On top of that, the design points in the corner and in themiddle of each grid are selected to add to the initial samplingset, in that they are most likely to demonstrate extreme andmedian values for both latency and power consumption.The initial sampling set is fed into HL-Pow to assesslatency (by HLS) and power consumption. After obtainingboth latency and power values, an approximate Pareto frontieris derived from the current sampling set, and the existingPareto-optimal points are used as references for identifyingpromising design points to evaluate. We propose to use the standard deviation reduction (SDR) [20] as the metric forcandidate point selection. SDR measures the ability that anattribute splits a dataset into subsets: the higher the SDR, thebetter the dataset is split by similarity. Speciﬁc to our case,the dataset is the set of latency or power consumption for alldesign points in an application, and the attributes are unrollingand pipelining. The SDR in our case can be deduced as

SDR = sd ( T ) − (cid:88) i | T i || T | × sd ( T i ) , (5) ABLE IID

IRECTIVE OPTIONS SUITABLE FOR THE TARGET PLATFORM . Directive Option

Array partitioning type: cyclic; factor: [1, 2, 4, 8]Loop pipelining different levels of nested loopsLoop unrolling factor: [1, 2, 4, 8] where sd ( · ) is the standard deviation computation function, T is the set of latency/power consumption and T i is the ith subset of T split by unrolling/pipelining. We evaluate all thetraining applications and ﬁnd that, for both latency and powerconsumption, loop pipelining has higher SDR compared toloop unrolling. This means that loop pipelining tends to showa larger effect than loop unrolling on both latency and powerconsumption, and can better split the design space to indicatedifferences in both of these metrics. According to this ﬁnding,we further transform each region of the design space into an ordered sequence , in which the directive conﬁgurations areﬁrst sorted by loop pipelining in a coarse-grained manner andthen by loop unrolling in a ﬁne-grained manner, as shown instep 5 of Fig. 4. In this way, the latency/power consumptioncan be roughly estimated as monotonic decreasing/increasingfollowing the direction from right to left in this representation.We identify each pair of neighboring points in the approx-imate Pareto set that are from the same region, and annotatethem in the corresponding ordered sequence. For each pair ofannotated points, we locate the middle point between them inthe sequence and add it to the sampling set. If this middle pointhas already been added to the sampling set, we remove it fromthe sequence, and instead search for the updated middle pointto add. The above steps, namely, design evaluation , Paretofrontier search and candidate selection , are iterated to searchfor promising design points until a user-deﬁned budget ofHLS runs is reached or no more candidates exist. Finally, toensure that the real Pareto-optimal points are not pruned awaydue to the error induced by power estimation, we allow thedesign points within a pre-deﬁned deviation (e.g., 5%) of thepower consumption from the nearest Pareto-optimal points tobe incorporated into the Pareto set.V. E

XPERIMENTAL R ESULTS

A. Experimental Setup

The HL-Pow design ﬂow is fully automated and imple-mented with Python and C++ for feature construction, modelestablishment and power inference. Different types of learningmodels are realized in Scikit-learn [21], XGBoost [22] andKeras [23], respectively. We apply our design ﬂow to evaluate22 applications from different categories in Polybench [24], re-sulting in up to 11326 valid design points and 256 features perdesign point. The design points are synthesized using ﬂoating-point arithmetic and implemented under a timing constraintof 10 ns. The FPGA development toolkit we use is XilinxVivado Design Suite 2018.2. We implement all the designpoints on a Xilinx Ultrascale+ ZCU102 FPGA board andcollect real power consumption through onboard measurementwith the Power Advantage Tool [25]. We customize the HLSoptimization strategies that ﬁt the applications into the targetplatform, as shown in Table II.

TABLE IIIA

CCURACY OF POWER MODELING . Application Power MAE (%) of Learning ModelsRange (W)

Lasso SVM GBDT CNNAtax 0.30 – 1.00 7.46 15.07 2.80 5.14Bicg 0.30 – 1.15 6.21 20.62 4.63 7.80Fdtd 2d 0.29 – 1.36 9.46 10.81 4.79 3.98Gemm 0.30 – 0.86 6.92 17.51 3.69 5.15Gramschmidt 0.29 – 0.65 9.07 12.31 6.26 5.69Jacobi 2d 0.30 – 1.31 10.67 14.16 6.32 4.36Mvt 0.30 – 1.09 9.58 14.03 4.11 4.40Overall 0.29 – 1.36 9.08 13.00 4.78 4.67

B. Performance of Power Modeling

We use 8784 design points from 15 applications for train-ing and validation, and 2542 design points ( > C. Quality of Design Space Exploration

We investigate the quality of our DSE algorithm, as pro-posed in Section IV, with the three applications from the testset (Fdtd 2d, Mvt and Gramschmidt) that have the largestnumber of design points. To assess the performance of ourDSE algorithm in real cases, we calibrate the Pareto-optimalpoints in the approximate Pareto set using the correspondingreal power values from measurement.

Average distance fromreference set (ADRS) is used as the metric to quantify thedifference between the approximate and the exact Pareto sets.ADRS is deﬁned as

ADRS ( ¯

P , P ) = (cid:34) | P | (cid:88) p ∈ P min ¯ p ∈ ¯ P ( δ (¯ p, p )) (cid:35) × ,δ (¯ p, p ) = max (cid:26) , Lat ¯ p − Lat p Lat p , P wr ¯ p − P wr p P wr p (cid:27) , (6) where ¯ P is the approximate Pareto set, P is the exact Paretoset, and Lat and

P wr denote latency and power, respectively.The lower the ADRS, the smaller the difference between theapproximate Pareto set and the exact Pareto set.

20 40 60 80 100Total sampling budget (%)(a)0246810 A D R S ( % ) A D R S ( % ) gramschmidtmvtfdtd_2daverage0 5 10 15 20 25 30 35Latency (10 cycle)(c)0.20.40.60.81.01.21.4 P o w e r c o n s u m p t i o n ( W ) Pareto frontierDesign point 0 5 10 15 20 25 30 35Latency (10 cycle)(d)0.20.40.60.81.01.21.4 P o w e r c o n s u m p t i o n ( W ) Approximate Pareto frontierSampled design point

Fig. 5. Results of Pareto frontier approximation: (a) ADRS of Fdtd 2d ap-plication with different initial sampling rate; (b) ADRS of different samplingbudgets under a 2% initial sampling rate; (c) real Pareto frontier of Fdtd 2dwith the complete sample set; and (d) approximate Pareto frontier of Fdtd 2dwith a 2% initial sampling rate and 20% sampling budget.

We investigate how different initial sampling rates and totalsampling budgets (i.e., the proportion of design points forsampling) affect the quality of approximation results. We ﬁrstevaluate the initial sampling rates from 2% to 10%. Fig. 5 (a)depicts the results for the application with the largest numberof design points, Fdtd 2d, and the other applications indicatea similar trend. ADRS decreases rapidly as the total samplingbudget increases from a small starting point, which showcasesthe efﬁcacy of our DSE algorithm. Moreover, we can observethat applying different initial sampling rates leads to a con-verged ADRS as the sampling budget increases. Nevertheless,using a small initial sampling rate beneﬁts the approximationquality given a limited sampling budget. This is because iteffectively balances the sampling proportion between initialsampling and iterative searching. As a result, we adopt a 2%initial sampling rate in the following experiments.The ADRS for different applications is shown in Fig. 5(b). Our algorithm demonstrates good results with a samplingbudget of 20% and converges at a sampling budget of 40%,resulting in an average ADRS of 2.35% and 1.84%, respec-tively. Fig. 5 (c) and (d) show the real and approximate Paretofrontiers for Fdtd 2d, respectively. From them, we can observea clear tradeoff between latency and power consumption.Fig. 5 (d) also indicates good approximation quality. In brief,our DSE algorithm can approach a close approximation of thereal Pareto frontier with a small sampling budget.VI. C

ONCLUSION

Power consumption is a key consideration for hardwaredesigns. However, existing methodologies for power estima-tion or measurement incur high development cost and alsoexhibit many restrictions. In light of these problems, we targetefﬁcient and accurate power estimation for FPGA designs atan early design stage. We introduce HL-Pow, a learning-basedpower modeling framework for HLS. We ﬁrst propose an auto-mated and fast feature construction ﬂow to capture informativefeatures for power indication, simply based upon HLS results,and then present a modeling framework which can build ageneric power model that works for diverse designs withoutthe necessity of model regeneration. HL-Pow can signiﬁcantly accelerate the power prediction process for FPGA designsas the execution of the time-consuming RTL implementationﬂow can be skipped. Experimental results verify that HL-Pow can achieve an average prediction error within 4.67% ofonboard power measurement. Based on HL-Pow, we describea novel and efﬁcient algorithm to explore the tradeoff betweenlatency and power consumption of HLS designs. The proposedalgorithm retrieves a close approximation of the real Paretofrontier with an average ADRS of 2.35% and 1.84% whileonly sampling 20% and 40% of design points, respectively, inthe complete design space.A

CKNOWLEDGMENT

This work is funded by Hong Kong RGC GRF under grant16245116. R

EFERENCES[1] P. Coussy et al. , “An introduction to high-level synthesis,”

IEEE DesignTest of Computers , pp. 8–17, 2009.[2] H.-Y. Liu and L. P. Carloni, “On learning-based methods for design-space exploration with high-level synthesis,” in

Proc. of DAC , 2013.[3] L. Ferretti et al. , “Lattice-traversing design space exploration for highlevel synthesis,” in

Proc. of ICCD , 2018.[4] G. Zhong et al. , “Design space exploration of FPGA-based acceleratorswith multi-level parallelism,” in

Proc. of DATE , 2017.[5] P. Meng et al. , “Adaptive threshold non-pareto elimination: Re-thinkingmachine learning for system level design space exploration on FPGAs,”in

Proc. of DATE , 2016.[6] Dong Liu and B. C. Schafer, “Efﬁcient and reliable high-level synthesisdesign space explorer for FPGAs,” in

Proc. of FPL , 2016.[7] L. Ferretti et al. , “Cluster-based heuristic for high level synthesisdesign space exploration,”

IEEE Transactions on Emerging Topics inComputing , 2018.[8] Xilinx Ltd, “Vivado design suite user guide: High level synthesis,”

XilinxWhite Paper , April 2017.[9] D. Liu and C. Svensson, “Power consumption estimation in CMOSVLSI chips,”

IEEE Journal of Solid-State Circuits , 1994.[10] D. Lee et al. , “Learning-based power modeling of system-level black-box IPs,” in

Proc. of ICCAD , 2015, pp. 847–853.[11] Z. Lin et al. , “An ensemble learning approach for in-situ monitoring ofFPGA dynamic power,”

TCAD , 2018.[12] W. Zuo et al. , “A polyhedral-based SystemC modeling and generationframework for effective low-power design space exploration,” in

Proc.of ICCAD , 2015.[13] Y. Liang et al. , “FlexCL: A model of performance and power forOpenCL workloads on FPGAs,” TC , 2018.[14] A. Bogliolo et al. , “Regression-based RTL power modeling,” TODAES ,2000.[15] Y. S. Shao et al. , “Aladdin: A pre-RTL, power-performance acceleratorsimulator enabling large design space exploration of customized archi-tectures,” in

Proc. of ISCA , 2014.[16] D. Chen et al. , “High-level power estimation and low-power designspace exploration for FPGAs,” in

Proc. of ASP-DAC , 2007.[17] H. Liang et al. , “Hierarchical library based power estimator for versatileFPGAs,” in

Proc. of IEEE International Symposium on EmbeddedMulticore/Many-core Systems-on-Chip , 2015.[18] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelongprogram analysis & transformation,” in

Proc. of CGO , 2004.[19] J. Zhao et al. , “COMBA: A comprehensive model-based analysisframework for high level synthesis of real applications,” in

Proc. ofICCAD , 2017.[20] J. R. Quinlan et al. , “Learning with continuous classes,” in

Proc. ofAustralian Joint Conference on Artiﬁcial Intelligence , 1992.[21] F. Pedregosa et al. , “Scikit-learn: Machine learning in Python,”

J. Mach.Learning Research , 2011.[22] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,”in

Proc. of KDD , 2016.[23] F. Chollet et al.