[PDF] Architectural Analysis of FPGA Technology Impact

Abstract

The use of high-level languages for designing hardware is gaining popularity since they increase design productivity by providing higher abstractions. However, one drawback of such abstraction level has been the difficulty of relating the low-level implementation problems back to the original high-level design, which is paramount for architectural optimization. In this work (developed between April 2013 and April 2014), we propose a methodology to analyze the effects of technology over the architecture, and to generate architectural-level area, delay and power metrics. Such feedback allows the designer to quickly gauge the impact of architectural decisions on the quality of generated hardware and opens the door to automatic architectural analysis. We demonstrate the use of our technique on three FPGA platforms using two designs: a Reed-Solomon error correction decoder and a 32-bit pipelined processor implementation.

Full PDF

AArchitectural Analysis of FPGA Technology Impact

Oriol Arcas-Abella, PhD Abhinav Agarwal, PhD

Abstract —The use of high-level languages for designing hard-ware is gaining popularity since they increase design productivityby providing higher abstractions. However, one drawback ofsuch abstraction level has been the difﬁculty of relating the low-level implementation problems back to the original high-leveldesign, which is paramount for architectural optimization. Inthis work , we propose a methodology to analyze the effects oftechnology over the architecture, and to generate architectural-level area, delay and power metrics. Such feedback allows thedesigner to quickly gauge the impact of architectural decisionson the quality of generated hardware and opens the door toautomatic architectural analysis. We demonstrate the use ofour technique on three FPGA platforms using two designs: aReed-Solomon error correction decoder and a 32-bit pipelinedprocessor implementation. Keywords — high-level hardware design; technology impact I. I

NTRODUCTION

In recent years, the automation of digital hardware designhas advanced greatly. The industry has concurrent goals ofincreased designer productivity, shorter design cycles and am-bitious area-performance-power constraints. As a consequencedesigners are increasingly adopting high-level Hardware De-scription Languages (HDL) [1], [2], [3], [4], [5] to addressthe challenging design requirements of large systems withoutlosing productivity. These languages provide a higher levelof abstraction than traditional languages such as VHDL orVerilog, implicitly hiding some of the low-level details in favorof an explicit system-wide view.Existing commercial and academic design tools are goodat synthesis, placement and routing of hardware designs fordifferent implementation platforms. Usually such tools providevery detailed reports, but only on gross and macro-levelresource consumption and performance metrics. Even if thedetailed reports are analyzed, the resources are closely tied tothe implemented circuit. Therefore, it is difﬁcult to obtain abreakdown of area, delay and power metrics for the modulesand sub-modules in the high-level design source.Architecture designers require tools to automatically gen-erate such a breakdown, as typical design iterations in thedesign cycle are always limited to speciﬁc blocks and modules.FPGAs are an important part of the high-performance ecosys-tem, as accelerators in heterogeneous architectures or high-throughput custom machines. In both cases they require deepdesign-space exploration [6] and architectural reﬁnements [7]to appropriately map onto the target technology with therequired constraints. There is an urgent need to improve andenhance the feedback from downstream synthesis tools toinform high-level design decisions. This project was started in April 2013.

This work tries to reduce the gap between the target circuitsand their high-level architectural descriptions. In addition,bridging these two levels of the design process is the ﬁrst andnecessary step towards automatic architectural optimization.Our contributions are the following: • We present a novel methodology to analyze high-level designs and their synthesis reports in order togenerate module and sub-module level metrics forarea, delay and power. This methodology can be usedwith existing languages and hardware synthesis tools,and some of its metrics cannot be obtained by theexisting design tools. The novelty of this work liesin automatic generation of metrics corresponding tohigh-level architectural units. • To demonstrate the applicability of this methodology,we implement it on a rule-based HDL. We use our pro-totype implementation to extract results for two micro-architectural test cases: a Reed-Solomon decoder anda RISC microprocessor. We demonstrate the use ofour tool on three different FPGA platforms. • The results presented are enriched with architecturalinformation. We propose how they could be used bythe designer to take informed architectural decisionsin order to optimize the design. • We also discuss the possibility of automatic architec-tural optimization using our framework.

Paper Organization:

Section II introduces our methodol-ogy. In Section III, we apply the methodology to the chosenhigh-level hardware designs implemented on three FPGAplatforms and show three different metrics for each implemen-tation. We consider the possibility of automatic architecturaloptimization in Section IV. In Section V, we discuss theapplicability of our technique to other HLS frameworks, andpresent recently developed techniques and tools related to thiswork. Finally, Section VI describes future research directionsand concludes the paper.II. M

ETHODOLOGY

We divided our methodology in two steps. Firstly, thedesign is analyzed and annotated. Then, after passing throughthe synthesis ﬂow, the resulting circuit is analyzed and com-pared to the original, abstract design. The user can applyarchitectural solutions to implementation problems, or evenautomatic architectural optimizations could be possible. Also,as the architectural and low-level information is known by thetool, automatic optimizations could be applied.The analysis starts with the user’s description of thearchitecture. As the needs for large-scale system design in-crease, several HDLs have been proposed. To implement our a r X i v : . [ c s . A R ] A ug eg (a) Bluespec architectural model. > 0 y en x en > sub (b) Annotated RTL model. > 0 sub mux > & y D y Q x Q x D (c) Graph of the FPGA circuit. Fig. 1: Example of our methodology applied to a GCD module, from the architecture to annotated circuits.methodology and demonstrate its applicability we have chosenBluespec SystemVerilog [2], a well-known rule-based HDL. Insection V we describe how our methodology can be appliedto HLS tools as well.As an example, consider the traditional swap/subtract great-est common divisor (GCD) Euclidean algorithm. In Figure1a we show a 32-bit GCD module implemented with twoBluespec rules. This example contains two rules and two 32-bit registers. Both guards are mutually exclusive, so at mostone rule can be executed every cycle.

A. Analysis of the Architecture

The initial step of our methodology starts with a userdescription of a hardware architecture. We show the userdescription of the GCD design in Figure 1a. Instead of directlyanalyzing the high-level source code, our tool analyzes theintermediate representation to simplify the problem.Our tool generates an abstract representation of the archi-tecture from the Bluespec description. During the rest of thepaper, we term the high-level objects (submodules and rules)as “blocks”. In Figure 1a there are four blocks: the rules swap and subtract , and the registers (submodules) x and y . Theintermediate representation contains a very simpliﬁed modelof the operations of the rules. All the algorithmic descriptionsare converted into a ﬂat RTL representation.The intermediate representation is converted into the deﬁni-tive Verilog description of the hardware, which is functionallyequivalent to the architectural description in Bluespec. Themain goal of our methodology is to establish relations be-tween the ﬁnal circuit and the original architecture. The coretechnique to accomplish this is the annotation method, whereinnocuous annotations are added to the Verilog ﬁles generatedby the Bluespec compiler. This annotation process is speciallynecessary for high-level synthesis, where the generated RTLcode does not clearly reﬂect the original architectural units andthe algorithmic semantics.This extra information will remain during the FPGA circuitsynthesis, and consists of unique identiﬁers that are assignedto each Bluespec block. This preﬁx is added to the name ofeach Verilog element that is present in the intermediate model,identifying the block that owns it. An RTL circuit of the GCDdesign is shown in Figure 1b. The annotations are represented as colors. For example, the subtract arithmetic unit, in blue,corresponds to the subtract rule. Similar units are merged bythe Bluespec compiler and may not be identiﬁed properly (witha white background in the example). This represents less than1% of the hardware. B. Analysis of the Circuit

After the Verilog model is annotated, it gets synthesizedinto circuits ( i.e., mapped, placed and routed) using the FPGAvendor’s tools. In this process the architecture is implementedin a particular technology, adapting to its characteristics. Thenext step is analyzing this circuit and identifying the anno-tations that remain after the synthesis. This will enrich theavailable information on area, timing and power characteristicsof the circuit, establishing a direct link with the originalarchitecture. During the following sections we will refer to theFPGA electronic elements, i.e.,

LUTs, registers, etc. as cellsor gates as well.We show a hypothetical synthesis of our GCD model inFigure 1c. The two original registers have been synthesizedinto ﬂip-ﬂops. We show the input (D) and output (Q) portsof the ﬂip-ﬂops in separated nodes because there is no asyn-chronous connection between them. The other cells are LUTswith different conﬁgurations and number of inputs. Again, theannotations, shown in the example as colors, remain.The resources and timing information extracted from theﬁnal circuit will be used to identify the annotated gates andmap them to the original architectural element. In this work,some specialized elements like Block RAMs and DSP sliceshave been kept outside the scope of our analysis and will needfurther research.

1) Area estimation:

The area analysis refers to the amountof FPGA resources used. Our tool analyzes the ﬁnal circuitand represents it as a Directed Acyclic Graph (DAG) forcomputation purposes. The nodes of the graph are FPGA cells,and the edges are FPGA nets. As we explained previously,after synthesizing the circuit, the FPGA cells are namedafter the Verilog element that generated them. Our tool looksfor preﬁxes in the names and extracts the unique label thatidentiﬁes the original Bluespec block.

2) Delay estimation:

The delays of the circuit can be repre-sented as weights in the DAG that the tool extracted. Figure 1c a) FPGA circuit (b) Annotated nodes (c) Expanded paths (d) Connected sets (e) System delays (f) Block delays

Fig. 2: Algorithm used to calculate the block delay. Nodes are FPGA cells, arrows are combinational paths.is an example of a DAG, where every node corresponds toa FPGA cell and every edge is a connection between cells.Some of these graphs can be too big to analyze, even for asmall design ( e.g., • System delays: all the edges are weighted with thecorresponding delay. This weighting function makesthe algorithm to choose the longest paths that containone or more nodes of the current block. • Block delays: only the nodes that belong to the currentblock have a positive weight, while the rest have a nullweight. With this weighting function, the algorithmwill return the paths with the highest delay due to thecurrent block.The ﬁrst weighting function allows us to ﬁnd if a blockcontributes to the longest (critical) paths of the system, whichare the performance bottlenecks of the whole design. Thisinformation can be returned by any typical timing analysistool. The second weighting function allows us to ﬁnd what we consider is the delay of a design block: among all thecombinational paths that cross the block, the longest intra-block segment. This second metric helps the designer tounderstand the isolated delay of an architectural unit.We want to note that the results of both metrics canbe completely different, because a block’s internal maximumpath can be longer than the block’s contribution to a systemcritical.

Therefore, non-critical optimization opportunitiesof architectural units ( i.e., block delays) are difﬁcult todetect with typical timing analysis, and their effect canbe higher than what would be expected [8].

For instance,in Figure 2e the critical paths are shown in red. Using thisweighting function, the designer can observe that the block inblue contributes to two system paths that have 5 and 4 nodes.Assuming that in this example all the delays are equal, thepath with 5 nodes would be the longest path contributed bythis block. Figure 2f shows the result of applying the blockcontribution delay. Only the delays of the annotated nodes areconsidered. The paths are 1 and 3 nodes long. In this examplethe block (in blue) contributes to a critical path of the systemwith 5 nodes, but the critical delay of the block is 3 nodeslong. Thus, the designer could choose a) to optimize the criticalpath of the system in order to reduce the maximum period andincrease the frequency. Another choice could be b) to optimizethe delay of the block if it is a functionality used often, or evenc) using a different, higher speed clock for it.

3) Power score:

As part of the feedback generated by thetool, the component blocks of a design are ranked in theorder of their average power consumption. The various partsof power score characterization are individually considered inthe following aspects:The static power consumption is directly proportional tothe area of the component blocks, weighted by values forvarying sizes of LUTs and ﬂip-ﬂops. The dynamic powerconsumption of a block is directly proportional to the productof the capacitance of the switching elements and the frequencyof transitions of block elements. Dynamic activity within theblock can either occur when the rule corresponding to the blockﬁres or when the input state for the rule changes as a result ofanother rule ﬁring. We proﬁle the design to obtain the numberof event transitions as measured by the rule-based activity, aswell as to obtain the relationship between output state affectedby a rule and input state dependence for each rule.State change in a given cycle leads to dynamic activity inthe next cycle in each block, which depends on the changedstate. In addition, rules that ﬁre in a given cycle write their statechanges in the same cycle, again consuming power. We assumethat, whenever a state is updated, its value changes, leading todynamic transitions in dependent blocks. The dynamic poweromputed for the design takes into account both of thesedescribed components.The static and dynamic power for each block ( P S and P D respectively) is determined using the cell characterizationobtained in the area analysis and the FPGA cell power modeldescribed in [9]. We then use dependency analysis andproﬁled rule-ﬁring statistics, to compute the average switchingfactor ( α ) for each block of the design. This is then used tocompute the average power consumption of each block givenby equation 1, where f is the frequency of operation. P avg = P s + P D · α · f (1)This metric is termed as “power score”. Though it isnot a direct estimate of the exact power consumption dueto assumption of upper-bound block-level activity, this metriccan still help the designer to compare the relative amount ofdynamic and static power of various blocks of the hardware.In addition, the probable hot spots of the architecture can beidentiﬁed. The value of the metric cannot be used to directlycompare the static and dynamic power of a block implementedin the different FPGA technologies as we use the same powermodel in each case. The main value of the score is thatthe relative power distribution of the design in each FPGAplatform will follow the same pattern as the power score forthat platform, and this information can be used to analyze andreduce the power consumption in a customized manner.III. R ESULTS

In this section we describe the architectures chosen todemonstrate our methodology and the results obtained usingour tools. The two testcases are:

Reed-Solomon:

This design is a parameterized Reed-Solomon error correction decoder which meets the throughputrequirement for use in an 802.16 wireless receiver. The de-coding algorithm is composed of several steps, each of whichis implemented as a separate module shown in Figure 3a.Dynamic activity, used for determining the power score metricsof the design, was generated using a testbench that feeds inputdata with errors at 50% of the maximal correctable rate.

SMIPS:

This design is a 32-bit RISC microarchitecturethat implements the MIPS I ISA. Figure 3b shows the maincomponents consisting of a multiply unit, coprocessor 0 (im-plementing data and instruction TLBs), independent instruc-tion/data L1 caches, and a uniﬁed, N-way L2 cache. This 5-stage processor can boot the GNU/Linux kernel. We used thedynamic activity generated during booting of Linux to generatethe power score metrics of the SMIPS design.We implemented these designs on three differentFPGA devices: Spartan 6 XC6SLX45T-3FGG484, Vir-tex 5 XC5VLX155T-2FF1136 and Virtex 7 XC7VX485T-2FFG1761. We used the Xilinx ISE and Vivado tools 14.4 tosynthesize, place and route the Verilog hardware model. Wealso used these tools to export the EDIF and SDF models. Allthe designs were targeted at 100 MHz on all the FPGAs.

Syndrome Computa/on Berlekamp-‐Massey Algorithm Chien Search Error Magnitude Error Correc/on (a) Reed-Solomon decoder

CPU

CP0 PC D F

I TLB I$ D TLB

D$ L2$ RF E Mult WB (b) SMIPS RISC architecture Fig. 3: The two example models analyzed.

A. Discussion of the results

The results for both architectures are shown in Figure 4.The bar colors indicate the FPGA device: Spartan 6 – blue,Virtex 5 – red, and Virtex 7 – green. Three metrics aredisplayed: • The area results show the number of cells per archi-tectural module. • The delay charts show the longest block delay, aspreviously explained in section II-B2. This metricmeasures the maximum contribution of each blockto any path that crosses it. The darker componentsare the network part of the delay, and the lighter arethe logic part. The diamond-shaped mark over thecolumns indicates the maximum delay of the system,and which block or blocks contribute to it. For bothdesigns we can observe the signiﬁcance of the networkin the total delay. • The power scores are also decomposed in two: staticpower score (darker shade) and dynamic power score(lighter shade).Our framework produces results for every module and sub-module, but we grouped some results to simplify the charts.

1) Reed-Solomon decoder:

Figure 4a shows the breakdownof metrics for modules of the algorithm, as well as for the de-coder module (RS) itself, which acts as a wrapper around thesecomponent modules, and for the top module which deals withInput-Output to memory. From an algorithmic perspective,Berlekamp step is the most computationally intensive part ofthe decoding process, and accordingly we see that this modulehas the maximum area in all three FPGA platforms. Errorcorrection step involves the minimum computation as it simplyremoves the computed error values from the received data,thus contributing to minimal area. The other three componentmodules have similar moderate area usage.For delay metrics, implementations on Virtex 5 and Virtex7 platforms easily meet the required 100 MHz clock frequencywith critical paths located mostly in Chien and Berlekamp op module RS Berlekamp Chien Err. corrector Err. magnitude Syndrome P o w e r s c o r e D e l a y ( n s ) A r ea ( no . o f ce ll s ) Clock period (10 ns)

Logic:Network: Spartan 6 DelayVirtex 5 Virtex 7System max: Dynamic:Static: Spartan 6 PowerVirtex 5 Virtex 7Spartan 6 AreaVirtex 5 Virtex 7 (a) Reed-Solomon area, delay and power metrics. D e l a y ( n s ) Clock period (10 ns) A r ea ( no . o f ce ll s ) T op m odu l e I n t e r c onn ec t C P U M u lt U n it C P D a t a TL B I n s t TL B I n s t C ac h e D a t a C ac h e L C ac h e L T a g s L T a g s P o w e r s c o r e (b) SMIPS area, delay and power metrics. Fig. 4: Area, delay and power metrics on three Xilinx FPGAs: Spartan 6 (blue), Virtex 5 (red) and Virtex 7 (green).odules respectively. However, the Spartan 6 implementationis unable to achieve this due to long computational operationsin Berlekamp, Chien and Syndrome, with Syndrome contribut-ing the critical path. The power metrics roughly track in similarratios as with the area metrics. One important point to noticeis that most of the dynamic power consumption in the decoderis contributed by the Berlekamp module. Dynamic powercomprises up to 50% of Berlekamp’s power consumption (inthe case of Virtex 5) while other blocks’ power consumptionis mainly the static power of the FPGA resources used. Thishighlights the importance of this module for the decoderdesign, and suggests design reﬁnement for reduced area aswell as the use of power reduction techniques for reducingunnecessary dynamic activity ( e.g., clock gating).

2) SMIPS:

In SMIPS there are 12 modules. These mod-ules correspond to some of the architectural units shown inFigure 3b. The results of the missing submodule architecturalunits, such as the pipeline stages, are included in their parentmodules. As with the previous case, the area and power metricsare very similar across all tested FPGA platforms, seen inFigure 4b. In general, area metrics seem to follow a descendingtrend from Spartan 6 to Virtex 7. This is a result of the FPGAarchitectures being different in these devices. For instance,Spartan 6 slices (a group of two LUTs) have one carry chainoutput [10], which can be used to implement fast carry chainarithmetic operations. The Virtex 5 and Virtex 7 slices havetwo independent carry chains [11], which allow implementingmore arithmetic operations with fewer LUTs. This is especiallyclear in the Virtex 7 area results.The data cache requires more resources in Virtex 5 than inthe other devices. The area report showed that the data cachemodule required 500 more registers in Virtex 5. We observedthat while specifying the same architecture, the synthesis tooldid not infer the data cache RAM unit correctly for Virtex5, and implemented the cache memory using registers insteadof using efﬁcient on-chip BlockRAMs. We argue that suchportability problems make necessary not only the developmentof platform-neutral synthesis tools and languages, but alsocross-platform analysis tools like ours.The delay results differ for Spartan 6, which was unable toachieve timing closure for a target clock period of 10 ns. Thearea of the design has an important impact on the performanceof the design. High resource usage congests the network andmakes it difﬁcult for the router to achieve the timing goals.SMIPS occupies about 30% of the Spartan 6 device, muchhigher than the other two devices. In a congested device thenetwork delays are high, even if it can ﬁt the design. Inaddition, the logic delays of Spartan 6 cells are higher than thehigh-performance Virtex 5 and 7 LUTs. For instance, delay ofa 6-input LUT in a Spartan 6 device can be ∼

200 ps, whereasin Virtex 5 it is ∼

80 ps and in Virtex 7 ∼

40 ps. We can observethat the critical path of the Spartan 6 implementation is causedby the instruction TLB. In Virtex 5, we show two maximumdelay marks, one over the CPU and another over the multiplyunit. This means that the critical delay starts at the executionstage of the CPU and ends at the multiply unit. The delayreport, along with the delay value, also includes the path thatcaused it and which architectural elements contribute to it. InVirtex 7, the maximum delay is caused by the execution stageof the CPU. For power consumption, it is seen that similar blocksdominate in all three platforms. These are L2 Cache, Executeblock and the Multiplier. The dominance of the L2 Cachecomes due to it being the largest block by far, thus having thelargest static power dissipation. The computationally intensiveExecute and Multiply blocks have a lot of logic and see a lotof dynamic activity. Beyond these three blocks we start seeingdifferences between the platforms. Virtex 5 has the Decode unitat relatively higher power consumption than even the Multiplyunit. These differences arise due to the different availability ofDSP arithmetic resources in the 3 FPGAs, different numberof multiplexers generated for large data storage, and differentlevels of power and area optimizations implemented in theplatforms.IV. A

UTOMATIC A RCHITECTURAL O PTIMIZATION

The architecture deﬁned by the user determines the per-formance, area and power consumption of the ﬁnal FPGAcircuit. The way that the architecture is synthesized, placedand routed can optimize these metrics, but they are alwaysconstrained by the architectural decisions. Thus, we believethat signiﬁcant changes of these results can only be achievedthrough high-level, architectural decisions. For instance, theresults in Figure 4b suggest several modiﬁcations at architec-tural level: reducing the number of entries of the L2 cache canimprove area and power metrics. Splitting the Execute stageof the pipeline in two would break the combinational pathcrossing the data TLB and the data cache.Currently these architectural optimizations are performedby the designer, under the guidance of the reports producedby the synthesis tools. Like the designer, our frameworkhas knowledge about the architectural design and the syn-thesis reports. This knowledge enables the tool to implementtechnology-guided architectural changes.The quality and impact evaluation methodology that wepresent in this work relies on two fundamental components.One is describing the hardware architecture using a high-level design language, such as Bluespec or another HLSlanguage. The other is the methodology to project the tech-nology problems to the architecture, as we described in theprevious sections. But automatic architectural optimizationrequires additional components. The framework must distin-guish the characteristics of each architectural unit, so thatthese parameters can be modiﬁed to meet the constraintsimposed by the technology. For instance, the optimization toolshould be able to modify the cache policies or the size ofsome units. The user should be able to put some constraintsover those variations, informing what quality minimums mustbe preserved when modifying the architecture. Informationabout the target technology can complement these optimizationinputs, allowing the tool to apply different strategies.V. G

ENERALIZATION OF THE M ETHODOLOGY AND R ELATED W ORK

We have shown how to implement our methodology forrule-based languages. In this section we want to discuss thegenerality of this method, and how it could be applied to HLSlanguages and tools. Popular HLS tools such as xPilot [4] (laterAutoPilot and currently Xilinx Vivado [12]), LegUp [3] andOCCC [5] convert C programs into synthesizable hardware.These three examples use LLVM as their C front-end. Theintermediate representation is then converted into Verilog orVHDL descriptions. The hardware architecture produced byC-based HLS converts blocks of C code, i.e., functions, intoFinite State Machines (FSM). Local and global variables arestored in local (block RAM) or external (DDR) memories.The ﬁrst step of our methodology, the architectural analysisand annotations, requires augmenting the intermediate repre-sentation with architectural information. This can be applied toC programs reusing popular debugging information that mostcompilers, including LLVM, support (for instance, DWARFfor ELF ﬁles).In contrast to RTL or rule-based models, the FSMs gener-ated by HLS require a variable number of cycles to ﬁnish notdirectly known at design time. But the FSMs have start andﬁnish signals. In this sense, performance analysis is similarto the implementation previously shown. In both cases theproduct is an RTL hardware description, where timing delayscan be analyzed and tracked to the original functional blocks.The power analysis is essentially the same, C-based HLS alsorequiring functional simulation to obtain the execution rates ofthe architectural blocks.Regarding related work, Yan et al. [13] presented anestimation model that provides an area-delay tradeoff forchosen applications and FPGA platforms. However, it is aimedprimarily at design partitioning of VLIW and Coarse-Grainedreconﬁgurable architectures, while our work aims at modelingany custom hardware design. Modeling frameworks like Mc-PAT [14] are able to estimate design metrics for a wide varietyof processor conﬁgurations and implementation technologies,but are limited to pre-deﬁned architectural parameters and cannot be used on arbitrary designs. Amouri et al. [15] proposeda method to accurately measure and validate the leakage powerdistribution in FPGA chips using a thermal camera. Theseextremely accurate results can be used within our methodologyfor modeling architectural power consumption. Li et al. [9]proposed a ﬁne-grained power model for interconnects andLUTs in an FPGA implementation targeting sub-100 nmtechnology. However, correlating high-level design blocks tothe FPGA power estimates requires additional analysis to keeptrack of how resources are allocated in each synthesis, place-ment and routing process, as well as individual activity andtrace generation for various component blocks. Our techniqueprovides this analysis.VI. C

ONCLUSION AND F UTURE W ORK

The increasing use of high-level hardware design languagesis enlarging the gap between target technology and archi-tectural speciﬁcations. Hardware designers require relevantfeedback from post-synthesis tools that inform design deci-sions in an iterative process.

In this paper, we describe amethodology to relate post-synthesis area, delay and powerdata back to the initial HLS design.

This methodologyis a novel approach to architecture characterization. Unlikeother techniques, it does not need additional user input toanalyze the architecture. Instead, it uses the same hardwaredescription used to synthesize the ﬁnal circuit. The quality ofthe characteristics extracted from this circuit are backed by thequality of the FPGA vendor’s tools. Such feedback allows the designer to quickly gauge the impact of architectural decisionson the quality of generated hardware.At present, the design changes necessitated by the designconstraints and the feedback generated using our techniquehave to be manually done by the user. Automation of designchanges requires appropriate granularity in quantifying theimpact of changes on area, performance and power metrics.

By satisfying this need, the work presented in this papercan serve as a foundation for Automatic ArchitecturalOptimization.

We will investigate this possibility in the future.To summarize, we have implemented a tool that automatesdesign characterization analysis and shown how it can helpto improve the quality of ﬁnal hardware and meet requiredgoals.

For that purpose, we use two designs: a Reed-Solomon error correction decoder and a 32-bit pipelinedprocessor implementation. We implement and characterizethese designs on three FPGA platforms: Spartan 6, Virtex5 and Virtex 7.

We discuss the limitations of the analysis andthe impact of the ﬁnal technology on the design, and we showexamples of how the information reported by the tool can helpto spot architectural problems.

Finally, this work has a highpotential for use in automatic architectural optimizationand cross-platform characterization, and could be appliedon other HLS design languages and tools .R EFERENCES[1] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis,J. Wawrzynek, and K. Asanovic, “Chisel: constructing hardware in aScala embedded language,” in

DAC

Proceedings of the19th International Symposium on Field Programmable Gate Arrays(FPGA) . ACM, 2011.[4] D. Chen, J. Cong, Y. Fan, G. Han, W. Jiang, and Z. Zhang, “xPilot: APlatform-Based Behavioral Synthesis System,”

SRC TechCon , 2005.[5] Villarreal, Jason and Park, Adrian and Najjar, Walid and Hal-stead, Robert, “Designing modular hardware accelerators in C withROCCC 2.0,” in . IEEE, 2010.[6] J. Benson, R. Cofell, C. Frericks, C.-H. Ho, V. Govindaraju,T. Nowatzki, and K. Sankaralingam, “Design, integration and imple-mentation of the DySER hardware accelerator into OpenSPARC,” in

IEEE 18th International Symposium on High Performance ComputerArchitecture (HPCA) , 2012.[7] R. Moussalli, W. Najjar, X. Luo, and A. Khan, “A High Through-put No-Stall Golomb-Rice Hardware Decoder,” in

IEEE 21st AnnualInternational Symposium on Field-Programmable Custom ComputingMachines (FCCM) , 2013.[8] A. Cornu, S. Derrien, and D. Lavenier, “Hls tools for fpga: Fasterdevelopment with better performance,” in

Reconﬁgurable Computing:Architectures, Tools and Applications . Springer, 2011, pp. 67–78.[9] F. Li, Y. Lin, L. He, D. Chen, and J. Cong, “Power Modeling andCharacteristics of Field Programmable Gate Arrays,”

IEEE Trans.Computer-Aided Design of Integrated Circuits and Systems

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference onLanguage, compilers, and tool support for embedded systems , ser.LCTES ’06. New York, NY, USA: ACM, 2006, pp. 182–188.[14] S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi,“McPAT: An integrated power, area, and timing modeling frameworkfor multicore and manycore architectures,” in , 2009, pp.469–480.[15] A. Amouri, H. Amrouch, T. Ebi, J. Henkel, and M. Tahoori, “AccurateThermal-Proﬁle Estimation and Validation for FPGA-Mapped Circuits,”in