[PDF] AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators

Abstract

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. In this paper, we address this problem by incorporating an automated DSE framework-AutoDSE- that leverages bottleneck-guided gradient optimizer to systematically find abetter design point. AutoDSE finds the bottleneck of the design in each step and focuses on high-impact parameters to overcome that, which is similar to the approach an expert would take. The experimental results show that AutoDSE is able to find the design point that achieves, on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks and 1.04x over the manually designed HLS accelerated vision kernels in Xilinx Vitis libraries yet with 26x reduction of their optimization pragmas. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.

Full PDF

AAutoDSE: Enabling Software Programmers Design EfficientFPGA Accelerators

Atefeh Sohrabizadeh ∗ , Cody Hao Yu ∗ , Min Gao , and Jason Cong , ∗ indicates co-first authors for this work Computer Science Department, University of California, Los Angeles, USA Falcon-computing Inc., USA{atefehsz,hyu,cong}@cs.ucla.edu,[email protected]

ABSTRACT

Adopting FPGA as an accelerator in datacenters is becoming main-stream for customized computing, but the fact that FPGAs are hardto program creates a steep learning curve for software program-mers. Even with the help of high-level synthesis (HLS), acceleratordesigners still have to manually perform code reconstruction andcumbersome parameter tuning to achieve the optimal performance.While many learning models have been leveraged by existing workto automate the design of efficient accelerators, the unpredictabilityof modern HLS tools becomes a major obstacle for them to maintainhigh accuracy. In this paper, we address this problem by incorpo-rating an automated DSE framework—

AutoDSE — that leveragesa bottleneck-guided gradient optimizer to systematically find abetter design point.

AutoDSE finds the bottleneck of the designin each step and focuses on high-impact parameters to overcomethat, which is similar to the approach an expert would take. Theexperimental results show that

AutoDSE is able to find the designpoint that achieves, on the geometric mean, 19.9 × speedup over oneCPU core for Machsuite and Rodinia benchmarks and 1.04 × overthe manually designed HLS accelerated vision kernels in XilinxVitis libraries yet with 26 × reduction of their optimization pragmas.With less than one optimization pragma per design on average,we are making progress towards democratizing customizable com-puting by enabling software programmers design efficient FPGAaccelerators. Due to the rapid growth of datasets in recent years, the demandfor scalable, high-performance computing continues to increase.However, the breakdown of Dennard’s scaling [13] has led to en-ergy efficiency becoming an important concern in datacenters,and has spawned exploration into using accelerators such as field-programmable gate arrays (FPGAs) to alleviate power consumption.For example, Microsoft has adopted CPU-FPGA systems in its data-center to help accelerate the Bing search engine [26]; Amazonintroduced the F1 instance [1], a compute instance equipped withFPGA boards, in its commercial Elastic Compute Cloud (EC2).On the other hand, FPGA is difficult to program compared toCPU and GPU since the traditional register-transfer level (RTL)programming model is more like circuit design instead of softwareimplementation. To improve the programmability, high-level syn-thesis (HLS) [10, 43] has attracted a large amount of attention overthe past decades. Currently, both FPGA vendors have their com-mercial HLS products—Xilinx SDx [36] and Intel FPGA SDK forOpenCL [18]. In this paper, we target Xilinx FPGAs as an examplebut our approach is extendable to Intel FPGAs as they are also sup-ported by the Merlin Compiler [8, 9, 14]. Code 1 shows an intuitiveHLS C implementation of one forward path of a ConvolutionalNeural Network (CNN) on Xilinx FPGAs. Xilinx SDx can generateabout 5800 lines of RTL kernel from ∼

70 lines of code in Code 1with the same functionality. As a result, it is much more efficient

Code 1: CNN HLS C Code Snippet // Skip const variable initizalization due to page limit , const float* weight ,const float* bias , float* output ) {float C[ParallelOut][ImSize][ImSize];for (int i = 0; i < NumOut / ParallelOut; ++i) { // Initializationfor (int h = 0; h < ImSize; ++h) {for (int w = 0; w < ImSize; ++w) {for (int po = 0; po < ParallelOut; po++)C[po][h][w] = bias[(i << shift) + po]; } }// Convolutionfor (int j = 0; j < NumIn; ++j) { for (int h = 0; h < ImSize; ++h) { for (int w = 0; w < ImSize; ++w) { for (int po = 0; po < ParallelOut; po++) { for (int p = 0; p < kKernel; ++p) { for (int q = 0; q < kKernel; ++q) C[po][h][w] += weight(i, po, j, p, q) * input(j,h + p,w + q); } } } }// ReLU + Max poolingfor (int h = 0; h < OutImSize; ++h) { for (int w = 0; w < OutImSize; ++w) { for (int po = 0; po < ParallelOut; po++) { output(i,h,w) = max(max(0.f,max(C[po][h * 2][w * 2 ], C[po][h * 2 + 1][w * 2 ]),max(C[po][h * 2][w * 2 + 1], C[po][h * 2 + 1][w * 2 + 1])));} } } } for designers to evaluate and improve their architectures in HLSC/C++.Although HLS is suitable for hardware experts to quickly imple-ment a design, it is not friendly for software designers who havelimited FPGA domain knowledge. Since the hardware architectureinferred from a syntactic C implementation could be ambiguous,current commercial HLS tools usually generate architecture struc-tures according to specific HLS C/C++ code patterns. As a result,even though Cong et al. [10] illustrated that the HLS tool is capableof generating FPGA designs with a performance as competitive asthe one in RTL, not every C program gives a good performanceand designers must manually reconstruct the HLS C/C++ kernelwith specific code patterns to achieve high performance. In fact,the generated FPGA accelerator from Code 1 is 80 × slower thana single-thread CPU even though the optimized code shown inCode 2 is able to achieve around 7,041 × speedup with 28 prag-mas after we analyze and resolve several performance bottleneckslisted in Table 2. As a matter of fact, the authors used Code 1 aslab assignments for an upper-division undergraduate course attheir institution. The students were asked to accelerate this ap-plication as much as they can using OpenCL targeting differentplatforms: CPU, GPU, and FPGA. Note that OpenCL is easier touse for beginner FPGA developers compared to HLS C/C++ since itapplies some of the optimizations such as memory coalescing bydefault. Table 1 summarizes the number of students’ submission insix different ranges. The performance numbers are normalized with a r X i v : . [ c s . A R ] S e p onference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al. Code 2: Optimized CNN HLS C Code Snippet // Skip const variable initizalization due to page limitvoid CnnKernel(const ap\_uint< 128 > * input, float weight,const ap\_uint< 512 > * bias, ap\_uint< 512 > * output){ respect to 75% of expert design’s performance which was requiredfor the students to get the full grade. As it can be seen, althoughthe students could perform well when targeting CPU and GPU, pro-gramming FPGA was challenging for them. The results suggest thatthe required code transformation, pragma insertion and pragmatuning present a significant barrier to a software programmer whentargeting FPGA.It turns out that the bottlenecks presented in Table 2 occur formost C/C++ programs developed by software programmers, andsimilar optimizations have to be repeated for each new application,which makes HLS C/C++ design not scalable. A possible solutionis to apply an automated micro-architecture optimization. Thus,everyone with descent knowledge of programming is able to try

Table 1: Number of Students’ Submissions for Accelerationof Code 1 in Each Range. The Performances Are Normalizedwith Respect to 75% of Expert Design’s Performance (TheRequired Performance).

Platfrom [0-0.2] (0.2-0.4] (0.4-0.6] (0.6-0.8] (0.8-1] (1- ∞ ]CPU 15 7 9 6 14 23GPU 9 25 14 10 4 11FPGA 69 3 0 0 0 0 Table 2: Analysis of Poor Performance in Code 1

Reason Required Code Changes for Higher Performance Low bandwidth util. Manually apply memory coalescing using HLSbuilt-in type ap_int . Low bandwidth util. Manually allocate local buffer and use memcpyto enable memory burst. Does not hide commu-nication latency Manually create load/compute/store functionsand double buffering. Lack of parallelism Manually create a function to wrap the loop andset proper array partition factors. Sequential execution Apply and with proper array partition factor or fuseloops. customized computing with minimum effort. In order to free accel-erator designers from the iterations of HLS design improvement,automated design space exploration (DSE) for HLS attracts moreand more attention. However, the recent advances in HLS toolshave brought new challenges in designing DSE methods.

Challenge 1: The large solution space:

The various combina-tions of the pragmas that can be applied on a code make exploringthe whole design space an impossible task. In the simplest case,one either can insert a pipeline or unroll pragma, or choose notto insert any pragma at all. If unrolling is applied on a loop, thepartition factors of the buffers used inside that loop will determinethe parallelization factor of the loop; thus, different types of arraypartitioning (complete, cyclic, block) and their factors need to beexplored. Only these three pragmas can generate a huge designspace. In fact, they produce 10 design points for Code 1. Challenge 2: Non-monotonic effect of design parameterson performance/area:

With the latest HLS tools as well as thelarger design spaces enumerated in this paper (see Section 3 fordetails), we cannot assume that an individual design parameter willaffect the performance/area in a smooth and/or monotonic way.For instance, Fig. 1 depicts the execution cycle of the N-W algo-rithm [24] with different parallel factors for its 5 loops synthesizedby Xilinx SDx [36]. Although the performance trend of 3 loops areideal, the rest of the 2 loops (

CG-loop-2 and

FG-loop-1 ) are not. Challenge 3: Correlation of different characteristics of adesign:

When different pragmas are employed together in a de-sign, they do not affect only one characteristic of a design. One hasto take the interaction between them into account for estimatingthe latency and area consumption of the design. Taking convo-lution part of the Code 1 as an example. Applying fine-grainedpipeline to w loop and parallelizing it with a factor of 2 resultsin a loop with initiation interval (II) of 2 synthesized by VivadoHLS [34]. However, when changing the parallel factor to 4, HLStool increases the II to 3 instead of doubling the resource utilizationto optimize resource consumption by reusing some of the logicunits. Note that these behaviors may differ from version to ver-sion; therefore, it is impractical to maintain an analytical model forDSE. Furthermore, pipelining the j loop is part of the best designconfiguration. However, it does not improve the performance untilafter the fine-grained pipelining is applied on the w loop. It suggests CG and FG mean coarse-grained and fine-grained, respectively.2 utoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA that the order of applying the pragmas is crucial in designing theexploration technique. E x e c u t i o n C y c l e Figure 1: HLS Cycles of N-W with Different Factors on LoopsChallenge 4: Implementation disparity of HLS tools:

Theimplementation of HLS tools is not fixed across different vendors.Furthermore, HLS tool implementation of the same vendor is alwayschanging from version to version. For example, the past Xilinx SDxversions consistently utilize registers to implement array partitionswith small sizes to save BRAMs, but if an array partition is requiredto support two reads in one cycle to achieve fully pipelined, orII = 1, the latest Xilinx SDx marks the partition as true-dual-portand uses dual-port BRAMs to implement it even if the array sizeis small. Such implementation details are hard to be captured andmaintained in analytical models. This makes it difficult to port ananalytical model built on a specific HLS tool to the other.

Challenge 5: Long synthesis time of HLS tools:

A majorchallenge of using vendor HLS tools directly for DSE is the longevaluation time since vendor HLS tools usually take 5-30 minutesto generate RTL and estimate the performance—and even longerif the design has a high performance. This emphasizes the needfor a DSE that can find the Pareto-optimal design points in feweriterations.To solve the challenges mentioned above, in this paper, wetreat HLS tool as a black-box and, first, apply gradient descentwith finite difference method to guide our explorer. Then, we dis-cuss the deficiency of problem-independent heuristics and thegradient-based approach for HLS DSE problem and present the

AutoDSE framework that adapts a bottleneck-guided gradientoptimizer to systematically search for better configurations. Weshow that our bottleneck-based optimizer can outperform generalhyper-heuristics used in the literature. Furthermore, it outperformsthe naive gradient-based approach we adapted. It also acceleratesthe exploration time as it follows the behavior of an expert andfocuses on high-impact design parameters first. To represent a griddesign space with all invalid points marked, we incorporate flexi-ble list-comprehension syntax to represent a design space as wellas checking rules. In addition, we also partition the design spacesystematically to address the local optimal problem caused by non-smooth/non-monotonic performance/area trend. In summary, thispaper makes the following contributions: • We propose two strategies to guide DSE. One adapts naive gradi-ent descent with finite difference method and the other exploitsa bottleneck-guided gradient optimizer. • We incorporate list-comprehension to represent a smooth, griddesign space with all invalid points marked. • We develop the

AutoDSE framework on top of the Merlin Com-piler to automatically perform DSE using bottleneck optimizer,which follows an expert optimizing the code to systematicallyclose in on high-QoR design points. • Evaluation results indicate that

AutoDSE is able to achieve aspeedup of 1.04 × in 0.3 hours on geometric mean with respectto 33 kernels from Xilinx optimized vision library [37] yet with Codes will be open-sourced when the paper is accepted. × reduction of their optimization pragmas resulting in lessthan one required optimization pragma per kernel. • We evaluate

AutoDSE on 11 computational kernels from Mach-suite [27] and Rodinia [5] benchmarks and one convolution layerof Alexnet [20] on the Amazon EC2 F1 instance [1], showingthat we are able to achieve, on a geometric mean, 19.9 × speedupover a single-thread CPU—only a 7% performance gap comparedto manual designs. There are a number of previous works that propose an automatedframework to explore the HLS design space, and they can be sum-marized in two categories, model-based and model-free techniques.

The studies in this category build an analytical model for evalu-ating the quality of each explored design point by estimating itsperformance and resource utilization. The authors in [35, 44, 46]build the dependence graph of the target application and utilizegraph analysis techniques along with predictive models to searchfor the best design. Although, this approach can quickly searchthrough the whole design space, it is inaccurate and difficult tomaintain the model and port it to other HLS vendors or versions asexplained in Challenge 4 of Section 1. Zhong et al. [48] developsa simple analytical model for performance and area estimation.However, their model is based on the assumption that the perfor-mance/area changes monotonically by modifying an individualdesign parameter, which is not a valid assumption as we explainedin Challenge 2 of Section 1. To increase the accuracy of the estima-tion model, a number of other studies restrict the target applicationto those that have a well-defined accelerator micro-architecturetemplate [6, 11, 12, 28, 32, 42], a specific application [39, 45], or aparticular computation pattern [7, 19, 25] hence, they lose general-ity.To the same end, there are other studies that build the predictivemodel by synthesizing a set of sample designs and iteratively up-dating it until the model gets to the desired accuracy. Later on, theyuse the trained model for estimating the quality of design insteadof invocations of the HLS tool. To learn the behavior of the HLStool, these works adapt supervised learning algorithms to bettercapture uncertainty of HLS tools [19, 21, 22, 29, 40, 47]. While thistechnique increases the accuracy of the model, it is still hard to portthe model to another HLS tool in a different vendor or version. Asa result, for each of them, a new model should be trained.

To avoid dealing with uncertainty of HLS tools, in this category,the studies treat HLS tool as a black box. Instead of learning apredictive model, they invoke HLS every time to evaluate the qual-ity of the design. To guide the search, they either exploit generalproblem-independent heuristics (e.g., simulated annealing [23] andgenetic algorithm [30]) or develop their own heuristics [15, 16, 31].S2FA [41], uses a hyper-heuristic approach with several optimiza-tion strategies to reduce the DSE iterations. The authors employmulti-armed bandit [17] to combine a set of heuristic algorithmsincluding uniform greedy mutation, differential evolution geneticalgorithm, particle swarm optimization, and simulated annealing.However, as we will present in Section 5.1.1, general hyper-heuristicapproaches are unstable for finding the high quality of result (QoR)design configuration. Moreover, the authors in [15, 16] claim thatPareto-optimal design points cluster together. They exploit an ini-tial sampling to build the first approximation of the Pareto frontierand require local searches to explore other candidates. However, onference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al. the cost of initial sampling is not scalable when the design space istremendously large (e.g., the scale of 10 to 10 ), as the ones wehave enumerated in this paper are. Even if it only samples 1% ofthe design space (the lowest sampling rate they use), it means 10 to 10 design points. Our goal is to expedite the hardware design by automating itsexploration process. Even though High-level synthesis (HLS) isnow widely used to facilitate the FPGA accelerator developmentcycle, as illustrated in Section 1, specific code patterns are stillnecessary to let the HLS tools apply certain architecture structures.For instance, although the optimized HLS code based on Code 2can achieve thousand times speedup, it has about 3 × more lines ofcode compared to the original code it is modified from. This impliesnot only the time-consuming code reconstruction efforts but theimpediment for automated design space exploration.In general, there are two types of pragmas (using Vivado HLSas an example) that are applied to a program. One type is the non-optimization pragmas, such as those shown in Lines 5-6 of Code 2,which specifies the sizes of the interface variables and the value ofloop bounds (not shown here). These pragmas are relatively easyfor software programmers to learn and apply. The other type isoptimization pragmas, including PIPELINE and

UNROLL pragmas.These pragmas require knowledge of FPGA devices and micro-architecture optimization experience, which are usually much morechallenging for a software programmer to learn and master. Theexperiment at the institution of the authors advocates this claim.As explained in Section 1, in this experiment, students were askedto optimize the performance of Code 1 targeting different platforms.The results are summarized in Table 1. Due to the difficulty ofchoosing the best location to put the pragmas, the type of thepragmas and tuning them for non-expert FPGA programmers, thebest FGPA submission only achieved 0 . × of the performance ofthe best design. The goal of this research is to minimize or eliminatethe need of optimization pragmas and let AutoDSE insert themautomatically. More formally, we formulate the HLS DSE problemas the following:

Problem 1: Identify Design Space.

Given a C program P as theFPGA accelerator kernel, construct a design space R K P with K pa-rameters that contains possible combinations of HLS pragmas for P as design configurations. Problem 2: Find the Optimal Configuration.

Given a C program P , one would like to insert a minimal number of optimizationpragmas to get a new program P ′ as the FPGA accelerator kernelalong with its design space set R K P ′ which is identified in Problem1, and a vendor HLS tool H that estimates the execution cycle Cycle ( H , P ′ ) and the resource utilization U til ( H , P ′ ) of the given P ′ as a black-box evaluation function. Find a configuration θ ∈ R K P ′ in a given search time limit so that the generated design P ′ ( θ ) with θ can fit in the FPGA and execution cycle is minimized. Formally,we define the problem as:min θ Cycle ( H , P ′ ( θ )) (1)subject to θ ∈ R K P ′ (2) ∀ u ∈ U til ( H , P ′ ( θ )) , u < T u (3)where u is the utilization of one of the FPGA on-chip resources and T u is a user-available resource threshold on FPGAs. We set all T u to be 0.8, an empirical threshold, in our experiments. Beyond 0.8,the design will suffer from high clock frequency degradation due Code 3: CNN Code Snippet in Merlin C void CnnKernel(const float input[NumIn][InImSize][InImSize],const float weight[NumOut][NumIn][kKernel][kKernel],const float bias[NumOut],float output[NumOut][OutImSize][OutImSize]) {float C[ParallelOut][ImSize][ImSize];for (int i = 0; i < NumOut/ParallelOut; i++) {// Initializationfor (int h = 0; h < ImSize; ++h) { to the difficulty in placement and routing. In addition, the rest ofthe resources are left for the interface logic of vendor HLS tool.Note that we introduce two optimization objectives, one is tominimize the optimization pragmas inserted to obtain P ′ and an-other is to maximize performance of P ′ using AutoDSE . Obviously,there is a trade-off between the two. An expert designer can alwaysget an optimized micro-architecture to achieve the best perfor-mance by inserting enough HLS optimization pragmas. However,it is time-consuming and not feasible for software programmerswith little or no FPGA design experience. In our evaluation, ourgoal is to match the performance of well-designed HLS library code(typically written by the experts) yet use much fewer optimizationpragmas. Indeed, our experimental results in Section 6 show that wecan achieve our goal with 26 × pragma reduction on the geometricmean, requiring less than 1 pragma per kernel. To reduce the size of the design space, we build our DSE on top ofthe Merlin Compiler [8, 9]. Section 4.1 reviews Merlin Compilerand justifies our choice. Then, we present an overview of

AutoDSE in Section 4.2.

In order to reduce the design space, we chose to utilize the MerlinCompiler [8, 9] developed by Falcon Computing Solutions [14] asthe backend of our tool as it provides a small set of pragmas to rep-resent optimization strategies from the perspective of architecturedesign. Table 3 lists the Merlin pragmas with architecture struc-tures. Note that the fg option in the fine-grained loop pipeline moderefers to the code transformation that tries to apply fine-grainedpipelining to a loop nest by fully unrolling all its sub-loops. Basedon these user-specified pragmas, the Merlin Compiler performssource-to-source code transformation to apply the correspondingarchitecture optimization by automatically generating the relatedHLS pragmas such as PIPELINE , UNROLL , and

ARRAY_PARTITION and applying them to the program. Since the number of pragmasrequired by the Merlin Compiler is much smaller (as it performssource level code reconstruction and generates most of the HLS utoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA required pragmas such as array partition), it defines a more compactdesign space, so we use it as the compilation tool for DSE [12, 41].For instance, Code 3 shows the CNN kernel with Merlin pragmas.With only a few line of pragmas, the Merlin Compiler is able totransform Code 3 to a high-performance HLS kernel with the sameperformance as Code 2. There are two kinds of code transforma-tions that one need to employ to get to a high-performance design.The first one is for increasing the data reuse by doing loop trans-formations which is common to CPU performance optimizationsas well (e.g. for cache locality); therefore, it is well acceptable bysoftware programmers and we expect them to apply them manuallywithout any problems. The second kind is required to enable archi-tectural optimizations such as memory burst, memory coalescing,and double buffering, as mentioned by reasons 1-3 in Table 2. Thesetransformations are much more difficult for software programmersto learn and apply effectively. Fortunately, the Merlin Compilertakes care of this kind of code transformations. For example, in-stead of rewriting Code 1 to test whether double buffering wouldhelp the performance as denoted by reason 3 in Table 2, we justneed to use the PIPELINE pragma with cg option. Furthermore,instead of manually applying code transformations for memorycoalescing and memory burst as denoted by bottlenecks 1 and 2 inTable 2, we can tune the interface and tiling pragma and theMerlin Compiler will rewrite the code to satisfy these constraints.As a result, our focus in this work is on finding the best location ofeach of the pragmas and tuning them to enable those architecturaloptimizations along with the best pipelining and parallelizationattributes to address reasons 4-5 in Table 2 as well. Table 3: Merlin Pragmas with Architecture Structures

Keyword Available Options Architecture Structureparallel factor= CG & FG parallelismpipeline mode=cg CG pipelinemode=fg FG pipelinetiling factor= Loop TilingCG: Coarse-grained; FG: Fine-grained

Our solution to Problem 1 is shown in Table 4. We identifythe design space for each kernel by analyzing the kernel AST torealize loop trip-counts, available bit-widths, and so on. In addition,since vendor HLS tools usually schedule fine-grained loops well,we only explore the parallel factor of fine-grained loops when itstrip-count is larger than 16; otherwise, we simply apply fully unrolland pipeline to small fine-grained loops to reduce the design space.Moreover, the parallelization factors and tile sizes considered areinteger divisor of their respective loop trip-count. We do not includethe interface pragma in our search space since the best bitwidthcan be determined by the size of the input and its data type.

Table 4: Design Space Building on Merlin Pragmas

Factor Design Space (Values)CG-loop parallel { u | < u < = TC ( L ) , u . c = TC ( L ) , c ∈ Z } FG-loop parallel (cid:26) u | (cid:26) < u < TC ( L ) , u . c = TC ( L ) , c ∈ Z , TC(L) > 16 u = TC ( L ) , otherwise (cid:27) CG-loop pipeline { p | p ∈ { of f , cд , f д }} FG-loop pipeline { p | p = f д } loop tiling { t | < t < TC ( L ) , t . c = TC ( L ) , c ∈ Z } CG: Coarse-grained; FG: Fine-grained; TC: Loop trip-count

Now that we have defined the design space in Table 4 for

Prob-lem 1 , we focus on

Problem 2 in the remainder of this paper. Al-though to some extents, Merlin pragmas alleviate the manual codereconstruction overhead, a designer still has to manually searchfor the best option for each pragma, including position, type, andfactors. In fact, choices for the CNN design in Code 3 contain fourDRAM buffers and thirteen loops, which result in ∼ design configurations. The large design space motivates us to develop anefficient approach to find the best configuration. We develop and implement

AutoDSE , a push-button framework, inFig. 2 based on the strategies explained in Section 5. The frameworkfirst automatically builds a design space according to Table 4 byanalyzing the kernel AST using the syntax described in Section 5.2.Then, it profiles and selects representative partitions using K-Meansas mentioned in Section 5.3. For each partition,

AutoDSE explorerperforms DSE using the proposed bottleneck-based gradient strat-egy in Section 5.1.3. The explorer can be tuned to evaluate thequality of design points based on different targets such as perfor-mance, resource, or finite difference introduced in Section 5.1.2.When the explorer finishes exploring a partition, it stores the bestconfiguration found by that partition and reallocates the workingthreads to other partitions to keep the resource utilization high. Fi-nally, when all partitions are finished,

AutoDSE outputs the designconfiguration with the best QoR among all partitions.

Explorer

Bottleneck Optimization AlgorithmCache Hit Checking Code Transformation HLS w. Vendor Tools Bottleneck Analysis Result CommittingEvaluator

Design Config. Waiting Queue

C Kernel Design Space Generator + Partitioner Profiler and Seed GenerationDesign

Space

PartitionDesign Space

PartitionDesign

Space Partition Design Space PartitionDesign Space PartitionRepresentative Design SpaceResult Database

C Kernel w. Optimized Design Config.Execution Flow Result Query

Figure 2: The AutoDSE Framework Overview

The general search techniques perform poorly in HLS DSE problemdue to the non-monotonic effect of the pragmas and their correla-tion with each other as explained in challenges 2 and 3 of Section 1.S2FA [41] uses a set of these techniques such as uniform greedymutation, differential evolution genetic algorithm, particle swarmoptimization, and simulated annealing. To show the deficiency ofthe common search techniques, we further test the performance ofthe gradient descent. The experimental results in Section 6 demon-strate that our proposed bottleneck-guided gradient optimizer out-performs all of these techniques.The organization of this section is as follows: we elaborate onour searching algorithm in Section 5.1. In Section 5.2, we presentan efficient way to represent the design space that helps us ruleout the infeasible design points. To prevent the search engine frombeing trapped in local optimal points, we partition the design spaceas explained in Section 5.3.

In this section, we analyze different approaches for exploring thedesign space and gradually come up with an effective algorithmto solve the problems mentioned in Section 3. We first examinethe efficiency of problem-independent heuristics in Section 5.1.1.Then, we introduce a new search technique based on gradient de-scent, a common iterative optimization algorithm, with a finitedifference method in Section 5.1.2 that systematically finds a betterdesign point in the design space. As we will explain, the problem-independent heuristics and naive gradient-based approach fail to onference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al. identify the killer parameters in few iterations. As a result, in Sec-tion 5.1.3, we present a bottleneck-guided gradient optimizer thatcan mimic an expert’s optimization method and can outperformthe aforementioned approaches. We illustrate arepresentative prior work on DSE, which utilized a popular searchengine called OpenTuner [2]. OpenTuner leverages the multi-armedbandit (MAB) approach [17] to assemble multiple meta-heuristic al-gorithms for high generalization. At each iteration, the MAB selectsthe meta-heuristic with the highest credit and updates the creditof the selected meta-heuristic based on the QoR, which means themeta-heuristic that can efficiently find high-quality design pointswill be rewarded and activated more frequently by the MAB, andvice versa. Due to its extensibility, OpenTuner has been adaptedto perform DSE for design optimization. DATuner [38] introducesentropy-based partition to search for the best parameters for phys-ical design tools with multiple threads. S2FA [41] further appliesmore strategies to improve the OpenTuner efficiency when per-forming DSE for HLS. Since the S2FA backend also employs theMerlin Compiler for code transformation, we use its DSE engine tojustify the advantage of developing a bottleneck-guided gradientoptimizer. N o r m a li z e d Sp ee d u p ff t - s t r i d e db f s - q u e u e , s t e n c il k m p s p m v a e s b f s - b u l k , n w ff t - t r a n s g e mm Figure 3: Speedup Over the Manual Design Using S2FA [41]

We use S2FA to perform the DSE for 24 hours by turning offits early stopping criteria and depict the speedup of our bench-mark cases over the corresponding manual design hourly in Fig. 3.The black dot indicates the time that the S2FA finds the overallbest design point. We can see that S2FA requires on average 16.8hours to find the best solution. We further analyze the explorationprocess and find that most designs have an obvious performancebottleneck (e.g., effective external memory bandwidth, insufficientparallel factors, etc.), which usually dominates more than half of theoverall execution cycle and is controlled by only one or two designparameters. In this situation, the performance gain of tuning otherparameters is often very limited and hard to attribute to the generallearning algorithms. The learning algorithm needs many iterationsto identify the key parameter and tune it to resolve the performancebottleneck. After that, it has to spend a large number of iterationsagain to find the next key parameter. This phenomenon motivatesus to develop a new search algorithm that is guaranteed to optimizethe killer parameter prior to others.

Gradient descent isa well-known iterative optimization algorithm for finding a localminimum point in a differentiable objective function. It has alsobeen successfully applied to solve large scale non-linear physicaldesign problems with a smooth analytical approximation such asmulti-level circuit placement [4]. Formally, gradient descent is em-ployed to find a configuration θ with the minimal objective value J ( θ ) in a solution space R K P : arдmin θ i ∈ R K P J ( θ i ) (4) To achieve the goal, we start from an initial configuration θ , and it-eratively update the configuration by following the steepest descent,the negative gradient −∇ J : θ i + = θ i − α ∇ J ( θ i ) (5)where α is the step size.The gradient descent approach requires the objective function tobe differentiable in order to find the next steepest descent. This lim-itation makes it impractical in many real-world applications, as thesystem may be too complicated to be modeled as partially observ-able Markov decision problems. To avoid the potential problemsof modeling HLS tools, we leverage the finite difference methodto approximate the gradient value by treating the HLS tool as ablack-box. That is, given a candidate configuration θ j deviated fromthe current configuration θ i , we use the finite difference method toapproximate the gradient as follows: д ( θ j , θ i ) ∼ Cycle ( H , P( θ j )) − Cycle ( H , P( θ i )) U til ( H , P( θ j )) − U til ( H , P( θ i )) (6)Note that Eq. 6 considers not only performance gain but resourceefficiency, so it could reduce the possibility of being trapped in alocal optimal. For example, we may reduce 10% execution cycleby spending 30% more area if we increase the parallel factor ofa loop (configuration θ ); we can also reduce the 5% executioncycle by spending 10% more area if we enlarge the bit-width ofa certain buffer (configuration θ ). Although θ seems better interms of the execution cycle, it may be more easily trapped bya local optimal point because it has a relatively limited resourceto be further improved. On the other hand, the finite differencevalues for the two configurations are д ( θ , θ ) = − = − . д ( θ , θ ) = − = − .

5, so the system prioritizes the secondconfiguration for a better long-term performance.Since the finite difference method selects the best candidatesas the next configuration, we need to generate a set of candidates, Θ cand , at each iteration. Specifically, we generate candidates byadvancing the value of each parameter in the current configurationby one step. Formally, the c -th candidate generated from θ i is: θ c = [ p , p , ..., p c + , ..., p k ] (7)where p c is the value of c -th parameter in θ i . Accordingly, we willgenerate K candidates at each iteration, which means we run HLS K times to determine the next configuration: θ i + = arдmin θ j ∈ Θ cand д ( θ j , θ i ) (8)By leveraging the gradient descent with a finite differencemethod, we expect to find a better design point every K HLS runs.Unfortunately, as we have illustrated in Fig. 1, the performancetrend is not always smooth, so the gradient process can easily betrapped by a low-quality local optimal design point. Taking Fig. 1again as an example, the gradient approach will stop at factor 2 for

FG-loop-1 because factor 3 has a worse performance but consumesmore resources. Actually, the gradient approach proposed in thissection only achieves 2.8 × speedup on the geometric mean of ourMachSuite [27] and Rodinia [5] benchmarks, which is even worsethan the results from the problem-independent heuristics reviewedin Section 5.1.1.Moreover, the efficiency of using the gradient-based approachfor DSE is limited by the process of approximating gradient value.In order to approximate the gradient value, at each iteration, weneed to evaluate K design points, where K is the total number of utoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA tuning parameters, to determine the next step. On the other hand,in most cases, only a few of the K tuning parameters have a highimpact on the performance, so we should evaluate only the K ′ impactful parameters at each iteration where K ′ < K . For instance,design space generator will instrument Code 1 with 27 pragmasbased on the rules explained in Section 5.2 and the gradient-basedapproach proposed in this section need to assess the quality of 27new designs in each iteration. However, in the early iterations theconvolution part takes more than 90% of the total cycle counts ofthe kernel. As a result, changing the pragmas outside of this partwill have insignificant effect on the performance and it is wastefulto explore them at this stage.Identifying the K ′ parameters is not straightforward. AlthoughHLS report may provide the cycle breakdown for the loop and func-tion statements, it is hard to map them to tuning parameters due tothe applications of several code transformations. Fortunately, theMerlin Compiler [8] includes a feature that performs back propaga-tion. This feature transmits the performance breakdown reportedby the HLS tool to the user input code, allowing us to identifythe performance bottleneck by traversing the Merlin report andmapping the bottleneck statement to one or few tuning parameters,which will be presented next. From our experience withthe naive gradient-based DSE approach proposed in Section 5.1.2and the general learning algorithm or heuristics discussed in Sec-tion 5.1.1, we see the following inefficiencies when comparing theirbehavior with human design experts:(1) Those approaches have to evaluate many design points to iden-tify the performance bottleneck. An expert could directly acquirethis information by analyzing the cycle breakdown.(2) Those approaches have no knowledge of parameters, so theyhave no way to prioritize important ones. An expert, on the otherhand, may know which parameter has a high potential of beingthe killer parameter.(3) Those approaches may stop exploring the options of a parameterdue to local optimal. An expert may know whether other optionsare worthwhile to explore or not.The first two inefficiencies can be resolved by leveraging thebottleneck analysis. We first build a map from the loop or functionstatements in the user input code to design parameters so that weknow which parameters should be focused on for a particular state-ment. To identify the critical path and type, we start with the kerneltop function statement. We first check to see if the current state-ment has child loop statements. For the function call statements,we dive into the function implementation to further check its childstatements. Then, we traverse each of them and create hierarchypaths. Note that since we sort all loop statements according to theirlatency by checking the Merlin report, the hierarchy paths we cre-ated will also be sorted by their latency. Subsequently, we checkthe Merlin report again to determine whether the performancebottleneck of the current statement is memory transfer or compu-tation. The Merlin Compiler obtains this information by analyzingthe transformed kernel code along with the HLS report. A cycleis considered to be a memory transfer cycle if it is consumed bycommunicating to off-chip memory. Finally, we append the currentstatement to the end of each path and return a list of paths in order.As a result, we can not only figure out the performance bottleneckfor each design point, but also identify a small set of effective de-sign parameters to focus on. Therefore, we are able to significantlyimprove the efficiency of our searching algorithm.When we obtain an ordered list of critical hierarchy paths fromthe bottleneck analysis, we start from the most critical innermost loop statement and identify its corresponding parameters. For in-stance, the convolution part of Code 1 takes 98% of the cycle counts;hence, the parameters applied to this section of code make the topof the list from the innermost loop to the outermost one. Note thatsince the bottleneck analysis also provides the bottleneck type in-formation (i.e., memory transfer or computation), we may identifya subset of the parameters mapped to that statement. For example,we may have design parameters of

PARALLEL and

TILING at thesame loop level. When the bottleneck type of the loop is memorytransfer, we focus on the

TILING parameter for the loop; otherwise,we focus on

PARALLEL parameter. In other words, we reduce thenumber of candidate design parameters not only by the bottleneckstatement but also by the bottleneck type.

Table 5: Performance and Resource Utilization Comparedto The Base Design When Parameters of Line 18 in Code 1Change

Optimization Status Perf BRAM LUT DSP FFPi- fg PASS (24 min) 175 × +7% +23% +24% +15%PF=4 TIMEOUT - - - - -Pi- fg + PF=4 PASS (28 min) 218 × +17% +44% +33% +25%Pi: Pipeline, PF: Parallel Factor, fg: fine-grained It often happens that there are more than one design param-eter that can be applied for each bottleneck type. In situationswhere the bottleneck of a loop statement is determined to be itscomputation, one can apply fine-grained or coarse-grained pipelin-ing/parallelization in general. When this case happens, we have toutilize a predefined priority for testing the parameters. We choosethe order of applying the pragmas for compute-bounded loops to be

PIPELINE mode fg , PARALLEL, and PIPELINE mode cg which is agreedy approach to improve the performance by utilizing more par-allelization units. Measuring the quality of design points with finitedifference (gradient) value helps AutoDSE not to over-utilize theFPGA as when for a configuration, the gain of the achieved speedupis not comparable to the loss of available resources, it will decreasethe quality of design; hence,

AutoDSE will turn that pragma off.Instead, the resources are left for applying a design parameter withhigher impact. Moreover, as mentioned in Challenge 3 of Section 1the order of applying the pragmas is crucial in order to get to thebest design. Since HLS tools schedule fine-grained pipelining/par-allelization better than the coarse-grained ones, our experimentsshow that evaluating the fine-grained options first helps

AutoDSE reach the best design point in fewer iterations. Table 5 shows howthe performance and resource utilization change compared to thebase design in which all the pragmas are off when

PIPELINE mode fg and PARALLEL pragmas are applied on line 18 in Code 1. Thetime limit to run the HLS tool is set to 60 minutes. The resultssuggest that in order to get to the optimal configuration for thisloop, we must first apply the fine-grained pipelining. This way, HLStool can better schedule the loop when parallelization is furtherapplied and its synthesis will finish in 28 minutes.Note that we do not prune the other design parameters. We justchange the order of the parameters to be explored as these rulescan not be generalized to all cases due to the unpredictability of theHLS tools. If the bottleneck of a design point is memory transfer, AutoDSE prioritizes

PIPELINE mode cg pragma over TILING . TheMerlin Compiler, by default, caches the data and the former willfurther overlap the communication time with computation by ap-plying double buffering; however, the latter, can be used to changethe size of the cached data.We define level n as a design where we have fixed the value of n parameters, so the maximum level in our algorithm is equal to thetotal number of parameters. Each design point is represented usinga data structure that includes the quality of design measured by onference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al. finite difference value introduced in Eq. 6, the focused parameters,the fixed parameters and the configurations of the parameters alongwith a stack containing its unexplored children. Each level has aheap of the pending design points that can be further explored. Sincenew design points are sorted by their quality values when they werepushed into the heap, the design point with a better quality valuewill be explored prior to other points. Moreover, when a new designpoint is passed through the bottleneck analyzer, it will generatenew focused parameters in order of importance stored in a stack;hence, by popping the stack, we get to work with the design pointwith the most promising impact. At each iteration of the algorithm, AutoDSE gets the heap with the highest level. Then, it peeks thefirst node of the heap and pulls its stack of unexplored children.The new candidate is picked by popping the stack and passed to thebottleneck analyzer to generate a new set of focused parameters asthe children. Then, it will be pushed to the heap of the next level. Ifa design point does not generate any focused parameters or whenits stack of unexplored children is empty, it will be popped out ofheap. The algorithm continues either until all the heaps are emptyor when the DSE has reached a runtime threshold.For the third inefficiency mentioned in the beginning of thissection, we cannot identify whether the current option of a param-eter is local or global optimal, so the most promising solution isbreaking the dependency between options and searching a set ofthem in parallel as explained in Section 5.3. In this way, althoughwe still need to evaluate multiple design points at every iteration,we guarantee that each design point can provide the maximuminformation for improving the performance because we alwaysevaluate the options of the parameter that has the largest impacton the performance bottleneck.

One solution to facilitating the bottleneck-based optimizer is toreduce ineffective parameters. Intuitively, we can build a grid designspace from Merlin pragmas by treating each Merlin pragma as atuning parameter and search for the best combination. However,many points in this grid space may be infeasible. For example,if we have determined to perform coarse-grained pipelining atthe outermost loop of a loop nest, the Merlin Compiler will applydouble-buffering on the loop. In this case, the physical meaning ofdouble-buffering at the outermost loop is to transfer a batch set ofdata from DRAM to BRAM, which cannot be further parallelized.As a result, pipeline and parallel pragmas are mutually exclusivein a loop nest. In this section, we propose an efficient approachto create a design space that preserves the grid design space butinvalidates infeasible combinations.

P1 P2 off ‘’ flatten1 Figure 4: Proposed Design Space Representation and Its Im-pact on DSE

Fig. 4 illustrates the goal of an efficient design space represen-tation. In this example, we attempt to explore the best parameterfor loop j of Code 1 and the best option for it with pragma P1 andP2 denoting the PIPELINE and

PARALLEL pragma respectively. Thepragma P P P ( P , P ) = ( cg , ) ,we only have two candidates to be explored in next step becausethe configuration ( P , P ) = ( cg , ) is invalid. This representationis exploration friendly and easy to enforce rules on the infeasiblepoints.To represent a grid design space with invalid points, we intro-duce a Python list comprehension syntax to AutoDSE. The

Python list comprehension is a concise approach for creating lists withconditions. It has the following syntax: list_name = [expression for item in list if condition]

Formally, we define the design space representation for Merlinpragmas with list comprehensions as follows:

For our example, the design space can be represented using listcomprehensions as follows: // Skip the rest due to page limit where line 5 indicate that the two pragmas are exclusive. In otherwords, when we set P = cg , the available option for P Python list comprehension is general. It provides a friendly andcomprehensive interface for higher layers such as polyhedral anal-ysis [49] and domain-specific languages to generate an effectivedesign space in the future. Third, the syntax of this representationis

Python compatible. This means we can leverage the

Python inter-preter to evaluate the design space and improve overall stability ofthe DSE framework. The Design Space Generator, depicted in Fig. 2,analyzes the kernel AST and extracts the required information forstarting the DSE such as the loops in the design, their trip-count,and available bit-width. Artisan [33] adopts a similar approach foranalyzing the code. However, it only considers unroll pragma incode instrumentation. Our approach, on the other hand, considersa wider set of pragmas as mentioned in Table 3 and employs thefollowing rules to prune the design space: • Ignore the fine-grained loops with trip count (TC) of less thanor equal to 16 as the HLS tool can schedule these loops well. • The allowed parallel factors for a loop are all sub-divisors of theloop TC up to min ( , TC ) plus the TC itself. Parallel factor oflarger than 128 causes HLS tool to run for a long time and itusually does not result in a good performance. • For each loop, we should have

T F ∗ PF < TC , where TF and PFare tiling factor and parallel factor respectively. • When pipeline mode fg is applied on a loop ( PIPELINEFLATTEN ), no other pragma is allowed for the inner loops. • A parallel pragma is invalid for a loop nest when pipeline mode cg is applied on that loop. • A tiling pragma is added only to the loops with an inner loop.According to the evaluation results, our pruning rules are ableto reduce on average 24.65 × design space while still achieving 1.3 × utoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA speedup on the geometric mean for our MatchSuite and Rodiniabenchmarks. One solution to solving the local optimal issue caused by the non-smooth performance gain is partitioning the design space basedon the likely distribution of local optimal points and exploringeach partition independently. Intuitively, we could partition thedesign space according to a range of values of every parameter ina design, but it may generate thousands of partitions and resultin a long exploration time. Thus, we only partition the designspace based on the pipeline mode, as pipeline mode fg unrollsall sub-loops to achieve fine-grained while the mode cg exploitsdouble buffers to implement coarse-grained pipeline. These twomodes apparently have the most significant different influence onthe generated architecture and are expected to have non-relatedperformance and resource utilization. According to the pipelinemodes in each loop, we use the tree partition and generate 2 N partitions from a design space with N non-innermost loops.Supposing we use t working threads to perform, at most, h hoursDSE for 2 N design space partitions, we need N t × h hours tofinish the entire process. On the other hand, some partitions thatare based on an insignificant pipeline pragma may have a similarperformance, so it is more efficient to only explore one of them. Asa result, we profile each partition by running HLS with minimizedparameter values to obtain the minimum area and performance anduse K-means clustering with performance and area as features toidentify t representative partitions among all 2 N partitions. Withthis strategy, we are able to achieve a further 2 × speedup on thegeometric mean by exploring, at most, t partitions in h hours. Our evaluation is performed on Amazon Elastic Compute Cloud(EC2) [1]. We use r4.4xlarge instance with 16 cores and 122 GiBmemory to perform the DSE and generate accelerator bit-streams.The generated FPGA accelerators are evaluated on an F1 instance( f1.2xlarge ) with Xilinx Virtex UltraScale+ TM VU9P FPGA. Inaddition, our benchmark is selected from the MachSuite [27] bench-mark suite, the FPGA-friendly Rodinia [5] benchmark, and oneconvolution layer of Alexnet [20]. For several common kernels,MachSuite provides C implementation that is programmed withoutthe consideration of FPGA acceleration, which makes it a natu-ral fit for demonstrating our approach. Furthermore, we evaluatethe performance of

AutoDSE on vision kernels of Xilinx Vitits li-braries [37] that are optimized for Xilinx FPGAs, based on theOpenCV library [3].

MachSuite [27] and Rodinia [5] Benchmark . We firstevaluate the gradient descent with a finite difference method andthe proposed optimization strategies. The 1 st to 3 rd bars of eachcase in Fig. 5 show the speedup gained by them with respect to CPU.Note that the chart is in logarithmic scale. The list-based designspace representation keeps the search space smooth by invalidat-ing infeasible combinations. As a result, we can investigate moredesign points in a fixed amount of time. This helps AES , NW , KMP , PATHFINDER , KMEANS , and

KNN . Design space partition benefits thedesigns with many loop nests in which the gradient process is easilytrapped by the local optimal when changing pipeline modes—suchas

AES , GEMM , NW , STENCIL-2D , and

STENCIL-3D . The 4 th bar shows the speedup of AutoDSE when bottleneck-guided gradient optimizer described in Section 5.1.3 is adapted alongwith the design space representation introduced in Section 5.2 anddesign space partitioning explained in Section 5.3. With this setup,

AutoDSE further improves the result by 5.5 × on the geometricmean. As a result, AutoDSE is able to achieve a speedup of 19.9 × over CPU and get to 0.93 × performance of the manual designs withonly 1.1 hours on the geometric mean. Table 6: Speedup of Our Approach, S2FA [41], Lattice-traversing DSE [16], and Manual Design Over an Intel XeonCPU Core

Approach AES NW GEMM KMP SPMV STENCIL-3DLattice [16] 2319.9 536.4 2.2 - - -S2FA [41] 7.4 3387.4 10.7 2.9

Ours 3774.7

Manual

We further evaluate the overall performance of generated acceler-ator designs by our bottleneck-guided gradient optimizer, S2FA [41],lattice-traversing DSE [16], and manual design over CPU in Table 6.Note that the performance of S2FA and lattice-traversing DSE arenot reported by the authors for all of the kernels we are testing. Themanual designs are optimized with the Merlin Compiler pragmaswithout changing the source programs to illustrate the optimalityof our DSE process. According to Table 6, using the bottleneck ap-proach, we can outperform S2FA and lattice-traversing DSE by 3.6 × and 4.3 × respectively, on the geometric mean. As we discussed inSection 5.1.1, the reason behind deficiency of S2FA is that it is hardfor the problem-independent learning algorithm to find the killerparameters. Lattice-traversing DSE needs an initial sampling step tolearn the design space which takes a long time for our benchmarkdue to the size of the design space even though the authors onlyconsider unrolling the loops and function inlining. This constraintmakes it hard for the tool to start the exploration process beforethe time limit for DSE is met. However, AutoDSE is able to find ahigh-performance design in a few iterations. Fig. 7 depicts

AutoDSE process for four cases that it can significantly outperform the othermethods. This shows that the bottleneck-guided gradient optimizercan rapidly achieve high performance. The reason that

AutoDSE does not exactly match the performance of manual designs for all ofthe cases is the fact that when the kernels contain many unboundedloops or while-loops, the HLS report may not reflect the accuratecomputation cycles. This affects the bottleneck type analysis ofthe Merlin report. Therefore, our search algorithm will focus onunimportant design points. In the future, we will study the Merlinreport analysis to avoid the situations where HLS may produce in-accurate report. Another advantage of

AutoDSE compared to otherDSE tools such as lattice-traversing DSE is that its backend is basedon the Merlin Compiler. This way, the tool can exploit the auto-matic code transformations for applying the common optimizationtechniques such as memory burst, memory coalescing, and doublebuffering; and focus only on high-level hardware changes.

Xilinx Vision Library [37] . To further evaluate the perfor-mance of

AutoDSE , we use 33 vision kernels from Xilinx VitisLibrary. These kernels utilize 14 optimization pragmas, on the geo-metric mean, which include

UNROLL , PIPELINE , ARRAY_PARTITION , DEPENDENCE , LOOP_FLATTEN , INLINE , DATAFLOW , and

STREAM . Foreach kernel, we remove the pragmas we search for along withthe one that the Merlin Compiler can infer (

INLINE ) and passit to

AutoDSE . The pragmas that we remove for these ker-nels include

UNROLL , PIPELINE , ARRAY_PARTITION , DEPENDENCE , LOOP_FLATTEN , and

INLINE which are used 13.5 times, on thegeometric mean. The pragmas we keep include

LOOP_TRIPCOUNT , onference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al. CPU

Figure 5: Speedup of the Proposed Approach Over an Intel Xeon CPU CoreFigure 6: Speedup and Number of Reduced Pragmas using the AutoDSE Compared to Vision Kernels of Xilinx Vitis li-braries [37]Figure 7: Performance of Generated Designs by AutoDSEOver Time

INTERFACE , DATAFLOW , and

STREAM . Note that Merlin Compiler canwork directly with the pragmas from HLS vendor tool as well.However, it does not apply any code transformation when thosepragmas are utilized. The

INTERFACE and

LOOP_TRIPCOUNT prag-mas are there to specify the connection to AXI bus and the rangeof the trip count of the loop respectively; therefore, they cannot beremoved. In addition, since our search space is built on top of theMerlin Compiler, we do not search for

DATAFLOW and

STREAM prag-mas as these pragmas are not among the Merlin-specified pragmas.Nevertheless, we require less than one optimization pragma perkernel, on the geometric mean. In the future, we will expand oursearch engine to HLS pragmas that are not included in Merlin.Fig. 6 depicts the performance comparison of the design point

AutoDSE generated with respect to Xilinx results along with thenumber of pragmas that we removed. Fig. 6 suggests that

AutoDSE is able to achieve a speedup of 1.04 × yet with 26 × reduction of theiroptimization pragmas in 0.3 hours, on the geometric mean, withrespect to Xilinx optimized kernels; therefore, proving the effective-ness of our bottleneck-based approach and the fact that it can mimicthe method an expert would take. For the cases that AutoDSE doesnot exactly match the performance of Vitis,

AutoDSE still finds thebest combination of the pragmas. The inequality lies in the differ-ent II that Merlin has achieved. For example, the histEqualize , histogram , and otsuthreshold kernels all have a loop that VivadoHLS achieves an II=3 when used with .However, if II=2 is added to the

PIPELINE pragma, Vivado HLScan achieve II=2 but, it is not possible to change the II using theMerlin Compiler. On the other hand,

AutoDSE is able to outperform the performance of customConv and reduce kernels significantlyby better detecting the choices and locations for pipelining andparallelization.

In this paper, we analyze the difficulty of exploring HLS designspace and demonstrate the inefficiency of the hyper-heuristics ap-proach. According to our analysis and observation, we propose abottleneck-guided gradient optimizer to systematically approacha better solution. To eliminate meaningless design points, we pro-pose a list comprehension-based design space representation andprune 24.65 × ineffective configurations on average, while keepingthe design space smooth. We further employ a partitioning strat-egy to address the local optimal problem. We finally implementa push-button framework, AutoDSE , based on the bottleneck op-timizer. The evaluation results show that the performance of thedesigns generated by the

AutoDSE framework matches the cor-responding manual designs and achieves on the geometric meana 19.9 × speedup over one CPU core for Machsuite and Rodiniabenchmarks and 1.04 × over the accelerated vision kernels of XilinxVitis libraries with 26 × reduction of their optimization pragmas.The experimental results suggest that AutoDSE lets anyone with adecent knowledge of programming try customized computing withminimum effort, which is our goal in democratizing customizablecomputing. In the future, we plan to include more transformations(design space parameters) for optimizing data access and reusepatterns.

ACKNOWLEDGEMENTS

The authors would like to thank Dr. Peichen Pan for his invaluablesupport with the Merlin Compiler and Dr. Lorenzo Ferretti forhelping with the comparison to his work. This work is supportedby the ICN-WEN award jointly funded by the NSF (CNS-1719403)and Intel (34627365), and CDSC industrial partners . REFERENCES [1] Amazon EC2 F1 Instance. https://aws.amazon.com/ec2/instance-types/f1/, 2020. https://cdsc.ucla.edu/partners/10 utoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA [2] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jef-frey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. Opentuner: Anextensible framework for program autotuning. In Proceedings of the 23rd in-ternational conference on Parallel architectures and compilation , pages 303–316,2014.[3] Gary Bradski. The opencv library.

Dr Dobb’s J. Software Tools , 25:120–125, 2000.[4] Tony Chan, Jason Cong, and Kenton Sze. Multilevel generalized force-directedmethod for circuit placement. In

Proceedings of the 2005 international symposiumon Physical design , pages 185–192, 2005.[5] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneouscomputing. In , pages 44–54. IEEE, 2009.[6] Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. Soda: stencil with optimizeddataflow architecture. In , pages 1–8. IEEE, 2018.[7] Young-kyu Choi and Jason Cong. Hls-based optimization and design space explo-ration for applications with variable loop bounds. In , pages 1–8. IEEE, 2018.[8] Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. Source-to-source optimization for hls. In

FPGAs for Software Programmers , pages 137–163.Springer, 2016.[9] Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. Softwareinfrastructure for enabling fpga-based accelerations in data centers. In

Proceedingsof the 2016 International Symposium on Low Power Electronics and Design , pages154–155, 2016.[10] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, andZhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment.volume 30, pages 473–491. IEEE, 2011.[11] Jason Cong and Jie Wang. Polysa: polyhedral-based systolic array auto-compilation. In , pages 1–8. IEEE, 2018.[12] Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. Automated acceleratorgeneration and optimization with composable, parallel and pipeline architecture.In

DAC , 2018.[13] Robert H Dennard, Fritz H Gaensslen, Hwa-Nien Yu, V Leo Rideout, ErnestBassous, and Andre R LeBlanc. Design of ion-implanted mosfet’s with very smallphysical dimensions.

IEEE Journal of Solid-State Circuits

IEEE Transactions on EmergingTopics in Computing , 2018.[16] Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi. Lattice-traversing de-sign space exploration for high level synthesis. In , pages 210–217. IEEE, 2018.[17] Álvaro Fialho, Luis Da Costa, Marc Schoenauer, and Michèle Sebag. Analyzingbandit-based adaptive operator selection mechanisms. volume 60, pages 25–64.Springer, 2010.[18] Intel SDK for OpenCL Applications. https://software.intel.com/en-us/intel-opencl, 2020.[19] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, ChristosKozyrakis, and Kunle Olukotun. Automatic generation of efficient accelerators forreconfigurable hardware. In , pages 115–127. IEEE, 2016.[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classificationwith deep convolutional neural networks. In

Advances in neural informationprocessing systems , pages 1097–1105, 2012.[21] Hung-Yi Liu and Luca P Carloni. On learning-based methods for design-spaceexploration with high-level synthesis. In

Proceedings of the 50th annual designautomation conference , pages 1–7, 2013.[22] Shuangnan Liu, Francis CM Lau, and Benjamin Carrion Schafer. Acceleratingfpga prototyping through predictive model-based hls design space exploration.In , pages 1–6. IEEE,2019.[23] Anushree Mahapatra and Benjamin Carrion Schafer. Machine-learning basedsimulated annealer method for high level synthesis design space exploration.In

Proceedings of the 2014 Electronic System Level Synthesis Conference (ESLsyn) ,pages 1–6. IEEE, 2014.[24] Saul B Needleman and Christian D Wunsch. A general method applicable tothe search for similarities in the amino acid sequence of two proteins.

Mol. Biol ,48:443–153, 1970.[25] Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christo-pher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating configurablehardware from parallel patterns.

ACM Sigplan Notices (ASPLOS) , 51(4):651–665,2016.[26] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Con-stantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi PrashanthGopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale data-center services. In , pages 13–24. IEEE, 2014. [27] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and DavidBrooks. Machsuite: Benchmarks for accelerator design and customized archi-tectures. In , pages 110–119. IEEE, 2014.[28] Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, LucaStornaiuolo, and Marco Santambrogio. Pareto optimal design space explorationfor accelerated cnn on fpga. In , pages 107–114. IEEE, 2019.[29] B Carrion Schafer and Kazutoshi Wakabayashi. Machine learning predictivemodelling high-level synthesis design space exploration.

IET computers & digitaltechniques , 6(3):153–159, 2012.[30] Benjamin Carrion Schafer. Parallel high-level synthesis design space explorationfor behavioral ips of exact latencies.

ACM Transactions on Design Automation ofElectronic Systems (TODAES) , 22(4):1–20, 2017.[31] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. Divide and conquer high-level synthesis design space exploration.

ACM Transactions on Design Automationof Electronic Systems (TODAES) , 17(3):1–19, 2012.[32] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. End-to-end optimization ofdeep learning applications. In

The 2020 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , pages 133–139, 2020.[33] Jessica Vandebon, Jose GF Coutinho, Wayne Luk, Eriko Nurvitadhi, and Tim Tod-man. Artisan: a meta-programming approach for codifying optimisation strate-gies. In

Proceedingsof the 2017 ACM/SIGDA International Symposium on Field-Programmable GateArrays , pages 157–166, 2017.[39] Pengfei Xu, Xiaofan Zhang, Cong Hao, Yang Zhao, Yongan Zhang, Yue Wang,Chaojian Li, Zetong Guan, Deming Chen, and Yingyan Lin. Autodnnchip: Anautomated dnn chip predictor and builder for both fpgas and asics. In

The 2020ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , pages40–50, 2020.[40] Sotirios Xydis, Gianluca Palermo, Vittorio Zaccaria, and Cristina Silvano. Spirit:Spectral-aware pareto iterative refinement optimization for supervised high-levelsynthesis.

IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems , 34(1):155–159, 2014.[41] Cody Hao Yu, Peng Wei, Max Grossman, Peng Zhang, Vivek Sarker, and JasonCong. S2fa: an accelerator automation framework for heterogeneous computingin datacenters. In , pages 1–6. IEEE, 2018.[42] Georgios Zacharopoulos, Lorenzo Ferretti, Giovanni Ansaloni, GiuseppeDi Guglielmo, Luca Carloni, and Laura Pozzi. Compiler-assisted selection ofhardware acceleration candidates from application source code. In , pages 129–137. IEEE,2019.[43] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong.Autopilot: A platform-based esl synthesis system. In

High-Level Synthesis , pages99–112. Springer, 2008.[44] Jieru Zhao, Liang Feng, Sharad Sinha, Wei Zhang, Yun Liang, and BingshengHe. Comba: A comprehensive model-based analysis framework for high levelsynthesis of real applications. In , pages 430–437. IEEE, 2017.[45] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flexten-sor: An automatic schedule exploration and optimization framework for tensorcomputation on heterogeneous system. In

Proceedings of the Twenty-Fifth In-ternational Conference on Architectural Support for Programming Languages andOperating Systems , pages 859–873, 2020.[46] Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. Lin-analyzer: a high-level performance analysis tool for fpga-based accelerators. In , pages 1–6.IEEE, 2016.[47] Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and SmailNiar. Design space exploration of fpga-based accelerators with multi-level paral-lelism. In

Design, Automation & Test in Europe Conference & Exhibition (DATE),2017 , pages 1141–1146. IEEE, 2017.[48] Guanwen Zhong, Vanchinathan Venkataramani, Yun Liang, Tulika Mitra, andSmail Niar. Design space exploration of multiple loops on fpgas using high levelsynthesis. In ,pages 456–463. IEEE, 2014.[49] Wei Zuo, Peng Li, Deming Chen, Louis-Noël Pouchet, Shunan Zhong, and JasonCong. Improving polyhedral code generation for high-level synthesis. In2013International Conference on Hardware/Software Codesign and System Synthesis(CODES+ ISSS)