[PDF] AnyHLS: High-Level Synthesis with Partial Evaluation

Abstract

FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. Programmers usually write pragma-annotated C/C++ programs to define the hardware architecture of an application. However, each hardware vendor extends its own C dialect using its own vendor-specific set of pragmas. This prevents portability across different vendors. Furthermore, pragmas are not first-class citizens in the language. This makes it hard to use them in a modular way or design proper abstractions. In this paper, we present AnyHLS, an approach to synthesize FPGA designs in a modular and abstract way. AnyHLS is able to raise the abstraction level of existing HLS tools by resorting to programming language features such as types and higher-order functions as follows: It relies on partial evaluation to specialize and to optimize the user application based on a library of abstractions. Then, vendor-specific HLS code is generated for Intel and Xilinx FPGAs. Portability is obtained by avoiding any vendor-specific pragmas at the source code. In order to validate achievable gains in productivity, a library for the domain of image processing is introduced as a case study, and its synthesis results are compared with several state-of-theart Domain-Specific Language (DSL) approaches for this domain.

Full PDF

AAnyHLS: High-Level Synthesis with PartialEvaluation

M. Akif Özkan ‡ , Arsène Pérard-Gayot † , Richard Membarth †∗ , Philipp Slusallek †∗ , Roland Leißa ∗ ,Sebastian Hack ∗ , Jürgen Teich ‡ , and Frank Hannig ‡‡ Friedrich-Alexander University Erlangen-Nürnberg (FAU), Germany ∗ Saarland University (UdS), Germany † German Research Center for Artiﬁcial Intelligence (DFKI), Germany c (cid:13) Abstract —FPGAs excel in low power and high throughputcomputations, but they are challenging to program. Traditionally,developers rely on hardware description languages like Verilog orVHDL to specify the hardware behavior at the register-transferlevel. High-Level Synthesis (HLS) raises the level of abstraction,but still requires FPGA design knowledge. Programmers usuallywrite pragma-annotated C/C++ programs to deﬁne the hardwarearchitecture of an application. However, each hardware vendorextends its own C dialect using its own vendor-speciﬁc setof pragmas. This prevents portability across different vendors.Furthermore, pragmas are not ﬁrst-class citizens in the language.This makes it hard to use them in a modular way or design properabstractions.In this paper, we present AnyHLS, an approach to synthesizeFPGA designs in a modular and abstract way. AnyHLS isable to raise the abstraction level of existing HLS tools byresorting to programming language features such as types andhigher-order functions as follows: It relies on partial evaluationto specialize and to optimize the user application based ona library of abstractions. Then, vendor-speciﬁc HLS code isgenerated for Intel and Xilinx FPGAs. Portability is obtainedby avoiding any vendor-speciﬁc pragmas at the source code. Inorder to validate achievable gains in productivity, a library forthe domain of image processing is introduced as a case study,and its synthesis results are compared with several state-of-the-art Domain-Speciﬁc Language (DSL) approaches for this domain.

I. I

NTRODUCTION

Field Programmable Gate Arrays (FPGAs) consist of anetwork of reconﬁgurable digital logic cells that can beconﬁgured to implement any combinatorial logic or sequentialcircuits. This allows the design of custom application-tailoredhardware. In particular memory-intensive applications beneﬁtfrom FPGA implementations by exploiting fast on-chip memoryfor high throughput. These features make FPGA implementa-tions orders of magnitude faster/more energy-efﬁcient than CPUimplementations in these areas. However, FPGA programmingposes challenges to programmers unacquainted with hardwaredesign.FPGAs are traditionally programmed at Register-TransferLevel (RTL). This requires to model digital signals, their timing,ﬂow between registers, as well as the operations performedon them. Hardware Description Languages (HDLs) such asVerilog or VHDL allow for the explicit description of arbitrarycircuits but require signiﬁcant coding effort and veriﬁcationtime. This makes design iterations time-consuming and error-prone, even for experts : The code needs to be rewritten for different performance or area objectives. In recent languagessuch as Chisel [1], VeriScala [2], and MyHDL [3], programmerscan create a functional description of their design but stick tothe RTL.High-Level Synthesis (HLS) increases the abstraction levelto an untimed high-level speciﬁcation similar to imperativeprogramming languages and automatically solves low-leveldesign issues such as clock-level timing, register allocation,and structural pipelining [4]. However, an HLS code that isoptimized for the synthesis of high-performance circuits isfundamentally different from a software program deliveringhigh performance on a CPU. This is due to the signiﬁcant gapbetween the programming paradigms. An HLS compiler has tooptimize the memory hierarchy of a hardware implementationand parallelize its data paths [5].In order to achieve good Quality of Results (QoR), HLSlanguages demand programmers also to specify the hardwarearchitecture of an application instead of just its algorithm. Forthis reason, HLS languages offer hardware-speciﬁc pragmas.This ad-hoc mix of software and hardware features makesit difﬁcult for programmers to optimize an application. Inaddition, most HLS tools rely on their own C dialect, whichprevents code portability. For example, Xilinx Vivado HLS [6]uses C++ as base language while Intel SDK [7] (formerlyAltera) uses OpenCL C. These severe restrictions make it hardto use existing HLS languages in a portable and modular way.In this paper, we advocate describing FPGA designs usingfunctional abstractions and partial evaluation to generateoptimized HLS code. Consider Figure 1 for an example fromimage processing: With a functional language, we separatethe description of the sobel_x operator from its realizationin hardware. The hardware realization make_local_op isa function that speciﬁes the data path, the parallelization,and memory architecture. Thus, the algorithm and hardwarearchitecture descriptions are described by a set of higher-order functions. A partial evaluator, ultimately, combinesthese functions to generate an HLS code that delivers high-performance circuit designs when compiled with HLS tools.Since the initial descriptions are high-level, compact, andfunctional, they are reusable and distributable as a library.We leverage the AnyDSL compiler framework [8] to performpartial evaluation and extend it to generate input code forHLS tools targeting Intel and Xilinx FPGA devices. We claimthat this approach leads to a modular and portable code otherthan existing HLS approaches, and is able to produce highly a r X i v : . [ c s . P L ] J u l ine bufferline buffer row sel col sel . . .op op v Mem2D( ,h, v ) line buffers Mem2D( w + v − ,h, )Mem2D( w + v − ,h, ) sliding windowlocal operator Mem1D( W × H, v ) Mem1D( W × H, v ) c (cid:13) Blender Foundation (CC BY 3.0) let sobel_x = @ |img, x, y|-1 * img.read(x-1, y-1) + 1 * img.read(x+1, y-1) +-2 * img.read(x-1, y ) + 2 * img.read(x+1, y ) +-1 * img.read(x-1, y+1) + 2 * img.read(x+1, y+1); let input = make_img_mem1d("sandiego.jpg"); let output = make_img_mem1d("output.jpg"); let operator = make_local_op(sobel_x); with generate(vhls) { operator(input, output) } Figure 1. AnyHLS example: The algorithm description sobel_x is decoupled from its realization in hardware make_local_op . The hardware realizationis a function that speciﬁes important transformations for the exploitation of parallelism and memory architecture. The function generate(vhls) selects thebackend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithmand realization functions. efﬁcient hardware implementations.In summary, this paper makes the following contributions: • We present AnyHLS , raising the abstraction level in HLSby using partial evaluation of higher-order functions as acore compiler technology. It guarantees the well-typedness of the residual program and offers considerably higherproductivity than existing DSL design techniques andC/C ++ -based approaches (see Section II). • AnyHLS offers unprecedented target independence, andthus portability, across different HLS tools by avoidingtool-speciﬁc pragma extensions and generating target-speciﬁc OpenCL or C ++ code as input to existing HLStools (see Section III). • Productivity, modularity, and portability gains are demon-strated by presenting an image processing library as acase study in Section IV. For this domain, we showthat a competitive performance in terms of throughputand resource usage can be achieved in comparison withexisting state-of-the-art DSLs (see Section V).II. O

VERVIEW , B

ACKGROUND , AND R ELATED W ORK

In the following, we brieﬂy discuss prior work (Sections II-Ato II-B) and fundamental concepts of AnyDSL (Section II-C).

A. QoR and Portability of Code in C-based HLS

HLS increases the abstraction level to an untimed high-levelspeciﬁcation such as C/C ++ or OpenCL from a fully-timedRTL. This eases the hardware design problem by eliminatinglow-level issues such as clock-level timing, register allocation,and gate-level pipelining [4], [9], [10]. Modern HLS tools areable to generate high-quality results for DSP and datapath-oriented applications. Several authors (e.g., [4], [11], [12])have argued the following points as key to this success:(i) advancements in RTL design tools, (ii) device-speciﬁc codegeneration, (iii) domain-speciﬁc focus on the target applications,and (iv) generating both software and hardware from thesame code. Modern HLS tools such as Intel FPGA SDK forOpenCL (AOCL) and Xilinx SDX offer system synthesis tomap program parts to either software or hardware. This enablessoftware-like development for library design and veriﬁcation. https://github.com/AnyDSL/anyhls There is an ongoing discussion whether C-based languagesare good candidates for HLS [4], [12]–[15]. Yet, most com-monly used HLS compilers (e.g., Vivado HLS, AOCL, Catapult,LegUp) are based on C-based languages [4], [6], [7], [10]. Themodularity and readability of C/C++ or OpenCL descriptionsoften conﬂict with best coding practices of HLS compilers [16],[17]. In the hardware design context, QoR design refers tothe ratio between the performance of the circuit (latency,throughput) and design cost (circuit area, energy consumption).A C-based HLS code optimized for satisfactory QoR is entirelydifferent from a typical software program [16]–[20]. Thereby,the developer should express the FPGA implementation of anapplication using the language abstractions of software (i.e.,arrays, loops to specify the memory hierarchy and hardwarepipelining). Language extensions like pragmas ﬁll the gapfor the lacking FPGA-centric features. However, pragmas arespeciﬁc to HLS tools, and they cannot be used in a modular waybecause the preprocessor already resolves them (e.g., pragmascannot be passed as function parameters). This ad-hoc mix ofsoftware and hardware abstractions of programming languagesin HLS makes optimizations hard [15], [17], [19]. Furthermore,the lack of standardization in HLS languages and compilershinders the portability of code across them. Often, the codeoptimized for one HLS tool must signiﬁcantly be changed totarget another HLS tool even when the same FPGA design isdescribed. For these reasons, we believe that the next step forHLS requires an increased level of abstraction on the languageside, which can reduce the need for expert knowledge.

B. Raising the Abstraction Level in HLS

Recent work suggests raising the abstraction level in HLSby designing libraries, DSLs or source-to-source compilersto hide low-level implementation details. This improves themodularity and reduces code duplication, but is hard to developand maintain when well-typedness of programs are preserved.[16]–[19] make extensive use of C ++ template metaprogrammingto provide libraries that are optimized for Vivado-HLS. Genericprograms can be optimized for compile-time known valuesusing metaprogramming techniques, but it has the followingdrawbacks: (i) The well-typedness of the generated programcannot be guaranteed in metaprogramming. This makes itdifﬁcult to understand error messages. (ii) Metaprograms areard to develop, maintain, and understand since the metalanguage is different from the core language (C ++ core vs.C ++ template language). For this reason, code cannot be easilymoved between the core and the meta language. (iii) Lambdaexpressions are not allowed to be used as template argumentsin C ++ . We refer to [8] for more details. In particular, [16], [18]explain the challenges of implementing higher-order algorithmsin C ++ for Vivado-HLS. OpenCL C does not support templatemetaprogramming, thus forces users to use preprocessor macrosfor generic library design. Therefore, libraries developed byusing C ++ template metaprogramming have to be rewrittencompletely for OpenCL C, that is, for AOCL.DSLs use domain-speciﬁc knowledge to parallelize algo-rithms and generate low-level, optimized code [21]. Program-ming accelerators using DSLs is thus easier, in particularfor FPGAs, because the compiler performs scheduling. Aprominent example of that is the FPGA version of Spiral [22].It generates HDL for digital signal processing applications.In the domain of image processing, recent projects includeDarkroom [23], Rigel [24], and the work of Pu et al. [25]based on Halide [26]. Hipacc [27], PolyMage [28], SODA [29],and RIPL [30] create image processing pipelines from aDSL. Rigel/Halide, PolyMage, and RIPL are declarative DSLs,whereas Hipacc is embedded into C ++ . All of these compilers,except Rigel, generate HLS code in order to simplify theirbackends. Other examples include L IFT that targets FPGAs viaalgorithmic patterns [31] and Tiramisu [32] for data-parallelalgorithms on dense arrays. Tiramisu takes as input a set ofscheduling commands from the user and feeds it to the polyhe-dral analysis of the compiler. However, a considerable portionof these scheduling primitives remains platform-speciﬁc [33].Spatial [15] is a language for programming Coarse-GrainedReconﬁgurable Architectures (CGRAs) and FPGAs. Spatialprovides language constructs to express control, memory, andinterfaces of hardware implementation.In this paper, it is shown that the described need to raise theabstraction level in HLS may be accomplished by using recentcompiler technology, in particular by exploring the conceptsof partial evaluation and high-order-functions . Unlike theaforementioned DSL compilers, AnyHLS allows programmersto build the basic blocks and abstractions necessary for theirapplication domain by themselves (see Section III). AnyHLS isthereby built on top of AnyDSL [8] (see Section II-C). AnyDSLoffers partial evaluation to enable shallow embedding [34]without the need for modifying a compiler. This meansthat there is no need to change the compiler when addingsupport for a new application domain, since programmers candesign custom control structures. Partial evaluation specializesalgorithmic variants of a program at compile-time. Comparedto metaprogramming, partial evaluation operates in a singlelanguage and preserves the well-typedness of programs [8]. Fur-thermore, different combinations of static/dynamic parameterscan be instantiated from the same code. Previously, we haveshown how to abstract image border handling implementationsfor Intel FPGAs using AnyDSL [35]. In this paper, we presentAnyHLS and an image processing library to synthesize FPGAdesigns in a modular and abstract way for both Intel and XilinxFPGAs.

C. AnyDSL Compiler Framework

AnyDSL [8], [34] is a compiler framework for designinghigh-performance, domain-speciﬁc libraries. It provides theimperative and functional language Impala. Impala’s syntax isinspired by Rust. We will now brieﬂy discuss Impala’s mostimportant features that we rely on in AnyHLS.

1) Partial Evaluation:

Partial evaluation is a technique forprogram optimization by specialization of compile-time knownvalues. Assume that each input of a program F is classiﬁedas either static s or dynamic d , and values for all of the staticinputs are given. Then, partial evaluation produces an optimized(residual) program F s such that [[ F s ]]( d ) = [[ F ]]( s, d ) (1)and running F s on the dynamic inputs produces the sameresult as running the original program F on all of theinputs [36]. Compiler techniques such as constant propagation,loop unrolling, or inlining are examples to partial evaluation.Typically, the user has no control when these optimizationsare applied from a compiler.Impala allows programmers to partially evaluate [37] theirprogram at compile time. Programmers control the partialevaluator via ﬁlters [38]. These are Boolean expressions of theform @ (expr) that annotate function signatures. Each call siteinstantiates the callee’s ﬁlter with the corresponding argumentlist. The call is specialized when the expression evaluates to true . The expression ?expr yields true , if expr is knownat compile-time; the expression $expr is never consideredconstant by the evaluator. For example, the following @ (?n) ﬁlter will only specialize calls to pow if n is statically knownat compile-time: fn @ (?n) pow(x: int , n: int ) -> int { if n == 0 {1} else { if n % let y = pow(x, n / 2);y * y} else {x * pow(x, n - 1)}}} Thus, the calls let z = pow(x, 5); let z = pow(3, 5); will result in the following equivalent sequences of instructionsafter specialization: let y = x * x; let z = x * y * y; let z = 243;

As syntactic sugar, @ is available as shorthand for @ ( true ) .This causes the partial evaluator to always specialize theannotated function.FPGA implementations must be statically deﬁned for QoR:types, loops, functions, and interfaces must be resolved atcompile-time [16], [18], [19]. Partial evaluation has manyadvantages compared to metaprogramming as discussed inSection II-B. Hence, Impala’s partial evaluation is particularlyuseful to optimize HLS descriptions. https://anydsl.github.io ) Generators: Because iteration on various domains is acommon pattern, Impala provides syntactic sugar for invokingcertain higher-order functions. The loop for var1, ..., varn in iter(arg1, ..., argn) { /* ... */ } translates to iter(arg1, ..., argn, |var1, ..., varn| { /* ... */ }); The body of the for loop and the iteration variables constitutean anonymous function |var1, ..., varn| { /* ... */ } that is passed to iter as the last argument. We call functionsthat are invokable like this generators . Domain-speciﬁc librariesimplemented in Impala make busy use of these features asthey allow programmers to write custom generators that takeadvantage of both domain knowledge and certain hardwarefeatures, as we will see in the next section.Generators are particularly powerful in combination withpartial evaluation. Consider the following functions: type Body = fn ( int ) -> (); fn @ (?a & ?b) unroll(a: int , b: int , body: Body) -> () { if a < b { body(a); unroll(a+1, b, body) }} fn @ range(a: int , b: int , body: Body) -> () {unroll($a, b, body)} Both generators iterate from a (inclusive) to b (exclusive)while invoking body each time. The ﬁlter unroll tells thepartial evaluator to completely unroll the recursion if both loopbounds are statically known at a particular call site.III. T HE A NY HLS L

IBRARY

Efﬁcient and resource-friendly FPGA designs requireapplication-speciﬁc optimizations. These optimizations andtransformations are well known in the community. For example,de Fine Licht et al. [20] discuss the key transformations of HLScodes such as loop unrolling and pipelining. They describethe whole hardware design from the low-level memory layoutto the operator implementations with support for low-levelloop transformations throughout the design. In our setting,the programmer deﬁnes and provides these abstractions usingAnyDSL for a given domain in the form of a library. Werely on partial evaluation to combine those abstractions and toremove overhead associated with them. Ultimately, the AnyDSLcompiler synthesizes optimized HLS code (C ++ or OpenCL C)from a given functional description of an algorithm as shownin Figure 2. The generated code goes to the selected HLS tool.This is in contrast to other domain-speciﬁc approaches likeHalide-HLS [25] or Hipacc [27], which rely on domain-speciﬁccompilers to instantiate predeﬁned templates or macros. Hipaccmakes use of two distinct libraries to synthesize algorithmicabstractions to Vivado-HLS and Intel AOCL, while AnyHLSuses the same image processing library that is described inImpala. A. HLS Code Generation

For HLS code generation, we implemented an intrinsicnamed vhls in AnyHLS to emit Vivado HLS and an intrinsicnamed opencl to emit AOCL: halide-app.cpp hipacc-app.cpp anyhsl-app.impalaHipacc compiler AnyDSL compiler+(partial evaluator)Vivadobackend Vivadobackend AOCLbackendHalide compiler ImageProcessingLib.impalaVHLS-code.cpp AOCL-code.clVHLS-code.cpp VHLS-code.cpp AOCL-code.cltemplatelibrary templatelibrarytemplatelibrary VHLS AOCL VHLS AOCLVHLS XILINXFPGA INTELFPGA XILINXFPGA INTELFPGAXILINXFPGA

Figure 2. FPGA code generation ﬂows for Halide, Hipacc, and AnyHLS (fromleft to right). VHLS and AOCL are used as acronyms for Vivado HLS andIntel FPGA SDK for OpenCL, respectively. Halide and Hipacc rely on domain-speciﬁc compilers for image processing that instantiate template libraries.AnyHLS allows deﬁning all abstractions for a domain in a language calledImpala and relies on partial evaluation for code specialization. This ensuresmaintainability and extensibility of the provided domain-speciﬁc library—forimage processing in this example. with vhls() { body() } with opencl() { body() }

With opencl we use a grid and block size of (1, 1, 1) to generate a single work-item kernel, as the ofﬁcial AOCLdocumentation recommends [7]. We extended AnyDSL’sOpenCL runtime by the extensions of Intel OpenCL SDK.To provide an abstraction over both HLS backends, we createa wrapper generate that expects a code generation function: type

Backend = fn ( fn () -> ()) -> (); fn @ generate(be: Backend, body: fn () -> ()) -> () { with be() { body() }} Switching backends is now just a matter of passing anappropriate function to generate : let backend = vhls; // or opencl with generate(backend) { body() } B. Building Abstractions for FPGA Designs

In the following, we present abstractions for the keytransformations and design patterns that are common in FPGAdesign. These include (a) important loop transformations, (b)control ﬂow and data ﬂow descriptions such as reductions,Finite State Machines (FSMs) and (d) the explicit utilization ofdifferent memory types. Approaches like Spatial [15] exposethese patterns within the language—new patterns requirededicated support from the compiler. Hence, these languagesand compilers are restricted to a specialized applicationdomain they have been designed for. In AnyHLS, Impala’sfunctional language and partial evaluation allow us to designthe abstractions needed for FPGA synthesis in the form ofa library. New patterns can be added to the library withoutdedicated support from the compiler. This makes AnyHLSeasier to extend compared to the approaches mentioned afore.

1) Loop Transformations: C ++ compilers usually providecertain preprocessor directives that perform particular codetransformations. A common feature is to unroll loops (seeleft-hand side): ody no unrolling body body unroll inner loop bodybody unroll outer loop body bodybody body unroll inner and outer loop Figure 3. Parallel processing for ( int i=0; i ()) -> (); fn @ tile(size: int , inner: Loop, outer: Loop) -> Loop { @ |beg, end, body| outer(0, (end-beg)/size,|i| inner(i*size + beg, (i+1)*size + end, |j| body))} let schedule = tile(W, unroll, range); for i in schedule(0, N) {body(i)} Passing W for the tiling size , unroll for the inner loop, and range for the outer loop yields a generator that is identicalto the loop nest at the beginning of this paragraph. With thisdesign, we can reuse or explore iteration techniques withouttouching the actual body of a for loop. For example, considerthe processing options for a two-dimensional loop nest as shownin Figure 3: When just passing range as inner and outer loop, the partial evaluator will keep the loop nest and, hence,not unroll body and instantiate it only once. Unrolling the innerloop replicates body and increases the bandwidth requirementsaccordingly. Unrolling the outer loop also replicates body , butin a way that beneﬁts data reuse from the temporal locality ofan iterative algorithm. Unrolling both loops replicate body forincreased bandwidth and data reuse for the temporal locality.C/C ++ -based HLS solutions often use a pragma to mark aloop amenable for pipelining. This means parallel executionof the loop iterations in hardware. For example, the followingcode on the left uses an initiation interval ( II ) of : for ( int i=0; i

II = 3; for i in pipeline(II, 0, N) {body(i)} Instead of a pragma (on the left), AnyHLS uses the intrinsicgenerator pipeline (on the right). Unlike the above loopabstractions (e.g., unroll), Impala emits a tool-speciﬁc pragmafor the pipeline abstraction. This provides portability acrossdifferent HLS tools. Furthermore, it allows the programmerto invoke and pass around pipeline —just like any othergenerator.

2) Reductions:

Reductions are useful in many contexts. Thefollowing function takes an array of values, a range within,and an operator: type

T = int ; fn @ (?beg & ?end) reduce(beg: int , end: int , input: &[T],op: fn (T, T) -> T) -> T { let n = end - beg; if n == 1 {input(beg)} else { let m = (end + beg) / 2; let a = reduce(beg, m, input, op); let b = reduce(m, end, input, op);op(a, b)}} In the above ﬁlter, the recursion will be completely unfoldedif the range is statically known. Thus, reduce(0, 4, [a, b, c, d], |x, y| x + y) yields: (a + b) + (c + d) .

3) Finite State Machines:

AnyHLS models computationsthat depend not only on the inputs but also on an internalstate with an FSM. To deﬁne an FSM, programmers need tospecify states and a transition function that determines whento change the current state based on the machine’s input. Thisis especially beneﬁcial for modeling control ﬂow. To describean FSM in Impala, we start by introducing types to representthe states and the machine itself: type

State = int ; struct FSM {add: fn (State, fn () -> (), fn () -> State) -> (),run: fn (State) -> ()} An object of type

FSM provides two operations: adding onestate with add or run ning the computation. The add methodtakes the name of the state, an action to be performed for thisstate, and a transition function associated with this state. Onceall states are added, the programmer run s the machine bypassing the initial state as an input parameter. The followingexample adds to every element of an array: let buf = /*...*/ ; let mut (idx, pixel) = (0, 0); let fsm = make_fsm();fsm.add(Read, || pixel = buf(idx),|| if idx>=len { Exit } else { Compute });fsm.add(Compute, || pixel += 1, || Write);fsm.add(Write, || buf(idx++) = pixel, || Read );fsm.run(Read); Similar the other abstractions introduced in this section, theconstructor for an FSM is not a built-in function of the compilerbut a regular Impala function. In some cases, we want toexecute the

FSM in a pipelined way. For this scenario, we adda second method run_pipelined . As all the methods, e.g., make_fsm , add , run , are annotated for partial evaluation(by @ ), input functions to these methods will be optimizedaccording to their static inputs. Ultimately, AnyHLS will emithe states of an FSM as part of a loop according to the selected run method.

4) Memory Types and Memory Abstractions:

FPGAs havedifferent memory types of varying sizes and access properties.Impala supports four memory types speciﬁc to hardware design(see Figure 4): global memory, on-chip memory, registers, andstreams. Global memory (typically DRAM) is allocated on thehost using our runtime and accessed through regular pointers.On-chip memory (e.g., BRAM or M10K/M20K) for the FPGAis allocated using the reserve_onchip compiler intrinsic.Memory accesses using the pointer returned by this intrinsicwill map to on-chip memory. Standard variables are mappedto registers, and a speciﬁc stream type is available to allowfor the communication between FPGA kernels. Memory-wise,a stream is mapped to registers or on-chip memory by theHLS tools. These FPGA-speciﬁc memory types in Impala willbe mapped to their corresponding tool-speciﬁc declarations inthe residual program (on-chip memory will be deﬁned as localmemory for AOCL whereas it will be deﬁned as an array inVivado HLS). a) Memory partitioning: an array partitioning pragmamust be deﬁned as follows to implement a C array withhardware registers using Vivado HLS [6]: typedef int

T;T Regs1D[size];

HLS variable=Regs1D array_partition dim=0

Listing 1. A typical way of partitioning an array by using pragmas in existingHLS tools.

Other HLS tools offer similar pragmas for the same task.Instead, AnyHLS provides a more concise description of aregister array without using any tool-speciﬁc pragma by therecursive declaration of registers as follows: type

T = int ; struct Regs1D {read: fn ( int ) -> T,write: fn ( int , T) -> (),size: int } fn @ make_regs1d(size: int ) -> Regs1D { if size == 0 {Regs1D {read: @ |_| 0,write: @ |_, _| (),size: size}} else { let mut reg: T; let others = make_regs1d(size - 1);Regs1D {read: @ |i| if i+1 == size { reg } else { others.read(i) },write: @ |i, v| if i+1 == size { reg = v } else { others.write(i, v) },size: size}}} Listing 2. Recursive description of a register array using partial evalutioninstead of declaring an array and partitioning it by HLS pragmas.

When the size is not zero, each recursive call to thisfunction allocates a register variable named reg , and createsa smaller register array with one element less named others .The read and write functions test if the index i is equalto the index of the current register. In the case of a match,the current register is used. Otherwise, the search continues in global memory on-chip memory register stream Figure 4. Memory types provided for FPGA design

Regs1D

1D register array

Regs2D

2D register array

OnChipArray on-chip array

StreamArray stream array

Figure 5. Memory abstractions the smaller array. The generator ( make_regs1d ) returns anImpala variable that can be read and written by index values( regs in the following code), similar to C arrays. let regs = make_regs1d(size);

However, it deﬁnes size number of registers in the residualprogram instead of declaring an array and partitioning it bytool-speciﬁc pragmas as in Listing 1. The generated codedoes not contain any compiler directives; hence it can beused for different HLS tools (e.g., Vivado HLS, AOCL). Sincewe annotated make_regs1d , read , and write for partialevaluation, any call to these functions will be inlined recursively.This means that the search to ﬁnd the register to read to orwrite from will be performed at compile time. These registerswill be optimized by the AnyDSL compiler, just like any othervariables: unnecessary assignments will be avoided, and a cleanHLS code will be generated.Correspondingly, AnyHLS provides generators (similar toListing 2) for one and two-dimensional arrays of on-chipmemory (e.g., line buffers in Section IV), global memory, andstreams (as illustrated in Figure 5) instead of using memorypartitioning pragmas encouraged in existing HLS tools (as inListing 1).IV. A L IBRARY FOR I MAGE P ROCESSING ON

FPGAAnyHLS allows for deﬁning domain-speciﬁc abstractionsand optimizations that are used and applied prior to generatingcustomized input to existing HLS tools. In this section, weintroduce a library that is developed to support HLS for thedomain of image processing applications. It is based on thefundamental abstractions introduced in Section III-B. Our low-level implementation is similar to existing domain-speciﬁclanguages targeting FPGAs [24], [27]. For this reason, we focuson the interface of our abstractions as seen by the programmer.We design applications by decoupling their algorithmicdescription from their schedule and memory operations. Forinstance, typical image operators, such as the followingSobel ﬁlter, just resort to the make_local_op generator.Similarly, we implement a point operator for RGB-to-graycolor conversion as follows (Listing 3): fn sobel_edge(output: & mut [T], input: &[T]) -> () { let img = make_raw_mem2d(width, height, input); let dx = make_raw_mem2d(width, height, output); et sobel_extents = extents(1, 1); // for 3x3 filter let operator = make_local_op(4, // vector factor sobel_operator_x, sobel_extents, mirror, mirror); with generate(hls) { operator(img, dx); }} fn rgb2gray(output: & mut [T], input: &[T]) -> () { let img = make_raw_img(width, height, input); let gray = make_raw_img(width, height, output); let operator = make_point_op( @ |pix| { let r = pix & 0xFF; let g = (pix >> 8) & 0xFF; let b = (pix >> 16) & 0xFF;(r + g + b) / 3}); with generate(hls) { operator(img, gray); }} Listing 3. Sobel ﬁlter and RGB-to-gray color conversion as exampleapplications described by using our library.

The image data structure is opaque. The target platformmapping determines its layout. AnyHLS provides commonborder handling functions as well as point and global operatorssuch as reductions (see Section III-B2). These operators arecomposable to allow for more sophisticated ones.

A. Vectorization

Image processing applications consist of loops that possess avery high degree of spatial parallelism. This should be exploitedto reach the bandwidth speed of memory technologies. Aresource-efﬁcient approach, so-called vectorization or loopcoarsening , is to aggregate the input pixels to vectors andprocess multiple input data at the same time to calculatemultiple output pixels in parallel [39]–[41]. This replicates onlythe arithmetic operations applied to data (so-called datapath)instead of the whole accelerator, similar to Single InstructionMultiple Data (SIMD) architectures. Vectorization requires acontrol structure specialized to a considered hardware design.We support the automatic vectorization of an application bya given factor v when using our image processing library. Inparticular, our library use the vectorization techniques proposedin [40]. For example, the make_local_op function hasan additional parameter to specify the desired vectorizationand will propagate this information to the functions it usesinternally: make_local_op(op, v) . For brevity, we omitthe parameter for the vectorization factor for the remainingabstractions in this section. B. Memory Abstractions for Image Processing1) Memory Accessor:

In order to optimize memory accessand encapsulate the contained memory type (on-chip memory,etc.) into a data structure, we decouple the data transfer fromthe data use via the following memory abstractions: struct

Mem1D {read: fn ( int ) -> T,write: fn ( int , T)->(),update: fn ( int ) -> (),size: int } struct Mem2D {read: fn ( int , int ) -> T,write: fn ( int , int , T)->(),update: fn ( int , int ) -> (),width: int , height: int } Similar to hardware design practices, these memory abstractionsrequire the memory address to be update d before the read / write operations. The update function transfers datafrom/to the encapsulated memory to/from staging registersusing vector data types. Then, the read / write functions access an element of the vector. This increases data reuse andDRAM-to-on-chip memory bandwidth [42].

2) Stream Processing:

Inter-kernel dependencies of analgorithm should be accessed on-the-ﬂy in combination withﬁne-granular communication in order to pipeline the fullimplementation with a ﬁxed throughput. That is, as soon as ablock produces one data, the next block consumes it. In thebest case, this requires only a single register of a small bufferinstead of reading/writing to temporary images:

Kernel2Kernel1 Kernel3

Mem1D Mem1DMem1D Mem1D

We deﬁne a stream between two kernels as follows: fn make_mem_from_stream(size: int , data: stream ) -> Mem1D;

3) Line Buffers:

Storing an entire image to on-chip memorybefore execution is not feasible since on-chip memory blocksare limited in FPGAs. On the other hand, feeding the dataon demand from main memory is extremely slow. Still, it ispossible to leverage fast on-chip memory by using it as FIFObuffers containing only the necessary lines of the input images( W pixels per line). line bufferline buffer Mem2D ( , h, v ) line buffers ( W, h, v ) Mem1D (

W, v ) This enables parallel reads at the output for every pixel readat the input. We model a line buffer as follows: type

LineBuf1D = fn (Mem1D) -> Mem1D; fn make_linebuf1d(width: int ) -> LineBuf1D; // similar for LineBuf2D Akin to

Regs1D (see Section III-B4), a recursive call buildsan array of line buffers (each line buffer will be declared by aseparate memory component in the residual program similarto on-chip array in Figure 5).

4) Sliding Window:

Registers are the most amenable re-sources to hold data for highly parallelized access. A slidingwindow of size w × h updates the constituting shift registers bya new column of h pixels and enables parallel access to w · h pixels. Mem2D ( w, h, ) sliding window Mem2D( , h, v ) This provides high data reuse for temporal locality and avoidswaste of on-chip memory blocks that might be utilized for a sim-ilar data bandwidth. Our implementation uses make_regs2d for an explicit declaration of registers and supports pixel-basedindexing at the output. This will instantiate w · h registers inthe residual program, as explained in Section III-B4. ype Swin2D = fn (Mem2D) -> Mem2D; fn @ make_sliding_window(w: int , h: int ) -> Swin2D { let win = make_regs2d(w, h); // ... } C. Loop Abstractions for Image Processing1) Point Operators:

Algorithms such as image scaling andcolor transformation calculate an output pixel for every inputpixel. The point operator abstraction (see Listing 4) in AnyHLSyields a vectorized pipeline over the input and output image.This abstraction is parametric in its vector factor v and thedesired operator function op . type PointOp = fn (Mem1D) -> Mem1D; fn @ make_point_op(v: int , op: Op) -> PointOp { @ |img, out| { for idx in pipeline(1, 0, img.size) {img.update(idx); for i in unroll(0, v) {out.write(i, op(img.read(i)));}out.update(idx);}}} Listing 4. Implementation of the point operator abstraction.

The total latency is L = L arith + (cid:100) W / v (cid:101) · H cycles (2)where W and H are the width and height of the input image,and L arith is the latency of the data path.

2) Local Operators:

Algorithms such as Gaussian blur andSobel edge detection calculate an output pixel by consideringthe corresponding input pixel and a certain neighborhood of itin a local window. Thus, a local operator with a w × h windowrequires w · h pixel reads for every output. The same ( w − · h pixels are used to calculate results at the image coordinates( x , y ) and ( x + 1 , y ). This spatial locality is transformed intotemporal locality when input images are read in raster order forburst mode, and subsequent pixels are sequentially processedwith a streaming pipeline implementation. The local operatorimplementation in AnyHLS (shown in Listing 5) consists ofline buffers and a sliding window to hold dependency pixelsin on-chip memory and calculates a new result for every newpixel read. line bufferline buffer row sel col sel . . .op op v Mem2D( ,h,v ) line buffers Mem2D( w + v − ,h, )Mem2D( w + v − ,h, ) sliding windowlocal operator Mem1D( W × H,v ) Mem1D( W × H,v ) This provides a throughput of v pixels per clock cycle at thecost of an initial latency ( v is the vectorization factor) L initial = L arith + ( (cid:98) h / (cid:99) · (cid:100) W / v (cid:101) + (cid:98) (cid:100) w / v (cid:101) / (cid:99) ) (3)that is spent for caching neighboring pixels of the ﬁrstcalculation. The ﬁnal latency is thus: L = L initial + ( (cid:100) W / v (cid:101) · H ) (4) type LocalOp = fn (Mem1D) -> Mem1D; fn @ make_local_op(v: int , op: Op, ext: Extents,bh_lower: FnBorder,bh_upper: FnBorder) -> LocalOp { @ |img, out| { let mut (col, row, idx) = (0, 0, 0); let wait = /* initial latency */ let fsm = make_fsm();fsm.add(Read, || img.update(idx), || Compute);fsm.add(Compute, || {line_buffer.update(col);sliding_window.update(row);col_sel.update(col); for i in unroll(0, v) {out.write(i, op(col_sel.read(i)));}}, || if idx > wait { Write } else { Index });fsm.add(Write, || out.update(idx-wait-1), || Index);fsm.add(Index, || {idx++; col++; if col == img_width { col=0; row++; }}, || if idx < img.size { Read } else { Exit });fsm.run_pipelined(Read, 1, 0, img.size);}} Listing 5. Implementation of the local operator abstraction.

Compared to the local operator in Figure 1, we also supportboundary handling. We specify the extent of the local operator(ﬁlter size / 2) as well as functions specifying the boundaryhandling for the lower and upper bounds. Then, row and columnselection functions apply border handling correspondingly in x -and y − directions by using one-dimensional multiplexer arrayssimilar to Özkan et al. [40].V. E VALUATION AND R ESULTS

In the following, we compare the Post Place and Route(PPnR) results using AnyHLS and other state-of-the-art domain-speciﬁc approaches including Halide-HLS [25] and Hipacc [27].The generated HLS codes are compiled using Intel FPGA SDKfor OpenCL 18.1 and Xilinx Vivado HLS 2017.2 targeting aCyclone V GT 5CGTD9 FPGA and a Zynq XC7Z020 FPGA,repectively.The generated hardware designs are evaluated for theirthroughput, latency, and resource utilization. FPGAs possesstwo types of resources: (i) computational: LUTs and DSPblocks; (ii) memory: Flipﬂops (FFs) and on-chip memory(BRAM/M20K). A SLICE/ALM is comprised of look-up tables(LUTs) and ﬂip ﬂops, thus indicate the resource usage whenconsidered with the DSP block and on-chip memory blocks.The implementation results presented for Vivado HLS featureonly the kernel logic, while those by Intel OpenCL includePCIe interfaces. The execution time of an FPGA circuit (VivadoHLS implementation) equals to T clk · latency, where T clk isthe clock period of the maximum achievable clock frequency(lower is better). We measured the timing results for IntelOpenCL by executing the applications on a Cyclone V GT5CGTD9 FPGA. This is the case for all analyzed applications.We have no intention nor license rights [43, §4] [44, §2] tobenchmark and compare the considered FPGA technologies orHLS tools. A. Applications

In our experimental evaluation, we consider the followingapplications:

16 35 107

FChainHarrisFChainHarris

Execution time [ms]naïvestreaming pipeline

Figure 6. Execution time for naïve and streaming pipeline implementationsof the Harris and FChain for an Intel Cyclone V for images of × . • Gaussian (Gauss) blurring an image with a × integerkernel • Harris corner detector (Harris) consisting of 9 kernelsthat resort to integer arithmetic and horizontal/verticalderivatives • Jacobi smoothing an image with a × integer kernel • ﬁlter chain (FChain) consisting of 3 convolution kernelsas a pre-processing algorithm • bilateral ﬁlter (Bilateral) , a × ﬂoating-point kernelas an edge-preserving and noise-reducing function basedon exponential functions • mean ﬁlter (MF) , a × ﬁlter that determines the averagewithin a local window via 8-bit arithmetic • SobelLuma , an edge detection algorithm provided as adesign example by Intel. The algorithm consists of RGBto Luma color conversion, Sobel ﬁlters, and thresholding

B. Library Optimizations

AnyHLS exploits stream processing and performs implicitparallelization. The following subsections show the impact ofthose optimizations.

1) Stream Processing:

Memory transfers between FPGA’sprogrammable logic and external memory are one of the mosttime-consuming parts of many image processing applications.AnyHLS streaming pipeline optimization passes dependencypixels directly from the producer to the consumer kernel,as explained in Section IV-B2. This allows pipelined kernelexecution and makes intermediate images between kernelssuperﬂuous. The more intermediate images are eliminated, thebetter the performance of the resulting designs. For example,this eliminates 8 intermediate images in Harris corner and 2 inﬁlter chain, see Figure 6 for the performance impact.The throughput of both streaming pipeline implementationsis indeed determined by their slowest individual kernel, whichis a local operator. Consider Table I, which displays the VivadoHLS reports. The latency results correspond to Equation (4).

Table IS

TREAMING PIPELINE IMPLEMENTATIONS OF H ARRIS AND FC HAIN ON A X ILINX Z YNQ . D

ATA IS TRANSFERRED TO THE

FPGA

ONLY ONCE , THUSSIMILAR THROUGHPUTS ARE ACHIEVED . I

MAGES SIZES ARE × , v = 1 , f target = 200 MH Z . App. Largest mask Sequential Dependency Latency [cyc.] Throughput [MB/s]FChain × local + local + local 1050649 821Harris × local + local + point 1049634 825

2) Vectorization:

Many FPGA implementations beneﬁt fromparallel processing in order to increase memory bandwidth.AnyHLS implicitly parallelizes a given image pipeline by a vectorization factor v . As an example, Figure 7 shows thePPnR results, along with the achieved memory throughput fordifferent vectorization factors for the mean ﬁlter on a Cyclone V.The memory-bound of the Cyclone V is reported by Intel’s

200 400 600 800 1 ,

000 1 ,

200 1 , V ec t o r i za ti on f ac t o r( v ) Memory Bound [MB/s]

Vectorization factor ( v ) R e s ou r ce U s a g e i n % On-Chip Mem Blocks Logic Resources

Figure 7. PPnR results of AnyHLS’s mean ﬁlter implementation on an IntelCyclone V. The memory bound of the device for our setup is 1344.80 MB/s. diagnosis tool. The speedup is almost linear, whereas resourceutilization is sub-linear to the vectorization factor, as Figure 7depicts. AnyHLS exploits the data reuse between consecutiveiterations of the local operators. Data is read and written withthe vectorized data types. The line buffers and the slidingwindow are extended to hold dependency pixels for vectorizedprocessing. Thus, only the datapath is replicated instead of thewhole accelerator implementation (see Section IV-A). All theconsidered applications except Bilateral in Figure 9 reach thememory bound. Bilateral is compute-bound due to its largenumber of ﬂoating-point operations.

C. Hardware Design Evaluation

We evaluate the generated hardware designs based on theirthroughput, latency, and resource utilization. As a reference, weuse the designs generated by Halide-HLS [25] and Hipacc [27],two state-of-the-art image processing DSLs that generatebetter results than previous approaches (e.g., Xilinx OpenCV).In contrast to these, which implement dedicated HLS codegenerators, AnyHLS is essentially implemented as a librarywithin the AnyDSL framework, as illustrated in Figure 2. Ourfocus is to show that higher-order abstractions, together withpartial evaluation, are powerful enough to design a librarytargeting different HLS compilers.

1) Experiments using Xilinx Vivado HLS:

We evaluate theresults of circuits generated using AnyHLS in comparison withhe domain-speciﬁc language approaches Hipacc and Halide-HLS. We consider two representative applications from theHalide-HLS repository with different conﬁgurations (borderhandling mode and vectorization factor): Gauss and Harris.These DSLs have been developed by FPGA experts and performbetter than many other existing libraries. The applications arerewritten for Hipacc and AnyHLS by respecting their originaldescriptions. This ensures that Halide-HLS applications havebeen implemented with adequate scheduling primitives. Hipaccand AnyHLS implementations require only the algorithmdescriptions as input.For almost all applications in Tables II and III, AnyHLSimplementations demand fewer resources and deliver higherperformance. Of course, this improvement mainly stems fromour library implementation. AnyHLS achieves a lower latencymainly because of the following reasons:i) The latency of a local operator generated from AnyHLS’image processing library corresponds to the theoreticallatency given in Equation (4), which is L = L arith +1 . . clock cycles for Gauss when v = 1 . L arith =14 for AnyHLS’ Gauss implementation as shown inTable II.ii) Halide-HLS pads input images according to the selectedborder handling mode (even when no border handling isdeﬁned). This increases the input image size from ( W , H ) to ( W + w − , H + h − ), thus the latency.iii) Hipacc does not pad input images, but run ( H + (cid:98) h/ (cid:99) · ( W + (cid:98) w/ (cid:99) ) ) loop iterations for a ( W × H ) imageand ( w × h ) window. This is similar to the convolutionexample in the Vivado Design Suite User Guide [6], butnot optimal.The execution time of an implementation equals to T clk · latency , where T clk is the clock period of the maximumachievable clock frequency (lower is better). Overall, AnyHLSprocesses a given image faster than the other DSL implemen-tations.Halide-HLS uses more on-chip memory for line buffers (seeSection IV-C2) compared to Hipacc and AnyHLS because of itsimage padding for border handling. Let us consider the numberof BRAMs utilized for the Gaussian blur: The line buffers needto hold 4 image lines for the × kernel. The image widthis and the pixel size is bits. Therefore, AnyHLS andHipacc use eight K BRAMs as shown in Table II. However,Halide-HLS stores integer pixels, which require 16 KBRAMs to buffer four image lines. This doubles the numberof BRAMs usage (see Table III).AnyHLS use the vectorization architecture proposed in [40].This improves the use of the registers compared to Hipacc andHalide.The performance metrics and resource usage reported byVivado HLS correlate with our Impala descriptions, hence weclaim that the HLS code generated from AnyHLS’ imageprocessing library does not entail severe side effects forthe synthesis of Vivado HLS. Hipacc and Halide-HLS havededicated compiler backends for HLS code generation. Thesecan be improved to achieve similar performance to AnyHLS.However, this is not a trivial task and prone to errors. Theadvantage of AnyDSL’s partial evaluation is that the user

Table IIPP N R RESULTS FOR THE X ILINX Z YNQ BOARD FOR IMAGES OF SIZE × AND T target = 5 NS ( CORRESPONDS TO f target = 200 MH Z ).B ORDER HANDLING IS UNDEFINED . App v

Table IIIPP N R RESULTS FOR THE G AUSSIAN BLUR WITH CLAMPING AT THEBORDERS . I

MAGE SIZES ARE × , v = 1 , f target = 200 MH Z . Framework has control over code generation. Extending AnyHLS’ imageprocessing library only requires adding new functions in Impala(see Figure 2). Our intention to compare AnyHLS with theseDSLs is to show that we can generate equally good designswithout creating an entire compiler backend.

2) Experiments using Intel FPGA SDK for OpenCL (AOCL):

Table IV presents the implementation results for an edgedetection algorithm provided as a design example by Intel. Thealgorithms consist of RGB to Luma color conversion, Sobelﬁlters, and thresholding. Intel’s implementations consist of asingle-work item kernel that utilizes shift registers accordingto the FPGA design paradigm. These types of techniques arerecommended by Intel’s optimization guide [7] despite thatthe same OpenCL code performs drastically bad on othercomputing platforms.

Table IVPP N R RESULTS OF AN EDGE DETECTION APPLICATION FOR THE I NTEL C YCLONE

V. I

MAGE SIZES ARE × . N ONE OF THEIMPLEMENTATIONS USE

DSP S . v Framework We described Intel’s handwritten

SobelLuma example usingHipacc and AnyHLS. Both Hipacc and AnyHLS provide ahigher throughput even without vectorization. In order to reachmemory-bound, we would have to rewrite Intel’s hand-tuneddesign example to exploit further parallelism. AnyHLS usesslightly less resource, whereas Hipacc provides slightly higherthroughput for all the vectorization factors. Similar to Figure 7,

EFERENCES

20 30 40 50 60 70 8010 Hardware resources (logic utilization [%]) T h r oughpu ti n [ M P i x e l/ s ] AnyHLSNDRange

Figure 8. Design space for a × mean ﬁlter using an NDRange kernel(using the num_compute_units / num_simd_work_items attributes)and AnyHLS (using the vectorization factor v ) for an Intel Cyclone V.MFGauss JacobiBilateral FChainHarris T h r oughpu ti n [ M P i x e l/ s ] Hipacc AnyHLSFigure 9. Throughput measurements for an Intel Cyclone V for theimplementations generated from AnyHLS and Hipacc. Resource utilizationfor the same implementations are shown in Table V. both frameworks yield throughputs very close to the memorybound of the Intel Cyclone V.The OpenCL NDRange kernel paradigm conveys multipleconcurrent threads for data-level parallelism. OpenCL-basedHLS tools exploit this paradigm to synthesize hardware. AOCLprovides attributes for NDRange kernels to transform its iter-ation space. The num_compute_units attribute replicatesthe kernel logic, whereas num_simd_work_items vector-izes the kernel implementation . Combinations of those providea vast design space for the same NDRange kernel. However, asFigure 8 demonstrates, AnyHLS achieves implementations thatare orders of magnitude faster than using attributes in AOCL.Finally, Table V and Figure 9 present a comparison betweenAnyHLS and the AOCL backend of Hipacc [45]. As shownin Figure 2, Hipacc has an individual backend and templatelibrary written with preprocessor directives to generate high-performance OpenCL code for FPGAs. In contrast, the ap-plication and library code in AnyHLS stays the same. Thegenerated AOCL code consists of a loop that iterates overthe input image. Compared to Hipacc, AnyHLS achievessimilar performance but outperforms Hipacc for multi-kernelapplications such as the Harris corner detector. This shows thatAnyHLS optimizes the inter-kernel dependencies better thanHipacc (see Section IV-B2). These parallelization attributes are suggested in [7] for NDRange kernels,not for the single-work item kernels using shift registers such as the edgedetection application shown in Table IV. Table VPP N R FOR THE I NTEL C YCLONE

V. M

ISSING NUMBERS (-)

INDICATE THATTHE GENERATED IMPLEMENTATIONS DO NOT FIT THE BOARD . App v Framework

VI. C

ONCLUSIONS

In this paper, we advocate the use of modern compilertechnologies for high-level synthesis. We combine functionalabstractions with the power of partial evaluation to decouple ahigh-level algorithm description from its hardware design thatimplements the algorithm. This process is entirely driven bycode reﬁnement, generating input code to HLS tools, such asVivado HLS and AOCL, from the same code base. To specifyimportant abstractions for hardware design, we have introduceda set of basic primitives. Library developers can rely on theseprimitives to create domain-speciﬁc libraries. As an example,we have implemented an image processing library for synthesisto both Intel and Xilinx FPGAs. Finally, we have shown thatour results are on par or even better in performance comparedto state-of-the-art approaches.A

CKNOWLEDGMENTS

This work is supported by the Federal Ministry of Educationand Research (BMBF) as part of the Metacca, MetaDL,ProThOS, and REACT projects as well as the Intel VisualComputing Institute (IVCI) at Saarland University. It wasalso partially funded by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) – project number146371743 – TRR 89 “Invasive Computing”. Many thanks toour colleague Puya Amiri for his work on the pipeline support.R

EFERENCES [1] J. Bachrach et al. , “Chisel: Constructing hardware in a Scalaembedded language”, in

Proc. of the 49th Annual DesignAutomation Conf. (DAC) , IEEE, Jun. 3–7, 2012.[2] Y. Liu et al. , “A scala based framework for developing accel-eration systems with FPGAs”,

Journal of Systems Architecture ,vol. 98, 2019.[3] J. Decaluwe, “MyHDL: A Python-based hardware descriptionlanguage”,

Linux Journal , no. 127, 2004.[4] J. Cong et al. , “High-level synthesis for FPGAs: Fromprototyping to deployment”,

IEEE Trans. on Computer-AidedDesign of Integrated Circuits and Systems (TCAD) , vol. 30, no.4, 2011.[5] J. Cong et al. , “Automated accelerator generation and opti-mization with composable, parallel and pipeline architecture”,in

Proc. of the 55th Annual Design Automation Conf. (DAC) ,ACM, Jun. 24–29, 2018.6] Xilinx,

Vivado Design Suite user guide high-level synthesisUG902 , 2017.[7] Intel,

Intel FPGA SDK for OpenCL: Best practices guide , 2017.[8] R. Leißa et al. , “AnyDSL: A partial evaluation framework forprogramming high-performance libraries”,

Proc. of the ACMon Programming Languages (PACMPL) , vol. 2, no. OOPSLA,Nov. 4–9, 2018.[9] L.-N. Pouchet et al. , “Polyhedral-based data reuse optimizationfor conﬁgurable computing”, in

Proc. of the ACM/SIGDAinternational symposium on Field programmable gate arrays ,ACM, 2013.[10] R. Nane et al. , “A survey and evaluation of FPGA high-levelsynthesis tools”,

IEEE Trans. on Computer-Aided Design ofIntegrated Circuits and Systems , vol. 35, no. 10, 2015.[11] G. Martin and G. Smith, “High-level synthesis: Past, present,and future”,

IEEE Design & Test of Computers , vol. 26, no. 4,2009.[12] D. F. Bacon et al. , “FPGA programming for the masses”,

Communications of the ACM , vol. 56, no. 4, 2013.[13] S. A. Edwards, “The challenges of synthesizing hardware fromC-like languages”,

IEEE Design & Test of Computers , vol. 23,no. 5, 2006.[14] J. Sanguinetti, “A different view: Hardware synthesis fromSystemC is a maturing technology”,

IEEE Design & Test ofComputers , vol. 23, no. 5, 2006.[15] D. Koeplinger et al. , “Spatial: A language and compiler forapplication accelerators”, in

Proc. of the 39th ACM SIGPLANConf. on Programming Language Design and Implementation(PLDI) , ACM, Jun. 18–22, 2018.[16] H. Eran et al. , “Design patterns for code reuse in HLS packetprocessing pipelines”, in , IEEE,2019.[17] J. S. da Silva et al. , “Module-per-object: A human-drivenmethodology for C++-based high-level synthesis design”, in , IEEE, 2019.[18] D. Richmond et al. , “Synthesizable higher-order functions forC++”,

Trans. on Computer-Aided Design of Integrated Circuitsand Systems , vol. 37, no. 11, 2018.[19] M. A. Özkan et al. , “A highly efﬁcient and comprehensiveimage processing library for C++-based high-level synthesis”,in

Proc. of the 4th Int’l Workshop on FPGAs for SoftwareProgrammers (FSP) , VDE, 2017.[20] J. de Fine Licht et al. , “Transformations of high-level synthesiscodes for high-performance computing”,

The Computing Re-search Repository (CoRR) , 2018. arXiv: 1805.08288 [cs.DC] .[21] G. Ofenbeck et al. , “Spiral in Scala: Towards the systematicconstruction of generators for performance libraries”, in

Proc.of the Int’l Conf. on Generative Programming: Concepts &Experiences (GPCE) , ACM, Oct. 27–28, 2013.[22] P. Milder et al. , “Computer generation of hardware for lineardigital signal processing transforms”,

ACM Trans. on DesignAutomation of Electronic Systems (TODAES) , vol. 17, no. 2,2012.[23] J. Hegarty et al. , “Darkroom: Compiling high-level imageprocessing code into hardware pipelines”,

ACM Trans. onGraphics (TOG) , vol. 33, no. 4, 2014.[24] J. Hegarty et al. , “Rigel: Flexible multi-rate image processinghardware”,

ACM Trans. on Graphics (TOG) , vol. 35, no. 4,2016.[25] J. Pu et al. , “Programming heterogeneous systems from animage processing DSL”,

ACM Trans. on Architecture and CodeOptimization (TACO) , vol. 14, no. 3, 2017.[26] J. Ragan-Kelley et al. , “Halide: A language and compiler foroptimizing parallelism, locality, and recomputation in imageprocessing pipelines”, in

Proc. of the Conf. on ProgrammingLanguage Design and Implementation (PLDI) , ACM, Jun. 16–19, 2013. [27] O. Reiche et al. , “Generating FPGA-based image processingaccelerators with Hipacc”, in

Proc. of the Int’l Conf. OnComputer Aided Design (ICCAD) , IEEE, Nov. 13–16, 2017.[28] N. Chugh et al. , “A DSL compiler for accelerating imageprocessing pipelines on FPGAs”, in

Proc. of the Int’l Conf.on Parallel Architecture and Compilation Techniques (PACT) ,ACM, Sep. 11–15, 2016.[29] Y. Chi et al. , “Soda: Stencil with optimized dataﬂow archi-tecture”, in , IEEE, 2018.[30] R. Stewart et al. , “A dataﬂow IR for memory efﬁcientRIPL compilation to FPGAs”, in

Proc. of the Int’l Conf. onAlgorithms and Architectures for Parallel Processing (ICA3PP) ,Springer, Dec. 14–16, 2016.[31] M. Kristien et al. , “High-level synthesis of functional patternswith Lift”, in

Proc. of the 6th ACM SIGPLAN Int’l Workshop onLibraries, Languages and Compilers for Array Programming,ARRAY@PLDI 2019, Phoenix, AZ, USA, June 22, 2019. , 2019.[32] R. Baghdadi et al. , “Tiramisu: A polyhedral compiler forexpressing fast and portable code”, in

Proc. of the IEEE/ACMInt’l Symp. on Code Generation and Optimization (CGO) ,IEEE, Feb. 16–20, 2019.[33] E. Del Sozzo et al. , “A uniﬁed backend for targeting FPGAsfrom DSLs”, in

Proc. of the 29th Annual IEEE Int’l Conf.on Application-speciﬁc Systems, Architectures and Processors(ASAP) , IEEE, Jul. 10–12, 2018.[34] R. Leißa et al. , “Shallow embedding of DSLs via online partialevaluation”, in

Proc. of the Int’l Conf. on Generative Program-ming: Concepts & Experiences (GPCE) , ACM, Oct. 26–27,2015.[35] M. A. Özkan et al. , “A journey into DSL design usinggenerative programming: FPGA mapping of image borderhandling through reﬁnement”, in

Proc. of the 5th Int’l Workshopon FPGAs for Software Programmers (FSP) , VDE, 2018.[36] N. D. Jones et al. , Partial evaluation and automatic programgeneration . Peter Sestoft, 1993.[37] Y. Futamura, “Parital computation of programs”, in

Proc. of theRIMS Symposia on Software Science and Engineering , 1982.[38] C. Consel, “New insights into partial evaluation: The SCHISMexperiment”, in

Proc. of the 2nd European Symp. on Program-ming (ESOP) , Springer, Mar. 21–24, 1988.[39] M. Schmid et al. , “Loop coarsening in C-based high-levelsynthesis”, in

Proc. of the 26th Annual IEEE Int’l Conf.on Application-speciﬁc Systems, Architectures and Processors(ASAP) , IEEE, 2015.[40] M. A. Özkan et al. , “Hardware design and analysis of efﬁcientloop coarsening and border handling for image processing”,in

Proc. of the Int’l Conf. on Application-speciﬁc Systems,Architectures and Processors (ASAP) , IEEE, Jul. 10–12, 2017.[41] G. Stitt et al. , “Scalable window generation for the IntelBroadwell+Arria 10 and high-bandwidth FPGA systems”, in

Proc. of the ACM/SIGDA Int’lSymp. on Field-ProgrammableGate Arrays (FPGA) , ACM, Feb. 25–27, 2018.[42] Y.-k. Choi et al. , “A quantitative analysis on microarchitecturesof modern CPU-FPGA platforms”, in

Proc. of the 53rd AnnualDesign Automation Conf. (DAC) , ACM, Jun. 5–9, 2016.[43]

Core evaluation license agreement

Intel program license subscription agreement et al. , “FPGA-based accelerator design froma domain-speciﬁc language”, in