[PDF] A Compiler Infrastructure for Accelerator Generators

Abstract

We present Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. Calyx combines a hardware-like structural language with a software-like control flow representation with loops and conditionals. This split representation enables a new class of hardware-focused optimizations that require both structural and control flow information which are crucial for high-level programming models for hardware design. The Calyx compiler lowers control flow constructs using finite-state machines and generates synthesizable hardware descriptions. We have implemented Calyx in an optimizing compiler that translates high-level programs to hardware. We demonstrate Calyx using two DSL-to-RTL compilers, a systolic array generator and one for a recent imperative accelerator language, and compare them to equivalent designs generated using high-level synthesis (HLS). The systolic arrays are 4.6\times faster and 1.1\times larger on average than HLS implementations, and the HLS-like imperative language compiler is within a few factors of a highly optimized commercial HLS toolchain. We also describe three optimizations implemented in the Calyx compiler.

Full PDF

AA Compiler Infrastructure for Accelerator Generators

Rachit Nigam*

Cornell UniversityUSA

Samuel Thomas*

Cornell UniversityUSA

Zhijing Li

Cornell UniversityUSA

Adrian Sampson

Cornell UniversityUSA

ABSTRACT

We present Calyx, a new intermediate language (IL) for compil-ing high-level programs into hardware designs. Calyx combinesa hardware-like structural language with a software-like controlflow representation with loops and conditionals. This split repre-sentation enables a new class of hardware-focused optimizationsthat require both structural and control flow information which arecrucial for high-level programming models for hardware design.The Calyx compiler lowers control flow constructs using finite-statemachines and generates synthesizable hardware descriptions.We have implemented Calyx in an optimizing compiler thattranslates high-level programs to hardware. We demonstrate Calyxusing two DSL-to-RTL compilers, a systolic array generator andone for a recent imperative accelerator language, and compare themto equivalent designs generated using high-level synthesis (HLS).The systolic arrays are 4 . × faster and 1 . × larger on averagethan HLS implementations, and the HLS-like imperative languagecompiler is within a few factors of a highly optimized commercialHLS toolchain. We also describe three optimizations implementedin the Calyx compiler. Hardware design is a language problem. While custom hardwareaccelerators are economically justified in a post Moore’s law era, wehave yet to see widespread adoption. Even though reconfigurablearchitectures, such as field programmable gate arrays (FPGAs),make it easy to deploy accelerators, the tooling and languagesinhibit ubiquitous use. Hardware description languages (HDLs)operate at the level of gates, wires, and clock cycles; while thislevel of abstraction is useful for designing high-end processors, itis inappropriate for the rapid design of computational accelerators.To liberate hardware design from these low-level abstractions,researchers have proposed several compilers for high-level specifi-cation languages. The traditional approach is high-level synthesis(HLS): to compile legacy software languages such as C, C++, orOpenCL to HDLs [3, 14, 24, 44, 45]. However, such languages area poor fit for generating hardware—they reflect pointer-based, se-quential, von Neumann models of computation. The hardware theyseek to generate is pervasively parallel, without a unified addressspace, and free from program counters.The cavernous semantic gap between C++ and HDLs motivates amore domain-specific approach. A new wave of hardware languagesand compilers focus on a specific application category [30, 40], on aspecific architecture style [8], or on lifting hardware-level concernsinto a restricted imperative language [18, 25]. These narrower lan-guages sacrifice the familiarity and backwards compatibility of tra-ditional HLS to simplify compilation, generate better hardware, andavoid the uncanny valley of inconsistent software-like semantics. * Equally contributing authors.

They can focus on providing high-level abstractions that conciselycapture the parallelism of the application domain.DSL-to-hardware compilers, however, remain substantial featsof engineering. The compiler developer needs not only to conceiveof a high-level architecture; they must also design a data path anda control path to implement the execution strategy and perform ar-chitectural optimizations [8, 18]. Each such compiler re-engineers anew intermediate language (IL) to encode the high-level semanticsof the input language while exposing architectural information toperform optimizations. A shared IL, along with a compiler infras-tructure that implements useful optimizations and analyses, willlet compiler engineers design new hardware DSLs and quickly getcompetitive hardware designs.We propose Calyx, a new intermediate language for compilingDSLs to hardware. Calyx combines a software-like imperative sub-language, which explicitly represents the control flow of a design,with a structural language, which instantiates hardware modulesand describes connections between them. Frontend compilers canspecify architectural details using the structural sub-language andrely on the high-level control language to encode a DSL’s semantics.The Calyx compiler optimizes these programs, generates controllogic, and emits synthesizable RTL.The contributions of this paper are: • Calyx, an intermediate language for compiling DSLs to hard-ware that uses a split representation combining a high-levelcontrol flow language with a hardware-like structural lan-guage. • An open-source pass-based compiler for analyzing, optimiz-ing, and lowering Calyx programs to synthesizable RTL. • The implementation of two compilers that target Calyx: (1)a PE-parametric systolic array generator that encodes thedata movement and computation schedule using Calyx’scontrol language, and (2) Dahlia [25], a general-purposeprogramming language for accelerator design which has apreexisting backend targeting HLS toolchains. • Three optimizations implemented within the Calyx compiler:resource sharing, live-range-based register-sharing, and apass to infer cycle latencies.

This section introduces Calyx by using it to implement a parallel reduction tree . A reduction tree applies an operator to many inputsto produce a single output. Figure 1b shows a small summation treeon four inputs. The operators within a tree level run in parallel toproduce the inputs to the next level. Unlike hardware descriptionlanguages (HDLs) or high-level synthesis (HLS), Calyx programsare meant to be generated by compiler frontends. We show thatwith Calyx’s control language, compilers can encode the semanticsof high-level languages while producing programs amenable tohardware optimization. a r X i v : . [ c s . P L ] F e b achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson group add0 { // m0[i], m1[i] m1.addr = i.out m2.addr = i.out // m0[i]+m1[i] a0.l = m1.out a0.r = m2.out // r=m0[I]+m1[i] r0.in = a0.out} group add1 { … } group add2 { a2.l = r0.out a2.r = r1.out r2.in = a2.out} while cmp.out with cond { seq { // layer 1 par { add0; add1 } // layer 2 add2; incr_idx; }} Data path speciﬁcation1 Execution Schedule2 Optimization Change3 group add2 { a0.l = r0.out; a0.r = r1.out; r2.in = a0.out;} (a) Calyx program. Groups incr_idx and cond elided. a a r a r r m m m m add0 add1add2 (b) Initial architecture (groups marked). m m m m a a r r r (c) Optimized architecture. Figure 1: Calyx describes the reduction tree using its split representation. The execution schedule makes the control flowexplicit while encapsulate connections between hardware modules. Done signals (Section 3.3) elided from group definitions.

Figure 1a shows a Calyx program fragment that implements aparallel reduction tree that computes ( 𝑚 + 𝑚 ) + ( 𝑚 + 𝑚 ) . Theprogram uses groups to specify the data path . Groups encapsulatehardware connections that implement an action. For example, thegroup add0 uses the hardware adder a0 to compute the sum of thefirst two inputs and save the result in a register r0 . The assignmentsused inside groups correspond to non-blocking assignments in RTLlanguages—updates to the left hand side of an assignment areimmediately propagated to the right hand side. In this way, eachgroup encapsulates a data flow graph.To compute the reduction, we need to schedule the executionof the layers. We want to execute the layers sequentially and torun the adders inside a tree layer in parallel. The Calyx programspecifies the reduction tree’s schedule using a separate control language . The control language uses group names to activatehardware connections. Unlike groups, control statements have nodirect hardware analog—instead, they resemble a small imperativeprogram with explicit parallelism. The schedule iterates over thememories using a while statement and sequences the execution ofthe layers using the seq operator. The par operator specifies thatthe adders in the first layer will be executed in parallel. Finally, theloop body uses the group incr_idx to increment the index intothe memories.Figure 1b shows the high-level architecture generated from theCalyx program and marks the connections that correspond tothe groups. The figure elides the control circuitry generated toimplement the schedule. High-level specifications of accelerators encode a treasure trove ofcontrol flow information that is lost when lowering to a register-transfer level (RTL) language. Compilers for such programmingmodels need a stable intermediate language (IL) to capture anduse such information. However, RTL is ill-suited for this task.RTL languages do not distinguish between control flow anddata flow because they implement both using the same structuralconstructs. For example, in order to sequence two operations, anRTL program must implement a state machine to track the currentstate. Such a state machine is implemented using registers andadders which are indistinguishable from registers and adders usedto implement the program’s data flow. This conflation means thata compiler cannot automatically extract and transform the controlflow of an arbitrary RTL program.Consider an optimization that reuses existing circuitry to per-form temporally disjoint computations. For example, our reductiontree uses adders a0 and a2 in two different stages and never over-laps their execution. Therefore, it would be safe to transform theprogram to share a single adder for both the stages. Implementingthis optimization in RTL, however, is difficult because the struc-tural implementation of a state machine obscures the program’scontrol flow. To determine that the two adders run at differenttimes, an analysis would need to reverse-engineer the executionschedule from the state machine implementation. Furthermore,transforming an RTL program would require pervasive changes.Figure 1c shows the optimized architecture. The transformationrequires carefully rewiring the input and output signals for a0 through multiplexers.In contrast, a Calyx program makes the control flow explicitand enables straightforward transformation. Given the execution Compiler Infrastructure for Accelerator Generators schedule of our Calyx program, it is clear that the groups add0 and add2 do not execute simultaneously since they are scheduled usingthe seq operator. Figure 1a shows the only change required toimplement this optimization. The Calyx program simply renamesthe uses of a2 in group add2 with a0 and the compiler correctlygenerates the additional multiplexers and control signals to sharethe adder. Calyx is neither a software IL nor a hardware IL. Software ILs, suchas LLVM [22], focus on providing a uniform representation of thecontrol flow and data flow of a program. They do not explicitlyrepresent structural facts, such as the mapping of logical adds ontophysical adders. On the other hand, hardware ILs focus on a purelystructural representation with explicit use of gates, wires, andclocks while conflating data flow with control signals. By marryingstructure and control, Calyx provides access to both structuraland control flow facts to enable a new class of optimizations thatcannot be captured by either style of ILs.

The Calyx infrastructure’s focal point is its program representa-tion. The Calyx IL aims to represent domain-specific acceleratordesigns throughout the entire lifetime of a hardware generationpipeline: generation from a language frontend, optimization andlowering, and implementation in a hardware description language.This section describes the Calyx IL; the following sections showhow to generate, lower and optimize the IL.

Calyx programs consist of components which encapsulate hardwarestructures and define an execution schedule to orchestrate theirbehavior: component name ( inputs ) -> ( outputs ) { cells { ... } wires { ... } control { ... }} The body includes hardware-like structural listings of cells and wires (Section 3.2) and software-like control code (Section 3.3). Theinput and output ports form the interface to the component anddefine their size in bits. For example, a component defining a 32-bitinteger adder uses these ports: component adder(lhs: 32, rhs: 32) -> (sum: 32) Ports in Calyx are untyped —they can hold any value of a givenwidth. Calyx leaves type-based reasoning to the language frontend.

Calyx programs explicitly instantiate components and define theconnections between them in a way that closely resembles RTLlanguages. This low-level of detail gives frontends precise controlover fine-grained architectural choices when needed and lets Calyxlower programs to synthesizable RTL. The cells section instantiates components: cells {a_reg = std_reg(32); // 32-bit registeradd = std_add(32); // 32-bit adder}

This example instantiates a register and an adder that operate on 32-bit values using the std_reg and std_add components. The wiressection defines assignments between the ports of components: wires {add.left = a_reg.out;add.right = a_reg.out;}

These assignments connect the out port of the register to the twoinput ports of the adder. The connections are non-blocking : updatesto a_reg.out are immediately visible to add.left . This closelyresembles non-blocking assignments in RTL languages.Wire assignments can specify more complex dataflow policiesby using guarded assignments : add.left = cmp.out ? a_reg.out;add.left = !cmp.out ? b_reg.out; The guarded assignments to the left port of the add componentuse the value of cmp.out to determine the assignment to activate.Guards are built with ports and a simple language of booleanconnectives.Like its RTL counterparts, Calyx requires that each port have a unique driver —activating multiple assignments in the same cycleresults in undefined behavior. This requirement also differentiatesCalyx’s guarded assignments from Bluespec’s atomic rules [26].While Bluespec resolves conflicting assignments by generatingscheduling logic to dynamically abort them, Calyx does not. Beingan intermediate language, Calyx trades-off the convenient pro-gramming abstraction for predictable compilation.Guarded assignments in Calyx correspond exactly to assign-ments in RTL languages. By themselves, they can encode arbitraryhardware designs, but are less amenable to analysis and transfor-mation. The next section describes Calyx’s two novel constructsthat simplify the specification of a program’s structural connec-tions and its execution schedule.

Calyx uses groups to encapsulate assignments. Inside a group, as-signments must obey the same constraints as RTL—unique driversfor ports, no combinational loops, etc. However, multiple groupscan use the same port: group assign_one { x_reg.in = 1; } group assign_two { x_reg.in = 2; }

Both groups unconditionally assign to the same port. However,since the groups encapsulate the assignments, they are not activeby default and do not violate the unique driver requirement. Incontrast, RTL languages require programmers to reason about allassignments to a port and weave in control signals to define aunique driver.The control program determines when groups run: control { seq { assign_one; assign_two } } The control block uses the seq (sequence) statement to specify that assign_one executes first, followed by assign_two . Since the two achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson groups execute at different times, the two assignments to the port x_reg.in do not conflict and Calyx can generate valid RTL.While control statements like seq can pass the control flow of aprogram to a group, they have no way to know when to return—groups can encode arbitrary computations that don’t have an ob-vious done condition. To signal when it has finished executing, agroup use a done signal: group assign_one {x_reg.in = 1;assign_one[ done ] = x_reg. done ;} In the above group, we are writing a value to a stateful element x_reg , and must wait for the element to signal that the write wascommitted. The group uses the x_reg.done port to signal that thegroup’s computations has finished.Interface signals, such as a group’s done signal, are used byCalyx to define a calling convention (Section 4.1). A control programpasses control flow to a group by setting a group’s go to andthe group returns control by setting its done signal to . Similarly,components use go and done interface signals to define a consistentcalling convention. Calyx’s interface is latency-insensitive ; it doesnot not reason about the number of cycles needed to execute acomputation. Section 4.4 shows how enriching Calyx programswith latency information enables efficient compilation. Calyx provides several primitives to encode the schedule of compo-nents. We designed these primitives to capture high-level propertiessuch as branching and looping, freeing frontends from the need torealize them in control circuitry. enable.

Naming a group inside a control statement passes controlto the group. par.

List of control statements that execute once in parallel. par { group_a; seq { group_b; group_c; }; group_d; } seq.

List of control statements executed in order. seq { group_a; par { group_b; group_c; }; group_d; } if.

Conditionally executes one of the branches. Specifies a portto use as the -bit conditional value ( port_name ) and a group( cond_group ) to compute the value on the port. if port_name with cond_group { true_stmt } else { false_stmt } while. The loop statement is similar to the conditional. It enables cond_group and uses port_name as the conditional value. When thevalue is high, it executes body_stmt and recomputes the conditionalusing cond_group . while port_name with cond_group { body_stmt } Calyx programs can use attributes to encode frontend and pass-specific information such as the latency of a group. Attributesare key-value pairs. For example, the following group defines anattribute “latency” and associates the value to it. group foo<"latency"=1> { ... } Components are the building blocks of Calyx programs. Each com-ponent instantiates subcomponents ( cells ) and defines the con-nections between them ( wires ). The control program defines theexecution schedule by enabling groups.The design principle behind Calyx is thus: in general, frontendsgenerate small groups to perform simple actions, such as incre-menting a register or comparing values, and use the control flowprogram to schedule them. However, when frontends have domain-specific knowledge, they can instantiate complex architectures andencapsulate them using groups.

The Calyx compiler optimizes (Section 5) and lowers Calyx pro-grams into synthesizable RTL. Compilation passes use interfacesignals , which define a calling convention, to realize a component’sexecution schedule. The result is a Calyx program with a flat listof guarded assignments and no control statements or groups. Thecompiler can then directly translate this flattened form into RTL.The primary compilation passes are: • GoInsertion: Guards all assignments in a group with thegroup’s go interface signal. • CompileControl: Generates latency-insensitive finite statemachines to structurally realize control operators. • RemoveGroups: Inlines uses of interface signals and elimi-nates all groups. • Lower: Translates control-free Calyx to RTL. • Sensitive: Opportunistically compiles control statementsinto groups using latency-sensitive FSMs. Only affects groupswith "static" attribute.

Figure 2 illustrates the main steps. This section describes the com-plete compilation process.

To realize a Calyx program’s execution schedule, the compiler needsa mechanism to pass control flow in purely structural programs.We use a pair of interface signals to define this interface: whena group sets another group’s go signal high, control is passed tothat group and it can enable assignments within it; when a groupsets its own done signal high, it passes control back. This interfaceresembles traditional latency-insensitive hardware design [4].Most passes treat interface signals like any other -bit port.The main compilation passes treat them specially—using them towire up the control signals. The final compilation pass eliminatesinterface signals by inlining them. Compiler Infrastructure for Accelerator Generators group one { x.in = 1; one[ done ] = x.done;} group two { x.in = 2; two[ done ] = x.done;} control { seq { one; two }} (a) Original program group one { x.in = one[ go ] ? 1; one[ done ] = x.done;} group two { x.in = two[ go ] ? 2; two[done] = x.done;} control { seq { one; two }} (b) GoInsertion group one { … } // Unchanged group two { … } group seq0 { // enable contained groups one[ go ] = fsm.out == 0 ? 1; two[ go ] = fsm.out == 1 ? 1; // FSM state updates fsm.in = fsm.out == 0 & one[ done ] ? 1; fsm.in = fsm.out == 1 & two[ done ] ? 2; seq0[ done ] = fsm.out == 2 ? 1;} control { seq0 } (c) CompileControl wires { x.in = fsm.out == 0 ? 1; x.in = fsm.out == 1 ? 2; fsm.in = fsm.out == 0 & x.done ? 1; fsm.in = fsm.out == 1 & x.done ? 1; // done condition for the component done = fsm.out == 2 ? 1; } control { /* empty */ } (d) RemoveGroups Figure 2: Calyx realizes the execution schedule by encoding it with structural components. After the CompileControl pass(c), the fsm register encodes the current state for the seq statement.

We describe the compilation pipeline by compiling the exampleCalyx program in Figure 2a.

Inserting go interface signals. Calyx’s semantics requires that as-signments within a group are only enabled when the group exe-cutes. To enforce this requirement, the

GoInsertion pass insertsthe group’s go signal into the guards of the contained assignments.Figure 2b shows the resulting program: one[go] guards assign-ments in group one while two[go] guards assignments in group two . When all groups are eventually removed, these guards willensure that the correct set of assignments are active at a given time.

Compiling control using interface signals.

The next step in the com-pilation process is realizing the control program using a structuralimplementation. Compilation relies on two important propertiesof Calyx: (1) groups can encode arbitrary computations, and (2) allgroups are treated uniformly, regardless of the computation theyperform—a group that increments a register is compiled the sameway as a group that runs a systolic array.The CompileControl pass performs a bottom-up traversal ofthe control program and does the following: (1) for each controlstatement, such as seq or while , instantiate a new group, called the compilation group , to contain all the structure needed to realize thecontrol statement, (2) implement the schedule by setting the con-stituent groups’ go and done signals, and (3) replace the statementin the control program with the corresponding compilation group .After this pass, every component’s control program is reduced to asingle group enable.Figure 2c shows these transformations. The pass defines a newgroup seq0 to encapsulate the structure required to realize the seq statement as well as a new register fsm to track the current state.Next, the pass enables the groups contained in the seq by writingto their go interface signals and updates the FSM state when thegroups set their done signal high. The done condition for seq0 iswhen the FSM reaches its final state. Finally, the pass replaces the seq control statement with the group seq0 . Inlining interface signals.

The RemoveGroups pass inlines all usesof interface signals and removes all groups. It performs three trans-formations: (1) Add new go and done ports to each component definitionand wire them up to the single group enable in the controlprogram.(2) Collect all writes to a group’s go and done signals and inlinethem into all uses of the signals. If there are multiple writes toa signal, replace the corresponding reads with a disjunctionof the written expressions. This step eliminates all interfacesignals from the component.(3) Remove all groups. Since all assignments are guarded byexpressions that encode the schedule, it is safe to removethe groups and place them in the top-level wires section.Figure 2d shows the resulting program that contains no groups,interface signals, or control statements. Code generation.

Each component now contains a flat list of guardedassignments. The Lower pass generates SystemVerilog programsby mapping each component to a module, generating wires for allthe ports, and threading a clock signal through the design.

The CompileControl pass performs a bottom-up traversal of thecontrol program, encodes the control flow of each control statementusing structural components, and replaces its use with correspond-ing compilation group. This example illustrates the timeline ofbottom-up elimination of control statements: control { par { seq { one; two; } seq { foo; bar; }}} control { par {seq0;seq1;}} control {par0;} We sketch the CompileControl pass’s strategies for implementingeach control statement in Calyx. par. A par control block enables all groups inside it and finishesexecuting when all groups have signaled done once. Since groupsmay finish executing at different times, the pass generates a 1-bitregister to save each child group’s done signal. The go signal foreach child group is set to high when the value in this register is 0.The done signal for the compilation group is 1 when all the 1-bitregisters output 1. achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson if. Calyx’s semantics dictate that an if statement executes a group cond before reading the value from a port and deciding whichbranch to execute. cond is supposed to update the value on the port.The pass generates two 1-bit registers: cc which tracks if cond hasbeen executed, and cs to store the value of the port generated afterexecuting cond to ensure that the value of the port is availablethrough the execution of the branches. The compilation groupenables either branch using the value in cs and finishes executingwhen the branch’s done signal is high. while. The loop compilation strategy resembles the one for if . Thegroup runs the condition group, saves the value from the conditionport to a register, and uses it to either enable the group in the body.The compilation group finishes executing when the value of theconditional port is 0. Resetting compilation groups.

Compilation groups reset their in-ternal state to operate correctly within loops. The pass generatesassignments that reset the value of internal state elements when acompilation group sets its done signal high.

The default compilation pass, CompileControl, generates latency-insensitive finite-state machines (FSMs) when realizing a compo-nent’s schedule. Such latency-insensitive designs allow the execu-tion schedule to uniformly reason about multi-cycle componentsand groups. The cost of this approach, however, is the additionalhardware and additional execution cycles required to coordinatewith the interface signals. Frontend compilers can often providelatency information that the compiler can exploit to build smallerand faster hardware.We implemented a pass that can opportunistically generatelatency-sensitive FSMs when latency information is available. Thispass is best-effort: it only attempts to generate such FSMs whenlatency information is available and gracefully falls back to Com-pileControl. The encapsulation property of groups enables thesekinds of best-effort passes—the compilation pipeline does not haveto reason about what is inside a group to compile it.The key benefits to this approach are: (1) frontends can quicklybuild a functioning end-to-end flow and incrementally add latencyinformation to generated programs, and (2) latency-sensitive com-pilation is just an optimization—it can be disabled, debugged, andinteracted with separately from the compilation pipeline. To the bestof our knowledge, Calyx’s ability to fluidly mix latency-sensitiveand latency-insensitive compilation is unique. Prior systems in-tertwine latency information through the compilation process, soeither everything is statically timed [44] or nothing is [16].Section 6.2 shows how a frontend can generate latency informa-tion, Section 7.2 demonstrates that the pass speeds up designs by1 . × without an area penalty, and Section 5.3 demonstrates howlatency information can be automatically inferred in certain cases. Compiling seq.

The latency-sensitive compilation pass, Sensitive,traverses the control program bottom-up and opportunisticallycompiles control statements when all of the nested groups specifytheir latency using the static attribute (Section 3.5): group let_r0 { r0.in = 0 } group let_r1 { r1.in = 0 } group incr_r0 {a0.l = r0.out; a0.r = 1;r0.in = a0.out; } group incr_r1 {a1.l = r1.out; a1.r = 1;r1.in = a1.out; } (a) Defined groups. r0 and r1 are reg-isters; a0 and a1 are adders. seq { par {let_r0;let_r1;}incr_r0;incr_r1;} (b) Schedule with resourcesharing opportunities. Figure 3: Resource sharing example. Since incr_r0 and incr_r1 do not run in parallel, they can share their adders. group one<"static"=1> { ... } group two<"static"=2> { ... } control { seq { one; two } } It generates an FSM with a self-incrementing counter and enableseach group for the specified number of cycles, and ignores the done signal from the groups: group static_seq0<"static"=3> {one[ go ] = fsm.out >= 0 && fsm.out < 1 ? 1;two[ go ] = fsm.out >= 1 && fsm.out < 3 ? 1;static_seq0[ done ] = fsm.out == 3 ? 1;// Increment the FSM.fsm.in = fsm < 3 ? fsm.out + 1;static_seq0[ done ] = fsm.out == 3;} When compiling seq , par , or if statements, the pass uses the la-tency information of the contained groups to generate a static attribute for generated compilation group.The pass demonstrates how Calyx enables development of small,modular passes that interact with the broader infrastructure. It isfeasible because the IL has a well-defined semantics that lets passesreason independently about the preservation of program semantics. We describe the design and implementation of three optimizationsthat demonstrate Calyx’s ability to support control-flow-sensitiveoptimizations.

Resource sharing is an optimization that reuses existing circuitsto perform temporally disjoint computations. For example, if anaccelerator needs to perform two add operations that are neverexecuted in parallel, it can map them to the same physical adder.Calyx is uniquely suited to implement such optimizations whichrequire both control flow facts (if two computations run in parallel)and structural facts (which physical adder performs an add).Calyx implements a group-level resource sharing optimization:if two groups are guaranteed to never execute in parallel, they canshare components. This pass does not attempt to share statefulcomponents because state is visible across groups. Frontends usethe "share" attribute (Section 3.5) to denote that a component issafe to share. component adder<"share"=1> { ... }

Compiler Infrastructure for Accelerator Generators

The pass uses the execution schedule of a component to calcu-late which groups may run in parallel and uses the encapsulationproperty of groups to implement sharing. It proceeds in three steps:

Building a conflict graph.

A conflict graph summarizes potentialconflicts—nodes denote groups and edges denote that the groups may run in parallel. The pass traverses the control program andadds edges between all children of a par block. For example, inFigure 3b, the groups let_r0 and let_r1 conflict with each otherwhile incr_r0 and incr_r1 do not. If the children of the par blockare themselves control programs, the pass adds edges between thegroups contained within each child.

Greedy coloring.

The pass performs a greedy coloring over theconflict graph to allocate shareable components to each group. Iftwo groups have an edge between them, they cannot have the samecomponents. The result of this step is a mapping from the namesof old components to new components. For example, in Figure 3a, incr_r1 gets the mapping: 𝑎 ↦→ 𝑎 Group rewriting.

In the final step, the pass applies local rewritesto groups based on the mapping. The simplicity of this step comesfrom the encapsulation property of groups—a rewriter does nothave to reason about uses of a component outside the group.Resource sharing demonstrates Calyx’s flexibility in analysis andtransformation—passes can recover control flow information fromthe schedule and use groups to perform local reasoning.

Group-local reasoning is insufficient for sharing stateful elementssuch as registers; writes to a register in one group are visible inother groups. To enable register sharing, we implement a live-rangeanalysis that, for each register, determines the last group in theexecution schedule to read from it. Since the register is guaranteedto never be used afterwards, subsequent groups can reuse the reg-ister. Live-range analysis is common in software compilers but isinfeasible in RTL languages since the control flow of the programis not explicit.The live-range analysis has to contend with two problems: (1)coping with the par blocks in the control program, and (2) inferringwhich groups read and write to registers.

Parallel control flow graphs.

We handle par blocks using parallelcontrol flow graphs (pCFGs) based on the work of Srinivasan andWolfe [38]. Most control operators in Calyx map directly to a tradi-tional CFG. However, par statements need special handling since,unlike an if statement which executes one of its two branches, a par statement executes all its children. While writes to a register ina conditional branch may be visible after the if statement, writeswithin children of par blocks are always visible after the par block.Parallel CFGs introduce a new kind of node—called a p-node —tohandle par blocks ( 𝑝 par block and recursively contains a set of pCFGs representing itschildren. In Figure 4b the p-node has two children. Calculating read and write sets.

Calyx implements a conservativeanalysis pass to determine the registers that groups and p-nodes seq {A; if cond.out with G {B;} else { par { seq { x0; x1; } seq { y0; y1; }}}C;} (a) Calyx program. p0 startendACB p0 startendx0x1 startendy0y1 (b) A visual representation of a pCFG. Figure 4: A Calyx program along with the corresponding par-allel control flow graph (pCFG). read from and write to. Both groups and p-nodes can, in gen-eral, contain complex logic, so the pass must conservatively over-approximate these sets. The read set is the set of registers a groupor p-node may read from and the write set is the set of registersthey must write to. The data-flow analysis uses this information todetermine the range each register is alive.

Computing liveness.

The pass uses a standard data-flow formulationto compute the live ranges. The only aspect that needs specialhandling is the children of p-nodes. For these, we set the live setsat the end of each child to be the set of live registers coming out ofthe p-node.

Sharing registers.

The pass uses the liveness information to build aconflict graph where nodes are registers and edges denote overlap-ping live ranges. The pass performs greedy coloring over this graphusing registers as colors and rewrites groups in a similar mannerto resource sharing.

The final optimization pass in the Calyx compiler attempts to con-servatively infer the latencies of groups and components. Thisenables the downstream Sensitive pass (Section 4.4) to lower Ca-lyx programs using more efficient, latency-sensitive finite statemachines. Consider the following group: component foo<"static"=1> { ... } group incr {f.in = add.out; // f is an instance of foo.f. go = 1'd1;incr[ done ] = f. done ;} The Calyx program specifies that the latency of the foo componentis 1 using the "static" attribute. Given this information, this passinfers that latency of incr to be 1 as well. It follows a simple rule:if a group’s done signal is equal to a components go signal, and ifthe component’s go signal is set to 1 within the group, the latencyof the group is inferred to the same as the component. Such uses ofcomponents occur in groups that simply activate one componentand end their execution.This pass is conservative and only works for simple groups.Given Calyx’s design principle—that most of the time frontends achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson PE T T L L PE PE PE l t pe Figure 5: Architecture for a 2 × × generate simple groups—such passes can be extremely powerful.Furthermore, such passes can be incrementally improved by addingnew rules that enables the pass to infer latencies for more groupsand transparently speed up programs. Section 7.1 shows that thispass transparently improves the performance of frontend code. We built two compilers that target Calyx for our case studies. Thefirst generates systolic arrays [19] for linear algebra computations.The second compiles Dahlia [25], an imperative programming lan-guage that uses a substructural type system to enable predictablehardware design. Our goal in both case studies is to demonstratehow Calyx makes it possible to quickly bring up good compilerimplementations for specialized languages. We do not aim to beatexisting commercial HLS compilers which represent decades ofengineering effort.

Systolic arrays [19] are a class of architectures that exploit datareuse. They power the recent wave of state-of-the-art linear alge-bra accelerators for machine learning [10, 17]. Figure 5 shows anexample systolic array. In every time step, data moves from left-to-right and top-to-bottom, while the processing elements (PEs) in thegrid perform computations on the data streams. Systolic arrays canmaintain a high throughput because data is reused between PEs.However, generating a custom systolic array implementation ischallenging: producing RTL directly requires generating complexcustom control hardware, and systolic arrays’ unique parallelismpattern can be challenging to express in HLS C++ [5, 21]. We im-plement a systolic array generator using Calyx in only 239 LOC ofPython and approximately 40 person-hours of effort. The generatorcan produce arrays with arbitrary dimensions and arbitrary PEswhich are implemented as Calyx components themselves. seq { par { t0; l0; } // Move data from memories par { pe_00; } // Run the first PE// Move data from memories and from registers par { t0; t1; l0; l1; pe_00_down; pe_00_right; }// Execute first PE and PEs on diagonals par { pe_00; pe_01; pe_10; }// Next step... par { t1; l1; down_00; down_01; right_00; right_10; } par { pe_01; pe_10; pe_11; } par { down_01; right_10; } par { pe_11; }} Figure 6: Control generated for 2x2 × Input.

The systolic array generator takes the dimensions of thematrix block and a Calyx component that implements the PE. Fora matrix multiply accelerator, for example, the PE consists of amultiply–accumulate (MAC) unit. It generates a systolic array thatmatches the dimensions of matrix block.

Architecture.

Figure 5 shows the desired architecture for a 2 × data movement: the groups on the edges move the data from the input memoriesto registers, and the ones in the middle move the data along thefabric. Finally, the compute groups perform the computation in thePE and write their results to an internal register. Generating Calyx.

To target Calyx, the systolic array generatorneeds to (1) instantiate PEs, (2) create the relevant groups, and (3)define the control for the systolic array. The compiler performs (1)and (2) using templates. For each PE, the compiler also instantiatesthe surrounding input registers and connects them to registers inthe previous PE. Finally, it defines groups to move the data andperform the computation.The next step is generating the control. Figure 6 shows the con-trol statements generated for a 2 × Inferring latencies.

The systolic array generator does not generateany "static" annotations. However, the Calyx compiler is able tocompletely infer the latency (Section 5.3) of a generated systolicarray when the processing element provides its latency. This meansthat the generator, by virtue of using the Calyx compiler, auto-matically supports both latency-sensitive and latency-insensitivesystolic arrays.

Debugging with Calyx.

In an initial version, the generator prema-turely enabled data movement groups causing the systolic array tocompute the wrong result. While debugging the kernel, it was easyto spot this mistake in the control program. This demonstrates a keyquality-of-life improvement when using the Calyx infrastructure

Compiler Infrastructure for Accelerator Generators to build accelerator generators—control logic bugs can be caughtby investigating the execution schedule.

Dahlia [25] is a recently proposed general-purpose language fordesigning accelerators that resembles traditional C-based HLS. Itdiffers from traditional HLS by adding a substructural type systemthat constrains the language to rule out programs that lead toinefficient hardware. The original Dahlia compiler generates C++with annotations for the commercial Vivado HLS [44] toolchain.In this case study, we build a new compiler for the Dahlia lan-guage that generates hardware using Calyx, eliminating the depen-dence on a monolithic, closed-source HLS backend and allowinggreater control over the generated architecture. The goal is not tooutperform the Vivado HLS backend; instead, we aim to show thatCalyx makes it possible to exploit Dahlia’s unique semantics tobuild a compiler that is far simpler than a full-fledged C-to-RTLtoolchain.

Lowered Dahlia.

Dahlia is a simple imperative language extendedwith high-level convenience features such as memory partitioning,loop unrolling, and logical array indexing. We elide the detailsof the first step of compilation that unrolls loops and compilesaccesses to partitioned memories. We refer interested readers toour implementation.Our explanation focuses on compiling Dahlia programs that usea small set of constructs: variables, unpartitioned memories, while loops, conditionals, and Dahlia’s two novel composition operators: unordered composition ( ; ) and ordered composition ( --- ).In Dahlia, memories and variables have an associated type andcan be updated with assignment syntax: let x: ubit <32> = 1; x := 2; let arr: ubit <32>[10]; arr[1] := 3; Dahlia’s unordered composition operator allows backends to paral-lelize computations while preserving data flow: x = 1; y = 2 // can occur in parallel

In contrast, Dahlia’s ordered composition operator requires backendto execute statements in a sequence: x = 1---x = 2

Ordered composition does not reason about explicit clock cycles.Instead, it imposes a partial order over the execution of programstatements by reasoning about logical timesteps . Lowered Dahliaalso supports standard imperative while loops and if conditionals. Generating Calyx.

The Calyx backend for Dahlia is a bottom-uppass that compiles each expression by instantiating groups andscheduling them using the control language.For example, for this Dahlia program: let x = 0--- if (x > 10) { x = 1 } else { x = 2 } The Calyx backend generates a group for each statement: group init_x { x.in = 0; init_x[ done ] = x. done ; } group one { x.in = 1; one[ done ] = x. done ; } group two { x.in = 2; one[ done ] = x. done ; } group cond { gt.left = x.out; gt.right = 10; cond[ done ] = 1; }

And schedules them using the following control program: seq {init_x; if gt.out with cond { one } else { two }} The Calyx backend has a one-to-one mapping for the languageconstructs in lowered Dahlia and the Calyx control language: mem-ory and variable assignments generate groups representing theupdate, ordered composition becomes seq , unordered compositionbecomes par , loops and conditionals map to while and if . Interfacing with black-box RTL.

Dahlia’s HLS backend uses a vendor-provided header to implement custom math functions such assquare root. The HLS compiler connects definitions within suchheaders to black-box RTL code. In order to interact with black-boxRTL components, Calyx programs can provide external definitions: extern "sqrt.sv" { component sqrt(left: 32, right: 32, go : 1) -> (out: 32, done : 1);} External definitions do not provide an implementation; instead theCalyx compiler links in the corresponding RTL program, in thiscase sqrt.sv , during code generation. External components canbe used like any other component: group foo {sqrt0.left = 10; sqrt0.right = 20;sqrt0. go = !sqrt0. done ? 1;foo[ done ] = sqrt0. done } Latency annotations.

Most operations in a Dahlia program have aprecise latency—register writes take one cycle, multiplies take fourcycles, etc. The Calyx backend uses this information to annotate thelatency of each group with the "static" attribute. Some operations,such as the RTL primitive to calculate the square-root, take a data-dependent number of cycles, so groups using them omit latencyinformation. Since the Calyx compiler gracefully handles mixedlatency-sensitive and latency-insensitive groups, we do not needto change anything else.

In our experience, a Calyx-based compiler requires three ingredi-ents: (1) the abstract architecture for the domain, (2) a mapping fromsource constructs to Calyx constructs, and (3) a strategy to gener-ate groups and control . For Dahlia, the architecture correspondeddirectly with the control language; for systolic arrays, we used atemplated design with a latency-insensitive interface. In both com-pilers, we used groups and control to modularize and compose dataflow graphs, which is not possible when generating RTL directly.

We evaluate Calyx by generating accelerators using the frontendsin the previous section and answering three questions: • Can we build a simple compiler that generates performantspecialized architectures? • Can we use Calyx to generate reasonable hardware in ageneral-purpose, HLS-like domain? achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson C y c l ec o un t Calyx (Latency-sensitive)Calyx (Latency-insensitive)HLS (a) Absolute cycle counts. L U T u s a g e (b) Absolute LUT usage. Figure 7: Resource and cycle count comparison matrix mul-tiply implementation HLS and as systolic arrays. • What is the effect of control-flow-sensitive optimizationsimplemented in the Calyx compiler?We compare Calyx-generated accelerators to Vivado HLS, a com-mercial HLS tool that represents decades of engineering effort. Ouraim is not to beat HLS at its own game but instead achieve the sameperformance regime with much lower effort.

To the best of our knowledge, Vivado HLS does not automaticallyinfer systolic arrays from loop nests. Instead, programmers need torewrite their program to coerce the compiler into generating pre-cisely the hardware they want. Calyx advocates for a more domain-specific approach—instead of relying on black-box compilers toinfer hardware structures, design new DSLs that automaticallysynthesize them. We study the performance characteristics of theCalyx-based systolic array generator (Section 6.1).

Evaluation setup.

We generate hardware designs for matrix multi-plication kernels ranging from 2 × ×

8. For each configuration,we generate a systolic array using the Calyx-based generator andimplement a straightforward matrix-multiply kernel in Vivado HLSthat fully unrolls the outer two loops. For the Calyx designs, we col-lect the number of cycles by simulating the design in Verilator [39](v4.108) and get resource estimates by synthesizing designs withVivado [44], targeting Zynq UltraScale+ XCZU3EG FPGA at a 7nsclock period. For the HLS designs, we report the latency and re-source estimates from the HLS report. We compare the cycle counts(Figure 7a) and the LUT usage (Figure 7b) of the designs. We report mm mm a t a d t g g mm g m v g e v g m t m v t s k s k b c g c k y dbn l u l c p s y m t s v t r m S i m u l a t i o n C y c l e S l o w d o w n Geo Mean

Not UnrolledUnrolled (a) Cycle slowdown of Calyx designs compared to Vivado HLS. De-signs below the y -axis are slower. mm mm a t a d t g g mm g m v g e v g m t m v t s k s k b c g c k y dbn l u l c p s y m t s v t r m L U T I n c r e a s e F a c t o r Geo Mean (b) LUT increase of Calyx designs over Vivado HLS. Designs belowthe y -axis are larger. Figure 8: Resource and cycle count comparison for Dahlia-generated Calyx designs and HLS designs for PolyBenchbenchmarks. Missing unrolled bars indicate that the bench-mark was not unrollable in Dahlia. the characteristics of systolic arrays compiled with the Sensitivepass (Latency-sensitive) and those without (Latency-insensitive).

Comparison against HLS.

Compared to HLS-based designs, Calyx-based systolic arrays are faster by a geometric mean of 4 . × andtake 1 . × more LUTs. For the largest input size, the systolic ar-ray is 10 . × faster than the HLS implementation while using1 . × more LUTs. Latency-sensitive compilation.

The systolic array generator doesnot generate any "static" annotations used by the Sensitivepass. It instead relies on the Calyx compiler to infer these attributes(Section 5.3). On average, Sensitive makes designs 1 . × faster and1 . × smaller. Discussion.

Our systolic array case study demonstrates how a lan-guage designer can quickly experiment with architectural designsthat are harder to express in traditional HLS tools. Without exten-sive engineering effort, the specialized approach can outperform ageneral-purpose HLS compiler.

Compiler Infrastructure for Accelerator Generators

We built the Dahlia-to-Calyx compiler in 2011 LOC of Scala. Thisincludes extensions to the Dahlia compiler that add passes to lowerDahlia specific constructs as well as the backend to generate Calyxfrom lowered Dahlia.

Evaluation setup.

We compare the Calyx-generated RTL againstthe original Dahlia compiler [25], which emits annotated C++ andrelies on Vivado HLS to generate hardware designs. We implementall 19 kernels from the linear algebra category of the PolyBench [23]benchmark suite and, for the 11 benchmarks Dahlia’s type systemallows it, unroll the loops to unlock parallelism. We use the samesetup as in Section 7.1 to gather numbers.We also evaluate the effects of the latency-sensitive compilation(Section 4.4). We run each benchmark with the Sensitive pass en-abled and disabled, following the same synthesis and measurementworkflow above.

Comparison against HLS.

We collected cycles counts (Figure 8a) andLUT usage (Figure 8b) for each benchmark with all optimizationsturned on and normalized them to the corresponding Vivado HLSimplementation. For the unrolled designs, we normalize against thecorresponding unrolled HLS designs. Since DSP and BRAM usageis almost identical for all benchmarks, we elide them.On average, the Calyx generated designs are 3 . × slower thanthe designs generated by Vivado HLS and use 1 . × more LUTs.For the unrolled designs, Calyx comes closer to HLS executiontime, being 2 . × slower while taking 2 . × more LUTs. Vivado HLSis a heavily optimized toolchain that incorporates state-of-the-artoptimizations and is designed to perform well on the kinds of nestedloop nests we evaluated. Latency-sensitive compilation.

Figure 9c shows the effect of the Sen-sitive pass (Section 4.4) on the Dahlia-to-Calyx compiler. Enablingthe optimization reduces execution time on average by 1 . × with-out significantly changing the resource usage. Discussion.

Despite its simplicity, the Dahlia frontend for Calyxcan already generate designs that are within a few factors of theperformance of a heavily optimized, commercial HLS toolchain.Part of the reason is that Dahlia is a far simpler language than C++,which makes a narrowly focused compiler tractable to build. This isthe use case for Calyx—rapidly designing compilers for specializedlanguages and achieving good performance quickly.We see adding traditional HLS-focused optimizations to Calyx,such as SDC scheduling [6], as the main avenue to close the gapwith Vivado HLS.

To demonstrate Calyx’s ability to express control-flow based op-timizations, we wrote a resource sharing pass (Section 5.1) and aregister sharing pass (Section 5.2). We perform an ablation study tocharacterize their effects on the final designs.Figure 9a reports the resource utilization of PolyBench bench-marks in three configurations: (1) resource sharing enabled, (2)register sharing enabled, and (3) both resource sharing and regis-ter sharing turned on. We normalize the resource counts againstbaselines with both passes disabled. mm mm a t a d t g g mm g m v g e v g m t m v t s k s k b c g c k y dbn l u l c p s y m t s v t r m L U T I n c r e a s e f a c t o r Resource SharingRegister SharingBoth Enabled (a) LUT increase from resource sharing and register sharing. mm mm a t a d t g g mm g m v g e v g m t m v t s k s k b c g c k y dbn l u l c p s y m t s v t r m R e g i s t e r D ec r e a s e F a c t o r Geo Mean (b) Register decrease from the register sharing optimization. mm mm a t a d t g g mm g m v g e v g m t m v t s k s k b c g c k y dbn l u l c p s y m t s v t r m S i m u l a t e d C y c l e S p ee dup Geo Mean (c) Speedup from using latency-sensitive compilation.

Figure 9: Effects of optimization passes. All graphs use loga-rithmic scales.

While both optimization passes find opportunities to share hard-ware components, there is not a uniform drop in the LUT usage.On average, the resource sharing pass increases LUT usage by 3%and the register sharing pass increases LUT usage by 11%. Sharinghardware components causes additional multiplexers to be instanti-ated which makes the resource usage worse in some cases. We planto implement a heuristic cost model to decide which componentsare worth sharing (Section 9).Figure 9b shows the effects of the register sharing pass on thenumber of registers used in the designs. On average, the pass re-duces register usage by 12% and finds register sharing opportunities achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson in every benchmark. Registers, compared to multiplexers, are moreexpensive in ASIC processes which represents another opportunityfor heuristics in a future version of the Calyx compiler.

For the largest PolyBench design ( gemver ) Calyx takes 0 .

06 secondsto generate RTL, compared to 26 . × Intermediate representations (IRs) for hardware generation havebeen a topic of detailed study. Calyx differs from past work becauseit is not tied to a specific hardware generation methodology as in IRsfor HLS compilers [3, 45], it represents programs at a higher level ofabstraction than IRs for RTL design [7, 15], and it provides precisecontrol over scheduling logic generation unlike Bluespec [26].

Bluespec.

Bluespec [26] is an HDL that uses guarded atomic actionsto enable compositional hardware design. The Bluespec compilerdetects conflicts between such actions, generates a parallel exe-cution schedule, and dynamically aborts rules on conflicts. Calyxrequires no implicit dynamic scheduling; it provides explicit controlover the execution schedule using its control language.

Halide.

Halide [31] is an image processing DSL that pioneeredthe separation of algorithmic specifications from the implemen-tation schedule to facilitate performance tuning, and follow-onwork has shown how to compile Halide-like languages to hard-ware [13, 20, 30]. Halide schedules represent optimization strate-gies, such as loop tiling, that do not affect the algorithm’s semantics.Calyx’s concept of a schedule is different: it orchestrates and ordersthe invocation of hardware components and as such determines theprogram’s semantics. Calyx’s schedules are appropriate for express-ing implementations of optimizations like loop tiling performed byhigh-level DSL compilers.

Software IRs.

Some hardware generators repurpose software IRssuch as LLVM [3, 22, 34, 45], GCC’s internal IR [28], and SUIF [2].Calyx is different from these approaches since it does not limitfrontend compilers to sequential, C-like semantics. It can representboth hardware resources and fine-grained parallelism that theserepresentations lack.

IRs for HLS.

Several HLS compilers include IRs that extend theirsequential input languages with representations of parallelism. 𝜇 IR [35] uses a task-parallel representation, SPARK [12] targetsspeculation and parallelism optimizations, CIRRF [11] providesprimitives for pipelining, and Wu et al. [43] propose a hierarchicalCDFG representation. Calyx differs from these IRs by providinglower-level control primitives to explicitly represent hardware re-sources and avoiding ties to a traditional HLS setting.Another category of HLS IRs uses finite state machines (FSMs) tomodel programs’ execution schedules at the cycle level [9, 32, 37].While such FSM representations are reminiscent of Calyx’s controllanguage, these IRs impose restrictions on the timing behavior of the operations inside the FSMs. Calyx imposes no such restrictions andcan compose arbitrary RTL programs while providing an interfaceto generate optimized latency-sensitive designs when possible.

Languages with hardware parallelism.

Language extensions andDSLs aim to combat the expressivity problems of HLS. They extendC with CSP-like parallelism [1], exploit software-oriented parallelinterfaces in C parallel patterns [29]. HeteroCL [20] is a Python-based DSL foroptimizing programs above the HLS level of abstraction. Theselanguages are higher level than Calyx and are not appropriate asgeneral IRs because they are tied to specific models of parallelism.Calyx can serve as a backend for them.

IRs for HDLs.

Modern HDL toolchains have IRs for transforminghardware designs [7, 15, 33, 41, 42]. These IRs work at the RTL levelof abstraction and are appropriate for representing a finished hard-ware implementation. For generating and optimizing acceleratorsfrom DSLs, however, they have the same abstraction gap problemas any other RTL language. These IRs are potential compilationtargets for Calyx.

Calyx provides a useful foundation for exploring the design ofhigher-level DSLs, compiler optimizations, and target-specific hard-ware design. We plan to build upon it to explore these ideas.

First-class pipelining.

Pipelines are a crucial building block for high-performance hardware designs. Calyx program encode pipelinesusing while loops and par blocks. However, in keeping with Ca-lyx’s philosophy of explicit control flow, we plan to design a first-class operator that will enable frontends to explicitly instantiatepipelines. An explicit representation will enable the compiler toimplement pipeline-specific optimizations such as automatic bufferinsertion. Higher-level control operators, such as pipelining, canbe compiled into more primitive control operators, which lets theCalyx IL and compiler incrementally add support for new operators.

Target-specific optimization.

Calyx’s optimization passes do notcurrently use cost models and other heuristics. We plan to extendthe Calyx compiler to support target-specific heuristics that en-able users to make different trade-offs for different targets. Forexample, multiplexers are cheap in ASICs but expensive in FPGAswhile registers are the opposite. Such differences should affect howaggressively optimization passes that share registers are applied.

Burden of synthesizability.

Several factors affect the ability of a de-sign to meet a specific clock period: the fan-out and fan-in factorsof modules, the size of the control FSM, and placement of registersin long combinational paths. Currently, Calyx requires frontendsto account for these problems and generate programs that, for ex-ample, break up long combinational paths. In the future, we plan toimplement passes that can analyze programs for such problems andtransform them to make them synthesizable. Compiler developerscan then use these passes and shift the burden of synthesizabilityonto the Calyx compiler.

Compiler Infrastructure for Accelerator Generators

10 CONCLUSION

The world of specialized hardware accelerator generators needsmore shared infrastructure. A common representation of controland structure can enable interoperability between languages whileamplifying the impact of cross-cutting optimizations, analyses,transformations, and tools.

ACKNOWLEDGMENTS

We thank Theodore Bauer and Kenneth Fang for their contributionsto the implementation of the Calyx compiler. Drew Zagieboyloand Zhiru Zhang provided feedback on the design of Calyx andearly drafts of the paper. Luis Vega provided invaluable help inunderstanding synthesis toolchains debugging RTL code generation.We also thank the anonymous reviewers and our shepherd, SophiaShao, for their detailed feedback.This work was supported in part by the Center for ApplicationsDriving Architectures (ADA), one of six centers of JUMP, a Semi-conductor Research Corporation program co-sponsored by DARPA.This is also partially supported by the Intel and NSF joint researchcenter for Computer Assisted Programming for Heterogeneous Ar-chitectures (CAPA). We also gratefully acknowledge support fromSambaNova Systems and software donations from Xilinx. Supportincluded NSF awards

A THE CALYX ARTIFACTA.1 Abstract

Our artifact packages an environment that can be used to repro-duce the figures in the paper and perform similar evaluations. It isavailable at the following link:https://zenodo.org/record/4432747It includes the following: • futil : The Calyx compiler. • fud : Driver for the futil compiler and hardware tools. • Linear algebra PolyBench written in Dahlia.

Note on proprietary tools.

We use Xilinx’s Vivado and Vivado HLStools to synthesize hardware designs and to generate HLS estimates.While trail version of these tools can be installed using Xilinx’s HLWebPACK installer, their licenses for these tool disallow redistribu-tion. Our

README.md details installation steps for these tools.

A.2 Artifact check-list (meta-information) • Program:

Polybench Benchmark Suite [23]. (All benchmarks usedin the evaluation are included with the artifact.) • Binary:

All binaries included except Vivado and Vivado HLS. • Run-time environment:

Rust source code can be compiled any-where: macOS, Windows, and Linux will all work. Our evaluationscripts assume a Unix environment with the following installed: – GNU Parallel 20161222 – verilator v4.038 – python3, pip3 and the python packages: numpy, pandas,seaborn, matplotlib, jupyterlab – jq 1.5.1 – vcdump 0.1.2 – vivado v2019.2, vivado_hls v2019.2 – futil, fud from commit dccd6f . – dahlia from commit . Our packaged virtual machine has these tools installed. • Metrics:

LUT usage and simulated cycle counts. • Output:

The figures reported in the paper. • Experiments:

We provide scripts for running the experiments anduse Jupyter notebook for making the figures. • How much disk space required (approximately)?:

65 GB. • Time needed to prepare workflow?: • Time needed to complete experiments?:

A.3 Description and Installation

A.3.1 How to Access.

The artifact is provided in two forms: • A virtual image with all dependencies installed. • Code repositories hosted on GitHub.The instructions to download both the virtual image and the code reposito-ries can be accessed here:https://github.com/cucapra/calyx-evaluationTo install the proprietary tools and run the scripts, please follow theinstructions in the

README.md file at the root of the code repository.

A.4 Evaluation and Expected Results

The evaluation process aims to accomplish two goals: • Reproduce the graphs in the paper (Figures 5 and 6). • Demonstrate the robustness of our tooling and infrastructure.The

README.md file at the root of the code repository walks through thesteps to reproduce the graphs from the paper, use the compiler to generateRTL code, and build on the infrastructure as a library.

Note on Figure 7a.

Our original submission contained a bug in one of theplotting scripts that was caught and fixed during artifact evaluation process.Complete details are in the

README.md instructions.

A.5 Methodology

Submission, reviewing, and badging methodology.

REFERENCES [1] Ali E. Abdallah and John Hawkins. 2003. Formal Behavioural Synthesis of Handel-C Parallel Hardware Implementations from Functional Specifications. In

HawaiiInternational Conference on System Sciences (HICSS) .[2] C Scott Ananian. 1998. Silicon C: A Hardware Backend for SUIF. https://flex.cscott.net/SiliconC/.[3] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,Jason H Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In

InternationalSymposium on Field-Programmable Gate Arrays (FPGA) .[4] Luca P Carloni, Kenneth L McMillan, and Alberto L Sangiovanni-Vincentelli.2001. Theory of latency-insensitive design.

IEEE/ACM International Conferenceon Computer-Aided Design (ICCAD) (2001).[5] J. Cong and J. Wang. 2018. PolySA: Polyhedral-Based Systolic Array Auto-Compilation. In

IEEE/ACM International Conference on Computer-Aided Design(ICCAD) .[6] J. Cong and Zhiru Zhang. 2006. An efficient and versatile scheduling algorithmbased on SDC formulation. In

Design Automation Conference (DAC) .[7] Ross Daly, Lenny Truong, and Pat Hanrahan. 2018. Invoking and Linking Gener-ators from Multiple Hardware Languages using CoreIR. In

Second Workshop onOpen-Source EDA Technology (WOSET) .[8] David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly,Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan.2020. Type-Directed Scheduling of Streaming Accelerators. In

ACM SIGPLANConference on Programming Language Design and Implementation (PLDI) .[9] Nikil D Dutt, Tedd Hadley, and Daniel D Gajski. 1991. An intermediate represen-tation for behavioral synthesis. In

Design Automation Conference (DAC) .[10] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, MingLiu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi,Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, SitaramLanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger.2018. A Configurable Cloud-scale DNN Processor for Real-time AI. In

Interna-tional Symposium on Computer Architecture (ISCA) . achit Nigam*, Samuel Thomas*, Zhijing Li, and Adrian Sampson [11] Zhi Guo, Betul Buyukkurt, John Cortes, Abhishek Mitra, and Walild Najjar. 2008.A compiler intermediate representation for reconfigurable fabrics. InternationalJournal of Parallel Programming (2008).[12] S Gupta, Renu Gupta, Nikil Dutt, and Alex Nicolau. 2004.

SPARK: A ParallelizingApproach to the High-Level Synthesis of Digital Circuits .[13] James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, NoyCohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014.Darkroom: Compiling high-level image processing code into hardware pipelines.

ACM Transactions on Graphics .[14] Intel. 2021.

Intel High Level Synthesis Compiler

IEEE/ACM International Conferenceon Computer-Aided Design (ICCAD) .[16] Lana Josipoviundefined, Radhika Ghosal, and Paolo Ienne. 2018. Dynami-cally Scheduled High-Level Synthesis. In

International Symposium on Field-Programmable Gate Arrays (FPGA) .[17] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, RickBoyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Da-ley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, RajendraGottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg,John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy,James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke,Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller,Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, MarkOmernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, AmirSalek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, JedSouter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian,Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, EricWilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of aTensor Processing Unit. In

International Symposium on Computer Architecture(ISCA) .[18] David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis,Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis,and Kunle Olukotun. 2018. Spatial: A language and compiler for applicationaccelerators. In

ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI) .[19] Hsiang-Tsung Kung. 1982. Why systolic architectures?

IEEE computer (1982).[20] Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, JasonCong, and Zhiru Zhang. 2019. HeteroCL: A Multi-Paradigm Programming In-frastructure for Software-Defined Reconfigurable Computing. In

InternationalSymposium on Field-Programmable Gate Arrays (FPGA) .[21] Y.-H. Lai, H. Rong, S. Zheng, W. Zhang, X. Cui, Y. Jia, J. Wang, B. Sullivan, Z.Zhang, Y. Liang, Y. Zhang, J. Cong, N. George, J. Alvarez, C. Hughes, and P.Dubey. 2020. SuSy: A Programming Model for Productive Construction of High-Performance Systolic Arrays on FPGAs. In

IEEE/ACM International Conferenceon Computer-Aided Design (ICCAD) .[22] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework forLifelong Program Analysis & Transformation. In

International Symposium onCode Generation and Optimization (CGO) .[23] Louis-Noel Pouchet. 2021.

PolyBench/C: The Polyhedral Benchmark Suite.

Re-trieved January 16, 2021 from http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/[24] Mentor Graphics. 2021.

Catapult High-Level Synthesis

ACM SIGPLAN Conferenceon Programming Language Design and Implementation (PLDI) .[26] Rishiyur Nikhil. 2004. Bluespec System Verilog: Efficient, correct RTL from highlevel specifications. In

Conference on Formal Methods and Models for Co-Design(MEMOCODE) .[27] Preeti Ranjan Panda. 2001. SystemC: A modeling platform supporting multipledesign abstractions. In

International Symposium on Systems Synthesis .[28] Christian Pilato and Fabrizio Ferrandi. 2013. Bambu: A modular frameworkfor the high level synthesis of memory-intensive applications. In

InternationalConference on Field-Programmable Logic and Applications (FPL) .[29] Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christo-pher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Generating con-figurable hardware from parallel patterns. In

ACM International Conference onArchitectural Support for Programming Languages and Operating Systems (ASP-LOS) . [30] Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming heterogeneous systems from animage processing DSL.

ACM Transactions on Architecture and Code Optimization(TACO) .[31] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, FrédoDurand, and Saman P. Amarasinghe. 2013. Halide: A language and compiler foroptimizing parallelism, locality, and recomputation in image processing pipelines.In

ACM SIGPLAN Conference on Programming Language Design and Implementa-tion (PLDI) .[32] Sameer D Sahasrabuddhe, Hakim Raja, Kavi Arya, and Madhav P Desai. 2007.AHIR: A hardware intermediate representation for hardware generation fromhigh-level programs. In

International Conference on VLSI Design (VLSID) .[33] Fabian Schuiki, Andreas Kurth, Tobias Grosser, and Luca Benini. 2020. LLHD: AMulti-Level Intermediate Representation for Hardware Description Languages. In

ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI) .[34] Shang HLS Authors. 2021.

The Shang High-Level Synthesis Framework . Re-trieved January 16, 2021 from https://web.archive.org/web/20180610233052/https://github.com/etherzhhb/Shang[35] Amirali Sharifian, Reza Hojabr, Navid Rahimi, Sihao Liu, Apala Guha, TonyNowatzki, and Arrvindh Shriraman. 2019. 𝜇 IR: An Intermediate Representationfor Transforming and Optimizing the Microarchitecture of Application Accelera-tors. In

IEEE/ACM International Symposium on Microarchitecture (MICRO) .[36] Satnam Singh and David J. Greaves. 2008. Kiwi: Synthesis of FPGA Circuits fromParallel Programs. In

Field-Programmable Custom Computing Machines (FCCM) .[37] Rohit Sinha and Hiren D Patel. 2012. synASM: A high-level synthesis frameworkwith support for parallel and timed constructs.

IEEE/ACM International Conferenceon Computer-Aided Design (ICCAD) .[38] H. Srinivasan and M. Wolfe. 1992. Analyzing programs with explicit parallelism.In

Languages and Compilers for Parallel Computing

Symposium on SDN Research (SOSR) .[41] Sheng-Hong Wang, Akash Sridhar, and Jose Renau. 2019. LNAST: A languageneutral intermediate representation for hardware description languages. In

SecondWorkshop on Open-Source EDA Technology (WOSET) .[42] Claire Wolf. 2021.

Yosys Manual

International Conference on Communications, Circuits and Systems (ICCCAS) .[44] Xilinx Inc. 2021.

Vivado Design Suite User Guide: High-Level Syn-thesis. UG902 (v2017.2) June 7, 2017.