[PDF] Extending High-Level Synthesis for Task-Parallel Programs

Abstract

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited (1) in the code development cycle due to the poor programmability, (2) in the correctness verification cycle due to restricted software simulation, and (3) in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2x and 6.8x, respectively. Our work is open-source at this https URL.

Full PDF

EExtending High-Level Synthesis for Task-Parallel Programs

Yuze Chi, Licheng Guo, Young-kyu Choi, Jie Wang, Jason Cong { chiyuze,lcguo,ykchoi,jiewang,cong } @cs.ucla.edu University of California, Los Angeles

ABSTRACT

C/C++/OpenCL-based high-level synthesis (HLS) becomes moreand more popular for field-programmable gate array (FPGA) accel-erators in many application domains in recent years, thanks to itscompetitive quality of result (QoR) and short development cyclecompared with the traditional register-transfer level (RTL) designapproach. Yet, limited by the sequential C semantics, it remainschallenging to adopt the same highly productive high-level pro-gramming approach in many other application domains, wherecoarse-grained tasks run in parallel and communicate with eachother at a fine-grained level. While current HLS tools support task-parallel programs, the productivity is greatly limited in the codedevelopment, correctness verification, and QoR tuning cycles, dueto the poor programmability, restricted software simulation, andslow code generation, respectively. Such limited productivity oftendefeats the purpose of HLS and hinder programmers from adoptingHLS for task-parallel FPGA accelerators.In this paper, we extend the HLS C++ language and present afully automated framework with programmer-friendly interfaces,universal software simulation, and fast code generation to overcomethese limitations. Experimental results based on a wide range ofreal-world task-parallel programs show that, on average, the linesof kernel and host code are reduced by 22% and 51%, respectively,which considerably improves the programmability. The correctnessverification and the iterative QoR tuning cycles are both greatlyaccelerated by 3.2× and 6.8×, respectively.

C/C++/OpenCL-based high-level synthesis (HLS) [15] has beenadopted rapidly by both the academia and the industry for pro-gramming field-programmable gate array (FPGA) accelerators inmany application domains, e.g., machine learning [16, 62], scientificcomputing [40, 68], and image processing [10, 52]. Compared withthe traditional register-transfer level (RTL) paradigm (Figure 1)where programmers often spend tens of minutes just to verify thecorrectness of a code modification, with HLS, programmers canfollow a rapid development cycle (Figure 2). Programmers can writecode in C and leverage fast software simulation to verify the func-tional correctness. Such a correctness verification cycle can take asfew as just 1 second, allowing functionalities to be iterated at a fastpace. Once the HLS code is functionally correct, programmers canthen generate RTL code, evaluate the quality of result (QoR) basedon the generated performance and resource reports, and modifythe HLS code accordingly. Such a QoR tuning cycle typically takesonly a few minutes. Thanks to the advances in HLS schedulingalgorithms [6, 7, 19, 28, 30] and timing optimizations [5, 27, 35],HLS can not only shorten the development cycle, but also generateprograms that are often competitive in cycle count [17], and morerecently in clock frequency as well [27]. Moreover, FPGA vendors provide host drivers and communication interfaces for kernels de-signed in HLS [31, 63], further reducing programmers’ burden tointegrate and offload workload to FPGA accelerators.

RTL codeCorrectness verificationand QoR tuning(via RTL simulation)( ~tens of minutes )Hardwarebitstream Logic synthesis andimplementation( ~hours )On-boardexecution

Figure 1: FPGA accelerator development flow without HLS.Programmers often spend tens of minutes after code modi-fication to evaluate the correctness and quality of result.

High-levelsynthesis( ~minutes )HLS C++ code(w/ pragmas)QoR tuning(based on HLS report) RTL code &HLS reportCorrectness verification(via software simulation)( ~seconds )Hardwarebitstream Logic synthesis andimplementation( ~hours )On-boardexecution

Figure 2: FPGA accelerator development flow with HLS. Pro-grammers spend seconds after code modification to verifythe correctness. Quality of result can usually be obtained inless than 10 minutes from the HLS report.

However, not all programs are created equal for HLS. Data-parallel programs can be easily programmed following the se-quential C semantics with HLS-specific compiler directives (i.e., a r X i v : . [ c s . A R ] S e p pragma”). The HLS compiler can then leverage the directives toextract the parallelism automatically via static dependency analysis.This enables such applications to be quickly designed and iteratedin the fast correctness verification cycle and QoR tuning cycle, asshown in Figure 2. However, task-parallel programs are not sup-ported by the native C semantics, and the productivity provided bycurrent HLS tools are greatly limited for the following reasons: • Poor programmability . Due to the lack of convenient applicationprogramming interfaces (API), programmers are often forcedto write more code than they have to. For example, a networkswitch needs to forward packets based on their content and theavailability of output ports. Without an API to read packets with-out consuming them (a.k.a., “peek”) from the ports, programmershave to manually and carefully create a buffer and maintain asmall state machine to keep track of incoming packets. This notonly elongates the development cycle, but also makes the codeerror-prone. • Restricted software simulation . As the key to fast correctnessverification, software simulation is not always available to task-parallel programs. For example, Vivado HLS does not supportdebugging Cannon’s algorithm [44] via software simulation be-cause of the existence of feedback loops in data paths, whileIntel OpenCL does not support more than 256 concurrent ker-nels [31] in software simulation. Lack of fast software simulationforces programmers to resort to RTL simulation for correctnessverification, significantly elongating the development cycle. • Slow code generation . We found that current HLS compilers viewtask-parallel code as a monolithic design and processes eachinstance of the same task as if they are different. For designsthat instantiate the same task multiple times (e.g., in a systolicarray), this leads to repetitive compilation on each task andunnecessarily slows down code generation. One may arguethat programmers can manually synthesize tasks separately andinstantiate them in RTL, but doing so requires debugging RTLcode, which is time-consuming and error-prone. We think suchprocesses should be automated by the compiler.Limited productivity for task-parallel programs significantlyelongates the development cycles and undermines the benefitsbrought by HLS. One may argue that programmers should alwaysgo for data-parallel implementations when designing FPGA ac-celerators using HLS, but data-parallelism may be inherently lim-ited, for example, in applications involving graphs. Moreover, re-searches show that even for data-parallel applications like neuralnetworks [16] and stencil computation [10], task-parallel imple-mentations show better scalability and higher frequency than theirdata-parallel counterparts due to the localized communication pat-tern [18]. In fact, at least 6 papers [24, 34, 46, 55, 59, 64] amongthe 28 research papers published in the ACM FPGA 2020 confer-ence use task-parallel implementation with HLS, and another 3papers [4, 50, 65] use RTL implementation that would have requiredtask-parallel implementation if written in HLS.In this paper, we extend the HLS C++ language and present ourframework, TAPA ( ta sk- pa rallel), as a solution to the aforemen-tioned limitations of HLS productivity. Our contributions include: • Convenient programming interfaces : We show that, withpeeking and transactions added to the programming interfaces,TAPA can be used to program task-parallel kernels with 22%

Table 1: Summary of related work.

Related Work Programmability SoftwareSimulation RTL CodeGenerationPeek-ing Trans-action HostIface.Fleet [60] No No N/A Sequential N/AIntel HLS ( ihc::pipe ) No No N/A Multi-thread MonolithicIntel HLS ( ihc::stream ) No Yes N/A Multi-thread MonolithicIntel OpenCL No No OpenCL Multi-thread MonolithicLegUp [3] No No N/A Multi-thread MonolithicMerlin [14] No No C++ Sequential MonolithicST-Accel [54] No No VFS Sequential HierarchicalVivado HLS ( ap fifo ) No No OpenCL Sequential MonolithicVivado HLS ( axis ) No Yes OpenCL Multi-thread ManualXilinx OpenCL No No OpenCL Multi-thread MonolithicTAPA Yes Yes C++ Coroutine Hierarchical reduction in lines of code (LoC) on average. By unifying theinterface used for the kernel and host, TAPA further reduces theLoC on the host side by 51% on average. • Universal software simulation : We demonstrate that our pro-posed simulator can correctly simulate task-parallel programsthat existing simulators fail to simulate. Moreover, the correct-ness verification cycle can be accelerated by a factor of 3.2× onaverage. • Hierarchical code generation : We show that by modularizinga task-parallel program and using a hierarchical approach, RTLcode generation can be accelerated by a factor of 6.8× on ourserver with 32 hyper-threads.Other related HLS tools [3, 14, 32, 63], streaming works [54, 60],and alternative APIs will be discussed in Section 5.1. Table 1 showsthe summary of related work. To the best of our knowledge, TAPAis the only work that provides convenient programming interfaces,universal software simulation, and hierarchical code generationfor general task-parallel programs on FPGAs using HLS. TAPA isopen-source at https://github.com/ucla-vast/tapa . Task-level parallelism is a form of parallelization of computer pro-grams across multiple processors. In contrast to data parallelismwhere the workload is partitioned on data and each processor exe-cutes the same program (e.g., OpenMP [21]), different processorsin a task-parallel program often behave differently, while data arepassed between processors. Examples of task-parallel programs in-clude image processing pipelines [10, 52], graph processing [61, 67],and network switching [50]. Software programs usually implementtasks as threads and/or processes and rely on the operating systemto schedule execution and handle communication. This often leadsto poor performance caused by inefficient inter-task communica-tion and frequent context switch [2]. Hardware programs, on theother hand, can be much more efficient due to the massive amountof inherently parallel logic units. In this paper, we focus on theproblem of statically mapping tasks to hardware. That is, instancesof tasks are synthesized to different areas in an FPGA accelerator.We plan to address dynamic scheduling in our future work. .2 Task-Parallel Programming Models Task-parallel programs are often described as communicating se-quential processes [29] or using dataflow models [36, 43, 51]. Kahnprocess network (KPN) [36] is one of the most popular models used.Under the KPN model, tasks are called processes . Processes com-municate only through unidirectional channels . Data exchangedbetween channels are called tokens . KPN requires that (cid:172) each pro-cess is deterministic, i.e., the same input sequence must producethe same output sequence; (cid:173) channels are unbound, read blocks ifand only if the channel is empty, and write always succeed imme-diately; (cid:174) a process cannot test an input channel for existence oftokens without consuming them. While KPN and models derivedfrom KPN (e.g., synchronous dataflow [43]) have been successfulin scheduling tasks on parallel processors, we show in the nextsection that, when applied to model task-parallel HLS programs,such models lack good programmability support. In this paper,we borrow the terms process , channel , and token used in the KPNformulation, but are not limited to KPN or any dataflow model. Infact, we will describe our programming model as a hierarchicalfinite state machine in Section 3.1.1. Graph is an important data structure that is critical in many datamining and machine-learning algorithms [26, 38, 45, 47, 49]. Whilethere are many existing FPGA accelerators designed for graphalgorithms [22, 23, 37, 48, 65–67], none of them are programmedin HLS. HLS’s lack of good productivity for task-parallel programsis one of the reasons it is not adopted for graph algorithms. In thissection, we use a real-world design to illustrate the productivityissues for implementing graph accelerators in HLS, which serve asa motivating example for our work.Our example accelerator implements PageRank [49] on the AlveoU280 board and leverages the high-bandwidth memories (HBM).The input graph is pre-processed and loaded into the HBM on theFPGA. The accelerator adopts an edge-centric graph programmingmodel [53] and decouples the computation into two phases, i.e., thescatter phase and the gather phase [11, 67]. In the scatter phase,edges are streamed from the HBM to the processing elements (PE)on FPGA. For each edge, an update message is generated to propa-gate the weighted ranking of the source vertex to the destinationvertex. The updates are collected and stored off-chip in the HBM.In the gather phase, the updates are loaded from the HBM and therankings are accumulated over each vertex. Our PageRank acceler-ator instantiates multiple PEs. The PEs are connected to a vertexhandler and a control module. The control module coordinatesaccesses to the vertex attributes and iterative execution betweenthe two phases. Figure 3 shows the block diagram of the exampleaccelerator.We measured 4.4 GTEPS on-board execution throughput usingthe accelerator with 19 HBM channels in use. As a comparison,multi-thread CPU performance is around 0.7 GTEPS with 4 DDR4memory channels [11]. Even if we assume a similar memory band-width as the FPGA accelerator and project the CPU performance to3.5 GTEPS, it would still be more than 20% slower, due to the lackof fine-grained control over communication. While developing this Giga traversed edges per second.

PageRankAccelerator

Ctrl

HBMvertices … Processing Element 1

Processing Element 2

Processing Element 8

Vertex Handler

Processing Element (zoomed in)updatesCompute UnitUpdate HandleredgesHBMHBM updates verticesVertex Handler Ctrl

Figure 3: Example PageRank accelerator design. accelerator, we found that the following are missing or hard-to-usein the HLS tools and significantly impact the productivity.1)

Peeking . Peeking is defined as reading a token from a channelwithout consuming it. As mentioned in Section 2.2, KPN explic-itly prohibits such behavior. Yet, such a pattern is common inmany applications. For example, in the PageRank accelerator, the

UpdateHandler module needs to keep track of the number of up-dates destinated to each vertex partition. Due to the large numberof partitions, block RAMs (BRAM) are used for storing the updatecounts. However, incrementing a value in BRAM cannot be done ina single clock cycle on FPGAs due to the addressing latency, whichprevents the loop from being fully pipelined. A workaround is toaccumulate the update count in a register for updates with the samepartition id ( pid ) and only write changes to BRAM when the pid changes. This requires us to detect conflicts on the addresses andstop reading the input channel when conflict occurs, as shown inthe green lines marked with “ + ” in Listing 1. Without a peek API,one has to write it as the red lines marked with “ - ” in Listing 1to manually maintain a buffer for the incoming values. This notonly increases the programming burden, but also makes the designprone to errors in state transitions of the buffer.2) Transactions . A sequence of tokens may constitute a sin-gle logical communication transaction. Using the same PageRankaccelerator example, in the gather phase when the updates areread from HBM, the updates transmitted from

UpdateHandler to ComputeUnit for each vertex partition can be considered a singletransaction. Since only

UpdateHandler knows the number of up-dates transmitted in each transaction,

ComputeUnit needs to testfor a special token to detect the end of transaction (green linesmarked with “ + ” in Listing 2). Without an eot API, one has to man-ually add a special bit to the data structure representing the tokens(red lines marked with “ - ” in Listing 2). Note that the Update structis used elsewhere and it is infeasible to add the eot bit directly tothe

Update struct. An alternative solution, i.e., sending the lengthof transaction to the token consumer beforehand, is not only morecomplicated, but also impractical in cases where the tokens aregenerated dynamically and the length of transaction cannot bedetermined beforehand.3)

System integration . To offload computation kernel from thehost CPU to FPGA accelerators, programmers need to write host-side code to interface the accelerator kernel with the host. FPGAvendors adopt the OpenCL standard to provide such a functionality.While the standard OpenCL host-kernel interface infrastructure isting 1: Code snippets with (green lines marked with “ + ”)and without (red lines marked with “ - ”) a peek API. Withoutthe peek

API, the code snippet is 33% longer and error-prone.Listing 2: Code snippets with (green lines marked with“ + ”) and without (red lines marked with “ - ”) an end-of-transaction ( eot ) API. Without the eot API, the code snippetis 2× longer. relieves programmers from writing their own operating system dri-vers and low-level libraries, it is still inconvenient and hard-to-use.Programmers often have to write and debug tens of lines of codejust to set up the host-kernel interface. Task-parallel acceleratorsoften make the situation worse because the parallel tasks are oftendescribed as distinct OpenCL kernels [31], which significantly in-creases the programmers’ burden on managing these kernels in thehost-kernel interface. For our PageRank accelerator, more than 60 lines of host code are created just for the host-kernel integration,which constitute more than 20 percent of the whole source code.Yet, what we actually need is just a single function invocation ofthe synthesized FPGA bitstream given proper arguments.4)

Software simulation . C does not have explicit parallel seman-tics by itself. Vivado HLS uses the dataflow model and allow pro-grammers to instantiate tasks by invoking each of them sequen-tially [63]. While this is very concise to write (red lines marked with“ - ” in Listing 3), it will lead to incorrect simulation results becausethe communication between ComputeUnit and

UpdateHandler arebidirectional, yet sequential execution can only send tokens from

ComputeUnit to UpdateHandler because of their invocation order.This problem was also pointed out in [8]. In order to run soft-ware simulation correctly, the programmer can change the sourcecode to run tasks in multiple threads for software simulation, butdoing so requires the same piece of task instantiation code to bewritten twice for synthesis and simulation, reducing productivity.While other tools that run tasks in parallel threads do not have thesame correctness problem, we will show in Section 4.4 that suchsimulators do not scale well when the number of tasks increase.

Listing 3: Code snippets that instantiate tasks in Vivado HLS(red lines marked with “ - ”) and TAPA (green lines markedwith “ + ”). The instantiation interface in Vivado HLS is notverbose, but software simulation does not work correctly. RTL code generation . In our PageRank design, the same process-ing element is instantiated 8 times. This makes the HLS compilersynthesize the same PE module 8 times, taking 7 minutes per com-pilation. We can reduce the code generation time to less than 1minute by manually synthesizing each module separately and con-necting the generated RTL code, but doing so forces us to debugRTL code and spend tens of minutes to verify the correctness foreach code modification, thus defeats the purpose for adopting HLS.In this paper, we present the TAPA framework that addressesthese challenges by providing convenient programming interfaces,universal software simulation, and hierarchical code generation. TAPA FRAMEWORK3.1 Programming Model and Interface

Similar to KPNdescribed in Section 2.2, tasks in TAPA communicate via chan-nels. Unlike KPN, tasks are modeled as hierarchical finite-statemachines (FSM). Each task is either a leaf that does not instantiateany channels or tasks, or a collection of tasks and channels withwhich the tasks communicate. A task that instantiates a set of tasksand channels is called the parent task for that set. Each channelmust be connected to exactly two tasks that are instantiated inthe same parent task. One of the tasks must act as a producer andthe other must act as a consumer . The producer streams tokens tothe consumer via the channel in the first-in-first-out (FIFO) order.Each task is an FSM, where the tokens streamed to and from thetask are inputs and outputs to the FSM. In case of a parent task,the state of all instantiated channels and tasks constitute its state.The producer of a channel can test the fullness of the channel andappend tokens to the channel ( write ) if the channel is not full. Theconsumer of a channel can test the emptiness of the channel and re-move tokens from the channel ( read ), or duplicate the head of tokenwithout removing it ( peek ), if the channel is not empty. Read, peek,and write operations can be blocking or non-blocking. A blockingoperation on an input (output) channel keeps the task FSM in itscurrent state until the channel becomes non-empty (non-full). Anon-blocking operation tries to perform the operation and returnswhether it is successful as one of the inputs to the task FSM. Eachtask is implemented as a C++ function, which can communicatewith each other via the communication interface . A parent taskinstantiates channels and tasks using the instantiation interface .One of the tasks is designated as the top-level task , which definesthe communication interfaces external to the FPGA accelerator.

Tasks communicate with eachother through the communication interface. TAPA provides sepa-rated communication APIs for the producer side and the consumerside. The producer and consumer tasks of a channel use ostream and istream as the interfaces, respectively. The interfaces are tem-plated and can be used for any copyable class. On the consumer side, istream provides peek that allows the programmer to read a tokenwithout removing it from the channel, i.e., the state of the channel isnot changed. A special token denoting end-of-transaction (EoT) isavailable to all channels. A process can “close” a channel by writingan EoT to it, and a process can “open” a channel by reading an EoTfrom it. An EoT token does not contain any useful data. This isdesigned deliberately to make it possible to break from a pipelinedloop when an EoT is present (Listing 2). Table 2 summarizes thecommunication interfaces provided by TAPA. Listing 4 shows anexample of how the communication interfaces are used in TAPA.

A parent task can instantiate chan-nels and tasks using the instantiation interface. Channels are in-stantiated using tapa::channel . For example, tapa::channel instantiates a channel with ca-pacity 2, meaning up to 2 tokens can be written to this channelwithout reading them out or blocking the producer. Data tokenstransmitted using this channel have type

VertexReq . Tasks areinstantiated using tapa::task::invoke . By default, a parent task

Table 2: TAPA communication interface. tapa::ostream&

API Producer-side functionality bool full(); fullness test void write(T); blocking write a data token bool try write(T); non-blocking write a data token void close(); blocking write an EoT token bool try close(); non-blocking write an EoT token tapa::istream&

API Consumer-side functionality bool empty(); emptiness test

T peek(); blocking peek a data token bool try peek(T&); non-blocking peek a data token

T read(); blocking read a data token bool try read(T&); non-blocking read a data token bool eot(); return if next token is EoT bool try eot(bool&); return if next token exists and if it is EoT void open(); blocking read an EoT token bool try open(); non-blocking read an EoT token void VertexHandler(tapa::istream& req_s, ...) { for (;;) { VertexReq req; if (req_s.try_read(req)) { ... // handle requests } } } void Ctrl(tapa::ostream& vertex_req, ...) { ... // initial setup while (...) { // iterative execution VertexReq req(...); // request vertices vertex_req.write(req); ... // finish scatter & do gather } } Listing 4: TAPA communication interface example. void PageRank(...) { tapa::channel vertex_req; ... tapa::task() .invoke(VertexHandler, vertex_req, ...) .invoke(Ctrl, vertex_req, ...) ... ; } Listing 5: TAPA instantiation interface example. does not finish until all its children tasks finish. A child task canoptionally be invoked with tapa::detach , meaning the child taskis launched and detached immediately, and the parent does not waitfor it to finish. The tapa::detach invocation type is particularlyuseful when a task never terminates, e.g.,

VertexHandler withan infinite loop (Listing 4). Listing 5 shows an example of howchannels and tasks are instantiated in TAPA. .1.4 System Integration Interface. To offload a kernel to anFPGA accelerator, programmers will need to integrate the FPGAinto the host CPU system. Thanks to the vendor-provided systemdrivers and the standard OpenCL accelerator APIs, most program-mers only need to follow the OpenCL host-kernel communicationspecification and invoke proper APIs. However, those OpenCLAPIs are still verbose and take a long time to learn and develop. Forexample, programmers need to learn the concepts of “platform”,“context”, “queue”, and “kernel” in OpenCL and manage them foreach accelerator, yet the only thing necessary is usually just find aproper FPGA accelerator or simulation environment and use it torun the program. This overhead for programmers is exacerbated bytask-parallel accelerators, where parallel tasks are often synthesizedas concurrent OpenCL kernels that need to be managed separatelyby the host.TAPA uses a unified system integration interface to further re-duce programmers’ burden. To offload a kernel to an FPGA ac-celerator, programmers only need to call the top-level task as aC++ function in the host code. Since TAPA can extract metadatainformation, e.g., argument type, from the kernel code, TAPA willautomatically synthesize proper OpenCL host API calls and emitan implementation of the top-level task C++ function that can setup the runtime environment properly. As a user of TAPA, the pro-grammer can use a single function invocation in the same sourcecode to run software simulation, hardware simulation, and on-boardexecution, with the only difference of specifying proper bitstreams.

State-of-the-Art Approach.

There are mainly two state-of-the-artapproaches that run fast software simulation for task-parallel ap-plications: the sequential approach and the multi-thread approach.A sequential simulator invokes tasks sequentially in the invocationorder [63]. Sequential simulators are fast, but cannot correctly sim-ulate the capacity of channels and applications with tasks commu-nicating bidirectionally, as discussed in Section 2.3. A multi-threadsimulator invokes tasks in parallel by launching a thread for eachtask. This enables the capacity of channels and bidirectional com-munication to be simulated correctly. However, they may performpoorly due to the inefficiency of inter-thread communication andcontext switch handled by the operating system. The FLASH simu-lator [8, 12] proposed an alternative to the above, which relies onthe HLS scheduling information to mimic the RTL FSM. While thissimulation approach itself is faster than multi-thread simulators,generating simulation executable becomes slower due to the needof the HLS scheduler output for cycle-accuracy, which is not neededfor correctness verification.In this section, we present an alternative approach to run soft-ware simulation on task-parallel applications. Given that the ineffi-ciency of multi-thread execution is mainly caused by the preemptivenature of operating system threads and inspired by the widespreadadoption of coroutines in modern software languages [25, 41], wepropose an approach that uses collaborative coroutines instead ofpreemptive threads. Note that fast and/or cycle-accurate debuggingin general [33] is out of the scope of this paper; we focus on thecorrectness and scalability issues for task-parallel programs.

Coroutine-Based Approach.

Routines in programming languagesare the units of execution contexts, e.g., functions in C [39]. Corou-tines [20] are routines that execute collaboratively; more specifi-cally, coroutines can be explicitly suspended and resumed. Corou-tines can even maintain their own stacks. As a result, each coroutinecan invoke subroutines themselves and suspend from and resumeto any subroutine [41]. Coroutines that have their own stacks arecalled stackful coroutines. A context switch between coroutinestakes only 26ns on modern CPUs [41]. As a comparison, an operat-ing system thread context switch takes 1.2 ˜2.2µs [2], which is twoorders of magnitude slower.TAPA leverages stackful coroutines to perform software simu-lation. When channels are instantiated in the simulator, enoughmemory space is reserved to ensure the channel capacity can besimulated correctly. When tasks are instantiated, a coroutine islaunched but suspended immediately for each task. Once all tasksare instantiated, the simulator starts to resume the suspended corou-tines. A resumed task will be suspended again if any input channelis accessed when empty or any output channel is accessed whenfull, which means that no progress can be made from this task. Adifferent task will then be selected and resumed by the simulator.For example, in the task instantiation code shown in Listing 5,both

VertexHandler and

Ctrl are launched as coroutines andsuspended immediately by the invoke function calls. Once all tasksare instantiated, the simulator starts to pick tasks for execution.

Ctrl is picked first, which will write vertex requests to vertex req .Once vertex req becomes full, the simulator determines that noprogress can be made from

Ctrl , thus will suspend it and pickanother task for execution.

VertexHandler is then resumed andtokens will be read from vertex req . Once vertex req becomesempty, the simulator determines that no progress can be made from

VertexHandler , thus will suspend it and pick the next task forexecution.To better utilize the available CPU cores, we use a thread poolto execute the coroutines. We will show in Section 4.4 that thecoroutine-based simulator outperforms the existing simulators by3.2× on average (Section 4.4).

State-of-the-art Approach.

Current HLS tools treat the wholetask-parallel program as a monolithic design, treat channels asglobal variables, and compile different instances of tasks as if theyare completely unrelated. While this enables instance-specific op-timizations, e.g., different constant arguments can be propagatedto different instances, it can also lead to a significant amount ofrepeated work. For example, the dataflow architecture generatedby the SODA compiler [9, 10] is highly modularized and manymodules are functionally identical. However, both the Vivado HLSbackend and Intel FPGA OpenCL backend of SODA generate RTLcode for each SODA module separately. When the design scalesout to hundreds of modules, RTL code generation can easily runfor hours, taking even longer time than logic synthesis and imple-mentation. While we recognize that a programmer can manuallygenerate RTL code for each task and glue them at RTL level tospeed up RTL code generation, doing so defeats the purpose ofusing HLS for high productivity, because the glued RTL code can be rror-prone yet cannot be verified using fast software simulation.We also recognize that fast RTL code generation in general is aninteresting problem, but we focus on the inefficiency exacerbatedby task-parallel programs in this paper. Modularized Approach.

Thanks to the hierarchical programmingmodel, TAPA can keep the program hierarchy, recognize differentinstances of the same task, and compile each task only once. Assuch, the total amount of time spent on RTL code generation isreduced. Moreover, modularized compilation makes it possible tocompile tasks in parallel, further reducing RTL code generationtime on multi-core machines. TAPA implements this by doing asource-to-source transformation to generate the vendor HLS codefor each task and invoking the vendor tools in parallel for eachtask. On average, TAPA reduces HLS compilation time by 4.9×(Section 4.5).

The TAPA automation flow is shown in Figure 4. The TAPA C++source code can be compiled directly for software simulation andcorrectness verification. Starting from the same TAPA C++ sourcecode, TAPA extracts the HLS code for each task and the metadatainformation of the whole design, including the communicationtopology among tasks, token types exchanged between tasks, andchannels’ capacity. The vendor HLS tool is then leveraged to gener-ate RTL code and performance/resource report for each task. Theextracted metadata is used to instantiate the task instances andconnect them together systematically, producing the overall HLSreport and kernel RTL code, which can be used for QoR tuningand logic synthesis and implementation, respectively. The samemetadata information is also used to create the host-kernel com-munication interface, which can be used for on-board execution oroptionally RTL simulation.

Handled automatically by TAPA

Extractmetadata(TAPA)

TAPA C++code KernelRTL code &HLS reportHLS code(per task)Task infoChan. info

Source to sourcetransformation(TAPA)

RTL code &HLS report(per task)

Task & ChannelInstantiation(TAPA)HLSCompiler

Host-kerneliface. code

TAPA

Figure 4: TAPA automation flow overview.

We prototype TAPA on Xilinx devices using Vivado HLS as thebackend; support for Intel devices will be added later. Clang com-piler infrastructure is modified to extract information about tasksand perform source-to-source transformation to generate VivadoHLS kernel code and OpenCL host code. GCC is used to compilethe host executables and the software simulators. We compare the productivity of TAPA with two vendor tools that provide end-to-end high-level programming experience (including host-kernelcommunication): Xilinx Vitis/Vivado HLS 2019.2 suite and IntelFPGA SDK for OpenCL Pro Edition 19.4. The experimental resultsare obtained on an Ubuntu 18.04 server with 2 Xeon Gold 6244processors.

We used the following benchmarks for comparison. All implemen-tations (Vivado HLS, Intel OpenCL, and TAPA) of each benchmarkare written in such a way that tasks in each implementation haveone-to-one correspondence, corresponding loops are scheduledwith the same initiation interval (II), and each task performs thesame computation. This guarantees all tools generate consistentquality of results. Note that we aim to compare the productivityof each of the HLS tools, not the quality of result. In particular,we were unable to guarantee that the generated RTL codes haveexactly the same behavior without having access to the HLS com-piler’s scheduling algorithm. For example, the network switchimplemented in TAPA has a total latency of 3 cycles while the Vi-vado HLS implementation has a total latency of 6. This is inevitablebecause, using Vivado HLS, one has to manually buffer the incom-ing packets, forcing an additional latency of 1 cycle at each networkstage. Table 3 summarizes the number of tasks and channels usedin each benchmark.

Cannon’s Algorithm.

Cannon’s algorithm [44] is a distributedalgorithm for matrix multiplication that runs on 2D mesh of pro-cessing elements (PE). This benchmark contains 8×8 PEs. Each PE isinternally vectorized to perform 8 multiply-accumulate operationsper cycle for two 128×128 matrices. Besides the 64 PEs, the acceler-ator also contains 9 data distributor/collector for each matrix. Theinputs to the whole accelerator are 1024×1024×1024.

Convolutional Neural Network.

Convolutional neural networksare very popular for many machine learning applications, e.g., im-age classification [58]. This benchmark implements the third layerof VGG [58] based on a systolic array implementation generatedfrom PolySA [16]. PolySA is a polyhedral-based systolic array auto-compilation framework that can generate optimal designs withinone hour with performance comparable to state-of-the-art manualdesigns. Gaussian Filter.

The Gaussian filter is often employed for low-pass filtering on input signals or images, or used iteratively forsolving linear system of equations. This benchmark is based on adataflow microarchitecture generated from SODA [10]. SODA is astencil compiler that can generate optimal communication-reusebuffers with temporal and spatial parallelism. This benchmarkperforms 8 iterations of Gaussian filtering, each of which is capa-ble of processing 16 input elements in parallel. The input size is32768×32768.

Graph Convolutional Network.

Graph convolutional network [38]is an emerging type of neural network that processes sparse andirregular data as opposed to dense and regular ones like images.This benchmark implements a forward layer of GCN for the Cora Parameters { i , o , h , w , p , q } = { , , , , , } . able 3: Benchmarks used in this paper. Each task may beinstantiated multiple times, so the number of task instancesis greater than the number of tasks. Benchmark cannon cnn

14 209 366 gaussian

15 564 1602 gcn gemm

14 207 364 network page rank

General Matrix Multiplication.

This benchmark is based on a sys-tolic array implementation generated from PolySA [16]. Comparedwith Cannon’s algorithm, PolySA avoids feedback data paths in thesystolic array, and can support non-square matrices. The inputs tothe accelerator are 1040×1024×1024.

Network Switching.

This benchmark implements an 8×8 Omeganetwork switch [42] that can route packets from any input portto any output port. The packets are 64-bit wide with the first 3bits being the header and are generated randomly with an evendistribution among the 8 destination ports.

PageRank.

This benchmark implements the PageRank [49] ci-tation ranking algorithm for general large graphs as described inSection 2.3. We use the Slashdot community graph [45] as thedataset for debugging, which contains 77360 vertices and 905468edges. The accelerator design itself can scale up to 2 vertices and2 edges. TAPA simplifies the kernel code in two aspects. First, the TAPAcommunication interfaces simplify the code with the built-in sup-port for peeking and transactions. This not only simplifies the bodyof each task definition, but also removes the necessity for many struct definitions. Second, the TAPA instantiation interfaces sim-plify the code by allowing tasks to be launched and detached con-cisely. Without this functionality, each task in Vivado HLS mustbe carefully given a termination condition, whereas Intel OpenCLrequires verbose kernel instantiation attributes for each instanceof task. Figure 5 shows the lines of kernel code comparison ofeach benchmark. On average, TAPA reduces the lines of kernelcode by 22%. Note that only synthesizable kernel code is counted;code added for multi-thread software simulation is not counted forVivado HLS.

The host code used in the benchmarks contains a minimal test-bench to verify the correctness of the kernel code. TAPA system-integration API automatically interfaces with the OpenCL hostAPIs and relieves the programmer from writing repetitive code justto connect the kernel to a host program. Table 6 shows the lines c a n n o n c n n g a u s s i a n g c n g e m m n e t w o r kp a g e _ r a n k L i n e s o f C o d e ( N o r m a l i z e d t o T A P A ) Lines of Kernel Code

Vivado HLSIntel OpenCLTAPA

Figure 5: LoC comparison for kernel code. Lower is better. c a n n o n c n n g a u s s i a n g c n g e m m n e t w o r kp a g e _ r a n k L i n e s o f C o d e ( N o r m a l i z e d t o T A P A ) Liens of Host Code

Vivado HLSIntel OpenCLTAPA

Figure 6: LoC comparison for host code. Lower is better. of host code comparison. On average, the length of host code isreduced by 51%.

Figure 7 shows four simulators, that is, the sequential VivadoHLS simulator, the multi-thread Vivado HLS simulator, the multi-thread Intel OpenCL simulator, and the coroutine-based TAPAsimulator. Among the three simulators, the sequential simulatorfails to correctly simulate benchmarks that require feedback datapaths ( cannon and page rank ). Due to the larger memory foot-print required for storing the tokens transmitted between tasksand lack of parallelism, the sequential simulator is outperformedby the coroutine-based simulator in all but one of the benchmarks( network ). The two multi-thread simulators correctly simulate allbenchmarks, except that Intel OpenCL cannot handle gaussian because its large number of task instances (564) exceeds the max-imum allowed (256) by the simulator. However, the multi-thread a n n o n c n n g a u s s i a n g c n g e m m n e t w o r kp a g e _ r a n k E l a p s e d T i m e Simulation Time

Vivado HLS (Seq)Vivado HLS (MT)Intel OpenCL (MT)TAPA (Coroutine)

Figure 7: Simulation time comparison. Lower is better. Thesequential simulator fails to simulate cannon and pagerank correctly. The Intel OpenCL multi-thread simulator cannotsimulate gaussian due to its large number of task instances. simulators perform poorly on benchmarks that are communication-intensive (e.g., network ) or have more tasks than the number ofavailable threads (e.g., gaussian ). The coroutine-based TAPA simu-lator can correctly simulate all benchmarks without significant per-formance loss for both communication-intensive and computation-intensive tasks with 3.2× average speedup.

Figure 8 shows the RTL code generation time comparison. Thanksto the hierarchical programming model and modularized code gen-erator, TAPA accelerates the HLS compilation time by 6.8× onaverage. This is because (cid:172)

TAPA runs HLS for each task only onceeven if it is instantiated many times, while Vivado HLS and IntelOpenCL runs HLS for each task instance; (cid:173)

TAPA runs HLS inparallel on multi-core machines. c a n n o n c n n g a u s s i a n g c n g e m m n e t w o r kp a g e _ r a n k E l a p s e d T i m e RTL Code Generation Time

Vivado HLSIntel OpenCLTAPA

Figure 8: RTL code generation time. Lower is better.

Intel HLS compiler supports two different inter-task communicationinterfaces, ihc::pipe and ihc::stream . ihc::pipe implementsa light-weight hardware FIFO with data , valid , and ready signals,while ihc::stream implements an Avalon-ST interface that sup-ports transactions. Tasks are instantiated using ihc::launch and ihc::collect . Software simulation is done via launching multiplethreads. Instances of the same task are synthesized separately. Intel OpenCL compiler supports light-weight FIFO via two setsof APIs, i.e., standard OpenCL pipe and Intel-specific channel .Tasks are instantiated by defining OpenCL kernel s, which forcesinstances of the same task to be synthesized separately as differentOpenCL kernels. OpenCL runtime handles the software simulationby launching multiple threads.

Vivado HLS provides two different streaming interfaces: ap fifo and axis . The ap fifo interface generates light-weight FIFO inter-face. Tasks are instantiated by invoking the corresponding func-tions in a dataflow region, and instances of the same task aresynthesized separately. Software simulation is done by sequen-tially executing the tasks. The axis interface generates AXI-Streaminterface with transaction support. It requires the programmersto instantiate channels and tasks in a separate configuration filewhen running logic synthesis and implementation. This allowsdifferent instances of the same task to be synthesized only once, buttakes longer time to learn and implement compared with ap fifo .OpenCL runtime handles the software simulation for tasks instan-tiated with the axis interface by launching multiple threads.

Xilinx OpenCL compiler supports standard OpenCL pipe , whichgenerates AXI-Stream interfaces similar to Vivado HLS axis , but pipe does not provide APIs to support transactions. Like VivadoHLS axis , software simulation of pipe is handled by the OpenCLruntime by launching multiple threads.

LegUp compiler provides legup::FIFO , which implements light-weight FIFOs. Tasks are instantiated using pthread

API (Sec-tion 5.3). Software simulation is accomplished by launching multi-ple threads. Instances of the same task are synthesized separately.

Merlin compiler [14] allows programmers to call the FPGA ker-nel as a C/C++ function and provides OpenMP-like simple pragmaswith automated design-space exploration based on machine learn-ing. To support task-parallel programs, Merlin leverages its backendvendor tools’ programming interfaces. Software simulation is doneby sequentially executing the tasks.In summary, as pointed out in Table 1 (on page 2), none of thestate-of-the-art HLS tools provide peeking support. Only Intel HLS ihc::stream and Vivado HLS axis support transactions. OnlyMerlin allows the accelerator kernel to be called as if it is a C/C++function. Vivado HLS and Merlin execute tasks sequentially forsimulation while others launch multiple threads. All HLS toolstreat a task-parallel program as a monolithic design and generateRTL code for each instance of task separately, except that VivadoHLS axis allows programmers to manually instantiate tasks usinga configuration file when running logic synthesis and implementa-tion. .2 Streaming Framework Streaming applications are a special type of task-parallel applica-tions that do not require complex control over inter-task commu-nication and often expose massive data parallelism in addition totask parallelism. There are previous works that focus specificallyon such applications.

ST-Accel [54] is a high-level programming platform for streamingapplications that features highly efficient host-kernel communica-tion interface exposed as a virtual file system (VFS). It uses VivadoHLS as its backend for hardware generation and its software simu-lation is done by sequential execution.

Fleet [60] is a massively parallel streaming framework for FPGAsthat features highly efficient memory interfaces for massive in-stances of parallel processing elements. Programmers write Fleetprograms in a domain-specific RTL language based on Chisel [1].The programs can be simulated in Scala .In summary, while these frameworks are specialized for stream-ing patterns, neither of them provide peeking and transaction inter-face in the kernel. Both run software simulation sequentially, whichdoes not have correctness problem for streaming applications butwill be restrictive for general task-parallel programs. SystemC is a set of C++ classes and macros that provide detailedhardware modeling and event-driven simulation. It supports bothcycle-accurate and untimed simulation and many simulator im-plementations are available [13, 56]. Some HLS tools support asubset of untimed SystemC as the input [63]. SystemC supportstask-parallel programs natively via the sc module constructs and tlm fifo interfaces. Listing 6 shows an example using the accel-erator discussed in Section 2.3. Compared with other C-like HLSlanguages, SystemC can model more hardware details but is moreverbose and less productive due to its special language constructs:for the code snippets shown in Listing 4 and Listing 5, equivalentSystemC code would be 37% longer.

Pthread

API is a set of widely used standard APIs that can beused to implement task-parallel programs using threads. Pthreadrequires programmers to explicitly create and join threads, andarguments need to be manually packed and passed. Listing 7 showsan example using the accelerator discussed in Section 2.3. Com-pared with the tapa::invoke

API used by TAPA, the pthread APIsrequire more effort to program: for the code snippets shown inListing 4 and Listing 5, equivalent pthread-based code would be78% longer.In summary, while the existing API alternatives are widely usedin some domains, they are more verbose and thus less productivecompared with TAPA.

In this paper, we present TAPA as an HLS C++ language extensionto enhance the programming productivity of task-parallel programson FPGAs. TAPA has multiple advantages over state-of-the-art HLStools: 1) its enhanced programming interface helps to reduce thelines of kernel code by 22% on average, 2) its unified system inte-gration interface reduces the lines of host code by 51% on average, Scala is the language in which Chisel is embedded. SC_MODULE(Ctrl) { sc_core::sc_port> vertex_req; // declare communication interface ... SC_CTOR(Ctrl) { SC_THREAD( thread ); } void thread () { ... } // task description }; SC_MODULE(PageRank) { // instantiate channels tlm::tlm_fifo vertex_req{/*depth=*/2}; ... Ctrl ctrl; // instantiate tasks ... SC_CTOR(PageRank) { // bind channels to communication interfaces ctrl.vertex_req(vertex_req); ... } }; Listing 6: SystemC TLM API example. struct Ctrl_Arg { // task communication interface channel* vertex_req; ... }; void Ctrl( void * arg) { // task description Ctrl_Arg* ctrl_arg = (Ctrl_Arg*)arg; // unpack arguments channel* vertex_req = ctrl_arg->vertex_req; ... pthread_exit(NULL); } void PageRank(...) channel vertex_req; // instantiate channels ... Ctrl_Arg Ctrl_arg; Ctrl_arg.vertex_req = &vertex_req; // pack arguments ... pthread_t Ctrl_pid, ...; // launch threads pthread_create(&Ctrl_pid, NULL, Ctrl, ( void *)&Ctrl_arg); ... pthread_join(&Ctrl_pid, NULL); // join threads ... } Listing 7: Pthread API example.

3) its coroutine-based software simulator reduces the length of cor-rectness verification development cycle by 3.2× on average, 4) itsmodularized code generation approach accelerates the QoR tuningdevelopment cycle by 6.8× on average. As a fully automated andopen-source framework, TAPA aims to provide highly productivedevelopment experience for task-parallel programs using HLS. Forfuture work, we plan to extend our work to support dynamicallygenerating and executing tasks on FPGAs. CKNOWLEDGMENT

This work is partially supported by a Google Faculty Award, theNSF RTML program, Xilinx Adaptive Compute Cluster (XACC)Program, and the CDSC industrial partners.

REFERENCES [1] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Ri-mas Aviˇzienis, John Wawrzynek, and Krste Asanovi´c. 2012. Chisel: ConstructingHardware in a Scala Embedded Language. In

DAC .[2] Eli Bendersky. 2018. Measuring context switching and memory over-heads for Linux threads. (2018). https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/ [3] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,Jason Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems. In

FPGA .[4] Han Chen, Sergey Madaminov, Michael Ferdman, and Peter Milder. 2020. FPGA-Accelerated Samplesort for Large Data Sets. In

FPGA .[5] Yu Ting Chen, Jin Hee Kim, Kexin Li, Graham Hoyes, and Jason H. Anderson.2019. High-Level Synthesis Techniques to Generate Deeply Pipelined Circuitsfor FPGAs with Registered Routing. In

FPT .[6] Jianyi Cheng, Shane T. Fleming, Yu Ting Chen, Jason H. Anderson, and George A.Constantinides. 2019. EASY: Efficient Arbiter SYnthesis from Multi-threadedCode. In

FPGA .[7] Jianyi Cheng, Lana Josipovi´c, George A. Constantinides, Paolo Ienne, and JohnWickerson. 2020. Combining Dynamic & Static Scheduling in High-level Syn-thesis. In

FPGA .[8] Yuze Chi, Young-kyu Choi, Jason Cong, and Jie Wang. 2019. Rapid Cycle-AccurateSimulator for High-Level Synthesis. In

FPGA .[9] Yuze Chi and Jason Cong. 2020. Exploiting Computation Reuse for StencilAccelerators. In

DAC .[10] Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA : Stencil withOptimized Dataflow Architecture. In

ICCAD .[11] Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li, and HuazhongYang. 2016. NXgraph: An Efficient Graph Processing System on a Single Machine.In

ICDE .[12] Young-kyu Choi, Yuze Chi, Jie Wang, and Jason Cong. 2020. FLASH: Fast,ParalleL, and Accurate Simulator for HLS.

TCAD (2020).[13] Moo Kyoung Chung, Jun Kyoung Kim, and Soojung Ryu. 2014. SimParallel:A High Performance Parallel SystemC Simulator Using Hierarchical Multi-threading. (2014).[14] Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. 2016. Soft-ware Infrastructure for Enabling FPGA-Based Accelerations in Data Centers. In

ISLPED .[15] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, andZhiru Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping toDeployment.

TCAD (2011).[16] Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-Based Systolic Array Auto-Compilation. In

ICCAD .[17] Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated Ac-celerator Generation and Optimization with Composable, Parallel and PipelineArchitecture. In

DAC .[18] Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. 2018. Latte: LocalityAware Transformation for High-Level Synthesis. In

FCCM .[19] Jason Cong and Zhiru Zhang. 2006. An Efficient and Versatile SchedulingAlgorithm Based On SDC Formulation. In

DAC .[20] Melvin E. Conway. 1963. Design of a Separable Transition-Diagram Compiler.

Commun. ACM

6, 7 (1963), 396–408.[21] Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry Standard APIfor Shared-Memory Programming.

IEEE Computational Science and Engineering

5, 1 (1998), 46–55.[22] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: GraphProcessing Framework on FPGA A Case Study of Breadth-First Search. In

FPGA .[23] Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and HuazhongYang. 2017. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGAArchitecture. In

FPGA .[24] Johannes De Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler. 2020. Flex-ible Communication Avoiding Matrix Multiplication on FPGA with High-LevelSynthesis. In

FPGA .[25] Ana L´ucia de Moura and Roberto Ierusalimschy. 2009. Revisiting Coroutines.

TOPLAS

31, 2 (2009).[26] Chenhui Deng, Zhiqiang Zhao, Yongyu Wang, Zhiru Zhang, and Zhuo Feng.2020. GraphZoom: A Multi-level Spectral Approach for Accurate and ScalableGraph Embedding. In

ICLR . [27] Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, ZhiruZhang, and Jason Cong. 2020. Analysis and Optimization of the Implicit Broad-casts in FPGA HLS to Improve Maximum Frequency. In

DAC .[28] Ameer Haj-Ali, Qijing Huang, William Moses, John Xiang, Krste Asanovic, JohnWawrzynek, and Ion Stoica. 2020. AutoPhase: Juggling HLS Phase Orderings inRandom Forests with Deep Reinforcement Learning. In

MLSys .[29] C. A. R. Hoare. 1978. Communicating Sequential Processes.

Commun. ACM

DAC .[31] Intel. 2020. Intel FPGA SDK for OpenCL Pro Edition: Programming Guide.(2020).[32] Intel. 2020. Intel High Level Synthesis Compiler Pro Edition: User Guide. (2020).[33] Al Shahna Jamal, Eli Cahill, Jeffrey Goeders, and Steven J. E. Wilton. 2020. FastTurnaround HLS Debugging using Dependency Analysis and Debug Overlays.

TRETS

13, 1 (2020).[34] Jiantong Jiang, Zeke Wang, Xue Liu, Juan G´omez-Luna, Nan Guan, QingxuDeng, Wei Zhang, and Onur Mutlu. 2020. Boyi: A Systematic Framework forAutomatically Deciding the Right Execution Model of OpenCL Applications onFPGAs. In

FPGA .[35] Lana Josipovi´c, Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and JordiCortadella. 2020. Buffer Placement and Sizing for High-Performance DataflowCircuits. In

FPGA .[36] Gilles Kahn. 1974. The Semantics of a Simple Language for Parallel Programming.In

IFIP .[37] Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerat-ing Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMCPlatform. In

FPGA .[38] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In

ICLR .[39] Donald Ervin Knuth. 1997.

Fundamental Algorithms. The Art of Computer Pro-gramming 1 (3rd ed.).[40] Mostafa Koraei, Omid Fatemi, and Magnus Jahre. 2019. DCMI: A Scalable Strategyfor Accelerating Iterative Stencil Loops on FPGAs.

TACO

16, 4 (2019).[41] Oliver Kowalke. 2014. Boost Library Documentation, Coroutine2. (2014). https://boost.org/doc/libs/1 65 0/libs/coroutine2/doc/html/coroutine2/intro.html [42] Duncan H. Lawrie. 1975. Access and Alignment of Data in an Array Processor.

ToC

C-24, 12 (1975).[43] Edward A. Lee and David G. Messerschmitt. 1987. Synchronous Data Flow.

IEEE

75, 9 (1987).[44] Hyuk-Jae Lee, James P. Robertson, and Jos´e A.B. Fortes. 1997. GeneralizedCannon’s Algorithm for Parallel Matrix Multiplication. In

ICS .[45] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009.Community Structure in Large Networks: Natural Cluster Sizes and the Absenceof Large Well-Defined Clusters.

Internet Mathematics

6, 1 (2009), 29–123.[46] Jiajie Li, Yuze Chi, and Jason Cong. 2020. HeteroHalide: From Image ProcessingDSL to Efficient FPGA Acceleration. In

FPGA .[47] Julian Mcauley. 2012. Learning to Discover Social Circles in Ego Networks. In

NIPS .[48] Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C.Hoe, Jos´e F Mart´ınez, and Carlos Guestrin. 2014. GraphGen: An FPGA Frameworkfor Vertex-Centric Graph Computation. In

FCCM .[49] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998.

ThePageRank Citation Ranking: Bringing Order to the Web . Technical Report.[50] Philippos Papaphilippou, Jiuxi Meng, and Wayne Luk. 2020. High-PerformanceFPGA Network Switch Architecture. In

FPGA .[51] James L Peterson. 1977. Petri Nets.

Comput. Surveys

9, 3 (1977).[52] Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming Heterogeneous Systems froman Image Processing DSL.

TACO

14, 3 (2017).[53] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric Graph Processing using Streaming Partitions. In

SOSP .[54] Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel:A High-Level Programming Platform for Streaming Applications on FPGA. In

FCCM .[55] Vladimir Rybalkin and Norbert Wehn. 2020. When Massive GPU ParallelismAin’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network. In

FPGA .[56] Tim Schmidt, Guantao Liu, and Rainer D¨omer. 2017. Exploiting Thread and DataLevel Parallelism for Ultimate Parallel SystemC Simulation. In

DAC .[57] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gal-lagher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data.

AIMagazine

29, 3 (2008), 93–106.[58] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-works for Large-Scale Image Recognition. In

ICLR .[59] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-End Optimizationof Deep Learning Applications. In

FPGA .

60] James Thomas, Pat Hanrahan, and Matei Zaharia. 2020. Fleet: A Framework forMassively Parallel Streaming on FPGAs. In

ASPLOS .[61] Yu Wang, James C. Hoe, and Eriko Nurvitadhi. 2019. Processor Assisted Work-list Scheduling for FPGA Accelerated Graph Processing on a Shared-MemoryPlatform. In

FCCM .[62] Xuechao Wei, Yun Liang, and Jason Cong. 2019. Overcoming Data TransferBottlenecks in FPGA-based DNN Accelerators via Layer Conscious MemoryManagement. In

DAC .[63] Xilinx. 2020. Vivado Design Suite User Guide: High-Level Synthesis (UG902).(2020).[64] Tanner Young-Schultz, Lothar Lilge, Stephen Brown, and Vaughn Betz. 2020.Using OpenCL to Enable Software-like Development of an FPGA-AcceleratedBiophotonic Cancer Treatment Simulator. In

FPGA .[65] Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN trainingon CPU-FPGA heterogeneous platforms. In

FPGA .[66] Jialiang Zhang and Jing Li. 2018. Degree-aware Hybrid Graph Traversal onFPGA-HMC Platform. In

FPGA .[67] Shijie Zhou, Rajgopal Kannan, Viktor K Prasanna, Guna Seetharaman, and QingWu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA.

TPDS (2019).[68] Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. CombinedSpatial and Temporal Blocking for High-Performance Stencil Computation onFPGAs Using OpenCL. In

FPGA ..