Agile SoC Development with Open ESP
Paolo Mantovani, Davide Giri, Giuseppe Di Guglielmo, Luca Piccolboni, Joseph Zuckerman, Emilio G. Cota, Michele Petracca, Christian Pilato, Luca P. Carloni
AAgile SoC Development with Open ESP
Paolo Mantovani, Davide Giri, Giuseppe Di Guglielmo, Luca Piccolboni, Joseph Zuckerman,Emilio G. Cota, Michele Petracca, Christian Pilato, and Luca P. Carloni ∗ {paolo,davide_giri,giuseppe,piccolboni,jzuck,cota,petracca,pilato,luca}@cs.columbia.eduDepartment of Computer Science, Columbia University in the City of New York, New York, NY 10027 ABSTRACT
ESP is an open-source research platform for heterogeneous SoCdesign. The platform combines a modular tile-based architecturewith a variety of application-oriented flows for the design and opti-mization of accelerators. The ESP architecture is highly scalable andstrikes a balance between regularity and specialization. The com-panion methodology raises the level of abstraction to system-leveldesign and enables an automated flow from software and hardwaredevelopment to full-system prototyping on FPGA. For applicationdevelopers, ESP offers domain-specific automated solutions to syn-thesize new accelerators for their software and to map complexworkloads onto the SoC architecture. For hardware engineers, ESPoffers automated solutions to integrate their accelerator designsinto the complete SoC. Conceived as a heterogeneous integrationplatform and tested through years of teaching at Columbia Uni-versity, ESP supports the open-source hardware community byproviding a flexible platform for agile SoC development.
KEYWORDS
System-level design, SoC, accelerators, network-on-chip.
Why ESP?
ESP is an open-source research platform for heteroge-neous system-on-chip (SoC) design and programming [18]. ESPis the result of nine years of research and teaching at ColumbiaUniversity [11, 12]. Our research was and is motivated by the con-sideration that Information Technology has entered the age ofheterogeneous computing. From embedded devices at the edge ofthe cloud to data center blades at the core of the cloud, specializedhardware accelerators are increasingly employed to achieve energy-efficient performance [9, 15, 31]. Across a variety of applicationdomains, such as mobile electronics, automotive, natural-languageprocessing, graph analytics and more, computing systems rely onhighly heterogeneous SoC architectures. These architectures com-bine general-purpose processors with a variety of accelerators spe-cialized for tasks like image processing, speech recognition, radiocommunication and graphics [20] as well as special-purpose proces-sor cores with custom instruction sets, graphics processing units,and tensor manipulation units [32]. The shift of the silicon industryfrom homogeneous multicore processors to heterogeneous SoCs isparticularly noticeable if one looks at the portion of chip area dedi-cated to accelerators in subsequent generations of state-of-the-artchips for smartphones [48], or at the amount of diverse processingelements in chips for autonomous driving [16].
ESP Vision.
ESP is a platform, i.e., the combination of an archi-tecture and a methodology [11]. The methodology embodies a set ∗ Emilio G. Cota is now with Google. Michele Petracca is now with Cadence DesignSystems. Christian Pilato is now with Politecnico di Milano.
Rapid PrototypingSoCIntegrationSoCSW BuildHLSDesignFlowsRTLDesignFlows
VivadoHLSStratus HLSCatapult HLS
SW Library third-partyprocessor coresthird-partyacceleratorsaccelerators
HW IP Library third-party accelerators’ SWLinux appsbare-metal apps device drivers
Figure 1: Agile SoC design and integration flows in ESP. of agile SoC design and integration flows, as shown in Figure 1. TheESP vision is to allow application domain experts to design SoCs.Currently, ESP allows SoC architects to rapidly implement FPGA-based prototypes of complex SoCs. The ESP scalable architecture and its flexible methodology enable a seamless integration of third-party open-source hardware (OSH) components (e.g., the ArianeRISC-V core [1, 51] or the NVIDIA Deep-Learning Accelerator [3]).SoC architects can instantiate also accelerators that are developedwith one of the many design flows and languages supported byESP. The list, which continues to grow, currently includes: C/C++with Xilinx Vivado HLS and Mentor Catapult HLS; SystemC withCadence Stratus HLS; Keras TensorFlow, PyTorch and ONNX withhls4ml; and Chisel, SystemVerilog, and VHDL for register-transferlevel (RTL) design. Hence, accelerator designers can choose theabstraction level and specification language that are most suitablefor their coding skills and the target computation kernels. Thesedesign flows enable the creation of a rich library of componentsready to be instanced into the ESP tile-based architecture with thehelp of the SoC integration flow.Thanks to the automatic generation of device drivers from pre-designed templates, the ESP methodology simplifies the invocationof accelerators from user-level applications executing on top ofLinux [25, 39]. Through the automatic generation of a network-on-chip (NoC) from a parameterized model, the ESP architecturecan scale to accommodate many processors, tens of accelerators,and a distributed memory hierarchy [27]. A set of platform ser-vices provides pre-validated solutions to access or manage SoCresources, including accelerators configuration [41], memory man-agement [39], and dynamic voltage frequency scaling (DVFS) [40],among others. ESP comes with a GUI that guides the designersthrough the interactive choice and placement of the tiles in the SoCand it has push-button capabilities for rapid prototyping of the SoCon FPGA.
Open-Source Hardware.
OSH holds the promise of boostinghardware development and creating new opportunities for academia a r X i v : . [ c s . A R ] S e p EM CPUACCAUX
ProcessorTileL2 cache I/O channel Interrupt req/ackTile configuration DVFSPerf. monitorsRISC-V Ariane or SPARC Leon3w/ L1 cache AcceleratorTileL2 cache DMA IRQTile configuration DVFSPerf. monitorsESP or Third-PartyAccelerator
Private-LocalMemory
Virtual memory
Figure 2: Example of a 3x3 instance of ESP with a high-leveloverview of the sockets for processors and accelerators. and entrepreneurship [30]. In recent years, no other project hascontributed to the growth of the OSH movement more than RISC-V [6]. To date, the majority of OSH efforts have focused on thedevelopment of processor cores that implement the RISC-V ISAand small-scale SoCs that connect these cores with tightly-coupledfunctional units and coprocessors, typically through bus-based in-terconnects. Meanwhile, there have been less efforts in developingsolutions for large-scale SoCs that combine RISC-V cores withmany loosely-coupled components, such as coarse-grained accel-erators [19], interconnected with a NoC. With this gap in mind,we have made an open-source release of ESP to provide the OSHcommunity with a platform for heterogeneous SoC design andprototyping [18].
The ESP architecture is structured as a heterogeneous tile grid. Fora given application domain, the architect decides the structure ofthe SoC by determining the number and mix of tiles. For example,Figure 2 shows a 9-tile SoC organized in a 3 × multiplane NoC [49].The content of each tile is encapsulated into a modular socket (aka shell ), which interfaces the tile to the NoC and implements theplatform services. The socket-based approach, which decouples thedesign of a tile from the design of the rest of the system, is one of thekey elements of the agile ESP SoC design flow. It highly simplifiesthe design effort of each tile by taking care of all the system inte-gration aspects, and it facilitates the reuse of intellectual property(IP) blocks. For instance, the ESP accelerator socket implementsservices for DMA, cache coherence, performance monitors, anddistributed interrupt requests.At design time, it is possible to choose the set of services toinstantiate in each tile. At runtime, the services can be enabledand many of them offer reconfigurability options, e.g., dynamicreconfiguration of the cache-coherence model [28].The ESP architecture implements a distributed system that isinherently scalable, modular and heterogeneous, where processorsand accelerators are given the same importance in the SoC. Differ-ently from other OSH platforms, ESP proposes a system-centricview, as opposed to a processor-centric view. Processing elements act as transaction masters that access periph-erals and slave devices distributed across remote tiles. All remotecommunication is supported by a NoC, which is a transparent com-munication layer. Processors and accelerators, in fact, operate asif all remote components were connected to their local bus con-troller in the ESP sockets. The sockets include standard bus ports,bridges, interface adapters and proxy components that provide acomplete decoupling from the network interface. Figure 3 showsin detail the modular architecture of the ESP interconnect for thecase of a six-plane NoC. Every platform service is implemented bya pair of proxy components. One proxy translates requests frombus masters, such as processors and accelerators, into transactionsfor one of the NoC planes. The other proxy forwards requests fromthe NoC planes to the target slave device, such as last-level cache(LLC) partitions, or Ethernet. For each proxy, there is a correspond-ing buffer queue, located between the tile port of the NoC routersand the proxy itself. In Figure 3, the color of a queue depends onthe assigned NoC plane. The number and direction of the arrowsconnected to the queues indicate whether packets can flow fromthe NoC to the tile, from the tile to the NoC, or in both directions.The arrows connect the queues to the proxies. These are labeledwith the name of the services they implement and the number ofthe NoC plane used for receiving and sending packets, respectively.The current implementation of the ESP NoC is a packet-switched2D-mesh topology with look-ahead dimensional routing. Every hoptakes a single clock cycle because arbitration and next-route compu-tation are performed concurrently. Multiple physical planes allowprotocol-deadlock prevention and provide sufficient bandwidthfor the various message types. For example, since a distributeddirectory-based protocol for cache coherence requires three sepa-rate channels, planes 1, 2 and 3 in Figure 3 are assigned to request,forward, and response messages, respectively. Concurrent DMAtransactions, issued by multiple accelerators and handled by var-ious remote memory tiles, require separate request and responseplanes. Instead of reusing the cache-coherence planes, the additionof two new planes (4 and 6 in Figure 3) increases the overall NoCbandwidth. Finally, one last plane is reserved for short messages,including interrupt, I/O configuration, monitoring and debug.Currently, customizing the NoC topology is not automated in theESP SoC integration flow. System architects, however, may exploredifferent topologies by modifying the router instances and updatingthe logic to generate the header flit for the NoC packets [50].
Each processor tile contains a processor core that is chosen at designtime among those available: the current choice is between the RISC-V 64-bit Ariane core from ETH Zurich [1, 51] and the SPARC 32-bitLEON3 core from Cobham Gaisler [17]. Both cores are capable ofrunning Linux and they come with their private L1 caches. Theprocessor integration into the distributed ESP system is transparent:no ESP-specific software patches are needed to boot Linux. Eachprocessor communicates on a local bus and is agnostic of the rest ofthe system. The memory interface of the LEON3 core requires a 32-bit AHB bus, whereas Ariane comes with a 64-bit AXI interface. Inaddition to proxies and bus adapters, the processor socket providesa unified private L2 cache of configurable size, which implements a UX / DEMUX
CPUI$ D$
BUS
Multi-plane Network-on-Chip
MUX / DEMUX
CPUs DebugUnit JTAGTAP Timer UARTIRQCTRL registerrcv (5) interrupt levelrcv (5)mem req (1) /fwd (2) /rsp (3)hp dev snd (5)[if no L2]mem req (1) /rsp (3)interrupt levelsnd (5)ETH0 — EDCL DDR MS M
MUX / DEMUX non-coherent DMAd2m (6) / m2d (4) /hp dev rcv (5)FrameBu↵erSS
BUS
DVIMM
Accelerator with PLM snd rcv
NoC 1NoC 2NoC 3 NoC 4NoC 5NoC 6
Private L2write back
I/O BUS S optionalInterrupt orinterrupt level S Bus SlaveInterface M Bus MasterInterface registersnd (5)DVFSControllerS M(L1 inv.)M S
CPU Debug
CPU debugrcv (5)DMAC TLB Config.registersmem req (1) /fwd (2) / rsp (3)Private L2write backLLC-coherent DMAd2m (4) / m2d (6) non-coherent DMAd2m (6) / m2d (4) interruptsnd (5) registerrcv (5)DVFSController
MUX / DEMUX
CPU debugsnd (5)
BUS I/O BUS registersnd (5)M registerrcv (5) interruptrcv (5)SS Mhp dev(EDCL/ETH0)snd (5)[if LLC enabled]LLC-coherent DMA(ETH0)d2m (4) / m2d (6)S registerrcv (5) non-coherent DMAd2m (6) / m2d (4)LLC-coherent DMAd2m (4) / m2d (6)mem req (1) /fwd (2) /rsp (3)Shared LLCw/ directory hp dev rcv (5)[if no LLC]mem req (1) /rsp (3)
BUS M snd rcv Local portshortcut memorycontroller(Boot ROM)S M(EDCL) (ETH0)
Accelerator TileAuxiliary Tile Processor TileMemory Tile
ESP Link
Debugaccess
Debug Ethernet soft_rstsnd (5)
L1 inv.
Figure 3: Detailed architecture of the NoC interface for four main ESP tiles. directory-based MESI cache-coherence protocol. Processor requestsdirected to memory-mapped I/O registers are forwarded by thesocket to the IO/IRQ NoC plane through an APB adapter. Theonly processor-specific component of the socket is an interrupt-level proxy, which implements the custom communication protocolbetween the processor and the interrupt controller and systemtimer in the auxiliary tile.
Each memory tile contains a channel to external DRAM. The num-ber of memory tiles can be configured at design time. Typically, itvaries from one to four depending on the size of the SoC. All neces-sary hardware logic to support the partitioning of the addressablememory space is automatically generated and the partitioning iscompletely transparent to software. Each memory tile also containsa configurably-sized partition of the LLC with the correspondingdirectory. The LLC in ESP implements an extended MESI protocol,in combination with the private L2 cache in the processor tiles, thatsupports Linux with symmetric multiprocessing, as well as runtimereconfigurable coherence for accelerators [28].
This tile contains the specialized hardware of a loosely-coupledaccelerator [19]. This type of accelerator executes a coarse-grainedtask independently from the processors while exchanging largedatasets with the memory hierarchy. To be integrated in the ESPtile, illustrated on the top-left portion of Figure 3, acceleratorsshould comply to a simple interface that includes load/store portsfor latency-insensitive channels [10, 13], signals to configure and start the accelerator, and an acc_done signal to notify the accelera-tor completion and generate an interrupt for the processors. ESPaccelerators that are newly designed with one of the supporteddesign flows automatically comply with this interface. For existingaccelerators, ESP offers a third-party integration flow [24]. In thiscase, the accelerator tile has only a subset of the proxy componentsbecause the configuration registers, DMA for memory access, andTLB for virtual memory [39] are replaced by standard bus adapters.The set of platform services provided by the socket relievesthe designer from the burden of “reinventing the wheel” with re-spect to implementing accelerator configuration through memory-mapped registers, address translation, and coherence protocols. Fur-thermore, the socket enables point-to-point communication (P2P)among accelerator tiles so that they can exchange data directlyinstead of using necessarily shared-memory communication.Third-party accelerators can use the services to issue interruptrequests, receive configuration parameters and initiate DMA trans-actions. They are responsible, however, for managing shared re-sources, such as reserved memory regions, and for implementingtheir own custom hardware-software synchronization protocol.The run-time reconfigurable coherence protocol service is partic-ularly relevant for accelerators. In fact, there is no static coherenceprotocol that can necessarily serve well all invocations of a setof heterogeneous accelerators in a given SoC [26]. With the non-coherent DMA model, an accelerator bypasses the cache hierarchy toexchange data directly with main memory. With the fully-coherent model, the accelerator communicates with an optional private cacheplaced in the accelerator socket. The ESP cache hierarchy augmentsa directory-based MESI protocol with support for two models where ccelerators send requests directly to the LLC, without owning aprivate cache: the LLC-coherent DMA and the coherent DMA models.The latter keeps the accelerator requests coherent with respect toall private caches in the system, whereas the former does not. While fully-coherent and coherent DMA are fully handled in hardware bythe ESP cache hierarchy, non-coherent DMA and
LLC-coherent DMA demand that software acquires appropriate locks and flushes pri-vate caches before invoking accelerators. These synchronizationmechanisms are implemented by the ESP device drivers, which aregenerated automatically when selecting any of the supported HLSflows discussed in Section 4.
The auxiliary tile hosts all shared peripherals in the system exceptfrom memory: the Ethernet NIC, UART, a digital video interface,a debug link to control ESP prototypes on FPGA and a monitormodule that collects various performance counters and periodicallyforwards them through the Ethernet interface.As shown in Figure 3, the socket of the auxiliary tile is the mostcomplex because most platform services must be available to servethe devices hosted by this tile. The interrupt-level proxy, for in-stance, manages the communication between the processors andthe interrupt controller. Ethernet, which requires coherent DMA tooperate as a slave peripheral, enables users to remotely log into anESP instance via SSH. The frame-buffer memory, dedicated to thevideo output, is connected to one proxy for memory-mapped I/Oand one for non-coherent DMA transactions. These enable bothprocessor cores and accelerators to write directly into the videoframe buffer. The Ethernet debug interface [23], instead, uses thememory-mapped I/O and register access services to allow ESP usersto monitor and debug the system through the
ESP Link application.Symmetrically, UART, timer, interrupt controller and the bootromare controlled by any master in the system through the counterpartproxies for memory-mapped I/O and register access. Hence, theauxiliary tile includes both pairs of proxies. These enable an addi-tional communication path, labeled as local port shortcut in Figure 3,which connects the masters in the auxiliary tile (i.e. the Ethernetdebug link) with slaves that do not share the same local bus. Asimilar shortcut in the processor tile allows a processor to flush itsown private L2 cache and manage its local DVFS controller.
The ESP accelerator’s Application Programming Interface (API)library simplifies the invocation of accelerators from a user appli-cation, by exposing only three functions to the programmer [25].Underneath, the API invokes the accelerators with the automati-cally generated Linux device drivers. The API is lightweight andcan be targeted from existing applications or by a compiler. Fora given application, the software execution of a computationallyintensive kernel can be replaced with hardware accelerators bymeans of a single function call ( esp_run() ). Figure 4 shows the caseof an application with four computation kernels, two executed insoftware and two implemented with an accelerator. The configura-tion argument passed to esp_run() is a simple data structure thatspecifies which accelerator(s) to invoke, how to configure them, andtheir point-to-point dependencies, if any. By using the esp_alloc() and esp_free() functions for memory allocation, data can be truly
Application
ESP Library
ESP accelerator driverESP coreLinuxESP alloc u s e r m o d e k e r n e l m o d e // Example of existing C application with ESP// accelerators that replace software kernels// 2 and 4. The cfg_k buffer and the accelerator configuration.{ int * buffer = esp_alloc (size); for (...) { kernel_1(buffer,...); // existing software esp_run (cfg_k2); // run accelerator(s) kernel_3(buffer,...); // existing software esp_run (cfg_k4); // run accelerator(s) } esp_free (buffer);} Figure 4: ESP accelerator API for seamless shared memory. shared between accelerators and processors, i.e., no data copies arenecessary. Data are allocated in an efficient way to improve the ac-celerator’s access to memory without compromising the software’sperformance [39]. The ESP software stack, combined with the gen-eration of device drivers for new custom accelerators, makes theaccelerator invocation as transparent as possible for the applicationprogrammers.
The ESP design methodology is flexible because it embodies dif-ferent design flows, for the most part automated and supportedby commercial CAD tools. In particular, recalling Figure 1, the ac-celerator design flow (on the left in the figure) aids the creation ofan IP library, whereas the
SoC flow (on the right) automates theintegration of heterogeneous components into a complete SoC.
The end goal of this flow is to add new elements to the library ofaccelerators that can be automatically instantiated with the SoCflow. Designers can work at different abstraction levels with variousspecification languages: • Cycle-accurate RTL descriptions with languages like VHDL,Verilog, SystemVerilog, or Chisel. • Loosely-timed or un-timed behavioral descriptions with Sys-temC or C/C++ that get synthesized into RTL with high-levelsynthesis (HLS) tools. ESP currently supports the three maincommercial HLS tools: Cadence Stratus HLS, Mentor Cata-pult, and Xilinx Vivado HLS. • Domain-specific libraries for deep learning like Keras Ten-sorFlow, PyTorch, and ONNX, for which ESP offers a flowcombining HLS tools with hls4ml, an OSH project [2, 22].
HLS-Based Accelerator Design.
For the HLS-based flows, ESPfacilitates the job of the accelerator designer by providing ESP-compatible accelerator templates, HLS-ready skeleton specifica-tions, multiple examples, and step-by-step tutorials for each flow.The push in the adoption of HLS from C/C++ specificationshas many reasons: (1) a large codebase of algorithms written inthese languages, (2) a simplified hardware/software co-design (sincemost embedded software is in C), and (3) a thousand-fold fasterfunctional execution of the specification than the counterpart RTLsimulation. On the other hand, HLS from C/C++ has also shownsome limitations because of the lack of ability to specify or accu-rately infer concurrency, timing, and communication propertiesof the hardware systems. HLS flows based on the IEEE-standard igh-LevelSynthesis Ver. 1Ver. 2Ver. 3
RTLDesignSpaceDomain-Specificor High-LeveldescriptionCode transformation
Latency A r e a Pareto-optimal
Figure 5: HLS-based accelerator design in ESP. language SystemC overcome these limitations, thus making Sys-temC the de-facto standard to model protocols and control-orientedapplications at a level higher than RTL.In ESP, we support and encourage the use of both C/C++ andSystemC flows for HLS and we have defined a set of guidelines tosupport the designers in porting an application to an HLS-readyformat. The ideal starting point is a self-contained description of thecomputation kernel, written in a subset of the C/C++ language [42]:a limited use of pointers and the absence of dynamic memoryallocation and recursion are important; also, aside from commonmathematical functions, no external library functions should beused. This initial software transformation is oftentimes the mostimportant step to obtain good quality source code for HLS [41].The designer of an ESP accelerator should aim at a well-structureddescription that partitions the specification into concurrent func-tional blocks. The goal is to obtain any synthesizable specificationthat enables the exploration of a vast design space, by evaluatingmany micro-architectural and optimization choices. Figure 5 showsthe relationship between the C/C++/SystemC design space and theRTL design space. The HLS tools provide a rich set of configurationknobs to obtain a variety of RTL implementations, each correspond-ing to a different cost-performance tradeoff point [36, 37]. Theknobs are push-button directives of the HLS tool represented bythe green arrows. Designers may also perform manual transfor-mations of the specification (orange arrows) to explore the designspace while preserving the functional behavior. For example, theycan expose parallelism by removing false dependencies or they canreduce resource utilization by encapsulating sections of code withsimilar behavior into callable functions [41].
Accelerator Structure.
The ESP accelerators are based on theloosely-coupled model [19]. They are programmed like devices byapplications that invoke device drivers with standard system calls,such as open and ioctl . They perform coarse-grained computa-tions while exchanging large data sets with the memory hierarchy.Figure 6 shows the structure and interface common to all ESP ac-celerators. The interface channels allow the accelerator to (1) com-municate with the CPU via memory-mapped registers ( conf_info ),(2) program the DMA controller or interact with other accelerators( load_ctrl and store_ctrl ), (3) exchange data with the memoryhierarchy or other accelerators, ( load_chnl and store_chnl ), and (4)notify its completion back to the software application ( acc_done ). conf_info acc_done clk rst load_chnl load_ctrl LOADCOMPUTE store_chnlstore_ctrl
CONFIGURESTOREPLMPLM
Figure 6: Structure of ESP accelerators.
These channels are implemented with latency-insensitive com-munication primitives, which HLS tools commonly provide as li-braries (e.g. Mentor MatchLib Connections [33], Cadence Flex Chan-nels [38], Xilinx ap_fifo ). These primitives preserve functionalcorrectness in the presence of latency variation both in the com-putation within the accelerator and in the communication acrossthe NoC [10]. This is obtained by adding valid and ready signals tothe channels. The valid signal indicates that the value of the databundle is valid in the current clock cycle, while the ready signal isde-asserted to apply backpressure. The latency-insensitive natureof ESP accelerators allows designers to fully exploit the ability ofHLS to produce many alternative RTL implementations, which arenot strictly equivalent from an RTL viewpoint (i.e., they do notproduce the same timed sequence of outputs for any valid inputsequence), but they are members of a latency-equivalent designclass [14]. Each member of this class can be seamlessly replacedwith another one, depending on performance and cost targets [44].The execution flow of an ESP accelerator consists of four phases, configure , load , compute , and store , as shown in Figure 6. A soft-ware application configures, checks, and starts the accelerator viamemory-mapped registers. During the load and store phases, theaccelerator interacts with the DMA controller, interleaving dataexchanges between the system and the accelerator’s private localmemory (PLM) with computation. When the accelerator completesits task, an interrupt resumes the software for further processing.For better performance, the accelerator can have one or moreparallel computation components that interact with the PLM. Theorganization of the PLM itself is typically customized for the givenaccelerator, with multiple banks and ports. For example, the de-signer can organize it as a circular or ping-pong buffers to sup-port the pipelining of computation and transfers with the externalmemory or other accelerators. Designers can leverage PLM genera-tors [45] to implement many different memory subsystems, eachoptimized for a specific combination of HLS knobs settings. Accelerator Behavior.
The charts of Figure 7 show the behav-ior of two concurrent ESP accelerators (
ACC0 and
ACC1 ) in three pos-sible scenarios. The two accelerators work in a producer-consumerfashion:
ACC0 generates data that
ACC1 takes as inputs. The accelera-tors are executed two times and concurrently; the consumer startsas soon as the data is ready; finally, both the accelerators perform Figure 7: Overlapping of computation and communication of ESP accelerators. burst of load and store DMA transactions, in red and brown respec-tively. The completion of the configuration phase and interruptrequest ( acc_done ) are marked with
CFG and
IRQ , respectively.In the top scenario, the two accelerators communicate via exter-nal memory. First, the producer
ACC0 runs and stores the resultingdata in main memory. Upon completion of the producer, the con-sumer
ACC1 starts and accesses the data in main memory; concur-rently, the producer
ACC0 can run a second time. The data exchangehappens through memory at the granularity of the whole acceler-ator data set. This scenario is a virtual pipeline of ESP acceleratorsthrough memory . Ping-pong buffering on the PLM for both load andstore phases allows the overlap of computation and communication.In addition, load and store phases are allowed to overlap. This isonly possible by assuming to have dedicated memory channelsfor each accelerator (e.g. two ESP memory tiles). As long as theNoC and memory bandwidth are not saturated, the performanceoverhead is limited only to the driver run time and the interrupthandling procedures. We consider this scenario ideal.In complex SoCs, it is reasonable to expect resource contentionand delays with the main memory. This can potentially limit thelatency and throughput of the accelerators, as shown in the middlescenario of Figure 7, where some of the DMA transactions get de-layed for both the producer and consumer accelerators. The ESPlibrary and API allows designers to replace the described softwarepipeline, with an actual pipeline of accelerators, based on point-to-point communication (P2P) over the NoC . The communicationmethod does not need to be chosen at design time; instead, specialconfiguration registers are used to overwrite the default DMA be-havior. Beside relieving memory contention, P2P communicationcan actually improve latency and throughput of communicatingaccelerators, as shown in the bottom scenario of Figure 7. Here,each output transaction of the producer
ACC0 is matched to an input transaction of the consumer
ACC1 (in green). Differently from theprevious scenarios, the data exchange via P2P happens at a smallergranularity: a single store transaction of the producer
ACC0 is a validinput for the consumer
ACC1 . A designer should take into accountthis assumption when designing accelerators for a specific task.
Accelerator Templates and Code Generator.
ESP providesthe designers with a set of accelerator templates for each of the HLS-based design flows. These templates leverage concepts of object-oriented programming and class inheritance to simplify the designof the accelerators in C/C++ or SystemC and enforce the interfaceand structure previously described. They also implicitly addressthe differences existing among the various HLS tools and inputspecification languages. For example, the latency-insensitive prim-itives, which come with the different vendors, may have slightlydifferent APIs, e.g.
Put() / Get() vs.
Read() / Write() , or timing behav-ior. With some HLS tools, the designer has to specify some extra wait() statements in SystemC to generate the correct RTL code.In the case of C/C++ designs a combination of HLS directives andcoding style must be followed to ensure that extra memories are notinadvertently inferred and the phases are correctly synchronized.Next to templates, ESP provides a further aid for the acceleratordesign: an interactive environment that generates a fully-workingand HLS-ready accelerator skeleton from a set of parameters passedby the designer. The skeleton comes with a unit testbench, synthesisand simulation scripts, a bare-metal driver, a Linux driver, and asample test application. This is the first step of the accelerator designflow, as shown on the top-right of Figure 8. The skeleton is a basicspecification that uses the templates and contains placeholder formanual accelerator-specific customizations. The parameters passedby the designers include: unique name and ID, desired HLS tool flow,a list of application-specific configuration registers, bit-width of thedata tokens, size of the data set and number of batches of data sets electprocessorcore configurecachehierarchy SoC flow
SoC configurationsoftware buildfull-systemsimulationFPGAprototypingbare-metalLinux acceleratordesign flow generate skeletoncustomizegenerate RTL w/ HLStest multiple memorychannels andLLC splitsmultiple instancesof the same ESP orthird-party accelerator enablehardwaremonitors select acceleratorimplementationenable private cache
Figure 8: Overview of the accelerator and SoC design flows with an example of SoC design configuration on the ESP GUI. to be executed without interrupting the CPU. Next to application-specific information, designers can choose architectural parametersthat set the minimum required size of the PLM and the maximummemory footprint of the application that invokes the accelerator.These parameters have effect on the generated accelerator skeleton,device-driver, test application, and on the configuration parametersfor the ESP socket that will host the accelerator.Starting from the automatically generated skeleton, designersmust customize the accelerator computation phase, leveraging thesoftware implementation of the target computation kernel as areference. In addition, they are responsible for customizing the inputgeneration and output validation functions in the unit testbenchand in the bare-metal and Linux test applications. Finally, in caseof complex data access patterns, they may also need to extend thecommunication part of the accelerator and define a more complexstructure for the PLM. The ESP release offers a set of online tutorialsthat describe these steps in details with simple examples, whichdemonstrate how the first version of a new accelerator can bedesigned, integrated and tested on FPGA in a few hours [18].The domain specific flow for embedded machine learning is fullyautomated [25]. The accelerator and the related software driversand application are generated in their entirety from the neural-network model. ESP automatically generates also the acceleratortile socket and a wrapper for the accelerator logic.
For existing accelerators, ESP provides a third-party acceleratorintegration flow (TPF). The TPF skips all the steps necessary todesign a new accelerator and goes directly to SoC integration. Thedesigner must provide some information about the existing IP blockand a simple wrapper to connect the wires of the accelerator’sinterface to the ESP socket. Specifically, the designer must fill in ashort XML file with a unique accelerator name and ID, the list andpolarity of the reset signals, the list of clock signals, an optionalprefix for the AXI master interface in the wrapper, the user-definedwidth of AXI optional control signals and the type of interruptrequest (i.e., level or edge sensitive). In addition, the TPF requires the list of RTL source files, including Verilog, SystemVerilog, VHDLand VHDL packages, a custom Makefile to compile the third-partysoftware and device drivers, and the list of executable files, librariesand other binary objects needed to control the accelerator.Currently, ESP provides adapters for AXI master (32 and 64bits), AHB master (32 bits) and AXI-Lite or APB slave (32 bits). Aslong as the target accelerator is compliant with these standard busprotocols, the Verilog top-level wrapper consists of a simple wireassignment to expose bus ports to the ESP socket and connect anynon-standard input port of the third-party accelerator (e.g. disabletest mode), if present. After these simple manual steps, ESP takescare of the whole integration automatically. We used the TPF tointegrate the NVDLA [3]. An online tutorial in the ESP releasedemonstrates the design of a complete SoC with multiple NVDLAtiles, multiple memory tiles and the Ariane RISC-V processor. Thissystem can run up to four concurrent machine-learning tasks usingthe original NVDLA software stack [24]. The center and the left portion of Figure 8 illustrate the agile SoCdevelopment enabled by ESP. Both the ESP and third-party accel-erator flows contribute to the pool of IP components that can beselected to build an SoC instance. The ESP GUI guides the designersthrough an interactive SoC design flow that allows them to: choosethe number, types and positions of tiles, select the desired Pareto-optimal design point from the HLS flows for each accelerator, selectthe desired processor core among those available, determine thecache hierarchy configuration, select the clock domains for eachtile, and enable the desired system monitors. The GUI writes aconfiguration file that the ESP build flow can include to generateRTL sockets, the system memory mapping, NoC routing tables, thedevice tree for the target processor architecture, software headerfiles, and configuration parameters for the proxy components.A single make target is sufficient to generate the bitstream for oneof the supported Xilinx evaluation boards (VCU128, VCU118 and A minor patch was required to run multiple NVDLAs in a Linux environment. C707) and proFPGA prototyping FPGA modules (XCVU440 andXC7V2000T). Another single make target compiles Linux and createsa default root file system that includes accelerators’ drivers and testapplications, together with all necessary initialization scripts toload the ESP library and memory allocator. If properly listed duringthe TPF, the software stack for the third-party accelerators is loadedinto the Linux image as well. When the FPGA implementation isready, users can load the boot loader onto the ESP boot memoryand the Linux image onto the external DRAM with the
ESP Link application and the companion module on the auxiliary tile. Next
ESP Link sends a soft reset to the processor cores, thus startingthe execution from the boot loader. Users can monitor the bootprocess via UART, or log in with SSH after Linux boot completes.The online tutorials explain how to properly wire the FPGA boardsto a simple home router to ensure connectivity.In addition to FPGA prototyping, designers can run full-systemRTL simulations of a bare-metal program. If monitoring the FPGAwith the UART serial interface, they can run bare-metal applica-tions on FPGA as well. The development of bare-metal and Linuxapplications for an ESP SoC is facilitated by the ESP software stackdescribed in Section 3. The ESP release offers several examples.The agile ESP flow allowed us to rapidly prototype many complexSoCs on FPGA, including: • An SoC with 12 computer vision accelerators, with as manydynamic frequency scaling (DFS) domains [40]. • A multi-core SoC booting Linux SMP with tens of accelera-tors, multiple DRAM controllers, and dynamically reconfig-urable cache coherence models [28]. • A RISC-V based SoCs where deep learning applications run-ning on top of Linux invoke loosely-coupled acceleratorsdesigned with multiple ESP accelerator design flows [25]. • A RISC-V based SoCs with multiple instances of the NVDLAcontrolled by the RISC-V Ariane processor [24].
The OSH movement is supported by multiple SoC design platforms,many based on the RISC-V open-standard ISA [6, 29]. The
RocketChip Generator is an OSH project that leverages the Chisel RTLlanguage to construct SoCs with multiple RISC-V cores connectedthrough a coherent TileLink bus [35]. The
Chipyard framework in-herits Rocket Chip’s Chisel-based parameterized hardware genera-tor methodology and also allows the integration of IP blocks writtenin other RTL languages, via a Chisel wrapper, as well as domain-specific accelerators [5].
Celerity used the custom co-processorinterface
RoCC of the Rocket chip to integrate five Rocket coreswith an array of 496 simpler RISC-V cores and a binarized neuralnetwork (BNN) accelerator, which was designed with HLS, into a385-million transistor SoC [21].
HERO is an FPGA-based researchplatform that allows the integration of a standard host multicoreprocessor with programmable manycore accelerators composed ofclusters of RISC-V cores based on the PULP platform [4, 34, 47].
OpenPiton was the first open-source SMP Linux-booting RISC-Vmulticore processor [8]. It supports the research of heterogeneousISAs and provides a coherence protocol that extends across multiplechips [7, 46].
Blackparrot is a multicore RISC-V architecture that offers some support for the integration of loosely-coupled acceler-ators [43]; currently, it provides two of the four cache-coherenceoptions supported by ESP: fully-coherent and non-coherent.While most of these platforms are built with a processor-centricperspective, ESP promotes a system-centric perspective with ascalable NoC-based architecture and a strong focus on the integra-tion of heterogeneous components, including particularly loosely-coupled accelerators. Another feature distinguishing ESP from theother open-source SoC platforms is the flexible system-level designmethodology that embraces a variety of specification languagesand synthesis flows, while promoting the use of HLS to facilitatethe design and integration of accelerators.
In summary, with ESP we aim at contributing to the open-sourcemovement by supporting the realization of more scalable archi-tectures for SoCs that integrate more heterogeneous components,thanks to a more flexible design methodology that accommodatesdifferent specification languages and design flows. Conceived asa heterogeneous integration platform and tested through years ofteaching at Columbia University, ESP is naturally suited to fostercollaborative engineering of SoCs across the OSH community.
ACKNOWLEDGMENTS
Over the years, the ESP project has been supported in part byDARPA (C
REFERENCES [1] Ariane. https://github.com/pulp-platform/ariane .[2] HLS4ML. https://fastmachinelearning.org/hls4ml/ .[3] NVIDIA Deep Learning Accelerator (NVDLA). .[4] PULP. https://pulp-platform.org/ .[5] A. Amid, D. Biancolin, A. Gonzalez, D. Grubb, S. Karandikar, H. Liew, A. Magyar,H. Mao, A. Ou, N. Pemberton, P. Rigge, C. Schmidt, J. Wright, J. Zhao, Y. S. Shao,K. Asanovic, and B. Nikolic. 2020. Chipyard: Integrated Design, Simulation, andImplementation Framework for Custom SoCs.
IEEE Micro
40, 4 (2020), 10–21.[6] K. Asanovic and D. Patterson. 2014. The Case for Open Instruction Sets.
Micro-processor Report (Aug. 2014).[7] J. Balkind, T. Chang, P. J. Jackson, G. Tziantzioulis, A. Li, F. Gao, A. Lavrov, G.Chirkov, J. Tu, M. Shahrad, and D. Wentzlaff. 2020. OpenPiton at 5: A Nexus forOpen and Agile Hardware Design.
IEEE Micro
40, 4 (2020), 22–31.[8] J. Balkind, K. Lim, F. Gao, J. Tu, D. Wentzlaff, M. Schaffner, F. Zaruba, and L.Benini. 2019. OpenPiton+ Ariane: The First Open-Source, SMP Linux-bootingRISC-V System Scaling From One to Many Cores. In
Workshop on ComputerArchitecture Research with RISC-V (CARRV) . 1–6.[9] S. Borkar and A. Chen. 2011. The Future of Microprocessors.
Communication ofthe ACM
54 (May 2011), 67–77. Issue 5.[10] L. P. Carloni. 2015. From Latency-Insensitive Design to Communication-BasedSystem-Level Design.
Proceedings of the IEEE
Proc. of theDesign Automation Conference (DAC) . 17:1–17:6.[12] L. P. Carloni, E. G. Cota, G. Di Guglielmo, D. Giri, J. Kwon, P. Mantovani, L.Piccolboni, and M. Petracca. 2019. Teaching Heterogeneous Computing withSystem-Level Design Methods. In
Workshop on Computer Architecture Education(WCAE) . 1–8.
13] L. P. Carloni, K. L. McMillan, A. Saldahna, and A. L. Sangiovanni-Vincentelli.1999. A Methodology for “Correct-by-Construction" Latency Insensitive Design.In
Proc. of the International Conference on Computer-Aided Design . 309–315.[14] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli. 2001. Theory ofLatency-Insensitive Design.
IEEE Transactions on CAD of Integrated Circuits andSystems
20, 9 (Sept. 2001), 1059–1076.[15] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman,S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M.Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. 2016. A Cloud-ScaleAcceleration Architecture. In
Proc. of the IEEE/ACM International Symposium onMicroarchitecture (MICRO) . 1–13.[16] H. Chishiro, K. Suito, T. Ito, S. Maeda, T. Azumi, K. Funaoka, and S. Kato. 2019.Towards Heterogeneous Computing Platforms for Autonomous Driving. In
Proc.of the International Conference on Embedded Software and Systems (ICESS) .[17] Cobham Gaisler. LEON3. .[18] Columbia SLD Group. ESP Release. .[19] E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An Analysisof Accelerator Coupling in Heterogeneous Architectures. In
Proc. of the DesignAutomation Conference (DAC) . 202:1–202:6.[20] W. Dally, Y. Turakhia, and S. Han. 2020. Domain-Specific Hardware Accelerators.
Communication of the ACM
63, 7 (June 2020), 48–57.[21] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi, L. Vega, C.Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao, A. Rao, G. Liu, R. K. Gupta,Z. Zhang, R. Dreslinski, C. Batten, and M. B. Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and DesignMethodologies for Fast Chips.
IEEE Micro
38, 2 (Feb. 2018), 30–41.[22] J. Duarte, S. Han, P. Harris, S. Jindariani, E. Kreinar, B. Kreis, J. Ngadiuba, M.Pierini, R. Rivera, N. Tran, and Z. Wu. 2018. Fast inference of deep neuralnetworks in FPGAs for particle physics.
Journal of Instrumentation
13, 07 (July2018), P07027–P07027.[23] J. Gaisler. 2004. An Open-Source VHDL IP Library with Plug & Play Configuration.
Building the Information Society (2004).[24] D. Giri, K.-L. Chiu, G. Eichler, P. Mantovani, N. Chandramoorth, and L. P. Car-loni. 2020. Ariane + NVDLA: Seamless Third-Party IP Integration with ESP. In
Workshop on Computer Architecture Research with RISC-V (CARRV) .[25] D. Giri, K.-L. Chiu, G. Di Guglielmo, P. Mantovani, and L.P. Carloni. 2020. ESP4ML:Platform-Based Design of Systems-on-Chip for Embedded Machine Learning.In
Proc. of the Conference on Design, Automation, and Test in Europe (DATE) .1049–1054.[26] D. Giri, P. Mantovani, and L. P. Carloni. 2018. Accelerators & Coherence: An SoCPerspective.
IEEE Micro
38, 6 (Nov. 2018), 36–45.[27] D. Giri, P. Mantovani, and L. P. Carloni. 2018. NoC-Based Support of Hetero-geneous Cache-Coherence Models for Accelerators. In
Proc. of the InternationalSymposium on Networks-on-Chip (NOCS) . 1:1–1:8.[28] D. Giri, P. Mantovani, and L. P. Carloni. 2019. Runtime Reconfigurable MemoryHierarchy in Embedded Scalable Platforms. In
Proc. of the Asia and South PacificDesign Automation Conference (ASPDAC) . 719–726.[29] S. Greengard. 2020. Will RISC-V Revolutionize Computing?
Commun. ACM
IEEE Computer
50, 6(June 2017), 50–59.[31] M. Horowitz. 2014. Computing’s Energy Problem (and What We Can Do AboutIt). In
International Solid-State Circuits Conference (ISSCC) . 10–14.[32] N. P. Jouppi, C. Young, N. Patil, and D. Patterson. 2018. A Domain-SpecificArchitecture for Deep Neural Networks.
Commun. ACM
61, 9 (Aug. 2018), 50–59.[33] B. Khailany, E. Krimer, R. Venkatesan, J. Clemons, J. S. Emer, M. Fojtik, A. Kline-felter, M. Pellauer, N. Pinckney, Y. S. Shao, S. Srinath, C. Torng, S. L. Xi, Y. Zhang,and B. Zimmer. 2018. A Modular Digital VLSI Flow for High-Productivity SoCDesign. In
Proc. of the Design Automation Conference (DAC) . 1–6. [34] A. Kurth, P. Vogel, A. Capotondi, A. Marongui, and L. Benini. 2017. HERO:Heterogeneous Embedded Research Platform for Exploring RISC-V ManycoreAccelerators on FPGA. In
Workshop on Computer Architecture Research withRISC-V (CARRV) . 1–7.[35] Y. Lee, A. Waterman, H. Cook, B. Zimmer, B. Keller, A. Puggelli, J. Kwak, R.Jevtic, S. Bailey, M. Blagojevic, P. Chiu, R. Avizienis, B. Richards, J. Bachrach,D. Patterson, E. Alon, B. Nikolic, and K. Asanovic. 2016. An Agile Approach toBuilding RISC-V Microprocessors.
IEEE Micro
36, 2 (Mar.-Apr. 2016), 8–20.[36] H-Y. Liu and L. P. Carloni. 2013. On Learning-based Methods for Design-Space Ex-ploration with High-Level Synthesis. In
Proc. of the Design Automation Conference(DAC) . 1–7.[37] H-Y. Liu, M. Petracca, and L. P. Carloni. 2012. Compositional System-Level DesignExploration with Planning of High-Level Synthesis. In
Proc. of the Conference onDesign, Automation, and Test in Europe (DATE) . 641–646.[38] M. Meredith. 2008. High-level SystemC Synthesis with Forte’s Cynthesizer. In
High-Level Synthesis . Springer, 75–97.[39] P. Mantovani, E. G. Cota, C. Pilato, G. Di Guglielmo, and L. P. Carloni. 2016.Handling Large Data Sets for High-Performance Embedded Applications inHeterogeneous Systems-on-Chip. In
Proc. of the Intl. Conference on Compilers,Architectures, and Synthesis of Embedded Systems (CASES) . 1–10.[40] P. Mantovani, E. G. Cota, K. Tien, C. Pilato, G. Di Guglielmo, K. Shepard, and L. P.Carloni. 2016. An FPGA-Based Infrastructure for Fine-Grained DVFS Analysisin High-Performance Embedded Systems. In
Proc. of the Design AutomationConference (DAC) . 157:1–157:6.[41] P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2016. High-level Synthesisof Accelerators in Embedded Scalable Platforms. In
Proc. of the Asia and SouthPacific Design Automation Conference (ASPDAC) . 204–211.[42] R. Nane, V. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S.Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A Survey and Evaluationof FPGA High-Level Synthesis Tools.
IEEE Transactions on CAD of IntegratedCircuits and Systems
35, 10 (2016), 1591–1604.[43] D. Petrisko, F. Gilani, M. Wyse, D. C. Jung, S. Davidson, P. Gao, C. Zhao, Z. Azad,S. Canakci, B. Veluri, T. Guarino, A. Joshi, M. Oskin, and M. B. Taylor. 2020.BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCs.
IEEEMicro
40, 4 (2020), 93–102.[44] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. COSMOS:Coordination of High-Level Synthesis and Memory Optimization for HardwareAccelerators.
ACM Transactions on Embedded Computing Systems
16, 5s (Sept.2017), 150:1–150:22.[45] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. System-LevelOptimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip.
IEEE Transactions on CAD of Integrated Circuits and Systems
36, 3 (March 2017),435–448.[46] Princeton Parallel Group. OpenPiton. https://parallel.princeton.edu/openpiton/ .[47] D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. 2014. EnergyEfficient Parallel Computing on the PULP Platform with Support for OpenMP.In
Convention of Electrical Electronics Engineers in Israel (IEEEI) .[48] Y. S. Shao, B. Reagen, G. Wei, and D. Brooks. 2015. The Aladdin Approach toAccelerator Design and Modeling.
IEEE Micro
35, 3 (May-Jun 2015), 58–70.[49] Y.-J. Yoon, N. Concer, and L. P. Carloni. 2013. Virtual Channels and Multiple Phys-ical Networks: Two Alternatives to Improve NoC Performance.
IEEE Transactionson CAD of Integrated Circuits and Systems
32, 12 (Dec. 2013), 1906–1919.[50] Y.-J. Yoon, P. Mantovani, and L. P. Carloni. 2017. System-Level Design ofNetworks-on-Chip for Heterogeneous Systems-on-Chip. In
Proc. of the Inter-national Symposium on Networks-on-Chip (NOCS) . 1–6.[51] F. Zaruba and L. Benini. 2019. The Cost of Application-Class Processing: Energyand Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology.
IEEE Transactions on Very Large Scale Integration Systems
27, 11 (Nov. 2019), 2629–2640.27, 11 (Nov. 2019), 2629–2640.