[PDF] A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Abstract

With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.

Full PDF

AA Survey on Coarse-Grained ReconﬁgurableArchitectures from a Performance Perspective st Artur Podobas

Center for Computational ScienceRIKEN

Kobe, [email protected] nd Kentaro Sano

Center for Computational ScienceRIKEN

Kobe, [email protected] rd Satoshi Matsuoka

Center for Computational ScienceRIKEN

Kobe, [email protected]

Abstract —With the end of both Dennard’s scaling and Moore’slaw, computer users and researchers are aggressively explor-ing alternative forms of compute in order to continue theperformance scaling that we have come to enjoy. Among themore salient and practical of the post-Moore alternatives arereconﬁgurable systems, with Coarse-Grained Reconﬁgurable Ar-chitectures (CGRAs) seemingly capable of striking a balancebetween performance and programmability.In this paper, we survey the landscape of CGRAs. Wesummarize nearly three decades of literature on the subject,with particular focus on premises behind the different CGRAarchitectures and how they have evolved. Next, we compilemetrics of available CGRAs and analyze their performance prop-erties in order to understand and discover existing knowledgegaps and opportunities for future CGRA research specializedtowards High-Performance Computing (HPC). We ﬁnd thatthere are ample opportunities for future research on CGRAs, inparticular with respect to size, functionality, support for parallelprogramming models, and to evaluate more complex applications.

Index Terms —Coarse-Grained Reconﬁgurable Architectures,CGRA, FPGA, Computing Trends, Reconﬁgurable systems

I. I

NTRODUCTION

With the end of Dennard’s scaling [1] and the looming threatthat even Moore’s law [2] is about to end [3], computing isperhaps facing its most challenging moments. Today, computerresearchers and practitioners are aggressively pursuing andexploring alternative forms of computing in order to try ﬁllthe void that an end of Moore’s law would leave behind.Today there are a plethora of emerging technologies with thepromise of overcoming the limits of technology scaling, suchas quantum- or neuromorphic-computing [4], [5]. However,not all

Post-Moore architectures are intrusive and some merelyrequire us to step-away from the comforts that von-Neumannarchitecture offer. Among the more salient of these technolo-gies are reconﬁgurable architectures [6].Reconﬁgurable architectures are systems that attempts toretain some of the silicon plasticity that an ASIC solutionusually throws away. These systems – at least conceptually –allow the silicon to be malleable and its functionality dynam-ically conﬁgurable. A reconﬁgurable system can for examplemimic a processor architecture for some time (e.g. a RISC-V core [7]), and then be changed to mimic a LTE basebandstation [8]. This property of reconﬁgurability is highly sought after, since it can mitigate the end of Moore’s law to someextent– we do not need more transistors, we just need tospatially conﬁgure the silicon to match the computation intime.Recently, a particular branch of reconﬁgurable architecture– the Field-Programmable Gate Arrays (FPGAs) [9] – hasexperienced a surge of renewed interest for use in High-Performance computing (HPC), and recent research has shownperformance- or power-beneﬁts for multiple applications [10]–[14]. At the same time, many of the limitations that FPGAshave, such as slow conﬁguration times, long compilationstimes, and (comparably) low clock frequencies, remain un-solved. These limitations have been recognized for decades(e.g. [15]–[17]), and have been used to drive forth a differ-ent branch of reconﬁgurable architecture: the Coarse-GrainedReconﬁgurable Architecture (CGRAs).CGRAs trade some of the ﬂexibility that FPGAs have tosolve their limitations. A CGRA can operate at higher fre-quencies, can provide higher theoretical compute performance,and can drastically reduce compilation times. While CGRAshave traditionally been used in embedded systems (particularfor media-processing), lately, they too are considered forHPC. Even traditional FPGA vendors such as Xilinx [18] andIntel [19] are creating and/or investigating to coarsen theirexisting reconﬁgurable architecture to complement other formsof compute.In this paper, we survey the literature of CGRAs, summa-rizing the different architectures and systems that have beenintroduced over time. We complement surveys written by ourpeers by focusing on understanding the trends in performancethat CGRAs have been experiencing, providing insights intowhere the community is moving and any eventual gaps inknowledge that can/should be ﬁlled.The contributions of our work are as follows: • A survey over three decades of Coarse-Grained Reconﬁg-urable Architectures, summarizing existing architecturetypes and properties, • A quantitatively analysis over performance metrics ofCGRA architecture as reported in their respective seminalpapers, and • An analysis on trends and observations regarding CGRAswith discussionhe remaining paper is organized in the following way.Section II introduces the motivation behind CGRAs, as wellas their generic design for the unfamiliar reader. Section IIIpositions this survey against existing surveys on the topic.Section IV quantitatively summarizes each architecture that wereviewed, describing key characteristics and the premise be-hind respective architecture. Section V analyzes the reviewedarchitecture from different perspectives (Sections VII, VIII,and VI), which we ﬁnally discuss at the end of the paper insection IX. II. I

NTRODUCTION TO

CGRA S Before summarizing the now three-decades of Coarse-Grained Reconﬁgurable Architecture (CGRA) research, westart by describing the main aspirations and motivations behindthem. To do so, we need to look at the CGRAs predecessor:The Field-Programmable Gate Array (FPGA).FPGAs are devices that were developed to reduce the costof simulation and developing Application-Speciﬁc IntegratedCircuits (ASICs). Because any bug/fault that were left undis-covered post ASIC tape-out would incur a (potentially) greateconomical loss, FPGAs were (and still are) crucial to digitaldesign. In order for FPGAs to mimic any digital design,they are made to have a large degree of ﬁne-grained recon-ﬁgurability. This ﬁne-grained reconﬁgurability was achievedby building FPGAs to contain a large amount of on-chipSRAM cells called Look-Up Tables (LUTs) [20]. Each LUTwas interfaced by few input wires (usually 4-6) and producedan output (and its complement) as a function of the SRAMcontent and their inputs. Hence, depending on the sought-after functionality to be simulated, LUTs could be conﬁguredand – through a highly reconﬁgurable interconnect – could beconnected to each other ﬁnally yield the expected designs. Thedesign would naturally run one to three order of magnitudelower than the ﬁnal standard-cell ASIC, but would neverthelessbe an invaluable prototyping tool.By the early 1990s, FPGAs had already found other uses(aside from digital development) within telecommunication,military, and automobile industries—the FPGA was seen asa compute device in owns right and there was some aspira-tion of use it for general-purpose computing, and not onlysomething used in the niche market of prototyping digitaldesigns. Despite this, several limitations of FPGAs werequickly identiﬁed that prohibited coverage of a wide rangeof applications. For example, unlike software compilationtools that take minutes to compile applications, the FPGAElectronic Design Automation (EDA) ﬂow took signiﬁcantlylonger, often requiring hours or even days of compilationtime. Similarly, if the expected application could not ﬁt onea single device, the long reconﬁguration overhead (the timeit takes to program the FPGA) demotivated time-sharing orcontext-switching of its resources. Another limitation was thatsome important arithmetic operators did not map well tothe FPGA; for example, a single integer multiplication couldoften consume larger fraction of the FPGA resources. Finally,FPGAs was relatively slow, running at a low clock frequency. Many of these challenges and limitations of applying FPGAsfor general-purpose computing holds to this day.Many early reconﬁgurable computing pioneers looked atthe limitations of FPGAs and considered what would hap-pen if one would increase the granularity at which it wasprogrammed? By increasing the granularity, larger and morespecialized units could be built, which would increase theperformance (clock frequency) of the device. Also, since thelarger units require less conﬁguration state, reconﬁguring thedevice would be signiﬁcantly faster, allowing ﬁne-grainedtime-sharing (multiple contexts) of the device. Finally, bycoarsening the units of reconﬁguration, one would includethose units that maps poorly on FPGAs into the fabric (e.g.multiplications), making better use of the silicon and increas-ing generality of the device. These new devices would later becalled: Coarse-Grained Reconﬁgurable Architecture (CGRAs).An example of what a CGRA looks like from the ar-chitecture perspective is shown in Figure 1. In Figure 1:awe see a mesh of reconﬁgurable cells (RCs) or processingelement (PEs), which is the smallest unit of reconﬁgurationthat perform work, and it is through this mesh that a user (orcompiler) decides how data ﬂows through the system. Thereare multiple ways of bringing data in/out to/from the fabric.One common way is to map the device in the memory of a hostprocessor (memory-mapped) and have the host processor or-chestrate the execution. A different way is to include (generic-)address generations (AGs) that can be conﬁgured to accessexternal memory using some pattern (often correspondingto the nested loops of the application) and push it throughthe array. A third option is to have the re-conﬁgurable cellsdo both the computation and address generation. Figure 1:billustrates the internal of a RC element, which includes anALU (integer and/or ﬂoating-point capable), two multiplexers(MUXs), and a local static RAM (SRAM) used for storage.The two multiplexers decide which of the external inputs tooperate on. The inputs are usually the output of adjacent RCs,the local SRAM scratchpad, a constant, or a previous output(e.g. for accumulations). The output of the ALU is similarlyconnected to adjacent RCs, local SRAM, or back to one of theMUXes. Operation of the RC is governed by a conﬁgurationregister, here brieﬂy shown in Figure 1:c. For simplicity, weshow a single register that hold the state– however, in manyarchitectures, each RC can hold multiple conﬁgurations thatare cycled through over the application lifetime. Each of theconﬁguration can for example hold the computation for aparticular basic block (where live-out variables are stored inSRAM) or discrete kernels.Figure 1 illustrates how a majority of today’s CGRAs looklike, but at the same time there are multiple variations. Forexample, early CGRAs often included ﬁne-grained reconﬁg-urable elements (Look-Up Tables, LUTs) inside the fabric.While the mesh topology is by far the most common, somework chose a ring or linear-array topology. Finally, the ﬂow-control of data in the network can be of varying complexity(e.g. token or tagged-token). We describe many of these in oursummary in the sections that follows.ertical and horizontal axis, and the second layer are fourquadrants composed into a mesh. Unlike previous CGRAs,MorphoSys had a dedicated multiplier inside the ALUs. ACGRA based on MorphoSys was also realized in silicon nearlyseven years from its inception [39].While most of the CGRA described so-far used a meshtopology of interconnection (with some connectivity), othertopologies have been considered. RaPiD [40], [41] is an CGRAthat arranged its reconﬁgurable processing elements in a singledimension. Here, each processing element is composed ofa number of primitive blocks, such as ALUs, Multipliers,scratchpads, or registers. These primitive blocks are connectedto each other through a number of local, segmented, tri-statedbus lines that can be conﬁgured to form a data-path– a so-called linear array . These processing elements can themselvesbe chained together to form the ﬁnal CGRA. Interestingly,RaPiD can be partial reconﬁgured during execution in what theauthors called “virtual execution”. RaPiD itself does not accessdata; instead, a number of generic address-pattern generatorinterface external memory and stream the data through thecompute fabric.The KressArray [42]–[44] was one of the earliest CGRAdesign to be created, and the project spanned nearly a decadewith multiple versions and variants of the architecture. Itfeatures a hierarchical topology, where the lowest tier wascomposed of a mesh of processing elements. The processingelements interfaced neighbours, and also included predicationsignals (to map if-then-else primitives). Generic address gen-erators supported the CGRA fabric by continuously streamingdata to the architecture.Chimaera [45] is a co-processor conceptually similar toGARP, with an array of reconﬁgurable processing elementsoperating at a quite ﬁne granularity (similar to modern FPGAs)can be reconﬁgured to perform a particular operation. It isclosely coupled to the host processor to the point where theregister ﬁle is (in part) shadowed and shared. Mapping applica-tion to the architecture was assisted by a ”simple” C compiler,and they demonstrated performance on Mediabench [46] andHoneywell [47].PipeRench [48] applied a novel network topology that wasa hybrid between that of a mesh and a linear array. Here, alarge number of linear arrays were layered, where each layersent data uni-directionally to the next layer. Several futureCGRA would adopts this kind of structure, including data-ﬂowmachines (e.g. TARTAR) and loop-accelerators (e.g. FPCA).The layers themselves in PipeRench were fairly ﬁne-grainedand comparable to GARP as they had reconﬁgurable Look-Up Tables rather than ﬁxed-function ALUs within. PipeRenchintroduced a virtualization technique that treated each separatelayer as a discrete accelerator, where a partial reconﬁgurationtraveled alongside with its associated data, reconﬁguring thenext layer according to its functionality in a pipeline fashion,which was new at the time. PipeRench was also later imple-mented in silicon [49].The DREAM [50] architecture was explicitly designed totarget (then) next-generation 3G networks, and argues that CGRAs are well suited for the upcoming standard with respectto software-deﬁned radio and the ﬂexibility to hot-ﬁx bugs(through patches) and ﬁrmware. The system has a hierarchyof conﬁguration managers and a mesh of simple, ALU-based, RPEs operating on 16-bit operands and with limitedsupport for complex operations such as multiplications (sinceoperations were realized through Look-Up Tables).So-far, all architecture has been computing using integerarithmetic’s. Imagine [51] is among the early architecturesthat included hardware ﬂoating-point arithmetic units. Thearchitecture itself is similar to RaPiD—it is a linear array,where each processing elements has number of resource(scratchpads, ALUs, etc.) all connecting using a global bus.Similar to RaPiD, the processing elements are passive, andexternal drivers are responsible for streaming data along theconnect processing elements. The Imagine architecture had aprototype realized six years after its seminal paper [52].

1) Modern Coarse-Grained Reconﬁgurable Architectures:

Most modern CGRA architecture’s lineage can be linked backto those described in the previous section, and a majorityof these architecture follows the generic template that wasdescribed in the previous section. However, while the overalltemplate remains similar, many recent architectures specializethemselves towards a certain niche use (low-power, Deep-Learning, GPU-like programmable, etc.)The ADRES CGRA systems [53], [54] has been a remark-ably successful architecture template for embedded architec-tures, and is still widely used. ADRES – like many previousand future CGRAs – consists of a mesh of processing elementswhere each element has neighbor (or Manhattan distance-2)connectivity. Inside each element we ﬁnd an ALU of vary-ing capability and a register ﬁle, alongside the multiplexersconﬁgured to bring in data from neighbours. The ﬁrst row inthe mesh, however, is unique, as it only contains the ALU(and no scratchpad/RF to store state). Instead, an optionalprocessor can extend its pipeline to support interfacing thatvery ﬁrst row in a Very Long Instruction Word [55] (VLIW)fashion. ADRES, by design, is thus heterogeneous. ADREScomes with a compiler called DRESC [56]. ADRES as anarchitecture has (and still is) a popular platform for CGRAresearch, such as when exploring multi-threaded CGRA sup-port [57], topologies [58], asynchronous further-than-neighborcommunication(e.g. HyCube [59]), or CGRA designs frame-works/generators (e.g. CGRA-ME [60], [61]). Furthermore,ADRES has been taped out in in silicon, for example in theSamsung Reconﬁgurable Processor (SRP) and the follow-upUL-SRP [62] architecture.The Dynamically Reconﬁgurable ALU Array (DRAA) [63]is a generic CGRA template proposed in 2003 to encouragecompilation research on CGRA architecture. Architecture-wise, DRAA allows changing many of the parameters thatdeﬁne an CGRA, such as the data-path width, the intercon-nections, size of register ﬁle, etc. Preceding both DySER andADRES, DRAA as a template has been used to e.g. study thememory hierarchy of CGRAs [64].The TRIPS/EDGE [65], [66] microarchitecture was a long-unning inﬂuential project that attempted to move away fromthe traditional approach of exploiting instruction-level paral-lelism in modern processors. The premise behind TRIPS wasthat as technology reduced the sizes of transistors, wire-delaysand path would dominate latency, and that it would hard toscale the (often) communication wires traditional super-scalarprocessors [67]. Instead, by tightly coupling functional unitsin (for example) a mesh, direct neighbor communication couldeasily be scaled. In effect, TRIPS/EDGE replaced the tradi-tional super-scalar Out-of-Order pipeline with a large CGRAarray: single instructions were no longer scheduled, but insteada new compiler [68], [69] was developed that scheduled entireblocks ( ”CGRA conﬁgurations” ) temporally on the processor,allowing up-to 16 instructions to be executed at a single time(and many more in-ﬂight). The TRIPS architecture was tapedout in silicon [70], [71] and – despite being discontinued –represented a milestone of true high-performance computingwith CGRAs. An interesting observation, albeit not necessarilyrelated to CGRAs, was that the Edge ISA has received renewedinterested as an alternative to express large amounts of ILP inFPGA soft processors [72].The DySER [73] architecture integrates a CGRA intothe backend of a processor’s pipeline to complement (un-like e.g. TRIPS that replace) the functionality of the tra-ditional (super-)scalar pipeline, and has been integrated inthe OpenSPARC [74] platform [75]. The key premise behindDySer is that there are many local hot regions in program codeand higher performance can be obtained by specializing onaccelerating these inside the CPU. DySER was evaluated usingboth simulator-based (m5 [76]) and an FPGA implementationon well-known benchmarks (PARSEC [77] and SPECint) andcompared with both CPU and GPU approaches, showing be-tween 1.5x-15x improvements over SSE and comparable ﬂex-ibility and performance to GPUs. Recently (2016 on-wards),DySER has been the focus of much of the FPGA-overlayscene (see Section IV-F). Other, similar work to DySeR thatintegrates CGRA-like structures into processing cores withvarious goals, includes CReAMS/HARTMP [78], [79] (appliesdynamic binary translation) or CGRA-sharing [80] (concep-tually similar to what AMD Bulldozer architecture [81] andUltrasparc T1/T2 did with their ﬂoating-point units).The AMIDAR [82] is another long running exciting projectthat (amongst others) uses CGRA to accelerate performancecritical sections. The AMIDAR CGRA extends the tradi-tional CGRA PE architecture with direct interface to memory(through DMA). There is support for multiple contexts andhardware support for branching (through dedicated condition-boxes operating on predication signals), which also allowspeculation. The AMIDA CGRA has been implemented andveriﬁed on a FPGA platform, and early results show that itcan reach over 1 GHz of clock frequency when mapped to a45 nm technology.The MORA [83] architecture is a platform for CGRA-related research. MORA targets media-processing, and henceprovide an 8-bit architecture with processing elements cover-ing the most commonly used operations. MORA itself is simi- lar to the previous MATRIX, with a simple 2D mesh structurewith neighbour communication. Each processing element hasa scratchpad (256 bytes large). MORA is programmable usinga domain-speciﬁc language developed over C++ [84].The CGRA Express [85] is yet another architecture thatfollows the concept of being a mesh with simple, ALU-likestructures. The premise and motivation for their work is thatmost existing CGRA application are optimized for maximalgraph coverage rather than sequential speed. The hypothesisis that – depending on the operators each PEs is conﬁguredto use – they can exploit the resulting positive clock slack ofthe operators and cascade (fuse) more operations per clockcycle than blindly registering the intermediate output. This,in turns, allow them to execute more instructions per cycles(or reduces the frequency) with little performance losses. Intheir architecture, they add an extra bypass network that canbe conﬁgured to not be pipelined. They show both powerand performance beneﬁts on multimedia benchmarks with andwithout their approach. The work could be conceptually seenas the opposite to what modern FPGAs (e.g. Stratix 10) doeswith HyperFlex [86], but for CGRAs.The Polymorphic Pipeline Array (PPA) [87] performedan interesting pilot study that drove the parameters of theirCGRA: they simulated a large number of benchmarks sched-uled on a hypothetical (inﬁnite) CGRAs, with focus onmodulo-scheduling and loop unrolling. They revealed thateven with inﬁnite larger CGRAs, the performance levels willbe bound as a function of the instruction-level parallelismin the loops and the limitation of modulo-scheduling, andthey argue that there is a deﬁnitive need to include otherforms of parallelism to scale on CGRAs. While the PEsthemselves follows a standard layout, they propose an in-teresting technique that allows multiple (unique) kernels tobe executed concurrently on the CGRA, where each kernelcommunicate with each other either through DMA or sharedmemory. Kernels can also be resized to fully exploit the CGRAarray.The premise behind SIMD-RA [88] is similar to that ofPPA: CGRAs relies too much on instructional-level paral-lelism, and opportunities from other forms of parallelism arelost. SIMD-RA focuses on embedding support to modularizethe CGRA-array to supporting multiple discrete controllableregions that (may) operate in SIMD fashion. They found thatusing SIMD not only yielded better performance, but werealso more area efﬁcient compared to only using software-pipelining.SmartCell [89] is a CGRA that aspires to be low-powerwith high-performance, supporting both SIMD- and MIMD-type parallelism. The architecture is effectively a 2D mesh,but with the mesh divided into 2x2 quadrants of processingelements. These 2x2 islands share a reconﬁgurable router andinter-quadrant communication is limited to the connectivity ofthese routers. The processing elements themselves are fairlystandard and contain instruction memory whose instruction(conﬁguration) is set either per processing element (MIMD)or sequenced globally (SIMD).ilRC [90] is a heterogeneous mesh composed of threedifferent blocks: generic ALU blocks, Multiplication/Shifternodes, and memory blocks, following the (by now) traditionalrecipe of a CGRA. Unique to BilRC is that the architectureexplicitly exposes the triggering of instruction, allowing theprogrammer and/or application ﬁne-grained control over theamount of parallelism or when instructions are triggered.The lack of ﬂoating-pointer support in CGRAs has alsobeen a research driving force. FloRA [91] is 16-bit IEEE-754 ﬂoating-point capable CGRA. The architecture itself iscomposed of 64 RCs, and each RC is fairly standard and donot include dedicated ﬂoating-point cores themselves; instead,multiple (2) RCs can be combined to enable single-precision(32-bit) ﬂoating-point support, where mantissa- and exponent-computation is distributed among the pair.Feng et al. [92] introduce a ﬂoating-point capable architec-ture speciﬁcally designed speciﬁcally radar signal processing.Despite the familiar mesh-based interconnection, the designdeviates from the traditional approach since their processingelements are fairly diverse and heterogeneous. The CGRAitself was taped out in silicon and could reach up-to 70GFLOP/s ﬂoating-point performance.PRET-driven (Precision Timed) CGRA aimed towards pre-dictable real-time processing was developed by Siqueira etal. [93]. Interestingly, the CGRA has support for threads,which is a concept used more in High-Performance ratherthan Real-Time computing. The architecture is similar to whata SIMT (Simulatenous Multi-Threading) architecture lookslike, where each processing elements has a duplicate numberof resource (primarily the register ﬁles) that are unique toeach thread. Aside from having deterministic timing insidethe CGRA, the authors also implement a predictable externalmemory access timing, required for real-time systems.The recent SPU [94] architecture aspires to provide aCGRA for general-purpose computing. The main novelty isthat SPU extends existing CGRA designs with support for twotypes of computational patterns: what they call ”stream-joins”(e.g. sparse vector multiplication inner-product) and alias-freescatter/gather (regular loops with indirection). This is achievedby extending the typical CGRA with options to conditionallyconsume input tokens (re-use values), reset accumulators,or conditionally discard output tokens. Address-Generationunits (linear and non-linear) reside inside on-chip SRAMcontrollers. The SPU targets general-purpose workloads withsome favor towards deep-learning applications.The premise Soft-Brain [95] is to combine both vector-level (for regular efﬁcient memory-access) and data-ﬂow (forparallelism) computation in CGRAs to reach high performanceand power-efﬁciency. The architecture consists of differentnumber of stream-engines (essentially address-generators inprior work) and the CGRA substrate itself. The input tothe CGRA substrate is a number of vector-ports, which areessentially the vector-data itself (512-bit) as fetched frommemory, the on-chip scratchpad, or fed-back from the outputof the CGRA, a stream-controller orchestrates the executionof the system. The Chameleon [96] CGRA is an early commercial CGRAthat focuses on competing with early DSPs and FPGAs. Herethe CGRA is layered, where they call each layer a slice.Each slice has three tiles, where each tile has a numberof scratchpad memories that interface with eight processingelement that can be reconﬁgured. The idea is to load thelocal scratchpad with data, conﬁgure the processing elementsassociated with the scratchpad with some functionality, and thepipe the data through and onto other slices. The Chameleonwas implemented in a 0.25 um process running at a 125 MHzclock frequency. The architecture itself operates on 32-bit data-path width, but can be conﬁgured to divide the data-stream intotwo 16-bit or four 8-bit streams as well.SiLAGO [97] is a methodology for creating CGRA-basedplatforms. The premise behind the method is to use recon-ﬁgurable CGRA processing elements (based on DRRA [98])as building-blocks for future systems in order to reduceproduction cost with little impact on performance (comparedto hand-made ASICs). Platforms based on SiLAGO andDRRA are amongst others specialized architectures for Deep-Learning [99], Brain-simulation computing [100], and genomeidentiﬁcation [101]. The Q100 [102] is similar in conceptsbut specializes on data-base computing and provides tiles forcomputing on data-ﬂow streams that users can assemble largersystems from. B. Larger CGRAs

Most CGRA systems (e.g. ADRES, TRIPS, DySER) limitthe size of the array to around (or less than) 64 processingelement, and only a few of so-far mentioned CGRAs are largerthan that (PipeRench has 256 PEs, GARP has 768) but theyare relatively ﬁne-grained. Likely the limited size of CGRAwas their application domain, which most involved image-,audio-, or telecommunication applications. However, in recentyears, even larger CGRA-based systems targeting e.g. High-Performance Computing application has started to emerge.The eXtreme Processing Platform [103] (XPP) was a CGRAthat focused on multiple levels of parallelism, including thatof pipeline processing, data-ﬂow computing, and task-levelexecution. XPP’s interconnection was deep and complex,consisting of multiple levels of various functionality. At thelowest tier, small processing elements containing scratchpad,an ALUs, and associated conﬁguration manager reside inmesh-like connectivity called a cluster. These clusters them-selves are connected through switch-boxes running along theirvertical and horizontal axes. Each tier has a conﬁgurationmanager that is responsible for all functionality of that layer(and below), allowing ﬁne-grained control and partitioning ofthe functionality of the system. XPP was token-driven, andexecution of operation occurs only when data is present atinputs.The High-Performance Reconﬁgurable Processor [104](HiPReP) is an on-going CGRA reserach platform capable ofﬂoating-pointer computations. HiPReP has dedicated ﬂoating-pointer cirucitry (unlike e.g. FloRA). Processing elementsare organized in a mesh with a heterogeneous (in terms ofultiple contexts). A number of scratchpad memories sits atthe fringes of each tile and are used to store streamed data.Controlling the operation of the processing elements is donethrough an instruction pointer that is governed by hierarchicalsequencers (one per tile and one global). The sequencer –effectively a programmable FSM – dictates which contextis being executed, and can (re)act on signals from the tilesthemselves. DRP was commercially taped out in the DRP-1prototype by NEC.The commercial DAPDNA-2 [120] processor produced byIPFlex contains up-to 376 32-bit processing elements, or-ganized as 8x8 PE quadrants in a mesh. The architectureis heterogeneous, with discrete tiles supporting ALU opera-tions, scratchpad, programmable delay lines, simple address-generations (counters) or I/O buffers. The processing ele-ments contains both multiplication and arithmetic units andalso support optional pre-processing of inputs through rota-tion/masking units. The tiles are interconnected using hori-zontal and vertical busses that run in-between and throughthe mesh, and crossing the quadrants can only be done atborder tile locations. Performance of selected applications(FIR, FFTs, Image processing) show two orders of magnitudebetter performance over the then state-of-the-art Pentium 4processor.The 167-processor architecture [121] borderlines CGRAsand conventional multi-core processors, but we include it heresince the processing elements are simple and communicationbetween them is only performed using direct (yet dynamicallyconﬁgured) connectivity (and not through shared-memory orcache coherence as done in multi-core). The main focus behindthis work is low-power, and through a series of advanced low-level optimizations (DVFS, clock generation and distribution,GALS [122] etc.) They show performance of up to 196.7GMAs/Watt when fabricated in 65 nm technology. Othersimilar architecture, based on programmable cores with limitedconnectivity, are the IMAPCAR [123]/Imap-CE CGRA [124]from NEC aimed towards image recognition in automobiles.The Rhyme-Redeﬁne [125], [126] architecture is a CGRAtargeting High-Performance Computing (HPC) kernels. It fol-lows a fairly typical CGRA design, where processing elementsare connected through in a torus network. The premise of theirwork is that there is a need to exploit multiple levels of paral-lelism (instruction-, loop- and task-level parallelism), albeit thecurrent implementation focuses primarily on instruction-levelparallelism through modulo-scheduling. The Rhyme-Redeﬁnesupports ﬂoating-point computations.Plasticine [127] is a recent, large CGRA that focus onparallel patterns. At the highest abstraction layer, it is builtof a mesh of units. There are two types of units: compute andmemory units, both of which are programmable with patterns.Inside the compute units we ﬁnd the raw functional units (theALUs) as well as programmable state for controlling them.The compute units are built with both SISD- and SIMD-typeparallelism in mind, and vector operations map natively tothese units. Similarly, inside the memory units, we ﬁnd a smallset of ALUs coupled with programmable logic to interface the SRAM local to the units. The mesh itself interfaces externalmemory through a set of address generators and coalescingunits. More importantly, Plasticine targets ﬂoating-point in-tensive applications, which is also shown in their evaluation(only three out of 13 applications are integer-only). Plasticineis programmable using Spatial [128]– a custom language basedon patterns for data-ﬂow computing.Recently, the Cerebras Wafer Scale Engine [129] has beencreated explicitly for high-end deep-learning training. Littleinformation is publicly available, but the architecture seemsto be a hybrid solution between general-purpose processingcode and specialized tiles for tensor computations, and couldmake it the single largest CGRA architecture to date with asize of over 46,225 mm . C. Linear-Arrays and Loop-Accelerators

VEAL [130] is a linear array that explicitly targets accelerat-ing small, compute-intensive loop-bodies. Similar to a before-mentioned PPA (Polymorphic Pipeline Array), the authorsbehind VEAL performed a rigid empirically evaluation of thebenchmarks they target, and demonstrate that one of the mainlimitation to the performance of mapping said benchmarks toCGRA fabrics is not the number of resources, but actually theamount of instruction-level parallelism extracted by modulo-scheduling. The VEAL is linear array fed by a numberof custom address-generators, which broadly corresponds tothe induction variables of the loops that are executed. Aninteresting observation is that VEAL is among few CGRAwork that use double-precision arithmetic’s. Another loop-accelerator similar to VEAL is FPCA [131].The BERET [132] architecture is yet another linear arraythat is designed to accelerate hot traces found by the hostinggeneral-purpose processing. One of main BERETs main con-tribution was that they identiﬁed a small set of graphs that theprocessing elements should cover (called SEBs); the set wasempirically extracted from the benchmark and has since thenbeen used in other studies (e.g. SEED [133], which is similarbut improved in concept).

D. Deep-Learning CGRAs

Deep-Learning [134], in particular the computationallyregular Convolutional Neural Networks (CNNs), have latelybecome target for specialized CGRAs. Here the focus isto limit the generality and reconﬁgurability of traditionalCGRA to ﬁt the computational patterns of CNNs, and insteadspend the gained logic on supporting specialized operationsfor the intended deep-learning workloads (such as compres-sion, multicasting, etc.). Furthermore, these architectures of-ten honor smaller (or mixed) number representations, sincedeep-learning often is amendable to lower-precision calcula-tions [135].The DT-CGRA [136], [137] architecture follows a CGRAdesign with fairly coarse processing elements that includeup-to three multiple-accumulate instructions. The processingelements also include programmable delay lines to easier maptemporally close data. Data inside the PEs is synchronizedhrough tokens using FIFO empty/full signals as proxy. Tosupport the different access patterns that modern deep-learninglayers have (stride, type, etc.), the CGRA mesh is driven bya number of stream-buffer units that are programmable usingin a VLIW-fashion and that control the address-generations toexternal memory to stream data.The Sparse CNN (SCNN) [138] is a deep-learning architec-ture that targets (primarily) CNNs and can exploit sparsenessin both activations and kernel weights. The architecture itself iscomposed of a mesh of RCs, where each element also includesa 4x4 multiplier array and a bank for accumulation registers.These RCs are driven- and orchestrated-by a layer sequencer,which fetches and broadcasts (compressed) weights and ac-tivations. SCNN supports inter-PE parallelism in the form ofspatial blocking/tiling, where each block is artiﬁcially enlargedwith a halo region, which is exchanged between adjacent tilesat the end of the computation. They also implement a dense version (DCNN) of the architecture in order to measure thearea overhead and power- and performance-gains of includingsparsity in the accelerator.Liang et al. [139] introduce a CGRA accelerator that targets reinforced learning . The processing elements themselves arefairly static, with support for addition, multiplication, or afuse of both. Additionally, a number of different activationfunctions (ReLu, sigmoid, and tanh) can be selected using theconﬁguration register, and data can be temporally stored ina local scratchpad. Unlike most existing CGRAs that placeaddress-generators in discrete units outside the RCs, Lian etal.’s RC include them inside. Global communication linesallow the user to control the reinforced training experienceof the system.The Eyeriss deep-learning inference engines [140], [141]follows a CGRA design methodology as well, albeit thespecialize on re-conﬁguring the network access-patterns ratherthan the compute (which mostly is based on multiply-accumulate operations). The CGRA itself is a mesh witha variety of options of point-to-point and broadcast opera-tions, highly suitable for deep-learning convolution patterns.Additionally, the platform supports compression of data andexploits sparseness of intermediate activations to increaseobserved bandwidth. The Eyeriss architecture – depending onthe type of neural network used – can utilize nearly 100% ofthe CGRA resources when inferring AlexNET.One of the most recent (and perhaps radical) changes to theFPGAs is coming in the form of support for Deep-LearningCGRAs. The Xilinx Versal [142], [143] series occupies alarge part of the silicon to a mesh-like CGRA structure ofprogrammable, neighbor-communicating, processing elements.The elements themselves are fairly general-purpose, but aremarketed as targeting deep-learning and telecommunicationapplication. To remedy the eventuality of the AI enginemissing crucial parts of deep-learning functionality that has yetto come, the AI engine can directly interface remaining partsof the reconﬁgurable (FPGA) silicon, which is in the form ofthe ﬁne-grained reconﬁgurable cells that Xilinx are known for.The system itself is an attempt to combine the best of both the ﬁne-grained and coarse-grained reconﬁgurable worlds.

E. Low-Power CGRAs

CGRAs has also been shown to be competitive in terms ofpower-consumption, particularly when compared to existing(low-power) processors and DSP -engines. The CGRAs in thisdomain follow the same concept as earlier CGRA design, butfocus on both technology and architecture improvements toreduce the static and/or dynamic power of the fabric.These CGRAs tend to focus on reduction the frequencyand voltage as much as possible. Since the dynamic power-consumption of a system is a function of both frequency andvoltage ( P dynamic = C ∗ V ∗ f clk ), reducing frequency canhave a dramatic effect on power-consumption. Several CGRAsin this area operate on near-MHz level, and some even removethe clock altogether.The Cool Mega Array [144], [145] (CMA-1 and CMA-2) architecture builds on the following two premises: ( i ) theclock (clock-tree, ﬂip-ﬂops, state, etc.) is a culprit behindmuch of the consumed power on modern chip, and ( ii ) ap-plications have adequate parallelism to freely trade silicon forperformance where needed. The CMA-1 was a typical CGRAmesh architecture, but created without a single clock. Thearchitecture focuses on stream-computing, where a processorpresents inputs to the CGRA that – in due to time – arecomputed using the clock-less fabric. The architecture (andits follow up, CMA-2) is power-efﬁcient, and experiments ontaped-out versions showed that the leakage-current of the chipcould be as low as 1 mW. The CMA architecture managersto reach up-to 89.28 GOPS/Watt using a 24-bit data-path.The CMA architecture is continued to be researched, andrecent work has focused on improving performance (throughvariable-latency pipelines in VPCMA [146]) or further reducepower-consumption through body-biasing.The SYSCORE [147] architecture is another similar archi-tecture that focuses on low-power consumption, but leveragedynamically scaling both voltage and frequency (DVFS) forpower-beneﬁts, and using a ﬁxed-point (and not ﬂoating-point) number representation. As with CMA-1/2, it is a 24-bitdatapath with a standard mesh-like arrangements of CGRA-tiles.Lopes et al. [148] evaluated a standard mesh-like CGRAfor use in real-time bio-signal processing. The CGRA theyconstructed had the additional beneﬁts of being able to power-down sections of the CGRA when unused to further extendbattery-life. Another bio-medical CGRA was introduced byDuch et al. [149] uses a mesh-like composition and a MHz clock-frequency to accelerate electrocardiogram (ECG)analysis kernels.The Samsung UL-SRP was designed for bio-medical ap-plication. The UL-SRP [62] is based on the ADRES, andfeatured a hybrid high-power/high-performance mode and alow-power/low-performance mode covered the different needsand use-cases.The PULP [150] cluster system features a 16 RC mesh toimprove performance and energy-consumption for near-sensorot strictly a CGRA, ZUMA [159] is an early effort tovirtualize the ﬁne-grained resources of an FPGA using a“virtual FPGA”, for reason of portability, compatibility, andFPGA-like reconﬁgurability inside of FPGA designs. Similarto a real FPGA, ZUMA discretized the FPGA into logicclusters that contains a crossbar and K-input Look-Up Tablewith an optional ﬂip-ﬂop capturing the output. Each cluster isconnected to a switch-box that can be programmed to routethe data around. The area cost of using virtual-FPGA can beas low as 40% more than the barebone FPGA, demonstratingits beneﬁts. Other (even earlier) work was FIRM-core [160],as well some more recent efforts include the vFPGA [161].Intermediate Fabrics (IF) [154] coarsen the FPGA logic bycreating a mesh of computational elements of varying sizes,such as for example multipliers and square root functions;small connectivity boxes (routers) govern the trafﬁc throughoutthe data-path. IF was evaluated on image processing (stencil)kernels, and overall showed an on average 17% drop in clockfrequency against a gain of 700x in compilation time overusing the FPGA alone.The MIN Overlay architecture [162] approach the CGRAdesign differently; it uses a one-dimensional strip of process-ing elements whose output is connect to each other through anall-to-all interconnect. Hence, data-ﬂow graphs are spliced andﬁt onto the linear array, and different parts of the graph arescheduled in time on the array and the interconnect. Differentcombinations and compositions of the processing elementswere evaluated and the clock frequency, for the most part, ranat 100 MHz, competitive to soft processor cores at the time.Other, arguable less conﬁgurable, overlays follow a similarone-dimension strip design, such as the VectorBlox MXPMatrix Processor [163]. The FPCA loop-accelerator describedearlier was also prototyped on FPGAs. The READY [164]architecture extend the linear array concept further by alsohaving multiple threads running on the overlay.An example of a layered CGRA overlay for FPGAs is theVDR architecture [165]. Here, computational resources arelaid out in one-dimensional strip where each strip is fully-connected to downstream units. Links are unidirectional, andsynchronization protocol guides data throughout the data-path.The VDR architecture runs at a clock frequency of 172 MHz,and was shown to be between 3 and 8 times faster comparedto the NiosII processor [166] (a well-known soft-core usedin FPGA design). Another architecture similar to VDR is theRALU [167].A ﬂurry of innovative overlays was introduced in the 2012onwards, all centered around the modern FPGAs Digital SignalBlock (DSP). The DSP blocks were originally included toallow the use of expensive operations that do not necessarilymap well to FPGAs (e.g. multipliers). DSP blocks have sincethen continuously evolved to include more diverge (various-size multiplication, accumulation, etc.) or more complex (e.g.single-precision arithmetic [153]) functionality. Some of thevendors (Xilinx) directly exposed the interface to control thedifferent functionality of the DSP blocks to the FPGA fabric,and it was not long before the idea to base CGRAs architecture around said DSP blocks. ReMorph [168] was one of the earlyarchitectures to adopt this style of reasoning. Several differentarchitectures have been explored around the concept of DSPs,including various topologies (e.g. trees [169]) or adaptationof existing architectures (e.g. DySer using DSP blocks [170]).The strengths of these architectures lie in their near-nativeperformance, where small overlays built around DSPs can runat 390 MHz (or higher).Quickdough [171], [172] is a design framework for usingCGRA overlays on FPGAs, speciﬁcally targeting them toassist CPU in accelerating compute-intensive program code.The overlay itself follows the standard layout with a meshof processing elements, each containing a small instructionmemory that sequences the ALU within the processing ele-ment. The mesh can interface external memory by enqueuingrequests to an address unit. Unique for the architecture is thatthe two parts (the address unit and the PE mesh) runs at twodistinctly different frequencies.Most FPGA overlays presented so far focus unique oninteger computation, likely because most FPGA overlay worktarget Xilinx devices, whose DSPs units do not containhardened ﬂoating-point operations. The Mesh-of-ALU [173]is an exception that targets both integer and ﬂoating-pointcomputation. The architecture is similar to other mesh-basedapproaches, but the work demonstrates high (at the time)performance capabilities of FPGAs also in ﬂoating-point op-erations, reaching nearly 20 GFLOPs on a Stratix IV [174]device. Using ﬂoating-point processing elements seem to incura 33% area overhead, yielding a smaller CGRA mesh, and alsoa (arguably negligible) 13% reduction in clock frequency.A different overlay architecture that target ﬂoating-pointoperations is the TILT array [175], [176]. The TILT arrayarchitecture is very similar conceptually to the MIN overlay. Alinear array of processing elements is arranged to communicatewith an all-to-all crossbar, which saves state into an on-chipRAM and relaying information to the computation in the nextcycle. The authors illustrated the beneﬁts of TILT over High-Level Synthesis (OpenCL) with both comparable performanceand improved productivity, reaching operating frequency ofup-to 387 MHz on a Stratix V [177].The URUK [178] architecture takes a different approach onhow the ALUs inside overlay should be implemented. Ratherthan having a ﬁxed-function, URUK leverage partial recon-ﬁguration [179], changing the RCs functionality throughouttime.Finally, tools for automatically creating CGRA overlays forFPGAs are emerging, such as the Rapid Overlay Builder [180]and CGRA-ME [60] that simplify generation and (in the caseof CGRA-ME also) compilation of applications and overlays.An interesting observation is that out of the 14 uniqueFPGA architecture that we surveyed here, 9 chose Xilinx asthe target platform while 5 focused on Intel (then Altera)FPGAs. There seems to be a favoring of Xilinx architectures,which we believe is due to the more dynamic control thatXilinx offers in their DSP blocks compared to Intel. On theother side, Intel DSPs have (starting from Arria 10 onwards)ardened support for IEEE-754 single-precision ﬂoating-pointoperations, encouraging ﬂoating-point heavy architecture touse those systems.V. CGRA T

RENDS AND C HARACTERISTICS

A. Method and Materials

For all previous surveyed and summarized work, we col-lected several metrics associated with each study. These were:1)

Year of publication,2)

Size of the CGRA array in terms of unique RCs,3)

Data-path width of the CGRA (e.g. MATRIX operateson 4-bit while RaPiD operates on 16-bit),4)

Clock frequency of operations ( f max ) in MHz asreported in the study,5) Power consumption in Watt. For studies that em-pirically measured this metric, we collected the (benchmark,power) tuple. Otherwise we used whatis reported in the study (often the post place-and-routepower estimation),6)

Technology (in nm) of architecture when either tapedout in silicon, or the standard cell library used with theEDA tools,7)

Area ( mm ) of the fully synthesized chip as reported inthe study. In some cases we had to manually calculateit based on the individual RC size or based on the gatesused (after veriﬁcation with authors). For FPGAs weused the chip (BGA) package size and assumed a chip-to-die ratio of 7:1, as has been reported in [181].8) Peak performance , including peak operations-per-second (OP/s), peak multiple-accumulates/second(MAC/s), Peak Floating-Point Operations/second(FLOPS) as reported in the paper. We differentiatebetween integer MAC/s and OP/s because somearchitectures (e.g. EGRA) do not balance them, leadingto a large theoretical OP/s but not a proportionally largeMAC/s.9)

Obtained Performance out of the theoretical peak (%).We used what the authors reported. For those caseswhere authors did not report obtained performance (e.g.only reported absolute time), we calculated this metricmanually where applicable, such as for example whenthe authors report both the input dimension and the exe-cution time (in seconds or cycles) of known applicationssuch as (non-Strassen) matrix-multiplication, FIR-ﬁlters,matrix-vector multiplication, etc.For item 8-9 we ignored studies that showed relativeperformance improvements, as it is hard to reason aroundthe performance of a baseline unless explicitly stated. Allmetrics included have either been directly reported in theseminal publication, have been veriﬁed by the authors, orwe were conﬁdent in our understanding of the architecture toderive them ourselves. We position and related our obtainedCGRA characteristics against those of modern GPUs. We usedNVIDIA GPUs as references with data collected from [182]and integer performance calculated using methods describedin [183]. VI. O

VERALL A RCHITECTURAL T RENDS

We start by analyzing data that is associated with time, andhow the CGRAs have grown as a function of time.Figure 5 overviews how CGRAs have changed through timewith respect to various metrics. The total number of RCs,as a function of the respective publication year, is shown inFigure 5:a. We see that a majority of CGRAs are quite small(median: 64 RCs) and even smaller for FPGA-based CGRAs(median: 25 RCs). This is in-line with the reasoning that mostCGRAs focus on small kernels in the embedded applicationdomain, honoring ILP rather than other forms of parallelism(e.g. thread- or task-level). There are several exceptions to this,such as GARP, which was an early CGRA that used 1/2-bitreconﬁgurable data-paths and thus needed a large number ofRCs to implement various functionality. The other exception isTARTAN, where the author’s largest evaluated version is up-to25,600 RCs, making it likely the largest CGRA ever simulated;this awe-inspiring size was reached by severely restricting thefunctionality of RCs (e.g. there is no multiplication support).Thirdly, the Plasticine architecture can have up-to 6208 RCsof varying sorts.Figure 5:b shows transistor scaling of CGRAs and NVidiaGPUs. As expected, the transistor dimensions have contin-uously grown smaller and smaller, as predicted by Moore.Note however that both FPGAs and GPUs is (on average)one transistor generation ahead of CGRAs, likely due to mostCGRAs being developed by academia and thus restricted tothose standard cell libraries available at that time (whichusually is not the most recent).Figure 5:c shows the area of the CGRAs as reported eitherby the ASIC synthesis tools, estimation by authors, or by theﬁnal taped-out chip. We also include the full-size of the FPGAdie sizes (that FPGA-based CGRAs have access to), and weposition these against the die-size of modern NVidia GPUs.We can see that the trend of CGRA research is – as with thesize of CGRAs – to favor smaller CGRAs, and the average sizeof the CGRAs are . mm Compared to GPUs, which hasmonotonically increased their size through time, CGRAs havealmost done the inverse, and decrease in size. There are twomajor exceptions: the ﬁrst is the Imagine architecture, whichreported an amazing size of 1000 mm (later 144 mm inthe follow-up paper 6 years later)– larger than any CGRAor GPU reported to this day. The other larger architecture isCUDA-programmable the SGMF at 800 mm .Figure 5:d shows how the reported power-consumption ofASIC-based CGRAs has grown over time, and is comparedto the Thermal-Design Package (TDP) reported for NVidiaGPUs. The CGRAs are experiencing on-average an expo-nential decrease in power-consumption, which is likely dueto smaller standard cell libraries coupled with small CGRAsize (Figure 5:a,c,d). On the other hand, NVIDIA GPUscontinuously consumes more and more power as time goeson (albeit even that is drawing to a halt due to Moore’slaw). The highest and most power-consumption CGRA, outof those reporting, is the Plasticine architecture consuming ay limitation in the fabric itself; however, it is interesting tosee that the operation frequency of FPGA-based CGRAs arerivaling most of the ASIC CGRAs.Figure 5:f shows the chosen data-path width that CGRAsresearch tend to adopt. Most architecture adopt either a 16-bit(28%) or 32-bit (56%) data-path width; those targeting 16-bitdata-path are usually more tailored towards a speciﬁc applica-tion, such as telecommunication or deep-learning, while thosethat target 32-bit (or beyond) is more general-purpose. A few(13.3%) target 8-bit architecture, but often have support forchaining 8-bit operations for 16-bit use. Matrix and GARPtarget very ﬁne-grained reconﬁgurability, with 4-bit and 1-/2-bit respectively. Despite this, we can expect future architectureto include more support for low- or hybrid-precision, since it isa reliable way of obtaining more performance while mitigatingmemory-boundness for applications that permits it.Figure 5:g shows the power-consumption of CGRAs andGPUs as a function of their respective die sizes. This graphcomplements the graph in Figure 5:d to show that the lowpower-consumption of CGRAs is mainly because they aresmall, with (out of those CGRAs that report both power- andarea) only Plasticine coming closer to the trend of GPUs.Finally, Figure 5:h shows how the individual RC area hasgrown throughout time, and we see that the size of RCshas been following the technology scaling, and continuouslydecreased in size. However, when normalizing the CGRAsmanufacturing technology to that of 16 nm, we actuallynoticed a different trend, where the area of individual RCs isincreasing, due to incorporating more complex elements (suchas wider data-paths, more complex arithmetic units, etc.).VII. I NTEGER AND F LOATING -P OINT P ERFORMANCE A NALYSIS

Figure 6 overviews data associated with the pure perfor-mance of the CGRAs, often when positioned against that ofNVidia GPUs.Figure 6a-f shows the obtained integer performance. Herewe distinguish between operations and MAC-based operationsin order to reveal architecture which are starved of multipliers.For example, the TARTAN CGRA can execute a large amountof operations per unit time, but has no support for multiplica-tions, leading to a very low comparable multiple-add perfor-mance. Figure 6a-b show the GARP and Matrix architectureas the sole candidates for low-precision arithmetic, and thatwhile both of these have comparable high performance (fortheir time), their multiplication (MAC) performance is lacking(in GARP, the overhead was 32x compared to an addition).Figure 7:c shows 8-bit integer performance, which has recentlybeen of interested to the deep-learning inference community,and where next-generation Xilinx Versile architecture willbe capable of reaching thousands of GOP/s of 8-bit integerperformance. Figure 6:d shows 16-bit integer performance,showing a continuous growth over the years. Note how theTARTAN architecture claims to reach similar performancelevels of the upcoming Xilinx Versile CGRA, despite beingmore than a decade old. Figure 6:e is a special case, and only a few CGRAs (e.g. Cool-Mega Array-1 and SYSCORE); despitetheir low visible performance, these devices are actually verypower-efﬁcient (see next section for discussion). Finally, 32-bit integer performance is shown in Figure 6:f, where we alsoincluded NVidia GPU integer performance for comparison.We see that CGRA has historically been comparable to thatof NVidia GPUs, and even FPGAs are becoming a valid wayof obtaining integer performance through CGRAs.Figure 6:g shows the peak ﬂoating-point performance thatCGRAs reported over the years. The number of ﬂoating-pointcapable CGRAs prohibits us from drawing any reasonabletrend-line, unlike the one for GPUs that exponentially growswith years (together with the die-area, see Figure 5:c). How-ever, those CGRAs that do include ﬂoating-point units cancompete with the performance of modern GPUs– sometimeseven outperform them. For example, the Plasticine architec-ture is capable of delivering 24.6 TFLOP/s of performance,rivaling GPUs from that generation, and the earlier Redeﬁneand SGMF architecture could deliver 500 and 840 GFLOP/srespectively. Even earlier, the Wavescalar architecture wascapable of 500 GFLOP/s, which was well ahead of GPUs atthat time. At a lower performance, we ﬁnd architecture suchas FLoRa ( 600 MFLOP/s) and the loop-accelerator VEAL(5.4 GFLOP/s).Figure 6:h shows the distribution over the number ofCGRAs that support ﬂoating-point versus those that supportinteger computations. Floating-point support is clearly under-represented , with more than a factor four more CGRAs thatsupport integer computations.VIII. P

ERFORMANCE U SAGE A NALYSIS

Figure 7:a shows the number of instructions-per-cycle (IPC)that applications experienced when executing on differentCGRAs. We see that a majority of CGRAs operate in a fairlylow performance domain, primarily due to their size, and mostexecute around 12 IPC (median). There are corner cases, suchas the Rhyme-Redeﬁne architecture which aims to exploreCGRA in High-Performance Computing reaching 300+ IPCon selected workloads, or the Deep-Learning SDT-CGRAarchitecture on inference, reaching 172 IPC. Similarly, Eyerissis capable to occupy 100% of its resources when inferringAlexNET, yielding astounding 700+ IPC. Most FPGA-basedCGRA also execute less than 100 IPC; this is primarily sincethe size of most FPGA CGRAs are rather small (see Figure5a).Figure 7:b shows the performance applications experiencewhen running on different CGRA architectures as a functionof topology size, where we group CGRA architectures intothree groups: small (¡16), medium (16-64) and large (¿64). Asis expected, we see that the performance and obtained IPCgrows as the architectures become larger, where applicationsrunning on small-sized CGRA experience on average albeit unusable by itself (only compute and noorchestration is available), would reach a compute capacity ofnearly 1 Peta-OP/s. While this extrapolation is indeed limited,it aims to show that CGRAs have the architectural capabilityof competing with modern GPUs designs, assuming we canfully utilize these (potentially over-provisioned) computingresources.

Manufacturing Process (nm) T r a n s i s t o r D e n s i t y ( M illi o n / mm ) NVIDIA GPU Transistor Density Scaling

Fig. 8: Transistor-density scaling as experienced by NVIDIAGPU chips, which we used to scale existing CGRAs fromvarious times in history.IX. D

ISCUSSION AND C ONCLUSION

As we saw in the previous section, a vast majority ofCGRAs are fairly small and run at a (comparably) lowfrequency. This, in turn, lead to very power-efﬁcient designs,allowing CGRAs to be placed into embedded devices such asmobile phones or wearables and operate for hours. This power-efﬁciency, with respect to the performance they provide, allowCGRAs to compete (and possible perform better) than GPUs,which in turn could lead to a partial remedy for dark silicon.From the analysis that we did in this study, CGRAs shouldbe considered a serious competitor to GPUs, particularly in afuture post-Moore era when power-efﬁciency becomes moreimportant.However, in order to reap the better compute-densitiesand better power-efﬁciency that CGRA offers, larger archi-tectures must be more thoroughly researched. Larger CGRAarchitectures, in particularly those aimed towards aiding or accelerating general-purpose computing, will be challengingto keep occupied. As we saw in the ﬁnal graph, CGRA scaledto the level of a V100 will potentially have a peak performanceconsisting of hundreds of teraﬂops, but the main question is:will we be able to map and fully utilize all those computingresources for anything but the most trivial kernels?Several authors have already pointed out that in orderto harvest larger CGRAs, we need to complement currentways of extracting instruction parallelism (primary moduloscheduling) with other forms of concurrency. While modernCGRAs (e.g. Plasticine, SIMD-RA) do exploit SIMD-levelparallelism, there will without doubt be needed to researcheven further, and support programming models such as CUDA(SGMF move towards this direction), multi-threading or evenmulti-tasking (e.g. OpenMP [186]) should be more aggres-sively pursued both from an architectural and programmabilityviewpoint in order to leverage future, large-sized, CGRAs.For example, the recently added dependencies features inframeworks such as OpenMP and OmpSs [187] matches verywell to clustered CGRAs that have islands of both computeand scratchpad, where the dependencies would dictate howdata would ﬂow on these CGRAs (exploiting both inter- andintra-task level parallelism and data locality).Another limitation of existing architectures is the applica-tion domain on which they accelerate. A large majority ofCGRAs target embedded applications such as ﬁlters, sten-cils, decoders, etc. Studies that integrate the CGRA into thebackend of a processor (e.g. TRIPS, DySER) tend to have amore diverse set of benchmarks available, and those studies(e.g. TARTAN, SEED, SGMF) that rely only on simulation(without hardware being developed) have the richest set ofapplication support. Despite this, CGRAs suffer from a similarproblem that current FPGAs struggle with: we limit ourselvesto small, simple kernels, rather than studying the impact ofthese architecture on more complex applications. To givea concrete example, there is to this day no reconﬁgurablearchitecture that has seriously considered any of the proxyapplications that drive HPC system procurement, such asfor example the RIKEN Fiber [188] or ECP benchmarksuites [189]. For FPGAs and High-Level Synthesis, this mightmake sense, since there is always the danger that these largekernels might not ﬁt onto a single FPGA; CGRAs, however,can store multiple contexts and kernels with little overhead inswitching between them, opening up possibilities for whole-application executions as well as opportunities to exploit inter-kernel temporal- and spatial data locality.A different challenge with the present (and similar future)surveys is in the amount of reporting that the different studiesdo. For example, studies that applies a simulation method-ology often have a wider benchmark coverage, but fails toreport hardware details (e.g. area or RTL-information). At thesame time, many CGRAs that were actually implemented inhardware (or RTL) do report area and power-consumption,but limits the benchmark selection and information. This,in turn, leads to gaps in several graphs were a clear high-performance candidate is represented in one graph, but due toimited information, is absent from the other. This could moreclearly be seen in this graphs that a derive metric, such asperformance per power (OPs/W). Similarly, many papers oftenprefer to report relative performance improvement, rather thanabsolute numbers, leading to difﬁculty in reasoning aroundperformance across a wide range of CGRAs. We include this inthe discussion section more as a observation for future studiesthat may attempt to perform a similar survey.Overall, this survey has shown that there is plenty of roomfor CGRA research to grow and to continue to be an activeresearch subjects for use in future architecture, particularlystriving to design high-performance CGRAs that aim at nicheor general-purpose computation at scale. As transistor dimen-sions stop shrinking and Moore’s law no longer allows us thearchitectural freedom of carelessly spending silicon, it is herethat reconﬁgurable architectures such as CGRAs might excelat providing performance in a post-Moore era.A

CKNOWLEDGEMENTS

This article is based on results obtained from a projectcommissioned by the New energy and Industrial TechnologyDevelopment Organization (NEDO).R

EFERENCES[1] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc, “Design of ion-implanted mosfet’s with very small physicaldimensions,”

IEEE Journal of Solid-State Circuits , vol. 9, no. 5, pp.256–268, 1974.[2] G. E. Moore et al. , “Cramming more components onto integratedcircuits,” 1965.[3] T. N. Theis and H.-S. P. Wong, “The end of moore’s law: A newbeginning for information technology,”

Computing in Science & Engi-neering , vol. 19, no. 2, p. 41, 2017.[4] J. S. Vetter, E. P. DeBenedictis, and T. M. Conte, “Architectures forthe post-moore era,”

IEEE Micro , vol. 37, no. 4, pp. 6–8, 2017.[5] C. D. Schuman, J. D. Birdwell, M. Dean, J. Plank, and G. Rose,“Neuromorphic computing: A post-moore’s law complementary archi-tecture,” Oak Ridge National Lab.(ORNL), Oak Ridge, TN (UnitedStates), Tech. Rep., 2016.[6] R. Tessier and W. Burleson, “Reconﬁgurable computing for digitalsignal processing: A survey,”

Journal of VLSI signal processing systemsfor signal, image and video technology , vol. 28, no. 1-2, pp. 7–27, 2001.[7] J. Gray, “Grvi phalanx: A massively parallel risc-v fpga acceleratoraccelerator,” in . IEEE,2016, pp. 17–20.[8] G. Wang, B. Yin, K. Amiri, Y. Sun, M. Wu, and J. R. Cavallaro, “Fpgaprototyping of a high data rate lte uplink baseband receiver,” in . IEEE, 2009, pp. 248–252.[9] I. Kuon, R. Tessier, J. Rose et al. , “Fpga architecture: Survey and chal-lenges,”

Foundations and Trends® in Electronic Design Automation ,vol. 2, no. 2, pp. 135–253, 2008.[10] C. Yang, T. Geng, T. Wang, C. Lin, J. Sheng, V. Sachdeva, W. Sherman,and M. Herbordt, “Molecular dynamics range-limited force evaluationoptimized for fpgas,” in ,vol. 2160. IEEE, 2019, pp. 263–271.[11] A. Podobas, H. R. Zohouri, N. Maruyama, and S. Matsuoka, “Evaluat-ing high-level design strategies on fpgas for high-performance comput-ing,” in . IEEE, 2017, pp. 1–4.[12] H. R. Zohouri, N. Maruyama, A. Smith, M. Matsuda, and S. Mat-suoka, “Evaluating and optimizing opencl kernels for high performancecomputing with fpgas,” in

SC’16: Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis . IEEE, 2016, pp. 409–420. [13] H. R. Zohouri, A. Podobas, and S. Matsuoka, “Combined spatial andtemporal blocking for high-performance stencil computation on fpgasusing opencl,” in

Proceedings of the 2018 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays , 2018, pp. 153–162.[14] A. Podobas and S. Matsuoka, “Designing and accelerating spiking neu-ral networks using opencl for fpgas,” in . IEEE, 2017, pp. 255–258.[15] T. Miyamori and K. Olukotun, “Remarc: Reconﬁgurable multimediaarray coprocessor,”

IEICE Transactions on information and systems ,vol. 82, no. 2, pp. 389–397, 1999.[16] A. Agarwal, S. Amarasinghe, R. Barua, M. Frank, W. Lee, V. Sarkar,D. Srikrishna, and M. Taylor, “The raw compiler project,” in

Proceed-ings of the Second SUIF Compiler Workshop, Stanford, CA , 1997.[17] J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with areconﬁgurable coprocessor,” in

Proceedings. The 5th Annual IEEESymposium on Field-Programmable Custom Computing Machines Cat.No. 97TB100186) . IEEE, 1997, pp. 12–21.[18] S. Ahmad, S. Subramanian, V. Boppana, S. Lakka, F.-H. Ho, T. Knopp,J. Noguera, G. Singh, and R. Wittig, “Xilinx ﬁrst 7nm device: Versal aicore (vc1902),” in . IEEE,2019, pp. 1–28.[19] K. E. Fleming, K. D. Glossop, and S. C. Steely, “Apparatus, methods,and systems with a conﬁgurable spatial accelerator,” Oct. 15 2019, uSPatent 10,445,250.[20] S. Brown, “Fpga architectural research: a survey,”

IEEE Design & Testof Computers , vol. 13, no. 4, pp. 9–15, 1996.[21] H. Amano, “A survey on dynamically reconﬁgurable processors,”

IEICE transactions on Communications , vol. 89, no. 12, pp. 3179–3187, 2006.[22] V. Tehre and R. Kshirsagar, “Survey on coarse grained reconﬁgurablearchitectures,”

International Journal of Computer Applications , vol. 48,no. 16, pp. 1–7, 2012.[23] R. Hartenstein, “A decade of reconﬁgurable computing: a visionaryretrospective,” in

Proceedings of the conference on Design, automationand test in Europe . IEEE Press, 2001, pp. 642–649.[24] G. Theodoridis, D. Soudris, and S. Vassiliadis, “A survey of coarse-grain reconﬁgurable architectures and cad tools,” in

Fine-and Coarse-Grain Reconﬁgurable Computing . Springer, 2007, pp. 89–149.[25] M. Wijtvliet, L. Waeijen, and H. Corporaal, “Coarse grained reconﬁg-urable architectures in the past 25 years: Overview and classiﬁcation,”in . IEEE, 2016, pp.235–244.[26] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei,“A survey of coarse-grained reconﬁgurable architecture and design:Taxonomy, challenges, and applications,”

ACM Computing Surveys(CSUR) , vol. 52, no. 6, p. 118, 2019.[27] J. Hennessy, N. Jouppi, S. Przybylski, C. Rowen, T. Gross, F. Baskett,and J. Gill, “Mips: A microprocessor architecture,”

ACM SIGMICRONewsletter , vol. 13, no. 4, pp. 17–22, 1982.[28] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “The garp architectureand c compiler,”

Computer , vol. 33, no. 4, pp. 62–69, 2000.[29] A. Marshall, T. Stansﬁeld, I. Kostarnov, J. Vuillemin, and B. Hutchings,“A reconﬁgurable arithmetic array for multimedia applications,” in

FPGA , vol. 99. Citeseer, 1999, pp. 135–143.[30] T. Stansﬁeld, “Using multiplexers for control and data in d-fabrix,” in

International Conference on Field Programmable Logic and Applica-tions . Springer, 2003, pp. 416–425.[31] E. Waingold, M. Taylor, V. Sarkar, V. Lee, W. Lee, J. Kim, M. Frank,P. Finch, S. Devabhaktumi, R. Barua et al. , “Baring it all to software:The raw machine,” 1997.[32] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,H. Hoffman, P. Johnson, J.-W. Lee, W. Lee et al. , “The raw micro-processor: A computational fabric for software circuits and general-purpose programs,”

IEEE micro , vol. 22, no. 2, pp. 25–35, 2002.[33] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald,H. Hoffmann, P. Johnson, J. Kim, J. Psota et al. , “Evaluation of theraw microprocessor: An exposed-wire-delay architecture for ilp andstreams,”

ACM SIGARCH Computer Architecture News , vol. 32, no. 2,p. 2, 2004.[34] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,J. MacKay, M. Reif, L. Bao, J. Brown et al. , “Tile64-processor: A64-core soc with mesh interconnect,” in . IEEE, 2008,pp. 88–598.[35] T. Miyamori and U. Olukotun, “A quantitative analysis of reconﬁg-urable coprocessors for multimedia applications,” in

Proceedings. IEEESymposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251) . IEEE, 1998, pp. 2–11.[36] E. Mirsky, A. DeHon et al. , “Matrix: a reconﬁgurable computingarchitecture with conﬁgurable instruction distribution and deployableresources.” in

FCCM , vol. 96, 1996, pp. 17–19.[37] G. Lu, H. Singh, M.-H. Lee, N. Bagherzadeh, F. Kurdahi, andM. Eliseu Filho, “The morphosys parallel reconﬁgurable system,” in

European Conference on Parallel Processing . Springer, 1999, pp.727–734.[38] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.Chaves Filho, “Morphosys: an integrated reconﬁgurable system fordata-parallel and computation-intensive applications,”

IEEE transac-tions on computers , vol. 49, no. 5, pp. 465–481, 2000.[39] M. Jo, D. Lee, and K. Choi, “Chip implementation of a coarse-grainedreconﬁgurable architecture supporting ﬂoating-point operations,” in , vol. 3. IEEE, 2008, pp.III–29.[40] C. Ebeling, D. C. Cronquist, and P. Franklin, “Rapid—reconﬁgurablepipelined datapath,” in

International Workshop on Field ProgrammableLogic and Applications . Springer, 1996, pp. 126–135.[41] C. Ebeling, D. C. Cronquist, P. Franklin, J. Secosky, and S. G.Berg, “Mapping applications to the rapid conﬁgurable architecture,” in

Proceedings. The 5th Annual IEEE Symposium on Field-ProgrammableCustom Computing Machines Cat. No. 97TB100186) . IEEE, 1997, pp.106–115.[42] R. W. Hartenstein, R. Kress, and H. Reinig, “A new fpga architecturefor word-oriented datapaths,” in

International Workshop on FieldProgrammable Logic and Applications . Springer, 1994, pp. 144–155.[43] ——, “A reconﬁgurable data-driven alu for xputers,” in

Proceedings ofIEEE Workshop on FPGA’s for Custom Computing Machines . IEEE,1994, pp. 139–146.[44] R. W. Hartenstein, M. Herz, T. Hoffmann, and U. Nageldinger, “Usingthe kressarray for reconﬁgurable computing,” in

Conﬁgurable Comput-ing: Technology and Applications , vol. 3526. International Society forOptics and Photonics, 1998, pp. 150–161.[45] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “Chimaera: a high-performance architecture with a tightly-coupled reconﬁgurable func-tional unit,” in

ACM SIGARCH computer architecture news , vol. 28,no. 2. ACM, 2000, pp. 225–235.[46] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: Atool for evaluating and synthesizing multimedia and communicationssystems,” in

Proceedings of 30th Annual International Symposium onMicroarchitecture . IEEE, 1997, pp. 330–335.[47] S. Kumar, L. Pires, S. Ponnuswamy, C. Nanavati, J. Golusky, M. Vojta,S. Wadi, D. Pandalai, and H. Spaanenberg, “A benchmark suite forevaluating conﬁgurable computing systems—status, reﬂections, andfuture directions,” in

Proceedings of the 2000 ACM/SIGDA eighthinternational symposium on Field programmable gate arrays . ACM,2000, pp. 126–134.[48] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, andR. R. Taylor, “Piperench: A reconﬁgurable architecture and compiler,”

Computer , vol. 33, no. 4, pp. 70–77, 2000.[49] H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine, and R. R. Taylor,“Piperench: A virtualized programmable datapath in 0.18 micron tech-nology,” in

Proceedings of the IEEE 2002 Custom Integrated CircuitsConference (Cat. No. 02CH37285) . IEEE, 2002, pp. 63–66.[50] J. Becker, M. Glesner, A. Alsolaim, and J. A. Starzyk, “Fast communi-cation mechanisms in coarse-grained dynamically reconﬁgurable arrayarchitectures.” in

PDPTA . Citeseer, 2000.[51] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. L´opez-Lagunas,P. R. Mattson, and J. D. Owens, “A bandwidth-efﬁcient architecturefor media processing,” in

Proceedings. 31st Annual ACM/IEEE Inter-national Symposium on Microarchitecture . IEEE, 1998, pp. 3–13.[52] J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das,“Evaluating the imagine stream architecture,” in

Proceedings. 31stAnnual International Symposium on Computer Architecture, 2004.

IEEE, 2004, pp. 14–25.[53] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins,“Adres: An architecture with tightly coupled vliw processor and coarse- grained reconﬁgurable matrix,” in

International Conference on FieldProgrammable Logic and Applications . Springer, 2003, pp. 61–70.[54] B. Mei, A. Lambrechts, J.-Y. Mignolet, D. Verkest, and R. Lauwereins,“Architecture exploration for a reconﬁgurable architecture template,”

IEEE Design & Test of Computers , vol. 22, no. 2, pp. 90–101, 2005.[55] J. A. Fisher, “Very long instruction word architectures and the eli-512,” in

Proceedings of the 10th annual international symposium onComputer architecture , 1983, pp. 140–150.[56] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins,“Dresc: A retargetable compiler for coarse-grained reconﬁgurablearchitectures,” in

IEEE, 2002,pp. 166–173.[57] K. Wu, A. Kanstein, J. Madsen, and M. Berekovic, “Mt-adres: mul-tithreading on coarse-grained reconﬁgurable architecture,” in

Interna-tional Workshop on Applied Reconﬁgurable Computing . Springer,2007, pp. 26–38.[58] F. Bouwens, M. Berekovic, A. Kanstein, and G. Gaydadjiev, “Archi-tectural exploration of the adres coarse-grained reconﬁgurable array,”in

International Workshop on Applied Reconﬁgurable Computing .Springer, 2007, pp. 1–13.[59] M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “Hycube: Acgra with reconﬁgurable single-cycle multi-hop interconnect,” in . IEEE,2017, pp. 1–6.[60] S. A. Chin, N. Sakamoto, A. Rui, J. Zhao, J. H. Kim, Y. Hara-Azumi,and J. Anderson, “Cgra-me: A uniﬁed framework for cgra modellingand exploration,” in .IEEE, 2017, pp. 184–189.[61] M. J. Walker and J. H. Anderson, “Generic connectivity-based cgramapping via integer linear programming,” in . IEEE, 2019, pp. 65–73.[62] C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim,“Ulp-srp: Ultra low power samsung reconﬁgurable processor forbiomedical applications,” in . IEEE, 2012, pp. 329–334.[63] J.-e. Lee, K. Choi, and N. D. Dutt, “Compilation approach for coarse-grained reconﬁgurable architectures,”

IEEE Design & Test of Comput-ers , vol. 20, no. 1, pp. 26–33, 2003.[64] ——, “Evaluating memory architectures for media applications oncoarse-grained reconﬁgurable architectures,” in

Proceedings IEEE In-ternational Conference on Application-Speciﬁc Systems, Architectures,and Processors. ASAP 2003 . IEEE, 2003, pp. 172–182.[65] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John,C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, and W. Yoder,“Scaling to the end of silicon with edge architectures,”

Computer ,vol. 37, no. 7, pp. 44–55, 2004.[66] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ran-ganathan, D. Burger, S. W. Keckler, R. G. McDonald, and C. R. Moore,“Trips: A polymorphous architecture for exploiting ilp, tlp, and dlp,”

ACM Transactions on Architecture and Code Optimization (TACO) ,vol. 1, no. 1, pp. 62–93, 2004.[67] K. Sankaralingam, V. A. Singh, S. W. Keckler, and D. Burger, “Routedinter-alu networks for ilp scalability and performance,” in

Proceedings21st International Conference on Computer Design . IEEE, 2003, pp.170–177.[68] A. Smith, J. Burrill, J. Gibson, B. Maher, N. Nethercote, B. Yoder,D. Burger, and K. S. McKinley, “Compiling for edge architectures,”in

International Symposium on Code Generation and Optimization(CGO’06) . IEEE, 2006, pp. 11–pp.[69] B. Yoder, J. Burrill, R. McDonald, K. Bush, K. Coons, M. Gebhart,S. Govindan, B. Maher, R. Nagarajan, B. Robatmili et al. , “Softwareinfrastructure and tools for the trips prototype,” in

Workshop onModeling, Benchmarking and Simulation . Citeseer, 2007.[70] R. McDonald, D. Burger, and S. Keckler, “The design and imple-mentation of the trips prototype chip,” in . IEEE, 2005, pp. 1–24.[71] M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz,M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill et al. ,“An evaluation of the trips computer system,” in

ACM Sigplan Notices ,vol. 44, no. 3. ACM, 2009, pp. 1–12.72] J. Gray and A. Smith, “Towards an area-efﬁcient implementation of ahigh ilp edge soft processor,” arXiv preprint arXiv:1803.06617 , 2018.[73] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish,K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality andparallelism specialization for energy-efﬁcient computing,”

IEEE Micro ,vol. 32, no. 5, pp. 38–51, 2012.[74] I. Parulkar, A. Wood, J. C. Hoe, B. Falsaﬁ, S. V. Adve, J. Torrellas,and S. Mitra, “Opensparc: An open platform for hardware reliabilityexperimentation,” in

Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE) . Citeseer, 2008, pp. 1–6.[75] J. Benson, R. Cofell, C. Frericks, C.-H. Ho, V. Govindaraju,T. Nowatzki, and K. Sankaralingam, “Design, integration and imple-mentation of the dyser hardware accelerator into opensparc,” in

IEEEInternational Symposium on High-Performance Comp Architecture .IEEE, 2012, pp. 1–12.[76] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, andS. K. Reinhardt, “The m5 simulator: Modeling networked systems,”

Ieee micro , no. 4, pp. 52–60, 2006.[77] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmarksuite: Characterization and architectural implications,” in

Proceedingsof the 17th international conference on Parallel architectures andcompilation techniques . ACM, 2008, pp. 72–81.[78] J. D. Souza, L. Carro, M. B. Rutzig, and A. C. S. Beck, “Towards adynamic and reconﬁgurable multicore heterogeneous system,” in . IEEE,2014, pp. 73–78.[79] ——, “A reconﬁgurable heterogeneous multicore with a homogeneousisa,” in . IEEE, 2016, pp. 1598–1603.[80] F. C. Junior, I. Silva, and R. Jacobi, “A partially shared thin re-conﬁgurable array for multicore processor,” in

Anais do IX Simp´osioBrasileiro de Engenharia de Sistemas Computacionais . SBC, 2019,pp. 113–118.[81] M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, “Bulldozer: Anapproach to multithreaded compute performance,”

IEEE Micro , vol. 31,no. 2, pp. 6–15, 2011.[82] D. L. Wolf, L. J. Jung, T. Ruschke, C. Li, and C. Hochberger,“Amidar project: lessons learned in 15 years of researching adaptiveprocessors,” in . IEEE, 2018, pp.1–8.[83] S. R. Chalamalasetti, S. Purohit, M. Margala, and W. Vanderbauwhede,“Mora-an architecture and programming model for a resource efﬁcientcoarse grained reconﬁgurable processor,” in . IEEE, 2009, pp. 389–396.[84] W. Vanderbauwhede, M. Margala, S. R. Chalamalasetti, and S. Purohit,“A c++-embedded domain-speciﬁc language for programming the morasoft processor array,” in

ASAP 2010-21st IEEE International Confer-ence on Application-speciﬁc Systems, Architectures and Processors .IEEE, 2010, pp. 141–148.[85] Y. Park, H. Park, and S. Mahlke, “Cgra express: accelerating exe-cution using dynamic operation fusion,” in

Proceedings of the 2009international conference on Compilers, architecture, and synthesis forembedded systems . ACM, 2009, pp. 271–280.[86] D. Lewis, G. Chiu, J. Chromczak, D. Galloway, B. Gamsa,V. Manohararajah, I. Milton, T. Vanderhoek, and J. Van Dyken, “Thestratix™ 10 highly pipelined fpga architecture,” in

Proceedings of the2016 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays . ACM, 2016, pp. 159–168.[87] H. Park, Y. Park, and S. Mahlke, “Polymorphic pipeline array: a ﬂexiblemulticore accelerator with virtualized execution for mobile multimediaapplications,” in

Proceedings of the 42nd Annual IEEE/ACM Interna-tional Symposium on Microarchitecture . ACM, 2009, pp. 370–380.[88] Y. Kim, J. Lee, J. Lee, T. X. Mai, I. Heo, and Y. Paek, “Exploiting bothpipelining and data parallelism with simd reconﬁgurable architecture,”in

International Symposium on Applied Reconﬁgurable Computing .Springer, 2012, pp. 40–52.[89] C. Liang and X. Huang, “Smartcell: An energy efﬁcient coarse-grainedreconﬁgurable architecture for stream-based applications,”

EURASIPJournal on Embedded Systems , vol. 2009, no. 1, p. 518659, 2009.[90] O. Atak and A. Atalar, “Bilrc: An execution triggered coarse grainedreconﬁgurable architecture,”

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 21, no. 7, pp. 1285–1298, 2012. [91] D. Lee, M. Jo, K. Han, and K. Choi, “Flora: Coarse-grained recon-ﬁgurable architecture with ﬂoating-point operation capability,” in . IEEE,2009, pp. 376–379.[92] F. Feng, L. Li, K. Wang, F. Han, B. Zhang, and G. He, “Floating-point operation based reconﬁgurable architecture for radar processing,”

IEICE Electronics Express , pp. 13–20 160 893, 2016.[93] H. Siqueira and M. Kreutz, “A coarse-grained reconﬁgurable archi-tecture for a pret machine,” in . IEEE, 2018, pp. 237–242.[94] V. Dadu, J. Weng, S. Liu, and T. Nowatzki, “Towards general purposeacceleration by exploiting common data-dependence forms,” in

Pro-ceedings of the 52nd Annual IEEE/ACM International Symposium onMicroarchitecture , 2019, pp. 924–939.[95] T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam,“Stream-dataﬂow acceleration,” in . IEEE, 2017,pp. 416–429.[96] B. Salefski and L. Caglar, “Re-conﬁgurable computing in wireless,” in

Proceedings of the 38th annual Design Automation Conference . ACM,2001, pp. 178–183.[97] A. Hemani, N. Farahini, S. M. Jafri, H. Sohoﬁ, S. Li, and K. Paul, “Thesilago solution: Architecture and design methods for a heterogeneousdark silicon aware coarse grain reconﬁgurable fabric,” in

The DarkSide of Silicon . Springer, 2017, pp. 47–94.[98] M. A. Shami, “Dynamically reconﬁgurable resource array,” Ph.D.dissertation, KTH Royal Institute of Technology, 2012.[99] M. Martina, A. Hemani, and G. Baccelli, “Design of a coarse grainreconﬁgurable array for neural networks.”[100] D. Stathis, C. Sudarshan, Y. Yang, M. Jung, S. A. M. H. Jafri, C. Weis,A. Hemani, A. Lansner, and N. Wehn, “ebrainii: A 3 kw realtimecustom 3d dram integrated asic implementation of a biologically plau-sible model of a human scale cortex,” arXiv preprint arXiv:1911.00889 ,2019.[101] Y. Yang, D. Stathis, P. Sharma, K. Paul, A. Hemani, M. Grabherr,and R. Ahmad, “Ribosom: rapid bacterial genome identiﬁcation usingself-organizing map implemented on the synchoros silago platform,”in

Proceedings of the 18th International Conference on EmbeddedComputer Systems: Architectures, Modeling, and Simulation . ACM,2018, pp. 105–114.[102] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, “Q100:The architecture and design of a database processing unit,”

ACMSIGARCH Computer Architecture News , vol. 42, no. 1, pp. 255–268,2014.[103] V. Baumgarte, G. Ehlers, F. May, A. N¨uckel, M. Vorbach, and M. Wein-hardt, “Pact xpp—a self-reconﬁgurable data processing architecture,” the Journal of Supercomputing , vol. 26, no. 2, pp. 167–184, 2003.[104] P. S. K¨asgen, M. Weinhardt, and C. Hochberger, “A coarse-grainedreconﬁgurable array for high-performance computing applications,” in . IEEE, 2018, pp. 1–4.[105] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “Wavescalar,”in

Proceedings of the 36th annual IEEE/ACM International Symposiumon Microarchitecture . IEEE Computer Society, 2003, p. 291.[106] A. Putnam, S. Swanson, M. Mercaldi, K. Michelson, A. Petersen,A. Schwerin, M. Oskin, and S. Eggers, “The microarchitecture of apipelined wavescalar processor: An rtl-based study,”

Tech. Rep. TR-2005–11–02 , 2005.[107] S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam,K. Michelson, M. Oskin, and S. J. Eggers, “The wavescalar architec-ture,”

ACM Transactions on Computer Systems (TOCS) , vol. 25, no. 2,p. 4, 2007.[108] M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C.Goldstein, and M. Budiu, “Tartan: evaluating spatial computation forwhole program execution,”

ACM SIGOPS Operating Systems Review ,vol. 40, no. 5, pp. 163–174, 2006.[109] J. L. Henning, “Spec cpu2006 benchmark descriptions,”

ACMSIGARCH Computer Architecture News , vol. 34, no. 4, pp. 1–17, 2006.[110] G. Ansaloni, P. Bonzini, and L. Pozzi, “Egra: A coarse grained re-conﬁgurable architectural template,”

IEEE Transactions on Very LargeScale Integration (VLSI) Systems , vol. 19, no. 6, pp. 1062–1074, 2010.[111] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructurefor computer system modeling,”

Computer , no. 2, pp. 59–67, 2002.112] D. Voitsechov and Y. Etsion, “Single-graph multiple ﬂows: Energyefﬁcient design alternative for gpgpus,” in

ACM SIGARCH computerarchitecture news , vol. 42, no. 3. IEEE Press, 2014, pp. 205–216.[113] D. Voitsechov, O. Port, and Y. Etsion, “Inter-thread communication inmultithreaded, reconﬁgurable coarse-grain arrays,” in .IEEE, 2018, pp. 42–54.[114] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “Nvidia tesla:A uniﬁed graphics and computing architecture,”

IEEE micro , vol. 28,no. 2, pp. 39–55, 2008.[115] D. Kirk et al. , “Nvidia cuda software and gpu parallel computingarchitecture,” in

ISMM , vol. 7, 2007, pp. 103–104.[116] A. Munshi, “The opencl speciﬁcation,” in . IEEE, 2009, pp. 1–314.[117] C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi gf100 gpuarchitecture,”

IEEE Micro , vol. 31, no. 2, pp. 50–59, 2011.[118] M. Zhu, L. Liu, S. Yin, Y. Wang, W. Wang, and S. Wei, “A reconﬁg-urable multi-processor soc for media applications,” in

Proceedings of2010 IEEE International Symposium on Circuits and Systems . IEEE,2010, pp. 2011–2014.[119] M. Suzuki, Y. Hasegawa, Y. Yamada, N. Kaneko, K. Deguchi,H. Amano, K. Anjo, M. Motomura, K. Wakabayashi, T. Toi et al. ,“Stream applications on the dynamically reconﬁgurable processor,”in

Proceedings. 2004 IEEE International Conference on Field-Programmable Technology (IEEE Cat. No. 04EX921) . IEEE, 2004,pp. 137–144.[120] T. Sato, “Dapdna-2 a dynamically reconﬁgurable processor with 37632-bit processing elements,” in . IEEE, 2005, pp. 1–24.[121] D. N. Truong, W. H. Cheng, T. Mohsenin, Z. Yu, A. T. Jacobson,G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, Z. Xiao et al. , “A167-processor computational platform in 65 nm cmos,”

IEEE Journalof Solid-State Circuits , vol. 44, no. 4, pp. 1130–1144, 2009.[122] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems.”Stanford Univ CA Dept of Computer Science, Tech. Rep., 1984.[123] S. Kyo and S. Okazaki, “Imapcar: A 100 gops in-vehicle visionprocessor based on 128 ring connected four-way vliw processingelements,”

Journal of Signal Processing Systems , vol. 62, no. 1, pp.5–16, 2011.[124] S. Kyo, T. Koga, and S. Okazaki, “Imap-ce: A 51.2 gops video rate im-age processor with 128 vliw processing elements,” in

Proceedings 2001International Conference on Image Processing (Cat. No. 01CH37205) ,vol. 3. IEEE, 2001, pp. 294–297.[125] S. Das, N. Sivanandan, K. T. Madhu, S. K. Nandy, and R. Narayan,“Rhyme: Redeﬁne hyper cell multicore for accelerating hpc kernels,”in . IEEE, 2016,pp. 601–602.[126] K. T. Madhu, S. Das, S. Nalesh, S. Nandy, and R. Narayan, “Compilinghpc kernels for the redeﬁne cgra,” in . IEEE, 2015, pp. 405–410.[127] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis,A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A recon-ﬁgurable architecture for parallel patterns,” in .IEEE, 2017, pp. 389–402.[128] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis,R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis et al. , “Spatial:A language and compiler for application accelerators,” in

ACM SigplanNotices , vol. 53, no. 4. ACM, 2018, pp. 296–311.[129] C. systems, “Wafer-scale deep learning,”

Hot-Chips 31

ACM SIGARCH Computer Architecture News ,vol. 36, no. 3. IEEE Computer Society, 2008, pp. 389–400.[131] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A fully pipelinedand dynamically composable architecture of cgra,” in . IEEE, 2014, pp. 9–16. [132] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, “Bundledexecution of recurring traces for energy-efﬁcient general purpose pro-cessing,” in

Proceedings of the 44th Annual IEEE/ACM InternationalSymposium on Microarchitecture . ACM, 2011, pp. 12–23.[133] T. Nowatzki, V. Gangadhar, and K. Sankaralingam, “Exploring thepotential of heterogeneous von neumann/dataﬂow execution models,”

ACM SIGARCH Computer Architecture News , vol. 43, no. 3, pp. 298–310, 2016.[134] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[135] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neu-ral networks with low precision multiplications,” arXiv preprintarXiv:1412.7024 , 2014.[136] X. Fan, H. Li, W. Cao, and L. Wang, “Dt-cgra: Dual-track coarse-grained reconﬁgurable architecture for stream applications,” in . IEEE, 2016, pp. 1–9.[137] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang, “Stream processingdual-track cgra for object inference,”

IEEE Transactions on Very LargeScale Integration (VLSI) Systems , vol. 26, no. 6, pp. 1098–1111, 2018.[138] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An ac-celerator for compressed-sparse convolutional neural networks,”

ACMSIGARCH Computer Architecture News , vol. 45, no. 2, pp. 27–40,2017.[139] M. Liang, M. Chen, Z. Wang, and J. Sun, “A cgra based neural networkinference engine for deep reinforcement learning,” in . IEEE, 2018,pp. 540–543.[140] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efﬁcient reconﬁgurable accelerator for deep convolutional neural net-works,”

IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2016.[141] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A ﬂexibleaccelerator for emerging deep neural networks on mobile devices,”

IEEE Journal on Emerging and Selected Topics in Circuits andSystems , vol. 9, no. 2, pp. 292–308, 2019.[142] B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer, “Xilinx adaptivecompute acceleration platform: Versal tm architecture,” in

Proceed-ings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . ACM, 2019, pp. 84–93.[143] K. Vissers, “Versal: The xilinx adaptive compute acceleration platform(acap),” in

Proceedings of the 2019 ACM/SIGDA International Sympo-sium on Field-Programmable Gate Arrays . ACM, 2019, pp. 83–83.[144] N. Ando, K. Masuyama, H. Okuhara, and H. Amano, “Variablepipeline structure for coarse grained reconﬁgurable array cma,” in .IEEE, 2016, pp. 217–220.[145] N. Ozaki, Y. Yoshihiro, Y. Saito, D. Ikebuchi, M. Kimura, H. Amano,H. Nakamura, K. Usami, M. Namiki, and M. Kondo, “Cool mega-array: A highly energy efﬁcient reconﬁgurable accelerator,” in . IEEE,2011, pp. 1–8.[146] T. Kojima, N. Ando, Y. Matshushita, H. Okuhara, N. A. V. Doan,and H. Amano, “Real chip evaluation of a low power cgra withoptimized application mapping,” in

Proceedings of the 9th Interna-tional Symposium on Highly-Efﬁcient Accelerators and ReconﬁgurableTechnologies . ACM, 2018, p. 13.[147] K. Patel, S. McGettrick, and C. J. Bleakley, “Syscore: A coarsegrained reconﬁgurable array architecture for low energy biosignalprocessing,” in . IEEE, 2011, pp.109–112.[148] J. Lopes, D. Sousa, and J. C. Ferreira, “Evaluation of cgra architecturefor real-time processing of biological signals on wearable devices,”in . IEEE, 2017, pp. 1–7.[149] L. Duch, S. Basu, R. Braojos, D. Atienza, G. Ansaloni, and L. Pozzi,“A multi-core reconﬁgurable architecture for ultra-low power bio-signalanalysis,” in . IEEE, 2016, pp. 416–419.[150] S. Das, K. J. Martin, P. Coussy, and D. Rossi, “A heterogeneous clusterwith reconﬁgurable accelerator for energy efﬁcient near-sensor datanalytics,” in . IEEE, 2018, pp. 1–5.[151] S. Das, D. Rossi, K. J. Martin, P. Coussy, and L. Benini, “A142mops/mw integrated programmable array accelerator for smartvisual processing,” in . IEEE, 2017, pp. 1–4.[152] O. Akbari, M. Kamal, A. Afzali-Kusha, M. Pedram, and M. Shaﬁque,“X-cgra: An energy-efﬁcient approximate coarse-grained reconﬁg-urable architecture,”

IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , 2019.[153] M. Langhammer and B. Pasca, “Floating-point dsp block architecturefor fpgas,” in

Proceedings of the 2015 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays . ACM, 2015, pp.117–125.[154] G. Stitt and J. Coole, “Intermediate fabrics: Virtual architectures fornear-instant fpga compilation,”

IEEE Embedded Systems Letters , vol. 3,no. 3, pp. 81–84, 2011.[155] S. Shukla, N. W. Bergmann, and J. Becker, “Quku: a two-level recon-ﬁgurable architecture,” in

IEEE Computer Society Annual Symposiumon Emerging VLSI Technologies and Architectures (ISVLSI’06) . IEEE,2006, pp. 6–pp.[156] N. W. Bergmann, S. K. Shukla, and J. Becker, “Quku: a dual-layer reconﬁgurable architecture,”

ACM Transactions on EmbeddedComputing Systems (TECS) , vol. 12, no. 1s, p. 63, 2013.[157] S. Shukla, N. W. Bergmann, and J. Becker, “Quku: A fpga based ﬂexi-ble coarse grain architecture design paradigm using process networks,”in . IEEE, 2012, pp. 93–96.[160] R. L. Lysecky, K. Miller, F. Vahid, and K. A. Vissers, “Firm-core virtualfpga for just-in-time fpga compilation,” in

FPGA , 2005, p. 271.[161] T. Myint, M. Amagasaki, Q. Zhao, and M. Iida, “A slm-based overlayarchitecture for ﬁne-grained virtual fpga,”

IEICE Electronics Express ,pp. 16–20 190 610, 2019.[162] R. Ferreira, J. G. Vendramini, L. Mucida, M. M. Pereira, and L. Carro,“An fpga-based heterogeneous coarse-grained dynamically reconﬁg-urable architecture,” in

Proceedings of the 14th international confer-ence on Compilers, architectures and synthesis for embedded systems .ACM, 2011, pp. 195–204.[163] A. Severance and G. G. Lemieux, “Embedded supercomputing infpgas with the vectorblox mxp matrix processor,” in

Proceedingsof the Ninth IEEE/ACM/IFIP International Conference on Hard-ware/Software Codesign and System Synthesis . IEEE Press, 2013,p. 6.[164] L. B. D. Silva, R. Ferreira, M. Canesche, M. M. Menezes, M. D. Vieira,J. Penha, P. Jamieson, and J. A. M. Nacif, “Ready: A ﬁne-grainedmultithreading overlay framework for modern cpu-fpga dataﬂow appli-cations,”

ACM Transactions on Embedded Computing Systems (TECS) ,vol. 18, no. 5s, p. 56, 2019.[165] D. Capalija and T. S. Abdelrahman, “Towards synthesis-free jit com-pilation to commodity fpgas,” in .IEEE, 2011, pp. 202–205.[166] J. Ball, “The nios ii family of conﬁgurable soft-core processors,” in . IEEE, 2005, pp. 1–40.[167] C. Feng and L. Yang, “Design and evaluation of a novel reconﬁgurablealu based on fpga,” in

Proceedings 2013 International Conferenceon Mechatronic Sciences, Electric Engineering and Computer (MEC) .IEEE, 2013, pp. 2286–2290.[168] K. Paul, C. Dash, and M. S. Moghaddam, “remorph: a runtimereconﬁgurable architecture,” in . IEEE, 2012, pp. 26–33.[169] A. K. Jain, X. Li, P. Singhai, D. L. Maskell, and S. A. Fahmy,“Deco: A dsp block based fpga accelerator overlay with low overheadinterconnect,” in . IEEE,2016, pp. 1–8. [170] A. K. Jain, X. Li, S. A. Fahmy, and D. L. Maskell, “Adapting thedyser architecture with dsp blocks as an overlay for the xilinx zynq,”

ACM SIGARCH Computer Architecture News , vol. 43, no. 4, pp. 28–33, 2016.[171] C. Liu, H.-C. Ng, and H. K.-H. So, “Automatic nested loop accelerationon fpgas using soft cgra overlay,” arXiv preprint arXiv:1509.00042 ,2015.[172] ——, “Quickdough: a rapid fpga loop accelerator design frameworkusing soft cgra overlay,” in . IEEE, 2015, pp. 56–63.[173] D. Capalija and T. S. Abdelrahman, “A high-performance overlayarchitecture for pipelined execution of data ﬂow graphs,” in . IEEE, 2013, pp. 1–8.[174] D. Mansur, “Stratix iv fpga and hardcopy iv asic@ 40 nm,” in . IEEE, 2008, pp. 1–22.[175] K. Ovtcharov, I. Tili, and J. G. Steffan, “Tilt: a multithreaded vliwsoft processor family,” in . IEEE, 2013, pp. 1–4.[176] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance,productivity and scalability of the tilt overlay processor to opencl hls,”in . IEEE, 2014, pp. 20–27.[177] D. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee,T. Vanderhoek, and H. Yu, “Architectural enhancements in stratix v™,”in

Proceedings of the ACM/SIGDA international symposium on Fieldprogrammable gate arrays . ACM, 2013, pp. 147–156.[178] Z. T. Aklah, “A hybrid partially reconﬁgurable overlay supporting just-in-time assembly of custom accelerators on fpgas,” 2017.[179] D. Koch,

Partial Reconﬁguration on FPGAs: Architectures, Tools andApplications . Springer Science & Business Media, 2012, vol. 153.[180] M. X. Yue, D. Koch, and G. G. Lemieux, “Rapid overlay builder forxilinx fpgas,” in . IEEE, 2015, pp.17–20.[181] R. Kisiel and Z. Szczepa´nski, “Trends in assembling of advanced icpackages,”

Journal of Telecommunications and information Technol-ogy

Computer , vol. 40, no. 12, pp. 50–55, 2007.[185] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, andD. Burger, “Dark silicon and the end of multicore scaling,” in .IEEE, 2011, pp. 365–376.[186] R. Chandra, L. Dagum, D. Kohr, R. Menon, D. Maydan, and J. Mc-Donald,

Parallel programming in OpenMP . Morgan kaufmann, 2001.[187] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell, X. Mar-torell, and J. Planas, “Ompss: a proposal for programming heteroge-neous multi-core architectures,”