A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective
AA Survey on Coarse-Grained ReconfigurableArchitectures from a Performance Perspective st Artur Podobas
Center for Computational ScienceRIKEN
Kobe, [email protected] nd Kentaro Sano
Center for Computational ScienceRIKEN
Kobe, [email protected] rd Satoshi Matsuoka
Center for Computational ScienceRIKEN
Kobe, [email protected]
Abstract —With the end of both Dennard’s scaling and Moore’slaw, computer users and researchers are aggressively explor-ing alternative forms of compute in order to continue theperformance scaling that we have come to enjoy. Among themore salient and practical of the post-Moore alternatives arereconfigurable systems, with Coarse-Grained Reconfigurable Ar-chitectures (CGRAs) seemingly capable of striking a balancebetween performance and programmability.In this paper, we survey the landscape of CGRAs. Wesummarize nearly three decades of literature on the subject,with particular focus on premises behind the different CGRAarchitectures and how they have evolved. Next, we compilemetrics of available CGRAs and analyze their performance prop-erties in order to understand and discover existing knowledgegaps and opportunities for future CGRA research specializedtowards High-Performance Computing (HPC). We find thatthere are ample opportunities for future research on CGRAs, inparticular with respect to size, functionality, support for parallelprogramming models, and to evaluate more complex applications.
Index Terms —Coarse-Grained Reconfigurable Architectures,CGRA, FPGA, Computing Trends, Reconfigurable systems
I. I
NTRODUCTION
With the end of Dennard’s scaling [1] and the looming threatthat even Moore’s law [2] is about to end [3], computing isperhaps facing its most challenging moments. Today, computerresearchers and practitioners are aggressively pursuing andexploring alternative forms of computing in order to try fillthe void that an end of Moore’s law would leave behind.Today there are a plethora of emerging technologies with thepromise of overcoming the limits of technology scaling, suchas quantum- or neuromorphic-computing [4], [5]. However,not all
Post-Moore architectures are intrusive and some merelyrequire us to step-away from the comforts that von-Neumannarchitecture offer. Among the more salient of these technolo-gies are reconfigurable architectures [6].Reconfigurable architectures are systems that attempts toretain some of the silicon plasticity that an ASIC solutionusually throws away. These systems – at least conceptually –allow the silicon to be malleable and its functionality dynam-ically configurable. A reconfigurable system can for examplemimic a processor architecture for some time (e.g. a RISC-V core [7]), and then be changed to mimic a LTE basebandstation [8]. This property of reconfigurability is highly sought after, since it can mitigate the end of Moore’s law to someextent– we do not need more transistors, we just need tospatially configure the silicon to match the computation intime.Recently, a particular branch of reconfigurable architecture– the Field-Programmable Gate Arrays (FPGAs) [9] – hasexperienced a surge of renewed interest for use in High-Performance computing (HPC), and recent research has shownperformance- or power-benefits for multiple applications [10]–[14]. At the same time, many of the limitations that FPGAshave, such as slow configuration times, long compilationstimes, and (comparably) low clock frequencies, remain un-solved. These limitations have been recognized for decades(e.g. [15]–[17]), and have been used to drive forth a differ-ent branch of reconfigurable architecture: the Coarse-GrainedReconfigurable Architecture (CGRAs).CGRAs trade some of the flexibility that FPGAs have tosolve their limitations. A CGRA can operate at higher fre-quencies, can provide higher theoretical compute performance,and can drastically reduce compilation times. While CGRAshave traditionally been used in embedded systems (particularfor media-processing), lately, they too are considered forHPC. Even traditional FPGA vendors such as Xilinx [18] andIntel [19] are creating and/or investigating to coarsen theirexisting reconfigurable architecture to complement other formsof compute.In this paper, we survey the literature of CGRAs, summa-rizing the different architectures and systems that have beenintroduced over time. We complement surveys written by ourpeers by focusing on understanding the trends in performancethat CGRAs have been experiencing, providing insights intowhere the community is moving and any eventual gaps inknowledge that can/should be filled.The contributions of our work are as follows: • A survey over three decades of Coarse-Grained Reconfig-urable Architectures, summarizing existing architecturetypes and properties, • A quantitatively analysis over performance metrics ofCGRA architecture as reported in their respective seminalpapers, and • An analysis on trends and observations regarding CGRAswith discussionhe remaining paper is organized in the following way.Section II introduces the motivation behind CGRAs, as wellas their generic design for the unfamiliar reader. Section IIIpositions this survey against existing surveys on the topic.Section IV quantitatively summarizes each architecture that wereviewed, describing key characteristics and the premise be-hind respective architecture. Section V analyzes the reviewedarchitecture from different perspectives (Sections VII, VIII,and VI), which we finally discuss at the end of the paper insection IX. II. I
NTRODUCTION TO
CGRA S Before summarizing the now three-decades of Coarse-Grained Reconfigurable Architecture (CGRA) research, westart by describing the main aspirations and motivations behindthem. To do so, we need to look at the CGRAs predecessor:The Field-Programmable Gate Array (FPGA).FPGAs are devices that were developed to reduce the costof simulation and developing Application-Specific IntegratedCircuits (ASICs). Because any bug/fault that were left undis-covered post ASIC tape-out would incur a (potentially) greateconomical loss, FPGAs were (and still are) crucial to digitaldesign. In order for FPGAs to mimic any digital design,they are made to have a large degree of fine-grained recon-figurability. This fine-grained reconfigurability was achievedby building FPGAs to contain a large amount of on-chipSRAM cells called Look-Up Tables (LUTs) [20]. Each LUTwas interfaced by few input wires (usually 4-6) and producedan output (and its complement) as a function of the SRAMcontent and their inputs. Hence, depending on the sought-after functionality to be simulated, LUTs could be configuredand – through a highly reconfigurable interconnect – could beconnected to each other finally yield the expected designs. Thedesign would naturally run one to three order of magnitudelower than the final standard-cell ASIC, but would neverthelessbe an invaluable prototyping tool.By the early 1990s, FPGAs had already found other uses(aside from digital development) within telecommunication,military, and automobile industries—the FPGA was seen asa compute device in owns right and there was some aspira-tion of use it for general-purpose computing, and not onlysomething used in the niche market of prototyping digitaldesigns. Despite this, several limitations of FPGAs werequickly identified that prohibited coverage of a wide rangeof applications. For example, unlike software compilationtools that take minutes to compile applications, the FPGAElectronic Design Automation (EDA) flow took significantlylonger, often requiring hours or even days of compilationtime. Similarly, if the expected application could not fit onea single device, the long reconfiguration overhead (the timeit takes to program the FPGA) demotivated time-sharing orcontext-switching of its resources. Another limitation was thatsome important arithmetic operators did not map well tothe FPGA; for example, a single integer multiplication couldoften consume larger fraction of the FPGA resources. Finally,FPGAs was relatively slow, running at a low clock frequency. Many of these challenges and limitations of applying FPGAsfor general-purpose computing holds to this day.Many early reconfigurable computing pioneers looked atthe limitations of FPGAs and considered what would hap-pen if one would increase the granularity at which it wasprogrammed? By increasing the granularity, larger and morespecialized units could be built, which would increase theperformance (clock frequency) of the device. Also, since thelarger units require less configuration state, reconfiguring thedevice would be significantly faster, allowing fine-grainedtime-sharing (multiple contexts) of the device. Finally, bycoarsening the units of reconfiguration, one would includethose units that maps poorly on FPGAs into the fabric (e.g.multiplications), making better use of the silicon and increas-ing generality of the device. These new devices would later becalled: Coarse-Grained Reconfigurable Architecture (CGRAs).An example of what a CGRA looks like from the ar-chitecture perspective is shown in Figure 1. In Figure 1:awe see a mesh of reconfigurable cells (RCs) or processingelement (PEs), which is the smallest unit of reconfigurationthat perform work, and it is through this mesh that a user (orcompiler) decides how data flows through the system. Thereare multiple ways of bringing data in/out to/from the fabric.One common way is to map the device in the memory of a hostprocessor (memory-mapped) and have the host processor or-chestrate the execution. A different way is to include (generic-)address generations (AGs) that can be configured to accessexternal memory using some pattern (often correspondingto the nested loops of the application) and push it throughthe array. A third option is to have the re-configurable cellsdo both the computation and address generation. Figure 1:billustrates the internal of a RC element, which includes anALU (integer and/or floating-point capable), two multiplexers(MUXs), and a local static RAM (SRAM) used for storage.The two multiplexers decide which of the external inputs tooperate on. The inputs are usually the output of adjacent RCs,the local SRAM scratchpad, a constant, or a previous output(e.g. for accumulations). The output of the ALU is similarlyconnected to adjacent RCs, local SRAM, or back to one of theMUXes. Operation of the RC is governed by a configurationregister, here briefly shown in Figure 1:c. For simplicity, weshow a single register that hold the state– however, in manyarchitectures, each RC can hold multiple configurations thatare cycled through over the application lifetime. Each of theconfiguration can for example hold the computation for aparticular basic block (where live-out variables are stored inSRAM) or discrete kernels.Figure 1 illustrates how a majority of today’s CGRAs looklike, but at the same time there are multiple variations. Forexample, early CGRAs often included fine-grained reconfig-urable elements (Look-Up Tables, LUTs) inside the fabric.While the mesh topology is by far the most common, somework chose a ring or linear-array topology. Finally, the flow-control of data in the network can be of varying complexity(e.g. token or tagged-token). We describe many of these in oursummary in the sections that follows.ertical and horizontal axis, and the second layer are fourquadrants composed into a mesh. Unlike previous CGRAs,MorphoSys had a dedicated multiplier inside the ALUs. ACGRA based on MorphoSys was also realized in silicon nearlyseven years from its inception [39].While most of the CGRA described so-far used a meshtopology of interconnection (with some connectivity), othertopologies have been considered. RaPiD [40], [41] is an CGRAthat arranged its reconfigurable processing elements in a singledimension. Here, each processing element is composed ofa number of primitive blocks, such as ALUs, Multipliers,scratchpads, or registers. These primitive blocks are connectedto each other through a number of local, segmented, tri-statedbus lines that can be configured to form a data-path– a so-called linear array . These processing elements can themselvesbe chained together to form the final CGRA. Interestingly,RaPiD can be partial reconfigured during execution in what theauthors called “virtual execution”. RaPiD itself does not accessdata; instead, a number of generic address-pattern generatorinterface external memory and stream the data through thecompute fabric.The KressArray [42]–[44] was one of the earliest CGRAdesign to be created, and the project spanned nearly a decadewith multiple versions and variants of the architecture. Itfeatures a hierarchical topology, where the lowest tier wascomposed of a mesh of processing elements. The processingelements interfaced neighbours, and also included predicationsignals (to map if-then-else primitives). Generic address gen-erators supported the CGRA fabric by continuously streamingdata to the architecture.Chimaera [45] is a co-processor conceptually similar toGARP, with an array of reconfigurable processing elementsoperating at a quite fine granularity (similar to modern FPGAs)can be reconfigured to perform a particular operation. It isclosely coupled to the host processor to the point where theregister file is (in part) shadowed and shared. Mapping applica-tion to the architecture was assisted by a ”simple” C compiler,and they demonstrated performance on Mediabench [46] andHoneywell [47].PipeRench [48] applied a novel network topology that wasa hybrid between that of a mesh and a linear array. Here, alarge number of linear arrays were layered, where each layersent data uni-directionally to the next layer. Several futureCGRA would adopts this kind of structure, including data-flowmachines (e.g. TARTAR) and loop-accelerators (e.g. FPCA).The layers themselves in PipeRench were fairly fine-grainedand comparable to GARP as they had reconfigurable Look-Up Tables rather than fixed-function ALUs within. PipeRenchintroduced a virtualization technique that treated each separatelayer as a discrete accelerator, where a partial reconfigurationtraveled alongside with its associated data, reconfiguring thenext layer according to its functionality in a pipeline fashion,which was new at the time. PipeRench was also later imple-mented in silicon [49].The DREAM [50] architecture was explicitly designed totarget (then) next-generation 3G networks, and argues that CGRAs are well suited for the upcoming standard with respectto software-defined radio and the flexibility to hot-fix bugs(through patches) and firmware. The system has a hierarchyof configuration managers and a mesh of simple, ALU-based, RPEs operating on 16-bit operands and with limitedsupport for complex operations such as multiplications (sinceoperations were realized through Look-Up Tables).So-far, all architecture has been computing using integerarithmetic’s. Imagine [51] is among the early architecturesthat included hardware floating-point arithmetic units. Thearchitecture itself is similar to RaPiD—it is a linear array,where each processing elements has number of resource(scratchpads, ALUs, etc.) all connecting using a global bus.Similar to RaPiD, the processing elements are passive, andexternal drivers are responsible for streaming data along theconnect processing elements. The Imagine architecture had aprototype realized six years after its seminal paper [52].
1) Modern Coarse-Grained Reconfigurable Architectures:
Most modern CGRA architecture’s lineage can be linked backto those described in the previous section, and a majorityof these architecture follows the generic template that wasdescribed in the previous section. However, while the overalltemplate remains similar, many recent architectures specializethemselves towards a certain niche use (low-power, Deep-Learning, GPU-like programmable, etc.)The ADRES CGRA systems [53], [54] has been a remark-ably successful architecture template for embedded architec-tures, and is still widely used. ADRES – like many previousand future CGRAs – consists of a mesh of processing elementswhere each element has neighbor (or Manhattan distance-2)connectivity. Inside each element we find an ALU of vary-ing capability and a register file, alongside the multiplexersconfigured to bring in data from neighbours. The first row inthe mesh, however, is unique, as it only contains the ALU(and no scratchpad/RF to store state). Instead, an optionalprocessor can extend its pipeline to support interfacing thatvery first row in a Very Long Instruction Word [55] (VLIW)fashion. ADRES, by design, is thus heterogeneous. ADREScomes with a compiler called DRESC [56]. ADRES as anarchitecture has (and still is) a popular platform for CGRAresearch, such as when exploring multi-threaded CGRA sup-port [57], topologies [58], asynchronous further-than-neighborcommunication(e.g. HyCube [59]), or CGRA designs frame-works/generators (e.g. CGRA-ME [60], [61]). Furthermore,ADRES has been taped out in in silicon, for example in theSamsung Reconfigurable Processor (SRP) and the follow-upUL-SRP [62] architecture.The Dynamically Reconfigurable ALU Array (DRAA) [63]is a generic CGRA template proposed in 2003 to encouragecompilation research on CGRA architecture. Architecture-wise, DRAA allows changing many of the parameters thatdefine an CGRA, such as the data-path width, the intercon-nections, size of register file, etc. Preceding both DySER andADRES, DRAA as a template has been used to e.g. study thememory hierarchy of CGRAs [64].The TRIPS/EDGE [65], [66] microarchitecture was a long-unning influential project that attempted to move away fromthe traditional approach of exploiting instruction-level paral-lelism in modern processors. The premise behind TRIPS wasthat as technology reduced the sizes of transistors, wire-delaysand path would dominate latency, and that it would hard toscale the (often) communication wires traditional super-scalarprocessors [67]. Instead, by tightly coupling functional unitsin (for example) a mesh, direct neighbor communication couldeasily be scaled. In effect, TRIPS/EDGE replaced the tradi-tional super-scalar Out-of-Order pipeline with a large CGRAarray: single instructions were no longer scheduled, but insteada new compiler [68], [69] was developed that scheduled entireblocks ( ”CGRA configurations” ) temporally on the processor,allowing up-to 16 instructions to be executed at a single time(and many more in-flight). The TRIPS architecture was tapedout in silicon [70], [71] and – despite being discontinued –represented a milestone of true high-performance computingwith CGRAs. An interesting observation, albeit not necessarilyrelated to CGRAs, was that the Edge ISA has received renewedinterested as an alternative to express large amounts of ILP inFPGA soft processors [72].The DySER [73] architecture integrates a CGRA intothe backend of a processor’s pipeline to complement (un-like e.g. TRIPS that replace) the functionality of the tra-ditional (super-)scalar pipeline, and has been integrated inthe OpenSPARC [74] platform [75]. The key premise behindDySer is that there are many local hot regions in program codeand higher performance can be obtained by specializing onaccelerating these inside the CPU. DySER was evaluated usingboth simulator-based (m5 [76]) and an FPGA implementationon well-known benchmarks (PARSEC [77] and SPECint) andcompared with both CPU and GPU approaches, showing be-tween 1.5x-15x improvements over SSE and comparable flex-ibility and performance to GPUs. Recently (2016 on-wards),DySER has been the focus of much of the FPGA-overlayscene (see Section IV-F). Other, similar work to DySeR thatintegrates CGRA-like structures into processing cores withvarious goals, includes CReAMS/HARTMP [78], [79] (appliesdynamic binary translation) or CGRA-sharing [80] (concep-tually similar to what AMD Bulldozer architecture [81] andUltrasparc T1/T2 did with their floating-point units).The AMIDAR [82] is another long running exciting projectthat (amongst others) uses CGRA to accelerate performancecritical sections. The AMIDAR CGRA extends the tradi-tional CGRA PE architecture with direct interface to memory(through DMA). There is support for multiple contexts andhardware support for branching (through dedicated condition-boxes operating on predication signals), which also allowspeculation. The AMIDA CGRA has been implemented andverified on a FPGA platform, and early results show that itcan reach over 1 GHz of clock frequency when mapped to a45 nm technology.The MORA [83] architecture is a platform for CGRA-related research. MORA targets media-processing, and henceprovide an 8-bit architecture with processing elements cover-ing the most commonly used operations. MORA itself is simi- lar to the previous MATRIX, with a simple 2D mesh structurewith neighbour communication. Each processing element hasa scratchpad (256 bytes large). MORA is programmable usinga domain-specific language developed over C++ [84].The CGRA Express [85] is yet another architecture thatfollows the concept of being a mesh with simple, ALU-likestructures. The premise and motivation for their work is thatmost existing CGRA application are optimized for maximalgraph coverage rather than sequential speed. The hypothesisis that – depending on the operators each PEs is configuredto use – they can exploit the resulting positive clock slack ofthe operators and cascade (fuse) more operations per clockcycle than blindly registering the intermediate output. This,in turns, allow them to execute more instructions per cycles(or reduces the frequency) with little performance losses. Intheir architecture, they add an extra bypass network that canbe configured to not be pipelined. They show both powerand performance benefits on multimedia benchmarks with andwithout their approach. The work could be conceptually seenas the opposite to what modern FPGAs (e.g. Stratix 10) doeswith HyperFlex [86], but for CGRAs.The Polymorphic Pipeline Array (PPA) [87] performedan interesting pilot study that drove the parameters of theirCGRA: they simulated a large number of benchmarks sched-uled on a hypothetical (infinite) CGRAs, with focus onmodulo-scheduling and loop unrolling. They revealed thateven with infinite larger CGRAs, the performance levels willbe bound as a function of the instruction-level parallelismin the loops and the limitation of modulo-scheduling, andthey argue that there is a definitive need to include otherforms of parallelism to scale on CGRAs. While the PEsthemselves follows a standard layout, they propose an in-teresting technique that allows multiple (unique) kernels tobe executed concurrently on the CGRA, where each kernelcommunicate with each other either through DMA or sharedmemory. Kernels can also be resized to fully exploit the CGRAarray.The premise behind SIMD-RA [88] is similar to that ofPPA: CGRAs relies too much on instructional-level paral-lelism, and opportunities from other forms of parallelism arelost. SIMD-RA focuses on embedding support to modularizethe CGRA-array to supporting multiple discrete controllableregions that (may) operate in SIMD fashion. They found thatusing SIMD not only yielded better performance, but werealso more area efficient compared to only using software-pipelining.SmartCell [89] is a CGRA that aspires to be low-powerwith high-performance, supporting both SIMD- and MIMD-type parallelism. The architecture is effectively a 2D mesh,but with the mesh divided into 2x2 quadrants of processingelements. These 2x2 islands share a reconfigurable router andinter-quadrant communication is limited to the connectivity ofthese routers. The processing elements themselves are fairlystandard and contain instruction memory whose instruction(configuration) is set either per processing element (MIMD)or sequenced globally (SIMD).ilRC [90] is a heterogeneous mesh composed of threedifferent blocks: generic ALU blocks, Multiplication/Shifternodes, and memory blocks, following the (by now) traditionalrecipe of a CGRA. Unique to BilRC is that the architectureexplicitly exposes the triggering of instruction, allowing theprogrammer and/or application fine-grained control over theamount of parallelism or when instructions are triggered.The lack of floating-pointer support in CGRAs has alsobeen a research driving force. FloRA [91] is 16-bit IEEE-754 floating-point capable CGRA. The architecture itself iscomposed of 64 RCs, and each RC is fairly standard and donot include dedicated floating-point cores themselves; instead,multiple (2) RCs can be combined to enable single-precision(32-bit) floating-point support, where mantissa- and exponent-computation is distributed among the pair.Feng et al. [92] introduce a floating-point capable architec-ture specifically designed specifically radar signal processing.Despite the familiar mesh-based interconnection, the designdeviates from the traditional approach since their processingelements are fairly diverse and heterogeneous. The CGRAitself was taped out in silicon and could reach up-to 70GFLOP/s floating-point performance.PRET-driven (Precision Timed) CGRA aimed towards pre-dictable real-time processing was developed by Siqueira etal. [93]. Interestingly, the CGRA has support for threads,which is a concept used more in High-Performance ratherthan Real-Time computing. The architecture is similar to whata SIMT (Simulatenous Multi-Threading) architecture lookslike, where each processing elements has a duplicate numberof resource (primarily the register files) that are unique toeach thread. Aside from having deterministic timing insidethe CGRA, the authors also implement a predictable externalmemory access timing, required for real-time systems.The recent SPU [94] architecture aspires to provide aCGRA for general-purpose computing. The main novelty isthat SPU extends existing CGRA designs with support for twotypes of computational patterns: what they call ”stream-joins”(e.g. sparse vector multiplication inner-product) and alias-freescatter/gather (regular loops with indirection). This is achievedby extending the typical CGRA with options to conditionallyconsume input tokens (re-use values), reset accumulators,or conditionally discard output tokens. Address-Generationunits (linear and non-linear) reside inside on-chip SRAMcontrollers. The SPU targets general-purpose workloads withsome favor towards deep-learning applications.The premise Soft-Brain [95] is to combine both vector-level (for regular efficient memory-access) and data-flow (forparallelism) computation in CGRAs to reach high performanceand power-efficiency. The architecture consists of differentnumber of stream-engines (essentially address-generators inprior work) and the CGRA substrate itself. The input tothe CGRA substrate is a number of vector-ports, which areessentially the vector-data itself (512-bit) as fetched frommemory, the on-chip scratchpad, or fed-back from the outputof the CGRA, a stream-controller orchestrates the executionof the system. The Chameleon [96] CGRA is an early commercial CGRAthat focuses on competing with early DSPs and FPGAs. Herethe CGRA is layered, where they call each layer a slice.Each slice has three tiles, where each tile has a numberof scratchpad memories that interface with eight processingelement that can be reconfigured. The idea is to load thelocal scratchpad with data, configure the processing elementsassociated with the scratchpad with some functionality, and thepipe the data through and onto other slices. The Chameleonwas implemented in a 0.25 um process running at a 125 MHzclock frequency. The architecture itself operates on 32-bit data-path width, but can be configured to divide the data-stream intotwo 16-bit or four 8-bit streams as well.SiLAGO [97] is a methodology for creating CGRA-basedplatforms. The premise behind the method is to use recon-figurable CGRA processing elements (based on DRRA [98])as building-blocks for future systems in order to reduceproduction cost with little impact on performance (comparedto hand-made ASICs). Platforms based on SiLAGO andDRRA are amongst others specialized architectures for Deep-Learning [99], Brain-simulation computing [100], and genomeidentification [101]. The Q100 [102] is similar in conceptsbut specializes on data-base computing and provides tiles forcomputing on data-flow streams that users can assemble largersystems from. B. Larger CGRAs
Most CGRA systems (e.g. ADRES, TRIPS, DySER) limitthe size of the array to around (or less than) 64 processingelement, and only a few of so-far mentioned CGRAs are largerthan that (PipeRench has 256 PEs, GARP has 768) but theyare relatively fine-grained. Likely the limited size of CGRAwas their application domain, which most involved image-,audio-, or telecommunication applications. However, in recentyears, even larger CGRA-based systems targeting e.g. High-Performance Computing application has started to emerge.The eXtreme Processing Platform [103] (XPP) was a CGRAthat focused on multiple levels of parallelism, including thatof pipeline processing, data-flow computing, and task-levelexecution. XPP’s interconnection was deep and complex,consisting of multiple levels of various functionality. At thelowest tier, small processing elements containing scratchpad,an ALUs, and associated configuration manager reside inmesh-like connectivity called a cluster. These clusters them-selves are connected through switch-boxes running along theirvertical and horizontal axes. Each tier has a configurationmanager that is responsible for all functionality of that layer(and below), allowing fine-grained control and partitioning ofthe functionality of the system. XPP was token-driven, andexecution of operation occurs only when data is present atinputs.The High-Performance Reconfigurable Processor [104](HiPReP) is an on-going CGRA reserach platform capable offloating-pointer computations. HiPReP has dedicated floating-pointer cirucitry (unlike e.g. FloRA). Processing elementsare organized in a mesh with a heterogeneous (in terms ofultiple contexts). A number of scratchpad memories sits atthe fringes of each tile and are used to store streamed data.Controlling the operation of the processing elements is donethrough an instruction pointer that is governed by hierarchicalsequencers (one per tile and one global). The sequencer –effectively a programmable FSM – dictates which contextis being executed, and can (re)act on signals from the tilesthemselves. DRP was commercially taped out in the DRP-1prototype by NEC.The commercial DAPDNA-2 [120] processor produced byIPFlex contains up-to 376 32-bit processing elements, or-ganized as 8x8 PE quadrants in a mesh. The architectureis heterogeneous, with discrete tiles supporting ALU opera-tions, scratchpad, programmable delay lines, simple address-generations (counters) or I/O buffers. The processing ele-ments contains both multiplication and arithmetic units andalso support optional pre-processing of inputs through rota-tion/masking units. The tiles are interconnected using hori-zontal and vertical busses that run in-between and throughthe mesh, and crossing the quadrants can only be done atborder tile locations. Performance of selected applications(FIR, FFTs, Image processing) show two orders of magnitudebetter performance over the then state-of-the-art Pentium 4processor.The 167-processor architecture [121] borderlines CGRAsand conventional multi-core processors, but we include it heresince the processing elements are simple and communicationbetween them is only performed using direct (yet dynamicallyconfigured) connectivity (and not through shared-memory orcache coherence as done in multi-core). The main focus behindthis work is low-power, and through a series of advanced low-level optimizations (DVFS, clock generation and distribution,GALS [122] etc.) They show performance of up to 196.7GMAs/Watt when fabricated in 65 nm technology. Othersimilar architecture, based on programmable cores with limitedconnectivity, are the IMAPCAR [123]/Imap-CE CGRA [124]from NEC aimed towards image recognition in automobiles.The Rhyme-Redefine [125], [126] architecture is a CGRAtargeting High-Performance Computing (HPC) kernels. It fol-lows a fairly typical CGRA design, where processing elementsare connected through in a torus network. The premise of theirwork is that there is a need to exploit multiple levels of paral-lelism (instruction-, loop- and task-level parallelism), albeit thecurrent implementation focuses primarily on instruction-levelparallelism through modulo-scheduling. The Rhyme-Redefinesupports floating-point computations.Plasticine [127] is a recent, large CGRA that focus onparallel patterns. At the highest abstraction layer, it is builtof a mesh of units. There are two types of units: compute andmemory units, both of which are programmable with patterns.Inside the compute units we find the raw functional units (theALUs) as well as programmable state for controlling them.The compute units are built with both SISD- and SIMD-typeparallelism in mind, and vector operations map natively tothese units. Similarly, inside the memory units, we find a smallset of ALUs coupled with programmable logic to interface the SRAM local to the units. The mesh itself interfaces externalmemory through a set of address generators and coalescingunits. More importantly, Plasticine targets floating-point in-tensive applications, which is also shown in their evaluation(only three out of 13 applications are integer-only). Plasticineis programmable using Spatial [128]– a custom language basedon patterns for data-flow computing.Recently, the Cerebras Wafer Scale Engine [129] has beencreated explicitly for high-end deep-learning training. Littleinformation is publicly available, but the architecture seemsto be a hybrid solution between general-purpose processingcode and specialized tiles for tensor computations, and couldmake it the single largest CGRA architecture to date with asize of over 46,225 mm . C. Linear-Arrays and Loop-Accelerators
VEAL [130] is a linear array that explicitly targets accelerat-ing small, compute-intensive loop-bodies. Similar to a before-mentioned PPA (Polymorphic Pipeline Array), the authorsbehind VEAL performed a rigid empirically evaluation of thebenchmarks they target, and demonstrate that one of the mainlimitation to the performance of mapping said benchmarks toCGRA fabrics is not the number of resources, but actually theamount of instruction-level parallelism extracted by modulo-scheduling. The VEAL is linear array fed by a numberof custom address-generators, which broadly corresponds tothe induction variables of the loops that are executed. Aninteresting observation is that VEAL is among few CGRAwork that use double-precision arithmetic’s. Another loop-accelerator similar to VEAL is FPCA [131].The BERET [132] architecture is yet another linear arraythat is designed to accelerate hot traces found by the hostinggeneral-purpose processing. One of main BERETs main con-tribution was that they identified a small set of graphs that theprocessing elements should cover (called SEBs); the set wasempirically extracted from the benchmark and has since thenbeen used in other studies (e.g. SEED [133], which is similarbut improved in concept).
D. Deep-Learning CGRAs
Deep-Learning [134], in particular the computationallyregular Convolutional Neural Networks (CNNs), have latelybecome target for specialized CGRAs. Here the focus isto limit the generality and reconfigurability of traditionalCGRA to fit the computational patterns of CNNs, and insteadspend the gained logic on supporting specialized operationsfor the intended deep-learning workloads (such as compres-sion, multicasting, etc.). Furthermore, these architectures of-ten honor smaller (or mixed) number representations, sincedeep-learning often is amendable to lower-precision calcula-tions [135].The DT-CGRA [136], [137] architecture follows a CGRAdesign with fairly coarse processing elements that includeup-to three multiple-accumulate instructions. The processingelements also include programmable delay lines to easier maptemporally close data. Data inside the PEs is synchronizedhrough tokens using FIFO empty/full signals as proxy. Tosupport the different access patterns that modern deep-learninglayers have (stride, type, etc.), the CGRA mesh is driven bya number of stream-buffer units that are programmable usingin a VLIW-fashion and that control the address-generations toexternal memory to stream data.The Sparse CNN (SCNN) [138] is a deep-learning architec-ture that targets (primarily) CNNs and can exploit sparsenessin both activations and kernel weights. The architecture itself iscomposed of a mesh of RCs, where each element also includesa 4x4 multiplier array and a bank for accumulation registers.These RCs are driven- and orchestrated-by a layer sequencer,which fetches and broadcasts (compressed) weights and ac-tivations. SCNN supports inter-PE parallelism in the form ofspatial blocking/tiling, where each block is artificially enlargedwith a halo region, which is exchanged between adjacent tilesat the end of the computation. They also implement a dense version (DCNN) of the architecture in order to measure thearea overhead and power- and performance-gains of includingsparsity in the accelerator.Liang et al. [139] introduce a CGRA accelerator that targets reinforced learning . The processing elements themselves arefairly static, with support for addition, multiplication, or afuse of both. Additionally, a number of different activationfunctions (ReLu, sigmoid, and tanh) can be selected using theconfiguration register, and data can be temporally stored ina local scratchpad. Unlike most existing CGRAs that placeaddress-generators in discrete units outside the RCs, Lian etal.’s RC include them inside. Global communication linesallow the user to control the reinforced training experienceof the system.The Eyeriss deep-learning inference engines [140], [141]follows a CGRA design methodology as well, albeit thespecialize on re-configuring the network access-patterns ratherthan the compute (which mostly is based on multiply-accumulate operations). The CGRA itself is a mesh witha variety of options of point-to-point and broadcast opera-tions, highly suitable for deep-learning convolution patterns.Additionally, the platform supports compression of data andexploits sparseness of intermediate activations to increaseobserved bandwidth. The Eyeriss architecture – depending onthe type of neural network used – can utilize nearly 100% ofthe CGRA resources when inferring AlexNET.One of the most recent (and perhaps radical) changes to theFPGAs is coming in the form of support for Deep-LearningCGRAs. The Xilinx Versal [142], [143] series occupies alarge part of the silicon to a mesh-like CGRA structure ofprogrammable, neighbor-communicating, processing elements.The elements themselves are fairly general-purpose, but aremarketed as targeting deep-learning and telecommunicationapplication. To remedy the eventuality of the AI enginemissing crucial parts of deep-learning functionality that has yetto come, the AI engine can directly interface remaining partsof the reconfigurable (FPGA) silicon, which is in the form ofthe fine-grained reconfigurable cells that Xilinx are known for.The system itself is an attempt to combine the best of both the fine-grained and coarse-grained reconfigurable worlds.
E. Low-Power CGRAs
CGRAs has also been shown to be competitive in terms ofpower-consumption, particularly when compared to existing(low-power) processors and DSP -engines. The CGRAs in thisdomain follow the same concept as earlier CGRA design, butfocus on both technology and architecture improvements toreduce the static and/or dynamic power of the fabric.These CGRAs tend to focus on reduction the frequencyand voltage as much as possible. Since the dynamic power-consumption of a system is a function of both frequency andvoltage ( P dynamic = C ∗ V ∗ f clk ), reducing frequency canhave a dramatic effect on power-consumption. Several CGRAsin this area operate on near-MHz level, and some even removethe clock altogether.The Cool Mega Array [144], [145] (CMA-1 and CMA-2) architecture builds on the following two premises: ( i ) theclock (clock-tree, flip-flops, state, etc.) is a culprit behindmuch of the consumed power on modern chip, and ( ii ) ap-plications have adequate parallelism to freely trade silicon forperformance where needed. The CMA-1 was a typical CGRAmesh architecture, but created without a single clock. Thearchitecture focuses on stream-computing, where a processorpresents inputs to the CGRA that – in due to time – arecomputed using the clock-less fabric. The architecture (andits follow up, CMA-2) is power-efficient, and experiments ontaped-out versions showed that the leakage-current of the chipcould be as low as 1 mW. The CMA architecture managersto reach up-to 89.28 GOPS/Watt using a 24-bit data-path.The CMA architecture is continued to be researched, andrecent work has focused on improving performance (throughvariable-latency pipelines in VPCMA [146]) or further reducepower-consumption through body-biasing.The SYSCORE [147] architecture is another similar archi-tecture that focuses on low-power consumption, but leveragedynamically scaling both voltage and frequency (DVFS) forpower-benefits, and using a fixed-point (and not floating-point) number representation. As with CMA-1/2, it is a 24-bitdatapath with a standard mesh-like arrangements of CGRA-tiles.Lopes et al. [148] evaluated a standard mesh-like CGRAfor use in real-time bio-signal processing. The CGRA theyconstructed had the additional benefits of being able to power-down sections of the CGRA when unused to further extendbattery-life. Another bio-medical CGRA was introduced byDuch et al. [149] uses a mesh-like composition and a MHz clock-frequency to accelerate electrocardiogram (ECG)analysis kernels.The Samsung UL-SRP was designed for bio-medical ap-plication. The UL-SRP [62] is based on the ADRES, andfeatured a hybrid high-power/high-performance mode and alow-power/low-performance mode covered the different needsand use-cases.The PULP [150] cluster system features a 16 RC mesh toimprove performance and energy-consumption for near-sensorot strictly a CGRA, ZUMA [159] is an early effort tovirtualize the fine-grained resources of an FPGA using a“virtual FPGA”, for reason of portability, compatibility, andFPGA-like reconfigurability inside of FPGA designs. Similarto a real FPGA, ZUMA discretized the FPGA into logicclusters that contains a crossbar and K-input Look-Up Tablewith an optional flip-flop capturing the output. Each cluster isconnected to a switch-box that can be programmed to routethe data around. The area cost of using virtual-FPGA can beas low as 40% more than the barebone FPGA, demonstratingits benefits. Other (even earlier) work was FIRM-core [160],as well some more recent efforts include the vFPGA [161].Intermediate Fabrics (IF) [154] coarsen the FPGA logic bycreating a mesh of computational elements of varying sizes,such as for example multipliers and square root functions;small connectivity boxes (routers) govern the traffic throughoutthe data-path. IF was evaluated on image processing (stencil)kernels, and overall showed an on average 17% drop in clockfrequency against a gain of 700x in compilation time overusing the FPGA alone.The MIN Overlay architecture [162] approach the CGRAdesign differently; it uses a one-dimensional strip of process-ing elements whose output is connect to each other through anall-to-all interconnect. Hence, data-flow graphs are spliced andfit onto the linear array, and different parts of the graph arescheduled in time on the array and the interconnect. Differentcombinations and compositions of the processing elementswere evaluated and the clock frequency, for the most part, ranat 100 MHz, competitive to soft processor cores at the time.Other, arguable less configurable, overlays follow a similarone-dimension strip design, such as the VectorBlox MXPMatrix Processor [163]. The FPCA loop-accelerator describedearlier was also prototyped on FPGAs. The READY [164]architecture extend the linear array concept further by alsohaving multiple threads running on the overlay.An example of a layered CGRA overlay for FPGAs is theVDR architecture [165]. Here, computational resources arelaid out in one-dimensional strip where each strip is fully-connected to downstream units. Links are unidirectional, andsynchronization protocol guides data throughout the data-path.The VDR architecture runs at a clock frequency of 172 MHz,and was shown to be between 3 and 8 times faster comparedto the NiosII processor [166] (a well-known soft-core usedin FPGA design). Another architecture similar to VDR is theRALU [167].A flurry of innovative overlays was introduced in the 2012onwards, all centered around the modern FPGAs Digital SignalBlock (DSP). The DSP blocks were originally included toallow the use of expensive operations that do not necessarilymap well to FPGAs (e.g. multipliers). DSP blocks have sincethen continuously evolved to include more diverge (various-size multiplication, accumulation, etc.) or more complex (e.g.single-precision arithmetic [153]) functionality. Some of thevendors (Xilinx) directly exposed the interface to control thedifferent functionality of the DSP blocks to the FPGA fabric,and it was not long before the idea to base CGRAs architecture around said DSP blocks. ReMorph [168] was one of the earlyarchitectures to adopt this style of reasoning. Several differentarchitectures have been explored around the concept of DSPs,including various topologies (e.g. trees [169]) or adaptationof existing architectures (e.g. DySer using DSP blocks [170]).The strengths of these architectures lie in their near-nativeperformance, where small overlays built around DSPs can runat 390 MHz (or higher).Quickdough [171], [172] is a design framework for usingCGRA overlays on FPGAs, specifically targeting them toassist CPU in accelerating compute-intensive program code.The overlay itself follows the standard layout with a meshof processing elements, each containing a small instructionmemory that sequences the ALU within the processing ele-ment. The mesh can interface external memory by enqueuingrequests to an address unit. Unique for the architecture is thatthe two parts (the address unit and the PE mesh) runs at twodistinctly different frequencies.Most FPGA overlays presented so far focus unique oninteger computation, likely because most FPGA overlay worktarget Xilinx devices, whose DSPs units do not containhardened floating-point operations. The Mesh-of-ALU [173]is an exception that targets both integer and floating-pointcomputation. The architecture is similar to other mesh-basedapproaches, but the work demonstrates high (at the time)performance capabilities of FPGAs also in floating-point op-erations, reaching nearly 20 GFLOPs on a Stratix IV [174]device. Using floating-point processing elements seem to incura 33% area overhead, yielding a smaller CGRA mesh, and alsoa (arguably negligible) 13% reduction in clock frequency.A different overlay architecture that target floating-pointoperations is the TILT array [175], [176]. The TILT arrayarchitecture is very similar conceptually to the MIN overlay. Alinear array of processing elements is arranged to communicatewith an all-to-all crossbar, which saves state into an on-chipRAM and relaying information to the computation in the nextcycle. The authors illustrated the benefits of TILT over High-Level Synthesis (OpenCL) with both comparable performanceand improved productivity, reaching operating frequency ofup-to 387 MHz on a Stratix V [177].The URUK [178] architecture takes a different approach onhow the ALUs inside overlay should be implemented. Ratherthan having a fixed-function, URUK leverage partial recon-figuration [179], changing the RCs functionality throughouttime.Finally, tools for automatically creating CGRA overlays forFPGAs are emerging, such as the Rapid Overlay Builder [180]and CGRA-ME [60] that simplify generation and (in the caseof CGRA-ME also) compilation of applications and overlays.An interesting observation is that out of the 14 uniqueFPGA architecture that we surveyed here, 9 chose Xilinx asthe target platform while 5 focused on Intel (then Altera)FPGAs. There seems to be a favoring of Xilinx architectures,which we believe is due to the more dynamic control thatXilinx offers in their DSP blocks compared to Intel. On theother side, Intel DSPs have (starting from Arria 10 onwards)ardened support for IEEE-754 single-precision floating-pointoperations, encouraging floating-point heavy architecture touse those systems.V. CGRA T
RENDS AND C HARACTERISTICS
A. Method and Materials
For all previous surveyed and summarized work, we col-lected several metrics associated with each study. These were:1)
Year of publication,2)
Size of the CGRA array in terms of unique RCs,3)
Data-path width of the CGRA (e.g. MATRIX operateson 4-bit while RaPiD operates on 16-bit),4)
Clock frequency of operations ( f max ) in MHz asreported in the study,5) Power consumption in Watt. For studies that em-pirically measured this metric, we collected the (benchmark,power) tuple. Otherwise we used whatis reported in the study (often the post place-and-routepower estimation),6)
Technology (in nm) of architecture when either tapedout in silicon, or the standard cell library used with theEDA tools,7)
Area ( mm ) of the fully synthesized chip as reported inthe study. In some cases we had to manually calculateit based on the individual RC size or based on the gatesused (after verification with authors). For FPGAs weused the chip (BGA) package size and assumed a chip-to-die ratio of 7:1, as has been reported in [181].8) Peak performance , including peak operations-per-second (OP/s), peak multiple-accumulates/second(MAC/s), Peak Floating-Point Operations/second(FLOPS) as reported in the paper. We differentiatebetween integer MAC/s and OP/s because somearchitectures (e.g. EGRA) do not balance them, leadingto a large theoretical OP/s but not a proportionally largeMAC/s.9)
Obtained Performance out of the theoretical peak (%).We used what the authors reported. For those caseswhere authors did not report obtained performance (e.g.only reported absolute time), we calculated this metricmanually where applicable, such as for example whenthe authors report both the input dimension and the exe-cution time (in seconds or cycles) of known applicationssuch as (non-Strassen) matrix-multiplication, FIR-filters,matrix-vector multiplication, etc.For item 8-9 we ignored studies that showed relativeperformance improvements, as it is hard to reason aroundthe performance of a baseline unless explicitly stated. Allmetrics included have either been directly reported in theseminal publication, have been verified by the authors, orwe were confident in our understanding of the architecture toderive them ourselves. We position and related our obtainedCGRA characteristics against those of modern GPUs. We usedNVIDIA GPUs as references with data collected from [182]and integer performance calculated using methods describedin [183]. VI. O
VERALL A RCHITECTURAL T RENDS
We start by analyzing data that is associated with time, andhow the CGRAs have grown as a function of time.Figure 5 overviews how CGRAs have changed through timewith respect to various metrics. The total number of RCs,as a function of the respective publication year, is shown inFigure 5:a. We see that a majority of CGRAs are quite small(median: 64 RCs) and even smaller for FPGA-based CGRAs(median: 25 RCs). This is in-line with the reasoning that mostCGRAs focus on small kernels in the embedded applicationdomain, honoring ILP rather than other forms of parallelism(e.g. thread- or task-level). There are several exceptions to this,such as GARP, which was an early CGRA that used 1/2-bitreconfigurable data-paths and thus needed a large number ofRCs to implement various functionality. The other exception isTARTAN, where the author’s largest evaluated version is up-to25,600 RCs, making it likely the largest CGRA ever simulated;this awe-inspiring size was reached by severely restricting thefunctionality of RCs (e.g. there is no multiplication support).Thirdly, the Plasticine architecture can have up-to 6208 RCsof varying sorts.Figure 5:b shows transistor scaling of CGRAs and NVidiaGPUs. As expected, the transistor dimensions have contin-uously grown smaller and smaller, as predicted by Moore.Note however that both FPGAs and GPUs is (on average)one transistor generation ahead of CGRAs, likely due to mostCGRAs being developed by academia and thus restricted tothose standard cell libraries available at that time (whichusually is not the most recent).Figure 5:c shows the area of the CGRAs as reported eitherby the ASIC synthesis tools, estimation by authors, or by thefinal taped-out chip. We also include the full-size of the FPGAdie sizes (that FPGA-based CGRAs have access to), and weposition these against the die-size of modern NVidia GPUs.We can see that the trend of CGRA research is – as with thesize of CGRAs – to favor smaller CGRAs, and the average sizeof the CGRAs are . mm Compared to GPUs, which hasmonotonically increased their size through time, CGRAs havealmost done the inverse, and decrease in size. There are twomajor exceptions: the first is the Imagine architecture, whichreported an amazing size of 1000 mm (later 144 mm inthe follow-up paper 6 years later)– larger than any CGRAor GPU reported to this day. The other larger architecture isCUDA-programmable the SGMF at 800 mm .Figure 5:d shows how the reported power-consumption ofASIC-based CGRAs has grown over time, and is comparedto the Thermal-Design Package (TDP) reported for NVidiaGPUs. The CGRAs are experiencing on-average an expo-nential decrease in power-consumption, which is likely dueto smaller standard cell libraries coupled with small CGRAsize (Figure 5:a,c,d). On the other hand, NVIDIA GPUscontinuously consumes more and more power as time goeson (albeit even that is drawing to a halt due to Moore’slaw). The highest and most power-consumption CGRA, outof those reporting, is the Plasticine architecture consuming ay limitation in the fabric itself; however, it is interesting tosee that the operation frequency of FPGA-based CGRAs arerivaling most of the ASIC CGRAs.Figure 5:f shows the chosen data-path width that CGRAsresearch tend to adopt. Most architecture adopt either a 16-bit(28%) or 32-bit (56%) data-path width; those targeting 16-bitdata-path are usually more tailored towards a specific applica-tion, such as telecommunication or deep-learning, while thosethat target 32-bit (or beyond) is more general-purpose. A few(13.3%) target 8-bit architecture, but often have support forchaining 8-bit operations for 16-bit use. Matrix and GARPtarget very fine-grained reconfigurability, with 4-bit and 1-/2-bit respectively. Despite this, we can expect future architectureto include more support for low- or hybrid-precision, since it isa reliable way of obtaining more performance while mitigatingmemory-boundness for applications that permits it.Figure 5:g shows the power-consumption of CGRAs andGPUs as a function of their respective die sizes. This graphcomplements the graph in Figure 5:d to show that the lowpower-consumption of CGRAs is mainly because they aresmall, with (out of those CGRAs that report both power- andarea) only Plasticine coming closer to the trend of GPUs.Finally, Figure 5:h shows how the individual RC area hasgrown throughout time, and we see that the size of RCshas been following the technology scaling, and continuouslydecreased in size. However, when normalizing the CGRAsmanufacturing technology to that of 16 nm, we actuallynoticed a different trend, where the area of individual RCs isincreasing, due to incorporating more complex elements (suchas wider data-paths, more complex arithmetic units, etc.).VII. I NTEGER AND F LOATING -P OINT P ERFORMANCE A NALYSIS
Figure 6 overviews data associated with the pure perfor-mance of the CGRAs, often when positioned against that ofNVidia GPUs.Figure 6a-f shows the obtained integer performance. Herewe distinguish between operations and MAC-based operationsin order to reveal architecture which are starved of multipliers.For example, the TARTAN CGRA can execute a large amountof operations per unit time, but has no support for multiplica-tions, leading to a very low comparable multiple-add perfor-mance. Figure 6a-b show the GARP and Matrix architectureas the sole candidates for low-precision arithmetic, and thatwhile both of these have comparable high performance (fortheir time), their multiplication (MAC) performance is lacking(in GARP, the overhead was 32x compared to an addition).Figure 7:c shows 8-bit integer performance, which has recentlybeen of interested to the deep-learning inference community,and where next-generation Xilinx Versile architecture willbe capable of reaching thousands of GOP/s of 8-bit integerperformance. Figure 6:d shows 16-bit integer performance,showing a continuous growth over the years. Note how theTARTAN architecture claims to reach similar performancelevels of the upcoming Xilinx Versile CGRA, despite beingmore than a decade old. Figure 6:e is a special case, and only a few CGRAs (e.g. Cool-Mega Array-1 and SYSCORE); despitetheir low visible performance, these devices are actually verypower-efficient (see next section for discussion). Finally, 32-bit integer performance is shown in Figure 6:f, where we alsoincluded NVidia GPU integer performance for comparison.We see that CGRA has historically been comparable to thatof NVidia GPUs, and even FPGAs are becoming a valid wayof obtaining integer performance through CGRAs.Figure 6:g shows the peak floating-point performance thatCGRAs reported over the years. The number of floating-pointcapable CGRAs prohibits us from drawing any reasonabletrend-line, unlike the one for GPUs that exponentially growswith years (together with the die-area, see Figure 5:c). How-ever, those CGRAs that do include floating-point units cancompete with the performance of modern GPUs– sometimeseven outperform them. For example, the Plasticine architec-ture is capable of delivering 24.6 TFLOP/s of performance,rivaling GPUs from that generation, and the earlier Redefineand SGMF architecture could deliver 500 and 840 GFLOP/srespectively. Even earlier, the Wavescalar architecture wascapable of 500 GFLOP/s, which was well ahead of GPUs atthat time. At a lower performance, we find architecture suchas FLoRa ( 600 MFLOP/s) and the loop-accelerator VEAL(5.4 GFLOP/s).Figure 6:h shows the distribution over the number ofCGRAs that support floating-point versus those that supportinteger computations. Floating-point support is clearly under-represented , with more than a factor four more CGRAs thatsupport integer computations.VIII. P
ERFORMANCE U SAGE A NALYSIS
Figure 7:a shows the number of instructions-per-cycle (IPC)that applications experienced when executing on differentCGRAs. We see that a majority of CGRAs operate in a fairlylow performance domain, primarily due to their size, and mostexecute around 12 IPC (median). There are corner cases, suchas the Rhyme-Redefine architecture which aims to exploreCGRA in High-Performance Computing reaching 300+ IPCon selected workloads, or the Deep-Learning SDT-CGRAarchitecture on inference, reaching 172 IPC. Similarly, Eyerissis capable to occupy 100% of its resources when inferringAlexNET, yielding astounding 700+ IPC. Most FPGA-basedCGRA also execute less than 100 IPC; this is primarily sincethe size of most FPGA CGRAs are rather small (see Figure5a).Figure 7:b shows the performance applications experiencewhen running on different CGRA architectures as a functionof topology size, where we group CGRA architectures intothree groups: small (¡16), medium (16-64) and large (¿64). Asis expected, we see that the performance and obtained IPCgrows as the architectures become larger, where applicationsrunning on small-sized CGRA experience on average albeit unusable by itself (only compute and noorchestration is available), would reach a compute capacity ofnearly 1 Peta-OP/s. While this extrapolation is indeed limited,it aims to show that CGRAs have the architectural capabilityof competing with modern GPUs designs, assuming we canfully utilize these (potentially over-provisioned) computingresources.
Manufacturing Process (nm) T r a n s i s t o r D e n s i t y ( M illi o n / mm ) NVIDIA GPU Transistor Density Scaling
Fig. 8: Transistor-density scaling as experienced by NVIDIAGPU chips, which we used to scale existing CGRAs fromvarious times in history.IX. D
ISCUSSION AND C ONCLUSION
As we saw in the previous section, a vast majority ofCGRAs are fairly small and run at a (comparably) lowfrequency. This, in turn, lead to very power-efficient designs,allowing CGRAs to be placed into embedded devices such asmobile phones or wearables and operate for hours. This power-efficiency, with respect to the performance they provide, allowCGRAs to compete (and possible perform better) than GPUs,which in turn could lead to a partial remedy for dark silicon.From the analysis that we did in this study, CGRAs shouldbe considered a serious competitor to GPUs, particularly in afuture post-Moore era when power-efficiency becomes moreimportant.However, in order to reap the better compute-densitiesand better power-efficiency that CGRA offers, larger archi-tectures must be more thoroughly researched. Larger CGRAarchitectures, in particularly those aimed towards aiding or accelerating general-purpose computing, will be challengingto keep occupied. As we saw in the final graph, CGRA scaledto the level of a V100 will potentially have a peak performanceconsisting of hundreds of teraflops, but the main question is:will we be able to map and fully utilize all those computingresources for anything but the most trivial kernels?Several authors have already pointed out that in orderto harvest larger CGRAs, we need to complement currentways of extracting instruction parallelism (primary moduloscheduling) with other forms of concurrency. While modernCGRAs (e.g. Plasticine, SIMD-RA) do exploit SIMD-levelparallelism, there will without doubt be needed to researcheven further, and support programming models such as CUDA(SGMF move towards this direction), multi-threading or evenmulti-tasking (e.g. OpenMP [186]) should be more aggres-sively pursued both from an architectural and programmabilityviewpoint in order to leverage future, large-sized, CGRAs.For example, the recently added dependencies features inframeworks such as OpenMP and OmpSs [187] matches verywell to clustered CGRAs that have islands of both computeand scratchpad, where the dependencies would dictate howdata would flow on these CGRAs (exploiting both inter- andintra-task level parallelism and data locality).Another limitation of existing architectures is the applica-tion domain on which they accelerate. A large majority ofCGRAs target embedded applications such as filters, sten-cils, decoders, etc. Studies that integrate the CGRA into thebackend of a processor (e.g. TRIPS, DySER) tend to have amore diverse set of benchmarks available, and those studies(e.g. TARTAN, SEED, SGMF) that rely only on simulation(without hardware being developed) have the richest set ofapplication support. Despite this, CGRAs suffer from a similarproblem that current FPGAs struggle with: we limit ourselvesto small, simple kernels, rather than studying the impact ofthese architecture on more complex applications. To givea concrete example, there is to this day no reconfigurablearchitecture that has seriously considered any of the proxyapplications that drive HPC system procurement, such asfor example the RIKEN Fiber [188] or ECP benchmarksuites [189]. For FPGAs and High-Level Synthesis, this mightmake sense, since there is always the danger that these largekernels might not fit onto a single FPGA; CGRAs, however,can store multiple contexts and kernels with little overhead inswitching between them, opening up possibilities for whole-application executions as well as opportunities to exploit inter-kernel temporal- and spatial data locality.A different challenge with the present (and similar future)surveys is in the amount of reporting that the different studiesdo. For example, studies that applies a simulation method-ology often have a wider benchmark coverage, but fails toreport hardware details (e.g. area or RTL-information). At thesame time, many CGRAs that were actually implemented inhardware (or RTL) do report area and power-consumption,but limits the benchmark selection and information. This,in turn, leads to gaps in several graphs were a clear high-performance candidate is represented in one graph, but due toimited information, is absent from the other. This could moreclearly be seen in this graphs that a derive metric, such asperformance per power (OPs/W). Similarly, many papers oftenprefer to report relative performance improvement, rather thanabsolute numbers, leading to difficulty in reasoning aroundperformance across a wide range of CGRAs. We include this inthe discussion section more as a observation for future studiesthat may attempt to perform a similar survey.Overall, this survey has shown that there is plenty of roomfor CGRA research to grow and to continue to be an activeresearch subjects for use in future architecture, particularlystriving to design high-performance CGRAs that aim at nicheor general-purpose computation at scale. As transistor dimen-sions stop shrinking and Moore’s law no longer allows us thearchitectural freedom of carelessly spending silicon, it is herethat reconfigurable architectures such as CGRAs might excelat providing performance in a post-Moore era.A
CKNOWLEDGEMENTS
This article is based on results obtained from a projectcommissioned by the New energy and Industrial TechnologyDevelopment Organization (NEDO).R
EFERENCES[1] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.LeBlanc, “Design of ion-implanted mosfet’s with very small physicaldimensions,”
IEEE Journal of Solid-State Circuits , vol. 9, no. 5, pp.256–268, 1974.[2] G. E. Moore et al. , “Cramming more components onto integratedcircuits,” 1965.[3] T. N. Theis and H.-S. P. Wong, “The end of moore’s law: A newbeginning for information technology,”
Computing in Science & Engi-neering , vol. 19, no. 2, p. 41, 2017.[4] J. S. Vetter, E. P. DeBenedictis, and T. M. Conte, “Architectures forthe post-moore era,”
IEEE Micro , vol. 37, no. 4, pp. 6–8, 2017.[5] C. D. Schuman, J. D. Birdwell, M. Dean, J. Plank, and G. Rose,“Neuromorphic computing: A post-moore’s law complementary archi-tecture,” Oak Ridge National Lab.(ORNL), Oak Ridge, TN (UnitedStates), Tech. Rep., 2016.[6] R. Tessier and W. Burleson, “Reconfigurable computing for digitalsignal processing: A survey,”
Journal of VLSI signal processing systemsfor signal, image and video technology , vol. 28, no. 1-2, pp. 7–27, 2001.[7] J. Gray, “Grvi phalanx: A massively parallel risc-v fpga acceleratoraccelerator,” in . IEEE,2016, pp. 17–20.[8] G. Wang, B. Yin, K. Amiri, Y. Sun, M. Wu, and J. R. Cavallaro, “Fpgaprototyping of a high data rate lte uplink baseband receiver,” in . IEEE, 2009, pp. 248–252.[9] I. Kuon, R. Tessier, J. Rose et al. , “Fpga architecture: Survey and chal-lenges,”
Foundations and Trends® in Electronic Design Automation ,vol. 2, no. 2, pp. 135–253, 2008.[10] C. Yang, T. Geng, T. Wang, C. Lin, J. Sheng, V. Sachdeva, W. Sherman,and M. Herbordt, “Molecular dynamics range-limited force evaluationoptimized for fpgas,” in ,vol. 2160. IEEE, 2019, pp. 263–271.[11] A. Podobas, H. R. Zohouri, N. Maruyama, and S. Matsuoka, “Evaluat-ing high-level design strategies on fpgas for high-performance comput-ing,” in . IEEE, 2017, pp. 1–4.[12] H. R. Zohouri, N. Maruyama, A. Smith, M. Matsuda, and S. Mat-suoka, “Evaluating and optimizing opencl kernels for high performancecomputing with fpgas,” in
SC’16: Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis . IEEE, 2016, pp. 409–420. [13] H. R. Zohouri, A. Podobas, and S. Matsuoka, “Combined spatial andtemporal blocking for high-performance stencil computation on fpgasusing opencl,” in
Proceedings of the 2018 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays , 2018, pp. 153–162.[14] A. Podobas and S. Matsuoka, “Designing and accelerating spiking neu-ral networks using opencl for fpgas,” in . IEEE, 2017, pp. 255–258.[15] T. Miyamori and K. Olukotun, “Remarc: Reconfigurable multimediaarray coprocessor,”
IEICE Transactions on information and systems ,vol. 82, no. 2, pp. 389–397, 1999.[16] A. Agarwal, S. Amarasinghe, R. Barua, M. Frank, W. Lee, V. Sarkar,D. Srikrishna, and M. Taylor, “The raw compiler project,” in
Proceed-ings of the Second SUIF Compiler Workshop, Stanford, CA , 1997.[17] J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with areconfigurable coprocessor,” in
Proceedings. The 5th Annual IEEESymposium on Field-Programmable Custom Computing Machines Cat.No. 97TB100186) . IEEE, 1997, pp. 12–21.[18] S. Ahmad, S. Subramanian, V. Boppana, S. Lakka, F.-H. Ho, T. Knopp,J. Noguera, G. Singh, and R. Wittig, “Xilinx first 7nm device: Versal aicore (vc1902),” in . IEEE,2019, pp. 1–28.[19] K. E. Fleming, K. D. Glossop, and S. C. Steely, “Apparatus, methods,and systems with a configurable spatial accelerator,” Oct. 15 2019, uSPatent 10,445,250.[20] S. Brown, “Fpga architectural research: a survey,”
IEEE Design & Testof Computers , vol. 13, no. 4, pp. 9–15, 1996.[21] H. Amano, “A survey on dynamically reconfigurable processors,”
IEICE transactions on Communications , vol. 89, no. 12, pp. 3179–3187, 2006.[22] V. Tehre and R. Kshirsagar, “Survey on coarse grained reconfigurablearchitectures,”
International Journal of Computer Applications , vol. 48,no. 16, pp. 1–7, 2012.[23] R. Hartenstein, “A decade of reconfigurable computing: a visionaryretrospective,” in
Proceedings of the conference on Design, automationand test in Europe . IEEE Press, 2001, pp. 642–649.[24] G. Theodoridis, D. Soudris, and S. Vassiliadis, “A survey of coarse-grain reconfigurable architectures and cad tools,” in
Fine-and Coarse-Grain Reconfigurable Computing . Springer, 2007, pp. 89–149.[25] M. Wijtvliet, L. Waeijen, and H. Corporaal, “Coarse grained reconfig-urable architectures in the past 25 years: Overview and classification,”in . IEEE, 2016, pp.235–244.[26] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei,“A survey of coarse-grained reconfigurable architecture and design:Taxonomy, challenges, and applications,”
ACM Computing Surveys(CSUR) , vol. 52, no. 6, p. 118, 2019.[27] J. Hennessy, N. Jouppi, S. Przybylski, C. Rowen, T. Gross, F. Baskett,and J. Gill, “Mips: A microprocessor architecture,”
ACM SIGMICRONewsletter , vol. 13, no. 4, pp. 17–22, 1982.[28] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “The garp architectureand c compiler,”
Computer , vol. 33, no. 4, pp. 62–69, 2000.[29] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutchings,“A reconfigurable arithmetic array for multimedia applications,” in
FPGA , vol. 99. Citeseer, 1999, pp. 135–143.[30] T. Stansfield, “Using multiplexers for control and data in d-fabrix,” in
International Conference on Field Programmable Logic and Applica-tions . Springer, 2003, pp. 416–425.[31] E. Waingold, M. Taylor, V. Sarkar, V. Lee, W. Lee, J. Kim, M. Frank,P. Finch, S. Devabhaktumi, R. Barua et al. , “Baring it all to software:The raw machine,” 1997.[32] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,H. Hoffman, P. Johnson, J.-W. Lee, W. Lee et al. , “The raw micro-processor: A computational fabric for software circuits and general-purpose programs,”
IEEE micro , vol. 22, no. 2, pp. 25–35, 2002.[33] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald,H. Hoffmann, P. Johnson, J. Kim, J. Psota et al. , “Evaluation of theraw microprocessor: An exposed-wire-delay architecture for ilp andstreams,”
ACM SIGARCH Computer Architecture News , vol. 32, no. 2,p. 2, 2004.[34] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,J. MacKay, M. Reif, L. Bao, J. Brown et al. , “Tile64-processor: A64-core soc with mesh interconnect,” in . IEEE, 2008,pp. 88–598.[35] T. Miyamori and U. Olukotun, “A quantitative analysis of reconfig-urable coprocessors for multimedia applications,” in
Proceedings. IEEESymposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251) . IEEE, 1998, pp. 2–11.[36] E. Mirsky, A. DeHon et al. , “Matrix: a reconfigurable computingarchitecture with configurable instruction distribution and deployableresources.” in
FCCM , vol. 96, 1996, pp. 17–19.[37] G. Lu, H. Singh, M.-H. Lee, N. Bagherzadeh, F. Kurdahi, andM. Eliseu Filho, “The morphosys parallel reconfigurable system,” in
European Conference on Parallel Processing . Springer, 1999, pp.727–734.[38] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.Chaves Filho, “Morphosys: an integrated reconfigurable system fordata-parallel and computation-intensive applications,”
IEEE transac-tions on computers , vol. 49, no. 5, pp. 465–481, 2000.[39] M. Jo, D. Lee, and K. Choi, “Chip implementation of a coarse-grainedreconfigurable architecture supporting floating-point operations,” in , vol. 3. IEEE, 2008, pp.III–29.[40] C. Ebeling, D. C. Cronquist, and P. Franklin, “Rapid—reconfigurablepipelined datapath,” in
International Workshop on Field ProgrammableLogic and Applications . Springer, 1996, pp. 126–135.[41] C. Ebeling, D. C. Cronquist, P. Franklin, J. Secosky, and S. G.Berg, “Mapping applications to the rapid configurable architecture,” in
Proceedings. The 5th Annual IEEE Symposium on Field-ProgrammableCustom Computing Machines Cat. No. 97TB100186) . IEEE, 1997, pp.106–115.[42] R. W. Hartenstein, R. Kress, and H. Reinig, “A new fpga architecturefor word-oriented datapaths,” in
International Workshop on FieldProgrammable Logic and Applications . Springer, 1994, pp. 144–155.[43] ——, “A reconfigurable data-driven alu for xputers,” in
Proceedings ofIEEE Workshop on FPGA’s for Custom Computing Machines . IEEE,1994, pp. 139–146.[44] R. W. Hartenstein, M. Herz, T. Hoffmann, and U. Nageldinger, “Usingthe kressarray for reconfigurable computing,” in
Configurable Comput-ing: Technology and Applications , vol. 3526. International Society forOptics and Photonics, 1998, pp. 150–161.[45] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “Chimaera: a high-performance architecture with a tightly-coupled reconfigurable func-tional unit,” in
ACM SIGARCH computer architecture news , vol. 28,no. 2. ACM, 2000, pp. 225–235.[46] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: Atool for evaluating and synthesizing multimedia and communicationssystems,” in
Proceedings of 30th Annual International Symposium onMicroarchitecture . IEEE, 1997, pp. 330–335.[47] S. Kumar, L. Pires, S. Ponnuswamy, C. Nanavati, J. Golusky, M. Vojta,S. Wadi, D. Pandalai, and H. Spaanenberg, “A benchmark suite forevaluating configurable computing systems—status, reflections, andfuture directions,” in
Proceedings of the 2000 ACM/SIGDA eighthinternational symposium on Field programmable gate arrays . ACM,2000, pp. 126–134.[48] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, andR. R. Taylor, “Piperench: A reconfigurable architecture and compiler,”
Computer , vol. 33, no. 4, pp. 70–77, 2000.[49] H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine, and R. R. Taylor,“Piperench: A virtualized programmable datapath in 0.18 micron tech-nology,” in
Proceedings of the IEEE 2002 Custom Integrated CircuitsConference (Cat. No. 02CH37285) . IEEE, 2002, pp. 63–66.[50] J. Becker, M. Glesner, A. Alsolaim, and J. A. Starzyk, “Fast communi-cation mechanisms in coarse-grained dynamically reconfigurable arrayarchitectures.” in
PDPTA . Citeseer, 2000.[51] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. L´opez-Lagunas,P. R. Mattson, and J. D. Owens, “A bandwidth-efficient architecturefor media processing,” in
Proceedings. 31st Annual ACM/IEEE Inter-national Symposium on Microarchitecture . IEEE, 1998, pp. 3–13.[52] J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das,“Evaluating the imagine stream architecture,” in
Proceedings. 31stAnnual International Symposium on Computer Architecture, 2004.
IEEE, 2004, pp. 14–25.[53] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins,“Adres: An architecture with tightly coupled vliw processor and coarse- grained reconfigurable matrix,” in
International Conference on FieldProgrammable Logic and Applications . Springer, 2003, pp. 61–70.[54] B. Mei, A. Lambrechts, J.-Y. Mignolet, D. Verkest, and R. Lauwereins,“Architecture exploration for a reconfigurable architecture template,”
IEEE Design & Test of Computers , vol. 22, no. 2, pp. 90–101, 2005.[55] J. A. Fisher, “Very long instruction word architectures and the eli-512,” in
Proceedings of the 10th annual international symposium onComputer architecture , 1983, pp. 140–150.[56] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins,“Dresc: A retargetable compiler for coarse-grained reconfigurablearchitectures,” in
IEEE, 2002,pp. 166–173.[57] K. Wu, A. Kanstein, J. Madsen, and M. Berekovic, “Mt-adres: mul-tithreading on coarse-grained reconfigurable architecture,” in
Interna-tional Workshop on Applied Reconfigurable Computing . Springer,2007, pp. 26–38.[58] F. Bouwens, M. Berekovic, A. Kanstein, and G. Gaydadjiev, “Archi-tectural exploration of the adres coarse-grained reconfigurable array,”in
International Workshop on Applied Reconfigurable Computing .Springer, 2007, pp. 1–13.[59] M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “Hycube: Acgra with reconfigurable single-cycle multi-hop interconnect,” in . IEEE,2017, pp. 1–6.[60] S. A. Chin, N. Sakamoto, A. Rui, J. Zhao, J. H. Kim, Y. Hara-Azumi,and J. Anderson, “Cgra-me: A unified framework for cgra modellingand exploration,” in .IEEE, 2017, pp. 184–189.[61] M. J. Walker and J. H. Anderson, “Generic connectivity-based cgramapping via integer linear programming,” in . IEEE, 2019, pp. 65–73.[62] C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim,“Ulp-srp: Ultra low power samsung reconfigurable processor forbiomedical applications,” in . IEEE, 2012, pp. 329–334.[63] J.-e. Lee, K. Choi, and N. D. Dutt, “Compilation approach for coarse-grained reconfigurable architectures,”
IEEE Design & Test of Comput-ers , vol. 20, no. 1, pp. 26–33, 2003.[64] ——, “Evaluating memory architectures for media applications oncoarse-grained reconfigurable architectures,” in
Proceedings IEEE In-ternational Conference on Application-Specific Systems, Architectures,and Processors. ASAP 2003 . IEEE, 2003, pp. 172–182.[65] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John,C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, and W. Yoder,“Scaling to the end of silicon with edge architectures,”
Computer ,vol. 37, no. 7, pp. 44–55, 2004.[66] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ran-ganathan, D. Burger, S. W. Keckler, R. G. McDonald, and C. R. Moore,“Trips: A polymorphous architecture for exploiting ilp, tlp, and dlp,”
ACM Transactions on Architecture and Code Optimization (TACO) ,vol. 1, no. 1, pp. 62–93, 2004.[67] K. Sankaralingam, V. A. Singh, S. W. Keckler, and D. Burger, “Routedinter-alu networks for ilp scalability and performance,” in
Proceedings21st International Conference on Computer Design . IEEE, 2003, pp.170–177.[68] A. Smith, J. Burrill, J. Gibson, B. Maher, N. Nethercote, B. Yoder,D. Burger, and K. S. McKinley, “Compiling for edge architectures,”in
International Symposium on Code Generation and Optimization(CGO’06) . IEEE, 2006, pp. 11–pp.[69] B. Yoder, J. Burrill, R. McDonald, K. Bush, K. Coons, M. Gebhart,S. Govindan, B. Maher, R. Nagarajan, B. Robatmili et al. , “Softwareinfrastructure and tools for the trips prototype,” in
Workshop onModeling, Benchmarking and Simulation . Citeseer, 2007.[70] R. McDonald, D. Burger, and S. Keckler, “The design and imple-mentation of the trips prototype chip,” in . IEEE, 2005, pp. 1–24.[71] M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz,M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill et al. ,“An evaluation of the trips computer system,” in
ACM Sigplan Notices ,vol. 44, no. 3. ACM, 2009, pp. 1–12.72] J. Gray and A. Smith, “Towards an area-efficient implementation of ahigh ilp edge soft processor,” arXiv preprint arXiv:1803.06617 , 2018.[73] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish,K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality andparallelism specialization for energy-efficient computing,”
IEEE Micro ,vol. 32, no. 5, pp. 38–51, 2012.[74] I. Parulkar, A. Wood, J. C. Hoe, B. Falsafi, S. V. Adve, J. Torrellas,and S. Mitra, “Opensparc: An open platform for hardware reliabilityexperimentation,” in
Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE) . Citeseer, 2008, pp. 1–6.[75] J. Benson, R. Cofell, C. Frericks, C.-H. Ho, V. Govindaraju,T. Nowatzki, and K. Sankaralingam, “Design, integration and imple-mentation of the dyser hardware accelerator into opensparc,” in
IEEEInternational Symposium on High-Performance Comp Architecture .IEEE, 2012, pp. 1–12.[76] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, andS. K. Reinhardt, “The m5 simulator: Modeling networked systems,”
Ieee micro , no. 4, pp. 52–60, 2006.[77] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmarksuite: Characterization and architectural implications,” in
Proceedingsof the 17th international conference on Parallel architectures andcompilation techniques . ACM, 2008, pp. 72–81.[78] J. D. Souza, L. Carro, M. B. Rutzig, and A. C. S. Beck, “Towards adynamic and reconfigurable multicore heterogeneous system,” in . IEEE,2014, pp. 73–78.[79] ——, “A reconfigurable heterogeneous multicore with a homogeneousisa,” in . IEEE, 2016, pp. 1598–1603.[80] F. C. Junior, I. Silva, and R. Jacobi, “A partially shared thin re-configurable array for multicore processor,” in
Anais do IX Simp´osioBrasileiro de Engenharia de Sistemas Computacionais . SBC, 2019,pp. 113–118.[81] M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, “Bulldozer: Anapproach to multithreaded compute performance,”
IEEE Micro , vol. 31,no. 2, pp. 6–15, 2011.[82] D. L. Wolf, L. J. Jung, T. Ruschke, C. Li, and C. Hochberger,“Amidar project: lessons learned in 15 years of researching adaptiveprocessors,” in . IEEE, 2018, pp.1–8.[83] S. R. Chalamalasetti, S. Purohit, M. Margala, and W. Vanderbauwhede,“Mora-an architecture and programming model for a resource efficientcoarse grained reconfigurable processor,” in . IEEE, 2009, pp. 389–396.[84] W. Vanderbauwhede, M. Margala, S. R. Chalamalasetti, and S. Purohit,“A c++-embedded domain-specific language for programming the morasoft processor array,” in
ASAP 2010-21st IEEE International Confer-ence on Application-specific Systems, Architectures and Processors .IEEE, 2010, pp. 141–148.[85] Y. Park, H. Park, and S. Mahlke, “Cgra express: accelerating exe-cution using dynamic operation fusion,” in
Proceedings of the 2009international conference on Compilers, architecture, and synthesis forembedded systems . ACM, 2009, pp. 271–280.[86] D. Lewis, G. Chiu, J. Chromczak, D. Galloway, B. Gamsa,V. Manohararajah, I. Milton, T. Vanderhoek, and J. Van Dyken, “Thestratix™ 10 highly pipelined fpga architecture,” in
Proceedings of the2016 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays . ACM, 2016, pp. 159–168.[87] H. Park, Y. Park, and S. Mahlke, “Polymorphic pipeline array: a flexiblemulticore accelerator with virtualized execution for mobile multimediaapplications,” in
Proceedings of the 42nd Annual IEEE/ACM Interna-tional Symposium on Microarchitecture . ACM, 2009, pp. 370–380.[88] Y. Kim, J. Lee, J. Lee, T. X. Mai, I. Heo, and Y. Paek, “Exploiting bothpipelining and data parallelism with simd reconfigurable architecture,”in
International Symposium on Applied Reconfigurable Computing .Springer, 2012, pp. 40–52.[89] C. Liang and X. Huang, “Smartcell: An energy efficient coarse-grainedreconfigurable architecture for stream-based applications,”
EURASIPJournal on Embedded Systems , vol. 2009, no. 1, p. 518659, 2009.[90] O. Atak and A. Atalar, “Bilrc: An execution triggered coarse grainedreconfigurable architecture,”
IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 21, no. 7, pp. 1285–1298, 2012. [91] D. Lee, M. Jo, K. Han, and K. Choi, “Flora: Coarse-grained recon-figurable architecture with floating-point operation capability,” in . IEEE,2009, pp. 376–379.[92] F. Feng, L. Li, K. Wang, F. Han, B. Zhang, and G. He, “Floating-point operation based reconfigurable architecture for radar processing,”
IEICE Electronics Express , pp. 13–20 160 893, 2016.[93] H. Siqueira and M. Kreutz, “A coarse-grained reconfigurable archi-tecture for a pret machine,” in . IEEE, 2018, pp. 237–242.[94] V. Dadu, J. Weng, S. Liu, and T. Nowatzki, “Towards general purposeacceleration by exploiting common data-dependence forms,” in
Pro-ceedings of the 52nd Annual IEEE/ACM International Symposium onMicroarchitecture , 2019, pp. 924–939.[95] T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam,“Stream-dataflow acceleration,” in . IEEE, 2017,pp. 416–429.[96] B. Salefski and L. Caglar, “Re-configurable computing in wireless,” in
Proceedings of the 38th annual Design Automation Conference . ACM,2001, pp. 178–183.[97] A. Hemani, N. Farahini, S. M. Jafri, H. Sohofi, S. Li, and K. Paul, “Thesilago solution: Architecture and design methods for a heterogeneousdark silicon aware coarse grain reconfigurable fabric,” in
The DarkSide of Silicon . Springer, 2017, pp. 47–94.[98] M. A. Shami, “Dynamically reconfigurable resource array,” Ph.D.dissertation, KTH Royal Institute of Technology, 2012.[99] M. Martina, A. Hemani, and G. Baccelli, “Design of a coarse grainreconfigurable array for neural networks.”[100] D. Stathis, C. Sudarshan, Y. Yang, M. Jung, S. A. M. H. Jafri, C. Weis,A. Hemani, A. Lansner, and N. Wehn, “ebrainii: A 3 kw realtimecustom 3d dram integrated asic implementation of a biologically plau-sible model of a human scale cortex,” arXiv preprint arXiv:1911.00889 ,2019.[101] Y. Yang, D. Stathis, P. Sharma, K. Paul, A. Hemani, M. Grabherr,and R. Ahmad, “Ribosom: rapid bacterial genome identification usingself-organizing map implemented on the synchoros silago platform,”in
Proceedings of the 18th International Conference on EmbeddedComputer Systems: Architectures, Modeling, and Simulation . ACM,2018, pp. 105–114.[102] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, “Q100:The architecture and design of a database processing unit,”
ACMSIGARCH Computer Architecture News , vol. 42, no. 1, pp. 255–268,2014.[103] V. Baumgarte, G. Ehlers, F. May, A. N¨uckel, M. Vorbach, and M. Wein-hardt, “Pact xpp—a self-reconfigurable data processing architecture,” the Journal of Supercomputing , vol. 26, no. 2, pp. 167–184, 2003.[104] P. S. K¨asgen, M. Weinhardt, and C. Hochberger, “A coarse-grainedreconfigurable array for high-performance computing applications,” in . IEEE, 2018, pp. 1–4.[105] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “Wavescalar,”in
Proceedings of the 36th annual IEEE/ACM International Symposiumon Microarchitecture . IEEE Computer Society, 2003, p. 291.[106] A. Putnam, S. Swanson, M. Mercaldi, K. Michelson, A. Petersen,A. Schwerin, M. Oskin, and S. Eggers, “The microarchitecture of apipelined wavescalar processor: An rtl-based study,”
Tech. Rep. TR-2005–11–02 , 2005.[107] S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam,K. Michelson, M. Oskin, and S. J. Eggers, “The wavescalar architec-ture,”
ACM Transactions on Computer Systems (TOCS) , vol. 25, no. 2,p. 4, 2007.[108] M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C.Goldstein, and M. Budiu, “Tartan: evaluating spatial computation forwhole program execution,”
ACM SIGOPS Operating Systems Review ,vol. 40, no. 5, pp. 163–174, 2006.[109] J. L. Henning, “Spec cpu2006 benchmark descriptions,”
ACMSIGARCH Computer Architecture News , vol. 34, no. 4, pp. 1–17, 2006.[110] G. Ansaloni, P. Bonzini, and L. Pozzi, “Egra: A coarse grained re-configurable architectural template,”
IEEE Transactions on Very LargeScale Integration (VLSI) Systems , vol. 19, no. 6, pp. 1062–1074, 2010.[111] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructurefor computer system modeling,”
Computer , no. 2, pp. 59–67, 2002.112] D. Voitsechov and Y. Etsion, “Single-graph multiple flows: Energyefficient design alternative for gpgpus,” in
ACM SIGARCH computerarchitecture news , vol. 42, no. 3. IEEE Press, 2014, pp. 205–216.[113] D. Voitsechov, O. Port, and Y. Etsion, “Inter-thread communication inmultithreaded, reconfigurable coarse-grain arrays,” in .IEEE, 2018, pp. 42–54.[114] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “Nvidia tesla:A unified graphics and computing architecture,”
IEEE micro , vol. 28,no. 2, pp. 39–55, 2008.[115] D. Kirk et al. , “Nvidia cuda software and gpu parallel computingarchitecture,” in
ISMM , vol. 7, 2007, pp. 103–104.[116] A. Munshi, “The opencl specification,” in . IEEE, 2009, pp. 1–314.[117] C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi gf100 gpuarchitecture,”
IEEE Micro , vol. 31, no. 2, pp. 50–59, 2011.[118] M. Zhu, L. Liu, S. Yin, Y. Wang, W. Wang, and S. Wei, “A reconfig-urable multi-processor soc for media applications,” in
Proceedings of2010 IEEE International Symposium on Circuits and Systems . IEEE,2010, pp. 2011–2014.[119] M. Suzuki, Y. Hasegawa, Y. Yamada, N. Kaneko, K. Deguchi,H. Amano, K. Anjo, M. Motomura, K. Wakabayashi, T. Toi et al. ,“Stream applications on the dynamically reconfigurable processor,”in
Proceedings. 2004 IEEE International Conference on Field-Programmable Technology (IEEE Cat. No. 04EX921) . IEEE, 2004,pp. 137–144.[120] T. Sato, “Dapdna-2 a dynamically reconfigurable processor with 37632-bit processing elements,” in . IEEE, 2005, pp. 1–24.[121] D. N. Truong, W. H. Cheng, T. Mohsenin, Z. Yu, A. T. Jacobson,G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, Z. Xiao et al. , “A167-processor computational platform in 65 nm cmos,”
IEEE Journalof Solid-State Circuits , vol. 44, no. 4, pp. 1130–1144, 2009.[122] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems.”Stanford Univ CA Dept of Computer Science, Tech. Rep., 1984.[123] S. Kyo and S. Okazaki, “Imapcar: A 100 gops in-vehicle visionprocessor based on 128 ring connected four-way vliw processingelements,”
Journal of Signal Processing Systems , vol. 62, no. 1, pp.5–16, 2011.[124] S. Kyo, T. Koga, and S. Okazaki, “Imap-ce: A 51.2 gops video rate im-age processor with 128 vliw processing elements,” in
Proceedings 2001International Conference on Image Processing (Cat. No. 01CH37205) ,vol. 3. IEEE, 2001, pp. 294–297.[125] S. Das, N. Sivanandan, K. T. Madhu, S. K. Nandy, and R. Narayan,“Rhyme: Redefine hyper cell multicore for accelerating hpc kernels,”in . IEEE, 2016,pp. 601–602.[126] K. T. Madhu, S. Das, S. Nalesh, S. Nandy, and R. Narayan, “Compilinghpc kernels for the redefine cgra,” in . IEEE, 2015, pp. 405–410.[127] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis,A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A recon-figurable architecture for parallel patterns,” in .IEEE, 2017, pp. 389–402.[128] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis,R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis et al. , “Spatial:A language and compiler for application accelerators,” in
ACM SigplanNotices , vol. 53, no. 4. ACM, 2018, pp. 296–311.[129] C. systems, “Wafer-scale deep learning,”
Hot-Chips 31
ACM SIGARCH Computer Architecture News ,vol. 36, no. 3. IEEE Computer Society, 2008, pp. 389–400.[131] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A fully pipelinedand dynamically composable architecture of cgra,” in . IEEE, 2014, pp. 9–16. [132] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, “Bundledexecution of recurring traces for energy-efficient general purpose pro-cessing,” in
Proceedings of the 44th Annual IEEE/ACM InternationalSymposium on Microarchitecture . ACM, 2011, pp. 12–23.[133] T. Nowatzki, V. Gangadhar, and K. Sankaralingam, “Exploring thepotential of heterogeneous von neumann/dataflow execution models,”
ACM SIGARCH Computer Architecture News , vol. 43, no. 3, pp. 298–310, 2016.[134] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[135] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neu-ral networks with low precision multiplications,” arXiv preprintarXiv:1412.7024 , 2014.[136] X. Fan, H. Li, W. Cao, and L. Wang, “Dt-cgra: Dual-track coarse-grained reconfigurable architecture for stream applications,” in . IEEE, 2016, pp. 1–9.[137] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang, “Stream processingdual-track cgra for object inference,”
IEEE Transactions on Very LargeScale Integration (VLSI) Systems , vol. 26, no. 6, pp. 1098–1111, 2018.[138] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An ac-celerator for compressed-sparse convolutional neural networks,”
ACMSIGARCH Computer Architecture News , vol. 45, no. 2, pp. 27–40,2017.[139] M. Liang, M. Chen, Z. Wang, and J. Sun, “A cgra based neural networkinference engine for deep reinforcement learning,” in . IEEE, 2018,pp. 540–543.[140] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural net-works,”
IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2016.[141] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexibleaccelerator for emerging deep neural networks on mobile devices,”
IEEE Journal on Emerging and Selected Topics in Circuits andSystems , vol. 9, no. 2, pp. 292–308, 2019.[142] B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer, “Xilinx adaptivecompute acceleration platform: Versal tm architecture,” in
Proceed-ings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . ACM, 2019, pp. 84–93.[143] K. Vissers, “Versal: The xilinx adaptive compute acceleration platform(acap),” in
Proceedings of the 2019 ACM/SIGDA International Sympo-sium on Field-Programmable Gate Arrays . ACM, 2019, pp. 83–83.[144] N. Ando, K. Masuyama, H. Okuhara, and H. Amano, “Variablepipeline structure for coarse grained reconfigurable array cma,” in .IEEE, 2016, pp. 217–220.[145] N. Ozaki, Y. Yoshihiro, Y. Saito, D. Ikebuchi, M. Kimura, H. Amano,H. Nakamura, K. Usami, M. Namiki, and M. Kondo, “Cool mega-array: A highly energy efficient reconfigurable accelerator,” in . IEEE,2011, pp. 1–8.[146] T. Kojima, N. Ando, Y. Matshushita, H. Okuhara, N. A. V. Doan,and H. Amano, “Real chip evaluation of a low power cgra withoptimized application mapping,” in
Proceedings of the 9th Interna-tional Symposium on Highly-Efficient Accelerators and ReconfigurableTechnologies . ACM, 2018, p. 13.[147] K. Patel, S. McGettrick, and C. J. Bleakley, “Syscore: A coarsegrained reconfigurable array architecture for low energy biosignalprocessing,” in . IEEE, 2011, pp.109–112.[148] J. Lopes, D. Sousa, and J. C. Ferreira, “Evaluation of cgra architecturefor real-time processing of biological signals on wearable devices,”in . IEEE, 2017, pp. 1–7.[149] L. Duch, S. Basu, R. Braojos, D. Atienza, G. Ansaloni, and L. Pozzi,“A multi-core reconfigurable architecture for ultra-low power bio-signalanalysis,” in . IEEE, 2016, pp. 416–419.[150] S. Das, K. J. Martin, P. Coussy, and D. Rossi, “A heterogeneous clusterwith reconfigurable accelerator for energy efficient near-sensor datanalytics,” in . IEEE, 2018, pp. 1–5.[151] S. Das, D. Rossi, K. J. Martin, P. Coussy, and L. Benini, “A142mops/mw integrated programmable array accelerator for smartvisual processing,” in . IEEE, 2017, pp. 1–4.[152] O. Akbari, M. Kamal, A. Afzali-Kusha, M. Pedram, and M. Shafique,“X-cgra: An energy-efficient approximate coarse-grained reconfig-urable architecture,”
IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , 2019.[153] M. Langhammer and B. Pasca, “Floating-point dsp block architecturefor fpgas,” in
Proceedings of the 2015 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays . ACM, 2015, pp.117–125.[154] G. Stitt and J. Coole, “Intermediate fabrics: Virtual architectures fornear-instant fpga compilation,”
IEEE Embedded Systems Letters , vol. 3,no. 3, pp. 81–84, 2011.[155] S. Shukla, N. W. Bergmann, and J. Becker, “Quku: a two-level recon-figurable architecture,” in
IEEE Computer Society Annual Symposiumon Emerging VLSI Technologies and Architectures (ISVLSI’06) . IEEE,2006, pp. 6–pp.[156] N. W. Bergmann, S. K. Shukla, and J. Becker, “Quku: a dual-layer reconfigurable architecture,”
ACM Transactions on EmbeddedComputing Systems (TECS) , vol. 12, no. 1s, p. 63, 2013.[157] S. Shukla, N. W. Bergmann, and J. Becker, “Quku: A fpga based flexi-ble coarse grain architecture design paradigm using process networks,”in . IEEE, 2012, pp. 93–96.[160] R. L. Lysecky, K. Miller, F. Vahid, and K. A. Vissers, “Firm-core virtualfpga for just-in-time fpga compilation,” in
FPGA , 2005, p. 271.[161] T. Myint, M. Amagasaki, Q. Zhao, and M. Iida, “A slm-based overlayarchitecture for fine-grained virtual fpga,”
IEICE Electronics Express ,pp. 16–20 190 610, 2019.[162] R. Ferreira, J. G. Vendramini, L. Mucida, M. M. Pereira, and L. Carro,“An fpga-based heterogeneous coarse-grained dynamically reconfig-urable architecture,” in
Proceedings of the 14th international confer-ence on Compilers, architectures and synthesis for embedded systems .ACM, 2011, pp. 195–204.[163] A. Severance and G. G. Lemieux, “Embedded supercomputing infpgas with the vectorblox mxp matrix processor,” in
Proceedingsof the Ninth IEEE/ACM/IFIP International Conference on Hard-ware/Software Codesign and System Synthesis . IEEE Press, 2013,p. 6.[164] L. B. D. Silva, R. Ferreira, M. Canesche, M. M. Menezes, M. D. Vieira,J. Penha, P. Jamieson, and J. A. M. Nacif, “Ready: A fine-grainedmultithreading overlay framework for modern cpu-fpga dataflow appli-cations,”
ACM Transactions on Embedded Computing Systems (TECS) ,vol. 18, no. 5s, p. 56, 2019.[165] D. Capalija and T. S. Abdelrahman, “Towards synthesis-free jit com-pilation to commodity fpgas,” in .IEEE, 2011, pp. 202–205.[166] J. Ball, “The nios ii family of configurable soft-core processors,” in . IEEE, 2005, pp. 1–40.[167] C. Feng and L. Yang, “Design and evaluation of a novel reconfigurablealu based on fpga,” in
Proceedings 2013 International Conferenceon Mechatronic Sciences, Electric Engineering and Computer (MEC) .IEEE, 2013, pp. 2286–2290.[168] K. Paul, C. Dash, and M. S. Moghaddam, “remorph: a runtimereconfigurable architecture,” in . IEEE, 2012, pp. 26–33.[169] A. K. Jain, X. Li, P. Singhai, D. L. Maskell, and S. A. Fahmy,“Deco: A dsp block based fpga accelerator overlay with low overheadinterconnect,” in . IEEE,2016, pp. 1–8. [170] A. K. Jain, X. Li, S. A. Fahmy, and D. L. Maskell, “Adapting thedyser architecture with dsp blocks as an overlay for the xilinx zynq,”
ACM SIGARCH Computer Architecture News , vol. 43, no. 4, pp. 28–33, 2016.[171] C. Liu, H.-C. Ng, and H. K.-H. So, “Automatic nested loop accelerationon fpgas using soft cgra overlay,” arXiv preprint arXiv:1509.00042 ,2015.[172] ——, “Quickdough: a rapid fpga loop accelerator design frameworkusing soft cgra overlay,” in . IEEE, 2015, pp. 56–63.[173] D. Capalija and T. S. Abdelrahman, “A high-performance overlayarchitecture for pipelined execution of data flow graphs,” in . IEEE, 2013, pp. 1–8.[174] D. Mansur, “Stratix iv fpga and hardcopy iv asic@ 40 nm,” in . IEEE, 2008, pp. 1–22.[175] K. Ovtcharov, I. Tili, and J. G. Steffan, “Tilt: a multithreaded vliwsoft processor family,” in . IEEE, 2013, pp. 1–4.[176] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance,productivity and scalability of the tilt overlay processor to opencl hls,”in . IEEE, 2014, pp. 20–27.[177] D. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee,T. Vanderhoek, and H. Yu, “Architectural enhancements in stratix v™,”in
Proceedings of the ACM/SIGDA international symposium on Fieldprogrammable gate arrays . ACM, 2013, pp. 147–156.[178] Z. T. Aklah, “A hybrid partially reconfigurable overlay supporting just-in-time assembly of custom accelerators on fpgas,” 2017.[179] D. Koch,
Partial Reconfiguration on FPGAs: Architectures, Tools andApplications . Springer Science & Business Media, 2012, vol. 153.[180] M. X. Yue, D. Koch, and G. G. Lemieux, “Rapid overlay builder forxilinx fpgas,” in . IEEE, 2015, pp.17–20.[181] R. Kisiel and Z. Szczepa´nski, “Trends in assembling of advanced icpackages,”
Journal of Telecommunications and information Technol-ogy
Computer , vol. 40, no. 12, pp. 50–55, 2007.[185] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, andD. Burger, “Dark silicon and the end of multicore scaling,” in .IEEE, 2011, pp. 365–376.[186] R. Chandra, L. Dagum, D. Kohr, R. Menon, D. Maydan, and J. Mc-Donald,
Parallel programming in OpenMP . Morgan kaufmann, 2001.[187] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell, X. Mar-torell, and J. Planas, “Ompss: a proposal for programming heteroge-neous multi-core architectures,”