[PDF] ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing

Abstract

The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardware specialization and the emergence of data-driven applications. In this paper, we propose ARENA -- an asynchronous reconfigurable accelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using the coarse-grained reconfigurable arrays (CGRAs) as the substrate platform, our key contribution is not only the CGRA-cluster design itself, but also the ensemble of a new architecture and programming model that enables asynchronous tasking across a cluster of reconfigurable nodes, so as to bring specialized computation to the data rather than the reverse. We presume distributed data storage without asserting any prior knowledge on the data distribution. Hardware specialization occurs at runtime when a task finds the majority of data it requires are available at the present node. In other words, we dynamically generate specialized CGRA accelerators where the data reside. The asynchronous tasking for bringing computation to data is achieved by circulating the task token, which describes the data-flow graphs to be executed for a task, among the CGRA cluster connected by a fast ring network. Evaluations on a set of HPC and data-driven applications across different domains show that ARENA can provide better parallel scalability with reduced data movement (53.9%). Compared with contemporary compute-centric parallel models, ARENA can bring on average 4.37x speedup. The synthesized CGRAs and their task-dispatchers only occupy 2.93mm^2 chip area under 45nm process technology and can run at 800MHz with on average 759.8mW power consumption. ARENA also supports the concurrent execution of multi-applications, offering ideal architectural support for future high-performance parallel computing and data analytics systems.

Full PDF

11 ARENA: Asynchronous ReconﬁgurableAccelerator Ring to Enable Data-CentricParallel Computing

Cheng Tan, Chenhao Xie, Andres Marquez, Antonino Tumeo, Kevin Barker, and Ang Li

Abstract —The next generation HPC and data centers are likely to be reconﬁgurable and data-centric due to the trend of hardwarespecialization and the emergence of data-driven applications. In this paper, we propose ARENA – an asynchronous reconﬁgurableaccelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using thecoarse-grained reconﬁgurable arrays (CGRAs) as the substrate platform, our key contribution is not only the CGRA-cluster design itself,but also the ensemble of a new architecture and programming model that enables asynchronous tasking across a cluster ofreconﬁgurable nodes, so as to bring specialized computation to the data rather than the reverse. We presume distributed data storagewithout asserting any prior knowledge on the data distribution. Hardware specialization occurs at runtime when a task ﬁnds the majorityof data it requires are available at the present node. In other words, we dynamically generate specialized CGRA accelerators where thedata reside. The asynchronous tasking for bringing computation to data is achieved by circulating the task token, which describes the dataﬂow graphs to be executed for a task, among the CGRA cluster connected by a fast ring network. Evaluations on a set of HPC and data-driven applications across different domains show that ARENA can provide better parallel scalability with reduced data movement(53.9%). Compared with contemporary compute-centric parallel models, ARENA can bring on average 4.37 × speedup. The synthesizedCGRAs and their task-dispatchers only occupy 2.93mm chip area under 45nm process technology and can run at 800MHz with onaverage 759.8mW power consumption. ARENA also supports the concurrent execution of multi-applications, offering ideal architecturalsupport for future high-performance parallel computing and data analytics systems. Index Terms —Compute-Flow-Architecture, Runtime Reconﬁguration, Asynchronous Parallel Execution, Abstract Machine Model. (cid:70)

NTRODUCTION

With the slowing down of Moore’s Law [51], future computersystems will need to resort to domain-speciﬁc acceleratorsfor continuous performance scaling [27], [63] under the samepower envelope. This is especially the case for HPC anddata centers, as we are quickly entering an era of extremeheterogeneity [71], characterized by cluster nodes integratinga multitude of cooperating accelerators [33], [38], [43], [61].While integrating domain-speciﬁc accelerators (DSAs)into HPC and data centers provides considerable efﬁciencygains [33], [38], [55], [64], it leads to enormous complexitiesas well. First, DSAs can only be economically designed forubiquitous computational patterns in applications, while theworkloads currently running in HPC and data centers areconverging towards mixed workﬂows that include scientiﬁcsimulation, machine learning, data analytics, etc. Second,managing various (typically loosely coupled) acceleratorsacross many nodes signiﬁcantly complicates programmingmodels and the software infrastructure. In both HPC anddata centers, the accelerators typically need to be sharedamong multiple users, often using multiple nodes for appli-cations with divergent characteristics. Since we could notdesign hardware accelerators for all seen and unseen kernels,reconﬁgurable architecture, which allows specialization aftersystem deployment and even during system execution, • The authors are with Paciﬁc Northwest National Laboratory, Richland,

WA, 99354.

E-mail: [email protected] promises to be a wise solution. While Field ProgrammableGate Arrays (FPGAs) have already been deployed at largescale in some data centers [61], they may suffer from limitedfrequency and energy-efﬁciency compared with ASICs, aswell as long reconﬁguration time (e.g. in milliseconds) dueto bit-level reconﬁgurability. Coarse-grained reconﬁgurablearrays (CGRAs), which integrate highly optimized functionalunits (rather than fundamental lookup tables or LUTs) andoffering reconﬁgurability at the word-level, emerge as apromising alternative choice [60]. The rapid reconﬁguration[24] makes dynamic formalization of hardware acceleratorsat runtime becomes feasible, and even plausible.Conventional large-scale HPC clusters implicitly assum-ing homogeneous node-conﬁguration and bulk-synchronous-parallel (BSP) execution model suffer from three low-utilization challenges regarding per-node data locality: (1)unbalanced data distribution among homogeneous nodesmay lead to unbalanced workload and poor utilization;(2) If data is not locally available, during the long-timeremote data fetching, the compute units can be idle; (2) Evenworse, these idle units or idle nodes cannot be reclaimedfor other tasks despite the node may hold their desireddata. As emerging workloads become more dynamic anddata-driven, decentralized asynchronous task managementis highly desired, while data locality becomes a crucial factorfor the system design [18], [27], [63]. This is largely dueto the observation that the energy cost of data-movementsigniﬁcantly overweights the energy cost of computing them[34], [40], [50], which is particularly the case when migrating a r X i v : . [ c s . D C ] N ov ata through the interconnect network (e.g., the powerbudget is ∼ active messages [16], [17] andremote procedure call (RPC) mitigate the problem by pushingcomputation to the nodes where data resides, they sufferfrom considerable overhead due to the lack of architecturalsupport. Existing applications however still widely adoptthe BSP model [14], [69] alternating phases of (parallel) localcomputation with phases of global communication. Note thatthe BSP model implicitly assumes that the majority of thetime can be spent in easily parallelizable computation phases,with limited data movement. However, the exponentialgrowth in the availability of data and the emergence ofnew applications, radically changed the balance. Architectural Support (Section 4)

Runtime (Section 3.2)

UserAPIsProgramming Model (Section 3.1)

ARENA_task_registerARENA_task_spawnARENA_data_acquire

Hardware Abstract FunctionsCPU

Dynamically Reconfigurable CGRA

TaskToken TaskQueue Address

Fig. 1: Overall ARENA design stack.Bridging coarse-grained reconﬁgurable architecture withlocality-oriented asynchronous parallel task execution, inthis paper, we propose ARENA — an AsynchronousREcoNﬁgurable Accelerator ring architecture & runtime, toenable data-centric computation ﬂow paradigm aiming atsigniﬁcantly reducing unnecessary inter-node data move-ment. ARENA comprises multiple CGRAs interconnected ina ring to provide quick reconﬁguration at runtime. The tasksof the application, tagged as task tokens, are injected intothe ring when dependencies are satisﬁed. Task-specializedCGRA accelerators can be constructed at runtime using thespeciﬁcation embedded in the task token. When task tokensare streaming around the ring, each node can verify whethera task should be executed locally (based on data locality andresource availability). In other words, ARENA is dynamicallysharing hardware resource according to data locality. Thisis in contrast with conventional systems sharing time-slotsaccording to hardware availability. In summary, this paperthus makes the following contributions: • Architecture: we proposed a multi-CGRA cluster archi-tecture (Section 4) that can be dynamically reconﬁguredat runtime, enabling asynchronous parallel execution formulti-users with very little overhead. • Runtime: we proposed a ﬂexible runtime (Section 3.2)enabling task-token delivery along the ring networkbased on data locality. Compared to compute-centric ex-ecution models, the proposed runtime can signiﬁcantlymitigate data movement and improve performance. • Programming Model: we proposed a data-centric pro-gramming model (Section 3.1) with easy-to-use APIsto facilitate the programming for ARENA architectureand runtime. An LLVM-based compiler toolchain isconstructed to support the programming model. • Evaluation:

RTL simulation results using practical HPCand data analytics applications (Section 5) show that ARENA can provide better parallel scalability withreduced data movement (53.9%) and bring 2.17 × and4.37 × speedup over traditional compute-centric ap-proach with and without CGRA acceleration. The CGRA-based ARENA prototype only costs 2.93 mm chip areain 45nm and can work at 800MHz with 759.8mW powerconsumption per node, showing signiﬁcant advantagesover the present architectural design in the current HPCand data centers. OTIVATION

Scientiﬁc simulation, where linear solvers iterate on dataorganized in dense (structured sparse) matrices or tensors,typically easy to divide in equally sized tiles, representsthe premier HPC workload [5], [6], [56], [68], [73]. How-ever, emerging HPC applications, targeting areas such aspower grid dynamics [4], seismic risk assessment [49],urban systems simulation, and microbiome analysis [3], willlikely combine together traditional scientiﬁc simulation withadvanced data analytics and machine learning. The datasetsfor these applications are much less structured, and thusmore difﬁcult to organize in regular and partitionable datastructures. Applications will alternate phases of scientiﬁcsimulations with regular behaviors, to phases where thecomputation happens on sparse data structures (e.g., sparsematrices, graph traversal) that induce unpredictable ﬁne-grained data accesses and irregular behaviors. This scenarioprovides a clear opportunity for adaptability to diverse be-haviors with reconﬁgurable hardware, while at the same timemakes the current HPC programming models inadequate. Inthe following, we describe three major existing multi-nodecomputing paradigms to motivate the ARENA design.

HPC systems typically rely on the classical Bulk SynchronousParallel (BSP) programming model in which a process isassigned to a processor or an entire node, and communicationtypically happens through message passing with librariessuch as MPI. In the BSP model, the computation proceeds asa series of global supersteps: concurrent computation, whereevery participating nodes perform parallel computationson local data; communication, where nodes exchange dataamong them (with various, algorithm dependent, patterns);and barrier synchronization, to align execution of nodes.The BSP model assumes that data are partitioned anddistributed across nodes and rarely moves, to facilitatethe local computation super-steps. Otherwise, the message-passing based communication super-steps would dominantthe execution time. While this model works well for applica-tions with easily partitionable data, regular computation, andlimited, structured communication, it starts to experiencesigniﬁcant limitations when workloads exhibit irregularbehaviors (skewed data distributions, high synchronizationintensity, irregular communication patterns). For these rea-sons, we consider the BSP model as

Compute-Centric .Consider as an example an application with the (hierar-chical) task graph in Figure 2(a), where the computationis split into 4 high-level tasks, each one executing task-partitioned computational kernels where the subtasks require2 B (f) Data-centric with reconfigured acceleration. AB CD

AB D C

Node 0Node 1 Node 2Node 3

AB CD

Node 0Node 1 Node 2Node 3 A Node 0Node 1 Node 2Node 3

AB CD

Node 0Node 1 Node 2Node 3 BA Node 0Node 1 Node 2Node 3

AB CD

Node 0Node 1 Node 2Node 3

ABCD

Node 0Node 1 Node 2Node 3 AC (a) Task graph with data requirement (b) Compute-centric with fixed acceleration. (c) Compute-centric first invoke. (d) Compute-centric second invoke.(e) Data-centric looping in ring to map the task. B CD CD firstinvoke secondinvoke ß second invoke starts AB CD

Fig. 2: Motivating example comparing traditional compute-centric execution model and ARENA data-centric executionmodel.data from other nodes. When employing a BSP model,both the data allocation and the distribution of the high-level tasks are ﬁxed for the entire application execution.Hence, if a subtask needs data available in another node, itneeds to initiate a communication phase, load the remotedata, and synchronize to avoid hazards. Despite the latesthigh-performance designs can exploit mechanisms such as remote-direct-memory-access (one-sided communication, notrequiring a blocking receive with implicit synchronizationfrom the remote node), prefetching and data migration,when these remote accessing are frequent, the bandwidthgap between local memory and remote memory (whichneeds to be accessed through the network) still remains themajor performance (and consequently energy) concern. Forexample, if, as typical in HPC applications, tasks are executedin a loop, and they contend on the same data blocks (e.g.,Tasks A, B, and C in Figure 2), data migration may triggereven more data movement and synchronization, as the actualdata distribution and access patterns are unknown beforeruntime.

Limitations –

While compute-centric BSP long remainsthe standard model in HPC, its adoption in emerging HPCapplications may be limited by data movement and synchro-nization. HPC practitioners have introduced asynchronousmulti-task runtimes to tackle the limitations of the BSP model.Besides migrating data blocks where the computation occurs,these runtimes often allow tasks to migrate where the datareside, via approaches such as active messages and remoteprocedure calls, following the data-centric models.

Reconﬁgurable accelerators have been deployed at a massivescale in data centers to provide application-speciﬁc acceler-ation with improved power-/area-efﬁciency [30], [60], [61].Some institutions have also started hosting clusters withFPGA to perform research in HPC [2].However, accelerators in HPC installations still generallyadopt the BSP model, where a part of the local computationis ofﬂoaded to the accelerator itself. This also requiresgathering desired data by the running tasks ofﬂoaded forexecution. While FPGAs potentially allow acceleration of workloads more diverse than conventional accelerators suchas GPUs, their current usage in HPC application oftenentails static conﬁguration for accelerating a small amountof kernels (e.g., FFT, GEMM). This approach matches withan ofﬂoad accelerator model - tasks and computationalkernels do not move in the system. On the other hand, areconﬁgurable architecture potentially allows to dynamicallyadjust the conﬁguration at runtime, being able to acceleratedivergent tasks with the computation proceeds. Despite thepotential, the excessive reconﬁguration overheads (typicallyin milliseconds) makes such an approach very costly.

Limitations –

Reconﬁgurable accelerators make it pos-sible to accelerate a more diverse workload, but currentpractice in heterogeneous HPC still leverages the ofﬂoadmodel. The lack of low-latency runtime reconﬁgurationalso limits the chances of large-scale task migration acrossthe whole system. On the other hand, the data-centricexecution model allows different tasks to work concurrently,despite using the same set of data in a node. CGRAs, asan alternative reconﬁgurable solution [60] to FPGAs [24],offer signiﬁcantly reduced reconﬁguration time with coarser-grained reconﬁguration, making ﬂexible runtime architectureadjustment becomes possible.

Many programming models have been leveraged or designedto allow data-centric execution on multi-node systems.SSMP [58] can operate on shared memory machines andsupport dynamic detection of dependencies between tasks.The implicitly shared memory management and depen-dence detection improve the programmability at the costof increased synchronization overhead.

Remote procedurecall (RPC) achieves near-data-computation based on priorknowledge about the exact distribution of data. X10 [21]and Chapel [20] allow the programmers to control whereto place the data and tasks. Similarly, Legion [12] enablesexplicit, programmer-controlled movement of data and place-ment of asynchronously spawned tasks, based on localityinformation. Legion employs a Cilk-like [15] algorithm forlocality-aware task stealing.3owards data-centric programming, MapReduce [25]programmers think in a data-centric fashion: they focusmore on handling the sets of data records, rather thanmanaging ﬁne-grained threads, processes, communication,and coordination [9]. However, MapReduce constrains itsusage to batch-processing tasks, which falls in the BSP scope.To accommodate the emerging data-driven applications withirregular and unpredictable data access patterns, data-centricexecution with asynchronous task-spawn should be enabledwith hardware support. Meanwhile, the synchronization andtask dependencies should be speciﬁed by the programmers toeliminate the unnecessary performance and energy overheadrather than forced by the programming model (e.g., remoteprocedure need to return to local in RPC, all the spawnedtasks of the same ancestor need to join eventually in Legion).Besides the data locality, the runtime should also considerthe computing resource utilization when reconﬁgurableaccelerators are deployed and shared by multi-users inHPC/data-center environments.

Limitations –

The high-level software framework andruntime facilitate the asynchronous execution of tasks andattempt to take advantage of data locality. Unfortunately,existing frameworks based on software solutions incurconsiderable overhead. The lack of hardware reconﬁgurationalso limits the beneﬁt from application-speciﬁc design andheterogeneity.

We propose ARENA to address the limitation of the threeaforementioned baseline. ARENA includes a novel program-ming model targeting asynchronous data-centric executionparadigm. As shown in Figure 2, all the conﬁgurable nodesin ARENA are connected by a ring network to bring thespecialized computation to the data rather than the reverse tominimize data movement. Each reconﬁgurable node mainlycontains a CGRA (detailed in Section 4.3) that supports real-time reconﬁguration and simultaneous execution of multi-tasks.

ROGRAMMING M ODEL

Being the interface between software and hardware, theARENA programming model deﬁnes a list of API functionsin Table 1. On one hand, in order to program an ARENAabstract machine, a software programmer has to deﬁne theiruser-logic as task functions, and rely on the

User APIs tooperate the abstract machine. Please note that although inthis work we use CGRAs as the hardware testbed, it isonly one of the the possible instantiations of the ARENAabstract machine model (AMM). On the other hand, in orderto support ARENA software and run ARENA program, analternative architecture has to support the

Hardware AbstractFunctions here.

To program an ARENA abstract machine, the programmersﬁrst partition their application into tasks and register thedeﬁned tasks to the ARENA runtime. Ideally, the partitioncan separate the working set into a bunch of continuousdata segments, where each task accounts for a segment.

Function & Construct DescriptionUser Deﬁned Function void my task ( Address

TASKstart , Address

TASKend , ﬂoat PARAM ) A user can implement multiple differenttasks to compose a single or multipleapplications.

User APIs void

ARENA task register ( int TASKid , Address & my kernel , bool isRoot ) Registers a kernel (e.g. my kernel ) with TASKid . The root task is launched by aCPU or a microcontroller once the systemstarts to run.void

ARENA task spawn ( int TASKid , Address

TASKstart , Address

TASKend , ﬂoat PARAM , Address

REMOTEstart , Address

REMOTEend ) Dynamically spawns a new

TaskToken ( FROMnode is automatically applied) thatwill be issued to the

CoalesceUnit . Theﬁelds of a task token is explained in detailin Section 4.1.

Hardware Abstract Functions void

ARENA init ( Address* local start , Address* local end ) Initializes local start and local end basedon local data information.TaskToken

ARENA arrive () Receives an incoming task from a remotenode.void

ARENA ﬁlter ( TaskToken token , Address* local start , Address* local end , TaskQueue*

SendQueue , TaskQueue*

WaitQueue ) Detaches, Splits or passes tasks based onthe token ’s required data addresses (bycomparing them with local start and local end ).bool

ARENA ready ( TaskToken token ) Checks if the computing resources areavailable for executing the task token .void

ARENA launch ( TaskToken token , TaskQueue*

CoalesceUnit ) Issues a task denoted by token either to aCPU or to a reconﬁgurable accelerator (e.g.,CGRA). The spawned new tasks will bepushed into CoalesceUnit.void

ARENA data acquire ( TaskToken token ) Acquires additional data from remote nodevia NIC.TaskToken

ARENA coalesce ( TaskQueue*

CoalesceUnit ) Coalesces spawned tasks with continuousdata range, identical required remote datarange, and identical PARAM.

Base ConstructsTaskToken encapsulates a task.

Address

Local address of data.

TaskQueue

Buffers for task tokens.

TABLE 1: ARENA programming and hardware APIs.ARENA does not limit the granularity of a task, which canbe extremely ﬁne-grained or coarse-grained. While ARENAworks perfectly when the data is locally available for a task,we understand this is not always feasible. When remotedata access is inevitable, the application can either spawna new task for the remote data, or explicitly initiate thedata-movement through the data-transfer-network . // Users can define multiple different kernels, // each with a specific task token ID. int∗∗ local_M; ... void BFS_kernel(int TASKstart, int TASKend, int PARAM) { int level = PARAM; for(int i=TASKstart; i level) { local_M[i][j] = level; ARENA_task_spawn(BFS_TOKEN, j, j+1, level+1); }}}} void run() { // Register kernels. ARENA_task_register(BFS_TOKEN, &BFS_kernel, true); // Launch ARENA runtime. ARENA_runtime(); }

Fig. 3: Example of programming SSSP in ARENA.4igure 3 shows an example on how to solve the single-source shortest path (SSSP) problem using a breadth-ﬁrstsearch (BFS) kernel. The design traverses associated verticesuntil the shortest path(s) from a source node to all the othernodes are found. Without losing generality, we assume thegraph, represented as an adjacency matrix, is distributivelystored on all nodes and each node holds

SIZE/N ODES vertices (rows) of the entire graph (the adjacency matrix is in

SIZE x SIZE , an initial value of ∞ indicates a connectededge while 0 implies no connection).ARENA enables asynchronous data-centric execution bydynamically spawning new task-tokens among the nodes.Currently, all tasks need to be registered at the beginning.During runtime, task-tokens are circulating among all nodesin the ring. In case a node conﬁrms it has the data requiredby a task-token (indicated by the starting & ending addresses T ASK start and

T ASK end ) as well as sufﬁcient hardwareresources for runtime hardware specialization implied by thetask-token, it takes out the token from the task-token-streamand executes it. A task will be executed by the CPU if nohardware specialization is provided. New tasks for remotenodes can be generated or spawned at any node. We currentlyrely on the programmer to determine the granularity of aspawned task (in other words, the data-range a spawnedtask designated). Fine-grained tasks facilitate asynchronousexecution but increase scheduling overhead in the runtime.We discuss this tradeoff in detail in Section 3.2.This is in contrast with the conventional compute-centricapproach demanding frequent data communication and syn-chronization [19]. As each node maintains the vertex statuslocally, and no prior knowledge about vertex distribution isasserted, repeated all-to-all communications are essentiallydesired for broadcasting vertex updating information toassociated nodes on the present frontier. Figure 9 showsthe performance gain of ARENA for the SSSP application.

Figure 4 and 5 illustrate the workﬂow of ARENA runtimeexecuted per-node and the pseudo-code of the workﬂow,respectively. As can be seen, multiple tasks (marked withdifferent colors) can be asynchronously executed in parallel.Note, the runtime can be supported by CPUs, GPUs, DSPs, orany other ﬁxed or reconﬁgurable hardware substrate giventhe substrate realize the

Hardware Abstract Functions andsupport the

Base Constructs .We describe the runtime process below: Primarily, the taskqueues (line 3-4) and local data range (line 7) are initialized.We then proceed with 6 steps:

Step-(1):

All the incomingtask tokens from the proceeding node will be appended tothe

RecvQueue (line 8-11).

Step-(2):

A token popped fromthe

RecvQueue will be processed in the

Filter , wherea task can be split into multiple tasks, which are eitherbuffered in the

WaitQueue for local execution, or forwardedto the

SendQueue , wait for being conveyed to the nextnode (line 21-23). The logic is: (I) if the task data range isirrelevant to the node’s local data range, it is forwarded to

SendQueue as it is; (II) if the task data range is a subsetof the node’s local data range ( local start ≤ T ASK start ≤ T ASK end ≤ local end ), implying the local availability ofall the needed data for the task, the token will be pushed s ARENA_filter ARENA_coalesce ARENA_arrive

ComputingUnits++++++++++++++++++++

SendQueueCoalesceUnitWaitQueue R ec v Q u e u e ARENA Runtime

Arriving task token that need to be split into two ( , ) and offloaded spawnsplitexecute ARENA_readyLocal Data ARENA_data_acquire ARENA_launchA token that is coalesced by two spawned tokens ( , )A token arrived before and is being executed

Fig. 4: ARENA runtime workﬂow. void ARENA_runtime() { bool terminate = false; TaskQueue WaitQueue, RecvQueue; TaskQueue SendQueue, CoalesceUnit; TaskToken token; Address local_start, local_end; ARENA_init(&local_start, &local_end); while(true) { // Enqueue an arriving task token. token = ARENA_arrive(); RecvQueue.enqueue(token); // Check termination for all tasks. token = RecvQueue.dequeue(); if(token.TASKid == TERMINATE and WaitQueue.empty()) { SendQueue.enqueue(token); if(terminate) break; terminate = true; continue; } terminate = false; // Offload, split, or convey a task. ARENA_filter(token, local_start, local_end, &SendQueue, &WaitQueue); // Check availability for task execution. token = WaitQueue.peek(); if(ARENA_ready(token)) { WaitQueue.dequeue(); // Acquire data if necessary. if(token.REMOTEend > token.REMOTEstart) { ARENA_data_acquire(token);} // Issue a task for execution and return // without waiting for completion. ARENA_launch(token, CoalesceUnit);{ // Task coalescing if possible. token = ARENA_coalesce(CoalesceUnit);{ if(token) RecvQueue.enqueue(token); }}}

Fig. 5: ARENA runtime pseudocode.into

WaitQueue for future execution; (III) if the data rangeindicated by the task is a superset of the node’s local datarange (

T ASK start ≤ local start ≤ local end ≤ T ASK end ),it suggests that the task might be too coarse-grained. Wesplit the task into three portions and spawn three newtasks. The one with data range local start to local end willbe buffered in WaitQueue for local processing; the othertwo tasks are redirected to

SendQueue ; (IV) ﬁnally, if thetask data range is partially aligned with the node’s localdata range, two new tasks will be spawned. The alignedpart is buffered in

WaitQueue while the mismatch partis forwarded to

SendQueue . Step-(3):

The runtime checkswhether there is available resources for the token at thetop of the

WaitQueue to be executed (line 26).

Step-(4):

5f so, the runtime veriﬁes whether the task needs to incurany inevitable remote data access. If yes, the data will beacquired from the objective remote nodes (line 30) throughthe

Data-Transfer-Network . Step-(5): when all demand-ing data are available, the task is issued to the computingresources for execution (line 33).

Step-(6)

As new tasks canbe generated locally in a node during the task execution, toavoid too many tasks ﬂooding the system, a

CoalesceUnit (line 35) is designed to aggregate the newly generated tasksif the boundaries of their data-ranges coincide each other,and whether alternative key parameters ((e.g., task-tokencarried partial-reduction variables) are the same. This canavoid the scenario that too many generated ﬁne-grained taskssaturate the task-token network and the associated buffers.Finally, the runtime on a particular node terminates whenthe TERMINATE token has been continuously received, andthere is no pending tasks in the local

WaitQueue (line 12-20).

LUSTER A RCHITECTURE

Our ARENA prototype in this work is built upon a CGRAcluster interconnected by a fast ring network. Figure 6(a)shows the overall design — multiple reconﬁgurable nodesare connected in a ring topology. Each node incorporates amicro-controller (e.g., a simpliﬁed CPU), a task dispatcher,a CGRA, a network interface (NIC), a DMA unit, and localmemory storage. At runtime, task tokens are circulatingalong the ring. The dispatcher can split, ofﬂoad, and forwarda task token based on the data range as already discussed.We adopt the ring network topology to simplify the routingstrategy and provide sufﬁcient bandwidth for delivering thesmall-sized task tokens ( ∼

21 bytes) with up to 16 nodesevaluated in this paper. The ring network can also providenear optimal bandwidth for most collective communication[54], and can be easily built upon various physical networktopology. We leave the exploration of alternative multi-nodetopology as a future work.

Task tokenTask token

Networkfor DataTransfer (a) Overview of ARENA cluster.(b) Task token format.

TASK id REMOTE start

PARAM FROM node

TASK start

TASK end

REMOTE end

Remote DataRemote Data Request

Simple CPU Core / Microcontroller ……………… … SP M D a t a Local Memory

DMA Unit

PreloadControl SignalsLocal Data Request Local Data

Task DispatcherNICInitial Token

Tokens inTokens out … C G R A C on t r o ll e r Fig. 6: ARENA cluster overview and task token format.

A task is represented by a task token in ARENA, whichcan be dynamically spawned, executed, delivered, andsplit. Figure 6(b) shows the general format of the task token which comprises 7 ﬁelds:

T ASK id indicates the taskwill be executed, which is registered by the user (using ARENA task register() ) before launching the ARENA run-time. During execution, the reconﬁgurable node will bedynamically conﬁgured (see Section 4.3) based on

T ASK id . T ASK start and

T ASK end together describe the data rangefor the task.

P ARAM refers to a token-carried return valuethat is typically initialized by its parent task. This ﬁeld isuseful when performing collective operations (e.g., reduction and accumulation ). For unavoidable remote data access,we use

REM OT E start and

REM OT E end to indicate thestarting and ending addresses.

F ROM node labels the nodewhere its parent task locates. Each task token thus requires21 bytes in our prototype architecture (i.e., 4-bit each for

T ASK id and F ROM node ; 4-byte each for the other ﬁelds).

ARENA’s task dispatcher mainly includes a task

FilterLogic and three queues as shown in Figure 7. The

FilterLogic can ofﬂoad, split, and convey a task token basedon the data requirement (see Section 3.2). The NIC handlesremote data requests from the task tokens in the

WaitQueue .The

WaitQueue will be acknowledged when the requiredremote data arrives at the data memory. The CGRA controllerwill then pop the acknowledged task token from the head of

WaitQueue and reconﬁgure the CGRAs accordingly.

Data ack D a t a r e q SendQueueFilterLogic

Task Dispatcher

FilterLogic

RecvQueue D a t a M e m o r y NIC S p a w n e d t a s k t ok e n CGRA Tile

Regs Regs Regs

Token inToken out W a it Q u e u e C G R A C on t r o ll e r C o a l e s c i ng U n it Fig. 7: Architecture of an ARENA node.6 .3 Reconﬁgurable CGRA Nodes and Toolchain

To achieve rapid dynamically reconﬁguration, an ARENAnode is prototyped with CGRA. The on-chip conﬁgurationmemory is used for conserving the control signals for eachtask. The intra-node CGRA consists of 64 tiles connected ina mesh network. A scratchpad data memory conserves thedata required for the computation. Note that both the controlsignals and the data are pre-loaded by the CPU/micro-controller through the DMA unit before launching theARENA runtime. The CGRA communicates with the

TaskDispatcher through the CGRA controller. The controllercan ofﬂoad tasks and coalesce spawned tokens through the

Coalescing Unit . CGRA Tile –

As shown in Figure 7, each CGRA tilecontains a functional unit, a scratchpad control memory, acrossbar switch, and three sets of registers. The functionalunit supports all the basic operations (e.g., add , mul , shif t , select , branch , load , store , etc). Control-divergence (i.e.,the existence of multiple control ﬂow paths) inside theloop kernel is supported through partial predication [32].The functional unit also supports the spawn operation (i.e.,generate a new task token and issue to the CGRA controller),which is unique in ARENA. If sufﬁcient information isavailable ( T ASK id , T ASK start and

T ASK end ), a new tokencan be spawn in a single cycle; otherwise, two cycles arerequired to encode additional information (i.e.,

P ARAM , REM OT E start , and

REM OT E end ). The

F ROM node ﬁledwill be automatically ﬁlled by the CGRA Controller. InFigure 7, there are 4 tiles being able to spawn new tasks(marked in green). The leftmost tiles are connected to two4-port scratchpad data memory banks. The functional unitcan be conﬁgured to perform different operations at eachcycle based on the control signals from the control memory.The CGRA tiles are granular to support the simultaneousexecution of multiple tasks. Speciﬁcally, all the tiles arepartitioned into 4 groups and a task can be executed by 1, 2,and 4 groups, dynamically managed by the CGRA controller.

Control Memory –

The control signals of all the tasksare initially pre-loaded into the control memory. At runtime,tiles iterate over a subset of the control signals to executespeciﬁc tasks based on the

T ASK id in the task token. Eachtile requires a 480-byte control memory in our prototypeto support all application tasks evaluated in this paper(Section 5). Each task has three execution modes poweredby different tile groups. It takes only 8 cycles for the CGRAcontroller to reconﬁgure speciﬁc tile groups by using thedata network to forward the T ASK id systolically throughthe array from right to left. CGRA Controller –

The CGRA Controller can launcha task (using the task token at the head of

WaitQueue )to be executed by different groups of the CGRA tiles.Based on the current CGRA utilization status and the datarequirement of the target task, the CGRA controller allocatean appropriate number of groups for a waiting task. Forexample, if the data range required by the target task isless than a quarter of the local data range (i.e., TASK end -TASK start < ( LOCAL end -LOCAL start )/4), only one availablegroup (i.e., 2x8 CGRA tiles) will be allocated to the task.When the target task works on more than half of the local datarange (i.e., TASK end -TASK start > ( LOCAL end -LOCAL start )/2),

Kernel DFG Generation MappingAlgorithmARENA Data-Centric Runtime DFGs ControlSignals

Data-centricexecution model

ARENA with CGRA

HardwareCGRAControllerHardwareTask DispatcherKernelAppTasks Loop

Vectorization& FlatteningVectorized &Flattened

Loops.

EXE P r e l o a d d a t a a nd c on t r o l s i gn a l Fig. 8: ARENA data-centric execution model developmentprocedure with compiler toolchain.the CGRA controller attempts to allocate all the four groups(i.e., entire 8x8 CGRA tiles) when available (otherwise, twogroups are allocated). In addition, there are four queues inthe controller to temporarily hold the spawned task tokens,which would be coalesced by the

Coalescing Unit if anytwo of them imply continuous task data range and share thesame

T ASK id and P ARAM . When there are insufﬁcientslots in the queues, the CGRA controller stops fetching tokensfrom the

WaitQueue in the Task Dispatcher. Deadlock canbe avoided by providing a memory attached to the CGRAcontroller for storing the over-spawned task tokens.

Compiler Toolchain for CGRA –

Figure 8 shows thedevelopment procedure for ARENA using CGRA clusteras the backend. As already mentioned, the CGRA clusteris just one design choice; the ARENA execution model canbe realized on alternative back-ends in case the HAF APIs(Table 1) are realized. For example, on a CPU cluster backend,we can adopt MPI non-blocking-send primitives [29] torealize HAF APIs. We use this as one of our baselines in theevaluation. Here, to support the CGRA-cluster backend, anLLVM [42] based design automation toolchain is developed.In particular, a kernel that typically includes a multi-levelnested loop is described as a task. As shown in Figure 8, tosynthesize an appropriate mapping for a task on the allocatedCGRA tiles, the nested loop is ﬁrst vectorized with a factorthat can fully leverage relatively larger CGRA tiles (e.g., 8x8tiles) in the vectorization pass. Then, the remaining loops areﬂattened, generating the Control-Data Flow Graph (CDFG)representation, which is an extension of DFG with controldependence edges. We implemented a heuristic method [39]to map the CDFG on various combinations of the tiles (i.e.,2x8, 4x8, and 8x8 tiles) and produce their control signals.

VALUATION

The ARENA runtime is evaluated on traditional CPU HPCclusters and our proposed CGRA-based ARENA cluster. Forthe latter, we extend the

Structural Simulation Toolkit (SST) [62]to model a multi-node cluster based on MPI. We modelthe network topology and package transmit switch in SSTusing the MACRELS Analytic Model [70]. The token transmitnetwork is modeled as 1D Torus Ring. We implement thetask dispatcher, CGRA, and CGRA controller in PyMTL [47],which can report cycle-accurate simulation results for asingle node. To obtain the cycle-level simulation result, weimplement the dispatcher interface in SST to handle the task7okens. Finally, we feed the single-node result to SST andgenerates synthesizable Verilog for power, area, and timinganalysis. The detailed simulation parameters are listed inTable 2.

Technology

Network Interface

80 Gb/s

Network Topology

1D Torus Ring

Network Switch

Dispatcher

Filter logic, 8-entry receive queue, 8-entrywait queue, 8-entry send queue

CPU (baseline) 2.6GHz, 20MB 3-level Cache, Out-of-order, x86

CGRA × CGRA Controller × TABLE 2: RTL and simulation parameters for ARENA.

Applications –

We evaluate ARENA using representa-tive HPC and data-analytics workloads: The single-sourceshortest paths (

SSSP ) problem (see Section 3.1) is a keysubroutine in many data-intensive graph computations.General Matrix Multiply (

GEMM ) is the core function oflinear algebra and deep learning workloads. We assume thematrices are distributed among nodes. Sparse-matrix-vectormultiplication (

SPMV ) is the fundamental kernel in manyscientiﬁc & data applications. Here, the distributed matrix isin the Compressed Sparse Row (CSR) format. DNA sequencealignment (

DNA ) [37] leverages Needleman-Wunsch (NW)algorithm to search the best-matched protein sequences withrespect to the target pattern. A Graph convolutional network

GCN [28] inference application on the Cora dataset is alsoevaluated, representing emerging irregular machine learningworkloads. We assume the adjacency and feature matricesare distributed among nodes. Finally, an

N-body simulationapplication [31] for simulating dynamical particle system isdemonstrated, representing traditional scientiﬁc simulationworkloads. Again, the particle information (e.g., acceler-ations, velocity, position, collision, etc) are distributivelystored and required to be updated per iteration at runtime.The conventional compute-centric parallel implementa-tions of all the evaluated applications are developed basedon state-of-the-art algorithms or derived from the widely-used benchmark suites [22], [59]. For example, the SSSPapplication is implemented based on [19]. The GCN model isextracted from PyTorch Geometric [28]. The DNA applicationleverages the NW algorithm from Rodinia [22]. Regardingthe ARENA implementation, all the applications are pro-grammed following the ARENA programming models fordata-centric asynchronous execution with runtime hardwarespecialization. Note that GCN and NBody contain numericdistinct functional tasks.

We show the beneﬁts of the data-centric programming model,CGRA hardware acceleration, and the entire ARENA system.

Programming Model –

We ﬁrst show the performanceeffectiveness of ARENA’s data-centric programming model.Figure 9 illustrates the normalized speedups for conventionalcompute-centric and ARENA’s models (Both are softwareimplemented based on MPI) with respect to a serial imple-mentation on a single CPU node (i.e., baseline). As can beseen, ARENA’s data-centric execution model shows higher speedups and better scalability in general. This is mainly dueto the elimination of synchronization, and the minimizationof data communication. On average, ARENA’s data-centricmodel outperforms the compute-centric counterpart by1.61 × (i.e., 7.82/4.87) in a 16-node cluster. Speciﬁcally, thekernels with higher data parallelism (e.g. SSSP, GEMM, andSPMV) can gain better scalability for both models; for kernelswith limited data parallelism such as DNA, the compute-centric model exhibits lower scalability due to massivedata dependency and costly remote communication. In thiscondition, ARENA achieves better scalability by streamingtask tokens over the nodes to minimize data movement.Figure 10 illustrates the normalized data movement break-down for ARENA’s data-centric model with respect to thecompute-centric model for all applications in a 4-node cluster.Compared with the compute-centric HPC cluster, ARENAcan eliminate on average 53.9% data movement withoutany prior knowledge about the data distribution, leadingto substantial improvement in energy efﬁciency. We alsoobserve different data movement patterns across applications.For example, SSSP posts considerable task movement, asit spawns massive ﬁne-grained tasks with discrete data-ranges, which are hard to coalesce. Regarding DNA, thecompute-centric implementation is based on OpenMP whereall threads are sharing the same copy in global memory [22].The sub-blocks workload distribution to threads following azig-zag manner incurs frequent data movement. In ARENA,the data dependency only exits on the edge of the sub-block,which can be explicitly labeled by the parent tasks using theARENA User-APIs (i.e., REM OT E start and

REM OT E end in ARENA task spawn()), therefore minimizing data move-ment. Finally, GEMM and NBody comprise coarse-grainedtasks and the task-ﬂows require data streaming among thenodes, leading to little task movement or essential datamovement as shown in the ﬁgure.

CGRA Speedup –

Figure 12 shows the normalizedspeedup of the evaluated applications running on differentconﬁgurations or combinations of ARENA’s CGRA tilegroups (each group is a 2x8 CGRA) with respect to thesingle node CPU baseline. In general, a larger CGRA tileconﬁguration leads to higher speedups. For DNA, however,the loop-carried data dependency limits the data parallelism,as well as the obtainable speedups (1.7 × speedup at most).On average, 1.3 × , 2.4 × , and 3.5 × speedups are achieved byARENA’s 2x8, 4x8, and 8x8 CGRA across all the applicationsand kernels. The 2x8 CGRA exhibits the optimal area-efﬁciency, showcasing the advantages of runtime hardwarespecialization. Overall System –

The normalized speedup of ARENA isshown in Figure 11. As can be seen, ARENA maximizes theoverall performance by leveraging a data-centric executionmodel and CGRA in a synergistic and integrated fashion.Instead of ﬁxing the CGRA conﬁguration for each workload,ARENA dynamically allocates and conﬁgures the CGRA tilesspeciﬁcally for a particular task based on the task-carriedspeciﬁcation (obtained based on data requirement of the task),as well as the current CGRA resource availability. On average,the compute-centric execution model using the entire CGRAsfor each kernel obtains 10.06 × speedup on a 16-node cluster,whereas ARENA achieves 21.29 × speedup. In other words,ARENA is 2.17 × better than the compute-centric with CGRA8 .87 7.820510152025 SSSP GEMM SPMV DNA GCN NBody Average N o r m a li ze dSp ee dup Compute-Centric 2-Node Compute-Centric 4-Node Compute-Centric 8-Node Compute-Centric 16-NodeData-Centric 2-Node Data-Centric 4-Node Data-Centric 8-Node Data-Centric 16-Node

Fig. 9: Normalized speedup for compute-centric and ARENA’s data-centric execution models running on different multi-CPUs cluster w.r.t. a serial implementation on a single node.

Data Movement in Computation-CentricTask Movement in Data-Centric Data Movement in Data-Centric D a t a M ov e m e n t B re a kd o w n Fig. 10: Normalized data movement breakdown in data-centric model w.r.t. the compute-centric model.support). This implies ARENA can leverage the CGRAs ina more efﬁcient way. Compared with Figure 9, we can seethat ARENA with CGRAs also gains better scalability (from1.61 × to 2.17 × on a 16-node cluster). The performance ofDNA does not improve much due to limited accelerationfrom CGRA. Finally, the compute-centric execution of GEMMdoes not scale well because synchronization over a largeramount of data creates serious performance bottlenecks. We evaluate the timing, area, and power consumption ofARENA’s CGRA-cluster (e.g., CGRAs, CGRA controllers andtask dispatchers) using the synthesized Verilog HDL codefrom PyMTL. We use Synopsys Design Compiler, CadenceInnovus, and Synopsys PrimeTime PX in order to synthesize,place, route, and estimate the power consumption of thedesigns. We use FreePDK45 with the Nangate standard celllibrary. Figure 13 illustrates the obtained chip layout. Thearea and power of the 32KB scratchpad data memory areestimated based on CACTI-6.5. The chip area is 2.19mm x1.24mm and the operating frequency is 800MHz @ 45nmwith on average 759.8mW power consumption.

ELATED W ORK

We summarize related work regarding the ring network,clusters of reconﬁgurable architecture, and the task-basedexecution model in a ring network.

Ring Network.

The ring network has been adopted inreal multi-processors design [11] (e.g., Intel Xeon-Phi [23])and logically studied for efﬁcient collective communication[8], [10], [44], [45], [46], [72]. On the one hand, the ringnetwork provide simple routing mechanism to better-utilizedthe link bandwidth for fast communication [11]. on theother hand, the ring network has been criticized for easy saturation due to linear increased latency with more node [7],[35], [46]. Previous works [8], [45], [46], [72] extend the ringconcept to form local-rings network among a subset of nodes,known as routerless network. In ARENA, we avoid thesaturation problem through: (a) a routerless task executionmodel among nodes; and (b) the dynamically task allocationand dispatching mechanism.

Reconﬁgurable Hardware Cluster.

Clusters incorporat-ing reconﬁgurable devices such as FPGAs [26], [41], [61], [65]and CGRAs [36], [60], [66] have already been showcasedby existing works [48], [67]. On one hand, Putnam et al.from Microsoft propose the reconﬁgurable Catapult fabric[61], where each instantiation consists of a 6x8 2D torusof Xilinx Stratix-V FPGAs. Every FPGA is connected to aCPU server via PCI-e and directly links to other FPGAsthrough the SAS cables. This 1632-node FPGA cluster hasbeen adopted for document ranking of the Bing search engine.Nearly 95% performance increase has been demonstratedwith only a 10% extra power budget. However, each FPGAin Catapult is specialized to a single application kernelduring runtime. Reconﬁguration takes several milliseconds.Even reloading models without changing the computationlogic can take up to 250 microseconds. On the other hand,the data-centric execution model allowing different tasksworking on the same set of data in an HPC node requiresrapid dynamic reconﬁguration. Gazzano et al. propose R-Grid, a complete grid infrastructure for distributed high-performance computing using dynamically reconﬁgurableFPGAs [26]. Knodel et al. virtualize the FPGA resourcesand propose adapted service models in a cloud context[41]. Tarafdar et al. discussed how FPGAs of a cloud datacenter can be ﬂexibly connected based on a logical kerneland a mapping ﬁle [65]. Zhang et al. adopt a clusterof six Xilinx VC709 FPGA for cooperative convolutionalneural network inference [75]. Regarding to the CGRAs, theSambaNova DataScale system incorporates 8 reconﬁgurable-dataﬂow-units (RDUs) derived from Plasticine [60], claiminghigher performance than a thousand GPUs for trainingextremely large deep-learning models. PPA [57] exploitspipeline parallelism in streaming applications to create aCGRA-like pipeline to execute streaming media applications.Samsung proposes to adopt the CGRA cluster for medicalvolume image rendering [36]. HammerBlade [66] aims atdesigning a rack-scale cluster for ML and Graphs. TheirASIC is composed of general-purpose cores and specializedCGRAs (e.g., Chimera [74]).

Asynchronous Task Execution.

The asynchronous tasks9 N o r m a li ze dSp ee dup Compute-Centric 2-Node w/ CGRA Compute-Centric 4-Node w/ CGRA Compute-Centric 8-Node w/ CGRA Compute-Centric 16-Node w/ CGRAARENA 2-Node ARENA 4-Node ARENA 8-Node ARENA 16-Node

Fig. 11: Normalized speedup for compute-centric and ARENA’s data-centric execution models running on differentmulti-CGRAs cluster w.r.t. a serial implementation on a single node without any acceleration. N o r m a li ze dSp ee dup Fig. 12: Normalized CGRA speedup w.r.t. the single nodebaseline CPU execution without any acceleration.Fig. 13: Chip layout of a single ARENA node.execution has been studied in many graph processingframeworks [13], [52], [53]. They propose software-level data-centric approaches to implement irregular graph kernels onmulti-node clusters [53] and GPUs platform [52]. The workthat is most relevant to ARENA is Groute [13], which is anasynchronous runtime environment for processing irregulargraphs. In Groute, the GPUs form a logical ring networkwith each GPU conserves a worklist. Tasks are encapsulatedas messages passing along the ring. ARENA is motivated byGroute. However, Groute is a pure software implementationbased on general-purpose GPUs, thus cannot beneﬁt fromhardware specialization. Additionally, for Groute, the routingpolicy, which speciﬁes the action when input is received, aswell as memory consistency and ownership, are all deﬁnedand maintained by the users, which can be complicated,tedious and error-prone (e.g., imagine if data dependencyoccurs at runtime, users have to manually locate themand fetch in corresponding tasks). In ARENA, these are designed and supported by hardware. Furthermore, globalcoordination and work counting in Groute are centralizedand managed by the CPU whereas in ARENA, all of themare distributedly processed.

ONCLUSION

In this paper, we propose an asynchronous-reconﬁgurable-accelerator-ring architecture for next generation data-drivenhigh-performance computing. Through the co-design ofarchitecture and programming model, ARENA is able tobring computation tasks in the form of CGRA-specializedhardware accelerators to the data, rather than the reverse asin contemporary compute-centric and dataﬂow architectures,signiﬁcantly improving performance and reducing data-movement. A CKNOWLEDGEMENT

This work was mainly supported by the Compute-Flow-Architecture project under PNNL’s DMC LDRD Initiative. Itwas also supported by the SO ( DA ) and FALLACY projectsunder DMC. The evaluation platforms were supported bythe U.S. DOE Ofﬁce of Science, Ofﬁce of Advanced ScientiﬁcComputing Research, under award 66150: ”CENATE - Centerfor Advanced Architecture Evaluation”. The Paciﬁc North-west National Laboratory is operated by Battelle for the U.S.Department of Energy under contract DE-AC05-76RL01830. R EFERENCES

FirstInternational Symposium on Networks-on-Chip (NOCS’07) , pages 18–29. IEEE, 2007.[8] F. Alazemi, A. Azizimazreah, B. Bose, and L. Chen. Routerlessnetwork-on-chip. In , pages 492–503. IEEE,2018.[9] P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. M. Hellerstein,and R. C. Sears. BOOM: Data-centric programming in thedatacenter.

EECS Department, University of California, Berkeley, Tech.Rep. UCB/EECS-2009-113 , 2009.

10] A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda.Efﬁcient large message broadcast using NCCL and CUDA-awareMPI for deep learning. In

Proceedings of the 23rd European MPIUsers’ Group Meeting , pages 15–22, 2016.[11] L. A. Barroso and M. Dubois. The performance of cache-coherentring-based multiprocessors. In

Proceedings of the 20th annualinternational symposium on computer architecture , pages 268–277, 1993.[12] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion:Expressing locality and independence with logical regions. In

SC’12: Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis , pages 1–11. IEEE, 2012.[13] T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali. Groute: AnAsynchronous Multi-GPU Programming Model for Irregular Com-putations.

ACM SIGPLAN Notices , 52(8):235–248, 2017.[14] R. H. Bisseling and W. F. McColl. Scientiﬁc computing on bulksynchronous parallel architectures. 1993.[15] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H.Randall, and Y. Zhou. Cilk: An efﬁcient multithreaded runtimesystem.

Journal of parallel and distributed computing , 37(1):55–69,1996.[16] D. Bonachea and P. Hargrove. GASNet Speciﬁcation, v1. 8.1. 2017.[17] D. Bonachea and P. H. Hargrove. GASNet-EX: A high-performance,portable communication library for exascale. In

InternationalWorkshop on Languages and Compilers for Parallel Computing , pages138–158. Springer, 2018.[18] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu,R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, andO. Mutlu. Google Workloads for Consumer Devices: MitigatingData Movement Bottlenecks. In

Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , pages 316–331, 2018.[19] A. Buluc¸ and K. Madduri. Parallel breadth-ﬁrst search on dis-tributed memory systems. In

Proceedings of 2011 InternationalConference for High Performance Computing, Networking, Storage andAnalysis , pages 1–12, 2011.[20] B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel pro-grammability and the chapel language.

The International Journal ofHigh Performance Computing Applications , 21(3):291–312, 2007.[21] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,K. Ebcioglu, C. Von Praun, and V. Sarkar. X10: an object-orientedapproach to non-uniform cluster computing.

Acm Sigplan Notices ,40(10):519–538, 2005.[22] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee,and K. Skadron. Rodinia: A benchmark suite for heterogeneouscomputing. In , pages 44–54. Ieee, 2009.[23] G. Chrysos. Intel® xeon phi coprocessor (codename knights corner).In , pages 1–31. IEEE, 2012.[24] B. De Sutter, P. Raghavan, and A. Lambrechts. Coarse-grainedreconﬁgurable array architectures. In

Handbook of signal processingsystems , pages 427–472. Springer, 2019.[25] J. Dean and S. Ghemawat. MapReduce: simpliﬁed data processingon large clusters.

Communications of the ACM , 51(1):107–113, 2008.[26] J. Dondo Gazzano, F. Sanchez Molina, F. Rincon, and J. C. L´opez. In-tegrating reconﬁgurable hardware-based grid for high performancecomputing.

The Scientiﬁc World Journal , 2015, 2015.[27] J. Dongarra, L. Grigori, and N. J. Higham. Numerical algorithms forhigh-performance computational science.

Philosophical Transactionsof the Royal Society A , 378(2166):20190066, 2020.[28] M. Fey and J. E. Lenssen. Fast graph representation learning withPyTorch Geometric. arXiv preprint arXiv:1903.02428 , 2019.[29] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra,J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine,R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall.Open MPI: Goals, concept, and design of a next generation MPIimplementation. In D. Kranzlm ¨uller, P. Kacsuk, and J. Dongarra,editors,

European Parallel Virtual Machine/Message Passing InterfaceUsers’ Group Meeting , pages 97–104. Springer, 2004.[30] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Her-bordt. FPDeep: Acceleration and load balancing of CNN trainingon FPGA clusters. In , pages81–84. IEEE, 2018.[31] S. Green. Particle simulation using cuda.

NVIDIA whitepaper ,6:121–128, 2010. [32] M. Hamzeh, A. Shrivastava, and S. Vrudhula. Branch-aware loopmapping on CGRAs. In

Proceedings of the 51st Annual DesignAutomation Conference , pages 1–6, 2014.[33] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhul-gakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu,P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied ma-chine learning at facebook: A datacenter infrastructure perspective.In , pages 620–629. IEEE, 2018.[34] M. Horowitz. 1.1 computing’s energy problem (and what we cando about it). In , pages 10–14. IEEE, 2014.[35] N. E. Jerger and L.-S. Peh. On-chip networks.

Synthesis Lectures onComputer Architecture , 4(1):1–141, 2009.[36] S. Jin, S. Lee, M.-K. Chung, Y. Cho, and S. Ryu. Implementation of avolume rendering on coarse-grained reconﬁgurable multiprocessor.In ,pages 243–246. IEEE, 2012.[37] M. Johnson, I. Zaretskaya, Y. Raytselis, Y. Merezhuk, S. McGinnis,and T. L. Madden. NCBI BLAST: a better web interface. volume 36,pages W5–W9, 04 2008.[38] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-datacenterperformance analysis of a tensor processing unit. In ,pages 1–12, 2017.[39] M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh. Hycube: Acgra with reconﬁgurable single-cycle multi-hop interconnect. In

Proceedings of the 54th Annual Design Automation Conference 2017 ,pages 1–6, 2017.[40] G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie. Quantifyingthe energy cost of data movement in scientiﬁc applications. In ,pages 56–65. IEEE, 2013.[41] O. Knodel, P. Lehmann, and R. G. Spallek. RC3E: Reconﬁgurableaccelerators in data centres and their provision by adapted servicemodels. In , pages 19–26. IEEE, 2016.[42] C. Lattner and V. Adve. LLVM: A compilation framework forlifelong program analysis & transformation. In

InternationalSymposium on Code Generation and Optimization, 2004. CGO 2004. ,pages 75–86. IEEE, 2004. [43] K. Lee. Introducing big basin: Our next-generation ai hardware,

IEEE Transactions on Parallel andDistributed Systems , 31(1):94–110, 2019.[45] T.-R. Lin, D. Penney, M. Pedram, and L. Chen. A Deep Rein-forcement Learning Framework for Architectural Exploration: ARouterless NoC Case Study. In . IEEE, 2018.[46] S. Liu, T. Chen, L. Li, X. Feng, Z. Xu, H. Chen, F. Chong, andY. Chen. IMR: High-performance low-cost multi-ring NoCs.

IEEETransactions on Parallel and Distributed Systems , 27(6):1700–1712,2015.[47] D. Lockhart, G. Zibrat, and C. Batten. PyMTL: A uniﬁed frameworkfor vertically integrated computer architecture research. In ,pages 280–292. IEEE, 2014.[48] J. C. Lyke, C. G. Christodoulou, G. A. Vera, and A. H. Edwards.An introduction to reconﬁgurable systems.

Proceedings of the IEEE

50] D. Molka, D. Hackenberg, R. Sch¨one, and M. S. M ¨uller. Charac-terizing the energy consumption of data transfers and arithmeticoperations on x86- 64 processors. In

International conference on greencomputing , pages 123–133. IEEE, 2010.[51] G. E. Moore. Cramming more components onto integrated circuits,1965.[52] R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In , pages463–474. IEEE, 2013.[53] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastruc-ture for graph analytics. In

Proceedings of the Twenty-Fourth ACMSymposium on Operating Systems Principles , pages 456–471, 2013.[54] NVIDIA. NVIDIA DGX-1 System Architecture White Paper, 2017.[55] I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, and M. Taiji.MDGRAPE-4: a special-purpose computer system for molecular dy-namics simulations.

Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences , 372(2021):20130387,2014.[56] S. P´all, M. J. Abraham, C. Kutzner, B. Hess, and E. Lindahl. TacklingExascale Software Challenges in Molecular Dynamics Simulationswith GROMACS. In S. Markidis and E. Laure, editors,

SolvingSoftware Challenges for Exascale , pages 3–27, Cham, 2015. SpringerInternational Publishing.[57] H. Park, Y. Park, and S. Mahlke. Polymorphic pipeline array:a ﬂexible multicore accelerator with virtualized execution formobile multimedia applications. In

Proceedings of the 42nd AnnualIEEE/ACM International Symposium on Microarchitecture , pages 370–380, 2009.[58] J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependenciesunder strided and aliased references. In

Proceedings of the 24th ACMInternational Conference on Supercomputing , pages 263–274, 2010.[59] L.-N. Pouchet and S. Grauer-Gray. Polybench: The polyhedralbenchmark suite. , 2012.[60] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao,S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun. Plasticine: Areconﬁgurable architecture for parallel patterns. In ,pages 389–402. IEEE, 2017.[61] A. Putnam, A. M. Caulﬁeld, E. S. Chung, D. Chiou, K. Constan-tinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray,M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka,J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, andD. Burger. A reconﬁgurable fabric for accelerating large-scaledatacenter services. In , pages 13–24. IEEE, 2014.[62] A. F. Rodrigues, G. R. Voskuilen, S. D. Hammond, and K. S.Hemmert. Structural Simulation Toolkit (SST). Technical report,[69] L. G. Valiant. A bridging model for parallel computation.

Commu-nications of the ACM , 33(8):103–111, 1990. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States),2016.[63] J. Shalf. The future of computing beyond Moore’s law.

PhilosophicalTransactions of the Royal Society A , 378(2166):20190061, 2020.[64] D. E. Shaw, J. Grossman, J. A. Bank, B. Batson, J. A. Butts, J. C.Chao, M. M. Deneroff, R. O. Dror, A. Even, C. H. Fenton, et al.Anton 2: raising the bar for performance and programmabilityin a special-purpose molecular dynamics supercomputer. In

SC’14: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , pages 41–53. IEEE,2014.[65] N. Tarafdar, T. Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia,and P. Chow. Enabling ﬂexible network FPGA clusters in aheterogeneous cloud data center. In

Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable GateArrays , pages 237–246, 2017.[66] M. B. Taylor, A. Sampson, C. Batten, Z. Zhang, L. Ceze, M. Oskin,and D. Richmond. The HammerBlade: An ML-Optimized Super-computer for ML and Graphs. https://sampl.cs.washington.edu/tvmconf/slides/Michael-Taylor-HammerBlade.pdf, 2020.[67] R. Tessier, K. Pocek, and A. DeHon. Reconﬁgurable computingarchitectures.

Proceedings of the IEEE , 103(3):332–354, 2015.[68] D. Trebotich, M. F. Adams, S. Molins, C. I. Steefel, and C. Shen.High-Resolution Simulation of Pore-Scale Reactive Transport Pro-cesses Associated with Carbon Sequestration.

Computing in Science

Engineering , 16(6):22–31, Nov 2014.[70] F. Versolatto and A. Tonello. An MTL Theory Approach for theSimulation of MIMO Power-Line Communication Channels.

PowerDelivery, IEEE Transactions on , 26:1710–1717, 07 2011.[71] J. S. Vetter, R. Brightwell, M. Gokhale, P. McCormick, R. Ross,J. Shalf, K. Antypas, D. Donofrio, T. Humble, C. Schuman, et al.Extreme heterogeneity 2018-productive computational science inthe era of extreme heterogeneity: Report for DOE ASCR workshopon extreme heterogeneity. Technical report, USDOE Ofﬁce ofScience (SC), Washington, DC (United States), 2018.[72] L. Wang, L. Liu, J. Han, X. Wang, S. Yin, and S. Wei. AchievingFlexible Global Reconﬁguration in NoCs using ReconﬁgurableRings.

IEEE Transactions on Parallel and Distributed Systems , 2019.[73] L.-W. Wang. Divide-and-conquer quantum mechanical materialsimulations with exascale supercomputers.

National Science Review ,1(4):604–617, 2014.[74] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. CHIMAERA:a high-performance architecture with a tightly-coupled reconﬁg-urable functional unit.

ACM SIGARCH Computer Architecture News ,28(2):225–235, 2000.[75] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong. Energy-efﬁcient CNN implementation on a deeply pipelined FPGA cluster.In

Proceedings of the 2016 International Symposium on Low PowerElectronics and Design , pages 326–331, 2016., pages 326–331, 2016.