ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing
Cheng Tan, Chenhao Xie, Tong Geng, Andres Marquez, Antonino Tumeo, Kevin Barker, Ang Li
11 ARENA: Asynchronous ReconfigurableAccelerator Ring to Enable Data-CentricParallel Computing
Cheng Tan, Chenhao Xie, Andres Marquez, Antonino Tumeo, Kevin Barker, and Ang Li
Abstract —The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardwarespecialization and the emergence of data-driven applications. In this paper, we propose ARENA – an asynchronous reconfigurableaccelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using thecoarse-grained reconfigurable arrays (CGRAs) as the substrate platform, our key contribution is not only the CGRA-cluster design itself,but also the ensemble of a new architecture and programming model that enables asynchronous tasking across a cluster ofreconfigurable nodes, so as to bring specialized computation to the data rather than the reverse. We presume distributed data storagewithout asserting any prior knowledge on the data distribution. Hardware specialization occurs at runtime when a task finds the majorityof data it requires are available at the present node. In other words, we dynamically generate specialized CGRA accelerators where thedata reside. The asynchronous tasking for bringing computation to data is achieved by circulating the task token, which describes the dataflow graphs to be executed for a task, among the CGRA cluster connected by a fast ring network. Evaluations on a set of HPC and data-driven applications across different domains show that ARENA can provide better parallel scalability with reduced data movement(53.9%). Compared with contemporary compute-centric parallel models, ARENA can bring on average 4.37 × speedup. The synthesizedCGRAs and their task-dispatchers only occupy 2.93mm chip area under 45nm process technology and can run at 800MHz with onaverage 759.8mW power consumption. ARENA also supports the concurrent execution of multi-applications, offering ideal architecturalsupport for future high-performance parallel computing and data analytics systems. Index Terms —Compute-Flow-Architecture, Runtime Reconfiguration, Asynchronous Parallel Execution, Abstract Machine Model. (cid:70)
NTRODUCTION
With the slowing down of Moore’s Law [51], future computersystems will need to resort to domain-specific acceleratorsfor continuous performance scaling [27], [63] under the samepower envelope. This is especially the case for HPC anddata centers, as we are quickly entering an era of extremeheterogeneity [71], characterized by cluster nodes integratinga multitude of cooperating accelerators [33], [38], [43], [61].While integrating domain-specific accelerators (DSAs)into HPC and data centers provides considerable efficiencygains [33], [38], [55], [64], it leads to enormous complexitiesas well. First, DSAs can only be economically designed forubiquitous computational patterns in applications, while theworkloads currently running in HPC and data centers areconverging towards mixed workflows that include scientificsimulation, machine learning, data analytics, etc. Second,managing various (typically loosely coupled) acceleratorsacross many nodes significantly complicates programmingmodels and the software infrastructure. In both HPC anddata centers, the accelerators typically need to be sharedamong multiple users, often using multiple nodes for appli-cations with divergent characteristics. Since we could notdesign hardware accelerators for all seen and unseen kernels,reconfigurable architecture, which allows specialization aftersystem deployment and even during system execution, • The authors are with Pacific Northwest National Laboratory, Richland,
WA, 99354.
E-mail: [email protected] promises to be a wise solution. While Field ProgrammableGate Arrays (FPGAs) have already been deployed at largescale in some data centers [61], they may suffer from limitedfrequency and energy-efficiency compared with ASICs, aswell as long reconfiguration time (e.g. in milliseconds) dueto bit-level reconfigurability. Coarse-grained reconfigurablearrays (CGRAs), which integrate highly optimized functionalunits (rather than fundamental lookup tables or LUTs) andoffering reconfigurability at the word-level, emerge as apromising alternative choice [60]. The rapid reconfiguration[24] makes dynamic formalization of hardware acceleratorsat runtime becomes feasible, and even plausible.Conventional large-scale HPC clusters implicitly assum-ing homogeneous node-configuration and bulk-synchronous-parallel (BSP) execution model suffer from three low-utilization challenges regarding per-node data locality: (1)unbalanced data distribution among homogeneous nodesmay lead to unbalanced workload and poor utilization;(2) If data is not locally available, during the long-timeremote data fetching, the compute units can be idle; (2) Evenworse, these idle units or idle nodes cannot be reclaimedfor other tasks despite the node may hold their desireddata. As emerging workloads become more dynamic anddata-driven, decentralized asynchronous task managementis highly desired, while data locality becomes a crucial factorfor the system design [18], [27], [63]. This is largely dueto the observation that the energy cost of data-movementsignificantly overweights the energy cost of computing them[34], [40], [50], which is particularly the case when migrating a r X i v : . [ c s . D C ] N ov ata through the interconnect network (e.g., the powerbudget is ∼ active messages [16], [17] andremote procedure call (RPC) mitigate the problem by pushingcomputation to the nodes where data resides, they sufferfrom considerable overhead due to the lack of architecturalsupport. Existing applications however still widely adoptthe BSP model [14], [69] alternating phases of (parallel) localcomputation with phases of global communication. Note thatthe BSP model implicitly assumes that the majority of thetime can be spent in easily parallelizable computation phases,with limited data movement. However, the exponentialgrowth in the availability of data and the emergence ofnew applications, radically changed the balance. Architectural Support (Section 4)
Runtime (Section 3.2)
UserAPIsProgramming Model (Section 3.1)
ARENA_task_registerARENA_task_spawnARENA_data_acquire
Hardware Abstract FunctionsCPU
Dynamically Reconfigurable CGRA
TaskToken TaskQueue Address
Fig. 1: Overall ARENA design stack.Bridging coarse-grained reconfigurable architecture withlocality-oriented asynchronous parallel task execution, inthis paper, we propose ARENA — an AsynchronousREcoNfigurable Accelerator ring architecture & runtime, toenable data-centric computation flow paradigm aiming atsignificantly reducing unnecessary inter-node data move-ment. ARENA comprises multiple CGRAs interconnected ina ring to provide quick reconfiguration at runtime. The tasksof the application, tagged as task tokens, are injected intothe ring when dependencies are satisfied. Task-specializedCGRA accelerators can be constructed at runtime using thespecification embedded in the task token. When task tokensare streaming around the ring, each node can verify whethera task should be executed locally (based on data locality andresource availability). In other words, ARENA is dynamicallysharing hardware resource according to data locality. Thisis in contrast with conventional systems sharing time-slotsaccording to hardware availability. In summary, this paperthus makes the following contributions: • Architecture: we proposed a multi-CGRA cluster archi-tecture (Section 4) that can be dynamically reconfiguredat runtime, enabling asynchronous parallel execution formulti-users with very little overhead. • Runtime: we proposed a flexible runtime (Section 3.2)enabling task-token delivery along the ring networkbased on data locality. Compared to compute-centric ex-ecution models, the proposed runtime can significantlymitigate data movement and improve performance. • Programming Model: we proposed a data-centric pro-gramming model (Section 3.1) with easy-to-use APIsto facilitate the programming for ARENA architectureand runtime. An LLVM-based compiler toolchain isconstructed to support the programming model. • Evaluation:
RTL simulation results using practical HPCand data analytics applications (Section 5) show that ARENA can provide better parallel scalability withreduced data movement (53.9%) and bring 2.17 × and4.37 × speedup over traditional compute-centric ap-proach with and without CGRA acceleration. The CGRA-based ARENA prototype only costs 2.93 mm chip areain 45nm and can work at 800MHz with 759.8mW powerconsumption per node, showing significant advantagesover the present architectural design in the current HPCand data centers. OTIVATION
Scientific simulation, where linear solvers iterate on dataorganized in dense (structured sparse) matrices or tensors,typically easy to divide in equally sized tiles, representsthe premier HPC workload [5], [6], [56], [68], [73]. How-ever, emerging HPC applications, targeting areas such aspower grid dynamics [4], seismic risk assessment [49],urban systems simulation, and microbiome analysis [3], willlikely combine together traditional scientific simulation withadvanced data analytics and machine learning. The datasetsfor these applications are much less structured, and thusmore difficult to organize in regular and partitionable datastructures. Applications will alternate phases of scientificsimulations with regular behaviors, to phases where thecomputation happens on sparse data structures (e.g., sparsematrices, graph traversal) that induce unpredictable fine-grained data accesses and irregular behaviors. This scenarioprovides a clear opportunity for adaptability to diverse be-haviors with reconfigurable hardware, while at the same timemakes the current HPC programming models inadequate. Inthe following, we describe three major existing multi-nodecomputing paradigms to motivate the ARENA design.
HPC systems typically rely on the classical Bulk SynchronousParallel (BSP) programming model in which a process isassigned to a processor or an entire node, and communicationtypically happens through message passing with librariessuch as MPI. In the BSP model, the computation proceeds asa series of global supersteps: concurrent computation, whereevery participating nodes perform parallel computationson local data; communication, where nodes exchange dataamong them (with various, algorithm dependent, patterns);and barrier synchronization, to align execution of nodes.The BSP model assumes that data are partitioned anddistributed across nodes and rarely moves, to facilitatethe local computation super-steps. Otherwise, the message-passing based communication super-steps would dominantthe execution time. While this model works well for applica-tions with easily partitionable data, regular computation, andlimited, structured communication, it starts to experiencesignificant limitations when workloads exhibit irregularbehaviors (skewed data distributions, high synchronizationintensity, irregular communication patterns). For these rea-sons, we consider the BSP model as
Compute-Centric .Consider as an example an application with the (hierar-chical) task graph in Figure 2(a), where the computationis split into 4 high-level tasks, each one executing task-partitioned computational kernels where the subtasks require2 B (f) Data-centric with reconfigured acceleration. AB CD
AB D C
Node 0Node 1 Node 2Node 3
AB CD
Node 0Node 1 Node 2Node 3 A Node 0Node 1 Node 2Node 3
AB CD
Node 0Node 1 Node 2Node 3 BA Node 0Node 1 Node 2Node 3
AB CD
Node 0Node 1 Node 2Node 3
ABCD
Node 0Node 1 Node 2Node 3 AC (a) Task graph with data requirement (b) Compute-centric with fixed acceleration. (c) Compute-centric first invoke. (d) Compute-centric second invoke.(e) Data-centric looping in ring to map the task. B CD CD firstinvoke secondinvoke ß second invoke starts AB CD
Fig. 2: Motivating example comparing traditional compute-centric execution model and ARENA data-centric executionmodel.data from other nodes. When employing a BSP model,both the data allocation and the distribution of the high-level tasks are fixed for the entire application execution.Hence, if a subtask needs data available in another node, itneeds to initiate a communication phase, load the remotedata, and synchronize to avoid hazards. Despite the latesthigh-performance designs can exploit mechanisms such as remote-direct-memory-access (one-sided communication, notrequiring a blocking receive with implicit synchronizationfrom the remote node), prefetching and data migration,when these remote accessing are frequent, the bandwidthgap between local memory and remote memory (whichneeds to be accessed through the network) still remains themajor performance (and consequently energy) concern. Forexample, if, as typical in HPC applications, tasks are executedin a loop, and they contend on the same data blocks (e.g.,Tasks A, B, and C in Figure 2), data migration may triggereven more data movement and synchronization, as the actualdata distribution and access patterns are unknown beforeruntime.
Limitations –
While compute-centric BSP long remainsthe standard model in HPC, its adoption in emerging HPCapplications may be limited by data movement and synchro-nization. HPC practitioners have introduced asynchronousmulti-task runtimes to tackle the limitations of the BSP model.Besides migrating data blocks where the computation occurs,these runtimes often allow tasks to migrate where the datareside, via approaches such as active messages and remoteprocedure calls, following the data-centric models.
Reconfigurable accelerators have been deployed at a massivescale in data centers to provide application-specific acceler-ation with improved power-/area-efficiency [30], [60], [61].Some institutions have also started hosting clusters withFPGA to perform research in HPC [2].However, accelerators in HPC installations still generallyadopt the BSP model, where a part of the local computationis offloaded to the accelerator itself. This also requiresgathering desired data by the running tasks offloaded forexecution. While FPGAs potentially allow acceleration of workloads more diverse than conventional accelerators suchas GPUs, their current usage in HPC application oftenentails static configuration for accelerating a small amountof kernels (e.g., FFT, GEMM). This approach matches withan offload accelerator model - tasks and computationalkernels do not move in the system. On the other hand, areconfigurable architecture potentially allows to dynamicallyadjust the configuration at runtime, being able to acceleratedivergent tasks with the computation proceeds. Despite thepotential, the excessive reconfiguration overheads (typicallyin milliseconds) makes such an approach very costly.
Limitations –
Reconfigurable accelerators make it pos-sible to accelerate a more diverse workload, but currentpractice in heterogeneous HPC still leverages the offloadmodel. The lack of low-latency runtime reconfigurationalso limits the chances of large-scale task migration acrossthe whole system. On the other hand, the data-centricexecution model allows different tasks to work concurrently,despite using the same set of data in a node. CGRAs, asan alternative reconfigurable solution [60] to FPGAs [24],offer significantly reduced reconfiguration time with coarser-grained reconfiguration, making flexible runtime architectureadjustment becomes possible.
Many programming models have been leveraged or designedto allow data-centric execution on multi-node systems.SSMP [58] can operate on shared memory machines andsupport dynamic detection of dependencies between tasks.The implicitly shared memory management and depen-dence detection improve the programmability at the costof increased synchronization overhead.
Remote procedurecall (RPC) achieves near-data-computation based on priorknowledge about the exact distribution of data. X10 [21]and Chapel [20] allow the programmers to control whereto place the data and tasks. Similarly, Legion [12] enablesexplicit, programmer-controlled movement of data and place-ment of asynchronously spawned tasks, based on localityinformation. Legion employs a Cilk-like [15] algorithm forlocality-aware task stealing.3owards data-centric programming, MapReduce [25]programmers think in a data-centric fashion: they focusmore on handling the sets of data records, rather thanmanaging fine-grained threads, processes, communication,and coordination [9]. However, MapReduce constrains itsusage to batch-processing tasks, which falls in the BSP scope.To accommodate the emerging data-driven applications withirregular and unpredictable data access patterns, data-centricexecution with asynchronous task-spawn should be enabledwith hardware support. Meanwhile, the synchronization andtask dependencies should be specified by the programmers toeliminate the unnecessary performance and energy overheadrather than forced by the programming model (e.g., remoteprocedure need to return to local in RPC, all the spawnedtasks of the same ancestor need to join eventually in Legion).Besides the data locality, the runtime should also considerthe computing resource utilization when reconfigurableaccelerators are deployed and shared by multi-users inHPC/data-center environments.
Limitations –
The high-level software framework andruntime facilitate the asynchronous execution of tasks andattempt to take advantage of data locality. Unfortunately,existing frameworks based on software solutions incurconsiderable overhead. The lack of hardware reconfigurationalso limits the benefit from application-specific design andheterogeneity.
We propose ARENA to address the limitation of the threeaforementioned baseline. ARENA includes a novel program-ming model targeting asynchronous data-centric executionparadigm. As shown in Figure 2, all the configurable nodesin ARENA are connected by a ring network to bring thespecialized computation to the data rather than the reverse tominimize data movement. Each reconfigurable node mainlycontains a CGRA (detailed in Section 4.3) that supports real-time reconfiguration and simultaneous execution of multi-tasks.
ROGRAMMING M ODEL
Being the interface between software and hardware, theARENA programming model defines a list of API functionsin Table 1. On one hand, in order to program an ARENAabstract machine, a software programmer has to define theiruser-logic as task functions, and rely on the
User APIs tooperate the abstract machine. Please note that although inthis work we use CGRAs as the hardware testbed, it isonly one of the the possible instantiations of the ARENAabstract machine model (AMM). On the other hand, in orderto support ARENA software and run ARENA program, analternative architecture has to support the
Hardware AbstractFunctions here.
To program an ARENA abstract machine, the programmersfirst partition their application into tasks and register thedefined tasks to the ARENA runtime. Ideally, the partitioncan separate the working set into a bunch of continuousdata segments, where each task accounts for a segment.
Function & Construct DescriptionUser Defined Function void my task ( Address
TASKstart , Address
TASKend , float PARAM ) A user can implement multiple differenttasks to compose a single or multipleapplications.
User APIs void
ARENA task register ( int TASKid , Address & my kernel , bool isRoot ) Registers a kernel (e.g. my kernel ) with TASKid . The root task is launched by aCPU or a microcontroller once the systemstarts to run.void
ARENA task spawn ( int TASKid , Address
TASKstart , Address
TASKend , float PARAM , Address
REMOTEstart , Address
REMOTEend ) Dynamically spawns a new
TaskToken ( FROMnode is automatically applied) thatwill be issued to the
CoalesceUnit . Thefields of a task token is explained in detailin Section 4.1.
Hardware Abstract Functions void
ARENA init ( Address* local start , Address* local end ) Initializes local start and local end basedon local data information.TaskToken
ARENA arrive () Receives an incoming task from a remotenode.void
ARENA filter ( TaskToken token , Address* local start , Address* local end , TaskQueue*
SendQueue , TaskQueue*
WaitQueue ) Detaches, Splits or passes tasks based onthe token ’s required data addresses (bycomparing them with local start and local end ).bool
ARENA ready ( TaskToken token ) Checks if the computing resources areavailable for executing the task token .void
ARENA launch ( TaskToken token , TaskQueue*
CoalesceUnit ) Issues a task denoted by token either to aCPU or to a reconfigurable accelerator (e.g.,CGRA). The spawned new tasks will bepushed into CoalesceUnit.void
ARENA data acquire ( TaskToken token ) Acquires additional data from remote nodevia NIC.TaskToken
ARENA coalesce ( TaskQueue*
CoalesceUnit ) Coalesces spawned tasks with continuousdata range, identical required remote datarange, and identical PARAM.
Base ConstructsTaskToken encapsulates a task.
Address
Local address of data.
TaskQueue
Buffers for task tokens.
TABLE 1: ARENA programming and hardware APIs.ARENA does not limit the granularity of a task, which canbe extremely fine-grained or coarse-grained. While ARENAworks perfectly when the data is locally available for a task,we understand this is not always feasible. When remotedata access is inevitable, the application can either spawna new task for the remote data, or explicitly initiate thedata-movement through the data-transfer-network . // Users can define multiple different kernels, // each with a specific task token ID. int∗∗ local_M; ... void BFS_kernel(int TASKstart, int TASKend, int PARAM) { int level = PARAM; for(int i=TASKstart; i
Fig. 3: Example of programming SSSP in ARENA.4igure 3 shows an example on how to solve the single-source shortest path (SSSP) problem using a breadth-firstsearch (BFS) kernel. The design traverses associated verticesuntil the shortest path(s) from a source node to all the othernodes are found. Without losing generality, we assume thegraph, represented as an adjacency matrix, is distributivelystored on all nodes and each node holds
SIZE/N ODES vertices (rows) of the entire graph (the adjacency matrix is in
SIZE x SIZE , an initial value of ∞ indicates a connectededge while 0 implies no connection).ARENA enables asynchronous data-centric execution bydynamically spawning new task-tokens among the nodes.Currently, all tasks need to be registered at the beginning.During runtime, task-tokens are circulating among all nodesin the ring. In case a node confirms it has the data requiredby a task-token (indicated by the starting & ending addresses T ASK start and
T ASK end ) as well as sufficient hardwareresources for runtime hardware specialization implied by thetask-token, it takes out the token from the task-token-streamand executes it. A task will be executed by the CPU if nohardware specialization is provided. New tasks for remotenodes can be generated or spawned at any node. We currentlyrely on the programmer to determine the granularity of aspawned task (in other words, the data-range a spawnedtask designated). Fine-grained tasks facilitate asynchronousexecution but increase scheduling overhead in the runtime.We discuss this tradeoff in detail in Section 3.2.This is in contrast with the conventional compute-centricapproach demanding frequent data communication and syn-chronization [19]. As each node maintains the vertex statuslocally, and no prior knowledge about vertex distribution isasserted, repeated all-to-all communications are essentiallydesired for broadcasting vertex updating information toassociated nodes on the present frontier. Figure 9 showsthe performance gain of ARENA for the SSSP application.
Figure 4 and 5 illustrate the workflow of ARENA runtimeexecuted per-node and the pseudo-code of the workflow,respectively. As can be seen, multiple tasks (marked withdifferent colors) can be asynchronously executed in parallel.Note, the runtime can be supported by CPUs, GPUs, DSPs, orany other fixed or reconfigurable hardware substrate giventhe substrate realize the
Hardware Abstract Functions andsupport the
Base Constructs .We describe the runtime process below: Primarily, the taskqueues (line 3-4) and local data range (line 7) are initialized.We then proceed with 6 steps:
Step-(1):
All the incomingtask tokens from the proceeding node will be appended tothe
RecvQueue (line 8-11).
Step-(2):
A token popped fromthe
RecvQueue will be processed in the
Filter , wherea task can be split into multiple tasks, which are eitherbuffered in the
WaitQueue for local execution, or forwardedto the
SendQueue , wait for being conveyed to the nextnode (line 21-23). The logic is: (I) if the task data range isirrelevant to the node’s local data range, it is forwarded to
SendQueue as it is; (II) if the task data range is a subsetof the node’s local data range ( local start ≤ T ASK start ≤ T ASK end ≤ local end ), implying the local availability ofall the needed data for the task, the token will be pushed s ARENA_filter ARENA_coalesce ARENA_arrive
ComputingUnits++++++++++++++++++++
SendQueueCoalesceUnitWaitQueue R ec v Q u e u e ARENA Runtime
Arriving task token that need to be split into two ( , ) and offloaded spawnsplitexecute ARENA_readyLocal Data ARENA_data_acquire ARENA_launchA token that is coalesced by two spawned tokens ( , )A token arrived before and is being executed
Fig. 4: ARENA runtime workflow. void ARENA_runtime() { bool terminate = false; TaskQueue WaitQueue, RecvQueue; TaskQueue SendQueue, CoalesceUnit; TaskToken token; Address local_start, local_end; ARENA_init(&local_start, &local_end); while(true) { // Enqueue an arriving task token. token = ARENA_arrive(); RecvQueue.enqueue(token); // Check termination for all tasks. token = RecvQueue.dequeue(); if(token.TASKid == TERMINATE and WaitQueue.empty()) { SendQueue.enqueue(token); if(terminate) break; terminate = true; continue; } terminate = false; // Offload, split, or convey a task. ARENA_filter(token, local_start, local_end, &SendQueue, &WaitQueue); // Check availability for task execution. token = WaitQueue.peek(); if(ARENA_ready(token)) { WaitQueue.dequeue(); // Acquire data if necessary. if(token.REMOTEend > token.REMOTEstart) { ARENA_data_acquire(token);} // Issue a task for execution and return // without waiting for completion. ARENA_launch(token, CoalesceUnit);{ // Task coalescing if possible. token = ARENA_coalesce(CoalesceUnit);{ if(token) RecvQueue.enqueue(token); }}}
Fig. 5: ARENA runtime pseudocode.into
WaitQueue for future execution; (III) if the data rangeindicated by the task is a superset of the node’s local datarange (
T ASK start ≤ local start ≤ local end ≤ T ASK end ),it suggests that the task might be too coarse-grained. Wesplit the task into three portions and spawn three newtasks. The one with data range local start to local end willbe buffered in WaitQueue for local processing; the othertwo tasks are redirected to
SendQueue ; (IV) finally, if thetask data range is partially aligned with the node’s localdata range, two new tasks will be spawned. The alignedpart is buffered in
WaitQueue while the mismatch partis forwarded to
SendQueue . Step-(3):
The runtime checkswhether there is available resources for the token at thetop of the
WaitQueue to be executed (line 26).
Step-(4):
5f so, the runtime verifies whether the task needs to incurany inevitable remote data access. If yes, the data will beacquired from the objective remote nodes (line 30) throughthe
Data-Transfer-Network . Step-(5): when all demand-ing data are available, the task is issued to the computingresources for execution (line 33).
Step-(6)
As new tasks canbe generated locally in a node during the task execution, toavoid too many tasks flooding the system, a
CoalesceUnit (line 35) is designed to aggregate the newly generated tasksif the boundaries of their data-ranges coincide each other,and whether alternative key parameters ((e.g., task-tokencarried partial-reduction variables) are the same. This canavoid the scenario that too many generated fine-grained taskssaturate the task-token network and the associated buffers.Finally, the runtime on a particular node terminates whenthe TERMINATE token has been continuously received, andthere is no pending tasks in the local
WaitQueue (line 12-20).
LUSTER A RCHITECTURE
Our ARENA prototype in this work is built upon a CGRAcluster interconnected by a fast ring network. Figure 6(a)shows the overall design — multiple reconfigurable nodesare connected in a ring topology. Each node incorporates amicro-controller (e.g., a simplified CPU), a task dispatcher,a CGRA, a network interface (NIC), a DMA unit, and localmemory storage. At runtime, task tokens are circulatingalong the ring. The dispatcher can split, offload, and forwarda task token based on the data range as already discussed.We adopt the ring network topology to simplify the routingstrategy and provide sufficient bandwidth for delivering thesmall-sized task tokens ( ∼
21 bytes) with up to 16 nodesevaluated in this paper. The ring network can also providenear optimal bandwidth for most collective communication[54], and can be easily built upon various physical networktopology. We leave the exploration of alternative multi-nodetopology as a future work.
Task tokenTask token
Networkfor DataTransfer (a) Overview of ARENA cluster.(b) Task token format.
TASK id REMOTE start
PARAM FROM node
TASK start
TASK end
REMOTE end
Remote DataRemote Data Request
Simple CPU Core / Microcontroller ……………… … SP M D a t a Local Memory
DMA Unit
PreloadControl SignalsLocal Data Request Local Data
Task DispatcherNICInitial Token
Tokens inTokens out … C G R A C on t r o ll e r Fig. 6: ARENA cluster overview and task token format.
A task is represented by a task token in ARENA, whichcan be dynamically spawned, executed, delivered, andsplit. Figure 6(b) shows the general format of the task token which comprises 7 fields:
T ASK id indicates the taskwill be executed, which is registered by the user (using ARENA task register() ) before launching the ARENA run-time. During execution, the reconfigurable node will bedynamically configured (see Section 4.3) based on
T ASK id . T ASK start and
T ASK end together describe the data rangefor the task.
P ARAM refers to a token-carried return valuethat is typically initialized by its parent task. This field isuseful when performing collective operations (e.g., reduction and accumulation ). For unavoidable remote data access,we use
REM OT E start and
REM OT E end to indicate thestarting and ending addresses.
F ROM node labels the nodewhere its parent task locates. Each task token thus requires21 bytes in our prototype architecture (i.e., 4-bit each for
T ASK id and F ROM node ; 4-byte each for the other fields).
ARENA’s task dispatcher mainly includes a task
FilterLogic and three queues as shown in Figure 7. The
FilterLogic can offload, split, and convey a task token basedon the data requirement (see Section 3.2). The NIC handlesremote data requests from the task tokens in the
WaitQueue .The
WaitQueue will be acknowledged when the requiredremote data arrives at the data memory. The CGRA controllerwill then pop the acknowledged task token from the head of
WaitQueue and reconfigure the CGRAs accordingly.
Data ack D a t a r e q SendQueueFilterLogic
Task Dispatcher
FilterLogic
RecvQueue D a t a M e m o r y NIC S p a w n e d t a s k t ok e n CGRA Tile
Regs Regs Regs
Token inToken out W a it Q u e u e C G R A C on t r o ll e r C o a l e s c i ng U n it Fig. 7: Architecture of an ARENA node.6 .3 Reconfigurable CGRA Nodes and Toolchain
To achieve rapid dynamically reconfiguration, an ARENAnode is prototyped with CGRA. The on-chip configurationmemory is used for conserving the control signals for eachtask. The intra-node CGRA consists of 64 tiles connected ina mesh network. A scratchpad data memory conserves thedata required for the computation. Note that both the controlsignals and the data are pre-loaded by the CPU/micro-controller through the DMA unit before launching theARENA runtime. The CGRA communicates with the
TaskDispatcher through the CGRA controller. The controllercan offload tasks and coalesce spawned tokens through the
Coalescing Unit . CGRA Tile –
As shown in Figure 7, each CGRA tilecontains a functional unit, a scratchpad control memory, acrossbar switch, and three sets of registers. The functionalunit supports all the basic operations (e.g., add , mul , shif t , select , branch , load , store , etc). Control-divergence (i.e.,the existence of multiple control flow paths) inside theloop kernel is supported through partial predication [32].The functional unit also supports the spawn operation (i.e.,generate a new task token and issue to the CGRA controller),which is unique in ARENA. If sufficient information isavailable ( T ASK id , T ASK start and
T ASK end ), a new tokencan be spawn in a single cycle; otherwise, two cycles arerequired to encode additional information (i.e.,
P ARAM , REM OT E start , and
REM OT E end ). The
F ROM node filedwill be automatically filled by the CGRA Controller. InFigure 7, there are 4 tiles being able to spawn new tasks(marked in green). The leftmost tiles are connected to two4-port scratchpad data memory banks. The functional unitcan be configured to perform different operations at eachcycle based on the control signals from the control memory.The CGRA tiles are granular to support the simultaneousexecution of multiple tasks. Specifically, all the tiles arepartitioned into 4 groups and a task can be executed by 1, 2,and 4 groups, dynamically managed by the CGRA controller.
Control Memory –
The control signals of all the tasksare initially pre-loaded into the control memory. At runtime,tiles iterate over a subset of the control signals to executespecific tasks based on the
T ASK id in the task token. Eachtile requires a 480-byte control memory in our prototypeto support all application tasks evaluated in this paper(Section 5). Each task has three execution modes poweredby different tile groups. It takes only 8 cycles for the CGRAcontroller to reconfigure specific tile groups by using thedata network to forward the T ASK id systolically throughthe array from right to left. CGRA Controller –
The CGRA Controller can launcha task (using the task token at the head of
WaitQueue )to be executed by different groups of the CGRA tiles.Based on the current CGRA utilization status and the datarequirement of the target task, the CGRA controller allocatean appropriate number of groups for a waiting task. Forexample, if the data range required by the target task isless than a quarter of the local data range (i.e., TASK end -TASK start < ( LOCAL end -LOCAL start )/4), only one availablegroup (i.e., 2x8 CGRA tiles) will be allocated to the task.When the target task works on more than half of the local datarange (i.e., TASK end -TASK start > ( LOCAL end -LOCAL start )/2),
Kernel DFG Generation MappingAlgorithmARENA Data-Centric Runtime DFGs ControlSignals
Data-centricexecution model
ARENA with CGRA
HardwareCGRAControllerHardwareTask DispatcherKernelAppTasks Loop
Vectorization& FlatteningVectorized &Flattened
Loops.
EXE P r e l o a d d a t a a nd c on t r o l s i gn a l Fig. 8: ARENA data-centric execution model developmentprocedure with compiler toolchain.the CGRA controller attempts to allocate all the four groups(i.e., entire 8x8 CGRA tiles) when available (otherwise, twogroups are allocated). In addition, there are four queues inthe controller to temporarily hold the spawned task tokens,which would be coalesced by the
Coalescing Unit if anytwo of them imply continuous task data range and share thesame
T ASK id and P ARAM . When there are insufficientslots in the queues, the CGRA controller stops fetching tokensfrom the
WaitQueue in the Task Dispatcher. Deadlock canbe avoided by providing a memory attached to the CGRAcontroller for storing the over-spawned task tokens.
Compiler Toolchain for CGRA –
Figure 8 shows thedevelopment procedure for ARENA using CGRA clusteras the backend. As already mentioned, the CGRA clusteris just one design choice; the ARENA execution model canbe realized on alternative back-ends in case the HAF APIs(Table 1) are realized. For example, on a CPU cluster backend,we can adopt MPI non-blocking-send primitives [29] torealize HAF APIs. We use this as one of our baselines in theevaluation. Here, to support the CGRA-cluster backend, anLLVM [42] based design automation toolchain is developed.In particular, a kernel that typically includes a multi-levelnested loop is described as a task. As shown in Figure 8, tosynthesize an appropriate mapping for a task on the allocatedCGRA tiles, the nested loop is first vectorized with a factorthat can fully leverage relatively larger CGRA tiles (e.g., 8x8tiles) in the vectorization pass. Then, the remaining loops areflattened, generating the Control-Data Flow Graph (CDFG)representation, which is an extension of DFG with controldependence edges. We implemented a heuristic method [39]to map the CDFG on various combinations of the tiles (i.e.,2x8, 4x8, and 8x8 tiles) and produce their control signals.
VALUATION
The ARENA runtime is evaluated on traditional CPU HPCclusters and our proposed CGRA-based ARENA cluster. Forthe latter, we extend the
Structural Simulation Toolkit (SST) [62]to model a multi-node cluster based on MPI. We modelthe network topology and package transmit switch in SSTusing the MACRELS Analytic Model [70]. The token transmitnetwork is modeled as 1D Torus Ring. We implement thetask dispatcher, CGRA, and CGRA controller in PyMTL [47],which can report cycle-accurate simulation results for asingle node. To obtain the cycle-level simulation result, weimplement the dispatcher interface in SST to handle the task7okens. Finally, we feed the single-node result to SST andgenerates synthesizable Verilog for power, area, and timinganalysis. The detailed simulation parameters are listed inTable 2.
Technology
Network Interface
80 Gb/s
Network Topology
1D Torus Ring
Network Switch
Dispatcher
Filter logic, 8-entry receive queue, 8-entrywait queue, 8-entry send queue
CPU (baseline) 2.6GHz, 20MB 3-level Cache, Out-of-order, x86
CGRA × CGRA Controller × TABLE 2: RTL and simulation parameters for ARENA.
Applications –
We evaluate ARENA using representa-tive HPC and data-analytics workloads: The single-sourceshortest paths (
SSSP ) problem (see Section 3.1) is a keysubroutine in many data-intensive graph computations.General Matrix Multiply (
GEMM ) is the core function oflinear algebra and deep learning workloads. We assume thematrices are distributed among nodes. Sparse-matrix-vectormultiplication (
SPMV ) is the fundamental kernel in manyscientific & data applications. Here, the distributed matrix isin the Compressed Sparse Row (CSR) format. DNA sequencealignment (
DNA ) [37] leverages Needleman-Wunsch (NW)algorithm to search the best-matched protein sequences withrespect to the target pattern. A Graph convolutional network
GCN [28] inference application on the Cora dataset is alsoevaluated, representing emerging irregular machine learningworkloads. We assume the adjacency and feature matricesare distributed among nodes. Finally, an
N-body simulationapplication [31] for simulating dynamical particle system isdemonstrated, representing traditional scientific simulationworkloads. Again, the particle information (e.g., acceler-ations, velocity, position, collision, etc) are distributivelystored and required to be updated per iteration at runtime.The conventional compute-centric parallel implementa-tions of all the evaluated applications are developed basedon state-of-the-art algorithms or derived from the widely-used benchmark suites [22], [59]. For example, the SSSPapplication is implemented based on [19]. The GCN model isextracted from PyTorch Geometric [28]. The DNA applicationleverages the NW algorithm from Rodinia [22]. Regardingthe ARENA implementation, all the applications are pro-grammed following the ARENA programming models fordata-centric asynchronous execution with runtime hardwarespecialization. Note that GCN and NBody contain numericdistinct functional tasks.
We show the benefits of the data-centric programming model,CGRA hardware acceleration, and the entire ARENA system.
Programming Model –
We first show the performanceeffectiveness of ARENA’s data-centric programming model.Figure 9 illustrates the normalized speedups for conventionalcompute-centric and ARENA’s models (Both are softwareimplemented based on MPI) with respect to a serial imple-mentation on a single CPU node (i.e., baseline). As can beseen, ARENA’s data-centric execution model shows higher speedups and better scalability in general. This is mainly dueto the elimination of synchronization, and the minimizationof data communication. On average, ARENA’s data-centricmodel outperforms the compute-centric counterpart by1.61 × (i.e., 7.82/4.87) in a 16-node cluster. Specifically, thekernels with higher data parallelism (e.g. SSSP, GEMM, andSPMV) can gain better scalability for both models; for kernelswith limited data parallelism such as DNA, the compute-centric model exhibits lower scalability due to massivedata dependency and costly remote communication. In thiscondition, ARENA achieves better scalability by streamingtask tokens over the nodes to minimize data movement.Figure 10 illustrates the normalized data movement break-down for ARENA’s data-centric model with respect to thecompute-centric model for all applications in a 4-node cluster.Compared with the compute-centric HPC cluster, ARENAcan eliminate on average 53.9% data movement withoutany prior knowledge about the data distribution, leadingto substantial improvement in energy efficiency. We alsoobserve different data movement patterns across applications.For example, SSSP posts considerable task movement, asit spawns massive fine-grained tasks with discrete data-ranges, which are hard to coalesce. Regarding DNA, thecompute-centric implementation is based on OpenMP whereall threads are sharing the same copy in global memory [22].The sub-blocks workload distribution to threads following azig-zag manner incurs frequent data movement. In ARENA,the data dependency only exits on the edge of the sub-block,which can be explicitly labeled by the parent tasks using theARENA User-APIs (i.e., REM OT E start and
REM OT E end in ARENA task spawn()), therefore minimizing data move-ment. Finally, GEMM and NBody comprise coarse-grainedtasks and the task-flows require data streaming among thenodes, leading to little task movement or essential datamovement as shown in the figure.
CGRA Speedup –
Figure 12 shows the normalizedspeedup of the evaluated applications running on differentconfigurations or combinations of ARENA’s CGRA tilegroups (each group is a 2x8 CGRA) with respect to thesingle node CPU baseline. In general, a larger CGRA tileconfiguration leads to higher speedups. For DNA, however,the loop-carried data dependency limits the data parallelism,as well as the obtainable speedups (1.7 × speedup at most).On average, 1.3 × , 2.4 × , and 3.5 × speedups are achieved byARENA’s 2x8, 4x8, and 8x8 CGRA across all the applicationsand kernels. The 2x8 CGRA exhibits the optimal area-efficiency, showcasing the advantages of runtime hardwarespecialization. Overall System –
The normalized speedup of ARENA isshown in Figure 11. As can be seen, ARENA maximizes theoverall performance by leveraging a data-centric executionmodel and CGRA in a synergistic and integrated fashion.Instead of fixing the CGRA configuration for each workload,ARENA dynamically allocates and configures the CGRA tilesspecifically for a particular task based on the task-carriedspecification (obtained based on data requirement of the task),as well as the current CGRA resource availability. On average,the compute-centric execution model using the entire CGRAsfor each kernel obtains 10.06 × speedup on a 16-node cluster,whereas ARENA achieves 21.29 × speedup. In other words,ARENA is 2.17 × better than the compute-centric with CGRA8 .87 7.820510152025 SSSP GEMM SPMV DNA GCN NBody Average N o r m a li ze dSp ee dup Compute-Centric 2-Node Compute-Centric 4-Node Compute-Centric 8-Node Compute-Centric 16-NodeData-Centric 2-Node Data-Centric 4-Node Data-Centric 8-Node Data-Centric 16-Node
Fig. 9: Normalized speedup for compute-centric and ARENA’s data-centric execution models running on different multi-CPUs cluster w.r.t. a serial implementation on a single node.
Data Movement in Computation-CentricTask Movement in Data-Centric Data Movement in Data-Centric D a t a M ov e m e n t B re a kd o w n Fig. 10: Normalized data movement breakdown in data-centric model w.r.t. the compute-centric model.support). This implies ARENA can leverage the CGRAs ina more efficient way. Compared with Figure 9, we can seethat ARENA with CGRAs also gains better scalability (from1.61 × to 2.17 × on a 16-node cluster). The performance ofDNA does not improve much due to limited accelerationfrom CGRA. Finally, the compute-centric execution of GEMMdoes not scale well because synchronization over a largeramount of data creates serious performance bottlenecks. We evaluate the timing, area, and power consumption ofARENA’s CGRA-cluster (e.g., CGRAs, CGRA controllers andtask dispatchers) using the synthesized Verilog HDL codefrom PyMTL. We use Synopsys Design Compiler, CadenceInnovus, and Synopsys PrimeTime PX in order to synthesize,place, route, and estimate the power consumption of thedesigns. We use FreePDK45 with the Nangate standard celllibrary. Figure 13 illustrates the obtained chip layout. Thearea and power of the 32KB scratchpad data memory areestimated based on CACTI-6.5. The chip area is 2.19mm x1.24mm and the operating frequency is 800MHz @ 45nmwith on average 759.8mW power consumption.
ELATED W ORK
We summarize related work regarding the ring network,clusters of reconfigurable architecture, and the task-basedexecution model in a ring network.
Ring Network.
The ring network has been adopted inreal multi-processors design [11] (e.g., Intel Xeon-Phi [23])and logically studied for efficient collective communication[8], [10], [44], [45], [46], [72]. On the one hand, the ringnetwork provide simple routing mechanism to better-utilizedthe link bandwidth for fast communication [11]. on theother hand, the ring network has been criticized for easy saturation due to linear increased latency with more node [7],[35], [46]. Previous works [8], [45], [46], [72] extend the ringconcept to form local-rings network among a subset of nodes,known as routerless network. In ARENA, we avoid thesaturation problem through: (a) a routerless task executionmodel among nodes; and (b) the dynamically task allocationand dispatching mechanism.
Reconfigurable Hardware Cluster.
Clusters incorporat-ing reconfigurable devices such as FPGAs [26], [41], [61], [65]and CGRAs [36], [60], [66] have already been showcasedby existing works [48], [67]. On one hand, Putnam et al.from Microsoft propose the reconfigurable Catapult fabric[61], where each instantiation consists of a 6x8 2D torusof Xilinx Stratix-V FPGAs. Every FPGA is connected to aCPU server via PCI-e and directly links to other FPGAsthrough the SAS cables. This 1632-node FPGA cluster hasbeen adopted for document ranking of the Bing search engine.Nearly 95% performance increase has been demonstratedwith only a 10% extra power budget. However, each FPGAin Catapult is specialized to a single application kernelduring runtime. Reconfiguration takes several milliseconds.Even reloading models without changing the computationlogic can take up to 250 microseconds. On the other hand,the data-centric execution model allowing different tasksworking on the same set of data in an HPC node requiresrapid dynamic reconfiguration. Gazzano et al. propose R-Grid, a complete grid infrastructure for distributed high-performance computing using dynamically reconfigurableFPGAs [26]. Knodel et al. virtualize the FPGA resourcesand propose adapted service models in a cloud context[41]. Tarafdar et al. discussed how FPGAs of a cloud datacenter can be flexibly connected based on a logical kerneland a mapping file [65]. Zhang et al. adopt a clusterof six Xilinx VC709 FPGA for cooperative convolutionalneural network inference [75]. Regarding to the CGRAs, theSambaNova DataScale system incorporates 8 reconfigurable-dataflow-units (RDUs) derived from Plasticine [60], claiminghigher performance than a thousand GPUs for trainingextremely large deep-learning models. PPA [57] exploitspipeline parallelism in streaming applications to create aCGRA-like pipeline to execute streaming media applications.Samsung proposes to adopt the CGRA cluster for medicalvolume image rendering [36]. HammerBlade [66] aims atdesigning a rack-scale cluster for ML and Graphs. TheirASIC is composed of general-purpose cores and specializedCGRAs (e.g., Chimera [74]).
Asynchronous Task Execution.
The asynchronous tasks9 N o r m a li ze dSp ee dup Compute-Centric 2-Node w/ CGRA Compute-Centric 4-Node w/ CGRA Compute-Centric 8-Node w/ CGRA Compute-Centric 16-Node w/ CGRAARENA 2-Node ARENA 4-Node ARENA 8-Node ARENA 16-Node
Fig. 11: Normalized speedup for compute-centric and ARENA’s data-centric execution models running on differentmulti-CGRAs cluster w.r.t. a serial implementation on a single node without any acceleration. N o r m a li ze dSp ee dup Fig. 12: Normalized CGRA speedup w.r.t. the single nodebaseline CPU execution without any acceleration.Fig. 13: Chip layout of a single ARENA node.execution has been studied in many graph processingframeworks [13], [52], [53]. They propose software-level data-centric approaches to implement irregular graph kernels onmulti-node clusters [53] and GPUs platform [52]. The workthat is most relevant to ARENA is Groute [13], which is anasynchronous runtime environment for processing irregulargraphs. In Groute, the GPUs form a logical ring networkwith each GPU conserves a worklist. Tasks are encapsulatedas messages passing along the ring. ARENA is motivated byGroute. However, Groute is a pure software implementationbased on general-purpose GPUs, thus cannot benefit fromhardware specialization. Additionally, for Groute, the routingpolicy, which specifies the action when input is received, aswell as memory consistency and ownership, are all definedand maintained by the users, which can be complicated,tedious and error-prone (e.g., imagine if data dependencyoccurs at runtime, users have to manually locate themand fetch in corresponding tasks). In ARENA, these are designed and supported by hardware. Furthermore, globalcoordination and work counting in Groute are centralizedand managed by the CPU whereas in ARENA, all of themare distributedly processed.
ONCLUSION
In this paper, we propose an asynchronous-reconfigurable-accelerator-ring architecture for next generation data-drivenhigh-performance computing. Through the co-design ofarchitecture and programming model, ARENA is able tobring computation tasks in the form of CGRA-specializedhardware accelerators to the data, rather than the reverse asin contemporary compute-centric and dataflow architectures,significantly improving performance and reducing data-movement. A CKNOWLEDGEMENT
This work was mainly supported by the Compute-Flow-Architecture project under PNNL’s DMC LDRD Initiative. Itwas also supported by the SO ( DA ) and FALLACY projectsunder DMC. The evaluation platforms were supported bythe U.S. DOE Office of Science, Office of Advanced ScientificComputing Research, under award 66150: ”CENATE - Centerfor Advanced Architecture Evaluation”. The Pacific North-west National Laboratory is operated by Battelle for the U.S.Department of Energy under contract DE-AC05-76RL01830. R EFERENCES
FirstInternational Symposium on Networks-on-Chip (NOCS’07) , pages 18–29. IEEE, 2007.[8] F. Alazemi, A. Azizimazreah, B. Bose, and L. Chen. Routerlessnetwork-on-chip. In , pages 492–503. IEEE,2018.[9] P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. M. Hellerstein,and R. C. Sears. BOOM: Data-centric programming in thedatacenter.
EECS Department, University of California, Berkeley, Tech.Rep. UCB/EECS-2009-113 , 2009.
10] A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda.Efficient large message broadcast using NCCL and CUDA-awareMPI for deep learning. In
Proceedings of the 23rd European MPIUsers’ Group Meeting , pages 15–22, 2016.[11] L. A. Barroso and M. Dubois. The performance of cache-coherentring-based multiprocessors. In
Proceedings of the 20th annualinternational symposium on computer architecture , pages 268–277, 1993.[12] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion:Expressing locality and independence with logical regions. In
SC’12: Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis , pages 1–11. IEEE, 2012.[13] T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali. Groute: AnAsynchronous Multi-GPU Programming Model for Irregular Com-putations.
ACM SIGPLAN Notices , 52(8):235–248, 2017.[14] R. H. Bisseling and W. F. McColl. Scientific computing on bulksynchronous parallel architectures. 1993.[15] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H.Randall, and Y. Zhou. Cilk: An efficient multithreaded runtimesystem.
Journal of parallel and distributed computing , 37(1):55–69,1996.[16] D. Bonachea and P. Hargrove. GASNet Specification, v1. 8.1. 2017.[17] D. Bonachea and P. H. Hargrove. GASNet-EX: A high-performance,portable communication library for exascale. In
InternationalWorkshop on Languages and Compilers for Parallel Computing , pages138–158. Springer, 2018.[18] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu,R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, andO. Mutlu. Google Workloads for Consumer Devices: MitigatingData Movement Bottlenecks. In
Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , pages 316–331, 2018.[19] A. Buluc¸ and K. Madduri. Parallel breadth-first search on dis-tributed memory systems. In
Proceedings of 2011 InternationalConference for High Performance Computing, Networking, Storage andAnalysis , pages 1–12, 2011.[20] B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel pro-grammability and the chapel language.
The International Journal ofHigh Performance Computing Applications , 21(3):291–312, 2007.[21] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,K. Ebcioglu, C. Von Praun, and V. Sarkar. X10: an object-orientedapproach to non-uniform cluster computing.
Acm Sigplan Notices ,40(10):519–538, 2005.[22] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee,and K. Skadron. Rodinia: A benchmark suite for heterogeneouscomputing. In , pages 44–54. Ieee, 2009.[23] G. Chrysos. Intel® xeon phi coprocessor (codename knights corner).In , pages 1–31. IEEE, 2012.[24] B. De Sutter, P. Raghavan, and A. Lambrechts. Coarse-grainedreconfigurable array architectures. In
Handbook of signal processingsystems , pages 427–472. Springer, 2019.[25] J. Dean and S. Ghemawat. MapReduce: simplified data processingon large clusters.
Communications of the ACM , 51(1):107–113, 2008.[26] J. Dondo Gazzano, F. Sanchez Molina, F. Rincon, and J. C. L´opez. In-tegrating reconfigurable hardware-based grid for high performancecomputing.
The Scientific World Journal , 2015, 2015.[27] J. Dongarra, L. Grigori, and N. J. Higham. Numerical algorithms forhigh-performance computational science.
Philosophical Transactionsof the Royal Society A , 378(2166):20190066, 2020.[28] M. Fey and J. E. Lenssen. Fast graph representation learning withPyTorch Geometric. arXiv preprint arXiv:1903.02428 , 2019.[29] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra,J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine,R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall.Open MPI: Goals, concept, and design of a next generation MPIimplementation. In D. Kranzlm ¨uller, P. Kacsuk, and J. Dongarra,editors,
European Parallel Virtual Machine/Message Passing InterfaceUsers’ Group Meeting , pages 97–104. Springer, 2004.[30] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Her-bordt. FPDeep: Acceleration and load balancing of CNN trainingon FPGA clusters. In , pages81–84. IEEE, 2018.[31] S. Green. Particle simulation using cuda.
NVIDIA whitepaper ,6:121–128, 2010. [32] M. Hamzeh, A. Shrivastava, and S. Vrudhula. Branch-aware loopmapping on CGRAs. In
Proceedings of the 51st Annual DesignAutomation Conference , pages 1–6, 2014.[33] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhul-gakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu,P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied ma-chine learning at facebook: A datacenter infrastructure perspective.In , pages 620–629. IEEE, 2018.[34] M. Horowitz. 1.1 computing’s energy problem (and what we cando about it). In , pages 10–14. IEEE, 2014.[35] N. E. Jerger and L.-S. Peh. On-chip networks.
Synthesis Lectures onComputer Architecture , 4(1):1–141, 2009.[36] S. Jin, S. Lee, M.-K. Chung, Y. Cho, and S. Ryu. Implementation of avolume rendering on coarse-grained reconfigurable multiprocessor.In ,pages 243–246. IEEE, 2012.[37] M. Johnson, I. Zaretskaya, Y. Raytselis, Y. Merezhuk, S. McGinnis,and T. L. Madden. NCBI BLAST: a better web interface. volume 36,pages W5–W9, 04 2008.[38] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-datacenterperformance analysis of a tensor processing unit. In ,pages 1–12, 2017.[39] M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh. Hycube: Acgra with reconfigurable single-cycle multi-hop interconnect. In
Proceedings of the 54th Annual Design Automation Conference 2017 ,pages 1–6, 2017.[40] G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie. Quantifyingthe energy cost of data movement in scientific applications. In ,pages 56–65. IEEE, 2013.[41] O. Knodel, P. Lehmann, and R. G. Spallek. RC3E: Reconfigurableaccelerators in data centres and their provision by adapted servicemodels. In , pages 19–26. IEEE, 2016.[42] C. Lattner and V. Adve. LLVM: A compilation framework forlifelong program analysis & transformation. In
InternationalSymposium on Code Generation and Optimization, 2004. CGO 2004. ,pages 75–86. IEEE, 2004. [43] K. Lee. Introducing big basin: Our next-generation ai hardware,
IEEE Transactions on Parallel andDistributed Systems , 31(1):94–110, 2019.[45] T.-R. Lin, D. Penney, M. Pedram, and L. Chen. A Deep Rein-forcement Learning Framework for Architectural Exploration: ARouterless NoC Case Study. In . IEEE, 2018.[46] S. Liu, T. Chen, L. Li, X. Feng, Z. Xu, H. Chen, F. Chong, andY. Chen. IMR: High-performance low-cost multi-ring NoCs.
IEEETransactions on Parallel and Distributed Systems , 27(6):1700–1712,2015.[47] D. Lockhart, G. Zibrat, and C. Batten. PyMTL: A unified frameworkfor vertically integrated computer architecture research. In ,pages 280–292. IEEE, 2014.[48] J. C. Lyke, C. G. Christodoulou, G. A. Vera, and A. H. Edwards.An introduction to reconfigurable systems.
Proceedings of the IEEE
50] D. Molka, D. Hackenberg, R. Sch¨one, and M. S. M ¨uller. Charac-terizing the energy consumption of data transfers and arithmeticoperations on x86- 64 processors. In
International conference on greencomputing , pages 123–133. IEEE, 2010.[51] G. E. Moore. Cramming more components onto integrated circuits,1965.[52] R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In , pages463–474. IEEE, 2013.[53] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastruc-ture for graph analytics. In
Proceedings of the Twenty-Fourth ACMSymposium on Operating Systems Principles , pages 456–471, 2013.[54] NVIDIA. NVIDIA DGX-1 System Architecture White Paper, 2017.[55] I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, and M. Taiji.MDGRAPE-4: a special-purpose computer system for molecular dy-namics simulations.
Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences , 372(2021):20130387,2014.[56] S. P´all, M. J. Abraham, C. Kutzner, B. Hess, and E. Lindahl. TacklingExascale Software Challenges in Molecular Dynamics Simulationswith GROMACS. In S. Markidis and E. Laure, editors,
SolvingSoftware Challenges for Exascale , pages 3–27, Cham, 2015. SpringerInternational Publishing.[57] H. Park, Y. Park, and S. Mahlke. Polymorphic pipeline array:a flexible multicore accelerator with virtualized execution formobile multimedia applications. In
Proceedings of the 42nd AnnualIEEE/ACM International Symposium on Microarchitecture , pages 370–380, 2009.[58] J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependenciesunder strided and aliased references. In
Proceedings of the 24th ACMInternational Conference on Supercomputing , pages 263–274, 2010.[59] L.-N. Pouchet and S. Grauer-Gray. Polybench: The polyhedralbenchmark suite. , 2012.[60] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao,S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun. Plasticine: Areconfigurable architecture for parallel patterns. In ,pages 389–402. IEEE, 2017.[61] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constan-tinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray,M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka,J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, andD. Burger. A reconfigurable fabric for accelerating large-scaledatacenter services. In , pages 13–24. IEEE, 2014.[62] A. F. Rodrigues, G. R. Voskuilen, S. D. Hammond, and K. S.Hemmert. Structural Simulation Toolkit (SST). Technical report,[69] L. G. Valiant. A bridging model for parallel computation.
Commu-nications of the ACM , 33(8):103–111, 1990. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States),2016.[63] J. Shalf. The future of computing beyond Moore’s law.
PhilosophicalTransactions of the Royal Society A , 378(2166):20190061, 2020.[64] D. E. Shaw, J. Grossman, J. A. Bank, B. Batson, J. A. Butts, J. C.Chao, M. M. Deneroff, R. O. Dror, A. Even, C. H. Fenton, et al.Anton 2: raising the bar for performance and programmabilityin a special-purpose molecular dynamics supercomputer. In
SC’14: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , pages 41–53. IEEE,2014.[65] N. Tarafdar, T. Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia,and P. Chow. Enabling flexible network FPGA clusters in aheterogeneous cloud data center. In
Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable GateArrays , pages 237–246, 2017.[66] M. B. Taylor, A. Sampson, C. Batten, Z. Zhang, L. Ceze, M. Oskin,and D. Richmond. The HammerBlade: An ML-Optimized Super-computer for ML and Graphs. https://sampl.cs.washington.edu/tvmconf/slides/Michael-Taylor-HammerBlade.pdf, 2020.[67] R. Tessier, K. Pocek, and A. DeHon. Reconfigurable computingarchitectures.
Proceedings of the IEEE , 103(3):332–354, 2015.[68] D. Trebotich, M. F. Adams, S. Molins, C. I. Steefel, and C. Shen.High-Resolution Simulation of Pore-Scale Reactive Transport Pro-cesses Associated with Carbon Sequestration.
Computing in Science
Engineering , 16(6):22–31, Nov 2014.[70] F. Versolatto and A. Tonello. An MTL Theory Approach for theSimulation of MIMO Power-Line Communication Channels.
PowerDelivery, IEEE Transactions on , 26:1710–1717, 07 2011.[71] J. S. Vetter, R. Brightwell, M. Gokhale, P. McCormick, R. Ross,J. Shalf, K. Antypas, D. Donofrio, T. Humble, C. Schuman, et al.Extreme heterogeneity 2018-productive computational science inthe era of extreme heterogeneity: Report for DOE ASCR workshopon extreme heterogeneity. Technical report, USDOE Office ofScience (SC), Washington, DC (United States), 2018.[72] L. Wang, L. Liu, J. Han, X. Wang, S. Yin, and S. Wei. AchievingFlexible Global Reconfiguration in NoCs using ReconfigurableRings.
IEEE Transactions on Parallel and Distributed Systems , 2019.[73] L.-W. Wang. Divide-and-conquer quantum mechanical materialsimulations with exascale supercomputers.
National Science Review ,1(4):604–617, 2014.[74] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. CHIMAERA:a high-performance architecture with a tightly-coupled reconfig-urable functional unit.
ACM SIGARCH Computer Architecture News ,28(2):225–235, 2000.[75] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster.In
Proceedings of the 2016 International Symposium on Low PowerElectronics and Design , pages 326–331, 2016., pages 326–331, 2016.