[PDF] Programming Strategies for Irregular Algorithms on the Emu Chick

Abstract

The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and programming differences from more typical, non-migratory platforms, but there has not yet been an analysis of algorithms on this system. This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system. We also define and justify relative metrics to compare prototype FPGA-based hardware with established ASIC architectures. The Chick currently supports up to 68x scaling for graph alignment, 80 MTEPS for BFS on balanced graphs, and 50\% of measured STREAM bandwidth for SpMV.

Full PDF

PProgramming Strategies for Irregular Algorithms on theEmu Chick

ERIC HEIN,

Emu Technology

SRINIVAS ESWAR, ABDURRAHMAN YAŞAR,

Georgia Institute of Technology

JIAJIA LI,

Pacific Northwest National Laboratory

JEFFREY S. YOUNG, THOMAS M. CONTE, ÜMIT V. ÇATALYÜREK, RICH VUDUC, and JA-SON RIEDY,

Georgia Institute of Technology

BORA UÇAR,

CNRS and LIP, École Normale Supérieure de LyonCCS Concepts: •

General and reference → Evaluation ; •

Theory of computation → Graph algorithmsanalysis ; Parallel algorithms ; Data structures design and analysis ; •

Computer systems organization → Multicore architectures ; •

Hardware → Emerging architectures ; ACM Reference Format:

Eric Hein, Srinivas Eswar, Abdurrahman Yaşar, Jiajia Li, Jeffrey S. Young, Thomas M. Conte, Ümit V. Çatalyürek,Rich Vuduc, Jason Riedy, and Bora Uçar. 2019. Programming Strategies for Irregular Algorithms on the EmuChick.

ACM Trans. Parallel Comput.

1, 1 (January 2019), 24 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Abstract

The Emu Chick prototype implements migratory memory-side processing in a novel hardwaresystem. Rather than transferring large amounts of data across the system interconnect, the EmuChick moves lightweight thread contexts to near-memory cores before the beginning of eachremote memory read. Previous work has characterized the performance of the Chick prototypein terms of memory bandwidth and programming differences from more typical, non-migratoryplatforms, but there has not yet been an analysis of algorithms on this system.This work evaluates irregular algorithms that could benefit from the lightweight, memory-sideprocessing of the Chick and demonstrates techniques and optimization strategies for achievingperformance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), andgraph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chicksystem. We also define and justify relative metrics to compare prototype FPGA-based hardwarewith established ASIC architectures. The Chick currently supports up to 68x scaling for graphalignment, 80 MTEPS for BFS on balanced graphs, and 50% of measured STREAM bandwidth forSpMV.

Authors’ addresses: Eric Hein Emu Technology; Srinivas Eswar, Abdurrahman Yaşar Georgia Institute of Technology; JiajiaLi Pacific Northwest National Laboratory; Jeffrey S. Young; Thomas M. Conte; Ümit V. Çatalyürek; Rich Vuduc; JasonRiedy Georgia Institute of Technology; Bora Uçar CNRS and LIP, École Normale Supérieure de Lyon.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2019 Association for Computing Machinery.1539-9087/2019/1-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn a r X i v : . [ c s . D C ] D ec nodelet Gossamer Core 1

Memory-Side Processor

Gossamer Core 4 ...

Migration Engine

RapidIODisk I/O

8 nodeletsper node

64 nodeletsper Chick

RapidIO

StationaryCore

Fig. 1. Emu architecture: The system consists of stationary processors for running the operating system andup to four

Gossamer processors per nodelet tightly coupled to memory. The cache-less Gossamer processingcores are multi-threaded to both source sufficient memory references and also provide sufficient work withmany outstanding references. The coupled memory’s narrow interface ensures high utilization for accessessmaller than typical cache lines.

Analyzing data stored in irregular data structures such as graphs and sparse matrices is chal-lenging for traditional architectures due to limited data locality in associated algorithms andperformance costs related to data movement. The Emu architecture [23] is designed specifically toaddress these data movement costs in a power-efficient hardware environment by using a cache-less system built around “nodelets” (see Figure 1) that execute lightweight threads. These threadsmigrate on remote data reads rather than pulling data through a traditional cache hierarchy. Thekey differentiators for the Emu architecture are the use of cache-less processing cores, a high-radixnetwork connecting distributed memory, and PGAS-based data placement and accesses. In short,the Emu architecture is designed to scale applications with poor data locality to supercomputingscale by more effectively utilizing available memory bandwidth and by dedicating limited powerresources to networks and data movement rather than caches.Previous work has investigated the initial Emu architecture design [23], algorithmic designsfor merge and radix sorts on the Emu hardware [51], and baseline performance characteristicsof the Emu Chick hardware [11, 36]. This investigation is focused on determining how irregularalgorithms perform on the prototype Chick hardware and how we implement specific algorithmsso that they can scale to a rack-scale Emu and beyond.This study’s specific demonstrations include: • The first characterization of the Emu Chick hardware using irregular algorithms includingsparse matrix vector multiply (SpMV), graph analytics (BFS), and graph alignment. We alsodiscuss programming strategies for the Emu such as replication (SpMV), remote writes toreduce migration (BFS), and data layout to reduce workload imbalance (graph alignment) thatcan be used to increase parallel performance on the Emu. Multi-node Emu results for BFS scaling up to 80 MTEPS and 1.28 GB/s on a balanced graphas well as an initial comparison of Emu-optimized code versus a naive Cilk implementationon x86. • Multi-node results for SpMV scaling up to 50% of measured peak bandwidth on the the Emu. • Graph alignment results showing a 68x speedup when scaling from 1 to 256 threads on 8nodelets with optimized data layout and comparison strategies.

The Emu architecture focuses on improved random-access bandwidth scalability by migratinglightweight

Gossamer threads, or “threadlets”, to data and emphasizing fine-grained memory access.A general Emu system consists of the following processing elements, as illustrated in Figure 1: • A common stationary processor runs the OS (Linux) and manages storage and networkdevices. • Nodelets combine narrowly banked memory with highly multi-threaded, cache-less

Gossamer cores to provide a memory-centric environment for migrating threads.These elements are combined into nodes that are connected by a RapidIO fabric. The currentgeneration of Emu systems include one stationary processor for each of the eight nodelets containedwithin a node. System-level storage is provided by SSDs. We talk more specifically about some ofthe prototype limitations of our Emu Chick in Section 4. More detailed descriptions of the Emuarchitecture are available [23], but this is a point in time description of the current implementationand its trade-offs.For programmers, the Gossamer cores are transparent accelerators. The compiler infrastructurecompiles the parallelized code for the Gossamer ISA, and the runtime infrastructure launches threadson the nodelets. Currently, one programs the Emu platform using Cilk [44], providing a path torunning on the Emu for OpenMP programs whose translations to Cilk are straightforward. Thecurrent compiler supports the expression of task or fork-join parallelism through Cilk’s cilk_spawn and cilk_sync constructs, with a future Cilk Plus (Cilk+) software release in progress that wouldinclude cilk_for (the nearly direct analogue of OpenMP’s parallel for ) as well as Cilk+ reducerobjects. Many existing C and C++ OpenMP codes can translate almost directly to Cilk+.A launched Gossamer thread only performs local reads. Any remote read triggers a migration,which will transfer the context of the reading thread to a processor local to the memory channelcontaining the data. Experience on high-latency thread migration systems like Charm++ identifiesmigration overhead as a critical factor even in highly regular scientific codes [1]. The Emu systemminimizes thread migration overhead by limiting the size of a thread context, implementing thetransfer efficiently in hardware, and integrating migration throughout the architecture. In particular,a Gossamer thread consists of 16 general-purpose registers, a program counter, a stack counter,and status information, for a total size of less than 200 bytes. The compiled executable is replicatedacross the cores to ensure that instruction access always is local. Limiting thread context size alsoreduces the cost of spawning new threads for dynamic data analysis workloads. Operating systemrequests are forwarded to the stationary control processors through the service queue.The highly multi-threaded Gossamer cores read only local memory and do not have caches,avoiding cache coherency traffic. Additionally, “memory-side processors” provide atomic read orwrite operations that can be used to access small amounts of data without triggering unnecessarythread migrations. A node’s memory size is relatively large with standard DDR4 chips (64 GiB)but with multiple, Narrow-Channel DRAM (NCDRAM) memory channels (8 channels with 8 bitinterfaces to the host using FIFO ordering). Each DIMM has a page size of 512B and a row size of1024. The smaller bus means that each channel of NCDRAM has only 2GB/s of bandwidth, butthe system makes up for this by having many more independent channels. Because of this, it can + nodelets X x =YLocal 1D 2D 1 nodelet 8+ nodelets 8+ nod X row v col= x Y X Y = x Y = row col VA Fig. 2. Emu-specific layout for CSR SpMV. sustain more simultaneous fine-grained accesses than a traditional system with fewer channelsand the same peak memory bandwidth.

We investigate programming strategies for three algorithms: 1) the standard (CSR) sparse matrixvector multiplication operation, 2) Graph500’s breadth-first search (BFS) benchmark, and 3) graphalignment, computing a potential partial mapping of the vertices of two graphs. These threealgorithms cover a variety of sparse, irregular computations: the ubiquitous sparse matrix vectormultiplication, filtered sparse matrix sparse vector multiplication (in BFS), and a variant of thesparse matrix - sparse matrix multiplication (in computing the similarities of vertices). In thefollowing subsections we discuss how we implement these algorithms on the Emu platform.

The matrix A is stored as a distributed CSR structure consisting of 3 arrays - row offsets, columnindices, and values. The row offset array is striped across all nodelets and encodes the length ofeach row. Every row’s non-zero entries and column indices are allocated together and are presentin the same nodelet giving rise to the jagged arrays col and V shown in Figure 2. X is replicatedacross each nodelet and the output Y is striped across all nodelets.In the 2D allocation case, no thread migrations occur when accessing elements in the same row.A 1D striped layout incurs a migration for every element within a row to fetch the vector entry.Synthetic Laplacian matrix inputs are created corresponding to a d -dimensional k -point stencil ona grid of length n in each dimension. For the tested synthetic matrices, d = k =

5, resultingin a n × n Laplacian with five diagonals. The tested real world matrices are listed in Table 3.

Our in-memory graph layout is inspired by STINGER [24] so that computation can adapt to achanging environment[71]. Each vertex contains a pointer to a linked-list of edge blocks, each ofwhich stores a fixed number of adjacent vertex IDs and a pointer to the next edge block. We use astriped array of pointers to distribute the vertex array across all nodelets in the system, such thatvertex 0 is on nodelet 0, vertex 1 is on nodelet 1, and so on. We use STINGER rather than CSR toenable future work with streaming data and incremental algorithms[35], one of the primary targetsof the Emu architecture. Note that breadth-first search is nearly equivalent to computing a filteredsparse matrix times sparse vector product [37].To avoid the overhead of generic run-time memory allocation via malloc , each nodelet pre-allocates a local pool of edge blocks. A vertex can claim edge blocks from any pool, but it is desirable able 1. Notations used in BFS. Symbol Description V Vertex set Q Queue of vertices P Parent array nP New parent arrayNeig ( v ) Neighbor vertices of v to string together edge blocks from the same pool to avoid thread migrations during edge listtraversal. When the local pool is exhausted, the edge block allocator automatically moves to thepool on the next nodelet.Kernel 1 of the Graph500 benchmark involves constructing a graph data structure from a listof edges. In our implementation the list of edges is loaded from disk into memory on nodelet 0.Currently I/O is limited on the prototype Emu Chick, and loading indirectly assists in evaluatingthe rest of the architecture. We sort the list by the low bits of the source vertex ID to group togetheredges that will be on the same nodelet, then spawn threads to scatter the list across all the nodelets.Once the list has been scattered, each nodelet spawns more threads locally to insert each edge intothe graph, allocating edge blocks from the local pool. Algorithm 1:

BFS algorithm using migrating threads P [ v ] ← −

1, for ∀ v ∈ VQ .push( root ) while Q is not empty dofor s ∈ Q do in parallelfor d ∈ Neig ( s ) do in parallel ▷ Thread migrates reading P [ d ] if P [ d ] = − thenif compare_and_swap( P [ d ] , -1, s ) then Q .push( d ) Q .slide_window() Our initial implementation of BFS (Algorithm 1) was a direct port of the STINGER code. Eachvertex iterates through each of its neighbors and tries to set itself as the parent of that vertex usingan atomic compare-and-swap operation. If the operation is successful, the neighbor vertex is addedto the queue to be explored along with the next frontier.On Emu, the parent array is striped across nodelets in the same way as the vertex array. Eachnodelet contains a local queue so that threads can push vertices into the queue without migrating.At the beginning of each frontier, threads are spawned at each nodelet to explore the local queues.Thread migrations do occur whenever a thread attempts to claim a vertex that is located on aremote nodelet. In the common case, a thread reads an edge, migrates to the nodelet that ownsthe destination vertex, executes a compare-and-swap on the parent array, pushes into the localqueue, and then migrates back to read the next edge. If the destination vertex happens to be local,no migration will occur when processing that edge.An alternative BFS implementation (Algorithm 2) exploits the capability of the Emu systemto efficiently perform remote writes. A copy of the parent array ( nP ) holds intermediate stateduring each frontier. Rather than migrating to the nodelet that contains the destination vertex, weperform a remote write on the nP array. The remote write packet can travel through the network lgorithm 2: BFS algorithm using remote writes P [ v ] ← −

1, for ∀ v ∈ VnP [ v ] ← −

1, for ∀ v ∈ VQ .push( root ) while Q is not empty dofor s ∈ Q do in parallelfor d ∈ Neig ( s ) do in parallel ▷ Thread issues remote write to nP [ d ] nP [ d ] ← s cilk_sync for v ∈ V do in parallelif P [ v ] = − thenif nP [ v ] (cid:44) − then P [ v ] ← nP [ v ] Q .push( v ) Q .slide_window() and complete asynchronously while the thread that created it continues to traverse the edge list.Remote writes attempting to claim the same vertex are serialized in the memory front end of theremote nodelet. Rather than attempting to synchronize these writes, we simply allow later writesto overwrite earlier ones. After all the remote writes have completed, we scan through the nP arraylooking for vertices that did not have a parent at the beginning of this frontier ( P [ v ] = −

1) butwere assigned a parent in this iteration ( nP [ v ] (cid:44) − nP [ v ] is copied into the parent array at P [ v ] . This issimilar to direction-optimizing BFS [9] and may be able to adopt its early termination optimizations. gsaNA : Parallel Similarity Computation Integrating data from heterogeneous sources is often modeled as merging graphs. Given twoor more compatible, but not necessarily isomorphic graphs, the first step is to identify a graphalignment , where a potentially partial mapping of the vertices of the two graphs is computed. Inthis work, we investigate the parallelization of gsaNA [68], which is a recent graph aligner thatuses the global structure of the graphs to significantly reduce the problem space and align largegraphs with a minimal loss of information. The proposed techniques are highly flexible, and theycan be used to achieve higher recall while being order of magnitudes faster than the current stateof the art [68].Briefly, gsaNA first reduces the problem space, then runs pairwise similarity computation be-tween two graphs. Although the problem space can be reduced significantly, the pairwise similaritycomputation step remains to be the most expensive part (more than 90% of the total executiontime). While gsaNA has an embarrassingly parallelizable nature for similarity computations, itsparallelization is not straightforward. This is because of the fact that gsaNA’s similarity functionis composed of multiple components, with some only depending on graph structure and othersdepending also on the additional metadata (types and attributes). All of these components comparevertices from two graphs and/or their neighborhood. Hence, the similarity computation step has ahighly irregular data access pattern. To reduce this irregularity, we store the metadata of a vertex’sneighborhood in sorted arrays. While arranging metadata helps to decrease irregularity, data accessremains a problem because of the skewed nature of real-world graphs. Similarity computations a) All (b) Pair (c) LayoutFig. 3. gsaNA : Task definition & bucket and vertex partition among the nodelets respecting the Hilbert-curveorder. Table 2. Notations used in gsaNA . Symbol Description V , V Vertex sets E , E Edge sets QT , QT Quad-trees of the graphs QT i . N eiд ( B ) Neighboring buckets of B in QT i σ ( u , v ) Similarity score for u ∈ V and v ∈ V N ( u ) Adjacency list of u ∈ V i A ( u ) Vertex attribute of u ∈ V i RW ( f (·)) Number of required memory Reads & Writes to execute given function, f (·) require accessing different portions of the graph simultaneously. In [69] authors provide paralleliza-tion strategies for different stages of gsaNA. However, because of the differences in the architectureand the parallelization framework, the earlier techniques cannot be applied to EMU Chick in astraightforward manner. Hence, in this work, we investigate two parallelization strategies forsimilarity computations and also two graph layout strategies on Emu Chick.gsaNA places vertices into a 2 D plane using a graph’s global structure information. The intuitionis that similar vertices should also have similar structural properties, and they should be placedclosely on the 2 D plane. When all vertices are placed, gsaNA partitions space into buckets in aquad-tree like fashion. Then, a task for similarity computation becomes the pairwise comparisonof the vertices in a bucket with vertices in the neighboring buckets. For example, in Figures 3a-3bthe vertices in the yellow colored bucket are compared with vertices in the yellow and red coloredbuckets. We investigate two parallel similarity computation schemes and two vertex layout schemes. Algorithm 3: parallelSim( QT , QT , k , σ ) P [ v ] ← ∅ , for ∀ v ∈ V for each non-empty B ∈ QT do cilk_spawn compSim( B , QT . Neiд ( B ) , P , σ ) cilk_sync return P In the

All Comparison scheme, Alg. 3 first spawns a threadfor each non-empty bucket of B ∈ QT where compSim is instantiated with compSimAll shownin Alg. 4. This function computes the similarity scores for each vertex v ∈ B with vertex u ∈ B ′ here B ′ ∈ QT . N eiд ( B ) . Afterwards, the top k similar vertices are identified and stored in P [ v ] .This approach has two main disadvantages. First, the number of parallel tasks is limited by thenumber of buckets. Second, since space is quad-tree like partitioned, this scheme may lead to loadimbalance. This technique is illustrated in Figure 3a. Algorithm 4: compSimAll( B , N B , P , σ ) ▷ For each vertex keep a priority list with top k elements. for each v ∈ B dofor each B ′ ∈ N B dofor each u ∈ B ′ do P [ v ] . insert ( u ) ▷ Only keeps top k return P In the

Pair Comparison scheme, Alg. 3 first spawns a thread for each non-empty bucket of B ∈ QT where compSim is instantiated with compSimPair shown in Alg. 5. Then, for each ⟨ B , B ′ ⟩ pairwhere B ∈ QT and B ′ ∈ QT . N eiд ( B ) , compSimPairAux is spawned. Next we compute pairwisesimilarity scores of vertices between these bucket pairs and return intermediate similarity scores(see Alg. 5). Finally, we merge these intermediate results in Alg. 5. This scheme spawns much morethreads than the previous one. This technique is illustrated in Figure 3b.In ALL comparison scheme, the number of threads is limited by the number of buckets.Thereforeachievable scalability is limited. On the other hand, because of the coarse grain composition ofthe tasks this scheme may lead to high load imbalance. Sorting tasks based on their loads in anon-increasing order can be a possible optimization/heuristic for reducing imbalance.The PAIR comparison scheme reduces the load imbalance by compromising with additionalsynchronization cost that arises during the insertion in Alg. 4. Task list is randomly shuffled todecrease the possibility of concurrent update requests to a vertex’s queue.Note that while ALL is akin to vertex-centric based partitioning,

PAIR is akin to edge-basedpartitioning. The vertices and edges here refer to the task graph.

In the

Block partitioned (BLK) layout, the vertices are partitioned among thenodelets based on their ID s, independent from their placement in the 2D plane. The buckets arealso partitioned among the nodelets independently. That is, each nodelet stores an equal number ofvertices and buckets. A vertex’s metadata is also stored in the same nodelet of corresponding vertex.With the two computational schemes, vertices in the same bucket may be in different nodelets,leading to many thread migrations. In the Hilbert-curve based (HCB) layout (shown in Fig. 3c),the vertices and buckets are partitioned among nodelets based on their Hilbert orders. To achievethis, after all vertices are inserted in the quad-tree, we sort buckets based on their Hilbert orders.Then, we rename every vertex in a bucket according to bucket’s rank (i.e., vertices in the firstbucket, B , have labels starting from 0 to | B | − BLK , a vertex’s metadata is also stored in the same nodelet of thecorresponding vertex. Here, all vertices in the same bucket are in the same nodelet, and hence thereis in general less migration. While

BLK may lead to a better workload balance (equal number ofsimilarity computations per nodelet),

HCB may lead to a workload imbalance, if two buckets withhigh number of neighbors are placed into the same nodelet.

The Emu Chick prototype is still in active development. The current hardware iteration usesan Arria 10 FPGA on each node card to implement the Gossamer cores, the migration engine, lgorithm 5: compSimPair( B , N B , P , σ ) p B ′ ← ∅ , for ∀ B ′ ∈ N B for each B ′ ∈ N B do cilk_spawn compSimPairAux( B , B ′ , p B ′ , σ ) cilk_sync P ← merge( p B ′ ∈ N B ) return P def compSimPairAux( B , B ′ , P , σ ): ▷ For each vertex keep a priority list with top k elements. for each v ∈ B dofor each u ∈ B ′ do P [ v ] . insert ( u ) ▷ Only keeps top k return P and the stationary cores. Several aspects of the system are scaled down in the current prototypewith respect to the next-generation system which will use larger and faster FPGAs to implementcomputation and thread migration. The current Emu Chick prototype has the following featuresand limitations: • Our system has one Gossamer Core (GC) per nodelet with a concurrent max of 64 threadlets.The next-generation system will have four GC’s per nodelet, supporting 256 threadlets pernodelet. • Our GC’s are clocked at 175MHz rather than the planned 300MHz in the next-generationEmu system. • The Emu’s DDR4 DRAM modules are clocked at 1600MHz rather than the full 2133MHz.Each node has a peak theoretical bandwidth of 12.8 GB/s. • CPU comparisons are made on a four-socket, 2.2 GHz Xeon E7-4850 v3 (Haswell) machinewith 2 TiB of DDR4 with memory clocked at 1333 MHz (although it is rated for 2133 MHz).Each socket has a peak theoretical bandwidth of 42.7 GB/s. • The current Emu software version provides support for C++ but does not yet include func-tionality to translate Cilk Plus features like cilk_for or Cilk reducers . All benchmarkscurrently use cilk_spawn directly, which also allows more control over spawning strategies.

All experiments are run using Emu’s 18.09 compiler and simulator toolchain, and the EmuChick system is running NCDIMM firmware version 2.5.1, system controller software version 3.1.0,and each stationary core is running the 2.2.3 version of software. We present results for severalconfigurations of the Emu system: • Emu Chick single-node (SN) : one node; 8 nodelets • Emu Chick multi-node (MN) : 8 nodes; 64 nodelets • Simulator results are excluded from this study as previous work [36] has shown simulatedscaling numbers for SpMV and STREAM on future Emu systems. We prioritize multi-noderesults on hardware.Application inputs are selected from the following sources: • The SpMV experiments use synthetic Laplacian matrices, and real-world inputs are selectedfrom the SuiteSparse sparse matrix collection [21]. Each Laplacian consists of a five-pointstencil which is a pentadiagonal matrix. BFS uses RMAT graphs as specified by Graph500 [6] and uniform random (Erdös-Renyi)graphs [72], scale 15 through 21, from the generator in the STINGER codebase . • gsaNA uses DBLP [55] graphs from years 2015 and 2017 that have been created previously [68].Detailed description of these graphs is provided in Section 5.3. The Emu Chick is essentially a memory to memory architecture, so we primarily present resultsin terms of memory bandwidth and effective bandwidth utilization. But comparing a new andnovel processor architecture (Emu) built on FPGAs to a well-established and optimized architecturebuilt on ASICs (Haswell) is difficult. Measuring bandwidth on the Haswell with the STREAMbenchmark[48] achieves much more of the theoretical peak memory bandwidth. The Emu Chick,however, implements a full processor on an FPGA and cannot take advantage of deeply pipelinedlogic that gives boosts to pure-FPGA accelerators, thus cannot achieve much of the theoreticalhardware peak. If we compare bandwidths against the DRAM peaks, prototype novel architectureslike the Chick almost never appear competitive. Comparing against measured peak bandwidth mayprovide an overly optimistic view of the prototype hardware.We have chosen to primarily consider percentage of measured peak bandwidth given an idealizedproblem model, but also report the raw bandwidth results. For integer SpMV and BFS, the naturalmeasures of IOPS (integer operations per second) and TEPS (traversed edges per second) areproportional to the idealized effective bandwidth.Our more recent tests have shown that the Emu hardware can achieve up to 1.6 GB/s pernode and 12.8 GB/s on 8 nodes for the STREAM benchmark, which is used as the measured peakmemory bandwidth number. This increase in STREAM BW from previous work [36] is primarilydue to clock rate increases and bug fixes to improve system stability. Meanwhile, our four-socket,2.2GHz Haswell with 1333 MHz memory achieves 100 GB/s, or 25 GB/s per NUMA domain. Sothe Emu FPGA-emulated processors achieve 11.7% of the theoretical peak, while the ASIC Haswellprocessors achieve 58.6%. Note that we run with NUMA interleaving enabled, so many accessescross the slower QPI links. This provides the best Haswell performance for our pointer chasingbenchmark[36]. Disabling NUMA page interleaving brings the Haswell STREAM performance to160 GB/s, which is 94% of the theoretical peak.

We first look at the effects of replication on the Emu, that is whether replicating the vector x in Fig. 2 provides a substantial benefit when compared to striping x across nodelets in the “noreplication” case. Effective Bandwidth is the primary metric measured in our experiments. It is calculated as theminimum number of bytes needed to complete the computation. On cache-based architectures thisis equivalent to the compulsory misses. For SpMV it is approximated by, BW = sizeof( A ) + sizeof( x ) + sizeof( y ) timeThe numerator is a trivial lower-bound on data moved, since it counts only one load of A (whichenjoys no reuse) and one load each of the two vectors, x and y (assuming maximal reuse). Themotivation for ignoring multiple loads of x or y is that ideally on a cache-based architecture witha “well-ordered” matrix, the vectors are cached and the computation is bandwidth-limited by thetime to read A . https://github.com/stingergraph/stinger/commit/149d5b562cb8685036517741bd6a91d62cb89631

500 1000 1500 2000 2500 3000 3500Matrix Size (MB)0100200300400500600700800 B and w i d t h ( M B / s ) SpMV (Emu Chick, Single node)

No. Threads

Fig. 4. SpMV Laplacian Stencil Bandwidth, No Replicaton (8 nodelets). B and w i d t h ( M B / s ) SpMV (Emu Chick, Single node)

No. Threads

Fig. 5. SpMV Laplacian Stencil Bandwidth, Replication (8 nodelets).

Figure 4 shows that the choice of grain size or work assigned to a thread can dramatically affectperformance for the non-replicated test case. The unit of work here is the number of rows assignedto each thread. The default fixed grain size of 16, while competitive for smaller graphs, does notscale well to the entire node. For small grain sizes, too many threads are created with little work perthread, resulting in slowdown due to thread creation overhead. A dynamic grain size calculation ispreferred to keep the maximum number of threads in flight, as can be seen with the peak bandwidthachieved with 256 and 512 threads for a single node.Figure 5 shows the effects of replication in SpMV. Interestingly, for the largest matrix size bothFigures 4 and 5 have similar bandwidths, which indicates good scaling for larger data sizes withoutreplication at the potential cost of thread migration hotspots on certain nodelets. However, wenote that using replication leads to much more regular scaling curves across different numbers ofthreads and grain sizes.Figure 6 shows scaling of multi-node (64 nodelets) using replication and different numbersof threads. The best run of SpMV achieves 6.11 GB/s with 4096 threads, which is 50.8% of ourupdated STREAM bandwidth number. However, it should also be noted from this figure that thebest scalability for all sizes (including smaller inputs) is achieved using 2048 threads.Table 3 shows the multi-node (run with 2048 threads) bandwidth in MB/s for real-world graphsalong with their average and maximum degree (non-zero per row) values. The rows are sorted bymaximum degree and if we exclude the graphs with large maximum degree (≥ ) we see similar B and w i d t h ( M B / s ) SpMV (Emu Chick, Multi node)

No. Threads

Fig. 6. SpMV Laplacian Replicated - multinode (64 nodelets).Table 3. SpMV multinode bandwidths (in MB/s) for real world graphs [21] along with matrix dimension,number of non-zeros (NNZ), and the average and maximum row degrees. Run with 4K threads

Matrix Rows NNZ Avg Deg Max Deg BWmc2depi 526K 2.1M 3.99 4 3870.31ecology1 1.0M 5.0M 5.00 5 4425.61amazon03 401K 3.2M 7.99 10 4494.79Delor295 296K 2.4M 8.12 11 4492.47roadNet- 1.39M 3.84M 2.76 12 3811.57mac_econ 206K 1.27M 6.17 44 3735.54cop20k_A 121K 2.62M 21.65 81 4520.05watson_2 352K 1.85M 5.25 93 3486.30ca2010 710K 3.49M 4.91 141 4075.97poisson3 86K 2.37M 27.74 145 4031.20gyro_k 17K 1.02M 58.82 360 2446.36vsp_fina 140K 1.1M 7.90 669 1335.59Stanford 282K 2.31M 8.20 38606 287.82ins2 309K 2.75M 8.89 309412 43.91bandwidths. Most graphs showed bandwidths in excess of 600 MB/s and many were comparable tothat of the synthetic Laplacians which are very well structured. This behavior is in contrast to acache based system where we expect performance to increase with increasing degree. The Emuhardware demonstrates good performance independent of the structure of the graph, even oneswith high-degree vertices.For the high maximum degree graphs (

Stanford , ins2 ) we attribute the poor performance to loadimbalance. Some of the rows in these graphs have a very high number of non zeros. Since we onlypartition work at the row level, a single thread will need to process these large rows and this loadimbalance results in slow running times. Current hardware limitations prevent exploring mixingparallelism across and within matrix rows [58] leaving that level of performance benefit to futurework. Figure 7 compares the migrating threads and remote write BFS implementations for a “streaming”or unordered BFS implementation. With the migrating threads algorithm, each thread will generallyincur one migration per edge traversed, with a low amount of work between migrations. The scale M T E P S Migrating threads BFS Remote writes BFS025050075010001250 E dg e B a n d w i d t h ( M B / s ) Fig. 7. Comparison of remote writes versus migrating BFS on a multi-node Chick system for balanced(Erdös-Rényi) graphs. Marking members of the frontier with remote writes is more efficient than movingentire thread contexts back and forth between the edge list and the parent array.

15 16 17 18 19 20 21 scale M T E P S Erd s Rényi RMAT 0100200300 E dg e B a n d w i d t h ( M B / s ) Fig. 8. Compares the performance of BFS on a single-node system between balanced (Erdös-Rényi) graphsand unbalanced (RMAT) graphs running on a single node of the Emu Chick. Unbalanced graphs lead to anuneven work distribution and low performance. threads are blocked while migrating, and do not make forward progress until they can resumeexecution on the remote nodelet. In contrast, the remote writes algorithm allows each thread tofire off many remote, non-blocking writes, which improves the throughput of the system due tothe smaller size of remote write packets.The effective bandwidth for BFS on a graph with a given scale and an edge factor of 16 is asfollows: BW = × scale × × = T EPS × × . This does not include bandwidth for flags or other state data structures and so is a lower bound asdiscussed in Section 5.1. Number of VerticesEdge factor = 16 M T E P S Emu single node - CilkEmu multi-node - Cilk x86 Haswell - STINGERx86 Haswell - Cilk050010001500 E dg e B a n d w i d t h ( M B / s ) Fig. 9. Comparison of BFS performance between the Emu Chick and the Haswell server described in Section 4.Two implementations were tested on the Haswell System, one from STINGER and the other from MEATBEE

Our initial graph engine implementation does not attempt to evenly partition the graph acrossthe nodelets in the system. The neighbor list of each vertex is co-located with the vertex on a singlenodelet. RMAT graphs specified by Graph500 have highly skewed degree distributions, leading touneven work distribution on the Emu. Figure 8 shows that BFS with balanced Erdös-Rényi graphsinstead leads a higher performance of 18 MTEPS (288 MB/s) versus 4 MTEPS (64 MB/s) for theRMAT graph. We were unable to collect BFS results for RMAT graphs on the multi-node Emusystem due to a hardware bug that currently causes the algorithm to deadlock. Future work willenhance the graph construction algorithm to create a better partition for power-law graphs.Figure 9 plots results for four configurations of BFS running with balanced graphs: Emu single-and multi-node and two BFS results from the Haswell system. The performance of a single node ofthe Emu Chick saturates at 18 MTEPS while the full multi-node configuration reaches 80 MTEPS ona scale 21 graph, with an equivalent bandwidth utilization of 1280 MB/s. On the Haswell platform,the MEATBEE (backported Emu Cilk) implementation reaches a peak of 105 MTEPS, outperformingthe STINGER (naive Cilk) implementation of BFS at 88 MTEPS, likely due to the reduction of atomicoperations as discussed in Section 3.2. gsaNA

Graph Alignment - Data Layout

For our tests, we use DBLP [55] graphs from years 2015 and 2017 that have been createdpreviously [68]. This pair of graphs is called DBLP (0), and they have nearly 48 K , 59 K vertices and453 K , 656 K edges respectively. These graphs are used in the experiments shown in Fig. 10. For theexperiments shown in Fig. 11, we filter some vertices and their edges from the two graphs in DBLP(0), resulting in seven different graph pairs for alignment. The properties of these seven pairs areshown in Table 4.We present similarity computation results for the Emu hardware on different sized graphsand execution schemes which are defined/named by combining the layout with the similaritycomputation. For instance, BLK-ALL refers to the case where we use the block partitioned vertexlayout and run

ALL parallel similarity computation.

Bandwidth is measured for gsaNA by theformula: able 4. Generated graphs for alignment. K = ; | T | : number of tasks; | B | : bucket size. Graphs: 512 1024 2048 4096 8192 16384 32768 | V | , | V | . K K K K K K K | E | . K . K K K K K K | E | K K K K K K K | T |

44 85 77 163 187 267 276 | B |

32 32 64 64 128 128 256 BW = (cid:213) ∀ B ∈ QT (cid:213) ∀ B ′ ∈ QT . N eiд ( B ) (cid:0) | B | + | B || B ′ | + (cid:205) ∀ u ∈ B (cid:205) ∀ v ∈ B ′ RW ( σ ( u , v )) (cid:1) × sizeof ( u ) time In a task, pairwise vertex similarities are computed between the vertices in a bucket B ∈ QT and the vertices in a bucket B ′ ∈ QT . N eiд ( B ) . Therefore in each task, every vertex u ∈ B isread once and every vertex v ∈ B ′ is read | B | times. Additional read and write cost comes fromthe similarity function σ ( u , v ) that is called for every vertex pair u , v with u ∈ B and v ∈ B ′ .Hence, the total data movement can be gathered by aggregating the size of the bucket readsand the size of the number of reads and writes required by the similarity function. Bandwidth isthe ratio between the total data movement over the execution time. We adopted the followingsimilarity metrics from gsaNA [68]: degree ( ∆ ), vertex type ( τ ), adjacent vertex type ( τ V ), adjacentedge type ( τ E ), and vertex attribute ( C V ). Since the similarity function consists of four differentsimilarity metrics, we can define the required number of reads and writes of the similarity functionas RW ( σ ( u , v )) = RW ( τ ( u , v )) + RW ( δ ( u , v )) + RW ( τ V ( u , v )) + RW ( τ E ( u , v )) + RW ( C V ( u , v )) . Inthis equation, the degree ( ∆ ) and the type ( τ ) similarity functions require one memory read foreach vertex and then one read and update for the similarity value. Therefore, RW ( τ ( u , v )) = RW ( ∆ ( u , v )) =

4. The adjacent vertex ( τ V ) and the edge ( τ E ) type similarity functions requirereading all adjacent edges of the two vertices and one read and update for the similarity value.Therefore, RW ( τ V ( u , v )) = RW ( τ E ( u , v )) = | N ( u )| + | N ( v )| +

2. The vertex attribute similarityfunction ( C V ) requires reading attributes of the two vertices and one read and update for thesimilarity value. Therefore, RW ( C V ( u , v )) = | A ( u )| + | A ( v )| + PAIR computation scheme with the largest number of threads. Since the

PAIR scheme doesmany unpredictable recursive spawns, controlling the number of threads for this scheme is veryhard and not accurate. Therefore, for increasing number of threads, we only consider

ALL with

BLK and

HCB vertex layouts. We observe that in the

BLK layout, our final speedup is 43 × using ALL and 52 × using PAIR . In the

HCB layout, our final speedup is 49 × using ALL and 68 × using PAIR . Ascan be seen in Fig. 10, when we increase the number of threads from 128 to 256, the bandwidthdecreases by 4% in

BLK-ALL scheme, because the coarse grained nature of

ALL cannot give betterworkload balance and thread migrations hurt the performance.Figure 11 presents results for all graphs in different execution schemes. We observe that the

HCB vertex layout improves the execution time by 10-to-36% in all datasets by decreasing the number of Number of Threads B and w i d t h ( M B / s ) BLK-ALLHCB-ALLBLK-PAIRHCB-PAIR

Fig. 10. gsaNA , Bandwidth vs. Threads for

ALL (rightmost bars represent

PAIR results), run on HW (8nodelets).

512 1024 2048 4096 8192 16384 32768 DBLP (0)

Graph sizes B and w i d t h ( M B / s ) BLK-ALLHCB-ALLBLK-PAIRHCB-PAIR

Fig. 11. gsaNA , Experiments on DBLP graphs on HW (8 nodelets). thread migrations. As expected, this improvement increases with the graph size. This improvementin a x86 architecture is reported as 10% in [69]. Second, we see that the

PAIR computation schemeenjoys improvements with both vertex layouts, because it has a finer grained task parallelism andhence better workload distribution.Figure 12 displays strong scaling results for BLK and HCB vertex layouts with the

ALL scheme onsingle-node and multi-node setups for the DBLP graph with 2048 vertices. Here, the strong scalingis given with respect to the single thread execution time of the BLK layout on the multi-node setup. Number of Threads Sp ee d u p Multi-BLKMulti-HCBSingle-BLKSingle-HCB

Fig. 12. gsaNA , strong scaling experiments on DBLP graph (2048 vertices) on HW (Multi-node and single-node).

On the multi-node setup, hardware crashed for gsaNA when 128 threads were used. We observefrom this figure that multi-node setup is slower than the single node setup—multi-node executiontimes are about 25% to 30% slower than the single-node execution times. This is so as the inter-nodemigrations are much more expensive. The proposed layout and computational schemes help toimprove efficiency of the algorithms on both multi-node and single-node experiments. HCB layoutimproves ALL layout about 12% to 3%.

Final observations:

We observe that the finer granularity of tasks in

PAIR and locality awarevertex layout with

HCB give an important improvement in terms of the bandwidth (i.e., executiontime). However, because of recursive spawns

PAIR may cause too many unpredictable threadmigrations if the data layout is random. Additionally, although

HCB helps to reduce the number ofthread migrations significantly, this layout may create hotspots if it puts many neighboring bucketsinto the same nodelet. Our approach of balancing the number of edges per nodelet tries to alleviatethese issues.

In addition to Figure 9, we present the following initial comparisons for our applications withruns using the same Cilk code on a Haswell CPU system. SpMV real-world graphs (Table 3) runin 5.33 ms to 1017.68 ms on the Emu versus 1.89 ms to 15.36 ms on the Haswell box, but the Emuexhibits memory bandwidth utilization of 0.3%–38.53% (where ins2 is slowest and amazon03 isfastest) versus 3.5–24% on the CPU system. Emu results are skewed by the three most “unbalanced”graphs which run much faster on the CPU system. gsaNA takes 158 seconds to run the DBLP(0)graph on the Emu hardware with 8 nodelets and it takes less than 2 seconds on a Haswell processorwith 112 threads. However, this is not a fair comparison as the presented DBLP graphs ( ∼

42 MB) fitin an L3 cache. Related Work

Advances in memory and integration technologies provide opportunities for profitably movingcomputation closer to data [62]. Some proposed architectures return to the older processor-in-memory (PIM) and “intelligent RAM” [57] ideas. Simulations of architectures focusing on near-dataprocessing [33] including in-memory [32] and near-memory [31] show great promise for increasingperformance while also drastically reducing energy usage.Other hardware architectures have tackled massive-scale data analysis to differing degrees ofsuccess. The Tera MTA / Cray XMT[25, 52] provide high bandwidth utilization by tolerating longmemory latencies in applications that can produce enough threads to source enough memoryoperations. In the XMT all memory interactions are remote incurring the full network latency oneach access. The Chick instead moves threads to memory on reads, assuming there will be a clusterof reads for nearby data. The Chick processor needs to tolerate less latency and need not keep asmany threads in flight. Also, unlike the XMT, the Chick runs the operating system on the stationaryprocessors, currently PowerPC, so the Chick processors need not deal with I/O interrupts andhighly sequential OS code. Similarly to the XMT, programming the Chick requires language andlibrary extensions. Future work with performance portability frameworks like Kokkos[26] willexplore how much must be exposed to programmers. Another approach is to push memory-centricaspects to an accelerator like Sparc M7’s data analytics accelerator[2] for database operations orGraphicionado[34] for graph analysis.Moving computation to data via software has a successful history in supercomputing viaCharm++[1], which manages dynamic load balancing on distributed memory systems by mi-grating the computational objects. Similarly, data analysis systems like Hadoop moved computationto data when the network was the primary data bottleneck [4]. The Emu Chick also is stronglyrelated to other PGAS approaches and is a continuation of the mNUMA architecture [65]. Otherapproaches to hardware PGAS acceleration include advanced RDMA networks with embeddedaddress translation and atomic operations [22, 30, 59, 61]. The Emu architecture supports remotememory operations to varying degrees and side-steps many other issues through thread migration.Remote operations pin a thread so that the acknowledgment can be delivered. How to trade betweenmigration and remote operations, as well as exposing that trade-off, is an open question.

SpMV:

There has been a large body of work on SpMV including on emerging architectures[12, 67] but somewhat limited recent work that is directly related to PGAS systems. However,future work with SpMV on Emu will investigate new state-of-the-art formats and algorithms suchas SparseX, which uses the Compressed Sparse eXtended (CSX) as an alternative data layout forstoring matrices [28].

BFS:

The implemented version of BFS builds on the standard Graph500 code with optimizationsfor Cilk and Emu. The two-phase implementation used in this work has some similarities to direction-optimizing BFS [10], in that the remote "put" phase mirrors the bottom-up algorithm. Other notablecurrent implementations include optimized, distributed versions [64] and a recent PGAS version [18].The implementation provided in this paper contrasts with previous PGAS work due to asymmetriccosts for remote get operations as discussed in Section 7. NUMA optimizations[70] similarly areread-oriented but lack thread migration.

Graph Alignment:

Graph alignment methods are traditionally [20, 29] classified into four basiccategories: spectral methods [45, 53, 56, 63, 73]; graph structure similarity methods [3, 43, 47, 49, 50];tree search or table search methods [17, 41, 46, 60]; and integer linear programming methods [8, 27,38, 40].

Final [73] is a recent work which targets labeled network alignment problem by extendingthe concept of

IsoRank [63] to use attribute information of the vertices and edges. All of thesemethods have scalability issues. gsaNa [68, 69] leverages global graph structure and reduces the roblem space and exploits the semantic information to alleviate most of the scalability issues. Inaddition to these sequential algorithms, we are aware of two parallel approaches for global graphalignment. The first one [39] decomposes the ranking calculations of IsoRank ’s similarity matrixusing the singular value decomposition. The second one is a shared memory parallel algorithm [54]that is based on the belief propagation (BP) solution for integer program relaxation [8]. It uses sharedmemory parallel matrix operations for BP iterations and also implements an approximate weightedbipartite matching algorithm. While these parallel algorithms show an important improvementover the state of the art sequential algorithms, the graphs used in the experiments are small in sizeand there is a high structural similarity. To the best of our knowledge, the use of gsaNA in [69]and in this paper presents the first method for parallel alignment of labeled graphs.Other recent work has also looked to extend from low-level characterizations like those presentedhere by providing initial Emu-focused implementations of Breadth-First Search[11], Jaccard indexcomputation [42], bitonic sort, [66] and compiler optimizations like loop fusion, edge flipping, andremote updates to reduce migrations [16].

The Emu architecture inverts the traditional scheme of hauling data to and from a grid ofprocessing elements. In this architecture, the data is static, and small logical units of computationmove throughout the system, and the load balancing is closely related to data layout and distribution,since threads can only run on local processing elements. Our work mapping irregular algorithmsto the Emu architecture expose the following issues needed to achieve relatively high performance:1)

Thread stack placement and remote thread migrations back to a “home” nodelet thatcontains the thread stack.2)

Balancing workload is difficult when using irregular data structures like graphs.3) Following from 2, input sizes are limited by the need to create distributed data structures from an initial chunk of data on the “home” node.4) The Emu is a non-uniform PGAS system with variable costs for remote “put” and “get”operations.5) The tension between top-down task programming on bottom-up data allocation hasproven difficult to capture in current programming systems.

Thread Stack Placement:

A stack frame is allocated on a nodelet when a new thread is spawned.Threads carry their registers with them when they migrate, but stack accesses require a migrationback to the originating nodelet. If a thread needs to access its stack while manipulating remote data,it will migrate back and forth (ping-pong). We can limit the usage of thread stacks and ping-pongmigration by obeying the following rules when writing a function that is expected to migrate:1) Maximize the use of inlined function calls. Normal function calls require a migration back tothe home nodelet to save the register set.2) Write lightweight worker functions using fewer than 16 registers to prevent stack spills.3) Don’t pass arguments by reference to the worker function. Dereferencing a pointer to avariable inside the caller’s stack frame forces a migration back to the home nodelet. Pointersto replicated data circumvent this migration.

Workload balance and distributed data structures:

One of the main challenges in obtaininggood performance on the Emu Chick prototype is the initial placement of data and distribution toremote nodelets. While the current Emu hardware contains a hardware-supported credit systemto control the overall amount of dynamic parallelism the choice of placement is still critical toavoid thread migration hotspots for SpMV and BFS. In the case of SpMV, replication reducesthread migration in each iteration, but replication is also not scalable to more complex, relatedalgorithms like MTTKRP or SpGEMM. The implementations of graph alignment using gsaNA ses data placement techniques like HCB and PAIR-wise comparisons to group threads on thesame nodelets with related data and limit thread migration, which dramatically improves theirperformance. Non-uniform PGAS operations:

Emu’s implementation of PGAS utilizes “put”-style remoteoperations (add, min, max, etc.) and “get” operations where a thread is migrated to read a local pieceof data. Thread migration is efficient when many get operations need to access the same nodelet-local memory channel. The performance difference observed between put and get operations is dueto how these two operations interact differently with load balancing. A put can be done withoutchanging the location of the thread, while a get means that multiple threads may have to sharesignificant resources on the same nodelet for a while. Additionally, a stream of gets with spatiallocality can be faster than multiple put operations. This non-uniformity means that kernels that needto access finely grained data in random order should be implemented as put operations whereverpossible while get operations should only be used when larger chunks of data are read together. Amajor outstanding question is how this scheme compares with explicitly remote references plustask migrations via remote calls as in UPC++[5]. The trade-off between hardware simplicity andsoftware flexibility is difficult to measure without implementations of both. Tractable abstractmodels miss implementation details like switch fabric traffic contention or task-switching cacheoverhead.

Top-down task programming on bottom-up data allocation:

The Cilk-based fork/joinmodel emphasizes a top-down approach to maximize parallelism without regard to data or threadlocation. Memory allocation on the Emu system, however, follows the bottom-up approach ofUPC[19] or SHMEM[14]. The Cilk model allows quickly writing highly parallel codes, but achiev-ing high performance (bandwidth utilization) requires controlling thread locations. We do notyet have a good way to relieve these tensions. Languages like Chapel[13] and X10[15] providea high-level view of data distribution but lack implicit thread migration. The gaANA results onthe highly dynamic variant in Algorithm 5 demonstrate how migrations on locality-emphasizingdata distribution can achieve relatively high performance. To our knowledge there is little work onprogramming systems that incorporate implicit and light-weight thread migration, but Charm++[1]and Legion[7] provide experience in programming heavier-weight task migration and locality indifferent environments.Note that the Emu compiler is rapidly evolving to include intra-node cilk_for and Cilk+ reducers.Experimental support became available at the time of writing and still is being evaluated. Balancingremote memory operations and thread migrations in reducer and parallel scan implementations forthe Emu architecture is ongoing work.

In this study, we focus on optimizing several irregular algorithms using programming strategiesfor the Emu system including replication, remote writes, and data layout and placement. We arguethat these three types of programming optimizations are key for achieving good workload balanceon the Emu system and that they may even be useful to optimize Cilk-oriented codes for x86systems (as with BFS).By analogy, back-porting GPU-centric optimizations to processors often provides improvedperformance. That is, in the same way that GPU architecture and programming encourages (or“forces”) programmers to parallelize and vectorize explicitly, the Emu design requires upfrontdecisions about data placement and one-sided communication that can lead to more scalable code.Future work would aim to evaluate whether these programming strategies can be generalized inthis fashion. y adopting a "put-only" strategy, our BFS implementation achieves 80 MTEPS on balancedgraphs. Our SpMV implementation makes use of replicated arrays to reach 50% of measuredSTREAM bandwidth while processing sparse data. We present two parallelization schemes andtwo vertex layouts for parallel similarity computation with the gsaNA graph aligner, achievingstrong scaling up to 68 × on the Emu system. Using the HCB vertex layout further improves theexecution time by up to 36%.

Acknowledgments

This work partially was supported by NSF Grant ACI-1339745 (XScala), an IARPA contract, andthe Defense Advanced Research Projects Agency (DARPA) under agreement

References [1] B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, and L.Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice. In

SC14: International Conference forHigh Performance Computing, Networking, Storage and Analysis . 647–658. https://doi.org/10.1109/SC.2014.58[2] K. Aingaran, S. Jairath, G. Konstadinidis, S. Leung, P. Loewenstein, C. McAllister, S. Phillips, Z. Radovic, R. Sivaramakr-ishnan, D. Smentek, and T. Wicki. 2015. M7: Oracle’s Next-Generation SPARC Processor.

IEEE Micro

35, 2 (March2015), 36–45. https://doi.org/10.1109/MM.2015.35[3] Ahmet E Aladağ and Cesim Erten. 2013. SPINAL: scalable protein interaction network alignment.

Bioinformatics

29, 7(2013), 917–924.[4] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2011. Disk-locality in Datacenter ComputingConsidered Irrelevant. In

Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems (HotOS’13) .USENIX Association, Berkeley, CA, USA, 12–12.[5] J. Bachan, S. Baden, D. Bonachea, P. Hargrove, S. Hofmeyr, M. Jacquelin, A. Kamil, and B. van Straalen. 2018. UPC++Specification v1.0, Draft 8. (2018). https://doi.org/10.25344/s45p4x[6] David A Bader, Jonathan Berry, Simon Kahan, Richard Murphy, E Jason Riedy, and Jeremiah Willcock. 2011.

Graph500Benchmark 1 (search) Version 1.2 . Technical Report. Graph500 Steering Committee.[7] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independencewith Logical Regions. In

Proceedings of the International Conference on High Performance Computing, Networking,Storage and Analysis (SC ’12) . IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389086[8] Mohsen Bayati, Margot Gerritsen, David F Gleich, Amin Saberi, and Ying Wang. 2009. Algorithms for large, sparsenetwork alignment problems. In

IEEE International Conference on Data Mining (ICDM) . 705–710.[9] Scott Beamer, Krste Asanović, and David Patterson. 2012. Direction-optimizing Breadth-first Search. In

Proceedings ofthe International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12) . IEEE ComputerSociety Press, Los Alamitos, CA, USA, Article 12, 10 pages. http://dl.acm.org/citation.cfm?id=2388996.2389013[10] S. Beamer, A. BuluÃğ, K. Asanovic, and D. Patterson. 2013. Distributed Memory Breadth-First Search Revisited:Enabling Bottom-Up Search. In . 1618–1627. https://doi.org/10.1109/IPDPSW.2013.159[11] Mehmet Belviranli, Seyong Lee, and Jeffrey S. Vetter. 2018. Designing Algorithms for the EMU Migrating-threads-basedArchitecture.

High Performance Extreme Computing Conference 2018 (2018).[12] Dan Bonachea, Rajesh Nishtala, Paul Hargrove, and Katherine Yelick. 2006. Efficient point-to-point synchronization inUPC. In .[13] B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language.

Int. J. HighPerform. Comput. Appl.

21, 3 (Aug. 2007), 291–312. https://doi.org/10.1177/1094342007078442[14] Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith.2010. Introducing OpenSHMEM: SHMEM for the PGAS Community. In

Proceedings of the Fourth Conference onPartitioned Global Address Space Programming Model (PGAS ’10) . ACM, New York, NY, USA, Article 2, 3 pages.https://doi.org/10.1145/2020373.2020375[15] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christophvon Praun, and Vivek Sarkar. 2005. X10: An Object-oriented Approach to Non-uniform Cluster Computing.

SIGPLANNot.

40, 10 (Oct. 2005), 519–538. https://doi.org/10.1145/1103845.1094852[16] Prasanth Chatarasi and Vivek Sarkar. 2018. A Preliminary Study of Compiler Transformations for Graph Applicationson the Emu System. In

Proceedings of the Workshop on Memory Centric High Performance Computing (MCHPC’18) .ACM, New York, NY, USA, 37–44. https://doi.org/10.1145/3286475.3286481

17] Leonid Chindelevitch, Cheng-Yu Ma, Chung-Shou Liao, and Bonnie Berger. 2013. Optimizing a global alignment ofprotein interaction networks.

Bioinformatics (2013), btt486.[18] Guojing Cong, George Almasi, and Vijay Saraswat. 2010. Fast PGAS Implementation of Distributed Graph Algorithms.In

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage andAnalysis (SC ’10) . IEEE Computer Society, Washington, DC, USA, 1–11. https://doi.org/10.1109/SC.2010.26[19] UPC Consortium, Dan Bonachea, and Gary Funck. 2013. UPC Language and Library Specifications, Version 1.3. (112013). https://doi.org/10.2172/1134233[20] Donatello Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2004. Thirty years of graph matching in patternrecognition.

International journal of pattern recognition and artificial intelligence

18, 03 (2004), 265–298.[21] Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection.

ACM Transactions onMathematical Software (TOMS)

38, 1 (2011), 1.[22] Michael Dungworth, James Harrell, Michael Levine, Stephen Nelson, Steven Oberlin, and Steven P. Reinhardt. 2011.

CRAY T3E . Springer US, Boston, MA, 419–441. https://doi.org/10.1007/978-0-387-09766-4_306[23] Timothy Dysart, Peter Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay Brockman, Kenneth Jacobsen, YujenJuan, Shannon Kuntz, and Richard Lethin. 2016. Highly scalable near memory processing with migrating threads onthe Emu system architecture. In

Irregular Applications: Architecture and Algorithms (IA3), Workshop on . IEEE, 2–9.[24] David Ediger, Robert McColl, Jason Riedy, and David A. Bader. 2012. STINGER: High Performance Data Structurefor Streaming Graphs. In

The IEEE High Performance Extreme Computing Conference (HPEC) . Waltham, MA. https://doi.org/10.1109/HPEC.2012.6408680[25] David Ediger, Jason Riedy, David A. Bader, and Henning Meyerhenke. 2013. Computational Graph Analytics forMassive Streaming Data. In

Large Scale Network-Centric Computing Systems , Hamid Sarbazi-azad and Albert Zomaya(Eds.). Wiley, Chapter 25. https://doi.org/10.1002/9781118640708.ch25[26] H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performanceportability through polymorphic memory access patterns.

J. Parallel and Distrib. Comput.

74, 12 (2014), 3202 –3216. https://doi.org/10.1016/j.jpdc.2014.07.003 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.[27] Mohammed El-Kebir, Jaap Heringa, and Gunnar W Klau. 2011. Lagrangian relaxation applied to sparse global networkalignment. In

IAPR International Conference on Pattern Recognition in Bioinformatics . Springer, 225–236.[28] Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and NectariosKoziris. 2018. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms.

ACM Trans. Math. Softw.

44, 3, Article 26 (Jan. 2018), 32 pages. https://doi.org/10.1145/3134442[29] Ahed Elmsallati, Connor Clark, and Jugal Kalita. 2016. Global Alignment of Protein-Protein Interaction Networks: ASurvey.

IEEE Transactions on Computational Biology and Bioinformatics

13, 4 (2016), 689–705.[30] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, and J. Reinhard.2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In

High Performance Computing, Networking,Storage and Analysis (SC), 2012 International Conference for . 1–9. https://doi.org/10.1109/SC.2012.39[31] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. 2015. NDA: Near-DRAM acceleration architectureleveraging commodity DRAM devices and standard memory modules. In . 283–295. https://doi.org/10.1109/HPCA.2015.7056040[32] T. Finkbeiner, G. Hush, T. Larsen, P. Lea, J. Leidel, and T. Manning. 2017. In-Memory Intelligence.

IEEE Micro

37, 4(Aug. 2017), 30–38. https://doi.org/10.1109/MM.2017.3211117[33] M. Gao, G. Ayers, and C. Kozyrakis. 2015. Practical Near-Data Processing for In-Memory Analytics Frameworks. In . 113–124. https://doi.org/10.1109/PACT.2015.22[34] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In . 1–13. https://doi.org/10.1109/MICRO.2016.7783759[35] Eric Hein and Tom Conte. 2016. DynoGraph: Benchmarking Dynamic Graph Analytics. In

SC16: International Conferencefor High Performance Computing, Networking, Storage and Analysis . http://sc16.supercomputing.org/sc-archive/tech_poster/tech_poster_pages/post214.html Poster.[36] Eric Hein, Tom Conte, Jeffrey Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, and Jason Riedy. 2018.An Initial Characterization of the Emu Chick. In

In the The Eighth International Workshop on Accelerators and HybridExascale Systems (ASHES), Vancouver, Canada .[37] Jeremy Kepner, Peter Aaltonen, David Bader, Aydın Buluc, Franz Franchetti, John Gilbert, Dylan Hutchison, ManojKumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, Jose Moreira, John D. Owens, Carl Yang, MarcinZalewski, and Timothy Mattson. 2016. Mathematical Foundations of the GraphBLAS. http://arxiv.org/abs/1606.05790[38] Gunnar W. Klau. 2009. A new graph-based method for pairwise global network alignment.

BMC Bioinformatics

10, 1(2009), S59.[39] Giorgos Kollias, Shahin Mohammadi, and Ananth Grama. 2012. Network similarity decomposition (nsd): A fast andscalable approach to network alignment.

IEEE Transactions on Knowledge and Data Engineering (TKDE)

24, 12 (2012), IEEE Interna-tional Conference on Data Mining (ICDM) . 389–398.[41] Segla Kpodjedo, Philippe Galinier, and Giulio Antoniol. 2014. Using local similarity measures to efficiently addressapproximate graph matching.

Discrete Applied Mathematics

164 (2014), 161–177.[42] ´Geraud P. Krawezik, Peter M. Kogge, Timothy J. Dysart, Shannon K. Kuntz, and Janice O. McMahon. 2018. Implementingthe Jaccard Index on the Migratory Memory-Side Processing Emu Architecture.

High Performance Extreme ComputingConference 2018 (2018).[43] Oleksii Kuchaiev, Tijana Milenković, Vesna Memišević, Wayne Hayes, and Nataša Pržulj. 2010. Topological networkalignment uncovers biological function and phylogeny.

Journal of the Royal Society Interface

7, 50 (2010), 1341–1354.[44] Charles E Leiserson. 1997. Programming irregular parallel applications in Cilk. In

International Symposium on SolvingIrregularly Structured Problems in Parallel . Springer, 61–71.[45] Chung-Shou Liao, Kanghao Lu, Michael Baym, Rohit Singh, and Bonnie Berger. 2009. IsoRankN: spectral methods forglobal alignment of multiple protein networks.

Bioinformatics

25, 12 (2009), i253–i258.[46] Dasheng Liu, Kay Chen Tan, Chi Keong Goh, and Weng Khuen Ho. 2007. A multiobjective memetic algorithm basedon particle swarm optimization.

IEEE Transactions on Systems, Man, and Cybernetics, Part B

37 (2007), 42–50.[47] Noël Malod-Dognin and Nataša Pržulj. 2015. L-GRAAL: Lagrangian graphlet-based network aligner.

Bioinformatics (2015), btv130.[48] John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers.

IEEEComputer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19–25.[49] Vesna Memišević and Nataša Pržulj. 2012. C-GRAAL: Common-neighbors-based global GRA ph AL ignment ofbiological networks.

Integrative Biology

4, 7 (2012), 734–743.[50] Tijana Milenkovic, Weng Leong Ng, Wayne Hayes, and Natasa Przulj. 2010. Optimal network alignment with graphletdegree vectors.

Cancer informatics .[52] D. Mizell and K. Maschhoff. 2009. Early experiences with large-scale Cray XMT systems. In . 1–9. https://doi.org/10.1109/IPDPS.2009.5161108[53] Behnam Neyshabur, Ahmadreza Khadem, Somaye Hashemifar, and Seyed Shahriar Arab. 2013. NETAL: a newgraph-based method for global alignment of protein–protein interaction networks.

Bioinformatics

29, 13 (2013),1654–1662.[54] Behnam Neyshabur, Ahmadreza Khadem, Somaye Hashemifar, and Seyed Shahriar Arab. 2013. NETAL: a newgraph-based method for global alignment of protein–protein interaction networks.

Bioinformatics

29, 13 (2013),1654–1662.[55] University of Trier. 2017. DBLP: Computer Science Bibliography. http://dblp.dagstuhl.de/xml/release/.[56] Rob Patro and Carl Kingsford. 2012. Global network alignment using multiscale spectral signatures.

Bioinformatics

IEEE Micro

17, 2 (March 1997), 34–44. https://doi.org/10.1109/40.592312[58] R. Pearce, M. Gokhale, and N. M. Amato. 2014. Faster Parallel Traversal of Scale Free Graphs at Extreme Scale withVertex Delegates. In

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis .549–559. https://doi.org/10.1109/SC.2014.50[59] R. Rajamony, L. B. Arimilli, and K. Gildea. 2011. PERCS: The IBM POWER7-IH high-performance computing system.

IBM Journal of Research and Development

55, 3 (May 2011), 3:1–3:12. https://doi.org/10.1147/JRD.2011.2109230[60] Vikram Saraph and Tijana Milenković. 2014. MAGNA: maximizing accuracy in global network alignment.

Bioinformatics

30, 20 (2014), 2931–2940.[61] P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez, Y. Itigin, M. Dubman, G. Shainer, R. L. Graham, L.Liss, Y. Shahar, S. Potluri, D. Rossetti, D. Becker, D. Poole, C. Lamb, S. Kumar, C. Stunkel, G. Bosilca, and A. Bouteiller.2015. UCX: An Open Source Framework for HPC Network APIs and Beyond. In . 40–43. https://doi.org/10.1109/HOTI.2015.13[62] Patrick Siegl, Rainer Buchty, and Mladen Berekovic. 2016. Data-Centric Computing Frontiers: A Survey On Processing-In-Memory. In

Proceedings of the Second International Symposium on Memory Systems (MEMSYS ’16) . ACM, New York,NY, USA, 295–308. https://doi.org/10.1145/2989081.2989087[63] Rohit Singh, Jinbo Xu, and Bonnie Berger. 2007. Pairwise global alignment of protein interaction networks by matchingneighborhood topology. In

Annual International Conference on Research in Computational Molecular Biology . 16–31.[64] Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka. 2017. Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines.

Data Science and Engineering

2, 1 (01 March2017), 22–35. https://doi.org/10.1007/s41019-016-0024-y[65] Megan Vance and Peter M. Kogge. 2010. Introducing mNUMA: An Extended PGAS Architecture. In

Proceedings ofthe Fourth Conference on Partitioned Global Address Space Programming Model (PGAS ’10) . ACM, New York, NY, USA, rticle 6, 10 pages. https://doi.org/10.1145/2020373.2020379[66] Kaushik Velusamy, Thomas B. Rolinger, Janice McMahon, and Tyler A. Simon. 2018. Exploring Parallel Bitonic Sort ona Migratory Thread Architecture. High Performance Extreme Computing Conference 2018 (2018).[67] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimizationof sparse matrix–vector multiplication on emerging multicore platforms.

Parallel Comput.

35, 3 (2009), 178–194.[68] Abdurrahman Yaşar and Ümit V. Çatalyürek. 2018. An Iterative Global Structure-Assisted Labeled Network Aligner. In

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining .[69] Abdurrahman Yaşar, Bora Uçar, and Ümit V. Çatalyürek. 2018. SINA: A Scalable Iterative Network Aligner. In .[70] Yuichiro Yasui, Katsuki Fujisawa, and Yukinori Sato. 2014. Fast and Energy-efficient Breadth-First Search on a SingleNUMA System. In

Supercomputing , Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer (Eds.). SpringerInternational Publishing, Cham, 365–381.[71] Chunxing Yin, Jason Riedy, and David A. Bader. 2018. A New Algorithmic Model for Graph Analysis of StreamingData. In

Proceedings of the 14th International Workshop on Mining and Learning with Graphs (MLG)

Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005Conference . 25–25. https://doi.org/10.1109/SC.2005.4[73] Si Zhang and Hanghang Tong. 2016. FINAL: Fast Attributed Network Alignment. In

ACM International Conference onKnowledge Discovery and Data mining . 1345–1354.. 1345–1354.