[PDF] SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Abstract

Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive. This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded. We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27\times on average (up to 1.78\times) under high-contention scenarios, and by 1.35\times on average (up to 2.29\times) under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consumption by 2.08\times on average (up to 4.25\times).

Full PDF

SSynCron : Efﬁcient Synchronization Supportfor Near-Data-Processing Architectures

Christina Giannoula †‡ Nandita Vijaykumar *‡ Nikela Papadopoulou † Vasileios Karakostas † Ivan Fernandez §‡ s Juan Gómez-Luna ‡ Lois Orosa ‡ Nectarios Koziris † Georgios Goumas † Onur Mutlu ‡† National Technical University of Athens ‡ ETH Zürich * University of Toronto § University of Malaga

Near-Data-Processing (NDP) architectures present apromising way to alleviate data movement costs and can pro-vide signiﬁcant performance and energy beneﬁts to parallelapplications. Typically, NDP architectures support severalNDP units, each including multiple simple cores placed closeto memory. To fully leverage the beneﬁts of NDP and achievehigh performance for parallel workloads, efﬁcient synchro-nization among the NDP cores of a system is necessary. How-ever, supporting synchronization in many NDP systems ischallenging because they lack shared caches and hardwarecache coherence support, which are commonly used for syn-chronization in multicore systems, and communication acrossdifferent NDP units can be expensive.This paper comprehensively examines the synchronizationproblem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron addslow-cost hardware support near memory for synchronizationacceleration, and avoids the need for hardware cache coher-ence support. SynCron has three components: 1) a special-ized cache memory structure to avoid memory accesses forsynchronization and minimize latency overheads, 2) a hierar-chical message-passing communication protocol to minimizeexpensive communication across NDP units of the system, and3) a hardware-only overﬂow management scheme to avoidperformance degradation when hardware resources for syn-chronization tracking are exceeded.We evaluate SynCron using a variety of parallel workloads,covering various contention scenarios. SynCron improvesperformance by 1.27 × on average (up to 1.78 × ) under high-contention scenarios, and by 1.35 × on average (up to 2.29 × )under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consump-tion by 2.08 × on average (up to 4.25 × ).

1. Introduction

Recent advances in 3D-stacked memories [59, 72, 85, 92,93, 145] have renewed interest in Near-Data Processing(NDP) [8, 9, 17, 110]. NDP involves performing computa-tion close to where the application data resides. This al-leviates the expensive data movement between processorsand memory, yielding signiﬁcant performance improvementsand energy savings in parallel applications. Placing low-power cores or special-purpose accelerators (hereafter calledNDP cores) close to the memory dies of high-bandwidth 3D-stacked memories is a commonly-proposed design for NDPsystems [8, 9, 19–21, 23, 38, 42–46, 49, 66, 67, 82–84, 98, 105,110–113, 117, 119, 131, 132, 143, 155, 158]. Typical NDP ar-chitectures support several NDP units connected to each other,with each unit comprising multiple NDP cores close to mem-ory [8, 19, 66, 83, 143, 155, 158]. Therefore, NDP architecturesprovide high levels of parallelism, low memory access latency,and large aggregate memory bandwidth.Recent research demonstrates the beneﬁts of NDP for par-allel applications, e.g., for genome analysis [23, 84], graphprocessing [8, 9, 20, 21, 112, 155, 158], databases [20, 38], secu-rity [54], pointer-chasing workloads [25,60,67,99], and neuralnetworks [19, 45, 82, 98]. In general, these applications exhibithigh parallelism, low operational intensity, and relatively low cache locality [15, 16, 33, 50, 133], which make them suitablefor NDP.Prior works discuss the need for efﬁcient synchronizationprimitives in NDP systems, such as locks [25, 99] and barri-ers [8, 43, 155, 158]. Synchronization primitives are widelyused by multithreaded applications [39, 40, 48, 69, 70, 90, 136–138, 140], and must be carefully designed to ﬁt the under-lying hardware requirements to achieve high performance.Therefore, to fully leverage the beneﬁts of NDP for parallelapplications, an effective synchronization solution for NDPsystems is necessary.Approaches to support synchronization are typically of twotypes [63, 64]. First, synchronization primitives can be builtthrough shared memory , most commonly using the atomicread-modify-write ( rmw ) operations provided by hardware.In CPU systems, atomic rmw operations are typically im-plemented upon the underlying hardware cache coherenceprotocols, but many NDP systems do not support hardwarecache coherence (e.g., [8, 46, 143, 155, 158]). In GPUs andMassively Parallel Processing systems (MPPs), atomic rmw operations can be implemented in dedicated hardware atomicunits, known as remote atomics . However, synchronizationusing remote atomics has been shown to be inefﬁcient, sincesending every update to a ﬁxed location creates high globaltrafﬁc and hotspots [41, 96, 108, 147, 153]. Second, synchro-nization can be implemented via a message-passing scheme,where cores exchange messages to reach an agreement. Somerecent NDP works (e.g., [8, 43, 55, 158]) propose message-passing barrier primitives among NDP cores of the system.However, these synchronization schemes are still inefﬁcient,as we demonstrate in Section 6, and also lack support for lock,semaphore and condition variable synchronization primitives.Hardware synchronization techniques that do not rely onhardware coherence protocols and atomic rmw operations havebeen proposed for multicore systems [1–3,94,97,116,146,157].However, such synchronization schemes are tailored for thespeciﬁc architecture of each system, and are not efﬁcient orsuitable for NDP systems (Section 7). For instance, CM5 [94]provides a barrier primitive via a dedicated physical network,which would incur high hardware cost to be supported inlarge-scale NDP systems. LCU [146] adds a control unit to each

CPU core and a buffer to each memory controller, whichwould also incur high cost to implement in area-constrained

NDP cores and controllers. SSB [157] includes a small bufferattached to each controller of the last level cache (LLC) andMiSAR [97] introduces an accelerator distributed at the LLC.Both schemes are built on the shared cache level in CPUsystems, which most NDP systems do not have. Moreover, inNDP systems with non-uniform memory access times, mostof these prior schemes would incur signiﬁcant performanceoverheads under high-contention scenarios. This is becausethey are oblivious to the non-uniformity of NDP, and thuswould cause excessive trafﬁc across NDP units of the systemupon contention (Section 6.7.1).Overall, NDP architectures have several important charac-teristics that necessitate a new approach to support efﬁcientsynchronization. First, most NDP architectures [8, 19, 25, 38,42–46, 49, 55, 67, 98, 110, 111, 113, 119, 155, 158] lack shared1 a r X i v : . [ c s . A R ] F e b aches that can enable low-cost communication and synchro-nization among NDP cores of the system. Second, hardwarecache coherence protocols are typically not supported in NDPsystems [8,19,25,38,42–45,49,55,67,82,98,111,119,155,158],due to high area and trafﬁc overheads associated with suchprotocols [46, 143]. Third, NDP systems are non-uniform,distributed architectures, in which inter-unit communication ismore expensive (both in performance and energy) than intra-unit communication [8, 20, 21, 38, 43, 83, 155, 158].In this work, we present SynCron , an efﬁcient synchroniza-tion mechanism for NDP architectures.

SynCron is designed toachieve the goals of performance, cost, programming ease, andgenerality to cover a wide range of synchronization primitivesthrough four key techniques. First, we ofﬂoad synchroniza-tion among NDP cores to dedicated low-cost hardware units,called Synchronization Engines (SEs). This approach avoidsthe need for complex coherence protocols and expensive rmw operations, at low hardware cost. Second, we directly bufferthe synchronization variables in a specialized cache memorystructure to avoid costly memory accesses for synchronization.Third,

SynCron coordinates synchronization with a hierarchi-cal message-passing scheme: NDP cores only communicatewith their local SE that is located in the same NDP unit. Atthe next level of communication, all local SEs of the sys-tem’s NDP units communicate with each other to coordinatesynchronization at a global level. Via its hierarchical commu-nication protocol,

SynCron signiﬁcantly reduces synchroniza-tion trafﬁc across NDP units under high-contention scenar-ios. Fourth, when applications with frequent synchronizationoversubscribe the hardware synchronization resources,

Syn-Cron uses an efﬁcient and programmer-transparent overﬂowmanagement scheme that avoids costly fallback solutions andminimizes overheads.We evaluate

SynCron using a wide range of parallel work-loads including pointer chasing, graph applications, and timeseries analysis. Over prior approaches (similar to [8, 43]),

SynCron improves performance by 1.27 × on average (up to1.78 × ) under high-contention scenarios, and by 1.35 × onaverage (up to 2.29 × ) under low-contention scenarios. Inreal applications with ﬁne-grained synchronization, SynCron comes within 9.5% of the performance and 6.2% of the energyof an ideal zero-overhead synchronization mechanism. Ourproposed hardware unit incurs very modest area and poweroverheads (Section 6.8) when integrated into the compute dieof an NDP unit.This paper makes the following contributions:• We investigate the challenges of providing efﬁcient synchro-nization in Near-Data-Processing architectures, and proposean end-to-end mechanism,

SynCron , for such systems.• We design low-cost synchronization units that coordinatesynchronization across NDP cores, and directly buffer syn-chronization variables to avoid costly memory accesses tothem. We propose an efﬁcient message-passing synchroniza-tion approach that organizes the process hierarchically, andprovide a hardware-only programmer-transparent overﬂowmanagement scheme to alleviate performance overheadswhen hardware synchronization resources are exceeded.• We evaluate

SynCron using a wide range of parallel work-loads and demonstrate that it signiﬁcantly outperforms priorapproaches both in performance and energy consumption.

SynCron also has low hardware area and power overheads.

2. Background and Motivation

Numerous works [8,9,19–21,25,38,43,45,54,55,67,73,82,99, 112, 128, 143, 155, 158] show the potential beneﬁt of NDPfor parallel, irregular applications. These proposals focus on the design of the compute logic that is placed close to or withinmemory, and in many cases provide special-purpose near-dataaccelerators for speciﬁc applications. Figure 1 shows the base-line organization of the NDP architecture we assume in thiswork, which includes several NDP units connected with eachother via serial interconnection links to share the same physi-cal address space. Each NDP unit includes the memory arraysand a compute die with multiple low-power programmablecores or ﬁxed-function accelerators, which we henceforth referto as NDP cores. NDP cores execute the ofﬂoaded NDP kerneland access the various memory locations across NDP unitswith non-uniform access times [8, 20, 21, 38, 143, 155, 158].We assume that there is no OS running in the NDP system.In our evaluation, we use programmable in-order NDP cores,each including small private L1 I/D caches. However,

Syn-Cron can be used with any programmable, ﬁxed-function orreconﬁgurable NDP accelerator. We assume software-assistedcache-coherence (provided by the operating system or theprogrammer), similar to [43, 143]: data can be either thread-private, shared read-only, or shared read-write. Thread-privateand shared read-only data can be cached by NDP cores, whileshared read-write data is uncacheable.

NDP Architecture NDP UnitInterconnection Link

Compute Die

NDP CoreNDP Core

NDP Core

Programmable

Accelerator

Cache ...

Memory Arrays

Figure 1: High-level organization of an NDP architecture.

We focus on three characteristics of NDP architectures thatare of particular importance in the synchronization context.First, NDP architectures typically do not have a shared levelof cache memory [8, 19, 25, 38, 42–46, 49, 55, 67, 98, 110, 111,113, 119, 155, 158], since the NDP-suited workloads usuallydo not beneﬁt from deep cache hierarchies due to their poor lo-cality [33,43,133,143]. Second, NDP architectures do not typ-ically support conventional hardware cache coherence proto-cols [8,19,25,38,42–45,49,55,67,82,98,111,119,155,158], be-cause they would add area and trafﬁc overheads [46, 143], andwould incur high complexity and latency [4], limiting the bene-ﬁts of NDP. Third, communication across NDP units is expen-sive, because NDP systems are non-uniform distributed archi-tectures. The energy and performance costs of inter-unit com-munication are typically orders of magnitude greater than thecosts of intra-unit communication [8,20,21,38,43,83,155,158],and thus inter-unit communication may slow down the execu-tion of NDP cores [155].

Approaches to support synchronization are typically eithervia shared memory or message-passing schemes.

In this case,cores coordinate via a consistent view of shared memory lo-cations, using atomic read/write operations or atomic read-modify-write ( rmw ) operations. If rmw operations are not supported by hardware, Lamport’s bakery algorithm [87] canprovide synchronization to N participating cores, assuming se-quential consistency [86]. However, this scheme scales poorly,as a core accesses O ( N ) memory locations at each synchro-nization retry. In contrast, commodity systems (CPUs, GPUs,MPPs) typically support rmw operations in hardware.GPUs and MPPs support rmw operations in specializedhardware units (known as remote atomics ), located in eachbank of the shared cache [58, 148], or the memory con-trollers [81,88]. Remote atomics are also supported by an NDP2ork [43] at the vault controllers of Hybrid Memory Cube(HMC) [59,145]. Implementing synchronization primitives us-ing remote atomics requires a spin-wait scheme, i.e., executingconsecutive rmw retries. However, performing and sendingevery rmw operation to a shared, ﬁxed location can cause highglobal trafﬁc and create hotspots [41, 96, 108, 147, 153]. InNDP systems, consecutive rmw operations to a remote NDPunit would incur high trafﬁc across NDP units, with highperformance and energy overheads.Commodity CPU architectures support rmw operations ei-ther by locking the bus (or equivalent link), or by relyingon the hardware cache coherence protocol [68, 135], whichmany NDP architectures do not support. Therefore, coherence-based synchronization [13, 24, 27, 35, 36, 57, 100, 101, 103,122, 126, 156] cannot be directly implemented in NDP archi-tectures. Moreover, based on prior works on synchroniza-tion [22, 30, 76, 102, 107, 140], coherence-based synchroniza-tion would exhibit low scalability on NDP systems for tworeasons. First, it performs poorly with a large number of cores,due to low scalability of conventional hardware coherenceprotocols [61, 79, 80, 135]. Most NDP systems include sev-eral NDP units [8, 83, 155, 158], each typically supportinghundreds of small, area-constrained cores [8, 19, 155, 158].Second, the non-uniformity in memory accesses signiﬁ-cantly affects the scalability of coherence-based synchroniza-tion [22,30,107,156]. Prior work on coherence-based synchro-nization [30] observes that the latency of a lock acquisitionthat needs to transfer the lock across

NUMA sockets can beup to 12.5 × higher than that within a socket. We expect sucheffects to be aggravated in NDP systems, since they are by na-ture non-uniform and distributed [8,20,21,38,43,83,155,158]with very low memory access latency within an NDP unit.We validate these observations on both a real CPU and oursimulated NDP system. On an Intel Xeon Gold server, weevaluate the operation throughput achieved by two coherence-based lock algorithms (Table 1), i.e., TTAS [122] and Hier-archical Ticket Lock (HTL) [103], using a microbenchmarktaken from the libslock library [30]. When increasing the num-ber of threads from 1 to 14 within a single socket, throughputdrops by 3.91 × and 2.77 × for TTAS and HTL, respectively.Moreover, when pinning two threads on different NUMA sock-ets, throughput drops by up to 2.29 × over when pinning themon the same socket, due to non-uniform memory access timesof lock variables. Million Operations 1 thread 14 threads 2 threads 2 threadsper Second single-socket single-socket same-socket different-socketTTAS lock [122] 8.92 2.28 9.91 4.32Hierarchical Ticket lock [103] 8.06 2.91 9.01 6.79

Table 1: Throughput of two coherence-based lock algorithms onan Intel Xeon Gold server using the libslock library [30].

In our simulated NDP system, we evaluate the performanceachieved by a stack data structure protected with a coarse-grained lock. Figure 2 shows the slowdown of the stack whenusing a coherence-based lock [63] ( mesi-lock ), implementedupon a MESI directory coherence protocol, over using an ideallock with zero cost for synchronization ( ideal-lock ). First, weobserve that the high contention for the cache line containingthe mesi-lock and the resulting coherence trafﬁc inside the net-work signiﬁcantly limit scalability of the stack as the numberof cores increases. With 60 NDP cores within a single NDPunit (Figure 2a), the stack with mesi-lock incurs 2.03 × slow-down over ideal-lock . Second, we notice that the non-uniformmemory accesses to the cache line containing the mesi-lock also impact the scalability of the stack. When increasing thenumber of NDP units while keeping total core count constantat 60 (Figure 2b), the slowdown of the stack with mesi-lock increases to 2.66 × (using 4 NDP units) over ideal-lock . In non-uniform NDP systems, the scalability of coherence-basedsynchronization is severely limited by the long transfer latencyand low bandwidth of the interconnect used between the NDPunits.

15 30 45 60

NDP cores S l o w d o w n (a) ideal-lock mesi-lock NDP units S l o w d o w n (b) ideal-lock mesi-lock Figure 2: Slowdown of a stack data structure using a coherence-based lock over using an ideal zero-cost lock, when varying (a)the NDP cores within a single NDP unit and (b) the number ofNDP units while keeping core count constant at 60.

In this approach,cores coordinate with each other by exchanging messages(either in software or hardware) in order to reach an agree-ment. For instance, a recent NDP work [8] implements abarrier primitive via hardware message-passing communica-tion among NDP cores, i.e., one core of the system worksas a master core to collect the synchronization status of therest. To improve system performance in non-uniform

HMC-based NDP systems, Gao et al. [43] propose a tree-style bar-rier primitive, where cores exchange messages to ﬁrst syn-chronize within a vault, then across the vaults of an HMCcube, and ﬁnally across HMC cubes. In general, optimizedmessage-passing synchronization schemes proposed in theliterature [2,43,53,62,64,141] aim to minimize (i) the numberof messages sent among cores, and (ii) expensive network traf-ﬁc. To avoid the major issues of synchronization via sharedmemory described above, we design our approach building onthe message-passing synchronization concept. SynCron : Overview

SynCron is an end-to-end solution for synchronization inNDP architectures that improves performance, has low cost,eases programmability, and supports multiple synchronizationprimitives.

SynCron relies on the following key techniques:

1. Hardware support for synchronization acceleration:

We design low-cost hardware units, called SynchronizationEngines (SEs), to coordinate the synchronization among NDPcores of the system. SEs eliminate the need for complex cachecoherence protocols and expensive rmw operations, and incurmodest hardware cost.

2. Direct buffering of synchronization variables:

We add aspecialized cache structure, the Synchronization Table (ST),inside an SE to keep synchronization information. Such directbuffering avoids costly memory accesses for synchronization,and enables high performance under low-contention scenarios.

3. Hierarchical message-passing communication:

We or-ganize the communication hierarchically, with each NDP unitincluding an SE. NDP cores communicate with their localSE that is located in the same NDP unit. SEs communicatewith each other to coordinate synchronization at a global level.Hierarchical communication minimizes expensive communi-cation across

NDP units, and achieves high performance underhigh-contention scenarios.

4. Integrated hardware-only overﬂow management:

Weincorporate a hardware-only overﬂow management scheme toefﬁciently handle scenarios when ST is fully occupied. Thisprogrammer-transparent technique effectively limits perfor-mance degradation under overﬂow scenarios.

SynCron

Figure 3 provides an overview of our approach.

SynCron ex-poses a simple programming interface such that programmerscan easily use a variety of synchronization primitives in their3ultithreaded applications when writing them for NDP sys-tems. The interface is implemented using two new instructionsthat are used by NDP cores to communicate synchronizationrequests to SEs. These are general enough to cover all seman-tics for the most widely-used synchronization primitives.

Compute Die

NDP CoreNDP Core

NDP Core

Memory Arrays SE Compute Die

NDP CoreNDP Core

NDP Core

Memory Arrays SE Compute Die

NDP CoreNDP Core

NDP Core

Memory Arrays SE Compute Die

NDP CoreNDP Core

NDP Core

Memory Arrays SE syncronVar Master SE ST SE SynCron ’ s ISA Extensions SynCron ’ s Programming Interface

Figure 3: High-level overview of

SynCron . We add one SE in the compute die of each NDP unit. For aparticular synchronization variable allocated in an NDP unit,the SE that is physically located in the same NDP unit isconsidered the

Master SE . In other words, the

Master SE isdeﬁned by the address of the synchronization variable. It isresponsible for the global coordination of synchronization onthat variable, i.e., among all SEs of the system. All other SEsare responsible only for the local coordination of synchroniza-tion among the cores in the same NDP unit with them.NDP cores act as clients that send requests to SEs via hard-ware message-passing. SEs act as servers that process synchro-nization requests. In the proposed hierarchical communication,NDP cores send requests to their local SEs, while SEs of differ-ent NDP units communicate with the

Master SE of the speciﬁcvariable, to coordinate the process at a global level, i.e., amongall NDP units.When an SE receives a request from an NDP core for asynchronization variable, it directly buffers the variable in itsST, keeping all the information needed for synchronization inthe ST. If the ST is full, we use the main memory as a fallbacksolution. To hierarchically coordinate synchronization viamain memory in ST overﬂow cases, we design (i) a genericstructure, called syncronVar , to keep track of required synchro-nization information, and (ii) specialized overﬂow messages tobe sent among SEs. The hierarchical communication amongSEs is implemented via corresponding support in messageencoding, the ST, and syncronVar structure.

SynCron ’s Operation

SynCron supports locks, barriers, semaphores, and condi-tion variables. Here, we present

SynCron ’s operation for locks.

SynCron has similar behavior for the other three primitives.

Lock Synchronization Primitive:

Figure 4 shows a systemcomposed of two NDP units with two NDP cores each. Inthis example, all cores request and compete for the same lock.First, all NDP cores send local lock acquire messages to theirSEs . After receiving these messages, each SE keeps trackof its requesting cores by reserving one new entry in its ST,i.e., directly buffering the lock variable in ST. Each ST entryincludes a local waiting list (i.e., a hardware bit queue with onebit for each local NDP core), and a global waiting list (i.e., a bitqueue with one bit for each SE of the system). To keep track ofthe requesting cores, each SE sets the bits corresponding to therequesting cores in the local waiting list of the ST entry. Whenthe local SE receives a request for a synchronization variable for the ﬁrst time , it sends a global lock acquire message to the Master SE , which in turn sets the corresponding bit in theglobal waiting list in its ST. This way, the Master SE keepstrack of all requests to a particular variable coming from an SE,and can arbitrate between different SEs. The local SE can thenserve successive local requests to the same variable until thereare no other local requests. By using the proposed hierarchicalcommunication protocol, the cores send local messages to their local SE, and the SE needs to send only one aggregated message, on behalf of all its local waiting cores, to the

MasterSE . As a result, we reduce the need for communication throughthe narrow, expensive links that connect different NDP units.

NDP Core 0

Memory Arrays SE NDP Core 0

Memory Arrays SE Memory Arrays SE Memory Arrays SE Master SE

NDP Core 1

NDP Core 0

NDP Core 1

NDP Unit 0 NDP Unit 1 syncronVar

Figure 4: An example execution scenario for a lock requested by all

NDP cores.

The

Master SE ﬁrst prioritizes the local waiting list, grantingthe lock to its own local NDP cores in sequence (e.g., to NDPCore 0 ﬁrst , and to NDP Core 1 next in Figure 4). At theend of the critical section, each local lock owner sends a lockrelease message to its SE in order to release the lock. Whenthere are no other local requests, the Master SE transfers thecontrol of the lock to the SE of another NDP unit based onits global waiting list . Then, the local SE grants the lockto its local NDP cores in sequence (e.g., , ). After alllocal cores release the lock, the SE sends an aggregated globallock release message to the Master SE and releases its STentry. When the message arrives at the Master SE , if thereare no other pending requests to the same variable, the

MasterSE releases its ST entry. In this example, SEs directly bufferthe lock variable in their STs. If an ST is full , the

Master SE globally coordinates synchronization by keeping track of allrequired information in main memory , via our proposedoverﬂow management scheme (Section 4.3). SynCron : Detailed Design

SynCron leverages the key observation that all synchro-nization primitives fundamentally communicate the same in-formation, i.e., a waiting list of cores that participate in thesynchronization process, and a condition to be met to notifyone or more cores. Based on this observation, we design

SynCron to cover the four most widely used synchronizationprimitives. Without loss of generality, we assume that eachNDP core represents a hardware thread context with a uniqueID. To support multiple hardware thread contexts per NDPcore, the corresponding hardware structures of

SynCron needto be augmented to include 1-bit per hardware thread context.

SynCron provides lock, barrier, semaphore and conditionvariable synchronization primitives, supporting two types ofbarriers: within cores of the same

NDP unit and within coresacross different NDP units of the system.

SynCron ’s pro-gramming interface (Table 2) implements the synchronizationsemantics with two new ISA instructions, which are rich and general enough to express all supported primitives. NDP coresuse these instructions to assemble messages for synchroniza-tion requests, which are issued through the network to SEs.

SynCron

Programming Interface syncronVar *create_syncvar ();void destroy_syncvar (syncronVar *svar);void lock_acquire (syncronVar *lock);void lock_release (syncronVar *lock);void barrier_wait_within_unit (syncronVar *bar, int initialCores);void barrier_wait_across_units (syncronVar *bar, int initialCores);void sem_wait (syncronVar *sem, int initialResources);void sem_post (syncronVar *sem);void cond_wait (syncronVar *cond, syncronVar *lock);void cond_signal (syncronVar *cond);void cond_broadcast (syncronVar *cond);

Table 2:

SynCron ’s Programming Interface (i.e., API). eq_sync addr, opcode, info : This instruction creates a mes-sage and commits when a response message is received back.The addr register has the address of a synchronization variable,the opcode register has the message opcode of a particular se-mantic of a synchronization primitive (Table 3), and the info register has speciﬁc information needed for the primitive ( Mes-sageInfo in message encoding of Fig. 5). req_async addr, opcode : This instruction creates a messageand after the message is issued to the network, the instructioncommits. The registers addr , opcode have the same semanticsas in req_sync instruction. We design

SynCron assuminga relaxed consistency memory model. The proposed ISAextensions act as memory fences. First, req_sync , commitsonce a message (ACK) is received (from the local SE to thecore), which ensures that all following instructions will beissued after req_sync has been completed. Its semantics issimilar to those of the SYNC and ACQUIRE operations ofWeak Ordering (WO) [28] and Release Consistency (RC) [28]models, respectively. Second, req_async , does not requirea return message (ACK). It is issued once all previous in-structions are completed. Its semantics is similar to that ofthe RELEASE operation of RC [28]. In the case of WO, req_sync is sufﬁcient. In the case of RC, the req_sync instruc-tion is used for acquire-type semantics, i.e., lock_acquire, bar-rier_wait, semaphore_wait and condition_variable_wait, whilethe req_async instruction is used for release-type semantics,i.e., lock_release, semaphore_post, condition_variable_signal,and condition_variable_broadcast.

Figure 5 describes the encoding ofthe message used for communication between NDP cores andthe SE. Each message includes: (i) the 64-bit address of thesynchronization variable, (ii) the message opcode that imple-ments the semantics of the different synchronization primitives(6 bits cover all message opcodes), (iii) the unique ID numberof the NDP core (6 bits are sufﬁcient for our simulated NDPsystem in Section 5), and (iv) a 64-bit ﬁeld (

MessageInfo ) thatcommunicates speciﬁc information needed for each differentsynchronization primitive, i.e., the number of the cores thatparticipate in a barrier, the initial value of a semaphore, theaddress of the lock associated with a condition variable.

64 bits 6 bits 6 bits 64 bitsAddress Opcode CoreID MessageInfo64 bits 6 bits 6 bits 64 bitsAddress Opcode CoreID MessageInfo

Message Encoding

64 bits 6 bits 6 bits 64 bitsAddress Opcode CoreID MessageInfo

Message Encoding

LockBarrierSemaphoreCondition Variable -Initial

Figure 5: Message encoding of

SynCron . Hierarchical Message Opcodes.

SynCron enables a hierar-chical scheme, where the SEs of NDP units communicatewith each other to coordinate synchronization at a global level.Therefore, we support two types of messages (Table 3): (i) local , which are used by NDP cores to communicate with theirlocal SE, and (ii) global , which are used by SEs to commu-nicate with the

Master SE , and vice versa. Since we supporttwo types of barriers (Table 2), we design two message op-codes for a local barrier_wait message sent by an NDP core toits local SE: (i) barrier_wait_local_within_unit is used whencores of a single NDP unit participate in the barrier, and (ii) barrier_wait_local_across_units is used when cores from dif-ferent NDP units participate in the barrier. In the latter case,if a smaller number of cores than the total available cores ofthe NDP system participate in the barrier,

SynCron supportsone-level communication: local SEs re-direct all messages(received from their local NDP cores) to the

Master SE , whichglobally coordinates the barrier among all participating cores.This design choice is a trade-off between performance ( moreremote messages ) and hardware/ISA complexity, since the number of participating cores of each

NDP unit would need tobe communicated to the hardware through additional registersin ISA, and message opcodes ( higher complexity ). Primitives

SynCron

Message OpcodesLocks lock_acquire_global, lock_acquire_local, lock_release_globallock_release_local, lock_grant_global, lock_grant_locallock_acquire_overﬂow, lock_release_overﬂow, lock_grant_overﬂow

Barriers barrier_wait_global, barrier_wait_local_within_unitbarrier_wait_local_across_units, barrier_depart_global, barrier_depart_localbarrier_wait_overﬂow, barrier_departure_overﬂow

Semaphores sem_wait_global, sem_wait_local, sem_grant_globalsem_grant_local, sem_post_global, sem_post_localsem_wait_overﬂow, sem_grant_overﬂow, sem_post_overﬂow

ConditionVariables cond_wait_global, cond_wait_local, cond_signal_globalcond_signal_local, cond_broad_global, cond_broad_localcond_grant_global, cond_grant_local, cond_wait_overﬂowcond_signal_overﬂow, cond_broad_overﬂow, cond_grant_overﬂow

Other decrease_indexing_counter

Table 3: Message opcodes of

SynCron . Each SE module (Figure 6) is integrated into the computedie of each NDP unit. An SE consists of three components:

The SPU isthe logic that handles the messages, updates the ST, and issuesrequests to memory as needed. The SPU includes the controlunit, a buffer, and a few registers. The buffer is a small SRAMqueue for temporarily storing messages that arrive at the SE.The control unit implements custom logic with simple logicalbitwise operators (and, or, xor, zero) and multiplexers.

Buffer

Registers

SPU ST IndexingCounters

Network

140 bits

149 bits

READ/WRITE

ENABLEINDEX SE Control Logic

DATA

Figure 6: The Synchronization Engine (SE).

ST keeps track of all theinformation needed to coordinate synchronization. Each SThas 64 entries. Figure 7 shows an ST entry, which includes: (i)the 64-bit address of a synchronization variable, (ii) the globalwaiting list used by the

Master SE for global synchronizationamong SEs, i.e., a hardware bit queue including one bit foreach SE of the system, (iii) the local waiting list used by allSEs for synchronization among the NDP cores of an NDPunit, i.e., a hardware bit queue including one bit for each NDPcore within the unit, (iv) the state of the ST entry, which canbe either free or occupied , and (v) a 64-bit ﬁeld ( TableInfo )to track speciﬁc information needed for each synchronizationprimitive. For the lock primitive, the

TableInfo ﬁeld is usedto indicate the lock owner that is either an SE of an NDP unit(

Global ID represented by the most signiﬁcant bits) or a local

NDP core (

Local ID represented by the least signiﬁcant bits).We assume that all NDP cores of an NDP unit have a unique local ID within the NDP unit, while all SEs of the system havea unique global ID within the system. The number of bits inthe global and local waiting lists of Figure 7 is speciﬁc forthe conﬁguration of our evaluated system (Section 5), whichincludes 16 NDP cores per NDP unit and 4 SEs (one perNDP unit), and has to be extended accordingly, if the systemsupports more NDP cores or SEs.

64 bits 4 bits 16 bits 1 bits 64 bits

Address Global Waitlist Local Waitlist State TableInfo

64 bits 4 bits 16 bits 1 bits 64 bits

Address Global Waitlist Local Waitlist State TableInfo

Synchronization Table Entry

64 bits 4 bits 16 bits 1 bits 64 bits

Address Global Waitlist Local Waitlist State TableInfo

Synchronization Table Entry

Lock

BarrierSemaphore

Condition Variable Global ID | Local ID

Current

Lock AddressLock

BarrierSemaphore

Condition Variable Global ID | Local ID

Current

Lock AddressLock

BarrierSemaphore

Condition Variable Global ID | Local ID

Current

Lock Address

Figure 7: Synchronization Table (ST) entry.

If an ST is full, i.e., all its entriesare in occupied state,

SynCron cannot keep track of informa-tion for a new synchronization variable in ST. We use themain memory as a fallback solution for such ST overﬂow5Section 4.3). The SE keeps track of which synchronizationvariables are currently serviced via main memory: similar toMiSAR [97], we include a small set of counters ( indexingcounters ), 256 in current implementation, indexed by the leastsigniﬁcant bits of the address of a synchronization variable,as extracted from the message that arrives at an SE. Whenan SE receives a message with acquire-type semantics for asynchronization variable and there is no corresponding en-try in the fully-occupied

ST, the indexing counter for thatsynchronization variable increases. When an SE receives amessage with release-type semantics for a synchronizationvariable that is currently serviced using main memory, thecorresponding indexing counter decreases. A synchroniza-tion variable is currently serviced via main memory, whenthe corresponding indexing counter is larger than zero. Notethat different variables may alias to the same indexing counter.This aliasing does not affect correctness, but it does affectperformance, since a variable may unnecessarily be servicedvia main memory, while the ST is not full.

Figure 8 describes the controlﬂow in SE. When an SE receives a message, it decodes themessage and accesses the ST . If there is an ST entryfor the speciﬁc variable (depending on its address), the SEprocesses the waiting lists , updates the ST , and encodesreturn message(s) , if needed. If there is not an ST entryfor the speciﬁc variable, the SE checks the value of the corre-sponding indexing counter : (i) if the indexing counter iszero and the ST is not full, the SE reserves a new ST entry andcontinues with step , otherwise (ii) if the indexing counteris larger than zero or the ST is full, there is an overﬂow. Inthat case, if the SE is the Master SE for the speciﬁc variable, itreads the synchronization variable from local memory arrays , processes the waiting lists , updates the variable in mainmemory , and encodes return message(s) , if needed. Ifthe SE is not the Master SE for the speciﬁc variable, it encodesan overﬂow message to the

Master SE to handle overﬂow. Process

Waiting Lists

Decode

Message

ST Entry

Found

Access ST Update ST

Encode Return

Message(s)Access Indexing Counters

Process

Waiting Lists

ST Entry Not Found Zero Counter && ST Not-FullNon-Zero Counter || ST Full

Overflow

Read Local

Memory

Write Local

Memory

Read Local

Memory

Write Local

MemoryEncode Overflow Message Figure 8: Control flow in SE.

SynCron integrates a hardware-only overﬂow managementscheme that provides very modest performance degradation(Section 6.7.3) and is programmer-transparent. To handle SToverﬂow cases, we need to address two issues: (i) where tokeep track of required information to coordinate synchroniza-tion, and (ii) how to coordinate ST overﬂow cases betweenSEs. For the former issue, we design a generic structure al-located in main memory. For the latter issue, we propose ahierarchical overﬂow communication protocol between SEs.

SynCron ’s Synchronization Variable.

We design ageneric structure (Figure 9), called syncronVar , which is usedto coordinate synchronization for all supported primitives inST overﬂow cases. syncronVar is deﬁned in the driver of theNDP system, which handles the allocation of the synchroniza-tion variables: programmers use create_syncvar() (Table 2)to create a new synchronization variable, the driver allocatesthe bytes needed for syncronVar in main memory, and returnsan opaque pointer that points to the address of the variable.Programmers should not de-reference the opaque pointer andits content can only be accessed via

SynCron ’s API (Table 2). syncronVar structure includes one waiting list for each SEof the system, which has one bit for each NDP core within the

SynCron ’ s Synchronization Variable struct syncronVar_t { uint16_t Waitlist[4]; uint64_t VarInfo; uint8_t OverflowInfo; } typedef struct syncronVar_t syncronVar; Lock

Barrier

SemaphoreCondition Variable Lock Owner

Current

Available

Barrier

SemaphoreCondition Variable Lock Owner

Current

Available

Barrier

SemaphoreCondition Variable Lock Owner

Current

Available

Barrier

SemaphoreCondition Variable Overflow IDs |Lock State

Overflow IDs

Overflow IDsOverflow IDsLock

Barrier

SemaphoreCondition Variable Overflow IDs |Lock State

Overflow IDs

Overflow IDsOverflow IDsLock

Barrier

SemaphoreCondition Variable Overflow IDs |Lock State

Overflow IDs

Overflow IDsOverflow IDs

Figure 9: Synchronization variable of

SynCron ( syncronVar ). NDP unit, and two additional ﬁelds (

VarInfo, OverﬂowInfo )needed to hierarchically handle ST overﬂows for all primitives.

To ensurecorrectness, only the

Master SE updates the syncronVar vari-able: in ST overﬂow, the SPU of the

Master SE issues reador write requests to its local memory to globally coordinatesynchronization via the syncronVar variable. In our proposedhierarchical design, there are two overﬂow scenarios: (i) theST of the

Master SE overﬂows, and (ii) the ST of a local SEoverﬂows or STs of multiple local SEs overﬂow.

The ST of the

Master SE overﬂows.

The other SEs of thesystem have not overﬂowed for a speciﬁc synchronization vari-able. Thus, they can still directly buffer this variable in theirlocal STs, and serve their local cores themselves, implement-ing a hierarchical (two-level) communication with

Master SE .The

Master SE receives global messages from SEs, and servesa local SE of an NDP unit using all bits in the waiting list ofthe syncronVar variable associated with that local SE. Speciﬁ-cally, when it receives a global acquire-type message from alocal SE, it sets all bits in the corresponding waiting list of the syncronVar variable. When it receives a global release-typemessage from a local SE, it resets all bits in the correspondingwaiting list of the syncronVar variable.

The ST of a local SE overﬂows.

In this scenario, there arelocal SEs that have overﬂowed for a speciﬁc variable, andlocal SEs that have not overﬂowed. Without loss of generality,we assume that only one SE of the system has overﬂowed.

The local SEs that have not overﬂowed serve their localcores themselves via their STs, implementing a hierarchical(two-level) communication with

Master SE . When the

MasterSE receives a global message from a local SE (that has not overﬂowed), it (i) sets (or resets) all bits in the waiting listof the syncronVar variable associated with that SE, and (ii)responds with a global message to the local SE, if needed.

The overﬂowed SE needs to notify the

Master SE to han-dle local synchronization requests of NDP cores located at another

NDP unit via main memory. We design overﬂow mes-sage opcodes (Table 3) to be sent from the local overﬂowedSE to the

Master SE and back. The overﬂowed SE re-directsall messages (sent from its local NDP cores) for a speciﬁcvariable to the

Master SE using the overﬂow message opcodes,and both the overﬂowed SE and the

Master SE increase theircorresponding indexing counters to indicate that this variableis currently serviced via memory. When the

Master SE re-ceives an overﬂow message, it (i) sets (or resets) in the waitinglist (associated with the overﬂowed SE) of the syncronVar variable, the bit that corresponds to the local ID of the NDPcore within the NDP unit, (ii) sets (or resets) in the

Overﬂow-Info ﬁeld of the syncronVar variable the bit that correspondsto the global ID of the overﬂowed SE to keep track of which

SE (or SEs) of the system has overﬂowed, and (iii) respondswith an overﬂow message to that SE, if needed. The local ID of the NDP core, and the global ID of the overﬂowed SE areencoded in the

CoreID ﬁeld of the message (Figure 5). Whenall bits in the waiting lists of the syncronVar variable becomezero (upon receiving a release-type message), the

Master SE decrements the corresponding indexing counter. Then, it sendsa decrease_index_counter message (Table 3) to the overﬂowedSE (based on the set bit that is tracked in the

OverﬂowInfo ﬁeld), which decrements its corresponding indexing counter.6 .4.

SynCron

Enhancements

RMW

Operations.

It is straightforward to extend

Syn-Cron to support simple atomic rmw operations inside the SE(by adding a lightweight ALU). The

Master SE could be re-sponsible for executing atomic rmw operations on a variabledepending on its address. We leave that for future work.

When local cores of an NDP unit re-peatedly request a lock from their local SE, the SE repeatedlygrants the lock within its unit, potentially causing unfairnessand delay to other NDP units. To prevent this, an extra ﬁeldof a local grant counter could be added to the ST entry. Thecounter increases every time the SE grants the lock to a localcore. If the counter exceeds a predeﬁned threshold, then whenthe SE receives a lock release, it transfers the lock to anotherSE (assuming other SEs request the lock). The host OS orthe user could dynamically set this threshold via a dedicatedregister. We leave the exploration of such fairness mechanismsto future work.

SynCron ’s design shares some of its design concepts withSSB [157], LCU [146], and MiSAR [97]. However,

SynCron is more general, supporting the four most widely-used synchro-nization primitives, and easy-to-use thanks to its high-levelprogramming interface.Table 4 qualitatively compares

SynCron with these schemes.SSB and LCU support only lock semantics, thus they intro-duce two

ISA extensions for a simple lock. MiSAR introducesseven ISA extensions to support three primitives and handleoverﬂow scenarios.

SynCron includes two ISA extensions forfour supported primitives . A spin-wait approach performsconsecutive synchronization retries, typically incurring highenergy consumption. A direct notiﬁcation scheme sends adirect message to only one waiting core when the synchro-nization variable becomes available, minimizing the trafﬁcinvolved upon a release operation. SSB, LCU and MiSAR aretailored for uniform memory systems. In contrast,

SynCron is the only hardware synchronization mechanism that targetsNDP systems as well as non-uniform memory systems.SSB and LCU handle overﬂow in hardware synchronizationresources using a pre-allocated table in main memory, andif it overﬂows, they switch to software exception handlers(handled by the programmer), which typically incur largeoverheads (due to OS intervention) when overﬂows happenat a non-negligible frequency. To avoid falling back to mainmemory, which has high latency, and using expensive soft-ware exception handlers, MiSAR requires the programmer tohandle overﬂow scenarios using alternative software synchro-nization libraries (e.g., pthread library provided by the OS).This approach can provide performance beneﬁts in CPU sys-tems, since alternative synchronization solutions can exploitlow-cost accesses to caches and hardware cache coherence.However, in NDP systems alternative solutions would by de-fault use main memory due to the absence of shared cachesand hardware cache coherence support. Moreover, when over-ﬂow occurs, MiSAR’s accelerator sends abort messages toall participating CPU cores notifying them to use the alter-native solution, and when the cores ﬁnish synchronizing viathe alternative solution, they notify MiSAR’s accelerator toswitch back to hardware synchronization. This scheme intro-duces additional hardware/ISA complexity, and communica-tion between the cores and the accelerator, thus incurring highnetwork trafﬁc and communication costs, as we show in Sec-tion 6.7.3. In contrast,

SynCron directly falls back to memoryvia a fully-integrated hardware-only overﬂow scheme, whichprovides graceful performance degradation (Section 6.7.3),and is completely transparent to the programmer: program- mers only use

SynCron ’s high-level API, similarly to howsoftware libraries are in charge of synchronization.

SSB [157] LCU [146] MiSAR [97]

SynCron

Supported Primitives 1 1 3 ISA Extensions 2 2 7 Spin-Wait Approach yes yes no no Direct Notiﬁcation no yes yes yes

Target System uniform uniform uniform non-uniform

Overﬂow partially partially handled by fully

Management integrated integrated programmer integrated

Table 4: Comparison of

SynCron with prior mechanisms.

SynCron in Conventional Systems

The baseline NDP architecture [8, 43, 143, 155, 158] weassume in this work shares key design principles with con-ventional NUMA systems. However, unlike NDP systems,NUMA CPU systems (i) have a shared level of cache (within aNUMA socket and/or across NUMA sockets), (ii) run multiplemulti-threaded applications, i.e., a high number of softwarethreads executed in hardware thread contexts, and (iii) theOS migrates software threads between hardware thread con-texts to improve system performance. Therefore, although

SynCron could be implemented in such commodity systems,our proposed hardware design would need extensions. First,

SynCron could exploit the low-cost accesses to shared cachesin conventional CPUs, e.g., including an additional level in

SynCron ’s hierarchical design to use the shared cache for efﬁ-cient synchronization within a NUMA socket, and/or handlingoverﬂow scenarios by falling back to the low-latency cacheinstead of main memory. Second,

SynCron needs to supportuse cases (ii) and (iii) listed above in such systems, i.e., in-cluding larger STs and waiting lists to satisfy the needs ofmultiple multithreaded applications, handling the OS threadmigration scenarios across hardware thread contexts, and han-dling multiple synchronization requests sent from differentsoftware threads with the same hardware ID to SEs, whendifferent software threads are executed on the same hardwarethread context. We leave the optimization of

SynCron ’s designfor conventional systems to future work.

5. Methodology

Simulation Methodology.

We use an in-house simulator thatintegrates ZSim [125] and Ramulator [85]. We model 4 NDPunits (Table 5), each with 16 in-order cores. The cores issue amemory operation after the previous one has completed, i.e.,there are no overlapping operations issued by the same core.Any write operation is completed (and the latency is accountedfor in our simulations) before executing the next instruction.To ensure memory consistency, compiler support [123] guar-antees that there is no reordering around the sync instructionsand a read is inserted after a write inside a critical section.

NDP Cores

16 in-order cores @2.5 GHz per NDP unit

L1 Data + Inst. Cache private, 16KB, 2-way, 4-cycle; 64 B line; 23/47 pJ per hit/miss [109]

NDP Unit buffered crossbar network with packet ﬂow control; 1-cycle arbiter;

Local Network

DRAM HBM

DRAM HMC

DRAM DDR4

Interconnection Links

Across NDP Units

Synchronization

SPU @1GHz clock frequency [129]; 8 × Engine buffer: 280B; ST: 1192B, 64 entries, 1-cycle [109];indexing counters: 2304B, 256 entries (8 LSB of the address), 2-cycle [109]

Table 5: Conﬁguration of our simulated system.

We evaluate three NDP conﬁgurations for different mem-ory technologies, namely 2D, 2.5D, 3D NDP. The 2D NDPconﬁguration uses a DDR4 memory model and resemblesrecent 2D NDP systems [34, 50, 89, 144]. In the 2.5D NDP7onﬁguration, each compute die of NDP units (16 NDP cores)is connected to an HBM stack via an interposer, similar tocurrent GPUs [106, 115] and FPGAs [131, 150]. For the 3DNDP conﬁguration, we use the HMC memory model, wherethe compute die of the NDP unit is located in the logic layerof the memory stack, as in prior works [8, 19, 155, 158]. Dueto space limitations, we present detailed evaluation results forthe 2.5D NDP conﬁguration, and provide a sensitivity studyfor the different NDP conﬁgurations in Section 6.5.We model a crossbar network within each NDP unit, sim-ulating queuing latency using the M/D/1 model [18]. Wecount in ZSim-Ramulator all events for caches, i.e., num-ber of hits/misses, network, i.e., number of bits transferredinside/across NDP units, and memory, i.e., number of totalmemory accesses, and use CACTI [109] and parameters re-ported in prior works [143, 149, 151] to calculate energy. Toestimate the latency in SE, we use CACTI for ST and indexingcounters, and Aladdin [129] for the SPU with 1GHz at 40nm.Each message is served in 12 cycles, corresponding to themessage (barrier_depart_global) that takes the longest time.

Workloads.

We evaluate workloads with both (i) coarse-grained synchronization, i.e., including only a few synchro-nization variables to protect shared data, leading to coreshighly contending for them ( high-contention ), and (ii) ﬁne-grained synchronization, i.e., including a large number of syn-chronization variables, each of them protecting a small granu-larity of shared data, leading to cores not frequently contend-ing for the same variables at the same time ( low-contention ).We use the term synchronization intensity to refer to the ratioof synchronization operations over other computation in theworkload. As this ratio increases, synchronization latencyaffects the total execution time of the workload more.We study three classes of applications (Table 6), all wellsuited for NDP. First, we evaluate pointer chasing workloads,i.e., lock-based concurrent data structures from the ASCYLIBlibrary [31], used as key-value sets. In ASCYLIB’s BinarySearch Tree (BST) [37], the lock memory requests are only0.1% of the total memory requests, so we also evaluate anexternal ﬁne-grained locking BST from [130]. Data structuresare initialized with a ﬁxed size and statically partitioned acrossNDP units, except for BSTs, which are distributed randomly.In these benchmarks, each core performs a ﬁxed number ofoperations. We use lookup operations for data structures thatsupport it, deletion for the rest, and push and pop operationsfor stack and queue. Second, we evaluate graph applicationswith ﬁne-grained synchronization from Crono [7, 65] (pushversion), where the output array has read-write data. All real-world graphs [32] used are undirected and statically partitionedacross NDP units, where the vertex data is equally distributedacross cores. Third, we evaluate time series analysis [142],using SCRIMP, and real data sets from Matrix Proﬁle [152].We replicate the input data in each NDP unit and partition theoutput array (read-write data) across NDP units.

Comparison Points.

We compare

SynCron with threeschemes: (i)

Central : a message-passing scheme that supportsall primitives by extending the barrier primitive of Tesser-act [8], i.e., one dedicated NDP core in the entire NDP systemacts as server and coordinates synchronization among all NDPcores of the system by issuing memory requests to synchro-nization variables via its memory hierarchy, while the remain-ing client cores communicate with it via hardware message-passing; (ii)

Hier : a hierarchical message-passing scheme thatsupports all primitives, similar to the barrier primitive of [43](or hierarchical lock of [141]), i.e., one NDP core per NDPunit acts as server and coordinates synchronization by issuingmemory requests to synchronization variables via its memoryhierarchy (including caches), and communicates with other

Data Structure Conﬁguration

Stack [31] 100K - 100% pushQueue [31,104] 100K - 100% popArray Map [31,56] 10 - 100% lookupPriority Queue [11,31,118] 20K - 100% deleteMinSkip List [31,118] 5K - 100% deletionHash Table [31,63] 1K - 100% lookupLinked List [31,63] 20K - 100% lookupBinary Search Tree Fine-Grained (BST_FG) [130] 20K - 100% lookupBinary Search Tree Drachsler (BST_Drachsler) [31,37] 10K - 100% deletion

Real Application Locks Barriers

Breadth First Search ( bfs ) [7] (cid:88) (cid:88)

Connected Components ( cc ) [7] (cid:88) (cid:88) Single Source Shortest Paths ( sssp ) [7] (cid:88) (cid:88)

Pagerank ( pr ) [7] (cid:88) (cid:88) Teenage Followers ( tf ) [65] (cid:88) -Triangle Counting ( tc ) [7] (cid:88) (cid:88) Time Series Analysis ( ts ) [152] (cid:88) (cid:88) Real Application Input Data Set wikipedia-20051105 ( wk ) bfs, cc, sssp, soc-LiveJournal1 ( sl ) pr, tf, tc sx-stackoverﬂow ( sx )com-Orkut ( co ) ts air quality ( air )energy consumption ( pow ) Table 6: Summary of all workloads used in our evaluation. servers and local client cores (located at the same NDP unitwith it) via hardware message-passing; (iii)

Ideal : an idealscheme with zero performance overhead for synchronization.In our evaluation, each NDP core runs one thread. For faircomparison, we use the same number of client cores, i.e., 15per NDP unit, that execute the main workload for all schemes.For synchronization, we add one server core for the entire sys-tem in

Central , one server core per NDP unit for

Hier , and oneSE per NDP unit for

SynCron . For

SynCron , we disable onecore per NDP unit to match the same number of client coresas the previous schemes. Maintaining the same thread-levelparallelism for executing the main kernel is consistent withprior works on message-passing synchronization [97, 141].

6. Evaluation

Figure 10 evaluates allsupported primitives using 60 cores, varying the interval (interms of instructions) between two synchronization points.We devise simple benchmarks, where cores repeatedly re-quest a single synchronization variable. For lock, the crit-ical section is empty, i.e., it does not include any instruc-tion. For semaphore and condition variable, half of thecores execute sem_wait/cond_wait, while the rest executesem_post/cond_signal, respectively. As the interval betweensynchronization points becomes smaller,

SynCron ’s perfor-mance beneﬁt increases. For an interval of 200 instructions,

SynCron outperforms

Central and

Hier by 3.05 × and 1.40 × respectively, averaged across all primitives. SynCron outper-forms

Hier due to directly buffering synchronization variablesin low-latency STs, and achieves the highest beneﬁts in thecondition variable primitive (by 1.61 × ), since this benchmarkhas higher synchronization intensity compared to the rest:cores coordinate for both the condition variable and the lockassociated with it. When the interval between synchronizationoperations becomes larger, synchronization requests becomeless dominant in the main workload, and thus all schemes per-form similarly. Overall, SynCron outperforms prior schemesfor all different synchronization primitives.

50 100 200 400 1K 2K 5KInstructions between critical sections012345 S p ee d u p Lock 20 50 100 200 500 1K 2KInstructions between barrier synchronization02468 S p ee d u p Barrier100 200 400 1K 2K 5K 10KInstructions between semaphore synchronization0123 S p ee d u p Semaphore 200 400 1K 2K 5K 10K 50KInstructions between condition variable synchronization03691215 S p ee d u p Condition VariableCentral Hier SynCron Ideal

Figure 10: Speedup of different synchronization primitives. .1.2. Pointer Chasing Data Structures. Figure 11 showsthe throughput for all schemes in pointer chasing varying theNDP cores in steps of 15, each time adding one NDP unit.

15 30 45 6057.51012.5 O p e r a t i o n s / s Stack

15 30 45 6051015

Queue

15 30 45 60567

Array MapNumber of NDP coresCentral Hier SynCron Ideal

15 30 45 6045678910 O p e r a t i o n s / s Priority Queue

15 30 45 604812162024

Skip List

15 30 45 6051015202530

Hash TableNumber of NDP cores

15 30 45 6036912 O p e r a t i o n s / m s Linked List

15 30 45 60246810

BST_FG

15 30 45 60510152025

BST_DrachslerNumber of NDP cores

Figure 11: Throughput of pointer chasing using data structures.

We observe four different patterns. First, stack , queue , array map , and priority queue incur high contention, as allcores heavily contend for a few variables. Array map hasthe lowest scalability due to a larger critical section. In high-contention scenarios, hierarchical schemes (

Hier , SynCron )perform better by reducing the expensive trafﬁc across NDPunits.

SynCron outperforms

Hier , since the latency cost ofusing SEs that update small STs is lower than using NDP coresas servers that update larger caches. Second, skip list and hashtable incur medium contention, as different cores may work ondifferent parts of the data structure. For these data structures,hierarchical schemes perform better, as they minimize theexpensive trafﬁc, and multiple server cores concurrently serverequests to their local memory.

SynCron retains most of theperformance beneﬁts of

Ideal , incurring only 19.9% overheadwith 60 cores, and outperforms

Hier by 9.8%. Third, linked list and

BST_FG exhibit low contention and high synchronizationdemand, as each core requests multiple locks concurrently.These data structures cause higher synchronization-relatedtrafﬁc inside the network compared to skip list and hash table ,and thus

SynCron further outperforms

Hier by 1.19 × due todirectly buffering synchronization variables in STs. Fourth,in BST_Drachsler lock requests constitute only 0.1% of thetotal requests, and all schemes perform similarly. Overall, weconclude that

SynCron achieves higher throughput than priormechanisms under different scenarios with diverse conditions.

Figure 12 shows the performanceof all schemes with real applications using all NDP units,normalized to

Central . Averaged across 26 application-inputcombinations,

SynCron outperforms

Central by 1.47 × and Hier by 1.23 × , and performs within 9.5% of Ideal .Our real applications exhibit low contention, as two coresrarely contend for the same synchronization variable, and highsynchronization demand, as several synchronization variablesare active during execution. We observe that

Hier and

Syn-Cron increase parallelism, because the per-NDP-unit serversservice different synchronization requests concurrently, andavoid remote synchronization messages across NDP units.Even though

Hier performs 1.19 × better than Central , onaverage, its performance is still 1.33 × worse than Ideal . Syn-Cron provides most of the performance beneﬁts of

Ideal (withonly 9.5% overhead on average), and outperforms

Hier dueto directly buffering the synchronization variables in STs,thereby completely avoiding the memory accesses for syn-chronization requests. Speciﬁcally, we ﬁnd that time seriesanalysis has high synchronization intensity, since the ratio ofsynchronization over other computation of the workload ishigher compared to graph workloads. For this application,

Hier and

SynCron outperform

Central by 1.64 × and 2.22 × ,as they serve multiple synchronization requests concurrently. SynCron further outperforms

Hier by 1.35 × due to directlybuffering the synchronization variables in STs. We concludethat SynCron performs best across all real application-inputcombinations and approaches the

Ideal scheme with no syn-chronization overhead.

Scalability.

Figure 13 shows the scalability of real applica-tions using

SynCron from 1 to 4 NDP units. Due to spacelimitations, we present a subset of our workloads, but we re-port average values for all 26 application-input combinations.This also applies for all ﬁgures presented henceforth. Acrossall workloads,

SynCron enables performance scaling by atleast 1.32 × , on average 2.03 × , and up to 3.03 × , when using4 NDP units (60 NDP cores) over 1 NDP unit (15 NDP cores). bfs.sl cc.sx sssp.co pr.wk tf.sl tc.sx ts.air ts.pow AVG S p ee d u p Figure 13: Scalability of real applications using

SynCron . Figure 14 shows the energy breakdown for cache, network,and memory in our real applications when using all cores.

SynCron reduces the network and memory energy thanks to itshierarchical design and direct buffering. On average,

SynCron reduces energy consumption by 2.22 × over Central and 1.94 × over Hier , and incurs only 6.2% energy overhead over

Ideal .We observe that 1) cache energy consumption constitutesa small portion of the total energy, since these applicationshave irregular access patterns. NDP cores that act as servers b f s . w k b f s . s l b s . s x b f s . c o cc . w k cc . s l cc . s x cc . c o sss p . w k sss p . s l sss p . s x sss p . c o p r . w k p r . s l p r . s x p r . c o t f . w k t f . s l t f . s x t f . c o t c . w k t c . s l t c . s x t c . c o t s . a i r t s . p o w A V G S p ee d u p . . . . Central Hier SynCron Ideal

Figure 12: Speedup in real applications normalized to

Central . fs.sl cc.sx sssp.co pr.wk tf.sl tc.sx ts.air ts.pow AVG0.00.20.40.60.81.0 E n e r g y B r e a k d o w n C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI cache network memory

Figure 14: Energy breakdown in real applications for C:

Central ,H:

Hier , SC:

SynCron and I:

Ideal . for Central and

Hier increase the cache energy only by 5.1%and 4.8% over

Ideal . 2)

Central generates a larger amountof expensive trafﬁc across NDP units compared to hierarchi-cal schemes, resulting in 2.68 × higher network energy over SynCron . SynCron also has less network energy (1.21 × ) than Hier , because it avoids transferring synchronization variablesfrom memory to SEs due to directly buffering them. 3)

Hier and

Central have approximately the same memory energy con-sumption, because they issue a similar number of requests tomemory. In contrast,

SynCron ’s memory energy consumptionis similar to that of

Ideal . We note that

SynCron provides higher energy reductions in applications with high synchro-nization intensity, such as time series analysis, since it avoidsa higher number of memory accesses for synchronization dueto its direct buffering capability.

Figure 15 shows normalized data movement, i.e., bytestransferred between NDP cores and memory, for all schemesusing four NDP units.

SynCron reduces data movement acrossall workloads by 2.08 × and 2.04 × over Central and

Hier ,respectively, on average, and incurs only 13.8% more datamovement than

Ideal . Central generates high data movementacross NDP units, particularly when running time series anal-ysis that has high synchronization intensity.

Hier reduces thetrafﬁc across NDP units; however, it may increase the trafﬁcinside an NDP unit, occasionally leading to slightly higher to-tal data movement (e.g., ts.air ). This is because when an NDPcore requests a synchronization variable that is physically lo-cated in another NDP unit, it ﬁrst sends a message inside theNDP unit to its local server, which in turns sends a messageto the global server. In contrast,

SynCron reduces the trafﬁcinside an NDP unit due to directly buffering synchronizationvariables, and across NDP units due to its hierarchical design. bfs.sl cc.sx sssp.co pr.wk tf.sl tc.sx ts.air ts.pow AVG0.00.20.40.60.81.0 D a t a M o v e m e n t C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI inside NDP units across NDP units

Figure 15: Data movement in real applications for C:

Central ,H:

Hier , SC:

SynCron and I:

Ideal . Hierarchical schemes provide highbeneﬁt under high contention, as they prioritize local requestsinside each NDP unit. We study their performance beneﬁt instack and priority queue (Figure 16) when varying the transferlatency of the interconnection links used across four NDP units.

Central is signiﬁcantly affected by the interconnect latencyacross NDP units, as it is oblivious to the non-uniform natureof the NDP system. Observing

Ideal , which reﬂects the actualbehavior of the main workload, we notice that after a certainpoint (vertical line), the cost of remote memory accesses acrossNDP units become high enough to dominate performance.

SynCron and

Hier tend to follow the actual behavior of theworkload, as local synchronization messages within NDP units are much less expensive than remote messages of

Central . SynCron outperforms

Hier by 1.06 × and 1.04 × for stack andpriority queue. We conclude that SynCron is the best at hidingthe latency of slow links across NDP units.

Transfer latency ( s) O p e r a t i o n s / s Stack

Transfer latency ( s) O p e r a t i o n s / s Priority Queue

Central Hier SynCron Ideal

Figure 16: Performance sensitivity to the transfer latency of theinterconnection links used to connect the NDP units.

We also study the effect of intercon-nection links used across the NDP units in a low-contentiongraph application (Figure 17). Observing

Ideal , with 500 nstransfer latency per cache line, we note that the workload ex-periences 2.46 × slowdown over the default latency of 40 ns,as 24.1% of its memory accesses are to remote NDP units.As the transfer latency increases, Central incurs signiﬁcantslowdown over

Ideal , since all NDP cores of the system com-municate with one single server, generating expensive trafﬁcacross NDP units. In contrast, the slowdown of hierarchi-cal schemes over

Ideal is smaller, as these schemes generateless remote trafﬁc by distributing the synchronization requestsacross multiple local servers.

SynCron outperforms

Hier dueto its direct buffering capabilities. Overall,

SynCron outper-forms prior high-performance schemes even when the networkdelay across NDP units is large.

40 100 200 500 S l o w d o w n IdealSynCronHierCentral1.07 1.11 1.15 1.171.29 1.33 1.36 1.371.61 1.87 2.23 2.67 pr.wk

Transfer latency in ns for each cache line

Figure 17: Performance sensitivity to the transfer latency of theinterconnection links used to connect the NDP units. All data isnormalized to

Ideal ( lower is better ). We study three memory technologies, which provide differ-ent memory access latencies and bandwidth. We evaluate (i)2.5D NDP using HBM, (ii) 3D NDP using HMC, and (iii) 2DNDP using DDR4. Figure 18 shows the performance of allschemes normalized to

Central of each memory. The reportedvalues show the speedup of

SynCron over

Central and

Hier . SynCron ’s beneﬁt is independent of the memory used: its per-formance versus

Ideal only slightly varies ( ± SynCron ’s performance improvement over prior schemesincreases as the memory access latency becomes higher thanksto direct buffering, which avoids expensive memory accessesfor synchronization. For example, in ts.pow , SynCron out-performs

Hier by 1.41 × and 2.49 × with HBM and DDR4,respectively, as the latter incurs higher access latency. Overall, SynCron is orthogonal to the memory technology used. cc.wk pr.wk ts.pow S p ee d u p HBM HBM HBMHMC HMC HMCDDR4 DDR4 DDR4Central Hier SynCron Ideal

Figure 18: Speedup with different memory technologies. .6. Effect of Data Placement Figure 19 evaluates the effect of better data placement on

SynCron ’s beneﬁts. We use Metis [74] to obtain a 4-way graphpartitioning to minimize the crossing edges between the 4NDP units. All data values are normalized to

Central withoutMetis. For

SynCron , we deﬁne ST occupancy as the averagefraction of ST entries that are occupied in each cycle. pr.wk pr.sl pr.sx pr.co S p ee d u p No Metis No Metis No Metis No MetisMetis Metis Metis MetisCentral Hier SynCron Ideal . . . . Max ST Occupancy (%) pr.wk pr.sl pr.sx pr.coNo Metis

62 51 53 48

Metis

39 29 38 34

Figure 19: Performance sensitivity to a better graph partitioningand maximum ST occupancy of

SynCron . We make three observations. First,

Ideal , which reﬂects theactual behavior of the main kernel (i.e., with zero synchroniza-tion overhead), improves performance by 1.47 × across thefour graphs. Second, with a better graph partitioning, SynCron still outperforms both

Central and

Hier . Third, we ﬁnd thatST occupancy is lower with a better graph partitioning. Whena local SE receives a request for a synchronization variableof another NDP unit, both the local SE and the

Master SE reserve a new entry in their STs. With a better graph parti-tioning, NDP cores send requests to their local SE, which isalso the

Master SE for the requested variable. Thus, only one

SE of the system reserves a new entry, resulting in a lowerST occupancy. We conclude that, with better data placement

SynCron still performs the best while achieving even lower SToccupancy.

SynCron ’s Design Choices

To demonstrate the effectivenessof

SynCron ’s hierarchical design in non-uniform NDP systems,we compare it with

SynCron ’s ﬂat variant. Each core in ﬂatdirectly sends all its synchronization requests to the MasterSE of each variable. In contrast, each core in

SynCron sendsall its synchronization requests to the local SE. If the local SEis not the

Master SE for the requested variable, the local SEsends a message across NDP units to the

Master SE .We evaluate three synchronization scenarios: (i) low-contention and synchronization non-intensive (e.g., graph ap-plications), (ii) low-contention and synchronization-intensive(e.g., time series analysis), and (iii) high-contention (e.g.,queue data structure).

Low-contention and synchronization non-intensive.

Fig-ure 20 evaluates this scenario using several graph process-ing workloads with 40 ns link latency between NDP units.

SynCron is 1.1% worse than ﬂat , on average. We concludethat

SynCron performs only slightly worse than ﬂat for low-contention and synchronization non-intensive scenarios. b f s . w k b f s . s l b s . s x b f s . c o cc . w k cc . s l cc . s x cc . c o sss p . w k sss p . s l sss p . s x sss p . c o p r . w k p r . s l p r . s x p r . c o t f . w k t f . s l t f . s x t f . c o t c . w k t c . s l t c . s x t c . c o A V G S p ee d u p Figure 20: Speedup of

SynCron normalized to ﬂat with 40 ns linklatency between NDP units, under a low-contention and synchro-nization non-intensive scenario.

Low-contention and synchronization-intensive.

Figure 21aevaluates this scenario using time series analysis with four different link latency values between NDP units.

SynCron performs 7.3% worse than ﬂat with a 40 ns inter-NDP-unitlatency. With a 500 ns inter-NDP-unit latency,

SynCron per-forms only ﬂat , since remote trafﬁc has alarger impact on the total execution time. We conclude that

SynCron performs modestly worse than ﬂat , and

SynCron ’sslowdown decreases as non-uniformity, i.e., the latency be-tween NDP units, increases. ts.air ts.pow S p ee d u p (a) Queue.30cores Queue.60cores S p ee d u p (b)40 ns 100 ns 200 ns 500 ns Figure 21: Speedup of

SynCron normalized to ﬂat , as we varythe transfer latency of the interconnection links used to con-nect NDP units, under (a) a low-contention and synchronization-intensive scenario using 4 NDP units, and (b) a high-contentionscenario using 2 and 4 NDP units.

High-contention.

Figure 21b evaluates this scenario using aqueue data structure with four different link latency valuesbetween NDP units, for 30 and 60 NDP cores.

SynCron with30 NDP cores outperforms ﬂat from 1.23 × to 1.76 × , as theinter-NDP-unit latency increases from 40 ns to 500 ns (i.e.,with increasing non-uniformity in the system). In a scenariowith high non-uniformity in the system and large number ofcontended cores, e.g., using a 500 ns inter-NDP-unit latencyand 60 NDP cores, SynCron ’s beneﬁt increases to a 2.14 × speedup over ﬂat . We conclude that SynCron performs signiﬁ-cantly better than ﬂat under high-contention.Overall, we conclude that in non-uniform , distributed NDPsystems, only a hierarchical hardware synchronization designcan achieve high performance under all various scenarios. We show the effectiveness of the proposed 64-entry ST (per NDP unit) using real applications. Table 7 showsthe measured occupancy across all STs. Figure 22 shows theperformance sensitivity to ST size. In graph applications,the average ST occupancy is low (2.8%), and the 64-entryST never overﬂows: maximum occupancy is 63% ( cc.wk ).In contrast, time series analysis has higher ST occupancy(reaching up to 89% in ts.pow ) due to the high synchronizationintensity, but there are no ST overﬂows. Even a 48-entry SToverﬂows for only 0.01% of synchronization requests, andincurs 2.1% slowdown over a 64-entry ST. We conclude thatthe proposed 64-entry ST meets the needs of applications thathave high synchronization intensity.

ST Occupancy Max (%) Avg (%)bfs.wk bfs.sl bfs.sx bfs.co cc.wk cc.sl cc.sx cc.co sssp.wk sssp.sl sssp.sx sssp.co pr.wk ST Occupancy Max (%) Avg (%)pr.sl pr.sx pr.co tf.wk tf.sl tf.sx tf.co tc.wk tc.sl tc.sx tc.co ts.air ts.pow Table 7: ST occupancy in real applications. cc.wk pr.wk ts.air ts.pow0.80.91.01.11.21.31.4 S l o w d o w n Figure 22: Slowdown with varying ST size (normalized to 64-entry ST). Numbers on top of bars show the percentage of over-ﬂowed requests. .7.3. Overﬂow Management. The linked list and BST_FGdata structures are the only cases where the proposed 64-entryST overﬂows, when using 60 cores, for 3.1% and 30.5% ofthe requests, respectively. This is because each core requestsat least two locks at the same time during the execution. Notethat these synthetic benchmarks represent extreme scenarios,where all cores repeatedly perform key-value operations.Figure 23 compares BST_FG’s performance with

SynCron ’sintegrated overﬂow scheme versus with a non-integratedscheme as in MiSAR. When overﬂow occurs, MiSAR’s ac-celerator aborts all participating cores notifying them to usean alternative synchronization library, and when the coresﬁnish synchronizing via an alternative solution, they notifyMiSAR’s accelerator to switch back to hardware synchro-nization. We adapt this scheme to

SynCron for comparisonpurposes: when an ST overﬂows, SEs send abort messagesto NDP cores with a hierarchical protocol, notifying themto use an alternative synchronization solution, and after ﬁn-ishing synchronization they notify SEs to decrease their in-dexing counters and switch to hardware. We evaluate twoalternative solutions: (i)

SynCron_CentralOvrﬂ , where onededicated NDP core handles all synchronization variables,and (ii)

SynCron_DistribOvrﬂ , where one NDP core per NDPunit handles variables located in the same NDP unit. With30.5% overﬂowed requests (i.e., with a 64-entry ST),

Syn-Cron_CentralOvrﬂ and

SynCron_DistribOvrﬂ incur 12.3%and 10.4% performance slowdown compared to with no SToverﬂow, due to high network trafﬁc and communication costsbetween NDP cores and SEs. In contrast,

SynCron affects per-formance by only 3.2% compared to with no ST overﬂow. Weconclude that

SynCron ’s integrated hardware-only overﬂowscheme enables very small performance overhead.

ST_16 ST_32 ST_48 ST_64 ST_128 ST_2566.06.57.07.58.0

BST_FG

SynCronSynCron_CentralOvrflSynCron_DistribOvrfl O p e r a t i o n s / m s Figure 23: Throughput achieved by BST_FG using differentoverﬂow schemes and varying the ST size. The reported num-bers show to the percentage of overﬂowed requests.

SynCron ’s Area and Power Overhead

Table 8 compares an SE with the ARM Cortex A7 core [14].We estimate the SPU using Aladdin [129], and the ST andindexing counters using CACTI [109]. We conclude that ourproposed hardware unit incurs very modest area and powercosts to be integrated into the compute die of an NDP unit.

SE (Synchronization Engine) ARM Cortex A7 [14]Technology

Area

SPU: 0.0141mm , ST: 0.0112mm Total: Total: Power

Table 8: Comparison of SE with a simple general-purpose in-order core, ARM Cortex A7.

7. Related Work

To our knowledge, our work is the ﬁrst one to (i) compre-hensively analyze and evaluate synchronization primitives inNDP systems, and (ii) propose an end-to-end hardware-basedsynchronization mechanism for efﬁcient execution of suchprimitives. We brieﬂy discuss prior work.

Synchronization on NDP.

Ahn et al. [8] include a message-passing barrier similar to our

Central baseline. Gao et al. [43]implement a hierarchical tree-based barrier for HMC [59],where cores ﬁrst synchronize inside the vault, then across vaults, and ﬁnally across HMC stacks. Section 6.1 showsthat

SynCron outperforms such schemes. Gao et al. [43] alsoprovide remote atomics at the vault controllers of HMC. How-ever, synchronization using remote atomics creates high globaltrafﬁc and hotspots [41, 96, 108, 147, 153].

Synchronization on CPUs.

A range of hardware synchro-nization mechanisms have been proposed for commodity CPUsystems [1–3,10,116,124]. These are not suitable for NDP sys-tems because they either (i) rely on the underlying cache coher-ence system [10,124], (ii) are tailored for the 2D-mesh networktopology to connect all cores [2, 3], or (iii) use transmission-line technology [116] or on-chip wireless technology [1]. Call-backs [120] includes a directory cache structure close to theLLC of a CPU system built on self-invalidation coherenceprotocols [26, 75, 77, 91, 121, 139]. Although it has low areacost, it would be oblivious to the non-uniformity of NDP,thereby incurring high performance overheads under high con-tention (Section 6.7.1). Callbacks improves performance ofspin-wait in hardware, on top of which high-level primitives(locks/barriers) are implemented in software. In contrast,

Syn-Cron directly supports high-level primitives in hardware, andis tailored to all salient characteristics of NDP systems.The closest works to ours are SSB [157], LCU [146], andMiSAR [97]. SSB, a shared memory scheme, includes a smallbuffer attached to each controller of LLC to provide lock se-mantics for a given data address. LCU, a message-passingscheme, incorporates a control unit into each core and a reser-vation table into each memory controller to provide reader-writer locks. MiSAR is a message-passing synchronizationaccelerator distributed at each LLC slice of tile-based many-core chips. These schemes provide efﬁcient synchronizationfor CPU systems without relying on hardware coherence proto-cols. As shown in Table 4, compared to these works,

SynCron is a more effective, general and easy-to-use solution for NDPsystems. These works have two major shortcomings. First,they are designed for uniform architectures, and would incurhigh performance overheads in non-uniform, distributed

NDPsystems under high-contetion scenarios, similarly to ﬂat inFigure 21b. Second, SSB and LCU handle overﬂow cases us-ing software exception handlers that typically incur large per-formance overheads, while MiSAR’s overﬂow scheme wouldincur high performance degradation due to high network trafﬁcand communication costs between the cores and the synchro-nization accelerator (Section 6.7.3). In contrast,

SynCron isa non-uniformity aware, hardware-only, end-to-end solutiondesigned to handle key characteristics of NDP systems.

Synchronization on GPUs.

GPUs support remote atomicunits at the shared cache and hardware barriers among threadsof the same block [114], while inter-block barrier synchroniza-tion is inefﬁciently implemented via the host CPU [114]. Theclosest work to ours is HQL [153], which modiﬁes the tagarrays of L1 and L2 caches to support the lock primitive. Thisscheme incurs high area cost [41], and is tailored to the GPUarchitecture that includes a shared L2 cache, while most NDPsystems do not have shared caches.

Synchronization on MPPs.

The Cray T3D/T3E [81, 127],SGI Origin [88], and AMOs [154] include remote atomics atthe memory controller, while NYU Ultracomputer [52] pro-vides fetch&and remote atomics in each network switch. Asdiscussed in Section 2, synchronization via remote atomicsincurs high performance overheads due to high global traf-ﬁc [41, 108, 147, 153]. Cray T3E supports a barrier usingphysical wires, but it is designed speciﬁcally for 3D torusinterconnect. Tera MTA [12], HEP [71, 134], J- and M-machines [29, 78], and Alewife [5] provide synchronization12sing hardware bits ( full/empty bits) as tags in each memoryword . This scheme can incur high area cost [146]. QOLB [51]associates one cache line for every lock to track a pointer to thenext waiting core, and one cache line for local spinning usingbits ( syncbits ). QOLB is built on the underlying cache coher-ence protocol. Similarly, DASH [95] keeps a queue of waitingcores for a lock in the directory used for coherence to notifycaches when the lock is released. CM5 [94] supports remoteatomics and a barrier among cores via a dedicated physicalcontrol network (organized as a binary tree), which wouldincur high hardware cost to be supported in NDP systems.

8. Conclusion

SynCron is the ﬁrst end-to-end synchronization solutionfor NDP systems.

SynCron avoids the need for complex co-herence protocols and expensive rmw operations, incurs verymodest hardware cost, generally supports many synchroniza-tion primitives and is easy-to-use. Our evaluations show that itoutperforms prior designs under various conditions, providinghigh performance both under high-contention (due to reduc-tion of expensive trafﬁc across NDP units) and low-contentionscenarios (due to direct buffering of synchronization variablesand high execution parallelism). We conclude that

SynCron isan efﬁcient synchronization mechanism for NDP systems, andhope that this work encourages further comprehensive stud-ies of the synchronization problem in heterogeneous systems,including NDP systems.

Acknowledgments

We thank the anonymous reviewers of ISCA 2020, MI-CRO 2020 and HPCA 2021 for feedback. We thank DionisiosPnevmatikatos, Konstantinos Nikas, Athena Elafrou, FoteiniStrati, Dimitrios Siakavaras, Thomas Lagos, Andreas Tri-antafyllos for helpful technical discussions. We acknowledgesupport from the SAFARI group’s industrial partners, espe-cially ASML, Google, Facebook, Huawei, Intel, Microsoft,VMware, and Semiconductor Research Corporation. Duringpart of this research, Christina Giannoula was funded from theGeneral Secretariat for Research and Technology (GSRT) andthe Hellenic Foundation for Research and Innovation (HFRI).

References [1] S. Abadal, A. Cabellos-Aparicio, E. Alarcon, and J. Torrellas, “WiSync: An Ar-chitecture for Fast Synchronization through On-Chip Wireless Communication,”in

ASPLOS , 2016.[2] J. L. Abellán, J. Fernández, and M. E. Acacio, “A g-line-based Network for Fastand Efﬁcient Barrier Synchronization in Many-Core CMPs,” in

ICPP , 2010.[3] J. L. Abellán, J. Fernández, M. E. Acacio et al. , “Glocks: Efﬁcient Support forHighly-Contended Locks in Many-Core CMPs,” in

IPDPS , 2011.[4] M. Abeydeera and D. Sanchez, “Chronos: Efﬁcient Speculative Parallelism forAccelerators,” in

ASPLOS , 2020.[5] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson et al. , “The MIT AlewifeMachine: Architecture and Performance,” in

ISCA , 1998.[6] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “GARNET: A Detailed on ChipNetwork Model inside a Full-System Simulator,” in

ISPASS , 2009.[7] M. Ahmad, F. Hijaz, Q. Shi, and O. Khan, “CRONO: A Benchmark Suite forMultithreaded Graph Algorithms Executing on Futuristic Multicores,” in

IISWC ,2015.[8] J. Ahn, S. Hong, S. Yoo, and O. Mutlu, “A Scalable Processing-in-Memory Ac-celerator for Parallel Graph Processing,” in

ISCA , 2015.[9] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: A Low-overhead, Locality-Aware Processing-in-Memory Architecture,” in

ISCA , 2015.[10] B. S. Akgul, J. Lee, and V. J. Mooney, “A System-on-a-Chip Lock Cache withTask Preemption Support,” in

CASES , 2001.[11] D. Alistarh, J. Kopinsky, J. Li, and N. Shavit, “The SprayList: A Scalable RelaxedPriority Queue,” in

PPoPP , 2015.[12] R. Alverson, D. Callahan, D. Cummings, B. Koblenz et al. , “The Tera ComputerSystem,”

ICS , 1990.[13] T. Anderson, “The Performance Implications of Spin-Waiting Alternatives forShared-Memory Multiprocessors,” in

ICPP , 1989.[14] ARM, “Cortex-A7 Technical Reference Manual,” 2009.[15] A. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “Performance Characteriza-tion of In-Memory Data Analytics on a Modern Cloud Server,” in

BDCC , 2015.[16] A. J. Awan, V. Vlassov, M. Brorsson, and E. Ayguade, “Node Architecture Impli-cations for In-Memory Data Analytics on Scale-in Clusters,” in

BDCAT , 2016.[17] R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno et al. , “Near-Data Pro-cessing: Insights from a MICRO-46 Workshop,”

IEEE Micro , 2014.[18] U. N. Bhat,

An Introduction to Queueing Theory: Modeling and Analysis in Ap-plications , 2nd ed. Birkhäuser Basel, 2015.[19] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun et al. , “Google Workloadsfor Consumer Devices: Mitigating Data Movement Bottlenecks,” in

ASPLOS ,2018. [20] A. Boroumand, S. Ghose, M. Patel, H. Hassan et al. , “CoNDA: Efﬁcient CacheCoherence Support for Near-data Accelerators,” in

ISCA , 2019.[21] A. Boroumand, S. Ghose, M. Patel, H. Hassan et al. , “LazyPIM: An EfﬁcientCache Coherence Mechanism for Processing-in-Memory,”

CAL , 2017.[22] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev et al. , “An Analysis ofLinux Scalability to Many Cores,” in

OSDI , 2010.[23] D. S. Cali, G. S. Kalsi, Z. Bingöl, C. Firtina et al. , “GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Frame-work for Genome Sequence Analysis,” in

MICRO , 2020.[24] M. Chabbi, M. Fagan, and J. Mellor-Crummey, “High Performance Locks forMulti-Level NUMA Systems,”

PPoPP , 2015.[25] J. Choe, A. Huang, T. Moreshet, M. Herlihy et al. , “Concurrent Data Structureswith Near-Data-Processing: An Architecture-Aware Implementation,” in

SPAA ,2019.[26] B. Choi, R. Komuravelli, H. Sung, R. Smolinski et al. , “DeNovo: Rethinking theMemory Hierarchy for Disciplined Parallelism,” in

PACT , 2011.[27] T. Craig, “Building FIFO and Priority Queuing Spin Locks from Atomic Swap,”Tech. Rep., 1993.[28] D. Culler, J. Singh, and A. Gupta,

Parallel Computer Architecture: A Hardware-Software Approach , 1999.[29] W. Dally, J. S. Fiske, J. Keen, R. Lethin et al. , “The Message-Driven Processor: AMulticomputer Processing Node with Efﬁcient Mechanisms,”

IEEE Micro , 1992.[30] T. David, R. Guerraoui, and . V. Trigonakis, “Everything You Always Wanted toKnow About Synchronization but Were Afraid to Ask,” in

SOSP , 2013.[31] T. David, R. Guerraoui, and V. Trigonakis, “Asynchronized Concurrency: TheSecret to Scaling Concurrent Search Data Structures,” in

ASPLOS , 2015.[32] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,”

TOMS , 2011.[33] G. F. de Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose et al. , “A New Methodologyand Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks:A Near-Data Processing Case Study,” in

SIGMETRICS , 2021.[34] F. Devaux, “The True Processing in Memory Accelerator,” in

Hot Chips , 2019.[35] D. Dice, V. J. Marathe, and N. Shavit, “Flat-Combining NUMA Locks,” in

SPAA ,2011.[36] D. Dice, V. J. Marathe, and N. Shavit, “Lock Cohorting: A General Techniquefor Designing NUMA Locks,”

TOPC , 2015.[37] D. Drachsler, M. Vechev, and E. Yahav, “Practical Concurrent Binary SearchTrees via Logical Ordering,”

PPoPP , 2014.[38] M. Drumond, A. Daglis, N. Mirzadeh, D. Ustiugov et al. , “The Mondrian DataEngine,” in

ISCA , 2017.[39] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee et al. , “Parallel ApplicationMemory Scheduling,” in

MICRO , 2011.[40] A. Elafrou, G. Goumas, and N. Koziris, “Conﬂict-Free Symmetric Sparse Matrix-Vector Multiplication on Multicore Architectures,” in SC , 2019.[41] A. ElTantawy and T. M. Aamodt, “Warp Scheduling for Fine-Grained Synchro-nization,” in HPCA , 2018.[42] I. Fernandez, R. Quislant, C. Giannoula, M. Alser et al. , “NATSA: A Near-DataProcessing Accelerator for Time Series Analysis,”

ICCD , 2020.[43] M. Gao, G. Ayers, and C. Kozyrakis, “Practical Near-Data Processing for In-Memory Analytics Frameworks,” in

PACT , 2015.[44] M. Gao and C. Kozyrakis, “HRL: Efﬁcient and Flexible Reconﬁgurable Logic forNear-Data Processing,” in

HPCA , 2016.[45] M. Gao, J. Pu, X. Yang, M. Horowitz et al. , “TETRIS: Scalable and EfﬁcientNeural Network Acceleration with 3D Memory,” in

ASPLOS , 2017.[46] S. Ghose, A. Boroumand, J. Kim, J. Gómez-Luna et al. , “Processing-in-Memory:A Workload-Driven Perspective,”

IBM JRD , 2019.[47] S. Ghose, T. Li, N. Hajinazar, D. Senol Cali et al. , “Demystifying ComplexWorkload-DRAM Interactions: An Experimental Study,” in

SIGMETRICS , 2019.[48] C. Giannoula, G. Goumas, and N. Koziris, “Combining HTM with RCU to Speedup Graph Coloring on Multicore Platforms,” in

ISC HPC , 2018.[49] M. Gokhale, S. Lloyd, and C. Hajas, “Near Memory Data Structure Rearrange-ment,” in

MEMSYS , 2015.[50] J. Gomez-Luna, I. El Hajj, I. Fernandez, C. Giannoula et al. , “Benchmarking aNew Paradigm: Understanding a Modern Processing-in-Memory Architecture,”in

SIGMETRICS , 2021.[51] J. R. Goodman, M. K. Vernon, and P. J. Woest, “Efﬁcient Synchronization Primi-tives for Large-Scale Cache-Coherent Multiprocessors,” in

ASPLOS , 1989.[52] A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe et al. , “The NYU Ultracom-puter—Designing a MIMD, Shared-Memory Parallel Machine,” in

ISCA , 1982.[53] D. Grunwald and S. Vajracharya, “Efﬁcient Barriers for Distributed Shared Mem-ory Computers,” in

IPDPS , 1994.[54] P. Gu, S. Li, D. Stow, R. Barnes et al. , “Leveraging 3D Technologies for Hard-ware Security: Opportunities and Challenges,” in

GLSVLSI , 2016.[55] P. Gu, X. Xie, Y. Ding, G. Chen et al. , “IPIM: Programmable in-Memory ImageProcessing Accelerator Using Near-Bank Architecture,”

ISCA , 2020.[56] R. Guerraoui and V. Trigonakis, “Optimistic Concurrency with OPTIK,” PPoPP2016.[57] H. Guiroux, R. Lachaize, and V. Quéma, “Multicore Locks: The Case Is NotClosed Yet,” in

USENIX ATC , 2016.[58] J. Gómez-Luna, J. M. González-Linares, J. I. Benavides Benítez, and N. GuilMata, “Performance Modeling of Atomic Additions on GPU Scratchpad Mem-ory,”

TPDS , 2013.[59] R. Hadidi, B. Asgari, B. A. Mudassar, S. Mukhopadhyay et al. , “Demystifyingthe Characteristics of 3D-stacked Memories: A case Study for Hybrid MemoryCube,” in

IISWC , 2017.[60] M. Hashemi, E. Ebrahimi, O. Mutlu, Y. N. Patt et al. , “Accelerating DependentCache Misses with an Enhanced Memory Controller,” in

ISCA , 2016.[61] M. Heinrich, V. Soundararajan, J. Hennessy, and A. Gupta, “A Quantitative Anal-ysis of the Performance and Scalability of Distributed Shared Memory CacheCoherence Protocols,” TC , 1999.[62] D. Hensgen, R. Finkel, and U. Manber, “Two Algorithms for Barrier Synchro-nization,” International Journal of Parallel Programming , 1988.[63] M. Herlihy and N. Shavit,

The Art of Multiprocessor Programming , 2008.[64] T. Hoeﬂer, T. Mehlan, F. Mietke, and W. Rehm, “A Survey of Barrier Algorithmsfor Coarse Grained Supercomputers,”

Chemnitzer Informatik Berichte , 2004.[65] S. Hong, S. Salihoglu, J. Widom, and K. Olukotun, “Simplifying Scalable GraphProcessing with a Domain-Speciﬁc Language,” in

CGO , 2014.[66] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee et al. , “Transparent Ofﬂoadingand Mapping: Enabling Programmer-Transparent Near-Data Processing in GPUSystems,” in

ISCA , 2016.

67] K. Hsieh, S. Khan, N. Vijaykumar, K. Chang et al. , “Accelerating Pointer Chasingin 3D-stacked Memory: Challenges, Mechanisms, Evaluation,” in

ICCD , 2016.[68] Intel,

64 and IA-32 Architectures Software Developer’s Manual , 2009.[69] J. Joao, M. A. Suleman, O. Mutlu, and Y. Patt, “Bottleneck Identiﬁcation andScheduling in Multithreaded Applications,” in

ASPLOS , 2012.[70] J. Joao, M. A. Suleman, O. Mutlu, and Y. Patt, “Utility-Based Acceleration ofMultithreaded Applications on Asymmetric CMPs,” in

ISCA , 2013.[71] H. F. Jordan, “Performance Measurements on HEP - a Pipelined MIMD Com-puter,”

ISCA , 1983.[72] H. Jun, J. Cho, K. Lee, H.-Y. Son et al. , “HBM DRAM Technology and Architec-ture,” in

IMW , 2017.[73] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi et al. , “SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Efﬁ-cient Sparse Matrix Operations,” in

MICRO , 2019.[74] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Parti-tioning Irregular Graphs,”

SIAM J. Sci. Comput. , 1998.[75] S. Kaxiras and G. Keramidas, “SARC Coherence: Scaling Directory Cache Co-herence in Performance and Power,”

IEEE Micro , 2010.[76] S. Kaxiras, D. Klaftenegger, M. Norgren, A. Ros et al. , “Turning CentralizedCoherence and Distributed Critical-Section Execution on Their Head: A NewApproach for Scalable Distributed Shared Memory,” in

HPDC , 2015.[77] S. Kaxiras and A. Ros, “A New Perspective for Efﬁcient Virtual-Cache Coher-ence,” in

ISCA , 2013.[78] S. W. Keckler, W. J. Dally, D. Maskit, N. P. Carter et al. , “Exploiting Fine-GrainThread Level Parallelism on the MIT Multi-ALU Processor,” in

ISCA , 1998.[79] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta et al. , “Cohesion: A HybridMemory Model for Accelerators,” in

ISCA , 2010.[80] J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel, “WAYPOINT: ScalingCoherence to Thousand-Core Architectures,” in

PACT , 2010.[81] R. E. Kessler and J. L. Schwarzmeier, “Cray T3D: A New Dimension for CrayResearch,”

Digest of Papers. Compcon Spring , 1993.[82] D. Kim, J. Kung, S. Chai, S. Yalamanchili et al. , “Neurocube: A ProgrammableDigital Neuromorphic Architecture with High-Density 3D Memory,” in

ISCA ,2016.[83] G. Kim, J. Kim, J. H. Ahn, and J. Kim, “Memory-Centric System InterconnectDesign with Hybrid Memory Cubes,” in

PACT , 2013.[84] J. Kim, D. Senol Cali, H. Xin, D. Lee et al. , “GRIM-Filter: Fast Seed LocationFiltering in DNA Read Mapping Using Processing-in-Memory Technologies,”

BMC Genomics , 2018.[85] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: AFast and Extensible DRAM Simulator,”

CAL , 2015.https://github.com/CMU-SAFARI/ramulator[86] Lamport, “How to Make a Multiprocessor Computer That Correctly ExecutesMultiprocess Programs,” TC , 1979.[87] L. Lamport, “A New Solution of Dijkstra’s Concurrent Programming Problem,” Commun. ACM , 1974.[88] J. Laudon and D. Lenoski, “The SGI Origin: A ccNUMA Highly ScalableServer,” in

ISCA , 1997.[89] D. Lavenier, J.-F. Roy, and D. Furodet, “DNA Mapping using Processor-in-Memory Architecture,” in

BIBM , 2016.[90] M. LeBeane, S. Song, R. Panda, J. H. Ryoo et al. , “Data Partitioning Strategiesfor Graph Workloads on Heterogeneous Clusters,” in SC , 2015.[91] A. R. Lebeck and D. A. Wood, “Dynamic Self-Invalidation: Reducing CoherenceOverhead in Shared-Memory Multiprocessors,” ISCA , 1995.[92] D. U. Lee, K. W. Kim, K. W. Kim, H. Kim et al. , “25.2 A 1.2V 8Gb 8-channel128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Mi-crobump I/O Test Methods Using 29nm Process and TSV,” in

ISSCC , 2014.[93] D. Lee, S. Ghose, G. Pekhimenko, S. Khan et al. , “Simultaneous Multi-LayerAccess: Improving 3D-Stacked Memory Bandwidth at Low Cost,”

TACO , 2016.[94] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman et al. , “TheNetwork Architecture of the Connection Machine CM-5 (Extended Abstract),” in

SPAA , 1992.[95] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber et al. , “The Stanford DashMultiprocessor,”

Computer , 1992.[96] A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, “Fine-Grained Synchro-nizations and Dataﬂow Programming on GPUs,” in

ICS , 2015.[97] C. Liang and M. Prvulovic, “MiSAR: Minimalistic Synchronization Acceleratorwith Resource Overﬂow Management,” in

ISCA , 2015.[98] J. Liu, H. Zhao, M. A. Ogleari, D. Li et al. , “Processing-in-Memory for Energy-Efﬁcient Neural Network Training: A Heterogeneous Approach,” in

MICRO ,2018.[99] Z. Liu, I. Calciu, M. Herlihy, and O. Mutlu, “Concurrent Data Structures forNear-Memory Computing,” in

SPAA , 2017.[100] V. Luchangco, D. Nussbaum, and N. Shavit, “A Hierarchical CLH Queue Lock,”in

Euro-Par , 2006.[101] P. Magnusson, A. Landin, and E. Hagersten, “Queue Locks on Cache CoherentMultiprocessors,” in

IPDPS , 1994.[102] J. Mellor-Crummey and M. Scott, “Synchronization without Contention,”

ASP-LOS , 1991.[103] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for Scalable Synchroniza-tion on Shared-Memory Multiprocessors,”

TOCS , 1991.[104] M. M. Michael and M. L. Scott, “Simple, Fast, and Practical Non-Blocking andBlocking Concurrent Queue Algorithms,” in

PODC , 1996.[105] A. Mirhosseini and J. Torrellas, “Survive: Pointer-Based In-DRAM IncrementalCheck-Pointing for Low-Cost Data Persistence and Rollback-Recovery,”

CAL ,2016.[106] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari et al. , “Proﬁling DNN Work-loads on a Volta-based DGX-1 System,” in

IISWC , 2018.[107] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller, “Memory Performanceand Cache Coherency Effects on an Intel Nehalem Multiprocessor System,” in

PACT , 2009.[108] A. Mukkara, N. Beckmann, and D. Sanchez, “PHI: Architectural Support forSynchronization- and Bandwidth-Efﬁcient Commutative Scatter Updates,” in

MI-CRO , 2019.[109] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA Or-ganizations and Wiring Alternatives for Large Caches with CACTI 6.0,” in

MI-CRO , 2007.[110] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primeron Processing in Memory,”

Emerging Computing: From Devices to Systems -Looking Beyond Moore and Von Neumann , 2021. [111] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Processing DataWhere It Makes Sense: Enabling In-Memory Computation,”

MICPRO , 2019.[112] L. Nai, R. Hadidi, J. Sim, H. Kim et al. , “GraphPIM: Enabling Instruction-LevelPIM Ofﬂoading in Graph Computing Frameworks,” in

HPCA , 2017.[113] R. Nair, S. F. Antao, C. Bertolli, P. Bose et al. , “Active Memory Cube: AProcessing-in-Memory Architecture for Exascale Systems,”

IBM JRD , 2015.[114] NVIDIA, “NVIDIA Tesla V100 GPU Architecture,”

White Paper , 2017.[115] NVIDIA, “ONTAP AI–NVIDIA DGX-2 POD with NetApp AFF A800,”

WhitePaper , 2019.[116] J. Oh, M. Prvulovic, and A. Zajic, “TLSync: Support for Multiple Fast BarriersUsing On-Chip Transmission Lines,” in

ISCA , 2011.[117] A. Pattnaik, X. Tang, A. Jog, O. Kayiran et al. , “Scheduling Techniques for GPUArchitectures with Processing-In-Memory Capabilities,” in

PACT , 2016.[118] W. Pugh, “Concurrent Maintenance of Skip Lists,” Tech. Rep., 1990.[119] S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian et al. , “NDC: Analyzing theImpact of 3D-stacked Memory+Logic Devices on MapReduce Workloads,” in

ISPASS , 2014.[120] A. Ros and S. Kaxiras, “Callback: Efﬁcient Synchronization without Invalidationwith a Directory just for Spin-Waiting,” in

ISCA , 2015.[121] A. Ros and S. Kaxiras, “Complexity-Effective Multicore Coherence,” PACT2012.[122] L. Rudolph and Z. Segall,

Dynamic Decentralized Cache Schemes for MIMDParallel Processors , 1984.[123] J. Rutgers, M. Bekooij, and G. Smit, “Portable Memory Consistency for SoftwareManaged Distributed Memory in Many-Core SoC,” in

IPDPSW , 2013.[124] J. Sampson, R. Gonzalez, J.-F. Collard, N. Jouppi et al. , “Exploiting Fine-GrainedData Parallelism with Chip Multiprocessors and Fast Barriers,” in

MICRO , 2006.[125] D. Sanchez and C. Kozyrakis, “ZSim: Fast and Accurate Microarchitectural Sim-ulation of Thousand-Core Systems,” in

ISCA , 2013.[126] M. L. Scott, “Non-Blocking Timeout in Scalable Queue-based Spin Locks,” in

PODC , 2002.[127] S. L. Scott, “Synchronization and Communication in the T3E Multiprocessor,” in

ASPLOS , 1996.[128] V. Seshadri, D. Lee, T. Mullins, H. Hassan et al. , “Ambit: In-Memory Acceleratorfor Bulk Bitwise Operations Using Commodity DRAM Technology,” in

MICRO ,2017.[129] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei et al. , “Co-Designing Acceleratorsand SoC Interfaces using gem5-Aladdin,” in

MICRO , 2016.[130] D. Siakavaras, K. Nikas, G. Goumas, and N. Koziris,“RCU-HTM: Combining RCU with HTM to Implement HighlyEfﬁcient Concurrent Binary Search Trees,” PACT 2017.https://github.com/jimsiak/concurrent-maps[131] G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna et al. , “NERO: ANear High-Bandwidth Memory Stencil Accelerator for Weather Prediction Mod-eling,” in

FPL , 2020.[132] G. Singh, J. Gómez-Luna, G. Mariani, G. F. Oliveira et al. , “NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learn-ing,” in

DAC , 2019.[133] G. Singh, L. Chelini, S. Corda, A. J. Awan et al. , “Near-Memory Computing:Past, Present, and Future,”

MICPRO , 2019.[134] B. J. Smith, “A Pipelined, Shared Resource MIMD Computer,”

ICPP , 1978.[135] D. J. Sorin, M. D. Hill, and D. A. Wood,

A Primer on Memory Consistency andCache Coherence . Morgan & Claypool Publishers, 2011.[136] F. Strati, C. Giannoula, D. Siakavaras, G. Goumas et al. , “An Adaptive Concur-rent Priority Queue for NUMA Architectures,” in CF , 2019.[137] M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib et al. , “Data Marshaling for Multi-Core Architectures,” in ISCA , 2010.[138] M. A. Suleman, O. Mutlu, M. Qureshi, and Y. Patt, “Accelerating Critical SectionExecution with Asymmetric Multi-Core Architectures,” in

ASPLOS , 2009.[139] H. Sung, R. Komuravelli, and S. V. Adve, “DeNovoND: Efﬁcient Hardware Sup-port for Disciplined Non-Determinism,”

ASPLOS , 2013.[140] N. R. Tallent, J. M. Mellor-Crummey, and A. Porterﬁeld, “Analyzing Lock Con-tention in Multithreaded Applications,” in

PPoPP , 2010.[141] X. Tang, J. Zhai, X. Qian, and W. Chen, “pLock: A Fast Lock for Architectureswith Explicit Inter-core Message Passing,” in

ASPLOS , 2019.[142] S. Torkamani and V. Lohweg, “Survey on Time Series Motif Discovery,”

WileyInterdis. Rev.: Data Mining and Knowledge Discovery , 2017.[143] P.-A. Tsai, C. Chen, and D. Sanchez, “Adaptive Scheduling for Systems withAsymmetric Memory Hierarchies,” in

MICRO

HybridMemory Cube Consortium , 2015.[146] E. Vallejo, R. Beivide, A. Cristal, T. Harris et al. , “Architectural Support for FairReader-Writer Locking,” in

MICRO , 2010.[147] K. Wang, D. Fussell, and C. Lin, “Fast Fine-Grained Global Synchronization onGPUs,” in

ASPLOS , 2019.[148] C. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU Architecture,”

IEEE Micro , 2011.[149] P. T. Wolkotte, G. J. M. Smit, N. Kavaldjiev, J. E. Becker et al. , “Energy Modelof Networks-on-Chip and a Bus,” in

SOCC , 2005.[150] Xilinx, “Virtex UltraScale+ HBM FPGA,” 2019.[151] M. Yan, X. Hu, S. Li, A. Basak et al. , “Alleviating Irregularity in Graph AnalyticsAcceleration: A Hardware/Software Co-Design Approach,” in

MICRO , 2019.[152] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum et al. , “Matrix Proﬁle I: All PairsSimilarity Joins for Time Series: A Unifying View that Includes Motifs, Discordsand Shapelets,” in

ICDM , 2016.[153] A. Yilmazer and D. Kaeli, “HQL: A Scalable Synchronization Mechanism forGPUs,” in

IPDPS , 2013.[154] L. Zhang, Z. Fang, and J. B. Carter, “Highly Efﬁcient Synchronization based onActive Memory Operations,” in

IPDPS , 2004.[155] M. Zhang, Y. Zhuo, C. Wang, M. Gao et al. , “GraphP: Reducing Communicationfor PIM-Based Graph Processing with Efﬁcient Data Partition,” in

HPCA , 2018.[156] M. Zhang, H. Chen, L. Cheng, F. C. M. Lau et al. , “Scalable Adaptive NUMA-Aware Lock,”

TPDS , 2017.[157] W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao, “Synchronization State Buffer:Supporting Efﬁcient Fine-grain Synchronization on Many-Core Architectures,”in

ISCA , 2007.[158] Y. Zhuo, C. Wang, M. Zhang, R. Wang et al. , “GraphQ: Scalable PIM-BasedGraph Processing,” in

MICRO , 2019., 2019.