SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures
Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gómez-Luna, Lois Orosa, Nectarios Koziris, Georgios Goumas, Onur Mutlu
SSynCron : Efficient Synchronization Supportfor Near-Data-Processing Architectures
Christina Giannoula †‡ Nandita Vijaykumar *‡ Nikela Papadopoulou † Vasileios Karakostas † Ivan Fernandez §‡ s Juan Gómez-Luna ‡ Lois Orosa ‡ Nectarios Koziris † Georgios Goumas † Onur Mutlu ‡† National Technical University of Athens ‡ ETH Zürich * University of Toronto § University of Malaga
Near-Data-Processing (NDP) architectures present apromising way to alleviate data movement costs and can pro-vide significant performance and energy benefits to parallelapplications. Typically, NDP architectures support severalNDP units, each including multiple simple cores placed closeto memory. To fully leverage the benefits of NDP and achievehigh performance for parallel workloads, efficient synchro-nization among the NDP cores of a system is necessary. How-ever, supporting synchronization in many NDP systems ischallenging because they lack shared caches and hardwarecache coherence support, which are commonly used for syn-chronization in multicore systems, and communication acrossdifferent NDP units can be expensive.This paper comprehensively examines the synchronizationproblem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron addslow-cost hardware support near memory for synchronizationacceleration, and avoids the need for hardware cache coher-ence support. SynCron has three components: 1) a special-ized cache memory structure to avoid memory accesses forsynchronization and minimize latency overheads, 2) a hierar-chical message-passing communication protocol to minimizeexpensive communication across NDP units of the system, and3) a hardware-only overflow management scheme to avoidperformance degradation when hardware resources for syn-chronization tracking are exceeded.We evaluate SynCron using a variety of parallel workloads,covering various contention scenarios. SynCron improvesperformance by 1.27 × on average (up to 1.78 × ) under high-contention scenarios, and by 1.35 × on average (up to 2.29 × )under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consump-tion by 2.08 × on average (up to 4.25 × ).
1. Introduction
Recent advances in 3D-stacked memories [59, 72, 85, 92,93, 145] have renewed interest in Near-Data Processing(NDP) [8, 9, 17, 110]. NDP involves performing computa-tion close to where the application data resides. This al-leviates the expensive data movement between processorsand memory, yielding significant performance improvementsand energy savings in parallel applications. Placing low-power cores or special-purpose accelerators (hereafter calledNDP cores) close to the memory dies of high-bandwidth 3D-stacked memories is a commonly-proposed design for NDPsystems [8, 9, 19–21, 23, 38, 42–46, 49, 66, 67, 82–84, 98, 105,110–113, 117, 119, 131, 132, 143, 155, 158]. Typical NDP ar-chitectures support several NDP units connected to each other,with each unit comprising multiple NDP cores close to mem-ory [8, 19, 66, 83, 143, 155, 158]. Therefore, NDP architecturesprovide high levels of parallelism, low memory access latency,and large aggregate memory bandwidth.Recent research demonstrates the benefits of NDP for par-allel applications, e.g., for genome analysis [23, 84], graphprocessing [8, 9, 20, 21, 112, 155, 158], databases [20, 38], secu-rity [54], pointer-chasing workloads [25,60,67,99], and neuralnetworks [19, 45, 82, 98]. In general, these applications exhibithigh parallelism, low operational intensity, and relatively low cache locality [15, 16, 33, 50, 133], which make them suitablefor NDP.Prior works discuss the need for efficient synchronizationprimitives in NDP systems, such as locks [25, 99] and barri-ers [8, 43, 155, 158]. Synchronization primitives are widelyused by multithreaded applications [39, 40, 48, 69, 70, 90, 136–138, 140], and must be carefully designed to fit the under-lying hardware requirements to achieve high performance.Therefore, to fully leverage the benefits of NDP for parallelapplications, an effective synchronization solution for NDPsystems is necessary.Approaches to support synchronization are typically of twotypes [63, 64]. First, synchronization primitives can be builtthrough shared memory , most commonly using the atomicread-modify-write ( rmw ) operations provided by hardware.In CPU systems, atomic rmw operations are typically im-plemented upon the underlying hardware cache coherenceprotocols, but many NDP systems do not support hardwarecache coherence (e.g., [8, 46, 143, 155, 158]). In GPUs andMassively Parallel Processing systems (MPPs), atomic rmw operations can be implemented in dedicated hardware atomicunits, known as remote atomics . However, synchronizationusing remote atomics has been shown to be inefficient, sincesending every update to a fixed location creates high globaltraffic and hotspots [41, 96, 108, 147, 153]. Second, synchro-nization can be implemented via a message-passing scheme,where cores exchange messages to reach an agreement. Somerecent NDP works (e.g., [8, 43, 55, 158]) propose message-passing barrier primitives among NDP cores of the system.However, these synchronization schemes are still inefficient,as we demonstrate in Section 6, and also lack support for lock,semaphore and condition variable synchronization primitives.Hardware synchronization techniques that do not rely onhardware coherence protocols and atomic rmw operations havebeen proposed for multicore systems [1–3,94,97,116,146,157].However, such synchronization schemes are tailored for thespecific architecture of each system, and are not efficient orsuitable for NDP systems (Section 7). For instance, CM5 [94]provides a barrier primitive via a dedicated physical network,which would incur high hardware cost to be supported inlarge-scale NDP systems. LCU [146] adds a control unit to each
CPU core and a buffer to each memory controller, whichwould also incur high cost to implement in area-constrained
NDP cores and controllers. SSB [157] includes a small bufferattached to each controller of the last level cache (LLC) andMiSAR [97] introduces an accelerator distributed at the LLC.Both schemes are built on the shared cache level in CPUsystems, which most NDP systems do not have. Moreover, inNDP systems with non-uniform memory access times, mostof these prior schemes would incur significant performanceoverheads under high-contention scenarios. This is becausethey are oblivious to the non-uniformity of NDP, and thuswould cause excessive traffic across NDP units of the systemupon contention (Section 6.7.1).Overall, NDP architectures have several important charac-teristics that necessitate a new approach to support efficientsynchronization. First, most NDP architectures [8, 19, 25, 38,42–46, 49, 55, 67, 98, 110, 111, 113, 119, 155, 158] lack shared1 a r X i v : . [ c s . A R ] F e b aches that can enable low-cost communication and synchro-nization among NDP cores of the system. Second, hardwarecache coherence protocols are typically not supported in NDPsystems [8,19,25,38,42–45,49,55,67,82,98,111,119,155,158],due to high area and traffic overheads associated with suchprotocols [46, 143]. Third, NDP systems are non-uniform,distributed architectures, in which inter-unit communication ismore expensive (both in performance and energy) than intra-unit communication [8, 20, 21, 38, 43, 83, 155, 158].In this work, we present SynCron , an efficient synchroniza-tion mechanism for NDP architectures.
SynCron is designed toachieve the goals of performance, cost, programming ease, andgenerality to cover a wide range of synchronization primitivesthrough four key techniques. First, we offload synchroniza-tion among NDP cores to dedicated low-cost hardware units,called Synchronization Engines (SEs). This approach avoidsthe need for complex coherence protocols and expensive rmw operations, at low hardware cost. Second, we directly bufferthe synchronization variables in a specialized cache memorystructure to avoid costly memory accesses for synchronization.Third,
SynCron coordinates synchronization with a hierarchi-cal message-passing scheme: NDP cores only communicatewith their local SE that is located in the same NDP unit. Atthe next level of communication, all local SEs of the sys-tem’s NDP units communicate with each other to coordinatesynchronization at a global level. Via its hierarchical commu-nication protocol,
SynCron significantly reduces synchroniza-tion traffic across NDP units under high-contention scenar-ios. Fourth, when applications with frequent synchronizationoversubscribe the hardware synchronization resources,
Syn-Cron uses an efficient and programmer-transparent overflowmanagement scheme that avoids costly fallback solutions andminimizes overheads.We evaluate
SynCron using a wide range of parallel work-loads including pointer chasing, graph applications, and timeseries analysis. Over prior approaches (similar to [8, 43]),
SynCron improves performance by 1.27 × on average (up to1.78 × ) under high-contention scenarios, and by 1.35 × onaverage (up to 2.29 × ) under low-contention scenarios. Inreal applications with fine-grained synchronization, SynCron comes within 9.5% of the performance and 6.2% of the energyof an ideal zero-overhead synchronization mechanism. Ourproposed hardware unit incurs very modest area and poweroverheads (Section 6.8) when integrated into the compute dieof an NDP unit.This paper makes the following contributions:• We investigate the challenges of providing efficient synchro-nization in Near-Data-Processing architectures, and proposean end-to-end mechanism,
SynCron , for such systems.• We design low-cost synchronization units that coordinatesynchronization across NDP cores, and directly buffer syn-chronization variables to avoid costly memory accesses tothem. We propose an efficient message-passing synchroniza-tion approach that organizes the process hierarchically, andprovide a hardware-only programmer-transparent overflowmanagement scheme to alleviate performance overheadswhen hardware synchronization resources are exceeded.• We evaluate
SynCron using a wide range of parallel work-loads and demonstrate that it significantly outperforms priorapproaches both in performance and energy consumption.
SynCron also has low hardware area and power overheads.
2. Background and Motivation
Numerous works [8,9,19–21,25,38,43,45,54,55,67,73,82,99, 112, 128, 143, 155, 158] show the potential benefit of NDPfor parallel, irregular applications. These proposals focus on the design of the compute logic that is placed close to or withinmemory, and in many cases provide special-purpose near-dataaccelerators for specific applications. Figure 1 shows the base-line organization of the NDP architecture we assume in thiswork, which includes several NDP units connected with eachother via serial interconnection links to share the same physi-cal address space. Each NDP unit includes the memory arraysand a compute die with multiple low-power programmablecores or fixed-function accelerators, which we henceforth referto as NDP cores. NDP cores execute the offloaded NDP kerneland access the various memory locations across NDP unitswith non-uniform access times [8, 20, 21, 38, 143, 155, 158].We assume that there is no OS running in the NDP system.In our evaluation, we use programmable in-order NDP cores,each including small private L1 I/D caches. However,
Syn-Cron can be used with any programmable, fixed-function orreconfigurable NDP accelerator. We assume software-assistedcache-coherence (provided by the operating system or theprogrammer), similar to [43, 143]: data can be either thread-private, shared read-only, or shared read-write. Thread-privateand shared read-only data can be cached by NDP cores, whileshared read-write data is uncacheable.
NDP Architecture NDP UnitInterconnection Link
Compute Die
NDP CoreNDP Core
NDP Core
Programmable
Accelerator
Cache ...
Memory Arrays
Figure 1: High-level organization of an NDP architecture.
We focus on three characteristics of NDP architectures thatare of particular importance in the synchronization context.First, NDP architectures typically do not have a shared levelof cache memory [8, 19, 25, 38, 42–46, 49, 55, 67, 98, 110, 111,113, 119, 155, 158], since the NDP-suited workloads usuallydo not benefit from deep cache hierarchies due to their poor lo-cality [33,43,133,143]. Second, NDP architectures do not typ-ically support conventional hardware cache coherence proto-cols [8,19,25,38,42–45,49,55,67,82,98,111,119,155,158], be-cause they would add area and traffic overheads [46, 143], andwould incur high complexity and latency [4], limiting the bene-fits of NDP. Third, communication across NDP units is expen-sive, because NDP systems are non-uniform distributed archi-tectures. The energy and performance costs of inter-unit com-munication are typically orders of magnitude greater than thecosts of intra-unit communication [8,20,21,38,43,83,155,158],and thus inter-unit communication may slow down the execu-tion of NDP cores [155].
Approaches to support synchronization are typically eithervia shared memory or message-passing schemes.
In this case,cores coordinate via a consistent view of shared memory lo-cations, using atomic read/write operations or atomic read-modify-write ( rmw ) operations. If rmw operations are not supported by hardware, Lamport’s bakery algorithm [87] canprovide synchronization to N participating cores, assuming se-quential consistency [86]. However, this scheme scales poorly,as a core accesses O ( N ) memory locations at each synchro-nization retry. In contrast, commodity systems (CPUs, GPUs,MPPs) typically support rmw operations in hardware.GPUs and MPPs support rmw operations in specializedhardware units (known as remote atomics ), located in eachbank of the shared cache [58, 148], or the memory con-trollers [81,88]. Remote atomics are also supported by an NDP2ork [43] at the vault controllers of Hybrid Memory Cube(HMC) [59,145]. Implementing synchronization primitives us-ing remote atomics requires a spin-wait scheme, i.e., executingconsecutive rmw retries. However, performing and sendingevery rmw operation to a shared, fixed location can cause highglobal traffic and create hotspots [41, 96, 108, 147, 153]. InNDP systems, consecutive rmw operations to a remote NDPunit would incur high traffic across NDP units, with highperformance and energy overheads.Commodity CPU architectures support rmw operations ei-ther by locking the bus (or equivalent link), or by relyingon the hardware cache coherence protocol [68, 135], whichmany NDP architectures do not support. Therefore, coherence-based synchronization [13, 24, 27, 35, 36, 57, 100, 101, 103,122, 126, 156] cannot be directly implemented in NDP archi-tectures. Moreover, based on prior works on synchroniza-tion [22, 30, 76, 102, 107, 140], coherence-based synchroniza-tion would exhibit low scalability on NDP systems for tworeasons. First, it performs poorly with a large number of cores,due to low scalability of conventional hardware coherenceprotocols [61, 79, 80, 135]. Most NDP systems include sev-eral NDP units [8, 83, 155, 158], each typically supportinghundreds of small, area-constrained cores [8, 19, 155, 158].Second, the non-uniformity in memory accesses signifi-cantly affects the scalability of coherence-based synchroniza-tion [22,30,107,156]. Prior work on coherence-based synchro-nization [30] observes that the latency of a lock acquisitionthat needs to transfer the lock across
NUMA sockets can beup to 12.5 × higher than that within a socket. We expect sucheffects to be aggravated in NDP systems, since they are by na-ture non-uniform and distributed [8,20,21,38,43,83,155,158]with very low memory access latency within an NDP unit.We validate these observations on both a real CPU and oursimulated NDP system. On an Intel Xeon Gold server, weevaluate the operation throughput achieved by two coherence-based lock algorithms (Table 1), i.e., TTAS [122] and Hier-archical Ticket Lock (HTL) [103], using a microbenchmarktaken from the libslock library [30]. When increasing the num-ber of threads from 1 to 14 within a single socket, throughputdrops by 3.91 × and 2.77 × for TTAS and HTL, respectively.Moreover, when pinning two threads on different NUMA sock-ets, throughput drops by up to 2.29 × over when pinning themon the same socket, due to non-uniform memory access timesof lock variables. Million Operations 1 thread 14 threads 2 threads 2 threadsper Second single-socket single-socket same-socket different-socketTTAS lock [122] 8.92 2.28 9.91 4.32Hierarchical Ticket lock [103] 8.06 2.91 9.01 6.79
Table 1: Throughput of two coherence-based lock algorithms onan Intel Xeon Gold server using the libslock library [30].
In our simulated NDP system, we evaluate the performanceachieved by a stack data structure protected with a coarse-grained lock. Figure 2 shows the slowdown of the stack whenusing a coherence-based lock [63] ( mesi-lock ), implementedupon a MESI directory coherence protocol, over using an ideallock with zero cost for synchronization ( ideal-lock ). First, weobserve that the high contention for the cache line containingthe mesi-lock and the resulting coherence traffic inside the net-work significantly limit scalability of the stack as the numberof cores increases. With 60 NDP cores within a single NDPunit (Figure 2a), the stack with mesi-lock incurs 2.03 × slow-down over ideal-lock . Second, we notice that the non-uniformmemory accesses to the cache line containing the mesi-lock also impact the scalability of the stack. When increasing thenumber of NDP units while keeping total core count constantat 60 (Figure 2b), the slowdown of the stack with mesi-lock increases to 2.66 × (using 4 NDP units) over ideal-lock . In non-uniform NDP systems, the scalability of coherence-basedsynchronization is severely limited by the long transfer latencyand low bandwidth of the interconnect used between the NDPunits.
15 30 45 60
NDP cores S l o w d o w n (a) ideal-lock mesi-lock NDP units S l o w d o w n (b) ideal-lock mesi-lock Figure 2: Slowdown of a stack data structure using a coherence-based lock over using an ideal zero-cost lock, when varying (a)the NDP cores within a single NDP unit and (b) the number ofNDP units while keeping core count constant at 60.
In this approach,cores coordinate with each other by exchanging messages(either in software or hardware) in order to reach an agree-ment. For instance, a recent NDP work [8] implements abarrier primitive via hardware message-passing communica-tion among NDP cores, i.e., one core of the system worksas a master core to collect the synchronization status of therest. To improve system performance in non-uniform
HMC-based NDP systems, Gao et al. [43] propose a tree-style bar-rier primitive, where cores exchange messages to first syn-chronize within a vault, then across the vaults of an HMCcube, and finally across HMC cubes. In general, optimizedmessage-passing synchronization schemes proposed in theliterature [2,43,53,62,64,141] aim to minimize (i) the numberof messages sent among cores, and (ii) expensive network traf-fic. To avoid the major issues of synchronization via sharedmemory described above, we design our approach building onthe message-passing synchronization concept. SynCron : Overview
SynCron is an end-to-end solution for synchronization inNDP architectures that improves performance, has low cost,eases programmability, and supports multiple synchronizationprimitives.
SynCron relies on the following key techniques:
1. Hardware support for synchronization acceleration:
We design low-cost hardware units, called SynchronizationEngines (SEs), to coordinate the synchronization among NDPcores of the system. SEs eliminate the need for complex cachecoherence protocols and expensive rmw operations, and incurmodest hardware cost.
2. Direct buffering of synchronization variables:
We add aspecialized cache structure, the Synchronization Table (ST),inside an SE to keep synchronization information. Such directbuffering avoids costly memory accesses for synchronization,and enables high performance under low-contention scenarios.
3. Hierarchical message-passing communication:
We or-ganize the communication hierarchically, with each NDP unitincluding an SE. NDP cores communicate with their localSE that is located in the same NDP unit. SEs communicatewith each other to coordinate synchronization at a global level.Hierarchical communication minimizes expensive communi-cation across
NDP units, and achieves high performance underhigh-contention scenarios.
4. Integrated hardware-only overflow management:
Weincorporate a hardware-only overflow management scheme toefficiently handle scenarios when ST is fully occupied. Thisprogrammer-transparent technique effectively limits perfor-mance degradation under overflow scenarios.
SynCron
Figure 3 provides an overview of our approach.
SynCron ex-poses a simple programming interface such that programmerscan easily use a variety of synchronization primitives in their3ultithreaded applications when writing them for NDP sys-tems. The interface is implemented using two new instructionsthat are used by NDP cores to communicate synchronizationrequests to SEs. These are general enough to cover all seman-tics for the most widely-used synchronization primitives.
Compute Die
NDP CoreNDP Core
NDP Core
Memory Arrays SE Compute Die
NDP CoreNDP Core
NDP Core
Memory Arrays SE Compute Die
NDP CoreNDP Core
NDP Core
Memory Arrays SE Compute Die
NDP CoreNDP Core
NDP Core
Memory Arrays SE syncronVar Master SE ST SE SynCron ’ s ISA Extensions SynCron ’ s Programming Interface
Figure 3: High-level overview of
SynCron . We add one SE in the compute die of each NDP unit. For aparticular synchronization variable allocated in an NDP unit,the SE that is physically located in the same NDP unit isconsidered the
Master SE . In other words, the
Master SE isdefined by the address of the synchronization variable. It isresponsible for the global coordination of synchronization onthat variable, i.e., among all SEs of the system. All other SEsare responsible only for the local coordination of synchroniza-tion among the cores in the same NDP unit with them.NDP cores act as clients that send requests to SEs via hard-ware message-passing. SEs act as servers that process synchro-nization requests. In the proposed hierarchical communication,NDP cores send requests to their local SEs, while SEs of differ-ent NDP units communicate with the
Master SE of the specificvariable, to coordinate the process at a global level, i.e., amongall NDP units.When an SE receives a request from an NDP core for asynchronization variable, it directly buffers the variable in itsST, keeping all the information needed for synchronization inthe ST. If the ST is full, we use the main memory as a fallbacksolution. To hierarchically coordinate synchronization viamain memory in ST overflow cases, we design (i) a genericstructure, called syncronVar , to keep track of required synchro-nization information, and (ii) specialized overflow messages tobe sent among SEs. The hierarchical communication amongSEs is implemented via corresponding support in messageencoding, the ST, and syncronVar structure.
SynCron ’s Operation
SynCron supports locks, barriers, semaphores, and condi-tion variables. Here, we present
SynCron ’s operation for locks.
SynCron has similar behavior for the other three primitives.
Lock Synchronization Primitive:
Figure 4 shows a systemcomposed of two NDP units with two NDP cores each. Inthis example, all cores request and compete for the same lock.First, all NDP cores send local lock acquire messages to theirSEs . After receiving these messages, each SE keeps trackof its requesting cores by reserving one new entry in its ST,i.e., directly buffering the lock variable in ST. Each ST entryincludes a local waiting list (i.e., a hardware bit queue with onebit for each local NDP core), and a global waiting list (i.e., a bitqueue with one bit for each SE of the system). To keep track ofthe requesting cores, each SE sets the bits corresponding to therequesting cores in the local waiting list of the ST entry. Whenthe local SE receives a request for a synchronization variable for the first time , it sends a global lock acquire message to the Master SE , which in turn sets the corresponding bit in theglobal waiting list in its ST. This way, the Master SE keepstrack of all requests to a particular variable coming from an SE,and can arbitrate between different SEs. The local SE can thenserve successive local requests to the same variable until thereare no other local requests. By using the proposed hierarchicalcommunication protocol, the cores send local messages to their local SE, and the SE needs to send only one aggregated message, on behalf of all its local waiting cores, to the
MasterSE . As a result, we reduce the need for communication throughthe narrow, expensive links that connect different NDP units.
NDP Core 0
Memory Arrays SE NDP Core 0
Memory Arrays SE Memory Arrays SE Memory Arrays SE Master SE
NDP Core 1
NDP Core 0
NDP Core 1
NDP Unit 0 NDP Unit 1 syncronVar
Figure 4: An example execution scenario for a lock requested by all
NDP cores.
The
Master SE first prioritizes the local waiting list, grantingthe lock to its own local NDP cores in sequence (e.g., to NDPCore 0 first , and to NDP Core 1 next in Figure 4). At theend of the critical section, each local lock owner sends a lockrelease message to its SE in order to release the lock. Whenthere are no other local requests, the Master SE transfers thecontrol of the lock to the SE of another NDP unit based onits global waiting list . Then, the local SE grants the lockto its local NDP cores in sequence (e.g., , ). After alllocal cores release the lock, the SE sends an aggregated globallock release message to the Master SE and releases its STentry. When the message arrives at the Master SE , if thereare no other pending requests to the same variable, the
MasterSE releases its ST entry. In this example, SEs directly bufferthe lock variable in their STs. If an ST is full , the
Master SE globally coordinates synchronization by keeping track of allrequired information in main memory , via our proposedoverflow management scheme (Section 4.3). SynCron : Detailed Design
SynCron leverages the key observation that all synchro-nization primitives fundamentally communicate the same in-formation, i.e., a waiting list of cores that participate in thesynchronization process, and a condition to be met to notifyone or more cores. Based on this observation, we design
SynCron to cover the four most widely used synchronizationprimitives. Without loss of generality, we assume that eachNDP core represents a hardware thread context with a uniqueID. To support multiple hardware thread contexts per NDPcore, the corresponding hardware structures of
SynCron needto be augmented to include 1-bit per hardware thread context.
SynCron provides lock, barrier, semaphore and conditionvariable synchronization primitives, supporting two types ofbarriers: within cores of the same
NDP unit and within coresacross different NDP units of the system.
SynCron ’s pro-gramming interface (Table 2) implements the synchronizationsemantics with two new ISA instructions, which are rich and general enough to express all supported primitives. NDP coresuse these instructions to assemble messages for synchroniza-tion requests, which are issued through the network to SEs.
SynCron
Programming Interface syncronVar *create_syncvar ();void destroy_syncvar (syncronVar *svar);void lock_acquire (syncronVar *lock);void lock_release (syncronVar *lock);void barrier_wait_within_unit (syncronVar *bar, int initialCores);void barrier_wait_across_units (syncronVar *bar, int initialCores);void sem_wait (syncronVar *sem, int initialResources);void sem_post (syncronVar *sem);void cond_wait (syncronVar *cond, syncronVar *lock);void cond_signal (syncronVar *cond);void cond_broadcast (syncronVar *cond);
Table 2:
SynCron ’s Programming Interface (i.e., API). eq_sync addr, opcode, info : This instruction creates a mes-sage and commits when a response message is received back.The addr register has the address of a synchronization variable,the opcode register has the message opcode of a particular se-mantic of a synchronization primitive (Table 3), and the info register has specific information needed for the primitive ( Mes-sageInfo in message encoding of Fig. 5). req_async addr, opcode : This instruction creates a messageand after the message is issued to the network, the instructioncommits. The registers addr , opcode have the same semanticsas in req_sync instruction. We design
SynCron assuminga relaxed consistency memory model. The proposed ISAextensions act as memory fences. First, req_sync , commitsonce a message (ACK) is received (from the local SE to thecore), which ensures that all following instructions will beissued after req_sync has been completed. Its semantics issimilar to those of the SYNC and ACQUIRE operations ofWeak Ordering (WO) [28] and Release Consistency (RC) [28]models, respectively. Second, req_async , does not requirea return message (ACK). It is issued once all previous in-structions are completed. Its semantics is similar to that ofthe RELEASE operation of RC [28]. In the case of WO, req_sync is sufficient. In the case of RC, the req_sync instruc-tion is used for acquire-type semantics, i.e., lock_acquire, bar-rier_wait, semaphore_wait and condition_variable_wait, whilethe req_async instruction is used for release-type semantics,i.e., lock_release, semaphore_post, condition_variable_signal,and condition_variable_broadcast.
Figure 5 describes the encoding ofthe message used for communication between NDP cores andthe SE. Each message includes: (i) the 64-bit address of thesynchronization variable, (ii) the message opcode that imple-ments the semantics of the different synchronization primitives(6 bits cover all message opcodes), (iii) the unique ID numberof the NDP core (6 bits are sufficient for our simulated NDPsystem in Section 5), and (iv) a 64-bit field (
MessageInfo ) thatcommunicates specific information needed for each differentsynchronization primitive, i.e., the number of the cores thatparticipate in a barrier, the initial value of a semaphore, theaddress of the lock associated with a condition variable.
64 bits 6 bits 6 bits 64 bitsAddress Opcode CoreID MessageInfo64 bits 6 bits 6 bits 64 bitsAddress Opcode CoreID MessageInfo
Message Encoding
64 bits 6 bits 6 bits 64 bitsAddress Opcode CoreID MessageInfo
Message Encoding
LockBarrierSemaphoreCondition Variable -Initial
Figure 5: Message encoding of
SynCron . Hierarchical Message Opcodes.
SynCron enables a hierar-chical scheme, where the SEs of NDP units communicatewith each other to coordinate synchronization at a global level.Therefore, we support two types of messages (Table 3): (i) local , which are used by NDP cores to communicate with theirlocal SE, and (ii) global , which are used by SEs to commu-nicate with the
Master SE , and vice versa. Since we supporttwo types of barriers (Table 2), we design two message op-codes for a local barrier_wait message sent by an NDP core toits local SE: (i) barrier_wait_local_within_unit is used whencores of a single NDP unit participate in the barrier, and (ii) barrier_wait_local_across_units is used when cores from dif-ferent NDP units participate in the barrier. In the latter case,if a smaller number of cores than the total available cores ofthe NDP system participate in the barrier,
SynCron supportsone-level communication: local SEs re-direct all messages(received from their local NDP cores) to the
Master SE , whichglobally coordinates the barrier among all participating cores.This design choice is a trade-off between performance ( moreremote messages ) and hardware/ISA complexity, since the number of participating cores of each
NDP unit would need tobe communicated to the hardware through additional registersin ISA, and message opcodes ( higher complexity ). Primitives
SynCron
Message OpcodesLocks lock_acquire_global, lock_acquire_local, lock_release_globallock_release_local, lock_grant_global, lock_grant_locallock_acquire_overflow, lock_release_overflow, lock_grant_overflow
Barriers barrier_wait_global, barrier_wait_local_within_unitbarrier_wait_local_across_units, barrier_depart_global, barrier_depart_localbarrier_wait_overflow, barrier_departure_overflow
Semaphores sem_wait_global, sem_wait_local, sem_grant_globalsem_grant_local, sem_post_global, sem_post_localsem_wait_overflow, sem_grant_overflow, sem_post_overflow
ConditionVariables cond_wait_global, cond_wait_local, cond_signal_globalcond_signal_local, cond_broad_global, cond_broad_localcond_grant_global, cond_grant_local, cond_wait_overflowcond_signal_overflow, cond_broad_overflow, cond_grant_overflow
Other decrease_indexing_counter
Table 3: Message opcodes of
SynCron . Each SE module (Figure 6) is integrated into the computedie of each NDP unit. An SE consists of three components:
The SPU isthe logic that handles the messages, updates the ST, and issuesrequests to memory as needed. The SPU includes the controlunit, a buffer, and a few registers. The buffer is a small SRAMqueue for temporarily storing messages that arrive at the SE.The control unit implements custom logic with simple logicalbitwise operators (and, or, xor, zero) and multiplexers.
Buffer
Registers
SPU ST IndexingCounters
Network
140 bits
149 bits
READ/WRITE
ENABLEINDEX SE Control Logic
DATA
Figure 6: The Synchronization Engine (SE).
ST keeps track of all theinformation needed to coordinate synchronization. Each SThas 64 entries. Figure 7 shows an ST entry, which includes: (i)the 64-bit address of a synchronization variable, (ii) the globalwaiting list used by the
Master SE for global synchronizationamong SEs, i.e., a hardware bit queue including one bit foreach SE of the system, (iii) the local waiting list used by allSEs for synchronization among the NDP cores of an NDPunit, i.e., a hardware bit queue including one bit for each NDPcore within the unit, (iv) the state of the ST entry, which canbe either free or occupied , and (v) a 64-bit field ( TableInfo )to track specific information needed for each synchronizationprimitive. For the lock primitive, the
TableInfo field is usedto indicate the lock owner that is either an SE of an NDP unit(
Global ID represented by the most significant bits) or a local
NDP core (
Local ID represented by the least significant bits).We assume that all NDP cores of an NDP unit have a unique local ID within the NDP unit, while all SEs of the system havea unique global ID within the system. The number of bits inthe global and local waiting lists of Figure 7 is specific forthe configuration of our evaluated system (Section 5), whichincludes 16 NDP cores per NDP unit and 4 SEs (one perNDP unit), and has to be extended accordingly, if the systemsupports more NDP cores or SEs.
64 bits 4 bits 16 bits 1 bits 64 bits
Address Global Waitlist Local Waitlist State TableInfo
64 bits 4 bits 16 bits 1 bits 64 bits
Address Global Waitlist Local Waitlist State TableInfo
Synchronization Table Entry
64 bits 4 bits 16 bits 1 bits 64 bits
Address Global Waitlist Local Waitlist State TableInfo
Synchronization Table Entry
Lock
BarrierSemaphore
Condition Variable Global ID | Local ID
Current
Lock AddressLock
BarrierSemaphore
Condition Variable Global ID | Local ID
Current
Lock AddressLock
BarrierSemaphore
Condition Variable Global ID | Local ID
Current
Lock Address
Figure 7: Synchronization Table (ST) entry.
If an ST is full, i.e., all its entriesare in occupied state,
SynCron cannot keep track of informa-tion for a new synchronization variable in ST. We use themain memory as a fallback solution for such ST overflow5Section 4.3). The SE keeps track of which synchronizationvariables are currently serviced via main memory: similar toMiSAR [97], we include a small set of counters ( indexingcounters ), 256 in current implementation, indexed by the leastsignificant bits of the address of a synchronization variable,as extracted from the message that arrives at an SE. Whenan SE receives a message with acquire-type semantics for asynchronization variable and there is no corresponding en-try in the fully-occupied
ST, the indexing counter for thatsynchronization variable increases. When an SE receives amessage with release-type semantics for a synchronizationvariable that is currently serviced using main memory, thecorresponding indexing counter decreases. A synchroniza-tion variable is currently serviced via main memory, whenthe corresponding indexing counter is larger than zero. Notethat different variables may alias to the same indexing counter.This aliasing does not affect correctness, but it does affectperformance, since a variable may unnecessarily be servicedvia main memory, while the ST is not full.
Figure 8 describes the controlflow in SE. When an SE receives a message, it decodes themessage and accesses the ST . If there is an ST entryfor the specific variable (depending on its address), the SEprocesses the waiting lists , updates the ST , and encodesreturn message(s) , if needed. If there is not an ST entryfor the specific variable, the SE checks the value of the corre-sponding indexing counter : (i) if the indexing counter iszero and the ST is not full, the SE reserves a new ST entry andcontinues with step , otherwise (ii) if the indexing counteris larger than zero or the ST is full, there is an overflow. Inthat case, if the SE is the Master SE for the specific variable, itreads the synchronization variable from local memory arrays , processes the waiting lists , updates the variable in mainmemory , and encodes return message(s) , if needed. Ifthe SE is not the Master SE for the specific variable, it encodesan overflow message to the
Master SE to handle overflow. Process
Waiting Lists
Decode
Message
ST Entry
Found
Access ST Update ST
Encode Return
Message(s)Access Indexing Counters
Process
Waiting Lists
ST Entry Not Found Zero Counter && ST Not-FullNon-Zero Counter || ST Full
Overflow
Read Local
Memory
Write Local
Memory
Read Local
Memory
Write Local
MemoryEncode Overflow Message Figure 8: Control flow in SE.
SynCron integrates a hardware-only overflow managementscheme that provides very modest performance degradation(Section 6.7.3) and is programmer-transparent. To handle SToverflow cases, we need to address two issues: (i) where tokeep track of required information to coordinate synchroniza-tion, and (ii) how to coordinate ST overflow cases betweenSEs. For the former issue, we design a generic structure al-located in main memory. For the latter issue, we propose ahierarchical overflow communication protocol between SEs.
SynCron ’s Synchronization Variable.
We design ageneric structure (Figure 9), called syncronVar , which is usedto coordinate synchronization for all supported primitives inST overflow cases. syncronVar is defined in the driver of theNDP system, which handles the allocation of the synchroniza-tion variables: programmers use create_syncvar() (Table 2)to create a new synchronization variable, the driver allocatesthe bytes needed for syncronVar in main memory, and returnsan opaque pointer that points to the address of the variable.Programmers should not de-reference the opaque pointer andits content can only be accessed via
SynCron ’s API (Table 2). syncronVar structure includes one waiting list for each SEof the system, which has one bit for each NDP core within the
SynCron ’ s Synchronization Variable struct syncronVar_t { uint16_t Waitlist[4]; uint64_t VarInfo; uint8_t OverflowInfo; } typedef struct syncronVar_t syncronVar; Lock
Barrier
SemaphoreCondition Variable Lock Owner
Current
Available
Barrier
SemaphoreCondition Variable Lock Owner
Current
Available
Barrier
SemaphoreCondition Variable Lock Owner
Current
Available
Barrier
SemaphoreCondition Variable Overflow IDs |Lock State
Overflow IDs
Overflow IDsOverflow IDsLock
Barrier
SemaphoreCondition Variable Overflow IDs |Lock State
Overflow IDs
Overflow IDsOverflow IDsLock
Barrier
SemaphoreCondition Variable Overflow IDs |Lock State
Overflow IDs
Overflow IDsOverflow IDs
Figure 9: Synchronization variable of
SynCron ( syncronVar ). NDP unit, and two additional fields (
VarInfo, OverflowInfo )needed to hierarchically handle ST overflows for all primitives.
To ensurecorrectness, only the
Master SE updates the syncronVar vari-able: in ST overflow, the SPU of the
Master SE issues reador write requests to its local memory to globally coordinatesynchronization via the syncronVar variable. In our proposedhierarchical design, there are two overflow scenarios: (i) theST of the
Master SE overflows, and (ii) the ST of a local SEoverflows or STs of multiple local SEs overflow.
The ST of the
Master SE overflows.
The other SEs of thesystem have not overflowed for a specific synchronization vari-able. Thus, they can still directly buffer this variable in theirlocal STs, and serve their local cores themselves, implement-ing a hierarchical (two-level) communication with
Master SE .The
Master SE receives global messages from SEs, and servesa local SE of an NDP unit using all bits in the waiting list ofthe syncronVar variable associated with that local SE. Specifi-cally, when it receives a global acquire-type message from alocal SE, it sets all bits in the corresponding waiting list of the syncronVar variable. When it receives a global release-typemessage from a local SE, it resets all bits in the correspondingwaiting list of the syncronVar variable.
The ST of a local SE overflows.
In this scenario, there arelocal SEs that have overflowed for a specific variable, andlocal SEs that have not overflowed. Without loss of generality,we assume that only one SE of the system has overflowed.
The local SEs that have not overflowed serve their localcores themselves via their STs, implementing a hierarchical(two-level) communication with
Master SE . When the
MasterSE receives a global message from a local SE (that has not overflowed), it (i) sets (or resets) all bits in the waiting listof the syncronVar variable associated with that SE, and (ii)responds with a global message to the local SE, if needed.
The overflowed SE needs to notify the
Master SE to han-dle local synchronization requests of NDP cores located at another
NDP unit via main memory. We design overflow mes-sage opcodes (Table 3) to be sent from the local overflowedSE to the
Master SE and back. The overflowed SE re-directsall messages (sent from its local NDP cores) for a specificvariable to the
Master SE using the overflow message opcodes,and both the overflowed SE and the
Master SE increase theircorresponding indexing counters to indicate that this variableis currently serviced via memory. When the
Master SE re-ceives an overflow message, it (i) sets (or resets) in the waitinglist (associated with the overflowed SE) of the syncronVar variable, the bit that corresponds to the local ID of the NDPcore within the NDP unit, (ii) sets (or resets) in the
Overflow-Info field of the syncronVar variable the bit that correspondsto the global ID of the overflowed SE to keep track of which
SE (or SEs) of the system has overflowed, and (iii) respondswith an overflow message to that SE, if needed. The local ID of the NDP core, and the global ID of the overflowed SE areencoded in the
CoreID field of the message (Figure 5). Whenall bits in the waiting lists of the syncronVar variable becomezero (upon receiving a release-type message), the
Master SE decrements the corresponding indexing counter. Then, it sendsa decrease_index_counter message (Table 3) to the overflowedSE (based on the set bit that is tracked in the
OverflowInfo field), which decrements its corresponding indexing counter.6 .4.
SynCron
Enhancements
RMW
Operations.
It is straightforward to extend
Syn-Cron to support simple atomic rmw operations inside the SE(by adding a lightweight ALU). The
Master SE could be re-sponsible for executing atomic rmw operations on a variabledepending on its address. We leave that for future work.
When local cores of an NDP unit re-peatedly request a lock from their local SE, the SE repeatedlygrants the lock within its unit, potentially causing unfairnessand delay to other NDP units. To prevent this, an extra fieldof a local grant counter could be added to the ST entry. Thecounter increases every time the SE grants the lock to a localcore. If the counter exceeds a predefined threshold, then whenthe SE receives a lock release, it transfers the lock to anotherSE (assuming other SEs request the lock). The host OS orthe user could dynamically set this threshold via a dedicatedregister. We leave the exploration of such fairness mechanismsto future work.
SynCron ’s design shares some of its design concepts withSSB [157], LCU [146], and MiSAR [97]. However,
SynCron is more general, supporting the four most widely-used synchro-nization primitives, and easy-to-use thanks to its high-levelprogramming interface.Table 4 qualitatively compares
SynCron with these schemes.SSB and LCU support only lock semantics, thus they intro-duce two
ISA extensions for a simple lock. MiSAR introducesseven ISA extensions to support three primitives and handleoverflow scenarios.
SynCron includes two ISA extensions forfour supported primitives . A spin-wait approach performsconsecutive synchronization retries, typically incurring highenergy consumption. A direct notification scheme sends adirect message to only one waiting core when the synchro-nization variable becomes available, minimizing the trafficinvolved upon a release operation. SSB, LCU and MiSAR aretailored for uniform memory systems. In contrast,
SynCron is the only hardware synchronization mechanism that targetsNDP systems as well as non-uniform memory systems.SSB and LCU handle overflow in hardware synchronizationresources using a pre-allocated table in main memory, andif it overflows, they switch to software exception handlers(handled by the programmer), which typically incur largeoverheads (due to OS intervention) when overflows happenat a non-negligible frequency. To avoid falling back to mainmemory, which has high latency, and using expensive soft-ware exception handlers, MiSAR requires the programmer tohandle overflow scenarios using alternative software synchro-nization libraries (e.g., pthread library provided by the OS).This approach can provide performance benefits in CPU sys-tems, since alternative synchronization solutions can exploitlow-cost accesses to caches and hardware cache coherence.However, in NDP systems alternative solutions would by de-fault use main memory due to the absence of shared cachesand hardware cache coherence support. Moreover, when over-flow occurs, MiSAR’s accelerator sends abort messages toall participating CPU cores notifying them to use the alter-native solution, and when the cores finish synchronizing viathe alternative solution, they notify MiSAR’s accelerator toswitch back to hardware synchronization. This scheme intro-duces additional hardware/ISA complexity, and communica-tion between the cores and the accelerator, thus incurring highnetwork traffic and communication costs, as we show in Sec-tion 6.7.3. In contrast,
SynCron directly falls back to memoryvia a fully-integrated hardware-only overflow scheme, whichprovides graceful performance degradation (Section 6.7.3),and is completely transparent to the programmer: program- mers only use
SynCron ’s high-level API, similarly to howsoftware libraries are in charge of synchronization.
SSB [157] LCU [146] MiSAR [97]
SynCron
Supported Primitives 1 1 3 ISA Extensions 2 2 7 Spin-Wait Approach yes yes no no Direct Notification no yes yes yes
Target System uniform uniform uniform non-uniform
Overflow partially partially handled by fully
Management integrated integrated programmer integrated
Table 4: Comparison of
SynCron with prior mechanisms.
SynCron in Conventional Systems
The baseline NDP architecture [8, 43, 143, 155, 158] weassume in this work shares key design principles with con-ventional NUMA systems. However, unlike NDP systems,NUMA CPU systems (i) have a shared level of cache (within aNUMA socket and/or across NUMA sockets), (ii) run multiplemulti-threaded applications, i.e., a high number of softwarethreads executed in hardware thread contexts, and (iii) theOS migrates software threads between hardware thread con-texts to improve system performance. Therefore, although
SynCron could be implemented in such commodity systems,our proposed hardware design would need extensions. First,
SynCron could exploit the low-cost accesses to shared cachesin conventional CPUs, e.g., including an additional level in
SynCron ’s hierarchical design to use the shared cache for effi-cient synchronization within a NUMA socket, and/or handlingoverflow scenarios by falling back to the low-latency cacheinstead of main memory. Second,
SynCron needs to supportuse cases (ii) and (iii) listed above in such systems, i.e., in-cluding larger STs and waiting lists to satisfy the needs ofmultiple multithreaded applications, handling the OS threadmigration scenarios across hardware thread contexts, and han-dling multiple synchronization requests sent from differentsoftware threads with the same hardware ID to SEs, whendifferent software threads are executed on the same hardwarethread context. We leave the optimization of
SynCron ’s designfor conventional systems to future work.
5. Methodology
Simulation Methodology.
We use an in-house simulator thatintegrates ZSim [125] and Ramulator [85]. We model 4 NDPunits (Table 5), each with 16 in-order cores. The cores issue amemory operation after the previous one has completed, i.e.,there are no overlapping operations issued by the same core.Any write operation is completed (and the latency is accountedfor in our simulations) before executing the next instruction.To ensure memory consistency, compiler support [123] guar-antees that there is no reordering around the sync instructionsand a read is inserted after a write inside a critical section.
NDP Cores
16 in-order cores @2.5 GHz per NDP unit
L1 Data + Inst. Cache private, 16KB, 2-way, 4-cycle; 64 B line; 23/47 pJ per hit/miss [109]
NDP Unit buffered crossbar network with packet flow control; 1-cycle arbiter;
Local Network
DRAM HBM
DRAM HMC
DRAM DDR4
Interconnection Links
Across NDP Units
Synchronization
SPU @1GHz clock frequency [129]; 8 × Engine buffer: 280B; ST: 1192B, 64 entries, 1-cycle [109];indexing counters: 2304B, 256 entries (8 LSB of the address), 2-cycle [109]
Table 5: Configuration of our simulated system.
We evaluate three NDP configurations for different mem-ory technologies, namely 2D, 2.5D, 3D NDP. The 2D NDPconfiguration uses a DDR4 memory model and resemblesrecent 2D NDP systems [34, 50, 89, 144]. In the 2.5D NDP7onfiguration, each compute die of NDP units (16 NDP cores)is connected to an HBM stack via an interposer, similar tocurrent GPUs [106, 115] and FPGAs [131, 150]. For the 3DNDP configuration, we use the HMC memory model, wherethe compute die of the NDP unit is located in the logic layerof the memory stack, as in prior works [8, 19, 155, 158]. Dueto space limitations, we present detailed evaluation results forthe 2.5D NDP configuration, and provide a sensitivity studyfor the different NDP configurations in Section 6.5.We model a crossbar network within each NDP unit, sim-ulating queuing latency using the M/D/1 model [18]. Wecount in ZSim-Ramulator all events for caches, i.e., num-ber of hits/misses, network, i.e., number of bits transferredinside/across NDP units, and memory, i.e., number of totalmemory accesses, and use CACTI [109] and parameters re-ported in prior works [143, 149, 151] to calculate energy. Toestimate the latency in SE, we use CACTI for ST and indexingcounters, and Aladdin [129] for the SPU with 1GHz at 40nm.Each message is served in 12 cycles, corresponding to themessage (barrier_depart_global) that takes the longest time.
Workloads.
We evaluate workloads with both (i) coarse-grained synchronization, i.e., including only a few synchro-nization variables to protect shared data, leading to coreshighly contending for them ( high-contention ), and (ii) fine-grained synchronization, i.e., including a large number of syn-chronization variables, each of them protecting a small granu-larity of shared data, leading to cores not frequently contend-ing for the same variables at the same time ( low-contention ).We use the term synchronization intensity to refer to the ratioof synchronization operations over other computation in theworkload. As this ratio increases, synchronization latencyaffects the total execution time of the workload more.We study three classes of applications (Table 6), all wellsuited for NDP. First, we evaluate pointer chasing workloads,i.e., lock-based concurrent data structures from the ASCYLIBlibrary [31], used as key-value sets. In ASCYLIB’s BinarySearch Tree (BST) [37], the lock memory requests are only0.1% of the total memory requests, so we also evaluate anexternal fine-grained locking BST from [130]. Data structuresare initialized with a fixed size and statically partitioned acrossNDP units, except for BSTs, which are distributed randomly.In these benchmarks, each core performs a fixed number ofoperations. We use lookup operations for data structures thatsupport it, deletion for the rest, and push and pop operationsfor stack and queue. Second, we evaluate graph applicationswith fine-grained synchronization from Crono [7, 65] (pushversion), where the output array has read-write data. All real-world graphs [32] used are undirected and statically partitionedacross NDP units, where the vertex data is equally distributedacross cores. Third, we evaluate time series analysis [142],using SCRIMP, and real data sets from Matrix Profile [152].We replicate the input data in each NDP unit and partition theoutput array (read-write data) across NDP units.
Comparison Points.
We compare
SynCron with threeschemes: (i)
Central : a message-passing scheme that supportsall primitives by extending the barrier primitive of Tesser-act [8], i.e., one dedicated NDP core in the entire NDP systemacts as server and coordinates synchronization among all NDPcores of the system by issuing memory requests to synchro-nization variables via its memory hierarchy, while the remain-ing client cores communicate with it via hardware message-passing; (ii)
Hier : a hierarchical message-passing scheme thatsupports all primitives, similar to the barrier primitive of [43](or hierarchical lock of [141]), i.e., one NDP core per NDPunit acts as server and coordinates synchronization by issuingmemory requests to synchronization variables via its memoryhierarchy (including caches), and communicates with other
Data Structure Configuration
Stack [31] 100K - 100% pushQueue [31,104] 100K - 100% popArray Map [31,56] 10 - 100% lookupPriority Queue [11,31,118] 20K - 100% deleteMinSkip List [31,118] 5K - 100% deletionHash Table [31,63] 1K - 100% lookupLinked List [31,63] 20K - 100% lookupBinary Search Tree Fine-Grained (BST_FG) [130] 20K - 100% lookupBinary Search Tree Drachsler (BST_Drachsler) [31,37] 10K - 100% deletion
Real Application Locks Barriers
Breadth First Search ( bfs ) [7] (cid:88) (cid:88)
Connected Components ( cc ) [7] (cid:88) (cid:88) Single Source Shortest Paths ( sssp ) [7] (cid:88) (cid:88)
Pagerank ( pr ) [7] (cid:88) (cid:88) Teenage Followers ( tf ) [65] (cid:88) -Triangle Counting ( tc ) [7] (cid:88) (cid:88) Time Series Analysis ( ts ) [152] (cid:88) (cid:88) Real Application Input Data Set wikipedia-20051105 ( wk ) bfs, cc, sssp, soc-LiveJournal1 ( sl ) pr, tf, tc sx-stackoverflow ( sx )com-Orkut ( co ) ts air quality ( air )energy consumption ( pow ) Table 6: Summary of all workloads used in our evaluation. servers and local client cores (located at the same NDP unitwith it) via hardware message-passing; (iii)
Ideal : an idealscheme with zero performance overhead for synchronization.In our evaluation, each NDP core runs one thread. For faircomparison, we use the same number of client cores, i.e., 15per NDP unit, that execute the main workload for all schemes.For synchronization, we add one server core for the entire sys-tem in
Central , one server core per NDP unit for
Hier , and oneSE per NDP unit for
SynCron . For
SynCron , we disable onecore per NDP unit to match the same number of client coresas the previous schemes. Maintaining the same thread-levelparallelism for executing the main kernel is consistent withprior works on message-passing synchronization [97, 141].
6. Evaluation
Figure 10 evaluates allsupported primitives using 60 cores, varying the interval (interms of instructions) between two synchronization points.We devise simple benchmarks, where cores repeatedly re-quest a single synchronization variable. For lock, the crit-ical section is empty, i.e., it does not include any instruc-tion. For semaphore and condition variable, half of thecores execute sem_wait/cond_wait, while the rest executesem_post/cond_signal, respectively. As the interval betweensynchronization points becomes smaller,
SynCron ’s perfor-mance benefit increases. For an interval of 200 instructions,
SynCron outperforms
Central and
Hier by 3.05 × and 1.40 × respectively, averaged across all primitives. SynCron outper-forms
Hier due to directly buffering synchronization variablesin low-latency STs, and achieves the highest benefits in thecondition variable primitive (by 1.61 × ), since this benchmarkhas higher synchronization intensity compared to the rest:cores coordinate for both the condition variable and the lockassociated with it. When the interval between synchronizationoperations becomes larger, synchronization requests becomeless dominant in the main workload, and thus all schemes per-form similarly. Overall, SynCron outperforms prior schemesfor all different synchronization primitives.
50 100 200 400 1K 2K 5KInstructions between critical sections012345 S p ee d u p Lock 20 50 100 200 500 1K 2KInstructions between barrier synchronization02468 S p ee d u p Barrier100 200 400 1K 2K 5K 10KInstructions between semaphore synchronization0123 S p ee d u p Semaphore 200 400 1K 2K 5K 10K 50KInstructions between condition variable synchronization03691215 S p ee d u p Condition VariableCentral Hier SynCron Ideal
Figure 10: Speedup of different synchronization primitives. .1.2. Pointer Chasing Data Structures. Figure 11 showsthe throughput for all schemes in pointer chasing varying theNDP cores in steps of 15, each time adding one NDP unit.
15 30 45 6057.51012.5 O p e r a t i o n s / s Stack
15 30 45 6051015
Queue
15 30 45 60567
Array MapNumber of NDP coresCentral Hier SynCron Ideal
15 30 45 6045678910 O p e r a t i o n s / s Priority Queue
15 30 45 604812162024
Skip List
15 30 45 6051015202530
Hash TableNumber of NDP cores
15 30 45 6036912 O p e r a t i o n s / m s Linked List
15 30 45 60246810
BST_FG
15 30 45 60510152025
BST_DrachslerNumber of NDP cores
Figure 11: Throughput of pointer chasing using data structures.
We observe four different patterns. First, stack , queue , array map , and priority queue incur high contention, as allcores heavily contend for a few variables. Array map hasthe lowest scalability due to a larger critical section. In high-contention scenarios, hierarchical schemes (
Hier , SynCron )perform better by reducing the expensive traffic across NDPunits.
SynCron outperforms
Hier , since the latency cost ofusing SEs that update small STs is lower than using NDP coresas servers that update larger caches. Second, skip list and hashtable incur medium contention, as different cores may work ondifferent parts of the data structure. For these data structures,hierarchical schemes perform better, as they minimize theexpensive traffic, and multiple server cores concurrently serverequests to their local memory.
SynCron retains most of theperformance benefits of
Ideal , incurring only 19.9% overheadwith 60 cores, and outperforms
Hier by 9.8%. Third, linked list and
BST_FG exhibit low contention and high synchronizationdemand, as each core requests multiple locks concurrently.These data structures cause higher synchronization-relatedtraffic inside the network compared to skip list and hash table ,and thus
SynCron further outperforms
Hier by 1.19 × due todirectly buffering synchronization variables in STs. Fourth,in BST_Drachsler lock requests constitute only 0.1% of thetotal requests, and all schemes perform similarly. Overall, weconclude that
SynCron achieves higher throughput than priormechanisms under different scenarios with diverse conditions.
Figure 12 shows the performanceof all schemes with real applications using all NDP units,normalized to
Central . Averaged across 26 application-inputcombinations,
SynCron outperforms
Central by 1.47 × and Hier by 1.23 × , and performs within 9.5% of Ideal .Our real applications exhibit low contention, as two coresrarely contend for the same synchronization variable, and highsynchronization demand, as several synchronization variablesare active during execution. We observe that
Hier and
Syn-Cron increase parallelism, because the per-NDP-unit serversservice different synchronization requests concurrently, andavoid remote synchronization messages across NDP units.Even though
Hier performs 1.19 × better than Central , onaverage, its performance is still 1.33 × worse than Ideal . Syn-Cron provides most of the performance benefits of
Ideal (withonly 9.5% overhead on average), and outperforms
Hier dueto directly buffering the synchronization variables in STs,thereby completely avoiding the memory accesses for syn-chronization requests. Specifically, we find that time seriesanalysis has high synchronization intensity, since the ratio ofsynchronization over other computation of the workload ishigher compared to graph workloads. For this application,
Hier and
SynCron outperform
Central by 1.64 × and 2.22 × ,as they serve multiple synchronization requests concurrently. SynCron further outperforms
Hier by 1.35 × due to directlybuffering the synchronization variables in STs. We concludethat SynCron performs best across all real application-inputcombinations and approaches the
Ideal scheme with no syn-chronization overhead.
Scalability.
Figure 13 shows the scalability of real applica-tions using
SynCron from 1 to 4 NDP units. Due to spacelimitations, we present a subset of our workloads, but we re-port average values for all 26 application-input combinations.This also applies for all figures presented henceforth. Acrossall workloads,
SynCron enables performance scaling by atleast 1.32 × , on average 2.03 × , and up to 3.03 × , when using4 NDP units (60 NDP cores) over 1 NDP unit (15 NDP cores). bfs.sl cc.sx sssp.co pr.wk tf.sl tc.sx ts.air ts.pow AVG S p ee d u p Figure 13: Scalability of real applications using
SynCron . Figure 14 shows the energy breakdown for cache, network,and memory in our real applications when using all cores.
SynCron reduces the network and memory energy thanks to itshierarchical design and direct buffering. On average,
SynCron reduces energy consumption by 2.22 × over Central and 1.94 × over Hier , and incurs only 6.2% energy overhead over
Ideal .We observe that 1) cache energy consumption constitutesa small portion of the total energy, since these applicationshave irregular access patterns. NDP cores that act as servers b f s . w k b f s . s l b s . s x b f s . c o cc . w k cc . s l cc . s x cc . c o sss p . w k sss p . s l sss p . s x sss p . c o p r . w k p r . s l p r . s x p r . c o t f . w k t f . s l t f . s x t f . c o t c . w k t c . s l t c . s x t c . c o t s . a i r t s . p o w A V G S p ee d u p . . . . Central Hier SynCron Ideal
Figure 12: Speedup in real applications normalized to
Central . fs.sl cc.sx sssp.co pr.wk tf.sl tc.sx ts.air ts.pow AVG0.00.20.40.60.81.0 E n e r g y B r e a k d o w n C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI cache network memory
Figure 14: Energy breakdown in real applications for C:
Central ,H:
Hier , SC:
SynCron and I:
Ideal . for Central and
Hier increase the cache energy only by 5.1%and 4.8% over
Ideal . 2)
Central generates a larger amountof expensive traffic across NDP units compared to hierarchi-cal schemes, resulting in 2.68 × higher network energy over SynCron . SynCron also has less network energy (1.21 × ) than Hier , because it avoids transferring synchronization variablesfrom memory to SEs due to directly buffering them. 3)
Hier and
Central have approximately the same memory energy con-sumption, because they issue a similar number of requests tomemory. In contrast,
SynCron ’s memory energy consumptionis similar to that of
Ideal . We note that
SynCron provides higher energy reductions in applications with high synchro-nization intensity, such as time series analysis, since it avoidsa higher number of memory accesses for synchronization dueto its direct buffering capability.
Figure 15 shows normalized data movement, i.e., bytestransferred between NDP cores and memory, for all schemesusing four NDP units.
SynCron reduces data movement acrossall workloads by 2.08 × and 2.04 × over Central and
Hier ,respectively, on average, and incurs only 13.8% more datamovement than
Ideal . Central generates high data movementacross NDP units, particularly when running time series anal-ysis that has high synchronization intensity.
Hier reduces thetraffic across NDP units; however, it may increase the trafficinside an NDP unit, occasionally leading to slightly higher to-tal data movement (e.g., ts.air ). This is because when an NDPcore requests a synchronization variable that is physically lo-cated in another NDP unit, it first sends a message inside theNDP unit to its local server, which in turns sends a messageto the global server. In contrast,
SynCron reduces the trafficinside an NDP unit due to directly buffering synchronizationvariables, and across NDP units due to its hierarchical design. bfs.sl cc.sx sssp.co pr.wk tf.sl tc.sx ts.air ts.pow AVG0.00.20.40.60.81.0 D a t a M o v e m e n t C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI C HSCI inside NDP units across NDP units
Figure 15: Data movement in real applications for C:
Central ,H:
Hier , SC:
SynCron and I:
Ideal . Hierarchical schemes provide highbenefit under high contention, as they prioritize local requestsinside each NDP unit. We study their performance benefit instack and priority queue (Figure 16) when varying the transferlatency of the interconnection links used across four NDP units.
Central is significantly affected by the interconnect latencyacross NDP units, as it is oblivious to the non-uniform natureof the NDP system. Observing
Ideal , which reflects the actualbehavior of the main workload, we notice that after a certainpoint (vertical line), the cost of remote memory accesses acrossNDP units become high enough to dominate performance.
SynCron and
Hier tend to follow the actual behavior of theworkload, as local synchronization messages within NDP units are much less expensive than remote messages of
Central . SynCron outperforms
Hier by 1.06 × and 1.04 × for stack andpriority queue. We conclude that SynCron is the best at hidingthe latency of slow links across NDP units.
Transfer latency ( s) O p e r a t i o n s / s Stack
Transfer latency ( s) O p e r a t i o n s / s Priority Queue
Central Hier SynCron Ideal
Figure 16: Performance sensitivity to the transfer latency of theinterconnection links used to connect the NDP units.
We also study the effect of intercon-nection links used across the NDP units in a low-contentiongraph application (Figure 17). Observing
Ideal , with 500 nstransfer latency per cache line, we note that the workload ex-periences 2.46 × slowdown over the default latency of 40 ns,as 24.1% of its memory accesses are to remote NDP units.As the transfer latency increases, Central incurs significantslowdown over
Ideal , since all NDP cores of the system com-municate with one single server, generating expensive trafficacross NDP units. In contrast, the slowdown of hierarchi-cal schemes over
Ideal is smaller, as these schemes generateless remote traffic by distributing the synchronization requestsacross multiple local servers.
SynCron outperforms
Hier dueto its direct buffering capabilities. Overall,
SynCron outper-forms prior high-performance schemes even when the networkdelay across NDP units is large.
40 100 200 500 S l o w d o w n IdealSynCronHierCentral1.07 1.11 1.15 1.171.29 1.33 1.36 1.371.61 1.87 2.23 2.67 pr.wk
Transfer latency in ns for each cache line
Figure 17: Performance sensitivity to the transfer latency of theinterconnection links used to connect the NDP units. All data isnormalized to
Ideal ( lower is better ). We study three memory technologies, which provide differ-ent memory access latencies and bandwidth. We evaluate (i)2.5D NDP using HBM, (ii) 3D NDP using HMC, and (iii) 2DNDP using DDR4. Figure 18 shows the performance of allschemes normalized to
Central of each memory. The reportedvalues show the speedup of
SynCron over
Central and
Hier . SynCron ’s benefit is independent of the memory used: its per-formance versus
Ideal only slightly varies ( ± SynCron ’s performance improvement over prior schemesincreases as the memory access latency becomes higher thanksto direct buffering, which avoids expensive memory accessesfor synchronization. For example, in ts.pow , SynCron out-performs
Hier by 1.41 × and 2.49 × with HBM and DDR4,respectively, as the latter incurs higher access latency. Overall, SynCron is orthogonal to the memory technology used. cc.wk pr.wk ts.pow S p ee d u p HBM HBM HBMHMC HMC HMCDDR4 DDR4 DDR4Central Hier SynCron Ideal
Figure 18: Speedup with different memory technologies. .6. Effect of Data Placement Figure 19 evaluates the effect of better data placement on
SynCron ’s benefits. We use Metis [74] to obtain a 4-way graphpartitioning to minimize the crossing edges between the 4NDP units. All data values are normalized to
Central withoutMetis. For
SynCron , we define ST occupancy as the averagefraction of ST entries that are occupied in each cycle. pr.wk pr.sl pr.sx pr.co S p ee d u p No Metis No Metis No Metis No MetisMetis Metis Metis MetisCentral Hier SynCron Ideal . . . . Max ST Occupancy (%) pr.wk pr.sl pr.sx pr.coNo Metis
62 51 53 48
Metis
39 29 38 34
Figure 19: Performance sensitivity to a better graph partitioningand maximum ST occupancy of
SynCron . We make three observations. First,
Ideal , which reflects theactual behavior of the main kernel (i.e., with zero synchroniza-tion overhead), improves performance by 1.47 × across thefour graphs. Second, with a better graph partitioning, SynCron still outperforms both
Central and
Hier . Third, we find thatST occupancy is lower with a better graph partitioning. Whena local SE receives a request for a synchronization variableof another NDP unit, both the local SE and the
Master SE reserve a new entry in their STs. With a better graph parti-tioning, NDP cores send requests to their local SE, which isalso the
Master SE for the requested variable. Thus, only one
SE of the system reserves a new entry, resulting in a lowerST occupancy. We conclude that, with better data placement
SynCron still performs the best while achieving even lower SToccupancy.
SynCron ’s Design Choices
To demonstrate the effectivenessof
SynCron ’s hierarchical design in non-uniform NDP systems,we compare it with
SynCron ’s flat variant. Each core in flatdirectly sends all its synchronization requests to the MasterSE of each variable. In contrast, each core in
SynCron sendsall its synchronization requests to the local SE. If the local SEis not the
Master SE for the requested variable, the local SEsends a message across NDP units to the
Master SE .We evaluate three synchronization scenarios: (i) low-contention and synchronization non-intensive (e.g., graph ap-plications), (ii) low-contention and synchronization-intensive(e.g., time series analysis), and (iii) high-contention (e.g.,queue data structure).
Low-contention and synchronization non-intensive.
Fig-ure 20 evaluates this scenario using several graph process-ing workloads with 40 ns link latency between NDP units.
SynCron is 1.1% worse than flat , on average. We concludethat
SynCron performs only slightly worse than flat for low-contention and synchronization non-intensive scenarios. b f s . w k b f s . s l b s . s x b f s . c o cc . w k cc . s l cc . s x cc . c o sss p . w k sss p . s l sss p . s x sss p . c o p r . w k p r . s l p r . s x p r . c o t f . w k t f . s l t f . s x t f . c o t c . w k t c . s l t c . s x t c . c o A V G S p ee d u p Figure 20: Speedup of
SynCron normalized to flat with 40 ns linklatency between NDP units, under a low-contention and synchro-nization non-intensive scenario.
Low-contention and synchronization-intensive.
Figure 21aevaluates this scenario using time series analysis with four different link latency values between NDP units.
SynCron performs 7.3% worse than flat with a 40 ns inter-NDP-unitlatency. With a 500 ns inter-NDP-unit latency,
SynCron per-forms only flat , since remote traffic has alarger impact on the total execution time. We conclude that
SynCron performs modestly worse than flat , and
SynCron ’sslowdown decreases as non-uniformity, i.e., the latency be-tween NDP units, increases. ts.air ts.pow S p ee d u p (a) Queue.30cores Queue.60cores S p ee d u p (b)40 ns 100 ns 200 ns 500 ns Figure 21: Speedup of
SynCron normalized to flat , as we varythe transfer latency of the interconnection links used to con-nect NDP units, under (a) a low-contention and synchronization-intensive scenario using 4 NDP units, and (b) a high-contentionscenario using 2 and 4 NDP units.
High-contention.
Figure 21b evaluates this scenario using aqueue data structure with four different link latency valuesbetween NDP units, for 30 and 60 NDP cores.
SynCron with30 NDP cores outperforms flat from 1.23 × to 1.76 × , as theinter-NDP-unit latency increases from 40 ns to 500 ns (i.e.,with increasing non-uniformity in the system). In a scenariowith high non-uniformity in the system and large number ofcontended cores, e.g., using a 500 ns inter-NDP-unit latencyand 60 NDP cores, SynCron ’s benefit increases to a 2.14 × speedup over flat . We conclude that SynCron performs signifi-cantly better than flat under high-contention.Overall, we conclude that in non-uniform , distributed NDPsystems, only a hierarchical hardware synchronization designcan achieve high performance under all various scenarios. We show the effectiveness of the proposed 64-entry ST (per NDP unit) using real applications. Table 7 showsthe measured occupancy across all STs. Figure 22 shows theperformance sensitivity to ST size. In graph applications,the average ST occupancy is low (2.8%), and the 64-entryST never overflows: maximum occupancy is 63% ( cc.wk ).In contrast, time series analysis has higher ST occupancy(reaching up to 89% in ts.pow ) due to the high synchronizationintensity, but there are no ST overflows. Even a 48-entry SToverflows for only 0.01% of synchronization requests, andincurs 2.1% slowdown over a 64-entry ST. We conclude thatthe proposed 64-entry ST meets the needs of applications thathave high synchronization intensity.
ST Occupancy Max (%) Avg (%)bfs.wk bfs.sl bfs.sx bfs.co cc.wk cc.sl cc.sx cc.co sssp.wk sssp.sl sssp.sx sssp.co pr.wk ST Occupancy Max (%) Avg (%)pr.sl pr.sx pr.co tf.wk tf.sl tf.sx tf.co tc.wk tc.sl tc.sx tc.co ts.air ts.pow Table 7: ST occupancy in real applications. cc.wk pr.wk ts.air ts.pow0.80.91.01.11.21.31.4 S l o w d o w n Figure 22: Slowdown with varying ST size (normalized to 64-entry ST). Numbers on top of bars show the percentage of over-flowed requests. .7.3. Overflow Management. The linked list and BST_FGdata structures are the only cases where the proposed 64-entryST overflows, when using 60 cores, for 3.1% and 30.5% ofthe requests, respectively. This is because each core requestsat least two locks at the same time during the execution. Notethat these synthetic benchmarks represent extreme scenarios,where all cores repeatedly perform key-value operations.Figure 23 compares BST_FG’s performance with
SynCron ’sintegrated overflow scheme versus with a non-integratedscheme as in MiSAR. When overflow occurs, MiSAR’s ac-celerator aborts all participating cores notifying them to usean alternative synchronization library, and when the coresfinish synchronizing via an alternative solution, they notifyMiSAR’s accelerator to switch back to hardware synchro-nization. We adapt this scheme to
SynCron for comparisonpurposes: when an ST overflows, SEs send abort messagesto NDP cores with a hierarchical protocol, notifying themto use an alternative synchronization solution, and after fin-ishing synchronization they notify SEs to decrease their in-dexing counters and switch to hardware. We evaluate twoalternative solutions: (i)
SynCron_CentralOvrfl , where onededicated NDP core handles all synchronization variables,and (ii)
SynCron_DistribOvrfl , where one NDP core per NDPunit handles variables located in the same NDP unit. With30.5% overflowed requests (i.e., with a 64-entry ST),
Syn-Cron_CentralOvrfl and
SynCron_DistribOvrfl incur 12.3%and 10.4% performance slowdown compared to with no SToverflow, due to high network traffic and communication costsbetween NDP cores and SEs. In contrast,
SynCron affects per-formance by only 3.2% compared to with no ST overflow. Weconclude that
SynCron ’s integrated hardware-only overflowscheme enables very small performance overhead.
ST_16 ST_32 ST_48 ST_64 ST_128 ST_2566.06.57.07.58.0
BST_FG
SynCronSynCron_CentralOvrflSynCron_DistribOvrfl O p e r a t i o n s / m s Figure 23: Throughput achieved by BST_FG using differentoverflow schemes and varying the ST size. The reported num-bers show to the percentage of overflowed requests.
SynCron ’s Area and Power Overhead
Table 8 compares an SE with the ARM Cortex A7 core [14].We estimate the SPU using Aladdin [129], and the ST andindexing counters using CACTI [109]. We conclude that ourproposed hardware unit incurs very modest area and powercosts to be integrated into the compute die of an NDP unit.
SE (Synchronization Engine) ARM Cortex A7 [14]Technology
Area
SPU: 0.0141mm , ST: 0.0112mm Total: Total: Power
Table 8: Comparison of SE with a simple general-purpose in-order core, ARM Cortex A7.
7. Related Work
To our knowledge, our work is the first one to (i) compre-hensively analyze and evaluate synchronization primitives inNDP systems, and (ii) propose an end-to-end hardware-basedsynchronization mechanism for efficient execution of suchprimitives. We briefly discuss prior work.
Synchronization on NDP.
Ahn et al. [8] include a message-passing barrier similar to our
Central baseline. Gao et al. [43]implement a hierarchical tree-based barrier for HMC [59],where cores first synchronize inside the vault, then across vaults, and finally across HMC stacks. Section 6.1 showsthat
SynCron outperforms such schemes. Gao et al. [43] alsoprovide remote atomics at the vault controllers of HMC. How-ever, synchronization using remote atomics creates high globaltraffic and hotspots [41, 96, 108, 147, 153].
Synchronization on CPUs.
A range of hardware synchro-nization mechanisms have been proposed for commodity CPUsystems [1–3,10,116,124]. These are not suitable for NDP sys-tems because they either (i) rely on the underlying cache coher-ence system [10,124], (ii) are tailored for the 2D-mesh networktopology to connect all cores [2, 3], or (iii) use transmission-line technology [116] or on-chip wireless technology [1]. Call-backs [120] includes a directory cache structure close to theLLC of a CPU system built on self-invalidation coherenceprotocols [26, 75, 77, 91, 121, 139]. Although it has low areacost, it would be oblivious to the non-uniformity of NDP,thereby incurring high performance overheads under high con-tention (Section 6.7.1). Callbacks improves performance ofspin-wait in hardware, on top of which high-level primitives(locks/barriers) are implemented in software. In contrast,
Syn-Cron directly supports high-level primitives in hardware, andis tailored to all salient characteristics of NDP systems.The closest works to ours are SSB [157], LCU [146], andMiSAR [97]. SSB, a shared memory scheme, includes a smallbuffer attached to each controller of LLC to provide lock se-mantics for a given data address. LCU, a message-passingscheme, incorporates a control unit into each core and a reser-vation table into each memory controller to provide reader-writer locks. MiSAR is a message-passing synchronizationaccelerator distributed at each LLC slice of tile-based many-core chips. These schemes provide efficient synchronizationfor CPU systems without relying on hardware coherence proto-cols. As shown in Table 4, compared to these works,
SynCron is a more effective, general and easy-to-use solution for NDPsystems. These works have two major shortcomings. First,they are designed for uniform architectures, and would incurhigh performance overheads in non-uniform, distributed
NDPsystems under high-contetion scenarios, similarly to flat inFigure 21b. Second, SSB and LCU handle overflow cases us-ing software exception handlers that typically incur large per-formance overheads, while MiSAR’s overflow scheme wouldincur high performance degradation due to high network trafficand communication costs between the cores and the synchro-nization accelerator (Section 6.7.3). In contrast,
SynCron isa non-uniformity aware, hardware-only, end-to-end solutiondesigned to handle key characteristics of NDP systems.
Synchronization on GPUs.
GPUs support remote atomicunits at the shared cache and hardware barriers among threadsof the same block [114], while inter-block barrier synchroniza-tion is inefficiently implemented via the host CPU [114]. Theclosest work to ours is HQL [153], which modifies the tagarrays of L1 and L2 caches to support the lock primitive. Thisscheme incurs high area cost [41], and is tailored to the GPUarchitecture that includes a shared L2 cache, while most NDPsystems do not have shared caches.
Synchronization on MPPs.
The Cray T3D/T3E [81, 127],SGI Origin [88], and AMOs [154] include remote atomics atthe memory controller, while NYU Ultracomputer [52] pro-vides fetch&and remote atomics in each network switch. Asdiscussed in Section 2, synchronization via remote atomicsincurs high performance overheads due to high global traf-fic [41, 108, 147, 153]. Cray T3E supports a barrier usingphysical wires, but it is designed specifically for 3D torusinterconnect. Tera MTA [12], HEP [71, 134], J- and M-machines [29, 78], and Alewife [5] provide synchronization12sing hardware bits ( full/empty bits) as tags in each memoryword . This scheme can incur high area cost [146]. QOLB [51]associates one cache line for every lock to track a pointer to thenext waiting core, and one cache line for local spinning usingbits ( syncbits ). QOLB is built on the underlying cache coher-ence protocol. Similarly, DASH [95] keeps a queue of waitingcores for a lock in the directory used for coherence to notifycaches when the lock is released. CM5 [94] supports remoteatomics and a barrier among cores via a dedicated physicalcontrol network (organized as a binary tree), which wouldincur high hardware cost to be supported in NDP systems.
8. Conclusion
SynCron is the first end-to-end synchronization solutionfor NDP systems.
SynCron avoids the need for complex co-herence protocols and expensive rmw operations, incurs verymodest hardware cost, generally supports many synchroniza-tion primitives and is easy-to-use. Our evaluations show that itoutperforms prior designs under various conditions, providinghigh performance both under high-contention (due to reduc-tion of expensive traffic across NDP units) and low-contentionscenarios (due to direct buffering of synchronization variablesand high execution parallelism). We conclude that
SynCron isan efficient synchronization mechanism for NDP systems, andhope that this work encourages further comprehensive stud-ies of the synchronization problem in heterogeneous systems,including NDP systems.
Acknowledgments
We thank the anonymous reviewers of ISCA 2020, MI-CRO 2020 and HPCA 2021 for feedback. We thank DionisiosPnevmatikatos, Konstantinos Nikas, Athena Elafrou, FoteiniStrati, Dimitrios Siakavaras, Thomas Lagos, Andreas Tri-antafyllos for helpful technical discussions. We acknowledgesupport from the SAFARI group’s industrial partners, espe-cially ASML, Google, Facebook, Huawei, Intel, Microsoft,VMware, and Semiconductor Research Corporation. Duringpart of this research, Christina Giannoula was funded from theGeneral Secretariat for Research and Technology (GSRT) andthe Hellenic Foundation for Research and Innovation (HFRI).
References [1] S. Abadal, A. Cabellos-Aparicio, E. Alarcon, and J. Torrellas, “WiSync: An Ar-chitecture for Fast Synchronization through On-Chip Wireless Communication,”in
ASPLOS , 2016.[2] J. L. Abellán, J. Fernández, and M. E. Acacio, “A g-line-based Network for Fastand Efficient Barrier Synchronization in Many-Core CMPs,” in
ICPP , 2010.[3] J. L. Abellán, J. Fernández, M. E. Acacio et al. , “Glocks: Efficient Support forHighly-Contended Locks in Many-Core CMPs,” in
IPDPS , 2011.[4] M. Abeydeera and D. Sanchez, “Chronos: Efficient Speculative Parallelism forAccelerators,” in
ASPLOS , 2020.[5] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson et al. , “The MIT AlewifeMachine: Architecture and Performance,” in
ISCA , 1998.[6] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “GARNET: A Detailed on ChipNetwork Model inside a Full-System Simulator,” in
ISPASS , 2009.[7] M. Ahmad, F. Hijaz, Q. Shi, and O. Khan, “CRONO: A Benchmark Suite forMultithreaded Graph Algorithms Executing on Futuristic Multicores,” in
IISWC ,2015.[8] J. Ahn, S. Hong, S. Yoo, and O. Mutlu, “A Scalable Processing-in-Memory Ac-celerator for Parallel Graph Processing,” in
ISCA , 2015.[9] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: A Low-overhead, Locality-Aware Processing-in-Memory Architecture,” in
ISCA , 2015.[10] B. S. Akgul, J. Lee, and V. J. Mooney, “A System-on-a-Chip Lock Cache withTask Preemption Support,” in
CASES , 2001.[11] D. Alistarh, J. Kopinsky, J. Li, and N. Shavit, “The SprayList: A Scalable RelaxedPriority Queue,” in
PPoPP , 2015.[12] R. Alverson, D. Callahan, D. Cummings, B. Koblenz et al. , “The Tera ComputerSystem,”
ICS , 1990.[13] T. Anderson, “The Performance Implications of Spin-Waiting Alternatives forShared-Memory Multiprocessors,” in
ICPP , 1989.[14] ARM, “Cortex-A7 Technical Reference Manual,” 2009.[15] A. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “Performance Characteriza-tion of In-Memory Data Analytics on a Modern Cloud Server,” in
BDCC , 2015.[16] A. J. Awan, V. Vlassov, M. Brorsson, and E. Ayguade, “Node Architecture Impli-cations for In-Memory Data Analytics on Scale-in Clusters,” in
BDCAT , 2016.[17] R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno et al. , “Near-Data Pro-cessing: Insights from a MICRO-46 Workshop,”
IEEE Micro , 2014.[18] U. N. Bhat,
An Introduction to Queueing Theory: Modeling and Analysis in Ap-plications , 2nd ed. Birkhäuser Basel, 2015.[19] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun et al. , “Google Workloadsfor Consumer Devices: Mitigating Data Movement Bottlenecks,” in
ASPLOS ,2018. [20] A. Boroumand, S. Ghose, M. Patel, H. Hassan et al. , “CoNDA: Efficient CacheCoherence Support for Near-data Accelerators,” in
ISCA , 2019.[21] A. Boroumand, S. Ghose, M. Patel, H. Hassan et al. , “LazyPIM: An EfficientCache Coherence Mechanism for Processing-in-Memory,”
CAL , 2017.[22] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev et al. , “An Analysis ofLinux Scalability to Many Cores,” in
OSDI , 2010.[23] D. S. Cali, G. S. Kalsi, Z. Bingöl, C. Firtina et al. , “GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Frame-work for Genome Sequence Analysis,” in
MICRO , 2020.[24] M. Chabbi, M. Fagan, and J. Mellor-Crummey, “High Performance Locks forMulti-Level NUMA Systems,”
PPoPP , 2015.[25] J. Choe, A. Huang, T. Moreshet, M. Herlihy et al. , “Concurrent Data Structureswith Near-Data-Processing: An Architecture-Aware Implementation,” in
SPAA ,2019.[26] B. Choi, R. Komuravelli, H. Sung, R. Smolinski et al. , “DeNovo: Rethinking theMemory Hierarchy for Disciplined Parallelism,” in
PACT , 2011.[27] T. Craig, “Building FIFO and Priority Queuing Spin Locks from Atomic Swap,”Tech. Rep., 1993.[28] D. Culler, J. Singh, and A. Gupta,
Parallel Computer Architecture: A Hardware-Software Approach , 1999.[29] W. Dally, J. S. Fiske, J. Keen, R. Lethin et al. , “The Message-Driven Processor: AMulticomputer Processing Node with Efficient Mechanisms,”
IEEE Micro , 1992.[30] T. David, R. Guerraoui, and . V. Trigonakis, “Everything You Always Wanted toKnow About Synchronization but Were Afraid to Ask,” in
SOSP , 2013.[31] T. David, R. Guerraoui, and V. Trigonakis, “Asynchronized Concurrency: TheSecret to Scaling Concurrent Search Data Structures,” in
ASPLOS , 2015.[32] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,”
TOMS , 2011.[33] G. F. de Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose et al. , “A New Methodologyand Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks:A Near-Data Processing Case Study,” in
SIGMETRICS , 2021.[34] F. Devaux, “The True Processing in Memory Accelerator,” in
Hot Chips , 2019.[35] D. Dice, V. J. Marathe, and N. Shavit, “Flat-Combining NUMA Locks,” in
SPAA ,2011.[36] D. Dice, V. J. Marathe, and N. Shavit, “Lock Cohorting: A General Techniquefor Designing NUMA Locks,”
TOPC , 2015.[37] D. Drachsler, M. Vechev, and E. Yahav, “Practical Concurrent Binary SearchTrees via Logical Ordering,”
PPoPP , 2014.[38] M. Drumond, A. Daglis, N. Mirzadeh, D. Ustiugov et al. , “The Mondrian DataEngine,” in
ISCA , 2017.[39] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee et al. , “Parallel ApplicationMemory Scheduling,” in
MICRO , 2011.[40] A. Elafrou, G. Goumas, and N. Koziris, “Conflict-Free Symmetric Sparse Matrix-Vector Multiplication on Multicore Architectures,” in SC , 2019.[41] A. ElTantawy and T. M. Aamodt, “Warp Scheduling for Fine-Grained Synchro-nization,” in HPCA , 2018.[42] I. Fernandez, R. Quislant, C. Giannoula, M. Alser et al. , “NATSA: A Near-DataProcessing Accelerator for Time Series Analysis,”
ICCD , 2020.[43] M. Gao, G. Ayers, and C. Kozyrakis, “Practical Near-Data Processing for In-Memory Analytics Frameworks,” in
PACT , 2015.[44] M. Gao and C. Kozyrakis, “HRL: Efficient and Flexible Reconfigurable Logic forNear-Data Processing,” in
HPCA , 2016.[45] M. Gao, J. Pu, X. Yang, M. Horowitz et al. , “TETRIS: Scalable and EfficientNeural Network Acceleration with 3D Memory,” in
ASPLOS , 2017.[46] S. Ghose, A. Boroumand, J. Kim, J. Gómez-Luna et al. , “Processing-in-Memory:A Workload-Driven Perspective,”
IBM JRD , 2019.[47] S. Ghose, T. Li, N. Hajinazar, D. Senol Cali et al. , “Demystifying ComplexWorkload-DRAM Interactions: An Experimental Study,” in
SIGMETRICS , 2019.[48] C. Giannoula, G. Goumas, and N. Koziris, “Combining HTM with RCU to Speedup Graph Coloring on Multicore Platforms,” in
ISC HPC , 2018.[49] M. Gokhale, S. Lloyd, and C. Hajas, “Near Memory Data Structure Rearrange-ment,” in
MEMSYS , 2015.[50] J. Gomez-Luna, I. El Hajj, I. Fernandez, C. Giannoula et al. , “Benchmarking aNew Paradigm: Understanding a Modern Processing-in-Memory Architecture,”in
SIGMETRICS , 2021.[51] J. R. Goodman, M. K. Vernon, and P. J. Woest, “Efficient Synchronization Primi-tives for Large-Scale Cache-Coherent Multiprocessors,” in
ASPLOS , 1989.[52] A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe et al. , “The NYU Ultracom-puter—Designing a MIMD, Shared-Memory Parallel Machine,” in
ISCA , 1982.[53] D. Grunwald and S. Vajracharya, “Efficient Barriers for Distributed Shared Mem-ory Computers,” in
IPDPS , 1994.[54] P. Gu, S. Li, D. Stow, R. Barnes et al. , “Leveraging 3D Technologies for Hard-ware Security: Opportunities and Challenges,” in
GLSVLSI , 2016.[55] P. Gu, X. Xie, Y. Ding, G. Chen et al. , “IPIM: Programmable in-Memory ImageProcessing Accelerator Using Near-Bank Architecture,”
ISCA , 2020.[56] R. Guerraoui and V. Trigonakis, “Optimistic Concurrency with OPTIK,” PPoPP2016.[57] H. Guiroux, R. Lachaize, and V. Quéma, “Multicore Locks: The Case Is NotClosed Yet,” in
USENIX ATC , 2016.[58] J. Gómez-Luna, J. M. González-Linares, J. I. Benavides Benítez, and N. GuilMata, “Performance Modeling of Atomic Additions on GPU Scratchpad Mem-ory,”
TPDS , 2013.[59] R. Hadidi, B. Asgari, B. A. Mudassar, S. Mukhopadhyay et al. , “Demystifyingthe Characteristics of 3D-stacked Memories: A case Study for Hybrid MemoryCube,” in
IISWC , 2017.[60] M. Hashemi, E. Ebrahimi, O. Mutlu, Y. N. Patt et al. , “Accelerating DependentCache Misses with an Enhanced Memory Controller,” in
ISCA , 2016.[61] M. Heinrich, V. Soundararajan, J. Hennessy, and A. Gupta, “A Quantitative Anal-ysis of the Performance and Scalability of Distributed Shared Memory CacheCoherence Protocols,” TC , 1999.[62] D. Hensgen, R. Finkel, and U. Manber, “Two Algorithms for Barrier Synchro-nization,” International Journal of Parallel Programming , 1988.[63] M. Herlihy and N. Shavit,
The Art of Multiprocessor Programming , 2008.[64] T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm, “A Survey of Barrier Algorithmsfor Coarse Grained Supercomputers,”
Chemnitzer Informatik Berichte , 2004.[65] S. Hong, S. Salihoglu, J. Widom, and K. Olukotun, “Simplifying Scalable GraphProcessing with a Domain-Specific Language,” in
CGO , 2014.[66] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee et al. , “Transparent Offloadingand Mapping: Enabling Programmer-Transparent Near-Data Processing in GPUSystems,” in
ISCA , 2016.
67] K. Hsieh, S. Khan, N. Vijaykumar, K. Chang et al. , “Accelerating Pointer Chasingin 3D-stacked Memory: Challenges, Mechanisms, Evaluation,” in
ICCD , 2016.[68] Intel,
64 and IA-32 Architectures Software Developer’s Manual , 2009.[69] J. Joao, M. A. Suleman, O. Mutlu, and Y. Patt, “Bottleneck Identification andScheduling in Multithreaded Applications,” in
ASPLOS , 2012.[70] J. Joao, M. A. Suleman, O. Mutlu, and Y. Patt, “Utility-Based Acceleration ofMultithreaded Applications on Asymmetric CMPs,” in
ISCA , 2013.[71] H. F. Jordan, “Performance Measurements on HEP - a Pipelined MIMD Com-puter,”
ISCA , 1983.[72] H. Jun, J. Cho, K. Lee, H.-Y. Son et al. , “HBM DRAM Technology and Architec-ture,” in
IMW , 2017.[73] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi et al. , “SMASH: Co-Designing Software Compression and Hardware-Accelerated Indexing for Effi-cient Sparse Matrix Operations,” in
MICRO , 2019.[74] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Parti-tioning Irregular Graphs,”
SIAM J. Sci. Comput. , 1998.[75] S. Kaxiras and G. Keramidas, “SARC Coherence: Scaling Directory Cache Co-herence in Performance and Power,”
IEEE Micro , 2010.[76] S. Kaxiras, D. Klaftenegger, M. Norgren, A. Ros et al. , “Turning CentralizedCoherence and Distributed Critical-Section Execution on Their Head: A NewApproach for Scalable Distributed Shared Memory,” in
HPDC , 2015.[77] S. Kaxiras and A. Ros, “A New Perspective for Efficient Virtual-Cache Coher-ence,” in
ISCA , 2013.[78] S. W. Keckler, W. J. Dally, D. Maskit, N. P. Carter et al. , “Exploiting Fine-GrainThread Level Parallelism on the MIT Multi-ALU Processor,” in
ISCA , 1998.[79] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta et al. , “Cohesion: A HybridMemory Model for Accelerators,” in
ISCA , 2010.[80] J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel, “WAYPOINT: ScalingCoherence to Thousand-Core Architectures,” in
PACT , 2010.[81] R. E. Kessler and J. L. Schwarzmeier, “Cray T3D: A New Dimension for CrayResearch,”
Digest of Papers. Compcon Spring , 1993.[82] D. Kim, J. Kung, S. Chai, S. Yalamanchili et al. , “Neurocube: A ProgrammableDigital Neuromorphic Architecture with High-Density 3D Memory,” in
ISCA ,2016.[83] G. Kim, J. Kim, J. H. Ahn, and J. Kim, “Memory-Centric System InterconnectDesign with Hybrid Memory Cubes,” in
PACT , 2013.[84] J. Kim, D. Senol Cali, H. Xin, D. Lee et al. , “GRIM-Filter: Fast Seed LocationFiltering in DNA Read Mapping Using Processing-in-Memory Technologies,”
BMC Genomics , 2018.[85] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: AFast and Extensible DRAM Simulator,”
CAL , 2015.https://github.com/CMU-SAFARI/ramulator[86] Lamport, “How to Make a Multiprocessor Computer That Correctly ExecutesMultiprocess Programs,” TC , 1979.[87] L. Lamport, “A New Solution of Dijkstra’s Concurrent Programming Problem,” Commun. ACM , 1974.[88] J. Laudon and D. Lenoski, “The SGI Origin: A ccNUMA Highly ScalableServer,” in
ISCA , 1997.[89] D. Lavenier, J.-F. Roy, and D. Furodet, “DNA Mapping using Processor-in-Memory Architecture,” in
BIBM , 2016.[90] M. LeBeane, S. Song, R. Panda, J. H. Ryoo et al. , “Data Partitioning Strategiesfor Graph Workloads on Heterogeneous Clusters,” in SC , 2015.[91] A. R. Lebeck and D. A. Wood, “Dynamic Self-Invalidation: Reducing CoherenceOverhead in Shared-Memory Multiprocessors,” ISCA , 1995.[92] D. U. Lee, K. W. Kim, K. W. Kim, H. Kim et al. , “25.2 A 1.2V 8Gb 8-channel128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Mi-crobump I/O Test Methods Using 29nm Process and TSV,” in
ISSCC , 2014.[93] D. Lee, S. Ghose, G. Pekhimenko, S. Khan et al. , “Simultaneous Multi-LayerAccess: Improving 3D-Stacked Memory Bandwidth at Low Cost,”
TACO , 2016.[94] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman et al. , “TheNetwork Architecture of the Connection Machine CM-5 (Extended Abstract),” in
SPAA , 1992.[95] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber et al. , “The Stanford DashMultiprocessor,”
Computer , 1992.[96] A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, “Fine-Grained Synchro-nizations and Dataflow Programming on GPUs,” in
ICS , 2015.[97] C. Liang and M. Prvulovic, “MiSAR: Minimalistic Synchronization Acceleratorwith Resource Overflow Management,” in
ISCA , 2015.[98] J. Liu, H. Zhao, M. A. Ogleari, D. Li et al. , “Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach,” in
MICRO ,2018.[99] Z. Liu, I. Calciu, M. Herlihy, and O. Mutlu, “Concurrent Data Structures forNear-Memory Computing,” in
SPAA , 2017.[100] V. Luchangco, D. Nussbaum, and N. Shavit, “A Hierarchical CLH Queue Lock,”in
Euro-Par , 2006.[101] P. Magnusson, A. Landin, and E. Hagersten, “Queue Locks on Cache CoherentMultiprocessors,” in
IPDPS , 1994.[102] J. Mellor-Crummey and M. Scott, “Synchronization without Contention,”
ASP-LOS , 1991.[103] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for Scalable Synchroniza-tion on Shared-Memory Multiprocessors,”
TOCS , 1991.[104] M. M. Michael and M. L. Scott, “Simple, Fast, and Practical Non-Blocking andBlocking Concurrent Queue Algorithms,” in
PODC , 1996.[105] A. Mirhosseini and J. Torrellas, “Survive: Pointer-Based In-DRAM IncrementalCheck-Pointing for Low-Cost Data Persistence and Rollback-Recovery,”
CAL ,2016.[106] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari et al. , “Profiling DNN Work-loads on a Volta-based DGX-1 System,” in
IISWC , 2018.[107] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller, “Memory Performanceand Cache Coherency Effects on an Intel Nehalem Multiprocessor System,” in
PACT , 2009.[108] A. Mukkara, N. Beckmann, and D. Sanchez, “PHI: Architectural Support forSynchronization- and Bandwidth-Efficient Commutative Scatter Updates,” in
MI-CRO , 2019.[109] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA Or-ganizations and Wiring Alternatives for Large Caches with CACTI 6.0,” in
MI-CRO , 2007.[110] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primeron Processing in Memory,”
Emerging Computing: From Devices to Systems -Looking Beyond Moore and Von Neumann , 2021. [111] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Processing DataWhere It Makes Sense: Enabling In-Memory Computation,”
MICPRO , 2019.[112] L. Nai, R. Hadidi, J. Sim, H. Kim et al. , “GraphPIM: Enabling Instruction-LevelPIM Offloading in Graph Computing Frameworks,” in
HPCA , 2017.[113] R. Nair, S. F. Antao, C. Bertolli, P. Bose et al. , “Active Memory Cube: AProcessing-in-Memory Architecture for Exascale Systems,”
IBM JRD , 2015.[114] NVIDIA, “NVIDIA Tesla V100 GPU Architecture,”
White Paper , 2017.[115] NVIDIA, “ONTAP AI–NVIDIA DGX-2 POD with NetApp AFF A800,”
WhitePaper , 2019.[116] J. Oh, M. Prvulovic, and A. Zajic, “TLSync: Support for Multiple Fast BarriersUsing On-Chip Transmission Lines,” in
ISCA , 2011.[117] A. Pattnaik, X. Tang, A. Jog, O. Kayiran et al. , “Scheduling Techniques for GPUArchitectures with Processing-In-Memory Capabilities,” in
PACT , 2016.[118] W. Pugh, “Concurrent Maintenance of Skip Lists,” Tech. Rep., 1990.[119] S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian et al. , “NDC: Analyzing theImpact of 3D-stacked Memory+Logic Devices on MapReduce Workloads,” in
ISPASS , 2014.[120] A. Ros and S. Kaxiras, “Callback: Efficient Synchronization without Invalidationwith a Directory just for Spin-Waiting,” in
ISCA , 2015.[121] A. Ros and S. Kaxiras, “Complexity-Effective Multicore Coherence,” PACT2012.[122] L. Rudolph and Z. Segall,
Dynamic Decentralized Cache Schemes for MIMDParallel Processors , 1984.[123] J. Rutgers, M. Bekooij, and G. Smit, “Portable Memory Consistency for SoftwareManaged Distributed Memory in Many-Core SoC,” in
IPDPSW , 2013.[124] J. Sampson, R. Gonzalez, J.-F. Collard, N. Jouppi et al. , “Exploiting Fine-GrainedData Parallelism with Chip Multiprocessors and Fast Barriers,” in
MICRO , 2006.[125] D. Sanchez and C. Kozyrakis, “ZSim: Fast and Accurate Microarchitectural Sim-ulation of Thousand-Core Systems,” in
ISCA , 2013.[126] M. L. Scott, “Non-Blocking Timeout in Scalable Queue-based Spin Locks,” in
PODC , 2002.[127] S. L. Scott, “Synchronization and Communication in the T3E Multiprocessor,” in
ASPLOS , 1996.[128] V. Seshadri, D. Lee, T. Mullins, H. Hassan et al. , “Ambit: In-Memory Acceleratorfor Bulk Bitwise Operations Using Commodity DRAM Technology,” in
MICRO ,2017.[129] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei et al. , “Co-Designing Acceleratorsand SoC Interfaces using gem5-Aladdin,” in
MICRO , 2016.[130] D. Siakavaras, K. Nikas, G. Goumas, and N. Koziris,“RCU-HTM: Combining RCU with HTM to Implement HighlyEfficient Concurrent Binary Search Trees,” PACT 2017.https://github.com/jimsiak/concurrent-maps[131] G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna et al. , “NERO: ANear High-Bandwidth Memory Stencil Accelerator for Weather Prediction Mod-eling,” in
FPL , 2020.[132] G. Singh, J. Gómez-Luna, G. Mariani, G. F. Oliveira et al. , “NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learn-ing,” in
DAC , 2019.[133] G. Singh, L. Chelini, S. Corda, A. J. Awan et al. , “Near-Memory Computing:Past, Present, and Future,”
MICPRO , 2019.[134] B. J. Smith, “A Pipelined, Shared Resource MIMD Computer,”
ICPP , 1978.[135] D. J. Sorin, M. D. Hill, and D. A. Wood,
A Primer on Memory Consistency andCache Coherence . Morgan & Claypool Publishers, 2011.[136] F. Strati, C. Giannoula, D. Siakavaras, G. Goumas et al. , “An Adaptive Concur-rent Priority Queue for NUMA Architectures,” in CF , 2019.[137] M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib et al. , “Data Marshaling for Multi-Core Architectures,” in ISCA , 2010.[138] M. A. Suleman, O. Mutlu, M. Qureshi, and Y. Patt, “Accelerating Critical SectionExecution with Asymmetric Multi-Core Architectures,” in
ASPLOS , 2009.[139] H. Sung, R. Komuravelli, and S. V. Adve, “DeNovoND: Efficient Hardware Sup-port for Disciplined Non-Determinism,”
ASPLOS , 2013.[140] N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield, “Analyzing Lock Con-tention in Multithreaded Applications,” in
PPoPP , 2010.[141] X. Tang, J. Zhai, X. Qian, and W. Chen, “pLock: A Fast Lock for Architectureswith Explicit Inter-core Message Passing,” in
ASPLOS , 2019.[142] S. Torkamani and V. Lohweg, “Survey on Time Series Motif Discovery,”
WileyInterdis. Rev.: Data Mining and Knowledge Discovery , 2017.[143] P.-A. Tsai, C. Chen, and D. Sanchez, “Adaptive Scheduling for Systems withAsymmetric Memory Hierarchies,” in
MICRO
HybridMemory Cube Consortium , 2015.[146] E. Vallejo, R. Beivide, A. Cristal, T. Harris et al. , “Architectural Support for FairReader-Writer Locking,” in
MICRO , 2010.[147] K. Wang, D. Fussell, and C. Lin, “Fast Fine-Grained Global Synchronization onGPUs,” in
ASPLOS , 2019.[148] C. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU Architecture,”
IEEE Micro , 2011.[149] P. T. Wolkotte, G. J. M. Smit, N. Kavaldjiev, J. E. Becker et al. , “Energy Modelof Networks-on-Chip and a Bus,” in
SOCC , 2005.[150] Xilinx, “Virtex UltraScale+ HBM FPGA,” 2019.[151] M. Yan, X. Hu, S. Li, A. Basak et al. , “Alleviating Irregularity in Graph AnalyticsAcceleration: A Hardware/Software Co-Design Approach,” in
MICRO , 2019.[152] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum et al. , “Matrix Profile I: All PairsSimilarity Joins for Time Series: A Unifying View that Includes Motifs, Discordsand Shapelets,” in
ICDM , 2016.[153] A. Yilmazer and D. Kaeli, “HQL: A Scalable Synchronization Mechanism forGPUs,” in
IPDPS , 2013.[154] L. Zhang, Z. Fang, and J. B. Carter, “Highly Efficient Synchronization based onActive Memory Operations,” in
IPDPS , 2004.[155] M. Zhang, Y. Zhuo, C. Wang, M. Gao et al. , “GraphP: Reducing Communicationfor PIM-Based Graph Processing with Efficient Data Partition,” in
HPCA , 2018.[156] M. Zhang, H. Chen, L. Cheng, F. C. M. Lau et al. , “Scalable Adaptive NUMA-Aware Lock,”
TPDS , 2017.[157] W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao, “Synchronization State Buffer:Supporting Efficient Fine-grain Synchronization on Many-Core Architectures,”in
ISCA , 2007.[158] Y. Zhuo, C. Wang, M. Zhang, R. Wang et al. , “GraphQ: Scalable PIM-BasedGraph Processing,” in
MICRO , 2019., 2019.