[PDF] Avoiding Scalability Collapse by Restricting Concurrency

Abstract

Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall application performance to fade or even drop abruptly. This problem is particularly (but not solely) acute on oversubscribed systems (systems with more threads than available hardware cores). In this paper, we introduce GCR (generic concurrency restriction), a mechanism that aims to avoid the scalability collapse. GCR, designed as a generic, lock-agnostic wrapper, intercepts lock acquisition calls, and decides when threads would be allowed to proceed with the acquisition of the underlying lock. Furthermore, we present GCR-NUMA, a non-uniform memory access (NUMA)-aware extension of GCR, that strives to ensure that threads allowed to acquire the lock are those that run on the same socket. The extensive evaluation that includes more than two dozen locks, three machines and three benchmarks shows that GCR brings substantial speedup (in many cases, up to three orders of magnitude) in case of contention and growing thread counts, while introducing nearly negligible slowdown when the underlying lock is not contended. GCR-NUMA brings even larger performance gains starting at even lighter lock contention.

Full PDF

AAvoiding Scalability Collapse byRestricting Concurrency

Dave Dice

Oracle Labs [email protected]

Alex Kogan

Oracle Labs [email protected]

Abstract

Saturated locks often degrade the performance of a multi-threaded application, leading to a so-called scalability col-lapse problem. This problem arises when a growing num-ber of threads circulating through a saturated lock causesthe overall application performance to fade or even dropabruptly. This problem is particularly (but not solely) acuteon oversubscribed systems (systems with more threads thanavailable hardware cores).In this paper, we introduce GCR (generic concurrencyrestriction), a mechanism that aims to avoid the scalabilitycollapse. GCR, designed as a generic, lock-agnostic wrapper,intercepts lock acquisition calls, and decides when threadswould be allowed to proceed with the acquisition of theunderlying lock. Furthermore, we present GCR-NUMA, anon-uniform memory access (NUMA)-aware extension ofGCR, that strives to ensure that threads allowed to acquirethe lock are those that run on the same socket.The extensive evaluation that includes more than twodozen locks, three machines and three benchmarks showsthat GCR brings substantial speedup (in many cases, up tothree orders of magnitude) in case of contention and growingthread counts, while introducing nearly negligible slowdownwhen the underlying lock is not contended. GCR-NUMAbrings even larger performance gains starting at even lighterlock contention.

Keywords locks, scalability, concurrency restriction, NUMA

The performance of applications on multi-core systems is of-ten harmed by saturated locks, where at least one threadis waiting for the lock. Prior work has observed that asthe number of threads circulating through a saturated lockgrows, the overall application performance often fades oreven drops abruptly [2, 7, 15, 16], a behavior called scalabil-ity collapse [7]. This happens because threads compete overshared system resources, such as computing cores and last-level cache (LLC). For instance, the increase in the number ofdistinct threads circulating through the lock typically leadsto increased cache pressure, resulting in cache misses. At thesame time, threads waiting for the lock consume valuableresources and might preempt the lock holder from makingprogress with its execution under lock, exacerbating thecontention on the lock even further.

Figure 1.

Microbenchmark performance with different lockson a 2-socket machine with 20 hyper-threads per socket.An example for scalability collapse can be seen in Fig-ure 1 that depicts the performance of a key-value map mi-crobenchmark with three popular locks on a 2-socket x86machine featuring 40 logical CPUs in total (full details ofthe microbenchmark and the machine are provided later).The shape and the exact point of the performance declinediffer between the locks, yet all of them are unable to sustainpeak throughput. With the Test-Test-Set lock, for instance,the performance drops abruptly when more than just a fewthreads are used, while with the MCS lock [20] the perfor-mance is relatively stable up to the capacity of the machineand collapses once the system gets oversubscribed (i.e., hasmore threads available than the number of cores). Note thatone of the locks, MCS-TP, was designed specifically to han-dle oversubscription [13], yet its performance falls short ofthe peak.It might be tempting to argue that one should never createa workload where the underlying machine is oversubscribed,pre-tuning the maximum number of threads and using alock, such as MCS, to keep the performance stable. We notethat in modern component-based software, the total numberof threads is often out of the hands of the developer. A goodexample would be applications that use thread pools, or evenhave multiple mutually unaware thread pools. Furthermore,in multi-tenant and/or cloud-based deployments, where theresources of a physical machine (including cores and caches)are often shared between applications running inside virtualmachines or containers, applications can run concurrentlywith one another without even being aware that they share2019-07-15 • Copyright Oracle and or its affiliates a r X i v : . [ c s . O S ] J u l ave Dice and Alex Kogan the same machine. Thus, limiting the maximum number ofthreads by the number of cores does not help much. Finally,even when a saturated lock delivers a seemingly stable per-formance, threads spinning and waiting for the lock consumeenergy and take resources (such as CPU time) from other,unrelated tasks .In this paper we introduce generic concurrency restriction (GCR) to deal with the scalability collapse. GCR operatesas a wrapper around any existing lock (including POSIXpthread mutexes, and specialized locks provided by an ap-plication). GCR intercepts calls for a lock acquisition anddecides which threads would proceed with the acquisitionof the underlying lock (those threads are called active ) andwhich threads would be blocked (those threads are called passive ). Reducing the number of threads circulating throughthe locks improves cache performance, while blocking pas-sive threads reduces competition over CPU time, leading tobetter system performance and energy efficiency. To avoidstarvation and achieve long-term fairness, active and passivethreads are shuffled periodically. We note that the admissionpolicy remains fully work conserving with GCR. That is,when a lock holder exits, one of the waiting threads will beable to acquire the lock immediately and enter its criticalsection.In this paper we also show how GCR can be extended into anon-uniform access memory (NUMA) setting of multi-socketmachines. In those settings, accessing data residing in a localcache is far cheaper than accessing data in a cache located ona remote socket. Previous research on locks tackled this issueby trying to keep the lock ownership on the same socket [3,8, 9, 22], thus increasing the chance that the data accessed bya thread holding the lock (and the lock data as well) would becached locally to that thread. The NUMA extension of GCR,called simply GCR-NUMA, takes advantage of that sameidea by trying to keep the set of active threads composedof threads running on the same socket. As a by-product ofthis construction, GCR-NUMA can convert any lock into aNUMA-aware one.We have implemented GCR (and GCR-NUMA) in the con-text of the LiTL library [12, 19], which provides the imple-mentation of over two dozen various locks. We have evalu-ated GCR with all those locks using a microbenchmark aswell as two well-known database systems (namely, KyotoCabinet [11] and LevelDB [17]), on three different systems(two x86 machines and one SPARC). The results show thatGCR avoids the scalability collapse, which translates to sub-stantial speedup (up to three orders of magnitude) in caseof high lock contention for virtually every evaluated lock,workload and machine. Furthermore, we show empiricallythat GCR does not harm the fairness of underlying locks (in We also discuss other waiting policies and their limitations later in thepaper. fact, in many cases GCR makes the fairness better). GCR-NUMA brings even larger performance gains starting at evenlighter lock contention. We also prove that GCR keeps theprogress guarantees of the underlying lock, i.e., it does notintroduce starvation if the underlying lock is starvation-free.

Prior work has explored adapting the number of active threadsbased on lock contention [7, 16]. However, that work cus-tomized certain types of locks, exploiting their specific fea-tures, such as the fact that waiting threads are organized in aqueue [7], or that lock acquisition can be aborted [16]. Thoserequirements limit the ability to adapt those techniques intoother locks and use them in practice. For instance, very fewlocks allow waiting threads to abandon an acquisition at-tempt, and many spin locks, such as a simple Test-Test-Setlock, do not maintain a queue of waiting threads. Further-more, the lock implementation is often opaque to the ap-plication, e.g., when POSIX pthread mutexes are used. Atthe same time, prior research has shown that every lockhas its own “15 minutes of fame", i.e., there is no lock thatalways outperforms others and the choice of the optimallock depends on the given application, platform and work-load [6, 12]. Thus, in order to be practical, a mechanism tocontrol the number of active threads has to be lock-agnostic,like the one provided by GCR.Other work in different, but related contexts has observedthat controlling the number of threads used by an applica-tion is an effective approach for meeting certain performancegoals. For instance, Raman et al. [23] demonstrate that witha run-time system that monitors application execution todynamically adapt the number of worker threads executingparallel loop nests. In another example, Pusukuri et al. [21]propose a system that runs an application multiple times forshort durations while varying the number of threads, anddetermines the optimal number of threads to create based onthe observed performance. Chadha et al. [4] identified cache-level thrashing as a scalability impediment and proposedsystem-wide concurrency throttling. Heirman et al. [14] sug-gested intentional undersubscription of threads as a responseto competition for shared caches. Hardware and softwaretransactional memory systems use contention managers tothrottle concurrency in order to optimize throughput [24].The issue is particularly acute in the context of transactionalmemory as failed optimistic transactions are wasteful ofresources.Trading off between throughput and short-term fairnesshas been extensively explored in the context of NUMA-awarelocks [3, 8, 9, 22]. Those locks do not feature a concurrencyrestriction mechanism, and in particular, do not avoid con-tention on the intra-socket level and the issues resultingfrom that.2019-07-15 • Copyright Oracle and or its affiliates

Background

Contending threads must wait for the lock when it is notavailable. There are several common waiting policies. Themost simple one is unbounded spinning, also known as busy-waiting or polling. There, the waiting threads spin on a globalor local memory location and wait until the value in that loca-tion changes. Spinning consumes resources and contributesto preemption when the system is oversubscribed, i.e., hasmore ready threads than the number of available logicalCPUs. Yet, absent preemption, it is simple and provides fastlock handover times, and for those reasons used by manypopular locks, e.g., Test-Test-Set.An alternative waiting policy is parking, where a waitingthread voluntarily releases its CPU and passively waits (byblocking) for another thread to unpark it when the lockbecomes available. Parking is attractive when the systemis oversubscribed, as it releases CPU resources for threadsready to run, including the lock holder. However, the costof the voluntary context switching imposed by parking ishigh, which translates to longer lock handover times whenthe next owner of the lock has to be unparked.To mitigate the overhead of parking and unparking on theone hand, and limit the shortcomings of unlimited spinningon the other hand, lock designers proposed a hybrid spin-then-park policy. There, threads spin for a brief period, andpark if the lock is still not available by the end of that time.While tuning the optimal time for spinning is challenging [15,18], it is typically set to the length of the context-switchround trip [7].

GCR wraps a lock API, that is, calls to, e.g.,

Lock / Unlock methods go through the corresponding methods of GCR. Inour implementation, we interpose on the standard POSIX pthreads_mutex_lock and pthreads_mutex_unlock meth-ods. Thus, using the standard LD_PRELOAD mechanism inLinux and Unix, GCR can be made immediately availableto any application that uses the standard POSIX API, evenwithout recompiling the application or its locks.In the following description, we distinguish between active threads, that is, threads allowed by GCR to invoke the APIof the underlying lock, and passive threads, which are notallowed to do so. Note that this distinction is unrelated to therunning state of the corresponding threads. That is, activethreads may actually be blocked (parked) if the underlyinglock decides doing so, while passive threads may be spinning,waiting for their turn to join the set of active threads. Inaddition, given that GCR by itself does not provide locksemantics (even though it implements the lock API), we willrefer to the underlying lock simply as the lock .GCR keeps track of the number of active threads. Whena thread invokes the

Lock method wrapped by GCR, GCR checks whether the number of active threads is larger than apreconfigured threshold. If not, a thread proceeds by callingthe lock’s

Lock method. This constitutes the fast path ofthe lock acquisition. Otherwise, GCR detects that the lockis saturated, and places the (passive) thread into a (lock-specific) queue. This queue is based on a linked list; eachnode in the list is associated with a different thread. Everythread in the queue but the first can choose whether to spinon a local variable in its respective node, yield the CPU andpark, or any combination of thereof. (The thread at the headof the queue has to spin as it monitors the number of activethreads.) In practice, we choose the spin-then-park policy forall passive threads in the queue but the first, to limit the useof system resources by those threads. Once the thread at thehead of the queue detects that there are no active threads,it leaves the queue, signals the next thread (if exists) thatthe head of the queue has changed (unparking that thread ifnecessarily), and proceeds by calling the lock’s Lock method.When a thread invokes GCR’s

Unlock method, it checkswhether it is time to signal the (passive) thread at the headof the queue to join the set of active threads. This is doneto achieve a long-term fairness, preventing starvation ofpassive threads. To this end, GCR keeps a simple counter forthe number of lock acquisitions. (Other alternatives, such astimer-based approaches, are possible.) Following that, GCRcalls the lock’s

Unlock method.

The auxiliary data structures used by GCR are given in Fig-ure 2 . The Node structure represents a node in the queue ofpassive threads. In addition to the successor and predecessornodes in the list, the

Node structure contains the event flag.This flag is used to signal a thread when its node moves tothe head in the queue.The

LockType structure contains the internal lock meta-data (passed to the

Lock and

Unlock functions of that lock)and a number of additional fields: • top and tail are the pointers to the first and the lastnodes in the queue of passive threads, respectively. • topApproved is a flag used to signal the passive threadat the top of the queue that it can join the set of activethreads. • numActive is the counter for the number of activethreads. • numAcqs is the counter for the number of lock acquisi-tions. It is used to move threads from the passive setto the active set, as explained below. We use the word signal throughout the paper in its abstract form, unrelatedto the OS inter-process communication mechanism of signals. For the clarity of exposition, we assume sequential consistency. Our actualimplementation uses memory fences as well as volatile keywords andpadding (to avoid false sharing) where necessarily. ave Dice and Alex Kogan typedef struct _Node { struct _Node ∗ next; struct _Node ∗ prev; int event;} Node; typedef struct {lock_t internalMutex;Node ∗ top;Node ∗ tail ; int topApproved; int numActive; int numAcqs;} LockType ; static int (∗nextLock)( lock_t ∗); static int (∗nextUnlock)(lock_t ∗);

Figure 2.

Auxiliary structures. int Lock(LockType ∗m) { /∗ if there is at most one active thread ∗/ if (m − >numActive <= 1) /∗ go to the fast path ∗/ FAA(&m − >numActive, 1); goto FastPath ; } SlowPath: /∗ enter the MCS − like queue of passive threads ∗/ Node ∗myNode = pushSelfToQueue(m); /∗ wait (by spin − then − park) for mynode to get to the top ∗/ if (! myNode − >event) Wait(myNode − >event); /∗ wait (by spinning ) for a signal to join theset of active threads ∗/ while (!m − >topApproved) { Pause (); /∗ stop waiting if no active threads left ∗/ if (m − >numActive == 0) break ; } if (m − >topApproved != 0) m − >topApproved = 0; FAA(&m − >numActive, 1); popSelfFromQueue(m, myNode); FastPath : return nextLock(&m − >internalMutex); } Figure 3.

Lock procedure.In addition to the

LockType structure, GCR uses nextLock ( nextUnlock ) function pointer, which is initialized to the Lock ( Unlock , respectively) function of the underlying lock.The initialization code is straightforward (on Linux it canuse the dlsym system call), and thus is not shown.The implementation of the GCR’s

Lock function is givenin Figure 3. When executing this function, a thread first checks the current number of active threads by reading the numActive counter (Line 3). If this number is at most one, itatomically (using a fetch-and-add (FAA) instruction if avail-able) increments it (Line 5) and continues to the fast path(Line 21). Note that the comparison to 1 in Line 3 effectivelycontrols when the concurrency restriction is enabled. Thatis, if we would want to enable concurrency restriction onlyafter detecting that, say, X threads are waiting for the lock,we could use X + numActive concurrently. However, the lack of atomicity mayonly impact performance (as the underlying lock will becomemore contended), and not correctness. Besides, this should berare when the system is in the steady state. Finally, note thatthe FAA operation in Line 5 is performed by active threadsonly (rather than all threads), limiting the overhead of thisatomic operation on a shared memory location.In the fast path, the thread simply invokes the Lock func-tion of the underlying lock (Line 23).The slow path is given in Lines 8–21. There, the threadjoins the queue of passive threads (Line 10); the code ofthe pushSelfToQueue function is given in Figure 5 and de-scribed below. Next, the thread waits until it reaches thetop of the queue (Line 12). This waiting is implemented byspin-then-park waiting policy in the

Wait function , andwe assume that when Wait returns, the value of event isnon-zero. We note that we could use pure spinning (on the event field) in the

Wait function as well.Once the thread reaches the top of the queue, it startsmonitoring the signal from active threads to join the activeset. It does so by spinning on the topApproved flag (Line 14).In addition, this thread monitors the number of active threadsby reading the numActive counter (Line 17). Note that unlikethe topApproved flag, this counter changes on every lockacquisition and release. Thus, reading it on every iteration ofthe spinning loop would create unnecessary coherence trafficand slow down active threads when they attempt to modifythis counter. In Section 4.4 we describe a simple optimizationthat allows to read this counter less frequently while stillmonitoring the active set effectively.Once the passive thread at the top of the queue breaks outof the spinning loop, it resets the topApproved flag if needed(Line 19) and atomically increments the numActive counter(Line 20). Then it removes itself from the queue of passivethreads (Line 21) and continues with the code of the fastpath. The pseudo-code of the popSelfFromQueue functionis given in Figure 5 and described below. The technical details of the parking/unparking mechanism are irrelevantfor the presentation, but briefly, we used futexes on Linux and a mutex witha condition variable on Solaris. int Unlock (LockType ∗ m) { /∗ check if it is time to bring someone fromthe passive to active set ∗/ if ((( m − >numAcqs++ % THRESHOLD) == 0) && m − >top != NULL) { /∗ signal the selected thread that it can go ∗/ m − >topApproved = 1; } FAA(&m − >numActive, − /∗ call underlying lock ∗/ return nextUnlock(&m − >internalMutex); } Figure 4.

Unlock procedure.The

Unlock function is straight-forward (see Figure 4).The thread increments the numAcqs counter and checkswhether it is time to bring a passive thread into the set ofactive threads (Line 27). Notice that in our implementationwe decide to do so based solely on the number of lock ac-quisitions, while other, more sophisticated approaches arepossible. For our evaluation,

THRESHOLD is set to 0x4000. Af-terwards, the thread atomically decrements the numActive counter (Line 31). Finally, it calls the

Unlock function of theunderlying lock (Line 33).The procedures for inserting and removing a thread to/fromthe queue of passive threads are fairly simple, yet not trivial.Thus, we opted to show them in Figure 5. (Readers familiarwith the MCS lock [20] will recognize close similarity toprocedures used by that lock to manage waiting threads.)In order to insert itself into the queue, a thread allocatesand initializes a new node (Lines 36–38) . Then it atomicallyswaps the tail of the queue with the newly created node(Line 39) using an atomic swap instruction. If the result ofthe swap is non-NULL, then the thread’s node is not the onlynode in the queue; thus, the thread updates the next pointerof its predecessor (Line 41). Otherwise, the thread sets the top pointer to its newly created node (Line 43) and sets the event flag (Line 44). The latter is done to avoid the call to Wait in Line 12.The code for removing the thread from the queue is slightlymore complicated. Specifically, the thread checks first whetherits node is the last in the queue (Line 50). If so, it attempts toupdate the tail pointer to NULL using an atomic compare-and-swap (CAS) instruction (Line 52). If the CAS succeeds,the thread attempts to set the top pointer to NULL as well(Line 53). Note that we need CAS (rather than a simple store)for that as the top pointer may have been already updatedconcurrently in Line 43. This CAS, however, should not beretried if failed, since a failure means that the queue is not In our implementation, we use a preallocated array of

Node objects, oneper thread per lock, as part of the infrastructure provided by the LiTLlibrary [19]. It is also possible to allocate a

Node object statically in

Lock() and pass it the the push and pop functions, avoiding the dynamic allocationof

Node objects altogether. Node ∗pushSelfToQueue(LockType ∗ m) { Node ∗ n = (Node ∗)malloc( sizeof (Node)); n − >next = NULL; n − >event = 0; Node ∗ prv = SWAP (&m − >tail, n); if (prv != NULL) { prv − >next = n; } else { m − >top = n; n − >event = 1; } return n; } void popSelfFromQueue(LockType ∗ m, Node ∗ n) { Node ∗ succ = n − >next; if (succ == NULL) { /∗ my node is the last in the queue ∗/ if (CAS (&m − >tail, n, NULL)) { CAS (&m − >top, n, NULL); free (n); return ; } for (;;) { succ = n − >next ; if (succ != NULL) break ; Pause (); } } m − >top = succ; /∗ unpark successor if it is parked in Wait ∗/ succ − >event = 1; free (n); } Figure 5.

Queue management procedures.empty anymore and thus we should not try to set top toNULL again. The removal operation is completed by deal-locating (or releasing for future reuse) the thread’s node(Line 54).If the CAS in Line 52 is unsuccessful, the thread realizesthat its node is no longer the last in the queue, that is, thequeue has been concurrently updated in Line 39. As a result,it waits (in the for-loop in Lines 57–61) until the next pointerof its node is updated in Line 41 by a new successor. Finally,after finding that its node is not the last in the queue (whetherimmediately in Line 50 or after the failed CAS in Line 52),the thread updates the top pointer to its successor in thequeue (Line 63) and signals the successor (Line 65) to stopwaiting in the

Wait function (cf. Line 12).

A lock is starvation-free if every attempt for its acquisitioneventually succeeds. In this section, we argue that the GCRalgorithm does not introduce starvation as long as the un-derlying lock is starvation-free, the OS scheduler does notstarve any thread and the underling architecture supportsstarvation-free atomic increment and swap operations. On2019-07-15 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan a high level, our argument is built on top of two observa-tions, namely that once a thread enters the queue of waitingthreads, it eventually reaches the top of the queue, and thata thread at the top of the queue eventually calls the

Lock function of the underlying lock.

Lemma 1.

The tail pointer always either holds a

NULL valueor points to a node whose next pointer is

NULL .Proof Sketch.

The tail pointer initially holds

NULL . Frominspecting the code, the tail pointer may change only inLine 39 or Line 52. Consider the change in Line 39. The valueof the tail pointer is set to a node whose next field wasinitialized to NULL (cf. Line 37). Thus, the lemma holds whenthe change in Line 39 takes place.The next pointer of a node gets modified only in Line 41.The node whose next pointer gets modified is the one pointedby tail before the change of tail in Line 37 took place. Asa result, when Line 41 is executed, the tail pointer does notpoint anymore to the node whose next pointer gets updated,and the lemma holds.Next, consider the change to tail in Line 52 done with aCAS instruction. If CAS is successful, the tail pointer getsa

NULL value. Otherwise, the tail pointer is not updated(and thus remains pointing to a node whose next pointer is

NULL ). Thus, the lemma holds in both cases. □ We define the state of the queue at time T to be all nodesthat are reachable from top and tail at time T . We say thata passive thread enters the queue when it finishes executingLine 39 and leaves the queue when it finishes executing CASin Line 53 or an assignment in Line 63. Lemma 2.

The event field of any node in the queue, exceptfor, perhaps, the node pointed by top , is .Proof Sketch. First, we observe that only a thread whose nodehas a non-zero event value can call popSelfFromQueue (cf. Line 12).Next, we show that at most one node in the queue hasa non-zero event value. The event field is initialized to 0(Line 38), and is set to 1 either in Line 44 or in Line 65. In theformer case, this happens when the corresponding threadfinds the queue empty (Line 39). Thus, when it sets the event field of its node to 1, the claim holds. In the latter case, athread t sets the event field in the node of its successor inthe queue. However, it does so after removing its node fromthe queue (by updating the top pointer to its successor inLine 63). Based on the observation above, t ’s node contains anon-zero event field. By removing its node from the queueand setting the event field in the successor node, t maintainsthe claim.Finally, we argue that the node with a non-zero event value is the one pointed by top . Consider, again, the twocases where the event field gets set. In the first case, it isset by a thread that just entered the queue and found the queue empty (and thus updated top to point to its nodein Line 43). In the second case, it is set by a thread t thatjust updated top to point to the node of its successor in thequeue. At this point (i.e., after executing Line 63 and beforeexecuting Line 65), no node in the queue contains a non-zero event value. Thus, based on the observation above, nothread can call popSelfFromQueue . At the same time, anynew thread entering the queue will find at least t ’s successorthere and thus, will not change the top pointer. Hence, when t executes Line 65, it sets the event field of the node pointedby top . □ We refer to a thread whose node is pointed by top as a thread at the top of the queue . Lemma 3.

Only a thread at the top of the queue can call popSelfFromQueue .Proof Sketch.

After entering the queue, a thread may leave it(by calling popSelfFromQueue ) only after it finds the event field in its node holding a non-zero value (cf. Line 12). Ac-cording to Lemma 2, this can only be the thread at the topof the queue. □ We order the nodes in the queue according to their rank,which is the number of links ( next pointers) to be traversedfrom top to reach the node. The rank of the node pointed by top is 0, the rank of the next node is 1 and so on. The rankof a (passive) thread is simply a rank of its correspondingnode in the queue.

Lemma 4.

The queue preserves the FIFO order, that is, athread t that enters the queue will leave the queue after allthreads that entered the queue before t and before all threadsthat enter the queue after t .Proof Sketch. By inspecting the code, the only place wherethe next pointer of a node may change is in Line 41 (apartfrom the initialization in Line 37). If a thread executes Line 41,then according to Lemma 1, it changes the next pointer ofa node of its predecessor in the queue from

NULL to thethread’s node. Thus, once a next pointer is set to a non-

NULL value, it would never change again (until the node is deleted,which will happen only after the respective thread leavesthe queue). Hence, threads that enter the queue after t willhave a higher rank than t .Lemma 3 implies that only a thread with rank 0 can leavethe queue. Thus, any thread that joins the queue after t willleave the queue after t , as t ’s rank will reach 0 first. Also,any thread that enters the queue before t will have a lowerrank than t . Therefore, t cannot leave the queue before allthose threads do. □ Lemma 5.

A thread at the top of the queue eventually callsthe

Lock function of the underlying lock.Proof Sketch.

By inspecting the code, when a thread t reachesthe top of the queue, its event field changes to 1 (Line 442019-07-15 • Copyright Oracle and or its affiliatesnd Line 65). Thus, it reaches the while loop in Line 14. Itwould stay in this loop as long as the topApproved fielddoes not change or when all active threads have left. If thelatter does not happen, however, assuming that the lock isstarvation-free and no thread holds it indefinitely long, activethreads will circulate through the lock, incrementing the numAcqs counter. Once this counter reaches the threshold,the topApproved field will be set, releasing t from the while loop.Following that, the only other place where t may spinbefore calling the Lock function of the underlying lock isin the for loop in Line 57. This would occur in a rare casewhere t may not find a successor in Line 49, but the successorwill appear and update tail (in Line 39) right before t does(in Line 52). However, assuming a scheduler that does notstarve any thread, the successor will eventually update the next pointer in t ’s node, allowing t to break from the for loop. □ Lemma 6.

A thread that enters the queue eventually becomesa thread at the top of the queue.Proof Sketch.

Consider a thread t entering the queue by ex-ecuting Line 39. If it finds no nodes in the queue (i.e., tail holds a NULL value), t sets top to point to its node (Line 43)and the Lemma trivially holds. Otherwise, some other thread t is at the top of the queue. By Lemma 5, that thread willeventually call the Lock function of the underlying lock.However, before doing so, it will remove itself from the queue(Line 21). According to Lemma 4, the queue preserves theFIFO order. Thus, the next thread after t will become thethread at the top of the queue, and eventually call the Lock function of the underlying lock. By applying the same argu-ment recursively, we can deduct that t will reach the top ofthe queue (after all threads with a lower rank do so). □ Theorem 7.

When GCR is applied to a starvation-free lockL, the resulting lock is starvation-free.Proof Sketch.

Given that the underlying lock L is starvation-free, we need to show that every thread calling GCR’s

Lock function would eventually call L’s

Lock function. In case athread finds at most one active thread in Line 3, it proceedsby calling to the L’s

Lock function (after executing an atomicFAA instruction, which we assume is starvation-free). Oth-erwise, it proceeds on the slow path by entering the queue(Line 10). Assuming the atomic swap operation is starvation-free, the thread would eventually execute Line 39 and enterthe queue. By Lemma 6, it will eventually reach the top ofthe queue, and by Lemma 5, it will eventually call L’s

Lock function. □ When the con-tention is low, GCR may introduce overhead by repeatedlysending threads to the slow path where they would discover immediately that no more active threads exist and they canjoin the set of active threads. This behavior can be mitigatedby tuning the thresholds for joining the passive and activesets (i.e., constants in Line 3 and Line 17, respectively). In ourexperiments, we found that setting the threshold for joiningthe passive set at 4 (Line 3) and the threshold for joining theactive set at half of that number (Line 17) was a reasonablecompromise between reducing the overhead of GCR andletting threads spin idly at low contention (if the underlyinglock allows so).

Reducing overhead on the fast path:

Even with theabove optimizations, employing atomic instructions on thefast path to maintain the counter of active threads degradesperformance when the underlying lock is not contended,or contended lightly. To cope with that, one would like tobe able to disable GCR (including counting the number ofactive threads) dynamically when the lock is not contended,and turn it on only when the contention is back. However,the counter of active threads was introduced exactly for thepurpose of identifying the contention on the underlying lock.We solve this "chicken and egg problem" as following: weintroduce one auxiliary array, which is shared by all threadsacquiring any lock. Each thread writes down in its dedicatedslot the address of the underlying lock it is about to acquire,and clears the slot after releasing that lock. After releasinga lock, a thread scans periodically the array, and counts thenumber of threads trying to acquire the lock it just released.If it finds that number at or above a certain threshold (e.g.,4), it enables GCR by setting a flag in the

LockType struc-ture. Scanning the array after every lock acquisition wouldbe prohibitively expensive. Instead, each thread counts itsnumber of lock acquisitions (in a thread-local variable) andscans the array after an exponentially increasing numberof lock acquisitions. Disabling GCR is easier – when it istime to signal a passive thread (cf. Line 27), and the queue ofpassive threads is empty while the number of active threadsis small (e.g., 2 or less), GCR is disabled until the next timecontention is detected.

Splitting the counter of active threads:

The counter ofactive threads, numActive , is modified twice by each threadfor every critical section. The contention over this countercan be slightly reduced by employing a known techniqueof splitting the counter into two, ingress and egress . Theformer is incremented (using atomic FAA instruction) in theentry to the critical section (i.e.,

Lock ), while the latter isincremented in the exit (

Unlock ) with a simple store as theincrement is done under the lock. The difference of thosetwo counters gives the estimate for the number of currentlyactive threads. (The estimate is because those counters arenot read atomically).

Spinning loop optimization:

As mentioned above, re-peated reading of the numActive counter (or the ingress and egress counters, as described above) inside the spinningloop of the thread at the top of the queue (cf. Lines 14–182019-07-15 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan in Figure 3) creates contention over the cache lines storingthose variables. We note that monitoring the size of the ac-tive set is required to avoid a livelock that may be causedwhen the last active lock holder leaves without setting the topApproved flag. This is unlikely to happen, however, whenthe lock is heavily contended and acquired repeatedly by thesame threads.Based on this observation, we employ a simple determinis-tic back-off scheme that increases the time interval betweensubsequent reads of the numActive counter by the spinningthread as long as it finds the active set not empty (or, moreprecisely, as long as it finds the size of the active set be-ing larger than the threshold for joining that set when theoptimization for reducing overhead at low contention de-scribed above is used). To support this scheme, we add the nextCheckActive field into the

LockType structure, initiallyset to 1, and also use a local counter initialized to 0 before the while loop in Line 14 of Figure 3. In every iteration of theloop, we increment the local counter and check if its valuemodulo nextCheckActive is equal 0. If so, we check the numActive counter. As before, we break out of the loop if thiscounter has a zero value, after resetting nextCheckActive to 1 so the next thread at the top of the queue will start mon-itoring the active set closely. Otherwise, if the numActive isnon-zero, we double the value of nextCheckActive , up to apreset boundary (1M in our case).It should be noted that like many synchronization mecha-nisms, GCR contains several knobs that can be used to tuneits performance. In the above, we specify all the default val-ues that we have used for our experiments. While evaluatingthe sensitivity of GCR to each configuration parameter isin the future work, we note that our extensive experimentsacross multiple platforms and applications provide empiri-cal evidence that the default parameter values represent areasonable choice.

As GCR controls which threads would join the active set, itmay well do so in a NUMA-aware way. In practice, this meansthat it should strive to maintain the active set composed ofthreads running on the same socket (or, more precisely, onthe same NUMA node). Note that this does not place anyadditional restrictions on the underlying lock, which mightbe a NUMA-aware lock by itself or not. Naturally, if theunderlying lock is NUMA-oblivious, the benefit of such anoptimization would be higher.Introducing NUMA-awareness into GCR requires rela-tively few changes.On a high level, instead of keeping justone queue of passive threads per lock, we keep a numberof queues, one per socket. Thus, a passive thread joins thequeue corresponding to the socket it is running on. In ad-dition, we introduce a notion of a preferred socket, whichis a socket that gets preference in decisions which threads should join the active set. In our case, we set the preferredsocket solely based on the number of lock acquisitions (i.e.,the preferred socket is changed in a round-robin fashionevery certain number of lock acquisitions), but other refined(e.g., time-based) schemes are possible.We say that a (passive) thread is eligible (to check whetherit can join the active set) if it is running on the preferredsocket or the queue (of passive threads) of the preferredsocket is empty. When a thread calls the Lock function, wecheck whether it is eligible and let it proceed with examiningthe size of the active set (i.e., read the numActive counter)only if it is. Otherwise, it immediately goes into the slowpath, joining the queue according to its socket. This meansthat once the designation of the preferred socket changes(when threads running on that socket acquire and releasethe lock “enough” times), active threads from the now not-preferred socket will become passive when they attempt toacquire the lock again.Having only eligible threads monitor the size of the activeset has two desired consequences. First, only the passivethread at the top of the queue corresponding to the preferredsocket will be the next thread (out of all passive threads) tojoin the set of active threads. This keeps the set of activethreads composed of threads running on the same (preferred)socket and ensures long-term fairness. Second, non-eligiblethreads (running on other, non-preferred sockets) do notaccess the counter of active threads (but rather wait untilthey become eligible), reducing contention on that counter.

We implemented GCR as a stand-alone library conforming tothe pthread mutex lock API defined by the POSIX standard.We integrated GCR into LiTL [19], an open-source projectproviding an implementation of dozens of various locks, in-cluding well-known established locks, such as MCS [20] andCLH [5], as well as more recent ones, such as NUMA-awareCohort [9] and HMCS locks [3]. The LiTL library also in-cludes the implementation of a related Malthusian lock [7],which introduces a concurrency restriction mechanism intothe MCS lock. Furthermore, the LiTL library allows specify-ing various waiting policies (e.g., spin or spin-then-park) forlocks that support that (such as MCS, CLH or Cohort locks).Overall, we experimented with 24 different lock+waiting pol-icy combinations in LiTL (for brevity, we will refer to eachlock+waiting policy combination simply as a lock). For ourwork, we enhanced the LiTL library to support the SolarisOS (mainly, by supplying functionality to park and unparkthreads) as well as the SPARC architecture (mainly, by ab-stracting synchronization primitives). We note that for politespinning, we use the MWAIT instruction made available onthe SPARC M7 processors [7].We run experiments on three different platforms:2019-07-15 • Copyright Oracle and or its affiliates

Oracle X6-2 server with 2 Intel Xeon E5-2630 v4 pro-cessors featuring 10 hyper-threaded cores each (40logical CPUs in total) and running Fedora 25. • Oracle T7-2 server with 2 SPARC M7 processors fea-turing 32 cores each, each supporting 8 logical CPUs(512 logical CPUs in total) and running Solaris 11.3. • Oracle X5-4 server with 4 Intel Xeon E7-8895 v3 pro-cessors featuring 18 hyper-threaded cores each (144logical CPUs in total) and running Oracle Linux 7.3.As most high-level conclusions hold across all evaluatedplatforms, we focus our presentation on X6-2; unless saidotherwise, the reported numbers are from this platform.In all experiments, we vary the number of threads up totwice the capacity of each machine. We do not pin threadsto cores, relying on the OS to make its choices. In all ex-periments, we employ a scalable memory allocator [1]. Wedisable the turbo mode on Intel-based platforms to avoid theeffect of that mode, which varies with the number of threads,on the results. Each reported experiment has been run 3times in exactly the same configuration. Presented resultsare the average of results reported by each of those 3 runs.

The microbenchmark uses a sequential AVL tree implementa-tion protected by a single lock. The tree supports the API of akey-value map, including operations for inserting, removingand looking up keys (and associated values) stored in the tree.After initial warmup, not included in the measurement inter-val, all threads are synchronized to start running at the sametime, and apply tree operations chosen uniformly and at ran-dom from the given distribution, with keys chosen uniformlyand at random from the given range. At the end of this timeperiod (lasting 10 seconds), the total number of operationsis calculated, and the throughput is reported. The reportedresults are for the key range of 4096 and threads performing80% lookup operations, while the rest is split evenly betweeninserts and removes. The tree is pre-initialized to containroughly half of the key range. Finally, the microbenchmarkallows to control the amount of the external work, i.e., theduration of a non-critical section (simulated by a pseudo-random number calculation loop). In this experiment, weuse a non-critical section duration that allows scalability upto a small number of threads.

Detailed performance of GCR on top of several locks:

The absolute performance of the AVL tree benchmark (interms of the total throughput) with several locks is shownin Figure 6. Figure 6 (a) and (b) show how the popular MCSlock [20] performs without GCR, with GCR and with GCR-NUMA, and how those locks compare to the recent Malthu-sian lock [7], which implements a concurrency restrictionmechanism directly into the MCS lock. Locks in Figure 6 (a)employ the spinning waiting policy, while those in Figure 6 (b)employ the spin-then-park policy. In addition, Figure 6 (c) and (d) compare the performance achieved with the simpleTest-Test-Set (TTAS) lock and the POSIX pthread mutex lock,respectively, when used without GCR, with GCR and withGCR-NUMA. The concurrency restriction mechanism of aMalthusian lock cannot be applied directly into the simpleTTAS or POSIX pthread mutex locks, so we do not include aMalthusian variant in those two cases.With the spinning policy (Figure 6 (a)), GCR has a smalldetrimental effect (2% slowdown for a single thread, and ingeneral, at most 12% slowdown) on the performance of MCSas long as the machine is not oversubscribed. This is becauseall threads remain running on their logical CPUs and the lockhandoff is fast at the time that GCR introduces certain (albeit,small) overhead. The Malthusian lock performs similarly to(but worse than) GCR. MCS with GCR-NUMA, however, topsthe performance chart as it limits the amount of cross-socketcommunication incurred by other altrenatives when the lockis handed off between threads running on different sockets.The performance of the MCS and Malthusian locks plummetsonce the number of running threads exceeds the capacityof the machine. At the same time, GCR (and GCR-NUMA)are not sensitive to that as they park excessive threads, pre-serving the overall performance. In case of GCR-NUMA, forinstance, this performance is close to the peak achieved with10 threads.The MCS and Malthusian locks with the spin-then-parkpolicy exhibit a different performance pattern (Figure 6 (b)).Specifically, the former shows poor performance at the rela-tively low number of threads. This is because as the numberof threads grows, the waiting threads start quitting spinningand park, adding the overhead of unparking for each lockhandoff. The Malthusian lock with its concurrency restric-tion mechanism avoids that. Yet, its performance is slightlyworse than that of MCS with GCR. Once again, MCS withGCR-NUMA easily beats all other contenders.In summary, the results in Figure 6 (a) and (b) show thatdespite being generic, the concurrency restriction mecha-nism of GCR performs superiorly to that of the specializedMalthusian lock. Besides, unlike the Malthusian lock, thechoice of a waiting policy for the underlying lock becomesmuch less crucial when GCR (or GCR-NUMA) is used.The TTAS and pthread mutex locks exhibit yet anotherperformance pattern (Figure 6 (c) and (d)). Similarly to theMCS spin-then-park variant, their performance drops at lowthread counts, however they manage to maintain reasonablethroughput even as the number of threads grows. Along withthat, both GCR and GCR-NUMA variants mitigate the dropin the performance.We also run experiments in which we measured the hand-off time for each of the locks presented in Figure 6, thatis, the interval between a timestamp taken right before thecurrent lock holder calls

Unlock() and right after the nextlock holder returns from

Lock() . Previous work has shownthat the performance of a parallel system is dictated by the2019-07-15 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan (a)

MCS spin (b)

MCS spin-then-park (c)

Test-Test-Set (d)

Pthread mutex

Figure 6.

Throughput results for the MCS, Test-Test-Set and POSIX pthread mutex locks (AVL tree). (a)

MCS Spin (b)

MCS Spin-then-park (c)

Test-Test-Set (d)

Pthread mutex

Figure 7.

Lock handeoff time for the MCS, Test-Test-Set and POSIX pthread mutex locks (AVL tree). (a)

MCS Spin (b)

MCS Spin-then-park (c)

Test-Test-Set (d)

Pthread mutex

Figure 8.

Total throughput measured with multiple instances of the microbenchmark, each run with 40 threads (AVL tree).length of its critical sections [10], which is composed of thetime required to acquire and release the lock (captured bythe handoff data), and the time a lock holder spends in thecritical section. Indeed, the data in Figure 7 shows corre-lation between the throughput achieved and the handofftime. That is, in all cases where the throughput of a lockdegraded in Figure 6, the handoff time has increased. At thesame time, GCR (and GCR-NUMA) manages to maintain aconstant handoff time across virtually all thread counts.In a different experiment, we run multiple instances ofthe microbenchmark, each configured to use the number ofthreads equal to the number of logical CPUs (40). This illus-trates the case in which an application with a configurablenumber of threads chooses to set that number based on themachine capacity (as it typically happens by default, for in-stance, in OpenMP framework implementations). Figure 8presents the results for the same set of locks as Figures 6–7.Both GCR and GCR-NUMA scale well up to 4 instances forall tested locks. Except for pthread mutex, all locks with-out GCR (or GCR-NUMA) perform much worse, especially when the number of instances is larger than one (which iswhen the machine is oversubscribed). Pthread mutex faresrelatively well, although it should be noted that its singleinstance performance is worse than, e.g., the one achievedby MCS spin.We note that as the number of instances grows, the Malthu-sian spin-then-park lock handles contention better than GCR(but not GCR-NUMA) on top of MCS spin-then-park (cf. Fig-ure 8 (b)). We attribute that to the fact that the Malthusianlock employs less active threads than GCR (and GCR-NUMA),which in the case of a heavily oversubscribed machine playsan important role. Note that the Malthusian spin lock doesnot provide any relief from contention compared to the MCSspin lock (cf. Figure 8 (a)), at the time that both GCR and GCR-NUMA scale linearly with the number of instances. This isbecause in this case, threads passivated by the Malthusianlock spin and take resources away from the active threadsat the time that GCR and GCR-NUMA park those passivethreads.2019-07-15 • Copyright Oracle and or its affiliates a) GCR, X6-2 (b)

GCR-NUMA, X6-2 (c)

GCR, T7-2 (d)

GCR-NUMA, T7-2 (e)

GCR, X5-4 (f)

GCR-NUMA, X5-4

Figure 9.

Speedup achieved by GCR and GCR-NUMA over base lock implementations (AVL tree). (a)

Base (b)

GCR (c)

GCR-NUMA

Figure 10.

Speedup for the base, GCR and GCR-NUMA locks when normalized to the performance of the MCS (spin-then-park)lock (without GCR, with GCR and with GCR-NUMA, respectively) with a single thread (AVL tree).

GCR on top of 24 locks:

After presenting results forsome concrete locks, we show in Figure 9 a heat map encod-ing the speedup achieved by GCR and GCR-NUMA whenused with each of the 24 locks provided by LiTL, for all thethree machines. A cell in Figure 9 at row X and column Y represents the throughput achieved with Y threads whenthe GCR (GCR-NUMA) library is used with lock X dividedby throughput achieved when the lock X itself is used (i.e.,without GCR or GCR-NUMA). The shades of red colors rep-resent slowdown (speedup below 1, which in most cases falls2019-07-15 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan (a)

Base (b)

GCR (c)

GCR-NUMA

Figure 11.

Unfairness factor for the base locks and when GCR is used (AVL tree).in the range of [0.8..1), i.e., less than 20% slowdown), whilethe shades of green colors represent positive speedup; theintensity of the color represents how substantial the slow-down/speedup is . In other words, ideally, we want to seeheat maps of green colors, as dark as possible. We also pro-vide raw speedup numbers for X6-2, but remove them forother machines for better readability.Until the machines become oversubscribed, the locks thatdo not gain from GCR are mostly NUMA-aware locks, suchas cbomcs_spin , cbomcs_stp , ctcktck (which are the vari-ants of Cohort locks [9]) and such as hyshmcs and hmcs (which are NUMA-aware hierarchical MCS locks [3]). Thismeans that, unsurprisingly, putting a (non-NUMA-aware)GCR mechanism in front of a NUMA-aware lock is not agood idea.When machines are oversubscribed, however, GCR achievesgains for all locks, often resulting in more than 4000x through-out increase compared to the base lock. Those are the caseswhen the performance with the base lock (without GCR)plummets, while GCR manages to avoid the drop. In general,locks that use spin-then-park policy tend to experience arelatively smaller drop in performance (and thus relativelysmaller benefit for GCR – up to 6x).When considering the speedup achieved with GCR-NUMA,we see the benefit of the concurrency restriction increasing,in some case dramatically, while in most cases, the benefitshows up at much lower thread counts. In particular, it isworth noting that above half the capacity of machines, theperformance of virtually all locks is improved with GCR-NUMA. When machines are oversubscribed, GCR-NUMAachieves over 6000x performance improvement for some ofthe locks.A different angle on the same data is given by Figure 10.Here we normalize the performance of all locks to thatachieved with mcs_stp (MCS spin-the-park) with a singlethread. The data in Figure 10(a) for locks without GCR echoesresults of several studies comparing between different locksand concluding that the best performing lock varies acrossthread counts [6, 12]. Notably, the normalized performancewith GCR and GCR-NUMA is more homogeneous (Figure 10(b)and (c)). With a few exceptions, all locks deliver similar per-formance with GCR (and GCR-NUMA) and across virtually all thread counts. The conclusion from these data is that GCRand GCR-NUMA deliver much more stable performance,mostly independent of the type of the underlying lock or thewaiting policy it uses. Is better performance with GCR traded for fairness?

It is natural to ask how the fairness of each lock is affectedonce the GCR mechanism is used. There are many ways toassess fairness; we show one that we call the unfairness factor .To calculate this factor, we sort the number of operationsreported by each thread at the end of the run, and calculatethe portion of operations completed by the upper half ofthreads. Note that the unfairness factor is a value between0 . . We report on our experiments with the Kyoto Cabinet [11] kccachetest benchmark run in a wicked mode, which ex-ercises an in-memory database. Similarly to [7], we modifiedthe benchmark to use the standard POSIX pthread mutexlocks, which we interpose with locks from the LiTL library.We also modified the benchmark to run for a fixed time2019-07-15 • Copyright Oracle and or its affiliates a) GCR (b)

GCR-NUMA

Figure 12.

Speedup achieved by GCR and GCR-NUMA over base lock implementations (Kyoto).and report the aggregated work completed. Finally, we fixedthe key range at a constant (10M) elements. (Originally, thebenchmark set the key range dependent on the number ofthreads). All those changes were also applied to Kyoto in [7]to allow fair comparison of performance across differentthread counts. The length of each run was 60 seconds.Kyoto employs multiple locks, each protecting a slot com-prising of a number of buckets in a hash table; the latter isused to implement a database [11]. Given that the wicked mode exercises a database with random operations and ran-dom keys, one should expect a lower load on each of themultiple slot locks compared to the load on the central lockused to protect the access to the AVL tree in the microbench-mark above. Yet, Kyoto provides a good example of how GCRbehaves in a real application setting.The results are presented in Figure 12, where we run GCRand GCR-NUMA on top of 24 locks provided by LiTL. Simi-larly to Figure 9, each cell decodes the slowdown/speedupachieved by GCR or GCR-NUMA, respectively, comparedto the base lock. As Figure 12 shows, both GCR and GCR-NUMA deliver robust gains (at times, over x1000), and thosegains start for virtually all locks even before the machinebecomes oversubscribed.

LevelDB is an open-source key-value storage library [17].We experimented with the release 1.20 of the library, whichincluded the db_bench benchmark. We used db_bench tocreate a database with the default 1M key-value pairs. Thisdatabase was used subsequently in the readrandom modeof db_bench , in which threads read random keys from thedatabase. (The same database was used for all runs, since readrandom does not modify it). Following the example ofKyoto, we modified the readrandom mode to run for a fixedtime (rather than run a certain number of operations, whichcaused the runtime to grow disproportionally for some locksunder contention). The reported numbers are the aggregatedthroughput in the readrandom mode. The length of each runwas 10 seconds.As its name suggests, the readrandom mode of db_bench is composed of

Get operations on the database with random keys. Each

Get operation acquires a global (per-database)lock in order to take a consistent snapshot of pointers to in-ternal database structures (and increment reference countersto prevent deleting those structures while

Get is running).The search operation itself, however, executes without hold-ing the database lock, but acquires locks protecting (sharded)LRU cache as it seeks to update the cache structure with keysit has accessed. Thus, like in the case of Kyoto, the contentionis spread over multiple locks.The results are presented in Figure 13 (a) and (b). As ex-pected, GCR gains are relatively modest, yet when the ma-chine is oversubscribed, it achieves positive speedups forall locks but two (the backoff and the pthreadinterpose locks. The latter is simply a wrapper around the standardPOSIX pthread mutex.). GCR-NUMA helps to extract moreperformance, but only slightly as low lock contention limitsits benefit as well.In order to explore how increased contention on the data-base lock would affect the speedups achieved by GCR andGCR-NUMA, we run the same experiment with an emptydatabase. In this case, the work outside of the critical sections(searching a key) is minimal and does not involve acquiringany other lock. The results are presented in Figure 13 (c)and (d). Overall, the increased contention leads to increasedspeedups achieved by GCR and GCR-NUMA. In particular,when the machine is oversubscribed, all locks but one benefitfrom GCR (and all locks benefit from GCR-NUMA).

We have presented GCR, a generic concurrency restrictionmechanism, and GCR-NUMA, the extension of GCR to theNUMA settings. GCR wraps any underlying lock and con-trols which threads are allowed to compete for its acquisi-tion. The idea is to keep the lock saturated by as few threadsas possible, while parking all other excessive threads thatwould otherwise compete for the lock, create contention andconsume valuable system resources. Extensive evaluationwith more than two dozen locks shows substantial speedupachieved by GCR on various systems and benchmarks; thespeedup grows even larger when GCR-NUMA is used.2019-07-15 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan (a)

GCR (10M keys) (b)

GCR-NUMA (10M keys) (c)

GCR (empty DB) (d)

GCR-NUMA (empty DB)

Figure 13.

Speedup achieved by GCR and GCR-NUMA over base lock implementations (LevelDB).

References [1] Yehuda Afek, Dave Dice, and Adam Morrison. 2011. Cache Index-aware Memory Allocation. In

Proceedings of ACM ISMM . 55–64.[2] S. Boyd-Wickizer, M. Kaashoek, R. Morris, and N. Zeldovich. 2012. Non-scalable locks are dangerous. In

Proceedings of the Linux Symposium .[3] Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. HighPerformance Locks for Multi-level NUMA Systems. In

Proceedings ofthe ACM PPoPP .[4] Gaurav Chadha, Scott Mahlke, and Satish Narayanasamy. 2012. WhenLess is More (LIMO):Controlled Parallelism For Improved Efficiency.In

Conference on Compilers, Architectures and Synthesis for EmbeddedSystems (CASES) .[5] Travis Craig. 1993.

Building FIFO and priority-queueing spin locks fromatomic swap . Technical Report TR 93-02-02. U. of Washington, Dept.of Computer Science.[6] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Every-thing You Always Wanted to Know About Synchronization but WereAfraid to Ask. In

Proceedings of the ACM Symposium on OperatingSystems Principles (SOSP) . 33–48.[7] Dave Dice. 2017. Malthusian Locks. In

Proc. of ACM EuroSys . 314–327.[8] Dave Dice and Alex Kogan. 2019. Compact NUMA-aware Locks. In

Proc. of ACM EuroSys .[9] David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock Cohorting:A General Technique for Designing NUMA Locks.

ACM TOPC

1, 2,Article 13 (Feb 2015).[10] Stijn Eyerman and Lieven Eeckhout. 2010. Modeling Critical Sec-tions in Amdahl’s Law and Its Implications for Multicore Design. In

Proceedings of ACM ISCA . 362–370.[11] FAL Labs [n. d.]. Kyoto Cabinet. http://fallabs.com/kyotocabinet.[12] Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. MulticoreLocks: The Case Is Not Closed Yet. In

Proceedings of USENIX ATC .649–662.[13] Bijun He, William N. Scherer, and Michael L. Scott. 2005. PreemptionAdaptivity in Time-published Queue-based Spin Locks. In

Proc. of HighPerformance Computing (HiPC) . 7–18.[14] W. Heirman, T.E. Carlson, K. Van Craeynest, I. Hur, A. Jaleel, andL. Eeckhout. 2014. Undersubscribed Threading on Clustered Cache Architectures. In

Proceedings of IEEE HPCA .[15] Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ail-amaki. 2009. A New Look at the Roles of Spinning and Blocking. In

Proceedings of the International Workshop on Data Management on NewHardware (DaMoN) . ACM.[16] Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry.2010. Decoupling Contention Management from Scheduling. In

Pro-ceedings of ACM ASPLOS . 117–128.[17] LevelDB [n. d.]. LevelDB. https://github.com/google/leveldb.[18] Beng-Hong Lim and Anant Agarwal. 1993. Waiting Algorithms forSynchronization in Large-scale Multiprocessors.

ACM Transactions onComputing Systems (1993).[19] LiTL [n. d.]. LiTL: Library for Transparent Lock interposition.https://github.com/multicore-locks/litl.[20] John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms forScalable Synchronization on Shared-memory Multiprocessors.

ACMTrans. Comp. Syst.

9, 1 (1991), 21–65.[21] Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N. Bhuyan. 2011.Thread Reinforcer: Dynamically Determining Number of Threads viaOS Level Monitoring. In

Proc. of IEEE IISWC .[22] Zoran Radovic and Erik Hagersten. 2003. Hierarchical Backoff Locksfor Nonuniform Communication Architectures. In

Proceedings of EEEHPCA . 241–252.[23] Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I.August. 2011. Parallelism Orchestration Using DoPE: The Degree ofParallelism Executive. In

Proc. of ACM PLDI .[24] Richard M. Yoo and Hsien-Hsin S. Lee. 2008. Adaptive TransactionScheduling for Transactional Memory Systems. In

Proc. of ACM SPAA ..