[PDF] Fissile Locks - Researchain

Abstract

Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any admission order guarantees. Such concerns led to the development of scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a variant of another popular queue-based MCS lock. CNA scales well under load and provides certain admission guarantees, but has more complicated lock handover operations than TS and incurs higher latencies at low contention. We propose Fissile locks, which capture the most desirable properties of both TS and CNA. A Fissile lock consists of two underlying locks: a TS lock, which serves as a fast path, and a CNA lock, which serves as a slow path. The key feature of Fissile locks is the ability of threads on the fast path to bypass threads enqueued on the slow path, and acquire the lock with less overhead than CNA. Bypass is bounded (by a tunable parameter) to avoid starvation and ensure long-term fairness. The result is a highly scalable NUMA-aware lock with progress guarantees that performs like TS at low contention and like CNA at high contention.

Full PDF

FFissile Locks

Dave Dice

Oracle Labs [email protected]

Alex Kogan

Oracle Labs [email protected]

Abstract

Classic test-and-test (TS) mutual exclusion locks are simple,and enjoy high performance and low latency of ownershiptransfer under light or no contention. They do not, however,scale gracefully under high contention and do not provideany admission order guarantees. Such concerns led to thedevelopment of scalable queue-based locks, such as a recent

Compact NUMA-aware (CNA) lock, a variant of another pop-ular queue-based

MCS lock. CNA scales well under load andprovides certain admission guarantees, but has more compli-cated lock handover operations than TS and incurs higherlatencies at low contention.We propose

Fissile locks, which capture the most desir-able properties of both TS and CNA. A Fissile lock consists oftwo underlying locks: a TS lock, which serves as a fast path,and a CNA lock, which serves as a slow path. The key featureof Fissile locks is the ability of threads on the fast path tobypass threads enqueued on the slow path, and acquire thelock with less overhead than CNA. Bypass is bounded (bya tunable parameter) to avoid starvation and ensure long-term fairness. The result is a highly scalable NUMA-awarelock with progress guarantees that performs like TS at lowcontention and like CNA at high contention.

AKA:FissileLocks;TetrisLocks;bifurcatedlocks;passinglaneTetris–on-linestreampackingproblemwithlimitedvisibilityMCSprovidesFIFOadmissionandCNAprovidesboundedlong-termfairnessguarantees.NotablyCNAUnlikeclassicCNA,ourvariantinFissilereorganizesthechainofwaitingthreadsearly,immediatelyafteracquiringtheCNAlock.Assuch,reorganizationrunsoutsidetheTScriticalsectionandpotentiallyallowsoverlapwithexecutionofthecriticalsection.ThethreadthathasacquiredtheTSlockistheownerofthecompoundFissilelock.

CCS Concepts • Software and its engineering → Mul-tithreading ; Mutual exclusion ; Concurrency control ; Processsynchronization ; Keywords

Locks, Mutexes, Mutual Exclusion, Synchroniza-tion, Concurrency Control *Highcontention;hightraffic;arrivalrate;intensity;transits;*Seize;barge;pounce;bypass;cutinline;jumpline;snatch;*Rapaciouslocks*Price-of-anarchy;tragedy-of-the-commons;

TS:

Test-and-test locks (TS) [3] are compact – consistingof a single lock word – simple, and provide excellent latencyunder light or no contention. They fail to scale, however, ascontention increases.Acquiring threads simply busy-wait, or spin attemptingto change the lock word state from unlocked to locked withan atomic read-modify-write instruction, such as compare-and-swap (CAS) or exchange (SWAP). If the atomic oper-ation was successful, then the thread has acquired the lockand may enter the critical section. Releasing the lock requiresonly a simple store to set the state to unlocked. So-called “po-lite” test-and-test-and-set locks (TTS), a variation on TS, firstfetch the lock value and only attempt the atomic instructionif the lock was observed to be not held. That is, acquiringthreads busy-wait until the lock is clear, at which point theyexecute an atomic instruction to try to gain ownership. TTS acts to avoid unnecessary write invalidation arising fromfailed atomic operations. Simple “impolite” TS locks do notbother to first load the value, so each probe of the lock causeswriting via the atomic instruction. TS and TTS locks areusually augmented with back-off – delays between probes –to moderate contention. ProbethelockInourdescriptionswewillassumeasequentiallyconsistentmemorymodelandnotconsidertheneedformemoryfenceorbarrierinstructions.

MCS:

The

MCS lock [34], is the usual alternative to sim-ple test-and-set-based locks, performing better under highcontention, but also having a more complex path and oftenlagging behind simple locks under no or light contention. InMCS, arriving threads use an atomic operation to append anelement to the tail of a linked list of waiting threads, and thenbusy wait on a field within that element, avoiding global spin-ning as found in TS. The list forms a queue of waiting threads.The lock’s tail variable is explicit and the head – the currentowner – is implicit. When the owner releases the lock itreclaims the element it originally enqueued and sets the flagin the next element, passing ownership. To convey owner-ship, the MCS unlock operator must identify the successor, ifany, and then store to the location where the successor busywaits. The list forms a multiple-producer-single-consumer(MPSC) queue where any thread can enqueue but only thecurrent owner can dequeue itself and pass ownership. Thehandover path is longer than that of TS locks and accessesmore distinct shared locations.MCS uses so-called local waiting where at most one threadis waiting on a given location at any one time. As such, anunlock operation will normally need to invalidate just onecache line – the line underlying the flag where the succes-sor busy waits – in one remote cache. Under contention,the unlock operator must fetch the address of the succes-sor element from its own element, and then store into theflag in the successor’s element, accessing two distinct cachelines, and incurring a dependent memory access to reach thesuccessor. Absent contention, the unlock operator uses anatomic compare-and-swap (CAS) operator to try to detachthe owner’s element and set the tail variable to null .MCS locks provide strict FIFO order. They are also com-pact, with the lock body requiring just a pointer to the tailof the chain of queue elements.One MCS queue element instance is required for each locka thread currently holds, and an additional queue element isrequired while a thread is waiting on a lock. Queue elementscan not be shared concurrently and can appear on at mostone queue – be associated with at most one lock – at agiven time. The standard POSIX pthread_mutex_lock and pthread_mutex_unlock operators do not require scoped2020-05-05 • Copyright Oracle and or its affiliates a r X i v : . [ c s . O S ] M a y ave Dice and Alex Kogan or lexically balanced locking. As such, queue element can notbe allocated on stack. Instead, MCS implementations thatexpose a standard POSIX interface will typically allocateelements from thread-local free lists, populated on demand . MCSrequirestheaddressofqueueelementinsertedbytheownertobepassedtothecorrespondingunlockoperator,whereitwillbeusedtoidentifyasuccessor,ifany.

The standard POSIX interface does not provide any meansto pass information from a lock operation to the correspond-ing unlock operator. As such, the address of the MCS queueelement inserted by the owner thread is usually recorded inthe lock instance so it can be conveyed to the subsequentunlock operation to identify the successor, if any. That fieldis protected by the lock itself and accessed within the crit-ical section. Accesses to the field that records the owner’squeue element address may themselves generate additionalcoherence traffic, although some implementations may avoidsuch accesses to shared fields by storing the queue elementaddress in a thread-local associative structure that maps lockaddresses to the owner’s queue element address.

ThestandardPOSIXpthreadmutexAPIshavenoprovisiontopassdatafromtheacquireoperationtothecorrespondingtounlocksotypicalMCSorCNAimplementationsadheringtothatAPIareforcedtoeitherstoretheaddressoftheowner’squeueelementinthelockstructureitself,toconveythataddresstotheunlockoperator–generatingadditionalcoherencetraffic–orusesupplementaryper-threadmapsthatassociateheldlockswiththeowningqueueelement.

CNA:

Compact NUMA-Aware locks (CNA) [19] are basedon MCS, but add NUMA-awareness. At arrival time, threadsannotate their queue element with their NUMA node number.At unlock-time, the owner scans forward into the primaryMCS chain and culls remote elements, transferring them toa secondary chain of remote threads. That secondary chainis propagated from the unlock operator to the successorvia the queue elements, so the lock structure remains com-pact. Reducing the NUMA diversity of the primary chainacts to reduce lock migration [21] and improve performance.To avoid indefinite starvation of threads on the secondarychain, the unlock operator periodically flushes the secondarychain back into the primary chain to shift the currently pre-ferred NUMA node. At unlock-time, if the primary chain isfound empty, the secondary is flushed back into the primaryto reprovision the primary chain. CNA unlock prefers todispatch to threads on the primary, but will revert to thesecondary list if the primary is empty. The secondary chainis manipulated under the lock itself, in the unlock operation.While CNA is NUMA-aware, compared to MCS, a numberof additional CNA-specific administrative steps – culling,reprovisioning, periodic flushing – execute while the lock isheld and are subsumed into the critical section, potentiallyincreasing the effective hold time of the lock. We observethat all NUMA-aware locks trade-off short-term fairness forimproved overall throughput.

Fissile augments CNA with a TS fast-path using the

LOITER lock construction (Locking : Outer-Inner Tranformation)[14] where the outer lock is a TS lock and the inner lock is aCNA lock. Acquiring ownership of the outer TS lock confers We note that the MCS “K42” variant [32, 37] allows queue elements to beallocated on stack – they are required only while a thread waits – but atthe cost of a longer path with more accesses to shared locations. ownership of the compound Fissile lock. Arriving threadsfirst try the fast-path TS lock and, if successful, immediatelyenter the critical section. Otherwise control diverts into theslow path where the thread acquires the inner CNA lock. Werefer to the owner of the inner CNA lock as the alpha thread.Once the CNA lock has been acquired, the alpha threadthen busy-waits on the TS outer lock. At most one thread atany one time busy-waits on the outer TS lock, avoiding thescalability impact of global spinning, where multiple threadssimultaneously busy-wait on a given location. As there isat most one thread busy-waiting on the outer lock, we useTS instead of TTS. Once the outer lock has been acquired,we release the inner lock and enter the critical section. Torelease a Fissile lock, we simply release the outer TS lock,regardless of whether the corresponding acquisition tookthe fast path or slow path.

Alpha;Alfa;OnDeck;first-among-equals;head-of-line;ready;standby;PrimusInterPares;PrimoHandover;responsetime;latency;convey;transfer;passProactive;early;aggressive;prompt;Decouplelockacquisitionandqueueing/waiting;TheowneroftheouterTSlockistheownerofthecompoundFissilelock.

A thread holds the inner CNA lock only within the Fis-sile lock acquisition operator. Specifically, Fissile releasesthe inner CNA lock within the Fissile acquire operation, butwhile still holding the outer TS lock, potentially extendingthe hold-time of the outer lock. This choice, however, allowsus to allocate the MCS queue element on-stack, which isa distinct advantage, avoiding MCS queue element alloca-tion and deallocation. (Classic MCS requires one allocatedqueue element for each lock concurrently held by a threadwhereas our approach avoids that expense). Furthermorethe queue element of the alpha thread does not need to becommunicated from the Fissile acquire operation to the un-lock operation, as is the case for normal MCS and CNA. Weemploy a specialized CNA implementation, described below,which shifts much of the administrative overhead specificto CNA and normally found in the unlock operator to runbefore we acquire the outer TS lock, so the overhead of re-leasing the CNA inner lock while holding the outer TS lockis minimized.In Listing-1 we provide a sketch of the Fissile algorithm.The

Outer field is a TS lock word which can take on 3 values:0 indicates unlocked ; 1 indicates locked and 2 encodes a spe-cial locked state used when the alpha thread is impatient andthe previous owner is transferring ownership of the outerTS lock directly to the alpha thread.

Inner is the CNA innerlock, and

Impatient reflects the state of the alpha thread. preferredembodimenttensionandtrade-off:@balancequalityofadmissionschedule(NUMA)vsreorganizationlatency@fairnessvsthroughput:latitudeandlaxity;unfairnessmaybefaster.

Absent remediation, simple TS allows indefinite bypassand starvation of waiting threads. To avoid this issue, thealpha threads busy-waits on the TS lock for a short graceperiod but will then become “impatient” and cue direct han-dover of ownership the next time the TS lock is released,bounding bypass.When the alpha thread becomes impatient, having failedto acquire the outer lock within the grace period, it setsthe

Impatient field from the normal state of 0 to 2. Theunlock operator fetches from

Impatient and stores thatvalue into the TS lock word. In typical circumstances whenunlock runs after the alpha has become impatient, it will2020-05-05 • Copyright Oracle and or its affiliatesbserve and fetch 2 from

Impatient and store that valueinto the TS lock word. The alpha will then notice that thevalue 2 has propagate from

Impatient into the lock word,and takes direct handoff of ownership from that previousowner, restoring the lock word from 2 back to 1. If the un-lock operation happens to run concurrently with the alphathread becoming impatient, the unlock may race and fetch 0from

Impatient instead of 2. In this case either the alphamanages to seize the TS lock and acquire it when it becomes0, or some other thread managed to pounce on the TS lock, inwhich case the alpha thread must wait one more lock cycleto take ownership. At worst, impatient handover is delayedby one acquire-release cycle. Once the value of 2 is visibleto threads in unlock, immediate handover to the alpha isassured. Threads arriving in the fast-path that observe 2 willdivert immediately into the CNA slow-path.The grace period serves as tunable parameter reflectingthe trade-off and tension between throughput and short-term fairness. A shorter grace periods yields less bypassand fairer admission, while longer periods may allow betterthroughput but worse short-term fairness.Fissile provides hybrid succession, employing competitivesuccession [14] when there is no contention, but switchingto more conservative direct succession when the alpha threadbecomes impatient. Under competitive succession, the ownerreleases the lock, allowing other waiting or arriving threadsto acquire the lock. Unfettered competitive succession ad-mits undesirable long-term unfairness and starvation buttypically performs well under light load. In addition, compet-itive succession tends to provide more graceful throughputunder preemption. In direct succession, as used by MCS,for instance, the lock holder directly transfers ownershipto a waiting successor without any intermediate or inter-vening transition to an unlocked state. All strict FIFO locksemploy direct succession. Direct succession suffers underpreemption, however, as ownership may be conveyed to apreempted thread, and we have to wait for operating systemtime-slicing to dispatch the owner onto a processor.

Barge;pounce;renouncement;Rapaciouslocks;AlpharevertstoimpatienceHybrid;Composite;CompoundProactivelyAnalogy:CSMA/CD-competitivesuccession;TokenRing-directsuccession

By restricting the number of threads competing for theouter TS lock, we improve the odds that an arriving threadwill find the lock clear and manage to acquire via the TS outerfast path. Under fixed load, the system will tend to reach abalanced steady state where many circulating threads tendto acquire the TS lock without waiting.As shown in [17], as more threads busy-wait on a given lo-cation, as is the case in TS, stores to that location take longerto propagate. (Concurrent reads to a given location scale, butconcurrent writes or atomics do not [36]). Fissile addressesthat concern by ensuring that only the alpha thread busy-waits on the outer TS lock at any given time, acceleratinghandover.The TS fast path provides the following benefits. First,latency is reduced, relative to MCS and CNA, for the uncon-tended case. Acquisition requires an atomic instruction and just a simple store to release. Second, the slow-path CNAMCS nodes can be allocated on-stack, simplifying the CNAimplementation and avoiding the need to communicate orconvey the owner’s MCS node from the lock operation to thecorresponding unlock. Third, TS with bounded bypass per-forms well under preemption, relative to MCS. Finally, andless obviously, the TS fast path provides benefit in the con-tended case. Fissile provides significant improvement overCNA when the critical section is small, and CNA has a hardtime “keeping up” with the flow of arriving threads. That is,for very short critical sections, CNA itself – CNA overheads– becomes the bottleneck for throughput [22]. Under intensecontention the TS lock allows more throughput, serving asan alternative bypass channel, giving contention “pressure” away to get around CNA when CNA becomes the bottleneck.When the critical sections are longer, fissile performs likeCNA. Allowing some threads to pass through the CNA slowpath and some fraction over the TS fast path would appearto dilute CNA’s NUMA benefits, but in practice, we find thatCNA still quickly acts to filter out remote threads from a setof threads circulating over a contended lock.

Dilute;attenuate;FissileprovidessignificantimprovementoverCNAwhenthecriticalsectionissmall,andCNAhasahardtime“keepingup”withtheflowofarrivingthreads.Thatis,forveryshortcriticalsections,CNAitself–CNAoverheads–becomesthebottleneckforthroughput.InthiscasetheTSlockallowsmorethroughputunderintensecontention,aswellasprovidingalow-latencyfast-pathundernoorlightcontention.FissileTSservesbypasschannel,givingthe“pressure”awaytogetaroundCNAwhenCNAbecomesthebottleneck.Whenthecriticalsectionsarelonger,fissileperformslikeCNA.

The result is a highly scalable NUMA-aware lock that per-forms like TS at low contention and as well or better thanCNA at high contention. Fissile provides short-term concur-rency restriction [14] which may improve overall throughputover a contended lock. Fissile locks are compact and alsotolerate preemption, by virtue of the TS outer lock, moregracefully than does CNA or MCS.

InfacttheperformanceofFissileoftenexceedsthatofeitherofitsunderlyinglocktypes.SynergywithmaxmorethanmaxofthepartsAnatomicisrequiredtoacquirethelockandsimplestoretorelease.Off-loadcontentionintoCNA;Divert;Deflect;relegate;

AdaptCNAtofissile–specialization

Classic CNA performs reorganization of the MCS chain –to be more NUMA-friendly and reduce NUMA lock transi-tions – while holding the the CNA lock itself, extending theeffective critical section length and delaying handover to asuccessor. Handover time impacts the scalability as the lockis held throughout handover, increasing the effective lengthof the critical section. At extreme contention, the criticalsection length determines throughput [2, 22]. Fissile usesa specialized variant of CNA which reorganizes the chainof waiting threads early, immediately after acquiring theCNA lock. As such, reorganization runs outside and beforethe TS critical section, off the critical path, and potentiallyallows pipelining and overlap with the critical section exe-cution. (Arguably, earlier reorganization may suffer as thereare fewer threads enqueued from which to schedule, but wehave not observed any performance penalty related to thisconcern).The variant of CNA used by Fissile differs from the original[19] as follows. Classic CNA, at unlock-time, culls the entireremote suffix of the primary chain into the remote list. Ourvariant looks ahead only one thread into the primary MCSchain, and provides constant-time culling costs, yielding lesspotentially futile scanning of the chain, and more predictableoverheads. In addition, our look-ahead-one policy generates2020-05-05 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan less coherence traffic accessing the MCS chain elements,as the element examine for potential culling would also beaccessed in the near future when we subsequently releasethe CNA lock.Finally, our version of CNA performs CNA administra-tive duties – flushing and culling – immediately after theowner acquires the CNA lock, whereas classic CNA defersthose operations until unlock-time. Specifically, we reorga-nize outside and before the outer TS critical section, allowingmore overlap between CNA administrative duties and theexecution of the critical section, and accelerating CNA lockhandover.All the changes above are optional optimizations and arenot required to use CNA within Fissile, but they serve toenhance performance.

While mutual exclusion remains an active research topic[2, 4–6, 13, 18, 22–24, 26–28] we focus on locks closely relatedto our design.NUMA-aware locks attempt to restrict ownership of a lockto threads on a given NUMA node over the short term, reduc-ing so-called lock migration , which can result in expensiveinter-node coherence traffic. The first NUMA-aware lockwas HBO (Hierarchical Back Off) [35], a test-and-set lockwhere busy-waiting threads running on the same NUMAnode as the current owner would use shorter back-off dura-tions, favoring the odds of handover to such proximal threadsrelative to most distant threads. While simple, HBO suffersfrom the same issues as other TS locks.Luchangco et al. [31] introducing HCLH, a NUMA-awarehierarchical version of the CLH queue-lock [12, 33]. TheHCLH algorithm collects requests on each node into a localCLH-style queue, and then has the thread at the head of thequeue integrate each node’s queue into a single global queue.This avoids the overhead of spinning on a shared location andeliminates fairness and starvation issues. HCLH intentionallyinserts non work-conserving combining delays to increasethe size of groups of threads to be merged into the globalqueue. It was subsequently discovered that HCLH requiredthreads to be bound to one processor for the thread’s lifetime.Failure to bind could result in exclusion and progress failures,and as such we will not consider HCLH further.NUMA-aware Cohort locks [20, 21] spawned various deriva-tives [7, 8]. While cohort locks scale well, they have a largevariable-sized footprint. The size of a cohort lock instance isa function of the number of NUMA nodes, and is thus notgenerally known until run-time, complicating static alloca-tion of cohort locks. Being hierarchical in nature, they sufferincreased latency under low or no contention as acquisitionrequires acquiring both node-level locks and top-level lock.CNA avoids all these concerns and is superior to cohort locks.

Listing 1.

Fissile Pseudocode class Fissile : atomic< int > Outer = 0 ; atomic< int > Impatient = 0 ; CNAMCSLock Inner {} ; def

Lock (Fissile * L) : | | | | if AtomicCAS (&L->Outer, 0, 1) == 0 : | | return | | | | | auto QueueElement I {} ; | | CNAAcquire (&L->Inner, &I) ; | | | | | | | CNACullOrFlush (&L->Inner, &I) ; | | | | | for : | | if AtomicSwap (&L->Outer, 1) == 0 : | | | goto Exeunt | | if PatienceExhausted() : break | | Pause() ; | | | | assert L->Impatient == 0 ; | L->Impatient = 2 ; | for | | | | if AtomicSwap (&L->Inner, 1) != 1 : | | | break ; | | Pause() ; | L->Impatient = 0 ; | | Exeunt: | | | assert L->Outer == 1 ; | CNARelease (&L->Inner, &I) ; def Unlock (Fissile * L) : | | assert L->Outer != 0 ; | L->Outer = L->Impatient ; qspinlock low-levelspin lock [10, 30] implementation from an MCS-based designto CNA is under under submission at the time of writing .Similarly, Fissile locks are readily portable into the kernelenvironment.Kashyap’s et al. [27] Shuffling Lock also performs NUMA-aware reorganization of MCS chains of waiters off the criti-cal path, by waiting threads. They also use a LOITER-baseddesign, but do not allow bypass. In the evaluation section,below, we compare Fissile against their user-mode imple-mentation.LOITER-base designs [14] first appeared, to our knowl-edge, in the HotSpot Java Virtual Machine implementation in 2007. The “Go” language runtime mutex[1] uses a LOITER-based scheme where the inner lock is implemented via asemaphore and time-bounded bypass is allowed. The linuxkernel QSpinlock [30] construct also has a dual path TS andMCS lock, but does not allow bypass. The QSpinLock TSfast-path avoids MCS latency overheads in the uncontendedcase.Various authors [4, 25] have suggested switching adap-tively between MCS and lower latency locks depending onthe contention level. While workable, this adds consider-able algorithmic complexity, particularly for the changeoverphase, and requires tuning. Lim and Agarwal [29] suggesteda more general framework for switching locks at runtime.

Concerns:reactivityresponsetime;hysteresisanddamping;chase;hunt;ringParameterParsimony

Unless otherwise noted, all data was collected on an OracleX5-2 system. The system has 2 sockets, each populated withan Intel Xeon E5-2699 v3 CPU running at 2.30GHz. Eachsocket has 18 cores, and each core is 2-way hyperthreaded,yielding 72 logical CPUs in total. The system was runningUbuntu 18.04 with a stock Linux version 4.15 kernel, andall software was compiled using the provided GCC version7.3 toolchain at optimization level “-O3”. 64-bit C or C++code was used for all experiments. Factory-provided systemdefaults were used in all cases, and Turbo mode [39] wasleft enabled. In all cases default free-range unbound threadswere used.We implemented all user-mode locks within LD_PRELOADinterposition libraries that expose the standard POSIX pthread_mutex_t programming interface using the framework from [21].This allows us to change lock implementations by vary-ing the LD_PRELOAD environment variable and withoutmodifying the application code that uses locks. The C++ std::mutex construct maps directly to pthread_mutex primitives, so interposition works for both C and C++ code.All busy-wait loops used the Intel

PAUSE instruction. Wenote that user-mode locks are not typically implemented as https://lwn.net/Articles/805655/ https://github.com/openjdk-mirror/jdk7u-hotspot/blob/master/src/share/vm/runtime/mutex.cpp pure spin locks, instead often using a spin-then-park waitingpolicy which voluntarily surrenders the CPUs of waitingthreads after a brief optimistic spinning period designed toreduce the context switching rate. In our case, we find thatuser-mode is convenient venue for experiments, and note inpassing that threads in the CNA slow-path are easily madeto park. PAUSEforpolitewaitingWeusea128bytesectorsizeonIntelprocessorsforalignmenttoavoidfalsesharing.Theunitofcoherenceis64bytesthroughoutthecachehierarchy,but128bytesisrequiredbecauseoftheadjacentcachelineprefetchfacilitywherepairsoflinesareautomaticallyfetchedtogether.

The MutexBench benchmark spawns T concurrent threads.Each thread loops as follows: acquire a central lock L; exe-cute a critical section; release L; execute a non-critical sec-tion. At the end of a 10 second measurement interval thebenchmark reports the total number of aggregate iterationscompleted by all the threads. We report the median of 7 inde-pendent runs in Figure-1. The critical section advances a C++ std::mt19937 pseudo-random generator (PRNG) 2 steps.The non-critical section is empty. For clarity and to conveythe maximum amount of information to allow a comparisionthe algorithms, the X -axis is offset to the minimum scoreand the Y -axis is logarithmic. Tofacilitatecomparison;visualcomparison;

Immediately before acquiring the lock, each thread fetchesthe value of a shared lock clock value. The critical sectionadvances that value. Subtracting the clock value fetched inthe critical section from the value fetched before acquiringthe lock gives a useful approximation of the thread’s waitingtime, given in units of lock acquisitions. Within the criti-cal section, we record that waiting time value into a globallog. After the measurement interval the benchmark harnesspost-processes the log to produce statistics describing thedistribution of the waiting time values, which reflect short-term fairness of the lock algorithm. The critical section alsotallies lock migrations. These activities increase the effectivelength of the critical section.We ran the benchmark under the following lock algo-rithms:

TTS is a simple test-and-test-and-set lock using clas-sic truncated randomized binary exponential back-off [3, 34]with the back-off duration capped to 100000 iterations of aPAUSE loop;

MCS is classic MCS;

CNA is described in [19]with the probability of flushing the secondary chain into theprimary configured as P = / ; Shuffle is Kashyap’s

Shuffle Lock [27] aqswonode variant ; Fissile is the Fissilealgorithm described above with the grace period configuredas 50 steps of the TS loop executed by the alpha thread andthe CNA flush probability configured for P = / CNAflushintervalsetto50000iterationsofthewaitingloop,withPAUSEinstruction,executedbytheheadoftheCNAsecondarychain. We picked P = /

256 to match the default value used by the Shuffle Lock,allowing a fair comparison between that lock and CNA Taken verbatim from https://github.com/sslab-gatech/shfllock/blob/master/ulocks/src/litl/src/aqswonode.c and integrated into ourLD_PRELOAD framework ave Dice and Alex Kogan

In Figure-1 we make the following observations regard-ing operation at maximal contention with an empty criticalsection: . • At 1 thread the benchmark measures the latency ofuncontended acquire and release operations. MCS andCNA lag behind TTS, Shuffle and Fissile as they lack afast-path. • At or above 2 threads, most algorithms fall behind TTSas TTS starves all but one thread for long periods, ef-fectively yielding performance near that found at justone thread. • Broadly, Fissile outperforms CNA and CNA outper-forms Shuffle. • Above 72 threads we encounter preemption via timeslicing. TTS and Fissile are tolerant of preemption wherethe other forms with direct handover encounter a pre-cipitous drop in performance.

Drill-down;exemplar;T=10

In Table-1 we provide additional details for execution at 10threads.

Throughput is given in units of millions of acquiresper second aggregate throughput for all threads;

Spread reflects long-term fairness between threads, computed asthe maximum number of iterations completed by any threadwithin the measurement interval divided by the minimum;

Migration is the reciprocal of the NUMA lock migrationrate. (A Migration value of N indicates that the lock migratedbetween NUMA nodes 1 out of every N lock acquisitions, onaverage). The remaining columns describe the distribution ofthe observed waiting times, which we use to measure short-term fairness. RSTDDEV is the relative standard deviation[40];

Theil-T is the normalized Theil-T index [38, 42] – usedin the field of econometrics as a metric of income disparityand unfairness – where a value of 0 is ideally fair and 1 ismaximally unfair.

Disperal;Disparity;Diverity;inequality;uniformity;unfairness;

We observe that TTS is deeply unfair over the long termand short term. TTS also exhibits a surprisingly low lockmigration rate – on average 1 migration per 323 acquisitions– presumably arising from platform-specific cache line arbi-tration phenomena. Somewhat perversely, this makes TTSimplicitly NUMA-friendly, reducing migration rates. TTS isvulnerable to the

Matthew Effect [41] – once a thread hasentered deeper back-off, it is less likely to acquire the lock inunit time, amplifying subsequent unfairness. The remaininglocks show reasonable long-term and short-term fairness.In Figure-2 we configure the benchmark so the non-criticalsection generates a uniformly distributed random value in [ − ) and steps the thread-local random number generator We note in passing that care must be taken when negative or retrograde scaling occurs and aggregate performance degrades as we increase threads.As a throught experiment, if a hypothetical lock implementation were tointroduce additional synthetic delays outside the critical path, aggregateperformance might increase as the delay throttles the arrival rate and con-currency over the contended lock. As such, evaluating just the maximalcontention case in isolation is insufficient. Sometimes called the capture effect that many steps, admitting potential positive scalability. Inthis moderate contention case we can see that Fissile and TTSlocks tend to provide the best performance, although the TTSlock is again unfair. Shuffle, CNA, and Fissile show a positiveinflection around 12 threads, as there are sufficient waitingthreads to allow NUMA-friendly intra-node handover. Again,we see an abrupt drop in throughput above 72 threads whenpreemption is active, but note that Fissile and TTS moregracefully tolerate preemption.

Methodologicaltrapandflawwhennegativescalingoccurs;Tuningthelockinisolutioninthismode–atmaximalcon-tentionwithanemptynon-criticalsection–canleadustoincorrectlytunetowardmakingthelock“worse”asensuinginefficientdelaysmayreducethearrivalrateandthrottleconcurrency,thusmakingaggregateperformancebetterinthemaximalcontentioncaseasweworsenthelockforthegeneralcase.Deceptiveandconfounding;peril;mislead-ing; https://blogs.oracle.com/dave/the-perils-of-negative-scalabilityhttps://en.wikipedia.org/wiki/Brooks’s_lawhttps://arxiv.org/abs/1506.06796

SIF=Slower-is-faster . . . . . . Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCNATTSShuffleFissile

Figure 1.

MutexBench: Maximum Contention

Throughput Spread Migration RSTDDEV Theil-TMCS .297 1.00 1.83 0.01 0.00CNA .458 1.06 254 13.5 0.17TTS 1.85 7.89 323 102 0.44Shuffle .344 1.86 234 11.3 0.15Fissile 1.11 1.26 374 11.8 0.17

Table 1.

Detailed Execution Analysis

Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCNATTSShuffleFissile

Figure 2.

MutexBench: Moderate Contention2020-05-05 • Copyright Oracle and or its affiliates

Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCNATTSShuffleFissile

Figure 3.

C++ std::atomic

In Figure-3 we use a benchmark harness similar to that ofMutexBench but with the following differences. The non-critical section uses a thread-local std::mt19937 pseudo-random number generator (PRNG) to compute a value dis-tributed uniformly in [ , ) and then advances the PRNGthat many steps. Instead of an explicit critical section, each it-eration executes A.load() where A is shared an instance of std::atomic and T is a simple struct containing 5 32-bit integer fields. The C++ compiler and runtime implement std::atomic for such objects by hashing the address ofthe instance into an array of mutexes, and acquiring those asneeded to implement the desired atomic action. Interestingly,the NUMA-aware locks, CNA, Shuffle and Fissile, exhibit fad-ing performance between 5 and 10 threads, but performancerecovers at higher thread counts when there are sufficientwaiting threads to profitably reorder for a NUMA-friendlyadmission schedule. Below 10 threads, contention is suffi-ciently low that Fissile exceeds CNA by virtue of its fast-path.Fissile and TTS provide similar performance in this region.Above 10 threads, the critical section is sufficiently long induration that CNA and Fissile yield approximately the sameperformance.In Figure-4 we repeat the experiment in Figure-3 on an Or-acle X5-4, which has 4 NUMA nodes, 18 cores per socket and2 hyperthreads per core, for 144 logical CPUs, demonstratingthat our approach generalizes to larger NUMA systems. Theonset of benefit provided by NUMA-aware locks is some-what delayed as we have 4 nodes instead of 2 and, at a giventhread count, threads are more dispersed and the socket isless populated. Fissile allows bypass both over the outer lock and withinthe CNA inner lock. We can, however, easily modify Fissileto provide expedited FIFO-like admission service as follows.First, FIFO locking requests that pass into the slow path

Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCNATTSShuffleFissile

Figure 4.

C++ std::atomic on 4-node Systemmark their CNA MCS queue element with a “FIFO” flag.CNA culling refrains from shifting such elements into theCNA secondary list. Critically, if element S is marked asFIFO, then no requests that arrive after S on the inner CNAlock will acquire that lock before S . We also suppress by-pass over the outer lock while FIFO requests are waiting.To that end, instead of setting and clearing the Impatient field we modify Fissile slightly to atomically fetch-and-add

Impatient by 2 or -2, respectively. (We also make a corre-sponding change to the comparison in the grace period loopfrom == 0 to != 1 ). When a FIFO request diverts into theslow path, it increments Impatient by 2 before acquiringthe CNA inner lock, and decrements by 2 after acquiringthe outer lock. The request will be serviced in FIFO order,without being bypassed by more recently arrived threads,once it increments

Impatient – and that value has becomevisible to threads in the unlock path – and has executed the

SWAP instruction that appends the request to the CNA MCSchain.To avoid fairness anomalies and make fairness analysismore tractable, we explicitly do no change the preferredNUMA when when servicing a FIFO request.To demonstrate the efficacy of FIFO-enabled Fissile, weextended the

MutexBench benchmark harness to allow amixture of normal and FIFO-designated threads, both com-peting for a common lock. We used 25 normal threads, and2 FIFO threads. Normal threads advance the global PRNG 2times in the critical section, as described above, and in thenon-critical section compute a uniformly distributed ran-dom number in [ − ) and advance a thread-local PRNGinstance that many steps. FIFO threads execute the same crit-ical section, but use a non-critical section duration randomlyselected from the range [ − ) , reflecting intermittentlow duty-cycle FIFO operations. The FIFO attribute is per-thread (but could also be specified for individual locking2020-05-05 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan operations) and is ignored by all lock implementations ex-cept FIFO-enabled Fissile. All FIFO data was taken on theX5-2.Table-2 shows the results, comparing Fissile, FIFO-enabledFissile, and MCS. We report throughput over a 10 secondmeasurement interval broken out for the normal threadsand the FIFO threads. We also report statistics describing theobserved wait times, computed in logical lock clock units, forthe FIFO threads in isolation. As we can see Fissile+FIFO provides wait times very close to that afforded by MCS, andwith greater throughput for both normal and FIFO threads.

Throughput Wait times for FIFOFIFO Normal RSTDDEV Worst Avg MedianMCS 1.3M 23.0M 0.03 29 24.4 25Fissile 1.5M 43.9M 52.3 531294 40.7 15Fissile+FIFO 2.7M 38.8M 0.33 41 11.9 12

Table 2.

FIFO Performance

Fissile locks are compact, NUMA-aware, preemption tolerant,and scalable, but also provide excellent latency at low orno contention. The algorithm is straightforward and easilyintegrated into existing locking infrastructures. They areparticularly helpful under contention with high arrival ratesand short critical sections. Contended locking uses the CNAlock while uncontended operations use the TS lock. Fissilelocks deflect contention away from TS lock into the CNAlock.Bypass over the outer lock via the fast path is the key toFissile. While the slow path provides a higher quality NUMA-friendly admission schedule, it also suffers higher latencyarising from the more complex lock mechanism. The fastpath allows for low latency in the uncontended case, butalso improves scalability under contention by augmentingthe slow path with an alternative if the the slow path lockoverheads prove a bottleneck.

ImposingnosurprisingoronerousrequirementsIntheAppendixweidentifyanumberofvariationsonthebasicFissilealgorithmthatweplantoexploreinthefuture.

References [1] 2020. Go RunTime : mutex implementation. https://github.com/golang/go/blob/master/src/sync/mutex.go [2] Vitaly Aksenov, Dan Alistarh, and Petr Kuznetsov. 2018. Brief An-nouncement: Performance Prediction for Coarse-Grained Locking. In

Proceedings of the 2018 ACM Symposium on Principles of DistributedComputing (PODC ’18) . ACM. https://doi.org/10.1145/3212734.3212785 [3] T. E. Anderson. 1990. The performance of spin lock alternatives forshared-money multiprocessors.

IEEE Transactions on Parallel andDistributed Systems (1990). https://doi.org/10.1109/71.80120 [4] Jelena Antić, Georgios Chatzopoulos, Rachid Guerraoui, and VasileiosTrigonakis. 2016. Locking Made Easy. In

Proceedings of the 17th In-ternational Middleware Conference (Middleware ’16) . ACM. https://doi.org/10.1145/2988336.2988357

An extended version of this paper is available at https://arxiv.org/abs/2003.05025 [5] Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nick-olai Zeldovich. 2012. Non-scalable locks are dangerous.

OttawaLinux Symposium (OLS) (2012). [6] Davidlohr Bueso. 2014. Scalability Techniques for Practical Synchro-nization Primitives.

Commun. ACM (2014). http://doi.acm.org/10.1145/2687882 [7] Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. HighPerformance Locks for Multi-Level NUMA Systems. In

Proceedings ofthe 20th ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP 2015) . Association for Computing Machinery. https://doi.org/10.1145/2688500.2688503 [8] Milind Chabbi and John Mellor-Crummey. 2016. Contention-Conscious, Locality-Preserving Locks. In

Proceedings of the 21st ACMSIGPLAN Symposium on Principles and Practice of Parallel Program-ming (PPoPP ’16) . Association for Computing Machinery. https://doi.org/10.1145/2851141.2851166 [9] Jonathan Corbet. [n. d.]. Cramming more into struct page. https://lwn.net/Articles/565097 , August 28, 2013. Accessed: 2018-10-01.[10] Jonathan Corbet. [n. d.]. MCS locks and qspinlocks. https://lwn.net/Articles/590243 , March 11, 2014. Accessed: 2018-09-12.[11] Jonathan Corbet. [n. d.]. Ticket Spinlocks. https://lwn.net/Articles/267968 , February 6, 2008. Accessed: 2018-09-12.[12] Travis Craig. 1993. Building FIFO and priority-queueing spin locksfrom atomic swap.[13] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Every-thing You Always Wanted to Know About Synchronization but WereAfraid to Ask. In

Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles (SOSP ’13) . http://doi.acm.org/10.1145/2517349.2522714 [14] Dave Dice. 2015. Malthusian Locks. CoRR abs/1511.06035 (2015).arXiv:1511.06035 http://arxiv.org/abs/1511.06035 [15] Dave Dice. 2017. Malthusian Locks. In

Proceedings of the TwelfthEuropean Conference on Computer Systems (EuroSys ’17) . http://doi.acm.org/10.1145/3064176.3064203 [16] Dave Dice and Alex Kogan. 2018. Compact NUMA-Aware Locks. CoRR abs/1810.05600 (2018). http://arxiv.org/abs/1810.05600 [17] Dave Dice and Alex Kogan. 2018. TWA - Ticket Locks Augmentedwith a Waiting Array.

CoRR abs/1810.01573 (2018). arXiv:1810.01573 http://arxiv.org/abs/1810.01573 [18] Dave Dice and Alex Kogan. 2019. Avoiding Scalability Collapse byRestricting Concurrency. In

Euro-Par 2019: Parallel Processing - 25th In-ternational Conference on Parallel and Distributed Computing, Göttingen,Germany, August 26-30, 2019, Proceedings (Lecture Notes in ComputerScience) . Springer. https://doi.org/10.1007/978-3-030-29400-7_26 [19] Dave Dice and Alex Kogan. 2019. Compact NUMA-Aware Locks. In

Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys ’19) . As-sociation for Computing Machinery. https://doi.org/10.1145/3302424.3303984 [20] David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock Cohorting:A General Technique for Designing NUMA Locks. In

Proceedings ofthe 17th ACM SIGPLAN Symposium on Principles and Practice of Paral-lel Programming (PPoPP ’12) . Association for Computing Machinery. https://doi.org/10.1145/2145816.2145848 [21] David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock Cohorting:A General Technique for Designing NUMA Locks.

ACM Trans. ParallelComput. (2015). https://doi.org/10.1145/2686884 [22] Stijn Eyerman and Lieven Eeckhout. 2010. Modeling Critical Sec-tions in Amdahl’s Law and Its Implications for Multicore Design. In

Proceedings of the 37th Annual International Symposium on ComputerArchitecture (ISCA ’10) . ACM. https://doi.org/10.1145/1815961.1816011 [23] Rachid Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma,and Vasileios Trigonakis. 2019. Lock–Unlock: Is That All? A PragmaticAnalysis of Locking in Software Systems.

ACM Trans. Comput. Syst. https://doi.org/10.1145/3301501 [24] Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. Mul-ticore Locks: The Case Is Not Closed Yet. In . USENIX Associa-tion. [25] P. H. Ha, M. Papatriantafilou, and P. Tsigas. 2005. Reactive spin-locks:a self-tuning approach. In . https://doi.org/10.1109/ISPAN.2005.73 [26] Prasad Jayanti, Siddhartha Jayanti, and Sucharita Jayanti. 2020. To-wards an Ideal Queue Lock. In Proceedings of the 21st InternationalConference on Distributed Computing and Networking (ICDCN 2020) . As-sociation for Computing Machinery. https://doi.org/10.1145/3369740.3369784 [27] Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, andTaesoo Kim. 2019. Scalable and Practical Locking with Shuffling. In

Proceedings of the 27th ACM Symposium on Operating Systems Principles(SOSP ’19) . Association for Computing Machinery. https://doi.org/10.1145/3341301.3359629 [28] Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. ScalableNUMA-aware Blocking Synchronization Primitives. In . USENIX Associa-tion. [29] Beng-Hong Lim and Anant Agarwal. 1994. Reactive SynchronizationAlgorithms for Multiprocessors. In

Proceedings of the Sixth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS VI) . ACM. https://doi.org/10.1145/195473.195490 [30] Waiman Long. 2013. qspinlock: Introducing a 4-byte queue spinlockimplementation. https://lwn.net/Articles/561775 , July 31, 2013. Ac-cessed: 2018-09-19.[31] Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. HierarchicalCLH Queue Lock. In

Euro-Par 2006 Parallel Processing . Springer BerlinHeidelberg. https://doi.org/10.1007/11823285_84 [32] O. Krieger B. Rosenburg M. Auslander, D. Edelsohn and R. Wisniewski.2003. Enhancement to the MCS lock for increased functionality and im-proved programmability – U.S. patent application number 20030200457. https://patents.google.com/patent/US20030200457 [33] P. Magnusson, A. Landin, and E. Hagersten. 1994. Queue locks on cachecoherent multiprocessors. In

Proceedings of 8th International ParallelProcessing Symposium . https://doi.org/10.1109/IPPS.1994.288305 [34] John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms forScalable Synchronization on Shared-memory Multiprocessors. ACMTrans. Comput. Syst. (1991). http://doi.acm.org/10.1145/103727.103729 [35] Zoran Radović and Erik Hagersten. 2003. Hierarchical Backoff Locksfor Nonuniform Communication Architectures. In

International Sym-posium on High Performance Computer Architecture – HPCA . IEEEComputer Society. http://dl.acm.org/citation.cfm?id=822080.822810 [36] H. Schweizer, M. Besta, and T. Hoefler. 2015. Evaluating the Cost ofAtomic Operations on Modern Architectures. In . https://doi.org/10.1109/PACT.2015.24 [37] Michael L. Scott. 2013. Shared-Memory Synchronization . Morgan &Claypool Publishers.[38] H. Theil. 1967.

Economics and Information Theory . North-Holland.[39] U. Verner, A. Mendelson, and A. Schuster. 2017. Extending Amdahl’sLaw for Multicores with Turbo Boost.

IEEE Computer ArchitectureLetters (2017). https://doi.org/10.1109/LCA.2015.2512982 [40] Wikipedia Contributors. 2020. Coefficient of Variation. https://en.wikipedia.org/wiki/Coefficient_of_variation [41] Wikipedia Contributors. 2020. Matthew Effect. https://en.wikipedia.org/wiki/Matthew_effect [42] Wikipedia Contributors. 2020. Theil index. https://en.wikipedia.org/wiki/Theil_index ave Dice and Alex Kogan ▶ CNAwithsupportforexpeditedadmission:

CNAcantriviallybeextendedtobettersupportreal-timeorFIFOacquisitionrequestsbymarkingtheMCSqueueelementas expedited .Suchexpeditedelementsarewillnotbeculledintothesecondarylistofremotethreads.Thus,ifSistheMCSqueueelementassociatedwithanexpeditedlockacquisitionoperation,thennothreadsthatarrivedafterSwillbeadmittedbeforeS.InthecontextofFissilelocks,onceanexpeditedthreadacquirestheCNAinnerlock,itcanthenimmediatelyassertimpatience,inhibitingbypassovertheouterTSlock,andprovidingthedesiredadmissionservice. ▶ CNA - triggering flushes:

Classic CNA runs Bernoullitrials in the unlock path to decide whether to flush the re-mote chain into primary and change the preferred NUMAnode, in order to provide long-term fairness. We have exper-imented with variants where the head of the remote chainmonitors how long it has waited and, if necessary, sets aflag in its MCS queue element to cue flushing the remotelist into the primary MCS chain, in order to avoid starva-tion. The CNA unlock operator checks that flag, and if set,flushes the remote queue and changes the preferred NUMAnode. This approach yields a time-based anti-starvation pol-icy instead of the count-based Bernoulli trials as found in theoriginal CNA and shifts the Bernoulli trial out of the unlockpath, replacing it with a fetch of a location that is usuallyin cache. In addition, we can use polite constructs such asMONITOR-MWAIT for timekeeping. ▶ Probabilistic bounded bypass :

We can provided boundedbypass over the outer TS lock as follows, without requir-ing an explicit “Impatient” state to be encoded or storedin the lock structure. Briefly, arriving threads run a biasedBernoulli trial with probability of success P = / PrbBB;Unlock()issimpleST0;count-basedinsteadoftime-basedimpatience;AllowsCNAinnerlocktoactasasieve/strainertofilteroutlockcirculatinginACSoverouterTSlock.Claimandconjecture:lazyPrbBBfiltersufficestoeventuallyhomogenizetheACS; ▶ Compact single-word form :

We can construct a single-world compact form of Fissile by collapsing the Outer, Inner,and Impatient fields into a single word. This condensed ver-sion is appropriate to replace the linux kernel’s qspinlock construct [10, 17, 30]. Briefly, the least significant byte servesas the outer lock, the next most significant bit encodes theimpatient state, and the remaining higher order bits encodethe tail of the CNA MCS queue. The Fissile unlock operatorstores 0 into the separately addressable low-order byte torelease the lock. We note that this encoding requires mixedsized atomic accesses to the same location, the safety ofwhich are platform-dependent. The current qspinlock imple-mentation also depends on mixed size accesses. claimCompact:collapseinnerCNAandTSouterintoasinglewordforuseinthelinuxkernel ▶ Deferred release of the CNA inner Lock :

We haveinvestigated deferring the release of the CNA inner lock un-til we unlock the Fissile lock proper and specifically afterdropping the outer TS lock. This may improve scalabilityby shifting CNA administrative work (culling and flushing activities) outside and after the TS critical section. Whileappealing, this change means that MCS queue elements cannot be allocated on-stack, necessitating more complex queueelement lifecycle memory management. Each thread musthave one allocated queue element for each lock that it holdswhereas with on-stack allocation, each thread has at mostone active queue element. Furthermore, we must conveythe address of the MCS queue element – the CNA owner’selement – from the Fissile acquire operation to the corre-sponding unlock operation. As noted above, typical lockingAPIs do not have provisions to pass such information fromacquire to unlock. A viable solution is to implement a thread-local cache that contains at most one element, a referenceto MCS queue element for the most recently acquired innerlock held that thread.In the Fissile slow path, either before or after acquiringthe CNA inner lock, a thread checks its cache. If a pendingdeferred queue element is present, it releases the associatedCNA lock and clears the cache. It can then install the currentqueue element into its cache. In the Fissile unlock operation,after releasing the TS outer lock, the thread again checksand clears its cache, releasing any locks associated with apending queue element found in the cache.The following benign scenario can arise. Thread T L E

1. Our approach defers the release of the CNA in-ner lock of L E T T L E E T ′ s cache and forcing the early releaseof L E E E L ▶ Deferred wakeup of the CNA successor :

Instead ofdeferring the release of the CNA inner lock until the Fissileunlock, and passing the queue element reference to fromthe Fissile acquire operation to the corresponding unlockoperation, we can perform a partial MCS release in the Fissilelock operation. The partial release operator clears the MCS

Tail variable if there are no other elements on the chain,and otherwise identifies the successor – the next elementin the MCS chain. If there is a successor in the chain, thepartial release does not notify that thread. Instead, we pass areference to the successor queue element to the correspond-ing Fissile unlock operator, which in turn passes ownershipto the successor via setting the usual MCS flag in the succes-sor’s queue element. This again allows the queue elements tobe allocated on-stack and simplifies memory management.As above, we can employ a thread-local cache of 1 elementto help convey the successor reference to the unlock operator.The deferred notification can occur at any point betweenacquiring the TS outer lock to immediately after releasingthe TS outer lock. Thread T L L

1, identifying thread T T

2) as the successor. T L T

2. When T T

2, it notifies T T T L T

1, while holding L

1, might acquirecontend lock L L T T T L T

3, in its cache. Suchearly wakeup is benign. ▶ Simplified encoding of the “Impatient” state :

We re-move the explicit Impatient field from the Fissile lock struc-ture. The outer lock word encoding changes as follows : 0indicates unlocked; 1 indicates locked; 2 indicates locked butwith an impatient alpha thread. When the alpha thread be-comes impatient, it executes an atomic fetch-and-incrementon the outer TS lock word. If value advanced from 0 to 1,then the alpha thread acquired the outer TS lock. Otherwisethe value advanced from 1 to 2 indicating TS lock was heldby some other thread. The alpha thread waits impatiently forthe word to change back to 1, at which point it has gainedownership of the outer TS lock. The Fissile unlock operatorsimply executes an atomic decrement on the lock word shift-ing the value from either 2 to 1 or 1 to 0. While more compact,this variant requires an atomic decrement in the unlock path,instead of a simple store. We note that the form that usesan explicit store depends on eventual consistency, wherethreads in unlock will eventually observe the Impatient = 2value set by an impatiently waiting alpha thread. ▶ We note that we can use an con-struction [17]. The embodiment described above has 2 stagesof waiting for threads that encounter contention : the innerlock and the outer lock. Adding an extra

Gate stage can,paradoxically, act to reduce handover latency. Contendedthreads in the slow path proceed as follows : acquire the CNAinner lock; wait until the Gate becomes 0; set the Gate to 1;release the inner CNA lock; wait for acquire the outer TSlock; clear the Gate; execute the critical section; and finallyrelease the outer TS lock. At any given time there can be atmost one thread waiting on the gate and at most one threadbusy-waiting on the outer lock. Clearing the Gate entailslow latency as there is at most one thread waiting on thegate, allowing faster handover and improved pipelining oflock acquisition operations. Note that acquiring the Gateoccurs only while holding the inner lock, so atomics are notrequired.Instead of a simple gate, which allows at most one thread,we could also employ a semaphore to allow a very smallnumber of threads to busy-wait on the outer TS lock.

Flowschematic

Slow path flow : [ Inner ( N )] (cid:55)→ [ Gate ( )] (cid:55)→ ∗ Inner (cid:55)→[

Outer ( )] (cid:55)→ ∗ Gate (cid:55)→ CS (cid:55)→ ∗ Outer (cid:55)→

NCS ; where [ Inner ( N )] indicates that at most N threads wait to acquire the inner lock, and ∗ Inner reflects the corresponding releaseof the inner lock. ▶ Borrowing from

TWA [17] (

TWA-Staged and 3-Stage variations) we can easily con-struct a 3-Stage Fissile lock where the outer lock is a ticketlock, with differentiated near and far waiting on the ticketlock. Threads in the slow path proceed as follows: acquire theCNA inner lock; Fetch and increment the ticket variable toassign a unique ticket value to the locking request; busy-waitwhile the assigned ticket value differs from the ticket lock’sgrant field by 2 or more ( far waiting); release the CNA innerlock; busy-wait until the assigned ticket equals the grantvariable ( near waiting); enter and execute the critical sec-tion; release the outer ticket lock by incrementing the grantvariable. A non-atomic increment suffices in the unlock path.In this formulation no bypass is allowed, although a fastpath is feasible. The CNA inner lock completely dictates theorder of admission, and we use the ticket locks, which haveefficient handover under light contention, when a threadnears the front of the conceptual queue of waiting threads. Avery small number of threads wait on the ticket lock at anygiven time, leveraging its excellent behavior in that mode. ▶ Impatience policies:

Myriad policies are possible forsetting and clearing the “impatient” state. Our reference im-plementation uses a simple form where the alpha threadwaits, if necessary becomes impatient, and then cancels theimpatient state once it obtains the outer TS lock. Other possi-bilities include leaving the impatient state set for all threadscurrently waiting on the inner lock, or until the inner lockdrains to empty state [1].

The“Go”MutexusesamostlyLIFOsemaphoreinamannersimilartothatoftheFissileinnerlock,andsimilartothe

LIFO-CR constructdescribedin[14]. ▶ TS Tunables:

Our current implement allows arrivingthread to try the TS just once. Experiments suggest, however,some benefit to allowing a brief bounded polite spinningphase, potentially with back-off, before diverting into theslow-path, as long as the lock remains impatient. In addition,in our current implementation, the alpha thread uses a simplebusy-wait loop with no back-off. Employing some moderateback-off policy may be useful. While these avenues showpromise, our current implementation has no TTS tunablevalues and exhibits desirable parameter parsimony . ▶ Spin-then-Park waiting:

As described, stalled threadsuse simple busy-waiting. We note, however, that it relativelysimple to modify Fissile locks so that threads waiting on theCNA inner lock will park – descheduling themselves via theoperating system so the CPU where they were running candispatch other ready threads or become idle. ▶ CNA Administrative responsibilities:

NUMA-aware reorganization of the MCS chain : Who; what;where; when; how. In order of preference:2020-05-05 • Copyright Oracle and or its affiliates ave Dice and Alex Kogan

1. A waiter on the chain reorganizes outside the criticalsection – delegated helping.2. A thread reorganizes the chain after having dropped thelock, outside and after the CS. This thread could still beperforming useful work, however, so we are borrowing itto help.3. Threads reorganize the chain in unlock while still holdingthe lock, on the critical path, potentially extending theeffective critical section duration. This impacts lock hand-over response time – the time needed to convey ownershipto a successor – and scalability. Classic CNA-MCS usesthis approach.Shuffle, for instance, delegates NUMA reorganization toother waiting threads, allow parallelism between reorganiza-tion and execution of the critical section. While elegant, forshort critical sections with intense contention the coherentcommunications cost can come to dominate performance,and make this approach unprofitable. That is, the granularityof work being delegated does not overcome the communica-tion costs to delegate and coordinate. And when the criticalsections are longer then any additional CNA overheads inthe critical path are less important, so delegated executionoften has no appreciable benefit relative to CNA. ▶ TS : polite vs impolite

Anderson [3] observed that “po-lite” test-and-test-and-set locks may be a better choice forcontended locks than simple “impolite” test-and-set locks.Test-and-test-and-set locks are polite in the sense that theyfirst load and check the lock word before conditionally at-tempting the atomic operation to acquire the lock. Simpletest-and-set locks are optimistic and forego that load andcheck and simply try the atomic operation. But if the lockis already held, such futile atomic operators may generateunnecessary coherence and write invalidation. A test-and-test-and-set strategy acts to reduce those overheads and thethe rate of failed atomic operations.We have found, however, that a simple impolite impolitetest-and-set policy is appropriate for the outer Fissile TS lock,ostensibly as there is at most one thread busying waiting onthe outer lock at any given time. Relatedly, polite test-and-test-and-set TTS acquisition may incur more coherence bustransactions in the case where the lock in not held but thelock word is not in cache or is in dirty state in a remote cache,as would be the case if the previous owner ran on a CPUwith a different cache. Absent coherence probe speculation,on bus operation is needed to load the lock and a secondto upgrade the cache line to written state. This scenario iscommon if the lock instance is promiscuous – being lock bymultiple different threads, but with little or no contention.Performant test-and-set locks will typically probe the lockdirectly with an atomic on arrival, optimistically assumingthey can acquire the lock, but then shift to a polite TTSmode as they busy wait. This strategy reduces latency in the uncontended case but acts to unnecessary write invalidationand coherence traffic in the contended case.

Politelockattemptsavoidfutileatomicoperatorswhichgenerateunnecessarycoherencetrafficviawriteinvalidationifthelockisfoundheld. ▶ Anti-starvation:

Fissile requires two types of anti-starvation.The first is in CNA, to ensure that lock operations from re-mote NUMA nodes are eventually serviced, and the secondis managed by the alpha thread to avoid indefinite bypassover the TS outer lock. ▶ Linux kernel qspinlock:

The Linux qspinlock construct [9, 10, 30] is a compact 32-bit lock, even on 64-bit architectures. The low-order bits ofthe lock word constitute a simple test-and-set lock whilethe upper bits encode the tail of an MCS chain. The resultis a hybrid of MCS and test-and-set. In order to fit into a32-bit work – a critical requirement – the chain is formedby logical CPU identifiers instead of traditional MCS queuenode pointers. Arriving threads attempt to acquire the test-and-set lock embedded in the low order bits of the lock word.This attempt fails if the test-and-set lock is held or of theMCS chain is populated. If successful, they enter the criticalsection, otherwise they join the MCS chain embedded inthe upper bits of the lock word. When a thread becomesan owner of the MCS lock, it can wait for the test-and-setlock to become clear, at which point it claims the test-and-set lock, releases the MCS lock, and then enters the criticalsection. The MCS aspect of qspinlock is used only whenthere is contention. The unlock operator simply clears thetest-and-set lock. The MCS lock is never held over the crit-ical section, but only during contended acquisition. Onlythe owner of the MCS lock spins on the test-and-set lock,reducing coherence traffic . Qspinlock is strictly FIFO. Whilethe technique employs local spinning on the MCS chain,unlike traditional MCS, arriving and departing threads willboth update the common lock word, increasing coherencetraffic and degrading performance relative to classic MCS.Qspinlock incorporates an additional optimization wherethe first contending thread spins on the test-and-set lockinstead of using the MCS path. Traditional MCS does not fitwell in the Linux kernel as (a) the constraint that a low-levelspin lock instance be only 32-bits is a firm requirement, and(b) the lock-unlock API does not provide a convenient wayto pass the owner’s MCS queue node address from lock tounlock. We note that qspinlocks replaced classic ticket locksas the kernel’s primary low-level spin lock mechanism in2014, and ticket locks replaced test-and-set locks, which areunfair and allow unbounded bypass, in 2008 [11].We note that the kernel provides a specialized qspinlockform for paravirtualized environments within virual ma-chines, the so-called “PV-friendly" qspinlock. We believe thesame properties that make Fissile tolerant of preemption also https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c

This provides a LOITER-style [15] lock with the outer lock consisting of atest-and-set lock and the inner lock consisting of the MCS lock, with bothlocks embedded in the same 32-bit word. ▶ Original abstract

Classic test-and-test (TS) mutual exclu-sion locks [3] are simple, enjoy high performance and lowlatency of ownership transfer under light or no contentionbut do not, however, scale gracefully under high contention.Furthermore TS locks do not provide any admission orderguarantees, and may allow sustained starvation of waitingthreads and long-term unfairness.Such concerns led to the development of scalable queue-based locks such as

MCS locks (Mellor-Crummey and Scott)[34] and NUMA-aware variants thereof such as

CompactNUMA-aware Locks (CNA) [16, 19]. Both MCS and CNAscale under load, but have more complicated lock handoveroperations than TS and suffer higher latencies at low con-tention.We propose

Fissile locks, which capture the most desir-able properties of both TS and CNA. A Fissile lock consists oftwo underlying locks: a TS lock and a CNA lock. Acquiringownership of the TS lock confers ownership of the compoundFissile Lock. Arriving threads first use an atomic instructionto try to acquire the TS lock. If successful, they immediatelyenter the critical section, and we say the Fissile lock wasacquired via the fast path . If the fast path attempt fails, thethread then acquires the CNA lock and busy-waits on the TSlock, releases the CNA lock, and enters the critical section.Releasing a Fissile lock entails just releasing the TS lock.Contended locking uses the CNA lock while uncontendedoperations use the TS lock. Fissile locks deflect contentionaway from TS lock into the CNA lock.To avoid TS-based starvation, the thread holding the CNAlock and waiting on the TS lock can become “impatient” andcue direct handover of ownership the next time the TS lockis released, bounding bypass. The result is a highly scalableNUMA-aware lock that performs like TS at low contention,enjoying low latency, and like CNA at high contention. ▶ Comparison of lock properties:

NUMA-Aware Bypass TS Fast-path Uncontended UnlockQSpinlock No No Yes StoreGo Mutex No Bounded Yes Atomic DecrementMCS No No No CASCNA Yes No No CASQSpinlock+CNA Yes No Yes StoreShuffle Locks Yes No Yes StoreFissile Locks Yes Bounded Yes Store