[PDF] Hemlock : Compact and Scalable Mutual Exclusion

Abstract

We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under contention, and low latency in the uncontended case. Hemlock is context-free -- not requiring any information to be passed from a lock operation to the corresponding unlock -- and FIFO. The performance of Hemlock is competitive with and often better than the best scalable spin locks.

Full PDF

HHemlock : Compact and Scalable Mutual Exclusion

Hemlock : Compact and Scalable Mutual Exclusion

Dave Dice [email protected] LabsUSA

Alex Kogan [email protected] LabsUSA

ABSTRACT

We present

Hemlock , a novel mutual exclusion locking algorithmthat is extremely compact, requiring just one word per thread plusone word per lock, but which still provides local spinning in mostcircumstances, high throughput under contention, and low latencyin the uncontended case. Hemlock is context-free – not requiring anyinformation to be passed from a lock operation to the correspondingunlock – and FIFO. The performance of Hemlock is competitivewith and often better than the best scalable spin locks.

CCS CONCEPTS • Software and its engineering → Multithreading ; Mutual exclu-sion ; Concurrency control ; Process synchronization . KEYWORDS

Synchronization, Locks, Mutual Exclusion, Scalability

Locks often have a crucial impact on the performance of parallelsoftware, hence they remain in the focus of intensive research. Manylocking algorithms have been proposed over the last several decades.Ticket Locks [24, 35, 42] are simple and compact, requiring just twowords for each lock instance and no per-thread data. They performwell in the absence of contention, exhibiting low latency because ofshort code paths. Under contention, however, performance suffers[18] because all threads contending for a given lock will busy-waiton a central location, increasing coherence costs. For contendedoperation, so-called queue based locks, such as CLH[12, 34] andMCS[36] provide relief via local spinning [21]. For both CLH andMCS, arriving threads enqueue an element (sometimes called a“node”) onto the tail of a queue and then busy-wait on a flag ineither their own element (MCS) or the predecessor’s element (CLH).Critically, at most one thread busy-waits on a given location at anyone time, increasing the rate at which ownership can be transferredfrom thread to thread relative to techniques that use global spinning,such as Ticket Locks.Hemlock is inspired by CLH, and, like CLH, threads wait on afield associated with the predecessor. Hemlock, however, avoidsthe use of queue nodes, freeing the implementation from lifecycleconcerns – allocating, releasing, caching – associated with thatstructure. The lock and unlock paths are extremely simple. An un-contended lock operation requires just an atomic SWAP (exchange)operation, and unlock just a compare-and-swap (

CAS ), which is thesame as MCS. Like Ticket Locks, MCS and CLH locks, Hemlockprovides FIFO admission.Hemlock is compact, requiring just one word per extant lockplus one word per thread, regardless of the number of locks heldor waited upon. Like MCS and CLH, the lock contains a pointer to the tail of the queue of threads waiting on that lock, or null if thelock is not held. The thread at the head of the queue is the owner.In MCS the queue is implemented as an explicit linked list runningfrom the head (owner) to the tail. In CLH the queue is implicit andeach thread waits on a field in its predecessor’s element. CLH alsorequires that a lock in unlocked state be provisioned with an emptyqueue element. When that lock is destroyed, the element must berecovered. Hemlock avoids that requirement.

CLH:pre-initializeorinitializeon-demandHemlockisdegenerateCLHOracleinventiondisclosureaccessionnumber:ORC2133687-US-NPR(CompactandScalableMutualExclusion)

Instead of using queue nodes, Hemlock provisions each threadwith a singular

Grant field where its successor can busy-wait. Nor-mally the

Grant field – which acts as a mailbox between a threadand its predecessor on the queue – is null , indicating empty. Dur-ing the unlock operation, a thread installs the address of the lockinto its

Grant field and then waits for that field to return to null .The successor thread observes the lock address appear in its prede-cessor’s

Grant field, which indicates that ownership has transferred.The successor then responds by clearing the

Grant field, acknowl-edging receipt of ownership and allowing the

Grant field of itspredecessor to be reused in subsequent handover operations, andthen finally enters the critical section.Under simple contention, when a thread holds one lock at a time,Hemlock provides local spinning. But if we have one thread 𝑇 𝑇 Grant field. As multiplethreads (via multiple locks) can be busy-waiting on 𝑇 Grant field, 𝑇 Grant field to disambiguate and allow the specific successor to determinethat ownership has been conveyed. We note that simple contentionis a common case for many applications. This is supported by thesurveys of Cheng et al. [6] and O’Callahan et al. [39] – which foundthat it is rare for a thread to hold multiple locks at a given time – aswell as by our profiling of LevelDB, described below. This suggeststhat Hemlock would enjoy local spinning in many practical settings.

Chengreportthatitisuncommon,intheapplicationstheysurveyed,forathreadtoholdmultiplelocksatagiventime,suggestingthatHemlockwouldenjoylocalspinning.Disambiguatesuccessionbywritinglockaddressinto

Grant .Potentialnames:hemlock;spinoff;tailspin;forespin;prospector;systolic;lockstepN-waitingwhereNisthenumberofdistinctcontendedlocksheldbyathread.Devolvestonon-localspinning:multi-waitingprefixesforX-localwaiting:usually;mostly;frequently;commonly;generally;dominantly;semi;pseudo;oft;fere;ferme;consuete;pluries;multo;multum;praecipue;maxime;saepo;frequens;plerumque;plerusque;valde;fortiter;vehementer;

The advantages of not using queue nodes, however, does notend in reduced space and a simplified implementation that avoidsnode lifecycle concerns. Both CLH and MCS need to convey theaddress of the owner (head) node from the lock operation to theunlock operation. The unlock operation needs that node to find thesuccessor, and to reclaim nodes from the queue so that nodes maybe recycled. While the locking API could be modified to accom-modate this requirement, it is inconvenient for the classic POSIX pthread locking interface, in which case the usual approach tosupport MCS is to add a field to the lock body that points to thehead, allowing that value to be passed from the lock operation tothe corresponding unlock operation. This new field is protected bythe lock itself, but accesses to the field execute within the effectivecritical section and may also induce additional coherence traffice.Instead of an extra field in the lock body to convey the head – the2021-02-19 • Copyright Oracle and or its affiliates a r X i v : . [ c s . D C ] F e b ice and Kogan class Thread : atomic Grant = null class Lock : atomic Tail = null def Lock (Lock * L) : assert Self → Grant = null auto pred = swap (&L → Tail, Self) if pred ≠ null : while pred → Grant ≠ L : Pause pred → Grant = null assert L → Tail ≠ null def Unlock (Lock * L) : assert Self → Grant = null auto v = cas (&L → Tail, Self, null) assert v ≠ null if v ≠ Self : Self → Grant = L while Self → Grant ≠ null : Pause Listing 1: Simplified Pseudo-code for Hemlock class Thread : atomic Grant = null class Lock : atomic Tail = null def Lock (Lock * L) : assert Self → Grant = null auto pred = swap (&L → Tail, Self) if pred ≠ null : while cas(&pred → Grant, L, null) ≠ L : Pause def Unlock (Lock * L) : assert Self → Grant = null auto v = cas (&L → Tail, Self, null) if v ≠ Self : Self → Grant = L while FetchAdd(&Self → Grant, 0) ≠ null : Pause Listing 2: Hemlock with CTR Optimization owner’s element – from the lock operation to the corresponding un-lock, implementations may also opt to use a per-thread associativemap that relates the lock address to the owner’s element. A lockalgorithm or interface that does not need to pass information fromthe lock operator to unlock is said to be context-free [45]. Hemlockis context-free and does not require the head pointer in the unlockoperation, simplifying the implementation. @LikeCLH,threadsinHemlockwaitonstateintheirpredecessor’selementinthequeue.@Invariant:whilemultiplethreadsmightwaitonSelf->Grant,atmostonethreadatanytimewillbewaitingforLtoappearinSelf->Grant.@Invariant:ExceptimmediatelyafteradmissionwhenawaitingthreadhasbeengrantedalockandneedstocleargrantfromLto0,onlytheowner(Self)canwriteSelf->Grant.@Potentialextrainterlockanddelayinunlock()@Grantactsasacontended-forsingletonmailboxUnlockmustwaitforpreviousunlock()toreadandreleasemailbox@informsuccessorviaSelf->Grant@ThreadwaitsonstateinthepredecessorlikeCLH@mailbox:emptyvsfull@AnnouncehandoverviaGrantpublish;announce;convey;grant;@taketurnshandingoverownershiptosuccessorsonmultiplelocksmustwaitforackandmailboxempty@chokepointinunlock@Duringhandover,outgoing/departingownermustwaitforsuccessortoobserveLinmailboxandacknowledgebyclearingmailboxbacktoemptynullstate.@InterlockstallexecutesoutsideCSandafterhandover@occupied;busy;full;vacant;empty;@constellation;configuration;@depict@HemlockisarguablyequivalenttoCLHwhereCLHusesasingletonper-threadsharedqueueelement@systolic@store;deposit;install@pipeline;@synchronous;tightlycoupled;interlock;dialog;conversation;back-and-forth;exchange;request-response;reply;acknowledge;lock-step;@deposit;store;place;install;@Concern:MailboxactsasapointofcontentionMailboxinducescoherencetrafficHowever:agiventhreadcanreleaseonlyonelockatatimeIthassomeresemblancetoMCS,CLHandExactVM’soldMetaLock.EachthreadhasaTLSSelfvariableandthethreadstructurecontainsaGrantfield.AlockconsistsofjustatailpointerintheusualMCS/CLHfashion.VerymuchunlikeCLHandMCS,therearenoqueuenodes,butwehave,atleastundersimplecontentionforasinglelock,strictlylocalspinning.ButifwehaveathreadT1thatholdsmultiplelocks,andtherearethreadswaitingonthoselocks,thenwedoendupwith(kindof)globalspinningonafieldinT1’sthreadstructureastheleadwaitingthreadforeachofthoselocksspinswillspinacommonfieldinT1.LikeCLH,wespinonafieldassociatedwiththepredecessor.TheGrantfieldactsasakindofmailboxtoannouncesuccessionandhandoverofownership.Ifit’soccupied,thenunlockhastowaitforittobecomeempty,soarguablythere’sanextrainterlockwherewecouldstallifathreadunlocksL1andL2inquicksuccessionandboththoselocksarecontended.ThefastuncontendedpathrequiresaSWAPinlockanda

CAS inunlock,soit’sonparwithMCSinthatrespect,althoughthepathsaretighter.

We start by describing a simplified version of the Hemlock algo-rithm, with pseudo-code provided in Listing-1. In Sections 2.1 wedescribe key performance optimizations, and the pseudo-code forthe optimized Hemlock algorithm is given in Listing-2.

Self refers to a thread-local structure containing the thread’s

Grant field. Threads arrive in the lock operator at line 8 and atomi-cally swap their own address into the lock’s

Tail field, obtainingthe previous tail value, constructing the implicit FIFO queue. If the

Tail field was null , then the caller acquired the lock without con-tention and may immediately enter the critical section. Otherwisethe thread waits for the lock’s address to appear in the predeces-sor’s

Grant field, signalling succession, at which point the threadrestores the predecessor’s

Grant field to null (empty) indicatingthe field can be reused for subsequent unlock operations by thepredecessor. The thread has been granted ownership by its prede-cessor and may enter the critical section. Clearing the

Grant field,above, is the only circumstance in which one thread may store intoanother thread’s

Grant field. Threads in the queue hold the addressof their immediate predecessor, obtained as the return value fromthe

SWAP operation, but do not know the identity of their successor,if any.In the unlock operator, at line 16, threads initially use an atomiccompare-and-swap (

CAS ) operation to try to swing the lock’s

Tail field from the address of their own thread,

Self , back to null , whichrepresents unlocked . If the

CAS was successful then there were nowaiting threads and the lock was released by the

CAS . Otherwise waiters exist and the thread then writes the address of the lock 𝐿 into its own Grant , alerting the waiting successor and passingownership. Finally, the outgoing thread waits for that successorto acknowledge the transfer and restore the

Grant field back toempty, indicating the field be reused for future locking operations.Waiting for the mailbox to return to null happens outside thecritical section, after the thread has conveyed ownership.In Hemlock, transfer of ownership in unlock is address-based ,where the outgoing owner writes the lock address into its owngrant field, whereas under MCS and CLH owner transfer is via aboolean written into a queue element monitored by its immediatesuccessor.Threads that attempt to release a lock that they do not hold willstall indefinitely at Line 21, waiting for an acknowledgement thatwill never arrive. This property makes it easy to identify and debugthe offending thread and unlock operation.

Proximateandimmediatecause;rootcause

MCS and Hemlock allow trivial implementations of the

TryLock operations – using

CAS instead of

SWAP – whereas Ticket Locks andCLH do not. An uncontended lock acquisition requires an atomic

SWAP for MCS, CLH and Hemlock and an atomic fetch-and-add for Ticket Locks. An uncontended unlock – no waiters – requires anatomic

CAS for MCS and Hemlock, and simple stores for CLH andTicket Locks while a contended unlock, which passes ownership toa waiter, requires a store for MCS, CLH and Ticket Locks.In Listing-1 line 21, threads in the unlock operator must waitfor the successor to acknowledge receipt of ownership, indicatingthe unlocking thread’s

Grant mailbox is again available for com-munication in subsequent locking operations. That is, the recipientneeds to take positive action and respond before the previous ownercan return from the unlock operator. While this phase of waitingoccurs outside and after the transfer of ownership – crucially, not within the effective critical section or on critical path– such waitingmay still impede the progress and latency of the thread that in-voked unlock. Specifically, we have tightly coupled back-and-forthsynchronous communication, where the thread executing unlock2021-02-19 • Copyright Oracle and or its affiliates emlock : Compact and Scalable Mutual Exclusion stores into its

Grant field and then waits for a response from thesuccessor, while the successor, running in the lock operator, waitsfor the transfer indication (line 11) and then responds to the unlock-ing thread and acknowledges by restoring

Grant to null (line 12).The unlock operator must await a positive reply from the successorin order to safely reuse the Grant field for subsequent operations.That is, the algorithm must not start an unlock operation until theprevious contended unlock has completed, and the successor hasemptied the mailbox. We note that MCS, in the unlock operator,must also wait for the successor executing in the lock operator toestablish the back-link that allows the owner to reach the successor.That is, both MCS and Hemlock have wait loops in the contended unlock path where threads may need to wait for the arriving suc-cessor to become visible to the current owner. Compared to MCSand CLH, the only addition burden imposed by Hemlock that fallsinside the critical path is the clearing of the predecessor’s

Grant field by the recipient (Line 12), which is implemented as a singlestore.To mitigate the performance concern described above, we couldoptimize Hemlock to defer and shift the waiting-for-response phase(Listing-1 line 21) to the prologue of subsequent lock and unlockoperations, allowing more useful overlap and concurrency betweenthe successor, which clears the

Grant field, and the thread whichperformed the unlock operation. The thread that called unlock mayenter its critical section earlier, before the successor clears

Grant .Ultimately, however, we opted to forgo this particular optimizationin our implementation as it provided little observable performancebenefit. While the

Grant mailbox field might appear to be a sourceof contention and to potentially induce additional coherence traffic,a given thread can release only one lock at a time, mitigating thatconcern.

Overlap;pipelining;

The synchronous back-and-forth communication pattern where athread waits for ownership and then clears the

Grant field (Listing-1 Lines 11-12) is inefficient on platforms that use MESI or MESIF“single writer” cache coherence protocols [26, 27]. Specifically, in unlock when the owner stores the lock address into its

Grant field(Line 20), it drives the cache line underlying

Grant into M -state(modified) in its local cache. Subsequent polling by the successor(Line 11) results in a coherence miss that will pull the line back intothe successor’s cache in 𝑆 -state (shared). The successor will thenobserve the waited-for lock address and proceed to clear Grant (Line 12) forcing an upgrade from 𝑆 to 𝑀 state in the successor’scache and invaliding the line from the cache of the previous owner,adding a delay in the critical path.We avoid the upgrade coherence transaction by polling with CAS (Listing-2 Line 9) instead of using simple loads, so, once thehand-over is accomplished and the successor observes the lockaddress, the line is already in 𝑀 -state in the successor’s local cache.We refer to this technique as the Coherence Traffic Reduction (CTR) optimization.As an alternative to busy-waiting with

CAS , we can achieveequivalent performance by using an atomic fetch-and-add of0 – implemented via

LOCK:XADD on x86 – on

Grant as a read-with-intent-to-write primitive, and, after observing the waited-for lock address to appear in

Grant , issuing a normal store to clear

Grant . That is, we simple replace the load instruction in the tra-ditional busy-wait loop with fetch-and-add of 0. Busy-waitingwith an atomic read-modify-write operator, such as

CAS , SWAP or fetch-and-add , is typically considered a performance anti-pattern.For instance, Anderson[5] observed that test-and-test-and-set locks are superior to crude test-and-set locks when there aremultiple waiters. But in our case with the 1-to-1 communicationprotocol used on the Grant field in Hemlock, busy-waiting viaread-modify-write atomic operations provides a performance bene-fit. Because of the simple communication pattern, back-off in thebusy-waiting loop is not useful.We also apply CTR in the unlock opertor at Listing-2 Line 15 aswe expect the

Grant field will be written by that same thread insubsequent unlock operations.Related approaches to coherence-optimized waiting have beendescribed [22]. Using

MONITOR-MWAIT [2] to wait for invalidation,instead of waiting for a value, has promise, but the facility is notyet available in user-mode on Intel processors . MWAIT may conferadditional benefits, as it avoids a classic busy-wait loop and thusavoids branch mispredictions in the critical path to exit the loopwhen ownership has transferred. In addition, depending on the im-plementation,

MWAIT may be more “polite” with respect to yieldingpipeline resources, potentially allowing other threads, including thelock owner, to execute faster by reducing competition for sharedresources. We might also busy-wait via hardware transactionalmemory, where invalidation will cause an abort, serving as a hintto the waiting thread. In addition, other techniques to hold the linein 𝑀 -state are possible, such as issuing stores to a dummy variablethat abuts the Grant field but which resides on the same cache line.Using the prefetchw prefetch-for-write advisory “hint” instructionwould appear workable but yielded no performance improvementin our experiments.

Politewaiting:altruismthatultimatelybenefitsthebenefactorConfer;afford;yield;admit

The CTR optimization is specific to the shared memory commu-nication pattern used in Hemlock, and is not directly applicable toother lock algorithms.All Hemlock performance data reported herein uses the CTRoptimization unless otherwise noted.We used the linux perf stat command to collect data fromthe hardware performance monitoring unit counters and foundthat CTR resulted in a reduction in number of load operations that“hit” on a line in 𝑀 -state in another core’s cache – requiring writeinvalidation and transfer back to the requester’s cache – and also areduction in total offcore traffic, while providing an improvementin throughput. We can show similar benefits from CTR with asimple program where a set of concurrent threads are configured ina ring, and circulate a single token. A thread waits for its mailboxto become non-zero, clears the mailbox, and deposits the tokenin its successor’s mailbox. Using CAS , SWAP or Fetch-and-Add tobusy-wait improves the circulation rate as compared to the naiveform which uses loads. Future Intel processors may support user-mode umonitor and umwait instructions[10]. We hope to use those instructions in future Hemlock experiments ice and Kogan

Figure-1 shows an example configuration of a set of threads andlocks in Hemlock. 𝐿 − 𝐿 𝐴 − 𝑁 representthreads. Solid arrows reflect the lock’s explicit Tail pointer, whichpoints to the most recently arrived thread – the tail of the lock’squeue. Dashed arrows, which appear between threads, refer to athread’s immediate predecessor in the implicit queue associatedwith a lock. The address of the immediate predecessor is obtainedvia the atomic

SWAP executed when threads arrive. The dashed edgecan be thought of as the busy-waits-on relation and are not physicallinks in memory that could be traversed. In the example, 𝐴 holds 𝐿 𝐵 holds 𝐿 𝐿 𝐸 holds 𝐿 𝐿 𝐿 𝐾 holds 𝐿 𝐿 𝐴 , 𝐵 and 𝐸 execute in their critical sections.while all the other threads are stalled waiting for locks. The queueof waiting threads for 𝐿 𝐶 (the immediate successor) followedby 𝐷 . 𝐷 ’s predecessor is 𝐶 , and, equivalently, 𝐶 ’s successor is 𝐷 .Thread 𝐷 busy-waits on 𝐶 ’s Grant field and 𝐶 busy-waits on 𝐵 ’s Grant field.Threads 𝐻 and 𝐽 both busy-wait on 𝐺 ’s Grant field. In simplelocking scenarios Hemlock provides local waiting, but when thedashed lines form junctions (elements with in-degree greater thanone) in the waits-on directed graph, we find non-local spinning, or multi-waiting . Similarly, in our contrived example, both 𝑁 and 𝐺 both wait on 𝐹 . While our design admits inter-lock performanceinterference, arising from multiple threads spinning on one Grant variable, as is the case for 𝐺 and 𝐹 , above, we believe this caseto be rare and not of consequence for common applications. (Forcomparison, CLH and MCS does not allow the concurrent sharing ofqueue elements, and thus provides local spinning, whereas Hemlockhas a shared singleton queue element – effectively the Grant field –that can be subject to being busy-waited upon by multiple threads).Crucially, if we have a set of coordinating threads where each threadacquires only one lock at a time, then they will enjoy local spinning.Non-local spinning can occur only when threads hold multiplelocks. Specifically, the worst-case number of threads that could bebusy-waiting on a given thread 𝑇 ’s Grant field is 𝑀 where 𝑀 is thenumber of locks held simultaneously by 𝑇 . Non-localspinningoccursonlywhenthread 𝑇 hold 𝐿 𝑇 alsoholdsorwaitsfor 𝐿 𝑇 haswaitingpredecessorsonboth 𝐿 𝐿 multi-waiting degree. When 𝐸 ultimately unlocks 𝐿 𝐸 installs a pointer to 𝐿 Grant field. Thread 𝐹 observes that store, assumes ownership,clears 𝐸 ’s Grant field back to empty (null) and enters the criticalsection. When 𝐹 then unlocks 𝐿

4, it deposits 𝐿 Grant field. Threads 𝐺 and 𝑁 both monitor 𝐹 ’s Grant field,with 𝐺 waiting for 𝐿 𝑁 waiting for 𝐿 𝐹 ’s Grant field, but 𝑁 ignoresthe change while 𝐺 notices the value now matches 𝐿

4, the lock that 𝐺 is waiting on, which indicates that 𝐸 has passed ownership of 𝐿 𝐺 . 𝐺 clears 𝐹 ’s Grant field, indicating that 𝐹 can reuse that fieldfor subsequent operations, and enters the critical section. Threads 𝐴 , 𝐸 and 𝐺 aremonitoringthatfieldandobservetheupdate. 𝐸 and 𝐺 seebutignorethechangewhile 𝐴 noticesthevaluenowmatches 𝐿 𝐻 haspassedownershipof 𝐿 𝐴 . 𝐴 thenclears 𝐻 ’s Grant fieldtoempty( null )andthenentersthecriticalsectionprotectedby 𝐿 Table-1 characterizes the space utilization of MCS, CLH, TicketLocks, and Hemlock. The values in the

Lock column reflect the sizeof the lock body. For MCS and CLH we assume that the head of thechain is carried in the lock body, and thus the lock consists of head and tail fields, requiring 2 words in total. E represents the size ofa queue element. CLH requires the lock to be preinitialized with a CLID E

L1L2L3 JH L4 K L5L6

ABG F L7 M N

Figure 1: Object Graph in Hemlock

SpaceLock Held Wait Thread InitMCS 2

𝐸 𝐸 𝐸 𝐸 • Ticket Locks 2 0 0 0Hemlock 1 0 0 1

Table 1: Space Usage so-called dummy element before use. When the lock is ultimatelydestroyed, the current dummy element must be recovered. The

Held field indicates the space cost for each held lock and similarly,the

Wait field indicates the cost in space of waiting for a lock. The

Thread column reflects per-thread state that must be reserved forlocking. For Hemlock, this is the

Grant field. A single word suffices,although to avoid false sharing we opted to sequester the

Grant field as the sole occupant of a cache line. In our implementationwe also elected to align and pad the MCS and CLH queue nodes toreduce false sharing and to provide a fair comparison, raising thesize of 𝐸 to a cache line. Init indicates if the lock requires non-trivialconstructors and destructors. CLH, for instance, requires that thecurrent dummy node be released when a lock is destroyed.Taking MCS as an example, lets say lock 𝐿 is owned by thread 𝑇 𝑇 𝑇 𝐿 . The lock body for 𝐿 requires 2 words and the MCS chain consists of elements 𝐸 ⇒ 𝐸 ⇒ 𝐸 𝐸 𝐸 𝐸 𝑇 𝑇 𝑇 𝐿 ’s head field points to 𝐸

1, the owner,and the tail field points to 𝐸

3. The space consumed in this config-uration is 2 words for 𝐿 itself plus 3 ∗ 𝐸 for the queue elements. Incomparision, Hemlock consumes one word for 𝐿 and and 3 wordsof thread-local state for the Grant fields.In MCS, when a thread acquires a lock, it contributes an elementto the associated queue, and when that element reaches the head ofthe queue, the thread becomes the owner. In the subsequent unlock operation, the thread extracts and reclaims that same element fromthe queue. In CLH, a thread contributes an element but, and once it2021-02-19 • Copyright Oracle and or its affiliates emlock : Compact and Scalable Mutual Exclusion has acquired the lock, recovers a different element from the queue– elements migrate between locks and threads. The MCS and CLH unlock operators require dependent loads and indirection to locatequeue nodes, while Hemlock avoids these overheads. In MCS, if the unlock operation is known to execute in the same stack frame asthe lock operation, the queue element may be allocated on stack.This is not the case for CLH. As previously noted, Hemlock avoidselements and their management.The K42 [33, 43] variation of MCS can recover the queue elementbefore returning from lock whereas classic MCS recovers the queueelement in unlock . That is, under K42, a queue element is neededonly while waiting but not while the lock is held, and as such,queue elements can always be allocated on stack, if desired. Whileappealing, the paths are much more complex and touch more cachelines than the classic version, impacting performance.If a lock site is well-balanced – with the lock and correspondingunlock operators lexically scoped and executing in the same stackframe – a Hemlock implementation can opt to use an on-stack Grant field instead of the thread-local

Grant field accessed via

Self .This optimization, which can be applied on an ad-hoc site-by-sitebasis, also acts to reduce multi-waiting on the thread-local

Grant field. trade-offbetweenmulti-waitingandspace;allowsheterogenousmixedusage

Safety-exclusionandlivenessNothingbadeverhappens;somethinggoodeventuallyhappens;

In this section, we argue that the Hemlock algorithm is a correctimplementation of a mutual exclusion lock with the FIFO admis-sion and so-called fere-local spinning properties. We define thoseproperties more formally below, but first we note that we considerthe standard model of shared memory [28] with basic atomic read and write operations as well as more advanced atomic SWAP , CAS and

FAA operations. We presume atomic operators with the usualsemantics.

The

SWAP operationreceivestwoarguments,addressandnewvalue,andatomicallyreadstheoldvalueatthegivenaddress,writesthegivennewvalueandreturnstheoldvalue.The

CAS (compare-and-swap)operationreceivesthreearguments,address,oldvalueandnewvalue,andatomicallyreadsthevalueatthegivenaddress,comparesittothegivenoldvalue,and,ifequal,writesthegivennewvalue.Wesayinthiscasethatthe

CAS issuccessful.Otherwise,ifthecurrentvalueisdifferentfromthegivenoldvalue,

CAS doesnotdoanychange,andwesayitisunsuccessful.Weassumethe

CAS instructionreturnsthecurrentvalueithasread.Wenotethatgiventhereturnvalue,onecanidentifywhether

CAS hasbeensuccessfulbycomparingthatvaluetotheoldoneusedintheinvocationof

CAS .Finally,the

FAA (fetch-and-increment)instruction,whichisrequiredonlyfortheoptimizedversionofouralgorithm,receivestwoarguments,addressanddelta,andatomicallyreadstheoldvalueatthegivenaddress,addsthedelta(whichcanbeanyintegernumber),andwritestheresultback.The

FAA operationreturnstheoldvalue,beforetheincrement.WenotethatwhilemostexistinghardwarearchitecturessupportCAS,somedonotsupportSWAPorFAA.Wheresuchsupportisnotavailable,thoseinstructionscanbeeasilyemulatedusingCAS.Door-wayvsdoor-step

Multiple threads perform execution steps, where at each stepa thread may perform local computation or execute one of theatomic operations on the shared memory. We assume threads usethe Hemlock algorithm to protect access to one or more criticalsections, i.e., specially marked blocks of code that must be executedby at most one thread at a time. Our arguments are formulatedfor the simplified version of the algorithm given in Listing-1, andas such, all line references in this section are w.r.t. Listing-1. Yet,we note that the correctness arguments apply, albeit with minormodifications, to the optimized version in Listing-2.We call Lines 5–13 the entry code and Lines 14–21 the exit code.Each thread cycles between the entry code (where it is trying to getinto the critical section), critical section code, exit code (where it iscleaning up to allow other threads to execute their critical sections)and the so-called remainder section, where it executes code thatdoes not belong to any of the other three sections [32]. We assumethe order in which threads take their execution steps is unknown,yet no thread ceases execution in the entry, exit or critical sections.In other words, if a thread 𝑇 is in any of those code sections at time 𝑡 , it is guaranteed that, eventually, at some time 𝑡 ′ > 𝑡 , 𝑇 would For example Java synchronized blocks and methods; C++ std::lock_guard and std::scoped_lock ; or locking constructs that allow the critical section to be expressedas a lambda mostly or frequently local perform its next execution step. We also assume that each threadexecutes a finite number of steps in the critical section.We refer to Line 8 as the entry doorstep of the entry code andLine 20 as the exit doorstep of the exit code for lock 𝐿 . We say that athread is spinning on a word W if its next execution step is readingfrom a shared memory location 𝑊 inside the while-loop (e.g., inLine 11 in Listing-1). We say a lock 𝐿 is associated with a thread 𝑇 if 𝑇 has executed the entry doorstep for 𝐿 , but has not completedthe exit code for 𝐿 . We prove the following properties for Hemlockalgorithm defined with respect to any instance of a lock 𝐿 . • Mutual exclusion:

At any point in time, at most one threadis in the critical section. • Lockout freedom:

Any thread that starts executing theentry code eventually completes the exit code. • FIFO:

Threads enter the critical section in the order in whichthey execute the entry doorstep. • Fere-local spinning:

At any point in time, the number ofspinning threads on the same word is bounded by the max-imum number of locks associated with any thread at thattime.We note that lockout-freedom is a stronger property than themore common deadlock-freedom property [32]. Also, we note thatif a thread never executes entry code of one lock inside the criti-cal section of another (i.e., each thread has at most one associatedlock), the fere-local spinning implies local spinning, i.e., each spin-ning thread reads a different word 𝑊 . Furthermore, the fere-localspinning is a dynamic property, e.g., the bound at time 𝑡 does notdepend on the maximum number of locks associated with a threadprior to 𝑡 .We start with an auxiliary lemma. We denote the Self variablethat (contains the

Grant field and) belongs to a thread 𝑇 𝑖 as Self i .Lemma 1. For any lock 𝐿 , if L → Tail is null , there is no threadthat executed the entry doorstep, but has not executed the exit doorstep.In particular, there is no thread in the critical section protected by 𝐿 . Proof. The claim trivially holds initially at the beginning of theexecution when L → Tail is null .Let 𝑇 𝑗 be the first thread for which the claim does not hold. Thatis, 𝑇 𝑗 is the first thread for which SWAP in Line 8 returns null , yetthere is a thread 𝑇 𝑘 that has executed that SWAP before 𝑇 𝑗 but hasnot executed CAS in Line 16 yet.Let 𝑇 𝑖 be the last thread that set L → Tail to null before 𝑇 𝑗 ( 𝑇 𝑖 might be the same thread as 𝑇 𝑘 or a different one). From theinspection of the code, once L → Tail is set to a non- null value inthe entry doorstep, it can revert to null only by a successful

CAS in Line 16. For

CAS in Line 16 to be successful, it has to be executedby the last thread that executed

SWAP in Line 8. In other words, if 𝑇 𝑖 executes a successful CAS in Line 16 at time 𝑡 and it executed thecorresponding SWAP at time 𝑡 ( 𝑡 < 𝑡 ), no other thread executed SWAP at time 𝑡 < 𝑡 < 𝑡 .Consider when 𝑇 𝑘 executes the SWAP instruction in Line 8 w.r.t. [ 𝑡 , 𝑡 ] . Case 1: 𝑇 𝑘 executes SWAP at time 𝑡 < 𝑡 . Let 𝑇 𝑙 be the nextthread that executes SWAP after 𝑇 𝑘 ( 𝑇 𝑙 can be the same as 𝑇 𝑖 , or adifferent thread). Since L → Tail contains

Self k , 𝑇 𝑘 will enter thewhile-loop in Line 11. It will exit the loop only when Self k → Grant changes to 𝐿 , which can happen, according to the code, only in Line2021-02-19 • Copyright Oracle and or its affiliates ice and Kogan

20 when 𝑇 𝑘 executes the exit code. By induction on the number ofthreads that executed SWAP between 𝑡 and 𝑡 , 𝑇 𝑖 can execute thesuccessful CAS in Line 16 only after 𝑇 𝑘 executes the exit code. Thismeans that when 𝑇 𝑖 executes SWAP in Line 8, 𝑇 𝑘 has executed CAS in Line 16 – a contradiction.Case 2: 𝑇 𝑘 executes SWAP at time 𝑡 > 𝑡 . This means that when 𝑇 𝑗 executes SWAP in Line 8, L → Tail contains i either

Self k or Self l for some other thread 𝑇 𝑙 that executes SWAP after 𝑇 𝑘 – acontradiction to the fact that 𝑇 𝑗 ’s SWAP returned null . □ With this lemma, we prove the correctness property for Hemlock.Theorem 2.

The Hemlock algorithm provides mutual exclusion.

Proof. By way of contradiction, assume 𝑇 𝑖 and 𝑇 𝑗 are simultane-ously in the critical section protected by the same lock 𝐿 . Let 𝑡 𝑖 and 𝑡 𝑗 be the points in time when 𝑇 𝑖 and 𝑇 𝑗 executed Line 8 for the lasttime, respectively. Without loss of generality, assume 𝑡 𝑖 > 𝑡 𝑗 . Con-sider the value returned by SWAP in Line 8 when executed by thread 𝑇 𝑖 . If the returned value is null , by Lemma 1 𝑇 𝑗 must have executedits CAS instruction in Line 16 before 𝑡 𝑖 . Hence, 𝑇 𝑖 will execute thecritical section after 𝑇 𝑗 has completed its own – a contradiction.If the returned value is Self k ≠ null (for some thread 𝑇 𝑘 thatmight be the same as 𝑇 𝑗 or a different one), let 𝑇 𝑙 be the thread thatexecutes SWAP in Line 8 right after 𝑇 𝑗 and before 𝑇 𝑗 executes CAS inLine 16. 𝑇 𝑙 might be the same as 𝑇 𝑖 or a different thread. 𝑇 𝑙 waitsin Line 11 for Self j → Grant to become 𝐿 . Self j → Grant can onlychange to L (from null ) in Line 20 by thread 𝑇 𝑗 . When this happens, 𝑇 𝑗 is outside of the critical section. By induction on the number ofthreads that execute SWAP in ( 𝑡 𝑗 , 𝑡 𝑖 ] , when 𝑇 𝑖 finds Self j → Grant to be null in Line 11 and subsequently enters the critical section, 𝑇 𝑗 is outside of the critical section – a contradiction. □ Next, we prove the progress property for Hemlock. We do so byshowing first that a thread cannot get stuck in the exit code, i.e.,the exit-code is lockout-free.Lemma 3.

Every thread 𝑇 𝑖 exiting the critical section eventuallycompletes the exit code. Proof. The only place in the exit code where a thread 𝑇 𝑖 mayiterate indefinitely is the while-loop in Line 21. In the following,we argue that 𝑇 𝑖 either completes the exit code without reachingLine 21, or eventually breaks out of the loop in Line 21.Let 𝑡 be the time 𝑇 𝑖 executes the last SWAP instruction in Line 8before entering the critical section and 𝑡 be the time it executes the CAS instruction in Line 16 when it starts the exit code. Consider thefollowing two cases. Case 1: no thread executes

SWAP in [ 𝑡 , 𝑡 ] . Inthis case, L → Tail contains

Self i and the CAS instruction in Line16 is successful. Therefore,

CAS returns

Self i , and 𝑇 𝑖 can completethe exit code by a constant number of its steps by skipping Lines18–21.Case 2: at least one thread executes SWAP in [ 𝑡 , 𝑡 ] . Let 𝑇 𝑘 bethe first such thread. Thus, CAS in Line 16 is not successful, andit returns

Self j , for some 𝑗 ≠ 𝑖 (perhaps 𝑗 = 𝑘 ). (We note that,by Lemma 1, CAS cannot return null .) Therefore, 𝑇 𝑖 reaches thewhile-loop in Line 21, after storing 𝐿 into Self i → Grant in Line 20.Consider the execution steps of thread 𝑇 𝑘 after its SWAP instruction.The

SWAP instruction returns

Self i , and so 𝑇 𝑘 reaches the while-loop in Line 11. After 𝑇 𝑖 executes Line 20, eventually 𝑇 𝑘 reads 𝐿 from Self i → Grant and breaks out of the while-loop in Line 11.Next, it executes Line 12, storing null into

Self i → Grant . Finally, 𝑇 𝑖 eventually reads null in Line 21 and breaks out of the while-loop. □ Next, we show that a thread cannot get stuck in the entry codeeither, but first we prove a simple auxiliary lemma.Lemma 4.

The

SWAP instruction in Line 8 executed by thread 𝑇 𝑖 never returns Self i . Proof. From code inspection, only thread 𝑇 𝑖 can write Self i into L → Tail . Thus, the claim holds until 𝑇 𝑖 executes SWAP at leastfor the second time.Let 𝑇 𝑖 execute SWAP in Line 8 for the 𝑘 -th time, 𝑘 ≥

2, at time 𝑡 𝑘 . Consider the previous, 𝑘 -1-th execution of SWAP by 𝑇 𝑖 , at time 𝑡 𝑘 − . From code inspection, 𝑇 𝑖 has to execute CAS in Line 16 at time 𝑡 𝑘 − < 𝑡 < 𝑡 𝑘 . If CAS is successful, 𝑇 𝑖 changes the value of L → Tail to null , and thus 𝑘 -th SWAP will return null or Self j for 𝑗 ≠ 𝑖 . If CAS is unsuccessful, there has been (at least one) another thread 𝑇 𝑗 , 𝑗 ≠ 𝑖 , that performed SWAP in Line 8 at time 𝑡 𝑘 − < 𝑡 ′ < 𝑡 . Thus, 𝑘 -th SWAP at time 𝑡 𝑘 > 𝑡 ′ will return Self j , or Self k (for 𝑘 ≠ 𝑖, 𝑗 )or null , but not Self i . □ Lemma 5.

Every thread 𝑇 𝑖 starting the entry code eventually entersthe critical section. Proof. The only place in the entry code where a thread 𝑇 𝑖 mayiterate indefinitely is the while-loop in Line 11. In the following,we argue that 𝑇 𝑖 either completes the entry code without reachingLine 11, or eventually breaks out of the loop in Line 11.Consider the following two cases w.r.t. to the value returnedby SWAP executed by thread 𝑇 𝑖 in the entry code at time 𝑡 𝑖 . Case 1: SWAP returns null . In this case, 𝑇 𝑖 can complete the entry code bya constant number of its steps by skipping Lines 9–12.Case 2: SWAP returns Self j . Thus, 𝑇 𝑖 reaches the while-loop inLine 11 and waits until Self j → Grant contains 𝐿 . From Lemma 4,we know that 𝑗 ≠ 𝑖 . Consider the state of thread 𝑇 𝑗 w.r.t. to thevalue returned by SWAP executed by thread 𝑇 𝑗 in Line 8 at time 𝑡 𝑗 < 𝑡 𝑖 . If 𝑇 𝑗 ’s SWAP returned null , 𝑇 𝑗 will complete the entrycode, end eventually reach Line 20 in the exit code. Otherwise, 𝑇 𝑗 ’s SWAP returned

Self k . If Self k is equal to Self i , then 𝑇 𝑖 executed(another) SWAP at time 𝑡 ′ 𝑖 < 𝑡 𝑗 . This means that 𝑇 𝑖 has executedthe exit code in the interval ( 𝑡 ′ 𝑖 , 𝑡 𝑖 ) , and in particular, has executedLine 20 in the interval ( 𝑡 𝑗 , 𝑡 𝑖 ) . Therefore, 𝑇 𝑗 will break out of thewhile-loop in Line 11*, enter the critical section, and eventuallyexecute Line 20, allowing 𝑇 𝑖 to complete its entry code. If Self k isnot equal to Self i , consider whether at time 𝑡 𝑗 , 𝑇 𝑘 has completedthe while-loop in Line 11 (including by skipping that while-loopentirely by evaluating the condition in Line 9 to false). If so, 𝑇 𝑘 willcomplete the entry code, end eventually reach Line 20 in the exitcode, letting 𝑇 𝑗 and, eventually, 𝑇 𝑖 to break out of the while-loop inLine 11*. Otherwise, 𝑇 𝑘 is waiting in Line 11. (We note that there isa third possibility that 𝑇 𝑘 has executed SWAP , but has not evaluatedthe condition in Line 9 yet, or has evaluated it to true, but has notstarted the while-loop in Line 11. We treat it as one of the first twopossibilities, according to whether or not 𝑇 𝑘 eventually waits in thewhile-loop in Line 11).2021-02-19 • Copyright Oracle and or its affiliates emlock : Compact and Scalable Mutual Exclusion In the case 𝑇 𝑘 is waiting in Line 11, we consider recursivelythe state of 𝑇 𝑘 w.r.t. to the value returned by its SWAP , and anythread 𝑇 𝑘 is waiting for in Line 11. Since the number of threadsis bounded, there have to be two threads, 𝑇 𝑎 and 𝑇 𝑏 , s.t. 𝑇 𝑎 ’s SWAP returns

Self b and 𝑇 𝑏 ’s SWAP returns either (a) null , or (b)

Self c for 𝑇 𝑐 in { 𝑇 𝑖 ,𝑇 𝑗 ,𝑇 𝑘 , . . . ,𝑇 𝑎 } or (c) Self c for 𝑇 𝑐 that has completed thewhile-loop in Line 11. . Following the similar reasoning as above, weconclude that 𝑇 𝑏 eventually executes Line 20. in its exit code, andallows 𝑇 𝑎 to break our of the waiting loop in Line 11. . By inductionon the number of threads in the set { 𝑇 𝑖 ,𝑇 𝑗 ,𝑇 𝑘 , . . . ,𝑇 𝑎 } , we concludethat, eventually, 𝑇 𝑖 completes the while-loop in Line 11 and entersthe critical section. □ Theorem 6.

The Hemlock algorithm is lockout-free.

Proof. This follows directly from Lemma 3 and 5, and the as-sumption that a thread completes its critical section in a finitenumber of its execution steps. □ Next, we prove that threads enter the critical section in the FIFOorder w.r.t. their execution of the entry doorstep. In the followinglemma, we show that when two threads execute the entry doorstepone after the other, the latter thread cannot “skip” over the formerand enter the critical section first.Lemma 7.

Let 𝑇 𝑖 be the next thread that executes the entry doorstepafter 𝑇 𝑗 . Then 𝑇 𝑖 enters the critical section after 𝑇 𝑗 . Proof. First, we note that the claim trivially holds if 𝑖 = 𝑗 . This isbecause 𝑇 𝑖 may execute another entry doorstep only after (enteringand) exiting the critical section.Next, we consider two cases. Case 1: 𝑇 𝑖 ’s execution of the SWAP in-struction in the entry doorstep returns null . This can only happenif 𝑇 𝑗 performs CAS in Line 16 before 𝑇 𝑖 executes the SWAP instruction.This means, however, that 𝑇 𝑗 has completed its critical section, andthe claim holds. Case 2: 𝑇 𝑖 ’s execution of the SWAP instruction in theentry doorstep returns

Self j . Then 𝑇 𝑖 will proceed to Line 11, andwait until Self j → Grant changes to 𝐿 . This can only happen when 𝑇 𝑗 reaches Line 20, which means that, once again, 𝑇 𝑗 has completedits critical section, and the claim holds. □ Theorem 8.

The Hemlock algorithm has the FIFO property.

Proof. By way of contradiction, assume there is a thread 𝑇 𝑖 thatexecutes the entry doorstep after a thread 𝑇 𝑗 , but enters the criticalsection before 𝑇 𝑗 . Without loss of generality, let 𝑇 𝑖 be the first suchthread in the execution of the algorithm. Let 𝑇 𝑘 be the thread thatexecutes the entry doorstep right before 𝑇 𝑖 ( 𝑇 𝑘 might be the samethread as 𝑇 𝑗 or a different one). By the way we chose 𝑇 𝑖 , 𝑇 𝑘 has notentered the critical section when 𝑇 𝑖 does. This is a contradiction toLemma 7. □ We are left to prove the last stated property of Hemlock, namelythe fere-local spinning. Again, we start with an auxiliary lemma.Lemma 9.

For every lock 𝐿 and thread 𝑇 𝑖 , there is at most onethread 𝑇 𝑗 waiting in Line 11 for Self i → Grant to become 𝐿 . Proof. Consider thread 𝑇 𝑗 waiting in Line 11 for Self i → Grant to become 𝐿 . To reach Line 11, 𝑇 𝑗 executed Line 8, where SWAP returned

Self i . This, in turn, means that 𝑇 𝑖 has also executed Line 8 (before 𝑇 𝑗 did). This is because Line 8 is the only place where Self k can be written into L → Tail , for any thread 𝑇 𝑘 .Assume by way of contradiction that another thread 𝑇 𝑘 is alsowaiting in Line 11 for Self i → Grant to become 𝐿 . Let 𝑡 𝑗 and 𝑡 𝑘 bethe points in time when 𝑇 𝑗 and 𝑇 𝑘 executed Line 8 for the last time,respectively. From the atomicity of SWAP , 𝑡 𝑗 ≠ 𝑡 𝑘 . Assume withoutloss of generality that 𝑡 𝑗 < 𝑡 𝑘 . Let 𝑡 𝑖 be the time 𝑇 𝑖 executed the SWAP for the last time before 𝑡 𝑗 . From the above, 𝑡 𝑖 < 𝑡 𝑗 < 𝑡 𝑘 .From the inspection of the code, the only way for 𝑇 𝑘 to write Self i in Line 8 into pred is for 𝑇 𝑖 to execute Line 8 right before 𝑇 𝑘 does. That is, there has to be a point in time 𝑡 𝑗 < 𝑡 ′ 𝑖 < 𝑡 𝑘 inwhich 𝑇 𝑖 executed Line 8 again. This means that during ( 𝑡 𝑖 , 𝑡 ′ 𝑖 ) , 𝑇 𝑖 has completed the entry code, its critical section, and the exit code(and started executing another entry code). When executing theexit code, 𝑇 𝑖 performed CAS in Line 16*. If this

CAS is successful,this means that it takes place before 𝑡 𝑗 (since the value of L → Tail remains unchanged), and so 𝑇 𝑗 would not read Self i into pred inLine 8 at 𝑡 𝑗 . Thus, this CAS has to fail, i.e., return a value differentfrom

Self i .Thus, 𝑇 𝑖 has to execute Lines 20–21, and in particular, wait untilits Grant field contains null . This happens before 𝑡 ′ 𝑖 and hencebefore 𝑡 𝑘 , therefore 𝑇 𝑗 is the only thread at this point that waitsin Line 11 for Self i → Grant to become 𝐿 . Since 𝑇 𝑖 completes itsexit point (and executes SWAP at 𝑡 ′ 𝑖 ), it must have exited the while-loop in Line 21 before 𝑡 ′ 𝑖 . This can happen only if 𝑇 𝑗 has executedLine 12 after 𝑡 𝑗 and before 𝑡 ′ 𝑖 . Thus, 𝑇 𝑗 no longer waits in Line 11for Self i → Grant to become 𝐿 when 𝑇 𝑘 starts to wait there – acontradiction. □ Note that as explained in Section-2.2, there might be multiplethreads spinning on the word

Self i → Grant in Line 11, each for adifferent lock 𝐿 . However, as we argue in the lemma above, theremight be only one thread per any given lock that waits for the valueof Self i → Grant to change.Theorem 10.

The Hemlock algorithm has the fere-local spinningproperty.

Proof. Assume thread 𝑇 𝑖 has 𝑘 associated locks at the givenpoint in time. By inspecting the code, threads can spin on a wordonly in Lines 11 or 21. By Lemma 9, there might be at most 𝑘 threadsspinning on Self i → Grant in Line 11„ one for each of the 𝑘 locksassociated with 𝑇 𝑖 . (We note that by the definition of the associatedlocks, a thread 𝑇 𝑗 cannot spin on Self i → Grant and wait until itcontains a value of a lock that is not associated with 𝑇 𝑖 .) At thesame time, only 𝑇 𝑖 can spin on Self i → Grant in Line 21, and itdoes so after writing 𝐿 into Self i → Grant in Line 20. This meansthat when 𝑇 𝑖 starts spinning on Self i → Grant , another thread 𝑇 𝑗 stops spinning on Self i → Grant in Line 11. We note that it can beeasily shown that such 𝑇 𝑗 exists. Thus, at any given point in time,the number of threads spinning on Self i → Grant is bounded by 𝑘 . □ While mutual exclusion remains an active research topic [41] [12][36] [18] [29] [17] [16] [15] [19] [1] [21] [43] we focus on locksclosely related to our design.2021-02-19 • Copyright Oracle and or its affiliates ice and Kogan

Simple test-and-set or polite test-and-test-and-set [43] locks arecompact and exhibit excellent latency for uncontended operations,but fail to scale and may allow unfairness and even indefinite star-vation. Ticket Locks are compact and FIFO and also have excel-lent latency for uncontended operations but they also fail to scalebecause of local spinning, although some variations attempt toovercome this obstacle, at the cost of increased space [18, 20, 40].For instance Anderson’s array-based queueing lock [4, 5] is basedon Ticket Locks but provides local spinning. It employs a waitingarray for each lock instance, sized to ensure there is at least onearray element for each potentially waiting thread, yielding a po-tentially large footprint. The maximum number of participatingthreads must be known in advance when initializing the lock.Queue-based locks such as MCS or CLH are FIFO and providelocal spinning and are thus more scalable. MCS is used in the linuxkernel for the low-level “qspinlock” construct [7, 9, 31]. Modernextensions of MCS edit the queue order to make the lock

NUMA-Aware [17].

Unless otherwise noted, all data was collected on an Oracle X5-2system. The system has 2 sockets, each populated with an IntelXeon E5-2699 v3 CPU running at 2.30GHz. Each socket has 18cores, and each core is 2-way hyperthreaded, yielding 72 logicalCPUs in total. The system was running Ubuntu 20.04 with a stockLinux version 5.4 kernel, and all software was compiled using theprovided GCC version 9.3 toolchain at optimization level “-O3”. 64-bit C or C++ code was used for all experiments. Factory-providedsystem defaults were used in all cases, and Turbo mode [44] wasleft enabled. In all cases default free-range unbound threads wereused.We implemented all user-mode locks within LD_PRELOAD in-terposition libraries that expose the standard POSIX pthread_mutex_t programming interface using the framework from [21].This allows us to change lock implementations by varying theLD_PRELOAD environment variable and without modifying theapplication code that uses locks. The C++ std::mutex constructmaps directly to pthread_mutex primitives, so interposition worksfor both C and C++ code. All busy-wait loops used the Intel

PAUSE instruction.

The MutexBench benchmark spawns 𝑇 concurrent threads. Eachthread loops as follows: acquire a central lock L; execute a criticalsection; release L; execute a non-critical section. At the end of a10 second measurement interval the benchmark reports the totalnumber of aggregate iterations completed by all the threads. Wereport the median of 7 independent runs in Figure-2 where the crit-ical section is empty as well as the non-critical section, subjectingthe lock to extreme contention. (At just one thread, this configura-tion also constitutes a useful benchmark for uncontended latency).The 𝑋 -axis reflects the number of concurrently executing threadscontending for the lock, and the 𝑌 reports aggregate throughput.For clarity and to convey the maximum amount of information to thetallyofallloopsexecutedbyallthethreadsinthemeasurementinterval allow a comparison of the algorithms, the 𝑋 -axis is offset to theminimum score and the 𝑌 -axis is logarithmic. We ran the benchmark under the following lock algorithms: MCS is classic MCS;

CLH is classic CLH;

Ticket is a classic TicketLock;

Hemlock is the Hemlock algorithm, with the CTR optimiza-tion, described above. For the MCS and CLH locks, our implemen-tation stores the current head of the queue – the owner – in afield adjacent to the tail, so the lock body size was 2 words. TheTicket Lock also has a size of 2 words, while Hemlock requires alock body of just 1 word. MCS and CLH additionally require onequeue element for each lock held or waited upon. CLH also requiresthat each lock be initialized with a so-called dummy element. Toavoid memory allocation during the measurement interval, the CLHand MCS implementations use a thread-local stack of free queueelements.In Figure-2 we make the following observations regarding oper-ation at maximal contention with an empty critical section: . • At 1 thread the benchmark measures the latency of uncon-tended acquire and release operations. Ticket Locks are thefastest, followed by Hemlock, CLH and MCS. • As we increase the number of threads, Ticket Locks initiallydo well but then fade. exhibiting a precipitous drop in perfor-mance. • Broadly, Hemlock performs slightly better than or the sameas CLH or MCS.To gauge the contribution of the CTR optimization, we exam-ine execution at 32 threads. The simplistic reference implementa-tion, shown in Listing-1, yields throughput of 3.41 million opera-tions/second. Applying CTR to the baseline yields 4.49M millionoperations/second.

Sensitivityanalysis:breakdownandcontributionofAHandCTRoptimizations

In Figure-3 we configure the benchmark so the non-critical sec-tion generates a uniformly distributed random value in [ − ) andsteps a thread-local C++ std::mt19937 random number generator(PRNG) that many steps, admitting potential positive scalability.The critical section advances a shared random number generator 5steps. In this moderate contention case we observe that Ticket Locksagain do well at low thread counts, and that Hemlock outperformsboth MCS and CLH. Maximumextremecontention–maximizesarrivalrate

To show that our approach is general and portable, we next reportMutexBench results on a Sun/Oracle T7-2 [11] in Figures 4 and 5.The T7-2 has 2 sockets, each socket populated by an M7 SPARCCPU running at 4.13GHz with 32 cores. Each core has 8 logicalCPUs sharing 2 pipelines. The system has 512 logical CPUs andwas running Solaris 11. We used the GCC version 6.1 toolchain tocompile the benchmark and the lock libraries. 64-bit SPARC doesnot directly support atomic fetch-and-add or SWAP operations –these are emulated by means of a 64-bit compare-and-swap op-erator (

CASX ). To implement CTR in the waiting phase, we used

MONITOR-MWAIT on the predecessor’s

Grant field followed by animmediate

CASX to try to reset

Grant , avoiding the promotion from shared to modified state which would normally be found in naive We note in passing that care must be taken when negative or retrograde scaling occursand aggregate performance degrades as we increase threads. As a thought experiment,if a hypothetical lock implementation were to introduce additional synthetic delaysoutside the critical path, aggregate performance might increase as the delay throttlesthe arrival rate and concurrency over the contended lock [25]. As such, evaluating justthe maximal contention case in isolation is insufficient. emlock : Compact and Scalable Mutual Exclusion Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock

Figure 2: MutexBench : Maximum Contention