Hemlock : Compact and Scalable Mutual Exclusion
HHemlock : Compact and Scalable Mutual Exclusion
Hemlock : Compact and Scalable Mutual Exclusion
Dave Dice [email protected] LabsUSA
Alex Kogan [email protected] LabsUSA
ABSTRACT
We present
Hemlock , a novel mutual exclusion locking algorithmthat is extremely compact, requiring just one word per thread plusone word per lock, but which still provides local spinning in mostcircumstances, high throughput under contention, and low latencyin the uncontended case. Hemlock is context-free – not requiring anyinformation to be passed from a lock operation to the correspondingunlock – and FIFO. The performance of Hemlock is competitivewith and often better than the best scalable spin locks.
CCS CONCEPTS • Software and its engineering → Multithreading ; Mutual exclu-sion ; Concurrency control ; Process synchronization . KEYWORDS
Synchronization, Locks, Mutual Exclusion, Scalability
Locks often have a crucial impact on the performance of parallelsoftware, hence they remain in the focus of intensive research. Manylocking algorithms have been proposed over the last several decades.Ticket Locks [24, 35, 42] are simple and compact, requiring just twowords for each lock instance and no per-thread data. They performwell in the absence of contention, exhibiting low latency because ofshort code paths. Under contention, however, performance suffers[18] because all threads contending for a given lock will busy-waiton a central location, increasing coherence costs. For contendedoperation, so-called queue based locks, such as CLH[12, 34] andMCS[36] provide relief via local spinning [21]. For both CLH andMCS, arriving threads enqueue an element (sometimes called a“node”) onto the tail of a queue and then busy-wait on a flag ineither their own element (MCS) or the predecessor’s element (CLH).Critically, at most one thread busy-waits on a given location at anyone time, increasing the rate at which ownership can be transferredfrom thread to thread relative to techniques that use global spinning,such as Ticket Locks.Hemlock is inspired by CLH, and, like CLH, threads wait on afield associated with the predecessor. Hemlock, however, avoidsthe use of queue nodes, freeing the implementation from lifecycleconcerns – allocating, releasing, caching – associated with thatstructure. The lock and unlock paths are extremely simple. An un-contended lock operation requires just an atomic SWAP (exchange)operation, and unlock just a compare-and-swap (
CAS ), which is thesame as MCS. Like Ticket Locks, MCS and CLH locks, Hemlockprovides FIFO admission.Hemlock is compact, requiring just one word per extant lockplus one word per thread, regardless of the number of locks heldor waited upon. Like MCS and CLH, the lock contains a pointer to the tail of the queue of threads waiting on that lock, or null if thelock is not held. The thread at the head of the queue is the owner.In MCS the queue is implemented as an explicit linked list runningfrom the head (owner) to the tail. In CLH the queue is implicit andeach thread waits on a field in its predecessor’s element. CLH alsorequires that a lock in unlocked state be provisioned with an emptyqueue element. When that lock is destroyed, the element must berecovered. Hemlock avoids that requirement.
CLH:pre-initializeorinitializeon-demandHemlockisdegenerateCLHOracleinventiondisclosureaccessionnumber:ORC2133687-US-NPR(CompactandScalableMutualExclusion)
Instead of using queue nodes, Hemlock provisions each threadwith a singular
Grant field where its successor can busy-wait. Nor-mally the
Grant field – which acts as a mailbox between a threadand its predecessor on the queue – is null , indicating empty. Dur-ing the unlock operation, a thread installs the address of the lockinto its
Grant field and then waits for that field to return to null .The successor thread observes the lock address appear in its prede-cessor’s
Grant field, which indicates that ownership has transferred.The successor then responds by clearing the
Grant field, acknowl-edging receipt of ownership and allowing the
Grant field of itspredecessor to be reused in subsequent handover operations, andthen finally enters the critical section.Under simple contention, when a thread holds one lock at a time,Hemlock provides local spinning. But if we have one thread 𝑇 𝑇 Grant field. As multiplethreads (via multiple locks) can be busy-waiting on 𝑇 Grant field, 𝑇 Grant field to disambiguate and allow the specific successor to determinethat ownership has been conveyed. We note that simple contentionis a common case for many applications. This is supported by thesurveys of Cheng et al. [6] and O’Callahan et al. [39] – which foundthat it is rare for a thread to hold multiple locks at a given time – aswell as by our profiling of LevelDB, described below. This suggeststhat Hemlock would enjoy local spinning in many practical settings.
Chengreportthatitisuncommon,intheapplicationstheysurveyed,forathreadtoholdmultiplelocksatagiventime,suggestingthatHemlockwouldenjoylocalspinning.Disambiguatesuccessionbywritinglockaddressinto
Grant .Potentialnames:hemlock;spinoff;tailspin;forespin;prospector;systolic;lockstepN-waitingwhereNisthenumberofdistinctcontendedlocksheldbyathread.Devolvestonon-localspinning:multi-waitingprefixesforX-localwaiting:usually;mostly;frequently;commonly;generally;dominantly;semi;pseudo;oft;fere;ferme;consuete;pluries;multo;multum;praecipue;maxime;saepo;frequens;plerumque;plerusque;valde;fortiter;vehementer;
The advantages of not using queue nodes, however, does notend in reduced space and a simplified implementation that avoidsnode lifecycle concerns. Both CLH and MCS need to convey theaddress of the owner (head) node from the lock operation to theunlock operation. The unlock operation needs that node to find thesuccessor, and to reclaim nodes from the queue so that nodes maybe recycled. While the locking API could be modified to accom-modate this requirement, it is inconvenient for the classic POSIX pthread locking interface, in which case the usual approach tosupport MCS is to add a field to the lock body that points to thehead, allowing that value to be passed from the lock operation tothe corresponding unlock operation. This new field is protected bythe lock itself, but accesses to the field execute within the effectivecritical section and may also induce additional coherence traffice.Instead of an extra field in the lock body to convey the head – the2021-02-19 • Copyright Oracle and or its affiliates a r X i v : . [ c s . D C ] F e b ice and Kogan class Thread : atomic
CAS inunlock,soit’sonparwithMCSinthatrespect,althoughthepathsaretighter.
We start by describing a simplified version of the Hemlock algo-rithm, with pseudo-code provided in Listing-1. In Sections 2.1 wedescribe key performance optimizations, and the pseudo-code forthe optimized Hemlock algorithm is given in Listing-2.
Self refers to a thread-local structure containing the thread’s
Grant field. Threads arrive in the lock operator at line 8 and atomi-cally swap their own address into the lock’s
Tail field, obtainingthe previous tail value, constructing the implicit FIFO queue. If the
Tail field was null , then the caller acquired the lock without con-tention and may immediately enter the critical section. Otherwisethe thread waits for the lock’s address to appear in the predeces-sor’s
Grant field, signalling succession, at which point the threadrestores the predecessor’s
Grant field to null (empty) indicatingthe field can be reused for subsequent unlock operations by thepredecessor. The thread has been granted ownership by its prede-cessor and may enter the critical section. Clearing the
Grant field,above, is the only circumstance in which one thread may store intoanother thread’s
Grant field. Threads in the queue hold the addressof their immediate predecessor, obtained as the return value fromthe
SWAP operation, but do not know the identity of their successor,if any.In the unlock operator, at line 16, threads initially use an atomiccompare-and-swap (
CAS ) operation to try to swing the lock’s
Tail field from the address of their own thread,
Self , back to null , whichrepresents unlocked . If the
CAS was successful then there were nowaiting threads and the lock was released by the
CAS . Otherwise waiters exist and the thread then writes the address of the lock 𝐿 into its own Grant , alerting the waiting successor and passingownership. Finally, the outgoing thread waits for that successorto acknowledge the transfer and restore the
Grant field back toempty, indicating the field be reused for future locking operations.Waiting for the mailbox to return to null happens outside thecritical section, after the thread has conveyed ownership.In Hemlock, transfer of ownership in unlock is address-based ,where the outgoing owner writes the lock address into its owngrant field, whereas under MCS and CLH owner transfer is via aboolean written into a queue element monitored by its immediatesuccessor.Threads that attempt to release a lock that they do not hold willstall indefinitely at Line 21, waiting for an acknowledgement thatwill never arrive. This property makes it easy to identify and debugthe offending thread and unlock operation.
Proximateandimmediatecause;rootcause
MCS and Hemlock allow trivial implementations of the
TryLock operations – using
CAS instead of
SWAP – whereas Ticket Locks andCLH do not. An uncontended lock acquisition requires an atomic
SWAP for MCS, CLH and Hemlock and an atomic fetch-and-add for Ticket Locks. An uncontended unlock – no waiters – requires anatomic
CAS for MCS and Hemlock, and simple stores for CLH andTicket Locks while a contended unlock, which passes ownership toa waiter, requires a store for MCS, CLH and Ticket Locks.In Listing-1 line 21, threads in the unlock operator must waitfor the successor to acknowledge receipt of ownership, indicatingthe unlocking thread’s
Grant mailbox is again available for com-munication in subsequent locking operations. That is, the recipientneeds to take positive action and respond before the previous ownercan return from the unlock operator. While this phase of waitingoccurs outside and after the transfer of ownership – crucially, not within the effective critical section or on critical path– such waitingmay still impede the progress and latency of the thread that in-voked unlock. Specifically, we have tightly coupled back-and-forthsynchronous communication, where the thread executing unlock2021-02-19 • Copyright Oracle and or its affiliates emlock : Compact and Scalable Mutual Exclusion stores into its
Grant field and then waits for a response from thesuccessor, while the successor, running in the lock operator, waitsfor the transfer indication (line 11) and then responds to the unlock-ing thread and acknowledges by restoring
Grant to null (line 12).The unlock operator must await a positive reply from the successorin order to safely reuse the Grant field for subsequent operations.That is, the algorithm must not start an unlock operation until theprevious contended unlock has completed, and the successor hasemptied the mailbox. We note that MCS, in the unlock operator,must also wait for the successor executing in the lock operator toestablish the back-link that allows the owner to reach the successor.That is, both MCS and Hemlock have wait loops in the contended unlock path where threads may need to wait for the arriving suc-cessor to become visible to the current owner. Compared to MCSand CLH, the only addition burden imposed by Hemlock that fallsinside the critical path is the clearing of the predecessor’s
Grant field by the recipient (Line 12), which is implemented as a singlestore.To mitigate the performance concern described above, we couldoptimize Hemlock to defer and shift the waiting-for-response phase(Listing-1 line 21) to the prologue of subsequent lock and unlockoperations, allowing more useful overlap and concurrency betweenthe successor, which clears the
Grant field, and the thread whichperformed the unlock operation. The thread that called unlock mayenter its critical section earlier, before the successor clears
Grant .Ultimately, however, we opted to forgo this particular optimizationin our implementation as it provided little observable performancebenefit. While the
Grant mailbox field might appear to be a sourceof contention and to potentially induce additional coherence traffic,a given thread can release only one lock at a time, mitigating thatconcern.
Overlap;pipelining;
The synchronous back-and-forth communication pattern where athread waits for ownership and then clears the
Grant field (Listing-1 Lines 11-12) is inefficient on platforms that use MESI or MESIF“single writer” cache coherence protocols [26, 27]. Specifically, in unlock when the owner stores the lock address into its
Grant field(Line 20), it drives the cache line underlying
Grant into M -state(modified) in its local cache. Subsequent polling by the successor(Line 11) results in a coherence miss that will pull the line back intothe successor’s cache in 𝑆 -state (shared). The successor will thenobserve the waited-for lock address and proceed to clear Grant (Line 12) forcing an upgrade from 𝑆 to 𝑀 state in the successor’scache and invaliding the line from the cache of the previous owner,adding a delay in the critical path.We avoid the upgrade coherence transaction by polling with CAS (Listing-2 Line 9) instead of using simple loads, so, once thehand-over is accomplished and the successor observes the lockaddress, the line is already in 𝑀 -state in the successor’s local cache.We refer to this technique as the Coherence Traffic Reduction (CTR) optimization.As an alternative to busy-waiting with
CAS , we can achieveequivalent performance by using an atomic fetch-and-add of0 – implemented via
LOCK:XADD on x86 – on
Grant as a read-with-intent-to-write primitive, and, after observing the waited-for lock address to appear in
Grant , issuing a normal store to clear
Grant . That is, we simple replace the load instruction in the tra-ditional busy-wait loop with fetch-and-add of 0. Busy-waitingwith an atomic read-modify-write operator, such as
CAS , SWAP or fetch-and-add , is typically considered a performance anti-pattern.For instance, Anderson[5] observed that test-and-test-and-set locks are superior to crude test-and-set locks when there aremultiple waiters. But in our case with the 1-to-1 communicationprotocol used on the Grant field in Hemlock, busy-waiting viaread-modify-write atomic operations provides a performance bene-fit. Because of the simple communication pattern, back-off in thebusy-waiting loop is not useful.We also apply CTR in the unlock opertor at Listing-2 Line 15 aswe expect the
Grant field will be written by that same thread insubsequent unlock operations.Related approaches to coherence-optimized waiting have beendescribed [22]. Using
MONITOR-MWAIT [2] to wait for invalidation,instead of waiting for a value, has promise, but the facility is notyet available in user-mode on Intel processors . MWAIT may conferadditional benefits, as it avoids a classic busy-wait loop and thusavoids branch mispredictions in the critical path to exit the loopwhen ownership has transferred. In addition, depending on the im-plementation,
MWAIT may be more “polite” with respect to yieldingpipeline resources, potentially allowing other threads, including thelock owner, to execute faster by reducing competition for sharedresources. We might also busy-wait via hardware transactionalmemory, where invalidation will cause an abort, serving as a hintto the waiting thread. In addition, other techniques to hold the linein 𝑀 -state are possible, such as issuing stores to a dummy variablethat abuts the Grant field but which resides on the same cache line.Using the prefetchw prefetch-for-write advisory “hint” instructionwould appear workable but yielded no performance improvementin our experiments.
Politewaiting:altruismthatultimatelybenefitsthebenefactorConfer;afford;yield;admit
The CTR optimization is specific to the shared memory commu-nication pattern used in Hemlock, and is not directly applicable toother lock algorithms.All Hemlock performance data reported herein uses the CTRoptimization unless otherwise noted.We used the linux perf stat command to collect data fromthe hardware performance monitoring unit counters and foundthat CTR resulted in a reduction in number of load operations that“hit” on a line in 𝑀 -state in another core’s cache – requiring writeinvalidation and transfer back to the requester’s cache – and also areduction in total offcore traffic, while providing an improvementin throughput. We can show similar benefits from CTR with asimple program where a set of concurrent threads are configured ina ring, and circulate a single token. A thread waits for its mailboxto become non-zero, clears the mailbox, and deposits the tokenin its successor’s mailbox. Using CAS , SWAP or Fetch-and-Add tobusy-wait improves the circulation rate as compared to the naiveform which uses loads. Future Intel processors may support user-mode umonitor and umwait instructions[10]. We hope to use those instructions in future Hemlock experiments ice and Kogan
Figure-1 shows an example configuration of a set of threads andlocks in Hemlock. 𝐿 − 𝐿 𝐴 − 𝑁 representthreads. Solid arrows reflect the lock’s explicit Tail pointer, whichpoints to the most recently arrived thread – the tail of the lock’squeue. Dashed arrows, which appear between threads, refer to athread’s immediate predecessor in the implicit queue associatedwith a lock. The address of the immediate predecessor is obtainedvia the atomic
SWAP executed when threads arrive. The dashed edgecan be thought of as the busy-waits-on relation and are not physicallinks in memory that could be traversed. In the example, 𝐴 holds 𝐿 𝐵 holds 𝐿 𝐿 𝐸 holds 𝐿 𝐿 𝐿 𝐾 holds 𝐿 𝐿 𝐴 , 𝐵 and 𝐸 execute in their critical sections.while all the other threads are stalled waiting for locks. The queueof waiting threads for 𝐿 𝐶 (the immediate successor) followedby 𝐷 . 𝐷 ’s predecessor is 𝐶 , and, equivalently, 𝐶 ’s successor is 𝐷 .Thread 𝐷 busy-waits on 𝐶 ’s Grant field and 𝐶 busy-waits on 𝐵 ’s Grant field.Threads 𝐻 and 𝐽 both busy-wait on 𝐺 ’s Grant field. In simplelocking scenarios Hemlock provides local waiting, but when thedashed lines form junctions (elements with in-degree greater thanone) in the waits-on directed graph, we find non-local spinning, or multi-waiting . Similarly, in our contrived example, both 𝑁 and 𝐺 both wait on 𝐹 . While our design admits inter-lock performanceinterference, arising from multiple threads spinning on one Grant variable, as is the case for 𝐺 and 𝐹 , above, we believe this caseto be rare and not of consequence for common applications. (Forcomparison, CLH and MCS does not allow the concurrent sharing ofqueue elements, and thus provides local spinning, whereas Hemlockhas a shared singleton queue element – effectively the Grant field –that can be subject to being busy-waited upon by multiple threads).Crucially, if we have a set of coordinating threads where each threadacquires only one lock at a time, then they will enjoy local spinning.Non-local spinning can occur only when threads hold multiplelocks. Specifically, the worst-case number of threads that could bebusy-waiting on a given thread 𝑇 ’s Grant field is 𝑀 where 𝑀 is thenumber of locks held simultaneously by 𝑇 . Non-localspinningoccursonlywhenthread 𝑇 hold 𝐿 𝑇 alsoholdsorwaitsfor 𝐿 𝑇 haswaitingpredecessorsonboth 𝐿 𝐿 multi-waiting degree. When 𝐸 ultimately unlocks 𝐿 𝐸 installs a pointer to 𝐿 Grant field. Thread 𝐹 observes that store, assumes ownership,clears 𝐸 ’s Grant field back to empty (null) and enters the criticalsection. When 𝐹 then unlocks 𝐿
4, it deposits 𝐿 Grant field. Threads 𝐺 and 𝑁 both monitor 𝐹 ’s Grant field,with 𝐺 waiting for 𝐿 𝑁 waiting for 𝐿 𝐹 ’s Grant field, but 𝑁 ignoresthe change while 𝐺 notices the value now matches 𝐿
4, the lock that 𝐺 is waiting on, which indicates that 𝐸 has passed ownership of 𝐿 𝐺 . 𝐺 clears 𝐹 ’s Grant field, indicating that 𝐹 can reuse that fieldfor subsequent operations, and enters the critical section. Threads 𝐴 , 𝐸 and 𝐺 aremonitoringthatfieldandobservetheupdate. 𝐸 and 𝐺 seebutignorethechangewhile 𝐴 noticesthevaluenowmatches 𝐿 𝐻 haspassedownershipof 𝐿 𝐴 . 𝐴 thenclears 𝐻 ’s Grant fieldtoempty( null )andthenentersthecriticalsectionprotectedby 𝐿 Table-1 characterizes the space utilization of MCS, CLH, TicketLocks, and Hemlock. The values in the
Lock column reflect the sizeof the lock body. For MCS and CLH we assume that the head of thechain is carried in the lock body, and thus the lock consists of head and tail fields, requiring 2 words in total. E represents the size ofa queue element. CLH requires the lock to be preinitialized with a CLID E
L1L2L3 JH L4 K L5L6
ABG F L7 M N
Figure 1: Object Graph in Hemlock
SpaceLock Held Wait Thread InitMCS 2
𝐸 𝐸 𝐸 𝐸 • Ticket Locks 2 0 0 0Hemlock 1 0 0 1
Table 1: Space Usage so-called dummy element before use. When the lock is ultimatelydestroyed, the current dummy element must be recovered. The
Held field indicates the space cost for each held lock and similarly,the
Wait field indicates the cost in space of waiting for a lock. The
Thread column reflects per-thread state that must be reserved forlocking. For Hemlock, this is the
Grant field. A single word suffices,although to avoid false sharing we opted to sequester the
Grant field as the sole occupant of a cache line. In our implementationwe also elected to align and pad the MCS and CLH queue nodes toreduce false sharing and to provide a fair comparison, raising thesize of 𝐸 to a cache line. Init indicates if the lock requires non-trivialconstructors and destructors. CLH, for instance, requires that thecurrent dummy node be released when a lock is destroyed.Taking MCS as an example, lets say lock 𝐿 is owned by thread 𝑇 𝑇 𝑇 𝐿 . The lock body for 𝐿 requires 2 words and the MCS chain consists of elements 𝐸 ⇒ 𝐸 ⇒ 𝐸 𝐸 𝐸 𝐸 𝑇 𝑇 𝑇 𝐿 ’s head field points to 𝐸
1, the owner,and the tail field points to 𝐸
3. The space consumed in this config-uration is 2 words for 𝐿 itself plus 3 ∗ 𝐸 for the queue elements. Incomparision, Hemlock consumes one word for 𝐿 and and 3 wordsof thread-local state for the Grant fields.In MCS, when a thread acquires a lock, it contributes an elementto the associated queue, and when that element reaches the head ofthe queue, the thread becomes the owner. In the subsequent unlock operation, the thread extracts and reclaims that same element fromthe queue. In CLH, a thread contributes an element but, and once it2021-02-19 • Copyright Oracle and or its affiliates emlock : Compact and Scalable Mutual Exclusion has acquired the lock, recovers a different element from the queue– elements migrate between locks and threads. The MCS and CLH unlock operators require dependent loads and indirection to locatequeue nodes, while Hemlock avoids these overheads. In MCS, if the unlock operation is known to execute in the same stack frame asthe lock operation, the queue element may be allocated on stack.This is not the case for CLH. As previously noted, Hemlock avoidselements and their management.The K42 [33, 43] variation of MCS can recover the queue elementbefore returning from lock whereas classic MCS recovers the queueelement in unlock . That is, under K42, a queue element is neededonly while waiting but not while the lock is held, and as such,queue elements can always be allocated on stack, if desired. Whileappealing, the paths are much more complex and touch more cachelines than the classic version, impacting performance.If a lock site is well-balanced – with the lock and correspondingunlock operators lexically scoped and executing in the same stackframe – a Hemlock implementation can opt to use an on-stack Grant field instead of the thread-local
Grant field accessed via
Self .This optimization, which can be applied on an ad-hoc site-by-sitebasis, also acts to reduce multi-waiting on the thread-local
Grant field. trade-offbetweenmulti-waitingandspace;allowsheterogenousmixedusage
Safety-exclusionandlivenessNothingbadeverhappens;somethinggoodeventuallyhappens;
In this section, we argue that the Hemlock algorithm is a correctimplementation of a mutual exclusion lock with the FIFO admis-sion and so-called fere-local spinning properties. We define thoseproperties more formally below, but first we note that we considerthe standard model of shared memory [28] with basic atomic read and write operations as well as more advanced atomic SWAP , CAS and
FAA operations. We presume atomic operators with the usualsemantics.
The
SWAP operationreceivestwoarguments,addressandnewvalue,andatomicallyreadstheoldvalueatthegivenaddress,writesthegivennewvalueandreturnstheoldvalue.The
CAS (compare-and-swap)operationreceivesthreearguments,address,oldvalueandnewvalue,andatomicallyreadsthevalueatthegivenaddress,comparesittothegivenoldvalue,and,ifequal,writesthegivennewvalue.Wesayinthiscasethatthe
CAS issuccessful.Otherwise,ifthecurrentvalueisdifferentfromthegivenoldvalue,
CAS doesnotdoanychange,andwesayitisunsuccessful.Weassumethe
CAS instructionreturnsthecurrentvalueithasread.Wenotethatgiventhereturnvalue,onecanidentifywhether
CAS hasbeensuccessfulbycomparingthatvaluetotheoldoneusedintheinvocationof
CAS .Finally,the
FAA (fetch-and-increment)instruction,whichisrequiredonlyfortheoptimizedversionofouralgorithm,receivestwoarguments,addressanddelta,andatomicallyreadstheoldvalueatthegivenaddress,addsthedelta(whichcanbeanyintegernumber),andwritestheresultback.The
FAA operationreturnstheoldvalue,beforetheincrement.WenotethatwhilemostexistinghardwarearchitecturessupportCAS,somedonotsupportSWAPorFAA.Wheresuchsupportisnotavailable,thoseinstructionscanbeeasilyemulatedusingCAS.Door-wayvsdoor-step
Multiple threads perform execution steps, where at each stepa thread may perform local computation or execute one of theatomic operations on the shared memory. We assume threads usethe Hemlock algorithm to protect access to one or more criticalsections, i.e., specially marked blocks of code that must be executedby at most one thread at a time. Our arguments are formulatedfor the simplified version of the algorithm given in Listing-1, andas such, all line references in this section are w.r.t. Listing-1. Yet,we note that the correctness arguments apply, albeit with minormodifications, to the optimized version in Listing-2.We call Lines 5–13 the entry code and Lines 14–21 the exit code.Each thread cycles between the entry code (where it is trying to getinto the critical section), critical section code, exit code (where it iscleaning up to allow other threads to execute their critical sections)and the so-called remainder section, where it executes code thatdoes not belong to any of the other three sections [32]. We assumethe order in which threads take their execution steps is unknown,yet no thread ceases execution in the entry, exit or critical sections.In other words, if a thread 𝑇 is in any of those code sections at time 𝑡 , it is guaranteed that, eventually, at some time 𝑡 ′ > 𝑡 , 𝑇 would For example Java synchronized blocks and methods; C++ std::lock_guard and std::scoped_lock ; or locking constructs that allow the critical section to be expressedas a lambda mostly or frequently local perform its next execution step. We also assume that each threadexecutes a finite number of steps in the critical section.We refer to Line 8 as the entry doorstep of the entry code andLine 20 as the exit doorstep of the exit code for lock 𝐿 . We say that athread is spinning on a word W if its next execution step is readingfrom a shared memory location 𝑊 inside the while-loop (e.g., inLine 11 in Listing-1). We say a lock 𝐿 is associated with a thread 𝑇 if 𝑇 has executed the entry doorstep for 𝐿 , but has not completedthe exit code for 𝐿 . We prove the following properties for Hemlockalgorithm defined with respect to any instance of a lock 𝐿 . • Mutual exclusion:
At any point in time, at most one threadis in the critical section. • Lockout freedom:
Any thread that starts executing theentry code eventually completes the exit code. • FIFO:
Threads enter the critical section in the order in whichthey execute the entry doorstep. • Fere-local spinning:
At any point in time, the number ofspinning threads on the same word is bounded by the max-imum number of locks associated with any thread at thattime.We note that lockout-freedom is a stronger property than themore common deadlock-freedom property [32]. Also, we note thatif a thread never executes entry code of one lock inside the criti-cal section of another (i.e., each thread has at most one associatedlock), the fere-local spinning implies local spinning, i.e., each spin-ning thread reads a different word 𝑊 . Furthermore, the fere-localspinning is a dynamic property, e.g., the bound at time 𝑡 does notdepend on the maximum number of locks associated with a threadprior to 𝑡 .We start with an auxiliary lemma. We denote the Self variablethat (contains the
Grant field and) belongs to a thread 𝑇 𝑖 as Self i .Lemma 1. For any lock 𝐿 , if L → Tail is null , there is no threadthat executed the entry doorstep, but has not executed the exit doorstep.In particular, there is no thread in the critical section protected by 𝐿 . Proof. The claim trivially holds initially at the beginning of theexecution when L → Tail is null .Let 𝑇 𝑗 be the first thread for which the claim does not hold. Thatis, 𝑇 𝑗 is the first thread for which SWAP in Line 8 returns null , yetthere is a thread 𝑇 𝑘 that has executed that SWAP before 𝑇 𝑗 but hasnot executed CAS in Line 16 yet.Let 𝑇 𝑖 be the last thread that set L → Tail to null before 𝑇 𝑗 ( 𝑇 𝑖 might be the same thread as 𝑇 𝑘 or a different one). From theinspection of the code, once L → Tail is set to a non- null value inthe entry doorstep, it can revert to null only by a successful
CAS in Line 16. For
CAS in Line 16 to be successful, it has to be executedby the last thread that executed
SWAP in Line 8. In other words, if 𝑇 𝑖 executes a successful CAS in Line 16 at time 𝑡 and it executed thecorresponding SWAP at time 𝑡 ( 𝑡 < 𝑡 ), no other thread executed SWAP at time 𝑡 < 𝑡 < 𝑡 .Consider when 𝑇 𝑘 executes the SWAP instruction in Line 8 w.r.t. [ 𝑡 , 𝑡 ] . Case 1: 𝑇 𝑘 executes SWAP at time 𝑡 < 𝑡 . Let 𝑇 𝑙 be the nextthread that executes SWAP after 𝑇 𝑘 ( 𝑇 𝑙 can be the same as 𝑇 𝑖 , or adifferent thread). Since L → Tail contains
Self k , 𝑇 𝑘 will enter thewhile-loop in Line 11. It will exit the loop only when Self k → Grant changes to 𝐿 , which can happen, according to the code, only in Line2021-02-19 • Copyright Oracle and or its affiliates ice and Kogan
20 when 𝑇 𝑘 executes the exit code. By induction on the number ofthreads that executed SWAP between 𝑡 and 𝑡 , 𝑇 𝑖 can execute thesuccessful CAS in Line 16 only after 𝑇 𝑘 executes the exit code. Thismeans that when 𝑇 𝑖 executes SWAP in Line 8, 𝑇 𝑘 has executed CAS in Line 16 – a contradiction.Case 2: 𝑇 𝑘 executes SWAP at time 𝑡 > 𝑡 . This means that when 𝑇 𝑗 executes SWAP in Line 8, L → Tail contains i either
Self k or Self l for some other thread 𝑇 𝑙 that executes SWAP after 𝑇 𝑘 – acontradiction to the fact that 𝑇 𝑗 ’s SWAP returned null . □ With this lemma, we prove the correctness property for Hemlock.Theorem 2.
The Hemlock algorithm provides mutual exclusion.
Proof. By way of contradiction, assume 𝑇 𝑖 and 𝑇 𝑗 are simultane-ously in the critical section protected by the same lock 𝐿 . Let 𝑡 𝑖 and 𝑡 𝑗 be the points in time when 𝑇 𝑖 and 𝑇 𝑗 executed Line 8 for the lasttime, respectively. Without loss of generality, assume 𝑡 𝑖 > 𝑡 𝑗 . Con-sider the value returned by SWAP in Line 8 when executed by thread 𝑇 𝑖 . If the returned value is null , by Lemma 1 𝑇 𝑗 must have executedits CAS instruction in Line 16 before 𝑡 𝑖 . Hence, 𝑇 𝑖 will execute thecritical section after 𝑇 𝑗 has completed its own – a contradiction.If the returned value is Self k ≠ null (for some thread 𝑇 𝑘 thatmight be the same as 𝑇 𝑗 or a different one), let 𝑇 𝑙 be the thread thatexecutes SWAP in Line 8 right after 𝑇 𝑗 and before 𝑇 𝑗 executes CAS inLine 16. 𝑇 𝑙 might be the same as 𝑇 𝑖 or a different thread. 𝑇 𝑙 waitsin Line 11 for Self j → Grant to become 𝐿 . Self j → Grant can onlychange to L (from null ) in Line 20 by thread 𝑇 𝑗 . When this happens, 𝑇 𝑗 is outside of the critical section. By induction on the number ofthreads that execute SWAP in ( 𝑡 𝑗 , 𝑡 𝑖 ] , when 𝑇 𝑖 finds Self j → Grant to be null in Line 11 and subsequently enters the critical section, 𝑇 𝑗 is outside of the critical section – a contradiction. □ Next, we prove the progress property for Hemlock. We do so byshowing first that a thread cannot get stuck in the exit code, i.e.,the exit-code is lockout-free.Lemma 3.
Every thread 𝑇 𝑖 exiting the critical section eventuallycompletes the exit code. Proof. The only place in the exit code where a thread 𝑇 𝑖 mayiterate indefinitely is the while-loop in Line 21. In the following,we argue that 𝑇 𝑖 either completes the exit code without reachingLine 21, or eventually breaks out of the loop in Line 21.Let 𝑡 be the time 𝑇 𝑖 executes the last SWAP instruction in Line 8before entering the critical section and 𝑡 be the time it executes the CAS instruction in Line 16 when it starts the exit code. Consider thefollowing two cases. Case 1: no thread executes
SWAP in [ 𝑡 , 𝑡 ] . Inthis case, L → Tail contains
Self i and the CAS instruction in Line16 is successful. Therefore,
CAS returns
Self i , and 𝑇 𝑖 can completethe exit code by a constant number of its steps by skipping Lines18–21.Case 2: at least one thread executes SWAP in [ 𝑡 , 𝑡 ] . Let 𝑇 𝑘 bethe first such thread. Thus, CAS in Line 16 is not successful, andit returns
Self j , for some 𝑗 ≠ 𝑖 (perhaps 𝑗 = 𝑘 ). (We note that,by Lemma 1, CAS cannot return null .) Therefore, 𝑇 𝑖 reaches thewhile-loop in Line 21, after storing 𝐿 into Self i → Grant in Line 20.Consider the execution steps of thread 𝑇 𝑘 after its SWAP instruction.The
SWAP instruction returns
Self i , and so 𝑇 𝑘 reaches the while-loop in Line 11. After 𝑇 𝑖 executes Line 20, eventually 𝑇 𝑘 reads 𝐿 from Self i → Grant and breaks out of the while-loop in Line 11.Next, it executes Line 12, storing null into
Self i → Grant . Finally, 𝑇 𝑖 eventually reads null in Line 21 and breaks out of the while-loop. □ Next, we show that a thread cannot get stuck in the entry codeeither, but first we prove a simple auxiliary lemma.Lemma 4.
The
SWAP instruction in Line 8 executed by thread 𝑇 𝑖 never returns Self i . Proof. From code inspection, only thread 𝑇 𝑖 can write Self i into L → Tail . Thus, the claim holds until 𝑇 𝑖 executes SWAP at leastfor the second time.Let 𝑇 𝑖 execute SWAP in Line 8 for the 𝑘 -th time, 𝑘 ≥
2, at time 𝑡 𝑘 . Consider the previous, 𝑘 -1-th execution of SWAP by 𝑇 𝑖 , at time 𝑡 𝑘 − . From code inspection, 𝑇 𝑖 has to execute CAS in Line 16 at time 𝑡 𝑘 − < 𝑡 < 𝑡 𝑘 . If CAS is successful, 𝑇 𝑖 changes the value of L → Tail to null , and thus 𝑘 -th SWAP will return null or Self j for 𝑗 ≠ 𝑖 . If CAS is unsuccessful, there has been (at least one) another thread 𝑇 𝑗 , 𝑗 ≠ 𝑖 , that performed SWAP in Line 8 at time 𝑡 𝑘 − < 𝑡 ′ < 𝑡 . Thus, 𝑘 -th SWAP at time 𝑡 𝑘 > 𝑡 ′ will return Self j , or Self k (for 𝑘 ≠ 𝑖, 𝑗 )or null , but not Self i . □ Lemma 5.
Every thread 𝑇 𝑖 starting the entry code eventually entersthe critical section. Proof. The only place in the entry code where a thread 𝑇 𝑖 mayiterate indefinitely is the while-loop in Line 11. In the following,we argue that 𝑇 𝑖 either completes the entry code without reachingLine 11, or eventually breaks out of the loop in Line 11.Consider the following two cases w.r.t. to the value returnedby SWAP executed by thread 𝑇 𝑖 in the entry code at time 𝑡 𝑖 . Case 1: SWAP returns null . In this case, 𝑇 𝑖 can complete the entry code bya constant number of its steps by skipping Lines 9–12.Case 2: SWAP returns Self j . Thus, 𝑇 𝑖 reaches the while-loop inLine 11 and waits until Self j → Grant contains 𝐿 . From Lemma 4,we know that 𝑗 ≠ 𝑖 . Consider the state of thread 𝑇 𝑗 w.r.t. to thevalue returned by SWAP executed by thread 𝑇 𝑗 in Line 8 at time 𝑡 𝑗 < 𝑡 𝑖 . If 𝑇 𝑗 ’s SWAP returned null , 𝑇 𝑗 will complete the entrycode, end eventually reach Line 20 in the exit code. Otherwise, 𝑇 𝑗 ’s SWAP returned
Self k . If Self k is equal to Self i , then 𝑇 𝑖 executed(another) SWAP at time 𝑡 ′ 𝑖 < 𝑡 𝑗 . This means that 𝑇 𝑖 has executedthe exit code in the interval ( 𝑡 ′ 𝑖 , 𝑡 𝑖 ) , and in particular, has executedLine 20 in the interval ( 𝑡 𝑗 , 𝑡 𝑖 ) . Therefore, 𝑇 𝑗 will break out of thewhile-loop in Line 11*, enter the critical section, and eventuallyexecute Line 20, allowing 𝑇 𝑖 to complete its entry code. If Self k isnot equal to Self i , consider whether at time 𝑡 𝑗 , 𝑇 𝑘 has completedthe while-loop in Line 11 (including by skipping that while-loopentirely by evaluating the condition in Line 9 to false). If so, 𝑇 𝑘 willcomplete the entry code, end eventually reach Line 20 in the exitcode, letting 𝑇 𝑗 and, eventually, 𝑇 𝑖 to break out of the while-loop inLine 11*. Otherwise, 𝑇 𝑘 is waiting in Line 11. (We note that there isa third possibility that 𝑇 𝑘 has executed SWAP , but has not evaluatedthe condition in Line 9 yet, or has evaluated it to true, but has notstarted the while-loop in Line 11. We treat it as one of the first twopossibilities, according to whether or not 𝑇 𝑘 eventually waits in thewhile-loop in Line 11).2021-02-19 • Copyright Oracle and or its affiliates emlock : Compact and Scalable Mutual Exclusion In the case 𝑇 𝑘 is waiting in Line 11, we consider recursivelythe state of 𝑇 𝑘 w.r.t. to the value returned by its SWAP , and anythread 𝑇 𝑘 is waiting for in Line 11. Since the number of threadsis bounded, there have to be two threads, 𝑇 𝑎 and 𝑇 𝑏 , s.t. 𝑇 𝑎 ’s SWAP returns
Self b and 𝑇 𝑏 ’s SWAP returns either (a) null , or (b)
Self c for 𝑇 𝑐 in { 𝑇 𝑖 ,𝑇 𝑗 ,𝑇 𝑘 , . . . ,𝑇 𝑎 } or (c) Self c for 𝑇 𝑐 that has completed thewhile-loop in Line 11. . Following the similar reasoning as above, weconclude that 𝑇 𝑏 eventually executes Line 20. in its exit code, andallows 𝑇 𝑎 to break our of the waiting loop in Line 11. . By inductionon the number of threads in the set { 𝑇 𝑖 ,𝑇 𝑗 ,𝑇 𝑘 , . . . ,𝑇 𝑎 } , we concludethat, eventually, 𝑇 𝑖 completes the while-loop in Line 11 and entersthe critical section. □ Theorem 6.
The Hemlock algorithm is lockout-free.
Proof. This follows directly from Lemma 3 and 5, and the as-sumption that a thread completes its critical section in a finitenumber of its execution steps. □ Next, we prove that threads enter the critical section in the FIFOorder w.r.t. their execution of the entry doorstep. In the followinglemma, we show that when two threads execute the entry doorstepone after the other, the latter thread cannot “skip” over the formerand enter the critical section first.Lemma 7.
Let 𝑇 𝑖 be the next thread that executes the entry doorstepafter 𝑇 𝑗 . Then 𝑇 𝑖 enters the critical section after 𝑇 𝑗 . Proof. First, we note that the claim trivially holds if 𝑖 = 𝑗 . This isbecause 𝑇 𝑖 may execute another entry doorstep only after (enteringand) exiting the critical section.Next, we consider two cases. Case 1: 𝑇 𝑖 ’s execution of the SWAP in-struction in the entry doorstep returns null . This can only happenif 𝑇 𝑗 performs CAS in Line 16 before 𝑇 𝑖 executes the SWAP instruction.This means, however, that 𝑇 𝑗 has completed its critical section, andthe claim holds. Case 2: 𝑇 𝑖 ’s execution of the SWAP instruction in theentry doorstep returns
Self j . Then 𝑇 𝑖 will proceed to Line 11, andwait until Self j → Grant changes to 𝐿 . This can only happen when 𝑇 𝑗 reaches Line 20, which means that, once again, 𝑇 𝑗 has completedits critical section, and the claim holds. □ Theorem 8.
The Hemlock algorithm has the FIFO property.
Proof. By way of contradiction, assume there is a thread 𝑇 𝑖 thatexecutes the entry doorstep after a thread 𝑇 𝑗 , but enters the criticalsection before 𝑇 𝑗 . Without loss of generality, let 𝑇 𝑖 be the first suchthread in the execution of the algorithm. Let 𝑇 𝑘 be the thread thatexecutes the entry doorstep right before 𝑇 𝑖 ( 𝑇 𝑘 might be the samethread as 𝑇 𝑗 or a different one). By the way we chose 𝑇 𝑖 , 𝑇 𝑘 has notentered the critical section when 𝑇 𝑖 does. This is a contradiction toLemma 7. □ We are left to prove the last stated property of Hemlock, namelythe fere-local spinning. Again, we start with an auxiliary lemma.Lemma 9.
For every lock 𝐿 and thread 𝑇 𝑖 , there is at most onethread 𝑇 𝑗 waiting in Line 11 for Self i → Grant to become 𝐿 . Proof. Consider thread 𝑇 𝑗 waiting in Line 11 for Self i → Grant to become 𝐿 . To reach Line 11, 𝑇 𝑗 executed Line 8, where SWAP returned
Self i . This, in turn, means that 𝑇 𝑖 has also executed Line 8 (before 𝑇 𝑗 did). This is because Line 8 is the only place where Self k can be written into L → Tail , for any thread 𝑇 𝑘 .Assume by way of contradiction that another thread 𝑇 𝑘 is alsowaiting in Line 11 for Self i → Grant to become 𝐿 . Let 𝑡 𝑗 and 𝑡 𝑘 bethe points in time when 𝑇 𝑗 and 𝑇 𝑘 executed Line 8 for the last time,respectively. From the atomicity of SWAP , 𝑡 𝑗 ≠ 𝑡 𝑘 . Assume withoutloss of generality that 𝑡 𝑗 < 𝑡 𝑘 . Let 𝑡 𝑖 be the time 𝑇 𝑖 executed the SWAP for the last time before 𝑡 𝑗 . From the above, 𝑡 𝑖 < 𝑡 𝑗 < 𝑡 𝑘 .From the inspection of the code, the only way for 𝑇 𝑘 to write Self i in Line 8 into pred is for 𝑇 𝑖 to execute Line 8 right before 𝑇 𝑘 does. That is, there has to be a point in time 𝑡 𝑗 < 𝑡 ′ 𝑖 < 𝑡 𝑘 inwhich 𝑇 𝑖 executed Line 8 again. This means that during ( 𝑡 𝑖 , 𝑡 ′ 𝑖 ) , 𝑇 𝑖 has completed the entry code, its critical section, and the exit code(and started executing another entry code). When executing theexit code, 𝑇 𝑖 performed CAS in Line 16*. If this
CAS is successful,this means that it takes place before 𝑡 𝑗 (since the value of L → Tail remains unchanged), and so 𝑇 𝑗 would not read Self i into pred inLine 8 at 𝑡 𝑗 . Thus, this CAS has to fail, i.e., return a value differentfrom
Self i .Thus, 𝑇 𝑖 has to execute Lines 20–21, and in particular, wait untilits Grant field contains null . This happens before 𝑡 ′ 𝑖 and hencebefore 𝑡 𝑘 , therefore 𝑇 𝑗 is the only thread at this point that waitsin Line 11 for Self i → Grant to become 𝐿 . Since 𝑇 𝑖 completes itsexit point (and executes SWAP at 𝑡 ′ 𝑖 ), it must have exited the while-loop in Line 21 before 𝑡 ′ 𝑖 . This can happen only if 𝑇 𝑗 has executedLine 12 after 𝑡 𝑗 and before 𝑡 ′ 𝑖 . Thus, 𝑇 𝑗 no longer waits in Line 11for Self i → Grant to become 𝐿 when 𝑇 𝑘 starts to wait there – acontradiction. □ Note that as explained in Section-2.2, there might be multiplethreads spinning on the word
Self i → Grant in Line 11, each for adifferent lock 𝐿 . However, as we argue in the lemma above, theremight be only one thread per any given lock that waits for the valueof Self i → Grant to change.Theorem 10.
The Hemlock algorithm has the fere-local spinningproperty.
Proof. Assume thread 𝑇 𝑖 has 𝑘 associated locks at the givenpoint in time. By inspecting the code, threads can spin on a wordonly in Lines 11 or 21. By Lemma 9, there might be at most 𝑘 threadsspinning on Self i → Grant in Line 11„ one for each of the 𝑘 locksassociated with 𝑇 𝑖 . (We note that by the definition of the associatedlocks, a thread 𝑇 𝑗 cannot spin on Self i → Grant and wait until itcontains a value of a lock that is not associated with 𝑇 𝑖 .) At thesame time, only 𝑇 𝑖 can spin on Self i → Grant in Line 21, and itdoes so after writing 𝐿 into Self i → Grant in Line 20. This meansthat when 𝑇 𝑖 starts spinning on Self i → Grant , another thread 𝑇 𝑗 stops spinning on Self i → Grant in Line 11. We note that it can beeasily shown that such 𝑇 𝑗 exists. Thus, at any given point in time,the number of threads spinning on Self i → Grant is bounded by 𝑘 . □ While mutual exclusion remains an active research topic [41] [12][36] [18] [29] [17] [16] [15] [19] [1] [21] [43] we focus on locksclosely related to our design.2021-02-19 • Copyright Oracle and or its affiliates ice and Kogan
Simple test-and-set or polite test-and-test-and-set [43] locks arecompact and exhibit excellent latency for uncontended operations,but fail to scale and may allow unfairness and even indefinite star-vation. Ticket Locks are compact and FIFO and also have excel-lent latency for uncontended operations but they also fail to scalebecause of local spinning, although some variations attempt toovercome this obstacle, at the cost of increased space [18, 20, 40].For instance Anderson’s array-based queueing lock [4, 5] is basedon Ticket Locks but provides local spinning. It employs a waitingarray for each lock instance, sized to ensure there is at least onearray element for each potentially waiting thread, yielding a po-tentially large footprint. The maximum number of participatingthreads must be known in advance when initializing the lock.Queue-based locks such as MCS or CLH are FIFO and providelocal spinning and are thus more scalable. MCS is used in the linuxkernel for the low-level “qspinlock” construct [7, 9, 31]. Modernextensions of MCS edit the queue order to make the lock
NUMA-Aware [17].
Unless otherwise noted, all data was collected on an Oracle X5-2system. The system has 2 sockets, each populated with an IntelXeon E5-2699 v3 CPU running at 2.30GHz. Each socket has 18cores, and each core is 2-way hyperthreaded, yielding 72 logicalCPUs in total. The system was running Ubuntu 20.04 with a stockLinux version 5.4 kernel, and all software was compiled using theprovided GCC version 9.3 toolchain at optimization level “-O3”. 64-bit C or C++ code was used for all experiments. Factory-providedsystem defaults were used in all cases, and Turbo mode [44] wasleft enabled. In all cases default free-range unbound threads wereused.We implemented all user-mode locks within LD_PRELOAD in-terposition libraries that expose the standard POSIX pthread_mutex_t programming interface using the framework from [21].This allows us to change lock implementations by varying theLD_PRELOAD environment variable and without modifying theapplication code that uses locks. The C++ std::mutex constructmaps directly to pthread_mutex primitives, so interposition worksfor both C and C++ code. All busy-wait loops used the Intel
PAUSE instruction.
The MutexBench benchmark spawns 𝑇 concurrent threads. Eachthread loops as follows: acquire a central lock L; execute a criticalsection; release L; execute a non-critical section. At the end of a10 second measurement interval the benchmark reports the totalnumber of aggregate iterations completed by all the threads. Wereport the median of 7 independent runs in Figure-2 where the crit-ical section is empty as well as the non-critical section, subjectingthe lock to extreme contention. (At just one thread, this configura-tion also constitutes a useful benchmark for uncontended latency).The 𝑋 -axis reflects the number of concurrently executing threadscontending for the lock, and the 𝑌 reports aggregate throughput.For clarity and to convey the maximum amount of information to thetallyofallloopsexecutedbyallthethreadsinthemeasurementinterval allow a comparison of the algorithms, the 𝑋 -axis is offset to theminimum score and the 𝑌 -axis is logarithmic. We ran the benchmark under the following lock algorithms: MCS is classic MCS;
CLH is classic CLH;
Ticket is a classic TicketLock;
Hemlock is the Hemlock algorithm, with the CTR optimiza-tion, described above. For the MCS and CLH locks, our implemen-tation stores the current head of the queue – the owner – in afield adjacent to the tail, so the lock body size was 2 words. TheTicket Lock also has a size of 2 words, while Hemlock requires alock body of just 1 word. MCS and CLH additionally require onequeue element for each lock held or waited upon. CLH also requiresthat each lock be initialized with a so-called dummy element. Toavoid memory allocation during the measurement interval, the CLHand MCS implementations use a thread-local stack of free queueelements.In Figure-2 we make the following observations regarding oper-ation at maximal contention with an empty critical section: . • At 1 thread the benchmark measures the latency of uncon-tended acquire and release operations. Ticket Locks are thefastest, followed by Hemlock, CLH and MCS. • As we increase the number of threads, Ticket Locks initiallydo well but then fade. exhibiting a precipitous drop in perfor-mance. • Broadly, Hemlock performs slightly better than or the sameas CLH or MCS.To gauge the contribution of the CTR optimization, we exam-ine execution at 32 threads. The simplistic reference implementa-tion, shown in Listing-1, yields throughput of 3.41 million opera-tions/second. Applying CTR to the baseline yields 4.49M millionoperations/second.
Sensitivityanalysis:breakdownandcontributionofAHandCTRoptimizations
In Figure-3 we configure the benchmark so the non-critical sec-tion generates a uniformly distributed random value in [ − ) andsteps a thread-local C++ std::mt19937 random number generator(PRNG) that many steps, admitting potential positive scalability.The critical section advances a shared random number generator 5steps. In this moderate contention case we observe that Ticket Locksagain do well at low thread counts, and that Hemlock outperformsboth MCS and CLH. Maximumextremecontention–maximizesarrivalrate
To show that our approach is general and portable, we next reportMutexBench results on a Sun/Oracle T7-2 [11] in Figures 4 and 5.The T7-2 has 2 sockets, each socket populated by an M7 SPARCCPU running at 4.13GHz with 32 cores. Each core has 8 logicalCPUs sharing 2 pipelines. The system has 512 logical CPUs andwas running Solaris 11. We used the GCC version 6.1 toolchain tocompile the benchmark and the lock libraries. 64-bit SPARC doesnot directly support atomic fetch-and-add or SWAP operations –these are emulated by means of a 64-bit compare-and-swap op-erator (
CASX ). To implement CTR in the waiting phase, we used
MONITOR-MWAIT on the predecessor’s
Grant field followed by animmediate
CASX to try to reset
Grant , avoiding the promotion from shared to modified state which would normally be found in naive We note in passing that care must be taken when negative or retrograde scaling occursand aggregate performance degrades as we increase threads. As a thought experiment,if a hypothetical lock implementation were to introduce additional synthetic delaysoutside the critical path, aggregate performance might increase as the delay throttlesthe arrival rate and concurrency over the contended lock [25]. As such, evaluating justthe maximal contention case in isolation is insufficient. emlock : Compact and Scalable Mutual Exclusion Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock
Figure 2: MutexBench : Maximum Contention
Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock
Figure 3: MutexBench : Moderate Contention
Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock
Figure 4: MutexBench : Maximum Contention – SPARC
Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock
Figure 5: MutexBench : Moderate Contention – SPARC
Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock
Figure 6: MutexBench : Maximum Contention – AMD
Threads A gg r ega t e t h r oughpu t r a t e : M s t ep s / s e c MCSCLHTicketHemlock
Figure 7: MutexBench : Moderate Contention – AMD busy-waiting. As needed,
CASX(A,0,0) serves as the read-with-intent-to-write primitive. The system uses MOESI cache coherency instead of the MESIF [26] found in modern Intel-branded proces-sors, allowing more graceful handling of write sharing. The abrupt2021-02-19 • Copyright Oracle and or its affiliates ice and Kogan performance drop experienced by all locks starting at 256 threadsis caused by competition for pipeline resources.
Figures 6 and 7 show performance on a 2-socket AMD NUMA sys-tem, where each socket contains an EPYC 7662 64-Core Processorand each core supports 2 logical CPUs, for 256 logical processorsin total. The base clock speed is 2.0 GHz. The kernel was linuxversion 5.4 and we used the same binaries built on the Intel x5-2system. AMD uses a MOESI coherence protocol. The results onAMD concur with those observed on the Intel system. . . . . . Threads A gg r ega t e t h r oughpu t r a t e : M illi on s o f ope r a t i on s pe r s e c ond MCSCLHTicketHemlock
Figure 8: LevelDB readrandom
In Figure-8 we used the “readrandom” benchmark in LevelDBversion 1.20 database varying the number of threads and report-ing throughput from the median of 5 runs of 50 second each. Eachthread loops, generating random keys and then tries to read the as-sociated value from the database. We used the Oracle X5-2 system tocollect data. We first populated a database and then collected data . We made a slight modification to the db_bench benchmarkingharness to allow runs with a fixed duration that reported aggre-gate throughput. Ticket Locks exhibit a slight advantage over MCS,CLH and Hemlock at low threads count after which Ticket Locksfade. LevelDB uses coarse-grained locking, protecting the databasewith a single central mutex: DBImpl::Mutex . Profiling indicatescontention on that lock via leveldb::DBImpl::Get() .Using an instrumented version of Hemlock we characterized theapplication behavior of LevelDB, as it relates to Hemlock. At 64threads, during a 50 second run, we found 24 instances of calls to lock where a thread already held at least one other lock. Theseall occurred during the first second after startup. The maximumnumber of locks held simultaneously by any thread was 2. Themaximum number of threads waiting simultaneously on any
Grant field was 1, thus the application enjoyed purely local spinning. leveldb.org db_bench ––threads=1 ––benchmarks=fillseq ––db=/tmp/db/ db_bench ––threads= threads ––benchmarks=readrandom––use_existing_db=1 ––db=/tmp/db/ ––duration=50 An interesting variation intend to explore in the future is to replacethe simplistic spinning on the
Grant field with a per-thread condi-tion variable and mutex pair that protect the Grant field, allowingthreads to use the same waiting policy as the platform mutex andcondition variable primitives. All long-term waiting for the
Grant field to become a certain address or to return to 0 would be via thecondition variable. Essentially, we treat
Grant as a bounded bufferof capacity 1 protected in the usual fashion by a condition variableand mutex. This construction yields 2 interesting properties : (a) thenew lock enjoys a fast-path, for uncontended locking, that doesn’trequire any underlying mutex or condition variable operations,(b) even if the underlying system mutex isn’t FIFO, our new lockprovides strict FIFO admission. Again, the result is compact, requir-ing only a mutex, condition variable and
Grant field per thread,and only one word per lock to hold the
Tail . For systems wherelocks outnumber threads, such an approach would result in spacesavings.
Hemlock is exceptionally simple with short paths, and avoids thedependent loads and indirection required by CLH or MCS to locatequeue nodes. The contended handover critical path is extremelyshort – the unlock operator conveys ownership to the successor inan expedited fashion. Despite being compact, it provides local spin-ning in common circumstances and scales better than Ticket Locks.Instead of traditional queue elements, as found in CLH and MCS,we use a per-thread shared singleton element. Finally, Hemlock ispractical and readily usable in real-world lock implementations.
ACKNOWLEDGMENTS
We thank Peter Buhr and Trevor Brown at the University of Water-loo for access to their AMD system.
REFERENCES [1] Ole Agesen, David Detlefs, Alex Garthwaite, Ross Knippel, Y. S. Ramakrishna,and Derek White. An efficient meta-lock for implementing ubiquitous synchro-nization.
SIGPLAN Notices OOPSLA 1999 , 1999. doi:10.1145/320385.320402 .[2] H. Akkan, M. Lang, and L. Ionkov. HPC runtime support for fast and powerefficient locking and synchronization. In , 2013. URL: http://dx.doi.org/10.1109/CLUSTER.2013.6702659.[3] Vitaly Aksenov, Dan Alistarh, and Petr Kuznetsov. Brief announcement: Per-formance prediction for coarse-grained locking. In
Proceedings of the 2018 ACMSymposium on Principles of Distributed Computing , PODC ’18. ACM, 2018. URL:http://doi.acm.org/10.1145/3212734.3212785, doi:10.1145/3212734.3212785 .[4] J.H. Anderson, Y.J. Kim, and T. Herman. Shared-memory mutual exclusion:major research trends since 1986.
Distributed Computing , 2003. URL: https://doi.org/10.1007/s00446-003-0088-6.[5] T. E. Anderson. The performance of spin lock alternatives for shared-moneymultiprocessors.
IEEE Transactions on Parallel and Distributed Systems , 1990. doi:10.1109/71.80120 .[6] Guang-Ien Cheng, Mingdong Feng, Charles E. Leiserson, Keith H. Randall, andAndrew F. Stark. Detecting data races in cilk programs that use locks. In
Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms andArchitectures , SPAA ’98. Association for Computing Machinery, 1998. doi:10.1145/277651.277696 .[7] Jonathan Corbet. MCS locks and qspinlocks. https://lwn.net/Articles/590243,March 11, 2014. Accessed: 2018-09-12.[8] Jonathan Corbet. A surprise with mutexes and reference counts. https://lwn.net/Articles/575460, December 4, 2013.[9] Jonathan Corbet. Mcs locks and qspinlocks, 2014. URL: https://lwn.net/Articles/590243/. emlock : Compact and Scalable Mutual Exclusion
CoRR , abs/1511.06035, 2015. URL: http://arxiv.org/abs/1511.06035, arXiv:1511.06035 .[15] Dave Dice. Malthusian locks. In
Proceedings of the Twelfth European Conferenceon Computer Systems , EuroSys ’17, 2017. URL: http://doi.acm.org/10.1145/3064176.3064203.[16] Dave Dice and Alex Kogan. Compact numa-aware locks.
CoRR , abs/1810.05600,2018. URL: http://arxiv.org/abs/1810.05600.[17] Dave Dice and Alex Kogan. Compact numa-aware locks. In
Proceedings of theFourteenth EuroSys Conference 2019 , EuroSys ’19. Association for ComputingMachinery, 2019. doi:10.1145/3302424.3303984 .[18] Dave Dice and Alex Kogan. TWA - ticket locks augmented with a waiting array.In
Euro-Par 2019: Parallel Processing - 25th International Conference on Paralleland Distributed Computing, Göttingen, Germany, August 26-30, 2019, Proceedings .Springer, 2019. doi:10.1007/978-3-030-29400-7\_24 .[19] Dave Dice and Alex Kogan. Fissile locks, 2020. URL: https://arxiv.org/abs/2003.05025, arXiv:2003.05025 .[20] David Dice. Brief announcement: A partitioned ticket lock. In
Proceedingsof the Twenty-third Annual ACM Symposium on Parallelism in Algorithms andArchitectures , SPAA ’11, 2011. URL: http://doi.acm.org/10.1145/1989493.1989543.[21] David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: A generaltechnique for designing numa locks.
ACM Trans. Parallel Comput. , 2015. URL:http://doi.acm.org/10.1145/2686884, doi:10.1145/2686884 .[22] David Dice and Mark Moir. Inter-thread communication using processormessaging– us patent 10,776,154, 2008. URL: https://patents.google.com/patent/US10776154B2/en.[23] Stijn Eyerman and Lieven Eeckhout. Modeling critical sections in amdahl’slaw and its implications for multicore design. In
Proceedings of the 37th AnnualInternational Symposium on Computer Architecture , ISCA ’10. ACM, 2010. URL:http://doi.acm.org/10.1145/1815961.1816011, doi:10.1145/1815961.1816011 .[24] M. J. Fischer, N. A. Lynch, J. E. Burns, and A. Borodin. Resource allocation withimmunity to limited process failure. In , 1979. URL: http://dx.doi.org/10.1109/SFCS.1979.37.[25] Carlos Gershenson and Dirk Helbing. When Slower is Faster.
CoRR
Computer Architecture, Sixth Edition:A Quantitative Approach . Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, 6th edition, 2017.[28] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctnesscondition for concurrent objects.
ACM Trans. Program. Lang. Syst. , 1990. doi:10.1145/78969.78972 .[29] Prasad Jayanti, Siddhartha Jayanti, and Sucharita Jayanti. Towards an ideal queuelock. In
Proceedings of the 21st International Conference on Distributed Computingand Networking , ICDCN 2020. Association for Computing Machinery, 2020. URL:https://doi.org/10.1145/3369740.3369784.[30] Doug Lea. The java.util.concurrent synchronizer framework.
Science of Com-puter Programming doi:https://doi.org/10.1016/j.scico.2005.03.007 .[31] Waiman Long. qspinlock: Introducing a 4-byte queue spinlock implementation.https://lwn.net/Articles/561775, July 31, 2013, 2013. Accessed: 2018-09-19.[32] Nancy A. Lynch.
Distributed Algorithms . Morgan Kaufmann Publishers Inc.,1996.[33] O. Krieger B. Rosenburg M. Auslander, D. Edelsohn and R. Wisniewski. En-hancement to the mcs lock for increased functionality and improved pro-grammability – u.s. patent application number 20030200457, 2003. URL: https://patents.google.com/patent/US20030200457.[34] P. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherentmultiprocessors. In
Proceedings of 8th International Parallel Processing Symposium ,1994. doi:10.1109/IPPS.1994.288305 .[35] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchro-nization on shared-memory multiprocessors.
ACM Trans. Comput. Syst. , 1991.URL: http://doi.acm.org/10.1145/103727.103729.[36] John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer synchro-nization for shared-memory multiprocessors. In
Proceedings of the Third ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming , PPOPP’91. ACM, 1991. URL: http://doi.acm.org/10.1145/109625.109637. [37] Microsoft. Waitonaddress function, 2015. URL: https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress.[38] Atsushi Nemoto. Bug 13690 – pthread_mutex_unlock potentially cause invalidaccess. https://sourceware.org/bugzilla/show_bug.cgi?id=13690, February 14,2012.[39] Robert O’Callahan and Jong-Deok Choi. Hybrid dynamic data race detection. In
Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming , PPoPP ’03. Association for Computing Machinery, 2003. doi:10.1145/781498.781528 .[40] Pedro Ramalhete. Ticket lock - array of waiting nodes (awn), 2015. URL:http://concurrencyfreaks.blogspot.com/2015/01/ticket-lock-array-of-waiting-nodes-awn.html.[41] Pedro Ramalhete and Andreia Correia. Tidex: A mutual exclusion lock. In
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming , PPoPP ’16, 2016. URL: http://doi.acm.org/10.1145/2851141.2851171.[42] David P. Reed and Rajendra K. Kanodia. Synchronization with eventcountsand sequencers.
Commun. ACM , 1979. URL: http://doi.acm.org/10.1145/359060.359076.[43] Michael L. Scott.
Shared-Memory Synchronization . Morgan & Claypool Publishers,2013. URL: https://doi.org/10.2200/S00499ED1V01Y201304CAC023.[44] U. Verner, A. Mendelson, and A. Schuster. Extending amdahl’s law for multicoreswith turbo boost.
IEEE Computer Architecture Letters , 2017. URL: https://doi.org/10.1109/LCA.2015.2512982.[45] Tianzheng Wang, Milind Chabbi, and Hideaki Kimura. Be my guest: Mcs locknow welcomes guests.
SIGPLAN PPoPP , 2016. doi:10.1145/3016078.2851160 . ice and Kogan A OPTIMIZATION: OVERLAP
To reduce the impact of waiting for receipt of transfer in the unlockoperator, at Listing-1 line 20, we can apply the
Overlap optimiza-tion which shifts and defers that waiting step until subsequentsynchronization operations, allowing greating overlap between thesuccessor and the outgoing owner.Threads arriving in the lock operator at Listing-3 line 6 wait toensure their
Grant mailbox field does not contain a residual addressfrom a previous contended unlock operation on that same lock, inwhich case it must wait for that tardy successor to fetch and clearthe
Grant field. In practice, waiting on this condition is rare. (Ifthread 𝑇 Grant value that happens to match that of the lock, then when asuccessor 𝑇 𝑇
1, it will incorrectly see that addressin T1’s grant field and then incorrectly enter the critical section,resulting in exclusion and safety failure and a corrupt chain. Thecheck at line 6 prevents that pathology).In Listing-3 line 16, threads wait for their own
Grant field to be-come empty.
Grant could be non- null because of previous unlockoperations that wrote and address into the field, but the correspond-ing successor has not yet cleared the field back to null . That is,
Grant is still occupied. Once
Grant becomes empty, the thread thenwrites the address of the lock into
Grant , alerting the successorand passing ownership. When ultimately destroying a thread, it isnecessary to wait while the thread’s
Grant field to transition backto null before reclaiming the memory underlying
Grant . class Thread : atomic
Listing 3: Hemlock with Overlap Optimization
B OPTIMIZATION: AGGRESSIVEHAND-OVER
The
Aggressive Hand-Over (AH) optimization, shown in Listing-4,changes the code in unlock to first store the lock’s address intothe
Grant field (Listing-4 Line 12), optimistically anticipating theexistence of waiters, and then execute the atomic
CAS to try to swingthe
Tail field back from
Self to null , handling the uncontendedcase. If the CAS succeeded, there are no waiters, and we then reset
Grant back to null and return, and otherwise wait for the successorto clear
Grant . This reorganization accomplishes handover earlier class Thread : atomic
Grant are harmless to latency as the thread is likely to have theunderlying cache line in modified state in its local cache. Listing-4also incorporates the CTR optimization.The contended handover critical path is extremely short – thevery first statement in the unlock operator conveys ownership tothe successor.In unlock , after we store into the
Grant field and transfer own-ership, the successor may enter the critical section and even releasethe lock in the interval before the original owner reaches the
CAS in unlock . As such, it is possible that the CAS in unlock couldfetch a Tail value of null . We therefore remove the corresponding assert found in line 17 in Listing-1.
Inasense,whenwehavewaitersandcontention,executingthe
CAS firstin unlock addsuselesslatencyandcoherencetraffic,anddelaysthehandovertothesuccessor.Optimistic;Opportunistic;Early;Eager;Agro;Anticipatory;speculative;Accelerate;Preemptive
While the aggressive hand-over optimization improves con-tended throughput, it can lead to surprising use-after-free mem-ory lifecycle pathologies and is thus not safe for general use in a pthread_mutex implementation .Consider the following scenario where we have a structure in-stance 𝐼 that contains a lock 𝐿 and a reference count for 𝐼 . Thereference count, which is currently 2, is protected by 𝐿 . Thread 𝑇 𝐿 while it accesses 𝐼 . Thread 𝑇 𝐿 and access 𝐼 . 𝑇 𝐼 , decre-ments the reference count from 2 to 1 and then calls unlock(L) . 𝑇 𝑇 𝐿 andaccesses 𝐼 . When finished, 𝑇 𝑇 𝐿 , and, as the referencecount transitioned to 0, and 𝐼 should not longer be accessible orreachable, 𝑇 𝐼 , which includes 𝐿 . 𝑇 𝐿 , resulting in a use-after-freeerror. Similar pathologies have been observed and fixed in the linuxkernel lock implementation and the user-mode pthread_mutex implementations [8, 38]. We thank Alexander Monakov and Travis Downs for reminding us of this concern emlock : Compact and Scalable Mutual Exclusion
Broadly, if the unlock operator has a fast-path which might re-lease or transfer the lock, and, in the same invocation of unlock,might then subsequently access the lock body, then the lock im-plementation is exposed to the use-after-free problem. Put anotherway, once transfer has been effected or potentially effected, the lockimplementation must not access the lock body again. In our case,the speculative hand-over store at line 12 renders the AH algorithmvulnerable.AH remains safe and immune from use-after-free errors, how-ever, in any environment where the lock body 𝐿 can not recyclewhile a thread remains in unlock(L) . If 𝐿 is garbage collected orprotected by safe memory reclamation techniques, such as read-copy update (RCU), then AH is permissible as the thread calling Unlock(L) holds a reference to 𝐿 which prevents 𝐿 from recycling.Furthermore, AH is safe if 𝐿 resides in type-stable memory or if 𝐿 isnever deallocated, as would be the case for statically allocated locks.The AH form (with CTR) provides the best overall performance ofthe Hemlock family. Evenifalockrecycles,Ithinkwe’resafeunderTSMasthere’snowaytheCAS(followingthespeculativestoreintoL->Grantinunlock)couldfindthecaller’sthreadinthetailfieldanddoanydamage. class Thread : atomic
Listing 5: Hemlock with Optimized Hand-Over – Variant 1
We now show additional variants that avoid use-after-free con-cerns, but which still provide the fast contended hand-over exhib-ited by AH.In Lisiting-5 we augment the encoding of
Grant to add a distin-guished 𝐿 | Grant field is 𝐿 | 𝐿 , in which case the thread over-writes 𝐿 | 𝐿 to pass ownership to that sucessor. This approachalso avoids, for common modes of contention, any accesses to thelock’s Tail field in the unlock operator, further reducing coherencetraffic on that coherence hotspot. By eliminating the speculativestore into
Grant found in AH, we void use-after-free concerns.The form in Listing-6 checks for the existence of successors in theunlock operator by first fetching the lock’s
Tail field. Successorsexist if and only iff the value is not equal to
Self (Listing-6 line12). This is tantamount to “polite”
CAS operator that first loads thevalue, avoiding the futile
CAS and its write invalidation when there class Thread : atomic
Listing 6: Hemlock with Optimized Hand-Over – Variant 2 are successors. This form is also immune to use-after-free concerns.Under contention, when there are waiting threads, the naive formincurs a futile
CAS and write invalidation on the
Tail field (Listing-1 line 16) in the critical path, before effecting transfer at line 20,while this version avoids the futile
CAS . C WAITING STRATEGIES
If desired, threads in the Hemlock slow-path (Listing-1 Line 10)could optionally be made to wait politely, voluntarily surrendingtheir CPU and blocking in the operating system, via constructssuch as
WaitOnAddress [37], where a waiting thread could use
WaitOnAddress to monitor its predecessor’s
Grant field.
Wenotethatuser-modelocksarenottypicallyimplementedaspurespinlocks,insteadoftenusingaspin-then-parkwaitingpolicywhichvoluntarilysurrenderstheCPUsofwaitingthreadsafterabriefoptimisticspinningperioddesignedtoreducethecontextswitchingrate.Inourcase,wefindthatuser-modeisconvenientvenueforexperi-ments,andnotethatthreadsintheHemlockslow-pathcouldeasilybemadetowaitpolitelyviaconstructssuchas
WaitOnAddress [37],whereawaitingthreadcoulduse
WaitOnAddress toblock,monitoringitspredecessor’s
Grant field.
Under Hemlock, a thread releasing a lock can determine withcertainty – based on the
Tail value – that successors do or do notexist, but the identity of the successor is not known to the threadcalling unlock . As such, Hemlock is not immediately amenable towaiting strategies such as park-unpark [13, 14, 30] where unpark wakes a specific thread.To allow purely local spinning and enable the use of park - unpark waiting constructs, we can replace the per-thread Grant field with a per-thread pointer to a chain of waiting elements , eachof which represents a waiting thread. The elements on 𝑇 ’s chainare 𝑇 ’s immediate successors for various locks. Waiting elementscontain a next field, a flag and a reference to the lock being waitedon and can be allocated on-stack. Instead of busy waiting on thepredecessor’s Grant field, waiting threads use