[PDF] Light-weight Locks

Abstract

In this paper, we propose a new approach to building synchronization primitives, dubbed "lwlocks" (short for light-weight locks). The primitives are optimized for small memory footprint while maintaining efficient performance in low contention scenarios. A read-write lwlock occupies 4 bytes, a mutex occupies 4 bytes (2 if deadlock detection is not required), and a condition variable occupies 4 bytes. The corresponding primitives of the popular pthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on the x86-64 platform. The API for lwlocks is similar to that of the pthread library but covering only the most common use cases. Lwlocks allow explicit control of queuing and scheduling decisions in contention situations and support "asynchronous" or "deferred blocking" acquisition of locks. Asynchronous locking helps in working around the constraints of lock-ordering which otherwise limits concurrency. The small footprint of lwlocks enables the construction of data structures with very fine-grained locking, which in turn is crucial for lowering contention and supporting highly concurrent access to a data structure. Currently, the Data Domain File System uses lwlocks for its in-memory inode cache as well as in a generic doubly-linked concurrent list which forms the building block for more sophisticated structures.

Full PDF

LLight-weight Locks

Nitin Garg , Ed Zhu , and Fabiano C. Botelho Data Domain, an EMC Company, Santa Clara, CA, USA { nitin.garg, ed.zhu, fabiano.botelho } @emc.com Abstract.

In this paper, we propose a new approach to building synchronization prim-itives, dubbed “lwlocks” (short for light-weight locks). The primitives are optimized forsmall memory footprint while maintaining eﬃcient performance in low contention scenar-ios. A read-write lwlock occupies 4 bytes, a mutex occupies 4 bytes (2 if deadlock detectionis not required), and a condition variable occupies 4 bytes. The corresponding primitivesof the popular pthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on thex86-64 platform. The API for lwlocks is similar to that of the pthread library but coveringonly the most common use cases. Lwlocks allow explicit control of queuing and schedul-ing decisions in contention situations and support “asynchronous” or “deferred blocking”acquisition of locks. Asynchronous locking helps in working around the constraints of lock-ordering which otherwise limits concurrency. The small footprint of lwlocks enables theconstruction of data structures with very ﬁne-grained locking, which in turn is crucial forlowering contention and supporting highly concurrent access to a data structure. Currently,the Data Domain File System uses lwlocks for its in-memory inode cache as well as in ageneric doubly-linked concurrent list which forms the building block for more sophisticatedstructures.

The advent of the multi-core systems has forced a rethinking of basic data structures in order tosupport greater scalability and concurrency [11]. While there have been good strides in buildinglock-free versions of certain data structures [2, 5], and software transactional memory (STM)based techniques are becoming popular [9, 10], the use of traditional locking techniques remainsthe de-facto standard for synchronization in shared-memory systems. The usual technique forincreasing concurrency using traditional locking schemes, aside from using algorithms that reducethe concurrent sections [4, 6], is to use diﬀerent locks for diﬀerent parts of the data structures.The use of such ﬁne-grained locking often runs afoul of the overhead involved, thereby limitingthe maximum number of locks used. To minimize the space overhead, the algorithms usually tryto minimize the number of locks, and in turn need to build a mapping to and from diﬀerent partsof the structure to the corresponding lock. This adds to the complexity of the code that needs tobe maintained.In this paper we present a novel technique to create locking primitives that have a very smallmemory footprint. We call our locks “light-weight locks” or “lwlocks”. Speciﬁcally, a read-writelock in our scheme takes 4 bytes, a mutex takes 4 bytes (only 2 if deadlock detection is notrequired), and a condition variable takes 4 bytes. The corresponding primitives of the popularpthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on the x86-64 platform. TheAPI for lwlocks is modeled after that of the pthread library. We however eschew some of thefeatures provided by pthread locks for the sake of simplicity of our implementation.We consider our contributions as being four-fold: (i) locking primitives with small memory foot-print which makes them ideal for very ﬁne-grained locking; (ii) the mechanism underlying the a r X i v : . [ c s . O S ] S e p mplementation of lwlocks that allows creation of custom lock-like primitives; (iii) access to wait-ing queue of threads so custom scheduling schemes can be implemented; and (iv) support for“asynchronous” or “deferred block” locking.In this paper, we focus largely on lwlocks. The rest of the paper is organized as follows: Section 2describes the idea that forms the basis of lwlocks. Section 3 describes the internal structureof the supported primitives and the algorithms for implementing their APIs. Section 4 brieﬂydescribes possible extensions to lwlocks and how asynchronous locking works. Section 5 comparesthe performance of lwlocks with the corresponding primitives in the pthread library. Finally, inSection 6 we present our conclusions. The core idea behind lwlocks is the observation that while a thread could block on diﬀerentlocks or wait on many diﬀerent condition variables in its lifetime, it can block on only onelock or condition variable at any given point. With lwlocks, whenever a thread has to block, ituses a “waiter” structure to do so. In this paper, we use the term “waiter structure” or simply“waiter” interchangeably. Each thread has its own waiter structure and can access it by invokingthe tls get waiter function (which returns the pointer to the waiter kept in the thread localstorage).Figure 1 presents the deﬁnition of a waiter structure. For compact representation, we limit themaximum number of waiter structures to be less than 2 so that each structure can also beuniquely referred to by a 16-bit number. We reserve the value 2 − NULL waiterstructure and denote it by

NULLID . We expect the limit on number of waiters, and hence on thenumber of threads, to be large enough for most applications . // The following deﬁnitions assumes// that each bool t takes 4 bytes, each// pthread mutex t takes 40 bytes, each// pthread cond t takes 48 bytes, and// each function pointer takes 8 bytes. interface event t { void signal ( ) ; void wait ( ) ;bool t poll ( ) ; } // 24 bytes interface domain t { waiter t alloc waiter ( ) ; void free waiter ( waiter t waiter ) ; waiter t get waiter ( ) ;waiter t id2waiter ( uint16 t id ) ; } // 32 bytes struct waiter t { event t event ;domain t domain;bool t signal pending ;bool t waiter waiting ;pthread mutex t mutex;pthread cond t cond ;uint64 t app data ;uint16 t id ;uint16 t next ;uint16 t prev ; } // 166 bytes Fig. 1: Deﬁnition of a waiter structure.A waiter structure is assigned to a thread the ﬁrst time the thread accesses it (via tls get waiter )and the structure is returned to the pool of free waiter structures when the thread exits to bere-used by a later thread. The waiter structure is the key piece that enables the compact natureof lwlocks. It can also be used to create other custom compact lock-like data structures. Thecurrent non-optimized implementation of a waiter structure occupies 166 bytes. Since this cost is The limit can be increased for a small increase in the size of the locks which presumably will beacceptable for an application that can support so many threads. er thread and we expect the normal use case to have far fewer threads than the number of locks,the amortized cost is very low. For example, an application with 1 ,

024 threads and around 4 , Waiter’s Event.

A generic event interface underlies the actual mechanics that are used by athread when it blocks or unblocks on a lock. The two main operations deﬁned for an event are: (i) wait , which is called to wait for the event to trigger; and (ii) signal , which informs a waiter ofan event getting triggered. A waiter structure uses one pthread mutex and one condition variableto implement both operations. The operation wait blocks the thread on the condition variableuntil a signal arrives. The operation signal wakes up the blocked thread. Like semaphores, theimplementation ensures that a signal on an event cannot be lost, i. e., a signal can be invokedbefore the matching wait is and the wait will ﬁnd the pending signal. Unlike a semaphore,however, the operations wait and signal are always called in pairs. There is also a third operationcalled poll . It can be used to check if a signal is already pending.

Waiter’s Domain.

Instead of a ﬁxed implementation for mapping from an id to the waiterstructure, we have abstracted out the notion of a waiter’s domain. A waiter’s domain deﬁnes fouroperations: (i) alloc waiter to allocate a waiter from the domain; (ii) free waiter to return awaiter back to the domain; (iii) get waiter which allows a thread to get to its own waiter; and(iv) id2waiter to map from an id to the waiter structure.Abstracting the notion of a domain has three beneﬁts. First it provides one more way of extendingthe system so that instead of an entire application being limited to a maximum of 2 threads, thelimit only applies to individual domains. Second it provides the ﬂexibility to create domains thathave lower limit on maximum concurrency, thereby allowing for creation of locks with even smallerfootprint. For example, a system limiting itself to 15 threads (127 without deadlock detection)would need only 1 byte for a mutex. Third, combining it with a custom event, allows for creationof libraries such as a user space job scheduler. The wait call on a job blocks it and causes thescheduler to switch to another ready job while the signal call marks the job as ready again. Wemention the waiter’s domains only for completeness as they are not necessary to understand theworkings of lwlocks. Lwlocks use a default global domain whose waiter structures implement thebehavior we describe here. Forming Lists or Stacks of Waiters.

Each waiter structure records its own id. It also hasspace for previous and next id values which can be used to form stacks or lists of waiters. Such alist (or stack) of waiters can be identiﬁed purely by the id of the ﬁrst element of the list, i. e., itcan be represented by a 16-bit value. To go to the next (previous) waiter, we convert the currentid to the corresponding waiter structure and look at the next (previous) id ﬁeld in it.

Locking Data.

The ﬁnal important piece of a waiter structure is the space it provides thatcan be used by the abstractions built on top for their own purpose. The waiter itself does notinterpret it in any way. For instance, read-write lwlocks use this space to record the type of lockingoperation that the thread was performing when it blocked: whether it was taking a read or writelock. Currently, this space amounts to 8 bytes and is referred to as app data . Light-Weight Lock Primitives

We now look at the internals of each of the lwlock primitives, the supported operations and howthey work. The lwlocks by default are “fair”: a lock is acquired in FIFO order by the threadsblocked on it and wake-ups from a condition variable are done in the order in which the threadscalled wait on the condition variable. Pthread locks are not fair in this sense, and although it ispossible to build lwlocks to mimic the same behavior, we have found fairness to be better suitedto our needs in the Data Domain File System [12].Each primitive uses 2 bytes to keep a queue of waiter structures of the threads that are blockedon that primitive. This queue is aptly called a waitq . The waitq is maintained as a “reverse list”as that allows insertion of a new waiter in a single hardware supported compare-and-swap (CAS)instruction. The next ﬁeld of a waiter structure holds the id of the waiter structure in front of it.The oldest waiter’s next ﬁeld holds

NULLID . The oldest waiter is the waiter in front of the waitq .To acquire a lock, a thread uses the CAS instruction to either take ownership of the lock or add itsown waiter structure to the lock’s waitq . If the lock is acquired, nothing more needs to be done.If it cannot be acquired, then the thread waits on its waiter structure (by calling the event’s wait routine). When the thread’s turn comes to own the lock (in FIFO order), the unlocking threadwill transfer the lock to it and invoke the event’s signal routine on the waiter to wake up thethread. Since the unlocking thread does the work of transferring the lock state and ownership,the waking thread can assume that it has the lock upon being signaled . The unlocking threadhas to walk the waitq to ﬁnd the waiter to signal. At any point there can be only one threadperforming the transfer on a lock and hence the walk is safe to perform.We now present each one of the lwlock primitives. Note that we only highlight the essence ofthe various operations in the included algorithms. The actual implementation, which we hope torelease to the open source community in the near future, has additional logic for performanceoptimization. Light-weight mutex.

The 4 bytes of a light-weight mutex (henceforth a lwmutex ) are composedof 2 equal parts. The ﬁrst part holds the id of the waiter structure of the owner thread and thesecond part is the waitq . The owner id is necessary to do self-deadlock detection. Figure 2 outlinesthe lock and unlock algorithms for the 4-byte version of a lwmutex . If deadlock detection is notrequired, the lock only needs to be 16 bits in size to hold the waitq . To comply with POSIXsemantics, we also need to be able to ascertain the owner of such a mutex. Fortunately, we canuse the same waitq space. The locking thread swaps the

NULLID of the waitq with the id of itsown waiter to indicate that the lock is taken. As other threads block, their waiter structures getadded to the waitq as in the case of the regular lwmutex . The diﬀerence is that the next ﬁeld ofthe waiter structure in front of the waitq does not hold

NULLID . Instead, it holds the id of thewaiter of the lock owner thread. Hence, the unlock operation traverses the waitq until a waiterwhose next ﬁeld matches the id of the unlocking thread’s waiter structure is reached. The next ﬁeld is reset to

NULLID and the waiter is signaled.

Light-weight condition variable.

The 4 bytes of a light-weight condition variable (hencefortha lwcondvar ) are composed of 2 equal parts. The ﬁrst part is a 2-byte version of lwmutex and thesecond part is the queue of waiter structures. There are three basic operations for a lwcondvar :(i) wait ; (ii) signal ; and (iii) broadcast . The internal 2-byte lwmutex is used to synchronize For unfair locks, this part has to change and the waking thread would need to try again to take thelock. truct lwmutex t { uint16 t owner; // owner ID uint16 t waitq ; // tail of queue } void lock (lwmutex t m ) { w = tls get waiter ( ) ; do { n = o = m ; i f ( o .owner == NULLID) n .owner = w . id ; else { w . next = n . waitq ; n . waitq = w . id ; }} while ( !CAS( m , o , n ) ) ; i f ( n .owner == w . id ) return ; // Got lock// Wait for lock transfer w . event . wait ( ) ; } void unlock(lwmutex t m ) { w = tls get waiter ( ) ; do { n = o = m ; i f ( o . waitq != NULLID) { wtw = id2waiter ( m . waitq ) ; pw = NULL ; while ( wtw . next != NULLID) { pw = wtw ; wtw = id2waiter ( wtw . next ) ; } i f ( pw == NULL) n . waitq = NULLID ; // Transfer lock to wtwn .owner = wtw . id ; } else n .owner = NULLID ; } while ( !CAS( m , o , n ) ) ; // remove wtw from q i f ( pw != NULL) pw . next = NULLID ; i f ( wtw ) wtw . event . signal ( ) ; } Fig. 2: Operations to lock and unlock a lwmutex . The old and new values passed in to CAS aredenoted by o and n , respectively. The caller’s thread local waiter structure is denoted by w . We use wtw and pw to denote the waiter to wake up and the previous waiter in the queue, respectively.manipulation of the waiter’s queue which makes the algorithms for those three operations veryeasy to derive. The algorithms for the three operations are presented in Appendix A. Light-weight read-write lock.

The light-weight read-write lock (henceforth a lw rwlock ) alsouses 2 of its 4 bytes for the waitq . Of the remaining 16 bits, 14 bits are used for the count ofread locks granted, 1 bit is used to indicate a write lock, and the ﬁnal 1 bit is used to indicatewhether the lock is read-biased or not. A read-biased lock is unfair towards writers in the sensethat a thread that needs a read lock will acquire it without any regard to waiting writers if thelock is already held by other readers. This behavior is similar to that of pthread read-write lockand is essential for applications where a thread can recursively acquire the same lock as a reader.Without the read-biased behavior, a deadlock can result if a writer arrives in between two readlock acquisitions: the second read lock attempt will wait for the writer which is waiting for theﬁrst read lock to be released. Applications that do not have recursive read locking do not needthe read-biased behavior but may choose to use it for throughput reasons.The 14-bit reader count limits the maximum number of readers per lock to 2 , a limit thatwe have found to be suﬃcient in practice. The limit can be raised by having the API explicitlyﬂag read-bias behavior, so the bias bit does not have to be in the lw rwlock or restricting themaximum concurrency, thereby freeing bits from the waitq or by slightly increasing the size ofthe lock.Figure 3 outlines the algorithms for the two main operations: (i) lock , and (ii) unlock . The lockoperation on lw rwlock is similar to that of lwmutex with the added ﬂag indicating if a read orwrite lock is requested. The unlock operation for non-read-biased lw rwlock has to pick the oldestset of waiters that it can signal: either a single writer or a set of contiguous readers. A read-biased lw rwlock can follow the same logic as a non-biased lw rwlock when the transfer is to a waitingwriter. For the transfer from a writer to reader(s), however, the writer has to signal all readers,not just the oldest contiguous set. The solution is to have the writer atomically remove the entire waitq and downgrade to a read lock. It then separates the waitq into two queues: one consistingof readers and one consisting of writers. The writers are added back to the front of the waitq hile also updating the reader count to fully account for the readers found in the removed waitq .Finally, the readers can be signaled. Note that the re-insert of the waiting writers during unlockis safe. The re-insert is done at the front of the waitq and any new writers will add themselvesto the back of the waitq . No other thread can be traversing the waitq for ownership transferas the re-inserting thread holds a read lock. This case makes the implementation of lw rwlocks the most complex of all the primitives and the algorithm outline is only at a high level for thecontention case, where the waitq has at least one waiter in it. The non-contented case is simpleto derive. struct lw rwlock t { uint1 t rd bias ;uint1 t wlocked ;uint14 t readers ;uint16 t waitq ; } void lock(lw rwlock t (cid:96) , bool t exclusive) { w = tls get waiter ( ) ; do { o = n = (cid:96) ; i f ( ! exclusive && ! o . wlocked &&( o . waitq == NULLID | | o . rd bias )) n . readers++; else i f (exclusive &&! ( n . wlocked | | n . readers > n . wlocked = 1; else { // Need to block w . app data = exclusive ; w . next = o . waitq ; n . waitq = w . id ; }} while ( !CAS( (cid:96) , o , n ) ) ; i f ( n .waitq == w .id) w .event.wait(); } void unlock fair(lw rwlock t (cid:96) ) { do { o = n = (cid:96) ; i f ( n . wlocked == 1) n . wlocked = 0; else n . readers −− ; i f ( ! ( n . wlocked | | n . readers > { ( pw, wtw ) = ﬁnd oldest set of waiters( n ) ; i f ( pw == NULL) n . waitq = NULLID ; i f ( wtw . app data != exclusive ) { n . readers = waitq size ( wtw .id ) ; } else // single writer picked n . wlocked = 1; }} while ( !CAS( (cid:96) , o , n ) ) ; i f ( pw != NULL) pw . next = NULLID ;wake up waiters( wtw ) ; } uint16 t waitq size ( uint16 t wid ) { uint16 t count = 0; while ( wid != NULLID) { wid = id2waiter ( wid ) . next ;count++; } return count ; } void unlock(lw rwlock t (cid:96) ) { i f ( ! (cid:96) . rd bias ) return unlock fair ( (cid:96) ) ; i f ( (cid:96) . wlocked == 0) { do { // Only writers in waitq o = n = (cid:96) ; i f ( n . readers == 1) { n . wlocked = 1; n . readers = 0; } else n . readers −− ; } while ( !CAS( (cid:96) , o , n ) ) ; i f ( n . wlocked) unlock fair ( (cid:96) ) ; return ; } // writer unlocking a biased lock ow = find oldest waiter ( (cid:96) ) ; i f ( ow . app data == exclusive ) { // handing oﬀ to writer unlock fair ( (cid:96) ) ; return ; } // Wake up all readers. Atomically// downgrade to read lock and// clear & return waitq. After the// downgrade only writers can block on (cid:96) . waitq = downgrade to read lock( (cid:96) ) ; // split waitq in two subqueues:// readers queue and writers queue (rd q , wr q) = splitq (waitq ) ; do { o = n = (cid:96) ; // 1 reader added during downgrade. n . readers += waitq size (rd q) − i f ( n . waitq == NULLID) n . waitq = wr q } while ( !CAS( (cid:96) , o , n ) ) ; i f ( n . waitq != wr q && wr q != NULLID) { ow = find oldest waiter ( n ) ; ow . next = wr q ; } wake up waiters( id2waiter (rd q ) ) ; } Fig. 3: Operations to lock and unlock a lw rwlock . The lock operation takes a boolean as inputto indicate whether an exclusive lock is requested. The old and new values passed in to CAS aredenoted by o and n , respectively. The caller’s thread local waiter structure is denoted by w . Weuse wtw , pw and ow to denote the waiter to wake up, the previous waiter in the waitq , and theoldest waiter in the waitq , respectively. Asynchronous Locking and Other Extensions

We take a moment here to highlight some aspects of the algorithms presented in Section 3 andhow small changes would enable alternative behaviors. On the locking side, the key observation isthat once a thread has put itself on the wait queue of a lock or condition variable, it is guaranteedto have the lock transferred to it or a signal delivered to it. The thread does not have to call wait right away. The thread could spin for a certain amount of time on poll before calling wait eﬀectively creating adaptive locks. It could also keep spinning which would create starvation-freespin locks. Both of these are scalable and contention-free similar to the approaches in [1, 3, 7].Alternately, the lock operation could simply return without calling wait at all. This would allowthe calling thread to take some application-speciﬁc action before invoking wait . We call this modeof operation as taking an “asynchronous” or “deferred blocking” lock. Asynchronous locking is thekey enabler to work around the constraints that lock-ordering imposes. We use this functionalityin building a generic highly concurrent doubly-linked list in the Data Domain File System [12].The list allows concurrent appends, dequeues, inserts (before or after any member), deletes anditerators (in either direction). Some of these operations need to acquire locks in opposite orderof other operations. To avoid deadlocks, a canonical order is picked and operations that need toacquire locks in the opposite direction use asynchronous locking.The following example, taken from doubly-linked list implementation, illustrates how asyn-chronous locking is used and why it is essential. Suppose the canonical order for nodes A &B is A, then B. A thread holds a lock on B already and needs to lock A. It will make an asyn-chornous lock call for A. If the thread is unable to get the lock, it is on A’s waitq , and it releasesthe lock on B. It then waits for the lock on A to be granted and then reacquires the lock on B(which is in canonical order). In the above sequence, the thread always either holds a lock (onA or B) or is in the waitq of a lock (on A). Other guarantees in the data structure assure thatin this case A and B will remain valid and hence there will be no illegal access. Achieving thiswithout asynchronous locking is not possible. Using trylock on A and upon failure, releasing Bthen locking A leaves a window open between release of B and locking of A where neither node isin any way aware of the thread. One or both nodes could go away in that window and the threadwould end up performing an illegal access.We are also working on building highly concurrent versions of other data structures (trees of var-ious kinds) where we expect to use asynchronous locking frequently. Note that since there is onlyone waiter structure per thread per domain, a thread can only be performing one asynchronouslock operation per domain at any time. To keep the discussion focused on lwlocks, we cannot gointo any more details of our list or other data structures here.On the unlocking side of the operations, we note that since the waitq management is visible inuser space, the unlocking thread has a lot of ﬂexibility in picking which of the waiting threads tosignal and whether to do lock hand-oﬀ or have the signaled thread retry. This can be exploitedto create any custom scheduling policy. We could pick the thread with the highest priority or thelongest waiting thread or even have applications use the app data to deﬁne their own preferences.Signaling waiters in LIFO instead of FIFO order would trade fairness for performance as weillustrate in Section 5.Finally, with most hardware supporting 64-bit CAS instructions, the generic building blocks of16-bit waitq leaves 48 bits available for building other primitives. For example, we have builtsemaphore like counters and a combined mutex+condition variable structure, and implemented upgrade and downgrade operations for lw rwlocks (see Appendix B for lw rwlock algorithmshat allow these). Although our implementation has focussed on process-private locks, we believeit is possible to extend the approach to include process-shared locks. For example, the Linuxoperating system limits the maximum number of processes to 2 which would give a naturalmapping from the process id to the waiter structure id for the process. The structures couldbe managed in user space shared memory or the kernel could manage them. Using an actualsemaphore would be more appropriate to use in this case to implement the event interface for thewaiters. We now examine some experimental data to show that the performance of lwlocks is acceptable.The experiments were performed on a 4-socket system with Xeon E7-4860 processors. Each sockethas 10 physical (20 with hyper-threading enabled) cores, for a total of 40 (80 hyper-threaded)cores. The machine has 256GB of memory and each core operates at 2.26GHz.We have carried out three sets of experiments. Each experiment was run 20 times which wasenough to get a conﬁdence level of 99% on the presented average values. The ﬁrst one comparesthe performance of unfair lwmutexes with unfair pthread mutexes. Unfair mutexes trade oﬀfairness for performance by using the greedy approach: the unlocking thread can reacquire thelock right away again. This is done to avoid the convoy problem. We have implemented twoversions of unfair lwmutexes : (i) LIFO wake-ups, which wakes up the most recent thread in the waitq ; and (ii) FIFO wake-ups, which wakes up the longest-waiting thread in the waitq . Theexperiment consists of n threads carrying out the same number of operations on a global doubly-linked list protected by a single unfair mutex – each operation has the same cost. Each threadacquires the global mutex, performs an operation and drops the mutex. There is no activityoutside the locked code block except to increment the loop counter.Figure 4 (a) shows how the latency per operation increases with the number of contending threads.As the number of threads increases, the per operation cost goes up for all lock types. Note that, forrelatively low contention ( n ≤ lwmutexes perform as good as unfair pthread mutex .We are satisﬁed that our implementation is reasonably eﬃcient from the performance shown byunfair lwmutex . The gap between pthread mutex and LIFO unfair lwmutex arises from the factthat pthread mutex try the CAS operation only once before making a system call to block. The lwmutex code (both lock and unlock) has to contend until the caller has performed a successfulCAS operation. The performance gap betweek LIFO and FIFO version of lwmutex hightlight theoverhead of traversing the waitq . It is well known that a fair mutex is considerably slower thanan unfair one under high contention due to frequent context switches (the convoy problem). For32 contending threads we saw that the latency per operation can go as high as 13x the latencyper operation seen for unfair mutexes. However, If there is no contention or just a few contendingthreads ( < lwmutexes and pthread locks can be reducedfurther, our primary concern is the memory overhead that prevents their use in extremely ﬁne-grained locking. Fine-grained locking results in lower contention in general and hence improvedperformance overall as we show in the next experiment.The second experiment illustrates how ﬁne-grained locking can deliver better performance overall.The experiment consists of n threads performing lookups, followed by an update to the looked-up Our code is written entirely in C and compiled with O4 optimization. Pthread code is part C and partﬁne-tuned assembly. La t en cy ( m i c r o s e c ond s pe r ope r a t i on ) Number of threads40 physical cores of 2.26 GHzunfair pthread mutexunfair lwmutex (LIFO wakeups)unfair lwmutex (FIFO wakeups) (a) Latency per operation on a global listprotected by either a single unfair lwmutex (with LIFO wake-ups or FIFO wake-ups), orpthread mutex. La t en cy ( m i c r o s e c ond s pe r ope r a t i on ) Number of threads40 physical cores of 2.26 GHz1k unfair pthread mutexes1 fair lwmutex per bucket (b) Latency per operation on either a hashtable using a fair lwmutex per bucket or 1 , ,

024 buckets.

Fig. 4: Latency per operation for: (a) coarse-grained locking; (b) ﬁne-grained locking.record, on a hash table. The hash table has 1 million buckets and is populated with 2 millionelements (chaining is used as the collision resolution scheme). We evaluate the latency per oper-ation (in microseconds) for two cases: (i) a fair lwmutex is embedded in each bucket’s list head;and (ii) 1,024 unfair pthread mutexes are used, where each one protects a range of 1,024 buckets.Figure 4 (b) shows how the latency per operation increases as the number of threads concurrentlyoperating on the hash table increases. As can be seen, it is preferable to have ﬁne-grained lockingthan optimizing the performance of the lock itself. Also, when the lock is placed within the bucketitself, it improves the memory locality and may have fewer cache misses compared to accessingpthread mutex located in a separate memory area. For the hash table case is very easy to mapfrom a bucket to a pthread mutex stored in a separate area. That is not true for other datastructures like linked lists and trees. Additional logic to minimize the number of locks for thosedata structures introduces complexity which is more diﬃcult to maintain than for the case wherea lock can be cheaply added per node. Even a hash table that uses open-addressing schemes(probing, double hashing or cuckoo hashing [8]) for resolving conﬂicts presents challenges whenusing range locking.Finally, the third experiment compares lw rwlocks with read-write pthread locks. We use thesame hash table as before but now we ﬁx the number of threads (readers + writers) to 34 andthen we vary the number of writers (or contending threads) from 0 to 34. Beyond 34 threads westart seeing contention across readers for pthread locks: the contention is on the update of thereader counter, which is surrounded by a mutex in the pthread library. Because we only want toevaluate the contention due to writers, we in turn, picked 34.Figure 5 shows how the latency per operation increases as the number of writers concurrentlyoperating on the hash table increases. Once again the ﬁne-grained locking provided by the cheap lw rwlocks delivers better overall performance and also scales better than read-write pthreadlocks.

We have presented in this paper a new approach to building compact synchronization primitives.This is possible because each thread can only block in one lock or condition variable at a time. La t en cy ( m i c r o s e c ond s pe r ope r a t i on ) Number of writers40 physical cores of 2.26 GHz1k read−write pthread locks1 lw_rwlock per bucket

Fig. 5: Latency per operation on either a hash table using a lw rwlock per bucket or 1 ,

024 read-write pthread locks, each one protecting a range of 1 ,

024 buckets. The total number of threadswas ﬁxed to 34 and the number of writers goes from 0 to 34.Besides the compact nature of light-weight locks, the queue management of blocked threads isalso done entirely in user space. This allows the implementation of features that are impossibleto implement with traditional pthread locks. For instance, asynchronous locking cannot be im-plemented with pthread locks as they stand. The cost for light-weight locks is a 166-byte waiterstructure per thread, which amortizes very quickly for applications where there are many morelocks than threads. We believe that this is a fairly common scenario.

References

1. T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors.

IEEETrans. Parallel Distrib. Syst. , 1:6–16, January 1990.2. Keir Fraser.

Practical lock freedom . PhD thesis, Cambridge University Computer Laboratory, 2003.3. Gary Granunke and Shreekant Thakkar. Synchronization algorithms for shared-memory multipro-cessors.

Computer , 23:60–69, June 1990.4. Marcel Kornacker and Douglas Banks. High-concurrency locking in r-trees. In

The 21st internationalconference on Very Large Data Bases , pages 134–145, 1995.5. Edya Ladan-Mozes and Nir Shavit. An optimistic approach to lock-free ﬁfo queues. In

The 18thAnnual Conference on Distributed Computing (DISC’04) , volume 3274 of

Lecture Notes in ComputerScience , pages 117–131. Springer, 2004.6. Philip L. Lehman and S. Bing Yao. Eﬃcient locking for concurrent operations on b-trees.

ACMTransactions on Database Systems , 6(4):650–670, 1981.7. John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors.

ACM Trans. Comput. Syst. , 9:21–65, February 1991.8. Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing, 2001.9. Hany E. Ramadan, Indrajit Roy, Maurice Herlihy, and Emmett Witchel. Committing conﬂictingtransactions in an stm. In

Proceedings of the 14th ACM SIGPLAN symposium on Principles andpractice of parallel programming , PPoPP ’09, pages 163–172, New York, NY, USA, 2009. ACM.10. N. Shavit and D. Touitou. Software transactional memory.

Distributed Computing, Special Issue ,10:99–116, 1997.11. Nir Shavit. Data structures in the multicore age.

Communications of the ACM , 54(3):76–84, 2011.12. Benjamin Zhu, Kai Li, and Hugo Patterson. Avoiding the disk bottleneck in the data domain dedu-plication ﬁle system. In

Proceedings of the 6th USENIX Conference on File and Storage Technologies ,FAST’08, pages 18:1–18:14, Berkeley, CA, USA, 2008. USENIX Association.

Pseudo code for light-weight condition variables

Figure 6 presents the structure of a lwcondvar as well as the operations it supports. struct lwcondvar t { lwmutex t m; // 2-byte lwmutex uint16 t waitq ; // queue tail } void wait(lwmutex t m , lwcondvar t c ) { w = tls get waiter ( ) ; w . next = NULLID ;lock ( c . m ) ; i f ( c . waitq == NULLID) c . waitq = w . id ; else { w . next = c . waitq ; c . waitq = w . id ; } unlock( c . m ) ;unlock( m ) ; w . event . wait ( ) ;lock ( m ) ; } void wake up waiters( waiter t w ) { while ( w != NULL) { waitq = w . next ; w . event . signal ( ) ; w = id2waiter (waitq) }} void signal(lwcondvar t c ) { lock ( c . m ) ; i f ( c . waitq != NULLID) { wtw = id2waiter ( c . waitq ) ; pw = NULL ; while ( wtw . next != NULLID) { pw = wtw ; wtw = id2waiter ( wtw . next ) ; } i f ( pw == NULL) c . waitq = NULLID ; else pw . next = NULLID ; } else wtw = NULL ;unlock( c . m ) ; i f ( wtw != NULL) wtw . event . signal ( ) ; // Else missed signal } void broadcast(lwcondvar t c ) { lock ( c . m ) ;waitq = c . waitq ; // pointer to queue’s head c . waitq = NULLID ;unlock( c . m ) ;wake up waiters( id2waiter (waitq ) ) ; } Fig. 6: Operations deﬁned for a lwcondvar . The old and new values passed in to CAS are denotedby o and n , respectively. The caller’s thread local waiter structure is denoted by w . We use wtw and pw to denote the waiter to wake up and the previous waiter in the queue, respectively. B Upgrading and Downgrading light-weight read-write locks

As mentioned in Section 4, a lw rwlock also supports upgrade and downgrade operations. Figure 7shows the algorithms for the two operations. Note that even though multiple readers could betraversing the waitq during upgrade, the traversal is safe. The waitq changes only due to arrivalof new waiters to the back of the queue or removal of the waiter at the front of the queue duringlock transfer. The former is immaterial to the traversal as it does not care for what happens towaiters behind it. The latter cannot happen as the traversing thread still has a read lock. The onlypossible race happens on the next ﬁeld of the oldest waiter in the waitq : a reader performingan upgrade wants to add it’s own waiter in front of it and a thread releasing write lock on areader-biased lock is re-inserting list of existing waiters. This situation is handled by the upgradelogic and by the unlock routine.The unlock operation presented in Figure 3 has to be slightly changed to support downgrade andupgrade of a lw rwlock . For the case where a writer is releasing the write lock of a read-biased lw rwlock , while re-inserting the wr q at the front of the waitq of the lw rwlock , we have tose CAS instruction to co-ordinate with a possible upgrader. Also, if an upgrader is found to bealready present at the front of the waitq , the re-inserted wr q is added behind the upgrader’swaiter. void downgrade( lw rwlock t (cid:96) ) { do { o = n = (cid:96) ; i f ( o . waitq != NULLID) { // Have existing waiters. Can’t// do direct downgrade. break ; } n . wlocked = 0; n . readers = 1; } while ( !CAS( (cid:96) , o , n ) ) ; i f ( n . readers != 1) { w = tls get waiter ( ) ; // Indicate that w will now wait// for a read lock. w . app data = SHARED ;insert waiter at front( (cid:96) .waitq, w ) ; // Unlock will grant reader lock and// waiter will get a pending signal. unlock( (cid:96) ) ; // Consuming the pending signal w . event . wait ( ) ; }} bool t upgrade(lw rwlock t (cid:96) ) { do { o = n = (cid:96) ; i f ( o . waitq != NULLID) { // There are more waiters that// could be upgrading themselves. break ; } else i f ( o . readers == 1) { // Only reader, grab it right away. n . wlocked = 1; n . readers = 0; } else { w . app data = UPGRADE ; n . waitq = w . id ; n . readers −− ; } } while ( !CAS( (cid:96) , o , n ) ) ; i f ( n . wlocked != 1) { w = tls get waiter ( ) ; i f ( n . waitq != w . id ) { i f ( ! insert for upgrade( (cid:96) , w )) { return FALSE ; // failure } // Unlock and wait for lock// to be granted. unlock( (cid:96) ) ; } w . event . wait ( ) ; } return TRUE ; // success } bool t insert for upgrade(lw rwlock t (cid:96) ,waiter t w ) { while (TRUE) { ow = find oldest waiter ( (cid:96) ) ; i f ( ow .app data == UPGRADE) { // Someone else waiting for upgrade return FALSE ; } // Try setting next pointer of last waiter.// Set the app data ﬁrst since it needs to// be visible to any competing thread that// is also trying the upgrade. w . app data = UPGRADE ; i f (!CAS( ow .next, NULLID, w .id)) { // Competing upgrade or re-insert ow = id2waiter ( ow . next ) ; i f ( ow .app data == UPGRADE) { // Lost to competing upgrade return FALSE ; // failure } // else lost to competing re-insert } else return TRUE ; // CAS success }} Fig. 7: Operations to downgrade and upgrade a lw rwlock . The old and new values passed in toCAS are denoted by o and n , respectively. The caller’s thread local waiter structure is denotedby w . We use ow to denote oldest waiter on the waitqwaitq