[PDF] Scalable Range Locks for Scalable Address Spaces and Beyond

Abstract

Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing interest in the Linux kernel community seeking to alleviate bottlenecks in the virtual memory management subsystem. The existing implementation of range locks in the kernel, however, uses an internal spin lock to protect the underlying tree structure that keeps track of acquired and requested ranges. This spin lock becomes a point of contention on its own when the range lock is frequently acquired. Furthermore, where and exactly how specific (refined) ranges can be locked remains an open question. In this paper, we make two independent, but related contributions. First, we propose an alternative approach for building range locks based on linked lists. The lists are easy to maintain in a lock-less fashion, and in fact, our range locks do not use any internal locks in the common case. Second, we show how the range of the lock can be refined in the mprotect operation through a speculative mechanism. This refinement, in turn, allows concurrent execution of mprotect operations on non-overlapping memory regions. We implement our new algorithms and demonstrate their effectiveness in user-space and kernel-space, achieving up to 9 × speedup compared to the stock version of the Linux kernel. Beyond the virtual memory management subsystem, we discuss other applications of range locks in parallel software. As a concrete example, we show how range locks can be used to facilitate the design of scalable concurrent data structures, such as skip lists.

Full PDF

TThis is an extended version of the article published in the proceedings of the 2020 15th European Conference onComputer Systems (EuroSys’20), which is available online at https://doi.org/10.1145/3342195.3387533 .This is an extended version of the article published in the proceedings of the 2020 15th European Conference onComputer Systems (EuroSys’20), which is available online at https://doi.org/10.1145/3342195.3387533 .This is an extended version of the article published in the proceedings of the 2020 15th European Conference onComputer Systems (EuroSys’20), which is available online at https://doi.org/10.1145/3342195.3387533 . Scalable Range Locks forScalable Address Spaces and Beyond

Alex Kogan

Oracle LabsBurlington, MA, USA [email protected]

Dave Dice

Oracle LabsBurlington, MA, USA [email protected]

Shady Issa ∗ U. Lisboa & INESC-IDLisbon, Portugal [email protected]

Abstract

Range locks are a synchronization construct designed to pro-vide concurrent access to multiple threads (or processes) todisjoint parts of a shared resource. Originally conceived inthe file system context, range locks are gaining increasinginterest in the Linux kernel community seeking to alleviatebottlenecks in the virtual memory management subsystem.The existing implementation of range locks in the kernel,however, uses an internal spin lock to protect the underlyingtree structure that keeps track of acquired and requestedranges. This spin lock becomes a point of contention on itsown when the range lock is frequently acquired. Further-more, where and exactly how specific (refined) ranges canbe locked remains an open question.In this paper, we make two independent, but related con-tributions. First, we propose an alternative approach forbuilding range locks based on linked lists. The lists are easyto maintain in a lock-less fashion, and in fact, our rangelocks do not use any internal locks in the common case. Sec-ond, we show how the range of the lock can be refined inthe mprotect operation through a speculative mechanism.This refinement, in turn, allows concurrent execution of mprotect operations on non-overlapping memory regions.We implement our new algorithms and demonstrate theireffectiveness in user-space and kernel-space, achieving upto 9 × speedup compared to the stock version of the Linuxkernel. Beyond the virtual memory management subsystem,we discuss other applications of range locks in parallel soft-ware. As a concrete example, we show how range locks canbe used to facilitate the design of scalable concurrent datastructures, such as skip lists. CCS Concepts • Theory of computation → Concurrency ;• Computer systems organization → Multicore archi-tectures ; •

Software and its engineering → Mutual ex-clusion ; Concurrency control ; Virtual memory . Keywords reader-writer locks, semaphores, scalable syn-chronization, lock-less, Linux kernel, parallel file systems

Range locks are a synchronization construct designed to pro-vide concurrent access to multiple threads (or processes) to ∗ Work was done while the author was an intern at Oracle Labs. disjoint parts of a shared resource. Originally, range lockswere conceived in the context of file systems [2], to addressscenarios in which multiple writers would want to write intodifferent parts of the same file. A conventional approach ofusing a single file lock to mediate the access among thosewriters creates a synchronization bottleneck. Range locks,however, allow each writer to specify (i.e., lock) the part ofthe file it is going to update, thus allowing serialization be-tween writers accessing the same part of the file, but parallelaccess for writers working on different parts.In recent years, there has been a surge of interest in rangelocks in a different context. Specifically, the Linux kernel com-munity considers using range locks to address contentionon mmap_sem [13], which is “one of the most intractable con-tention points in the memory-management subsystem” [9]. mmap_sem is a reader-writer semaphore protecting the accessto the virtual memory area (VMA) structures. VMA repre-sents a distinct and contiguous region in the virtual addressspace of an application; all VMA structures are organizedas a red-black tree ( mm_rb ) [6]. The mmap_sem semaphoreis acquired by any virtual memory-related operation, suchas mapping, unmapping and mprotecting memory regions,and handling page fault interrupts. As a result, for data in-tensive applications that operate on chunks of dynamicallyallocated memory, the contention on the semaphore becomesa significant bottleneck [6, 9, 11].The existing implementation of range locks in the Linuxkernel is relatively straightforward. It uses a range tree(based on red-black trees) protected by a spin lock [22]. Giventhat every acquisition and release of the range lock, for anyrange, results in the acquisition and release of that spin lock,the latter can easily become a bottleneck on its own underheavy use regardless of the contention on actual ranges. Notethat even non-overlapping ranges and/or ranges acquiredfor read have to synchronize using that same spin lock. Weexpand on the implementation of existing range locks in thekernel and its shortcomings in Section 3.Even when putting the issues in the existing range lock im-plementation aside, exploiting the potential parallelism whenusing range locks to protect the access to VMA structures inthe Linux kernel is far from trivial. The key challenge is thataddresses presented to virtual memory (VM) operations (sin-gular addresses arising from page fault handling or rangesassociated with APIs such as mprotect ) do not necessarily1 a r X i v : . [ c s . O S ] J un calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issafall on VMA boundaries. Thus, the enclosing range of theVM space that needs to be protected is not known in advanceof walking the mm_rb tree. Therefore, simply applying a VMoperation under the lock acquired for the range of that oper-ation does not work. As an intuitive example, consider two mprotect operations on different (non-overlapping) mem-ory ranges. If those operations acquire the range lock onlyon those (non-overlapping) ranges, they may race with eachother on updates to the VMA metadata if they end up oper-ating on the same VMA. Furthermore, regardless of whethertwo mprotect s operate on the same VMA, if one of them ro-tates the mm_rb tree, the other one may read an inconsistentstate while traversing the tree in parallel. All these issuesmight be the reason that in the kernel patch that replaces mmap_sem with a range lock, the latter is always acquired forthe full range [5], exploiting no potential parallelism thatrange locks can provide .This paper makes two related, but independent contribu-tions. First, we propose an alternative design for efficientscalable range locks that addresses the shortcomings of theexisting algorithm. Our idea is to organize ranges in a linkedlist instead of a range tree. Each node in the list representsan acquired range. Therefore, conceptually, once a threadmanages to insert its node into the list, it holds the rangelock for that particular range. While traversing a list to findthe insertion point is less efficient than traversing a tree, thenumber of nodes in the list is expected to be relatively low,as it corresponds to the number of threads in the systemaccessing ranges. At the same time, lists are known to bemore amenable for non-blocking updates, since unlike a (bal-anced) tree, one needs to modify atomically just one pointerto update the list. As a result, our list-based design does notrequire any lock in the common case.Our second contribution is the discussion of applicationsfor range locks in parallel software. Our prime focus ison scaling the virtual memory management in the Linuxkernel by introducing a speculative mechanism into the mprotect operations. As we observe, in certain cases han-dling mprotect calls results in modifying the metadata of theunderlying VMA without changing the structure of mm_rb .For those cases, our mechanism acquires the range lock onlyfor a relatively small (refined) range, thus enabling paral-lel execution of mprotect operations on non-overlappingregions of virtual memory. As it turns out, those are thecommon cases for applications that use the GLIBC memoryallocator, which is the default user-mode malloc-free alloca-tor. The latter employs per-thread memory arenas, which are The range lock API includes calls to acquire the lock for a specific range(e.g., [10..25]) as well as a special call to acquire the lock for the entire (full)range (i.e., [0..2 − The author of the patch notes that "while there is no improvement ofconcurrency perse, these changes aim at adding the machinery to permitthis in the future." We are not aware of any follow-up work that does that. initialized by mmaping a large chunk of memory and mpro-tecting the pages that are actually in use. Those mprotect calls expand or shrink the size of the VMA corresponding tothe set of pages with currently allocated objects, which areexactly the cases that our speculative mechanism supports.We note that the applicability of range locks extends be-yond the virtual memory management subsystem. As Kimet al. demonstrated recently [24], range locks can be usedto optimize shared file I/O operations in a file system; webelieve that the range locks we present in this paper can beused as a drop-in replacement for the implementation usedin [24]. More generally, drawing from the original motiva-tion behind the concept of range locks, the ideas presented inthis paper appear to be a natural fit for parallel file systems;we plan to experiment with such systems in the future work.In addition, we argue that range locks can be highly useful infacilitating the design of scalable concurrent data structures.As a concrete example, we discuss the design of a new skiplist in which a range lock is used for scalable synchronizationbetween threads applying concurrent operations on the skiplist. The new skip list is based on a well-known optimisticskip list by Helrihy et al. [21]. Instead of acquiring multiplelocks during an update operation (potentially, as many asthe number of levels in the skip list) [21], our design acquiresone range only. Beyond the potential performance benefits ofreducing lock contention and the number of required atomicoperations, our design eliminates the need for associatinga (spin) lock with every node in the list, thus reducing thememory footprint of the skip list.We have evaluated our ideas both in the user-space andkernel-space. For the former, we implemented our list-basedrange locks and compared them to the tree-based rangelock implementation that we ported from the Linux kernelinto the user-space. Our experiments confirm that the newrange locks scale better and outperform existing range locksin virtually all evaluated settings. Moreover, we show thatthe range lock-based skip lists perform significantly betterwhen using our implementation of range locks underneath,compared to the tree-based range lock implementation. Wealso implemented the new range locks in the kernel, andevaluated them with Metis, a suite of map-reduce bench-marks [27] used extensively for the scalability research ofthe Linux kernel [3, 6, 11, 23]. When coupled with the specu-lative mechanism in mprotect , some Metis benchmarks runup to 9 × faster on the modified kernel compared to stock andup to 69 × faster compared to the kernel that uses tree-basedrange locks. Range locks (or byte-range locks) were conceived in thecontext of file systems to support concurrent access to thesame file [2]. Since files are a continuous range of bytes,different processes can access disjoint regions within thesame file if they acquire a (range) lock for the desired region,2calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issae.g., through the fcntl operation in Unix [2]. More recently,range locks gained attention as an important piece in the de-sign of parallel and distributed file systems. In GPFS [34], forinstance, when a process requests access to a region withina file, it is granted a token for the whole file. Only whenanother process requests access to another disjoint regionwithin the same file, a revoke request is sent to the tokenholder to revoke its rights for the other process’s desiredrange. This design has low locking overhead when a file isaccessed by a single process at the cost of higher overheadwhen coordination between multiple processes is required.Thakur et al. [36] suggested the use of a data-structure witha per-process entry. Each process would acquire a rangelock in two steps: first it accesses its slot within the data-structure updating it with the desired range and then it readsa snapshot of the data-structure. If no other process has re-quested a conflicting range, the lock is acquired; otherwise,the steps are repeated after processes reset their slots withinthe data-structure.To avoid liveness issues, Aarestad et al. [1] proposed usinga red-black tree to store the ranges acquired by different pro-cesses. The same approach is taken by recent efforts withinthe Linux kernel development community to replace the read-write semaphore within the virtual memory sub-system witha red-black tree-based range lock implementation [4, 22].However, as explained earlier, relying on a red-black treeprotected by a spin lock can be a serious scalability bottle-neck, as we will confirm later in Section 7. At the same time,our approach does not use locks in the common case.In a recent and highly relevant work [24], Kim et al. con-sider using range locks in the context of parallel file systems,and make a similar observation regarding the lack of scala-bility of the existing kernel range locks. They followed analternative design for range locks, which was previouslyproposed by Quinson et al. [33], in which the entire rangeis divided into (a preset number of) segments, each associ-ated with a reader-writer lock. To acquire a certain part ofthe range for read or write, one needs to acquire the reader-writer locks of the corresponding segments in the respectivemode. In their proposal, the full range acquisition is par-ticularly expensive, as it requires acquiring all underlyingreader-writer locks. Moreover, choosing the right granularity,i.e., the number of segments, is critical — too few segmentswould create contention on the underlying reader-writerlocks, while too many segments would make range acquisi-tion more expensive — yet, Kim et al. do not discuss how thegranularity should be tuned. Therefore, we believe the appli-cability of Kim et al.’s scenarios is limited to the cases wherethe size of the entire range and the granularity of the accessare known and static, which is precisely the case consideredin [24]. Nevertheless, we include Kim et al.’s range locks inour performance study in Section 7.The database community developed a similar concept torange locks, known as key-range locks [31, 32]. They were introduced to guarantee serializability of database transac-tions operating on a range of records, avoiding so called phantom read phenomena [26]. Besides locking all existingkeys within a range, key-range locks also lock the neigh-boring key such that, e.g., no concurrent transaction couldinsert new keys — that did not exist a priori, and thus couldnot be locked — within the desired range [31]. To allow moreconcurrency, Lomet [26] introduced hierarchical locking toattribute different lock modes to ranges and keys (e.g., lock-ing a range in exclusive mode and a key in shared mode).To overcome the high locking overhead incurred by lockingall the keys within a range, Graefe [17] suggested dynami-cally switching between different locking granularities. Inaddition to the higher locking overhead that these solutionscan incur, they also suffer from lower parallelism since non-overlapping ranges within a region where no keys exist haveto be unnecessarily serialized on an existing key. Lomet andMokbel [25] tried to decouple locking from the existing databy statically partitioning tables into disjoint partitions. Com-pared to our solution, such an approach suffers from lowerparallelism due to false sharing when non-overlapping rangelock requests fall within the same partition.As mentioned earlier, one of the main motivations behindthe renewed interest in range locks is to design a scalablelocking mechanism for the kernel address space operations.Song et al. attempted to address this problem in the contextof parallelizing live VM migration [35]. To that end, theyproposed a range lock implementation based on a skip listprotected by a spin lock. Conceptually, their design is verysimilar to the one found in the Linux kernel [22]. In particular,both cases have the same bottleneck in the form of a spin lockprotecting their corresponding underlying data structuresfor tracking acquired ranges.Several works pursued the same goal of scaling kernel ad-dress space operations via a different route: replacing the red-black-tree mm_rb with alternative data-structures. Clementset al. [6] proposed using a RCU-balanced tree to allow con-currency between a single writer and multiple readers. Inaddition to not allowing parallel update operations, the pro-posed tree trades fewer rotations for tree imbalance, whichcan increase tree traversal times. In another work by thesame authors, they proposed using a radix tree, where eachmapped page will be inserted in a separate node within thetree [7]. Such design supports concurrent read and updateaccesses to non-overlapping nodes. However, this comes attwo significant costs: (i) a large memory footprint for usingper-page nodes, and (ii) high locking overhead, since lockinga range of pages entails locking several nodes within thetree. Unlike both proposals by Clements et al., our work doesnot require changing mm_rb and thus requires less intrusivechanges to the kernel.3calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issa

The existing implementation of range locks in the Linuxkernel uses a range tree (based on red-black trees) protectedby a spin lock [22]. To acquire a range, a thread first acquiresthe spin lock and then traverses the tree to find a count ofall the ranges that overlap with (and thus, block) the givenrange. For a reader-writer range lock, this count does not include overlapping ranges belonging to other readers (ifthe given acquisition is also for read) [4]. Next, the threadinserts a node describing its range into the tree, and releasesthe spin lock. If at that point the count of blocking ranges iszero, the thread has the range lock and can start the criticalsection that the lock protects. Otherwise, it waits until thecount drops to zero, which would happen when threads thathave acquired blocking (i.e., overlapping) ranges exit theirrespective critical sections. Specifically, when a thread isdone with its range, it acquires the spin lock, removes itsnode from the tree and then traverses the tree, decrementingthe count of blocking ranges for all relevant ranges, andfinally releases the spin lock.This range lock implementation has several shortcomings.The most severe one is the use of a spin lock to protect therange tree. This lock can easily become a bottleneck on itsown even without the logical contention on ranges. Notethat every acquisition and release of the range lock resultsin the acquisition and release of that spin lock. Therefore,even non-overlapping ranges and/or ranges acquired forread have to synchronize using that same spin lock.Furthermore, while placing all ranges in the range treepreserves the FIFO order, it limits concurrency. Assume thatwe have three exclusive acquisition requests for ranges com-ing in this order: A=[1..3], B=[2..7], C=[4..5]. While A holdsthe lock, B is blocked (it overlaps with A), and C is blockedbehind B, but in practice, it could proceed as it does notoverlap with A. Finally, the existing range locks have no fastpath, that is, even when there is a single thread acquiring arange, it still would go through the same path of acquiringthe spin lock, updating the range tree and so on.The list-based range locks presented in this paper addressall the aforementioned issues. First, they only use a lockwhen fairness is concerned, i.e., to avoid starvation of threadstrying to acquire a range, but repeatedly failing to do so dueto other threads that manage to acquire overlapping ranges.In our experiments, this is an unlikely scenario, meaning thatour range locks do not use any locks in the common case.Second, list-based range locks can achieve a higher level ofparallelism by allowing concurrent threads to acquire more(non-overlapping) ranges. Considering the example above,for instance, while A is in the list, B waits until A finishes,but C can go ahead and insert its node into the list after A.Finally, our design allows the introduction of a fast path, inwhich the range lock can be acquired in a small constant number of steps. This path is particularly efficient for single-thread applications or multi-thread applications in which arange lock is acquired by one thread at a time.We opted to use a linked list as an underlying data struc-ture for the relative simplicity and amenability to concurrentupdates of the former. We note that, in general, a linear-time search provided by a linked list is less efficient than thelogarithmic-time search provided by a balanced search treeor a skip list. In practice, however, this should not present anissue, as in all applications that we consider the number ofstored elements (ranges) in the list is relatively small since itis proportional to the number of threads accessing concur-rently the resource(s) protected by the range lock. For thesetting in which this assumption does not hold, we plan toinvestigate extending our design to employ a skip list formore efficient search operations in the future.

We start with a simpler version of our linked list-based rangelocks algorithm intended for mutual exclusion, i.e., it sup-ports concurrent acquisition of disjoint ranges, but no over-lapping ranges are allowed. In the next section, we describean extension of the algorithm to support reader-writer ex-clusion, where readers can acquire overlapping ranges, but awriter cannot overlap with another (reader or writer) thread.The idea at the basis of the algorithm is to insert acquiredranges in a linked list sorted by ranges’ starting points. Ac-cordingly, any overlapping ranges will compete to be in-serted at the same position in the list. Therefore, by relyingon an atomic compare-and-swap (CAS) primitive, it is possi-ble to ensure that only one range from a group of overlappingranges will succeed in entering the list while others will fail.The pseudo-code for the exclusive access list-basedrange locks algorithm is shown in Listing 1. It presentsthe lock structures and the implementation of the

MutexRangeAcquire and

MutexRangeRelease functions aswell as the auxiliary functions called by those two. For theclarity of exposition, we assume sequential consistency. Ouractual implementation uses volatile keywords and mem-ory fences where necessarily. CAS and FAA indicate opcodesfor the compare-and-swap and fetch-and-add atomic instruc-tions, respectively ; Pause() is a no-op operation used forpolite busy-waiting.For each shared resource protected by a range lock, a

ListRL list must be defined. Each node,

LNode , within thelist contains the range it defines and a pointer to the nextnode in the list (cf. Listing 1). At the beginning, the head ofthe list points to null , indicating that the list is empty.When a thread requests an exclusive access over the givenregion within a resource, it first creates an instance of the It is easy to simulate FAA with CAS on architectures that do not have anative support for the former. class LNode: __u64 start ; __u64 end LNode∗ next class ListRL : LNode∗ head class RangeLock: LNode∗ node def MutexRangeAcquire(ListRL∗ listrl , __u64 start , __u64 end): RangeLock∗ rl = new RangeLock() rl − >node = new LNode() rl − >node − >start = start ; rl − >node − >end = end; rl − >node − >next = NULL InsertNode( listtl , rl − >node) return rl def MutexRangeRelease(RangeLock∗ rl) DeleteNode(rl − >node) def compare(LNode∗ lock1, LNode∗ lock2 ): if !lock1: return if lock1 − >start >= lock2 − >end: return if lock2 − >start >= lock1 − >end: return − return def marked(LNode ∗node): return is_odd (( __u64)node) def unmark(LNode ∗node): return (__u64)node − def InsertNode(ListRL ∗ listrl , LNode ∗lock ): while true: LNode∗∗ prev = & listrl − >head LNode∗ cur = ∗prev while true: if marked(cur): break elif cur and marked(cur − >next): LNode ∗next = unmark(cur − >next) CAS(prev, cur, next) cur = next else : auto ret = compare(cur, lock) if ret == −

1: prev = &cur − >next cur = ∗prev elif ret == 0: while (!marked(cur − >next)): Pause() elif ret == 1: lock − >next = cur if CAS(prev, cur, lock ): return − the range is acquired now. cur = ∗prev def DeleteNode(LNode ∗lock): FAA(&lock − >next, 1) Listing 1.

Pseudo-code for the exclusive access rangelocks implementation.

RangeLock structure (cf. Line 11), which contains a pointerto the

LNode structure. Note that for simplicity, we allocate anew

RangeLock instance each time the

MutexRangeAcquire is called. It is possible, however, to maintain and reuse a poolof

RangeLock instances; we discuss memory management of those instances in detail in Section 4.4. Next, the threadinitializes the

RangeLock structure (cf. Lines 12–13). Finally,in order to acquire a range, the thread must successfullyinsert the corresponding node into the given range locklist structure (cf. Line 14). To release the acquired range (in

MutexRangeRelease ), a node corresponding to the range isdeleted from the list (cf. Line 17).The

InsertNode function describes the logic of insertinga node ( lock ) into the list (cf. Listing 1). At the high level,this function traverses the list searching for the insertionpoint (in the increasing order of start addresses) for the givennode describing the given range. If the traversal comes by anode with an overlapping range, it waits until that node isremoved from the list.In more detail,

InsertNode traverses the list from its headand checks each node, cur , it encounters while maintaininga pointer, prev , that points to the address of the previousnode’s next pointer. A node in the list can either be marked ,i.e., logically deleted with the least significant bit of its next pointer being set, or not. (We describe the deletion mecha-nism in detail later.) If prev is found to be logically deleted,the traversal has to restart, as the list might have changedin a way that would not allow the thread to insert its nodesafely (cf. Line 32). If cur is logically deleted, an attemptto remove it from the list is made by making prev point to cur ’s successor (cf. Lines 34–37). This is done by issuingCAS to atomically replace the pointer to cur by a pointerto cur ’s successor. Regardless of the result of CAS, whichmay fail due to a concurrent thread performing the samechange, the traversal of the list continues (Line 37). We note,however, that in our actual implementation we check the re-sult of CAS, and if successful, we reclaim the node using thememory management mechanism described in Section 4.4.When an unmarked cur is encountered, the ranges ofboth cur and lock are compared (cf. Line 39) — see the compare function for details (Lines 18–24). If (the range in) lock succeeds (the range in) cur without overlapping withit, the list traversal is continued (Lines 40–42). If they overlap,then lock must wait until cur is marked as deleted, whichwill happen when the thread that acquired the correspondingrange exits its critical section. After the wait, the traversalresumes from the same point (and the marked cur will besubsequently removed as described above).In case lock precedes cur (or cur is null ), the insertionposition for lock has been found to be between prev and cur . To execute the insertion, CAS is issued trying to replace cur by lock in prev (see Line 48). If the CAS is successful,the exclusive access over the range is now acquired and thefunction can return (Line 49). Otherwise, this means anotherthread has changed prev , either by inserting a node rightafter prev or marking prev for deletion. In this case, thetraversal is resumed from the same point with cur beingupdated to a new value from prev (Line 50).5calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady IssaAn acquired range lock is unlocked by deleting the corre-sponding node from the list (cf. Line 17). In a linked list, itmeans updating the next pointer of the node’s predecessorto point to the node’s successor. However, one has to locatethe predecessor first, which means traverse the list fromthe head. For performance considerations, when releasing arange lock, we only delete the corresponding node logically.This is achieved by a common technique in concurrent linkedlist implementations of marking the node [19], i.e., settingthe least significant bit (LSB) of its next pointer. This set-ting avoids races with concurrent threads trying to changethe value of next while inserting or removing a neighbor-ing node. (Recall that CAS instructions in InsertNode areissued on pointers of nodes that are expected not to be log-ically deleted.) Since only one thread can mark any givennode (the thread that acquired the corresponding range),setting the LSB can be done with an atomic increment in-struction (cf. Line 52). This means that on architectures thatsupport such an instruction, range lock release is wait-free.As described above, marked nodes are removed from the listduring traversals in

InsertNode . Correctness Argument:

We argue that the pseudo-code inListing 1 is a correct and a deadlock-free implementationof exclusive access range locks. For correctness, we arguethat the implementation never allows two threads to acquirerange locks with overlapping ranges. This claim is based onthe following invariant:

Invariant 1.

For any two consecutive ranges R1 and R2 inthe list ListRL, R1.end ≤ R2.start.

To prove the progress property, we note that a thread T would remain infinitely long in the InsertNode functiononly if (a) it finds infinitely often its prev variable pointing toa deleted node (cf. Line 32), or (b) it traverses infinitely manylogically deleted nodes (cf. Lines 34–37), or (c) it traverses in-finitely many ranges that end before the thread’s range starts(cf. Lines 40–42), or (d) it waits infinitely long to a threadwith an overlapping range (cf. Line 45). Given that the listcontains a finite number of nodes when T calls InsertNode ,cases (a), (b), and (c) are possible only if some other thread (orthreads) insert (and delete) infinitely many nodes, which inturn means that those threads acquire and release infinitelymany ranges while T is executing InsertNode . Assumingthat no thread fails while holding the range lock, then eithercase (d) is impossible as the thread would mark its node aslogically deleted in a finite number of steps (if the hardwaresupports wait-free FAA), or case (d) is possible only if infin-itely many threads would acquire and release a range lock(for the CAS-based implementation of FAA). Thus, T wouldeither return from InsertNode (and thus acquire the rangelock), or infinitely many threads would acquire and releasethe range lock while T is executing

InsertNode .We note that the described implementation of the list-based range lock is not starvation-free, e.g., a thread trying (a) (b)

Figure 1.

An example for a race condition between readersand writers solved by validation. (a): Three reader rangesare in the list. (b): A new reader with the range [15..45] arrives, and since it starts before the reader with the range [20..25] , it inserts itself into the list after a reader with therange [1..10] . At the same time, a writer with the range [30..35] arrives, finds that it does not overlap with anyreader and inserts itself into the list after the reader with therange [20..25] .to insert a node into the list may continuously fail to applyCAS (cf. Line 48) and/or be forced to restart the traversal ifits prev pointer gets marked (cf. Lines 32–33). In Section 4.3we describe a simple mechanism to introduce fairness andavoid starvation.

In the previous section, we have presented a range lockalgorithm that supports acquiring exclusive access on de-fined ranges. Now, we extend the algorithm to handle reader-writer synchronization. For the sake of brevity, in this sectionthreads acquiring a range lock in shared mode will be re-ferred to as readers while threads acquiring a range lock inexclusive mode will be referred to as writers.A natural way to extend the range locks algorithm fromthe previous section is to consider the access mode (read orwrite) in the compare function, and allow an overlap whenboth compared ranges belong to readers. In other words, wewould traverse the list (in

InsertNode ) and insert the givennode into the list even if that node (i.e., its range) overlapswith an existing node, and both nodes belong to readers.Unfortunately, this approach enables a race condition be-tween readers and writers, exemplified in Figure 1. A readermay “miss” a writer with an overlapping range located downthe list. At the same time, a writer may “miss” a reader withan overlapping range that entered the list at the point thatthe writer has already traversed. This race condition is pos-sible because overlapping readers and writers may insertthemselves into the list at different points (i.e., after differentnodes), and therefore they do not compete to modify thesame ( next ) pointer (see Figure 1).We solve this problem with an extra validation step per-formed by readers and writers. Specifically, when a readerinserts its node into the list, it continues to scan the list untilit finds a node with a range that does not overlap. If dur-ing this scan the reader comes across a writer, it waits untilthe writer’s node is (logically) deleted. As for the writer, itsvalidation step is slightly different (since a similar wait by6calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issa def RWRangeAcquire(ListRL∗ listrl , __u64 start , __u64 end, int reader ): do: RangeLock∗ rl = new RangeLock() rl − >node = new LNode() rl − >node − >start = start rl − >node − >end = end rl − >node − >next = NULL rl − >node − >reader = reader while (InsertNode( listrl , rl − >node)) return rl −

1: if lock1 comes before lock2 , or def compare(LNode∗ lock1, LNode∗ lock2 ): if !lock1: return int readers = lock − >reader + lock2 − >reader if lock2 − >start >= lock1 − >end: return − if lock2 − >start >= lock1 − >start and readers == 2: return − if lock1 − >start >= lock2 − >end: return if lock1 − >start >= lock2 − >start and readers == 2: return return def InsertNode(ListRL ∗ listrl , LNode ∗lock ): ... if CAS(prev, cur, lock ): if lock − >reader: return r_validate (lock) else : return w_validate( listrl , lock) ... Listing 2.

Reader-Writer range locks presented as diffsfrom the corresponding functions in Listing 1.writers for readers would lead to deadlock). Once the writerinserts itself into the list, it re-traverses the list from thehead until it finds its own node. If during this re-traversal, awriter finds a reader with an overlapping range, the writerleaves the list (by logically deleting its node) and restarts theacquisition attempt from the beginning. The race conditionscan only happen between a reader and a writer that haveboth inserted themselves into the list, therefore re-traversingthe list will guarantee detecting such a race.Note that this validation approach may cause starvationof writers, as they may be forced to restart repeatedly byincoming readers. We describe a way to avoid this issue inSection 4.3. Furthermore, note that our validation approachgives preference to readers, since in case of a conflict theystay in the list while writers restart. It is straightforward toreverse the scheme and give preference to writers instead,by letting them stay in the list (while waiting for conflictingreaders to leave) and making the readers restart in case of aconflict.Listing 2 shows how to implement shared range locks,where overlapping acquisitions with shared (reader) accessesdo not block each the other. The pseudo-code in Listing 2 ispresented in the form of diffs from Listing 1. The

LNode struc-ture includes now a flag ( reader ) indicating whether the cor-responding range is acquired for read or for write. (This is − overlapping node is found. def r_validate (LNode ∗lock ): LNode∗∗ prev = &lock − >next LNode∗ cur = unmark(∗prev) while true: if !cur or cur − >start > lock − >end: return if marked(cur − >next): LNode ∗next = unmark(cur − >next) CAS(prev, cur, next) cur = next elif cur − >reader: prev = &cur − >next cur = unmark(∗prev) else : while !marked(cur − >next): Pause() def w_validate(ListRL ∗ listrl , LNode ∗lock ): LNode∗∗ prev = & listrl − >head LNode∗ cur = unmark(∗prev) while true: if cur == lock : return if marked(cur − >next): LNode ∗next = unmark(cur − >next) CAS(prev, cur, next) cur = next elif cur − >end <= lock − >start: prev = &cur − >next cur = unmark(∗prev) else : DeleteNode(lock ); return Listing 3.

Validation functions called from

InsertNode in Listing 2.a trivial change and thus not shown). The

RWRangeAcquire function is similar to

MutexRangeAcquire (in Listing 1),except that the call to

InsertNode is now wrapped in a do-while loop. This loop will be executed more than onceby a writer only, and only in the case the writer’s vali-dation fails. The

RWRangeRelease function is identical to

MutexRangeRelease (in Listing 1) and thus not shown. The compare function is adapted in a straightforward way toallow overlapping reader ranges (see Lines 19–26). Finally,the only change in the

InsertNode function is the call tovalidation functions according to the access mode for whichthe range lock is acquired (see Lines 29–31).The details of the validation functions are given in List-ing 3.A reader executes the r_validate function, where it con-tinues to traverse the list from the point where it just in-serted its node and until it either reaches the end of the listor reaches a node that starts after the reader’s node ends(Line 41). During the traversal, and as an optimization, thereader attempts to remove logically deleted nodes from thelist (Lines 42–45). Like mentioned before in Section 4.1, in theactual implementation, successfully removed nodes are recy-cled using the memory management mechanism described7calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issain Section 4.4. Furthermore, if it encounters a writer’s node,it waits until the node is logically deleted (Lines 49–51).A writer, for its part, executes the w_validate function,where it traverses the list from the head until it reachesits node (Line 58). Like a reader, during the traversal thewriter attempts to remove logically deleted nodes from thelist (Lines 59–62). If, however, a writer comes across an over-lapping node, it deletes its node and fails the validation(Lines 66–68). Note that this overlapping node has to be-long to a reader, since a writer waits for any overlappingnode (for which compare returns zero) before inserting itselfinto the list (cf. Lines 43–45 in Listing 1).

Correctness Argument:

We argue that the pseudo-code inListing 2 is a correct implementation of reader-writer rangelocks. To that end, we argue that the implementation neverallows two threads to acquire conflicting ranges — rangesconflict when they overlap and at least one of them is a writer.Our claim is based on the following invariant:

Invariant 2.

For any two consecutive ranges R and R in ListRL , R . start ≤ R . start . Moreover, if R is a writer, then R . end ≤ R . start . Based on this invariant, if a reader or a writer G in ListRL overlaps with a writer W , then G . start ≤ W . start (if W . start ≤ G . start then, according to Invariant 2, W . end ≤ G . start , thus they can not overlap). Assume there is a writer G in ListRL that overlaps with W and G . start ≤ W . start .Since G is a writer then G . end ≤ W . start (otherwise Invari-ant 2 breaks), a contradiction. Now, we are left with the caseof G being a reader. There are two possibilities: either (i) G entered ListRL before W or (ii) after W . Note that a rangeenters ListRL after a successful CAS at Line 29 in Listing 2.The intuition at the basis of our correctness argumentis that if a conflicting range that enters

ListRL last (amongthe two conflicting ranges) defers to the other conflictingrange, we can guarantee reader-writer exclusion. Accord-ingly, to handle the first case, w_validate is executed afterthe CAS operation at Line 29, and since it starts travers-ing ranges in

ListRL from the head node, then any range G with G . start ≤ W . start ≤ G . end that entered ListRL before W (and has not left yet) is guaranteed to be visitedduring the traversal. For the second case, r_validate is ex-ecuted after the CAS operation, and since it starts traversingranges in ListRL from the node succeeding G , any range W with G . start ≤ W . start ≤ G . end that entered ListRL before G is guaranteed to be visited during the traversal. Conse-quently, by ensuring that both w_validate and r_validate do not return successfully if a conflicting range lock is visited,reader-writer exclusion is guaranteed.As for deadlock freedom, the same arguments used forthe basic mutual exclusion apply also for the reader-writerpseudo-code. There are two additional cases, though, inwhich thread T may wait infinitely long in InsertNode :(a) when w_validate infinitely often returns 1 and (b) when r_validate function waits infinitely long for a thread withan overlapping writer range. Case (a) is possible if otherthread (or threads) insert (and delete) infinitely many over-lapping nodes, which in turn means that those threads ac-quire and release infinitely many reader ranges while T isexecuting

InsertNode . Assuming that no thread fails afterexecuting the CAS operation at Line 29, case (b) is similar tocase (d) in the exclusive access variant (see Section ?? ).Similarly to what we mentioned earlier, we note that whilethe presented reader-writer range locks are deadlock-free,they are not starvation-free. We discuss next how our designcan be augmented with an auxiliary lock to avoid starvation. The range lock design presented so far does not use any locks.However, it allows starvation of a thread repeatedly failingto insert its node into the list due to other threads concur-rently acquiring and releasing locks (and thus modifying thelist). A simple way to avoid that is to introduce an auxiliary(fair) reader-writer lock coupled with an impatient counter.A thread acquiring the range lock checks the impatient counter, and if it is equal to zero (common case), proceedswith the range acquisition. Otherwise, if the counter is non-zero, it acquires the RW-lock for read. When a thread failsto acquire the range lock in a few attempts, it bumps up the impatient counter (atomically) and acquires the RW-lockfor write. The counter is decremented (atomically) upon therelease of the RW-lock that was acquired for write. Note thatany race between a thread reading zero from the counterand a thread incrementing the counter is benign, as the solepurpose of this counter is to introduce fairness rather thanensure the correctness of the underlying range lock.

In the proposed design of range locks, threads traverse listnodes concurrently with threads modifying the list. Whilethis approach avoids the bottleneck of an auxiliary lock pro-tecting the underlying structure as found in the existingimplementation of range locks [4, 22], the lock-less traversalof a list poses a challenge with respect to the memory man-agement of list nodes. This is because a list node may notbe immediately reclaimed once it is removed from the list,since other threads traversing the list may have a referenceto this node and may try to access its memory after it hasbeen removed from the list. This is a well-known problem inthe area of concurrent data structures [15, 30], and multiplesolutions are available [20].For our kernel-space implementation, we employ the read-copy-update (RCU) method [29], which is readily supportedin the Linux kernel [28]. RCU is a synchronization mech-anism that allows readers (threads that access shared datawithout modifying it) to execute concurrently with a writer(a thread modifying shared data) without acquiring locks.The idea at the basis of RCU is for readers to announce when8calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issathey start and finish accessing shared data, while writersapply their changes to a copy of the data that is visible toonly new readers (i.e., readers that started after the writer).The old data is then atomically replaced by the (modified)copy and recycled when there are no more active old readers.In the context of memory reclamation, threads traversingthe list mark themselves as readers throughout the traversal,while a thread trying to reclaim memory, performs that op-eration as a writer. To facilitate progress and efficiency, weemploy the call_rcu()

API, which does not require wait-ing for concurrent readers when retiring memory. That is,the memory will be retired (through the callback passed to call_rcu() ) asynchronously, after those readers exit theircorresponding critical sections.For the user-space implementation, we chose an epoch-based reclamation scheme [16] for its simplicity and lowoverhead. We augment the epoch-based reclamation schemewith thread-local object (node) pools to amortize reclamationcosts as we detail next. Each thread maintains two (thread-local) pools of list nodes (where each pool is implemented asa sequential linked list). One pool contains list nodes readyto be allocated and used for a range lock acquisition (we callthis pool active ), while another pool contains list nodes thatthis thread has removed from the list, but has not recycledyet (we call this pool reclaimed ). Note that each thread hasonly two pools, regardless of the number of range locks itaccesses. To amortize allocation costs, the active pools areinitialized with N records ( N =

128 in our case), while thereclaimed pool are initially empty. In addition, each thread isassociated with an epoch number, which is a 64-bit counterinitialized to zero and incremented before (and after) a threadmakes first (last, respectively) reference to a list node whentraversing the list during the range lock acquisition.When a thread removes a node from the list, it puts thenode into the reclaimed pool. When a thread needs to allo-cate a new node for the range lock acquisition, it grabs anode from the active pool. If the active pool is empty, it callsa barrier function, which iterates over epoch numbers ofother threads and waits for each thread to finish its currentoperation (by incrementing its epoch), if such operation isin progress (i.e., if the corresponding epoch number is odd).After the barrier, it is safe to recycle (or reclaim) all nodes inthe reclaimed pool. Therefore, the thread switches betweenits pools, and the (now empty) active pool becomes the re-claimed pool, and the (potentially, non-empty) reclaimedpool becomes the active pool. After the switch, and in or-der to keep the memory footprint of the system steady, thethread checks whether the size of the active pool is too small(e.g., has less than N / N ). At thesame time, if the active pool is too large (e.g., has more than2 N nodes), the active pool is trimmed by reclaiming (freeing)extra nodes (up to the total size of N ). Note that when theworkload is balanced, i.e., each thread removes roughly the same number of nodes that it inserts into the list underlyingthe range lock, the memory management does not involvethe system memory allocator (except for the initial allocationof active pools). The proposed range lock implementation is amendable toa fast path optimization, which allows the range lock to beacquired and released in a constant number of steps whenthe lock is not contended. This is particularly important fora single thread execution, but is also useful when the lock isaccessed by multiple threads while only one of them accessesthe lock at a time.The fast path is implemented as following. When a threadacquires the range lock, it checks whether the list is empty(i.e., whether head points to null ). If so, it attempts to set(using CAS) the head of the list to the marked pointer to thenode corresponding to the range lock acquisition request. Ifsuccessful, the range lock acquisition is complete. In pseudo-code, the fast range lock acquisition path is implementedwith the following two lines inserted right before the callto

InsertNode in the range lock acquisition function (e.g.,before Line 14 in Listing 1): if ( listrl − >head == NULL and CAS(&listrl − >head, NULL, mark(rl − >node))): return rl ; The (not shown) mark macro simply sets the LSB of thegiven pointer. Note that the head pointer can be marked onlyif the lock has been acquired on the fast path. We exploit thisfact in two places. First, during unlock, if a thread t findsthat the head is marked and points to t ’s node, t realizesthat it has acquired the range lock through the fast path, andattempts to release it by setting head to null (using CAS).At the same time, if another thread t ′ attempts to acquire therange lock on the regular path and finds head being marked,it first removes the mark (by changing head to point to thesame node but without mark using CAS), and then proceedswith the acquisition. This ensures that a range lock l acquiredon the fast path would be properly released on the regularpath if other threads acquired other ranges in the meantimebetween l ’s acquisition and release.In summary, the main difference between the fast andregular paths is in the way nodes are removed from the list.While on the regular path, the node is marked during lockrelease, and removed during lock acquisition when (possibly)another thread traverses the list, on the fast path the removalis eager. This reduces the total number of atomic operationsrequired to delete a node from the list, and keeps the numberof steps performed during the lock operation constant asthere are no marked nodes that are needed to be removedfrom the list first.9calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issa Operating systems provide processes with the virtual mem-ory (VM) abstraction. It allows processes to assume theyhave access to all possible addressable memory, regardlessof the actual underlying physical memory. To keep track ofhow regions within a process’s virtual memory map to actualphysical memory pages (whether located in the main mem-ory or swapped to disk), the Linux kernel uses the conceptof Virtual Memory Area (VMA) structures [14]. In practice,VMA is a data structure that defines a distinct contiguousregion within the virtual memory address space using a startaddress and a variable length (multiple of a page size). TheVMA metadata also includes other attributes, such as themapping to physical memory, access permissions, pointersto neighboring VMA structures, etc. For each process, theLinux kernel stores all its associated VMA structures in a red-black tree ( mm_rb ). A typical VM operation starts by querying mm_rb with an address (provided as an input from the APIcaller) to find the enclosing VMA (if it exists). Accordingto the nature of the operation, it may read or change somemetadata of a VMA, split a VMA, merge two VMA structures,insert a VMA, delete a VMA, etc. Note that a single VM opera-tion may perform several of these operations on one or moreVMA structures, according to the given input address range.Moreover, splitting, merging, inserting and deleting VMAstructures incur structural changes to the mm_rb . To that end,operations that might modify VMA structures and/or mm_rb (such as mprotect ) acquire mmap_sem for write, while oper-ations that only read VMAâĂŹs metadata (such as the pagefault handler) acquire mmap_sem for read.While the concept of range locks may appear, at a firstglance, as a natural fit for synchronizing the access to re-gions of the shared virtual memory address space, the task ofapplying those locks for this purpose in the Linux kernel isnot straightforward due to mainly two reasons: (i) the APIsof VM operations are oblivious to the underlying VMA struc-tures, and rely on querying mm_rb for this purpose; and (ii) aVM operation may end up performing structural changes to mm_rb (and thus interfere with other concurrent VM opera-tions accessing mm_rb ), and this is unknown a-priori.As a concrete example of the challenge of usingrange locks in VM operations, consider two calls: mprotect(0x100000, 65536, PROT_NONE) and mprotect(0x180000, 65536, PROT_READ) . If wenaively protect only the range on which each call operates(i.e., [0x100000 .. 0x110000] and [0x180000 ... 0x190000]), andthose two ranges fall within the scope of the same VMA,the two operations may simultaneously acquire range locksfor the corresponding ranges, and overwrite each other’supdates to the metadata of that same VMA. Moreover, ifthose calls result in a structural modification to mm_rb , theywould perform those modifications without synchronizing (a)

Two adjacent VMA structures, withdifferent protection flags. (b)

Same VMA structures af-ter mprotect(0x1800, 4096,PROT_READ | PROT_WRITE) returns.

Figure 2.

Example for an mprotect operation changingVMA metadata without modifying the mm_rb tree.one with another.To overcome these issues, one might always acquire therange lock for the full range whenever this lock is required inwrite mode. This would, however, preclude any parallelismwhen a writer acquires the range lock, and in fact, is expectedto perform worse than mmap_sem (since the latter has a moreefficient acquisition path). mprotect

By inspecting the implementation of various VM opera-tions [14], we notice that they do not always end up mod-ifying mm_rb . For instance, consider the case when thereare two neighboring VMA structures describing two con-tiguous memory regions with different protection flags, and mprotect is called on the area at the head of the secondVMA (or the tail of the first VMA), with protection flagsidentical to the flags of the other VMA (see Figure 2). Inthat scenario, the boundaries (i.e., the metadata) of the in-volved VMA structures are changed, but the structure of mm_rb remains unchanged. As mentioned in the Introduc-tion, this case is common in the GLIBC memory allocator.Consequently, for the cases where mm_rb does not change,we devise a speculative approach, in which the range lockis optimistically acquired only for the relevant part of theVM address space. We note that when a VM operation needsto modify mm_rb (e.g., when mprotect splits a VMA intotwo, thus it needs to create a node corresponding to the newVMA and insert it into mm_rb ), acquiring the range lock forthe entire range is the only available option to synchronizecorrectly with other operations traversing the mm_rb tree.Listing 4 provides the pseudo-code for the mprotect op-eration with the integrated speculative mechanism. The in-tuition behind our speculative approach is that if we areable to decide whether the mprotect operation will end upmodifying mm_rb before the mprotect applies its changes,then it is safe to lock only the respective range; otherwise, ifwe discover that mm_rb needs to be modified, we restart the mprotect operation after acquiring the full range (for write).The latter action prevents other concurrent speculative op-erations from running and potentially reading inconsistent mm_rb while it is being modified. To this end, we augment themajor memory management structure in the Linux kernel( mm ) with a sequence number. This number is incrementedevery time a range lock acquired for the full range in writemode is released. We use the sequence number to detect10calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issawhether the mm_rb has changed during the speculative oper-ation as described below.The first step of the mprotect operation is to locate therelevant VMA given the input address and size. Therefore,we first acquire the range lock in read mode for the inputrange. This ensures that the structure of the underlying mm_rb would not change while find_vma() is running, sincewe make sure that mm_rb only changes under the range lockacquired in write mode for the entire range. (As its name sug-gests, find_vma() traverses mm_rb searching for the VMAthat contains the given address, or more precisely, searchingfor the first VMA whose end address is larger than the givenaddress). Note that since the range lock is acquired in readmode, this step may run in parallel with other speculatingoperations (or any other operation that acquires a range lockin read mode). After locating the VMA, we unlock the rangelock, and lock it again, this time in write mode and withthe range adjusted to span the entire VMA (plus some smallextra space, as we explain below). Note that during the timethe range lock is not held, mm_rb may change and, in partic-ular, the VMA returned by find_vma() might not be validanymore. We use a sequence number mentioned above todetect this scenario. Specifically, we read the sequence num-ber right before dropping the read range lock and compareit to the number read right after acquiring the write rangelock. If those numbers differ (or the boundaries of the foundVMA have changed), the speculation fails, and we restart the mprotect operation from the beginning. We note that it istrivial to limit the number of retries, although we do not dothat in our prototype implementation.In case the speculation can proceed, we continue withthe operation by going through the logic of identifyingthe required changes to the VMA(s) involved in the given mprotect operation. If this logic identifies that the changesrequire a structural modification to mm_rb , the speculationfails, the write range lock is dropped, and the mprotect op-eration is restarted by acquiring the write range lock forthe full range. Otherwise, the mprotect operation completeswhile holding the write range lock for the relevant rangeonly, thus allowing parallelism with other mprotect opera-tions and/or operations that acquire the range lock for read(e.g., page faults discussed in the next section).We are left to describe one subtle detail of determining thesize of the range for the write acquisition during speculation.We note that it is not enough to lock only the underlyingVMA of the given mprotect operation. This is because asdiscussed in Section Section 5.1, two mprotect operationson neighboring VMA structures can change the metadata ofone another concurrently, thus creating a race condition. Toavoid this situation, we set the range of the write range lockacquisition to the underlying VMA plus a page (4096 bytes)from each side of the VMA.While the speculative mechanism described in this sec-tion is presented in the context of mprotect , we note that mprotect(__u64 addr, size_t size , int prot_flags ): __u64 start = addr __u64 end = addr + size bool speculate = true while true: if speculate : range_read_lock(range_lock, start , end) else : range_full_write_lock (range_lock) vm_area_struct ∗vma = find_vma(addr) if speculate : __u64 seq_number = mm − >seqnumber __u64 aligned_start = vma − >start − __u64 aligned_end = vma − >end + 4096 range_read_unlock(range_lock) range_write_lock(range_lock, aligned_start , aligned_end) if seq_number != mm − >seqnumber or aligned_start != (vma − >start − or aligned_end != (vma − >end+4096): range_write_unlock(range_lock ); continue ... if speculate and will perform structural modification : range_write_unlock(range_lock) speculate = false continue ... release_write_unlock (range_lock) return Listing 4.

Simplified pseudo-code for the speculative mprotect implementation.a similar mechanism can be employed in other operationsas well. For instance, mmap , munmap and brk all start fromcalling find_vma (or a similar function), during which therange lock can be held in the read mode. Those operations,however, typically (but not always) end up modifying mm_rb ,and thus would need to drop the read range lock and acquirethe write range lock for the entire range. Thus, the specu-lative approach would shorten the time during which thewrite range lock is held at the cost of an extra (read) rangelock acquisition. Evaluating the effect of this speculation isleft for future work. Page fault interrupts access the VM subsystem to identifywhether the address that triggered the fault is allowed to beaccessed. They do so by locating the appropriate VMA (bycalling the same find_vma() function) and then handlingthe fault based on that VMA’s metadata (such as protectionflags). Since the page fault routine only queries the metadataof VMA structures (but does not change them), it acquires therange lock in read mode. The original patch that introducedrange locks into the Linux kernel, however, does all theacquisitions, including the one in the page fault routine, forthe full range [5].11calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady IssaWe observe that the page fault routine accesses only themetadata of the VMA returned by find_vma() . Therefore, itis straightforward to refine the range of the lock acquisitionto contain only the given address (in our implementation, welock the range of a page size). We note that any modificationto mm_rb is done while holding the write range lock for thefull range, while any modification to VMA metadata is donewhile holding the write range lock (at least, according toSection 5.2) that covers the range being modified. Therefore,the refinement of the range of the lock acquired in pagefaults is safe. Furthermore, note that this refinement alone isnot expected to improve the scalability of the VM subsystem,because the range lock is acquired in read mode, similarlyto the original mm_sem . However, when coupled with thespeculation in mprotect , page fault interrupts can now lockand access VMA structures in parallel with some (or at leastpart) of the mprotect operations.

In this section, we show how range locks can be used tocoordinate concurrent accesses to a skip list. We base ourdesign on the optimistic skip list by Herlihy et al. [21]. Inthe original design, each node is associated with a spin lock.Search operations are wait-free, and in particular do not ac-quire any locks. Update operations start by searching the listfor the given key, locking all relevant nodes (we elaborateon that below) and validating that the list has not changedin a way that precludes completing the operation (e.g., thenode we want to delete is still in the list), perform the re-quired update (removing the node from the list, or insertinga new node), and finally unlock all the acquired locks. If thevalidation above fails, the operation releases all the locks ithas acquired, and restarts.When replacing the per-node spin lock with a single rangelock, we maintain the same properties. In particular, thesearch operations are still wait-free, which is important forread-dominated workloads. The major change is in the lock-ing protocol. The original optimistic skip list acquires node-level locks for all the predecessors of the node returned bysearch (in case of a remove operation) or of the node witha key larger than the given key (in case of an insert opera-tion). Note that each node has between 1 and N predecessors,where N is the number of levels in the skip list, and thus thelocking protocol consists of between 1 and N lock acquisi-tions. In addition, remove operations acquire the lock of thetarget node to be deleted, adding one more lock acquisitionto the locking protocol. With range locks, we always need toacquire one range only. For inserts, the range is the intervalbetween the key of the predecessor at the highest level (atwhich the new node will be inserted) and the target key (tobe inserted). For removes, the range is defined from the keyof the predecessor at the highest level to the target key (to beremoved) plus 1; the latter is to avoid races with inserts thatmay attempt to update pointers in the to-be-deleted node. We note that beyond the conceptual simplicity and thepotential performance benefits stemming from the fact thateach operation acquires at most one (range) lock, the rangelock-based skip list has a smaller memory footprint than itsoriginal lazy counterpart. This is due to elimination of spinlocks associated with every node in the skip lists. As thenumber of nodes in skip lists is typically (much) larger thanthe number of concurrent threads updating the skip list, thismay translate into significant memory savings.

In this section, we evaluate our linked list-based range locksusing two user-space applications.We start with ArrBench, a microbenchmark that we de-veloped in which threads access a range of slots of a sharedarray for either read or write. This benchmark allows usto assess the performance of our range locks in differentcontention scenarios. Array slots are padded to the size ofa cache line. In read mode, a thread reads the values storedin each slot in the given range, while for write a thread in-crements the value stored in each slot by 1. Each operationacquires a range lock for the corresponding range, and in thecorresponding access mode (read or write). Between opera-tions on the array, each thread performs some (non-critical)work, emulated by a variable number of no-op operations.The number of no-op operations is chosen uniformly ran-domly from the given range (2048 in our case). We set thesize of the array (i.e., the number of slots) to 256.To simulate various levels of contention and possible us-age scenarios for range locks, we created three variants ofthe ArrBench: in the first variant, each thread acquires theentire range of the array. In the second variant, each threadacquires a non-overlapping range calculated by dividing thesize of the array by the number of threads. Note that in thisvariant, threads do not conflict on the ranges they acquire.Furthermore, in order to keep the amount of work (i.e., thenumber of slot accesses) performed under the range lock thesame independent of the number of threads, in this variantonly, threads traverse the corresponding portion of the arraythe number of times equal to the number of threads. In otherwords, when this variant is run with one thread, that threadwould traverse the entire array once for every acquisitionof the range lock; when run with two threads, each of thethreads would traverse half of the array twice for every ac-quisition of the range lock, and so on. Finally, in the thirdvariant, each thread picks random starting and ending pointsfrom the range defined by the size of the array , acquires therange lock with that range, and performs one traversal ofcorresponding slots.We implemented the mutex and reader-writer variants of We select starting and ending points randomly modulo the size of thearray, and switch if the former is larger than the latter. list-ex and list-rw , respectively. We ported two implementations ofrange locks found in the kernel into the user-space, one foundin the Lustre file system (denoted as lustre-ex ) and anotherrecently proposed by Bueso [4] (denoted as kernel-rw ). Asmentioned earlier, the latter is a reader-writer version ofthe former. In the user-space experiments, we used a simpletest-test-and-set lock to implement a spin lock protectingthe range tree in lustre-ex and kernel-rw . We note thatthe Linux kernel uses a slightly more sophisticated spin lockimplementation [8, 12], however, this detail is insignificantin our context . In addition, we implemented the recent pro-posal for range locks by Kim et.al. [24]. Those locks wereproposed in the context of pNOVA, a variant of a non-volatilememory file system, hence we denote this version of rangelocks as pnova-rw . As described in Section 2, pnova-rw op-erates with a present number of segments, each of a presetsize [24]; in our experiments we set this number to 256 seg-ment, spanning one array slot each. We also experimentedwith other number of segments, spanning multiple slots; al-though the results were quantitatively different, they lead tosimilar conclusions.We ran the experiments on a system with two Intel XeonE5-2630 v4 sockets featuring 10 hyperthreaded cores each(40 logical CPUs in total) and running Fedora 29. We did notpin threads to cores, relying on the OS to make its choices.We also disabled the turbo mode to avoid the effects of thatmode (which may vary with the number of threads) on theresults. We vary the number of threads between 1 and 40,as well as the mix of operations performed by each thread(100% reads, 80% reads and 20% writes, and 60% reads and40% writes). The results for the 80% reads workload weresimilar to the 60% reads workload and thus omitted. Eachreported experiment has been run 5 times in exactly the sameconfiguration. Presented results are the mean of throughputresults reported by each of those 5 runs, where throughput iscalculated based on the total number of operations performedby all the threads running for ten seconds. The standarddeviation of nearly all results is less than 3% of the mean.The results for the first variant of ArrBench, in whichthreads acquire and access the entire range, are shown inFigure 3 (a) and (b). The lustre-ex variant does not scale atall, it allows only one thread to traverse the array at a timeas it does not support reader-writer semantics. Moreover, allthreads contend heavily on the spin lock protecting the rangetree structure. This is not the case for the list-ex , where thefact the threads perform non-critical work without a rangelock helps it to scale for low thread counts. In fact, in mostcases list-ex performs better than kernel-rw , even though To confirm that, we tried a different lock and observed similar relativeperformance results. (a) (b)

60% reads (c) (d)

60% reads (e) (f)

60% reads

Figure 3.

Throughput for the ArrBench microbenchmark,where all threads acquire the entire range (first row), threadsacquire non-overlapping ranges (second row) and threadsacquiring random ranges (third row).the latter allows readers run concurrently. Once again, thespin lock protecting the underlying range tree plays detri-mental role in the performance of kernel-rw . The pnova-rw variant also does not scale due to the high lock acquisitionlatency (acquiring this lock for the entire range requires ac-quiring all the underlying segment reader-writer locks). Atthe same time, list-rw does not use locks in the commoncase, and shows scalability across most thread counts.The results for the second variant of ArrBench, in whicheach thread acquires a non-overlapping part of the range,are shown in Figure 3 (c) and (d). Note that the maximumnumber of concurrent range accesses is equivalent to thenumber of threads depicted on the x-axis, which determinesthe size of the list (or the tree) in the corresponding range lockimplementation. In theory, in this case the total throughputshould scale with the number of threads for every range lock,as threads never compete for the same range (regardlessof the access mode). In practice, however, all range locksscale almost linearly up to a small number of threads (4–8).Beyond that, the contention on the spin lock in lustre-ex and kernel-rw degrades the performance of those variants. list-ex and list-rw lack a single point of contention, andmanage to scale, albeit less than linearly, across all threadcounts. pnova-ex tops the charts as in this workload noneof its underlying segment reader-writer locks is contended.When considering the results for the third variant of Ar-rBench, in which each thread acquires a random part of the13calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issa

Figure 4.

Throughput for the skip list benchmark.range (see Figure 3 (e) and (f)), one can note a mix of behav-iors seen in the previous two variants. Overall, lustre-ex does not scale, kernel-rw scales up to a small number ofthreads, while list-ex either slightly better than (in read-only workload) or significantly outperforms (when work-loads include writes) kernel-rw , despite providing only ex-clusive access to each range. pnova-ex performs poorly asits underlying reader-writer locks are once again contended.At the same time, list-rw provides superior performanceacross all workloads, scaling better than any other variant.Next, we used the Synchrobench benchmark [18] to evalu-ate the performance of new skip lists that employ rangelocks to synchronize concurrent access, as discussed inSection 6. We compare three variants: the original opti-mistic skip list [21] (provided in Synchrobench, denotedas orig ), and two variants of our new skip list that uses arange lock, one built on top of the Lustre range locks (de-noted as range-lustre ) and another on top of the exclusivelist-based range lock presented in Section ?? (denoted as range-list ). As it is not clear how one should set the num-ber and the size of segments in pNOVA range locks, we donot include that lock in the evaluation of skip lists.Figure 4 shows the results for the typical set workloadcomposed of 80% find and 20% update operations (split evenlybetween inserts and removes); the key range is 8M, and 4Mkeys are randomly selected and inserted into the skip listbefore each experiment. We report the mean throughputafter repeating each experiment 5 times (here as well thestandard deviation is less than 3% of the mean for nearly alldata points). The results show that range-list performssimilarly to orig , even though the former is simpler andconsumes less memory as it does not use a lock per skip listnode. range-lustre tracks both versions at lower threadcounts. Once thread counts grow, however, the contentionon its internal spin lock increases, and as expected, its per-formance drops to less than half of the other two variants.This workload demonstrates that the increased concurrencyallowed by range-list outweighs the linear complexity ofthe linked list, in contrast with the logarithmic complexityof range-lustre ’s range tree. For the kernel-level experiments, we compared the stock ver-sion (4.16.0-rc2) with the one that has mm_sem replaced witha range lock. For the latter, we used the patch by Bueso [5];we call this variant tree-full as it always acquires therange lock for the full range. Based on this patch, we re-placed the range lock implementation with the reader-writerlinked list-based one described in this paper; we call thisvariant list-full . Furthermore, we refined the ranges ofthe acquired range locks as described in Section 5. We referto the variants with refined ranges as tree-refined and list-refined , respective of the range lock implementationused by each. All the variants were compiled in the defaultconfiguration.We ran the experiments on a system with four Intel XeonE7-8895 v3 sockets featuring 18 hyperthreaded cores each(144 logical CPUs in total). Like for user-space experiments,we do not pin threads to cores and disable the turbo mode.For our evaluation, we used Metis, an open source MapRe-duce library [27], known for stress-testing the VM subsystemthrough the mix of VM-related operations (such as page-faults, mmap and mprotect ) [23]. Each experiment was re-peated 5 times, and we report the mean of the results. Thestandard deviation of the majority of the results was below5% of the mean.Through the tracing facility in the kernel ( ftrace ), weidentified that three benchmarks in the Metis suite use mprotect extensively. Those applications are wc (wordcount), wr (inverted index calculation) and wrmem, which isa variant of wr that allocates a chunk of memory and fills itwith random “words” instead of reading its input from a file.We used default input files for wc and wr, and 2GB input sizefor wrmem. The tracing also revealed that the majority of thecalls to mprotect (over 99%) succeed in the speculative path.We note that in all other Metis benchmarks, which did notcall mprotect as extensively as the other three benchmarksmentioned above, the impact of range locks was negligible.Figure 5 shows the runtime results for wc, wr and wrmem(lower is better). Up to 8–16 threads, all variants perform sim-ilarly and scale linearly with the number for threads. How-ever, once the thread counts increase, and with them the con-tention on the VM subsystem, the variants produce differentresults. Notably, the performance of the stock version wors-ens with the increased contention, while the list-based rangelock variants remain mostly flat, or continue to scale, as in thecase of wrmem and list-refined . In general, the tree-basedrange locks perform worse than the list-based ones, andmostly worse even when compared with the stock version.We believe this is at least in part because of the contentioncreated on the spin lock protecting the access to the rangetree. Refining the ranges of the range lock acquisitions helpsboth tree-based and list-based variants, i.e., tree-refined outperforms tree-full , while list-refined outperforms14calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issa (a) wr (b) wc (c) wrmem Figure 5.

Runtime for Metis benchmarks. (a) wr (b) wc (c) wrmem Figure 6.

Breakdown of the impact of refining the range in list-based range lock variants. list-full . In fact, at 144 threads, list-refined has 9 × speedup over stock in wrmem. Again, similar to our ob-servation in the user-space skip list experiment, the higherparallelism achieved by list-based range locks outweighs thelinear complexity of list traversal, even with large numberof concurrent ranges (144 in this case).It is interesting to note that list-full outperforms stock under high contention despite always acquiring the rangelock for the full range. We conjecture that this is due to thedifferent waiting policies employed by those two variants.Specifically, stock uses a read-write semaphore ( mm_sem ), inwhich threads block (after spinning for a while if optimisticspinning is enabled) when the semaphore is unavailable untilthey are waken up by another thread. In list-full (and list-refined ), threads block for a small period of time ifthe range is unavailable and recheck the range, which turnsto be more efficient under contention. Exploring differentwaiting policies and their impact on lock performance is anactive area of research [10, 23].Figure 6 drills down into the effect of refining ranges onthe performance of the list-based range locks. Here list-pf ( list-mprotect ) denotes the variant where only the rangein the page fault routine ( mprotect operation, respectively)is refined. As expected, the refinement in the page faultroutine does not have much effect, since the range lock isacquired there for read while in all other places it is ac-quired for the full range. At the same time, refining therange in mprotect has a small, but positive effect as now mprotect operations on non-overlapping ranges can be ap-plied concurrently. As Figure 6 shows, however, it is the combination of the two optimizations that makes a differ-ence – list-refined , which refines the range in both pagefaults and mprotect and thus allows their concurrent exe-cution, substantially outperforms all other variants.Through the lock_stat mechanism built into kernel,wecollected statistics on the time threads spent waiting forvarious locks in the kernel. (The lock_stat mechanism isknown to introduce a probe effect [12], therefore it wasenabled only for runs in which we collected statistics on lockwait times.) In Figure 7 we plot the average wait times for mm_sem (in the stock variant) as well as for the range lock inall other variants, breaking down between read and writeacquisitions. Not surprisingly, those results show a (rough)correlation between high wait times and poor scalability.They also reveal that with range refinement, the averagewait times decrease.Figure 8 shows the average wait time on the spin-lock pro-tecting the range tree in the tree-full and tree-refined variants. Notice that the waiting time grows with the numberof threads, supporting our hypothesis that this lock repre-sents a point of contention. The range refinement does notchange much the wait time for the spin lock. This is notsurprising, as this lock is acquired for every acquisition ofthe range lock, regardless of whether or not the range isavailable. However, while in tree-full the wait time forthe spin lock is relatively small compared to the wait timefor the range lock itself (which includes waiting for a rangeto become available), in tree-refined takes the lion shareof the range lock wait time (cf. Figure 7 and Figure 8). This15calable Range Locks for Scalable Address Spaces and Beyond Alex Kogan, Dave Dice, and Shady Issa (a) wr (b) wc (c) wrmem Figure 7.

Average wait time for mm_sem (in stock) and range lock (in all other variants). (a) wr (b) wc (c) wrmem Figure 8.

Average wait time for the spin lock protecting the range tree in tree-full and tree-refined .underscores the effectiveness of range refinement in allow-ing parallel processing of the VM operations. That is, whenthe ranges are refined, most wait time for a range lock can beattributed to the wait time on the auxiliary spin lock ratherthan waiting for the range availability. Unlike tree-basedrange locks, list-based range locks do not have a centralpoint of contention and thus can take better advantage ofthis parallelism, as demonstrated by the results in Figure 5.

In this paper, we presented the design and implementationof new scalable range locks. Those locks employ a simpleunderlying structure (a concurrent linked list) to keep trackof acquired ranges. This structure allows simple lock-lessmodifications with just one atomic instruction. Therefore,our design avoids the pitfall of existing range locks, and doesnot require an auxiliary lock in the common case.Furthermore, we show how range locks can be employedeffectively to mitigate the contention on the access to theVM subsystem and its data structures, in particular, the red-black tree holding VMA structures. We achieve that througha speculative mechanism introduced into the mprotect oper-ation; this mechanism allows to refine the range of the lockacquired in mprotect . We also refine the range of lock ac-quisitions in page fault routines. Together, those refinementsallow parallel processing of page faults and mprotect s op-erating on non-overlapping regions of VM space, which isparticularly beneficial, e.g., for the standard GLIBC memoryallocator. In addition, we demonstrate the utility of rangelocks for the design of concurrent, scalable data structuresthrough the example of a range-lock based skip list. We evaluate the scalability of the new range locks in user-space through several microbenchmarks and kernel-spacethrough several applications from the Metis suite. The re-sults show that the new range locks provide superior per-formance compared to the existing range locks (in the user-space and kernel), as well as to the current method of VMsubsystem synchronization in the kernel (that uses a read-write semaphore). Future work includes evaluating rangelocks with additional benchmarks, and exploring the usageof range locks in other contexts, such as parallel file sys-tems [24] and as building blocks for other concurrent datastructures, such as hash tables and binary search trees.

References [1] P. M. Aarestad, A. Ching, G. K. Thiruvathukal, and A. N. Choudhary.2006. Scalable Approaches for Supporting MPI-IO Atomicity. In

SixthIEEE International Symposium on Cluster Computing and the Grid (CC-GRID’06) , Vol. 1. 35–42.[2] AT&T. 1986. UNIX System V User’s Manual Volume 1. http://bitsavers.trailing-edge.com/pdf/att/3b1/999-801-312IS_ATT_UNIX_PC_System_V_Users_Manual_Volume_1.pdf

Accessed:2019-04-15.[3] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, AlekseyPesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.2010. An Analysis of Linux Scalability to Many Cores. In

Proceedingsof the USENIX Conference on Operating Systems Design and Implemen-tation (OSDI) . 1–16.[4] Davidlohr Bueso. 2017. locking: Introduce range reader/writer lock. https://lwn.net/Articles/722741/ , May 15, 2017. Accessed: 2018-10-29.[5] Davidlohr Bueso. 2018. mm: towards parallel address space operations. https://lwn.net/Articles/746537/ , Feb 5, 2018. Accessed: 2019-04-15.[6] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2012.Scalable Address Spaces Using RCU Balanced Trees. In

Proceedings ofthe Seventeenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) . 199–210. [7] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2013.RadixVM: Scalable Address Spaces for Multithreaded Applications.In

Proceedings of the ACM European Conference on Computer Systems(EuroSys) . 211–224.[8] Jonathan Corbet. 2014. MCS locks and qspinlocks. https://lwn.net/Articles/590243 , March 11, 2014. Accessed: 2018-10-29.[9] Jonathan Corbet. 2017. Range reader/writer locks for the kernel. https://lwn.net/Articles/724502 , June 5, 2017. Accessed: 2018-09-28.[10] Dave Dice. 2017. Malthusian Locks. In

Proceedings of the ACM EuropeanConference on Computer Systems (EuroSys) . 314–327.[11] Dave Dice and Alex Kogan. 2019. BRAVO: Biased Locking for Reader-writer Locks. In

Proceedings of the Usenix Annual Technical Conference(USENIX ATC) . 315–328.[12] Dave Dice and Alex Kogan. 2019. Compact NUMA-aware Locks. In

Proceedings of the ACM European Conference on Computer Systems(EuroSys) . 12:1–12:15.[13] Laurent Dufour. 2017. Replace mmap_sem by a range lock. https://lwn.net/Articles/723648/ , May 24, 2017. Accessed: 2018-10-29.[14] L. Torvalds et al. 2020. Linux source code. .Accessed: 2020-03-10.[15] Jose M. Faleiro and Daniel J. Abadi. 2017. Latch-free Synchronizationin Database Systems: Silver Bullet or Fool’s Gold?. In

Proceedings ofConference on Innovative Data Systems Research (CIDR) .[16] K. Fraser. 2004.

Practical lock-freedom . Ph.D. Dissertation. Universityof Cambridge.[17] Goetz Graefe. 2007. Hierarchical locking in B-tree indexes. In

BTW .[18] Vincent Gramoli. 2015. More Than You Ever Wanted to Know AboutSynchronization: Synchrobench, Measuring the Impact of the Synchro-nization on Concurrent Algorithms. In

Proceedings of the 20th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP) .[19] Timothy L. Harris. 2001. A Pragmatic Implementation of Non-blockingLinked-Lists. In

Proceedings of the 15th International Conference onDistributed Computing (DISC) . 300–314.[20] Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, andJonathan Walpole. 2007. Performance of Memory Reclamation forLockless Synchronization.

J. Parallel Distrib. Comput.

67, 12 (2007),1270–1285.[21] Maurice Herlihy, Yossi Lev, Victor Luchangco, and Nir Shavit. 2007.A Simple Optimistic Skiplist Algorithm. In

Proceedings of the 14thInternational Conference on Structural Information and CommunicationComplexity (SIROCCO) .[22] Jan Kara. 2013. lib: Implement range locks. https://lkml.org/lkml/2013/1/31/483 , January 31, 2013. Accessed: 2018-09-28.[23] Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. ScalableNUMA-aware Blocking Synchronization Primitives. In

Proceedings ofthe Usenix Annual Technical Conference (USENIX ATC) .[24] June-Hyung Kim, Jangwoong Kim, Hyeongu Kang, Chang-Gyu Lee,Sungyong Park, and Youngjae Kim. 2019. pNOVA: Optimizing SharedFile I/O Operations of NVM File System on Manycore Servers. In

Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems(APSys) . 1–7.[25] David Lomet and Mohamed F. Mokbel. 2009. Locking Key Ranges withUnbundled Transaction Services.

Proc. VLDB Endow.

2, 1 (Aug. 2009),265–276.[26] David B. Lomet. 1993. Key Range Locking Strategies for ImprovedConcurrency. In

Proceedings of the 19th International Conference onVery Large Data Bases (VLDB ’93) . Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 655–664.[27] Yandong Mao, Robert Morris, and Frans Kaashoek. 2010.

OptimizingMapReduce for Multicore Architectures . Technical Report. MIT.[28] Paul E. Mckenney, Silas Boyd-wickizer, and Jonathan Walpole. 2012.

RCU usage in the Linux kernel: One decade later . Technical Report.[29] Paul E. McKenney and Jack Slingwine. 1998. Read-copy-update: Using Execution History to Solve Concurrency Problems. In

Parallel andDistributed Computing and Systems . 509–518.[30] Maged M. Michael. 2004. Hazard Pointers: Safe Memory Reclamationfor Lock-Free Objects.

IEEE Trans. Parallel Distrib. Syst.

15, 6 (2004),491–504.[31] C. Mohan. 1990. ARIES/KVL: A Key-value Locking Method for Concur-rency Control of Multiaction Transactions Operating on B-tree Indexes.In

Proceedings of the Sixteenth International Conference on Very LargeDatabases . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,392–405.[32] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and PeterSchwarz. 1992. ARIES: A Transaction Recovery Method SupportingFine-granularity Locking and Partial Rollbacks Using Write-aheadLogging.

ACM Trans. Database Syst.

17, 1 (March 1992), 94–162.[33] M. Quinson and F. Vernier. 2009. Byte-Range Asynchronous Lockingin Distributed Settings. In . 191–195.[34] Frank Schmuck and Roger Haskin. 2002. GPFS: A Shared-Disk FileSystem for Large Computing Clusters. In

Proceedings of USENIX Con-ference on File and Storage Technologies (FAST) .[35] Xiang Song, Jicheng Shi, Ran Liu, Jian Yang, and Haibo Chen. 2013.Parallelizing Live Migration of Virtual Machines. In

Proceedings ofthe 9th ACM SIGPLAN/SIGOPS International Conference on VirtualExecution Environments (VEE) . 85–96.[36] Rajeev Thakur, Robert Ross, and Robert Latham. 2005. ImplementingByte-Range Locks Using MPI One-Sided Communication. In

RecentAdvances in Parallel Virtual Machine and Message Passing Interface ,Beniamino Di Martino, Dieter Kranzlmüller, and Jack Dongarra (Eds.).Springer Berlin Heidelberg, Berlin, Heidelberg, 119–128.,Beniamino Di Martino, Dieter Kranzlmüller, and Jack Dongarra (Eds.).Springer Berlin Heidelberg, Berlin, Heidelberg, 119–128.