Jiffy: A Lock-free Skip List with Batch Updates and Snapshots
JJiffy: A Lock-free Skip List with Batch Updates and Snapshots
Tadeusz Kobus
Poznan University of TechnologyPoznań, [email protected]
Maciej Kokociński
Poznan University of TechnologyPoznań, [email protected]
Paweł T. Wojciechowski
Poznan University of TechnologyPoznań, [email protected]
ABSTRACT
In this paper we introduce Jiffy, the first lock-free, linearizableordered key-value index that offers both (1) batch updates, whichare put and remove operations that are executed atomically, and(2) consistent snapshots used by, e.g., range scan operations. Jiffy isbuilt as a multiversioned lock-free skip list and relies on CPU’s TimeStamp Counter register to generate version numbers at minimalcost. For faster skip list traversals and better utilization of the CPUcaches, key-value entries are grouped into immutable objects called revisions . Moreover, by changing the size of revisions and thusmodifying the synchronization granularity, our index can adaptto varying contentions levels (smaller revisions are more suitedfor write-heavy workloads whereas large revisions benefit read-dominated workloads, especially when they feature many rangescan operations). Structure modifications to the index, which resultin changing the size of revisions, happen through (lock-free) skiplist node split and merge operations that are carefully coordinatedwith the update operations. Despite rich semantics, Jiffy offershighly scalable performance, which is comparable or exceeds theperformance of the state-of-the-art lock-free ordered indices thatfeature linearizable range scan operations. Compared to its (lock-based) rivals that also support batch updates, Jiffy can execute largebatch updates up to 7.4 × more efficiently. Concurrent programming is inherently difficult. Hence, to developapplications and complex systems, such as database engines, whichare optimized for modern multicore hardware, programmers oftenrely on concurrent data structures . These structures expose a welldefined interface and can be safely used in a multithreaded environ-ment without additional synchronization (see, e.g., [19]). Under thehood, concurrent data structures feature sophisticated, often non-blocking synchronization algorithms optimized for performance.With the proliferation of multicore hardware in recent years, manynew concurrent data structures, such as concurrent lists [26, 50],sets [11, 22, 30, 38, 45, 47], (ordered) key-value indices (or maps,dictionaries) [8, 9, 12–14, 23, 24, 37, 40, 43, 44, 46, 48, 49, 51], etc.,have been proposed, each time improving the performance overthe existing solutions and introducing new features, such as thesupport for consistent range scan operations or snapshots that pro-vide a read-only, static and consistent view over the state of theentire dataset.In this paper, we introduce
Jiffy , the first linearizable [29], lock-free ordered index (sorted key-value map) that besides offeringconsistent snapshots used, e.g., by range scans, provides supportfor batch updates , which are put and remove operations that areexecuted atomically. We propose several innovations to make ouralgorithm highly scalable, despite the rich semantics it offers. The novel design of our index is based on a multiversioned [10]skip list [41]. However, unlike many existing multiversioned con-current indices, which rely on a single atomic counter to generateversion numbers, e.g., [9, 32, 33], Jiffy’s concurrency control mech-anism is specially designed to use version numbers obtained byreading CPU’s Time Stamp Counter (TSC) register [31, 42], a high-resolution clock available on the x86_64 platform. Reading the TSCregister is an extremely fast operation as it does not involve a sys-tem call. In turn, Jiffy does not feature a single point of contentionand offers scalable performance on modern 40+ core CPUs.Key-value entries are grouped in Jiffy into immutable objects,called revisions , which are tagged with a version number. The use ofrevisions instead of maintaining each key-value pair as a separateobject has several benefits. Firstly, the use of revisions allows theindex to be smaller and thus quicker to traverse. Secondly, accessesto individual key-value entries can be performed more efficientlythrough the use of a lightweight hash index inside each revision,whereas range scans can benefit from keys and values being storedin sorted arrays within the revision. Crucially, however, by growingor shrinking the skip list and thus modifying the sizes of revisions,we can optimize the synchronization granularity in Jiffy, whichallows it to adapt to changing workloads. Smaller revisions are moresuited for write-heavy workloads whereas large revisions benefitread-dominated workloads, especially when they feature manyrange scan operations. Automatic adaptation to the workload isaccomplished on per-revision basis through a simple, yet versatilepolicy based on monitoring the time concurrent threads spendexecuting update (i.e., put, remove and batch update) and read (i.e.,lookup or range scan) operations, not by counting the numberof operations performed or monitoring the contention on sharedreferences, as in other existing approaches, e.g., [43, 44, 51].The core contribution of our paper is, however, the novel lock-free algorithm that enables updates, reads, as well as index struc-ture modifications, which facilitate varying the sizes of revisions.Structure modifications are streamlined with updates and happenthrough the skip list node split and merge operations based onthe atomic compare-and-swap (CAS) operations. Our algorithm isbased on a few simple rules all threads in Jiffy must abide: • always help to complete a structure modification when en-countering one, • a node split happens towards higher keys (a new node inher-its the upper half of the key range of the node that undergoesa split operation), • merges happen towards lower keys (the preceding node in-herits the key range of the node that undergoes a mergeoperation), • batch updates proceed from the largest keys included in abatch towards the lower keys. a r X i v : . [ c s . D S ] F e b adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski We implemented Jiffy in Java and extensively tested it on variousworkloads against the state-of-the-art lock-free ordered indices thatfeature linearizable range scans [12–14, 44] and the (lock-based)ordered indices that also support batch updates [51]. Our tests showthe highly scalable performance of Jiffy, which is comparable orexceeds the performance of the other systems. Crucially, due to itslock-free architecture, Jiffy can execute large batch updates muchmore efficiently compared to its (lock-based) rivals, with speedup inthroughput ranging from 1.1 × to 7.4 × , depending on a test scenario. A template for obtaining non-blocking algorithms for concurrentdata structures based on CAS was originally proposed by Her-lihy [27] [28]. In practice, however, implementations based on thisapproach suffer from low parallelism and high overhead due toexcessive copying and reliance on a single global pointer accessedthrough CAS by all threads. Much better performing ordered indeximplementations can be achieved through purposefully designed(non-blocking) algorithms, which we discuss next. In particular,we focus on non-blocking skip lists and other high performanceordered indices that support snapshots and batch updates.Skip lists were first introduced by Pugh [41]. Valois [50] was thefirst to sketch a lock-free algorithm for a skip list, although the firstcomplete algorithm was proposed by Sundell and Tsigas [49], as anextension of their prior work on concurrent priority queues [48].Their implementation relied on the CAS and FAA (fetch-and-add)-based lock-free memory management scheme originally proposedby Valois [50] and later revised by Michael and Scott [37].Frasier [24] gave an alternative implementation of a lock-freeskip list, which relies on Harris’ CAS-based approach for implement-ing lock-free linked lists [26]. Fomitchev and Ruppert’s implemen-tation of a lock-free skip list [23] combines the techniques of Valoisand Harris. The ubiquitous
ConcurrentSkipListMap [19], whichis part of the standard Java java.util.concurrent library, drawsfrom Freiser’s, Fomitchev’s and Sundell’s work. All algorithms dis-cussed above are linearizable [29] except for range scans. Moreover,unlike Jiffy, they do not support batch updates or snapshots.LeapList [8] and KiWi [9] are skip list-based indices that pro-vide linearizable range scans (but no fully linearizable snapshots,as Jiffy). LeapList relies on fine-grained locks and Software Trans-actional Memory (STM) for concurrency control whereas KiWifeatures a multiversioned architecture and CAS-based operationsto provide lock-freedom (range scans are wait-free). However, notevery update operation in KiWi creates a new version: withoutconcurrent range scans, an update operation simply overwrites theold value in the index. Version numbers are managed through anatomic counter, which is bound to become a bottleneck (in Jiffy werely on the TSC register for this purpose, see below). Each of thebase nodes in both LeapList and KiWi holds 𝑘 key-value entries forcache-friendliness, but 𝑘 is fixed (unlike in Jiffy).Nitro [32] is a skip list-based index used in Couchbase. Nitrouses multiversioning to provide snapshots, but the creation of anew snapshot is not a thread-safe operation (it cannot be executedconcurrently with put or remove operations).Now we discuss tree-based ordered index data structures. Snap-Tree by Bronson et al. [12] is a lock-based relaxed balance AVL tree. SnapTree uses a linearizable clone operation for atomic snapshotsand range scans, which can severely slow down concurrent updateoperations (in Jiffy, creating a snapshot, which is also used for arange scan, is an O(1) operation that does not impact concurrentoperations in any way). Brown et al. proposes k-ary search trees[13, 14], which are a generalization of lock-free binary search treesby Ellen et al. [22]. Range scans undergo a validation phase forensuring linearizability and are restarted when a concurrent up-date is detected (in Jiffy, a range scan may help to complete someconcurrent update operations, but is never restarted). CTrie [40] isa lock-free concurrent hash trie based on CAS. Atomic snapshotsare provided through a lazy copy-on-write operation, which slowsdown concurrent update operations. In CTrie no partial snapshotscan be obtained. Minuet [46] is a distributed, in-memory B-treeindex with linearizable snapshots. To create snapshots, Minuet alsorelies on a relatively expensive copy-on-write method, but allowssnapshots to be shared across multiple range scans.Sagonas et al. proposed a number of contention-adapting (CA) tree-based data structures with linearizable range scans. The datastructures feature a lock-based [43, 44] or a lock-free binary searchtree [51] as the main part of the index, where each leaf node is avariable-size container , i.e., an AVL tree, a skip list or an immutabledata structure that holds multiple key-value entries (which is similarto a revision in Jiffy). The size of the container is adjusted to theobserved contention level (we discuss the differences with our indexautoscaling policy in Section 3.3.6). Linearizable range scans areachieved either through locking, optimistic scan and validation orreplacing the leaf data structures using CAS with special objectsthat can be used by concurrent threads to help with completing therange scan (and to block update operations in the meantime). Fromall of the data structures we discussed so far, only the lock-basedvariants of the CA trees support batch update operations.Besides works of Sagonas et al. on CA trees, we are aware ofseveral works on data structures that dynamically adapt to changingcontention levels, e.g., [5, 17]. Unlike CA trees, none of the proposedalgorithms support linearizable range scans or batch updates.Finally, several researchers have investigated general techniquesfor adding linearizable range scans (but not batch updates) to exist-ing concurrent data structures, e.g. [6, 15, 35, 36, 39].The concurrency control mechanism implemented in Jiffy sharessome similarities with the multiversioned transactional engine in[33], which also relies on structures similar to our batch descriptorsand CAS operations to ensure that all updates become visible toconcurrent operations atomically. However, unlike this implemen-tation, Jiffy is lock-free and no update operation, including batchupdates, ever aborts. Crucially, instead of using a shared atomiccounter to generate version numbers, Jiffy relies on the CPU’s TimeStamp Counter (TSC) register [31], which greatly helps to reducecontention between concurrent threads on modern 40+ core CPUs.TSC has been used for a similar purpose also in the context oftransactional memory [42] [25], a concurrent stack implementation[21], and a serializable (but not linearizable) database engine [34]. In this section, we discuss the architecture of Jiffy, the crucial detailsregarding its implementation, and also argue about its correctness. iffy: A Lock-free Skip List with Batch Updates and Snapshots ⊥ c f i m o r t w ⊥⊥⊥⊥⊥ H Figure 1: The multiversioned architecture of Jiffy. Each nodeof the lowest-level list of the skip list manages a range ofkeys, e.g., (−∞ , 𝑐 ) , [ 𝑐, 𝑓 ) , [ 𝑓 , 𝑖 ) , etc. Key-value entries are keptin immutable revisions (triangles), each in a concrete ver-sion, with newest at the top. The skip list grows and shrinksby splitting or merging nodes and through split and mergerevisions (colored green and red, respectively). Jiffy is a multiversioned [10] skip list [41], where each node (anobject on the lowest-level linked list of the skip list) manages acontinuous range of keys (see Figure 1). More precisely, each nodestores (1) a node key , i.e., a key that represents the lower end ofthe managed key range (the exclusive upper end is defined by thenode key of the successor node), and (2) a reference to the head of a revision list . The revision list consists of revisions , immutable objectsthat store key-value entries that fit the node’s range (we discussthe layout of data in a revision in Section 3.3.5). Each revision istagged with a version number, which thus serves as the versionnumber for each key-value entry stored in the revision. Unlike ina classic skip list, the first node, called the base node, is not just asentinel but also manages a range of entries (its key is ⊥ , and thusits key range is (−∞ , 𝑐 ) in our example). Update operations, suchas put, remove or batch update use the compare-and-swap (CAS) operation to add a new revision as the head of the revision list (wesimply say that a new revision is added to the node) and cut the listshort whenever the internal garbage collector indicates that certainrevisions will not be needed any more.In Jiffy, structure modifications , i.e., changes to the index, aremore involved compared to a typical lock-free skip list, such as[19], where nodes are added or removed upon inserting new keysor removing the existing ones. In our approach, the index grows bysplitting a node into two and shrinks by merging two nodes intoone (see details in Section 3.3.1). The index starts with a single base node (with key ⊥ ). During a split of a node with key 𝑛 (referredto as node 𝑛 ), a new node 𝑛 ′ , where 𝑛 ′ > 𝑛 is added directly afternode 𝑛 (or node ⊥ if the base node undergoes a split). Node 𝑛 ′ inherits the upper half of the key range originally assigned to node 𝑛 (the key of node 𝑛 does not change). On the other hand, duringa merge operation of node 𝑛 , it is merged with the node directlypreceding node 𝑛 in the index (so with a node with a strictly lowerkey; the base node cannot undergo a merge operation and is neverremoved). As in a classic skip list, the index nodes (i.e., the nodes onall but the lowest-level linked lists, which facilitate fast traversals CAS ( val , oldVal , newVal ) atomically replaces val with newVal only if val = oldVal .The operation returns a boolean value that indicates if the operation was successful. n ... o l d e rr e v i s i o n s n ... update n ... n GC (a) (b) (c) (d) Figure 2: Regular update operation: (a) initial state, (b) createa new revision, (c) add the new revision to the node (CAS), (d)garbage collect obsolete revisions. of the data structure), are inserted to the higher-level linked listsprobabilistically (in our implementation, the probability of insertingindex nodes up to a certain level is the same as in [19]). Operationson higher-level linked lists are also performed using CAS.A node split or a merge can occur only upon some update oper-ation, i.e., put, remove or batch update (which we discuss in detailin Sections 3.3.2-3.3.3). When an update operation of some key 𝑘 isperformed and the appropriate node is found (i.e., node 𝑛 , where 𝑘 ≥ 𝑛 and there does not exist a node 𝑛 ′ where 𝑘 ≥ 𝑛 ′ ), an autoscal-ing policy decides how the update is to be performed (we discussthe details of our autoscaling policy in Section 3.3.6). In majority ofcases, a regular update is performed (see Figure 2). Regular updateinvolves copying the head of the revision list at node 𝑛 , applyingthe update on the copied revision, adding it to the revision list and,if necessary, garbage collecting obsolete revisions, i.e., revisionsthat will never be read again, including in any snapshot. Otherwise,a node split or a merge is performed. In case of a node split, theupdate operation is reflected in one of the two new split revisions( left split revision inserted as the head of the revision list on node 𝑛 and right split revision as the head of the revision list on the newnode). On the other hand, in case of a merge, the new merge revi-sion (on the node directly preceding node 𝑛 in the index) includesthe update to 𝑘 , as well the entries for all other keys previouslystored within the two nodes. Node splits and merges mean thatnow revision lists are not just simple linked lists: through split andmerge revisions, revision lists branch and join.Jiffy is a lock-free data structure, which means that it guaranteessystem-wide progress. To this end, threads occasionally help oneanother in completing other (update) operations (in case, e.g., somethread is preempted for a long time). Doing so may involve a numberof steps, especially in case of batch updates or updates that resultin node splits or merges. To ensure orderly execution of all updateoperations, we define the following rules:(1) any operation (so also a lookup or a range scan) that encoun-ters a node split or a merge, helps to complete the operationthat invoked the split or merge,(2) an operation can add a new revision 𝑟 to the revision list atsome node 𝑛 only if there is no pending operation at node 𝑛 (the thread helps to complete the pending operations beforeadding 𝑟 ),(3) the execution of a batch update, which comprises of a set ofput and remove operations, starts by updating the highestkey in the batch and always continues towards lower keys. adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski Rule (1) means that our index returns to a stable state (i.e., withoutongoing structure changes) as soon as possible, so that subsequentoperations (including lookups and range scans) can be performedefficiently. Rules (2) and (3) enforce a consistent order of performingupdates (also across batch updates), thus allowing Jiffy to guaranteelinearizability. Rules (2) and (3) also give precedence to operationsthat happen on nodes with lower keys, thus preventing live-locks(e.g., two threads operating on neighboring nodes, with one con-stantly attempting to perform a split, the other a merge). The lock-free nature of Jiffy inevitably means that under somehighly unfavorable workloads, helping other threads will have aconvoying effect which results in all threads attempting to completethe same updates/splits/merges thus wasting resources. This, how-ever, is unavoidable if we are to guarantee system-wide progress.
We already briefly stated that Jiffy is a multiversioned data structure.Now we discuss how version numbers are generated and used.To provide linearizable behavior [29] (intuitively, all operationsappear as if they were executed sequentially on a single CPU),threads in a multiversion system typically synchronize on a shared(atomic) counter, which is used to generate version numbers (see,e.g., [9]). This, however, introduces a point of contention thatquickly becomes a bottleneck. In Jiffy we avoid such a bottle-neck by relying on a high-resolution clock supported by CPU. Moreprecisely, version numbers are obtained by reading the Time StampCounter (TSC), a 64-bit register (available on the x86_64 architecturesince 2008), which functions as a CPU-cycle-level resolution wall-clock for the entire multi-CPU machine (see the constant_tsc and nonstop_tsc flags in Linux’s /proc/cpu_info ) [18, 31, 42]. TSCis reset to 0 upon machine restart and then advances with constantrate. Reading the TSC register (e.g., using the
RDTSCP instruction)is an extremely fast operation as it does not involve a system call(in our tests,
RDTSCP takes about 10 ns to complete).Since Jiffy is implemented in Java, we do not access the TSCregister directly. Instead, we use the
System.nanoTime() method[20], which on the popular Java Virtual Machines (JVMs) for thex86_64 platforms, e.g., [3, 4], internally relies on TSC. By specifi-cation,
System.nanoTime() is a thread-safe operation that for allinvocations of this method in an instance of JVM returns a mono-tonically increasing 8B integer. In our pseudocodes, we will usethe TSC . read () function to retrieve values from the TSC register.We use the values generated by TSC in the following way. Eachupdate operation (put, remove or batch update) and each revisioncreated by such an operation is associated with two version num-bers: in the beginning a temporary one, which we call an optimisticversion number and, eventually, the final version number , whichnever changes again. An optimistic version number is negative,which signals a concurrent thread that encounters a revision with Recall that a merge operation on some node 𝑛 involves adding a merge revision to theexisting node directly preceding node 𝑛 in the index, which is a much more complexoperation than adding a new node in a split operation. Reading the atomic counter is also necessary to create snapshots of the dataset. Thefirst version of Jiffy that relied an atomic counter to generate version numbers did notscale past 4-8 threads. TSC registers across CPU sockets must be synchronized using a synchronous
RESET signal, which is commonly the case on modern hardware [2, 18]. Assume for now that
System.nanoTime() always returns a positive value. such a version number about the pending update operation (whichthe thread might now have to complete). Moreover, there is a specialrelationship between the optimistic and the final version numbers,which allows us to better handle lookups and range scans that areperformed on snapshots.More precisely, an update operation commences with an opti-mistic version number 𝑣 = −( 𝑡 + ) , where 𝑡 is obtained by readingthe TSC register. The name optimistic version number comes fromthe fact that | 𝑣 | corresponds to the lowest possible final versionnumber with which the update operation can complete. Hence wedefine an invariant 𝑣 ′ ≥ | 𝑣 | , where 𝑣 ′ is the final version numberassigned to the revision. For correctness of our algorithm, revisionsin each revision must have unique version numbers. Since the val-ues read by a thread from TSC are not guaranteed to be strictlymonotonically increasing, we add 1 to 𝑡 , and before we assign thefinal version number to the revision, we ensure that the currentvalue of the TSC register is greater or equal 𝑣 ′ .Lookup and range scan operations (see details in Section 3.3.4)use the version numbers stored in revisions to retrieve the correctrevision and, from it, the value for the searched key. The readoperations can be performed also on a snapshot acquired earlierby the thread. Snapshot creation consists of recording the currentvalue of TSC as the snapshot version and storing it in a special(lock-free) linked-list shared between the threads. A snapshot withsnapshot version 𝑠 corresponds to the state of the dataset at time 𝑠 .Assume that we have already found the appropriate node andevaluate the revisions in its revision list. The most recent value forsome key 𝑘 can be found in the most recently completed revision,i.e., the revision with the greatest positive version number. On theother hand, for lookups and range scans performed on a snapshot(with snapshot version 𝑠 ), when evaluating a revision 𝑟 with versionnumber 𝑣 , we do the following: • if | 𝑣 | > 𝑠 , skip reading 𝑟 , • if 𝑣 > ∧ 𝑣 ≤ 𝑠 , and the revision list contains no revisionwith version number 𝑣 ′ , s.t. 𝑣 < 𝑣 ′ ≤ 𝑠 , then retrieve 𝑟 , or • if 𝑣 < ∧ − 𝑣 ≤ 𝑠 , help to complete the update operationthat created 𝑟 , resolve the final version number for 𝑟 , and actaccordingly. We start the descriptions of struc-ture modifications in Jiffy with a node split operation. For simplicitywe abstract away from the fact that in Jiffy all structure modifi-cations are streamlined with the update operations. Consider theexample in Figure 3, in which we show how node 𝑘 , that manages arange of keys [ 𝑘, 𝑟 ) , is split so a new node 𝑜 (whose range is [ 𝑜, 𝑟 ) ) isto be inserted between node 𝑘 and node 𝑟 . To this end we first createtwo special revisions, called left ( lsr ) and right split revisions ( rsr ).Each split revision contains half of the entries from the revisionthat was the head of the revision list at node 𝑘 in the beginning. Weuse a CAS operation to add the left split revision ( lsr ) to the revisionlist at node 𝑘 (Figure 3b). Next we create a temporary split node ,whose next pointer is set to 𝑟 (Figure 3c). We use CAS to swingthe next pointer from node 𝑘 to the temporary split node and thusadd it into the index (Figure 3d). Note that the node has key 𝑜 , so,e.g., concurrent lookups searching for keys in range [ 𝑜, 𝑟 ) will be iffy: A Lock-free Skip List with Batch Updates and Snapshots k r k rlsr rsr k ro k ro k o ro k o r (a) (b) (c) (d) (e) (f) Figure 3: Node split operation of node 𝑘 : (a) initial state, (b) create split revisions ( lsr , rsr ), add the left split revision ( lsr ) to node 𝑘 (CAS), (c) create a temporary split node 𝑜 , (d) add the temporary split node 𝑜 to the index (CAS), (e) create node 𝑜 with theright split revision, (f) add node 𝑜 to the index (CAS) and garbage collect the temporary split node. k o r k o rmt k o rmr k o r k r (a) (b) (c) (d) (e) Figure 4: Node merge operation of node 𝑜 : (a) initial state, (b) add a merge terminator ( mt ) to node 𝑜 (CAS), (c) add a mergerevision ( mr ) to node 𝑘 (CAS), (d) unlink node 𝑜 from the index (CAS), (e) garbage collect node 𝑜 and the merge terminator. able to find it and help to complete the split operation (informationnecessary to complete the split operation is accessible through splitrevisions and the temporary split node). Next, we create node 𝑜 withthe right split revision as the sole revision on the node’s revisionlist. The next pointer of node 𝑜 is set to node 𝑟 (Figure 3e). Finally,we use CAS to swing the next pointer of node 𝑘 from the temporarysplit node to node 𝑜 , garbage collect the temporary split node andwrite the final version number to split revisions (Figure 3f).Why could not we simply insert node 𝑜 in-between nodes 𝑘 and 𝑟 using a single CAS operation, as in a simple lock-free linkedlist [26]? It is because the entire split operation involves adding arevision to node 𝑘 and creating node 𝑜 . Without a temporary splitnode 𝑜 an ABA problem is possible. Imagine two threads, A and B.Thread A acquires the reference to node 𝑟 , adds a left split revisionto the revision list at node 𝑘 , and is preempted. Then, thread Bthat tries to add a revision at node 𝑘 , observes a pending splitoperation and adds node 𝑜 with the right split revision. Supposethat subsequently node 𝑜 is merged back to node 𝑘 , so the nextpointer at node 𝑘 again points to node 𝑟 . When thread A continuesits execution, it incorrectly adds node 𝑜 in-between nodes 𝑘 and 𝑟 ,which may corrupt lookup and range scan operations. In our schemethe ABA problem on the temporary split node is still possible, butwe can recover from it without corrupting concurrent operations.If thread A observes that some other thread already set the finalversion number in the left split revision, it means that node 𝑜 musthave already been created (and merged into node 𝑘 , assigning thefinal version number is the last operation of a node split). In suchcase the temporary split node can be safely removed.Now let us consider the node merge operation. In the examplein Figure 4b a merge terminator ( mt ) is added to the revision listat node 𝑜 , thus initiating the merge operation. No other revisioncan now be added to the revision list at node 𝑜 , hence also a splitoperation cannot be invoked on node 𝑜 . In the next step, we in-voke the helpMergeTerminator function to find the node directlypreceding node 𝑜 and, if necessary, complete all pending opera-tions at this node (in some cases we need to perform the searchfor the preceding node again). Once we find node 𝑘 , we create amerge revision ( mr ) that encompasses the entries from the mergeterminator’s successor revision in the revision list as well as the head of the revision list at node 𝑘 (Figure 4c). Note that the mergerevision joins the revision lists at node 𝑘 and node 𝑜 (excluding themerge terminator), and so has two successors: left (default, sameas in an ordinary revision) and right. Next we use CAS to swingthe next pointer at node 𝑘 to node 𝑟 , thus unlinking node 𝑜 fromthe index (Figure 4d). Finally, we mark node 𝑜 as terminated , whichmeans that now it can be garbage collected together with the mergeterminator (Figure 4e).In our implementation, structure changes to the index are drivenby update operations. E.g., a put operation may cause a node split.In such case, one of the split revisions reflects also the put operationthat caused the node split in the first place. This way no revisionsare created unnecessarily.As we mentioned earlier, completing node splits and merges isperformed also by lookup or range scan operations that happen toencounter a not yet completed structure modification operation.The rather complex logic of dealing with various stages of nodesplits and merges is hidden in the helpSplit, helpTempSplitNode,helpMergeTerminator and findAndHelpMergeRevision functions,which we use in the operations we discuss next. Consider the pseudocodesof the put and remove operations in Algorithm 1. The pseudocodesrequire small changes to accommodate the batch update operations.We will discuss these changes in the next section, which is devotedto batch updates.A thread that performs put ( key , value ) first finds the appropriatenode (line 4) and acquires a reference to the neighboring node (thesucceeding node in the index, line 5). This reference will be requiredlater to ensure that we adequately handle all concurrent node splitsand merges. Now we perform a series of checks. In case any condi-tion is satisfied, we always start over by searching key again. Firstwe check if we found ourselves in a temporary split node. If so,we help with completing the split operation (and start over, line 7).Next, we retrieve the head of the revision list ( headRev , line 10)and check if the node has been terminated (through a node mergeoperation, line 10). Then, we check the version number of headRev ,and if necessary, help to complete the update operation that added headRev to the node and start over (lines 12-14, helpPendingUpdate adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski Algorithm 1
The put and remove operations in Jiffy procedure put( key , value ) var newRev = ⊥ while true do var node = findNodeForKey ( key ) // unlinks terminated nodes var nextNode = node . next if node is TempSplitNode then // middle of node split helpTempSplitNode ( node ) // corresponds to Figure 3e-f continue var headRev = node . head // first revision from the revision list if node . terminated then // ready to unlink continue if headRev . version < then // pending update operation helpPendingUpdate ( headRev ) // complete the operation continue if node . next ≠ nextNode then // a split or merge happened continue var optVer = − ∗ ( TSC . read () + ) var updateType = autoscaler . query ( headRev , key ) if updateType = REGULAR _ UPDATE then newRev = headRev . cloneAndPut ( key , value , optVer ) if CAS ( node . head , headRev , newRev ) then // successful break else // updateType = NODE _ SPLIT var ( lRev , rRev ) = headRev . putAndSplit ( key , value , optVer ) if CAS ( node . head , headRev , lRev ) then // successful helpSplit ( lRev ) // corresponds to Figure 3c-f newRev = lRev break var finVer = max ( TSC . read () , − optVer ) // to ensure the invariant waitUntil ( finVer ) finVer = trySetVersion ( newRev , finVer ) if newRev is SplitRevision then // set finVer on both revisions newRev . sibling . version = finVer performGC ( newRev ) // cut revision list short if necessary procedure remove( key ) var newRev = ⊥ while true do ... // same as lines 4–16 in put if headRev . get ( key ) = ⊥ then // nothing to do return var optVer = − ∗ ( TSC . read () + ) var updateType = autoscaler . query ( headRev , key ) if updateType = REGULAR _ UPDATE then newRev = headRev . cloneAndRemove ( key , optVer ) if CAS ( node . head , headRev , newRev ) then // successful break else // updateType = NODE _ MERGE var mTerm = MergeTerminator ( headRev , key , optVer ) if CAS ( node . head , headRev , mTerm ) then // successful helpMergeTerminator ( mTerm ) // corresp. to Figure 4c-e newRev = mTerm break if newRev is MergeTerminator then newRev = findAndHelpMergeRevision ( newRev ) // Figure 4d-e var finVer = max ( TSC . read () , − optVer ) // to ensure the invariant waitUntil ( finVer ) finVer = trySetVersion ( newRev , finVer ) performGC ( newRev ) // cut revision list short if necessary function trySetVersion( revision , version ) // set the final version, but var oldVer = revision . version // only if not already set if oldVer > then return oldVer if CAS ( revision . version , oldVer , version ) then // successful, return version // linearization point return revision . version procedure waitUntil( version ) // wait until TSC advances enough while TSC . read () < version do // in practice, false right away NOP uses the same logic as put, remove or batch update to complete apending update operation). Finally, we check if the neighboringnode did not change in the meantime (line 15).By reaching line 17 we know that we are in the correct nodeand thus we can safely try to add a new revision. To this end,we acquire the optimistic version number optVer from TSC andquery the autoscaler to determine the type of update we need toperform: a regular update or a node split. In the former case (lines19-22), we clone headRev and modify the value for key throughthe cloneAndPut function on headRev . We then try to add suchcreated newRev to the revision list (using CAS). If we fail, we startover. On the other hand, if we were successful, we obtain the finalversion number finVer from TSC (line 29), wait until the currentvalue of the TSC register is greater or equal | optVer | (to ensure ourinvariant, see Section 3.2), and set the final version number finVer on newRev (and its sibling, if necessary, lines 31-33). Because of TSC’shigh resolution, in our tests we have never encountered a situationin which the active wait in line 67 was necessary. Finally we gothrough the revision list to see if some revisions can be removed(line 34, we discuss how the garbage collection mechanism workswhen we discuss snapshots in Section 3.3.4). In case of a node split, we create a pair ( lRev , rRev ) of new split re-visions through putAndSplit function on headRev (line 24). The left( lRev ) and right ( rRev ) split revisions reference each other throughthe sibling field. We attempt to add lRev to the revision list. Again,if we fail, we start over. On the other hand, if we are successful,we complete the update operation by calling the helpSplit functionand, eventually, also assigning finVer to rRev .There are a few details worth pointing out: • The findNodeForKey function, also used by the remove, batchupdate, lookup and range scan operations, during traversingthe index unlinks all terminated nodes (nodes, whose mergeoperation completed), as well as the appropriate index nodes(nodes in the higher levels of the skip list that point to theterminated node). • If CAS fails in either line 21 or 25, no other thread is awareof our attempt to perform put. Thus, we can safely start over. • In certain situation we do not have to traverse the entireindex again to find the appropriate node, but for brevity weskip this optimization in our pseudocode. • If put successfully added lRev to the revision list, then putwill try to complete the node split operation (other threads iffy: A Lock-free Skip List with Batch Updates and Snapshots might help in this operation as well). By the time put reachesline 29, we can be certain that the node split has finished. • CAS in line 63 can fail only if some other thread alreadyassigned the final version number to newRev . Assigning thefinal version number to newRev is the linearization point forthe entire put operation.The remove ( key ) operation proceeds similarly to put, but withtwo differences. Firstly, remove returns early if headRev does notcontain a value for key (line 39). Secondly, remove might resultin a node merge instead of a node split, as remove decreases thesize of the revision at the head of the revision list (lines 47-52). Insuch a case, remove creates a merge terminator mTerm and tries toadd it to the revision list. If CAS was successful, we complete themerge operation, as discussed in Section 3.3.1. If some other threadhelped to complete the node merge, we find the merge revisionthat it created and unlink the terminated node, if necessary. In ourpseudocode we always perform the full search for merge revisionand ensure that the node merge is completed (line 54).So far for simplicity we assumed that TSC . read () returns onlypositive values. System.nanoTime() , which we use in our imple-mentation to retrieve values from TSC, can return negative values.To adhere to our earlier assumption, for every
System.nanoTime() operation invoked in Jiffy, we subtract from the returned value thevalue of
System.nanoTime() obtained upon creation of our index.
A batch update comprises ofa number of put and remove operations that are to be performedatomically. The batchUpdate function in Jiffy relies on the samelogic as put and remove, except for a few differences:(1) All put and remove operations to be executed by batchUpdateare stored within a batch descriptor , which also manages a version field which initially contains the optimistic versionnumber and, eventually, the final version number. Thus, read-ing the version number in a revision created by a batchUpdatehappens indirectly through the batch descriptor (this is thedifference we mentioned in the beginning of Section 3.3.2).(2) Each revision created by batchUpdate reflects the changes toall keys managed by node , which are included in the batch.(3) Execution of batchUpdate can result in both node splits andnode merges, as determined by the autoscaler.(4) In order to (help) complete a batchUpdate, a thread must addall necessary revisions to appropriate nodes (in descendingorder of keys) and only then try to assign the final versionnumber to the version field in the batch descriptor.(5) Suppose that batchUpdate happened to find a headRev (inan appropriate node) in which the value for key 𝑘 is notpresent. If batch includes the remove ( 𝑘 ) operation, we needto clone headRev and add it to the node, unlike in case of asimple remove ( 𝑘 ) operation, where we could return earlywithout modifying the revision list.The order of updates performed by batchUpdate naturally followsfrom our design assumption for the node merge operation to pro-ceed towards lower keys. Assume that batchUpdate proceeds in theopposite order (from lower to higher keys). Then it is possible thatbatchUpdate adds a revision to some node 𝑛 𝑖 , and then proceeds to node 𝑛 𝑗 that directly follows 𝑛 𝑖 in the index and decides to per-form a node merge operation on 𝑛 𝑗 . Consequently, a new (merge)revision would have to be created on 𝑛 𝑖 , which is suboptimal.To explain why adding a revision in situation described in (5) isnecessary consider otherwise. Suppose a concurrent batchUpdateadd a new revision with an update of key 𝑘 at the same node andfinishes with a lower final version number than the batchUpdatefrom (5). Such a situation would represent the lost update (lostremove) anomaly: a lookup on a snapshot that includes both batchupdates would incorrectly return a value for 𝑘 instead of ⊥ . Before we discuss howthe lookups and range scans are implemented, let us focus on theway snapshots are maintained. Jiffy is implemented in Java, whichmeans that we do not have to manage the memory manually. How-ever, we still need to track snapshots acquired by threads to letthe JVM’s garbage collector reclaim revisions that are no longeruseful (will never be read again). To this end, a thread that ac-quires a snapshot registers in the index by adding a special objectto the snapshot list , which is a lock-free linked list. Each objecton the snapshot list contains a publicly available snapshot versionsnapVersion acquired from TSC upon thread registration. A snap-shot with snapVersion corresponds to the state of the dataset attime snapVersion. Jiffy’s inner garbage collector periodically scansthe list to obtain the lowest snapVersion, so it knows which entriescan be safely disposed of (removing unnecessary revisions happensupon every update operation, see, e.g., line 34 in put in Algorithm 1).A thread can easily refresh the snapshot by querying again the TSCregister and writing the new value in the thread’s entry on thelist (this operation does not even require a CAS operation, as 8Bvalues are written atomically on the x86_64 architecture). Note thatthis operation has to be performed immediately after registeringbecause Jiffy’s inner garbage collector could have already freedsome entries that would be visible to the reader thread. A readerthread should regularly refresh its snapshot to allow the garbagecollector to progress, and unregister (remove its object from thelist) when it will not use snapshots any more. Note that if a threadwants to use several snapshots at the same time, it suffices thatthe value snapVersion stored in the thread’s entry in snapshot listrepresents the smallest snapshot version of all thread’s snapshots.A lookup operation (see Algorithm 2) comes in two variants:get ( key ) , which is used to retrieve the newest entry for some key key (lines 1-2), and get ( key , snapVersion ) , used when operating ona snapshot snapVersion (lines 3-24). In fact, the former functioncalls the latter function with the special value NEWEST _ VERSION .The get ( key , snapVersion ) function starts similarly to put andremove by finding the appropriate node for key (lines 4-16). How-ever, unlike those functions, it helps only in completing pendingstructure modifications (which are rare), not regular updates. Then,depending on the value of snapVersion, either the getNewestRevisionor getRevision function is invoked (lines 19 and 21, respectively).Finally, the value for key is retrieved from the revision, unless therevision is ⊥ . In such case, get returns ⊥ as well (lines 22-24).The getNewestRevision function (lines 25-34) simply iteratesover the revision list and returns the first revision with a positive adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski Algorithm 2
The get operations in Jiffy function get( key ) // get the most recent value for key return get ( key , NEWEST _ VERSION ) function get( key , snapVersion) // get the value for key in a snapshot while true do var node = findNodeForKey ( key ) // unlinks terminated nodes var nextNode = node . next if node is TempSplitNode then // middle of node split helpTempSplitNode ( node ) // corresponds to Figure 3e-f continue var headRev = node . head // first revision from the revision list if headRev is MergeTerminator then helpMergeTerminator ( headRev ) // corr. to Figure 4c-e continue if node . next ≠ nextNode then // a split or merge happened continue break var revision = ⊥ if snapVersion = NEWEST _ VERSION then revision = getNewestRevision ( headRev , key ) else revision = getRevision ( headRev , key , snapVersion ) if revision = ⊥ then return ⊥ return revision . get ( key ) function getNewestRevision( headRev , key ) var revision = headRev while revision ≠ ⊥ do // iterate over revision list if revision . version > then // first from a completed update break if revision is MergeRevision ∧ key ≥ revision . rightKey then revision = revision . rightNext // choose the right successor else revision = revision . next // choose the only (or left) successor return revision function getRevision( headRev , key , snapVersion) var revision = headRev while revision ≠ ⊥ do // iterate over revision list var version = revision . version if revision . version > ∧ version ≤ snapVersion then break if version < ∧ − version ≤ snapVersion then helpPendingUpdate ( revision ) // complete the operation version = revision . version if revision is MergeTerminator then revision = findMergeRevision ( revision ) if version ≤ snapVersion then break if revision is MergeRevision ∧ key ≥ revision . rightKey then revision = revision . rightNext // choose the right successor else revision = revision . next // choose the only (or left) successor return revision version number (line 28). Since revisions in the revision list arekept in descending order of the absolute values of their versionnumbers, the function will return a revision from the most recentlycompleted update operation at this node. Note that when we reach amerge revision which the function does not return (line 30), we needto decide whether to proceed to the left or to the right successor ofthe merge revision. To this end, we compare key with the rightKey field of the merge revision, which stores the key of the node thatunderwent a merge operation that resulted in the merge revision.The getRevision function (lines 35-52) performs more complexlogic, which corresponds to the rules we already discussed in Sec-tion 3.2. Note that when getRevision encounters a merge terminatorand helps to complete the merge operation, it needs to find the cor-responding merge revision (on a node that precedes in the indexthe node with the merge terminator, line 45).Range scans (which always operate on some snapshot) rely onthe same logic as getRevision, except for one difference. Recall thatin getRevision ( key , snapVersion ) in some cases we use key to de-cide whether to proceed to the left or the right successor of a mergerevision. A range scan intends to retrieve all key-value entries fromthe appropriate revision. Hence, if a range scan encounters a mergerevision when evaluating a revision list at some node, it retrieves a bulk revision that is constructed by recursively traversing all suc-cessors of all the encountered merge revisions. In practice, bulkrevisions are created extremely rarely. In our tests (see Section 4),revision lists contain at most 3-4 revisions at a time, and usually Note that for simplicity we abstract away from the fact that version numbers ofrevisions created by batchUpdate operations have to be accessed indirectly, throughthe batch descriptor. only 2. Moreover, node merges are rare, so there are few mergerevisions that would necessitate in creating bulk revisions.
So far we treated a revision as an immutableobject that holds a range of key-value entries in a concrete version.Now we discuss, how revisions are implemented.A revision holds two arrays: keys and values . Data in both arraysis sorted according to the keys. This way we can perform lookupoperations in a cache-friendly manner. Transforming one revisioninto a new one, as required by the update operations, involves copy-ing the arrays and updating/removing the appropriate keys/values.Since all keys and values are kept in a contiguous range of memory,such copy operations are fast.Our tests have shown, that threads spend a significant amountof time performing binary search in revisions. Thus, we added alightweight hash index for a fast key lookup. More precisely, ineach revision, we maintain two additional arrays. The first array, indices , contains 2B values and is twice the length of the keys array. Upon creation of a revision, the indices array is populatedso that for each 𝑘 = keys [ 𝑖 ] , 𝑖 is written to either indices [ ∗ 𝑡 ] or indices [ ∗ 𝑡 + ] , where 𝑡 = ℎ ( 𝑘 ) mod length ( keys ) , for some hashfunction ℎ . A lookup operation for some key 𝑘 calculates ℎ ( 𝑘 ) and 𝑡 , and then checks if 𝑘 is stored in keys [ indices [ ∗ 𝑡 ]] . If not, itlooks for 𝑘 again in keys [ indices [ ∗ 𝑡 + ]] . If either of these checkswere successful, we can return the value for 𝑘 by returning either values [ indices [ ∗ 𝑡 ]] or values [ indices [ ∗ 𝑡 + ]] . On the otherhand, if either indices [ ∗ 𝑡 ] or indices [ ∗ 𝑡 + ] was empty, 𝑘 isnot present in the revision. Finally, if 𝑘 was not found under eitherindex (because at least two other keys in the revision had the samehash value), a binary search is performed on keys . To speed up iffy: A Lock-free Skip List with Batch Updates and Snapshots populating the indices array, in the second array, hashes , we store2B hashes of keys calculated using the ℎ function. Upon creation ofa new revision, the hashes array can be efficiently copied, similarlyto the keys and values arrays. Since hashes are 2B, a revision cancontain up to 65K key-value entries, which is more than enough(in the tests from Section 4 each revision stores 25-300 entries). Determining the optimal size of a revi-sion is problematic, because smaller revisions are better for updates,as less copying is needed, whereas larger revisions better suit reads,i.e., lookups and range scans, as the index is smaller and range scanscan efficiently read large, sorted arrays of entries. Our experimentsshowed that the sizes of revisions should be between 25-300 entries,depending on the workload. Moreover, we noticed that addingthe lightweight hash indices to revisions not only improved theoverall performance, but reduced the relative performance differ-ences when we tested Jiffy with different predefined revision sizes.However, the size of revisions still impacts the relative performanceof updates and reads. Hence, we need some way of monitoring theworkload to automatically control the sizes of revisions.We cannot simply monitor the number of updates (or reads)in a unit of time, and adjust the sizes of revisions, because of apositive feedback loop: in a read dominated workload revisions arelarger, which negatively impacts the execution of updates. Hence,fewer updates are executed. In turn, the ratio of reads to updatesincreases, which leads to the further increase of the revisions sizes.An analogous case can be made for write dominated workloads.Our autoscaling policy works as follows. Each revision main-tains two exponential moving averages pReads and pUpdates thatroughly correspond to the amount of time spent by threads per-forming reads and updates in the revisions in any node given. Tothis end, instead of using a constant, we weight both moving av-erages using the time that passed since the thread last performedany read or update, respectively. We use the ratio of these valuesand a simple linear function to calculate the suitable revision sizefrom range [25, 300], with smaller revisions when the majority ofoperations are updates.More precisely, when a thread adds a new revision 𝑟 , its pUpdates = 𝑡 + ( − 𝑡 ) ∗ 𝑢 and pReads = ( − 𝑡 ) ∗ 𝑝 , where 𝑢 and 𝑝 are thevalues of pUpdates and pReads , respectively, from 𝑟 ’s successor inthe revision list and 0 < 𝑡 ≤ 𝑟 (both values are obtained from TSC). In a batch update, the weight 𝑡 is divided between all created revisions. Upon a read, a threadmodifies the moving averages in the first revision on the revisionlist in a similar way ( pUpdates = ( − 𝑡 ) ∗ 𝑢 and pReads = 𝑡 +( − 𝑡 ) ∗ 𝑝 ), but 𝑡 corresponds to the time that passed since the lastread performed by the thread. Updating the moving averages byconcurrent threads results in a race condition, which is harmless,as we are just gathering some statistics. To reduce the load on thereader threads as well as the chances of a race condition happening,reader threads update the moving averages only every 100 readoperations (with 𝑡 corresponding to the time it took the thread to Jiffy is a generic Java data structure, which means that arrays in the revisions storereferences to key/value objects, and not the keys/values themselves. Hence, the size ofa revision does not depend on the types of keys/values, as could be the case if Jiffywere implemented in, e.g., C++. perform 100 reads). Range scans update the moving averages onlyonce per revision despite reading many entries, because retrievingthe revision from the index requires much more effort compared toreading entries from a revision.Our autoscaling policy is completely different from the one from[43, 44, 51], which relies on observing contention on shared ref-erences to containers (revisions). The CAS operations on thesereferences are performed by updates and range scans. Interestingly,with a single thread, the mentioned approach leads to ever increas-ing revision sizes, which is problematic for updates.
Now we argue that Jiffy ensures linearizability [29]. For simplicity,we abstract from node splits and merges. It is easy to see that put,remove, and batchUpdate operations (on the same keys) are seri-alized because (1) no revision can be added to the revision list ifthere are pending operations at this node (2) when encounteringa pending operation at some node, a thread helps to complete theoperation before proceeding with its own update. The final versionnumbers of revisions in the revision list of each node monotonicallydecrease when iterating from the head of the revision list (recall thatan optimistic version number equals − 𝑣 , where 𝑣 = TSC . read () + 𝑣 ′ , which is also acquired from TSC is such that 𝑣 ′ ≥ 𝑣 ).All batchUpdate operations update keys in the descending order ofkeys, thus ensuring that no two batch updates with intersecting keysets update revision lists of two nodes in a different order. The en-tries created by every update operation can be read by other threadsonce the final version number is established. The final (positive)version number is written to the version field of the entry or to thebatch descriptor, using an atomic operation (CAS). Entries createdby the same batchUpdate operation appear as added atomicallybecause all entries share the same batch descriptor. The assignmentof the final version number is the linearization point for updates.Now we discuss the get operations. Observe the following:(1) Entries (within revisions) for any key 𝑘 are arranged in therevision list of the node responsible for a key range thatincludes 𝑘 , according to their (final) version numbers indescending order (as we argued above), and the get operationalways evaluates the entries in that order.(2) For each key 𝑘 at any given moment there can be only a singlepending update operation that modifies 𝑘 (a revision withoutthe final version number established), which precedes in therevision list all other revisions that might include 𝑘 .According to the linearizable semantics, the get ( 𝑘 ) operationmust return the newest value written for the key 𝑘 and it may ormay not observe the effects of the concurrent operations (opera-tions that have not completed before get ( 𝑘 ) started). The inclusionor exclusion of a concurrent update depends on whether its lin-earization point lies before or after the one for the get operation.Hence, get ( 𝑘 ) can safely skip reading an entry in a revision whosefinal version number is not yet determined, and thus return thevalue from the entry (for key 𝑘 ) from the first revision whose finalversion number is positive.Now consider the get ( 𝑘, 𝑠 ) operation. This time, the linearizationpoint of the snapshot creation or update determines which value adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski should be returned by the get operation. The value 𝑠 is obtainedfrom the TSC register upon registering or updating the snapshot.Entries written by update operations that finished prior to theacquisition of 𝑠 have the final version number 𝑣 ≤ 𝑠 (recall the waitof the update operations until the TSC register indicates 𝑡 ≥ 𝑣 ). Onthe other hand, entries written by operations executed concurrentlywith the snapshot creation/update may (but not necessarily must)have final version numbers 𝑣 ′ > 𝑠 . We choose the linearizationpoint for the snapshot creation/update so that it precedes all suchconcurrent operations.The get ( 𝑘, 𝑠 ) operation chooses the entry for key 𝑘 from a re-vision with the greatest final version number 𝑣 ≤ 𝑠 . Recall ourobservation (1). For a revision 𝑟 with version number 𝑣 , such that | 𝑣 | > 𝑠 , we can skip reading 𝑟 , because if 𝑣 <
0, due to our invariant(see Section 3.2), the final version number for this revision will be atleast | 𝑣 | , so also greater than 𝑠 . If 𝑣 < − 𝑣 ≤ 𝑠 , get ( 𝑘, 𝑠 ) helpsto complete the update operation. It means that get ( 𝑘, 𝑠 ) will beable to determine the final version number for this revision. For thefirst revision with a final (positive) version number 𝑣 ≤ 𝑠 , get ( 𝑘, 𝑠 ) extracts the value for key 𝑘 and returns it. We implemented Jiffy in Java and experimentally compared it withSnapTree [12], k-ary tree [13, 14], CA-imm (lock-based contention-adapting tree with immutable containers) [43], CA-AVL and CA-SL (lock-based CA trees with mutable containers based on AVLtrees and skip lists, respectively) [44] and LFCA tree (lock-free CAtree with immutable containers) [51] (see also Section 2). All ofthese ordered indices feature linearizable range scans. CA-AVL andCA-SL also support linearizable batch updates. For reference, wealso include the ubiquitous
ConcurrentSkipListMap (Java CSLM)[19], which does not support either consistent range scans noratomic batch updates. In some tests we also include KiWi [9] whoseavailable codebase [1] supports only 4 B integer keys. We conducted our tests on a server equipped with two IntelXeon Gold 6252N CPUs, 192 GB of DRAM and running OpenSUSETumbleweed (version 20200815) with kernel 5.8. Each CPU has 24cores (48 hyperthreads), is clocked at 2.3 GHz and features 36 MBof L3 cache. We ran our tests on OpenJDK 14.0.2.
To show that our novel system can achieve good multithreadedperformance despite providing rich semantics, we use a custommicrobenchmark to assess how Jiffy (and its competitors) performunder multithreaded workloads with varied levels of contention.Each microbenchmark thread issues only one type of operations,i.e., either updates (put/remove/batch update operations), lookups (get operations) or range scans , so that certain operations, such aslong-running scans or batch updates do not stifle the execution ofoperations of other types. We vary the percentage of threads thatperform each kind of operations to uncover the characteristics ofall tested indices. Hence comparing KiWi’s performance with the performance of other indices in ourtests is difficult (all other indices are generic, so they work with keys and values ofdifferent types and store them as Java objects, not values of primitive types).
In total we consider four test scenarios: an update-only scenario,an update-lookup scenario (25% of threads do updates, 75% of threadsdo lookups) and two mixed scenarios (25% threads do updates, 50%threads do lookups, 25% threads do range scans, but range scans areeither short or long , i.e., cover 100 or 10000 subsequent key-valueentries, starting from a randomly chosen key).To assess the performance of batch updates in Jiffy, we test itin five variants. In the default variant, Jiffy performs all updatesas single put or remove operations. Other variants correspondto results obtained when Jiffy executes all update operations in10-operation batch updates or large, 100-operation batches. Todemonstrate the performance of batch updates in the extreme cases,they are either sequential (update consecutive key-value entries) or random (update randomly chosen key-value entries). In a similarway we test CA-AVL and CA-SL, which also support batch updates.The dataset has the average size of 10M entries (20M uniquekeys). Jiffy is multiversioned, so it typically maintains more entriesat any given moment. The sizes of key/value sizes are set to 16/100 Band 4/4 B (typical for such tests, see, e.g., [7, 16]). We examine thesystems when keys are randomly chosen with a uniform and aZipfian distribution (distribution skew is 0.99, which is the same asin the YCSB benchmark in the default settings [16]).The results are reported in (millions of) basic operations persecond, i.e., put, remove or get operations on a single key (a scanover 10 key-value entries counts as 10 get operations). We start by discussing the results of tests in which key/value sizeswere set to 16/100 B and keys were chosen with uniform distri-bution (see Figure 5). In all tested scenarios, Jiffy exhibits scalablebehaviour. Single put/remove operations in Jiffy are slightly moreexpensive than in some other systems, e.g., SnapTree, CA-imm,LFCA tree, CA-AVL (in the write-only scenario by about 30% inthe worst case at 64 threads and by 15% for 96 threads, see the toprow plot in Figure 5a). The increased cost of updates comes fromthe multiversioned architecture of Jiffy. Each update that adds arevision to some node requires at least two CAS operations: oneto add a revision to the revision list at the node and one to set thefinal version number to the revision. In other lock-free indices onlyone CAS is necessary: when update is performed in place (e.g., JavaCSLM) or to replace an old key-value entry container with a newone (e.g., LFCA tree). Note that in Jiffy there is also an additionaloverhead resulting from managing lightweight hash indices insiderevisions. As the hash indices boost the performance of lookups,the performance differences between Jiffy and the mentioned sys-tems is smaller when lookups are introduced to the workload (seeFigure 5b). Our autoscaling policy set the revision sizes to around35 entries in the write-only scenario vs 130 entries in the update-lookup scenarios. The revision size adjustment time was about 10second (and about a second on a 1M entries dataset).Jiffy executes range scans much more efficiently than its competi-tors (see Figure 5c-d). LFCA tree and CA-imm are about 10% slowerthan Jiffy’s, whereas the only two other indices that, similarly toJiffy, support batch updates, i.e., CA-AVL and CA-SL, at best achieveonly half of the total throughput of Jiffy. Interestingly, range scansare especially problematic in lock-based SnapTree, which performed iffy: A Lock-free Skip List with Batch Updates and Snapshots (a) 100% threads: put/remove (b) 25% threads: put/remove 25% threads: put/remove, 50% threads: get, 25% threads: scan75% threads: get (c) Short scans (100 ops) (d) Long scans (10000 ops) S i m p l e p u t / r e m o v e T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_a Jiffy SnapTree k-ary tree CA-AVL CA-SL CA-imm LFCA tree Java CSLM - o p . b a t c h u p d a t e s T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_b10 - o p . b a t c h u p d a t e s T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_b100 Jiffy rand Jiffy seq CA-AVL rand CA-AVL seq CA-SL rand CA-SL seq
Figure 5: Throughput scalability results (16 B keys, 100 B values, keys chosen with uniform distribution). best in the first two scenarios (see also Section 2 for discussion onrange scans in SnapTree).Now let us consider the performance of batch updates in Jiffy.When batch updates are small (each includes 10 put/remove opera-tions, see the plots in the middle row of Figure 5), batch updates inJiffy are slightly slower compared to CA-AVL’s and CA-SL’s, dueto the same reasons, which we discussed earlier when explainingthe performance of put/remove operations in Jiffy. Notice that withrandom batch updates the performance of lock-based CA-AVL andCA-SL starts to diminish towards the higher number of concurrentthreads, whereas Jiffy continues to scale thanks to its lock-free ar-chitecture. The differences between the lock-based and the lock-freeapproach start to become apparent when all updates are executedas large batch updates (each includes 100 put/remove operations,see the plots in the bottom row of Figure 5). When batch updatesare sequential, the performance of Jiffy is about 15% better thanthe performance of either CA-AVL or CA-SL (in the write-onlyscenario). However, with random batch updates, Jiffy’s maximalthroughput is 4.9 × and 6.1 × of the maximal throughput of CA-AVLand CA-SL, respectively.Notice the somewhat surprising way in which small batch up-dates impact the performance of Jiffy in the mixed scenario withsmall range scans (Figure 5c, middle row). Using random batchupdates results in a slightly better overall performance, compared to sequential batch updates, which are on average much cheaper toexecute (each sequential batch update creates on average 1-2 revi-sions vs 𝑛 revisions for a random batch update with 𝑛 put/removeoperations). This phenomenon can be explained by examining thethroughput of update operations (see Figure 7 in Appendix for theplot with update-only throughput): with small sequential batchupdates, Jiffy executes four times as many updates compared tothe same test with random batch updates. In turn, in the formertest, Jiffy has to manage many more revisions, which translates intoslightly worse performance of lookups and scans.Let us now consider similar tests but conducted with 4 B key/valuesizes (see Figure 6). In the first two scenarios KiWi (whose imple-mentation is optimized for 4 B integer keys and does not supportother key/value types) beats other indices, but not by much (in thewrite-only scenario the second best performing SnapTree is 10%slower compared to KiWi, whereas Jiffy is 20% slower compared toKiWi). Overall, the relative differences between the performanceof the tested indices stay largely the same, except for two smalldifferences. Firstly, with smaller key/value sizes, the performanceof lock-based CA-AVL and CA-SL starts to diminish earlier (with asmaller number of concurrent threads). Secondly, we can observea much more substantial advantage of Jiffy in workloads that fea-ture range scans. In the mixed scenario with long range scans, Jiffybeats the second-best performing indices CA-imm and LFCA tree by adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski (a) 100% threads: put/remove (b) 25% threads: put/remove 25% threads: put/remove, 50% threads: get, 25% threads: scan75% threads: get (c) Short scans (100 ops) (d) Long scans (10000 ops) S i m p l e p u t / r e m o v e T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_a Jiffy SnapTree k-ary tree CA-AVL CA-SL CA-imm LFCA tree KiWi Java CSLM - o p . b a t c h u p d a t e s T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_b10 - o p . b a t c h u p d a t e s T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_b100 Jiffy rand Jiffy seq CA-AVL rand CA-AVL seq CA-SL rand CA-SL seq
Figure 6: Throughput scalability results (4 B keys, 4 B values, keys chosen with uniform distribution). × to 1.5 × when batch updates are sequentialand from 4.9 × /6.1 × to 5.7 × /7.4 × when batch updates are random(for CA-AVL and CA-SL, respectively).We conducted similar tests but with keys chosen with Zipfiandistribution (see Figure 8 and Figure 10 in Appendix). The perfor-mance differences between the tested indices were largely the sameas in the results presented above, although KiWi no longer was thebest performing index in the write-only scenario (in the update-lookup scenario KiWi’s performance was matched by SnapTree,LFCA tree, CA-imm and was a few percent better than Jiffy’s). Thebiggest difference in performance could be observed when updateswere executed as random batch updates. A skewed workload re-sults in much higher contention levels which are further amplifiedwhen put/remove operations are performed as batch updates, eachof which creates many new revisions (containers in CA-AVL andCA-SL). Such workloads were almost equally bad for Jiffy and itslock-based competitors. In the write-only scenario, the observedthroughput for Jiffy, CA-AVL and CA-SL was about 1.5-2 Mops/s forsmall random batch updates and 0.3-0.5 Mops/s for large randombatch updates. In this paper, we presented Jiffy, the first lock-free, linearizableordered key-value index with batch updates and snapshots. Despiteits rich functionality, Jiffy offers scalable performance across variousworkloads, often exceeding the performance of the state-of-the-artindices with less flexible semantics. Crucially, our novel lock-free,multiversioned algorithm that powers Jiffy allows it to executebatch updates more efficiently compared to its (lock-based) rivals,with speedup in throughput ranging from 1.1 × to 7.4 × , dependingon a test scenario.Jiffy’s codebase soon will be available on our github repository. ACKNOWLEDGMENTS
This work was supported by the Foundation for Polish Science,within the TEAM programme co-financed by the European Unionunder the European Regional Development Fund (grant No. POIR.04.04.00-00-5C5B/17-00). We thank Intel Poland for providing us withhardware resources. iffy: A Lock-free Skip List with Batch Updates and Snapshots REFERENCES [1] [n.d.]. KiWi. https://github.com/sdimbsn/KiWi.[2] [n.d.]. Linux Kernel 5.8 source code. https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/cpu/intel.c
Proc.of DISC ’12 . 1–15.[6] Maya Arbel-Raviv and Trevor Brown. 2018. Harnessing Epoch-Based Reclamationfor Efficient Range Queries. In
Proc. of PPoPP ’18 . 14–27.[7] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny.2012. Workload Analysis of a Large-Scale Key-Value Store. In
Proc. of SGIMETRICS’12 . 53–64.[8] Hillel Avni, Nir Shavit, and Adi Suissa. 2013. Leaplist: Lessons Learned in De-signing Tm-Supported Range Queries. In
Proc. of PODC ’13 . 299–308.[9] Dmitry Basin, Edward Bortnikov, Anastasia Braginsky, Guy Golan-Gueta, EshcarHillel, Idit Keidar, and Moshe Sulamy. 2017. KiWi: A Key-Value Map for ScalableReal-Time Analytics. In
Proc. of PPoPP ’17 . 357–369.[10] Philip A. Bernstein and Nathan Goodman. 1983. Multiversion concurrencycontrol—theory and algorithms.
ACM Transactions on Database Systems (TODS)
8, 4 (Dec. 1983).[11] Anastasia Braginsky and Erez Petrank. 2012. A Lock-free B+Tree. In
Proc. ofSPAA ’12 . 58–67.[12] Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. 2010. APractical Concurrent Binary Search Tree. In
Proc. of PPoPP ’10 . 257–268.[13] Trevor Brown and Hillel Avni. 2012. Range Queries in Non-blocking k-ary SearchTrees. In
Proc. of OPODIS ’12 . 31–45.[14] Trevor Brown and Joanna Helga. 2011. Non-Blocking k-Ary Search Trees. In
Proc. of OPODIS ’11 . 207–221.[15] Bapi Chatterjee. 2017. Lock-Free Linearizable 1-Dimensional Range Queries. In
Proc. of ICDCN ’19 . Article 9.[16] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and RussellSears. 2010. Benchmarking Cloud Serving Systems with YCSB. In
Proc. of SoCC’10 . 143–154.[17] Tyler Crain, Vincent Gramoli, and Michel Raynal. 2013. A Contention-FriendlyBinary Search Tree. In
Proc. of Euro-Par ’13 .[18] Martin G. Dixon, Jeremy J. Shrall, and Rajesh S. Parthasarathy. 2011. ControllingTime Stamp Counter (TSC) Offsets For Mulitple Cores And Threads. USPTOpatent no. US 20110154090 A1, Jun. 23, 2011.[19] Java documentation. [n.d.]. Java concurrent collections. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/package-summary.html.[20] Java documentation. [n.d.]. Java
System.nanotime() . https://docs.oracle.com/javase/7/docs/api/java/lang/System.html
Proc. of POPL ’15 . 233–246.[22] Faith Ellen, Panagiota Fatourou, Eric Ruppert, and Franck van Breugel. 2010.Non-Blocking Binary Search Trees. In
Proc. of PODC ’10 . 131–140.[23] Mikhail Fomitchev and Eric Ruppert. 2004. Lock-Free Linked Lists and Skip Lists.In
Proc. of PODC ’04 . 50–59.[24] K. Fraser. 2004.
Practical Lock-freedom . Ph.D. Dissertation. University of Cam-bridge.[25] Ellis Giles, Kshitij Doshi, and Peter Varman. 2018. Hardware TransactionalPersistent Memory. In
Proc. of MEMSYS ’18 . 190–205.[26] Timothy L. Harris. 2001. A Pragmatic Implementation of Non-blocking Linked-Lists. In
Proc. of DISC ’01 . 300–314.[27] M. Herlihy. 1990. A Methodology for Implementing Highly Concurrent DataStructures. In
Proc. of PPoPP ’90 . 197–206. [28] Maurice Herlihy. 1991. Wait-free Synchronization.
ACM Transactions on Pro-gramming Languages and Systems (TOPLAS)
13, 1 (Jan. 1991), 124–149.[29] Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctnesscondition for concurrent objects.
ACM Transactions on Programming Languagesand Systems (TOPLAS)
12, 3 (1990), 463–492.[30] Shane V. Howley and Jeremy Jones. 2012. A Non-Blocking Internal Binary SearchTree. In
Proc. of SPAA ’12 . 161–171.[31] Intel Corporation 2008.
Intel 64 and IA-32 Architectures Software Developer’sManual - Volume 3B . Intel Corporation.[32] Sarath Lakshman, Sriram Melkote, John Liang, and Ravi Mayuram. 2016. Nitro:A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index.
Proc. of VLDB Endowment
9, 13 (2016), 1413–1424.[33] Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M.Patel, and Mike Zwilling. 2011. High-Performance Concurrency Control Mech-anisms for Main-Memory Databases.
Proc. of VLDB Endowment (Dec. 2011),298–309.[34] Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. 2017. Cicada: De-pendably Fast Multi-Core In-Memory Transactions. In
Proc. of SIGMOD ’17 .21–35.[35] Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier. 2015. Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Pro-gramming. In
Proc. of SOSP ’15 . 168–183.[36] Paul E. McKenney and John D. Slingwine. 1998. Read-copy update: Using execu-tion history to solve concurrency problems. In
Proc. of PDCS ’98 . 509–518.[37] Maged M. Michael and Michael L. Scott. 1995.
Correction of a Memory ManagementMethod for Lock-Free Data Structures . Technical Report.[38] Aravind Natarajan and Neeraj Mittal. 2014. Fast Concurrent Lock-free BinarySearch Trees. In
Proc. of PPoPP ’14 . 317–328.[39] Erez Petrank and Shahar Timnat. 2013. Lock-Free Data-Structure Iterators. In
Proc. of DISC ’13 . 224–238.[40] Aleksandar Prokopec, Nathan Grasso Bronson, Phil Bagwell, and Martin Odersky.2012. Concurrent Tries with Efficient Non-blocking Snapshots. In
Proc. of PPoPP’12 . 151–160.[41] William Pugh. 1990. Skip Lists: A Probabilistic Alternative to Balanced Trees.
Commun. ACM
33, 6 (June 1990), 668–676.[42] Wenjia Ruan, Yujie Liu, and Michael Spear. 2013. Boosting Timestamp-BasedTransactional Memory by Exploiting Hardware Cycle Counters.
ACM Transac-tions on Architecture and Code Optimization
10, 4 (Dec. 2013), 40:1–40:21.[43] Konstantinos Sagonas and Kjell Winblad. 2015. Contention Adapting SearchTrees. In
Proc. of ISPDC ’15 . 215–224.[44] Konstantinos Sagonas and Kjell Winblad. 2018. A contention adapting approachto concurrent ordered sets.
Journal of Parallel Distributed Computing
115 (2018),1–19.[45] Niloufar Shafiei. 2013. Non-blocking Patricia Tries with Replace Operations. In
Proc. of ICDCS ’13 . 216–225.[46] Benjamin Sowell, Wojciech Golab, and Mehul A. Shah. 2012. Minuet: A ScalableDistributed Multiversion B-Tree.
Proc. of VLDB Endowment
5, 9 (May 2012),884–895.[47] Michael Spiegel and Paul F. Reynolds Jr. 2010. Lock-Free Multiway Search Trees.In
Proc. of ICPP ’10 . 604–613.[48] Håkan Sundell and Philippas Tsigas. 2003. Fast and Lock-Free Concurrent PriorityQueues for Multi-Thread Systems. In
Proc. of IPDPS ’03 . 84.2.[49] Håkan Sundell and Philippas Tsigas. 2004. Scalable and Lock-Free ConcurrentDictionaries. In
Proc. of SAC ’04 . 1438–1445.[50] John D. Valois. 1995. Lock-Free Linked Lists Using Compare-and-Swap. In
Proc.of PODC ’95 . 214–222.[51] Kjell Winblad, Konstantinos Sagonas, and Bengt Jonsson. 2018. Lock-Free Con-tention Adapting Search Trees. In
Proc. of SPAA ’18 . 121–132.13 adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski
APPENDIX
In Figure 8, Figure 8, Figure 9, and Figure 10 we present the addi-tional results of our scalability tests. Besides the total throughput, we include the plots for throughput of update operations, so thedata can be easier to interpret. iffy: A Lock-free Skip List with Batch Updates and Snapshots (a) 100% threads: put/remove (b) 25% threads: put/remove 25% threads: put/remove, 50% threads: get, 25% threads: scan75% threads: get (c) Short scans (100 ops) (d) Long scans (10000 ops) S i m p l e p u t / r e m o v e T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_a U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_a_update Jiffy SnapTree k-ary tree CA-AVL CA-SL CA-imm LFCA tree Java CSLM - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_b10 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_b10_update - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_b100 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_16_100_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_16_100_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_16_100_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_16_100_b100_update Jiffy rand Jiffy seq CA-AVL rand CA-AVL seq CA-SL rand CA-SL seq
Figure 7: Throughput scalability results (16 B keys, 100 B values, keys chosen with uniform distribution). adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski (a) 100% threads: put/remove (b) 25% threads: put/remove 25% threads: put/remove, 50% threads: get, 25% threads: scan75% threads: get (c) Short scans (100 ops) (d) Long scans (10000 ops) S i m p l e p u t / r e m o v e T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_16_100_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_16_100_a U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_16_100_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_16_100_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_16_100_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_16_100_a_update Jiffy SnapTree k-ary tree CA-AVL CA-SL CA-imm LFCA tree Java CSLM - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_16_100_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_16_100_b10 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_16_100_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_16_100_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_16_100_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_16_100_b10_update - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_16_100_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_16_100_b100 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_16_100_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_16_100_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_16_100_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_16_100_b100_update Jiffy rand Jiffy seq CA-AVL rand CA-AVL seq CA-SL rand CA-SL seq
Figure 8: Throughput scalability results (16 B keys, 100 B values, keys chosen with Zipfian distribution). iffy: A Lock-free Skip List with Batch Updates and Snapshots (a) 100% threads: put/remove (b) 25% threads: put/remove 25% threads: put/remove, 50% threads: get, 25% threads: scan75% threads: get (c) Short scans (100 ops) (d) Long scans (10000 ops) S i m p l e p u t / r e m o v e T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_a U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_a_update Jiffy SnapTree k-ary tree CA-AVL CA-SL CA-imm LFCA tree KiWi Java CSLM - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_b10 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_b10_update - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_b100 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.0_0.0_0_0.0_0_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.75_0.0_0_0.0_0_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_200_0.0_0_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_u_0.5_0.25_20000_0.0_0_b100_update Jiffy rand Jiffy seq CA-AVL rand CA-AVL seq CA-SL rand CA-SL seq
Figure 9: Throughput scalability results (4 B keys, 4 B values, keys chosen with uniform distribution). adeusz Kobus, Maciej Kokociński, and Paweł T. Wojciechowski (a) 100% threads: put/remove (b) 25% threads: put/remove 25% threads: put/remove, 50% threads: get, 25% threads: scan75% threads: get (c) Short scans (100 ops) (d) Long scans (10000 ops) S i m p l e p u t / r e m o v e T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_a T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_a U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_a_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_a_update Jiffy SnapTree k-ary tree CA-AVL CA-SL CA-imm LFCA tree KiWi Java CSLM - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_b10 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_b10 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_b10_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_b10_update - o p . b a t c h u p d a t e s T o t a l t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_b100 T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_b100 U p d a t e t h r o u g h p u t T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.0_0.0_0_0.0_0_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.75_0.0_0_0.0_0_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_200_0.0_0_b100_update T h r o u g h p u t ( M o p s / s ) hyperthreads plot_20M_10M_z_0.5_0.25_20000_0.0_0_b100_update Jiffy rand Jiffy seq CA-AVL rand CA-AVL seq CA-SL rand CA-SL seq