[PDF] Dash: Scalable Hashing on Persistent Memory

Abstract

Byte-addressable persistent memory (PM) brings hash tables the potential of low latency, cheap persistence and instant recovery. The recent advent of Intel Optane DC Persistent Memory Modules (DCPMM) further accelerates this trend. Many new hash table designs have been proposed, but most of them were based on emulation and perform sub-optimally on real PM. They were also piece-wise and partial solutions that side-step many important properties, in particular good scalability, high load factor and instant recovery. We present Dash, a holistic approach to building dynamic and scalable hash tables on real PM hardware with all the aforementioned properties. Based on Dash, we adapted two popular dynamic hashing schemes (extendible hashing and linear hashing). On a 24-core machine with Intel Optane DCPMM, we show that compared to state-of-the-art, Dash-enabled hash tables can achieve up to ~3.9X higher performance with up to over 90% load factor and an instant recovery time of 57ms regardless of data size.

Full PDF

DDash: Scalable Hashing on Persistent Memory

Baotong Lu ∗ Xiangpeng Hao Tianzheng Wang Eric Lo The Chinese University of Hong Kong Simon Fraser University { btlu, ericlo } @cse.cuhk.edu.hk { xha62, tzwang } @sfu.ca ABSTRACT

Byte-addressable persistent memory (PM) brings hash tables thepotential of low latency, cheap persistence and instant recovery.The recent advent of Intel Optane DC Persistent Memory Modules(DCPMM) further accelerates this trend. Many new hash table de-signs have been proposed, but most of them were based on emulationand perform sub-optimally on real PM. They were also piece-wiseand partial solutions that side-step many important properties, inparticular good scalability, high load factor and instant recovery.We present Dash, a holistic approach to building dynamic andscalable hash tables on real PM hardware with all the aforemen-tioned properties. Based on Dash, we adapted two popular dynamichashing schemes (extendible hashing and linear hashing). On a 24-core machine with Intel Optane DCPMM, we show that compared tostate-of-the-art, Dash-enabled hash tables can achieve up to ∼ × higher performance with up to over 90% load factor and an instantrecovery time of 57ms regardless of data size. PVLDB Reference Format:

Baotong Lu, Xiangpeng Hao, Tianzheng Wang, Eric Lo. Dash: ScalableHashing on Persistent Memory.

PVLDB , 13(8): 1147-1161, 2020.DOI: https://doi.org/10.14778/3389133.3389134

1. INTRODUCTION

Dynamic hash tables that can grow and shrink as needed at run-time are a fundamental building block of many data-intensive sys-tems, such as database systems [11, 15, 26, 34, 41, 46] and key-valuestores [5, 13, 16, 24, 37, 48, 62]. Persistent memory (PM) such as3D XPoint [9] and memristor [53] promises byte-addressability,persistence, high capacity, low cost and high performance, all on thememory bus. These features make PM very attractive for buildingdynamic hash tables that persist and operate directly on PM, withhigh performance and instant recovery. The recent release of IntelOptane DC Persistent Memory Module (DCPMM) brings this visioncloser to reality. Since PM exhibits several distinct properties (e.g.,asymmetric read/write speeds and higher latency); blindly applyingprior disk or DRAM based approaches [12, 29, 36] would not reapits full beneﬁts, necessitating a departure from conventional designs. ∗ Work partially performed while at Simon Fraser University.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. 13, No. 8ISSN 2150-8097.DOI: https://doi.org/10.14778/3389133.3389134 M illi o n o p s / s Number of threadsCCEHLevel Hashing CCEH (ideal)Level Hashing (ideal) 0 10 20 30 40 1 4 8 16 24 M illi o n o p s / s Number of threads

Figure 1:

Throughput of state-of-the-art PM hashing (CCEH [38]and Level Hashing [68]) for insert (left) and search (right) opera-tions on Optane DCPMM. Neither matches the expected scalability.

There have been a new breed of hash tables speciﬁcally designedfor PM [10, 38, 39, 51, 67, 68] based on DRAM emulation, beforeactual PM was available. Their main focus is to reduce cachelineﬂushes and PM writes for scalable performance. But when they aredeployed on real Optane DCPMM, we ﬁnd (1) scalability is still amajor issue, and (2) desirable properties are often traded off.Figure 1 shows the throughput of two state-of-the-art PM hashtables [38, 68] under insert (left) and search (right) operations, on a24-core server with Optane DCPMM running workloads under uni-form key distribution (details in Section 6). As core count increases,neither scheme scales for inserts, nor even read-only search opera-tions. Corroborating with recent work [31, 56], we ﬁnd the mainculprit is Optane DCPMM’s limited bandwidth which is ∼ × lower than DRAM’s [22]. Although the server is fully populated toprovide the maximum possible bandwidth, excessive PM accessescan still easily saturate the system and prevent the system from scal-ing. We describe two major sources of excessive PM accesses thatwere not given enough attention before, followed by a discussion ofimportant but missing functionality in prior work. Excessive PM Reads.

Much prior work focused on reducingwrites to PM, however, we note that it is also imperative to reducePM reads; yet many existing solutions reduce PM writes by incurringmore PM reads. Different from the device-level behavior (PM readsbeing faster than writes), end-to-end write latency (i.e., the entiredata path including CPU caches and write buffers in the memorycontroller) is often lower than reads [56, 63]. The reason is whilePM writes can leverage write buffers, PM reads mostly touch thePM media due to hash table’s inherent random access patterns. Inparticular, existence checks in record probing constitute a largeproportion of such PM reads: to ﬁnd out if a key exists, one ormultiple buckets (e.g., with linear probing) have to be searched,incurring many cache misses and PM reads when comparing keys.

Heavyweight Concurrency Control.

Most prior work side-stepped the impact of concurrency control. Bucket-level locking a r X i v : . [ c s . D B ] A p r as been widely used [38, 68], but it incurs additional PM writes toacquire/release read locks, further pushing bandwidth consumptiontowards the limit. Lock-free designs [51] can avoid PM writes forread-only probing operations, but are notoriously hard to get right,more so in PM for safe persistence [59].Neither record probing nor concurrency control typically preventsa well-designed hash table to scale on DRAM. However, on PM theycan easily exhaust PM’s limited bandwidth. These issues call fornew designs that can reduce unnecessary PM reads during probingand lightweight concurrency control that further reduces PM writes. Missing Functionality.

We observe in prior designs, necessaryfunctionality is often traded off for performance (though scalabilityis still an issue on real PM). (1) Indexes could occupy more than50% of memory capacity [66], so it is critical to improve loadfactor (records stored vs. hash table capacity). Yet high load factoris often sacriﬁced by organizing buckets using larger segments inexchange for smaller directories (fewer cache misses) [38]. As wedescribe later, this in turn can trigger more pre-mature splits andincur even more PM accesses, impacting performance. (2) Variable-length keys are widely used in reality, but prior approaches rarelydiscuss how to efﬁciently support it. (3) Instant recovery is a unique,desirable feature that could be provided by PM, but is often omittedin prior work which requires a linear scan of the metadata whosesize scales with data size. (4) Prior designs also often side-step thePM programming issues (e.g., PM allocation), which impact theproposed solution’s scalability and adoption in reality.

We present

Dash , a holistic approach to dynamic and scalablehashing on real PM without trading off desirable properties. Dashuses a combination of new and existing techniques that are carefullyengineered to achieve this goal. We adopt ﬁngerprinting [43]that was used in PM tree structures to avoid unnecessary PM readsduring record probing. The idea is to generate ﬁngerprints (one-bytehashes) of keys and place them compactly to summarize the possibleexistence of keys. This allows a thread to tell if a key possibly existsby scanning the ﬁngerprints which are much smaller than the actualkeys. Instead of traditional bucket-level locking, Dash uses anoptimistic, lightweight ﬂavor of it that relies on veriﬁcation to detectconﬂicts, rather than (expensive) shared locks. This allows Dashto avoid PM writes for search operations. With ﬁngerprinting andoptimistic concurrency, Dash avoids both unnecessary reads andwrites, saving PM bandwidth and allowing Dash to scale well. Dash retains desirable properties. We propose a new load balancingstrategy to postpone segment splits with improved space utilization.To support instant recovery, we limit the amount of work to be doneupon recovery to be small (reading and possibly writing a one-bytecounter), and amortize recovery work to runtime. Compared toprior work that handles PM programming issues in ad hoc ways,Dash uses PM programming models (PMDK [20], one of the mostpopular PM libraries) to systematically handle crash consistency,PM allocation and achieve instant recovery.Although these techniques are not all new, Dash is the ﬁrst tointegrate them for building hash tables that scale without sacriﬁcingfeatures on real PM. Techniques in Dash can be applied to variousstatic and dynamic hashing schemes. Compared to static hashing,dynamic hashing can adjust hash table size on demand withoutfull-table rehashing which may block concurrent queries and sig-niﬁcantly limit performance. In this paper, we focus on dynamichashing and apply Dash to two classic approaches: extendible hash-ing [12, 38] and linear hashing [29, 36]. They are both widely usedin database and storage systems, such as Oracle ZFS [40], IBMGPFS [49], Berkeley DB [3] and SQL Server Hekaton [32]. Evaluation using a 24-core Intel Xeon Scalable CPU and 1.5TBof Optane DCPMM shows that Dash can deliver high performance,good scalability, high space utilization and instant recovery with aconstant recovery time of 57ms. Compared to the aforementionedstate-of-the-art [38, 68], Dash achieves up to ∼ × better perfor-mance on realistic workloads, and up to over 90% load factor withhigh space utilization and the ability to instantly recover. We make four contributions. First, we identify the mismatch be-tween existing and desirable PM hashing approaches, and highlightthe new challenges. Second, we propose Dash, a holistic approachto building scalable hash tables on real PM. Dash consists of a setof useful and general building blocks applicable to different hashtable designs. Third, we present Dash-enabled extendible hashingand linear hashing, two classic and widely used dynamic hashingschemes. Finally, we provide a comprehensive empirical evaluationof Dash and existing PM hash tables to pinpoint and validate theimportant design decisions. Our implementation is open-source at: https://github.com/baotonglu/dash .In the rest of the paper, we give necessary background in Sec-tion 2. Sections 3–5 present our design principles and Dash-enabledextendible hashing and linear hashing. Section 6 evaluates Dash.We discuss related work in Section 7 and conclude in Section 8.

2. BACKGROUND

We ﬁrst give necessary background on PM (Optane DCPMM)and dynamic hashing, then discuss issues in prior PM hash tables.

Hardware.

We target Optane DCPMMs (in DIMM form factor).In addition to byte-addressability and persistence, DCPMM offershigh capacity (128/256/512GB per DIMM) at a price lower thanDRAM’s. It supports modes:

Memory and

AppDirect . The formerpresents capacious but slower volatile memory. DRAM is used as acache to hide PM’s higher latency, with hardware-controlled cachingpolicy. The AppDirect mode allows software to explicitly accessDRAM and PM with persistence in PM, without implicit caching.Applications need to make judicious use of DRAM and PM. Similarto other work [38, 39, 51, 68], we leverage the AppDirect mode, asit provides more ﬂexibility and persistence guarantees.

System Architecture.

The current generation of DCPMM re-quires the system be equipped with DRAM to function properly. Wealso assume such setup with PM and DRAM, both of which are be-hind multiple levels of volatile

CPU caches. Data is not guaranteedto be persisted in PM until a cacheline ﬂush instruction (

CLFLUSH , CLFLUSHOPT or CLWB ) [21] is executed or other events that implic-itly cause cacheline ﬂush occur. Writes to PM may also be reordered,requiring fences to avoid undesirable reordering. The application(hash table in our case) must explicitly issue fences and cachelineﬂushes to ensure correctness.

CLFLUSH and

CLFLUSHOPT will evictthe cacheline that is being ﬂushed, while

CLWB does not (thus maygive better performance). After a cacheline of data is ﬂushed, itwill reach the asynchronous DRAM refresh (ADR) domain whichincludes a write buffer and a write pending queue with persistenceguarantees upon power failure. Once data is in the ADR domain,it is considered as persisted. Although DCPMM supports 8-byteatomic writes, internally it uses the 256-byte blocks. But softwareshould not (be hardcoded to) depend on this as it is an internalparameter of the hardware which may change in future generations.

Performance Characteristics.

At the device level, as many pre-vious studies have shown, PM exhibits asymmetric read and writelatency, with writes being slower. It exhibits ∼ × longer than DRAM’s. More recent studies [56, 63], howeverrevealed that on Optane DCPMM, read latency as seen by the soft-ware is often higher than write latency. This is attributed to the factthat writes ( store instructions) commit once the data reaches theADR domain at the memory controller rather than when reachingDCPMM media. On the contrary, a read operation often needsto touch the actual media unless the data being accessed is cache-resident (which is rare especially in data structures with inherentrandomness, e.g., hash tables). Tests also showed that the bandwidthof DCPMM depends on many factors of the workload. In general,compared to DRAM, it exhibits ∼ × / ∼ × slower sequential/ran-dom read bandwidth. The numbers for sequential/random write are ∼ × / ∼ × . Notably, the performance of small random storesis severely limited and non-scalable [63], which, however, is theinherent access pattern in hash tables. These properties exhibit astark contrast to prior estimates about PM performance [45, 55, 58],and lead to signiﬁcantly lower performance of many prior designson DCPMM than originally reported. Thus, it is important to reduce both PM reads and writes for higher performance. More details onraw DCPMM device performance can be found elsewhere [22]; wefocus on the end-to-end performance of hash tables on PM.

Now we give an overview of extendible hashing [12] and lin-ear hashing [29, 36]. We focus on their memory-friendly versionswhich PM-adapted hash tables were based upon. Dash can also beapplied to other approaches which we defer to future work.

Extendible Hashing.

The crux of extendible hashing is its use ofa directory to index buckets so that they can be added and removeddynamically at runtime. When a bucket is full, it is split into two newbuckets with keys redistributed. The directory may get expanded(doubled) if there is not enough space to store pointers to the newbucket. Figure 2(a) shows an example with four buckets, each ofwhich is pointed to by a directory entry; a bucket can store up tofour records (key-value pairs). In the ﬁgure, indices of directoryentries are shown in binary. The two least signiﬁcant bits (LSBs)of the hash value are used to select a bucket; we call the number ofsufﬁx bits being used here the global depth . The hash table can haveat most 2 global depth directory entries (buckets). A search operationfollows the pointer in the corresponding directory entry to probethe bucket. Each bucket also has a local depth . In Figure 2(a), thelocal depth of each bucket is 2, same as the global depth. Supposewe want to insert key 30 that is hashed to bucket 01 , which is fulland needs to be split to accommodate the new key. Splitting thebucket will require more directory entries. In extendible hashing,the directory always grows by doubling its current size. The resultis shown in Figure 2(b). Here, bucket 01 in Figure 2(a) is splitinto two new buckets (001 and 101 ), one occupies the originaldirectory entry, and the other occupies the second entry in the newlyadded half of the directory. Other new entries still point to theircorresponding original buckets. Search operations will use three bitsto determine the directory entry index (global depth now is three).After a bucket is split, we increment its local depth by one, andupdate the new bucket’s local depth to be the same (3 in our exam-ple). The other unsplit buckets’ local depth remains 2. This allowsus to determine whether a directory doubling is needed: if a bucketwhose local depth equals the global depth is split (e.g., bucket 001 or 101 ), then the directory needs to be doubled to accommodatethe new bucket. Otherwise (local depth < global depth), the direc-tory should have 2 global depth − local depth directory entries pointingto that bucket, which can be used to accommodate the new bucket. Choosing a proper hash function that evenly distributes keys to allbuckets is an important but orthogonal problem to our work. (a) Before inserting 30. Global depth is 2. (b) After inserting 30. Global depth becomes 3 and directory doubled.Directory:Buckets:Local depth: 2 2 2 2 2

2 2 Figure 2:

An example of extendible hashing. (a)

The hash table isfull. (b)

The local depth of unsplit buckets is 2. Splitting bucketswith local depth < global depth will not double the directory.For instance, if bucket 000 needs to be split, directory entry 100 (pointing to bucket 000 ) can be updated to point to the new bucket. Linear Hashing.

In-memory linear hashing takes a similar ap-proach to organizing buckets using a directory with entries pointingto individual buckets [29]. The main difference compared to ex-tendible hashing is that in linear hashing, the bucket to be split ischosen “linearly.” That is, it keeps a pointer (page ID or address) tothe bucket to be split next and only that bucket would be split in eachround, and advances the pointer to the next bucket when the splitof the current bucket is ﬁnished. Therefore, the bucket being splitis not necessarily the same as the bucket that is full as a result ofinserts, and eventually the overﬂowed bucket will be split and haveits keys redistributed. If a bucket is full and an insert is requested toit, more overﬂow buckets will be created and chained together withthe original, full bucket. For correct addressing and lookup, linearhashing uses a group of hash functions h ... h n , where h n coverstwice the range of h n − . For buckets that are already split, h n is usedso we can address buckets in the new hash table capacity range, andfor the other unsplit buckets we use h n − to ﬁnd the desired bucket.After all buckets are split (a round of splitting has ﬁnished), the hashtable’s capacity will be doubled; the pointer to the next-to-be-splitbucket is reset to the ﬁrst bucket for the next round of splitting.Determining when to trigger a split is an important problem inlinear hashing. A typical approach is to monitor and keep the loadfactor bounded [29]. The choice of a proper splitting strategy mayalso vary across workloads, and is orthogonal to the design of Dash. Now we discuss how dynamic hashing is adapted to work on PM.We focus on extendible hashing and start with CCEH [38], a recentrepresentative design; Section 5 covers linear hashing on PM.To reduce PM accesses, CCEH groups buckets into segments ,similar to in-memory linear hashing [29]. Each directory entry thenpoints to a segment which consists of a ﬁxed number of bucketsindexed by additional bits in the hash values. By combining multiplebuckets into a larger segment, the directory can become signiﬁcantlysmaller as fewer bits are needed to address segments, making itmore likely to be cached entirely by the CPU, which helps reducingaccess to PM. Note that split now happens at the segment (instead ofbucket) level. A segment is split once any bucket in it is full, even ifthe other buckets in the segment still have free slots, which results inlow load factor and more PM accesses. To reduce such pre-maturesplits, linear probing can be used to allow a record to be inserted intoa neighbor bucket. However, this improves load factor at the cost ofmore cache misses and PM accesses. Thus, most approaches boundprobing distance to a ﬁxed number, e.g., CCEH probes no more thanfour cachelines. However, our evaluation (Section 6) shows thatlinear probing alone is not enough in achieving high load factor.nother important aspect of dynamic PM hashing is to ensure fail-ure atomicity, particularly during segment split which is a three-stepprocess: (1) allocate a new segment in PM, (2) rehash records fromthe old segment to the new segment and (3) register the new segmentin the directory and update local depth. Existing approaches suchas CCEH only focused on step 3, side-stepping PM managementissues surrounding steps 1–2. If the system crashes during step 1 or2, we need to guarantee the new segment is reclaimed upon restart toavoid permanent memory leaks. In Sections 4 and 6.1, we describeDash’s solution and a solution for existing approaches.

3. DESIGN PRINCIPLES

The aforementioned issues and performance characteristics ofOptane DCPMM lead to the following design principles of Dash: • Avoid both Unnecessary PM Reads and Writes.

Probing per-formance impacts not only search operations, but also all the otheroperations. Therefore, in addition to reducing PM writes, Dashmust also remove unnecessary PM reads to conserve bandwidthand alleviate the impact of high end-to-end read latency. • Lightweight Concurrency.

Dash should scale well on multicoremachines with persistence guarantees. Given the limited band-width, concurrency control must be lightweight to not incur muchoverhead (i.e., avoid PM writes for search operations, such as readlocks). Ideally, it should also be relatively easy to implement. • Full Functionality.

Dash must not sacriﬁce or trade off importantfeatures that make a hash table useful in practice. In particular, itneeds to support near-instantaneous recovery and variable-lengthkeys and achieve high space utilization.

4. Dash FOR EXTENDIBLE HASHING

Based on the principles in Section 3, we describe Dash in thecontext of Dash-Extendible Hashing (Dash-EH). We discuss howDash applies to linear hashing in Section 5.

Similar to prior approaches [38, 65], Dash-EH uses segmentation.As shown in Figure 3, each directory entry points to a segment whichconsists of a ﬁxed number of normal buckets and stash buckets foroverﬂow records from normal buckets which did not have enoughspace for the inserts. The lock, version number and clean markerare for concurrency control and recovery, which we describe later.Figure 4 shows the internals of a bucket. We place the metadataused for bucket probing on the ﬁrst 32 bytes, followed by multiple16-byte record (key-value pair) slots. The ﬁrst 8 bytes in each slotstore the key (or a pointer to it for keys longer than 8 bytes). Theremaining 8 bytes store the payload which is opaque to Dash; it canbe an inlined value or a pointer, depending on the application’s need.The size of a bucket is adjustable. In our current implementation itis set to 256-byte (block size of Optane DCPMM [22]) for betterlocality. This allows us to store 14 records (16-byte each) per bucket.The 32-byte metadata includes key data structures for Dash-EHto handle hash table operations and realize the design principles. Itstarts with a 4-byte version lock for optimistic concurrency control(Section 4.4). A 4-bit counter records the number of records storedin the bucket. The allocation bitmap reserves one bit per slot, toindicate whether the corresponding slot stores a valid record. The membership bitmap is reserved for bucket load balancing whichwe describe later (Section 4.3). What follows are structures suchas ﬁngerprints and counters to accelerate probing and improve loadfactor. Most unnecessary probings are avoided by scanning theﬁngerprints area.

Lock

Segment 0:

Bucket b-1

Bucket b Bucket b+1

Bucket b+2 . . . . . .

Stash bucketsSegment n:

Bucket b-1

Bucket b Bucket b+1

Bucket b+2 . . . . . .

Stash buckets . . .

3. Stash2. Displace1. Balanced insert

VersionClean?Directory

Figure 3:

Overall architecture of Dash-EH.

Metadata Records (key-value pairs)Version lock (4 bytes) Counter Membership Alloc. bitmap32 bytes 224 bytes (16-byte x 14 pairs). . . 14 + 4 fingerprintsFP 1 FP 2 064208Overflow fingerprint bitmap Overflow bitStash bucket index Overflow membership Overflow count

Unused

Figure 4:

Dash-EH bucket layout. The ﬁrst 32 bytes are dedicatedto metadata that optimizes probing and load factor, followed byrecords. Normal and stash buckets share the same layout.

Bucket probing (i.e., search in one bucket) is a basic operationneeded by all the operations supported by a hash table (search, insertand delete) to check for key existence. Searching a bucket typicallyrequires a linear scan of the slots. This can incur lots of cache missesand is a major source of PM reads, especially so for long keys storedas pointers. It is a major reason for hash tables on PM to exhibit lowperformance. Moreover, such scans for negative search operations(i.e., when the target key does not exist) are completely unnecessary.We employ ﬁngerprinting [43] to reduce unnecessary scans. Itwas used by trees to reduce PM accesses with an amortized numberof key loads of one. We adopt it in hash tables to reduce cachemisses and accelerate probing. Fingerprints are one-byte hashesof keys for predicting whether a key possibly exists. We use theleast signiﬁcant byte of the key’s hash value. To probe for a key,the probing thread ﬁrst checks whether any ﬁngerprint matches thesearch key’s ﬁngerprint. It then only accesses slots with matchingﬁngerprints, skipping all the other slots. If there is no match, the keyis deﬁnitely not present in the bucket. This process can be furtheraccelerated with SIMD instructions [21].Fingerprinting particularly beneﬁts negative search (where thesearch key does not exist) and uniqueness checks for inserts. It alsoallows Dash to use larger buckets to tolerate more collisions andimprove load factor, without incurring many cache misses: most un-necessary probes are avoided by ﬁngerprints. This design contrastswith many prior designs that trade load factor for performance byhaving small buckets of 1–2 cachelines [10, 38, 68].As Figure 4 shows, each bucket contains 14 slots, but 18 ﬁnger-prints (bits 64–208); 14 are for slots in the bucket, and the other fourrepresent keys placed in a stash bucket but were originally hashedinto the current bucket. They can allow early avoidance of accessto stash buckets, saving PM bandwidth. We describe details next aspart of the bucket load balancing strategy that improves load factor.

Segmentation reduces cache misses on the directory (by reducingits size). However, as we describe in Sections 2.3 and 6, this is atthe cost of load factor: in a naive implementation the entire segmenteeds to be split if any bucket is full, yet other buckets in the segmentmay still have much free space. We observe that the key reason isload imbalance caused by the (inﬂexible) way buckets are selectedfor inserting new records, i.e., a key is only mapped to a singlebucket. Dash uses a combination of techniques for new insertsto balance loads among buckets while limiting PM reads needed.Algorithm 1 shows how the insert operation works in Dash-EH at ahigh level, with three key techniques described below.

Balanced Insert.

To insert a record whose key is hashed intobucket b ( hash ( key ) = b ), Dash probes both bucket b and b + b + n where n > b ... b + n − Displacement.

If both the target bucket b and probing bucket b + b or b + n + n + n + and (2) bucket n + hash ( key ) = b and both b and b + b + hash ( key ) = b + b +

2. If such a record does notexist, we repeat for bucket b but move a record with hash ( key ) = b − membership bitmap (Figure 3) to acceleratedisplacement. If a bit is set, then the corresponding key was notoriginally hashed into this bucket; it was placed here because ofbalanced insert or displacement. When checking for bucket b ( b + Stashing.

If the record cannot be inserted into bucket b or b + overﬂowrecords . Stash buckets use the same layout as that of normal buckets;probing of a stash bucket follows the same procedure as probing anormal bucket (see Section 4.2). While stashing can be effective inimproving load factor, it could incur non-trivial overhead: the morestash buckets are used, the more CPU cycles and PM reads will beneeded to probe them. This is especially undesirable for negativesearch and uniqueness check in insert operations, since both need toprobe all stash buckets, despite it may be completely unnecessary.To solve this problem, we try to set up record metadata includingﬁngerprints in a normal bucket and only refer actual record accessto the stash bucket. As Figure 4 shows, four additional ﬁngerprintsper bucket are reserved for overﬂow records stored in stash buck-ets. A 4-bit overflow fingerprint bitmap records whether thecorresponding ﬁngerprint slot is occupied. Another overflow bit indicates whether the bucket has overﬂowed any record to a stashbucket. Algorithm 1 (lines 27–29) shows this process. Similar toinserting records into a normal bucket, the overﬂow record’s ﬁn- Algorithm 1

Dash-EH insert algorithm with bucket load balancing. def dash_eh_insert(key, value):h = hash(key) retry: [target_seg] = get_segment(h)[target_bucket, probing_bucket] = target_seg.bk(h) Lock target_bucket and probing_bucket if verify_seg is not target_seg:Unlock and goto retry if key exists in either bucket or the stash: Unlock and return Result::KeyExists if target_bucket or probing_bucket is not full:if target_bucket.count <= probing_bucket.count: target_bucket.insert(key, value, h)else probing_bucket.insert(key, value, h)else if bucket is not NULL:bucket.insert(key, value, h) elif stash_bucket.insert(key, value, h):target.overflow = true Set overflow fingerprint bitmap and fingerprintelse split_segment(h)goto retry Unlock target_bucket and probing_bucketreturn Result::Inserted gerprint also allows one more bucket probing, using the overflowmembership bitmap to indicate whether the overﬂow ﬁngerprintsoriginally belong to this bucket. The 2-bit stash bucket index per overﬂow record indicates which one of the four stash buckets therecord is inserted into for faster lookup. If the overﬂow ﬁngerprintcannot be inserted into neither the target nor the probing bucket, weincrement overflow count in the target bucket. Once the counterbecomes positive, a probing thread will have to check the stash areato ensure that a key does or does not exist. Thus, it is desirable toreserve enough slots of overﬂow ﬁngerprints in each bucket so thatthe overﬂow counter is rarely positive. As Section 6 shows, using2–4 stash buckets per segment can improve load factor to over 90%without imposing signiﬁcant overhead.

Dash employs optimistic locking, an optimistic ﬂavor of bucket-level locking inspired by optimistic concurrency control [27, 54].Insert operations will follow traditional bucket-level locking to lockthe affected buckets. Search operations are allowed to proceed with-out holding any locks (thus avoiding writes to PM) but need to verifythe read record. For this to work, in Dash the lock consists of (1) asingle bit that serves the role of “the lock” and (2) a version numberfor detecting conﬂicts (not to be confused with the version numberin Figure 3 for instant recovery). As line 7 in Algorithm 1 shows,the inserting thread will acquire bucket-level locks for the targetand probing buckets. This is done by atomically setting the lockbit in each bucket by trying the compare-and-swap ( CAS ) instruc-tion [21] until success. Then the thread enters the critical sectionand continues its operations. After the insert is done, the threadeleases the lock by (1) resetting the lock bit and (2) incrementingthe version number by one, in one step using an atomic write.To probe a bucket for a key, Dash ﬁrst takes a snapshot of thelock word and checks whether the lock is being held by a concurrentwriter (the lock bit is set). If so, it waits until the lock is released andrepeats. Then it is allowed to read the bucket without holding anylock. Upon ﬁnishing its operations, the reader thread will read thelock word again to verify the version number did not change, and ifso, it retries the entire operation as the record might not be valid as aconcurrent write might have modiﬁed it. This lock-free read designrequires segment deallocation (due to merge) happen only after noreaders are (and will be) using the segment. We use epoch-basedreclamation [14] to achieve this without incurring much overhead.Dash does not use segment-level locks, saving PM access in thesegment level. As a result, structural modiﬁcation operations (SMOs,such as segment split) need to lock all the buckets in each segment.Directory doubling/halving is handled similarly: the directory lockis only held when the directory is being doubled or halved. Forother operations on the directory (e.g., updating a directory entry topoint to a new segment), no lock is taken. Instead, they are treatedas search operations without taking the directory lock. This is safebecause we guarantee isolation in the segment level: an insertingthread must ﬁrst acquire locks to protect the affected buckets. “Real”probings (search/insert) proceed without reading the directory lockbut again need to verify that they entered the right segment by re-reading the directory to test whether these two read results match; ifnot, the thread aborts and retries the entire operation.

Dash stores pointers to variable-length keys, which is a commonapproach [2, 38, 43, 68]. A knob is provided to switch between theinline (ﬁxed-length keys up to 8 bytes) and pointer modes. Thoughdereferencing pointers may incur extra overhead, ﬁngerprintinglargely alleviates this problem. For negative search where the targetkey does not exist, no ﬁngerprint will match and so key probingwill not happen at all. For positive search, as we have discussed inSection 4.2, the amortized number of key load (therefore the numbercache misses caused by following the key pointer) is one [43].

Now we present details on how Dash-EH performs insert, searchand delete operations on PM with persistence guarantees.

Insert.

Section 4.3 presented the high-level steps for insert; herewe focus on the bucket-level. As the bucket::insert function inAlgorithm 2 shows, we ﬁrst write and persist the new record (lines3-4), and then set up the metadata (ﬁngerprint, allocation bitmap,counter and membership, lines 6–12). Note that the allocationbitmap, membership bitmap and counter are in one word; they areupdated in one atomic write. The

CLWB and fence are then issued(line 16) to persist all the metadata. Once the corresponding bit inthe bitmap is set, the record is visible to other threads. If a crashhappens before the bitmap is persisted, the new record is regarded asinvalid; otherwise, the record is successfully inserted. This allowsus to avoid expensive logging while maintaining consistency.Displacing a record needs two steps: (1) inserting it into thenew bucket and (2) deleting it from the original bucket. As the displace function in Algorithm 2 shows, step 2 is done by resettingthe corresponding bit in the allocation bitmap without moving data.In case a crash happens before step 2 ﬁnishes, a record will appearin both buckets. This necessitates a duplicate detection mechanismupon recovery, which is amortized over runtime (see Section 4.8).If the insert has to happen in a stash bucket, we set the overﬂowmetadata in the normal bucket. This cannot be done atomically with

Algorithm 2

Bucket insert and displacement in Dash-EH. def bucket::insert(key, value, h): slot = slots[slot_id = get_free_slot()]slot.assign(key, value) CLWB+FENCE(slot) fingerprints[slot_id] = LSB_byte(h) ++counterif bucket is probing bucket: membership.set(slot_id) CLWB+FENCE(alloc_bitmap, membership, counter, FP) def displace(target = b, prob = b+1): slot_id = prob.get_unset_LSB(membership)if slot_id is not Invalid and (b+2) is not full: (prob+1).insert(prob.slots[slot_id]) ∼ (1 << slot_id) prob.counter--CLWB+FENCE(prob.alloc_bitmap, counter) return probelse if slot_id is not Invalid and (b-1) is not full:(target-1).insert(target.slots[slot_id]) target.alloc_bitmap &= ∼ (1 << slot_id)target.membership &= ∼ (1 << slot_id) target.counter--CLWB+FENCE(target.alloc_bitmap, counter) return targetelse return NULL Search.

With balanced insert and displacement, a record could beinserted into its target bucket b where b = hash ( key ) or its probingbucket b +

1. A search operation then has to check both if the recordis not found in b . As Algorithm 3 shows, the probing thread ﬁrstreads the directory to obtain a reference to the corresponding seg-ment and buckets (lines 4–5). It then takes a snapshot of the versionnumber of both buckets (lines 6–7) for veriﬁcation later. We verifyat line 10 that the segment did not change (i.e., the directory entrystill points to it) and retry if needed. Once segment check passed, wecheck whether the target/probing buckets are being modiﬁed (i.e.,locked) at lines 14–15. If not, we continue to search the target andprobing buckets (lines 17–27) using the bucket::search function(not shown). Note that we need to verify the lock version did notchange after bucket::search returns (lines 18 and 24).If neither bucket contains the record, it might be stored in a stashbucket (lines 31–37). If overflow count >

0, then we search thestash buckets as the overﬂow ﬁngerprint area does not have enoughspace for all overﬂow records from the bucket. Otherwise, stashaccess is only needed if there is a matching ﬁngerprint (lines 31–35). lgorithm 3

Dash-EH search algorithm. def dash_eh_search(key): retry: [target_seg] = get_segment(h)[target_bucket, probing_bucket] = target_seg.bk(h) vt = target_bucket.version_lockvp = probing_bucket.version_lock [verify_seg] = get_segment(h)if verify_seg is not target_seg: goto retry if is_lock_set(vt) or is_lock_set(vp):goto retry result = target_bucket.search(key) if vt is not target_bucket.version_lock:goto retry if result is not NULL:return result result = probing_bucket.search(key) if vp is not probing_bucket.version_lock:goto retry if result is not NULL:return result if key matches overflow fingerprints:search corresponding stash buckets and return elsereturn NULL elseSearch by scanning the stash buckets and return return NULL Delete.

To delete a record from a normal bucket, we reset thecorresponding bit in the allocation bitmap, decrement the counterand persist these changes. Then the slot becomes available for futurereuse. To delete a record from a stash bucket, in addition to clearingthe bit in the allocation bitmap, we also clear the overﬂow ﬁngerprintin the target bucket which this record overﬂowed from if it exists;otherwise we only decrement the target bucket’s overﬂow counter.

When a thread has exhausted all the options to insert a record intoa bucket, it triggers a segment split and possibly expansion of thedirectory. Conversely, when the load factor drops below a threshold,segments can be merged to save space. At a high level, three stepsare needed to split a segment S : (1) allocate a new segment N , (2)rehash keys in S and redistribute records in S and N , and (3) attach N to the directory and set the local depth of N and S . These operationscause the structure of the hash table to change and must be madecrash consistency on PM while maintaining high performance.For crash consistency, Dash-EH chains all segments using sidelinks to the right neighbor. Each segment has a state variable toindicate whether the segment is in an SMO and whether it is the onebeing split or the new segment. An initial value of zero indicatesthe segment is not part of an SMO. Figure 5 shows an example.Note that as shown by the ﬁgure, Dash-EH uses the most signiﬁcantbits (MSBs) of hash values to address and organize segments andbuckets (i.e., the directory is indexed by the MSBs of hash values), (b) Allocate a new segment and do the rehashing. L: 1 2 2 1 2 2 2 2 2 2 2 (a) Initial state. (c) Update the directory entry and local depth.

432 12401011 Figure 5:

Segment split in Dash-EH; the global depth is 2.similar to other recent work [38]. This is different from traditionalextendible hashing described in Section 2.2 that uses LSBs of hashvalues to address buckets. Using LSBs was the choice in the diskera to reduce I/O: the directory can be doubled by simply copyingthe original directory and appending it to the directory ﬁle. OnPM, such advantage is marginalized as to double a directory, oneneeds to allocate and persist a double-sized directory in PM anywayto keep the directory in a contiguous address space. Using MSBsalso allows directory entries pointing to the same segment to beco-located, reducing cacheline ﬂushes during splits [38].To split a segment S , we ﬁrst mark its state as SPLITTING andallocate a new segment N whose address is stored in the side link of S . N is then initialized to carry S ’s side link as its own. Its local depthis set to the local depth of S plus one. Then, we change N ’s state to NEW to indicate this new segment is part of a split SMO for recoverypurposes (see Section 4.8). We rely on PM programming libraries(PMDK [20]) to atomically allocate and initialize the new segment;in case of a crash, the allocated PM block is guaranteed to be eitherowned by Dash or the allocator and will not be permanently leaked.After initialization, we ﬁnish up step 2 by redistributing recordsbetween N and S . Records moved from S to N are deleted in S after they are inserted into N . Note that the rehashing/redistributingprocess does not need to be done atomically: if a crash happens inthe middle of rehashing, upon (lazy) recovery we redo the rehashingprocess with uniqueness check to avoid repeating work for recordsthat were already inserted into N before the crash; we describe moredetails later in Section 4.8. Figure 5(b) shows the state of the hashtable after step 2. Then the directory entry for N and the local depthof S are updated as shown in Figure 5(c). Similarly, these updates areconducted using an atomic PMDK transaction which may use anyapproach such as lightweight logging. Many other systems avoidthe use of logging to maintain high performance, largely because ofthe frequent pre-mature splits. But split is much rarer in Dash thanksto bucket load balancing that gives high load factor (Section 4.3);this allows Dash-EH to employ logging-based PMDK transactionsthat abstracts away many details and eases implementation. Dash provides truly instant recovery by requiring a constantamount of work (reading and possibly writing a one-byte counter),before the system is ready to accept user requests. We add a globalversion number V and a clean marker shown in Figure 3, and a per-segment version number. clean is a boolean that denotes whetherthe system was shutdown cleanly; V tells whether recovery (duringruntime) is needed. Upon a clean shutdown, clean is set to trueand persisted. Upon restart, if clean is true, we set clean to falseand start to handle requests. Otherwise, we increment V by one andstart to handle requests. For both clean shutdown and crash cases,“recovery” only involves reading clean and possibly bumping V .The actual recovery work is amortized over segment accesses.To access a segment, the accessing thread ﬁrst checks whetherthe segment version matches V . If not, the thread (1) recovers theegment to a consistent state before doing its original operation (e.g.,insert or search), and (2) sets the segment’s version number to V so that future accesses can skip the recovery pass. With such lazyrecovery approach, a segment is not recovered until it is accessed.Multiple threads may access a segment that needs to be recovered.We employ a segment-level lock that is only for recovery purpose,but a thread only tries to acquire the lock if it sees the segment’sversion number does not match V . Our current implementationuses one-byte version numbers. In case the version number wrapsaround and recovery is needed, we reset V to zero and set the versionnumber of each segment to one. Since crash and repeated crashesare rare events, such wrap-around cases should be very rare.Recovering a segment needs four steps: (1) clear bucket locks,(2) remove duplicate records, (3) rebuild overﬂow metadata, and (4)continue the ongoing SMO. Some locks might be in the locked stateupon crash, so every lock in each bucket needs to be reset. Duplicaterecords are detected by checking the ﬁngerprints in neighboringbuckets. This is lightweight since the real key comparison is onlyneeded if the ﬁngerprints match. Overﬂow metadata in normalbuckets also needs to be cleared and rebuilt based on the recordsin stash buckets as we do not guarantee their crash consistency forperformance reasons. Finally, if a segment is in the SPLITTING state, the accessing thread will follow the segment’s side link to testwhether the neighbor segment is in the

NEW state. If so, we restartthe rehashing-redistribution process and ﬁnish the split. Otherwise,we reset the state variable which in effect rolls back the split.

5. Dash FOR LINEAR HASHING

We present Dash-LH, Dash-enabled linear hashing that uses thebuilding blocks discussed previously (balanced insert, displacement,ﬁngerprinting and optimistic concurrency). We do not repeat themhere and focus on the design decisions speciﬁc to linear hashing.

Figure 6 (focus on segments 0-3 for now) shows the overallstructure of Dash-LH. Similar to Dash-EH, Dash-LH also usessegmentation and splits at the segment level. However, we followthe linear hashing approach to always split the segment pointed toby the

Next pointer, which is advanced after the segment is split.Since the segment to be split is not necessarily a full segment, itneeds to be able to accommodate overﬂow records, e.g., using linkedlists. However, linked list traversal would incur many cache misses,which is a huge penalty for PM hash tables. Instead, we leveragethe stashing design in Dash and use an adjustable number of stashbuckets. In addition to a ﬁxed number of stash buckets (e.g., 2 stashbuckets) in each segment, we store a linked list of stash buckets. Asegment split is triggered whenever a stash bucket is allocated toaccommodate overﬂow records. This contrasts with classic linearhashing which splits a bucket at a time which is vulnerable to longoverﬂow chains under high insertion rate. Dash-LH uses largersplit unit (segment) and chaining unit (stash bucket rather thanindividual records), reducing chain length (therefore pointer chasingand cache misses). The overﬂow metadata and ﬁngerprints furtherhelps alleviate the performance penalty brought by the need tosearch stash buckets. Overall, as we show in Section 6, Dash-LHcan also achieve near-linear scalability on realistic workloads.

Similar to Dash-EH, it is also important to reduce directory sizefor better cache locality. Some designs use double expansion [17]which increases segment size exponentially: allocating a new seg-ment doubles the number of buckets in the hash table. For example,the second segment allocated would be 1 × the size of the ﬁrst ... Directory:Segments:

N = 3

Next Next SplitSegment Index: 0 1 2 3 4 5 6 7 8 9

Figure 6:

Overview of Dash-LH. Segments are organized in arrays.segment, and the third segment would be 2 × larger than the ﬁrstsegment, and so on. The beneﬁt is that directory size can becomevery small and be often ﬁt even in the L1 cache. However, it alsomakes load factor drop by half whenever a new segment is allocated.To reduce space waste, we postpone double expansion and expandthe hash table by several ﬁxed-size segments ﬁrst, before triggeringdouble expansion. We call the number of such ﬁxed expansions the stride . Figure 6 shows an example (stride = 4). A directory entrycan point to an array of segments; the ﬁrst four entries point to one-segment arrays, the next four entries point to two-segment arrays,and so on. With a larger stride, the allocation of larger segmentarrays will have less impact on load factor. The result is very smalldirectory size that is typically L1-resident. Using 16KB segments,the ﬁrst segment array will include 64 segments, with a stride offour, we can index TB-level data with a directory less than 1KB. Since linear hashing expands in one direction, splits are essen-tially serialized by locking the

Next pointer. To shorten the lengthof the critical section, we adopt the expansion strategy proposed byLHlf [65] where the expansion only atomically advances

Next with-out actually splitting the segment. Then any thread that accesses asegment that should be split (denoted in the segment metadata area)will ﬁrst conduct the actual split operation. As a result, multiplesegments splits can execute in parallel by multiple threads. Beforeadvancing the

Next pointer, the accessing thread ﬁrst probes the di-rectory entry for the new segment to test whether the correspondingsegment array is allocated. If not, it allocates the segment array andstores it in the directory entry. The performance of PM allocatortherefore may impact overall performance, as we show in Section 6.Dash-LH uses a variable N to compute the number of bucketsof the base table. After each round of the split, Next is reset tozero and N is incremented to denote that the number of buckets isdoubled. For consistency guarantees, we store N (32-bit) and Next (32-bit) in a 64-bit word which can be updated atomically.

6. EVALUATION

This section evaluates Dash and compares it with two other state-of-the-art PM hash tables, CCEH [38] and level hashing [68].Speciﬁcally, through experiments we conﬁrm the following: • Dash-enabled hash tables (Dash-EH and Dash-LH) scale well onmulticore servers with real Optane DCPMM; • The bucket load balancing techniques allow Dash to achieve highload factor while maintaining high performance; • Dash provides instant recovery with a minimal, constant amountof work needed upon restart, reducing service downtime.

We implemented Dash-EH/LH using PMDK [20], which pro-vides primitives for crash-safe PM management and synchronization.These primitives are essential for building PM data structures, butalso introduce overhead. For example, PMDK allocator exhibitscalability problems and is much slower than DRAM allocators [31].Such overheads are ignored in previous emulation-based work, butare not negligible in reality. We take them into account in our evalua-tion. The other hash tables under comparison (CCEH [38] and levelhashing [68]) were both proposed based on DRAM emulation. Weported them to run on Optane DCPMM using their original code and PMDK. Like previous work [38], we optimize level hashing byco-locating all the locks in a small and continuous memory region(lock striping) [18] to reduce cache misses. Now we summarize thekey implementation issues and our solutions. Crash Consistency.

Dash uses PMDK transactions for segmentsplits. This frees Dash from handling low-level details while guaran-teeing safe and atomic allocations. We noticed a consistency issuein CCEH code where a power failure during segment split couldleak PM. We ﬁxed this problem using PMDK transaction. We alsoadapted CCEH and level hashing to use PMDK reader-writer locksthat are automatically unlocked upon recovery.

Persistent Pointers.

Both CCEH and level hashing assume stan-dard 8-byte pointers based on DRAM emulation, while some sys-tems use 16-byte pointers for PM [20, 25]. Long pointers break thememory layout and make atomic instructions hard to use. To use8-byte pointers on PM, we extended PMDK to ensure that PM ismapped onto the same virtual address range across different runs (us-ing

MAP FIXED NOREPLACE in mmap and setting mmap min addr in the kernel). All hash tables experimented here use this approach. Garbage Collection.

We implemented a general-purpose epoch-based PM reclamation mechanism for Dash. We also observed thatthe open-sourced implementation of CCEH allows threads to accessthe directory without acquiring any locks, which may allow accessto freed memory (due to directory doubling or halving). We ﬁxedthis problem with the same epoch-based reclamation approach.

We run experiments on a server with a Intel Xeon Gold 6252 CPUclocked at 2.1GHz, 768GB of Optane DCPMM (6 × × Parameters.

For fair comparison, we set CCEH and level hash-ing to use the same parameters as in their original papers [38, 68].Our own tests showed these parameters gave the best performanceand load factor overall. Level hashing uses 128-byte (two cachelines)buckets. CCEH uses 16KB segments and 64-byte (one cacheline)buckets, with a probing length of four. Dash-EH and Dash-LH use256-byte (four cachelines) buckets and 16KB segments. Each seg-ment has two stash buckets, making it enough to have four overﬂowﬁngerprint slots per bucket so that the overﬂow counter is rarelypositive. Dash-LH uses hybrid expansion with a stride of eight andits ﬁrst segment array includes 64 segments.

Benchmarks.

We stress test each hash table using microbench-marks. For search operations, we run positive search and negativesearch: the latter probes speciﬁcally non-existent keys. Unless oth-erwise speciﬁed, for all runs we preload the hash table with 10million records, then execute 190 million inserts (as an insert-onlybenchmark), 190 million positive search/negative search/delete op-erations back-to-back on the 200-million-record hash table. For all Code downloaded from https://github.com/DICL/CCEH and https://github.com/Pfzuo/Level-Hashing . We also had to replace

MAP SHARED VALIDATE with

MAP SHARED for

MAP FIXED NOREPLACE to work, detailed in our code repository.

Level CCEH Dash-EH Dash-LH M illi o n o p s / s M illi o n o p s / s Figure 7:

Single-thread performance under ﬁxed-length keys (left)and variable-length keys (right).hash indexes, We use GCC’s std:: Hash bytes (based on Mur-mur hash [1]) as the hash function, which is known to be fast andprovides high-quality hashes. Similar to other work [38, 68], weuse uniformly distributed random keys in our workloads. We alsotested skewed workloads under the Zipﬁan distribution (with varyingskewness) and found all operations achieved better performance ben-eﬁtting from the higher cache hit ratios on hot keys, and contentionis rare because the hash values are largely uniformly distributed.Due to space limitation, we omit the detailed results over skewedworkloads. For ﬁxed-length key experiments, both keys and valuesare 8-byte integers; for variable-length key experiments, we use(pointers to) 16-byte keys and 8-byte values. The variable-lengthkeys are pre-generated by the benchmark before testing.

We begin with single-thread performance to understand the basicbehaviors of each hash table. We ﬁrst consider a read-only workloadwith ﬁxed-length keys. Read-only results provide an upper boundperformance on the hash tables since no modiﬁcation is done to thedata structure. They directly reﬂect the underlying design’s cacheefﬁciency and concurrency control overhead.As Figure 7 shows, Dash-EH can outperform CCEH/level hashingby 1.9 × /2.6 × for positive search. Dash-LH and Dash-EH achievedsimilar performance because they use the same building blocks, withbounded PM accesses and lightweight concurrency control which re-duces PM writes. For negative search, Dash variants achieved moresigniﬁcant improvement, being 2.4 × /4.4 × faster than CCEH/levelhashing. As Section 6.5 shows, this is attributed to ﬁngerprints andthe overﬂow metadata which signiﬁcantly reduce PM accesses.For inserts, Dash and CCEH achieve similar performance ( ∼ × level hashing). Although CCEH has one fewer cacheline ﬂush perinsert than Dash, Dash’s bucket load balancing strategy reduces seg-ment splits, improving both performance and load factor. Withoutthe allocation bitmap, CCEH requires a reserved value (e.g., ‘0’) toindicate an empty slot. This design imposes additional restrictionsto the application; Dash avoids it using metadata. Level hashing ex-hibited much lower performance due to more PM reads and frequentlock/unlock operations. It also requires full-table rehashing thatincurs many cacheline ﬂushes. For deletes, Dash outperforms CCE-H/level hashing by 1.2 × /1.9 × because of reduced cache misses.The beneﬁt of Dash is more prominent for variable-length keys.As Figure 7 shows, Dash-EH/LH are 2 × /5 × faster than CCEH/levelhashing for positive search. The differences in negative search areeven more dramatic (5 × /15 × ). Again, this results show the effec-tiveness of ﬁngerprinting. Since all the hash tables store pointersfor longer ( > × /3.7 × for insert, and 1.2 × /2.9 × for delete. ash-EH Dash-LH CCEH Level Hashing (a) 100% insert. M illi o n o p s / s Number of threads (b) 100% positive search.

Number of threads (c) 100% negative search.

Number of threads (d) 100% delete.

Number of threads (e) Mixed.

Number of threads

Figure 8:

Throughput under different workloads with a varying number of threads and 8-byte keys and 8-byte values.

Without Fingerprint With Fingerprint M illi o n o p s / s M illi o n o p s / s Figure 9:

Effect of ﬁngerprinting in buckets under ﬁxed-keys (left)and variable-length keys (right).

We test both individual operations and a mixed workload thatconsists of 20% of insert and 80% of search operations. For themixed workload, we preload the hash table with 60 million recordsto allow search operations to access actual data.Figure 8 plots how each hash table scales under a varying numberof threads and ﬁxed-length keys. For inserts, level hashing exhibitsthe worst scalability mainly due to full-table rehashing, which istime-consuming on PM and blocks concurrent operations. Withﬁngerprinting and bucket load balancing, Dash ﬁnishes uniquenesschecks quickly and triggers fewer SMOs, with fewer PM accessesand interactions with the PM allocator. Though neither Dash-EHnor Dash-LH scales linearly as inserts inherently exhibit many ran-dom PM writes, Dash is the most scalable solution, being up to1.3 × /8.9 × faster than CCEH/level hashing for insert operations.For search operations, Figures 8(b–c) show near-linear scalabilityfor Dash-EH/LH. CCEH falls behind mainly because of its use ofpessimistic locking which incurs large amount of PM writes even forread-only workloads (to acquire/release read locks). Level hashinguses a similar design but lock striping [18] makes all the locks likelyto ﬁt into the CPU cache. Therefore, although level hashing haslower single-thread performance than CCEH, it still achieves similarperformance to CCEH under multiple threads. Delete operations inDash-EH, Dash-LH, CCEH and level hashing on 24 threads scaleand improve over their single-threaded version by 8.4 × , 9.8 × , 6.1 × and 14.7 × , respectively. For the mixed workload on 24 threads,Dash outperforms CCEH/level hashing by 2.7 × /9.0 × .We observed similar trends (but with widening gaps betweenDash-EH/LH and CCEH/level hashing) for workloads using variable-length keys (not shown for limited space). In the following sections,we discuss how each design in Dash impacts its performance. Fingerprinting is a major reason for Dash to perform and scalewell on PM as it greatly reduces PM accesses. We quantify its effectby comparing Dash-EH with and without ﬁngerprinting. Figure 9shows the result under 24 threads. With ﬁxed-length keys, ﬁnger-printing improves throughput by 1.04/1.19/1.72/1.02 × for insert/-positive search/negative search/delete. The numbers for variable-length keys are 1.88/3.13/7.04/1.52 × , respectively. As introducedin Section 4.3, Dash uses overﬂow metadata to allow early-stop of Without Metadata With Metadata M illi o n o p s / s M illi o n o p s / s Figure 10:

Effect of overﬂow metadata (24 threads) with ﬁxed-length keys and two (left) and four (right) stash buckets per segment. M a x . l o a d f a c t o r Segment size (KB) Bucke ti zed+ Probing+ Balanced insert+ Displacement+ 2 Stash buckets+ 4 Stash buckets Figure 11:

Maximum load factor after adding different techniques.search operations to save PM accesses. Figure 10 shows its effec-tiveness under 24 threads with varying numbers of stash buckets.Dash-EH with two stash buckets outperforms the baseline (no meta-data) by 1.07/1.29/1.70/1.16 × for insert/positive search/negativesearch/delete. With more stash buckets added, search performancedrops by about 25% without the overﬂow metadata. With overﬂowmetadata, however, the performance remains stable, as negativesearch operations can early-stop after checking the overﬂow meta-data without actually probing the stash buckets. Now we study how linear probing, balanced insert, displacementand stashing improve load factor. We ﬁrst measure the maximumload factor of one segment with different sizes. Using larger seg-ments could decrease load factor, but may improve performance byreducing directory size. Real systems must balance this tradeoff.Figure 11 shows the result. In the ﬁgure “Bucketized” representsthe simplest segmentation design without the other techniques. Thesegment can be up to 80% full on 1KB segments but graduallydegrades to about 40% full at most as the segment size increasesto 128KB. With just one bucket probing added (“+Probing”), theload factor under 128KB segment increases by ∼ ∼ L o a d f a c t o r Number of records (k) Dash-EH (2)Dash-EH (4)Dash-LH (2)Level HashingCCEH

Figure 12:

Load factor of different hashing schemes with respectto number of items inserted to the hash table. M illi o n o p s / s Number of threads Op ti mis ti c (posi ti ve search)Op ti mis ti c (nega ti ve search)Spinlock (posi ti ve search)Spinlock (nega ti ve search) Figure 13:

Scalability under different concurrency control strate-gies (reader-writer spinlocks vs. optimistic locking in Dash-EH).between 35% and 43%, because CCEH only conducts four cachelineprobings before triggering a split. As we noted in Section 4.3, longprobing lengths increase load factor at the cost of performance, yetshort probing lengths lead to pre-mature splits. Compared to CCEH,Dash and level hashing can achieve high load factor because oftheir effective load factor improvement techniques. The “dipping”indicates segment splits/table rehashing is happening. We alsoobserve that with two stash buckets, denoted as Dash-EH/LH (2),we achieve up to 80% load factor, while the number for using fourstash buckets in Dash-EH (4) is 90%, matching that of level hashing.

As Section 4.4 describes, PM data structures favor lightweightapproach such as optimistic locking in Dash, over traditional pes-simistic locking. Figure 13 experimentally veriﬁes this point bycomparing pessimistic locking (reader-writer spinlocks) and opti-mistic locking in Dash-EH, under (negative and positive) searchworkloads. The spinlock based version does not scale well becauseof extra PM writes needed for manipulating read locks. We repeatedthe same experiments on DRAM and found that both of them canscale well. This captures yet another important ﬁnding that wasomitted or not easy to discover in previous emulation based studies.

It is desirable for persistent hash tables to recover instantly aftera crash or clean shutdown to reduce service downtime. We testrecovery time by ﬁrst loading a certain number of records and thenkilling the process and measuring the time needed for the system tobe able to handle incoming requests. Table 1 shows the time neededfor each hash table to get ready for handling incoming requestsunder different data sizes. The recovery time for Dash-EH/LH andlevel hashing are at sub-second level and does not scale as data sizeincreases, effectively achieving instant recovery. For Dash-EH/LHthe only needed work is to open the PM pool that is backing the hashtable, and then read and possibly set the values of two variables. For1280M data, level hashing requires an allocation size greater thanthe maximum allowed by PMDK allocator (15.998GB). However, italso only needs a ﬁxed amount of work to open the PM pool duringthe recovery, so we expect its recovery time would remain the same(53ms) under larger data sizes. The recovery time for CCEH is

Table 1:

Recovery time (ms) vs. data size. CCEH’s recovery timescales with data size. For level hashing and Dash it remains constantbecause both need a ﬁxed amount of work upon restart.Hash Table

Number of indexed records (million)

40 80 160 320 640 1280

Dash-EH

57 57 57 57 57 57

Dash-LH

57 57 57 57 57 57

CCEH

113 165 262 463 870 1673

Level hashing

53 53 53 53 53 (53) M illi o n o p s / s Time (seconds)Dash-EHDash-LH M illi o n o p s / s Time (seconds)Dash-EHDash-LH

Figure 14:

Throughput under different time points upon restart withone thread (left) and 24 threads (right).linearly proportional to the size of the indexed data because it needsto scan the entire directory upon recovery. As data size increases,so is the directory size, requiring more time on recovery.Dash’s lazy recovery may impact runtime performance. We mea-sure this effect by recording the throughput over time of Dash-EH/LH once it is instantly recovered. The hash table is pre-loadedwith 40 million records. We then kill the process while running apure insert workload and restart the hash table and start to issuepositive search operations to observe how throughput changes overtime. The result is shown in Figure 14; the red arrow indicatesthe time point when Dash is back online to be able to serve newrequests. The throughput is relatively low at the beginning: 0.1–0.3Mops/s under one thread in Figure 14(left), and 0.6 Mops/s under 24threads in Figure 14(right). Using more threads can help throughputto return to normal earlier, as multiple threads could hit differentsegments and work on the rebuilding of metadata or concludingSMOs in parallel. Throughput returns to normal in 0.2 secondsunder 24 threads, while the number under one thread is 0.9 s.

It has been shown that PM programming infrastructure can bea major overhead due to reasons such as page faults and cachelineﬂushes [22, 31]. We quantify its impact by running the same insertbenchmark in Section 6.4 under two allocators (PMDK vs. a cus-tomized allocator) and two Linux kernel versions (5.2.11 vs. 5.5.3).Our customized allocator pre-allocates and pre-faults PM to removepage faults at runtime. Though not practical, it allows us to quantifythe impact of PM allocator; it is not used in other experiments. AsFigure 15(left) shows, Dash-EH is not very sensitive to allocator per-formance under different kernel versions as its allocation size (singlesegment, 16KB) is ﬁxed and not huge. Dash-LH in Figure 15(right),however, exhibited very low performance using PMDK allocator onkernel 5.2.11 ( ∼

25% the number under 5.5.3). We found the reasonwas a bug in kernel 5.2.11 that can cause large PM allocations tofall back to use 4KB pages, instead of 2MB huge pages (PMDKdefault). This led to many more page faults and OS scheduler activi-ties, which impact Dash-LH the most, as linear hashing inherently Caused by a patch discussed at lkml.org/lkml/2019/7/3/95 ,ﬁxed by patch at lkml.org/lkml/2019/10/19/135 (5.3.8 andnewer). More details are available in our code repository. M illi o n o p s / s Number of threadsPMDK allocator (5.5.3)Custom prefault (5.5.3) PMDK allocator (5.2.11)Custom prefault (5.2.11) 0 3 6 9 12 1 4 8 16 24 M illi o n o p s / s Number of threads

Figure 15:

Impact of PM allocator and OS support on Dash-EH(left) and Dash-LH (right).requires multiple threads compete for PM allocation during splitoperations and slow PM allocation could block concurrent requests.The increased number of page faults also impacted recovery per-formance. For instance, under 160 million records, it took CCEH ∼ × longer on kernel 5.2.11 than the number in Table 1 to recover.These results highlight the complexity of PM programming andcall for careful design and testing involving both userspace (e.g.,allocator) and OS support. We believe it is necessary as the PMprogramming stack is evolving rapidly while practitioners and re-searchers have started to rely on them to build PM data structures.

7. RELATED WORK

Dash builds upon many techniques from prior in-memory andPM-based hash tables, tree structures and PM programming tools.

In-Memory Hash Indexes.

Section 2.2 has covered extendiblehashing [12] and linear hashing [29, 36], so we do not repeat here.Cuckoo hashing [44] achieves high memory efﬁciency throughdisplacement: a record can be inserted into one of the two bucketscomputed using two independent hash functions; if both buckets arefull, a randomly-chosen record is evicted to its alternative bucketto make space for the new record. The evicted record is inserted inthe same way. MemC3 [13] proposes a single-writer, multi-readeroptimistic concurrent cuckoo hashing scheme that uses versioncounters with a global lock. MemC3 [13] also uses a taggingtechnique which is similar to ﬁngerprinting to reduce the overheadof accessing pointers to variable-length keys. FASTER [5] furtheroptimizes it by storing the tag in the unused high order 16 bits ineach pointer. libcuckoo [33] extends MemC3 to support multi-writer. Cuckoo hashing approaches may incur many memory writesdue to consecutive cuckoo displacements. Dash limits the numberof probings and uses optimistic locking to reduce PM writes.

Static Hashing on PM.

Most work aims at reducing PM writes,improving load factor and reducing the cost of full-table rehash-ing. Some proposals use multi-level designs that consist of a maintable and additional levels of stashes to store records that cannotbe inserted into the main table. PFHT [10] is a two-level schemethat allows only one displacement to reduce writes. It uses linkedlists to resolve collisions in the stash, which may incur many cachemisses during probing. Path hashing [67] further organizes thestash as an inverted complete binary tree to lower search cost. Levelhashing [68] is a two-level scheme that bounds the search cost to atmost four buckets. Upon resizing, the bottom-level is rehashed tobe 4 × size of the top-level table, and the previous top level becomesthe new bottom level. Compared to cuckoo hashing, the number ofbuckets needed to probe during a lookup is doubled. Dash also usesstashes to improve load factor, but most search operations only needto access two buckets thanks to the overﬂow metadata. Dynamic Hashing on PM.

CCEH [38] is a crash-consistent ex-tendible hashing scheme which avoids full-table rehashing [12]. Toimprove search efﬁciency, it bounds its probing length to four cache-lines, but this can lead to low load factor and frequent segment splits. CCEH’s recovery process requires scanning the directory uponrestart, thus sacriﬁes instant recovery. Prior proposals often use pes-simistic locking [38, 68] which can easily become a bottleneck dueto excessive PM writes when manipulating locks. The result is evenconﬂict-free search operations cannot scale. NVC-hashmap [51] isa lock-free, persistent hash table based on split-ordered lists [52].Although the lock-free design can reduce PM writes, it is hard toimplement; the linked list design may also incur many cache misses.Dash solves these problems with optimistic locking that reduces PMwrites and allows near-linear scalability for search operations.

Range Indexes.

Most range indexes for PM are B+-tree or trievariants and aim to reduce PM writes [2, 6, 7, 19, 30, 43, 59, 60, 64].An effective technique is unsorted leaf nodes [6, 7, 43, 64] at thecost of linear scans, while hash indexes mainly reduce PM writesby avoiding consecutive displacements. FP-tree[43] proposes ﬁn-gerprints in leaf nodes to reduce PM accesses; Dash adopted it toreduce unnecessary bucket probing and efﬁciently support variable-length keys. Some work [43, 60] places part of the index in DRAM(e.g., inner nodes) to improve performance. This trades off instantrecovery as the DRAM part has to be rebuilt upon restart [31]. Thesame tradeoff can be seen in hash tables by placing the directory inDRAM. With bucket load balancing techniques, Dash can use largersegments and place the directory in PM, avoiding this tradeoff.

PM Programming.

PM data structures rely heavily on userspacelibraries and OS support to easily handle such issues as PM alloca-tion and space management. PMDK [20] is so far the most popularand comprehensive library. An important issue in these libraries is toavoid leaking PM permanently. A common solution [4, 20, 42, 50]is to use an allocate-activate approach so that the allocated PM iseither owned by the application or the allocator upon a crash. At theOS level, PM ﬁle systems provide direct access (DAX) to bypasscaches and allow pointer-based accesses [35]. Some traditional ﬁlesystems (e.g., ext4 and XFS) have been adapted to support DAX.PM-speciﬁc ﬁle systems are also being proposed to further reduceoverhead [8, 23, 28, 47, 57, 61]. We ﬁnd support for PM program-ming is still in its early stage and evolving quickly with possiblebugs and inefﬁciencies as Section 6.9 shows. This requires carefulintegration and testing when designing future PM data structures.

8. CONCLUSION

Persistent memory brings new challenges to persistent hash tablesin both performance (scalability) and functionality. We identify thatthe key is to reduce both unnecessary PM reads and writes, whereasprior work solely focused on reducing PM writes and ignored manypractical issues such as PM management and concurrency control,and traded off instant recovery capability. Our solution is Dash,a holistic approach to scalable PM hashing. Dash combines bothnew and existing techniques, including (1) ﬁngerprinting to reducePM accesses, (2) optimistic locking, and (3) a novel bucket loadbalancing technique. Using Dash, we adapted extendible hashingand linear hashing to work on PM. On real Intel Optane DCPMM,Dash scales with up to ∼ × better performance than prior state-of-the-art, while maintaining desirable properties, including highload factor and sub-second level instant recovery. Acknowledgements.

We thank Lucas Lersch (TU Dresden & SAP)and Eric Munson (University of Toronto) for helping us isolate bugsin the Linux kernel. We also thank the anonymous reviewers andshepherd for their constructive feedback, and the PC chair’s coordi-nation in the shepherding process. This work is partially supportedby an NSERC Discovery Grant, Hong Kong General Research Fund(14200817, 15200715, 15204116), Hong Kong AoE/P-404/18, In-novation and Technology Fund ITS/310/18. . REFERENCES [1] A. Appleby. Murmurhash. https://sites.google.com/site/murmurhash/ .[2] J. Arulraj, J. J. Levandoski, U. F. Minhas, and P. Larson.Bztree: A high-performance latch-free range index fornon-volatile memory.

PVLDB , 11(5):553–565, 2018.[3] BerkeleyDB. Access Method Conﬁguration. https://docs.oracle.com/cd/E17076_02/html/programmer_reference/am_conf.html .[4] K. Bhandari, D. R. Chakrabarti, and H. Boehm. Makalu: fastrecoverable allocation of non-volatile memory. In

Proceedingsof the 2016 ACM SIGPLAN International Conference onObject-Oriented Programming, Systems, Languages, andApplications, OOPSLA 2016, part of SPLASH 2016,Amsterdam, The Netherlands, October 30 - November 4, 2016 ,pages 677–694, 2016.[5] B. Chandramouli, G. Prasaad, D. Kossmann, J. J. Levandoski,J. Hunter, and M. Barnett. FASTER: A concurrent key-valuestore with in-place updates. In

Proceedings of the 2018International Conference on Management of Data, SIGMODConference 2018, Houston, TX, USA, June 10-15, 2018 , pages275–290, 2018.[6] S. Chen, P. B. Gibbons, and S. Nath. Rethinking databasealgorithms for phase change memory. In

CIDR 2011, FifthBiennial Conference on Innovative Data Systems Research,Asilomar, CA, USA, January 9-12, 2011, Online Proceedings ,pages 21–31, 2011.[7] S. Chen and Q. Jin. Persistent b+-trees in non-volatile mainmemory.

PVLDB , 8(7):786–797, 2015.[8] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. C. Lee,D. Burger, and D. Coetzee. Better I/O throughbyte-addressable, persistent memory. In

Proceedings of the22nd ACM Symposium on Operating Systems Principles 2009,SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009 ,pages 133–146, 2009.[9] R. Crooke and M. Durcan. A revolutionary breakthrough inmemory technology.

3D XPoint Launch Keynote , 2015.[10] B. Debnath, A. Haghdoost, A. Kadav, M. G. Khatib, andC. Ungureanu. Revisiting hash table design for phase changememory. In

Proceedings of the 3rd Workshop on Interactionsof NVM/FLASH with Operating Systems and Workloads,INFLOW 2015, Monterey, California, USA, October 4, 2015 ,pages 1:1–1:9, 2015.[11] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R.Stonebraker, and D. A. Wood. Implementation techniques formain memory database systems.

SIGMOD Rec. , 14(2):18,June 1984.[12] R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong.Extendible hashing–a fast access method for dynamic ﬁles.

ACM Trans. Database Syst. , 4(3):315–344, Sept. 1979.[13] B. Fan, D. G. Andersen, and M. Kaminsky. MemC3: Compactand concurrent MemCache with dumber caching and smarterhashing. In

Presented as part of the 10th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 13) ,pages 371–384, 2013.[14] K. Fraser.

Practical lock-freedom . PhD thesis, University ofCambridge, UK, 2004.[15] H. Garcia-Molina and K. Salem. Main memory databasesystems: An overview.

IEEE Trans. on Knowl. and Data Eng. ,4(6):509–516, Dec. 1992.[16] S. Ghemawat and J. Dean. LevelDB. 2019. https://github.com/google/leveldb . [17] W. G. Griswold and G. M. Townsend. The design andimplementation of dynamic hashing for sets and tables in Icon.

Softw., Pract. Exper. , 23(4):351–367, 1993.[18] M. Herlihy and N. Shavit.

The art of multiprocessorprogramming . Morgan Kaufmann, 2008.[19] D. Hwang, W.-H. Kim, Y. Won, and B. Nam. Endurabletransient inconsistency in byte-addressable persistent B+-tree.In , pages 187–200, 2018.[20] Intel. Persistent Memory Development Kit. 2018. http://pmem.io/pmdk/libpmem/ .[21] Intel Corporation. Intel 64 and IA-32 architectures softwaredeveloper’s manual. 2015.[22] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu,A. Memaripour, Y. J. Soh, Z. Wang, Y. Xu, S. R. Dulloor,J. Zhao, and S. Swanson. Basic performance measurements ofthe Intel Optane DC Persistent Memory Module, 2019.[23] R. Kadekodi, S. K. Lee, S. Kashyap, T. Kim, A. Kolli, andV. Chidambaram. SplitFS: reducing software overhead in ﬁlesystems for persistent memory. In

Proceedings of the 27thACM Symposium on Operating Systems Principles, SOSP2019, Huntsville, ON, Canada, October 27-30, 2019 , pages494–508, 2019.[24] O. Kaiyrakhmet, S. Lee, B. Nam, S. H. Noh, and Y. Choi.SLM-DB: single-level key-value store with persistent memory.In ,pages 191–205, 2019.[25] H. Kimura. FOEDUS: OLTP engine for a thousand cores andNVRAM.

SIGMOD , pages 691–706, 2015.[26] O. Kocberber, B. Grot, J. Picorel, B. Falsaﬁ, K. Lim, andP. Ranganathan. Meet the walkers: Accelerating indextraversals for in-memory databases. In

Proceedings of the46th Annual IEEE/ACM International Symposium onMicroarchitecture , MICRO-46, pages 468–479, 2013.[27] H. T. Kung and J. T. Robinson. On optimistic methods forconcurrency control.

ACM TODS , 6(2):213–226, June 1981.[28] Y. Kwon, H. Fingler, T. Hunt, S. Peter, E. Witchel, and T. E.Anderson. Strata: A cross media ﬁle system. In

Proceedingsof the 26th Symposium on Operating Systems Principles,Shanghai, China, October 28-31, 2017 , pages 460–477, 2017.[29] P.-A. Larson. Dynamic hash tables.

Commun. ACM ,31(4):446–457, Apr. 1988.[30] S. K. Lee, K. H. Lim, H. Song, B. Nam, and S. H. Noh.WORT: Write optimal radix tree for persistent memorystorage systems. In , pages 257–270, Santa Clara,CA, 2017. USENIX Association.[31] L. Lersch, X. Hao, I. Oukid, T. Wang, and T. Willhalm.Evaluating persistent memory range indexes.

PVLDB ,13(4):574–587, 2019.[32] J. J. Levandoski, D. B. Lomet, S. Sengupta, A. Birka, andC. Diaconu. Indexing on modern hardware: hekaton andbeyond. In

International Conference on Management of Data,SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014 , pages717–720, 2014.[33] X. Li, D. G. Andersen, M. Kaminsky, and M. J. Freedman.Algorithmic improvements for fast concurrent cuckoo hashing.In

Ninth Eurosys Conference 2014, EuroSys 2014, Amsterdam,The Netherlands, April 13-16, 2014 , pages 27:1–27:14, 2014.[34] H. Lim, M. Kaminsky, and D. G. Andersen. Cicada:Dependably fast multi-core in-memory transactions. In roceedings of the 2017 ACM International Conference onManagement of Data , SIGMOD ’17, pages 21–35, 2017.[35] Linux. Direct access for ﬁles. 2019. [36] W. Litwin. Linear hashing: A new tool for ﬁle and tableaddressing. In

Proceedings of the Sixth InternationalConference on Very Large Data Bases - Volume 6 , VLDB ’80,pages 212–223. VLDB Endowment, 1980.[37] L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H.Arpaci-Dusseau. WiscKey: Separating keys from values inSSD-conscious storage. In , pages 133–148, 2016.[38] M. Nam, H. Cha, Y. ri Choi, S. H. Noh, and B. Nam.Write-optimized dynamic hashing for persistent memory. In , pages 31–44, Feb. 2019.[39] F. Nawab, J. Izraelevitz, T. Kelly, C. B. M. III, D. R.Chakrabarti, and M. L. Scott. Dal´ı: A periodically persistenthash map. In , pages 37:1–37:16, 2017.[40] Oracle. Architectural overview of the Oracle ZFS storageappliance. 2015. .[41] Oracle. MySQL. 2019. .[42] I. Oukid, D. Booss, A. Lespinasse, W. Lehner, T. Willhalm,and G. Gomes. Memory management techniques forlarge-scale persistent-main-memory systems.

PVLDB ,10(11):1166–1177, 2017.[43] I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner.FPTree: A hybrid SCM-DRAM persistent and concurrentB-tree for storage class memory. In

Proceedings of the 2016International Conference on Management of Data, SIGMOD ,pages 371–386, 2016.[44] R. Pagh and F. F. Rodler. Cuckoo hashing.

J. Algorithms ,51(2):122–144, May 2004.[45] S. Pelley, T. F. Wenisch, B. T. Gold, and B. Bridge. Storagemanagement in the NVRAM era.

PVLDB , 7(2):121–132,2013.[46] PostgreSQL Global Development Group. PostgreSQL. 2019. .[47] D. S. Rao, S. Kumar, A. S. Keshavamurthy, P. Lantz,D. Reddy, R. Sankaran, and J. Jackson. System software forpersistent memory. In

Ninth Eurosys Conference 2014,EuroSys 2014, Amsterdam, The Netherlands, April 13-16,2014 , pages 15:1–15:15, 2014.[48] Redis Labs. Redis. 2019. https://redis.io .[49] F. B. Schmuck and R. L. Haskin. GPFS: A shared-disk ﬁlesystem for large computing clusters. In

Proceedings of theFAST ’02 Conference on File and Storage Technologies,January 28-30, 2002, Monterey, California, USA , pages231–244, 2002.[50] D. Schwalb, T. Berning, M. Faust, M. Dreseler, andH. Plattner. nvm malloc: Memory allocation for NVRAM. In

International Workshop on Accelerating Data ManagementSystems Using Modern Processor and Storage Architectures -ADMS 2015, Kohala Coast, Hawaii, USA, August 31, 2015 ,pages 61–72, 2015. [51] D. Schwalb, M. Dreseler, M. Uﬂacker, and H. Plattner.NVC-Hashmap: A persistent and concurrent hashmap fornon-volatile memories. In

Proceedings of the 3rd VLDBWorkshop on In-Memory Data Mangement and Analytics ,IMDM ’15, pages 4:1–4:8, 2015.[52] O. Shalev and N. Shavit. Split-ordered lists: Lock-freeextensible hash tables.

J. ACM , 53(3):379–405, May 2006.[53] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams.The missing memristor found.

Nature , 453(7191):80–83,2008.[54] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden.Speedy transactions in multicore in-memory databases.

SOSP ,pages 18–32, 2013.[55] A. van Renen, V. Leis, A. Kemper, T. Neumann, T. Hashida,K. Oe, Y. Doi, L. Harada, and M. Sato. Managing non-volatilememory in database systems. In

Proceedings of the 2018International Conference on Management of Data , SIGMOD’18, pages 1541–1555, 2018.[56] A. van Renen, L. Vogel, V. Leis, T. Neumann, and A. Kemper.Persistent memory I/O primitives. In

Proceedings of the 15thInternational Workshop on Data Management on NewHardware, DaMoN 2019. , pages 12:1–12:7, 2019.[57] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan,P. Saxena, and M. M. Swift. Aerie: ﬂexible ﬁle-systeminterfaces to storage-class memory. In

Ninth EurosysConference 2014, EuroSys 2014, Amsterdam, TheNetherlands, April 13-16, 2014 , pages 14:1–14:14, 2014.[58] T. Wang and R. Johnson. Scalable logging through emergingnon-volatile memory.

PVLDB , 7(10):865–876, 2014.[59] T. Wang, J. Levandoski, and P.-A. Larson. Easy lock-freeindexing in non-volatile memory. In , pages461–472, April 2018.[60] F. Xia, D. Jiang, J. Xiong, and N. Sun. HiKV: A hybrid indexkey-value store for DRAM-NVM memory systems. In

Proceedings of the 2017 USENIX Annual TechnicalConference , USENIX ATC ’17, pages 349–362, 2017.[61] J. Xu and S. Swanson. NOVA: A log-structured ﬁle system forhybrid volatile/non-volatile main memories. In , pages 323–338,2016.[62] S. Xu, S. Lee, S. W. Jun, M. Liu, J. Hicks, and Arvind.Bluecache: A scalable distributed ﬂash-based key-value store.

PVLDB , 10(4):301–312, 2016.[63] J. Yang, J. Kim, M. Hoseinzadeh, J. Izraelevitz, andS. Swanson. An empirical guide to the behavior and use ofscalable persistent memory. In , 2020.[64] J. Yang, Q. Wei, C. Chen, C. Wang, K. L. Yong, and B. He.NV-Tree: reducing consistency cost for NVM-based singlelevel systems. In , pages 167–181, 2015.[65] D. Zhang and P.-A. Larson. Lhlf: Lock-free linear hashing(poster paper). In

Proceedings of the 17th ACM SIGPLANSymposium on Principles and Practice of ParallelProgramming , PPoPP ’12, pages 307–308, New York, NY,USA, 2012. ACM.[66] H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma,and R. Shen. Reducing the storage overhead of main-memoryoltp databases with hybrid indexes. In

Proceedings of the 2016nternational Conference on Management of Data , SIGMOD’16, pages 1567–1581, New York, NY, USA, 2016. ACM.[67] P. Zuo and Y. Hua. A write-friendly and cache-optimizedhashing scheme for non-volatile memory systems.

IEEETransactions on Parallel Distributed Systems (TPDS) ,29(5):985–998, 2018. [68] P. Zuo, Y. Hua, and J. Wu. Write-optimized andhigh-performance hashing index scheme for persistentmemory. In13th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 18)