[PDF] Fine-Grain Checkpointing with In-Cache-Line Logging

Abstract

Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.

Full PDF

FFine-Grain Checkpointing with In-Cache-LineLogging

Nachshon Cohen

AmazonHaifa, Israel [email protected]

David T. Aksun

EPFLLausanne, Switzerland [email protected]

Hillel Avni

HuaweiTel Aviv, Israel [email protected]

James R. Larus

EPFLLausanne, Switzerland [email protected]

Abstract

Non-Volatile Memory offers the possibility of implementinghigh-performance, durable data structures. However, achiev-ing performance comparable to well-designed data structuresin non-persistent (transient) memory is difficult, primarilybecause of the cost of ensuring the order in which memorywrites reach NVM. Often, this requires flushing data to NVMand waiting a full memory round-trip time.In this paper, we introduce two new techniques:

Fine-Grained Checkpointing , which ensures a consistent, quicklyrecoverable data structure in NVM after a system failure,and

In-Cache-Line Logging , an undo-logging technique thatenables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these tech-niques in the Masstree data structure, making it persistentand demonstrating the ease of applying them to a highly op-timized system and their low (5.9-15.4%) runtime overheadcost.

CCS Concepts • Hardware → Non-volatile memory ; Keywords non-volatile memory, NVM, durable data struc-tures, in-cache-line logging, InCLL, fine-grain checkpointing

ACM Reference Format:

Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus.2019. Fine-Grain Checkpointing with In-Cache-Line Logging. In

ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3297858.3304046

Work done while the first author was a postdoc at EPFL.Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe owner/author(s).

Non-Volatile Memory (NVM) is fast, byte-addressable mem-ory that retains its contents after a power failure or a systemcrash. New technologies, such as 3D-XPoint [15], PCM [17,25], STT-RAM [12], and ReRAM [1, 29], promise NVM atlow cost, thus blurring the line between durable storage andmain memory. One important use of NVM is enabling therapid restart of a failed system. Restarting an existing ma-chine typically incurs a significant delay due to the need toread data from durable media such as a disk or SSD, parse it,and rebuild internal data structures. NVM can avoid mostof these restart costs. Since NVM is byte-addressable, it ispossible to store efficient, pointer-based structures, such asB+ trees or hash maps, directly in NVM. After a failure, thesestructures remain in NVM, enabling the system to resumeimmediately after rebooting and recovering data [8].The main challenge in designing durable data structuresfor NVM is that processor caches are (and are likely to re-main) transient. During a power failure, all memory writesthat were not propagated from cache to NVM will be lost.The processor memory system also complicates the task ofwriting cache lines to NVM in a consistent manner. Cachelines are not written back to memory (NVM) in the order inwhich an application modifies them, but rather according toa memory system’s low-level and frequently undocumentedcache replacement policy. This creates a well-studied chal-lenge: how to ensure that the durable copy of a data structureis well-formed (consistent) after a crash, even though NVMcontains a mixture of stale and new cache lines?Most NVM systems require programmer-specified trans-actions to ensure that a group of memory writes either allreach NVM or none of them do. Typically, a change to astructure is first logged (using either a redo or undo log) inNVM and then applied to the structure. The log providessufficient information for the recovery process to restorea structure to a consistent state, regardless of whether allstructure modifications reached NVM.However, the system must ensure that the log is com-pletely flushed to NVM before the structure itself is modified. a r X i v : . [ c s . O S ] F e b SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus

This requires the use of cache flush instructions that transferdirty cache lines from the (transient) cache to the (durable)NVM. These write backs are only guaranteed to have com-pleted after a fence instruction executes. These instructionsare expensive since they require a full round trip to NVMand reduce program performance by a significant amount.This overhead, moreover, can be incurred on an application’scritical path, as a structure is being updated.An alternative to transactions is checkpointing, anotherwidely used technique for ensuring recoverability. At pe-riodic intervals, an application’s entire state is saved ondurable media. After a failure, the last recorded checkpoint isrestored to memory and the computation resumes from thispoint, requiring the re-execution of the work done betweenthe checkpoint and failure. Since copying the entire stateof an application to slow durable media is expensive, mostsystems take checkpoints at infrequent intervals (minutes tohours) to reduce this cost. The interval between checkpointsis a tradeoff between the overhead of recording checkpointsand the cost of re-executing the lost computation.In this work, we introduce

Fine-Grained Checkpointing ,which uses frequent checkpoints to NVM to ensure persis-tence at low cost. Instead of ensuring that every memorywrite is logged or propagates to NVM, we partition an appli-cation’s execution into epochs and ensure that after a crash,data structures can be restored to their state at the end ofthe most recently completed epoch (Nawab calls a similarapproach “periodic persistence” [23]). Our system flushesthe processors’ caches to NVM at the start of an epoch, thusensuring that at this point NVM contains all modified data.This approach has many advantages. The number of cachelines that must be flushed is bounded by the cache size, andmodified cache lines may have been written back duringthe epoch, so the cost of recording a checkpoint is low. Themodified lines are flushed in a batch by hardware, furtherreducing the cost. Our approach’s low cost allows shortcheckpoint intervals, e.g., tens of milliseconds (we use 64 ms ),thereby reducing both the potential data loss and the recov-ery time. In addition, a software developer need not annotatean application to delimit fine-grained transactions, ratherhe or she only needs to ensure that the application’s state isrecoverable at the end of an epoch.Our approach differs from a traditional checkpoint in thatthere is no distinct copy of a data structure or a memoryimage. The in-NVM data structure also serves as the durablecheckpoint. The challenge is to keep this checkpoint consis-tent as the structure is modified. After a crash, NVM statewill consist of a mixture of the state at the beginning of theepoch, which must be kept, and modifications during thefailed epoch, which must be discarded. The system must be Unlike most checkpointing, which resumes execution at the point at whicha checkpoint occurred, our goal is to restore the persistent data structuresto their state at the checkpoint, so that a program can restart its execution. able to distinguish between these two intermixed states andrecover the one from the beginning of the epoch. One solu-tion to this problem is to log the old memory value beforeeach write. The log can be applied, in reversed applicationorder, to roll back writes and to revert to the state at thebeginning of the epoch. But logging itself requires care toensure that the log reaches NVM before the structure is mod-ified. Therefore, it again introduces the cost of cache-lineflushes on the critical path.To solve the problem of fine-grained modifications toNVM, we introduce the new concept of an

In-Cache-LineLog (InCLL) . An InCLL serves the same role as an undo log.But instead of using an external log, the InCLL is placed inthe same cache line as the data structure field being logged.InCLL relies on the Persistent Cache Store Order (PCSO)memory-ordering model of NVM (two writes to the samecache line reach NVM in program order) to ensure the or-dering requirements of the log, without introducing cacheflushes and delays.The main limitation of an InCLL is its limited capacity.Since it resides in the same cache line as the data, an InCLLshould be small and cannot handle all modifications. If anobject is modified multiple times during an epoch, InCLLsmay be insufficient to provide crash recoverability. In thiscase, our approach falls back on object-level logging. Afterthe entire object is logged, subsequent modifications in theepoch do not require additional actions. Together, the combi-nation of InCLL and object-level logging drastically reducesthe number of synchronous writes to NVM.To validate this approach, we applied Fine-Grained Check-pointing into Masstree [21], a cache-efficient data structurethat combines a B+ tree and Trie. We also implemented adurable memory allocator based on this approach. Measure-ments show that the overhead of our scheme is low andrestart time is dramatically reduced.The main contributions of this paper are: • Fine-Grained Checkpointing, a technique to ensure aconsistent, quickly recoverable data structure in NVMafter a system failure. • In-Cache-Line Logging, a undo-logging technique thatenables recovery of the state from the beginning ofan epoch without requiring cache-line flushes in thenormal case. • Implementation of these techniques for the Masstreedata structure, which made it persistent and demon-strated their application in a highly optimized systemand their low (5.9-15.4%) runtime overhead cost.

In this paper, we use the Persistent Cache Store Order (PCSO)memory ordering model for NVM [9]. Cache lines are writ-ten back to NVM according to a computer’s (unspecified) ine-Grain Checkpointing with In-Cache-Line Logging ASPLOS ’19, April 13–17, 2019, Providence, RI, USA cache replacement policy, so we cannot assume any specificwrite-back behavior. An application may explicitly forcespecific cache lines to be written to NVM, by using cache-line write-back instruction, such as the x64’s clflushopt or clwb . These instructions are asynchronous, they only ini-tiate a memory transfer but do not wait until data actuallyreaches NVM. To ensure that a write-back completes, theapplication must issue a fence instruction, such as sfence ,which delays CPU execution until the outstanding write-back instructions finish. Since this instruction waits until thedata reaches NVM, it is far more expensive than a normal(cached) memory reference.While ensuring the order in which writes to differentcache lines reach NVM is expensive, ordering writes to thesame cache line is essentially free. If two writes target thesame cache line, the order in which they reach the cache cor-responds to the order in which they reach NVM. Preservingthe order of cache writes can be done with release memoryordering in C++11, which introduces a happens-before rela-tion between the writes [4, 20]. On the x64 architecture, therelease memory fence incurs no runtime overhead and onlylimits the ability of a compiler to reorder writes.Formally, given two writes X and W , we say that X < p W if X is written to persistent memory no later than W . X < hb W is the standard happens before relationship. c ( X ) represents the cache line address X writes to. The followingholds [9]: • W < hb writeback ( c ( W )) < hb fence < hb X ⇒ W < p X (explicit flush). • W < hb X ∧ c ( W ) = c ( X ) ⇒ W < p X (granularity).Our InCLL technique relies on the second ordering guaran-tee, that if two writes target the same cache line, a happens-before relation is sufficient to ensure persistence ordering. Masstree [21] is a production-quality ordered-set data struc-ture that has been used to build in-memory databases suchas Silo [27]. Masstree is a combination of a Trie and a B+ tree,implemented to carefully exploit caching, prefetching, opti-mistic navigation, and fine-grained locking. Below we sketchsome details of Masstree that are necessary to understandour changes that make Masstree durable.Masstree uses two types of nodes: internal nodes and leafnodes. There are roughly an order of magnitude more leafnodes then internal nodes and leaf nodes are modified muchmore frequently, so our checkpointing focuses on the leafnodes. The number of items in an leaf node is a parameterof Masstree’s implementation; the default implementationuses 15 keys and 15 pointers to values. The keys reside inthe keys array and the value pointers resides in the vals array. Listing 1 illustrates some details of a leaf node.The permutation field records valid key-value pairs, inother words which entries in the arrays are occupied. We

Listing 1.

Masstree’s node structure class basenode; // lock, version, meta information template < int width=15> class leafnode : public basenode{ basenode ∗parent, ∗prev, ∗next; uint64_t permutation; // which key/vals are active keytype keys[width]; valuetype ∗vals[width]; void remove(keytype key){ int idx = find_idx(key); remove_idx(&permutation, idx); } void insert(keytype key, valuetype ∗val){ int idx = insert_idx(&permutation); keys[idx] = key; vals[idx] = val; } void update( int idx, valuetype ∗val){ vals[idx] = val; } }; can consider the permutation field to be a bitmap speci-fying whether an index is in use, although, in practice, italso orders entries as well. Deleting a key-value pair from anode modifies only the permutation field. Inserting a newkey-value pair modifies both the permutation field and anunused entry pair in the keys/vals arrays. Updating anexisting key modifies only the vals entry.Masstree support splitting and merging of nodes, butthese happen far less frequently than modification of leafnodes. The full details of the Masstree algorithm are quiteinvolved [21], but are not necessary to understand this paper. The main contributions of this paper are Fine-Grained Check-pointing and In-Cache-Line Logging, which we illustrate byshow how to make Masstree durable. The approach uses acombination of three techniques: fine-grained checkpoints,in-cache-line log, and external logging. This section sketchesthese three mechanisms, and § 4 provides more detail.

Fine-Grained Checkpointing

Execution is broken into epochs — our implementation uses 64 ms , the Masstree mem-ory reclamation epoch, though longer or shorter intervalsare feasible. During an epoch, NVM contains a mixture ofthe memory state from the previous epoch and some — butnot all — memory writes executed during the current epoch.At the start of an epoch, the entire cache is flushed to NVM,ensuring that all modifications from the previous epoch aresafely stored in NVM. The cost of flushing the cache is low Our code is online: https://github.com/epfl-vlsc/Incll

SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus as its size is bounded and some modifications may have beenwritten back to NVM during the previous epoch (§ 6.2).

External Logging

The external log is a standard undo log.Under certain circumstances, when a Masstree node is mod-ified during an epoch, the entire node is stored in the logso that subsequent modifications can be reversed. To en-sure persistence ordering, the log is written to NVM and an sfence operation is issued before the node is modified.An external log is the standard tool for ensuring the con-sistency of durable data structures in NVM [5, 9, 10, 13, 14, 16,19, 28]. It is possible to log at different granularities: a word,an object, or a page. We choose object-level granularity, sothat when a single word in a Masstree node is modified, theentire node is recorded in the external log. External logging’sprimary benefit is simplicity. It ensures durability withoutrequiring pervasive changes to the Masstree algorithm. How-ever, it may also have performance benefits since a node isonly logged once, even if it is modified many times, as duringmerges or splits.In our approach, we always use the external log for in-frequent, these complex modifications. In addition, changesto internal (non-leaf) nodes are infrequent, so they are alsohandled by the external log (§ 6.1). Leaf nodes are loggedonly when required by the InCLL algorithm described below.The external log is discarded after the cache is flushedat the start of an epoch since all of the logged changes willhave been stored in NVM.

InCLL

In-Cache-Line Logging (InCLL) is a technique forlogging modifications to a node without waiting on NVM.InCLL embeds an undo log inside the same cache lines asa Masstree leaf node. When a node is modified for the firsttime in an epoch, InCLL stores the old value of the modifiedfield. Since the log resides in the same cache line as the data,no write backs or fences are required to ensure that the logreaches NVM with the modification.The primary benefit of InCLL is the low cost of logging.But, the capacity of each node’s log is limited since an InCLLresides in the same cache line as the data. Each log entryalso reduces the number of Masstree array entries, whichdegrades the cache efficiency of the Masstree structure. Ourdurable Masstree algorithm logs — in a typical case — onlyone or two modifications per node per epoch. If a leaf nodeis modified repeated, external logging is likely to be used.We find that the combination of a limited InCLL and an ex-ternal log works extremely well. If Masstree updates duringan epoch are random, most will access different nodes, andthe external log will be infrequently used. If, on the otherhand, modifications are ordered, there may be many writesto the same node, in which case the external log, which onlyrecords the node once, will perform well.

In our scheme, execution is partitioned into 64 ms epochs.We use the wbinvd instruction to flush the entire cache atthe start of an epoch. Since Masstree uses epoch-based recla-mation for allocating and de-allocating nodes, we reuse itsmechanism and interval for our epochs as well. Shorter in-tervals would raise the overhead cost of cache flushing (cur-rently about 2% (§ 6.2)) but reduce the number of updatesthat might be lost or need to be re-executed after a failure.Epochs are assigned a monotonically increasing index,which is stored durably. With 64 ms epochs, a 32-bit indexwraps after more than eight years. We also keep track offailed epochs. During recovery, an epoch in which a crashoccurred is added to the set of failed epoch . Modificationmade during a failed epoch will not be visible after recovery.

In our durable Masstree algorithm, each node consists of 14keys, 14 pointers to values, and two InCLLs. Each InCLL re-quires 8 bytes, like a pointer. The InCLL entries are carefullycache aligned. The first resides immediately before the 14-pointer array and the second resides immediately after the14-pointer array. Hence, the first InCLL resides in the samecache line as pointers 0–6 and the second InCLL resides inthe same cache line as pointers 7–14 (Figure 1). Each of theseInCLL can record a single value modification per epoch.We use an additional InCLL for the permutation field.The permutation field records whether a location in thepointer array is occupied. This field is modified when a newkey-value pair is inserted or removed.The InCLL used to log the permutation field is denotedInCLL p , and the two InCLLs used to log values are denotedInCLL and InCLL for the first and second cache lines, re-spectively. The operation of InCLL p differs from InCLL , ,as described below. To understand the operation of the InCLL p log, we describehow insert and delete operate. When a new key-value pair isinserted into a leaf node, an unoccupied location is found us-ing the permutation field. The appropriate entry in the keys array is set to the key and the corresponding entry in the vals array is set to the value. Furthermore, the permutation field is updated to record that the entry is now occupied. Ifthere is no free entry, the node must be split, a case that ishandled by external logging.After a crash, a leaf node must be returned to the samestate as the start of the current epoch. There are a number ofcases to consider. If an entry was unoccupied at the beginning If the data lasts longer, a background thread could run once every 8 yearsand reset all indices to zero. Since the duration of graduate studies is lessthan 8 years, this feature is not in our current implementation. One fewer key and pointer than the standard implementation. ine-Grain Checkpointing with In-Cache-Line Logging ASPLOS ’19, April 13–17, 2019, Providence, RI, USA of the current epoch, there is no need to restore its keys or vals fields. Hence, the only field that must be logged is theMasstree permutation field (in permutationInCLL ). Evenif multiple key-value pairs are inserted during an epoch, onlythe permutation field needs to be logged. Since this field islogged in InCLL p , there is no need to use the external logfor multiple consecutive writes.Deletion in Masstree is similar to insertion. Only the permutation field is modified to indicate that the entry forthe key is now unoccupied, so no external logging is nec-essary in this case either. Moreover, if a node is modifiedby inserting new key-value pairs and subsequently remov-ing key-value pairs, then InCLL p logging still suffices sincerestoring permutation leaves the node in its original state.But it is not possible to simply log the permutation field ina mixed sequence of insertions and deletions. An entry thatis deleted might be overwritten by a subsequent insertion,which destroys the original key-value pair that should berestored after a crash. Thus, if a key-value pair is removedand, in the same epoch, a key-value pair is inserted intothe same entry, then the entire node must be externallylogged. InCLL p contains a boolean indicator ( insAllowed )that insertions do not require external logging. The indicatoris initially true and is set to false during a delete.Since a data structure might be large, it is impractical toclear all InCLL entries when an epoch advances. The InCLL p also records the epoch number in which the InCLL was used.During recovery, the log is applied (i.e., the permutation field is restored to its old value) only if the epoch numbercorresponds to a failed epoch.The last field of InCLL p is a boolean ( logged ) indicatingthat a node was logged to the external log. If the node waslogged in the current epoch, no further logging is required.Overall, InCLL p consists of four fields: • nodeEpoch stores the epoch number for the InCLL p . • permutationInCLL stores the value of the permutation field at the beginning of the nodeEpoch epoch. If thesystem crashes during this epoch, permutation is re-covered from permutationInCLL . • insAllowed controls if insertions are permitted to usethe InCLL. • logged records if the node was logged in the externallog during epoch nodeEpoch .Code depicting the structure of a leaf node appears in List-ing 2. It is also illustrated in Figure 1. Next we consider inserting or deleting a key-value pair to aleaf node and show how InCLL p is used (Listing 3). When akey-value pair is inserted or removed from a leaf node, thethread first checks whether nodeEpoch is equal to the cur-rent epoch ( curEpoch ). If not equal, the current modificationis the first time the node is modified in the current epoch. Listing 2.

Durable Masstree’s node structure class basenode; // lock, version, meta information struct ValInCLL{ long idx:4; static const INVALIDIDX= − long ptr:44; // 48 bits minus 4 least significant bits long lowNodeEpoch:16; // last 16 bits of the epoch; ValInCLL(ptr, idx); ValInCLL():ptr(nullptr),idx(INVALIDIDX); }; template < const int width=14> class leafnode : public basenode{ basenode ∗parent, ∗prev, ∗next; uint62_t nodeEpoch; // InCLL p field bool logged, InsAllowed; // InCLL p fields uint64_t permutationInCLL; // InCLL p field uint64_t permutation; // which key/val are active keytype keys[width]; alignas(64) struct {} ALIGN; // align to cache line ValInCLL InCLL1; // same cache line as vals[0..6] valuetype ∗vals[width]; ValInCLL InCLL2; // same cache line as vals[7..13] }; Then, the old (pre-modified) permutation value is saved in permutationInCLL , effectively logging its old value. After-ward, nodeEpoch is set to the current epoch, insAllowed isset to true , and logged to false .The persistence ordering of writing permutationInCLL and nodeEpoch is important. If setting nodeEpoch reachesNVM before permutationInCLL , recovery might fail. Theproblem is that if a failure occurs after the new epoch reachesNVM but before permutationInCLL reaches NVM, the re-covery procedure (discussed below) will assume that thenode was modified in the most recent epoch and must berecovered. Thus, it would incorrectly recover the node us-ing the very old value of permutationInCLL (belonging toa previous epoch).We use this persistence ordering: Set permutationInCLL to the old (pre-modified) permutation value. Second, set nodeEpoch to the current epoch. Third, modify permutation to reflect the insertion or deletion of a key-value pair. Thefields insAllowed and logged are semantically transientand do not require persistence ordering. This ordering en-sures that the node can always be recovered. If only thefirst modification reaches NVM, the node is not recovered as nodeEpoch does not record a failed epoch. If the first and sec-ond modifications reach NVM, the node is recovered; in thiscase, both permutation and permutationInCLL representsthe value at the beginning of the failed epoch. Recoveryis not required, but also not harmful. If all three modifi-cations reach NVM, the node is recovered correctly using permutationInCLL . These three fields reside in the same SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus p e r m u t a t i o n I n C LL p e r m u t a t i o nn o d e E p o c h Cache line 1 K e y K e y K e y Cache line 2-3 Cache line 4 V a l I n C LL V a l V a l V a l Cache line 5 V a l V a l V a l V a l I n C LL K e y InCLL- p InCLL- InCLL- Figure 1.

Durable Masstree leaf node. InCLL are orange. nodeEpoch and permutationInCLL resides in the same cache line as permutation . Values have two InCLL entries, InCLL , , located in their cache lines.cache line, so ordering is ensured by a release memory fence,but not writing back or waiting for a full round-trip to NVM.If nodeEpoch is equal to curEpoch , the node was alreadymodified during the current epoch. As mentioned above,consecutive insertions or removals can use the InCLL. Theonly case in which further action is need is when logged is false (that is, the node was not logged in the external log),the current operation is an insertion of a new key-value pair,and insAllowed is false . In this case, the node is loggedin the external log before it is modified. Code for insertionsand deletion appear in Listing 3. Updating the value of a key already in a leaf is slightly morecomplex than insertion or deletion. The two InCLLs embed-ded in the value array of a node, denoted InCLL , , are usedto log the old value when it is updated. These InCLLs, un-like InCLL p , require careful encoding to reduce their size.InCLL , can be used to log one of seven possible fields, so itcontains an additional field that records the entry that wasmodified. Using two 64-bit words for InCLL , would reducethe number of values in the array to 12 in two cache lines,incurring a performance penalty. Therefore, it is desirable touse a single word for InCLL , .To compact InCLL , , we observe that the values storedby Masstree are pointers to the actual values. In current x64architecture, pointers are represented in a canonical form inwhich only the lower 48 bits are used. In a valid memoryaddress, the upper 16 bits must equal the value of the 47th bit.In addition, all memory allocations are aligned on 16-byteboundaries, so the least significant four bits must be zero.We pack InCLL , as follows. Bits 0–3 represent the indexof the pointer that is logged; this field can represent sevenvalues, so with 4 bits we can indicate all array entries (0–6for InCLL and 7–13 for InCLL ). Bits 4–47 hold the loggedpointer. Bits 48–63 represent the lower 16 bits of the epoch in With future 5-level paging, we can fallback to external logging if the storedaddress is higher than 2 or exploit stricter alignment restrictions. which the node was modified, denoted lowNodeEpoch . Weassume that the lower 16 bits of the current epoch can becombined with the higher 16 bits of InCLL p ’s nodeEpoch to produce the full epoch number in which the InCLL wasused. During updates, we check if the difference betweenthe current epoch and the InCLL p ’s nodeEpoch . If 16 bits areinsufficient to correctly encode the epoch, we fall back onthe external log. This happens approximately once an hour(2 epochs of 64 ms each).When the value of an existing key is updated, the threadfirst checks if InCLL p ’s nodeEpoch is equal to curEpoch . Ifit is not, this is the first time the node is modified in thecurrent epoch. The thread computes which InCLL must beused, depending on whether the modified entry’s index is 6or lower. The old value, the index, and the lower 16 bits ofthe epoch are encoded into a single word and stored in theappropriate InCLL.If InCLL p ’s nodeEpoch is equal to curEpoch , the node hasbeen modified during the current epoch. But if the InCLL ofthe other cache line was used, it is still possible to use theunused InCLL. In addition, if the pointer being modified waspreviously logged in the InCLL, there is no need to recordit again. The latter is valuable when the keys are drawnfrom a skewed distribution. So, if some keys are popular andmodified multiple times during an epoch, there is no need touse the external log. The external log is likely to be necessaryif two (or more) popular keys reside in the same cache lineof a leaf node. Code for updates appear in Listing 3. We use the external log for modifications that are more com-plex or less common than these leaf updates. This has thebenefit of requiring minimal changes to the Masstree code,while maintaining a relatively low overhead cost. Splittingand merging of leaf nodes are infrequent and are handled bylogging the affected nodes. Also, all modifications to internaltree nodes are logged. ine-Grain Checkpointing with In-Cache-Line Logging ASPLOS ’19, April 13–17, 2019, Providence, RI, USA

Listing 3.

Durable Masstree operations void leafnode::InCLL( bool InCLLallowed, permInCLL, valInCLL[2]){ if (globalEpoch != nodeEpoch){ isInsertionsAllowed = true ; isLogged = false ; if (higher(globalEpoch) != higher(nodeEpoch)) isLogged = logNode(); if (!isLogged){ permutationInCLL = permInCLL; InCLL1 = valInCLL[1]; InCLL2 = valInCLL[2]; // order writes atomic_thread_fence(memory_order_release); } nodeEpoch = globalEpoch; InCLL[1,2].lowNodeEpoch = lower(nodeEpoch); } else if (!isLogged && !InCLLallowed) isLogged = logNode(); atomic_thread_fence(memory_order_release); } void leafnode::remove(keytype key){ int idx = find_idx(key); InCLL( true , permutation, ValInCLL(), ValInCLL()); InsAllowed= false ; remove_idx(&permutation,idx); } void leafnode::insert(typetype key, valuetype ∗val){ int idx = insert_idx(&permutation); InCLL(InsAllowed, permutation, ValInCLL(), ValInCLL()); keys[idx] = key; vals[idx] = val; } void leafnode::update( int idx, valuetype ∗val){ ValInCLL InCLL = (idx<=6) ? InCLL1 : InCLL2; InCLLallowed = (InCLL.idx == idx || InCLL.idx == INVALIDIDX); ValInCLL vc1 = ValInCLL (vals[idx], idx); ValInCLL vc2 = ValInCLL(nullptr, INVALIDIDX); if (idx>=7) swap(vc1, vc2); InCLL(InCLLallowed, permutation, vc1, vc2); vals[idx] = val; } The external log is also used whenever a modificationcannot be handled by the InCLL. If two values in the samecache line are modified in the same epoch, the external logis used. Similarly, if a key-value pair is removed and insertedat the same epoch, we also fall back on the external log.To reduce the cost of checking if an internal node waslogged, we introduce an epoch number in each internal nodethat indicates that the node was log in a specific epoch. Asimple comparison against the current epoch number pre-vents multiple logging. Our algorithm locks a node beforelogging to avoid races. This provides an additional benefit during recovery; since a node appears at most once in theexternal log, there are no dependencies among log entriesand they can be restored in parallel. In contrast, standardundo logging has to be applied in a reversed applicationorder, limiting concurrency in the recovery procedure.

After an abrupt crash, the durable Masstree is recovered asfollows. First, the external log is applied before executionresumes. This is done by copying the contents of each nodefrom the log to its corresponding node. As mentioned above,there are no dependencies among log entries, so it can beapplied concurrently with minimal or no synchronization.Pseudo-code illustrating recovery appear in Listing 4.InCLLs must also be applied to recover nodes. But unlikethe external log, the InCLLs are embedded inside the durableMasstree nodes. Applying all of them before the executionresumes would require a traversal of the entire tree, whichwould cause a long delay. To avoid this, the InCLL restoresare applied lazily, during tree traversals.When a thread attempts to access a node, it first checksif the node’s nodeEpoch is less than the epoch number ofthe current execution. If it is, recovery is applied to the nodebefore continuing with the access. To avoid concurrencyraces when multiple threads attempt to recover a node si-multaneously, we use locking. However, it is not possibleto use the leaf’s lock. The problem is that the state of thelock is not preserved, so after a failure, it might be in a failedstate. Therefore, attempting to lock the node could resultin a deadlock, even if only a single thread is attempting tolock the node. Instead, the system defines an array of (tran-sient) locks for recovery. When a thread attempts to recovera node, it hashes the leaf address to find an appropriate re-covery lock. After acquiring the lock, the thread checks ifthe node’s epoch is still lower than the first epoch in thecurrent execution. If it is, the thread attempts to recover thenode from the InCLL. First, it checks if nodeEpoch is a failedepoch. If so, the permutation field is recovered by copyingthe permutationInCLL field into it. Second, the thread re-constructs the epoch of InCLL , by combining the lower 16bits with the higher bits from nodeEpoch . If the resultingepoch number is a failed epoch, the index and value pointerare retrieved from the InCLL , field and are applied to theappropriate location in the vals array. Lastly, the node’sInCLL is initialized to the first epoch in the current execu-tion to indicate that the node does not need further recovery.Code illustrating InCLL lazy recovery appears in Listing 4.At the point when a leaf node is externally logged, it mayhave been modified and the changes logged in its InCLL.Thus, the contents of the external log will not equal to thestate of the node at the beginning of the failed epoch, andsimply copying the log to the node is insufficient for correctrecovery. After recovery using the external log, the InCLLsin the nodes must also be applied. To reduce recovery time, SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus

Listing 4.

Durable Masstree recovery uint64_t currExecEpoch; // first epoch in current execution lock recoveryLocks[K]; // before first access to durable Masstree void durableMasstree::recovery(){ parallel for for each node L in external log do : memcpy(L − >addr, L − >content, L − >size); } // before first access to a leaf node void leafnode::lazyNodeRecovery(){ if (unlikely(nodeEpoch

We tested the modified system by intentionally crashing itat random points, launching a new process, and checkingthat system’s state matched the state at the beginning of thefailed epoch. We also used many unit tests to ensure thata cache line was left in its correct state. We are currentlydeveloping a tool to help reason about the correctness of thistype of system.

We implemented fine-grained checkpointing and InCLL asdescribed above (INCLL) and measured it against the un-modified, transient Masstree (MT) and against an improvedversion of Masstree (MT+) that adopted two enhancementsfrom INCLL, using a global barrier at each epoch and mmap ingmemory space for Masstree’s pool allocator, rather than ob-taining it through jemalloc . Without these two enhance-ments, INCLL performed slightly better than MT (Figure 2).All experiments were run on a server containing two IntelXeon Gold 6132 (Skylake) processors, each with 14 coresand 28 hyperthreads, running at 2.6 GHz, with a 19.25 MBL3 cache. The system contained 1.5 TB of DDR4-2666 RAM.The operating system was Ubuntu Linux 16.04.5 LTS. Thecode was compiled by g++ version 5.4.0. with the baselinemakefile optimization level -O3. Each experiment was exe-cuted 10 times, and we report the average time. The standarddeviation of the experiments ranged from 0.03% to 0.08%.Since NVM is not available, we allocate a file in /dev/shm and mapped it to the application address space (DRAM). Bydefault, we do not introduce artificial latencies for the cacheflush or fence operations. However, we also measure the ef-fect of higher NVM latencies by introducing artificial latency

Figure 2.

Throughput of baseline Masstree (MT), optimizedMasstree (MT+), and our durable Masstree (INCC).after the sfence instructions since clflushopt instructionsare asynchronous.Figure 2 reports the throughput of the three Masstreeversions on different workloads. Unless otherwise noted,the tree was initialized with 20 million entries and we ranwith 8 threads. Keys and values are 8-bytes long . The work-load was generated by driver threads on the same machineto avoid network interference. For YCSB_A (write heavy),the operation distribution was 50% puts and 50% reads, forYCSB_B (read heavy) 5% puts and 95% reads, for YCSB_C(read only) 100% reads, and for YCSB_E a read-only scanof 10 keys. We employ two key distributions. In uniform ,the keys are generated uniformly at random in the rangebetween zero and 20M. In zipfian , the keys are generatedaccording to a zipfian distribution with a skew parameter of0.99. Keys are scrambled by computing a hash of their values,so that frequent keys do not (necessarily) appear in closeproximity. We report the overall throughput (operations persecond) of executing 1 million operations on each thread.The optimized version (MT+) performed 2.4–68.5% bet-ter than unmodified Masstree (MT). We use MT+ as thebaseline for comparisons with the durable version (INCLL).Durable Masstree performed 5.9–15.4% slower than MT+,which reflects the cost of InCLL and periodic cache flushes.As expected, the write-intensive workload’s (YCSB_A) per-formance is reduced by a larger amount (10.3–15.4%) than theread-light (5.9–13.9%) or read-only workloads (7.9–13.5%).The scan workload (YCSB_E) is least affected by InCLL.The Zipfian workload performed better than the uniformworkload in both systems, and it is less affected by InCLL(5.9–10.3% vs. 7.8–15.4%) because its skewed distributionmeans fewer nodes are accessed and consequently processorcaching is more effective and more writes are logged (seediscussion of Figure 6). Values are allocated in a 32-byte buffer containing additional Masstreefields.

SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus

Figure 3.

Effect of emulated latencies for cache write backon INCLL (baseline is INCLL with zero emulated latency).INCLL increased the number of instructions executed 0.0–14.5% (uniform) and 0.1–7.6% (Zipfian). It also increased thenumber of L1 load and store references by a similar amount,but had less effect on the L1 cache miss rate. The number ofLLC load references also increased by 7.9–16.3%, but the num-ber of L3 store references was less consistent (-28.7–27.5%).For MT+, the LLC cache miss rate was high (25.8–39.9% load,25.1–99.6% store), but the absolute number of misses is low(1.3–2.9M load, 31-666.8K store). INCLL had little effect onthe L1 load miss rate (-5.1–14.2%), but it reduced the LLCload miss rate by 42.0–95.2%.Figure 3 shows the effect of increasing the latency offlushing modified locations to NVM on the YCSB_A write-heavy workload. We introduced an additional delay of 100 ns –1000 ns after the sfence operations. The effect on NVM la-tency is small. Even with an added latency of 1 µ s, the per-formance of INCLL only decreased by 4.3% for the uniformworkload and 6.0% for the Zipfian, compared to no emulatedlatencies. This small difference demonstrates that InCLL isable to avoid the full cost of flushing writing to NVM formost memory references.Figure 4 depicts the performance of MT+ and INCLL over1 to 56 threads. The workload is again YCSB_A. The perfor-mance loss due to InCLL seemed unrelated to the number ofthreads, ranging from 14.6–21.3% for uniform and 3.0–19.3%for Zipfian. For the Zipfian workload, overall benchmarkperformance decreased at 44 threads in both system. In alarger tree (100M entries), however, the performance of allbenchmarks increased monotonically with the number ofthreads and the performance loss due to InCLL was similar(10.7–15.3% and 6.7–22.2%, respectively).Figure 5 depicts the performance of MT+ and INCLL forincreasing the number of entries in the tree. The workload isagain YCSB_A. Both MT+ and INCLL were affected similarlyby the increased tree size. The performance on the uniformworkload decreased 69% for both MT+ and InCLL as the treegrew from 10K to 100M nodes, and the Zipfian workloadperformance decreased by approximate 50%. Figure 4.

Throughput of INCLL and MT+ (INCL_A bench-mark) for different numbers of threads.

Figure 5.

Throughput of INCLL and MT+ (INCL_A bench-mark) for varying tree size.

Figure 6.

Overhead of INCLL over MT+ (INCL_A bench-mark) for varying tree size.The overhead for the uniform workload forms a parabola(Figure 6), with a lower overheads for small and large trees.To understand this phenomena, we measured the effective-ness of the external log and InCLL. Figure 7 show the numberof nodes logged for the YCSB_A workload when InCLL log-ging is disabled (LOGGING) and in the normal operatingmode (INCLL). As can be seen, for both uniform and zipfian ine-Grain Checkpointing with In-Cache-Line Logging ASPLOS ’19, April 13–17, 2019, Providence, RI, USA

Figure 7.

Number of logged nodes when InCLL logging is disabled (LOGGING) and with InCLL (INCLL).distributions, the number of logged nodes increases sharply(logarithmic plots) until the tree reaches 1–3M nodes. Afterthat point, the uniform distribution levels off without InCLLand declines rapidly with InCLL, and the zipfian distribu-tion grows slowly without InCLL and declines slowly withInCLL.A tree of 10K keys contains approximately 1K nodes, manyof which are modified frequently by the approximately 80Koperations during a typical epoch. A node modified a sec-ond time is usually recorded in the external log (§ 4.1.1).However, no logging is needed for the rest of the epoch, sosubsequent modifications incur no overhead. For a tree with100M keys, however, overhead is low for a different reason.When the tree is very large and keys are chosen uniformlyat random, a node has a low probability of being modifiedand a lower probability of being modified twice. Thus, mostof its modifications are logged by InCLL, not external log-ging. The Zipfian distribution, which has a higher localityof reference and more nodes access twice or more, does notbenefit as much from the InCLL and its number of nodeslogged continues to increase as the tree grows. In the middle,when the tree contains 1M–3M entries, the probability thata node is modified twice or more is relatively large. This en-tails external logging, but since the number of operations ona given node is likely to be low, the overhead of this loggingis unlikely be amortized over a series of operations, as insmaller trees. Despite the relatively high overhead for treesin this range, the heavy-write benchmark ran at most 27%slower than MT+.Figure 8 show how performance changes for both InCLL(INCLL) and only logging (LOGGING) as the latency of flush-ing modified locations to NVM increases on the YCSB_Aworkload. With InCLL, the performance decreases only 4.1%(uniform) and 5.7% (zipfian) as the latency of this operationincreases to 1 µ s. With only logging, the performance de-creases 42.5% (uniform) and 28.5% (zipfian) over this range.InCLL greatly reduces the number of cache lines that mustbe flushed, which becomes increasingly beneficial as thelatency of this operation increases. An initial version of our system applied InCLL to internalnodes as well. This significantly reduced the number of in-ternal nodes that were logged. However, it did not improveperformance appreciable since many more leaf nodes arelogged. Furthermore, InCLL reduced the width of internalnodes, resulting in slower tree traversals. Overall, applyingInCLL to all nodes resulted in lower performance.

We measured the cost of a global cache flush using the threeworkloads. The instruction that flushes the entire cache, wbinvd , is a privileged instruction that can only be executedby the kernel. We measured user-space overhead: the timefrom the syscall until the operation returned to user space. Inall measured cases, the cost was 1.38–1.39 ms with a varianceof 6-12%. Since the flushes are executed once every 64 ms ,the total cost of this operation is 2.2%. To measure recovery, we intentionally crashed the systemimmediately before starting a new epoch. This is a worst-casescenario for the number of nodes recorded in the externallog. The workload is write-heavy (50% writes) and the treewas 1M entries (worst-case scenario for InCLL). We foundthat 84K nodes were recorded during the epoch. Applyingthese log entries required approximately 15 ms . As expected,recovery is fast, even in a worst-case scenarios, primarilybecause of the short epoch duration. There have been many attempts to create efficient durableindexes in NVM. wB+-Trees [6] are a tree designed to re-duce the number of writes to NVM by using unoccupiedleaf entries for insertions, in a manner somewhat similarto the permutation field in Masstree. However, wB+-Treesstill require at least two write backs and fences per update.NV-Tree [30] uses an append-only strategy to reduce the

SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus

Figure 8.

Performance with emulated latencies for cache write back when InCLL logging is disabled (LOGGING) and withInCLL (INCLL).overhead of writing to NVM. However, every split requiresreconstructing the path of internal nodes, increasing over-head. Update operations still require two write backs andfences. FPTree [24] reduces the overhead of NVM writes bystoring only leaf nodes in NVM. Internal nodes are storedin DRAM and rebuilt during recovery. However, rebuildingthe tree increases restart time significantly. In addition, itrequires three write backs and fences per update operation.WORT [18] is a radix tree designed for NV, which attemptsto reduce the number of writes to NVM. However, it stillrequires two write backs and fences per update operation.BzTree [2] uses a lock-free persistent multi-word CAS opera-tion (PMwCAS) to implement a durable B+ tree. They do notreport the number of write backs and fences (which mightvary due to concurrency races), but at least two are neededfor each PMwCAS. Insertions require at least two PMwCAS,so at least four persistent fences are necessary. Dalí [23] isa durable, nonblocking hash map based on globally flushingthe cache. Each bucket follows an append-only strategy, soupdating an existing value allocates a new node and appendsit to the corresponding hash bucket. Therefore, Dalí makesless efficient use of the cache than Masstree. Memory recla-mation relies on a garbage collection-like algorithm duringrecovery, which makes recovery very expensive.Most programming interfaces to NVM are either trans-actional memory or locks. Mnemosyne [28] was the firstsystem that used a software transactional memory systemfor NVM. It is based on a durable redo log whose imple-mentation is from TinySTM [11]. PMDK [10] is a library,provided by Intel, that uses an undo log to provide durabletransactions as an NVM interface. Atlas [5] uses locks to de-limit uninterruptible durable regions. It uses a durable undolog to roll back unfinished atomic regions after a failure. TheInCLL programming model is more complex, as it tailorslogging to the semantics of a data structure, but it achievesfar better performance than general-purpose approaches.Numerous papers have tried to improve transaction per-formance without changing the programming model. None come close to the low overhead of InCLL. Kolli et al. [16]pipelined multiple stages to reduce the number of write backsand fences. NVThread [13] improved the lock-based durablesection with a redo-log. Different threads are spawned asdifferent processes, giving each thread a fast, hardware medi-ated view of its local modification. LSNVMM [14] improvedthe performance of durable transactions by avoiding replica-tion of data in a log and the original location. Kamino-TX [22]avoids slow lookups in the redo log by applying modifica-tions to a DRAM copy. These modifications are propagatedlazily to NVM. DudeTM [19] decouples transactional concur-rency control from the durability mechanism. Concurrencycontrol is performed on a DRAM copy and produces a redolog, which is applied to NVM in the background. The lattertwo systems avoid write backs and fences on the critical pathbut suffer from a long restart delay due to the cost of copyingthe NVM structure to DRAM.The main challenge that needs to be solved by a persistentmemory allocator [3, 7, 23, 26] is inconsistency between theallocator’s metadata and the application’s data structures.If a buffer was allocated but the system crashed before itwas linked to the persistent structure, it is a persistent mem-ory leak. Schwalb et al. [26] broke allocation into two steps,reserve and activate, where each flushes data to persistentmemory. A crash after the reserve step is rolled back by thesystem. The NV-heap [7] system implements allocation, au-tomatic garbage collection, reference counting, and pointerassignments as simple, fixed-size ACID transactions using apersistent redo log. Makalu [3] uses a conservative garbagecollector to recover unreachable pointers. Thus, no writesback are required during allocation. It has a slow restart time,due to its need to traverse a potentially unbounded amountof memory before an application restarts.

Fine-Grain Checkpointing is not an exact replacement fortransactions and InCLLs are not a general-purpose substi-tute for logs. Using InCLLs require detailed understanding ine-Grain Checkpointing with In-Cache-Line Logging ASPLOS ’19, April 13–17, 2019, Providence, RI, USA of data access patterns and cache-line boundaries. Check-pointing requires less program annotation than transactionssince a developer only needs to ensure data structure con-sistency at infrequent epoch boundaries, but this does notalleviate the complexity introduced by InCLL and does notguarantee immediate durability. Their combination providesa powerful tool for experts, such as library developer, forreducing the cost of durability, albeit at the cost of additionalprogramming complexity.Analogous to cache-efficient or concurrent data structures,we believe that efficient, recoverable structures cannot becreated with one-size-fits-all techniques. Achieving high per-formance require code that is data-structure and architec-ture specific. In this paper, we use Masstree as an example todemonstrate how InCLLs make a highly optimized structuredurable. The amount of effort was approximately 2 personmonths without prior knowledge of Masstree.Other pointer-based structures could benefit from thesetechniques. Currently, achieving good results requires care-ful reasoning about the characteristic of the specific struc-tures (e.g. § 4). We are aware there are still many open ques-tions about how to apply this technique in other contexts —e.g., values that span a cache line, objects with less sharplydefined update patterns, or more complex data structure ma-nipulations — and are investigating more general solutions.

In this paper, we present Fine-Grained Checkpointing andIn-Cache-Line Logging. The former periodically flushes thecache to NVM, thereby persisting everything. The latter em-beds an undo log inside the cache lines in a data structure,thus enabling fast logging to undo writes from partially exe-cuted epochs. We transformed the Masstree data structureto be durable using these techniques and an external object-granularity log to handle complex and infrequently writtenstructure operations. The combination of these techniquesguarantees durability while introducing only a moderateperformance overhead.

References [1] Hiroyuki Akinaga and Hisashi Shima. Resistive Random Access Mem-ory (ReRAM) Based on Metal Oxides.

Proc. IEEE , 98(12):2237–2251,December 2010.[2] Joy Arulraj, Justin Levandoski, Umar Farooq Minhas, and Per-AkeLarson. Bztree: a high-performance latch-free range index for non-volatile memory.

Proc. VLDB Endow. , 11(5):553–565, 2018.[3] Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. Makalu:Fast recoverable allocation of non-volatile memory. In

Proceedings ofthe 2016 ACM SIGPLAN International Conference on Object-OrientedProgramming, Systems, Languages, and Applications , OOPSLA 2016,pages 677–694, New York, NY, USA, 2016. ACM.[4] Hans-J. Boehm and Sarita V. Adve. Foundations of the c++ concurrencymemory model. In

Proceedings of the 29th ACM SIGPLAN Conferenceon Programming Language Design and Implementation , PLDI ’08, pages68–78, New York, NY, USA, 2008. ACM. [5] Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. Atlas:Leveraging locks for non-volatile memory consistency. In

Proceedingsof the 2014 ACM International Conference on Object Oriented Program-ming Systems Languages & Applications , OOPSLA ’14, pages 433–452,New York, NY, USA, 2014. ACM.[6] Shimin Chen and Qin Jin. Persistent b+-trees in non-volatile mainmemory.

Proc. VLDB Endow. , 8(7):786–797, February 2015.[7] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Ra-jesh K. Gupta, Ranjit Jhala, and Steven Swanson. Nv-heaps: Makingpersistent objects fast and safe with next-generation, non-volatilememories. In

Proceedings of the Sixteenth International Conferenceon Architectural Support for Programming Languages and OperatingSystems , ASPLOS XVI, pages 105–118, New York, NY, USA, 2011. ACM.[8] Nachshon Cohen, David T. Aksun, and James R. Larus. Object-oriented recovery for non-volatile memory.

Proc. ACM Program. Lang. ,2(OOPSLA):153:1–153:22, October 2018.[9] Nachshon Cohen, Michal Friedman, and James R. Larus. Efficientlogging in non-volatile memory by exploiting coherency protocols.

Proc. ACM Program. Lang. , 1(OOPSLA):67:1–67:24, October 2017.[10] Krzysztof Czurylo and Andy Rudoff. NVML: NVM Library, 2014.[11] Pascal Felber, Christof Fetzer, and Torvald Riegel. Dynamic perfor-mance tuning of word-based software transactional memory. In

Pro-ceedings of the 13th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming , PPoPP ’08, pages 237–246, New York, NY,USA, 2008. ACM.[12] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane,H. Yamada, M. Shoji, H. Hachino, C. Fukumoto, H. Nagao, and H. Kano.A novel nonvolatile memory with spin torque transfer magnetizationswitching: spin-ram. In

IEEE Int. Devices Meet. 2005. IEDM Tech. Dig. ,pages 459–462. IEEE, 2005.[13] Terry Ching-Hsiang Hsu, Helge Brügner, Indrajit Roy, Kimberly Kee-ton, and Patrick Eugster. Nvthreads: Practical persistence for multi-threaded applications. In

Proceedings of the Twelfth European Confer-ence on Computer Systems , EuroSys ’17, pages 468–482, New York, NY,USA, 2017. ACM.[14] Qingda Hu, Jinglei Ren, Anirudh Badam, Jiwu Shu, and Thomas Mosci-broda. Log-structured non-volatile main memory. In

Proceedings ofthe 2017 USENIX Conference on Usenix Annual Technical Conference ,USENIX ATC ’17, pages 703–717. USENIX Association, 2017.[15] Intel. Intel and micron produce breakthrough memory technology,2015.[16] Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen, and Thomas F.Wenisch. High-performance transactions for persistent memories. In

Proceedings of the Twenty-First International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ASPLOS’16, pages 399–411, New York, NY, USA, 2016. ACM.[17] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Ar-chitecting phase change memory as a scalable dram alternative. In

Proceedings of the 36th Annual International Symposium on ComputerArchitecture , ISCA ’09, pages 2–13, New York, NY, USA, 2009. ACM.[18] Se Kwon Lee, K. Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H.Noh. Wort: Write optimal radix tree for persistent memory storagesystems. In

Proceedings of the 15th Usenix Conference on File and StorageTechnologies , FAST’17, pages 257–270. USENIX Association, 2017.[19] Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, YongweiWu, Weimin Zheng, and Jinglei Ren. Dudetm: Building durable trans-actions with decoupling for persistent memory. In

Proceedings of theTwenty-Second International Conference on Architectural Support forProgramming Languages and Operating Systems , ASPLOS ’17, pages329–343, New York, NY, USA, 2017. ACM.[20] Jeremy Manson, William Pugh, and Sarita V. Adve. The java memorymodel. In

Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposiumon Principles of Programming Languages , POPL ’05, pages 378–391,New York, NY, USA, 2005. ACM.

SPLOS ’19, April 13–17, 2019, Providence, RI, USA Nachshon Cohen, David T. Aksun, Hillel Avni, and James R. Larus [21] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache crafti-ness for fast multicore key-value storage. In

Proceedings of the 7thACM European Conference on Computer Systems , EuroSys ’12, pages183–196, New York, NY, USA, 2012. ACM.[22] Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee, YanqiZhou, Ramnatthan Alagappan, Karin Strauss, and Steven Swanson.Atomic in-place updates for non-volatile main memories with kamino-tx. In

Proceedings of the Twelfth European Conference on ComputerSystems , EuroSys ’17, pages 499–512, New York, NY, USA, 2017. ACM.[23] Faisal Nawab, Joseph Izraelevitz, Terence Kelly, Charles B. MorreyIII, Dhruva R. Chakrabarti, and Michael L. Scott. Dalí: A PeriodicallyPersistent Hash Map. In ,volume 91. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.[24] Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, andWolfgang Lehner. Fptree: A hybrid scm-dram persistent and con-current b-tree for storage class memory. In

Proceedings of the 2016International Conference on Management of Data , SIGMOD ’16, pages371–386, New York, NY, USA, 2016. ACM.[25] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers.Scalable high performance main memory system using phase-changememory technology. In

Proceedings of the 36th Annual InternationalSymposium on Computer Architecture , ISCA ’09, pages 24–33, NewYork, NY, USA, 2009. ACM.[26] David Schwalb, Tim Berning, Martin Faust, Markus Dreseler, andHasso Plattner. nvm malloc: Memory allocation for NVRAM. In

International Workshop on Accelerating Data Management SystemsUsing Modern Processor and Storage Architectures - ADMS 2015, KohalaCoast, Hawaii, USA, August 31, 2015. , pages 61–72, 2015.[27] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and SamuelMadden. Speedy transactions in multicore in-memory databases. In

Proceedings of the Twenty-Fourth ACM Symposium on Operating Sys-tems Principles , SOSP ’13, pages 18–32, New York, NY, USA, 2013.ACM.[28] Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne:Lightweight persistent memory. In

Proceedings of the Sixteenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems , ASPLOS XVI, pages 91–104, New York, NY,USA, 2011. ACM.[29] H.-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu,Pang-Shiu Chen, Byoungil Lee, Frederick T. Chen, and Ming-Jinn Tsai.Metal-Oxide RRAM.

Proc. IEEE , 100(6):1951–1970, June 2012.[30] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai LeongYong, and Bingsheng He. Nv-tree: Reducing consistency cost fornvm-based single level systems. In