[PDF] Vilamb: Low Overhead Asynchronous Redundancy for Direct Access NVM

Abstract

Vilamb provides efficient asynchronous systemredundancy for direct access (DAX) non-volatile memory (NVM) storage. Production storage deployments often use system-redundancy in form of page checksums and cross-page parity. State-of-the-art solutions for maintaining system-redundancy for DAX NVM either incur a high performance overhead or require specialized hardware. The Vilamb user-space library maintains system-redundancy with low overhead by delaying and amortizing the system-redundancy updates over multiple data writes. As a result, Vilamb provides 3--5x the throughput of the state-of-the-art software solution at high operation rates. For applications that need system-redundancy with high performance, and can tolerate some delaying of data redundancy, Vilamb provides a tunable knob between performance and quicker redundancy. Even with the delayed coverage, Vilamb increases the mean time to data loss due to firmware-induced corruptions by up to two orders of magnitude in comparison to maintaining no system-redundancy.

Full PDF

VVilamb: Low Overhead Asynchronous Redundancy for Direct Access NVM

Rajat Kateja, Andy Pavlo, Gregory R. [email protected], [email protected], [email protected] Mellon University

Abstract

Vilamb provides efﬁcient asynchronous system-redundancy for direct access (DAX) non-volatilememory (NVM) storage. Production storage deploy-ments often use system-redundancy in form of pagechecksums and cross-page parity. State-of-the-artsolutions for maintaining system-redundancy for DAXNVM either incur a high performance overhead orrequire specialized hardware. The Vilamb user-spacelibrary maintains system-redundancy with low overheadby delaying and amortizing the system-redundancyupdates over multiple data writes. As a result, Vilambprovides 3–5 × the throughput of the state-of-the-artsoftware solution at high operation rates. For ap-plications that need system-redundancy with highperformance, and can tolerate some delaying of dataredundancy, Vilamb provides a tunable knob betweenperformance and quicker redundancy. Even with thedelayed coverage, Vilamb increases the mean time todata loss due to ﬁrmware-induced corruptions by up totwo orders of magnitude in comparison to maintainingno system-redundancy. Non-volatile memory (NVM) storage combinesDRAM-like access latencies and granularities with disk-like durability [1, 10, 11, 39, 54]. Direct access (DAX)to NVM data exposes raw NVM performance to appli-cations. Applications using DAX map NVM ﬁles intotheir address spaces and access data with load and storeinstructions, eliminating system software overheads as-sociated with conventional storage interfaces.Production storage demands fault tolerance in addi-tion to non-volatility and performance. Whereas somefault tolerance mechanisms extend to DAX NVM stor-age trivially (e.g., background scrubbing), others do not.In particular, mechanisms for resilience against device- ﬁrmware-bug-induced data corruption ﬁt poorly. FS-level page checksums enable detection of ﬁrmware-bug-induced data corruption, and cross-page redundancy en-ables recovery from such corruptions [6,7,29,53,60]. Weuse system-redundancy to refer to FS level checksumsand cross-page redundancy.Maintaining system-redundancy for DAX NVM stor-age, without forfeiting its performance beneﬁts, is chal-lenging for two reasons. First, accesses via load andstore instructions bypass system software, removing thestraightforward ability to detect and act on data changes(e.g., to update system-redundancy). Second, NVM’scache-line granular writes increase the overhead of up-dating system-redundancy (e.g., checksums) that is usu-ally computed over sizeable data regions (e.g., pages) foreffectiveness and space efﬁciency.The state-of-the-art solution for DAX NVM system-redundancy is the Pangolin library [69]. Pangolin ad-dresses the challenge of system software bypass by re-quiring applications to use its transactional API. Thisenables Pangolin to mediate and act on data accessesTo address the incongruence in DAX write and system-redundancy granularities, Pangolin introduces micro-buffering and per-object checksums. Pangolin buffersapplication writes in DRAM and updates the NVM onlyon transaction commits. This buffering also enablesPangolin to use data diffs to make system-redundancyupdates more efﬁcient.Even with Pangolin’s well-optimized design, syn-chronous system-redundancy updates incur signiﬁcantoverhead. For example, Fig. 1 shows that Pangolin re-duces key-value insert throughput by 10–20% at lowinsert rates, compared to a No-Redundancy baseline,and by up to 80% at high rates. Fundamentally, anysoftware-based synchronous approach will struggle withhigh throughput updates because it must update system-1 a r X i v : . [ c s . O S ] A p r Threads T h r o u g h p u t ( M - o p s / s e c ) No-RedundancyVilamb: 1 sec periodPangolin

Figure 1:

Throughput for a PMDK key-value store when usingthree system-redundancy options, as a function of the number ofthreads performing PMDK’s insert-only benchmark workload.(Details in § 4.3; RBtree results shown here.) redundancy on every operation. A recently proposedspecialized hardware controller offers low-overhead syn-chronous DAX NVM system-redundancy [33], but it isunlikely to be available in systems soon.This paper describes

Vilamb , a user-space library forefﬁcient asynchronous DAX NVM system-redundancy.Vilamb moves system-redundancy updates out of the crit-ical path and delays them to amortize the overhead overmultiple data updates. Delaying the system-redundancyupdates creates a conﬁgurable trade-off between the de-lay before updated data is covered and performance.Fig. 1 shows that updating system-redundancy every sec-ond with Vilamb reduces the No-Redundancy throughputby only 6%, even at the highest throughput level; thiscorresponds to 5 × higher throughput than Pangolin. Al-though Vilamb leaves a fraction of data brieﬂy uncov-ered, it increases the mean time to data loss (MTTDL)due to ﬁrmware-induced corruptions by 112 × over No-Redundancy for this benchmark.Unlike Pangolin, Vilamb does not require applicationsto adopt a particular access interface to identify dataupdates. Instead, Vilamb repurposes page table dirtybits to efﬁciently identify of data updates. Vilamb markspages with updated system-redundancy as clean and iden-tiﬁes pages with outdated system-redundancy by check-ing their dirty bit. We implement a kernel module that Vil-amb uses for batched fetching and clearing for dirty bits.Vilamb ensures atomic and consistent system-redundancyupdates for all dirty pages by using shadow copies ofdirty bits and leveraging batteries that are common inproduction environments [18, 22, 31, 32, 37, 46, 63, 64].Extensive evaluation with eight macro- and micro-benchmarks demonstrate Vilamb’s efﬁcacy. Vilamb witha 1 sec delay between system-redundancy updates re-duces single-threaded Redis’ YCSB throughput by only1.6–17%, compared to 13–18% for Pangolin. Increasingthe delay to 10 seconds further reduces Vilamb’s over- head to 0.1–6%. Similar to Fig. 1, Vilamb offers 3–5 × higher throughput than Pangolin at high insert rates forall ﬁve of Intel’s PMDK key-value stores. By protectingthe clean pages from ﬁrmware-bug-induced corruption,Vilamb increases the MTTDL over No-Redundancy. Forexample, Vilamb with a 1 sec system-redundancy up-date period increases Redis’ MTTDL by 15 × and 74 × over No-Redundancy for a write-heavy and ready-heavyYCSB workload, respectively. Detailed timing break-downs with ﬁo microbenchmarks and battery cost analy-sis conﬁrm Vilamb’s design decisions.This paper makes three primary contributions. First,it identiﬁes asynchronous system-redundancy as an im-portant addition to the toolbox of DAX NVM system-redundancy solutions. Second, it describes Vilamb’s efﬁ-cient delayed system-redundancy design that improvesperformance for applications that can tolerate delayedcoverage. Third, it quantiﬁes Vilamb’s efﬁcacy, cost, andreliability via extensive evaluation with eight macro- andmicro-benchmarks. This section provides background on direct-access(DAX) NVM and system-redundancy, and the challengesthat DAX poses for maintaining system-redundancy. Itthen describes the solution space and how Vilamb andrelated work ﬁt into it.

NVM refers to a class of memory technologies thathave access latencies comparable to DRAM and thatretain their contents across power outages like disks.Various NVM technologies, such as 3D-XPoint [1, 27],Memristors [11], PCM [39, 54], and battery-backedDRAM [10, 15], are either already in-use or expectedto be available soon. In this paper, we focus on NVMthat is accessible like DRAM DIMMs rather than like adisk [45]. That is, NVM that resides on the memory bus,with load/store accessible data that moves between CPUcaches and NVM at a cache-line granularity. Althoughapplications can continue to access NVM via conven-tional FS interface, doing so incurs the overhead of sys-tem calls, and (potentially) data copying and inefﬁcientgeneral-purpose ﬁle system code [14, 19, 30, 61, 65, 66].The DAX interface to NVM eliminates system soft-ware overheads, enabling applications to leverage rawNVM performance. With DAX, applications map NVMpages into their address spaces and access persistentdata via load and store instructions. File systems thatmap a NVM ﬁle into the application address space(bypassing the page cache) on a mmap system call arereferred to as DAX ﬁle systems and said to support2AX-mmap [19, 40, 67]. DAX is widely used foradding persistence to conventionally volatile in-memoryDBMSs [41, 52, 56, 70] and is poised as the “killer use-case” for NVM.DAX-mmap helps applications realize NVM perfor-mance beneﬁts, but requires careful reasoning to en-sure data consistency. Volatile processor caches canwrite-back data in arbitrary order, forcing applicationsto use cache-line ﬂushes and memory fences for dura-bility and ordering. Transactional NVM access librariesease this burden by exposing simple transactional APIsto applications and ensuring consistency on their be-half [8, 13, 24, 26, 62]. Alternatively, the system canbe equipped with enough battery to allow ﬂushing ofcached writes to NVM before a power failure [44,49,72];our work assumes this option.

Many production storage systems implement system-redundancy, in the form of FS level page checksums andcross-page redundancy, to protect against ﬁrmware-bug-induced data corruption [21,53,58,71]. Device ﬁrmwaresare susceptible to bugs, like any software, because of theircomplex functionalities, such as address translation andwear leveling. A class of these bugs, namely lost writebugs and misdirected read or write bugs, can cause datacorruption [6, 7, 29, 53, 60]. Lost write bugs cause theﬁrmware to incorrectly consider a write as completedwithout actually writing the data on to the device media.Misdirected read or write bugs cause the ﬁrmware toaccess (read or write) data at a wrong location on thedevice media.Firmware bugs can corrupt data that an application isactively accessing as well as data at rest. An example ofa ﬁrmware bug affecting actively accessed data would bea misdirected read bug that causes the ﬁrmware to returnincorrect data for an application read. On the other hand,lost write or address mapping bugs that are triggeredwhen the ﬁrmware is performing wear-leveling couldcorrupt data at rest.Storage systems can detect and recover from ﬁrmware-bug-induced corruption using system-redundancy [43,53, 71]. For example, a FS can store and access pagechecksums separately from the data, making it unlikelyfor a ﬁrmware bug to affect both the data and its FS-levelchecksum in the same manner. An FS-level checksummismatch can then ﬂag ﬁrmware-bug-induced corruption,which the FS can recover from by using cross-page parity.Many storage systems implement system-redundancyin addition to a variety of other fault-tolerance mech-anisms [21, 23, 28, 34, 38, 48, 57, 67, 71]. In particu-lar, storage systems implement system-redundancy even in the presence of device-level error correcting codes(ECCs) [9, 35, 68]. ECCs are designed for, and effectiveagainst, random bit ﬂip induced corruption. However,they are ineffective against most ﬁrmware-bug-inducedcorruption, because they are computed, stored, and ac-cessed as a single unit with the data at a very low levelof the device’s ﬁrmware or hardware.

Production NVM storage deployments will requiresimilar levels of fault-tolerance as conventional storagedeployments, including system-redundancy. Unsurpris-ingly, recently proposed NVM storage system designsinclude system-redundancy [33,50,67,69]. Among theseproposals, ﬁle systems like Nova-Fortis [67] and Plexi-store [50] implement system-redundancy only for datathat is accessed via the FS interface.Maintaining system-redundancy for DAX NVM ischallenging for two reasons: (i) hardware controlled datamovement, and (ii) cache-line granular writes.

Hardware Controlled Data Movement : Applications’data writes to DAX NVM bypass system software. Thislack of software control makes it challenging for the stor-age software to identify updated NVM pages for whichit needs to update system-redundancy.

Cache-line Granular Writes : Incongruence in the sizeof DAX writes and the size of pages over which system-redundancy is usually maintained increases the overheadof maintaining system-redundancy. Most storage sys-tems maintain system-redundancy over sizeable blocks(e.g., 4K page checksums) for space efﬁciency. Cache-line granular writes require reading (at least) an entirepage to update the system-redundancy. Whereas RAIDsystems solve a similar “small write” problem by readingthe data before updating it [47], a DAX NVM storagesystem software cannot use this solution. As discussedabove, direct access to NVM bypasses system software,prohibiting the use of pre-write values for incrementalsystem-redundancy updates.

Table 1 summarizes the design space of DAX NVMsystem-redundancy solutions and the tradeoffs amongthe three options (including Vilamb) in the toolbox.

Pangolin [69] is a user-space library that maintainsDAX NVM system-redundancy synchronously by requir-ing applications to explicitly inform it about their dataupdates; applications piggyback these notiﬁcations onPangolin’s transactional interface. Pangolin offers strongcoverage (immediate system-redundancy updates andveriﬁcation) and does not require any specialized hard-ware resources (because it is a software-based solution).3 olution CoverageGuarantees PerformanceOverhead ProgrammingModel Specialized HardwareRequirement

Pangolin [69] Strong Medium-to-High Restrictive NoneTvarak [33] Strong Negligible Non-Restrictive YesVilamb Conﬁgurable Conﬁgurable Non-Restrictive None

Table 1:

Solutions for DAX NVM system-redundancy and their trade-offs.

Pangolin addresses the mismatch of ﬁne-grained DAXupdates with large checksum ranges by requiring explicitobject deﬁnitions and maintaining per-object checksumsinstead of per-page checksums.Pangolin is well-tuned, including several overhead-reducing mechanisms, making it the state-of-the-art foran in-line software-only solution. Yet, Pangolin still in-curs signiﬁcant performance overhead (up to 80%) inmany cases. Fundamentally, Pangolin’s synchronoussystem-redundancy update design requires updatingsystem-redundancy at the same rate at which an object isbeing modiﬁed; this becomes costly for the high updaterates enabled by NVM. Pangolin’s per-object check-sums also incur higher space overhead for small dataobjects. Also, importantly, Pangolin only works for ap-plications that can be and are modiﬁed to use its object-based transactional interface. Applications that manageNVM data themselves using other data models, such asNVM-optimized databases [3], may not be easily ﬁt toPangolin’s interface.

Tvarak [33] is a hardware controller co-located withthe last level cache (LLC) that the FS can ofﬂoad system-redundancy maintenance work onto. Tvarak is able toidentify data updates by the virtue of being interposedin the data path. Tvarak offers synchronous system-redundancy updates and veriﬁcation, does not restrictapplications to any speciﬁc library/API, and is low-overhead. However, it requires specialized hardwareresources, including a controller, on-controller cache,and shared LLC partitions. The need for dedicated (andnewly proposed) hardware resources implies that Tvarakis not available for immediate use, and may not be part ofcommodity servers for many years. Further, Tvarak intro-duces cache-line granular checksums for DAX-mappeddata, increasing the space overhead.Prioritizing strong coverage at the expense of per-formance and a restrictive programming model (withPangolin [69]), or cost and near-term availability (withTvarak [33]), will not be the preferred choice for all ap-plications. Many applications prioritize performance anduse storage systems wherein some of the fault-tolerancemechanisms (e.g., remote replication or even persistence)are asynchronous—the fault-tolerance is still desired, and the more coverage the better, but not at a high perfor-mance cost [16, 28, 34, 48].

Vilamb is a software library that embraces an asyn-chronous approach to updating system-redundancy forupdated data. Like other asynchronous redundancy-update approaches, it identiﬁes and completes requiredsystem-redundancy updates in the background. Indeed, itdoes both aspects (identifying and updating) outside thecritical path of application accesses. As such, Vilamb canprovide low-overhead DAX NVM system-redundancy.Also, Vilamb does not impose any programming modelrestrictions and does not require any specialized hardwareresources. But, Vilamb reduces the data coverage guar-antees by delaying system-redundancy updates. Speciﬁ-cally, recently modiﬁed pages may not be covered whena ﬁrmware bug affects them. So, Vilamb can be a goodoption when applications desire high performance and/orare not a good ﬁt for Pangolin-like API. and view partialsystem-redundancy coverage is as better than none.

This section begins by describing Vilamb’s designelements: delayed system-redundancy updates and re-purposing of dirty bits. It then describes the effect ofVilamb’s design on resilience against different failuresand ends with Vilamb’s implementation details.

Vilamb asynchronously maintains per-page checksumsand cross-page parity for DAX NVM storage. A back-ground thread periodically updates system-redundancyfor pages which have been written to since Vilamb lastupdated their system-redundancy. By delaying system-redundancy updates, Vilamb amortizes the overhead overmultiple cache-line writes to the same DAX NVM page.Fig. 2 illustrates how Vilamb reduces work for per-page checksums (cross-page parity is not shown in theexample, but is updated at the same time as the pagechecksum). The ﬁgure shows a DAX NVM page andits checksum; the checksum can either be up-to-date ( (cid:51) )or outdated (x). In the initial state, the checksum is up-to-date with the data. The ﬁrst write to the page makesthe checksum stale. Instead of updating the checksumimmediately, Vilamb delays the update until after two4 ache Line WritesDAX NVM PageChecksum: Up-to-date( ✓ ) or Outdated(x)? Initial State ✓ Vilamb Computes Checksumx x x ✓ Time

Figure 2: Delayed Checksum Computation Example – Bycomputing per-page checksums asynchronously, Vilamb amor-tizes the computation overhead over multiple cache-line writesto the same NVM page. more writes. By delaying the update Vilamb performsa single checksum (and parity, not shown in the ﬁgure)computation, instead of three.Vilamb scrubs the data using a separate backgroundthread to detect data corruption. Upon mismatch betweenthe page data and checksum for a clean page, Vilambraises an error and halts the program. The OS can recovercorrupted pages using the parity pages, with potential re-mapping to different physical pages [67, 69].

The conventional use-case of dirty bits is irrelevant forDAX NVM pages, making them available for repurpos-ing. The dirty bit is conventionally used to identify up-dated, or “dirtied”, in-memory pages that the storage sys-tem needs to write back to persistent storage. In case ofDAX NVM storage, the ﬁle system maps NVM-residentﬁles into application address spaces using the virtualmemory system [19, 40]. Consequently, even thougheach mapped page has a corresponding dirty bit, the con-ventional semantic of these dirty bits is irrelevant becausethe pages already reside in persistent NVM storage.Vilamb repurposes dirty bits to identify pages thathave been written to since Vilamb last updated theirsystem-redundancy. When a ﬁle is ﬁrst DAX mapped,its pages’ dirty bits are clear and system-redundancy isup-to-date (potentially updated during initialization fornewly created ﬁles). A page write, which causes itssystem-redundancy to become stale, sets the page’s dirtybit. In each successive invocation, Vilamb’s backgroundthread updates the system-redundancy only for pageswith their dirty bit set and then clears the correspondingdirty bits again.

Shadow Dirty Bits : Vilamb carefully orchestrates thenon-atomic two-step process of updating a page’s system-redundancy and clearing its dirty bit; performing thesesteps without any safeguard is incorrect. Clearing thedirty bit after updating the system-redundancy is incor-rect because an interleaved application access can invali-date the system-redundancy. Reversing the order is not safe either. A checksum veriﬁcation (e.g., in a scrub-bing thread) after the dirty bit is cleared, but before thechecksum is updated, would cause a spurious checksum-mismatch. Vilamb makes a persistent shadow copy of thedirty bit before clearing it, and clears this shadow copyonly after completing the redundancy update. If either ofthe dirty bit or its shadow copy is set for a page, Vilambknows that the page’s redundancy is outdated.

Vilamb’s asynchronous approach to system-redundancy introduces a tunable window of vulnerability.Pages that an application writes to remain susceptibleto corruption until Vilamb updates their system-redundancy. We describe the implication of this windowof vulnerability for different kinds of failures below.

Page Corruption : System-redundancy’s primary goalis to protect data from ﬁrmware-bug-induced corruption.Additionally, system-redundancy also protects from ran-dom bit ﬂip induced corruptions, though on-device ECCsare already expected to address those. Vilamb’s delayedchecksums would detect corruption to all but recentlywritten (dirty) pages. We illustrate this with an examplelost write bug triggered in three different scenarios.Consider a ﬁrmware that uses an on-device write-backcache and that suffers from a bug wherein the ﬁrmware(infrequently) “forgets” to destage some data from thecache to the device media. (1) For the ﬁrst scenario, con-sider an application write that is evicted from the CPUcaches to the NVM device, is stored in the on-devicewrite-back cache, and then lost by the ﬁrmware beforeVilamb updates the corresponding page’s checksum. Thiswould lead to a silent corruption because Vilamb woulduse the incorrect (old) data to compute the checksum.(2) For the second scenario, consider that Vilamb up-dates the page’s checksum before the ﬁrmware bug istriggered (i.e., while the data is in the CPU caches or inthe on-device cache). Vilamb would update the check-sum correctly in this scenario and detect the subsequentcorruption because of a data checksum mismatch at alater point. (3) For the third scenario, imagine the bugaffects a clean page while the ﬁrmware is performingwear leveling. Vilamb would be able to detect this dataloss in its scrubbing thread.Among the pages that Vilamb detects as corrupted,Vilamb can recover those that belong to stripes with allclean pages (and hence, an up-to-date parity). Any dirtypage in a stripe invalidates the parity. Thus, even if thecorrupted page is itself clean, Vilamb can recover it onlyif all other pages in its stripe are also clean.

Power Failures : Vilamb avoids any inconsistenciesbetween data and its system-redundancy by ensuring5 irtual Memory System FS DAX mmap()Vilamb Userspace Library(per-page checksums, and cross-page parity)Application (e.g., Redis)Check/clear dirty bitsVilamb Kernel Module (read/reset dirty bits) User SpaceFile DataNature and frequency of system-redundancy Checksums and ParityMeta Checksum Kernel SpaceNVM

Figure 3: Vilamb’s Implementation : The user space libraryperforms the checksum and parity computations with a periodthat is set by the application. The kernel module checks andclears the dirty bits when requested by the user space library. that the system-redundancy is made up-to-date if thereis a power failure. To that end, Vilamb leverages bat-tery backups that are common in production environ-ments [18,22,23,31,32,37,63]. Conventional storage sys-tems use batteries to ﬂush DRAM to a persistent mediumupon a power failure [18,23,31,32]. NVM does not needbatteries to make its contents persistent, because they arealready persistent. Vilamb instead leverages the batterybackup to update system-redundancy upon a power fail-ure, ensuring that no pages are left uncovered. Given thatbatteries are also used to address other issues, includingbrief power losses and spikes [46], we believe that Vil-amb can exploit them for updating system-redundancy.

NVM DIMM Failures or Machine Failures : Vil-amb’s system-redundancy is not intended for protec-tion against DIMM or machine failures; the storagesystem can protect against these using remote replica-tion [59,70]. Being a machine-local fault-tolerance mech-anism, system-redundancy, independent of its implemen-tation, is ineffective against machine failures. For DIMMfailures, Vilamb’s asynchronous system-redundancy de-sign makes it unable to reconstruct the fraction of thepages in the failed DIMM that belonged to a stripe withoutdated system-redundancy. Although the storage sys-tem could still recover a large fraction of the data (§ 4.8),it would need other redundancy to recover the remainingdata.

We implement Vilamb as a user-space library. Thelibrary exposes an API that applications can use to con-ﬁgure the nature of system-redundancy (e.g., type ofchecksum and number of pages in a stripe) and its up-date frequency. The library uses a periodic backgroundthread that checks and clears the dirty bits using new sys-tem calls that we implement, and performs the system-redundancy updates for the dirty pages. Our implemen- tation uses a stripe size of ﬁve pages by default, withfour consecutive data pages and one parity page. Thestripes are statically determined at the time of initializa-tion. Fig. 3 shows the components of our implementation.

New System Calls : We implement two new systemcalls, getDirtyBits and clearDirtyBits , to checkand clear the dirty bits for pages in a memory range,respectively. getDirtyBits returns a bitvector thathas the dirty bits for pages in the input memory range. clearDirtyBits accepts a dirty bitvector as its parame-ter in addition to a memory range. It clears the dirty bitfor a page in the memory range only if the correspondingbit is set in the input dirty bitvector. Since Vilamb isunaware of pages dirtied in between the checking andclearing and will not update their system-redundancy,it uses this input dirty bitvector for clearDirtyBits toclear the dirty bits only for pages that were dirty wheninitially checked.

Batched Checking and Clearing : Vilamb checksand clears dirty bits for multiple NVM pages (e.g., 512in our experiments) as a batch for efﬁciency. Both check-ing and clearing of dirty bits require a system call andtraversing the hierarchical page table; clearing dirty bitsfurther requires invalidating the corresponding TLB en-tries. Each of these is a costly operation, as evinced byprior research [2], and demonstrated by our experiments(§ 4.6). Batching allows pages to share the system call,fractions of the page table walk, and the TLB invalida-tion. We found that batching reduced the amount of timespent in checking/clearing dirty bits by up to two ordersof magnitude.

Algorithm : Algorithm 1 details the steps that Vil-amb’s background thread performs on each invocation.Vilamb loops over all the N pages in a given DAX NVMﬁle in increments of B pages; B being the batch size forwhich Vilamb checks the dirty bits using a single systemcall (Line 2). Vilamb stores a persistent shadow copy ofthe dirty bits (Line 3) and then clears them (Line 6). Vil-amb updates the checksum of each dirty page (Line 12),and the parity of a group of P pages if either of them isdirty (Line 16). Vilamb stores the checksums and par-ity separately from the data (Fig. 3) and then clears theshadow copy of the dirty bits (Line 20). Vilamb thenupdates a meta-checksum (checksum of the page check-sums) after every iteration (Line 22 and Fig. 3).As a performance optimization, instead of storing ashadow copy of the dirty bit for each page, we use a singledirty bitvector of size B along with the current batch’sstarting page number (Line 3 and Line 4). Together, thestarting page number and the dirty bitvector copy sufﬁceto store shadow copies of the dirty bits for pages in the6 lgorithm 1: System-Redundancy Update Thread

Parameter :

Batch Size, B

Parameter :

Number of Pages in File, N

Parameter :

Number of Pages in a Parity Group, P for i ← to N increment by B do dirtyBitvector ← checkDirtyBits( i, i + B ) ; dirtyBitvectorCopy ← dirtyBitvector ; currentBatchStartingPage ← i ; memoryFence; clearDirtyBits( i, i + B, dirtyBitvector ) ; for j ← i to i + B increment by P do for k ← j to j + P increment by do updateParity ← False; if bitIsSet( dirtyBitvector , k − i ) then updateParity ← True; computePageChecksum( k ) ; end end if updateParity then computeParity( j, j + P ) ; end end memoryFence; dirtyBitvectorCopy ← end computeMetaChecksum() ; current batch; pages not in the current batch do not needa shadow copy of their dirty bits because their dirty bitsare not being cleared. Having a single dirty bitvectorimproves performance by reducing cache pollution.Vilamb’s redundancy veriﬁcation thread (i.e., thescrubbing thread) computes and veriﬁes the checksumonly for pages that are clean, i.e., they have neither theirdirty bit nor their shadown dirty bit set. If the checksumveriﬁcation succeeds, the thread moves to the next page.In case of a checksum mismatch, the scrubbing threadre-checks whether the page is clean. This second checkis to ensure that the page was not modiﬁed after the ﬁrstcheck but before the checksum veriﬁcation. If the secondcheck also indicates that the page is clean, the scrubbingthread raises a signal to halt the application. The ﬁlesystem can then recover the page, if it belongs to a cleanstripe (we have not implemented recovery). Leveraging Hardware Support : Our implementa-tion of Vilamb leverages hardware-support wheneverpossible. We use CRC-32C checksums and employ the crc32q instruction when available. Similarly, we useSIMD instructions for computing the parity wheneverpossible (e.g., by operating on 256-byte words in ourexperiments). We never ﬂush cache lines for persis-tence because we assume battery-backed servers. Wedo, however, use fences to ensure ordering between up-dates. For example, the fence at Line 5 ensures that theshadow copy of the dirty bits and current batch’s start-ing page number writes are completed before the dirty bits are cleared. Similarly, the fence at Line 19 ensuresthat system-redundancy is written before the dirty bits’shadow copy is cleared. We extend the same perfor-mance beneﬁts (e.g., no cache line ﬂushes and SIMDparity computations) to the alternatives that we compareVilamb with in our evaluation.

This section evaluates Vilamb and compares it to No-Redundancy and Pangolin, using eight macro- and micro-benchmarks. No-Redundancy serves as the baseline,providing the best performance but not implementingany system-redundancy. Pangolin is a state-of-the-artuserspace library that updates system-redundancy whenapplications commit their data writes to NVM.We obtained Pangolin’s code from the authors and runit with checksum and parity updates enabled but check-sum veriﬁcation disabled (referred to as Pangolin-MLPCin the Pangolin paper [69]). We run Vilamb also withchecksum and parity updates enabled and checksum ver-iﬁcation disabled. As shown in the evaluation of Pan-golin [69], and conﬁrmed by our experiments, checksumveriﬁcation via scrubbing at reasonable frequencies in-curs negligible overhead. Pangolin can also verify check-sums on object reads, which Vilamb cannot, but doing soreduces throughput by up to 50% for large objects [69].Unless mentioned otherwise, Vilamb uses a 512-pagebatch size for checking/clearing dirty bits. To accuratelyquantify Vilamb’s overheads, we pin it to the same core(s)as the application. For single threaded applications suchas Redis, this means that the application and Vilamb runon the same logical core (i.e., same hyper-thread). Eachdata point in our results is an average of three runs withroot mean square error bars. We use a dual-socket IntelXeon Silver 4114 machine with Linux 4.4.0 kernel forour experiments. The system has 192 GB DRAM, fromwhich we use 64 GB as emulated NVM [51].

Key takeaways from our evaluation include:• Vilamb is low-overhead. For example, Vilamb witha 10 sec system-redundancy update period reducesRedis’ YCSB throughput by only 0.1–6% in com-parison to No-Redundancy.• Vilamb signiﬁcantly outperforms Pangolin. Forexample, Vilamb has 3–5 × higher insert through-put than Pangolin for ﬁve PMDK key-value stores.Even for low throughput applications like singlethreaded Redis serving YCSB, Vilamb has up to18% higher throughput than Pangolin.• Vilamb signiﬁcantly increases the MTTDL. Forexample, Vilamb increases the MTTDL for PMDK7 angolin Vilamb System-Redundancy Thread Period (sec)1 5 10

No-Redundancy

YCSB-A YCSB-B YCSB-C

YCSB Workload T h r o u g h p u t ( K - o p s / s e c ) (a) Throughput YCSB-A YCSB-B YCSB-C

YCSB Workload A v e r a g e L a t e n c y ( m s ) (b) Average Latency YCSB-A YCSB-B YCSB-C

YCSB Workload t h % - il e L a t e n c y ( m s ) (c) Tail Latency Figure 4: YCSB with Redis – Throughput and read latency of YCSB workloads with Redis. key-value stores by up to two orders of magnitude.• Vilamb offers a tradeoff between performance andtime-to-coverage. For example, decreasing the de-lay between system-redundancy updates from 5 secto 1 sec increases Redis’ YCSB-A MTTDL by 3 × but decreases the throughput by 10%.• Vilamb’s battery requirements are low. Across allof our workloads, the cost of batteries that Vilambrequires never exceeds $10. Redis [55] is a widely used open-source NoSQLDBMS. We modify it to use a DAX NVM ﬁle for itsdata heap. Our implementation uses the libpmemobj li-brary [25] from the Intel persistent memory developmentkit (PMDK) [26] for No-Redundancy.

Modifying Redis to use Vilamb and Pangolin : ForVilamb, we added 10 lines of initialization and cleanupcode in one ﬁle. The initialization code registers Redis’NVM heap with Vilamb and sets the system-redundancyupdate delay. To use Pangolin’s transactional API (whichis similar to but different than libpmemobj ), we changed346 lines of code across 10 ﬁles in Redis. Whereas mostof these changes were to the transactional interface (e.g.,using pgl_tx_begin ), we also had to modify Redis toinvoke Pangolin before reading data from an object (us-ing pgl_get ). Doing so enables Pangolin to determinewhether the object is in NVM or in DRAM and provideRedis with the correct pointer.

Experimental Setup : We use three core YCSB work-loads: YCSB-A (50:50 reads:updates), YCSB-B (95:5reads:updates), and YCSB-C (read-only). We initializethe DBMS with 1M (1 × ) key-value pairs for a NVMfootprint of 10 GB and run the workloads for ﬁve min-utes. The YCSB workload generator uses 20 threads andruns on a different socket than Redis. Results : Fig. 4 presents throughput and read latencies.Vilamb reduces the throughput, in comparison to No-Redundancy, by 0.1–6% for a system-redundancy update period of 10 sec and by 1.6–17% for a period of 1 sec.Increasing the delay for system-redundancy updates im-proves Vilamb’s performance because it performs fewersystem-redundancy updates and hogs less CPU. Withaggressive system-redundancy updates every second, Vil-amb increases the tail latency for YCSB-A because itstalls Redis while updating system-redundancy on thesame core. This effect can be mitigated if Vilamb andRedis were to run on separate cores.Pangolin’s throughput is 13–18% lower than No-Redundancy, with a higher overhead for more read-heavyworkloads. In addition to the overhead of updatingsystem-redundancy, Pangolin incurs overhead becauseof two other factors, both related to its micro-bufferingdesign. First, on every object read, Pangolin probes acuckoo hash table to check whether the latest copy of theobject is in a DRAM micro-buffer or in NVM. Second,when Redis adds an object to a transaction, Pangolincopies the entire object to DRAM for micro-buffering,rather than just the modiﬁed data ranges.For the write-heavy workload YCSB-A, Pangolin out-performs Vilamb with a system-redundancy update pe-riod of 1 sec. This is because Pangolin’s micro-bufferingdesign enables it to perform checksum and parity updatesusing the diff of the updated data. Pangolin uses the newdata in the DRAM micro-buffer and the old data in theNVM to compute the data diff. In contrast, Vilamb hasto read the entire page to update the checksum, and alsoread other pages in the stripe to update the parity. With 5and 10 sec system-redundancy update periods, Vilamboutperforms Pangolin by 5–7%.For read-heavy workloads YCSB-B and YCSB-C, Vil-amb reduces the throughput marginally (e.g., less than2% for YCSB-C) whereas Pangolin reduces the through-put by 18%. This is because even though the number ofsystem-redundancy updates reduce, Pangolin continuesto incur the additional overheads described above. Forexample, Pangolin has to check whether the data is in8 angolin

Vilamb System-Redundancy Thread Period (sec)1 5 10

No-Redundancy

CTree BTree RBTree RTree HashMap

Data Structure T h r o u g h p u t ( K - o p s / s e c ) (a) Insert Throughput CTree BTree RBTree RTree HashMap

Data Structure T h r o u g h p u t ( K - o p s / s e c ) (b) Remove Throughput Pangolin Vilamb: 1 sec period No-Redundancy

Threads T h r o u g h p u t ( M - o p s / s e c ) (c) CTree Insert Threads T h r o u g h p u t ( M - o p s / s e c ) (d) BTree Insert Threads T h r o u g h p u t ( M - o p s / s e c ) (e) RBTree Insert Threads T h r o u g h p u t ( M - o p s / s e c ) (f) RTree Insert Threads T h r o u g h p u t ( M - o p s / s e c ) (g) HashMap Insert Figure 5: PMDK Key-Value Stores – Throughput for insert-only, remove-only benchmarks with different PMDK key-value stores.

DRAM or NVM for object reads.Pangolin’s moderate overhead (up to 18%) comparedto No-Redundancy and Vilamb is an artifact of Redis’ in-efﬁciencies. In particular, Redis’ single-threaded designcauses it to have low performance (tens of thousands ofoperations per sec) that does not fully expose the system-redundancy update overheads. In the next section, weshow that multi-threaded key-value stores that performmillions of operations per second beneﬁt signiﬁcantlyfrom Vilamb’s asynchronous approach.

Intel persistent memory development kit (PMDK) [26]implements NVM-optimized key-value stores and in-cludes performance benchmarks.

Experimental Setup : Similar to Pangolin [69], weuse insert-only, and remove-only benchmarks for ﬁvekey-value stores: Crit-Bit Tree (CTree), BTree, Red-Black Tree (RBTree), Range Tree (RTree) and chaininghashmap (HashMap). We ﬁrst re-create the experimentand results from Pangolin [69] with a single-thread thatperforms 5 million operations. We then use multiplethreads (1 to 32) with 100,000 operations per thread.We modify the PMDK benchmark for multi-threadedbenchmarking. In the original implementation, thethreads synchronize using a coarse-grained lock; eachthread holds a lock over the entire data structure for theentire duration of its transaction. Not surprisingly, thecoarse-grained lock leads to poor scaling. We modiﬁed the implementation such that each thread maintains andoperates on its own instance of the data structure. All thethreads share the same NVM pool, but do not synchro-nize their changes because they operate on different data.Our modiﬁcations enabled close to linear scaling for thebaseline case of No-Redundancy.

Results : Figs. 5(a) and 5(b) show the throughput forthe insert-only and remove-only workloads when usinga single thread for the key-value store. Pangolin’s over-heads are similar to those reported in their paper [69].Vilamb’s performance improves with increasing delayin system-redundancy updates. Of the ﬁve key-valuestores, both Pangolin and Vilamb have the highest over-head in comparison to No-Redundancy for RTree becauseRTree’s insertion touches the largest amount of data. Forthe remove-only workload, Pangolin outperforms Vilambwith 1 sec system-redundancy update period because re-moving objects touches only a small amount of data andPangolin can efﬁciently update system-redundancy usingthe diffs for small data.Figs. 5(c) to 5(g) show the insert-only throughputfor the ﬁve key-value stores with increasing numberof threads. Increasing the number of threads updatesNVM data more aggressively and generates more system-redundancy updates. This causes Pangolin to have upto 80% lower throughput than No-Redundancy. Acrossthe the ﬁve key-value store, Vilamb has 3–5 × higherthroughput than Pangolin when using 32 threads.9 angolin Vilamb System-Redundancy Thread Period (sec)1 5 10

No-Redundancy

64 256 1024 4096

Data Size (bytes) A v e r a g e L a t e n c y ( u s ) (a) Allocation

64 256 1024 4096

Data Size (bytes) A v e r a g e L a t e n c y ( u s ) (b) Overwrite

64 256 1024 4096

Data Size (bytes) A v e r a g e L a t e n c y ( u s ) (c) Deallocation Figure 6: NVM Transaction Latencies – Latencies for transactional allocation, overwriting, and deallocation.

Pangolin [69] introduced micro-benchmarks to mea-sure the latency of transactional operations (allocation,overwrite, and deallocation), and to measure the scalabil-ity of overwriting NVM regions with multiple threads.

Experimental Setup : We perform each transactionaloperation (allocation, overwrite, deallocation) 1 milliontimes for different sized objects in a single thread andreport the average latency. We use an NVM ﬁle of10 GB for this. For scalability, we increase the num-ber of threads with each thread overwriting 64-byte and4 KB regions 200,000 times.

Results : Fig. 6 shows the latency for performing thetransactional operations using a single thread. For 64-byte objects, Pangolin incurs 23%, 44%, and 30% higherlatency than No-Redundancy for allocation, overwrite,and deallocation, respectively. In contrast, Vilamb witha system-redundancy update period of 1 sec increasesthe corresponding latencies by only 9%, 5%, and 3%;increasing the system-redundancy update period furtherreduces Vilamb’s latencies. Increasing the object sizesincreases the latency for all conﬁgurations, because moredata is touched (except for deallocation, in which onlymetadata is updated). However, even for 4 KB objects,Vilamb with a system-redundancy update period of 1 sechas 13%–31% lower latencies than than Pangolin.Fig. 7 shows the throughput for overwriting 64-byteand 4 KB regions with increasing number of threads.Vilamb scales close to No-Redundancy, with only upto 25% lower throughput. In contrast, Pangolin hasup to 77% lower throughput. Pangolin’s experimentswith real NVM (in contrast to our DRAM-based emula-tion) showed that No-Redundancy performance does notscale well beyond 8 threads because of NVM’s limitedbandwidth [69]. However, even with 8 threads Vilamb’sthroughput is double of Pangolin’s. As NVM perfor-mance improves and gets closer to DRAM performance, the beneﬁts of Vilamb’s asynchronous redundancy main-tenance will become more pronounced. We also evalu-ated overwriting with other intermediate data sizes (256and 1024 bytes) and obtained similar trends.

This section evaluates Vilamb’s performance usingﬁo [5] microbenchmarks. We cannot evaluate Pangolinusing ﬁo because ﬁo’s NVM engine [20] does not useobject based transactions. Rather ﬁo treats the entireDAX-mapped ﬁle as a raw sequence of bytes. This il-lustrates Pangolin’s programming model restriction. Ap-plications that manage DAX-mapped data themselves,either as raw data as in ﬁo microbenchmarks or in a morecomplex fashion like NVM databases [3], can beneﬁtfrom Pangolin only if they can be and are modiﬁed touse its APIs.

Experimental Setup : Fio’s libpmem enginereads/writes DAX NVM ﬁles at a cache line granularity.We use write-only and read-only workloads with a 16 GBﬁle and three access patterns: uniform random, sequen-tial, and Zipf. The workloads perform reads/writes equalto the ﬁle size. The random and sequential workloadschoose previously unread/unwritten cache lines, conse-quently reading/writing each cache line in the entire ﬁleexactly once. We use a single thread and pin it to a logicalcore along with Vilamb.

Results : Fig. 8 shows the throughput for the two work-loads with three access patterns each. For write-onlyworkloads, Vilamb reduces throughput by 0.5–56% withhigher overheads for more frequent system-redundancyupdates. Vilamb’s overheads are highest for the randomworkload and lowest for the sequential workload; sequen-tial workloads offer the best opportunity to reduce com-putations, because successive cache line writes belong tothe same page. Even for random workloads, the overheadis only 10% with a system-redundancy update delay of10 angolin Vilamb: 1 sec period No-Redundancy

Number of Threads T h r o u g h p u t ( M - o p s / s e c ) (a) 64 Byte Writes Number of Threads T h r o u g h p u t ( M - o p s / s e c ) (b) 4096 Byte Writes Figure 7: NVM Overwrite Throughput

60 seconds. Vilamb reduces the throughput by only up to3% for read-only workloads, demonstrating the efﬁcacyof its checking of dirty bits. Vilamb’s througput is higherthan No-Redundancy for the read-only sequential work-load with an update period of more than 10 seconds; thisis an artifact of the experimental setup. While checkingfor dirty bits, Vilamb populates the page table entries andreduces the number of soft page faults. The performancebeneﬁt of reduced soft page faults outweigh the overheadof checking the dirty bits infrequently (i.e., with a periodof more than 10 seconds). This anamoulous inversion ofperformance can be resolved by pre-populating the pagetable entries for Vilamb as well.

To better understand the cost of checking and clearingdirty bits, we break down the cost into its constituentcomponents: (i) system call, (ii) page table walk to de-sired page table entries, (iii) reading/resetting the dirtybits, and (iv) TLB invalidation after clearing dirty bits.We also demonstrate the beneﬁts of batching multiplepages when checking and clearing the dirty bits.

Experimental Setup : We use the write-only ﬁo work-load with 64-byte writes and a uniform random accesspattern. We conﬁgure Vilamb to check/clear the dirty bitsevery second. We measure the average amount of timespent in each of the components for a single invocationof Vilamb’s background thread. We vary the batch sizeto demonstrate the impact of batching.

Results : Fig. 9(a) presents the time spent in variouscomponents of checking and clearing dirty bits. Thebatch size is set to 512 pages for this experiment. Dou-bling the ﬁle size, and consequently the total number ofpages, roughly doubles the amount of time spent in eachof the components. This is because the number of sys-tem calls, page walks, and reads of the dirty bits are alldirectly proportional to the total number of pages. Thenumber of pages for which the dirty bit is cleared and thenumber of TLB invalidations depend on the workload’saccess pattern. For the uniform random access workload,these are also directly proportional to the total number of

Vilamb System-Redundancy Thread Period (sec)1 10 30 60

No-Redundancy

Random Zipf Sequential

Access Pattern B a n d w i d t h ( M B / s ) (a) Write Only Workload Random Zipf Sequential

Access Pattern B a n d w i d t h ( M B / s ) (b) Read Only Workload Figure 8: Fio Microbenchmarks – Throughputs for write-only and read-only workloads with different access patterns. pages.Fig. 9(b) presents the impact of batch size for a 16 GBﬁle. As the batch size increases, the time spent inchecking/clearing dirty bits decreases with diminishingmarginal returns. This decrease is because the number ofsystem calls reduce and larger fractions of the page tablewalks are shared between the pages in the same batch.The beneﬁts are diminishing with increasing batch size,because of the ﬁxed cost of reading all the dirty bits andresetting the ones that are found to be set.

This section analyzes the cost of batteries required forVilamb to update the system-redundancy after a powerfailure for various workloads. We consider two kinds ofbatteries: ultra-capacitors that cost $2.85/KJ [44,64], andlithium-ion batteries that cost $0.02/KJ [46, 64]. Conven-tionally, datacenters use lithium-ion batteries; moderndatacenters additionally use ultra-capacitors because oftheir higher energy efﬁciency and density [64]. We con-sider servers with 500W [64] power usage.For Redis with the write-heavy workload YCSB-A,one iteration of Vilamb’s system-redundancy updatestakes 143 ms when performed every second and 562 mswhen performed every 10 seconds. These correspond toless than 1 KJ of energy required, i.e. the cost would beless than $2.85 when using ultra-capacitors and less than$0.02 when using the conventional lithium-ion batteries.This is the case for all PMDK key-value stores exceptRTree as well. For RTree, because of its sparse and largewrites, Vilamb can require up to 5 seconds to updatethe system-redundancy upon a power failure, requiring2.5 KJ of energy. This corresponds to $7.2 in ultra-capacitor cost or $0.05 lithium-ion battery cost. For ﬁo,even with the adversarial random write workload with asystem-redundancy update period of every 60 seconds,Vilamb requires only 4.5 seconds after a power failure.This translates to 2.25 KJ of required energy and $6.4 inultra-capacitor cost or $0.04 in lithium-ion battery cost.The battery requirement, and the associated cost, can be11 learing Dirty Bits Checking Dirty Bits

Clearing Dirty Bits: Invalidate TLBsClearing Dirty Bits: Reset BitsClearing Dirty Bits: Page WalksClearing Dirty Bits: System Calls Checking Dirty Bits: Read BitsChecking Dirty Bits: Page WalksChecking Dirty Bits: System CallsIterate over File

File Size (GB) L a t e n c y ( m s ) (a) Breakdown of Time Spent K K L a t e n c y ( m s ) (b) Impact of Batch Size Figure 9: Cost of Checking/Clearing Dirty Bits – 9(a) showsthe time spent in each component of checking/clearing dirtybits for a batch size of 512 pages and increasing ﬁle sizes. 9(b)shows that increasing the batch size reduces the time spent inchecking/clearing dirty bits with diminishing returns. further reduced by limiting the number of pages that canbe dirty (i.e., with outdated system-redundancy) usingViyojit’s [32] design.

We now evaluate the increase in mean time to data loss(MTTDL) over No-Redundancy when using Vilamb. ForNo-Redundancy, a single page corruption causes dataloss.

MT T DL No − Redundancy = MTTF P AGE P , where P is thenumber of pages in the system.A page corruption affects data protected with Vilambin different ways. If the corruption affects a page that isdirty, Vilamb would checksum the corruption, leading toa silent data corruption. If the corruption affects a pagethat is itself clean but belongs to a stripe with a dirty page(hence, an outdated parity), Vilamb cannot recover thepage, causing a data loss. For a corruption that affects apage that is itself clean and belongs to a stripe with allclean pages, Vilamb can recover the page. In summary,if the corruption affects a page in a vulnerable stripe ,i.e., a stripe with even one dirty page, it would lead todata loss. MT T DL

Vilamb = MTTF P AGE V × N , where V is thenumber of vulnerable stripes, and N is the number ofpages in a stripe. Vilamb increases the MT T DL by PV × N in comparison to No-Redundancy.We use the above to compute the increase in theMTTDL with Vilamb over No-Redundancy for the vari-ous applications and workloads described in § 4. Work-load access patterns, i.e., the rate and locality of their dataupdates determine the number of vulnerable stripes. Weemperically measure the average number of vulnerablestripes for the various workloads and use that to com-pute the increase in MTTDL. For Redis, Vilamb with asystem-redundancy update period of 1 sec increases theMTTDL by 15 × for the write-heavy workload YCSB-A and 74 × for the ready-heavy workload YCSB-B. In-creasing the delay reduces the MTTDL, because a largerfraction of data remains dirty (e.g., 21 × and 13 × forYCSB-B with 5 sec and 10 sec period, respectively). ForPMDK’s key-value stores, Vilamb increases the MTTDLby up to two orders of magnitude (e.g., 112 × for RBTreeinsert-only workload with 32 threads). Vilamb provides low-overhead system-redundancy forDAX NVM data by embracing an asynchronous ap-proach. In doing so, Vilamb creates a tunable trade-offbetween performance and time-to-coverage. For exam-ple, decreasing the system-redundancy update delay from5 seconds to 1 second reduces Vilamb’s throughput forRedis with YCSB-A workload by 10% but also increasesthe MTTDL by 3 × . Vilamb’s asynchronous approachamortizes the performance overhead of updating system-redundancy over multiple data writes. As a result, Vil-amb outperforms the state-of-the-art synchronous system-redundancy solution, Pangolin, by up to 5 × . AlthoughVilamb’s delayed data coverage design is not suited forall applications, it adds a high throughput option to thesuite of DAX NVM system-redundancy options availableto applications.12 eferences [1] Intel Optane/Micron 3d-XPoint Memory . .[2] Nadav Amit. Optimizing the TLB ShootdownAlgorithm with Page Access Tracking. In Proceed-ings of the 2017 USENIX Conference on UsenixAnnual Technical Conference , USENIX ATC ’17,pages 27–39, Berkeley, CA, USA, 2017. USENIXAssociation.[3] Joy Arulraj, Andrew Pavlo, and Subramanya R.Dulloor. Let’s Talk About Storage & RecoveryMethods for Non-Volatile Memory Database Sys-tems. In

Proceedings of the 2015 ACM SIGMODInternational Conference on Management of Data ,SIGMOD ’15, pages 707–722, New York, NY,USA, 2015. ACM.[4] Joy Arulraj, Matthew Perron, and Andrew Pavlo.Write-behind Logging.

Proc. VLDB Endow. ,10(4):337–348, November 2016.[5] Jens Axboe. Fio-ﬂexible I/O tester.

URLhttps://github.com/axboe/ﬁo , 2014.[6] Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R.Goodson, and Bianca Schroeder. An Analysisof Data Corruption in the Storage Stack.

Trans.Storage , 4(3):8:1–8:28, November 2008.[7] Mary Baker, Mehul Shah, David S. H. Rosenthal,Mema Roussopoulos, Petros Maniatis, TJ Giuli,and Prashanth Bungale. A Fresh Look at the Relia-bility of Long-term Digital Storage. In

Proceedingsof the 1st ACM SIGOPS/EuroSys European Con-ference on Computer Systems 2006 , EuroSys ’06,pages 221–234, New York, NY, USA, 2006. ACM.[8] Bill Bridge.

NVM support for C ap-plications , 2015. Available at .[9] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, andO. Mutlu. Error characterization, mitigation, andrecovery in ﬂash-memory-based solid-state drives.

Proceedings of the IEEE , 105(9):1666–1704, Sep.2017.[10] Peter M. Chen, Wee Teck Ng, Subhachandra Chan-dra, Christopher Aycock, Gurushankar Rajamani, and David Lowell. The Rio File Cache: SurvivingOperating System Crashes. In

Proceedings of theSeventh International Conference on ArchitecturalSupport for Programming Languages and Operat-ing Systems , ASPLOS VII, pages 74–83, New York,NY, USA, 1996. ACM.[11] L.O. Chua. Memristor-the missing circuit element.

Circuit Theory, IEEE Transactions on , 18(5):507–519, Sep 1971.[12]

Peloton Database Management Systems . http://pelotondb.org .[13] Joel Coburn, Adrian M. Caulﬁeld, Ameen Akel,Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala,and Steven Swanson. NV-Heaps: Making Persis-tent Objects Fast and Safe with Next-generation,Non-volatile Memories. In Proceedings of the Six-teenth International Conference on ArchitecturalSupport for Programming Languages and Operat-ing Systems , ASPLOS XVI, pages 105–118, NewYork, NY, USA, 2011. ACM.[14] Jeremy Condit, Edmund B. Nightingale, Christo-pher Frost, Engin Ipek, Benjamin Lee, DougBurger, and Derrick Coetzee. Better I/O ThroughByte-addressable, Persistent Memory. In

Proceed-ings of the ACM SIGOPS 22Nd Symposium on Op-erating Systems Principles , SOSP ’09, pages 133–146, New York, NY, USA, 2009. ACM.[15] G. Copeland, T. Keller, R. Krishnamurthy, andM. Smith. The case for safe ram. In

Proceed-ings of the 15th International Conference on VeryLarge Data Bases , VLDB ’89, pages 327–335, SanFrancisco, CA, USA, 1989. Morgan KaufmannPublishers Inc.[16] Giuseppe DeCandia, Deniz Hastorun, Madan Jam-pani, Gunavardhan Kakulapati, Avinash Lakshman,Alex Pilchin, Swaminathan Sivasubramanian, PeterVosshall, and Werner Vogels. Dynamo: Amazon’sHighly Available Key-value Store. In

Proceed-ings of Twenty-ﬁrst ACM SIGOPS Symposium onOperating Systems Principles , SOSP ’07, pages205–220, New York, NY, USA, 2007. ACM.[17] Mingkai Dong and Haibo Chen. Soft UpdatesMade Simple and Fast on Non-volatile Memory.In , pages 719–731, Santa Clara,CA, 2017. USENIX Association.1318] Aleksandar Dragojevi´c, Dushyanth Narayanan, Ed-mund B. Nightingale, Matthew Renzelmann, AlexShamis, Anirudh Badam, and Miguel Castro. NoCompromises: Distributed Transactions with Con-sistency, Availability, and Performance. In

Proceed-ings of the 25th Symposium on Operating SystemsPrinciples , SOSP ’15, pages 54–70, New York, NY,USA, 2015. ACM.[19] Subramanya R. Dulloor, Sanjay Kumar, Anil Ke-shavamurthy, Philip Lantz, Dheeraj Reddy, RajeshSankaran, and Jeff Jackson. System Software forPersistent Memory. In

Proceedings of the Ninth Eu-ropean Conference on Computer Systems , EuroSys’14, pages 15:1–15:15, New York, NY, USA, 2014.ACM.[20]

Running FIO with pmem engines . https://pmem.io/2018/06/25/fio-tutorial.html .[21] Sanjay Ghemawat, Howard Gobioff, and Shun-TakLeung. The google ﬁle system. In Proceedingsof the Nineteenth ACM Symposium on OperatingSystems Principles , SOSP ’03, pages 29–43, NewYork, NY, USA, 2003. ACM.[22] Sriram Govindan, Anand Sivasubramaniam, andBhuvan Urgaonkar. Beneﬁts and Limitations ofTapping into Stored Energy for Datacenters. In

Proceedings of the 38th Annual International Sym-posium on Computer Architecture , ISCA ’11, pages341–352, New York, NY, USA, 2011. ACM.[23] Dave Hitz, James Lau, and Michael Malcolm. Filesystem design for an nfs ﬁle server appliance. In

Proceedings of the USENIX Winter 1994 Techni-cal Conference on USENIX Winter 1994 TechnicalConference , WTEC’94, pages 19–19, Berkeley, CA,USA, 1994. USENIX Association.[24] Qingda Hu, Jinglei Ren, Anirudh Badam, Jiwu Shu,and Thomas Moscibroda. Log-structured Non-volatile Main Memory. In

Proceedings of the 2017USENIX Conference on Usenix Annual TechnicalConference , USENIX ATC ’17, pages 703–717,Berkeley, CA, USA, 2017. USENIX Association.[25]

PMDK’s libpmemobj Library . https://pmem.io/pmdk/libpmemobj/ .[26] PMDK: Intel Persistent Memory Development Kit . http://pmem.io .[27] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim,Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R. Dulloor,Jishen Zhao, and Steven Swanson. Basic Perfor-mance Measurements of the Intel Optane DC Per-sistent Memory Module. CoRR , abs/1903.05714,2019.[28] Minwen Ji, Alistair C Veitch, and John Wilkes.Seneca: remote mirroring done write. In

USENIX Annual Technical Conference, GeneralTrack , ATC’03, pages 253–268, 2003.[29] Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou,and Arkady Kanevsky. Are Disks the DominantContributor for Storage Failures?: A Comprehen-sive Study of Storage Subsystem Failure Character-istics.

Trans. Storage , 4(3):7:1–7:25, November2008.[30] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap,Taesoo Kim, and Vijay Chidambaram. SplitFS: AFile System that Minimizes Software Overhead inFile Systems for Persistent Memory. In

Proceed-ings of the 27th ACM Symposium on OperatingSystems Principles (SOSP ’19) , Ontario, Canada,October 2019.[31] Anuj Kalia, Michael Kaminsky, and David G.Andersen. FaSST: Fast, Scalable and SimpleDistributed Transactions with Two-sided (RDMA)Datagram RPCs. In

Proceedings of the 12thUSENIX Conference on Operating Systems De-sign and Implementation , OSDI’16, pages 185–201,Berkeley, CA, USA, 2016. USENIX Association.[32] Rajat Kateja, Anirudh Badam, Sriram Govindan,Bikash Sharma, and Greg Ganger. Viyojit: Decou-pling Battery and DRAM Capacities for Battery-Backed DRAM. In

Proceedings of the 44th AnnualInternational Symposium on Computer Architec-ture , ISCA ’17, pages 613–626, New York, NY,USA, 2017. ACM.[33] Rajat Kateja, Nathan Bechmann, and GregGanger. Tvarak: Software-managed hardwareofﬂoad for dax nvm storage redundancy.

Par-allel Data Lab Technical Report CMU-PDL-19-105 . .[34] Kimberly Keeton, Cipriano Santos, Dirk Beyer, Jef-frey Chase, and John Wilkes. Designing for Disas-ters. In Proceedings of the 3rd USENIX Conferenceon File and Storage Technologies , FAST’04, pages5–5, Berkeley, CA, USA, 2004. USENIX Associa-tion.1435] Taeho Kgil, David Roberts, and Trevor Mudge. Im-proving nand ﬂash based disk caches. In

Proceed-ings of the 35th Annual International Symposiumon Computer Architecture , ISCA ’08, pages 327–338, Washington, DC, USA, 2008. IEEE ComputerSociety.[36] Hideaki Kimura. FOEDUS: OLTP Engine for aThousand Cores and NVRAM. In

Proceedings ofthe 2015 ACM SIGMOD International Conferenceon Management of Data , SIGMOD ’15, pages 691–706, New York, NY, USA, 2015. ACM.[37] Vasileios Kontorinis, Liuyi Eric Zhang, Baris Ak-sanli, Jack Sampson, Houman Homayoun, EddiePettis, Dean M. Tullsen, and Tajana Simunic Ros-ing. Managing Distributed Ups Energy for Effec-tive Power Capping in Data Centers. In

Proceed-ings of the 39th Annual International Symposiumon Computer Architecture , ISCA ’12, pages 488–499, Washington, DC, USA, 2012. IEEE ComputerSociety.[38] Harendra Kumar, Yuvraj Patel, Ram Kesavan,and Sumith Makam. High-performance Meta-data Integrity Protection in the WAFL Copy-on-write File System. In

Proceedings of the 15thUsenix Conference on File and Storage Technolo-gies , FAST’17, pages 197–211, Berkeley, CA,USA, 2017. USENIX Association.[39] Benjamin C. Lee, Engin Ipek, Onur Mutlu, andDoug Burger. Architecting Phase Change MemoryAs a Scalable Dram Alternative. In

Proceedingsof the 36th Annual International Symposium onComputer Architecture , ISCA ’09, pages 2–13, NewYork, NY, USA, 2009. ACM.[40]

Supporting ﬁlesystems in persistent memory . https://lwn.net/Articles/610174/ .[41] Virendra J. Marathe, Margo Seltzer, Steve Byan,and Tim Harris. Persistent Memcached: BringingLegacy Code to Byte-addressable Persistent Mem-ory. In Proceedings of the 9th USENIX Conferenceon Hot Topics in Storage and File Systems , Hot-Storage’17, pages 4–4, Berkeley, CA, USA, 2017.USENIX Association.[42] Sanketh Nalli, Swapnil Haria, Mark D. Hill,Michael M. Swift, Haris Volos, and Kimberly Kee-ton. An Analysis of Persistent Memory Use withWHISPER. In

Proceedings of the Twenty-SecondInternational Conference on Architectural Support for Programming Languages and Operating Sys-tems , ASPLOS ’17, pages 135–148, New York, NY,USA, 2017. ACM.[43] Sumit Narayan, John A. Chandy, Samuel Lang,Philip Carns, and Robert Ross. Uncovering Errors:The Cost of Detecting Silent Data Corruption. In

Proceedings of the 4th Annual Workshop on Petas-cale Data Storage , PDSW ’09, pages 37–41, NewYork, NY, USA, 2009. ACM.[44] Dushyanth Narayanan and Orion Hodson. Whole-system Persistence. In

Proceedings of the Seven-teenth International Conference on ArchitecturalSupport for Programming Languages and Operat-ing Systems , ASPLOS XVII, pages 401–410, NewYork, NY, USA, 2012. ACM.[45]

Intel Optane Memory SSDs . .[46] Darshan S. Palasamudram, Ramesh K. Sitaraman,Bhuvan Urgaonkar, and Rahul Urgaonkar. UsingBatteries to Reduce the Power Costs of Internet-scale Distributed Networks. In Proceedings of theThird ACM Symposium on Cloud Computing , SoCC’12, pages 11:1–11:14, New York, NY, USA, 2012.ACM.[47] David A. Patterson, Garth Gibson, and Randy H.Katz. A Case for Redundant Arrays of Inexpen-sive Disks (RAID). In

Proceedings of the 1988ACM SIGMOD International Conference on Man-agement of Data , SIGMOD ’88, pages 109–116,New York, NY, USA, 1988. ACM.[48] R. Hugo Patterson, Stephen Manley, Mike Fed-erwisch, Dave Hitz, Steve Kleiman, and ShaneOwara. SnapMirror: File-System-Based Asyn-chronous Mirroring for Disaster Recovery. In

Pro-ceedings of the 1st USENIX Conference on File andStorage Technologies , FAST ’02, Berkeley, CA,USA, 2002. USENIX Association.[49]

Deprecating the PCOMMIT instruction . https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction .[50] Plexistore keynote presentation atNVMW 2018 . http://nvmw.ucsd.edu/nvmw18-program/unzip/current/nvmw2018-paper97-presentations-slides.pptx .1551] Persistent Memory Emulation . http://pmem.io/2016/02/22/pm-emulation.html .[52] Persistent Memory Storage Engine . https://github.com/pmem/pmse .[53] Vijayan Prabhakaran, Lakshmi N. Bairavasun-daram, Nitin Agrawal, Haryadi S. Gunawi, An-drea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. IRON File Systems. In Proceedings ofthe Twentieth ACM Symposium on Operating Sys-tems Principles , SOSP ’05, pages 206–220, NewYork, NY, USA, 2005. ACM.[54] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan,and Jude A. Rivers. Scalable High PerformanceMain Memory System Using Phase-change Mem-ory Technology. In

Proceedings of the 36th AnnualInternational Symposium on Computer Architec-ture , ISCA ’09, pages 24–33, New York, NY, USA,2009. ACM.[55]

Redis: in-memory key value store . http://redis.io/ .[56] Redis PMEM: Redis, enhanced to use PMDK’slibpmemobj . https://github.com/pmem/redis .[57] Ohad Rodeh, Josef Bacik, and Chris Mason.BTRFS: The Linux B-Tree Filesystem. Trans.Storage , 9(3):9:1–9:32, August 2013.[58] Bianca Schroeder, Sotirios Damouras, and PhillipaGill. Understanding Latent Sector Errors and Howto Protect Against Them.

ACM Trans. Storage ,6(3):9:1–9:23, September 2010.[59] Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang.Distributed Shared Persistent Memory. In

Proceed-ings of the 2017 Symposium on Cloud Computing ,SoCC ’17, pages 323–337, New York, NY, USA,2017. ACM.[60] Gopalan Sivathanu, Charles P. Wright, and ErezZadok. Ensuring data integrity in storage: Tech-niques and applications. In

Proceedings of the 2005ACM Workshop on Storage Security and Survivabil-ity , StorageSS ’05, pages 26–36, New York, NY,USA, 2005. ACM.[61] Haris Volos, Sanketh Nalli, Sankarlingam Panneer-selvam, Venkatanathan Varadarajan, Prashant Sax-ena, and Michael M. Swift. Aerie: Flexible ﬁle-system interfaces to storage-class memory. In

Proceedings of the Ninth European Conference on Computer Systems , EuroSys ’14, pages 14:1–14:14,New York, NY, USA, 2014. ACM.[62] Haris Volos, Andres Jaan Tack, and Michael M.Swift. Mnemosyne: Lightweight Persistent Mem-ory. In

Proceedings of the Sixteenth InternationalConference on Architectural Support for Program-ming Languages and Operating Systems , ASPLOSXVI, pages 91–104, New York, NY, USA, 2011.ACM.[63] Di Wang, Sriram Govindan, Anand Sivasubra-maniam, Aman Kansal, Jie Liu, and BadriddineKhessib. Underprovisioning Backup Power Infras-tructure for Datacenters. In

Proceedings of the 19thInternational Conference on Architectural Supportfor Programming Languages and Operating Sys-tems , ASPLOS ’14, pages 177–192, New York, NY,USA, 2014. ACM.[64] Di Wang, Chuangang Ren, Anand Sivasubrama-niam, Bhuvan Urgaonkar, and Hosam Fathy. En-ergy Storage in Datacenters: What, Where, andHow Much? In

Proceedings of the 12th ACM SIG-METRICS/PERFORMANCE Joint InternationalConference on Measurement and Modeling of Com-puter Systems , SIGMETRICS ’12, pages 187–198,New York, NY, USA, 2012. ACM.[65] Xiaojian Wu and A. L. Narasimha Reddy. SCMFS:A File System for Storage Class Memory. In

Pro-ceedings of 2011 International Conference for HighPerformance Computing, Networking, Storage andAnalysis , SC ’11, pages 39:1–39:11, New York, NY,USA, 2011. ACM.[66] Jian Xu and Steven Swanson. NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories. In ,pages 323–338, Santa Clara, CA, 2016. USENIXAssociation.[67] Jian Xu, Lu Zhang, Amirsaman Memaripour, Ak-shatha Gangadharaiah, Amit Borase, Tamires BritoDa Silva, Steven Swanson, and Andy Rudoff.NOVA-Fortis: A Fault-Tolerant Non-Volatile MainMemory File System. In

Proceedings of the 26thSymposium on Operating Systems Principles , SOSP’17, pages 478–496, New York, NY, USA, 2017.ACM.[68] Da Zhang, Vilas Sridharan, and Xun Jian. Explor-ing and optimizing chipkill-correct for persistent16emory based on high-density nvrams. In , pages 710–723.IEEE, 2018.[69] Lu Zhang and Steven Swanson. Pangolin: A Fault-Tolerant Persistent Memory Programming Library.In , Renton, WA, 2019. USENIXAssociation.[70] Yiying Zhang, Jian Yang, Amirsaman Memaripour,and Steven Swanson. Mojim: A Reliable andHighly-Available Non-Volatile Memory System. In

Proceedings of the Twentieth International Confer-ence on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS ’15,pages 3–18, New York, NY, USA, 2015. ACM. [71] Yupu Zhang, Abhishek Rajimwale, Andrea C.Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.End-to-end Data Integrity for File Systems: A ZFSCase Study. In

Proceedings of the 8th USENIX Con-ference on File and Storage Technologies , FAST’10,pages 3–3, Berkeley, CA, USA, 2010. USENIXAssociation.[72] Jishen Zhao, Sheng Li, Doe Hyun Yoon, Yuan Xie,and Norman P. Jouppi. Kiln: Closing the Per-formance Gap Between Systems with and WithoutPersistence Support. In