[PDF] Efficient Kernel Object Management for Tiered Memory Systems with KLOC

Abstract

Software-controlled heterogeneous memory systems have the potential to improve performance, efficiency, and cost tradeoffs in emerging systems. Delivering on this promise requires an efficient operating system (OS) mechanisms and policies for data management. Unfortunately, modern OSes do not support efficient tiering of data between heterogeneous memories. While this problem is known (and is being studied) for application-level data pages, the question of how best to tier OS kernel objects has largely been ignored. We show that careful kernel object management is vital to the performance of software-controlled tiered memory systems. We find that the state-of-the-art OS page management research leaves considerable performance on the table by overlooking how best to tier, migrate, and manage kernel objects like inodes, dentry caches, journal blocks, network socket buffers, etc., associated with the filesystem and networking stack. In response, we characterize hotness, reuse, and liveness properties of kernel objects to develop appropriate tiering/migration mechanisms and policies. We evaluate our proposal using a real-system emulation framework on large-scale workloads like RocksDB, Redis, Cassandra, and Spark and achieve 1.4X to 4X higher throughput compared to the prior art.

Full PDF

EEfﬁcient Kernel Object Management for Tiered Memory Systems with KLOC

Sudarsun Kannan (Rutgers University) , Yujie Ren (Rutgers University) , Abhishek Bhatacharjee (Yale University)

Abstract

Software-controlled heterogeneous memory systems havethe potential to improve performance, efﬁciency, and costtradeoffs in emerging systems. Delivering on this promiserequires efﬁcient operating system (OS) mechanisms andpolicies for data management. Unfortunately, modern OSesdo not support efﬁcient tiering of data between heterogeneousmemories. While this problem is known (and is being studied)for application-level data pages, the question of how best totier OS kernel objects has largely been ignored.We show that careful kernel object management is vital tothe performance of software-controlled tiered memory sys-tems. We ﬁnd that the state-of-the art OS page managementresearch leaves considerable performance on the table byoverlooking how best to tier, migrate, and manage kernelobjects like inodes, dentry caches, journal blocks, networksocket buffers, etc., associated with the ﬁlesystem and net-working stack. In response, we characterize hotness, reuse,and liveness properties of kernel objects to develop appropri-ate tiering/migration mechanisms and policies. We evaluateour proposal using a real-system emulation framework onlarge-scale workloads like RocksDB, Redis, Cassandra, andSpark, and achieve 1.4 × to 4 × higher throughput comparedto prior art. Hardware heterogeneity is here. Vendors are couplinggeneral-purpose CPUs with accelerators ranging from GPUsand FGPAs to domain-speciﬁc hardware for deep learning,signal processing, ﬁnite automata, and much more [10,11,32].Memory systems are combining the best properties of emerg-ing technologies optimized for latency, bandwidth, capacity,volatility, or cost. Researchers are already studying the bene-ﬁts of die-stacked DRAM [9, 29, 44], while Intel’s Knight’sLanding uses high bandwidth multi-channel DRAM (MC-DRAM) alongside DDR4 memory to achieve both high band-width and high capacity [5]. Non-volatile 3D XPoint memo-ries are now commercially available for database systems, anddisaggregated memory is being touted as a promising solu-tion to scale capacity for blade servers [40]. Next-generationsystems will consist of heterogeneous compute nodes (CPU,GPU, or both) connected to multiple types of memory withdifferent bandwidth, latency, and capacity properties.To harness the promise of heterogeneity, software-controlled data management is necessary to ensure that asprograms navigate different phases of execution, each withpotentially distinct working sets, data is tiered appropriately.To optimize for performance, ideally the hottest data will be placed in the fastest memory node (in terms of latency orbandwidth) until that node is full, the next-hottest data willbe ﬁlled into the second-fastest node up to its capacity, andso on. As a program executes, its data must be periodicallyassessed for hotness and re-organized to maximize perfor-mance. Doing this successfully requires effective OS policiesand mechanisms to determine data reuse and control datamigration.Unfortunately, data hotness tracking and page migrationin modern OSes have high overheads and are surprisinglyinefﬁcient. Recent studies address this in many ways, in-cluding support for transparent huge page migration [37],concurrent migration of multiple pages and symmetric ex-change of pages [54], compiler- and MemIf-based frame-works [41, 50] that determine data hotness via heap proﬁl-ing, and machine learning to determine page hotness [38].Other OS-based approaches [8, 35] and hardware acceleratedapproaches [13, 30, 42, 57] have been proposed. Althougheffective, these studies focus on application-level data and largely ignore the question of how best to tier kernel objectsassociated with I/O.

We show that ignoring kernel objects leaves considerableperformance on the table and that carefully tiering, migrat-ing, and managing kernel objects like inodes, dentry caches,journal blocks, network socket buffers, etc., is vital to over-all system performance. We also show – by characterizinghotness, reuse, and liveness properties – that techniques pre-viously proposed for application-level data tiering are a poorﬁt for kernel objects. This is because of three key differencesbetween application-level pages and kernel objects. First, ker-nel objects are not mapped or owned by a speciﬁc application.Second, kernel objects can and are shared and reused acrossmultiple tenants. Third, kernel objects are much shorter livedthan application-level data. For these reasons, traditional hot-ness scanning and page migration techniques cannot simplybe extended to kernel objects, which can neither be associatedwith a speciﬁc application nor are sufﬁciently long-lived totolerate the latencies of state-of-the-art application-level pagemigration techniques [54].In response, we introduce the concept of k ernel- l evel o bject c ontexts (or KLOC s ). Each KLOC encapsulates a set of kernel-level objects associated with entities like ﬁles, sockets, virtualdevice ﬁles etc. We build allocation, deallocation, page place-ment, and page migration mechanisms and policies for

KLOC s,striking a compromise between tying kernel-level objects toapplication-level characteristics while also adapting to theunique reuse/lifetime characteristics of kernel-level objects.

KLOC -based tiering allows the OS to associate applications1 a r X i v : . [ c s . O S ] A p r nd kernel-level objects (thereby tracking when, how, andwhy applications access kernel-level objects) while also un-derstanding that kernel-objects can be shared across differentapplications (with different I/O behavior) and have shorterlifetimes than application-level data. KLOC -based tiering alsopermits kernel objects to leverage aspects of the ﬁle systemand networking stacks that have already been optimized forperformance (e.g., data prefetching for I/O, etc.).To build

KLOC s, we address several research challenges.Unlike coarser-grained application-level approaches based onNUMA-afﬁnity, which generally tie data to particular memorysockets through large chunks of application lifetime, ﬁles andsockets can be rapidly spawned/killed and accessed/reusedin a myriad ways (for example, RocksDB [2], creates andoperates on hundreds of ﬁles through its lifetime). Somekernel objects (like ﬁle system journals) can be shared acrossmultiple ﬁles (and applications). Finally, currently OSes lacksupport for kernel object migration entirely (kernel objectsare managed by the OS’s slab allocator and remains in thelocation it was allocated through time). Even worse, buildingkernel object migration is a non-trivial effort – many kernelobjects (like kernel buffer pages) have lifetimes that are soshort (in tens of milliseconds), that the latency to migrate thembetween memory tiers and perform associated operations likeTLB shootdowns [42, 43] can be prohibitively high.To accurately gauge the beneﬁts of

KLOC s, we proto-type our approach in the mainline Linux kernel v4.17. Todetermine the set of kernel objects per

KLOC , we rely onapplication-level system calls to identify the ﬁles and sockets,virtual devices, etc., being accessed and the newly-allocatedkernel objects (e.g., a dentry, cache page, packet/socket buffer)that they are associated with. Each ﬁle, socket, and a virtualdevice has its own

KLOC . We also associate kernel objectswith CPUs in order to maintain information about CPU- andapplication-wide

KLOC association. To track chains of ker-nel objects associated with

KLOC s in an efﬁcient manner, weimplement a lightweight object map table within the kernel.This map is accessed by the OS to determine

KLOC s associ-ated with cold kernel objects so that they can be migrated toslow memory, and is designed to obviate the need for high-overhead page table scans to determine page hotness. Weovercome this challenge by grouping kernel objects used byﬁles and sockets to a large region of virtually contiguouspages. By doing this, and also then going beyond modernOSes and implementing support for kernel object migration,we enable good performance. To boost efﬁciency, we also ex-ploit I/O stack optimizations like prefetching of ﬁle data andinsertion of prefetched cache pages to faster memory (whichis particularly beneﬁcial for workloads with sequential I/O ac-cess patterns). Overall, we add 6K lines of code in the Linuxkernel to implement these techniques, and make no changesto the hardware or applications.We quantify the beneﬁts of

KLOC s by using a two-socketsystem to emulate a two-tier memory system with a fast and slow memory, similar to prior work [23, 26, 36, 54]. Weperform end-to-end evaluations with RocksDB (a persistentkey-value store), Redis (a network-intensive memory store),Cassandra (a distributed wide column store), and Spark (adistributed general-purpose cluster-computing framework),in addition to microbenchmarks like Filebench. Our perfor-mance gains of up to 1.4 × and 4 × respectively show theperformance potential of KLOC s, and lay the groundwork forfurther exploration of kernel object tiering mechanisms andpolicies in the systems research community.

Advances in heterogeneous memory hardware have moti-vated the need for efﬁcient management of memory resourcesthat vary in capacity, speed, and cost. We ﬁrst discuss hard-ware and software trends, followed by related work on tech-niques for memory heterogeneity management and their limi-tations.

Several heterogeneous memory technologies, such as non-volatile memory (NVM), Hybrid Memory Cube (HMC), andHigh Bandwidth Memory (HBM) will coexist with traditionalDRAMs. On-chip memory such as stacked 3D-DRAM, Hy-brid Memory Cube (HMC) and High Bandwidth Memory(HBM) [13, 42] are expected to provide 10 × higher band-width and 1.5 × lower latency, but provide 8-16 × [7, 9, 12, 43]lower capacity compared to DRAM. Other technologies, likeNVMs, offer 4-8 × higher capacity compared to DRAMs butsuffer from 2-3 × higher read latency, 5 × higher write la-tency, and 3-5 × lower random access bandwidth comparedto DRAM. Several ﬁle systems [15, 21, 52, 53] and user-levellibraries [28,49] have been proposed to exploit persistence, ashave approaches to integrate them as virtual memory [22, 36].Given the differences in bandwidth, latency, and capacity, het-erogeneous memory systems will increase the complexity ofthe memory management software stack. Hardware-level management.

There have been several priorproposals to manage memory heterogeneity in the hardware.Batman modiﬁes the memory controller to randomize dataplacement for increasing the cumulative DRAM and stacked3D-DRAM bandwidth [13]. Meswani et al. [42] discuss ex-tending the TLB and the memory controller with additionallogic for identifying page hotness. To reduce page migra-tion cost, Dong et al. [20] propose SSD FTL-like mappingof physical addresses dynamically [14]. Oskin et al. proposean architectural mechanism to selectively invalidate entriesin the TLB for reducing the TLB shoot-downs during mi-grations [43]. Ramos et al. propose a hybrid design withhardware-driven page placement policy and the OS periodi-cally updating its page tables using the information from thememory controller [46].2 oftware-level management.

Several recent studies aug-ment traditional OS approaches to track page hotness byscanning page tables to migrate application pages of differ-ent sizes [8, 24, 35, 38, 41, 43, 54]. These approaches extendwork originally proposed by Denning [19] for disk swap-ping. Gupta et al. propose HeteroVisor [24], which uses pagehotness tracking and migration techniques for virtualized dat-acenters, whereas Kannan et al. [35] propose on-demand dataplacement for virtualized datacenters. Yan et al. [54] proposestechniques to accelerate page migration in heterogeneousmemory systems by increasing parallelism. Lagar-Cavilla etal [38] propose a combination of OS-level hotness scanningcombined with machine learning for data placement acrossfast and slow memories. Many of these techniques extendthe concept of NUMA-afﬁnity to data pages. That is, theyassociate applications to particular memory sockets in orderto accelerate memory access from CPUs that are physicallycloser.While these approaches are beneﬁcial for application-leveldata, kernel object management for heterogeneous memoryremains unexplored and is in its infancy (for example, thereis no support for kernel object migration in modern OSes).

Application Description ResidentMem. Size

RocksDB [2] Facebook’s persistent key-value storebased on log-structured merge tree;Workload: Widely-used DBbench [3]with 1M keys. 8.4GBRedis [6] Network-intensive key-value store withsupport for persistence; uses

RedisBench , 4 millions ops., 75%/25% Set/Getdistribution. 14GBFilebench [47] File system benchmark; uses eightthreads, 8GB per-thread, performing se-quential and random reads. 16.3GBCassandra [1] Java-based NoSQL DB; run withYCSB [16] workload using eight threads,50% read-write ratio. 11GBSpark [56] Apache Spark; performs Terrasort on20GB data using sixteen threads and usesHadoop ﬁle system. 32.1GB

Table 1: Applications and workloads.

Table also shows applicationresident set size (in GB).

Experimental Environment

Processors 2.4 GHz Intel E5–2650v4 (Broadwell), 20 cores/-socket, 2 threads/coreCache 512 KB L2, 25 MB LLCMemory Sock-ets Two 80 GB sockets conﬁgured as NUMA nodes, maxbandwidth ofStorage 512 GB NVMe with 1.2 GB and 412 MB sequentialand random access bandwidthOS Debian Trusty — Linux v4.17.0

Table 2: System conﬁgurations.

For optimal application performance in heterogeneousmemory systems, placing not only application-level pagesbut also kernel pages to faster memory is critical. Howeverpage placement of kernel pages (and objects) is not well stud-ied. To understand the need for kernel object placement inheterogeneous memory systems, we next analyze real-worldI/O-intensive applications.

Quantifying the performance of kernel object tiering re-quires a platform that permits end-to-end execution of largescale workloads (we use those summarized in Table 1) withfull-system effects. Cycle-accurate simulators are too slowand lack the detail necessary to (easily and accurately) studykernel-level structures in the virtual memory, storage, andnetwork subsystems [13, 20, 42]. Ideally, we would use acommercially-available heterogeneous memory platform withsupport for ﬂexible tiering of kernel objects. Regrettably, thereare no commercial platforms that can be conﬁgured to dothis yet. For example, we considered Intel’s DC memory [4],which attaches a persistent Optane memory side-by-side withDRAM. Unfortunately, this system can currently only be con-ﬁgured such that the persistent memory is a direct-accessﬁle system accessible via custom user-level runtimes, or theDRAM is a direct-mapped L4 cache of the persistent mem-ory. There is no way to conﬁgure the DC memory platformto make it entirely visible (or kernel objects visible) to thevirtual memory sub-system, which is what we need for ourstudies.Therefore, while we will revisit Intel’s DC platform in Sec-tion 6, and for now we use a two-socket memory system toemulate a two-level tiered memory in a manner that is similarto recent work [24, 26, 36, 43, 54]. These sockets have thearchitectural conﬁguration described in Table 2. Like previ-ous work [24, 26, 36], we emulate a fast memory on one ofthe sockets, and a slow memory on the other by applyingthermal throttling to slow down the latter. The slow mem-ory’s bandwidth and latency are conﬁgured by modifying thePCI-based thermal registers. The ﬂexibility of this platformenables exploration of a generic software-controlled hetero-geneous framework as we can vary capacities of the memorynodes, as well as their latency/bandwidth characteristics. Forour studies, we vary capacity/bandwidth differences betweenthe fast and slow memories from 2 × to 16 × . Furthermore,for some of our experiments, we controlled application/kernelpage placement in the fast/slow memories by adding hooksin Linux’s memory management stack to redirect page alloca-tions. All workloads are executed on the 20 CPUs of the nodeassociated with fast memory. We will publicly release ourkernel and tools so that they can be used by the communityfor follow-up studies.3 .2 Experimental Results Table 1 shows that we focus on large-scale workloads thatare compute-, ﬁle-, and network-intensive in order to stress-test our approach. We also use Filebench, which is a mixedrandom write and read access workload. We structure ourstudies around the following questions:

How are kernel memory objects allocated and accessed?

Modern data center applications are known to be I/O intensiveand expend considerable portions of their runtime within thekernel-level ﬁle system and networking stacks. For example,40% of the runtime of RocksDB is spent within the ﬁle systemcode path. What is less well-known, however, is that theseapplications also allocate millions of memory pages for kernelobjects that perform I/O caching, in-memory metadata andbook-keeping structures (like radix trees for caches), journalsand logs, as well as ingress and egress network socket buffers.Figure 1a quantiﬁes the number of pages allocated for dif-ferent kernel objects. The data shows that all the workloadsallocate many page cache pages and kernel buffers (usingslab allocations via kmalloc ). Filebench uses eight I/O threadsto simultaneously write 4KB blocks to separate ﬁles. Thewrite and read operation entails allocation of page cache (forwriting data or bringing data from disk) as well as updates tosystem metadata structures, which involves allocating jour-nals, radix tree, block driver buffers, etc. Consequently, bothpage cache and kernel metadata allocations increase signif-icantly compared to user-level pages. In contrast, RocksDBupdates hundreds of 4MB ﬁles with key-value data. Therefore,slab allocations for inodes, dentry caches, radix tree nodes(for the indexing cache), driver block I/O and journals areall frequent and contribute to 36% of the pages allocated tokernel objects. Redis is network-intensive and allocates manypages for ingress and egress socket buffers, and also pagecache pages to periodically checkpoint key-value store stateto a large ﬁle on disk [6]. Spark [56] uses the Hadoop ﬁlesystem (HDFS) to store and checkpoint data (RDDs [56]).Note that HDFS is run as a separate process. HDFS maintainsuser-level cache and periodically updates page cache (so lesskernel buffer pages)We also proﬁle the frequency with which these differentkernel objects are accessed in Figure 1b and the distributionof last-level cache misses in Figure 1c. Even though fewerkernel buffers are allocated (see Figure 1a), they are accessedmore often than other kernel objects. To understand why,consider, for example, a ﬁle write in Filebench. The virtualﬁle system (VFS) looks up the page cache radix tree, allocatesa new page if the necessary, inserts the page into the radix tree,performs metadata/data journalling with logging, and ﬁnally,commits to storage. These steps are more memory-intensivethan writing the data to the page cache. In fact, scaling theworkload inputs leads to a sharp increase in LLC misses dueto higher trafﬁc to kernel buffers. Filebench’s spends 86% ofexecution time inside the OS, and hence, the memory accessesincrease proportionately, compared to RocksDB (54%) and Redis (38%).

How does tiering of kernel objects impact performance?

To study the impact of kernel object placement in fast/slowmemory, we conﬁgure the capacity of the fast memory so thatit cannot ﬁt all the application’s user-level and kernel-levelpages. Our results, illustrated in Figure 2, assume that fast andslow memory are 5GB and 40GB respectively, and that slowmemory has a bandwidth of 5GB/second, thereby emulating a5 × bandwidth difference between fast and slow memory. Thisis similar to recent work [24,35,42] and is representative of thebandwidth differences between HBM and NVM technologiesrelative to DRAM. In tandem, Figure 3 and Figure 4 show theimpact of varying slow memory bandwidth and fast memorycapacity. The App Slow + OS Slow bars show the worst-casescenario where all pages are placed in slower memory,

AppSlow + OS Fast shows the case where only the kernel pagesare placed in fast memory,

App Fast + OS Slow shows thecase where only application-level pages are placed in fastmemory, and

App Fast + OS Fast shows an ideal case allpages ﬁt in fast memory. The y-axis shows the normalizedthroughput.Placing both application and kernel pages (

App Slow +OS Slow ) in slow memory degrades performance across allworkloads. As shown in Figure 2, placing only kernel pages(

App Slow + OS Fast ) to limited-capacity fast memory doesimprove performance; for example, RocksDB and ﬁlebenchimprove by 1.58 × and 2.8 × respectively compared to AppSlow + OS Slow . In real-world settings, one would not justplace kernel pages in fast memory (but would also do so forapplication-pages as much as possible) but this experimentshows that even just tiering kernel objects appropriately im-pacts performance. For network (and storage) intensive Redis,placing kernel pages in fast memory boosts performance by1.8 × over App Slow + OS Slow . In Spark, we note a highcontention between Spark compute pages (heap) and HDFSstorage (kernel pages) for a limited-capacity fast memory.Only placing kernel pages in fast memory improves perfor-mance by 1.3 × , mostly due to page cache placement in fastmemory.Finally, suppose we place application pages in fast memorybut prevent kernel objects from being in fast memory. WhileRedis improves marginally, this is not the case for Filebench,which spends 86% of execution inside the OS.Figure 3 and Figure 4 show the sensitivity of RocksDB(highly I/O and OS-intensive) and Spark (mostly compute-intensive with intermittent I/O) towards lowering memorybandwidth or reducing fast memory capacity. The x-axis inFigure 3 shows the increasing ratio of slow memory band-width to fast memory bandwidth, whereas Figure 4 showsthe increasing fast memory capacity ratio. The results showthat reducing fast memory capacity or lowering slow memoryimpacts both application and the impact of placing kernelobjects to slower memory affecting RocksDB signiﬁcantly.4 il eben c h r ed i s r o cks db c a ss and r a s pa r k Application P age s ( i n K ) appcachekernel buffer (a) Page allocation distribution. The barsshow distribution across heap, page cache (OS),and kernel buffers (slab pages). The y-axis showoverall pages allocations during an application’slifetime. f il eben c h r ed i s r o cks db c a ss and r a s pa r k Application M e m o r y A cc e ss ( % ) appkernellib (b) Memory access distribution. The barsshow memory access distribution (in %) acrossApp, Kernel (OS), and other user-space libraries. f il eben c h r ed i s r o cks db c a ss and r a s pa r k Application LL C M i ss e s ( i n - m illi on ) appkernellib (c) Last-level cache miss distribution.Figure 1: Memory allocation, access, and last-level cache miss distribution. filebench redis rocksdb cassandra spark Application T h r oughpu t ( M B / s e c ) App Slow + OS SlowApp Fast + OS FastApp Slow + OS FastApp Fast + OS Slow

Figure 2: Impact of kernel object (page) placement.

Fast memorycapacity is set to 1/8th (5GB) capacity of Slow memory (40GB), and SlowMemory bandwidth is set to 1/5th (4GB/sec) of Fast Memory.

Slow Memory Bandwidth Relative to Fast Memory

RocksDB Spark T h r oughpu t ( K O p s / s e c ) App Slow + OS SlowApp Fast + OS Fast App Slow + OS FastApp Fast + OS Slow

Figure 3: Memory bandwidth sensitivity.

The x-axis varies slowmemory bandwidth relative to fast Memory, and the y-axis shows throughput.

Fast and Slow Memory Capacity Ratio

RocksDB Spark T h r oughpu t ( K O p s / s e c ) App Slow + OS SlowApp Fast + OS Fast App Slow + OS FastApp Fast + OS Slow

Figure 4: Sensitivity to fast memory capacity.

The x-axis show fastmemory capacity ratio relative to an all-fast Memory system, and the y-axisshows throughput impact. App Fast-OS Fast shows optimal case performancein an all-fast memory system. The slow Memory bandwidth is set of 1/16thof Fast Memory

Overall, these results show that the choice of where to placekernel objects impacts performance substantially and thatthere is consequently a need to go beyond prior work anddevise efﬁcient tiering of kernel objects. f il eben c h r ed i s r o cks db c a ss and r a s pa r k Application O b j e c t L i f e t i m e ( m illi s e c ) kernel buffercache Figure 5: Active lifetime and access interval.

Bars show averageactive life time of cache and kernel-buffer pages before recycled; y-axis inmilliseconds.

What is the lifetime of a typical kernel object?

Figure 5shows the lifetime of OS cache and kernel buffer (slab) pages.Conceptually, kernel buffer pages expire after they are freed.Cache pages remain until they are evicted from memory pres-sure (we mark them as expired when they are added to theLRU list).Figure 5 shows that RocksDB and Redis have cache pageswith average lifetimes of less than 160 milliseconds, andsmaller kernel pages are even shorter-lived, at 60 milliseconds.To understand why kernel objects are short-lived, consider,for example, a ﬁle write. A page cache page is allocated, theuser data is copied to the cache page, a radix tree node isallocated using the slab allocator [31], and the cache pageis inserted. The page cache page remains inactive until sub-sequent reads/writes or commits (i.e., fsync() ) to the disk.In contrast, kernel buffers such as radix tree nodes are fre-quently queried, allocated, and deleted due to tree rebalancingor cache page deletion. Other in-memory structures such asdentry caches and in-memory journals are also frequentlyallocated and deleted when data and metadata are updated.These observations showcase the limitations of prior worklike Thermostat [8], which uses a 30-second interval betweentwo hotness tracking iterations. This relatively large timeperiod was used to because scans of page tables to ascertainhotness and invalidate TLB entries are long latency events.Our results show, however, that kernel objects are far tooshort-lived to be amenable to such high intervals.5

Our Approach: KLOC-Based Tiering

Having quantiﬁed the performance challenges posed andopportunities offered by tiering of kernel objects in genericsoftware-controlled heterogeneous memory systems, we con-sider how to go beyond prior work (which neglects kernelobject tiering) and devise kernel tiering. Our goals are highperformance and ready implementation in commercial OSes.At ﬁrst blush, one might consider extending OS support forNUMA afﬁnity to also include kernel objects. This approachwould permit placement of kernel objects – just like applica-tion pages – in memory devices in a manner that attempts tominimize distance between CPUs and the data they frequentlyaccess. Unfortunately, such NUMA afﬁnity approaches arenot viable for kernel objects, which can be shared/reusedacross applications (making it challenging to assign a singleafﬁnity to kernel objects) and have lifetimes much shorterthan application pages (making existing ways of measuringhotness and migrating pages inapplicable). Current OSes,which includes Linux, FreeBSD, Solaris, lack capability toassociate kernel objects with application entities or provideﬁne-grained of placement kernel objects. Instead, we use theconcept of

KLOC s to encapsulate groups of kernel objects intological entities – associated with ﬁles and network sockets –that can be managed together in a lightweight manner. Usingthese entities as a unit of movement enables ﬁner-graineddecision-making about allocations, placements, and migra-tion of kernel objects associated with an entity (i.e., a

KLOC ),than using NUMA afﬁnity. These beneﬁts are crucial for goodperformance but does require extending existing ﬁlesystems,network stacks, the OS slab allocator, journals, and devicedrivers with support for

KLOC s. A key feature of our approachis to use tap into application-level system calls to accuratelytrack

KLOC s, and kernel level maps that lets us identify hotkernel objects much faster than traditional approaches thatscan the page table. This in turn permits us to migrate kernelobjects fast enough so that its beneﬁts are not outweighed byrelatively short kernel object lifetimes. Finally, we leverageexisting and highly-optimized I/O stack optimizations suchas adaptive I/O prefetching (also known as readahead) andtechniques to speculatively place I/O cache and ﬁlesystemobjects to fast memory for further boost performance.

We next discuss the key design and implementation de-tails of

KLOC . We base our design discussions on support forthe ﬁle system and network stacks, followed by ways to sup-port effective kernel object placement and migration and alsoexploit traditional OS optimizations such as I/O prefetching.

We start with a set of page placement constraints. First,kernel objects are short-lived, and immediate placement ofcurrently active objects to faster memory is critical. Second,because faster memory is usually lower-capacity, migrating

Block I/OPage CacheVFSUser-space

File Creation File Write File Fsync File Close Per-KLOC Map inode dentry

Cache page radix-tree node jbd2 bio

Root (1)

Create KLOC map and add inode, dentry objects (2)

Allocate journal, page cache, radix node; (3) allocate bio to fast-mem; and add to KLOC map (4)

De-allocate unused objects, and move KLOC and its objects to slow-mem

Per-KLOC Map inode dentry jbd2 bio

Root

44 4 44

Move to slow-mem

Cache page radix-tree node

Figure 6:

KLOC support for storage Stack.

Creation of a

KLOC mapand addition of VFS, ﬁle system, and device driver kernel objects. inactive objects to slower memory and making way for ac-tive objects is critical. Third, reducing the frequency of long-latency hotness scans and reducing migration overheads iscritical. Finally, application pages are always prioritized touse faster memory unless their pages become inactive. Impor-tantly,

KLOC object grouping, placement, and migration arenot tightly bound to a speciﬁc placement policy.

The workloads that we use spend a signiﬁcant fraction ofexecution time in the ﬁle system dealing with page caches,in-memory structures like inodes, dentry caches, radix trees,block-device buffers, and journals, and in the network sub-system where they interact with ingress/egress socket buffersand network I/O queues. /* KLOC-based dentry object allocation */ void *dentry_alloc(struct inode *inode) {struct dentry *dentry; /* Get KLOC */ struct kloc *kloc = inode_to_kloc(inode); struct *alloc_policy; /* Check if KLOC is active */ if (kloc_active(kloc)) { /* Get KLOC’s allocation policy */ alloc_policy = kloc->alloc_policy; dentry = allocate_hetero(sizeof(struct dentry), alloc_policy); /* Add to KLOC’s map*/ add_to_kloc_map(kloc->map, dentry)} else { /* Use default allocation */ dentry = allocate(sizeof(struct dentry)); }

Figure 7: Pseudocode for

KLOC -based dentry object allocationand mapping.

Dentry is used to keep track of the hierarchy of ﬁles indirectories. The pseudocode code checks if a ﬁle-based

KLOC (representedas inode) is active, allocates the dentry based on

KLOC ’s allocation policy,and ﬁnally adds the dentry object to

KLOC map.

To the best of our knowledge, current OSes lack abstrac-tions and the capability to group kernel objects and efﬁcientlyplace them in heterogeneous memory systems. Our focus is onCPU interactions for these workloads but as accelerators likeGPUs increasingly offer access to system services [48], we ex-pect

KLOC s to be also useful for other processing paradigmsbeyond CPUs.

Ideally,

KLOC s should be associated with entities that are ac-cessible (and meaningful) to both applications and the OS and6llow sharing of kernel objects across applications when re-quired. Consider, for example, the notion of a ﬁle , which is vis-ible to both the application and the OS code paths responsiblefor kernel object allocation and I/O servicing. During a ﬁle cre-ate operation, a set of in-memory kernel objects such as inode,dentry cache, and journal blocks are allocated. During a ﬁlewrite operation, the virtual ﬁle system (VFS) allocates cachepages, radix tree nodes for managing the cache pages, journalrecords (for crash-consistency), and extents [39]. When theblock driver commits in-memory pages of a ﬁle to disk, theblock driver allocates a ﬁle’s block I/O structures. All theseallocations, whether initiated by the application or OS, areassociated with the notion of a ﬁle, making it the appropriateentity around which to group the kernel objects allocated toit. We therefore use a ﬁle’s inode used across all layers of ﬁlesystem to create a

KLOC to track all associated kernel objects,(i.e., the VFS, the ﬁle system, and the device driver).

We create a map structure in the inode

KLOC for trackingall kernel objects associated with each ﬁle. The map struc-ture is implemented as red-black tree and maintains kernelobjects in the virtual ﬁle system layer (VFS), the dentry cache(used for name lookup), the page cache, the actual ﬁle system(without loss of generality, we use Ext4 ﬁle system in our im-plementation), the log manager (for crash-consistency), andthe block device driver. We show an example of this map inFigure 6.One of the key challenges in maintaining inode

KLOC s isto understand when to create and update kernel objects. Inorder to do this, we use OS system calls such as create(),write(), read(), close() as semantic hints. Figure 6 shows thatduring ﬁle creation, a new inode and its map is created (withthe inode as root), and the dentry object is added. On a ﬁlewrite operation, cache pages, their radix tree nodes, and thejournal records (JBD2) are added to the map too. After a ﬁleis closed, the cache entries are removed. Consequently, the

KLOC is deleted only when the ﬁle and inode are deleted [33].

Modern applications run on tens of CPUs and future appli-cations may run on hundreds of CPUs sharing and access per-thread ﬁles. Examples of such applications include those thatperform graph computations and maintain key-value stores.In such cases, it is beneﬁcial to be able to manage

KLOC sacross applications and CPUs.Therefore, our approach assumes that a ﬁle (i.e., inode)becomes active when thread(s) perform I/O on it. To integratea notion of CPU numbers with

KLOC s, we exploit Linux’sper-CPU data structures. Each thread (or CPU) is representedby a task structure ( task_struct ) that represents the activeprocess currently scheduled on the CPU. We extend the task_-struct with an active context state that represents the ﬁlesactively being accessed by the CPU. We leverage ﬁle systemI/O system calls, such as open, read, write, close, and othersto identify the kernel objects being accessed by each CPU. These kernel objects are added to the per-CPU context map,as shown in the the pseudocode in Figure 7. This approachenables

KLOC to group kernel objects concurrently accessedby multiple CPUs.

To group kernel objects allocated by network subsystem,we use socket descriptors. Within Linux (and other OSes),these are also implemented as inodes, making our ﬁle system

KLOC design easily extensible to networking. Sockets presentan appropriate entity for

KLOC creation as they are accessibleby both the application and the OS. Page placement and mi-gration policies are applied to pages grouped with respect tosockets.

The network stack uses packet buffers ( skbuff ) to send andreceive application data. The packet buffers allocated in theingress and egress path dominate kernel object allocations.The ingress and egress path in the network stack comprises ofmultiple layers, including TCP, UDP, IP, and network devicedriver (e.g., NAPI). To reduce overheads of copying networkpackets across these layers, user-level buffers are copied tosocket buffers and reused across other lower layers (TCP,UDP, IP, and NAPI) before being copied over to the NIC. Forthe receive path, the device driver is responsible for alloca-tions, which are subsequently reused by upper layers of thenetworking stack. In addition to the socket buffers, kernel ob-jects such as network queues are also allocated, but constitutea small fraction allocated memory. Our networking

KLOC sgroup these per-socket network kernel objects, enabling bettermemory placement and migration.As with inode

KLOC s, socket

KLOC s also use a map struc-ture to track kernel objects associated with a socket. Therefore,

KLOC uses the system calls responsible for socket creation( socket(), open() ) to initiate per-socket map structures. Whena CPU invokes an egress ( send() ), ingress ( recv() ), and polls,

KLOC marks the socket active and adds network kernel objectsto the per-socket map structure. Whether newly-allocated ker-nel objects are allocated to fast or slow memory depends onhow the socket is managed. In general, all kernel objects ofan active socket

KLOC are directed to fast memory.

The egress path of a network stack is generally syn-chronous, and the network stack objects, including the socketbuffers, are allocated during send() operations. As a result,grouping socket buffers and network stack kernel objects inthe egress path involves simply adding them to the socket’scontext map structure. Unlike the egress path, the ingress(network receive) path is dependent upon the asynchronousarrival of packets. As network packets arrive, the device driverprocess inside the OS allocates a generic packet buffer butdoes not know the socket information to which this packetbelongs. This information is extracted in a higher layer of theTCP stack. This creates a research challenge – how to groupegress packet buffers to a socket

KLOC as early as possible,7o as to apply the appropriate set of memory tiering policiesto them.An initial, straightforward approach is to extract thepacket’s entire header and identify the associated socket num-ber within the driver itself (before transfer of control to thehigher TCP layers), add to the per-socket context map, andapply memory allocation policies of the socket context. Unfor-tunately, this naive approach means that we inspect the packetfor socket information in both the driver and higher layers,which is both CPU-intensive and time-consuming. In practice,we ﬁnd that the latency for these steps is sufﬁciently highthat they outweigh the performance beneﬁts of placing thesebuffers within faster memory. The key problem is that thebuffers are so short-lived that the additional time for headerextraction becomes a performance bottleneck.A better approach, and the one we use, is to extend thenetwork stack device driver. We do indeed add code to extractsocket information within the device driver, but we improveperformance by avoiding redundant work at the higher-levellayers. We do this by extending the packet buffer structure( skbuff ) with an 8-byte socket ﬁeld, which contains the socketinformation extracted in the device driver. This ﬁeld elides theneed for further socket information extraction at the higherlevels of the TCP stack. We also extend the device driver toadd the packet to the socket

KLOC ’s map structure and allocatethe packet to faster memory (provided the socket is active).Extending the idea of grouping kernel objects with a socketcontext deep down to the device driver provides the ﬂexibilityto group kernel objects that are allocated asynchronously andapply uniform allocation and data placement policies.

Grouping kernel objects via inode and socket

KLOC s andsupporting per-CPU active contexts enables better memorykernel object tiering. Unlike modern OSes, which do not sup-port kernel object tiering, we can allocate kernel objects to fastmemory, identify cold/inactive kernel objects, and migrate thelatter to slow memory. To build efﬁcient techniques to identifycold/inactive kernel objects, and build support for their mi-gration, we go beyond prior application-level page placementsystems.

KLOC cannot rely on long-latency hotness scanningand page migrations (that incur TLB invalidations [38, 54])because the kernel objects are much too short-lived to toleratelong migration latencies.

Consider page hotness scanning. With

KLOC , when a ﬁleor socket context is actively used, all kernel objects groupedin the context map are placed in faster memory. If the fastermemory capacity is full, the objects of the inactive ﬁle orsocket

KLOC s (currently not accessed by any CPU) are movedfrom faster to slower memory. Migrating inactive

KLOC s isconceptually similar to the idea of garbage collecting in fastmemory, without incurring the cost of tracking each and everykernel object. We rely on quickly identifying cold/inactive kernel objects and migrating them to slow memory. To dothis, we use and extend the Linux LRU mechanism developedinitially for swapping and adopted by prior work [35, 54]. Weintroduce several extensions critical for kernel object place-ment in addition to concurrent page migrations introduced byYan et al. [54].First, OSes such as Linux (and FreeBSD) maintain an ac-tive page list, an active LRU list, and an inactive LRU list ofpages. Due to limited faster memory capacity, we not onlymigrate pages from the inactive list but also from the activeLRU list when the demand for fast memory is high. Second,we do not wait for Linux to identify LRU and inactive pages;instead, once a network or ﬁle context becomes inactive, weimmediately mark and migrate the pages. Third, slab pagescould be shared across one or more active and inactive ﬁle/-socket entities. To avoid undue effects, we do not migrateshared pages with active objects. Finally, repeated migrationof kernel pages between fast and slow memory can be detri-mental to performance. To avoid repeated migration, we usean 8-bit per-page counter to track migrations and retain suchpages in fast memory. We observe a small fraction (less than1%) of pages that meet these conditions due to the shorterlifetime of kernel objects.

Now consider the actual mechanism to migrate kernel ob-jects. Traditionally, kernel buffer pages allocated using theOS slab allocator (which excludes cache pages or kmalloc() )are managed independently of the application pages and arereused across one or more applications. The slab allocator at-tempts to group objects of the same type or sizes together [23].Each slab page can contain one or more kernel objects fromdifferent subsystems or shared by different tenants. Impor-tantly, slab kernel pages are not directly mapped to an addressspace or process, and can also be accessed using a physicaladdress.

Current OSes do not support migration of slab pages.

To enable migration of slab allocated kernel objects, onemight consider entirely redesigning the slab allocator. Al-though ideal, given the magnitude of changes to the OS design,in

KLOC , we propose an alternative solution to support themigration of kernel objects. First, for kernel objects related toﬁles and sockets, we use vmalloc() (virtual malloc) support in-side the OS. Using vmalloc() provides the ability to allocate alarge region of virtually contiguous memory that also containsan anonymous address space (anon_vma) [17] not backableby a ﬁle. Using vmalloc() allows us to allocate and coalescekernel objects of a context to a virtually contiguous region,and use the anonymous address space to extend Linux’s pagemigration code to support kernel object migration . Using

KLOC s also allows us to leverage existing OS-leveloptimizations such like in-memory buffering, prefetching and Our current approach is limited to support the migration of kernel objectsthat are not physically dereferenced. adap-tive readahead mechanism (also known as I/O prefetching) tounderstand the notion of

KLOC s so that workloads with tem-poral and spatial locality of I/O reference can be accelerated.Our approach requires no changes to I/O prefetching policiesand mechanisms; we simply direct all I/O prefetches to fastmemory. For example, consider that readahead speculativelyreads a portion of ﬁle contents into memory with an expecta-tion that the process working on a ﬁle will read/write that datain the future. Existing readahead mechanisms are adaptive;by tracking how often prefetched pages are actually used, theOS maintains a window of prefetch targets. When prefetchesare used frequently, the window dynamically increases (to amaximum of 128MB in Linux); when the prefetches are notused often, the window is shrunk. Generally, the prefetchingwindows increase when the workload demonstrates sequen-tial memory access, while it decreases when access patternsbecome more random [27, 51]. By placing readahead pagesin faster memory, we accelerate sequential workloads. Forrandom access workloads, the prefetch window automaticallyshrinks, and our approach has no punitive effective. (This alsoensures that we do not over-aggressively allocate cold/inactivekernel objects to fast memory.)

Our evalulations answer the following questions: • What are the beneﬁts and implications of

KLOC ’s ﬁne-grained placement of ﬁle system’s kernel objects in het-erogeneous memory systems? What is the impact on fastmemory utilization? • How effective is

KLOC ’s support for the network subsys-tem? • Is the

KLOC ’s capability to accelerate I/O stack’sprefetching optimization with fast memory pages beneﬁ-cial?

We use a 40-core Intel Xeon 2.67 GHz dual-socket sys-tem, with 80GB memory per socket and a 512 GB Intel Op-tane NVMe with a peak sequential and random bandwidthof 1.2 GB/s and 425 MB/s respectively. We use the mem-ory heterogeneity emulator from section 3 and consider ageneric fast (DRAM) and slow (throttled) memory. As dis-cussed in section 3 (see Figure 3), the fast memory capacityand slow memory bandwidth have a direct correlation withperformance. We quantify performance sensitivity to slowmemory bandwidth. We ﬁx the fast memory capacity to 5 GB(this is representative of recent industrial and academic pro-jections [4,12,18,24,35]). We use the same set of applicationsstudied in 3. To understand the performance of

KLOC on realNVM device (for which the bandwidth can be modiﬁed) forRocksDB, we use a 64-core, 2 TB Intel’s DC platform.

Mechanisms DescriptionAll-SlowMem A worst-case slow memory-only system; OS and ap-plication pages are always placed to slow memoryAll-FastMem An ideal all fast memory system (best case)Naive A greedy approach that naively uses NUMA and at-tempts to place application and OS pages to low ca-pacity fast memory although they cannot ﬁtNimble State-of-the-art application page placement systemwith concurrent migration support [54]Migration-only Migration based approach that only migrates coldpages from fast to slow memory, freeing fast mem-ory for direct allocation similar to [8, 35, 54]

KLOC -nomigrate

KLOC ’s kernel page placement but without migrationfor the storage stack

KLOC -migrate-fs-noprefetch

KLOC with kernel page migration but without networkstack or I/O prefetcher support (section 5.2)

KLOC -migrate-fs-nw-noprefetch

KLOC -migrate-fs-noprefetch with support for networkstack (section 5.3)

KLOC -migrate-fs-nw-prefetch Uses hints from the OS I/O prefetcher (section 5.5)

Table 3: Evaluation Mechanisms.

System Conﬁgurations.

We compare our approach to sev-eral other page placement and migration approaches. Theseare summarized in Table 3: (1)

All-SlowMem represents aworst-case baseline on a system with only slow memory. (2)

All-FastMem represents an ideal system that can ﬁt the entireapplication workload in fast memory. (3)

Naive represents anapproach where the OS greedily places both application andkernel pages in fast memory. When fast memory is full, sub-sequent allocations for application and OS pages are servedfrom slower memory until fast memory pages become free. (4)

Nimble represents the state-of-the-art system for placementof application-level pages. Nimble also employs concurrentpage migrations [54]. (5)

Migration-only attempts to allocateOS and application pages to fast memory, but also identiﬁescold pages migrates them to slower memory similar to [8]. (6)

KLOC -nomigrate represents a version of our approach using

KLOC s that group kernel objects and allocate them to fastmemory. What we omit from this is the idea of migratingpages of inactive

KLOC s from fast to slow memory. (7)

KLOC -migrate-fs-noprefetch , which goes beyond

KLOC -nomigrateand also migrates cold/inactive ﬁle system kernel objects. Fi-nally, (8)

KLOC -migrate-fs-nw-noprefetch and

KLOC -migrate-fs-nw-prefetch represent our ﬁnal, full-blown

KLOC approachby also adding support for the network stack and includingI/O prefetching optimizations to fast memory.

KLOC

We compare the following approaches summarized inTable 3: (1) All-SlowMem - the worst-case baseline, (2)All-FastMem (ideal case), (3) Naive, (4) Migration-only, (5)

KLOC -nomigrate – the proposed

KLOC based placement ofkernel objects related to ﬁle and socket without migration,and ﬁnally, (6)

KLOC -migrate-fs-noprefetch –

KLOC . To un-derstand the reduction in the use of slow memory pages forkernel objects (and hence better performance) with

KLOC , inFigure 9, we show statistics for RocksDB. The x-axis showsthe total page cache, kernel buffer pages (slab pages for kernelin-memory structures, logs, socket buffers, etc.), and inactive9 ilebench redis rocksdb cassandra spark

Application T h r oughpu t ( M B / s e c ) All-SlowMemAll-FastMemNaiveNimble Migration-onlyKLOC-nomigrateKLOC-migrate-fs-noprefetch (a)

KLOC performance for applications and benchmarks.

Fast memory capacity is set to 5 GB (i.e., 1/8th of slow memorycapacity) and the slow memory bandwidth is 1/8 of the fast memory.Redis results only considers

KLOC for ﬁle system. Y-axis showsscaled throughput (OPS/sec) for each application. Nimble showsstate-of-the-art application data placement system.

Slow Memory Bandwidth Relative to Fast Memory

RocksDB Spark T h r oughpu t ( K O p s / s e c ) All-SlowMemAll-FastMemNaive NimbleMigration-onlyKLOC-migrate-fs-noprefetch (b) Performance sensitivity to slow memory bandwidth.

The x-axis varies slow memory’s bandwidth ratio relative to fast memory, and they-axis shows throughput. The fast memory capacity is set of 5 GB (1/8th) ofslow memory.

Figure 8: Performance and memory bandwidth sensitivity. pages migrated to slower memory, and the y-axis shows thepage count (in units of 100K pages).

Observations.

First, as expected, incorrect placement ofkernel-level objects to slow memory can signiﬁcantly de-grade performance, as shown by the difference between theAll-FastMem (Optimal) vs. All-SlowMem conﬁgurations forall applications. Second, the

Naive greedy approach attemptsto naively allocate all application and OS pages to the limited-capacity fast memory. As a result, the naive placement ofnon-critical kernel objects can drastically impact performance-sensitive application pages and kernel objects. RocksDB useshundreds of 4 MB ﬁles to store the persistent key-values asa string-sorted table [34, 45]; once the ﬁles are ﬁlled, lowcapacity fast memory is polluted with inactive ﬁle caches andkernel objects (e.g., inode structures, dentry caches). As aresult, the throughput of the

Naive approach reduces by 2.6 × compared to the optimal case. We observe a similar trend forCassandra and Filebench. In contrast, Redis uses few largeﬁles to checkpoint data, thereby even the Naive approachprovides gains over the worst-case baseline, but the cachepage pollution quickly hinders fast memory beneﬁts. Spark, iscomposed of multiple applications, which includes the Scala-based Spark benchmark, the Spark framework, and the HDFSstorage server (Hadoop ﬁle system). While Spark by itselfis compute-intensive, the HDFS storage server is highly I/Ointensive. As a result, the

Naive approach is ineffective. Next,

Page Cache Buffer Pages Migrated Pages

Slow Memory Pages Used P age C oun t ( i n K ) All-SlowMemAll-FastMemNaiveMigration-onlyKLOC-migrate-fs-noprefetch

Figure 9: RocksDB’s kernel page slow memory use.

The x-axisshows slow memory pages allocated for cache, kernel buffers (for OS datastructures) or inactive pages migrated to slow memory.

KLOC reduces slowmemory use for page cache and kernel buffers. All-Fastmem does not useslow memory or incur migrations.

Nimble mainly tracks application-level pages using the OSswapping mechanism’s active and inactive lists. Lack of ker-nel page placement and excessive migration without a contexthurts performance for I/O-intensive applications.Next, the

Migration-only approach only performs appli-cation and kernel page migrations by tracking fast memoryfor cold (inactive) pages and migrating them to slow mem-ory. However, blindly migrating all kernel (and application)objects increases page allocation and TLB and page tableinvalidation overheads. Further, the

Migration-only approachlacks the semantic information about how entities like ﬁlesand sockets and their related kernel objects are used. Forexample, even after an close() operation, cache pages andjournal data could continue to pollute faster memory.Finally, the

KLOC -nomigrate approach provides effectiveplacement of kernel objects and ﬁle-related object and at-tempts to place them to fast memory, but does not supportpage migration. In contrast,

KLOC -migrate-fs-noprefetch , inaddition to

KLOC placement, also aggressively demotes ob-jects of inactive ﬁles to slower memory using the modiﬁedLRU mechanism; because of this, the fast memory alloca-tion misses (or slow memory use) for active kernel objectsalso reduces(see Figure 9). As a result, RocksDB through-put improves by up to 1.42 × over migration approach. In-terestingly, for Redis, the migration-based KLOC -migrate-fs-noprefetch approach provides up to 1.70x gains even over

KLOC -nomigrate . Finally, the Java-based Cassandra uses alarge in-memory cache, and the application-level pages con-sume 95% of fast memory capacity, resulting in marginalgains with

KLOC . Figure 8b varies the slow memory bandwidth along thex-axis and studies the impact on all RocksDB and Spark forbrevity. The x-axis varies slow memory’s bandwidth ratiorelative to fast memory, and the y-axis shows throughput.The fast memory capacity is set of 8GB (1/10th) of slowmemory. Because RocksDB is highly memory intensive, witha signiﬁcant fraction of memory pages allocated in kernel andhigh kernel-level memory references, the results show highperformance beneﬁt across all bandwidth conﬁguration. For10

SET GET T h r oughpu t ( i n K O PS / s e c ) All-SlowMemAll-FastMemNaiveNimble Migration-onlyKLOC-nomigrateKLOC-migrate-fs-noprefetchKLOC-migrate-fs-nw-noprefetch

Figure 10: Redis performance with

KLOC for network stack.

Re-sults show cumulative throughput 8 Redis server instances that use 1 KBvalue size for 4-million keys with 75% read (GET) requests.

Spark,

KLOC -migrate-fs-noprefetch beneﬁts are high until thememory bandwidth is 1/4th of the fast memory. Increasingthe slow memory bandwidth further (e.g., 1/2 of fast memory)shows limited gains.

Summary.

The results highlight the beneﬁts of introduc-ing ﬁle context to encapsulate kernel objects and only targetplacement of objects currently accessed by an application andavoiding placement cost of non-critical objects.

KLOC for Network Stack

We next discuss the beneﬁts of introducing

KLOC supportfor the socket entity in the network stack and controlling theallocation and placement of related kernel objects in heteroge-neous memory. Figure 10 shows the performance of Redis, anetwork-intensive key-value store serving hundreds of clients.In contrast to the

KLOC approach studied in Figure 8a withoutthe network stack support,

KLOC -migrate-fs-nw-noprefetch in Figure 10 shows the performance of network stack with

KLOC support.For performance analysis, we use the well-known Redisbenchmark performing 4 million key operations, 25% insert(SET) and 75%, fetch (GET) operations, and 1 KB value sizefor each key as used by prior work [25]. Because Redis isa single-threaded application, we run instances of Redisserver with each instance using a set of dedicated ports. They-axis shows the throughput of SET and GET operations.Note that, for KLOC -migrate-fs-nw-noprefetch , each socketentity has an independent

KLOC map to encapsulate, allocate,and migrate kernel objects and in addition to

KLOC s for a ﬁle.As noted in section 3 and section 5, for the network subsystem,the socket buffers (a.k.a skbuff ) dominate the kernel objectallocation. The socket buffers are allocated and reused acrossdifferent layers (system calls, TCP, IP, and NAPI) and can arereused across operations.Redis is a memory-intensive application; the application’skey-value store, the network stack, and ﬁle system for check-pointing, all demand memory pages and are sensitive to fastmemory capacity and slow memory bandwidth. Comparedto the optimal case ( fast memory-only system), other ap-proaches with limited faster memory capacity suffer slow-down for both SET and GET operations. For the

Naive ap-proach, contention for fast memory across application and I/Osubsystems lead to incorrect placement and performance slow-down. The shorter lifetime of network socket buffers, which are frequently allocated, released, or reused across networkoperations, makes the

Migration-only approach ineffective;this approach suffers from high page migration and relatedoverheads [24, 35, 42] reaping marginal beneﬁts from mem-ory heterogeneity. The

KLOC -migrate-fs-noprefetch approachcan only handle an efﬁcient kernel page placement of theﬁle system’s kernel object. In contrast,

KLOC -migrate-fs-nw-noprefetch can map and efﬁciently handle the placement ofboth network and ﬁle system kernel objects. The networksupported approach ﬁrst attempts to allocate objects mappedto a socket’s

KLOC to faster memory; when a direct allocationis not feasible, it uses the modiﬁed Linux LRU-based migra-tion approach to move inactive socket related pages to slowmemory and makes room for subsequent allocations to fastmemory. As a result,

KLOC provides 3.3 × higher throughputcompared to the migration-based approach. Summary.

The substantial performance gains highlightthe beneﬁts of introducing a ﬁne-grained

KLOC support forsocket to encapsulate network stack objects for efﬁcient dataplacement. Note that

KLOC supports multiple instances (pro-cesses) highlighting the generality of the proposed abstrac-tion.

KLOC

Enabling

KLOC to group set of kernel-level objects asso-ciated with entities like ﬁles provides a capability to exploitOS-level hints and optimizations for page placement. To show-case such ﬂexibility, we evaluate the beneﬁts of combining

KLOC and the ﬁlesystems’ I/O prefetcher as shown in Fig-ure 11. The I/O prefetcher adapts to the application’s I/Oaccess pattern and varies the page cache allocation behavior.In OSes such as Linux, the I/O prefetcher expands the I/Oprefetch window size (up to 128 MB) adaptively for bothspatial or temporal locality [27]. The page prefetch windowshrinks for random access patterns. Capturing such seman-tic hints can be beneﬁcial for

KLOC ’s page placement andreducing migration.Figure 11 shows the throughput of for all applicationsand Figure 12 shows RocksDB’s throughput for sequentialand random access I/O patterns (in the X-axis). To demon-strate the incremental beneﬁts, for brevity, we only com-pare

KLOC without prefetcher support (

KLOC -migrate-fs-nw-noprefetch ) with prefetcher supported

KLOC ( KLOC -migrate-fs-nw-prefetch ).Next, combining the prefetch I/O optimization in ﬁle sys-tem with

KLOC improves the performance of several I/O-intensive applications with temporal or spatial locality. Forexample, applications such as RocksDB, Redis, and Cassan-dra beneﬁt from proactively placing prefetched I/O pages tofaster memory. For example, RocksDB’s overall throughputimproves by 1.27 × . As shown in Figure 12, RocksDB’s se-quential access signiﬁcantly beneﬁts with prefetching. ForFilebench, we use random read and write workload that nei-ther gains nor loses performance; this is because, the prefetchapproach uses the "I/O prefetch window size" as a hint to11 filebench redis rocksdb spark cassandra Application T h r oughpu t ( K O p s / s e c ) KLOC-migrate-fs-nw-noprefetchKLOC-migrate-fs-nw-prefetch

Figure 11:

KLOC

I/O prefetch gains.

Y-axisis scaled application throughput (OPS/sec). AccessPattern KLOC-migrate-fs-nw-noprefetch KLOC-migrate-fs-nw-prefetchRandWrite 53 55RandRead 145 146SeqWrite 330 300SeqRead 1178 1306ReadWrite 76 217

Figure 12: RocksDB throughputbreakdown.

Values show throughput in(100K Ops/sec)

RocksDB T h r oughpu t ( M B / s e c ) All-SlowMemAll-FastMemNaiveKLOC-migrate-fs-nw-prefetch

Figure 13:

KLOC impact on DC-Optane for RocksDB. predict random access and avoids aggressively placing andpolluting fast memory that are not likely to be used. As aside effect, this reduces the migration of inactive fast memorypages to slow memory also contributing towards performancebeneﬁts. For Redis, the performance gains ( 1.32 × ) are froma periodic checkpoint of the in-memory key-value store tothe storage, which is mostly sequential. The overall Redisperformance improves by 4 × compared to migration-onlyapproach. Summary.

The results show the beneﬁts of combining

KLOC with traditional OS-level optimizations such as I/Oprefetcher. We believe the techniques could be inherited forother subsystems [55].

KLOC performance on DC-Optane Mem-ory

Finally, to understand the impact of

KLOC on DC-Optanememory technologies, we use a 256GB DC-Optane on amemory socket with 16GB DRAM hardware managed cache(memory mode) as a slow memory and a 48GB DRAM onlysocket (fast memory). In addition to various limitations dis-cussed in Section 2, due to space constraints, we only showthe results for RocksDB. First, as shown in Figure 13, onlyrunning RocksDB on slower memory shows substantial per-formance degradation compared to the optimal (using onlyfast memory) approach. Current DC-Optane technology ina memory mode manages DRAM-cache as a direct-mappedcache. For large working set size, we see a substantial in-crease in latency as well as throughput reduction, possiblycontributed due to a combination of poor cache managementand high cache miss overheads. Employing

KLOC by max-imizing placement of kernel as well as OS pages to fastermemory accelerates performance considerably compared tofully leaving it to the hardware and using a naive placement.

To provide efﬁcient memory placement and management ofkernel objects in heterogeneous memory systems, we presentthe

KLOC , a mechanism that encapsulates kernel objects withﬁne-grained contexts with entities such as ﬁles and socketsand provides efﬁcient data placement and migration withoutrequiring expensive hotness scanning. Our results on real-world applications such as RocksDB and Redis show up to1.4 × and 4 × higher throughput compared migration-based techniques. Our future work will explore supporting KLOCfor other subsystems (e.g., GPU). References [1] “Apache Cassandra,” http://cassandra.apache.org/.[2] “Facebook RocksDB,” http://rocksdb.org/.[3] “Google LevelDB ,” http://tinyurl.com/osqd7c8.[4] “Intel-Micron Memory 3D XPoint,” http://intel.ly/1eICR0a.[5] “Knights Landing (KNL): 2nd Gener-ation Intel R (cid:13) Xeon Phi TM Proceedings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operat-ing Systems , ser. ASPLOS ’17. New York, NY,USA: ACM, 2017, pp. 631–644. [Online]. Available:http://doi.acm.org/10.1145/3037697.3037706[9] B. Akin, F. Franchetti, and J. C. Hoe, “Datareorganization in memory using 3d-stacked dram,”in

Proceedings of the 42Nd Annual InternationalSymposium on Computer Architecture , ser. ISCA’15. New York, NY, USA: ACM, 2015, pp. 131–143. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750397[10] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu,M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W.Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic,“Puma: A programmable ultra-efﬁcient memristor-based accelerator for machine learning inference,”in

Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for Programming anguages and Operating Systems , ser. ASPLOS’19. New York, NY, USA: ACM, 2019, pp. 715–731. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304049[11] M. Becchi and P. Crowley, “A hybrid ﬁnite automatonfor practical deep packet inspection,” in Proceedingsof the 2007 ACM CoNEXT Conference , ser. CoNEXT’07. New York, NY, USA: ACM, 2007, pp. 1:1–1:12. [Online]. Available: http://doi.acm.org/10.1145/1364654.1364656[12] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale,L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W.Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar,J. Shen, and C. Webb, “Die stacking (3d) microarchitec-ture,” in

Proceedings of the 39th Annual IEEE/ACMInternational Symposium on Microarchitecture , ser.MICRO 39. Washington, DC, USA: IEEE Com-puter Society, 2006, pp. 469–479. [Online]. Available:https://doi.org/10.1109/MICRO.2006.18[13] C.-C. Chou, A. Jaleel, and M. Qureshi, “Batman: Max-imizing bandwidth utilization for hybrid memory sys-tems,” in

Technical Report, TR-CARET-2015-01 (March9, 2015) , 2015.[14] T.-S. Chung, D.-J. Park, S. Park, D.-H. Lee, S.-W. Lee,and H.-J. Song, “A survey of ﬂash translation layer,”

J.Syst. Archit. , vol. 55, no. 5-6, pp. 332–343, May 2009.[Online]. Available: http://dx.doi.org/10.1016/j.sysarc.2009.03.005[15] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee,D. Burger, and D. Coetzee, “Better I/O Through Byte-addressable, Persistent Memory,” in

Proceedings of theACM SIGOPS 22nd Symposium on Operating SystemsPrinciples , ser. SOSP ’09, Big Sky, Montana, USA,2009.[16] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan,and R. Sears, “Benchmarking Cloud Serving Systemswith YCSB,” in

Proceedings of the 1st ACM Sympo-sium on Cloud Computing , ser. SoCC ’10, Indianapolis,Indiana, USA, 2010.[17] T. Corbet, “On the proper use of vmalloc,” https://lwn.net/Articles/57800/, 2003.[18] Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, andR. Bianchini, “Memscale: Active low-power modesfor main memory,”

SIGARCH Comput. Archit. News ,vol. 39, no. 1, pp. 225–238, Mar. 2011. [Online]. Avail-able: http://doi.acm.org/10.1145/1961295.1950392[19] P. J. Denning, “The working set model for programbehavior,”

Commun. ACM , vol. 11, no. 5, pp. 323–333, May 1968. [Online]. Available: http://doi.acm.org/10.1145/363095.363141[20] X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi,“Simple but effective heterogeneous main memory withon-chip memory controller support,” in

Proceedingsof the 2010 ACM/IEEE International Conference forHigh Performance Computing, Networking, Storage andAnalysis , ser. SC ’10. Washington, DC, USA: IEEEComputer Society, 2010, pp. 1–11. [Online]. Available:http://dx.doi.org/10.1109/SC.2010.50[21] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,D. Reddy, R. Sankaran, and J. Jackson, “SystemSoftware for Persistent Memory,” in

Proceedingsof the Ninth European Conference on ComputerSystems , ser. EuroSys ’14. New York, NY, USA:ACM, 2014, pp. 15:1–15:15. [Online]. Available:http://doi.acm.org/10.1145/2592798.2592814[22] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish,R. Sankaran, J. Jackson, and K. Schwan, “Data tieringin heterogeneous memory systems,” in

Proceedingsof the Eleventh European Conference on ComputerSystems

SIGPLAN Not. , vol. 50,no. 7, pp. 79–92, Mar. 2015. [Online]. Available:http://doi.acm.org/10.1145/2817817.2731191[25] A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski,L. Ceze, and T. Mudge, “Integrated 3d-stacked server de-signs for increasing physical density of key-value stores,”in

Proceedings of the 19th International Conference onArchitectural Support for Programming Languages andOperating Systems , ser. ASPLOS ’14, Salt Lake City,Utah, USA, 2014.[26] H. Hanson and K. Rajamani, “What computer architectsneed to know about memory throttling,” in

Proceedingsof the 2010 International Conference on Computer Ar-chitecture , ser. ISCA’10. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 233–242.[27] J. He, S. Kannan, A. C. Arpaci-Dusseau, and R. H.Arpaci-Dusseau, “The Unwritten Contract of SolidState Drives,” in

Proceedings of the Twelfth EuropeanConference on Computer Systems , ser. EuroSys’17. New York, NY, USA: ACM, 2017, pp. 127–144. [Online]. Available: http://doi.acm.org/10.1145/3064176.30641871328] Intel, “NVM Library,” https://github.com/pmem/nvml.[29] D. Jevdjic, S. Volos, and B. Falsaﬁ, “Die-stacked dramcaches for servers: Hit ratio, latency, or bandwidth?have it all with footprint cache,” in

Proceedingsof the 40th Annual International Symposium onComputer Architecture , ser. ISCA ’13. New York, NY,USA: ACM, 2013, pp. 404–415. [Online]. Available:http://doi.acm.org/10.1145/2485922.2485957[30] X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Maki-neni, D. Newell, D. Solihin, and R. Balasubramonian,“Chop: Adaptive ﬁlter-based dram caching for cmpserver platforms,” in

High Performance Computer Ar-chitecture (HPCA), 2010 IEEE 16th International Sym-posium on , Jan 2010, pp. 1–12.[31] C. Jonathan, “Linux Radix Trees,” https://lwn.net/Articles/175432/.[32] N. P. Jouppi, C. Young, N. Patil, D. Patterson,G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark,J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,T. V. Ghaemmaghami, R. Gottipati, W. Gulland,R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt,D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,A. Lundin, G. MacKean, A. Maggiore, M. Mahony,K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni,K. Nix, T. Norrie, M. Omernick, N. Penukonda,A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,C. Severn, G. Sizikov, M. Snelham, J. Souter,D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,E. Wilcox, and D. H. Yoon, “In-datacenter performanceanalysis of a tensor processing unit,” in

Proceedingsof the 44th Annual International Symposium onComputer Architecture , ser. ISCA ’17. New York,NY, USA: ACM, 2017, pp. 1–12. [Online]. Available:http://doi.acm.org/10.1145/3079856.3080246[33] S. Kannan, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, Y. Wang, J. Xu, and G. Palani, “Designinga true direct-access ﬁle system with devfs,” in

Proceedings of the 16th USENIX Conference on Fileand Storage Technologies , ser. FAST’18. Berkeley,CA, USA: USENIX Association, 2018, pp. 241–255.[Online]. Available: http://dl.acm.org/citation.cfm?id=3189759.3189782[34] S. Kannan, N. Bhat, A. Gavrilovska, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, “Redesigning lsmsfor nonvolatile memory with novelsm,” in

Proceedingsof the 2018 USENIX Conference on Usenix Annual Technical Conference , ser. USENIX ATC ’18. Berkeley,CA, USA: USENIX Association, 2018, pp. 993–1005.[Online]. Available: http://dl.acm.org/citation.cfm?id=3277355.3277450[35] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan,“Heteroos: Os design for heterogeneous memorymanagement in datacenter,” in

Proceedings of the44th Annual International Symposium on ComputerArchitecture , ser. ISCA ’17. New York, NY, USA:ACM, 2017, pp. 521–534. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080245[36] S. Kannan, A. Gavrilovska, and K. Schwan, “pvm:Persistent virtual memory for efﬁcient capacityscaling and object storage,” in

Proceedings ofthe Eleventh European Conference on ComputerSystems , ser. EuroSys ’16. New York, NY, USA:ACM, 2016, pp. 13:1–13:16. [Online]. Available:http://doi.acm.org/10.1145/2901318.2901325[37] Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel,“Coordinated and efﬁcient huge page management withingens,” in

Proceedings of the 12th USENIX Conferenceon Operating Systems Design and Implementation , ser.OSDI’16. Berkeley, CA, USA: USENIX Association,2016, pp. 705–721. [Online]. Available: http://dl.acm.org/citation.cfm?id=3026877.3026931[38] A. Lagar-Cavilla, J. Ahn, S. Souhlal, N. Agarwal,R. Burny, S. Butt, J. Chang, A. Chaugule, N. Deng,J. Shahid, G. Thelen, K. A. Yurtsever, Y. Zhao,and P. Ranganathan, “Software-deﬁned far memoryin warehouse-scale computers,” in

Proceedings ofthe Twenty-Fourth International Conference on Ar-chitectural Support for Programming Languages andOperating Systems , ser. ASPLOS ’19. New York, NY,USA: ACM, 2019, pp. 317–330. [Online]. Available:http://doi.acm.org/10.1145/3297858.3304053[39] C. Lee, D. Sim, J.-Y. Hwang, and S. Cho, “F2FS: ANew File System for Flash Storage,” in

Proceedingsof the 13th USENIX Conference on File and StorageTechnologies , ser. FAST’15, Santa Clara, CA, 2015.[40] K. Lim, J. Chang, T. Mudge, P. Ranganathan,S. K. Reinhardt, and T. F. Wenisch, “Disaggregatedmemory for expansion and sharing in blade servers,”

SIGARCH Comput. Archit. News , vol. 37, no. 3,pp. 267–278, Jun. 2009. [Online]. Available: http://doi.acm.org/10.1145/1555815.1555789[41] F. X. Lin and X. Liu, “Memif: Towards programmingheterogeneous memory asynchronously,” in

Proceed-ings of the Twenty-First International Conference onArchitectural Support for Programming Languages andOperating Systems , ser. ASPLOS ’16. New York, NY,14SA: ACM, 2016, pp. 369–383. [Online]. Available:http://doi.acm.org/10.1145/2872362.2872401[42] M. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ig-natowski, and G. Loh, “Heterogeneous memory architec-tures: A hw/sw approach for mixing die-stacked and off-package memories,” in

High Performance Computer Ar-chitecture (HPCA), 2015 IEEE 21st International Sym-posium on , Feb 2015, pp. 126–136.[43] M. Oskin and G. H. Loh, “A software-managedapproach to die-stacked dram,” in

Proceedings of the2015 International Conference on Parallel Architectureand Compilation (PACT) , ser. PACT ’15. Washington,DC, USA: IEEE Computer Society, 2015, pp. 188–200.[Online]. Available: https://doi.org/10.1109/PACT.2015.30[44] M. Radulovic, D. Zivanovic, D. Ruiz, B. R. de Supinski,S. A. McKee, P. Radojkovi´c, and E. Ayguadé,“Another trip to the wall: How much will stackeddram beneﬁt hpc?” in

Proceedings of the 2015International Symposium on Memory Systems , ser.MEMSYS ’15. New York, NY, USA: ACM, 2015, pp.31–36. [Online]. Available: http://doi.acm.org/10.1145/2818950.2818955[45] P. Raju, R. Kadekodi, V. Chidambaram, and I. Abra-ham, “PebblesDB: Building Key-Value Stores usingFragmented Log-Structured Merge Trees,” in

Proceed-ings of the 26th ACM Symposium on Operating SystemsPrinciples (SOSP ’17) , Shanghai, China, October 2017.[46] L. E. Ramos, E. Gorbatov, and R. Bianchini,“Page placement in hybrid memory systems,” in

Proceedings of the International Conference onSupercomputing , ser. ICS ’11. New York, NY,USA: ACM, 2011, pp. 85–95. [Online]. Available:http://doi.acm.org/10.1145/1995896.1995911[47] T. Vasily, “Filebench,” https://github.com/ﬁlebench/ﬁlebench.[48] J. Veselý, A. Basu, A. Bhattacharjee, G. H. Loh,M. Oskin, and S. K. Reinhardt, “Generic systemcalls for gpus,” in

Proceedings of the 45th AnnualInternational Symposium on Computer Architecture , ser.ISCA ’18. Piscataway, NJ, USA: IEEE Press, 2018,pp. 843–856. [Online]. Available: https://doi.org/10.1109/ISCA.2018.00075[49] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne:Lightweight Persistent Memory,” in

Proceedings of theSixteenth International Conference on Architectural Sup-port for Programming Languages and Operating Sys-tems , ser. ASPLOS XVI, Newport Beach, California,USA, 2011. [50] C. Wang, H. Cui, T. Cao, J. Zigman, H. Volos,O. Mutlu, F. Lv, X. Feng, and G. H. Xu, “Panthera:Holistic memory management for big data processingover hybrid memories,” in

Proceedings of the40th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , ser. PLDI2019. New York, NY, USA: ACM, 2019, pp. 347–362. [Online]. Available: http://doi.acm.org/10.1145/3314221.3314650[51] F. Wu, H. Xi, J. Li, and N. Zou, “Linux readahead: lesstricks for more,” 08 2019.[52] X. Wu and A. L. N. Reddy, “SCMFS: A File System forStorage Class Memory,” in

Proceedings of 2011 Inter-national Conference for High Performance Computing,Networking, Storage and Analysis , ser. SC ’11, Seattle,Washington, 2011.[53] J. Xu and S. Swanson, “NOVA: A Log-structured FileSystem for Hybrid Volatile/Non-volatile Main Memo-ries,” in

Proceedings of the 14th Usenix Conference onFile and Storage Technologies , ser. FAST’16, 2016.[54] Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee,“Nimble page management for tiered memory systems,”in

Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems , ser. ASPLOS’19. New York, NY, USA: ACM, 2019, pp. 331–345. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304024[55] Y. Yang, P. Xiang, M. Mantor, and H. Zhou, “Cpu-assisted gpgpu on fused cpu-gpu architectures,” in

Proceedings of the 2012 IEEE 18th InternationalSymposium on High-Performance Computer Architec-ture , ser. HPCA ’12. Washington, DC, USA: IEEEComputer Society, 2012, pp. 1–12. [Online]. Available:http://dx.doi.org/10.1109/HPCA.2012.6168948[56] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker,and I. Stoica, “Spark: Cluster computing withworking sets,” in

Proceedings of the 2Nd USENIXConference on Hot Topics in Cloud Computing ,ser. HotCloud’10. Berkeley, CA, USA: USENIXAssociation, 2010, pp. 10–10. [Online]. Available:http://dl.acm.org/citation.cfm?id=1863103.1863113[57] L. Zhao, R. Iyer, R. Illikkal, and D. Newell, “Exploringdram cache architectures for cmp server platforms,” in