Efficient Kernel Object Management for Tiered Memory Systems with KLOC
EEfficient Kernel Object Management for Tiered Memory Systems with KLOC
Sudarsun Kannan (Rutgers University) , Yujie Ren (Rutgers University) , Abhishek Bhatacharjee (Yale University)
Abstract
Software-controlled heterogeneous memory systems havethe potential to improve performance, efficiency, and costtradeoffs in emerging systems. Delivering on this promiserequires efficient operating system (OS) mechanisms andpolicies for data management. Unfortunately, modern OSesdo not support efficient tiering of data between heterogeneousmemories. While this problem is known (and is being studied)for application-level data pages, the question of how best totier OS kernel objects has largely been ignored.We show that careful kernel object management is vital tothe performance of software-controlled tiered memory sys-tems. We find that the state-of-the art OS page managementresearch leaves considerable performance on the table byoverlooking how best to tier, migrate, and manage kernelobjects like inodes, dentry caches, journal blocks, networksocket buffers, etc., associated with the filesystem and net-working stack. In response, we characterize hotness, reuse,and liveness properties of kernel objects to develop appropri-ate tiering/migration mechanisms and policies. We evaluateour proposal using a real-system emulation framework onlarge-scale workloads like RocksDB, Redis, Cassandra, andSpark, and achieve 1.4 × to 4 × higher throughput comparedto prior art. Hardware heterogeneity is here. Vendors are couplinggeneral-purpose CPUs with accelerators ranging from GPUsand FGPAs to domain-specific hardware for deep learning,signal processing, finite automata, and much more [10,11,32].Memory systems are combining the best properties of emerg-ing technologies optimized for latency, bandwidth, capacity,volatility, or cost. Researchers are already studying the bene-fits of die-stacked DRAM [9, 29, 44], while Intel’s Knight’sLanding uses high bandwidth multi-channel DRAM (MC-DRAM) alongside DDR4 memory to achieve both high band-width and high capacity [5]. Non-volatile 3D XPoint memo-ries are now commercially available for database systems, anddisaggregated memory is being touted as a promising solu-tion to scale capacity for blade servers [40]. Next-generationsystems will consist of heterogeneous compute nodes (CPU,GPU, or both) connected to multiple types of memory withdifferent bandwidth, latency, and capacity properties.To harness the promise of heterogeneity, software-controlled data management is necessary to ensure that asprograms navigate different phases of execution, each withpotentially distinct working sets, data is tiered appropriately.To optimize for performance, ideally the hottest data will be placed in the fastest memory node (in terms of latency orbandwidth) until that node is full, the next-hottest data willbe filled into the second-fastest node up to its capacity, andso on. As a program executes, its data must be periodicallyassessed for hotness and re-organized to maximize perfor-mance. Doing this successfully requires effective OS policiesand mechanisms to determine data reuse and control datamigration.Unfortunately, data hotness tracking and page migrationin modern OSes have high overheads and are surprisinglyinefficient. Recent studies address this in many ways, in-cluding support for transparent huge page migration [37],concurrent migration of multiple pages and symmetric ex-change of pages [54], compiler- and MemIf-based frame-works [41, 50] that determine data hotness via heap profil-ing, and machine learning to determine page hotness [38].Other OS-based approaches [8, 35] and hardware acceleratedapproaches [13, 30, 42, 57] have been proposed. Althougheffective, these studies focus on application-level data and largely ignore the question of how best to tier kernel objectsassociated with I/O.
We show that ignoring kernel objects leaves considerableperformance on the table and that carefully tiering, migrat-ing, and managing kernel objects like inodes, dentry caches,journal blocks, network socket buffers, etc., is vital to over-all system performance. We also show – by characterizinghotness, reuse, and liveness properties – that techniques pre-viously proposed for application-level data tiering are a poorfit for kernel objects. This is because of three key differencesbetween application-level pages and kernel objects. First, ker-nel objects are not mapped or owned by a specific application.Second, kernel objects can and are shared and reused acrossmultiple tenants. Third, kernel objects are much shorter livedthan application-level data. For these reasons, traditional hot-ness scanning and page migration techniques cannot simplybe extended to kernel objects, which can neither be associatedwith a specific application nor are sufficiently long-lived totolerate the latencies of state-of-the-art application-level pagemigration techniques [54].In response, we introduce the concept of k ernel- l evel o bject c ontexts (or KLOC s ). Each KLOC encapsulates a set of kernel-level objects associated with entities like files, sockets, virtualdevice files etc. We build allocation, deallocation, page place-ment, and page migration mechanisms and policies for
KLOC s,striking a compromise between tying kernel-level objects toapplication-level characteristics while also adapting to theunique reuse/lifetime characteristics of kernel-level objects.
KLOC -based tiering allows the OS to associate applications1 a r X i v : . [ c s . O S ] A p r nd kernel-level objects (thereby tracking when, how, andwhy applications access kernel-level objects) while also un-derstanding that kernel-objects can be shared across differentapplications (with different I/O behavior) and have shorterlifetimes than application-level data. KLOC -based tiering alsopermits kernel objects to leverage aspects of the file systemand networking stacks that have already been optimized forperformance (e.g., data prefetching for I/O, etc.).To build
KLOC s, we address several research challenges.Unlike coarser-grained application-level approaches based onNUMA-affinity, which generally tie data to particular memorysockets through large chunks of application lifetime, files andsockets can be rapidly spawned/killed and accessed/reusedin a myriad ways (for example, RocksDB [2], creates andoperates on hundreds of files through its lifetime). Somekernel objects (like file system journals) can be shared acrossmultiple files (and applications). Finally, currently OSes lacksupport for kernel object migration entirely (kernel objectsare managed by the OS’s slab allocator and remains in thelocation it was allocated through time). Even worse, buildingkernel object migration is a non-trivial effort – many kernelobjects (like kernel buffer pages) have lifetimes that are soshort (in tens of milliseconds), that the latency to migrate thembetween memory tiers and perform associated operations likeTLB shootdowns [42, 43] can be prohibitively high.To accurately gauge the benefits of
KLOC s, we proto-type our approach in the mainline Linux kernel v4.17. Todetermine the set of kernel objects per
KLOC , we rely onapplication-level system calls to identify the files and sockets,virtual devices, etc., being accessed and the newly-allocatedkernel objects (e.g., a dentry, cache page, packet/socket buffer)that they are associated with. Each file, socket, and a virtualdevice has its own
KLOC . We also associate kernel objectswith CPUs in order to maintain information about CPU- andapplication-wide
KLOC association. To track chains of ker-nel objects associated with
KLOC s in an efficient manner, weimplement a lightweight object map table within the kernel.This map is accessed by the OS to determine
KLOC s associ-ated with cold kernel objects so that they can be migrated toslow memory, and is designed to obviate the need for high-overhead page table scans to determine page hotness. Weovercome this challenge by grouping kernel objects used byfiles and sockets to a large region of virtually contiguouspages. By doing this, and also then going beyond modernOSes and implementing support for kernel object migration,we enable good performance. To boost efficiency, we also ex-ploit I/O stack optimizations like prefetching of file data andinsertion of prefetched cache pages to faster memory (whichis particularly beneficial for workloads with sequential I/O ac-cess patterns). Overall, we add 6K lines of code in the Linuxkernel to implement these techniques, and make no changesto the hardware or applications.We quantify the benefits of
KLOC s by using a two-socketsystem to emulate a two-tier memory system with a fast and slow memory, similar to prior work [23, 26, 36, 54]. Weperform end-to-end evaluations with RocksDB (a persistentkey-value store), Redis (a network-intensive memory store),Cassandra (a distributed wide column store), and Spark (adistributed general-purpose cluster-computing framework),in addition to microbenchmarks like Filebench. Our perfor-mance gains of up to 1.4 × and 4 × respectively show theperformance potential of KLOC s, and lay the groundwork forfurther exploration of kernel object tiering mechanisms andpolicies in the systems research community.
Advances in heterogeneous memory hardware have moti-vated the need for efficient management of memory resourcesthat vary in capacity, speed, and cost. We first discuss hard-ware and software trends, followed by related work on tech-niques for memory heterogeneity management and their limi-tations.
Several heterogeneous memory technologies, such as non-volatile memory (NVM), Hybrid Memory Cube (HMC), andHigh Bandwidth Memory (HBM) will coexist with traditionalDRAMs. On-chip memory such as stacked 3D-DRAM, Hy-brid Memory Cube (HMC) and High Bandwidth Memory(HBM) [13, 42] are expected to provide 10 × higher band-width and 1.5 × lower latency, but provide 8-16 × [7, 9, 12, 43]lower capacity compared to DRAM. Other technologies, likeNVMs, offer 4-8 × higher capacity compared to DRAMs butsuffer from 2-3 × higher read latency, 5 × higher write la-tency, and 3-5 × lower random access bandwidth comparedto DRAM. Several file systems [15, 21, 52, 53] and user-levellibraries [28,49] have been proposed to exploit persistence, ashave approaches to integrate them as virtual memory [22, 36].Given the differences in bandwidth, latency, and capacity, het-erogeneous memory systems will increase the complexity ofthe memory management software stack. Hardware-level management.
There have been several priorproposals to manage memory heterogeneity in the hardware.Batman modifies the memory controller to randomize dataplacement for increasing the cumulative DRAM and stacked3D-DRAM bandwidth [13]. Meswani et al. [42] discuss ex-tending the TLB and the memory controller with additionallogic for identifying page hotness. To reduce page migra-tion cost, Dong et al. [20] propose SSD FTL-like mappingof physical addresses dynamically [14]. Oskin et al. proposean architectural mechanism to selectively invalidate entriesin the TLB for reducing the TLB shoot-downs during mi-grations [43]. Ramos et al. propose a hybrid design withhardware-driven page placement policy and the OS periodi-cally updating its page tables using the information from thememory controller [46].2 oftware-level management.
Several recent studies aug-ment traditional OS approaches to track page hotness byscanning page tables to migrate application pages of differ-ent sizes [8, 24, 35, 38, 41, 43, 54]. These approaches extendwork originally proposed by Denning [19] for disk swap-ping. Gupta et al. propose HeteroVisor [24], which uses pagehotness tracking and migration techniques for virtualized dat-acenters, whereas Kannan et al. [35] propose on-demand dataplacement for virtualized datacenters. Yan et al. [54] proposestechniques to accelerate page migration in heterogeneousmemory systems by increasing parallelism. Lagar-Cavilla etal [38] propose a combination of OS-level hotness scanningcombined with machine learning for data placement acrossfast and slow memories. Many of these techniques extendthe concept of NUMA-affinity to data pages. That is, theyassociate applications to particular memory sockets in orderto accelerate memory access from CPUs that are physicallycloser.While these approaches are beneficial for application-leveldata, kernel object management for heterogeneous memoryremains unexplored and is in its infancy (for example, thereis no support for kernel object migration in modern OSes).
Application Description ResidentMem. Size
RocksDB [2] Facebook’s persistent key-value storebased on log-structured merge tree;Workload: Widely-used DBbench [3]with 1M keys. 8.4GBRedis [6] Network-intensive key-value store withsupport for persistence; uses
RedisBench , 4 millions ops., 75%/25% Set/Getdistribution. 14GBFilebench [47] File system benchmark; uses eightthreads, 8GB per-thread, performing se-quential and random reads. 16.3GBCassandra [1] Java-based NoSQL DB; run withYCSB [16] workload using eight threads,50% read-write ratio. 11GBSpark [56] Apache Spark; performs Terrasort on20GB data using sixteen threads and usesHadoop file system. 32.1GB
Table 1: Applications and workloads.
Table also shows applicationresident set size (in GB).
Experimental Environment
Processors 2.4 GHz Intel E5–2650v4 (Broadwell), 20 cores/-socket, 2 threads/coreCache 512 KB L2, 25 MB LLCMemory Sock-ets Two 80 GB sockets configured as NUMA nodes, maxbandwidth ofStorage 512 GB NVMe with 1.2 GB and 412 MB sequentialand random access bandwidthOS Debian Trusty — Linux v4.17.0
Table 2: System configurations.
For optimal application performance in heterogeneousmemory systems, placing not only application-level pagesbut also kernel pages to faster memory is critical. Howeverpage placement of kernel pages (and objects) is not well stud-ied. To understand the need for kernel object placement inheterogeneous memory systems, we next analyze real-worldI/O-intensive applications.
Quantifying the performance of kernel object tiering re-quires a platform that permits end-to-end execution of largescale workloads (we use those summarized in Table 1) withfull-system effects. Cycle-accurate simulators are too slowand lack the detail necessary to (easily and accurately) studykernel-level structures in the virtual memory, storage, andnetwork subsystems [13, 20, 42]. Ideally, we would use acommercially-available heterogeneous memory platform withsupport for flexible tiering of kernel objects. Regrettably, thereare no commercial platforms that can be configured to dothis yet. For example, we considered Intel’s DC memory [4],which attaches a persistent Optane memory side-by-side withDRAM. Unfortunately, this system can currently only be con-figured such that the persistent memory is a direct-accessfile system accessible via custom user-level runtimes, or theDRAM is a direct-mapped L4 cache of the persistent mem-ory. There is no way to configure the DC memory platformto make it entirely visible (or kernel objects visible) to thevirtual memory sub-system, which is what we need for ourstudies.Therefore, while we will revisit Intel’s DC platform in Sec-tion 6, and for now we use a two-socket memory system toemulate a two-level tiered memory in a manner that is similarto recent work [24, 26, 36, 43, 54]. These sockets have thearchitectural configuration described in Table 2. Like previ-ous work [24, 26, 36], we emulate a fast memory on one ofthe sockets, and a slow memory on the other by applyingthermal throttling to slow down the latter. The slow mem-ory’s bandwidth and latency are configured by modifying thePCI-based thermal registers. The flexibility of this platformenables exploration of a generic software-controlled hetero-geneous framework as we can vary capacities of the memorynodes, as well as their latency/bandwidth characteristics. Forour studies, we vary capacity/bandwidth differences betweenthe fast and slow memories from 2 × to 16 × . Furthermore,for some of our experiments, we controlled application/kernelpage placement in the fast/slow memories by adding hooksin Linux’s memory management stack to redirect page alloca-tions. All workloads are executed on the 20 CPUs of the nodeassociated with fast memory. We will publicly release ourkernel and tools so that they can be used by the communityfor follow-up studies.3 .2 Experimental Results Table 1 shows that we focus on large-scale workloads thatare compute-, file-, and network-intensive in order to stress-test our approach. We also use Filebench, which is a mixedrandom write and read access workload. We structure ourstudies around the following questions:
How are kernel memory objects allocated and accessed?
Modern data center applications are known to be I/O intensiveand expend considerable portions of their runtime within thekernel-level file system and networking stacks. For example,40% of the runtime of RocksDB is spent within the file systemcode path. What is less well-known, however, is that theseapplications also allocate millions of memory pages for kernelobjects that perform I/O caching, in-memory metadata andbook-keeping structures (like radix trees for caches), journalsand logs, as well as ingress and egress network socket buffers.Figure 1a quantifies the number of pages allocated for dif-ferent kernel objects. The data shows that all the workloadsallocate many page cache pages and kernel buffers (usingslab allocations via kmalloc ). Filebench uses eight I/O threadsto simultaneously write 4KB blocks to separate files. Thewrite and read operation entails allocation of page cache (forwriting data or bringing data from disk) as well as updates tosystem metadata structures, which involves allocating jour-nals, radix tree, block driver buffers, etc. Consequently, bothpage cache and kernel metadata allocations increase signif-icantly compared to user-level pages. In contrast, RocksDBupdates hundreds of 4MB files with key-value data. Therefore,slab allocations for inodes, dentry caches, radix tree nodes(for the indexing cache), driver block I/O and journals areall frequent and contribute to 36% of the pages allocated tokernel objects. Redis is network-intensive and allocates manypages for ingress and egress socket buffers, and also pagecache pages to periodically checkpoint key-value store stateto a large file on disk [6]. Spark [56] uses the Hadoop filesystem (HDFS) to store and checkpoint data (RDDs [56]).Note that HDFS is run as a separate process. HDFS maintainsuser-level cache and periodically updates page cache (so lesskernel buffer pages)We also profile the frequency with which these differentkernel objects are accessed in Figure 1b and the distributionof last-level cache misses in Figure 1c. Even though fewerkernel buffers are allocated (see Figure 1a), they are accessedmore often than other kernel objects. To understand why,consider, for example, a file write in Filebench. The virtualfile system (VFS) looks up the page cache radix tree, allocatesa new page if the necessary, inserts the page into the radix tree,performs metadata/data journalling with logging, and finally,commits to storage. These steps are more memory-intensivethan writing the data to the page cache. In fact, scaling theworkload inputs leads to a sharp increase in LLC misses dueto higher traffic to kernel buffers. Filebench’s spends 86% ofexecution time inside the OS, and hence, the memory accessesincrease proportionately, compared to RocksDB (54%) and Redis (38%).
How does tiering of kernel objects impact performance?
To study the impact of kernel object placement in fast/slowmemory, we configure the capacity of the fast memory so thatit cannot fit all the application’s user-level and kernel-levelpages. Our results, illustrated in Figure 2, assume that fast andslow memory are 5GB and 40GB respectively, and that slowmemory has a bandwidth of 5GB/second, thereby emulating a5 × bandwidth difference between fast and slow memory. Thisis similar to recent work [24,35,42] and is representative of thebandwidth differences between HBM and NVM technologiesrelative to DRAM. In tandem, Figure 3 and Figure 4 show theimpact of varying slow memory bandwidth and fast memorycapacity. The App Slow + OS Slow bars show the worst-casescenario where all pages are placed in slower memory,
AppSlow + OS Fast shows the case where only the kernel pagesare placed in fast memory,
App Fast + OS Slow shows thecase where only application-level pages are placed in fastmemory, and
App Fast + OS Fast shows an ideal case allpages fit in fast memory. The y-axis shows the normalizedthroughput.Placing both application and kernel pages (
App Slow +OS Slow ) in slow memory degrades performance across allworkloads. As shown in Figure 2, placing only kernel pages(
App Slow + OS Fast ) to limited-capacity fast memory doesimprove performance; for example, RocksDB and filebenchimprove by 1.58 × and 2.8 × respectively compared to AppSlow + OS Slow . In real-world settings, one would not justplace kernel pages in fast memory (but would also do so forapplication-pages as much as possible) but this experimentshows that even just tiering kernel objects appropriately im-pacts performance. For network (and storage) intensive Redis,placing kernel pages in fast memory boosts performance by1.8 × over App Slow + OS Slow . In Spark, we note a highcontention between Spark compute pages (heap) and HDFSstorage (kernel pages) for a limited-capacity fast memory.Only placing kernel pages in fast memory improves perfor-mance by 1.3 × , mostly due to page cache placement in fastmemory.Finally, suppose we place application pages in fast memorybut prevent kernel objects from being in fast memory. WhileRedis improves marginally, this is not the case for Filebench,which spends 86% of execution inside the OS.Figure 3 and Figure 4 show the sensitivity of RocksDB(highly I/O and OS-intensive) and Spark (mostly compute-intensive with intermittent I/O) towards lowering memorybandwidth or reducing fast memory capacity. The x-axis inFigure 3 shows the increasing ratio of slow memory band-width to fast memory bandwidth, whereas Figure 4 showsthe increasing fast memory capacity ratio. The results showthat reducing fast memory capacity or lowering slow memoryimpacts both application and the impact of placing kernelobjects to slower memory affecting RocksDB significantly.4 il eben c h r ed i s r o cks db c a ss and r a s pa r k Application P age s ( i n K ) appcachekernel buffer (a) Page allocation distribution. The barsshow distribution across heap, page cache (OS),and kernel buffers (slab pages). The y-axis showoverall pages allocations during an application’slifetime. f il eben c h r ed i s r o cks db c a ss and r a s pa r k Application M e m o r y A cc e ss ( % ) appkernellib (b) Memory access distribution. The barsshow memory access distribution (in %) acrossApp, Kernel (OS), and other user-space libraries. f il eben c h r ed i s r o cks db c a ss and r a s pa r k Application LL C M i ss e s ( i n - m illi on ) appkernellib (c) Last-level cache miss distribution.Figure 1: Memory allocation, access, and last-level cache miss distribution. filebench redis rocksdb cassandra spark Application T h r oughpu t ( M B / s e c ) App Slow + OS SlowApp Fast + OS FastApp Slow + OS FastApp Fast + OS Slow
Figure 2: Impact of kernel object (page) placement.
Fast memorycapacity is set to 1/8th (5GB) capacity of Slow memory (40GB), and SlowMemory bandwidth is set to 1/5th (4GB/sec) of Fast Memory.
Slow Memory Bandwidth Relative to Fast Memory
RocksDB Spark T h r oughpu t ( K O p s / s e c ) App Slow + OS SlowApp Fast + OS Fast App Slow + OS FastApp Fast + OS Slow
Figure 3: Memory bandwidth sensitivity.
The x-axis varies slowmemory bandwidth relative to fast Memory, and the y-axis shows throughput.
Fast and Slow Memory Capacity Ratio
RocksDB Spark T h r oughpu t ( K O p s / s e c ) App Slow + OS SlowApp Fast + OS Fast App Slow + OS FastApp Fast + OS Slow
Figure 4: Sensitivity to fast memory capacity.
The x-axis show fastmemory capacity ratio relative to an all-fast Memory system, and the y-axisshows throughput impact. App Fast-OS Fast shows optimal case performancein an all-fast memory system. The slow Memory bandwidth is set of 1/16thof Fast Memory
Overall, these results show that the choice of where to placekernel objects impacts performance substantially and thatthere is consequently a need to go beyond prior work anddevise efficient tiering of kernel objects. f il eben c h r ed i s r o cks db c a ss and r a s pa r k Application O b j e c t L i f e t i m e ( m illi s e c ) kernel buffercache Figure 5: Active lifetime and access interval.
Bars show averageactive life time of cache and kernel-buffer pages before recycled; y-axis inmilliseconds.
What is the lifetime of a typical kernel object?
Figure 5shows the lifetime of OS cache and kernel buffer (slab) pages.Conceptually, kernel buffer pages expire after they are freed.Cache pages remain until they are evicted from memory pres-sure (we mark them as expired when they are added to theLRU list).Figure 5 shows that RocksDB and Redis have cache pageswith average lifetimes of less than 160 milliseconds, andsmaller kernel pages are even shorter-lived, at 60 milliseconds.To understand why kernel objects are short-lived, consider,for example, a file write. A page cache page is allocated, theuser data is copied to the cache page, a radix tree node isallocated using the slab allocator [31], and the cache pageis inserted. The page cache page remains inactive until sub-sequent reads/writes or commits (i.e., fsync() ) to the disk.In contrast, kernel buffers such as radix tree nodes are fre-quently queried, allocated, and deleted due to tree rebalancingor cache page deletion. Other in-memory structures such asdentry caches and in-memory journals are also frequentlyallocated and deleted when data and metadata are updated.These observations showcase the limitations of prior worklike Thermostat [8], which uses a 30-second interval betweentwo hotness tracking iterations. This relatively large timeperiod was used to because scans of page tables to ascertainhotness and invalidate TLB entries are long latency events.Our results show, however, that kernel objects are far tooshort-lived to be amenable to such high intervals.5
Our Approach: KLOC-Based Tiering
Having quantified the performance challenges posed andopportunities offered by tiering of kernel objects in genericsoftware-controlled heterogeneous memory systems, we con-sider how to go beyond prior work (which neglects kernelobject tiering) and devise kernel tiering. Our goals are highperformance and ready implementation in commercial OSes.At first blush, one might consider extending OS support forNUMA affinity to also include kernel objects. This approachwould permit placement of kernel objects – just like applica-tion pages – in memory devices in a manner that attempts tominimize distance between CPUs and the data they frequentlyaccess. Unfortunately, such NUMA affinity approaches arenot viable for kernel objects, which can be shared/reusedacross applications (making it challenging to assign a singleaffinity to kernel objects) and have lifetimes much shorterthan application pages (making existing ways of measuringhotness and migrating pages inapplicable). Current OSes,which includes Linux, FreeBSD, Solaris, lack capability toassociate kernel objects with application entities or providefine-grained of placement kernel objects. Instead, we use theconcept of
KLOC s to encapsulate groups of kernel objects intological entities – associated with files and network sockets –that can be managed together in a lightweight manner. Usingthese entities as a unit of movement enables finer-graineddecision-making about allocations, placements, and migra-tion of kernel objects associated with an entity (i.e., a
KLOC ),than using NUMA affinity. These benefits are crucial for goodperformance but does require extending existing filesystems,network stacks, the OS slab allocator, journals, and devicedrivers with support for
KLOC s. A key feature of our approachis to use tap into application-level system calls to accuratelytrack
KLOC s, and kernel level maps that lets us identify hotkernel objects much faster than traditional approaches thatscan the page table. This in turn permits us to migrate kernelobjects fast enough so that its benefits are not outweighed byrelatively short kernel object lifetimes. Finally, we leverageexisting and highly-optimized I/O stack optimizations suchas adaptive I/O prefetching (also known as readahead) andtechniques to speculatively place I/O cache and filesystemobjects to fast memory for further boost performance.
We next discuss the key design and implementation de-tails of
KLOC . We base our design discussions on support forthe file system and network stacks, followed by ways to sup-port effective kernel object placement and migration and alsoexploit traditional OS optimizations such as I/O prefetching.
We start with a set of page placement constraints. First,kernel objects are short-lived, and immediate placement ofcurrently active objects to faster memory is critical. Second,because faster memory is usually lower-capacity, migrating
Block I/OPage CacheVFSUser-space
File Creation File Write File Fsync File Close Per-KLOC Map inode dentry
Cache page radix-tree node jbd2 bio
Root (1)
Create KLOC map and add inode, dentry objects (2)
Allocate journal, page cache, radix node; (3) allocate bio to fast-mem; and add to KLOC map (4)
De-allocate unused objects, and move KLOC and its objects to slow-mem
Per-KLOC Map inode dentry jbd2 bio
Root
44 4 44
Move to slow-mem
Cache page radix-tree node
Figure 6:
KLOC support for storage Stack.
Creation of a
KLOC mapand addition of VFS, file system, and device driver kernel objects. inactive objects to slower memory and making way for ac-tive objects is critical. Third, reducing the frequency of long-latency hotness scans and reducing migration overheads iscritical. Finally, application pages are always prioritized touse faster memory unless their pages become inactive. Impor-tantly,
KLOC object grouping, placement, and migration arenot tightly bound to a specific placement policy.
The workloads that we use spend a significant fraction ofexecution time in the file system dealing with page caches,in-memory structures like inodes, dentry caches, radix trees,block-device buffers, and journals, and in the network sub-system where they interact with ingress/egress socket buffersand network I/O queues. /* KLOC-based dentry object allocation */ void *dentry_alloc(struct inode *inode) {struct dentry *dentry; /* Get KLOC */ struct kloc *kloc = inode_to_kloc(inode); struct *alloc_policy; /* Check if KLOC is active */ if (kloc_active(kloc)) { /* Get KLOC’s allocation policy */ alloc_policy = kloc->alloc_policy; dentry = allocate_hetero(sizeof(struct dentry), alloc_policy); /* Add to KLOC’s map*/ add_to_kloc_map(kloc->map, dentry)} else { /* Use default allocation */ dentry = allocate(sizeof(struct dentry)); }
Figure 7: Pseudocode for
KLOC -based dentry object allocationand mapping.
Dentry is used to keep track of the hierarchy of files indirectories. The pseudocode code checks if a file-based
KLOC (representedas inode) is active, allocates the dentry based on
KLOC ’s allocation policy,and finally adds the dentry object to
KLOC map.
To the best of our knowledge, current OSes lack abstrac-tions and the capability to group kernel objects and efficientlyplace them in heterogeneous memory systems. Our focus is onCPU interactions for these workloads but as accelerators likeGPUs increasingly offer access to system services [48], we ex-pect
KLOC s to be also useful for other processing paradigmsbeyond CPUs.
Ideally,
KLOC s should be associated with entities that are ac-cessible (and meaningful) to both applications and the OS and6llow sharing of kernel objects across applications when re-quired. Consider, for example, the notion of a file , which is vis-ible to both the application and the OS code paths responsiblefor kernel object allocation and I/O servicing. During a file cre-ate operation, a set of in-memory kernel objects such as inode,dentry cache, and journal blocks are allocated. During a filewrite operation, the virtual file system (VFS) allocates cachepages, radix tree nodes for managing the cache pages, journalrecords (for crash-consistency), and extents [39]. When theblock driver commits in-memory pages of a file to disk, theblock driver allocates a file’s block I/O structures. All theseallocations, whether initiated by the application or OS, areassociated with the notion of a file, making it the appropriateentity around which to group the kernel objects allocated toit. We therefore use a file’s inode used across all layers of filesystem to create a
KLOC to track all associated kernel objects,(i.e., the VFS, the file system, and the device driver).
We create a map structure in the inode
KLOC for trackingall kernel objects associated with each file. The map struc-ture is implemented as red-black tree and maintains kernelobjects in the virtual file system layer (VFS), the dentry cache(used for name lookup), the page cache, the actual file system(without loss of generality, we use Ext4 file system in our im-plementation), the log manager (for crash-consistency), andthe block device driver. We show an example of this map inFigure 6.One of the key challenges in maintaining inode
KLOC s isto understand when to create and update kernel objects. Inorder to do this, we use OS system calls such as create(),write(), read(), close() as semantic hints. Figure 6 shows thatduring file creation, a new inode and its map is created (withthe inode as root), and the dentry object is added. On a filewrite operation, cache pages, their radix tree nodes, and thejournal records (JBD2) are added to the map too. After a fileis closed, the cache entries are removed. Consequently, the
KLOC is deleted only when the file and inode are deleted [33].
Modern applications run on tens of CPUs and future appli-cations may run on hundreds of CPUs sharing and access per-thread files. Examples of such applications include those thatperform graph computations and maintain key-value stores.In such cases, it is beneficial to be able to manage
KLOC sacross applications and CPUs.Therefore, our approach assumes that a file (i.e., inode)becomes active when thread(s) perform I/O on it. To integratea notion of CPU numbers with
KLOC s, we exploit Linux’sper-CPU data structures. Each thread (or CPU) is representedby a task structure ( task_struct ) that represents the activeprocess currently scheduled on the CPU. We extend the task_-struct with an active context state that represents the filesactively being accessed by the CPU. We leverage file systemI/O system calls, such as open, read, write, close, and othersto identify the kernel objects being accessed by each CPU. These kernel objects are added to the per-CPU context map,as shown in the the pseudocode in Figure 7. This approachenables
KLOC to group kernel objects concurrently accessedby multiple CPUs.
To group kernel objects allocated by network subsystem,we use socket descriptors. Within Linux (and other OSes),these are also implemented as inodes, making our file system
KLOC design easily extensible to networking. Sockets presentan appropriate entity for
KLOC creation as they are accessibleby both the application and the OS. Page placement and mi-gration policies are applied to pages grouped with respect tosockets.
The network stack uses packet buffers ( skbuff ) to send andreceive application data. The packet buffers allocated in theingress and egress path dominate kernel object allocations.The ingress and egress path in the network stack comprises ofmultiple layers, including TCP, UDP, IP, and network devicedriver (e.g., NAPI). To reduce overheads of copying networkpackets across these layers, user-level buffers are copied tosocket buffers and reused across other lower layers (TCP,UDP, IP, and NAPI) before being copied over to the NIC. Forthe receive path, the device driver is responsible for alloca-tions, which are subsequently reused by upper layers of thenetworking stack. In addition to the socket buffers, kernel ob-jects such as network queues are also allocated, but constitutea small fraction allocated memory. Our networking
KLOC sgroup these per-socket network kernel objects, enabling bettermemory placement and migration.As with inode
KLOC s, socket
KLOC s also use a map struc-ture to track kernel objects associated with a socket. Therefore,
KLOC uses the system calls responsible for socket creation( socket(), open() ) to initiate per-socket map structures. Whena CPU invokes an egress ( send() ), ingress ( recv() ), and polls,
KLOC marks the socket active and adds network kernel objectsto the per-socket map structure. Whether newly-allocated ker-nel objects are allocated to fast or slow memory depends onhow the socket is managed. In general, all kernel objects ofan active socket
KLOC are directed to fast memory.
The egress path of a network stack is generally syn-chronous, and the network stack objects, including the socketbuffers, are allocated during send() operations. As a result,grouping socket buffers and network stack kernel objects inthe egress path involves simply adding them to the socket’scontext map structure. Unlike the egress path, the ingress(network receive) path is dependent upon the asynchronousarrival of packets. As network packets arrive, the device driverprocess inside the OS allocates a generic packet buffer butdoes not know the socket information to which this packetbelongs. This information is extracted in a higher layer of theTCP stack. This creates a research challenge – how to groupegress packet buffers to a socket
KLOC as early as possible,7o as to apply the appropriate set of memory tiering policiesto them.An initial, straightforward approach is to extract thepacket’s entire header and identify the associated socket num-ber within the driver itself (before transfer of control to thehigher TCP layers), add to the per-socket context map, andapply memory allocation policies of the socket context. Unfor-tunately, this naive approach means that we inspect the packetfor socket information in both the driver and higher layers,which is both CPU-intensive and time-consuming. In practice,we find that the latency for these steps is sufficiently highthat they outweigh the performance benefits of placing thesebuffers within faster memory. The key problem is that thebuffers are so short-lived that the additional time for headerextraction becomes a performance bottleneck.A better approach, and the one we use, is to extend thenetwork stack device driver. We do indeed add code to extractsocket information within the device driver, but we improveperformance by avoiding redundant work at the higher-levellayers. We do this by extending the packet buffer structure( skbuff ) with an 8-byte socket field, which contains the socketinformation extracted in the device driver. This field elides theneed for further socket information extraction at the higherlevels of the TCP stack. We also extend the device driver toadd the packet to the socket
KLOC ’s map structure and allocatethe packet to faster memory (provided the socket is active).Extending the idea of grouping kernel objects with a socketcontext deep down to the device driver provides the flexibilityto group kernel objects that are allocated asynchronously andapply uniform allocation and data placement policies.
Grouping kernel objects via inode and socket
KLOC s andsupporting per-CPU active contexts enables better memorykernel object tiering. Unlike modern OSes, which do not sup-port kernel object tiering, we can allocate kernel objects to fastmemory, identify cold/inactive kernel objects, and migrate thelatter to slow memory. To build efficient techniques to identifycold/inactive kernel objects, and build support for their mi-gration, we go beyond prior application-level page placementsystems.
KLOC cannot rely on long-latency hotness scanningand page migrations (that incur TLB invalidations [38, 54])because the kernel objects are much too short-lived to toleratelong migration latencies.
Consider page hotness scanning. With
KLOC , when a fileor socket context is actively used, all kernel objects groupedin the context map are placed in faster memory. If the fastermemory capacity is full, the objects of the inactive file orsocket
KLOC s (currently not accessed by any CPU) are movedfrom faster to slower memory. Migrating inactive
KLOC s isconceptually similar to the idea of garbage collecting in fastmemory, without incurring the cost of tracking each and everykernel object. We rely on quickly identifying cold/inactive kernel objects and migrating them to slow memory. To dothis, we use and extend the Linux LRU mechanism developedinitially for swapping and adopted by prior work [35, 54]. Weintroduce several extensions critical for kernel object place-ment in addition to concurrent page migrations introduced byYan et al. [54].First, OSes such as Linux (and FreeBSD) maintain an ac-tive page list, an active LRU list, and an inactive LRU list ofpages. Due to limited faster memory capacity, we not onlymigrate pages from the inactive list but also from the activeLRU list when the demand for fast memory is high. Second,we do not wait for Linux to identify LRU and inactive pages;instead, once a network or file context becomes inactive, weimmediately mark and migrate the pages. Third, slab pagescould be shared across one or more active and inactive file/-socket entities. To avoid undue effects, we do not migrateshared pages with active objects. Finally, repeated migrationof kernel pages between fast and slow memory can be detri-mental to performance. To avoid repeated migration, we usean 8-bit per-page counter to track migrations and retain suchpages in fast memory. We observe a small fraction (less than1%) of pages that meet these conditions due to the shorterlifetime of kernel objects.
Now consider the actual mechanism to migrate kernel ob-jects. Traditionally, kernel buffer pages allocated using theOS slab allocator (which excludes cache pages or kmalloc() )are managed independently of the application pages and arereused across one or more applications. The slab allocator at-tempts to group objects of the same type or sizes together [23].Each slab page can contain one or more kernel objects fromdifferent subsystems or shared by different tenants. Impor-tantly, slab kernel pages are not directly mapped to an addressspace or process, and can also be accessed using a physicaladdress.
Current OSes do not support migration of slab pages.
To enable migration of slab allocated kernel objects, onemight consider entirely redesigning the slab allocator. Al-though ideal, given the magnitude of changes to the OS design,in
KLOC , we propose an alternative solution to support themigration of kernel objects. First, for kernel objects related tofiles and sockets, we use vmalloc() (virtual malloc) support in-side the OS. Using vmalloc() provides the ability to allocate alarge region of virtually contiguous memory that also containsan anonymous address space (anon_vma) [17] not backableby a file. Using vmalloc() allows us to allocate and coalescekernel objects of a context to a virtually contiguous region,and use the anonymous address space to extend Linux’s pagemigration code to support kernel object migration . Using
KLOC s also allows us to leverage existing OS-leveloptimizations such like in-memory buffering, prefetching and Our current approach is limited to support the migration of kernel objectsthat are not physically dereferenced. adap-tive readahead mechanism (also known as I/O prefetching) tounderstand the notion of
KLOC s so that workloads with tem-poral and spatial locality of I/O reference can be accelerated.Our approach requires no changes to I/O prefetching policiesand mechanisms; we simply direct all I/O prefetches to fastmemory. For example, consider that readahead speculativelyreads a portion of file contents into memory with an expecta-tion that the process working on a file will read/write that datain the future. Existing readahead mechanisms are adaptive;by tracking how often prefetched pages are actually used, theOS maintains a window of prefetch targets. When prefetchesare used frequently, the window dynamically increases (to amaximum of 128MB in Linux); when the prefetches are notused often, the window is shrunk. Generally, the prefetchingwindows increase when the workload demonstrates sequen-tial memory access, while it decreases when access patternsbecome more random [27, 51]. By placing readahead pagesin faster memory, we accelerate sequential workloads. Forrandom access workloads, the prefetch window automaticallyshrinks, and our approach has no punitive effective. (This alsoensures that we do not over-aggressively allocate cold/inactivekernel objects to fast memory.)
Our evalulations answer the following questions: • What are the benefits and implications of
KLOC ’s fine-grained placement of file system’s kernel objects in het-erogeneous memory systems? What is the impact on fastmemory utilization? • How effective is
KLOC ’s support for the network subsys-tem? • Is the
KLOC ’s capability to accelerate I/O stack’sprefetching optimization with fast memory pages benefi-cial?
We use a 40-core Intel Xeon 2.67 GHz dual-socket sys-tem, with 80GB memory per socket and a 512 GB Intel Op-tane NVMe with a peak sequential and random bandwidthof 1.2 GB/s and 425 MB/s respectively. We use the mem-ory heterogeneity emulator from section 3 and consider ageneric fast (DRAM) and slow (throttled) memory. As dis-cussed in section 3 (see Figure 3), the fast memory capacityand slow memory bandwidth have a direct correlation withperformance. We quantify performance sensitivity to slowmemory bandwidth. We fix the fast memory capacity to 5 GB(this is representative of recent industrial and academic pro-jections [4,12,18,24,35]). We use the same set of applicationsstudied in 3. To understand the performance of
KLOC on realNVM device (for which the bandwidth can be modified) forRocksDB, we use a 64-core, 2 TB Intel’s DC platform.
Mechanisms DescriptionAll-SlowMem A worst-case slow memory-only system; OS and ap-plication pages are always placed to slow memoryAll-FastMem An ideal all fast memory system (best case)Naive A greedy approach that naively uses NUMA and at-tempts to place application and OS pages to low ca-pacity fast memory although they cannot fitNimble State-of-the-art application page placement systemwith concurrent migration support [54]Migration-only Migration based approach that only migrates coldpages from fast to slow memory, freeing fast mem-ory for direct allocation similar to [8, 35, 54]
KLOC -nomigrate
KLOC ’s kernel page placement but without migrationfor the storage stack
KLOC -migrate-fs-noprefetch
KLOC with kernel page migration but without networkstack or I/O prefetcher support (section 5.2)
KLOC -migrate-fs-nw-noprefetch
KLOC -migrate-fs-noprefetch with support for networkstack (section 5.3)
KLOC -migrate-fs-nw-prefetch Uses hints from the OS I/O prefetcher (section 5.5)
Table 3: Evaluation Mechanisms.
System Configurations.
We compare our approach to sev-eral other page placement and migration approaches. Theseare summarized in Table 3: (1)
All-SlowMem represents aworst-case baseline on a system with only slow memory. (2)
All-FastMem represents an ideal system that can fit the entireapplication workload in fast memory. (3)
Naive represents anapproach where the OS greedily places both application andkernel pages in fast memory. When fast memory is full, sub-sequent allocations for application and OS pages are servedfrom slower memory until fast memory pages become free. (4)
Nimble represents the state-of-the-art system for placementof application-level pages. Nimble also employs concurrentpage migrations [54]. (5)
Migration-only attempts to allocateOS and application pages to fast memory, but also identifiescold pages migrates them to slower memory similar to [8]. (6)
KLOC -nomigrate represents a version of our approach using
KLOC s that group kernel objects and allocate them to fastmemory. What we omit from this is the idea of migratingpages of inactive
KLOC s from fast to slow memory. (7)
KLOC -migrate-fs-noprefetch , which goes beyond
KLOC -nomigrateand also migrates cold/inactive file system kernel objects. Fi-nally, (8)
KLOC -migrate-fs-nw-noprefetch and
KLOC -migrate-fs-nw-prefetch represent our final, full-blown
KLOC approachby also adding support for the network stack and includingI/O prefetching optimizations to fast memory.
KLOC
We compare the following approaches summarized inTable 3: (1) All-SlowMem - the worst-case baseline, (2)All-FastMem (ideal case), (3) Naive, (4) Migration-only, (5)
KLOC -nomigrate – the proposed
KLOC based placement ofkernel objects related to file and socket without migration,and finally, (6)
KLOC -migrate-fs-noprefetch –
KLOC . To un-derstand the reduction in the use of slow memory pages forkernel objects (and hence better performance) with
KLOC , inFigure 9, we show statistics for RocksDB. The x-axis showsthe total page cache, kernel buffer pages (slab pages for kernelin-memory structures, logs, socket buffers, etc.), and inactive9 ilebench redis rocksdb cassandra spark
Application T h r oughpu t ( M B / s e c ) All-SlowMemAll-FastMemNaiveNimble Migration-onlyKLOC-nomigrateKLOC-migrate-fs-noprefetch (a)
KLOC performance for applications and benchmarks.
Fast memory capacity is set to 5 GB (i.e., 1/8th of slow memorycapacity) and the slow memory bandwidth is 1/8 of the fast memory.Redis results only considers
KLOC for file system. Y-axis showsscaled throughput (OPS/sec) for each application. Nimble showsstate-of-the-art application data placement system.
Slow Memory Bandwidth Relative to Fast Memory
RocksDB Spark T h r oughpu t ( K O p s / s e c ) All-SlowMemAll-FastMemNaive NimbleMigration-onlyKLOC-migrate-fs-noprefetch (b) Performance sensitivity to slow memory bandwidth.
The x-axis varies slow memory’s bandwidth ratio relative to fast memory, and they-axis shows throughput. The fast memory capacity is set of 5 GB (1/8th) ofslow memory.
Figure 8: Performance and memory bandwidth sensitivity. pages migrated to slower memory, and the y-axis shows thepage count (in units of 100K pages).
Observations.
First, as expected, incorrect placement ofkernel-level objects to slow memory can significantly de-grade performance, as shown by the difference between theAll-FastMem (Optimal) vs. All-SlowMem configurations forall applications. Second, the
Naive greedy approach attemptsto naively allocate all application and OS pages to the limited-capacity fast memory. As a result, the naive placement ofnon-critical kernel objects can drastically impact performance-sensitive application pages and kernel objects. RocksDB useshundreds of 4 MB files to store the persistent key-values asa string-sorted table [34, 45]; once the files are filled, lowcapacity fast memory is polluted with inactive file caches andkernel objects (e.g., inode structures, dentry caches). As aresult, the throughput of the
Naive approach reduces by 2.6 × compared to the optimal case. We observe a similar trend forCassandra and Filebench. In contrast, Redis uses few largefiles to checkpoint data, thereby even the Naive approachprovides gains over the worst-case baseline, but the cachepage pollution quickly hinders fast memory benefits. Spark, iscomposed of multiple applications, which includes the Scala-based Spark benchmark, the Spark framework, and the HDFSstorage server (Hadoop file system). While Spark by itselfis compute-intensive, the HDFS storage server is highly I/Ointensive. As a result, the
Naive approach is ineffective. Next,
Page Cache Buffer Pages Migrated Pages
Slow Memory Pages Used P age C oun t ( i n K ) All-SlowMemAll-FastMemNaiveMigration-onlyKLOC-migrate-fs-noprefetch
Figure 9: RocksDB’s kernel page slow memory use.
The x-axisshows slow memory pages allocated for cache, kernel buffers (for OS datastructures) or inactive pages migrated to slow memory.
KLOC reduces slowmemory use for page cache and kernel buffers. All-Fastmem does not useslow memory or incur migrations.
Nimble mainly tracks application-level pages using the OSswapping mechanism’s active and inactive lists. Lack of ker-nel page placement and excessive migration without a contexthurts performance for I/O-intensive applications.Next, the
Migration-only approach only performs appli-cation and kernel page migrations by tracking fast memoryfor cold (inactive) pages and migrating them to slow mem-ory. However, blindly migrating all kernel (and application)objects increases page allocation and TLB and page tableinvalidation overheads. Further, the
Migration-only approachlacks the semantic information about how entities like filesand sockets and their related kernel objects are used. Forexample, even after an close() operation, cache pages andjournal data could continue to pollute faster memory.Finally, the
KLOC -nomigrate approach provides effectiveplacement of kernel objects and file-related object and at-tempts to place them to fast memory, but does not supportpage migration. In contrast,
KLOC -migrate-fs-noprefetch , inaddition to
KLOC placement, also aggressively demotes ob-jects of inactive files to slower memory using the modifiedLRU mechanism; because of this, the fast memory alloca-tion misses (or slow memory use) for active kernel objectsalso reduces(see Figure 9). As a result, RocksDB through-put improves by up to 1.42 × over migration approach. In-terestingly, for Redis, the migration-based KLOC -migrate-fs-noprefetch approach provides up to 1.70x gains even over
KLOC -nomigrate . Finally, the Java-based Cassandra uses alarge in-memory cache, and the application-level pages con-sume 95% of fast memory capacity, resulting in marginalgains with
KLOC . Figure 8b varies the slow memory bandwidth along thex-axis and studies the impact on all RocksDB and Spark forbrevity. The x-axis varies slow memory’s bandwidth ratiorelative to fast memory, and the y-axis shows throughput.The fast memory capacity is set of 8GB (1/10th) of slowmemory. Because RocksDB is highly memory intensive, witha significant fraction of memory pages allocated in kernel andhigh kernel-level memory references, the results show highperformance benefit across all bandwidth configuration. For10
SET GET T h r oughpu t ( i n K O PS / s e c ) All-SlowMemAll-FastMemNaiveNimble Migration-onlyKLOC-nomigrateKLOC-migrate-fs-noprefetchKLOC-migrate-fs-nw-noprefetch
Figure 10: Redis performance with
KLOC for network stack.
Re-sults show cumulative throughput 8 Redis server instances that use 1 KBvalue size for 4-million keys with 75% read (GET) requests.
Spark,
KLOC -migrate-fs-noprefetch benefits are high until thememory bandwidth is 1/4th of the fast memory. Increasingthe slow memory bandwidth further (e.g., 1/2 of fast memory)shows limited gains.
Summary.
The results highlight the benefits of introduc-ing file context to encapsulate kernel objects and only targetplacement of objects currently accessed by an application andavoiding placement cost of non-critical objects.
KLOC for Network Stack
We next discuss the benefits of introducing
KLOC supportfor the socket entity in the network stack and controlling theallocation and placement of related kernel objects in heteroge-neous memory. Figure 10 shows the performance of Redis, anetwork-intensive key-value store serving hundreds of clients.In contrast to the
KLOC approach studied in Figure 8a withoutthe network stack support,
KLOC -migrate-fs-nw-noprefetch in Figure 10 shows the performance of network stack with
KLOC support.For performance analysis, we use the well-known Redisbenchmark performing 4 million key operations, 25% insert(SET) and 75%, fetch (GET) operations, and 1 KB value sizefor each key as used by prior work [25]. Because Redis isa single-threaded application, we run instances of Redisserver with each instance using a set of dedicated ports. They-axis shows the throughput of SET and GET operations.Note that, for KLOC -migrate-fs-nw-noprefetch , each socketentity has an independent
KLOC map to encapsulate, allocate,and migrate kernel objects and in addition to
KLOC s for a file.As noted in section 3 and section 5, for the network subsystem,the socket buffers (a.k.a skbuff ) dominate the kernel objectallocation. The socket buffers are allocated and reused acrossdifferent layers (system calls, TCP, IP, and NAPI) and can arereused across operations.Redis is a memory-intensive application; the application’skey-value store, the network stack, and file system for check-pointing, all demand memory pages and are sensitive to fastmemory capacity and slow memory bandwidth. Comparedto the optimal case ( fast memory-only system), other ap-proaches with limited faster memory capacity suffer slow-down for both SET and GET operations. For the
Naive ap-proach, contention for fast memory across application and I/Osubsystems lead to incorrect placement and performance slow-down. The shorter lifetime of network socket buffers, which are frequently allocated, released, or reused across networkoperations, makes the
Migration-only approach ineffective;this approach suffers from high page migration and relatedoverheads [24, 35, 42] reaping marginal benefits from mem-ory heterogeneity. The
KLOC -migrate-fs-noprefetch approachcan only handle an efficient kernel page placement of thefile system’s kernel object. In contrast,
KLOC -migrate-fs-nw-noprefetch can map and efficiently handle the placement ofboth network and file system kernel objects. The networksupported approach first attempts to allocate objects mappedto a socket’s
KLOC to faster memory; when a direct allocationis not feasible, it uses the modified Linux LRU-based migra-tion approach to move inactive socket related pages to slowmemory and makes room for subsequent allocations to fastmemory. As a result,
KLOC provides 3.3 × higher throughputcompared to the migration-based approach. Summary.
The substantial performance gains highlightthe benefits of introducing a fine-grained
KLOC support forsocket to encapsulate network stack objects for efficient dataplacement. Note that
KLOC supports multiple instances (pro-cesses) highlighting the generality of the proposed abstrac-tion.
KLOC
Enabling
KLOC to group set of kernel-level objects asso-ciated with entities like files provides a capability to exploitOS-level hints and optimizations for page placement. To show-case such flexibility, we evaluate the benefits of combining
KLOC and the filesystems’ I/O prefetcher as shown in Fig-ure 11. The I/O prefetcher adapts to the application’s I/Oaccess pattern and varies the page cache allocation behavior.In OSes such as Linux, the I/O prefetcher expands the I/Oprefetch window size (up to 128 MB) adaptively for bothspatial or temporal locality [27]. The page prefetch windowshrinks for random access patterns. Capturing such seman-tic hints can be beneficial for
KLOC ’s page placement andreducing migration.Figure 11 shows the throughput of for all applicationsand Figure 12 shows RocksDB’s throughput for sequentialand random access I/O patterns (in the X-axis). To demon-strate the incremental benefits, for brevity, we only com-pare
KLOC without prefetcher support (
KLOC -migrate-fs-nw-noprefetch ) with prefetcher supported
KLOC ( KLOC -migrate-fs-nw-prefetch ).Next, combining the prefetch I/O optimization in file sys-tem with
KLOC improves the performance of several I/O-intensive applications with temporal or spatial locality. Forexample, applications such as RocksDB, Redis, and Cassan-dra benefit from proactively placing prefetched I/O pages tofaster memory. For example, RocksDB’s overall throughputimproves by 1.27 × . As shown in Figure 12, RocksDB’s se-quential access significantly benefits with prefetching. ForFilebench, we use random read and write workload that nei-ther gains nor loses performance; this is because, the prefetchapproach uses the "I/O prefetch window size" as a hint to11 filebench redis rocksdb spark cassandra Application T h r oughpu t ( K O p s / s e c ) KLOC-migrate-fs-nw-noprefetchKLOC-migrate-fs-nw-prefetch
Figure 11:
KLOC
I/O prefetch gains.
Y-axisis scaled application throughput (OPS/sec). AccessPattern KLOC-migrate-fs-nw-noprefetch KLOC-migrate-fs-nw-prefetchRandWrite 53 55RandRead 145 146SeqWrite 330 300SeqRead 1178 1306ReadWrite 76 217
Figure 12: RocksDB throughputbreakdown.
Values show throughput in(100K Ops/sec)
RocksDB T h r oughpu t ( M B / s e c ) All-SlowMemAll-FastMemNaiveKLOC-migrate-fs-nw-prefetch
Figure 13:
KLOC impact on DC-Optane for RocksDB. predict random access and avoids aggressively placing andpolluting fast memory that are not likely to be used. As aside effect, this reduces the migration of inactive fast memorypages to slow memory also contributing towards performancebenefits. For Redis, the performance gains ( 1.32 × ) are froma periodic checkpoint of the in-memory key-value store tothe storage, which is mostly sequential. The overall Redisperformance improves by 4 × compared to migration-onlyapproach. Summary.
The results show the benefits of combining
KLOC with traditional OS-level optimizations such as I/Oprefetcher. We believe the techniques could be inherited forother subsystems [55].
KLOC performance on DC-Optane Mem-ory
Finally, to understand the impact of
KLOC on DC-Optanememory technologies, we use a 256GB DC-Optane on amemory socket with 16GB DRAM hardware managed cache(memory mode) as a slow memory and a 48GB DRAM onlysocket (fast memory). In addition to various limitations dis-cussed in Section 2, due to space constraints, we only showthe results for RocksDB. First, as shown in Figure 13, onlyrunning RocksDB on slower memory shows substantial per-formance degradation compared to the optimal (using onlyfast memory) approach. Current DC-Optane technology ina memory mode manages DRAM-cache as a direct-mappedcache. For large working set size, we see a substantial in-crease in latency as well as throughput reduction, possiblycontributed due to a combination of poor cache managementand high cache miss overheads. Employing
KLOC by max-imizing placement of kernel as well as OS pages to fastermemory accelerates performance considerably compared tofully leaving it to the hardware and using a naive placement.
To provide efficient memory placement and management ofkernel objects in heterogeneous memory systems, we presentthe
KLOC , a mechanism that encapsulates kernel objects withfine-grained contexts with entities such as files and socketsand provides efficient data placement and migration withoutrequiring expensive hotness scanning. Our results on real-world applications such as RocksDB and Redis show up to1.4 × and 4 × higher throughput compared migration-based techniques. Our future work will explore supporting KLOCfor other subsystems (e.g., GPU). References [1] “Apache Cassandra,” http://cassandra.apache.org/.[2] “Facebook RocksDB,” http://rocksdb.org/.[3] “Google LevelDB ,” http://tinyurl.com/osqd7c8.[4] “Intel-Micron Memory 3D XPoint,” http://intel.ly/1eICR0a.[5] “Knights Landing (KNL): 2nd Gener-ation Intel R (cid:13) Xeon Phi TM Proceedings of the Twenty-Second International Conference on ArchitecturalSupport for Programming Languages and Operat-ing Systems , ser. ASPLOS ’17. New York, NY,USA: ACM, 2017, pp. 631–644. [Online]. Available:http://doi.acm.org/10.1145/3037697.3037706[9] B. Akin, F. Franchetti, and J. C. Hoe, “Datareorganization in memory using 3d-stacked dram,”in
Proceedings of the 42Nd Annual InternationalSymposium on Computer Architecture , ser. ISCA’15. New York, NY, USA: ACM, 2015, pp. 131–143. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750397[10] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu,M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W.Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic,“Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference,”in
Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for Programming anguages and Operating Systems , ser. ASPLOS’19. New York, NY, USA: ACM, 2019, pp. 715–731. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304049[11] M. Becchi and P. Crowley, “A hybrid finite automatonfor practical deep packet inspection,” in Proceedingsof the 2007 ACM CoNEXT Conference , ser. CoNEXT’07. New York, NY, USA: ACM, 2007, pp. 1:1–1:12. [Online]. Available: http://doi.acm.org/10.1145/1364654.1364656[12] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale,L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W.Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar,J. Shen, and C. Webb, “Die stacking (3d) microarchitec-ture,” in
Proceedings of the 39th Annual IEEE/ACMInternational Symposium on Microarchitecture , ser.MICRO 39. Washington, DC, USA: IEEE Com-puter Society, 2006, pp. 469–479. [Online]. Available:https://doi.org/10.1109/MICRO.2006.18[13] C.-C. Chou, A. Jaleel, and M. Qureshi, “Batman: Max-imizing bandwidth utilization for hybrid memory sys-tems,” in
Technical Report, TR-CARET-2015-01 (March9, 2015) , 2015.[14] T.-S. Chung, D.-J. Park, S. Park, D.-H. Lee, S.-W. Lee,and H.-J. Song, “A survey of flash translation layer,”
J.Syst. Archit. , vol. 55, no. 5-6, pp. 332–343, May 2009.[Online]. Available: http://dx.doi.org/10.1016/j.sysarc.2009.03.005[15] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee,D. Burger, and D. Coetzee, “Better I/O Through Byte-addressable, Persistent Memory,” in
Proceedings of theACM SIGOPS 22nd Symposium on Operating SystemsPrinciples , ser. SOSP ’09, Big Sky, Montana, USA,2009.[16] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan,and R. Sears, “Benchmarking Cloud Serving Systemswith YCSB,” in
Proceedings of the 1st ACM Sympo-sium on Cloud Computing , ser. SoCC ’10, Indianapolis,Indiana, USA, 2010.[17] T. Corbet, “On the proper use of vmalloc,” https://lwn.net/Articles/57800/, 2003.[18] Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, andR. Bianchini, “Memscale: Active low-power modesfor main memory,”
SIGARCH Comput. Archit. News ,vol. 39, no. 1, pp. 225–238, Mar. 2011. [Online]. Avail-able: http://doi.acm.org/10.1145/1961295.1950392[19] P. J. Denning, “The working set model for programbehavior,”
Commun. ACM , vol. 11, no. 5, pp. 323–333, May 1968. [Online]. Available: http://doi.acm.org/10.1145/363095.363141[20] X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi,“Simple but effective heterogeneous main memory withon-chip memory controller support,” in
Proceedingsof the 2010 ACM/IEEE International Conference forHigh Performance Computing, Networking, Storage andAnalysis , ser. SC ’10. Washington, DC, USA: IEEEComputer Society, 2010, pp. 1–11. [Online]. Available:http://dx.doi.org/10.1109/SC.2010.50[21] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,D. Reddy, R. Sankaran, and J. Jackson, “SystemSoftware for Persistent Memory,” in
Proceedingsof the Ninth European Conference on ComputerSystems , ser. EuroSys ’14. New York, NY, USA:ACM, 2014, pp. 15:1–15:15. [Online]. Available:http://doi.acm.org/10.1145/2592798.2592814[22] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish,R. Sankaran, J. Jackson, and K. Schwan, “Data tieringin heterogeneous memory systems,” in
Proceedingsof the Eleventh European Conference on ComputerSystems
SIGPLAN Not. , vol. 50,no. 7, pp. 79–92, Mar. 2015. [Online]. Available:http://doi.acm.org/10.1145/2817817.2731191[25] A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski,L. Ceze, and T. Mudge, “Integrated 3d-stacked server de-signs for increasing physical density of key-value stores,”in
Proceedings of the 19th International Conference onArchitectural Support for Programming Languages andOperating Systems , ser. ASPLOS ’14, Salt Lake City,Utah, USA, 2014.[26] H. Hanson and K. Rajamani, “What computer architectsneed to know about memory throttling,” in
Proceedingsof the 2010 International Conference on Computer Ar-chitecture , ser. ISCA’10. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 233–242.[27] J. He, S. Kannan, A. C. Arpaci-Dusseau, and R. H.Arpaci-Dusseau, “The Unwritten Contract of SolidState Drives,” in
Proceedings of the Twelfth EuropeanConference on Computer Systems , ser. EuroSys’17. New York, NY, USA: ACM, 2017, pp. 127–144. [Online]. Available: http://doi.acm.org/10.1145/3064176.30641871328] Intel, “NVM Library,” https://github.com/pmem/nvml.[29] D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked dramcaches for servers: Hit ratio, latency, or bandwidth?have it all with footprint cache,” in
Proceedingsof the 40th Annual International Symposium onComputer Architecture , ser. ISCA ’13. New York, NY,USA: ACM, 2013, pp. 404–415. [Online]. Available:http://doi.acm.org/10.1145/2485922.2485957[30] X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Maki-neni, D. Newell, D. Solihin, and R. Balasubramonian,“Chop: Adaptive filter-based dram caching for cmpserver platforms,” in
High Performance Computer Ar-chitecture (HPCA), 2010 IEEE 16th International Sym-posium on , Jan 2010, pp. 1–12.[31] C. Jonathan, “Linux Radix Trees,” https://lwn.net/Articles/175432/.[32] N. P. Jouppi, C. Young, N. Patil, D. Patterson,G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark,J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,T. V. Ghaemmaghami, R. Gottipati, W. Gulland,R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt,D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,A. Lundin, G. MacKean, A. Maggiore, M. Mahony,K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni,K. Nix, T. Norrie, M. Omernick, N. Penukonda,A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,C. Severn, G. Sizikov, M. Snelham, J. Souter,D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,E. Wilcox, and D. H. Yoon, “In-datacenter performanceanalysis of a tensor processing unit,” in
Proceedingsof the 44th Annual International Symposium onComputer Architecture , ser. ISCA ’17. New York,NY, USA: ACM, 2017, pp. 1–12. [Online]. Available:http://doi.acm.org/10.1145/3079856.3080246[33] S. Kannan, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, Y. Wang, J. Xu, and G. Palani, “Designinga true direct-access file system with devfs,” in
Proceedings of the 16th USENIX Conference on Fileand Storage Technologies , ser. FAST’18. Berkeley,CA, USA: USENIX Association, 2018, pp. 241–255.[Online]. Available: http://dl.acm.org/citation.cfm?id=3189759.3189782[34] S. Kannan, N. Bhat, A. Gavrilovska, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, “Redesigning lsmsfor nonvolatile memory with novelsm,” in
Proceedingsof the 2018 USENIX Conference on Usenix Annual Technical Conference , ser. USENIX ATC ’18. Berkeley,CA, USA: USENIX Association, 2018, pp. 993–1005.[Online]. Available: http://dl.acm.org/citation.cfm?id=3277355.3277450[35] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan,“Heteroos: Os design for heterogeneous memorymanagement in datacenter,” in
Proceedings of the44th Annual International Symposium on ComputerArchitecture , ser. ISCA ’17. New York, NY, USA:ACM, 2017, pp. 521–534. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080245[36] S. Kannan, A. Gavrilovska, and K. Schwan, “pvm:Persistent virtual memory for efficient capacityscaling and object storage,” in
Proceedings ofthe Eleventh European Conference on ComputerSystems , ser. EuroSys ’16. New York, NY, USA:ACM, 2016, pp. 13:1–13:16. [Online]. Available:http://doi.acm.org/10.1145/2901318.2901325[37] Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel,“Coordinated and efficient huge page management withingens,” in
Proceedings of the 12th USENIX Conferenceon Operating Systems Design and Implementation , ser.OSDI’16. Berkeley, CA, USA: USENIX Association,2016, pp. 705–721. [Online]. Available: http://dl.acm.org/citation.cfm?id=3026877.3026931[38] A. Lagar-Cavilla, J. Ahn, S. Souhlal, N. Agarwal,R. Burny, S. Butt, J. Chang, A. Chaugule, N. Deng,J. Shahid, G. Thelen, K. A. Yurtsever, Y. Zhao,and P. Ranganathan, “Software-defined far memoryin warehouse-scale computers,” in
Proceedings ofthe Twenty-Fourth International Conference on Ar-chitectural Support for Programming Languages andOperating Systems , ser. ASPLOS ’19. New York, NY,USA: ACM, 2019, pp. 317–330. [Online]. Available:http://doi.acm.org/10.1145/3297858.3304053[39] C. Lee, D. Sim, J.-Y. Hwang, and S. Cho, “F2FS: ANew File System for Flash Storage,” in
Proceedingsof the 13th USENIX Conference on File and StorageTechnologies , ser. FAST’15, Santa Clara, CA, 2015.[40] K. Lim, J. Chang, T. Mudge, P. Ranganathan,S. K. Reinhardt, and T. F. Wenisch, “Disaggregatedmemory for expansion and sharing in blade servers,”
SIGARCH Comput. Archit. News , vol. 37, no. 3,pp. 267–278, Jun. 2009. [Online]. Available: http://doi.acm.org/10.1145/1555815.1555789[41] F. X. Lin and X. Liu, “Memif: Towards programmingheterogeneous memory asynchronously,” in
Proceed-ings of the Twenty-First International Conference onArchitectural Support for Programming Languages andOperating Systems , ser. ASPLOS ’16. New York, NY,14SA: ACM, 2016, pp. 369–383. [Online]. Available:http://doi.acm.org/10.1145/2872362.2872401[42] M. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ig-natowski, and G. Loh, “Heterogeneous memory architec-tures: A hw/sw approach for mixing die-stacked and off-package memories,” in
High Performance Computer Ar-chitecture (HPCA), 2015 IEEE 21st International Sym-posium on , Feb 2015, pp. 126–136.[43] M. Oskin and G. H. Loh, “A software-managedapproach to die-stacked dram,” in
Proceedings of the2015 International Conference on Parallel Architectureand Compilation (PACT) , ser. PACT ’15. Washington,DC, USA: IEEE Computer Society, 2015, pp. 188–200.[Online]. Available: https://doi.org/10.1109/PACT.2015.30[44] M. Radulovic, D. Zivanovic, D. Ruiz, B. R. de Supinski,S. A. McKee, P. Radojkovi´c, and E. Ayguadé,“Another trip to the wall: How much will stackeddram benefit hpc?” in
Proceedings of the 2015International Symposium on Memory Systems , ser.MEMSYS ’15. New York, NY, USA: ACM, 2015, pp.31–36. [Online]. Available: http://doi.acm.org/10.1145/2818950.2818955[45] P. Raju, R. Kadekodi, V. Chidambaram, and I. Abra-ham, “PebblesDB: Building Key-Value Stores usingFragmented Log-Structured Merge Trees,” in
Proceed-ings of the 26th ACM Symposium on Operating SystemsPrinciples (SOSP ’17) , Shanghai, China, October 2017.[46] L. E. Ramos, E. Gorbatov, and R. Bianchini,“Page placement in hybrid memory systems,” in
Proceedings of the International Conference onSupercomputing , ser. ICS ’11. New York, NY,USA: ACM, 2011, pp. 85–95. [Online]. Available:http://doi.acm.org/10.1145/1995896.1995911[47] T. Vasily, “Filebench,” https://github.com/filebench/filebench.[48] J. Veselý, A. Basu, A. Bhattacharjee, G. H. Loh,M. Oskin, and S. K. Reinhardt, “Generic systemcalls for gpus,” in
Proceedings of the 45th AnnualInternational Symposium on Computer Architecture , ser.ISCA ’18. Piscataway, NJ, USA: IEEE Press, 2018,pp. 843–856. [Online]. Available: https://doi.org/10.1109/ISCA.2018.00075[49] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne:Lightweight Persistent Memory,” in
Proceedings of theSixteenth International Conference on Architectural Sup-port for Programming Languages and Operating Sys-tems , ser. ASPLOS XVI, Newport Beach, California,USA, 2011. [50] C. Wang, H. Cui, T. Cao, J. Zigman, H. Volos,O. Mutlu, F. Lv, X. Feng, and G. H. Xu, “Panthera:Holistic memory management for big data processingover hybrid memories,” in
Proceedings of the40th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , ser. PLDI2019. New York, NY, USA: ACM, 2019, pp. 347–362. [Online]. Available: http://doi.acm.org/10.1145/3314221.3314650[51] F. Wu, H. Xi, J. Li, and N. Zou, “Linux readahead: lesstricks for more,” 08 2019.[52] X. Wu and A. L. N. Reddy, “SCMFS: A File System forStorage Class Memory,” in
Proceedings of 2011 Inter-national Conference for High Performance Computing,Networking, Storage and Analysis , ser. SC ’11, Seattle,Washington, 2011.[53] J. Xu and S. Swanson, “NOVA: A Log-structured FileSystem for Hybrid Volatile/Non-volatile Main Memo-ries,” in
Proceedings of the 14th Usenix Conference onFile and Storage Technologies , ser. FAST’16, 2016.[54] Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee,“Nimble page management for tiered memory systems,”in
Proceedings of the Twenty-Fourth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems , ser. ASPLOS’19. New York, NY, USA: ACM, 2019, pp. 331–345. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304024[55] Y. Yang, P. Xiang, M. Mantor, and H. Zhou, “Cpu-assisted gpgpu on fused cpu-gpu architectures,” in
Proceedings of the 2012 IEEE 18th InternationalSymposium on High-Performance Computer Architec-ture , ser. HPCA ’12. Washington, DC, USA: IEEEComputer Society, 2012, pp. 1–12. [Online]. Available:http://dx.doi.org/10.1109/HPCA.2012.6168948[56] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker,and I. Stoica, “Spark: Cluster computing withworking sets,” in
Proceedings of the 2Nd USENIXConference on Hot Topics in Cloud Computing ,ser. HotCloud’10. Berkeley, CA, USA: USENIXAssociation, 2010, pp. 10–10. [Online]. Available:http://dl.acm.org/citation.cfm?id=1863103.1863113[57] L. Zhao, R. Iyer, R. Illikkal, and D. Newell, “Exploringdram cache architectures for cmp server platforms,” in