BPF for storage: an exokernel-inspired approach
Yu Jian Wu, Hongyi Wang, Yuhong Zhong, Asaf Cidon, Ryan Stutsman, Amy Tai, Junfeng Yang
BBPF for storage: an exokernel-inspired approach
Yu Jian Wu , Hongyi Wang , Yuhong Zhong ,Asaf Cidon , Ryan Stutsman , Amy Tai , and Junfeng Yang Columbia University, University of Utah, VMware Research
Abstract
The overhead of the kernel storage path accounts forhalf of the access latency for new NVMe storage devices.We explore using BPF to reduce this overhead, by in-jecting user-defined functions deep in the kernel’s I/Oprocessing stack. When issuing a series of dependent I/Orequests, this approach can increase IOPS by over 2.5 × and cut latency by half, by bypassing kernel layers andavoiding user-kernel boundary crossings. However, wemust avoid losing important properties when bypassingthe file system and block layer such as the safety guaran-tees of the file system and translation between physicalblocks addresses and file offsets. We sketch potentialsolutions to these problems, inspired by exokernel filesystems from the late 90s, whose time, we believe, hasfinally come! “ As a dog returns to his vomit, so a fool repeatshis folly.
Attributed to King Solomon ” Storage devices have historically lagged behind network-ing devices in achieving high bandwidth and low latency.While 100 Gb/s bandwidth is now common for networkinterface cards (NICs) and the physical layer, storage de-vices are only beginning to support 2-7 GB/s bandwidthand 4-5 µs latencies [8, 13, 19, 20]. With such devices, thesoftware stack is now a substantial overhead on everystorage request. In our experiments this can account forabout half of I/O operation latency, and the impact onthroughput can be even more significant.Kernel-bypass frameworks (e.g. SPDK [44]) and near-storage processing reduce kernel overheads. However,there are clear drawbacks to both such as significant,often bespoke, application-level changes [40, 41], lack ofisolation, wasteful busy waiting when I/O usage isn’thigh, and the need for specialized hardware in the caseof computational storage [10, 16, 42]. Therefore, we want a standard OS-supported mechanism that can reduce thesoftware overhead for fast storage devices.To address this, we turn to the networking commu-nity, which has long had high-bandwidth devices. Linux’seBPF [6] provides an interface for applications to embedsimple functions directly in the kernel. When used to in-tercept I/O, these functions can perform processing thatis traditionally done in the application and can avoid hav-ing to copy data and incurring context switches whengoing back and forth between the kernel and user space.Linux eBPF is widely used for packet processing andfiltering [5, 30], security [9] and tracing [29].BPF’s ubiquity in the kernel and its wide acceptancemake it a natural scheme for application-specific kernel-side extensions in layers outside of the networking stack.BPF could be used to chain dependent I/Os, eliminatingcostly traversals of the kernel storage stack and transi-tions to/from userspace. For example, it could be used totraverse a disk-based data structure like a B-tree whereone block references another. Embedding these BPF func-tions deep enough in the kernel has the potential to elim-inate nearly all of the software overhead of issuing I/Oslike kernel-bypass, but, unlike kernel-bypass, it does notrequire polling and wasted CPU time.To realize the performance gains of BPF we iden-tify four substantial open research challenges which areunique to the storage use case. First, for ease of adoption,our architecture must support Linux with standard filesystems and applications with minimal modifications toeither. It should also be efficient and bypass as manysoftware layers as feasible. Second, storage pushes BPFbeyond current, simple packet processing uses. Packetsare self-describing, so BPF can operate on them mostlyin isolation. Traversing a structure on-disk is stateful andfrequently requires consulting outside state. Storage BPFfunctions will need to understand applications’ on-diskformats and access outside state in the application orkernel, for example, to synchronize concurrent accessesor to access in-memory structure metadata. Third, we1 a r X i v : . [ c s . O S ] F e b Figure 1:
Kernel’s latency overhead with 512 B random reads.HDD is Seagate Exos X16, NAND is Intel Optane 750 TLCNAND, NVM-1 is first generation Intel Optane SSD (900P), andNVM-2 is second generation Intel Optane SSD (P5800X) need to ensure that BPF storage functions cannot violatefile system security guarantees while still allowing shar-ing of disk capacity among applications. Storage blocksthemselves typically do not record ownership or accesscontrol attributes, in contrast to network packets whoseheaders specify the flows to which the packets belong.Hence, we need an efficient scheme for enforcing accesscontrol that doesn’t induce the full cost of the kernel’sfile system and block layers. Fourth, we need to enableconcurrency. Applications support concurrent accessesvia fine-grained synchronization (e.g. lock coupling [28])to avoid read-write interference with high throughput;synchronization from BPF functions may be needed.Our approach is inspired by the work on exokernel filesystem designs. User-defined kernel extensions were thecornerstone of XN for the Xok exokernel. It supportedmutually distrusting “libfs”es via code downloaded fromuser processes to the kernel [32]. In XN, these untrusteddeterministic functions were interpreted to give the ker-nel a user-defined understanding of file system metadata.Then, the application and kernel design was clean slate,and absolute flexibility in file system layout was the goal.While we require a similar mechanism to allow users toenable the kernel to understand their data layout, ourgoals are different: we want a design that allows applica-tions to define custom BPF-compatible data structuresand functions for traversing and filtering on-disk data,works with Linux’s existing interfaces and file systems,and substantially prunes the amount of kernel code exe-cuted per-I/O to drive millions of IOPS.
The past few years have seen new memory technologiesemerge in SSDs attached to high bandwidth PCIe usingNVMe. This has led to storage devices that now rivalthe performance of fast network devices [1] with a fewmicroseconds of access latency and gigabytes per secondof bandwidth [8, 13, 19, 20]. Hence, just as the kernelnetworking stack emerged as a CPU bottleneck for fastnetwork cards [22, 34, 37, 43], the kernel storage stack isnow becoming a bottleneck for these new devices.Figure 1 shows this; it breaks down the fraction of readI/O latency that can be attributed to the device hardwareand the system software for increasingly fast storagedevices. The results show the kernel’s added software kernel crossing 351 ns 5.6%read syscall 199 ns 3.2%ext4 2006 ns 32.0%bio 379 ns 6.0%NVMe driver 113 ns 1.8%storage device 3224 ns 51.4%total 6.27 µs 100.0%
Table 1:
Average latency breakdown of a 512 B random read() syscall using Intel Optane SSD gen 2.
NVMe Driverbio LayerDispatch from NVMe DriverDispatch from Syscall File SystemSyscall LayerKernel Boundary File System -Submissionbio Layer -SubmissionNVMe Driver -Submission NVMe Device NVMe Driver -Completion Interrupt Handlerbio Layer -Completion Interrupt HandlerFile System -CompletionApplicationSyscall Function (e.g., read() / io_uring_enter())Dispatch from Application
Figure 2:
Dispatch paths for the application and the two kernelhooks. overhead was already measurable (10-15% latency over-head) with the first generation of fast NVMe devices (e.g.first generation Optane SSD or Samsung’s Z-NAND); inthe new generation of devices software accounts for abouthalf of the latency of each read I/O . The Overhead Source.
To breakdown this softwareoverhead, we measured the average latency of the dif-ferent software layers when issuing a random 512 B read() system call with O_DIRECT on an Intel Op-tane SSD Gen 2 prototype (P5800X) on a 6-core i5-85003 GHz server with 16 GB of memory, Ubuntu 20.04, andLinux 5.8.0. We use this setup throughout the paper. Wedisable processor C-states and turbo boost and use themaximum performance governor. Table 1 shows that thelayers that add the most latency are the file system (ext4here), followed by the transition from user space intokernel space.Kernel-bypass allows applications directly submit re-quests to devices, effectively eliminating all of these costs2 I O P S I m p r o v e m e n t thread count:12thread count:6thread count:4thread count:2thread count:1 (a) Lookups with read syscall, usingsyscall dispatcher layer hook. I O P S I m p r o v e m e n t (b) Lookups with read syscall, usingNVMe driver layer hook. L a t e n c y ( u s ) Dispatch from User SpaceDispatch from SyscallDispatch from NVMe Driver (c)
One-threaded lookup with readsyscall, using syscall layer hook andNVMe driver layer hook. I O P S I m p r o v e m e n t batch size:8batch size:4batch size:2batch size:1 (d) Single-threaded lookups withio_uring syscall, using NVMe driverhook.
Figure 3:
Search throughput improvement on B-tree with varying depth when reissuing lookups from different kernel layers. except the costs to post NVMe requests (“NVMe driver”)and the device latency itself [22, 33, 44, 45]. However,eliminating all of these layers comes at a high cost. Thekernel can only delegate whole devices to processes, andapplications must implement their own file systems onraw devices [40, 41]. Even when they do, they still can-not safely share files or capacity between distrustingprocesses. Finally, lack of efficient application-level inter-rupt dispatching means these applications must resortto polling to be efficient at high load; this makes it im-possible for them to efficiently share cores with otherprocesses, resulting in wasteful busy polling when I/Outilization isn’t high enough.There have been recent efforts to streamline ker-nel I/O submission costs via a new system call called io_uring [7]. It provides batched and asynchronous I/Osubmission/completion path that amortizes the costsof kernel boundary crossings and can avoid expensivescheduler interactions to block kernel threads. However,each submitted I/O still must pass through all of the ker-nel layers shown in Table 1. So, each I/O still incurs sig-nificant software overhead when accessing fast storagedevices (we quantify this in §3).
With the rise of fast networks, Berkeley Packet Filter(BPF) has gained in popularity for efficient packet pro-cessing since it eliminates the need to copy each packetinto userspace; instead applications can safely oper-ate on packets in the kernel. Common networking usecases include filtering packets [4, 5, 21, 30], network trac-ing [2, 3, 29], load balancing [4, 12], packet steering [27]and network security checks [9]. It has also been used as away to avoid multiple network crossings when accessingdisaggregated storage [35]. Linux supports BPF via theeBPF extension since Linux 3.15 [18]. The user-definedfunctions can be executed either using an interpreter ora just-in-time (JIT) compiler.
BPF for Storage.
We envision similar use of BPF forstorage by removing the need to traverse the kernel’sstorage stack and move data back and forth betweenthe kernel and user space when issuing dependent stor-age requests. Many storage applications consist of many“auxiliary” I/O requests, such as index lookups. A key characteristic of these requests is that they occupy I/Obandwidth and CPU time to fetch data that is ultimately not returned to the user. For example, a search on a B-tree index is a series of pointer lookups that lead to thefinal I/O request for the user’s data page. Each of theselookups makes a roundtrip from the application throughthe kernel’s storage stack, only for the application tothrow the data away after simple processing. Other, sim-ilar use cases include database iterators that scan tablessequentially until an attribute satisfies a condition [15]or graph databases that execute depth-first searches [24].
The Benefits.
We design a benchmark that executeslookups on an on-disk B + -tree (which we call a B-treefor simplicity), a common data structure used to indexdatabases [17,31]. For simplicity, our experiments assumethat the leaves of the index contain user data rather thanpointers [36]. We assume each node of the B-tree sits ona separate disk page, which means for a B-tree of depth d , a lookup requires reading d pages from disk. The coreoperation of a B-tree lookup is parsing the current pageto find the offset of the next disk page, and issuing a readfor the next page, a “pointer lookup”. Traditionally, a B-tree lookup requires d successive pointer lookups fromuserspace. To improve on the baseline, we reissue succes-sive pointer lookups from one of two hooks in the kernelstack: the syscall dispatch layer (which mainly eliminateskernel boundary crossings) or the NVMe driver’s inter-rupt handler on the completion path (which eliminatesnearly all of the software layers from the resubmission).Figure 2 shows the dispatch paths for the two hooksalong with the normal user space dispatch path. Thesetwo hooks are proxies for the eBPF hooks that we wouldultimately use to offload user-defined functions.Figures 3a and 3b show the throughput speedup ofboth hooks relative to the baseline application traversal,and Figure 3c shows the latency of both hooks while vary-ing the depth of the B-tree. When lookups are reissuedfrom the syscall dispatch layer, the maximum speedup is1.25 × . The improvement is modest because each lookupstill incurs the file system and block layer overhead;the speedup comes exclusively from eliminating kernelboundary crossings. As storage devices approach 1 µslatencies, we expect greater speedups from this dispatch3ook. On the other hand, reissuing from the NVMe drivermakes subsequent I/O requests significantly less com-putationally expensive, by bypassing nearly the entiresoftware stack. Doing so achieves speed ups up to 2.5 × ,and reduces latency by up to 49%. Relative throughputimprovement actually goes down when adding morethreads, because the baseline application also benefitsfrom more threads until it reaches CPU saturation at6 threads. Once the baseline hits CPU saturation, thecomputational savings due to reissuing at the driver be-comes much more apparent. The throughput improve-ment from reissuing in the driver continues to scale withdeeper trees, because each level of the tree compoundsthe number of requests that are issued cheaply. What about io_uring?
The previous experiments useLinux’s standard, synchronous read system call. Here,we repeat these experiments using the more efficient andbatched io_uring submission path to drive B-tree lookupsfrom a single thread. Like before, we reissue lookupsfrom within the NVMe driver and plot the throughputimprovement against an application that simply batchesI/Os using unmodified io_uring calls. Figure 3d showsthe throughput speedup due to reissuing from within thedriver relative to the application baseline.As expected, increasing the batch size (number of sys-tem calls batched in each io_uring call), increases thespeedup, since a higher batch size increases the numberof requests that can be reissued at the driver. For exam-ple, for a batch size of 1 only 1 request (per B-tree level)can be reissued inexpensively, whereas for a batch sizeof 8, each B-tree level saves on 8 concurrent requests.Therefore, placing the hooks close to the device benefitsboth standard, synchronous read calls and more effi-cient io_uring calls. With deep trees, BPF coupled withio_uring delivers > × higher throughput; even threedependent lookups give 1.3–1.5 × speedups. Our experiments have given us reason to be optimisticabout BPF’s potential to accelerate operations with faststorage devices; however, to realize these gains, I/O resub-missions must happen as early as possible, ideally withinthe kernel NVMe interrupt handler itself. This createssignificant challenges in using BPF to accelerate storagelookups for a practical system such as a key-value store.We envision building a library that provides a higherlevel-interface than BPF and new BPF hooks in the Linuxkernel as early in the storage I/O completion path as pos-sible, similar to XDP [21]. This library would contain BPFfunctions to accelerate access and operations on populardata structures, such as B-trees and log-structured mergetrees (LSM).Within the kernel, these BPF functions that wouldbe triggered in the NVMe driver interrupt handler on each block I/O completion. By giving these functions ac-cess to raw buffers holding block data, they could extractfile offsets from blocks fetched from storage and imme-diately reissue an I/O to those offsets; they could alsofilter, project, and aggregate block data by building upbuffers that they later return to the application. By push-ing application-defined structures into the kernel thesefunctions can traverse persistent data structures withlimited application involvement. Unlike XN, where func-tions were tasked with implementing full systems, thesestorage BFP functions would mainly be used to definethe layout of a storage blocks that make up applicationdata structures.We outline some of the key design considerations andchallenges for our preliminary design, which we believewe can realize without substantial re-architecture of theLinux kernel.
Installation & Execution.
To accelerate dependentaccesses, our library installs a BPF function using a spe-cial ioctl . Once installed, the application I/Os issuedusing that file descriptor are “tagged”; submissions tothe NVMe layer propagate this tag. The kernel I/O com-pletion path, which is triggered in the NVMe device in-terrupt handler, checks for this tag. For each tagged sub-mission/completion, our NVMe interrupt handler hookpasses the read block buffer into the BPF function.When triggered, the function can perform a few ac-tions. For example, it can extract a file offset from theblock; then, it can “recycle” the NVMe submission de-scriptor and I/O buffer by calling a helper function thatretargets the descriptor to the new offset and reissues itto the NVMe device submission queue. Hence, one I/Ocompletion can determine the next I/O that should besubmitted with no extra allocations, CPU cost, or delay.This lets functions perform rapid traversals of structureswithout application-level involvement.The function can also copy or aggregate data from theblock buffer into its own buffers. This lets the functionperform selection, projection, or aggregation to buildresults to return to the application. When the functioncompletes it can indicate which buffer should be returnedto the application. For cases where the function starteda new I/O and isn’t ready to return results to the ap-plication yet (for example, if it hasn’t found the rightblock yet), it can return no buffer, preventing the I/Ocompletion from being raised to the application.
Translation & Security.
In Linux the NVMe driverdoesn’t have access to file system metadata. If an I/Ocompletes for a block at offset o in a file, a BPF func-tion might extract file offset o as the next I/O to issue.However, o is meaningless to the NVMe context, sinceit cannot tell which physical block this corresponds towithout access to the file’s metadata and extents. Blockscould embed physical block addresses to avoid the need4o consult the extents, but without imposing limits onthese addresses, BPF functions could access any blockon the device. Hence, a key challenge is imbuing theNVMe layer with enough information to efficiently andsafely map file offsets to the file’s corresponding physicalblock offsets without restricting the file system’s abilityto remap blocks as it chooses.For simplicity and security in our design, each func-tion only uses the file offsets in the file to which the ioctl attached the function. This ensures functionscannot access data that does not belong to the file. Todo this without slow file system layer calls and withoutconstraining the file system’s block allocation policy, weplan to only trigger this block recycling when the extentsfor a file do not change . We make the observation thatmany data center applications do not modify persistentstructures on block storage in place. For example, oncean LSM-tree writes SSTable files to disk, they are im-mutable and their extents are stable [26]. Similarly, theindex file extents remain nearly stable in on-disk B-treeimplementations; In a 24 hour YCSB [25] (40% reads, 40%updates, 20% inserts, Zipfian 0.7) experiment on MariaDBrunning TokuDB [14], we found the index file’s extentsonly changed every 159 seconds on average with only 5extent changes in 24 hours unmapping any blocks. Notethat in these index implementations, each index is storedon a single file, and does not span multiple files, whichfurther helps simplify our design.We exploit the relative stability of file extents via asoft state cache of the extents at the NVMe layer. Whenthe ioctl first installs the function on the file stor-ing the data structure, its extents are propagated to theNVMe layer. If any block is unmapped from any of thefile’s extents, a new hook in the file system triggers aninvalidation call to the NVMe layer. Ongoing recycledI/Os are then discarded, and an error is returned to theapplication layer, which must rerun the ioctl to resetthe NVMe layer extents before it can reissue tagged I/Os.This is a heavy-handed but simple approach. It leaves thefile system almost entirely decoupled from the NVMelayer, and it places no restrictions on the file system blockallocation policies. Of course, these invalidations need tobe rare for the cache to be effective, but we expect this istrue in most of the applications we target. I/O Granularity Mismatches.
When the BIO layer“splits” an I/O, e.g. across two discontiguous extents, itwill generate multiple NVMe operations that complete atdifferent times. We expect these cases to be rare enoughthat we can perform that I/O as a normal BIO and returnthe buffer and completion to the application. There, it canrun the BPF function itself and restart the I/O chain withthe kernel starting at the next “hop”. This avoids extracomplexity in the kernel. Similarly, if application needsto generate more than one I/O in response to a single I/O completion, we propagate the completion up to the BIOlayer which allocates and submits the multiple I/Os tothe NVMe layer. This avoids returning to userspace.
Caching.
As the caching of indices is often managedby the application [14, 23, 26], we assume the BPF traver-sal will not interact with the buffer cache directly and thatapplications manage caching and synchronizing withtraversals. Cache eviction and management is increas-ingly done at the granularity of application-meaningfulobjects (e.g. individual data records) instead of wholepages. Our scheme fits well into this model, where BPFfunctions can return specific objects to the applicationrather than pages, to which it can apply its own cachingpolicies.
Concurrency and Fairness.
A write issued throughthe file system might only be reflected in the buffer cacheand would not be visible to the BPF traversal. This couldbe addressed by locking, but managing application-levellocks from within the NVMe driver could be expensive.Therefore, data structures that require fine-grained lock-ing (e.g. lock coupling in B+trees [28]) require carefuldesign to support BPF traversal.To avoid read/write conflicts, we initially plan to targetdata structures that remain immutable (at least for a longperiod of time). Fortunately, many data structures havethis property, including LSM SSTable files that remain im-mutable [26,39], and on-disk B-trees that are not updateddynamically in-place, but rather in a batch process [14].In addition, due to the difficulty of acquiring locks, weplan initially to only support read-only BPF traversals.BPF issued requests do not go through the file systemor block layer, so there is no easy place to enforce fairnessor QoS guarantees among processes. However, the de-fault block layer scheduler in Linux is the noop schedulerfor NVMe devices, and the NVMe specification supportscommand arbitration at hardware queues if fairness is arequirement [11]. Another challenge is that the NVMelayer may reissue an infinite number of I/O requests. TheeBPF verifier prevents loops with unknown bounds [38],but we would also need to prevent unbounded I/O loopsat our NVMe hook.For fairness purposes and to prevent unbounded traver-sals, we plan to implement a counter per process in theNVMe layer that will track the number of chained sub-missions, and set a bound on this counter. The counter’svalues can periodically be passed to the BIO layer toaccount the number of requests.
BPF has the potential to significantly speed up depen-dent lookups to fast storage devices. However, it createsseveral key challenges, arising due to the loss of con-text when operating deep in the kernel’s storage stack.In this paper, we focused primarily on enabling the ini-5ial use case of index traversals. Notwithstanding, evenfor current fast NVMe devices (and more so for futureones), chaining a small number of requests using BPFprovides significant gains. We envision a BPF for storagelibrary could help developers offload many other stan-dard storage operations to the kernel, such as compaction,compression, deduplication and scans. We also believethat the interactions of BPF with the cache and schedulerpolicies create exciting future research opportunities.
We would like to thank Frank Hady and Andrew Ruffinfor their generous support, and to Adam Manzanares,Cyril Guyot, Qing Li and Filip Blagojevic for their guid-ance throughout the project and feedback on the paper.
References [1] 200Gb/s ConnectX-6 Ethernet Single/Dual-Port Adapter IC | NVIDIA. . mellanox . com/products/ethernet-adapter-ic/connectx-6-en-ic .[2] bcc. https://github . com/iovisor/bcc .[3] bpftrace. https://github . com/iovisor/bpftrace .[4] Cilium. https://github . com/cilium/cilium .[5] Cloudflare architecture and how BPF eats theworld. https://blog . cloudflare . com/cloudflare-architecture-and-how-bpf-eats-the-world/ .[6] eBPF. https://ebpf . io/ .[7] Efficient io with io_uring. https://kernel . dk/io_uring . pdf .[8] Intel® Optane™ SSD DC P5800X Series. https://ark . intel . . html .[9] MAC and Audit policy using eBPF. https://lkml . org/lkml/2020/3/28/479 .[10] Ngd systems newport platform. . ngdsystems . com/technology/computational-storage . [11] NVMe base specification. https://nvmexpress . org/wp-content/uploads/NVM-Express-1_4b-2020 . . . pdf .[12] Open-sourcing katran, a scalable network loadbalancer. https://engineering . fb . com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/ .[13] Optimizing Software for the Next Gen Intel OptaneSSD P5800X. . intel . . html?videoId=6215534787001 .[14] Percona tokudb. . percona . com/software/mysql-database/percona-tokudb .[15] RocksDB iterator. https://github . com/facebook/rocksdb/wiki/Iterator .[16] SmartSSD computational storage drive. . xilinx . com/applications/data-center/computational-storage/smartssd . html .[17] SQL server index architecture and design guide. https://docs . microsoft . com/en-us/sql/relational-databases/sql-server-index-design-guide .[18] A thorough introduction to eBPF. https://lwn . net/Articles/740157/ .[19] Toshiba memory introduces XL-FLASHstorage class memory solution. https://business . kioxia . com/en-us/news/2019/memory-20190805-1 . html .[20] Ultra-Low Latency with Sam-sung Z-NAND SSD. . samsung . com/semiconductor/global . semi . static/Ultra-Low_Latency_with_Samsung_Z-NAND_SSD-0 . pdf .[21] XDP. . iovisor . org/technology/xdp .[22] Adam Belay, George Prekas, Ana Klimovic, SamuelGrossman, Christos Kozyrakis, and EdouardBugnion. IX: A protected dataplane operatingsystem for high throughput and low latency. In , pages69–65, Broomfield, CO, October 2014. USENIXAssociation.[23] Badrish Chandramouli, Guna Prasaad, Donald Koss-mann, Justin Levandoski, James Hunter, and MikeBarnett. FASTER: A concurrent key-value storewith in-place updates. In Proceedings of the 2018International Conference on Management of Data ,SIGMOD ’18, page 275–290, New York, NY, USA,2018. Association for Computing Machinery.[24] J Chao. Graph databases for beginners:Graph search algorithm basics. https://neo4j . com/blog/graph-search-algorithm-basics/ .[25] Brian F Cooper, Adam Silberstein, Erwin Tam,Raghu Ramakrishnan, and Russell Sears. Bench-marking cloud serving systems with YCSB. In Pro-ceedings of the 1st ACM symposium on Cloud com-puting , pages 143–154, 2010.[26] Siying Dong, Mark Callaghan, Leonidas Galanis,Dhruba Borthakur, Tony Savor, and Michael Strum.Optimizing space amplification in rocksdb. In
CIDR ,volume 3, page 3, 2017.[27] Pekka Enberg, Ashwin Rao, and Sasu Tarkoma.Partition-aware packet steering using XDP andeBPF for improving application-level parallelism.In
Proceedings of the 1st ACM CoNEXT Workshop onEmerging In-Network Computing Paradigms , ENCP’19, page 27–33, New York, NY, USA, 2019. Associa-tion for Computing Machinery.[28] Goetz Graefe. A Survey of B-Tree Locking Tech-niques.
ACM Transactions on Database Systems ,35(3), July 2010.[29] Brendan Gregg.
BPF Performance Tools . Addison-Wesley Professional, 2019.[30] Toke Høiland-Jørgensen, Jesper Dangaard Brouer,Daniel Borkmann, John Fastabend, Tom Herbert,David Ahern, and David Miller. The express datapath: Fast programmable packet processing in theoperating system kernel. In
Proceedings of the 14thinternational conference on emerging networking ex-periments and technologies , pages 54–66, 2018.[31] Stratos Idreos, Niv Dayan, Wilson Qin, MaliAkmanalp, Sophie Hilgard, Andrew Ross, JamesLennon, Varun Jain, Harshita Gupta, David Li,et al. Design continuums and the path toward self-designing key-value stores that know and learn. In
CIDR , 2019. [32] M. Frans Kaashoek, Dawson R. Engler, Gregory R.Ganger, Hector M. Briceño, Russell Hunt, DavidMazières, Thomas Pinckney, Robert Grimm, JohnJannotti, and Kenneth Mackenzie. Application Per-formance and Flexibility on Exokernel Systems. In
Proceedings of the Sixteenth ACM Symposium on Op-erating Systems Principles , SOSP ’97, page 52–65,New York, NY, USA, 1997. Association for Comput-ing Machinery.[33] Kostis Kaffes, Timothy Chong, Jack TigarHumphries, Adam Belay, David Mazières, andChristos Kozyrakis. Shinjuku: Preemptive schedul-ing for µ second-scale tail latency. In , pages 345–360, 2019.[34] Anuj Kalia, Michael Kaminsky, and David Andersen.Datacenter RPCs can be general and fast. In , pages 1–16, Boston,MA, February 2019. USENIX Association.[35] Kornilios Kourtis, Animesh Trivedi, and NikolasIoannou. Safe and Efficient Remote ApplicationCode Execution on Disaggregated NVM Storagewith eBPF. arXiv preprint arXiv:2002.11528 , 2020.[36] Lanyue Lu, Thanumalayan Sankaranarayana Pil-lai, Hariharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Wisckey:Separating keys from values in ssd-conscious stor-age. ACM Transactions on Storage (TOS) , 13(1):1–28,2017.[37] Michael Marty, Marc de Kruijf, Jacob Adriaens,Christopher Alfeld, Sean Bauer, Carlo Contavalli,Mike Dalton, Nandita Dukkipati, William C. Evans,Steve Gribble, Nicholas Kidd, Roman Kononov, Gau-tam Kumar, Carl Mauer, Emily Musick, Lena Olson,Mike Ryan, Erik Rubow, Kevin Springborn, PaulTurner, Valas Valancius, Xi Wang, and Amin Vahdat.Snap: a microkernel approach to host networking.In
In ACM SIGOPS 27th Symposium on OperatingSystems Principles , New York, NY, USA, 2019.[38] Luke Nelson, Jacob Van Geffen, Emina Torlak, andXi Wang. Specification and verification in the field:Applying formal methods to BPF just-in-time com-pilers in the linux kernel. In , pages 41–61. USENIX Association,November 2020.[39] Patrick O’Neil, Edward Cheng, Dieter Gawlick, andElizabeth O’Neil. The log-structured merge-tree(lsm-tree).
Acta Informatica , 33(4):351–385, 1996.740] Weikang Qiao, Jieqiong Du, Zhenman Fang,Michael Lo, Mau-Chung Frank Chang, and JasonCong. High-throughput lossless compression ontightly coupled CPU-FPGA platforms. In , pages 37–44. IEEE, 2018.[41] Zhenyuan Ruan, Tong He, and Jason Cong. IN-SIDER: Designing in-storage computing systemfor emerging high-performance drive. In , pages 379–394, 2019.[42] Sudharsan Seshadri, Mark Gahagan, SundaramBhaskaran, Trevor Bunker, Arup De, Yanqin Jin,Yang Liu, and Steven Swanson. Willow: A user-programmable SSD. In , pages 67–80, Broomfield, CO, October2014. USENIX Association. [43] Shin-Yeh Tsai and Yiying Zhang. Lite kernel RDMAsupport for datacenter applications. In
Proceedingsof the 26th Symposium on Operating Systems Princi-ples , pages 306–324, 2017.[44] Ziye Yang, James R Harris, Benjamin Walker, DanielVerkamp, Changpeng Liu, Cunyin Chang, GangCao, Jonathan Stern, Vishal Verma, and Luse E Paul.SPDK: A development kit to build high performancestorage applications. In , pages 154–161. IEEE, 2017.[45] Irene Zhang, Jing Liu, Amanda Austin, Michael Low-ell Roberts, and Anirudh Badam. I’m not dead yet!the role of the operating system in a kernel-bypassera. In