[PDF] CleanQ: a lightweight, uniform, formally specified interface for intra-machine data transfer

Abstract

We present CleanQ, a high-performance operating-system interface for descriptor-based data transfer with rigorous formal semantics, based on a simple, formally-verified notion of ownership transfer, with a fast reference implementation. CleanQ aims to replace the current proliferation of similar, but subtly diverse, and loosely specified, descriptor-based interfaces in OS kernels and device drivers. CleanQ has strict semantics that not only clarify both the implementation of the interface for different hardware devices and software usecases, but also enable composition of modules as in more heavyweight frameworks like Unix streams. We motivate CleanQ by showing that loose specifications derived from implementation lead to security and correctness bugs in production systems that a clean, formal, and easilyunderstandable abstraction helps eliminate. We further demonstrate by experiment that there is negligible performance cost for a clean design: we show overheads in the tens of cycles for operations, and comparable end-to-end performance to the highly-tuned Virtio and DPDK implementations on Linux.

Full PDF

CCleanQ: a lightweight, uniform, formally speciﬁed interface for intra-machinedata transfer

Roni Haecki, Lukas Humbel, Reto Achermann, David Cock, Daniel Schwyn, Timothy RoscoeSystems Group, Department of Computer Science, ETH Zurich

Abstract

We present CleanQ, a high-performance operating-systeminterface for descriptor-based data transfer with rigorous for-mal semantics, based on a simple, formally-veriﬁed notion of ownership transfer , with a fast reference implementation.CleanQ aims to replace the current proliferation of similar,but subtly diverse, and loosely speciﬁed, descriptor-basedinterfaces in OS kernels and device drivers. CleanQ has strictsemantics that not only clarify both the implementation ofthe interface for different hardware devices and software use-cases, but also enable composition of modules as in moreheavyweight frameworks like Unix streams.We motivate CleanQ by showing that loose speciﬁcationsderived from implementation lead to security and correctnessbugs in production systems that a clean, formal, and easily-understandable abstraction helps eliminate. We further demon-strate by experiment that there is negligible performance costfor a clean design: we show overheads in the tens of cyclesfor operations, and comparable end-to-end performance tothe highly-tuned Virtio and DPDK implementations on Linux.

1. Introduction

CleanQ is a uniform operating system interface for transferringbulk data, which abstracts from and uniﬁes a wide variety ofdescriptor-based data transfer interfaces used by both softwareand hardware in a modern OS.Queues based on descriptor rings are pervasive in OS codefor moving data between processes, hardware devices like net-work adaptors, kernel and user space, etc. Despite this thereis a wide range of different queue interfaces and implementa-tions (even within a single OS). So-called standard interfacesto queues, where they exist, are typically speciﬁed informallyusing a reference implementation in C.This leads to a serious problem, which we elaborate on inSection 2. Implementations cannot be reused (since the se-mantics are subtly different), and thus implementation bugscan recur (and do) when a new queue is built. The lack ofclear semantics mean that bugs also arise due to inappropriateuse of a given descriptor queue, by a client programmer whomay not understand its subtleties, and also make it hard tocompose code modules which operate on queues of data, as ispossible with Unix Streams or other I/O frameworks. Withouta formally sound description of a queue’s behavior, it is im-possible to reason about the correct behavior of the OS whichuses them. Finally, a lack of uniformity is a lost opportunityto build and apply standard tools for debugging, validating, proﬁling, and monitoring such queues at runtime, making OSdevelopment more difﬁcult and time-consuming.CleanQ addresses these problems not by simply proposingyet another queue interface, but starting from a provably soundformal speciﬁcation (presented in Section 3) of how a descrip-tor queue should behave. This gives clear guarantees to clientsof a queue (whether it be a device driver or an inter-processcommunication system), and also states clear obligations onthe code that implements an end-point of the queue.The speciﬁcation is strict, and so the chance of subtle mis-matches between the expectations of client and implementa-tion are eliminated. It allows for full concurrency betweenactors, such as a driver process and a network card. It alsosubsumes the memory model in use: the client of a CleanQqueue does not need to be concerned about weak consistencyor non-coherent memory in order to write correct, portablecode to use it. However, CleanQ focuses purely on the dataplane interface, leaving ﬂexibility to system designers in howsuch queues are instantiated and provisioned.From the speciﬁcation, we then proceed to a C interfacewhich captures it. This interface is highly general. In Section 4we describe it and demonstrate its generality with a numberof implementations we have built behind it, for network cards,storage adaptors, and inter-process communication. Moreover,the CleanQ C interface composes: we also describe CleanQmodules that provide loopback, debugging and network stackfunctionality.Finally, we evaluate the overhead of using our CleanQ im-plementation modules on Linux and a microkernel-based re-search operating system in Section 5 to show both performanceand portability. We show that, despite the strict semantics andhighly speciﬁed, uniform interface, CleanQ is cheap: it is com-parable to Virtio and DPDK in operation latency and imposesless than 1% overhead for set/get operations using Memcached[17].

2. Background and Motivation

CleanQ is a formalization of descriptor rings. Rings of de-scriptors are a fairly pervasive technique for transferring databetween end-points (software processes or threads, GPUs, ad-dress spaces, hardware I/O devices, virtual machines, etc.) ina modern OS. Each descriptor refers to a region of memory(usually elsewhere) plus some metadata, including which endof the communication “owns” the data (and metadata). TheLinux kernel alone has at least 6 different descriptor queueimplementations, not including hardware-speciﬁc I/O queues a r X i v : . [ c s . O S ] N ov or network and storage devices.Descriptor rings work well because they highly decouplesender and receiver: sending data between a user process anda high-performance network adaptor using Intel’s DPDK [28],for example, doesn’t require either side to touch payload dataas part of the transfer, and in the common case requires nosynchronization between sender and receiver. In such drivers,interrupts and hardware register access is only used for coarse-grained synchronization at low load levels; for the most parteach side of the communication can proceed in parallel withoutexplicit coordination.However, while the technique of descriptor rings is almostuniversal, there is little consensus on what a given ring shouldlook like. A great number of software interfaces, libraries and‘standards’ have been proposed over time (e.g. [16, 19, 43, 1,32, 22, 3]) all of which are variations or enhancements of thesame basic theme. Moreover, every new high-performance I/Odevice adopts a different descriptor format for its I/O queues,including devices from the same vendor (e.g. [39, 24, 26, 27,25, 4]). This proliferation of implementations is often for good reasons:queue implementations (whether communicating between pro-cesses or between software and hardware devices) have dif-ferent requirements in how they are constructed and set up,metadata that may need to be passed with each buffer, andadditional, implementation-speciﬁc semantics associated withenqueue and dequeue operations. Examples are Virtio [43] forbuffer transfer between host and virtual machines, SKBufs[35]in the Linux network stack or mbufs in DPDK [28]. How-ever, the proliferation of implementations combined with thedifﬁculty of getting memory semantics right leads to a steadystream of serious bugs, ranging from performance problemsto critical security vulnerabilities.For example, the Virtio framework used in the QEMU em-ulator and the KVM virtual machine monitor has allowed amalicious guest to break security by inserting more requeststhan the size of the queue [13]. Changing the queue parametersin Virtio has caused the hosting QEMU process to crash [41]).Worse, these bugs have not only been appearing for a longtime – they continue to appear [7, 8, 9, 40, 12, 10, 11, 13, 15,14, 41, 6] in Linux and Android (and, we suspect, other systemsoftware). A new class of “double-fetch” bugs/vulnerabilitieshave recently appeared [46] whereby data is fetched twice butchanged by another party in between. All these bugs ultimatelyboil down to production code making incorrect assumptionsabout when and how memory can be accessed safely by oneside of a descriptor queue-based channel: when it is safe toreuse a buffer, when an endpoint can safely enqueue anotherbuffer, etc.We argue the main reason this problem is not going awayis the prevalence of speciﬁcation by implementation : Thedocumentation–where it exists at all–is written in English prose and not formally speciﬁed. For example, consider the virtqueue mechanism in Virtio [43] for transferring databetween a device driver and a virtual device.A virtqueue consists of (1) a descriptor table spec-ifying which buffers a driver is using for its device,(2) an available ring containing descriptors offeredto the device, and (3) a used ring containing buffersprocessed by the device and returned to the driver.While the speciﬁcation makes it clear that buffers available tothe programmer are on one of the two rings, the corner-cases(such as adding a descriptor twice to a ring) are undocumented.Programmers are ultimately advised to read the code. To makethings worse, Virtio has two different queue interfaces–onefor the host and one for the guest–having slightly differentsemantics (e.g. a memcpy on the host side).

A further consequence of speciﬁcation-via-implementation isthat correctness often depends on a particular memory model.The combination of program operations and fences/barriersrequired for correct operation is not at all obvious from thedocumentation, and is different for different processor archi-tectures. For example, when a device driver enqueues a buffer,it ﬁrst writes the buffer contents, then the descriptor, and ﬁ-nally updates the head pointer of the ring buffer to inform thedevice that there is a new descriptor. On a machine with aweak memory model, these operations can be reordered andthe device notiﬁed before the descriptor is written to memory.Consider, for example, invoking a Virtio [43] or fbuf [16]based queue between two cores: an Intel x86 machine imple-ments Total Store Ordering [44], leading to a relatively simpleimplementation: if a thread performs two writes w w

2, another thread that observes w w

1. Implementing buffer transfer for x86-TSO orsimilar models, it sufﬁces to ensure correct write ordering.However, machines with weak memory consistency likeARMv7, ARMv8, or IBM Power [18] allow considerable re-laxation in the visibility order from a given core: any load,store, or atomic instruction can be extensively reorderedaround other loads and stores. A correct queue thereforerequires tricky use of barrier and fence instructions.Moreover, given the number of descriptor queue imple-mentations in a typical OS, it is surprising and disappointingthat generic functionality cannot be shared among implemen-tations, nor can implementations compose efﬁciently into apipeline, in the manner of more heavyweight data transferframeworks like Streams in AT&T [42] and Plan 9 [38], or theprotocol modules in the x-Kernel [23]. The duplication of code and continuing stream of new bugs inad-hoc descriptor queues led us to develop CleanQ, a formally-2 ueue

B.dequeue()A.enqueue()A.dequeue() B.enqueue() Q BA (queued B → A) Q AB (queued A → B) O B (owned by B) O A (owned by A) Figure 1: Ownership transfer speciﬁed data transfer model for descriptor queues with anassociated C-language interface.To the best of our knowledge, CleanQ is the ﬁrst practical,formally-speciﬁed, general-purpose descriptor ring abstrac-tion, and we show that the generality and strict semantics ofCleanQ come with negligible performance penalty comparedwith poorly-speciﬁed (but well-implemented) subsystems inproduction use.By decoupling rigorously deﬁned data transfer semanticsfrom implementation, CleanQ allows clients and implementa-tions to be developed and tested separately with much greaterassurance of correctness. CleanQ is a speciﬁcation and not animplementation, thus leaving orthogonal issues like metadata,setup, and additional semantics to the implementation whileadhering to a single common model for data transfer.CleanQ is highly general: it can express a variety of hard-ware descriptor queues as well as communication channelsbetween processes in an OS. We demonstrate this functional-ity by later implementing, among other things, a debuggingmodule which is transparent to a CleanQ queue but appliesrigorous online checking of its arguments.CleanQ is not “yet another queue implementation”, noris it a bug-ﬁnding technique for existing implementations.Moreover, our goal is not to build a formally veriﬁed system(such as seL4 [31] or CertiKOS [21]), but rather a soundbasis for reasoning about the system and its behavior andhaving clearly deﬁned semantics. CleanQ is an example ofhow a useful subset of the beneﬁts and guarantees of full-stackveriﬁcation can be practically introduced into existing systemsin a portable and incremental mannerA concise, formally-sound model such as CleanQ is essen-tial to the development and proof of formal systems software,but it is just as important for non-veriﬁed systems such asLinux. As long as the implementation of a CleanQ module ad-heres to the speciﬁcation, we can guarantee properties provedon the formal model.

3. Model

The interfaces we consider (e.g. virtio ) all transfer data in buffers (packets, VM pages, disk blocks, etc.) between pro-cesses (including software processes, device drivers, hardwaredevices, etc.). The copy itself is simple (for a zero-copy im-plementation, it is completely absent). The principal difﬁcultyis the bookkeeping: When can a process safely read a bufferit has received? When must it stop writing before handing itoff? When, exactly, is the buffer handed off? We therefore take the concept of ownership as our primitiveabstraction, and base our formal invariants on the followingfour properties that must hold if an entity can really be said to“own” a thing:1. A thing has at most one owner.2. If an entity owns a thing, it has exclusive use of it.3. An entity knows whether it owns a thing or not.4. Ownership can be transferred.From the ﬁrst property we infer the fundamental invariantof the model: O A ∩ O B = /0The set of things owned by A ( O A ) is disjoint from the set ofthings owned by B ( O B ), for any processes A and B . Note thata process is anything that might read or modify a buffer: auser-space process, a device driver, a hardware component(e.g. a network card).The second property expresses the most important guar-antees that the system must provide to processes (and thatprocesses must, in turn, respect): First, if A owns a buffer,any changes to the buffer visible to A must be due to modi-ﬁcations A itself made since gaining ownership; Second, noother process ( B ) may rely on the contents of the buffer until A relinquishes ownership and B acquires it; Third, all changescaused by A (while it owned the buffer) must be visible to Bimmediately upon acquiring ownership . This guarantees iso-lation among processes, and provides clear requirements forany code needing to manage a weak or non-coherent memorysystem (by dictating barriers, ﬂushes, etc. see Section 3.4).The third and fourth properties force us to elaborate theformal model: If atomic transfer were possible, we could stickwith just the sets by A and B and the bookkeeping problemwould be straightforward. Generally however, no atomic trans-fer of ownership is possible: most implementations (especiallyhardware) transfer buffers by means of a descriptor ring orsimilar. The relinquishing process enqueues a descriptor re-ferring to the buffer to be transferred, which the acquiringprocess (eventually) dequeues.Note that the transfer sets ( Q AB and Q BA ) at this point donot preserve ordering. We add the FIFO property by reﬁningthe model in section 3.2.While a buffer is in the queue (descriptor ring), it cannot besaid to belong to either A or B in a way compatible with our4 properties. If the buffers queued from A to B ( Q AB ) belongto A , A is free to modify them as it likes (property 2). But assoon as the descriptor is dequeued, B will assume it owns it(e.g. the NIC will start writing). As enqueue and dequeue areasynchronous, A has no way of knowing when to stop writing!Likewise, assigning ownership of Q AB to B violates property3: B gains ownership (and thus responsibility) without beinginformed (when A enqueues). The queues are therefore distinct Properties that must always hold. A O B Q AB Q BA B.dequeue()

A.enqueue()

A.dequeue()

B.enqueue() headtail donerecl (tail+1) % N ≠ recl → tail ++; O A -={X} head ≠ tail → head++ done ≠ head → done++ recl ≠ done → recl++; O A +={Y} XY Figure 2: Ownership in the i82599 ring buffer from the ownership sets (and from each other): O A ∩ Q AB = O B ∩ Q AB = /0 O A ∩ Q BA = O B ∩ Q BA = /0 Q AB ∩ Q BA = /0This is the complete model, illustrated by Figure 1. Herewe see the four sets describing the transfer of ownership be-tween processes A and B , and the allowable transitions. Bar-ring A.register() and

B.register() (which add buffersto, and remove them from bookkeeping), the four opera-tions ( A | B.enqueue() , A | B.dequeue() ) transfer ownershipof buffers clockwise: O A → Q AB → O B → Q BA → O A .One ﬁnal invariant completes the model, and expresses thatbuffers are never lost, or invented out of thin air: O A ∪ Q AB ∪ Q B ∪ Q BA = CONST

Figure 2 illustrates the descriptor ring buffer of the Intel i8259910GbE network controller [29, 27] used in Intel’s popularX520 server network cards, and how it is interpreted in theCleanQ model. The ring itself (ﬁgure center) is a circularbuffer with two pointers: head and tail . The descriptorsfrom head up to (but not including) tail are those enqueuedbut not yet taken by the device i.e. the set Q AB (here A is thedriver and B the NIC).The driver enqueues X by writing at tail , then increment-ing the pointer, atomically transferring X from O A to Q AB . TheNIC dequeues by incrementing head , atomically moving abuffer from Q AB to O B , ( done up to head ). The done pointeris only modiﬁed by hardware and points to the last (oldest)buffer that the NIC has dequeued but not yet processed.Only tail , head and done have hardware-dictated mean-ing — The NIC doesn’t distinguish (and doesn’t need to) be-tween buffers that are enqueued back to the driver ( Q BA ) andalready dequeued in software and ready for reuse (unshaded There are two hardware modes: An explicit done pointer, or a done bitin the descriptor which deﬁnes an implicit done pointer

Queue Q BA O B O A -{ X } Q AB ∪ { X } A.enqueue(X) L BA L B O A -{ X } L AB +[ X ] XY hd++ Figure 3: Reﬁnement steps descriptors in Figure 2). The driver keeps track of which de-scriptors it has dequeued (and are safe for reuse), with the recl pointer. This points to the oldest descriptor in Q BA (thelast shaded). The i82599 processes buffers in order. recl divides the region between tail and done into the returneddescriptor queue ( Q BA ) ( recl to done ) and ‘free’ descriptors( tail to recl ).All four of the queue operations consist of atomically incre-menting a pointer (as indicated in gray in Figure 2, togetherwith the guards ( → ) against letting the head of a queue over-take its tail).This shows that the CleanQ speciﬁcation and its notion ofownership do, in fact, model the i82599 hardware queues– CleanQ closely corresponds to the design of real, high-performance hardware. The extremely simple implementationpossible in this case also demonstrates that there is no inherentoverhead to a well-speciﬁed formal interface, such as CleanQ. O A (the buffers owned by A ) cannot be deﬁned by the con-tent of the descriptor ring. A might have e.g. register -ed apool of buffers, shared between multiple queues. Operationson O A are thus deﬁned abstractly: A.enqueue(X) : O A : = O A − { X } A.dequeue(Y) : O A : = O A ∪ { Y } Any implementation of A.enqueue() must cause A to relin-quish ownership of X and that of A.dequeue() cause A totake ownership of Y . This is not a property on the ring buffer,but rather a correctness requirement for software that uses thering buffer: It tells the programmer exactly when they mustrelinquish ownership, and exactly when they may assume theyhave re-acquired it. This notion of a ‘speciﬁcation to be implemented’ is a datareﬁnement (as used e.g. in the seL4 proof [5, 47]), and is alsohow our i82599 interpretation is formally speciﬁed. Figure 3depicts a stepwise reﬁnement of

A.enqueue(X) from theabstract set-based model described so far, via an intermediate4 ueue

A.enqueue(X) Q BA ∪ Δ BA Q AB ∪ { X } - Δ AB O B ∪ Δ AB - Δ BA O A -{ X } B.enqueue()A.dequeue() B.dequeue()

Figure 4: Weak postconditions for concurrency model where queues become lists (establishing FIFO order),to the ring buffer model just described.Each layer is the ownership transfer ring (c.f. Figure 1) at agiven reﬁnement level. Double lines indicate elements linkedby the state relation e.g. the set Q AB contains exactly the el-ements of the list L AB which is in turn the descriptors from done up to head in the ring. Red arrows highlight the reﬁne-ment of A.enqueue(X) , from set insertion ( Q AB ∪ { X } ) to listappend ( L AB + [ X ] ) and ﬁnally pointer increment ( head++ ).The state relations, and reﬁned datatypes and operations areall formalised in Isabelle/HOL. Following the convention usedin Formal Methods conferences, we do not include them herefor space reasons, but all Theory sources will be publishedand are available on request. CleanQ is fully concurrent, and mandates no locks. A and B may simultaneously enqueue and dequeue to their sharedqueues, as long as the invariants are preserved. For the i82599this reﬂects that, for example, head is updated by the NICobliviously to everything except that it does not overtake tail .The driver is free to enqueue at tail at precisely the momentthat the NIC dequeues at head .The strict postconditions of Figure 3 are not preserved bythe actions of a concurrent process. For example, the strictpostcondition Q new AB = Q old AB ∪ { X } for A.enqueue(X) is invali-dated if B dequeues X .In reasoning about A , we cannot rely on X being in Q AB ,just because A has executed enqueue(X) . We can , however,infer that X is in one of Q AB , O B or Q BA — everywhere B might have put it, without A doing anything (i.e. calling dequeue ). Figure 4 summarizes the weakened postconditionfor enqueue(X) that is preserved under interference by B .These weakened postconditions are a prerequisite for veri-fying the correctness of a particular implementation under fullconcurrency, using Owicki-Gries [36] logic as, for example, inthe veriﬁcation of the eChronos real-time operating system [2].We have formalized these, also in Isabelle/HOL, includingnoninterference and reﬁnement proofs for all abstract levels. Knowing exactly when ownership is gained and lost is essen-tial to knowing exactly which cache management operations and fences / barriers are needed, and when, in order to correctlyprovide the guarantees implied by our ‘four properties of own-ership’. In particular, weak-memory-model architectures (such A Owns X B Owns X

X[0]:= xx:= X[1]*(tail++)= X X= *(++head)X[1]:= yz:= X[0] datadependent rfence wfence rfence wfence

Queued : A → B Figure 5: Ownership implies fences as ARM and Power [34]) and partially- or non-coherent sys-tems (e.g. accelerators) may violate the exclusivity guaranteesby reordering memory operations past the ownership transfer(by reordering or speculatively executing instructions, or byserving stale values from non-coherent caches).Consider Figure 5, depicting the transfer of the buffer X between sender A and receiver B , on a hypothetical very-weak-memory architecture (similar situations are or were observableon IBM Power and DEC Alpha systems). In the absenceof fences (barriers), the only orderings guaranteed are thosewith a data dependency, marked with a solid arrow. The twodotted arrows between A ’s modiﬁcations to the buffer and itsrelinquishing ownership (incrementing tail ) only indicate the intended ordering; These operations may occur in any order.In particular, the execution order indicated by the circled rednumbers is consistent with the constraints (and can actually beobserved).Here we see A relinquish ownership (1), then B acquire it (2)and immediately write value y to X[1] (3). Only then are A ’smodiﬁcations scheduled: It reads the value in X[1] ( y , at 4),and writes it into X[0] (5). Finally, B reads the value in X[0] ,and sees the value y that it itself wrote. B has communicatedto itself by traveling into the ‘past’!A sufﬁcient ﬁx in this case is to add the read and writefences as indicated. A ’s read from X[0] is forced to commitbefore the increment to tail (as it also involves a read), andlikewise the write to

X[1] . The read and write fences in B areredundant in this example (as in fact is A ’s write fence due tothe data dependency). If not all instructions in A and B areknown however, all four barriers are necessary.Where the ownership model helps here is that all four barri-ers can be inferred from the guarantees and responsibilities ofownership: A must ensure that any writes to X become visibleto B before B learns that it owns X (i.e. A ’s write to tail becomes visible to B ). The last point at which A can ensurethis (as B ’s dequeue is asynchronous) is when it enqueues X –Hence the write fence before updating tail . Likewise A mustnot rely on X after relinquishing ownership – Hence the readfence before the enqueue. An equivalent argument implies the5ecessity of the fences on B ’s side (e.g. in the absence of adata dependency).Furthermore, it should be possible to automatically place therequired fences, for some combination of a (possibly-stronger)memory model (e.g. ARM or TSO) and known code in A and B . Such automatic inference in code of similar complexitywas demonstrated by Liu et. al. [33]

4. Interface and Implementation

In this section we describe the C interface we derive fromthe formal model in the previous section, together with aset of implementations (termed “modules”, following UnixStreams [42]) we have built and evaluated for inter-processcommunication and device drivers.Figure 7 shows the software architecture. To show theapplicability of CleanQ to different devices and other use-cases, we have implemented modules for an AHCI [30] stor-age host adapter, an Intel e1000 NIC [25], an Intel i8259910Gb NIC [27], a Solarﬂare SFN5122F low-latency NIC [45],a network protocol stack implementing UDP/IP, a shared-memory inter-process queue, and a DPDK [28] module forthe Intel i82599 10Gb NIC. We also implemented a debugmodule, which checks the interface contract at runtime. Wedescribe the detail of the network stack and debug modulesin section 4.2.We have applied CleanQ in Linux, DPDK, and amicrokernel-based research OS, showing that it is deployableacross multiple, complete existing systems. An additional,Rust-based implementation which exploit’s Rust’s ownership-based type system is beyond the scope of this paper.

Figure 6 shows the C declarations for CleanQ. This interface isimplemented by generic code which performs various integritychecks (such as region bounds for buffers) before calling cor-responding module-speciﬁc methods in a vtable associatedwith the struct cleanq argument.The enqueue and dequeue methods must adhere to theirspeciﬁcation introduced in section 3. The additional notify , register , and deregister calls are described below.We now describe the semantics of the CleanQ interfacefunctions in detail. Interface calls that do not satisfy the re-quired preconditions are bugs on the caller side and causeundeﬁned behavior (though many are caught by the genericchecking code). As in section 3, “process” denotes anythingthat changes buffers, for instance a software driver or thehardware of a network interface card. Creation and destruction of queues is not part of the inter-face, since these processes are highly implementation speciﬁcand typically need module speciﬁc parameters, such as deviceregisters of a network card, a shared memory buffer for theloopback/IPC queue, or another queue in case of the debugqueue. Creating a queue must include initializing the cleanq data structure including the generic state and vtable.

Register takes a contiguous region of memory, previouslyowned by neither side of the queue, inserts it into the set ofowned buffers, and returns an identiﬁer for it to be used in sub-sequent enqueue and dequeue operations. Register is typicallyused in conjunction with buffer pools or slab allocators thatallocate a large chunk of memory at once.Register is of practical importance in cases where address-related state must be set up in advance, for example regionsfor RDMA-based transfers, or programming an IOMMU tomake a region of memory accessible to both sides of the queue.In simple shared-memory cases it can be implemented as anull operation which returns the pointer address as the handle.The mem argument is an OS-speciﬁc handle to a memoryresource, e.g. a pointer to anonymous memory, a ﬁle handleto a mapped segment, or a capability to physical memory. Thememory region has to be at least read-accessible from thecalling process. At no time may the set of registered regionsoverlap with each other.

Deregister removes a previously-registered region with thesupplied id from the queue. Deregister can only succeed if theregion has not already been deregistered, and all memory inthe region is currently owned by the calling process.

Enqueue enqueues a buffer of a previously registered re-gion for ownership transfer. Buffers are identiﬁed by a regionid, an offset into this region and a length. The buffer describedby offset and length, must lie within the registered region,and must be owned by the process (i.e. a buffer cannot be en-queued twice without dequeuing it beforehand). The operationcan fail if the underlying queue has run out of space.The valid payload is speciﬁed by a further offset and lengthwithin the buffer, allowing clients to leave space for headersand footers (meta data) added later.As speciﬁed, a successful enqueue relinquishes ownershipof the buffer and inserts it into the transfer set. Eventually theownership of the buffer will be obtained by the peer process,but there is no guarantee when this happens.A client must not alter a buffer once it has given up own-ership, and doing so will result in undeﬁned behavior. Sincewe know precisely when we yield the ownership and whichmemory region is described by the buffer, the implementa-tion can and must guarantee that all changes to the buffer areobservable (using memory fences) before the ownership istransferred.The flags ﬁeld allows additional metadata to be passedalong orthogonally with the buffer, with the proviso that theformal semantics from section 3 are not altered in any way.For example, a DMA copy engine might require the client todistinguish source and destination buffers for a copy.

Dequeue removes a previously enqueued buffer from thequeue and transfers ownership of the buffer to the callingprocess. As long as the process owns a buffer, the process canalter the contents of this buffer. Dequeue can be called anytime but returns an error if there is nothing to dequeue.Absent an error, a correct implementation must return a6 rr_t cleanq_register ( struct cleanq *q, void *mem , regionid_t * rid );err_t cleanq_deregister ( struct cleanq *q, void *mem , regionid_t rid );err_t cleanq_enqueue ( struct cleanq *q, regionid_t rid , size_t offset , size_t length ,size_t valid_data , size_t valid_length , uint64_t flags );err_t cleanq_dequeue ( struct cleanq *q, regionid_t * rid , size_t * offset , size_t * length ,size_t * valid_data , size_t * valid_length , uint64_t * flags );err_t cleanq_notify ( struct cleanq *q);

Figure 6: The CleanQ library interface

CleanQ Generic InterfaceCommon CodeNetworkQ AHCIQ LoopbackQ DebugQ EthernetQIPQUDPQ DPDKQ

Figure 7: Implemented modules of CleanQ Library valid buffer, i.e. one that is within a previously registeredregion and is not yet owned by the calling process. As withenqueue, a subset of the buffer can be declared valid using valid_data and valid_length , to allow stripping of head-ers (for example, in the UDP queue example we show insection 4.2).Metadata about the transfer can, as with enqueue, be re-turned in flags . Again, this is implementation deﬁned butmust be orthogonal to the memory ownership semantics. Itcan be used to signal corrupt packets from a network adapter,for example, or as part of a chaining protocol (section 4.4)

Notify is an optional performance optimization mechanism:for example, a doorbell informing the process on the other sideof the queue that there might (i.e. no guarantee) be buffersin the queue that are ready for processing. It has no formalsemantics at all, and its use (or omission) must not affect thecorrectness of any implementation relative to the speciﬁcation.

We have implemented CleanQ modules which communicatewith a variety of hardware devices using their native descriptorformat and protocol, along with inter-process communicationchannels which pass descriptors in shared memory using avariant of FastForward [20]. Despite incorporating basic run-time checks, we show in Section 5 that these modules arecomparable in performance with the “native” implementationsthey replace. However, CleanQ’s formally speciﬁed interfacehas a further advantage: in contrast to ad-hoc, C-speciﬁedqueues, CleanQ modules can compose in a pipeline or stack,analogous to System V streams. The “null” implementation (amodule which sits in front of another CleanQ module but sim-ply passes data through) imposes negligible overhead, and wehave implemented a debug module which augments CleanQ’sdefault bounds checks with more extensive bookkeeping todetect violations of the queue’s contract by either client ordownstream module. For example, it maintains an operationlog for debugging purposes, and detects overlapping or dupli-cate enqueues, which prevent “double fetch” race conditionvulnerabilities [46].

We have built a full-duplex UDP protocol stack which sits atopa CleanQ module implementing a NIC’s hardware queues andwhich itself consists of two CleanQ modules: one for theUDP headers and one for IP and Ethernet headers. The resultis a dataplane implementation similar to Arrakis [37]. Thestructure of the enqueue call is shown in ﬁgure 8. In order to struct udp_q * que = ...struct cleanq * q = ( struct cleanq *) que ;cleanq_enqueue (q, ...) {// Some interface checksq. enq (q, ...) {// Build UDP headerq-> ip_q . enq (q->ip_q , ...) {// Build IP + Ethernet headerip_q -> nic_q . enq (ip_q -> nic_q , ...) {// Build descriptor and inform Hardware} } } }

Figure 8: Stacking queues to implement UDP implement different layers of the stack, we use valid_data and valid_length . When receiving a packet, each layerreads and interprets the header found at offset valid_data .To pass a packet up to the next higher layer valid_data isincremented by the header size, such that the next higher layerwill ignore the current layer’s header.

Our experience building a number of CleanQ modules, andcomposing them, has so far been very positive. Implementa-tion is generally straightforward, similar to a Virtio queue or anad-hoc implementation, and establishes that the model is sufﬁ-ciently general to cover all the use-cases we have encounteredso far.CleanQ’s formal semantics make it very clear what obli-gations exist for a module programmer at every point in thecode, and remove most of the uncertainty about what the codeneeds to guarantee and when. The use of stackable modulesprovides the expected beneﬁts in composability, and we havemade extensive use of the debug module for checking.Compared with other queue implementations in systemslike Linux, however, CleanQ is something of a radical simpli-ﬁcation, and this might raise several concerns.Firstly, we are paying the price of abstraction: the clearerinterface requires indirect method calls for each module. His-torically these have been viewed as expensive, but as we showin Section 5 modern processors have reduced this overhead to7onsiderably less than the cost of, e.g., formatting hardwaredescriptors, and so this appears not to be an issue.Secondly, each enqueue or dequeue operation acts on asingle buffer: there is no batching. In practice, the cost ofmultiple enqueue/dequeue operations is sufﬁciently small inour implementations that this does not degrade performancesigniﬁcantly.Finally, we do not directly support chaining of multiple, dis-contiguous buffers as with BSD mbuf s or Linux sk_buf s.Instead, we chain buffers using a simple protocol above

CleanQ’s single-buffer enqueue/dequeue operations. As withbatching, the additional overhead is small for our usecases.Our argument, backed up by performance measurements, isthat the simplicity and rigorous semantics of CleanQ outweighthe small overhead the design might incur.

5. Evaluation

To evaluate the performance of CleanQ we ﬁrst benchmark theoverhead of our implementation of the interface, then comparethe equivalent operations of Virtio to our queues. Followingthis, we set the overhead into perspective of a real application.We then further evaluate the different mechanisms of stacking,the debug queue and ﬁnish the performance benchmarks witha more complex example of an implementation of a UDP stackbased on our queues as well as on DPDK. Finally we discussthe performance of CleanQ based on the previously presentedbenchmarks.With these benchmarks we show that there is no signiﬁcantperformance loss when changing from existing systems toCleanQ while we gain a clean, easier to use, well-deﬁnedinterface with the ability to stack queues. Furthermore, in ourimplementation of the interface we added sanity checks on thebuffers through the thin library layer that are not included inmost systems.All experiments were conducted on a two-socket Intel XeonE5-2670 v2 (Ivy-Bridge, 2.5 GHz) system with hyper thread-ing disabled. We used an i82559-based Intel X520 dual-port10GbE card to evaluate the performance of the UDP queue.Unless indicated otherwise, all measurements are taken usingthe timestamp counter of the processor. We evaluated CleanQon Linux (Ubuntu 18.04 LTS) and a microkernel-based re-search OS.

This benchmark shows that the performance overhead of ourC implementation, which provides some sanity checks as com-mon code, is small in absolute terms for all four operations enqueue, dequeue, register and deregister .The benchmark setup is as follows: We conﬁgured CleanQto use the loopback -module which resembles an in-memoryring buffer where enqueue writes the descriptor into memoryand dequeue reads the descriptor contents from memory andthe corresponding pointers are updated accordingly. We mea-sure at two points: at the calls to the interface and the calls to

Figure 9: Overhead of C interface implementation the module (before the vtable invocation). We run the bench-mark of 100,000 repetitions and account for our measuringinstrumentation.Figure 9 shows the median (and standard deviation) of eachof the four operations. We observe that the cost of the thinlibrary layer of our CleanQ implementation is on the orderof tens of cycles for the enqueue, dequeue and deregisteroperation whereas register requires an additional check thatthe memory region is actually owned by the caller resulting inabout 400 cycles overhead for a system call and the requiredbookkeeping.The results show that our implementation adds littleoverhead in exchange for a well-deﬁned and clean interfacebased on a formal model. The overheads of the fast-pathoperations enqueue/dequeue are less than 30 cycles andrequire fewer cycles than the simple loopback module. Weexpect the register/deregister operations to be on the slow-pathbut nevertheless they only add a few hundred cycles at mostfor the bookkeeping operations.

In this benchmark we compare the operations of CleanQ (withour loopback module) that have an equivalent in Virtio (ad-d/get vs. enqueue/dequeue) to show that the performance iscomparable.We compare CleanQ with Virtio (both on Linux) by mea-suring the calls to the application interface of CleanQ andVirtio’s virtqueue implementation. To measure the perfor-mance of Virtio we adapted one of the Linux Virtio tests,adding measurement code and increasing the number of rep-etitions. The Virtio test uses virtqueue_add_inbuf() toadd buffers to the queue and after the host side has removedthem, the buffers are reclaimed from the guest side by calling virtqueue_get_buf() . Note, the host side of the virtqueueis accessible through a different interface that requires a memcpy for adding data to the queue. For fairness we didnot include the host side interface operations in this bench-mark. The result of the benchmark is shown in Figure 10.Enqueueing a descriptor to the Virtio virtqueue costs 56 cy-cles while enqueuing a buffer through our interface and then8 igure 10: Performance of Virtio ring buffer (add/get) com-pared to CleanQ (enqueue/dequeue) both running on Linux processing it in the module costs 72 cycles. Getting a descrip-tor from the virtqueue is more expensive at 100 cycles whiledequeueing a buffer from CleanQ only costs 64 cycles.Overall the performance is similar to Virtio’s guest sidewhile CleanQ provides additional checks on the buffers, acleaner and simpler interface that allows for more complexconstructs by stacking queues on top of each other.

To put our C implementation overhead of CleanQ into per-spective, we measure the total processing time of Memcached(v1.5.10) [17] including network stack and hashtable lookupfor set and get requests. This is a simple application contextin which CleanQ interface can be used.We send small get/set requests (key + value < 16 bytes) overthe network to our Memcached server. We proﬁle incomingrequests by measuring the network stack processing time andthe duration of Memcached handling the get/set request. Note,the resolution of the software timestamps provided by thenetwork stack is one microsecond (or 2500 cycles).The results, in the form of a CDF plot of 100,000 set/getoperations, are shown in Figure 11 for request handling inMemcached and Figure 12 for processing the packet in thenetwork stack. The median time spent from the kernel to theuserspace application on the receive path of a UDP packet is3 microseconds or around 7500 cycles. The median of bothset/get is around 3450 cycles (1.3 µ s). Combining these twomeasurements results in 10950 cycles (4.38 µ s) that are spentover the application’s path on which the CleanQ interfacecould realistically be used.Comparing the time spent in the library on the fast path tothe application’s time spent in other code, leaves the overheadof the library at < 1%. The small overhead of the library isdominated by the processing time of other parts of the code. In this experiment we measure the scalability of the imple-mentations module stacking. We repeat the same experimentof section 5.1, but we now stack ten null modules on top of theloopback module. The null module mimics a no-op: it only

Figure 11: Memcached set/get processing timeFigure 12: Linux: kernel to userspace network stack perfor-mance on receive path invokes the same operation on the next module in the stackand therefore all observed overhead originates from stackingitself. We measure at different levels of the stack the time ittakes until the lower level completes the operation. Again, weconducted 100,000 runs for each operation.Figure 13 shows the median execution time and standarddeviation at three different points of measurement: i) the baseline ( loopback ) represents the lowest level of the stack whichincludes only the loopback module (corresponding to sec-tion 5.1) ii) Null 1 represents the time taken when a singlenull module is stacked on top of the loopback module. Weobserve a negligible overhead for a single stack of less than10 cycles for any of the operations. iii) Null 10 measures thefull stack of ten null modules stacked on top of the loopbackmodule where the entire stack of ten modules results in about100 cycles overhead compared to the baseline.Each additional module stacked on top corresponds to anadditional indirect function call, which results in the overheadof less than 10 cycles for stacking a single module. Moreover,our results suggest that the overhead per stacked module staysconstant when more modules are stacked. With this experi-ment we have shown that CleanQ’s stacking functionality isefﬁcient.9 igure 13: Overhead of stacking queues In this experiment we measure the overhead of the debugmodule that performs additional checks of buffer ownershipon every queue operation.We repeat the experiment of section 5.1 with the only dif-ference being that we stack the debug module on top of theloopback module. Again we perform 100,000 repetitions andmeasure the total execution time of each module in the stack.

Figure 14: Overhead of debug queue

Figure 14 shows the median completion time for each opera-tion including standard errors. We observe an overhead for theadditional checks and stacking of the debug module of about50-80 cycles for the enqueue and dequeue operations respec-tively and a total duration of 120-130 cycles. The deregisteroperation only adds about 20 cycles of overhead. Register isthe most expensive operation adding 300 cycles.The results show that even with tracking ownership, whichrequires lookup and updating internal data structures to re-ﬂect the change of ownership, we observe a total completiontime for the two fast-path operations of less than 130 cycles.Deregister simply checks whether the region has been regis-tered before and then removes it if all buffers are owned bythe caller, resulting in low overhead. The register operationrequires verifying the size and access rights which requires asyscall in our implementation, resulting in a 2x increase whichis, however, still less that 0 . µ s Putting the overhead of the debug into perspective, despitean increase of up to 2x relative to the loopback module, the

Figure 15: Performance of UDP queue additional 50-80 cycles are dwarfed by the 3450 cycles pro-cessing time of our simple example application (Memcached)resulting in less than 2% overhead.

In this benchmark we show that we can implement a morecomplex construct based on our stacking mechanism to realizea high-performance, low overhead UDP network stack sim-ilar to that in the Arrakis system [37]. This benchmark wasimplemented on the microkernel OS.This benchmark consists of a UDP/IP echo server usingCleanQ: a UDP module and an IP/Ethernet module bothstacked on top of the e10k module which drives the IntelX520 dual-port 10GbE card. The resulting queue is a stackof three modules. The network card has a distinct queue fortransmit and receive and for each of the two hardware queueswe initialize a CleanQ stack. Note, the e10k module will needto convert from and to the descriptor format the network cardunderstands. We generate 64-byte UDP packets and send themto the echo server with the CleanQ stack. We measure theprocessing time for sending and receiving the packet on theecho server.Figure 15 shows the median and standard deviation basedon 100,000 measured packets for the e10k module and therest of the UDP network stack (Ethernet/IP/UDP) combined.We observe a much higher standard deviation compared toprevious experiments.On investigation, this latency distributed is heavily bimodal:the latency highly depends on whether the NIC hardware reg-isters need to be updated. A write to the device register resultsin a 10x increase of the enqueue operation, but is only per-formed for a small fraction of enqueues. How frequently thisexpensive register write occurs depends on load and batchingheuristics, but is inherent in the hardware and not a feature ofCleanQ per se .Enqueuing buffers into the transmit and receive queues ofthe NIC take about 90 cycles whereas processing the UDPmodule takes about 100-250 cycles. When a buffer is writteninto the hardware descriptor queue, the descriptor needs tobe formatted which takes most of the 90 cycles. The workdone by the UDP module includes growing the valid pointers10kt/s Standard deviationDPDK 705,000 3362CleanQ 713,100 7305CleanQ UDP 742,600 5066

Table 1: DPDK vs. CleanQ vs. CleanQ UDP results of the buffer to make space for the network headers as wellas formatting the UDP, IP and Ethernet headers including thegeneration of checksums which account for the majority ofthe 250 cycles latency.Dequeuing from the NIC queue is generally more expensivethan enqueuing descriptors which result in 200-400 cycleslatency. Dequeuing on the UDP module takes 20 cycles on thetransmit queue and 120 cycles on the receive path. Wheneverthe NIC completes a descriptor it updates the status bit ofthe descriptor which in turn ends up in main memory or thelast level cache introducing latency (90-150 cycles) when thedescriptor is read by software. Moreover, the UDP moduleneeds to verify the headers on the received descriptors.To summarize, the performance characteristics of the hard-ware descriptor queues is generally dominated by the cost offormatting a descriptor and updating the register containingthe receive and send pointers. Moreover, writing hardwareregisters to inform the card about the software state can be ex-pensive (> 3000 cycles) and batching the updates can amortizethe cost which is the source of the large variance in this ex-periment. We conducted similar experiments on a SolarFlareSNF5122F with comparable results.Putting this into comparison of the library overhead, format-ting a descriptor costs twice as much as our interface abstrac-tion.

In this benchmark we want to show that the CleanQ interfacecan also be integrated into existing systems without degradingperformance.We implemented a module based on the DPDK Intel ixgbedriver for the (i82599-based) Intel X520. We reuse the setupcode and control plane of DPDK [28] but reimplement thedataplane as a CleanQ module. Additionally, we stacked theUDP stack on top of the NIC CleanQ module.We compare the CleanQ module (and the UDP stack) tothe original DPDK driver, measuring the packets per secondof a single NIC queue using one core running a UDP echoserver implemented using DPDK routines (send/recv burst).Furthermore, we also measure the performance of the CleanQUDP stack directly using the CleanQ interface. To generateload we implemented a benchmark using standard sockets. Oneach core we run a thread which sends and receives the UDPpackets in a closed loop with a conﬁgurable amount of packetsin ﬂight. Table 1 shows packets per second (median of 10runs) using minimum-sized packets.DPDK alone achieves a throughput of 705,000 pkts/s whileDPDK using CleanQ achieved 713,100 pkts/s and the full UDP stack using CleanQ reached 742,600 pkts/s. Incorporating theCleanQ interface into DPDK resulted in similar performancewhile implementing a UDP stack using only CleanQ increasedperformance by 5%.This result shows that incorporating a CleanQ into a high-performance networking framework (DPDK) does not degradethroughput and can even deliver better performance.

In the evaluation we demonstrated that CleanQ provides aclean and well deﬁned queue abstraction with a strong notionof ownership transfer while still being lightweight (comparedto full stack veriﬁcation) and able to deliver comparable per-formance to Virtio’s virtqueue in a direct comparison. Wehave shown a C implementation of CleanQ with low overheadin absolute numbers as well as when used in an applicationcontext the resulting overhead to be less than 1% of the receiveand processing time of Memchached get/set request. We fur-ther show the that the overhead of CleanQ is not only dwarfedby application processing time but also by interfacing with thehardware itself such as by formatting descriptors and writingregisters.Furthermore, we have demonstrated CleanQ’s ﬂexibility inbuilding efﬁcient protocols and adding strict bounds checks bystacking of modules – a feature which is enabled by the welldeﬁned abstraction of CleanQ. We have shown that stackinga module has a small overhead and is scalable to multiplemodules. Finally, we have demonstrated this functionality byimplementing a UDP network stack.

6. Conclusion

CleanQ demonstrates that it is possible to unify many of theproliferation of descriptor queues in common usage behind asingle interface with strict formal semantics, at no performancecost on the dataplane.The beneﬁts begin with the elimination of many subtle bugswhich appear (and reappear) whenever such an interface isdeﬁned informally. These include failing to catch all the us-age cases, as well as different interpretations of the interfacebetween client and implementation (which a strict formal spec-iﬁcation prevents). The ownership-transfer model of CleanQ issound under lock-free concurrency, and provides a frameworkfor the veriﬁcation of implementations.The many CleanQ modules already implemented show thatsuch an interface is widely applicable within an OS, and alsopermits the composition of modules to provide reuse of func-tionality.Moreover, this generality, composability, and soundnesscome at a low cost: CleanQ matches Virtio for operationlatency and imposes less than 1% overhead on end-to-endapplication workloads such as Memcached.The Isabelle/HOL formalisations will be published sepa-rately as an extended technical report, and all CleanQ modulesand support code will be released under an open source license.11 eferences [1] Zach Amsden, Daniel Arai, Daniel Hecht, Anne Holler, Pratap Subrah-manyam, and Vmware Inc. VMI: An Interface for Paravirtualization,2006.[2] June Andronick, Corey Lewis, Daniel Matichuk, Carroll Morgan, andChristine Rizkallah. Proof of OS scheduling behavior in the presenceof interrupt-induced concurrency. In Jasmin Christian Blanchette andStephan Merz, editor,

International Conference on Interactive TheoremProving , pages 52–68, Nancy, France, August 2016. Springer.[3] Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet.Linux Block IO: Introducing Multi-queue SSD Access on Multi-coreSystems. In

Proceedings of the 6th International Systems and Stor-age Conference , SYSTOR ’13, pages 22:1–22:10, Haifa, Israel, 2013.ACM.[4] Broadcom.

Standard Broadcom NetXtreme II Family Highly IntegratedMedia Access Controller , October 2008. Revision PG203-R.[5] David Cock, Gerwin Klein, and Thomas Sewell. Secure Microkernels,State Monads and Scalable Reﬁnement. In

Proceedings of the 21stInternational Conference on Theorem Proving in Higher Order Logics ,TPHOLs ’08, pages 167–182, Berlin, Heidelberg, 2008. Springer-Verlag.[6] MITRE Corporation. CVE-2017-9986,. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-9986 . November 2017.[7] CVE Details. CVE-2004-0555. . July 2004.[8] CVE Details. CVE-2008-1317. . March 2008.[9] CVE Details. CVE-2010-1187. . Oktober 2010.[10] CVE Details. CVE-2015-1805. . August 2015.[11] CVE Details. CVE-2015-5366. . August 2015.[12] CVE Details. CVE-2015-7613. . September 2015.[13] CVE Details. CVE-2016-5403. . August 2016.[14] CVE Details. CVE-2016-7618. . August 2017.[15] CVE Details. CVE-2017-14916. . December 2017.[16] Peter Druschel and Larry L. Peterson. Fbufs: A High-bandwidthCross-domain Transfer Facility. In

Proceedings of the FourteenthACM Symposium on Operating Systems Principles , SOSP ’93, pages189–202, Asheville, North Carolina, USA, 1993. ACM.[17] Brad Fitzpatrick. Distributed Caching with Memcached.

Linux J. ,2004(124):5–, August 2004.[18] Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, AliSezgin, Luc Maranget, Will Deacon, and Peter Sewell. Modellingthe ARMv8 Architecture, Operationally: Concurrency and ISA. In

Proceedings of POPL: the 43rd ACM SIGPLAN-SIGACT Symposiumon Principles of Programming Languages , 2016.[19] Andrew Friedley, Torsten Hoeﬂer, Greg Bronevetsky, Andrew Lums-daine, and Ching-Chen Ma. Ownership Passing: Efﬁcient DistributedMemory Programming on Multi-core Systems. In

Proceedings of the18th ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming , PPoPP ’13, pages 177–186, Shenzhen, China, 2013.ACM.[20] John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastforwardfor efﬁcient pipeline parallelism: A cache-optimized concurrent lock-free queue. In

Proceedings of the 13th ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming , PPoPP ’08, pages43–52, New York, NY, USA, 2008. ACM.[21] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Wu, Jieung Kim, Vil-helm Sjöberg, and David Costanzo. CertiKOS: An Extensible Archi-tecture for Building Certiﬁed Concurrent OS Kernels. In

Proceedingsof the 12th USENIX Conference on Operating Systems Design andImplementation , OSDI’16, pages 653–669, Savannah, GA, USA, 2016.USENIX Association.[22] Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy.MegaPipe: A New Programming Interface for Scalable Network I/O.In

Proceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation , OSDI’12, pages 135–148, Hollywood,CA, USA, 2012. USENIX Association. [23] Norman C. Hutchinson and Larry L. Peterson. The X-Kernel: AnArchitecture for Implementing Network Protocols.

IEEE Trans. Softw.Eng. , 17(1):64–76, January 1991.[24] Intel Corporation.

Intel 82571 1 GbE Controller: Datasheet .[25] Intel Corporation.

PCIe GbE Controllers Open Source Soft-ware Developer’s Manual 631xESB/632xESB, 82563EB/82564EB,82571EB/82572EI & 82573E/82573V/82573L , 2009. Revision 2.3.[26] Intel Corporation.

Intel 82576 Gigabit Ethernet Controller Datasheet ,December 2010. Revision 2.61.[27] Intel Corporation.

Intel 82599 10 GbE Controller Datasheet , December2010. Revision 2.6.[28] Intel Corporation. Data plane development kit. . August 2018.[29] Intel Corporation. Intel ethernet converged network adapterx520. . August2018.[30] Intel Corporation. Seiral ata advanced host controller interface(ahci) 1.3.1. . August2018.[31] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick,David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt,Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, andSimon Winwood. seL4: Formal Veriﬁcation of an OS Kernel. In

Proceedings of the ACM SIGOPS 22Nd Symposium on OperatingSystems Principles , SOSP ’09, pages 207–220, Big Sky, Montana,USA, 2009. ACM.[32] Alexander Krizhanovsky. Linux Netlink Mmap: Bulk Data Transferfor Kernel Database. http://natsys-lab.blogspot.com/2015/03/linux-netlink-mmap-bulk-data-transfer.html . March2015.[33] Feng Liu, Nayden Nedev, Nedyalko Prisadnikov, Martin Vechev, andEran Yahav. Dynamic Synthesis for Relaxed Memory Models. In

Proceedings of the 33rd ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , PLDI ’12, pages 429–440,Beijing, China, 2012. ACM.[34] Luc Maranget, Susmit Sarkar, and Peter Sewell. A Tutorial Introductionto the ARM and POWER Relaxed Memory Models. 2012.[35] David S. Miller. How skbuffs work. Onlin., http://vger.kernel.org/~davem/skb.html .[36] Susan Owicki and David Gries. An axiomatic proof technique forparallel programs i.

Acta Inf. , 6(4):319–340, December 1976.[37] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos,Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Ar-rakis: The Operating System is the Control Plane. In

Proceedings ofthe 11th USENIX Conference on Operating Systems Design and Im-plementation , OSDI’14, pages 1–16, Broomﬁeld, CO, 2014. USENIXAssociation.[38] David Leo Presotto. Multiprocessor Streams for Plan 9. In in Proceed-ings of the United Kingdom UNIX User Group Summer Proceedings ,pages 11–19, 1993.[39] Realtek Semi-Conductor Co., Ltd.

RTL8029AS: Realtek PCI Full-Duplex Ethernet Controller with built-in SRAM , January 1997.[40] RedHat. CVE-2015-9016. https://access.redhat.com/security/cve/cve-2015-9016 . August 2015.[41] RedHat. CVE-2017-17381. https://access.redhat.com/security/cve/cve-2017-17381 . November 2017.[42] D.M. Ritchie. A stream input-output system.

AT&T Bell LaboratoriesTechnical Journal , 63:1897–1910, 10 1984.[43] Rusty Russell. Virtio: Towards a De-facto Standard for Virtual I/ODevices.

SIGOPS Oper. Syst. Rev. , 42(5):95–103, July 2008.[44] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli,and Magnus O. Myreen. X86-TSO: A Rigorous and Usable Program-mer’s Model for x86 Multiprocessors.

Commun. ACM , 53(7):89–97,July 2010.[45] Solarﬂare Communications, Inc.

Solarﬂare SFN5122F Dual-Port10GbE Enterprise Server Adapter , 2010.[46] Pengfei Wang, Jens Krinke, Kai Lu, Gen Li, and Steve Dodier-Lazaro.How double-fetch situations turn into double-fetch vulnerabilities: Astudy of double fetches in the linux kernel. In

Proceedings of the 26thUSENIX Conference on Security Symposium , SEC’17, pages 1–16,Berkeley, CA, USA, 2017. USENIX Association.

47] S. Winwood, G. Klein, T. Sewell, J. Andronick, D. Cock, and M. Nor-rish. Mind the gap. In S. Berghofer, T. Nipkow, C. Urban, and M. Wen-zel, editors,

Theorem Proving in Higher Order Logics (TPHOLs) , vol-ume 5674 of

Lecture Notes in Computer Science , Berlin, Heidelberg,2009. Springer-Verlag., Berlin, Heidelberg,2009. Springer-Verlag.