[PDF] A Software-only Mechanism for Device Passthrough and Sharing

Abstract

Network processing elements in virtual machines, also known as Network Function Virtualization (NFV) often face CPU bottlenecks at the virtualization interface. Even highly optimized paravirtual device interfaces fall short of the throughput requirements of modern devices. Passthrough devices, together with SR-IOV support for multiple device virtual functions (VF) and IOMMU support, mitigate this problem somewhat, by allowing a VM to directly control a device partition bypassing the virtualization stack. However, device passthrough requires high-end (expensive and power-hungry) hardware, places scalability limits on consolidation ratios, and does not support efficient switching between multiple VMs on the same host. We present a paravirtual interface that securely exposes an I/O device directly to the guest OS running inside the VM, and yet allows that device to be securely shared among multiple VMs and the host. Compared to the best-known paravirtualization interfaces, our paravirtual interface supports up to 2x higher throughput, and is closer in performance to device passthrough. Unlike device passthrough however, we do not require SR-IOV or IOMMU support, and allow fine-grained dynamic resource allocation, significantly higher consolidation ratios, and seamless VM migration. Our security mechanism is based on a novel approach called dynamic binary opcode subtraction.

Full PDF

aa r X i v : . [ c s . O S ] S e p A Software-only Mechanism for Device Passthrough and Sharing

Piyus Kedia (Microsoft Research) and Sorav Bansal (IIT Delhi)

Abstract

Network processing elements in virtual machines, alsoknown as Network Function Virtualization (NFV) oftenface CPU bottlenecks at the virtualization interface. Evenhighly optimized paravirtual device interfaces fall short ofthe throughput requirements of modern devices. Passthroughdevices, together with SR-IOV support for multiple devicevirtual functions (VF) and IOMMU support, mitigate thisproblem somewhat, by allowing a VM to directly control adevice partition bypassing the virtualization stack. However,device passthrough is riddled with its own problems of lowconsolidation ratios, relatively static resource partitioning,and difﬁculties in VM migration.We present a paravirtual interface that securely exposesan I/O device directly to the guest OS running inside theVM, and yet allows that device to be securely shared amongmultiple VMs and the host. Compared to the best-knownparavirtualization interfaces, our paravirtual interface sup-ports up to 2x higher throughput, and is closer in perfor-mance to device passthrough. Unlike device passthroughhowever, we do not require SR-IOV or IOMMU support,and allow ﬁne-grained dynamic resource allocation, signiﬁ-cantly higher consolidation ratios, and seamless VM migra-tion. Our security mechanism is based on a novel approachcalled dynamic binary opcode subtraction .

1. Introduction

Today’s networks rely on “middleboxes” [14, 29] (alsocalled network appliances) for a variety of network pro-cessing needs, such as overlay network switches, ﬁrewalls,load balancers, routers, etc. Network function virtualization(NFV) proposes shifting middlebox processing from spe-cialized hardware appliances to software running on com-modity hardware. Further, NFV beneﬁts signiﬁcantly fromvirtualization capabilities, to signiﬁcantly improve resource [Copyright notice will appear here once ’preprint’ option is removed.] utilization [9], and allow sharing of hardware resourcesamong multiple (and potentially untrusted) tenants [9, 20].Virtualized network appliances, however present a newset of performance challenges. For example, ClickOS [20]showed that the network stack on the Xen hypervisor [2]falls far short of the maximum achievable throughput on a10Gbps NIC, using commodity x86 hardware. Because theoriginal network stacks were not designed for such high-throughput workloads, inefﬁciencies lurk at multiple lev-els in current network stacks: (a) guest-side and host-sideuser/kernel network API (e.g., socket API) was not designedto handle such workloads; (b) the device virtualization in-terface between the guest and the host (e.g., virtio [19]) isoften a performance bottleneck; and (c) the host-side net-work bridge/switch (e.g., Linux bridge, Open vSwitch [23])is usually incapable of handling high rates of trafﬁc.The netmap framework [25] proposes an efﬁcient user/kernelinterface, best suited for high-throughput I/O. A netmap-capable user process maps shared-memory producer-consumerrings to communicate efﬁciently with the kernel. This al-lows a zero-copy interface between the user and the kernel,and also allows the user to perform I/O in batches, thusamortizing the cost of traversing the kernel’s network stackover multiple packets. This results in high overall through-puts. For example, netmap’s pkt-gen [25] can saturate a10Gbps link with a single 3GHz core using 64B packets(14.88 Mpps), while the socket-based network stack reachesonly about a third of the maximum achievable throughput[25].However, if netmap is used inside a VM, performancebottlenecks emerge at the device virtualization interface.Software-emulated devices exhibit signiﬁcant CPU over-heads related to faithful execution of the device state ma-chine in software. Paravirtual interfaces for device virtual-ization are relatively faster, but even they prove inadequate insuch high-throughput settings. Table 1a shows the through-put of running netmap’s pkt-gen inside a virtual machineusing the virtio -based paravirtual device [19]. The virtiointerface involves shared-memory communication betweenthe guest and the host through producer-consumer rings. Thetable shows results for three different implementations of thehost-side networking stack, namely tap , tap-vhost , and netmap (more details in Section 5). Compared to bare-metal, the performance penalty of the virtio interface is x(Kpps) rx(Kpps)60B 1500B 60B 1500Bsocket-baremetal 470 180 394 214socket-virtio-vhost 250 170 300 150netmap-baremetal 14810 820 13304 820netmap-virtio 236 193 306 268netmap-virtio-vhost 357 285 416 422netmap-virtio-netmap 154 154 - -netmap-fastio netmap-fastio-no-rzc - - 12970 813 (a) tx(Kpps) rx(Kpps)60B 1500B 60B 1500Bnetmap-baremetal 14600 822 13620 820netmap-virtio 188 182 35 20netmap-virtio-vhost 331 256 68 41netmap-fastio 14632 815 13001 816netmap-fastio-no-rzc - - 13049 811 (b) Table 1: Single-guest throughput on (a) multiprocessor and(b) uniprocessor hosts with a 10Gbps NIC. All numbersare given in Kpps (thousand packets per second). To cal-culate throughput, use Mbps = Kpps * (pktsize + 20.2)* 8. netmap-fastio-no-rzc refers to fastio withoutreceive-side zero-copysigniﬁcant. The penalty is even more pronounced in CPU-constrained settings. To show this, we also show experi-ments with uniprocessor hosts using the virtio inter-face. While guest-host communication involves cachelinetransfers in multiprocessor hosts, uniprocessor hosts requireexpensive VM exits. Further, sharing the network deviceamong multiple VMs incurs performance penalties at thehost-side switch.These CPU bottlenecks indicate that our I/O virtualiza-tion stacks are perhaps a bit too “deep”. The cost of travers-ing the I/O virtualization stack (even with paravirtualiza-tion) is often more than the cost of actual network process-ing. This I/O virtualization cost stems primarily from theneed to secure the host and the device from untrusted VMs,which forces us to use narrow interfaces between the VMand the host (such as producer-consumer rings) resulting inrelatively deep I/O stacks.In this paper, we show that security can be provided inalternate ways; we use this observation to make the I/Ostack signiﬁcantly thinner. We allow a VM to have directvisibility into the hardware device. The VM can read/writeto the hardware device without host intervention. Yet, weensure that an untrusted VM cannot harm the host and/orother guests. One of the central ideas in this paper is dynamic binaryopcode subtraction , or DBOS. DBOS enables the hyper-visor to restrict VM behaviour; we use DBOS to imple-ment the requisite security required in the I/O virtualizationstack. Table 1a shows results using our approach (labeled netmap-fastio ) to paravirtualization. Using fastio ,the I/O throughput achievable inside the VM is comparableto bare-metal performance, even on a single processor. Weachieve this for an off-the-shelf guest OS (Linux), withoutassumptions about SR-IOV and IOMMU hardware support.Our interface supports fair allocation of device resourcesamong untrusted VMs, and allows fast switching amongthem. Compared to netmap-based VALE [26], our softwareswitch provides up to 10x higher throughput.In summary, this paper makes two primary contributions: • We present a fast device paravirtualization mechanismwhich exhibits close to bare-metal performance. Com-pared to conventional paravirtualization (e.g., virtio ),our scheme provides up to 25x higher throughput. Com-pared to highly hand-optimized I/O virtualization stacks(e.g., ClickOS [20]), we achieve around 2x throughputimprovement for small 60B packets, and around 30%throughput improvement for 1500B packets. • We introduce a novel security mechanism, DBOS, whichallows a hypervisor to restrict guest behaviour. We demon-strate an application of DBOS to improve I/O virtualiza-tion performance.This paper is organized as follows. Section 2 providesrelevant background on network processing and switching;we discuss our DBOS-based solution alongwith its securityconsiderations in Section 3. The operation of our guest-side paravirtual driver is presented in Section 4, Section 5presents our experiments and results; Section 6 discussesdesign considerations and alternate design choices; Section 7discusses related work, and Section 8 concludes.

2. Background

Previous work [25] has shown that the traditional user/kernelinterface for network processing can become a performancebottleneck for high-throughput workloads. The netmap APIdeﬁnes a user/kernel interface, whereby a user process canpre-allocate a set of ring buffers to communicate with thekernel, and map this allocated memory in its address space.The interface between the user and the kernel is that ofa shared ring containing buffer pointers. On the transmitpath, the user can write the packet contents in pre-allocatedbuffers, setup the buffer pointers in the shared ring, and in-crement the ring’s head pointer. The kernel would consumebuffers from the shared ring by incrementing the tail pointer,and send them to the network port/device. Similarly, on thereceive path, the user would ﬁrst set up empty buffers in itsring and update the head pointer; the kernel would copy re-ceived packets to the ring buffers, and increment the ring’s ail pointer (so the user can read the packet contents). Un-like traditional socket API, netmap does not involve over-heads related to memory allocation/deallocation, copying,and other book-keeping on the I/O datapath. The overall re-sult is a much faster user/kernel API. As we see in Table 1a,the netmap API can saturate a 10Gbps NIC (14.8 Mpps),while the traditional socket API can only reach a fraction ofthe link capacity for 60B packets on baremetal.Running the network processing elements inside a virtualmachine require device virtualization. Unlike full virtual-ization, paravirtualization allows ﬂexibility in choosing theright VM/hypervisor interface for optimum performance.Today, device paravirtualization is typically done using ashared-memory producer-consumer ring between the guestand the host. For example, KVM/Virtio sets up shared-memory rings between the guest and the Qemu process.On the transmit path, the guest writes to the shared ring andthe host-side Qemu process reads from it. Similarly, dataﬂows in the reverse direction on the receive path. As an op-timization, the host-side kernel may read/write to the queue(virtio-vhost) instead of the Qemu process.A host-side software switch (e.g., Linux tun/tap, OpenvSwitch), typically implemented as a part of the kernel mul-tiplexes/demultiplexes packets for multiple guests. Currentswitches are unable to sustain high throughputs; e.g., thetap interface on Linux and Open vSwitch peak at around300-600Kpps [20, 26]. A recent software-switch, VALE,takes advantage of the fact that its ports may be using thenetmap API. Using this, VALE implements switching inbatches, thus exposing opportunities for improving forward-ing performance, and optimizing cache utilization throughprefetching. Even with these optimizations, a VALE switchtogether with the virtio-vhost interface, can handle only upto 3.5Mpps [26].Due to these limitations, a common approach todayfor high-performance networking with virtual machines, isdevice-passthrough, whereby a NIC can be exposed directlyto a VM. Device passthrough reduces scalability, as the de-vice is exclusively controlled by the given virtual machine.This problem is mitigated by modern NICs supporting hard-ware multi-queuing, VMDq, and SR-IOV [15]. Further, de-vice passthrough complicates live migration, and requiresIOMMU support for security.We show that it is possible to achieve equivalent perfor-mance as device-passthrough without compromising scala-bility, or live-migration. We also do not require SR-IOV orIOMMU hardware support.

3. Our Solution

Our paravirtual device driver, called fastio driver, is dif-ferent from current paravirtual drivers (e.g., virtio) in severalways. Firstly, we require that our device driver should au-thenticate itself with the hypervisor at load-time. Once au-thenticated, the hypervisor can trust the fastio driver with privileged device state. The authentication and trust is main-tained in multiple steps:1. Our fastio driver does not rely on any read/write staticdata, i.e., its binary object ﬁle contains only code pagesand read-only static data. For all other memory needs, thedriver must use “special” stack space or heap memory.As we see later, the hypervisor ensures that this stack andheap memory remain private to our driver, i.e., the rest ofthe guest kernel cannot read/write to it.2. At load time, the fastio driver loads its code and read-only data pages in the guest memory, and informs thehypervisor about its loaded addresses and size, using ahypercall.3. Using page-protection bits in the x86 extended page ta-bles (EPTs) [17], the hypervisor write-protects all codeand data pages of the fastio driver. This allows the hyper-visor to ensure that the guest OS cannot change the drivercode/data, after it has been loaded and authenticated.4. The hypervisor veriﬁes the contents of the driver’s codepages. We perform this veriﬁcation using cryptography-based digital-signatures. The fastio driver presents asigned certiﬁcate (signed by the hypervisor) which certi-ﬁes the contents of the code pages; the hypervisor com-putes a sha1 hash of the code pages, and ensure that itmatches with the value presented in the signed certiﬁ-cate.Next, we expose the privileged device state to the guest,by mapping its memory addresses (including MMIO ad-dresses) in the guest’s physical address space (GPA space).This involves creating mappings in the guest’s EPT for thedevice data structures at “privileged GPA addresses”, or PG-PAs. We need to ensure that the PGPAs are distinct from theactual guest physical memory addresses, to avoid conﬂicts.In our prototype implementation, we use addresses above4GB for PGPAs; we assume that all our guests have lessthan 4GB RAM. This is not a fundamental restriction, as PG-PAs can be chosen to be arbitrarily large in a 64-bit addressspace. Mapping privileged state at PGPAs, exposes the hy-pervisor to attacks from the untrusted guest. To prevent theseattacks, we next ensure that the PGPAs cannot be mapped inthe guest’s virtual address space, unless the hypervisor ex-plicitly allows it to do so. Essentially, we will try to ensurethat the fastio driver is allowed to map the PGPAs in its ad-dress space, while the rest of the guest kernel is not allowedto do so.If the hypervisor can successfully ensure that the PGPAsare not mapped in the guest’s virtual address space (GVAspace), it effectively ensures that the guest cannot accessthe PGPA addresses. We conﬁgure the virtualization hard-ware to ensure that a VM exit occurs on every change tothe guest’s virtual address space. i.e., an exit should occuron every execution of the mov-to-cr3 , mov-to-cr4 , mov-to-cr0 , and other privileged instruction that can po- entially change the VA space. On VM exits resulting fromthese instructions, the hypervisor checks the new GVA spaceto ensure that no mappings exist to our PGPAs in it. For ex-ample, on the execution of a mov-to-cr3 instruction in-side the guest, a VM exit occurs, and the hypervisor walksthe page table to ensure that none of its entries point to thePGPAs.Because x86 paging allows changes to the VA spacethrough simple modiﬁcations to the page-table entries, wefurther mark all the page-table pages read-only on every cr3load. Thus, while walking the guest’s page table during theVM-exit caused by the mov-to-cr3 instruction, the hy-pervisor marks all the GPAs corresponding to the page-tablepages as read-only. Any future write access by the guest toits page-table pages, causes an EPT violation, resulting in aVM exit. The hypervisor then emulates the exiting instruc-tion within the hypervisor before returning control back tothe VM.Using this, we ensure that the guest can never directlyaccess the PGPA space. This solution requires VM exits onevery execution of the mov-to-cr3 instruction (amongother such instructions that can change the address space)and on every write access to a page-table page within theguest. Now, the hypervisor needs to implement a mechanismwhich allows our fastio driver to access the PGPA spacedirectly (but still disallows the rest of the guest kernel fromdoing so).The hypervisor sets up a special page-table, called theprivileged page table PPT , using pages in the read-only datasection of the fastio driver. Notice that the hypervisor is freeto write to the PPT, even though the PPT pages appear read-only to the guest kernel. The PPT will contain mappings tothe PGPA space, and the fastio driver can switch to it usingthe mov-to-cr3 instruction to access the device state di-rectly. The PPT would also contain mappings for the guest’skernel data structures, so that the driver can efﬁciently com-municate between the device and the guest kernel. In our32-bit implementation, our PPT maps the entire guest kernelin the PPT (at addresses above 0xc0000000), and uses the“userspace addresses” (0x0000000-0xc0000000) to map theprivileged device state.The fastio driver’s pseudo-code is shown in Figure 1. Thedriver ﬁrst disables interrupts (line 2), then loads the addressof the

PPT in %eax register (line 3), and ﬁnally, executes mov-to-cr3 to load the PPT (line 5). (We discuss the needfor save/restore of the stack pointer later). The body of thefastio driver can now access the device and the guest’s datastructures to efﬁciently implement the transmit/receive logic.In particular, it transfers packets between the guest’s netmapring and the device’s hardware ring. Finally, after the body ofthe fastio driver has executed, the mov-to-cr3 instructionis executed to restore the guest’s original page table (line13), before restoring the original interrupt ﬂag and returningto the caller. (We discuss the need for lines 14-15 later). fastio_driver() {1 save_flags()

Figure 1: Fastio driver pseudo-codeApart from security considerations, this solution has aserious performance concern. Every call to the fastio driverinvolves two executions of the mov-to-cr3 instruction,and each of them will cause a VM exit in our model. Theperformance overhead of these exits is likely to be more thanthe overhead of the virtio interface, which only requiredone exit (or no exits for multiprocessor hosts). Ideally, wewould like to ensure that the two mov-to-cr3 instructionsexecuted by our fastio driver do not cause VM exits, whilethe other mov-to-cr3 instructions executed by the guestkernel cause VM exits.This differentiation is perhaps hard to achieve efﬁciently,through runtime mechanisms alone. We use dynamic binaryopcode subtraction (DBOS) to solve this problem. DBOS in-volves ensuring that an opcode is not present in the guest’sexecutable address space. To implement DBOS, the hypervi-sor removes execute-privileges from all guest pages, exceptthe fastio driver’s code pages. This is done at the time whenthe fastio driver is loaded, and the hypervisor veriﬁes it. Sub-sequently, any instruction execution by the guest OS (outsideof the fastio driver) would cause a VM exit resulting from anEPT execute-privileges violation. At this point, the hypervi-sor scans the page containing the instruction being executedto ascertain the absence of the mov-to-cr3 opcode in thatpage. Checking the absence of the mov-to-cr3 opcode in-volves checking the absence of the following byte sequencein the page: , , “ B ”. Here B is any byte that satis-ﬁes the equation: ( B &0 x

38 == 0 x ), i.e., the bits 3, 4, and5 of B should be equal to 0b011 (cr3). The prime observa-tion is that if a byte sequence corresponding to an opcode isnot even present in the executable address space of a guest,the guest can never execute that opcode. In the rest of thepaper, we will also call this sequence of bytes representing he mov-to-cr3 opcode, the 2.5-byte sequence (two exactbytes, and one byte with three bits set to a certain value).We noticed that it is quite rare to ﬁnd the presence ofthe 2.5-byte sequence in typical code. For example, the onlycode pages in the Linux kernel that contain this 2.5-byte se-quence are the mov-to-cr3 instructions themselves. No-tice that while ascertaining the absence of a byte sequence,we disregard any assumptions about instruction boundaries.We call this technique “dynamic binary opcode subtraction”,because it subtracts an opcode dynamically from the execu-tion stream of a guest.The mov-to-cr3 opcode needs to be subtracted notjust from the guest kernel’s code stream, but also from theuser programs’ execution stream running within the guest.If we disregard user programs, the guest can launch a simpleattack, whereby it can branch to a user code page with kernelprivileges to execute the mov-to-cr3 instruction. Even inthe user code pages, the presence of the 2.5-byte sequenceis extremely rare. In fact, in all our experiments involvingexecution of several programs shipped with stock UbuntuLinux, including the SPEC Integer programs, we did not ﬁndthe presence of the 2.5-byte sequence in any of them.If a page containing the current executing instruction (andcausing the EPT violation) does not contain the 2.5-bytesequence, we restore executable privileges on it. To guardagainst attacks involving page-boundaries, we also check thesuccessor and predecessor pages of the currently executingpage. If either of them has already been marked executable,we ensure that the 2.5-byte sequence does not appear even ifthe two pages are consider together as one contiguous block.Similarly, each time an executable page is installed in a pagetable (through a page table update, hence causing a VMexit), we perform the same check again to ensure that theexecutable page’s new neighbours do not cause the presenceof the 2.5-byte sequence.When we mark a page with execute privileges, we alsotake away write privileges from that page (again through ma-nipulation of page-protection bits in the corresponding EPTentry). If that page is ever written-to subsequently (to imple-ment page-swapping, for example), an EPT violation occurs— in this case, the hypervisor removes execute privilegesfrom that page, and re-instates write privileges on that page.This mechanism can also handle dynamically generated andself-modifying code.This scheme works well if none of the guest pages con-tain the 2.5-byte sequence. However, if a page (or a combina-tion of two successive pages in the GVA space) indeed con-tains the 2.5-byte sequence, the hypervisor needs to handleit gracefully. A straw-man solution is to never grant executeprivileges to any such page, causing an EPT violation eachtime an instruction on that page gets executed. This is likelyto result in a huge slowdown, especially if multiple instruc-tions within a page execute a large number of times (e.g., a loop). We instead use in-place binary patching to deal withsuch situations efﬁciently.We patch any instruction containing the 2.5-byte se-quence, with the single-byte int3 opcode ( ), resultingin a VM exit on its execution. (We conﬁgure the virtualiza-tion hardware to generate a VM exit on the int3 instruc-tion). The hypervisor keeps track of all such patches, andemulates the original instruction on the patch-induced VMexits. The use of the int3 instruction does not preclude theguest from using it for its own purposes (e.g., debugging),as the hypervisor can easily differentiate between guest’s int3 and hypervisor’s patched int3 . The only remainingcomplication is that of identifying the instruction boundarycontaining the 2.5-byte sequence. As we discussed earlier, inall our experiments, the only occurrence of the 2.5-byte se-quence involved an actual mov-to-cr3 instruction withinthe guest. Hence, simply patching all the 3 bytes in the se-quence would achieve the desired result. If we patch all the3 bytes, we also take care of cases, where the 2.5-byte se-quence straddles two instructions, i.e., some of the bytes be-long to one instruction, while others belong to the successorinstruction.In general, it is possible that the 2.5-byte sequence ap-pears in the middle of an instruction. In this case, if wesimply patch the sequence, the guest’s instruction semanticscan change (causing the guest to get incorrectly confused).Here is an example of an instruction that could get incor-rectly patched: assembly binary representationmov $0x18200f, %eax 0xb8, 0x0f, 0x20, 0x18, 0x0 This instruction contains the 2.5-byte sequence, and if wepatched it with the int3 opcode, we would replace it with assembly binary representationmov $0xcccccc, %eax 0xb8, 0xcc, 0xcc, 0xcc, 0x0

Hence, this instruction would silently behave incorrectly(without causing a VM exit) if patched by us. To deal withthis situation, we need to identify the instruction boundary ofthe instruction containing the 2.5-byte sequence, and patchits ﬁrst byte (along with patching the 2.5 bytes sequence it-self). In this example, we should have also replaced the ﬁrstbyte with . Doing so would ensure that a VMexit occurs on the execution of this instruction, allowing thehypervisor to emulate it correctly.Our current method for identifying instruction boundariesinvolves tracking the values of eip for each process/kernel(identiﬁed using the value of the cr3 register). On notic-ing a 2.5-byte sequence, we start disassembling instructionsfrom a known predecessor eip in the current GVA space.Using this disassembly, we can identify the boundary of theinstruction containing the 2.5-byte sequence. If a predeces-sor eip is not known yet, we simply leave the page withoutexecute privileges, and emulate the instructions in the hyper-visor if that page executes again, in the hope that eventually e will ﬁnd some predecessor eip to the 2.5-byte sequence.If we still do not ﬁnd a predecessor eip after a large num-ber of EPT-induced VM exits on that page, we simply patchthe 2.5-byte sequence with the int3 opcodes. We had todo this for one page during our experiments on the Linuxkernel, where we found a mov-to-cr3 instruction in theﬁrst page of the 32-bit Linux kernel image (v3.9.0). Noneof the instructions preceding this instruction (including thisinstruction) ever executed after loading the driver, and so wecould not reliably determine the instruction boundary. In thisparticular case, patching the 2.5-byte sequence was anywaysthe correct thing to do. Notice that mis-identiﬁcation (ornon-identiﬁcation) of instruction boundaries does not pose asecurity risk; it can only cause misbehaviour within a guest.If a guest is aware of our approach, it can easily help theVMM by avoiding such situations.Using DBOS, we conﬁgure the virtualization hardware tonot cause VM exits on execution of the mov-to-cr3 in-struction inside the guest; yet we ensure that the mov-to-cr3 instruction causes a VM exit inside the guest kernel, but doesnot cause an exit within our fastio driver. This enables us tohave an exitless I/O path between guest-fastio-guest, andyet provide direct visibility into the hardware device stateto fastio. This enables us to obtain I/O performance closeto bare-metal, within the guest. The fastio driver can nowbe used, not just to access the hardware device, but also toaccess other privileged state of the host/other VMs, and toimplement fast VM-to-VM and VM-to-host communication.We next discuss the security threats to our scheme, andour solutions to them. Our security model relies on the inability of the guest tochange its virtual address space without hypervisor interven-tion. We achieve this by ensuring that the guest’s executableaddress space cannot contain the mov-to-cr3 opcode. Weconﬁgure the virtualization hardware such that all guest in-structions that can potentially modify its address space causeVM exits, with the exception of the mov-to-cr3 instruc-tion. The mov-to-cr3 instruction does not cause a VMexit; instead we use DBOS and binary patching to ensurethat the guest exits on mov-to-cr3 executions. The fas-tio driver’s mov-to-cr3 instructions execute without VMexits. We further need to ensure that the fastio driver’s codedoes not itself contain the 2.5-byte sequence, except at entryand exit (for the mov-to-cr3 instructions as shown in Fig-ure 1). These two instructions at fastio entry/exit points arethe only occurrences of the 2.5-byte sequence in the guest’sexecutable address space.The body of the fastio driver is our “trusted comput-ing base” (TCB), as it enjoys visibility into privileged statethrough the PPT. As discussed earlier, we ensure that theTCB cannot be modiﬁed through EPT page-protection bits.Further, we ensure that all execution within the TCB hap-pens with interrupts disabled (notice the cli instruction at fastio entry), so that no other code can run while the PPTis operational. We also need to ensure that the fastio codeis bug-free and cannot cause any exception, lest the guest’suntrusted exception handler may get called while the PPTis operational. We also ensure that all non-maskable inter-rupts cause VM exits, so that the hypervisor can interposeand disallow the guest from running while the PPT is opera-tional. Further, the TCB uses a separate CPU-private stack todisallow another processor from trying to interfere with ourexecution by causing race conditions on our stack state (lines6,7,11 in Figure 1). The CPU-private stack is also mapped inthe PPT and the PGPA space, to protect it from the rest of theguest kernel. Further, we ensure that the pages belonging tothe fastio driver are mapped correctly in the guest page table(if mapped), and only in one place at its designated virtualaddress.We ensure that the only mov-to-cr3 opcodes in theguest’s executable address space are the ones belonging tothe fastio driver at entry and exit. The guest could poten-tially launch an attack, by directly jumping to one of thesetwo mov-to-cr3 instructions inside the fastio driver, tosubvert our protection. We discuss three potential attack sce-narios, and how we neutralize them.

Jumping to one of fastio’s mov-to-cr3 instructionwith an arbitrary value in the eax register

The guest could load a pointer to an arbitrary page tablestructure in the eax register and jump directly to one ofthe mov-to-cr3 instructions in the fastio driver. This cancause an arbitrary page table to get loaded within the guest.This new page table could contain mappings to the PGPApages; further, the new page table may map the virtual ad-dress corresponding to the EIP register to a totally differ-ent GPA, thus allowing execution of arbitrary code while thePGPA pages are mapped.Fortunately, this attack can be prevented by using afeature in x86 virtualization hardware, called “CR3 targetcontrols”. Using this, it is possible to conﬁgure the hard-ware such that VM exits occur on each execution of the mov-to-cr3 instruction, except when the value beingloaded into the cr3 register is one of the values speciﬁedin CR3 target controls. The x86 hardware supports speciﬁ-cation of up to four target controls, and this capability wasperhaps included for efﬁcient shadow-page-table based vir-tualization.We use this interesting capability in the following way:we conﬁgure the hardware to exit whenever the guest exe-cutes the mov-to-cr3 opcode, except when the value be-ing loaded in the cr3 register is the PPT address. We do soby specifying the PPT address as one of the CR3 target con-trols. We use the other three CR3 target controls as a cachefor recently seen values of the cr3 register at the time of thecall to the fastio driver.Assuming that there are a small number of processes(typically one) accessing the fastio driver, the cr3 values or these processes would get cached in the CR3 targetcontrols. Hence, VM exits would be avoided at fastio entry(because the loaded CR3 value is the PPT address which isone of the CR3 target controls) and at fastio exit (becausethe loaded CR3 value would typically be one of the cachedvalues in the CR3 target controls). In all our experimentsinvolving netmap, there was only one value of cr3 observedat fastio entry, which was easily cached using the CR3 targetcontrols, resulting in exitless guest-fastio-guest path.Now, the original attack, whereby the guest can jumpto fastio’s mov-to-cr3 instruction with an arbitrary value in eax , is thwarted because the value in the eax register is notgoing to be one of the CR3 target controls. Hence, a VM exitwould occur and the hypervisor can interpose and preventthe attack.Further, the guest may try to set its eax register to one ofthe cached target controls, and then branch to fastio’s mov-to-cr3, thus avoiding the VM exit. The only security threatfrom this behaviour could occur if the guest uses the PPT’saddress in the eax register; all other cached target controlsdo not map the PGPA space. We next discuss these attacksin detail. Jumping to fastio’s entry mov-to-cr3 instruction withPPT’s address in the eax register

The guest could jump directly to the ﬁrst mov-to-cr3 in-struction (at fastio driver entry) without disabling interrupts.Because the guest will only try to load the PPT into thecr3 register, a VM exit will not occur (as the PPT addressis one of the CR3 target controls). This can potentially al-low the guest to receive an external interrupt (as the inter-rupts were not disabled), and execute its untrusted interrupthandler while the PPT is operational. To prevent this attack,we conﬁgure the virtualization hardware such that all exter-nal interrupts cause VM exits. Because our experiments in-volve high-throughput workloads, our fastio driver operatesin polling mode, and this extra interrupt handling cost doesnot cause performance degradation. In Section 6, we discussa solution which allows the guest to directly handle hardwareinterrupts (through Intel’s VT-d posted interrupts hardwarefeature [16]) without VM exits, and yet ensures VM exits onany interrupts received while the PPT was operational.Thus, we can effectively ensure that the hypervisor getsto interpose on any external interrupt received while thePPT was operational. The hypervisor identiﬁes the attack bydetermining if the PPT was operational while the interruptwas received, and thwarts it (potentially by terminating theguest).

Jumping to fastio’s exit mov-to-cr3 instruction withPPT’s address in the eax register

The guest could load the PPT’s address in a register andbranch to the second mov-to-cr3 instruction (at fastiodriver exit). This can enable the guest to execute untrustedcode inside fastio’s caller while the PPT is operational. To thwart this attack, we further add a check at fastio’s exit (af-ter the mov-to-cr3 instruction has executed) to conﬁrmthat the current value of cr3 is not the PPT’s address. If it is,the guest makes a hypercall to alert the hypervisor. Becauseinterrupts are not allowed while this code is executing, andthe PPT is guaranteed to map the fastio code pages correctly(and only in one place), this ensures that such an attack getsthwarted.If the guest tries to jump anywhere in the middle of ourfastio driver, it does not pose a security risk, as that cannotallow the guest to load the PPT or otherwise obtain thecapability to map the PGPA pages in its GVA space, otherthan in ways discussed above.

4. The fastio driver

The fastio driver is our privileged code (TCB) can accessboth guest’s data structures and the hypervisor’s privilegedstate (including device state). The driver acts as a bridgebetween the guest kernel and the device, and also allowssharing of the device among multiple VMs and the host.

To simplify design, we use the same fastio driver, bothwithin the guests, and at the host. The host’s fastio driverperforms a few extra operations related to initialization ofthe actual hardware device. At boot time, the host loads thefastio driver, thus initializing the hardware device, and ini-tializing a PPT for its own use. The host’s PPT maps thehost kernel (at their original virtual addresses) and the de-vice pages at a ﬁxed virtual address, say PDVA (privilegeddevice virtual address). After host’s PPT has been correctlyinitialized, the host can now use the fastio driver to commu-nicate to/from the device. For example, a transmit call fromthe host involves switching to the PPT (within the host) andtransferring packets from the host kernel to the device. For32-bit Linux, we use the top 1GB of the VA space for thehost kernel (0xc0000000-0xffffffff), and use 516 contigu-ous pages starting at PDVA to map the device state. For ourprototype, we use PDVA = 4MB (it must not overlap withthe kernel’s address space). Of the 516 device pages in thePPT, 512 pages are for device MMIO, and four pages arefor storing the device rings. Because we use the netmap APIfor the device driver, whereby all rings and buffers are pre-allocated by the kernel, the fastio driver is also responsiblefor allocating the host’s netmap rings/buffers at load-time.The initialization of the guest-side fastio driver is rel-atively simple. The fastio driver authenticates itself to thehost, and allocates its netmap rings/buffers, and ensures thatthe guest kernel can see an attached NIC. The guest’s fas-tio driver also initializes its PPT, which contains mappingsfor the device state. These mappings are made at the samevirtual addresses in all guest/host PPTs.Besides device state, we also need mappings inside thePPT for the buffers and rings of the host and other guests. his is required to allow sharing of the network device; oneguest should be able to receive packets for other guests/hostinside its fastio driver. We follow a convention, wherebyeach guest (and host) is allocated a ﬁxed amount of VA spacein the PPT, to map its network buffers. In our experiments,a 16MB space for each guest is enough to map all its net-work rings and buffers. These mappings start at a ﬁxed ad-dress, called ppt va start , which must be distinct from thekernel addresses, and the addresses used to map the devicestate. In our prototype implementation for 32-bit Linux, weuse ppt va start = M B . The device state is mappedbelow ppt va start and the kernel is mapped startingat address c0000000 (3GB). Thus, the VA space between16MB and 3GB is available for mapping guest buffers andrings. Assuming 16MB per guest, this allows us to map upto 191 guests inside the PPT at the same time.Each guest’s slab of VA space in the PPT is laid outin a ﬁxed format: the ﬁrst few pages are dedicated to thetransmit and receive netmap rings, and all the other pagescontain the guest’s network buffers. A slot inside a netmapring contains pointers to the network buffers. Alongwiththe original buffer pointers (pointing to the guest kerneladdresses of the buffers), we also keep the correspondingPPT pointers (pointing to the the same buffer but through thePPT addresses) with each netmap ring slot. The PPT pointersare initialized at fastio load time, as discussed below.To distinguish between guests, we assign a fastio ID toeach guest/host. The host always has ID 0, while guestsare given IDs dynamically by the host, using bitmap-basedallocation. The netmap buffers of guest ID n , are accessibleat virtual address ppt va start + n * 16MB in thePPT. Each guest slab (of 16MB) contains its netmap ringsand netmap buffers, laid out sequentially and contiguously.Mapping between the slabs at ppt va start and thecorresponding physical memory need to be created dynam-ically, as guests boot and shutdown. The mapping for thehost’s slab remains static, and gets initialized at PPT creationtime (alongwith the mappings for the device state). Whenthe fastio driver initializes within the guest, it allocates itsnetmap rings and buffers, and makes a hypercall to informthe host about the GPA addresses of these rings/buffers. Thehost translates the GPA addresses to the corresponding hostphysical addresses (HPA), and creates mappings appropri-ately in the host’s PPT. Even if the buffers/rings were dis-contiguous in the guest kernel, they are laid out sequentiallyand contiguously in the PPT’s address space. Further, thehost walks through the netmap rings of the guest, and initial-izes the PPT pointers inside the netmap ring slots. (Recallthat the PPT pointer for a guest buffer is the address at whichthat buffer was mapped in the PPT in the guest’s slab). Thesepointers can now be used by the host/other guests to accessthis guest’s network buffers.The PPT pointers stored in netmap ring slots are nowvisible to the untrusted guest, and the guest could poten- tially modify these pointers to try and confuse the fastiodriver. To avoid this attack, our fastio driver performs a base-and-bounds check on the PPT pointer before de-referencingit: ( ppt pointer ≥ ppt va start + n * 16MB) and( ppt pointer < ppt va start + ( n + 1) * 16MB),where n is the guest ID. This base-and-bounds check en-sures that the PPT pointer lies within the guest’s PPT slab,and so the guest cannot cause the fastio driver to incorrectlyread/write to memory outside its own address space.Because we have multiple PPTs (one for each guest, andone for the host), we may need to create these mappingsinside all PPTs. For the host, we create these mappingsimmediately (at the time of fastio initialization hypercall).For the guests, we create these mappings on-demand asfollows:1. All guest PPTs are initialized such that they containGVA → GPA mappings from ppt va start (in GVAspace) to pgpa start (in GPA space) for a contiguousblock of size (

MAX GUESTS * 16MB). pgpa start is aPGPA address, i.e., an address in the guest’s physical ad-dress space which is distinct from guest’s physical mem-ory. We use pgpa start = 4 GB + 16 M B . (Recall thatthe PGPA addresses in our implementation start at 4GBand are used to map privileged device and hypervisorstate).Thus, the mappings for guest ID n are accessible at ad-dress pgpa start + n ∗ M B in the GPA space (forall guests). Initially (when there are no guests), theseGPA addresses do not map to any host physical addresses(HPA), i.e., the present bit in the corresponding EPT en-tries is set to zero.2. If a guest tries to access the netmap buffers of anotherguest (or its own netmap buffers through the PPT),an EPT violation may result if the corresponding GPAaddress (in the pgpa start region) is currently un-mapped. If this happens, the hypervisor handles the EPTviolation by creating the required GPA → HPA mappingon demand, before resuming the guest.

The fastio driver is trusted and works cooperatively with allthe other fastio drivers to transmit/receive network packets.Mutual exclusion among different fastio drivers, is ensuredthrough a shared lock, which is also mapped using the PPTin all guests/host.On the transmit path, the driver ﬂushes its own tx buffersinto the hardware ring On the receive path, the driver con-sumes packets from the hardware ring, determines the des-tination for each packet, and copies that packet into the rxbuffers of the appropriate guest/host. (Recall that using PPT,guest ID, and PPT pointers, any guest/host can read therings/buffers of any other guest).On the receive path, it is possible for a guest to behave“selﬁshly” by never calling into the fastio driver, and still njoying the “service” of other guests (as this guest’s packetswould still be received by other guests). Such an attack caneasily be prevented by maintaining statistics on the numberof fastio calls by each guest, and selectively dropping guestpackets on noticing selﬁsh behaviour. Even if all guests areselﬁsh, the host would still be able to receive packets for allof them and for itself.While we use coarse-grained locking (one shared lock),ﬁner-grained synchronization could potentially increaseconcurrency and perhaps performance. Further, in high-contention scenarios, it may be better to select a “leader”(or a few leaders) which will be responsible for switchingpackets for all guests/host. Priorities could also be intro-duced during acquisition of the shared lock, if needed. Finally, we try and eliminate packet copies between the hard-ware ring and the guest/host netmap rings. On the transmitpath, zero-copy is straight-forward. The fastio driver main-tains a hash table, which contains a mapping between PPTpointers (for guest’s network buffers) and their correspond-ing HPA addresses. Insertions to this table happens at fastiodriver initialization time (when PPT pointers are determinedby the host during the hypercall). Further the host pins theseHPA addresses to memory, i.e. it ensures that its swapperwill never swap these addresses to disk.During transmit, the fastio driver performs a fast hashlookup to convert the PPT pointer of the network buffer (tobe transmitted) to its HPA address, and writes the computedHPA address to the hardware ring, thus avoiding packetcopies. Using this mechanism, the hardware transmit ringwould now contain HPA pointers to buffers belonging todifferent guests simultaneously.On the receive path, zero-copy is similar. The guests/hostprovide PPT pointers to empty buffers, through their netmaprings. The host converts the PPT pointers to their HPA ad-dresses and writes them to the hardware ring. Again, thehardware receive ring would now contain HPA pointers tobuffers belonging to different guests simultaneously. Thehardware stores the received packets into these buffers.There are two possibilities on the receive path: either thepacket-destination is the same as the owner of the buffer inwhich it was received (match); or the packet-destination isdifferent from the buffer owner (mismatch).Matches are easy to handle — we simply enqueue thebuffer pointer to the destination guest’s netmap receive ring,so the guest can read the contents of the received packet.Mismatches cannot be handled in this way — we cannot en-queue the buffer pointer to the destination guest’s receivering, as the buffer does not belong to this guest. For mis-matches, we allocate a fresh buffer from the destinationguest, and copy the packet contents into it before enqueue-ing the newly allocated buffer into the destination guest’snetmap receive ring. There are two caveats to receive-side zero-copy. First, it isnow possible for one guest to snoop on the packets of anotherguest (if the ﬁrst guest’s buffer is used to receive the sec-ond guest’s packet). This opens the possibility of one guestlaunching a man-in-the-middle attack on another guest. Be-cause modern network stacks are usually resilient to man-in-the-middle attacks (through end-to-end encryption, for ex-ample), this is usually not an issue. Second, the total num-ber of buffers available to implement the receive stack isnow smaller: without zero-copy, there was an extra set ofbuffers available exclusively to the hardware ring; with zero-copy, the hardware ring relies on the buffers provided by theguest/host netmap rings. The extra set of buffers availableto the hardware ring, allow the “double-buffering” effect,whereby the hardware can receive packets on its own set ofbuffers, while the user application can read already receivedpackets. To avoid this downside, we allocate netmap ringswith twice the number of buffers as the hardware ring. Thisensures that the double-buffering effect remains intact, evenin zero-copy mode.Due to mismatches, the order of received buffers can bedifferent from the order in which buffers were allocated inthe ring. Because, zero-copy receive has slightly weakersecurity guarantees, and requires more memory, we showresults both with and without rx-zerocopy.

5. Experiments

We conducted our experiments on a 4-core machine with16GB RAM and an Intel X540-T2 10Gbps network adapter.The machine was connected to a 10Gbps network switch. Tosend/receive packets at the other end, we used another ma-chine with an identical 10Gbps network adapter, also con-nected to our 10Gbps network switch. We used 32-bit Linux3.9.0 on our hosts and guest with PAE-paging mode. Ourguest was conﬁgured with two CPUs, and 1GB memory. Forexperiments involving a single-core host, our guest was alsogiven only a single CPU. For our netmap-based experiments,we used netmap’s pkt-gen utility to send/receive packets. Forexperiments involving socket-based I/O, we used the netperfutility.Our fastio driver is based on the Linux ixgbe driver, withthe netmap ixgbe patch. The netmap’s patch deﬁnes twofunctions, “txsync” and “rxsync”, which are used to trans-mit/receive packets between the user’s netmap rings andthe hardware device respectively. The transfers between thenetmap rings and the hardware devices are performed inbatches, where the batch-size depends on the size of thehardware ring. We used a hardware ring with 512 slots (de-fault). To implement fastio, we modiﬁed the txsync andrxsync function to switch to our PPT at entry, and switchback to the original page table at exit. i.e., netmap’s txsyncand rxsync functions form the body of our fastio driver (Fig-ure 1). We conﬁgured the driver to use polling, to avoid ex- raneous scheduling issues during our experiments [21]. Oursource code and raw data will be made publicly available.Because the txsync and rxsync functions execute withinour trusted fastio driver while the PPT is operational, theycan access the rings of all guests/host, as well as the hard-ware ring using the PPT addresses. We implemented ourzero-copy logic within the txsync and rxsync functions.The same logic (for txsync and rxsync) would execute bothwithin the host and within all guests, as also discussed inSection 4.2.Table 1a presents our results for the network throughputfor a single guest on a multi-core host. The rows labeled netmap- use the netmap API for user/kernel communi-cation within the guest, while the rows labeled socket- use the socket API. We show results using the virtio inter-face without/with vhost support (labeled -virtio- and -virtio-host- respectively). Further, we show resultsfor two different types of host-side switches, namely tap ,and netmap . For the netmap “switch”, the netmap ethernetinterface was directly exposed to the Qemu process (withoutusing the VALE switch that ships with netmap). The perfor-mance with the VALE switch is inferior to the performancewithout it, as it adds extra computation on the switchingpath.Without the netmap API within the guest, the through-put is heavily CPU-bottlenecked at the guest’s user/kernelinterface. The netmap API on baremetal host, is able tosaturate the 10Gbps NIC even with 60B packets (14.8Mpps). However using netmap with virtio incurs a largeperformance penalty (2.3 Kpps for 60B packets). virtio-vhost-tap improves the throughput marginally, while virtio-netmap does not perform any better. We were unable tosetup virtio-netmap to receive packets; the transmitside throughput in this conﬁguration is only 154Kpps, andwe do not expect the receive throughput to be signiﬁcantlyhigher.These CPU bottlenecks have also been previously re-ported by ClickOS [20]. The ClickOS project addresses thisproblem by overhauling Xen hypervisor’s I/O virtualizationsubsystem. Just like netmap, ClickOS optimizations involvememory-preallocation, batching, and fast switching, albeit atthe VM/hypervisor interface. Even with all these optimiza-tions, ClickOS peaks at around 11 Mpps while transmitting60B packets using a 512-slot device ring. On the receiveside, ClickOS peaks at around 6.2 Mpps for 60B packetsusing a 512-slot device ring. (These ﬁgures have been takenfrom [20]).The row labeled -fastio shows the throughputs achievedby our solution. Expectedly, our achieved throughputs arevery close to the throughputs achieved on bare-metal. Com-pared to ClickOS, our transmit throughput is around 33%better, and our receive throughput is around 115% bet-ter, for the same hardware conﬁguration. The row labeled -fastio-no-rzc shows throughput without our zero- Number of tx/rx 1 2 3 4tx-60B 14704 14753 14776 14860tx-1500B 815 820 820 820rx-60B 13292 11712 9800 8311rx-1500B 816 820 820 820Table 2: Transmit and receive performance for multipleVMs on a 10Gbps NIC (Kpps).copy optimization on the receive side; because the cost ofpacket-copies is not signiﬁcant on the I/O path, the through-puts are largely similar to -fastio .Table 1b presents the throughput results for a unipro-cessor host and guest. Our solution remains unaffected bythe scarcity of CPUs, while virtio and virtio-vhost observe(sometimes signiﬁcant) performance penalties.Table 2 shows the total throughput with multiple trans-mitters and receivers for 60B/1500B packets. For multipletx/rx agents ( > ), one of the agents is the host and the restare VMs. We show results for up to four tx/rx agents, as ourtest machine had four CPU cores. The transmit-side through-put remains largely unaffected with increasing number oftransmitters; moreover, the bandwidth is largely fairly allo-cated among the transmitters. On the receive-side, we no-tice throughput degradation for small packets with increas-ing number of agents; this degradation is largely due to pack-ets dropped while trying to enqueue them to the receive-siderings of other agents. For a single receiver, packet drops can-not not happen, as the receiving agent simply returns to itsuserspace (which consumes the packet) on observing a fullring. However, if one agent tries to enqueue a packet to afull ring of another receiver, the packet gets dropped (wastedwork). The probability of packet drops increases with in-creasing number of receivers. The probability is smaller forlarger packets, as the CPU has to do less work — thus we donot see the effect of dropped packets for 1500B-sized trans-fers.Next, we discuss the maximum achievable throughputof our fastio packet switch, assuming both the transmitterand receiver are running as VMs (without NIC involve-ment). For this experiment, we implemented a shared soft-ware ring which replaced our hardware ring. The transmit-ters enqueued to this software ring, while the receivers de-queued from it. For this experiment, the transmit-side oper-ation involved copying packets (always) from netmap ringsto the shared software ring; on the receive-side, we showresults with and without the zero-copy optimization. Table 3presents throughput results for one-transmitter/one-receiver(tx1-rx1), one-transmitter/three-receivers (tx1-rx3), three-transmitters/one-receiver (tx3-rx1), and two-transmitters/one-receiver (tx2-rx2). The maximum achievable throughputwith receive-side zero-copy enabled, is around 30 Mppsfor small packets, and 10.9 Mpps for large packets; with- onﬁg fastio fastio-no-rzcpktsize 60B 1500B 60B 1500Btx1-rx1 30.43 10.93 22.59 7.84tx1-rx3 12.08 6.18 11.37 5.21tx3-rx1 27.19 8.02 20.06 6.20tx2-rx2 15.61 7.50 13.80 5.81Table 3: Total throughput for software-only switching with-out NIC involvement (Mpps).out zero-copy, the throughputs decreases to 22.6 and 7.8Mpps respectively. In all these cases, the achieved through-puts are well above the line-rates supported by current NICs.With increasing number of receivers, packets start gettingdropped resulting in lower overall throughput. With an in-creasing number of transmitters (tx3-rx1), the throughputdrops marginally, presumably due to lock contention. VALE,another netmap-based software switch, reports throughputsof around 3.4 Mpps (tx) and 2.5 Mpps (rx) while runningKVM-based virtual machines [26]. While exact/fair compar-isons with VALE are not possible (as our switch is perhapslacking in many features provided by VALE), the perfor-mance improvements provided by our switch due to in-guestswitching are clearly visible. In contrast, VALE requireshost-side involvement.Finally, we discuss the runtime overheads of DBOS.DBOS overheads are related to VM exits caused due toguest’s execution of the mov-to-cr3 opcode (which ispatched by us to the int3 opcode), and VM exits due towrite-accesses to the page-table pages (which are write-protected by us using the EPT). For all our experimentsinvolving pkt-gen and the Linux kernel, we encounteredzero overhead due to DBOS. This is expected because only afew page-table switches occur during pkt-gen execution,and almost no writes happen to the page-table pages. How-ever, it is possible for a VM to be running other programssimultaneously with pkt-gen ; to characterize these over-heads, we run some CPU-intensive programs (taken fromSPEC CPUInt2000 [13]) and present runtime overheads,alongwith the statistics collected for VM exits. The over-heads range between -2% and 34%; the majority of this over-head is due to writes to page-table pages by the guest kernel,presumably to implement LRU page-replacement algorithm.We also show results for the forkwait microbenchmark[1] which forks 40,000 processes and waits for each of themto exit in turn. Given that the forkwait benchmark createsand destroys a large number of page tables, the resultingDBOS overhead is signiﬁcant (18.74x). After running allthese benchmarks, 12 kernel pages contained at least one int3 patch. All these patches were in the guest’s kernel,and were due to mov-to-cr3 instructions. We also found193 instances where the sufﬁx of the 2.5-byte sequence oc-curred at the top of an executable page in the guest. Of these, original DBOSprogram time (s) slowdown ptable-exits cr3-exitsgcc 35.6

6. Discussion

Our performance experiments and our discussion on securitydemonstrates the utility of DBOS as a security technique.We use several x86 mechanisms to achieve a practical imple-mentation of DBOS, namely, EPT-based read/write/executepage-protection, the length of the mov-to-cr3 instructionopcode, support for CR3 target controls, ability to conﬁgurethe virtualization hardware to cause VM exits on certain in-structions, to name a few. For example, if the length of the mov-to-cr3 opcode was smaller (e.g., one byte instead ofthe 2.5 bytes), the number of patch-sites and resulting exitswould have been greater. Similarly, security would not havebeen possible without support for CR3 target controls.As we discuss in Section 3.1, we conﬁgure the virtual-ization hardware to cause a VM-exit on every interrupt, andwe discussed why we need this capability to ensure security(Section 3.1). For interrupt-intensive workloads, this may bea severe performance penalty [10]. Recent support for x86VT-d posted interrupts allow the guest to directly receive in-terrupts without requiring VM exits. Even if the guest wantsto use VT-d posted interrupts, we could still ensure that anyinterrupt received within the TCB causes a VM exit by en-suring that the virtual addresses containing the interrupt-descriptor-table (IDT) are unmapped in our PPT. Typically,the kernel initializes the IDT in the beginning and stores itsvirtual address in the IDTR using the lidt instruction. Thehypervisor can interpose on the execution of the lidt in-struction (by requiring a VM exit), and record the address ofthe IDT. Thereafter, it can ensure that the virtual addressescorresponding to the IDT addresses in the PPT are mappedto a shadow

IDT [10]. All entries in our shadow IDT wouldhave their present-bit set to 0, causing a not-present excep- ion an an interrupt. Additionally, the host is conﬁgured toforce a VM exit whenever a not-present exception occurs.Through this, the hypervisor would get to interpose on anyinterrupt received during TCB execution.Because we ensure that all mov-to-cr3 instruction ex-ecutions within the guest cause VM exits, this can causeoverhead for applications that involve signiﬁcant context-switching, as also seen in our experiments. One could poten-tially optimize this further by using in-place binary patchingto replace all mov-to-cr3 instructions with a call to a spe-cial trusted function (another TCB), that allows loading thecr3 without requiring a VM exit. This would involve cachingthe most-frequently-used cr3 values in the CR3 target con-trols. We leave this optimization for future work.The choice of the opcode to subtract is also interesting;on the x86 architecture, we identiﬁed three different possibleways of accomplishing security through opcode subtraction.We have discussed the ﬁrst one involving subtraction of the mov-to-cr3 opcode in this paper. The other two involvesubtracting either the lgdt opcode, or the mov-to-cr4 opcode. In both these cases, the subtracted opcode is used atentry and exit of the fastio driver. Of all the three choices,the mov-to-cr3 opcode incurs the least overhead on theI/O path.Finally, we discuss guest ﬁdelity. In general, DBOS doesnot change guest behaviour in any way (apart from poten-tially slowing it down). The only exception is that we relyon the identiﬁcation of instruction boundaries; if instructionboundaries are incorrectly determined, or they can changedynamically, DBOS can cause the guest’s logical behaviourto change. As we show in our experiments for Linux, thisis usually not an issue. While this work is about implement-ing DBOS on existing OS/programs, it should be possible tomake the compiler DBOS-aware, so that it prevents emissionof certain byte sequences.

7. Related Work

There are two categories of related work for this paper:one involving network I/O optimization and virtualization,and the other involving software-based security techniquessuch as dynamic binary translation, proof-carrying code, andtyped-assembly language.

Network I/O Optimization and Virtualization

Routebricks [6] worked on implementing fast softwarerouters by scaling them on a number of servers. PFQ [3],PF RING [5], Intel DPDK [7], and netmap [25] are all ap-proaches involving mapping NIC buffers into user addressspace.As already discussed, ClickOS [20] suggests a completeoverhaul of the Xen hypervisor’s network interface, whichincreases the effective bandwidth between the guest OS andthe hardware NIC. Other work on improving hypervisor net-working performance [4, 24, 27, 28] suggest similar op- timizations; [12, 30] discuss scheduling optimizations forgood networking throughput. Efforts involving vhost-likeoptimizations for Open vswitch [23, 24] are also interest-ing. All these efforts involve optimizing either the guest-sidestack, or the host-side stack, or the guest/host virtualizationinterface. In comparison, our approach completely obviatesthe host-side stack, and provides direct NIC access to theguest, resulting in signiﬁcantly higher throughputs and lowerlatencies.

Software Techniques for Security

The closest competing technique to DBOS, is perhaps dy-namic binary translation (DBT). Unlike DBOS, DBT incurslarge overheads for indirect jumps and interrupts/exceptions[1, 8]. BTKernel [18] optimizes DBT for interrupts/exceptionsand indirect branches; however, BTKernel cannot providethe security guarantees required for our application. For ex-ample, BTKernel’s approach of leaving code-cache addresson return stacks, and jumping directly to them can be used tolaunch a security attack in our case. DBOS is a low-overheadmechanism for ensuring security, and usually results in muchlower overheads than DBT for similar security guarantees.Conversely, DBOS is not as powerful as DBT, and cannot beused for several other DBT applications.DBOS is similar to veriﬁcation techniques such as proof-carrying code (PCC) [22] and typed-assembly language(TAL) [11], in that, both techniques involve analyzing thecode at load time to ascertain safety. However, unlike PCCand TAL, our analysis is much simpler — we only check forthe occurrence of a certain pattern (grep) in the executableaddress space. In contrast, PCC and TAL require detailedreasoning about semantics of individual instructions, andcontrol ﬂow. While PCC and TAL have been successfullyused to ascertain safety for relatively small programs, as-certaining safety against a full guest operating system stillremains an open problem. Also, veriﬁcation techniques sel-dom worry about instruction boundaries, and the potentialof being able to jump in the middle of an instruction. Ourcurrent method for driver certiﬁcation involves digital sig-natures; it remains to be seen if methods like proof-carryingcode may be used instead. A PCC-based certiﬁer would needto certify that the fastio driver behaves as expected, and doesnot allow PPT access to be leaked to the untrusted guest.

8. Conclusions

We present a novel security mechanism, DBOS, and show itssuccessful application for I/O virtualization. Using DBOS,we are able to expose the privileged hypervisor/device stateto the guest without security risks. Our trusted guest-sidedriver can access this privileged state to perform fast I/Oand switching. We show signiﬁcant improvements over thestate-of-the-art device virtualization solutions, and softwareswitches. eferences [1] A DAMS , K.,

AND A GESEN , O. A comparison of softwareand hardware techniques for x86 virtualization. In

ASPLOS’06: Proceedings of the 12th international conference on Ar-chitectural support for programming languages and operatingsystems (New York, NY, USA, 2006), ACM, pp. 2–13.[2] B

ARHAM , P., D

RAGOVIC , B., F

RASER , K., H

AND , S.,H

ARRIS , T., H O , A., N EUGEBAUER , R., P

RATT , I.,

AND W ARFIELD , A. Xen and the art of virtualization. In

Proceed-ings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York, NY, USA, 2003), SOSP ’03, ACM,pp. 164–177.[3] B

ONELLI , N., D I P IETRO , A., G

IORDANO , S.,

AND P RO - CISSI , G. On multi—gigabit packet capturing with multi—core commodity hardware. In

Proceedings of the 13th In-ternational Conference on Passive and Active Measurement (Berlin, Heidelberg, 2012), PAM’12, Springer-Verlag, pp. 64–73.[4] C

ARDIGLIANO , A., D

ERI , L., G

ASPARAKIS , J.,

AND F USCO , F. vpf ring: Towards wire-speed network monitor-ing using virtual machines. In

Proceedings of the 2011 ACMSIGCOMM Conference on Internet Measurement Conference (New York, NY, USA, 2011), IMC ’11, ACM, pp. 533–548.[5] D

ERI , L. Direct nic access. ,December 2011.[6] D

OBRESCU , M., E GI , N., A RGYRAKI , K., C

HUN , B.-G.,F

ALL , K., I

ANNACCONE , G., K

NIES , A., M

ANESH , M.,

AND R ATNASAMY , S. Routebricks: Exploiting parallelism toscale software routers. In

Proceedings of the ACM SIGOPS22Nd Symposium on Operating Systems Principles (NewYork, NY, USA, 2009), SOSP ’09, ACM, pp. 15–28.[7] Dpdk: Data-plane development kit. http://dpdk.org ,March 2015.[8] F

EINER , P., B

ROWN , A. D.,

AND G OEL , A. Comprehen-sive kernel instrumentation via dynamic binary translation. In

Proceedings of the Seventeenth International Conference onArchitectural Support for Programming Languages and Op-erating Systems (New York, NY, USA, 2012), ASPLOS XVII,ACM, pp. 135–146.[9] G

HODSI , A., S

EKAR , V., Z

AHARIA , M.,

AND S TOICA , I.Multi-resource fair queueing for packet processing. In

Pro-ceedings of the ACM SIGCOMM 2012 Conference on Appli-cations, Technologies, Architectures, and Protocols for Com-puter Communication (New York, NY, USA, 2012), SIG-COMM ’12, ACM, pp. 1–12.[10] G

ORDON , A., A

MIT , N., H AR ’E L , N., B EN -Y EHUDA , M.,L

ANDAU , A., S

CHUSTER , A.,

AND T SAFRIR , D. Eli: Bare-metal performance for i/o virtualization. In

Proceedings ofthe Seventeenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS XVII, ACM, pp. 411–422.[11] G

ROSSMAN , D.,

AND M ORRISETT , J. G. Scalable certiﬁca-tion for typed assembly language. In

Selected Papers from theThird International Workshop on Types in Compilation (Lon- don, UK, UK, 2001), TIC ’00, Springer-Verlag, pp. 117–146.[12] H AR ’E L , N., G ORDON , A., L

ANDAU , A., B EN -Y EHUDA ,M., T

RAEGER , A.,

AND L ADELSKY , R. Efﬁcient and scal-able paravirtual i/o system. In

Proceedings of the 2013USENIX Conference on Annual Technical Conference (Berke-ley, CA, USA, 2013), USENIX ATC’13, USENIX Associa-tion, pp. 231–242.[13] H

ENNING , J. L. SPEC CPU2000: Measuring CPU perfor-mance in the new millenium.

IEEE Computer 33 , 7 (July2000), 28–35.[14] H

ONDA , M., N

ISHIDA , Y., R

AICIU , C., G

REENHALGH , A.,H

ANDLEY , M.,

AND T OKUDA , H. Is it still possible to extendtcp? In

Proceedings of the 2011 ACM SIGCOMM Conferenceon Internet Measurement Conference (New York, NY, USA,2011), IMC ’11, ACM, pp. 181–194.[15] Intel virtualization technology for connectivity. ,March 2015.[16] Intel virtualization technology for Directed I/O. .[17] Intel 64 and IA-32 architectures software developer’s manualvolume 3B: System programming guide part 2. .[18] K

EDIA , P.,

AND B ANSAL , S. Fast dynamic binary translationfor the kernel. In

Proceedings of the Twenty-Fourth ACMSymposium on Operating Systems Principles (New York, NY,USA, 2013), SOSP ’13, ACM, pp. 101–115.[19] Libvirt Virtualization API: VirtIO. http://wiki.livirt.org/page/Virtio .[20] M

ARTINS , J., A

HMED , M., R

AICIU , C., O

LTEANU , V.,H

ONDA , M., B

IFULCO , R.,

AND H UICI , F. Clickos andthe art of network function virtualization. In

Proceedings ofthe 11th USENIX Conference on Networked Systems Designand Implementation (Berkeley, CA, USA, 2014), NSDI’14,USENIX Association, pp. 459–473.[21] M

OGUL , J. C.,

AND R AMAKRISHNAN , K. K. Eliminatingreceive livelock in an interrupt-driven kernel.

ACM Trans.Comput. Syst. 15 , 3 (Aug. 1997), 217–252.[22] N

ECULA , G. C. Proof-carrying code. In

Proceedings ofthe 24th ACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages (New York, NY, USA, 1997), POPL’97, ACM, pp. 106–119.[23] P

FAFF , B., P

ETTIT , J., A

MIDON , K., C

ASADO , M., K

OPO - NEN , T.,

AND S HENKER , S. Extending networking into thevirtualization layer. In

Eight ACM Workshop on Hot Topics inNetworks (HotNets-VIII), HOTNETS ’09, New York City, NY,USA, October 22-23, 2009 (2009).[24] R AM , K. K., S ANTOS , J. R., T

URNER , Y., C OX , A. L., AND R IXNER , S. Achieving 10 gb/s using safe and transparentnetwork interface virtualization. In

Proceedings of the 2009ACM SIGPLAN/SIGOPS International Conference on VirtualExecution Environments (New York, NY, USA, 2009), VEE’09, ACM, pp. 61–70.

25] R

IZZO , L. Netmap: A novel framework for fast packet i/o.In

Proceedings of the 2012 USENIX Conference on AnnualTechnical Conference (Berkeley, CA, USA, 2012), USENIXATC’12, USENIX Association, pp. 9–9.[26] R

IZZO , L.,

AND L ETTIERI , G. Vale, a switched ethernetfor virtual machines. In

Proceedings of the 8th InternationalConference on Emerging Networking Experiments and Tech-nologies (New York, NY, USA, 2012), CoNEXT ’12, ACM,pp. 61–72.[27] R

IZZO , L., L

ETTIERI , G.,

AND M AFFIONE , V. Speedingup packet i/o in virtual machines. In

Proceedings of theNinth ACM/IEEE Symposium on Architectures for Networkingand Communications Systems (Piscataway, NJ, USA, 2013),ANCS ’13, IEEE Press, pp. 47–58.[28] S

ANTOS , J. R., T

URNER , Y., J

ANAKIRAMAN , G.,

AND P RATT , I. Bridging the gap between software and hard-ware techniques for i/o virtualization. In

USENIX 2008 An-nual Technical Conference on Annual Technical Conference (Berkeley, CA, USA, 2008), ATC’08, USENIX Association,pp. 29–42.[29] S

HERRY , J., H

ASAN , S., S

COTT , C., K

RISHNAMURTHY , A.,R

ATNASAMY , S.,

AND S EKAR , V. Making middleboxessomeone else’s problem: Network processing as a cloud ser-vice. In

Proceedings of the ACM SIGCOMM 2012 Conferenceon Applications, Technologies, Architectures, and Protocolsfor Computer Communication (New York, NY, USA, 2012),SIGCOMM ’12, ACM, pp. 13–24.[30] X U , C., G AMAGE , S., L U , H., K OMPELLA , R.,

AND X U ,D. vturbo: Accelerating virtual machine i/o processing us-ing designated turbo-sliced core. In Proceedings of the 2013USENIX Conference on Annual Technical Conference (Berke-ley, CA, USA, 2013), USENIX ATC’13, USENIX Associa-tion, pp. 243–254.14