[PDF] PiBooster: A Light-Weight Approach to Performance Improvements in Page Table Management for Paravirtual Virtual-Machines

Abstract

In paravirtualization, the page table management components of the guest operating systems are properly patched for the security guarantees of the hypervisor. However, none of them pay enough attentions to the performance improvements, which results in two noticeable performance issues. First, such security patches exacerbate the problem that the execution paths of the guest page table (de)allocations become extremely long, which would consequently increase the latencies of process creations and exits. Second, the patches introduce many additional IOTLB flushes, leading to extra IOTLB misses, and the misses would have negative impacts on I/O performance of all peripheral devices. In this paper, we propose PiBooster, a novel lightweight approach for improving the performance in page table management. First, PiBooster shortens the execution paths of the page table (de)allocations by the PiBooster cache, which maintains dedicated buffers for serving page table (de)allocations. Second, PiBooster eliminates the additional IOTLB misses with a fine-grained validation scheme, which performs page table and DMA validations separately, instead of doing both together. We implement a prototype on Xen with Linux as the guest kernel. We do small modifications on Xen (166 SLoC) and Linux kernel (350 SLoC). We evaluate the I/O performance in both micro and macro ways. The micro experiment results indicate that PiBooster is able to completely eliminate the additional IOTLB flushes in the workload-stable environments, and effectively reduces (de)allocation time of the page table by 47% on average. The macro benchmarks show that the latencies of the process creations and exits are expectedly reduced by 16% on average. Moreover, the SPECINT,lmbench and netperf results indicate that PiBooster has no negative performance impacts on CPU computation, network I/O, and disk I/O.

Full PDF

11 PiBooster: A Light-Weight Approach to Performance Improvements in Page TableManagement for Paravirtual Virtual-Machines

Zhi Zhang, Yueqiang Cheng

Abstract —In paravirtualization, the page table management components of the guest operating systems are properly patched for thesecurity guarantees of the hypervisor. However, none of them pay enough attentions to the performance improvements, which resultsin two noticeable performance issues. First, such security patches exacerbate the problem that the execution paths of the guest pagetable (de)allocations become extremely long, which would consequently increase the latencies of process creations and exits. Second,the patches introduce many additional IOTLB ﬂushes, leading to extra IOTLB misses, and the misses would have negative impacts onI/O performance of all peripheral devices.In this paper, we propose PiBooster, a novel lightweight approach for improving the performance in page table management. First,PiBooster shortens the execution paths of the page table (de)allocations by the PiBooster cache, which maintains dedicated buffers forserving page table (de)allocations. Second, PiBooster eliminates the additional IOTLB misses with a ﬁne-grained validation scheme,which performs page table and DMA validations separately, instead of doing both together. We implement a prototype on Xen withLinux as the guest kernel. We do small modiﬁcations on Xen (

SLoC) and Linux kernel (

SLoC). We evaluate the I/Operformance in both micro and macro ways. The micro experiment results indicate that PiBooster is able to completely eliminate theadditional IOTLB ﬂushes in the workload-stable environments, and effectively reduces (de)allocation time of the page table by 47% onaverage. The macro benchmarks show that the latencies of the process creations and exits are expectedly reduced by 16% onaverage. Moreover, the

SPECINT , lmbench and netperf results indicate that PiBooster has no negative performance impacts on CPUcomputation, network I/O, and disk I/O. Index Terms —Pare-virtualization, Peripheral Devices, Page-table Management. (cid:70)

NTRODUCTION

In paravirtualization [7], [30], the operating system of each VirtualMachine (a.k.a. guest or guest domain) and the hypervisor sharethe same virtual space. In order to prevent malicious accessesfrom the guest OS, the hypervisor sets the guest page tables read-only, and validates their updates to ensure that there is no runtimeviolation [7]. However, only the page table based protection isnot enough to defend against the DMA attacks driven by themalicious guest OS [25]. To ﬁx this gap, the hypervisor resorts tothe I/O virtualization (AMD-Vi [5] or Intel VT-d [14]) technology,which leverages a new Input/Output Memory Management Unit(IOMMU) to restrict DMA accesses to the physical memory pagesoccupied by the hypervisor and the guest page tables. To integratethe above protection techniques, the guest page table managementand the hypervisor are required to be properly patched.

Problem.

However, all existing patches mainly focus on thesecurity enhancements of the hypervisor, without enough atten-tions to the performance improvements, which results in twonoticeable performance issues. The ﬁrst one is the long executionpaths of the guest page table (de)allocations, which involve acomplex memory allocation process and an additional securityvalidation procedure. The memory allocation process frequentlyinvolves a slab allocator [9] and the page frame allocations that arefrequently managed with a buddy system [16], which introducesdeep invocations for each page-table (de)allocation. Moreover, theadditional security validation procedure always adds extra costs • Zhi Zhang is with the Data61, CSIRO, Australia and the University of NewSouth Wales, Australia.E-mail: [email protected] • Yueqiang Cheng is with Baidu XLab, America.E-mail: [email protected] for preventing page tables from malicious software and DMAmodiﬁcations. All these lead to poor performance of the pagetable (de)allocation, and consequently result in the long latenciesof the creations and exits of processes.The other one is the additional IOTLB ﬂushes introduced bythe security validations of the page table (de)allocations. The guestpage tables should be non-readable for any DMA requests and thecorresponding pages should be readable and writable when thepage tables are deallocated. The updates of the access permissionsrequire IOTLB ﬂushes to refresh the access permissions, which isnecessary for the sake of the security of the hypervisor. In addition,these access permission update events are often triggered duringthe whole life cycle of a running system. As a consequence, theIOTLB ﬂushing events are frequently involved, which inevitablyincreases the IOTLB miss rate and lowers the speed of the DMAaddress translation. All these are likely to introduce negativeimpacts on the I/O performance of all peripheral devices. Thebaseline of Figure 1 (a) illustrates the two issues taking page tableallocation as an example.

Solution.

The above identiﬁed performance issues urge us torevise the design of the page table management to improve the per-formance without sacriﬁcing the security guarantees. In responseto the appeal, in this paper we propose PiBooster, a novel software-only lightweight approach for improving performance in pagetable management. First, PiBooster shortens the execution pathsof the page table (de)allocations by the PiBooster cache, whichmaintains dedicated buffers for serving page table (de)allocations(Figure 1 (b)). The PiBooster cache queues the deallocated page-table pages in a constant time, with the hope that they will bereused (popped out of the cache) in the page table allocations inthe near future. By doing so, the page table allocations do not a r X i v : . [ c s . O S ] O c t Page Alloca*ons from Exis*ng Memory Allocators Page Table Valida*ons DMA Valida*ons IOTLB Flushes (a)

Baseline (Page Table Alloca2on) Guest Host IOMMU (b) PiBooster Cache (c) PiBooster Cache + Fine-‐grained Valida2on

PiBooster Cache Page Table Valida*ons DMA Valida*ons IOTLB Flushes

Guest Host IOMMU

PiBooster Cache Page Table Valida*ons

Guest Host IOMMU

Execu*on Flow Execu*on Cost

Fig. 1: Solution Overview. When the PiBooster cache and ﬁne-gained validation scheme is enabled, the execution path of page tableallocation is dramatically reduced, and the additional IOTLB ﬂushes are eliminated.need to involve the costly memory management subsystem everytime, instead they could directly get pages from the cached buffers,dramatically shortening the execution paths.Second, PiBooster eliminates the additional IOTLB ﬂusheswith a ﬁne-grained validation scheme (Figure 1 (c) ), whichseparates the page table and DMA validations. In the traditionaldesign, there are two types of pages: writable page that is writablefor software and DMA accesses, and non-writable page (e.g.,page-table page) that are non-writable for both software and DMA.The page table allocations and deallocations always involve thetype changes between the two types. Thus, the hypervisor has todo both page table and DMA validations to ensure that both ofthem are not violating the security policies. However, we observethat it is not necessary to do DMA validation every time if wecreate a new page type (i.e., semi-writable page) with the non-writable permission for DMA access, and enforce the page typechanges only occur between the page-table pages and the semi-writable pages during the page table allocations and deallocations.The management of the semi-writable pages can be assisted by thePiBooster cache.We implement a prototype on Xen with Linux as the guestkernel. We do small modiﬁcations on Xen version 4.2.1 (

SLoC) and Linux kernel version 3.2.0 (

SLoC). We evaluatethe I/O performance in both micro and macro ways. The microexperiment results indicate that PiBooster is able to completelyeliminate the additional IOTLB ﬂushes, and effectively reduce(de)allocation time of the page table. There are (34%, 47%), (38%,22%) and (65%, 65%) improvements for a pair of allocation anddeallocation for three-level page table, from top to bottom. Evenin the worst cases when the PiBooster has to go through thetraditional path to allocate pages, the performance overhead forone page table allocation is still very small, only adding about20 instructions. Fortunately, the worst case could be avoided bycarefully setting the number of the initial pages of the PiBoostercache. The macro benchmarks show that PiBooster has no negativeimpact on the CPU computation, network I/O and disk I/O. Inparticular, the latencies of the process creations and exits areexpectedly reduced by 16% on average.In summary, we make the following contributions: 1) We identify two signiﬁcant performance issues in thepage table management. In particular, we are the ﬁrst,to the best of our knowledge, to identify the performanceissue between guest page table (de)allocations and theIOTLB ﬂushes.2) We proposed a novel approach - called PiBooster, toshorten the execution paths of the page table allocationand deallocations, and eliminate the additional IOTLBﬂushes, without sacriﬁcing the system security.3) We implemented a prototype of the PiBooster and evalu-ated the performance in both micro and macro ways. Theexperiment results indicate that the PiBooster can beneﬁtthe page table (de)allocations and eliminate additionalIOTLB ﬂushes, without negative performance impacts onthe system.The rest of the paper is structured as follows: In Section 2,we brieﬂy describe the background knowledge, and highlight theperformance issues. Then we describe the system overview andimplementation in Section 3 and Section 4. In Section 5, weevaluate the performance of the PiBooster system, and discussseveral issues in Section 6. At last, we discuss the related work inSection 7, and conclude the whole paper in Section 8.

ROBLEM D EFINITION

In this section, we describe the necessary background knowledgeand highlight the identiﬁed two performance issues: 1) longexecution paths of the guest page table (de)allocations and 2) theadditional IOTLB ﬂushes. As Xen [7] is a typical and popularparavirtual hypervisor, we use Xen in the x86 MMU model [1] toillustrate the details. The presented mechanisms are also availableon other paravirtual platforms.

The long execution path issue refers to the execution paths of theguest page table allocations and deallocations. In the traditionaldesign, allocating and deallocating a page-table page are timecostly, as they have to invoke the complex memory allocators and perform both software (i.e., page table) and DMA validations. Inaddition, such allocation and deallocation events frequently occurin a system, i.e., the numerous process creations and exits willresult in many page-table allocations and deallocations. Thus, it isnecessary to deeply understand the long execution path issue andimprove the performance accordingly.

To allocate a page, the guest kernel has to invoke the systemallocators, typically from the slab allocator to the buddy allocator.The slab allocator [9] manages a variable number of caches thatare linked together by a doubly linked circular list. Each cachemaintains blocks of contiguous pages in memory called slabs,which are carved up into small chunks for data structures andobjects of speciﬁc sizes. When kmalloc is called, the allocator willsearch through the prepared caches. If there is no suitable object,the buddy allocator will be involved. The buddy allocator [16]manages all free memory blocks of permitted sizes by free lists.The blocks consist of pages and the sizes are usually powers of 2.When receiving requests for memory, the allocator ﬁrstly checksif the request size is equal to a permitted size. If they are equal,the allocator will return a block from the corresponding free list. Ifthey are not equal or the free list is empty, the allocator will split alarger block (e.g., the permitted size is twice that of the requestedsize) into multiple sub-blocks and return one sub-block, insertingthe rest to the proper free list.In contrast to the page allocation, the page deallocation isto return the page back to the system. The deallocation processmay also invoke slab allocator and/or buddy allocator, and wouldtrigger the updates of the corresponding data structures, e.g., thebuddy allocator always attempts to merge adjacent freed blocksinto a large permitted size.In brief, the page allocation and deallocation are time costlydue to the deep invocations and complex updates of the dependentdata structures.

Page tables are used by the hardware, i.e., Memory ManagementUnit (MMU), to translate the linear addresses into physical ad-dresses. In the PAE-enabled paging mode, a page table has threelevels: L3 level (bottom level), L2 level (middle level) and L1level (top level). The slots in L3, L2 and L1 levels are known asPage Table Entry (PTE), Page Middle Directory (PMD) and PageGlobal Directory (PGD), respectively. A PTE slot could determinethe access permissions of a page, e.g., the kernel could set a pageread-only by clearing the bit within a PTE slot that representsthe writable permission. Typically, each user process has its ownpage table, and the creation and exit of a user process will beaccompanied by the allocation and deallocation of a page tablerespectively.Fig. 2: Page type updates between writable pages and non-writablepages. The hypervisor deﬁnes two page types 1) writable page thatis writable for both software and DMA, and 2) non-writablepage that is non-writable for both software and DMA. The non-writable page has several sub-page types, such as page-table pagesand GDT/LDT pages. In addition, the hypervisor also requiresthat every page type update only occur between writable andnon-writable pages, as summarized in Figure 2. The hypervisormaintains a type reference count for each page, and enforces thepolicy that any given page has exactly one type at any given time.In addition, it also enforces that only pages with the writabletype have a writable mapping in the page tables. By doing thisit can ensure that the guest OS is not able to directly modify anypage-table pages and therefore cannot subvert the security of thehypervisor.Whenever a page table is loaded to work, there should be somepage type updates from writable pages to non-writable pages (i.e.,page-table page). First, the hypervisor must ensure that the page-table page has a type reference count of zero. In addition, it mustbe validated to ensure that it complies with the following policy:for a page with a page-table type to be valid, it is required that anypages referenced by a present page table entry in the page havethe type of the next level down. For instance, any page referencedby a page with type L1 Page Table must itself have the type L2Page Table. This policy is applied recursively down to the L3 pagetable layer. At L3 the invariant is that any data page mapped aswritable by an L3 entry must be a writable page. By applyingthese policies, the hypervisor ensures that all page-table pages asa whole are safe to be loaded. Note that the hypervisor is alwaysinvolved in all updates of the page tables, the policies for the pagetable updates are non-bypassable.The validation process for the page table deallocation iscontrary to the one for the page table allocation. In brief, thehypervisor will validate the page table’s type reference count,access permissions and the slot mappings to ensure that the pagetable is completely and securely deallocated.

The input/output memory management unit (IOMMU) [14] is amemory management unit (MMU) that connects a DMA-capableI/O bus to the main memory. Like a traditional MMU, the IOMMUmaps device addresses (also called as I/O addresses) to physicaladdresses through a dedicated page table. The IOMMU pagetable that is created and maintained by the hypervisor in itsown space is able to restrict the access to a particular page byconﬁguring the permission bits. The hypervisor grants differentaccess permissions to different page types, such as the writablepages are always allowed with full access permissions, while thepage-table pages are always inaccessible to any devices.However, if the DMA address translation always needs tolook up the IOMMU page table, it will be slow and inefﬁcient.To accelerate the translation speed, the I/O translation look-aside buffer (IOTLB) is introduced. The IOTLB is used to cachefrequently accessed page table entries. By doing so, the IOTLBis very likely to be accessed, indicating that the physical addressof a queried DMA address will be immediately fetched throughthe IOTLB path (Figure 3a). If unlikely the IOTLB miss occurs,the DMA address translation still can go along the slow I/O page-table path to get the physical address (Figure 3b). To achieve abetter I/O performance, the DMA address translation should avoidtaking the I/O page-table path. (a) IOTLB Path.(b) I/O Page-Table Path.

Fig. 3: IOTLB path is much faster than I/O page-table path.

The page table allocation and deallocation always trigger theupdates between the writable pages and the page-table pages, andthey have different access permissions for DMA requests. To keepthe security of the hypervisor, the DMA validation is necessary.Speciﬁcally, the hypervisor will update the corresponding entriesof the IOMMU page table to set correct access permissions. Afterthat, the hypervisor also needs to ﬂush IOTLB entries to invalidatethe stale entries, without which the security of the hypervisorwould be violated, e.g., the DMA requests could write the page-table pages through stale IOTLB entries.Fig. 4: The DMA validations introduce additional IOTLB ﬂushes,which lead to selecting the slow path (i.e., I/O page-table path) forthe DMA address translation.In addition, the additional IOTLB ﬂushes are likely to let theDMA address translations take the slow and inefﬁcient page-tablepath, instead of taking the fast and efﬁcient IOTLB path (Figure 4),due to the additional IOTLB misses. Numerous IOTLB misseswould lower the speed of the DMA transferring, especially for thehigh performance devices, such as Intel I/OAT [18]. I B OOSTER O VERVIEW

In the design of PiBooster, we consider several requirements,which are summarized and listed as follows.1) Unaltered system security. The new scheme should notsacriﬁce the system security to obtain performance ben-eﬁts. No one likes to use a system with known designloopholes.2) Compatible with legacy applications. The new schemeshould limit the modiﬁcations on the guest kernel andthe hypervisor, without any modiﬁcations on the existingapplications.3) Small modiﬁcations. The new scheme should minimizethe development cost on the guest kernel and the hyper-visor.

Figure 5 depicts the architecture of PiBooster. It consists of aPiBooster cache in the guest kernel and a PiBooster module in thehypervisor space. The PiBooster cache allocates a number of semi-writable pages at the initial stage, which are newly introducedby the ﬁne-grained validation scheme (Section 3.3). These semi-writable pages are maintained in a dedicated cache. At runtime, thePiBooster cache attempts to satisfy all page table allocation anddeallocation requests, shortening the execution path and savingthe execution time. There are many page type updates from/to thesemi-writable pages. The PiBooster cache mediates these updatesand issues hypercalls to the hypervisor, asking it to performthe ﬁne-gained security validations on the update requests (i.e.,the communication channel). Once the veriﬁed pages pass thevalidations, their page types will be updated accordingly.The PiBooster module enforces the ﬁne-grained validationscheme on the page table updates. It performs (1) the page tablevalidations that validate the page table contents as well as thetype reference count, and (2) the DMA validations that ensurethat the DMA requests cannot write the page-table pages andthe semi-writable pages. In the traditional validation scheme, theDMA validations always trigger additional IOTLB ﬂushes dueto the access permission updates between the writable pages andthe page-table pages. In the ﬁne-grained validation scheme, bothsemi-writable pages and page-table pages are non-writable forDMA requests. Thus, the hypervisor never needs to do the IOTLBﬂushes or IOMMU page table updates (i.e., vanished channel),which will beneﬁt the I/O performance of the peripheral devices.Moreover, it also saves the time of the whole security validation,further reducing the execution time of the page table allocationsand deallocations.

The ﬁne-grained validation scheme aims to eliminate the addi-tional IOTLB ﬂushes and reduces the total time of the secu-rity validations. Speciﬁcally, in the traditional security validationscheme, there are only two general page types: writable pageand non-writable page, and the type updates between them arerequired to do both page table validations and DMA validations.The DMA validations not only increase the total validation time,but also introduce numerous additional IOTLB ﬂushes. Moreover,the additional IOTLB ﬂushes cannot be skipped, because it wouldprovide a time gap for the adversary to attack the hypervisor byleveraging the stale IOTLB entries.To address this problem without sacriﬁcing the system secu-rity, we introduce a new page type: semi-writable page , whichis writable for software but non-writable for DMA. In addition,we enforce that the page type updates between the writable pageand the page-table page must go through the semi-writable pageﬁrst (as illustrated in Figure 5). As the semi-writable page andpage-table page are already inaccessible to DMA, there is no needto do the DMA validations, meaning that the additional IOTLBﬂushes could be totally avoided. As a consequence, the time ofthe whole security validation process is reduced, accelerating thespeeds of page table allocations and deallocations. Similar tothe management of the page-table pages, the hypervisor is onlyresponsible for the ﬁnal validation of the semi-writable page,leaving all other management operations for the PiBooster cache.By doing this, we can keep the modiﬁcations as small as possible

Semi-‐writable Pages Page Table Pages

PiBooster Cache

Guest Kernel

Writable Pages

Hypervisor

Guest Space Hypervisor Space

PiBooster Module ( Fine-‐grained Valida>on)

IOMMU

IOTLBs

Page Type Update Communica>on Channel Vanished Channel

Fig. 5: PiBooster Architecture. The semi-writable pages are managed by the PiBooster cache, and all page type updates are validatedby the PiBooster module. Through the cooperation of PiBooster cache and PiBooster module, PiBooster successfully shortens theexecution paths of the guest page table (de)allocations and eliminates additional IOTLB ﬂushes.by reusing existing validation process and page-table managementsubsystem, and also retain the system security.

The PiBooster module works in the hypervisor space, extendedfrom the traditional coarse-gained validation module. The ﬁrst taskof the PiBooster module is to add the support for the semi-writablepage . Instead of adding a new data structure for marking the semi-writable pages, the PiBooster module chooses to reuse the existingone. By doing so, the PiBooster module could reuse the existinginterfaces. In particular, we ﬁnd that the page-type data structurecould be extended by borrowing a bit from type reference count for semi-writable page .The second task of the PiBooster module is to perform theﬁne-grained security validations for all page type update requests.The main logic is to check page type ﬁrst, and then determine toperform one validation or both of them. Speciﬁcally, if the pagetype update occurs between writable page and semi-writable page,it will perform both page table and DMA validations. However,if the page type update occurs between semi-writable page andpage-table page, the PiBooster module only perform the pagetable validations, skipping the DMA validations. Note that thepage table validations are always necessary, as the semi-writablepages are writable for the guest OS, and the modiﬁcations from theuntrusted guest OS could subvert the enforced security policies.The PiBooster module also exports a new hypercall inter-face for the PiBooster cache to facilitate their communications.Through the new interface, the PiBooster cache could explicitlyinvoke the PiBooster module to perform security validations onspeciﬁc page type updates.

The basic idea behind the PiBooster cache is to have cachesof semi-writable pages available for page table allocations anddeallocations. Without the page-oriented PiBooster cache, thekernel will spend much of its time allocating, initializing and freeing page-table pages. The slab allocator that is similar tothe PiBooster cache is not used in our settings, due to thefollowing reasons. First, it is not aware of the page type or thesecurity requirements, which would lead to unexpected crashesof the guest OS. Adding such functionalities would introducemany internal checking operations that are necessary for otherobjects maintained by the slab allocator. At least, those additionalchecking operations will lower the (de)allocation efﬁciency ofother objects. Second, the existing size-oriented management ofthe slab does not distinguish the page-table pages from otherpages that have the same size, and the customization of thismanagement mechanism would need lots of development costs,such as interface updates, internal data structure updates. Inaddition, the related components that rely on the slab allocatormay also be affected. At last, adding the ﬁne-grained validationmechanism only for one object (i.e., page-table page) will subvertthe generality of the slab allocator. Considering the above threereasons, we give up reusing the existing slab allocator and aim tobuild a dedicated one, maintained by PiBooster cache, serving thepage table allocations and deallocations.

The PiBooster cache is enabled in the system bootup phase by de-fault. By doing so, the page tables of all user processes are servedby the PiBooster cache from the very beginning. To increase theﬂexibility, it also allows dynamical activation at runtime throughan exported interface.In the initialization phase, the PiBooster cache allocates a bulkof pages from the existing system allocators, converts them intosemi-writable pages, and maintains them in a dedicated cache list.At runtime, the page table is always successfully deallocated byefﬁciently pushing deallocated pages into the PiBooster cache. Butthe cases for the page table allocation is a little bit complex. Nor-mally, the PiBooster cache could serves all page table allocationrequests using the cached pages. However, in the worst cases, thecached pages may not be able to satisfy the allocation requests.In such conditions, the PiBooster cache will have to re-invoke the system allocators to get new pages. Fortunately, the reinvocationscould be avoided by carefully setting the number of the initialpages. In fact, the re-invocations rarely occur in a workﬂow stablesystem. In our experiments, there are always several semi-writablepages in the PiBooster cache ready for the page-table allocationsafter the PiBooster is running for a few minutes. All these casesare evaluated in Section 5.The PiBooster cache works in the whole life cycle of theguest by default, but the end user is able to explicitly disable itat any time through the exported interface. Once the PiBoostercache receives the disable command, it will release all resources,e.g., deallocating the cached pages assisted by the existing systemallocators as well as the PiBooster module, and releasing thedata structures that are used for managing the cached pages. Inaddition, it will also issue a hypercall to inform the hypervisor,which will disable the ﬁne-grained validation mechanism andswitch it back to the traditional coarse-grained validation scheme.

When the memory management daemon (e.g., kswapd ) noticesthat the available memory is tight, it will explicitly call theexported interfaces of the PiBooster cache to free some cachedpages. There are two interfaces to shrink the cache pages. Oneis based on the page number. The memory management daemoncan specify a number to ask the PiBooster cache to release. Theother one is based on the percentage. For instance, the kernel couldask the PiBooster cache to release 50% cached pages. In fact, thenumber of semi-writable pages maintained in the PiBooster cacheis not too high. In our experiments, it is always less than ,meaning that the size of the cache is less than KB .The PiBooster cache could also automatically shrink itselfthrough a predeﬁned threshold. The threshold can be deﬁnedaccording to the page number or the proportion (i.e., the number ofthe cached semi-writable pages over the number of the page-tablepages), or a combination of them. MPLEMENTATION

In this section, we present the implementation details of thePiBooster module and the PiBooster cache based on Xen (thehypervisor) and Linux (the guest kernel).

Fig. 6: The layout of the traditional data structure for page type.The ﬁrst main task of the PiBooster module is to extend theexisting data structure to support semi-writable page. As illustratedin Figure 6, the data structure for labelling page types occupies32bits, i.e., bits 28 - 31 are allocated for page type, bits 23 - 27 isfor others (e.g., bit 26 indicate if this page has been validated), andbits 0 - 22 are for reference count of one page type. The existingpage types have occupied all page type bits, and there is no extrabit available for semi-writable page. Facing this problem, we donot choose to introduce new data structures, as it would increasethe management complexity and result in many modiﬁcations of all related management functions. Instead, we choose to borrowa bit from reference count . In particular, the reference count ﬁeldhas 23 bits, recording the number of type references for a pageas its current type. In fact, the system usually does not build somany references (i.e., − ) to one page. Thus, we borrow thehighest bit (bit 22) as the semi-page table bit (as illustrated inFigure 7). As a consequence, it still supports more than 4 millionreference counts, enough for almost all cases. And the hypervisorwill always do a check for the reference count whenever doingoperations on it so as to avoid count overﬂows. Actually, thehypervisor with PiBooster is functioning well in our experiments.The PiBooster module also needs to patch the page type check-ing functions, e.g., get page type, to adjust the checking logic.The added checking logic is straightforward, such that we onlyadd or change 166 SloC to achieve the whole patch. In addition,some validation steps (e.g., DMA validations) necessary in thetraditional design could be skipped when the newly introducedsemi-writable page is involved in the page type update, simplifyingthe whole validation logic in a certain level.Fig. 7: The semi-writable page type support. There are three-level of the guest page table, and the PiBoostercache maintains a single-linked list for each of them. Each nodeof the list has a pointer pointing to the next node and a page IDthat is the base address of a cached page. Note that this addressshould be a physical address, rather than a virtual address, becausephysical address of a page is unique in the whole system but itsvirtual addresses could be multiple. Thus, using virtual addresswould lead to the confusion of the page type tracing and thesemi-writable page management. As the page table allocationsand deallocations could happen at any time on any core, each listhas its own lock to support concurrent updates.The PiBooster cache has two interfaces for the runtime page ta-ble allocations and deallocations. The pop interface is for the pagetable allocations. When this interface is invoked, the PiBoostercache fetches the top node of the corresponding list, extractsthe base address of the cached page, and then returns it to thecaller. Correspondingly, the push interface is serving the page tabledeallocations. In the push interface, the PiBooster cache saves thebase address of the deallocated page into a node, and inserts itonto the top of the list. Obviously, the pop and push interfaces areextremely fast as they could response the requests in a constanttime. In addition, the PiBooster cache also exports an interfacefor the memory management daemon for explicitly shrinking thecached pages.The PiBooster cache also adds several virtual ﬁles in sysfs [21],[24] that is a virtual ﬁle system provided by the Linux kernel. Byusing the virtual ﬁles, the end user could send commands to thePiBooster cache, as well as query and conﬁgure the internal status.In the current implementation, we only support one command,which is able to activate and deactivate the cache service in an on-demand way. In addition, the end user could read and writethe virtual ﬁles to query the number of the cached pages, anddynamically shrink the cache size. For instance, the end user couldexplicitly ask the PiBooster cache to release all the cached pagesby setting the number of the cached pages to be zero.The PiBooster cache mediates all page type updates related tothe semi-writable pages. For each page type update, the PiBoostercache would issue the hypercall exported by the PiBooster moduleto explicitly inform the hypervisor to perform security validations.In certain cases, such as shrinking the cached semi-writable pagesto writable pages, the PiBooster cache could submit a batch ofrequests in one hypercall. In such case, the hypervisor wouldupdate the I/O page tables as well as ﬂush IOTLBs at one time,saving time of the cache shrinking.

VALUATION

We have implemented the prototype of PiBooster on our exper-iment platform. Xen version 4.2.1 is the hypervisor while theguest VM (i.e., Dom0) is Ubuntu version 12.04 with Linux kernelversion 3.2.0. The PiBooster added or changed

SLoC inthe Linux kernel and

SLoC in Xen. To fully evaluate theperformance and its effects on the whole system, we measuredthe PiBooster in both micro-benchamrks (e.g., the frequency ofIOTLB ﬂushes, the execution time of page table (de)allocations,and the memory usage of the PiBooster cache) and macro-benchmarks (e.g.,

SPECINT , netperf and lmbench ). The experiment platform is a LENOVO QiTian M4390 PC withfour CPU cores (i.e., Intel Core i5-3470) running at 3.20 GHz. Weenable the Intel VT-d feature through BIOS and grub conﬁgurationﬁle.

Workload Emulation.

In order to allow us to repeatedly mea-sure the effects of the PiBooster on 1) page table allocationsand deallocations, and 2) the IOTLB ﬂushes, we use a stresstool to explicitly emulate a heavy workload with many short-time and concurrently-running processes. Speciﬁcally, the toolperiodically launches a browser (i.e., Mozilla Firefox 31.0 inthe experiment), continuously opens new tabs one by one, andterminates the browser gracefully. The purpose of these operationsis to frequently create and terminate a large number of processes,leading to many page table allocations and deallocations. Thefrequency can be conﬁgured. In our experiment setting, there are processes created and exited per minute. In order to avoid thebrowser occupying too much memory, we terminate it in every5 minutes. At this moment, the memory usage of the browserreaches to . MB on average.

The micro-benchmark measurements are to evaluate the frequencyof the IOTLB ﬂushes, the execution time of the page table(de)allocations and the memory usage of the PiBooster cache.For each measurement, there are two control groups and onebaseline/normal group. In the baseline group, we run the workloademulation in the guest VM with default settings, without enablingthe PiBooster mechanism. On the contrary, the two control groupsare: 1) the Pre-PiBooster group, where the PiBooster is enabledbefore the workload emulation starts, and 2) the Dyn-PiBooster group, where the PiBooster is dynamically enabled (e.g., ﬁve min-utes after the workload emulation launches). The Dyn-PiBoostergroup is to evaluate if 1) the PiBooster is able to enter a stablestate as the Pre-PiBooster group, and 2) how fast the PiBooster isable to enter the stable state.Fig. 8: The frequency of IOTLB ﬂushes. In the Pre-PiBoostergroup, the frequency is reduced from a very low level to zerowithin 1 minutes. In the Dyn-PiBooster group, the frequency dropssharply within two minutes from the high level to zero. Bothcontrol groups indicate that the PiBooster could always enter thestable state (i.e., zero frequency).

In this test, we aim to evaluate the effectiveness of the ﬁne-grained validation on the additional IOTLB ﬂushes. We sample thefrequency of the IOTLB ﬂushes in 30 minutes. The measurementresults are illustrated in Figure 8. In the baseline group, thefrequency of the IOTLB ﬂushes is increasing in the ﬁrst ﬁveminutes, and then keeps at a high ﬂush rate (i.e., 9050 ﬂushes onaverage per minute) until the test completes. In the Pre-PiBoostergroup, the ﬂush frequency quickly decreases from a low level(i.e., 332 ﬂushes for the ﬁrst minute) to zero level in about oneminute and keeps at the level. In the Dyn-PiBooster group, theﬂush frequency sharply decreases to zero when the PiBooster isenabled. The PiBooster roughly spends two minutes entering thestable state. Thus, we can conclude that the ﬁne-grained validationscheme is able to efﬁciently and effectively eliminate the IOTLBﬂushes introduced by the DMA validations.

As there are three-level of the guest page table, and each levelhas its (de)allocation functions, e.g., pgd alloc and pgd free forthe L1 (top) level. In order to clearly observe the changes ofthe CPU usage, we measure them separately. We continuouslymeasure them in 30 minutes, and calculate the average executiontime of each function in each one minute. Note that in the Dyn-PiBooster group, we enable the PiBooster 5 minutes after theworkload starts to run. That is why the measurements of the Dyn-PiBooster group in the ﬁrst 5 minutes are almost overlapped withthe baseline results.As shown in Figure 9, 10, and 11, the execution time onaverage in the Pre-PiBooster group, for a pair of allocation anddeallocation of the three-level page table, from top to bottom are(622, 196), (424, 242) (260, 217) in nanoseconds, while in thebaseline group, the corresponding execution time on average are (a) Execution time of L1 allocation is reduced by 34%. (b) Execution time of L1 deallocation is reduced by 47%.

Fig. 9: The execution time of L1 (i.e., the top level) allocation and deallocation. The PiBooster in the Dyn-PiBooster group can quicklyenter the stable state in 2 minutes. (a) Execution time of L2 allocation is reduced by 38%. (b) Execution time of L2 deallocation is reduced by 22%.

Fig. 10: The execution time of L2 allocation and deallocation. The PiBooster in the Dyn-PiBooster group can quickly enter the stablestate in 2 minutes. (a) Execution time of L3 allocation is reduced by 65%. (b) Execution time of L3 deallocation is reduced by 65%.

Fig. 11: The execution time of L3 (i.e., the bottom level) allocation and deallocation. The PiBooster in the Dyn-PiBooster group canquickly enter the stable state in 2 minutes.(944, 366), (682, 313), (755, 627) in nanoseconds. Correspond-ingly, the improvements are (34%, 47%), (38%, 22%) and (65%,65%), from top to bottom. Putting them together, the page tableallocations and deallocation can be improved by 45% and 55% onaverage respectively. The results also indicate that the PiBooster inthe Dyn-PiBooster group and the Pre-PiBooster group can achievethe same performance improvements. The only difference is that the PiBooster in the Dyn-PiBooster group needs a transitionalperiod of 2 minutes to be stable.

The Worst Case.

When the PiBooster starts to work, the guestkernel always ﬁrst invokes the PiBooster cache for the page tableallocations. If the cached semi-writable pages cannot satisfy therequirements (a.k.a., cache miss), the PiBooster cache would have to allocate writable pages from the existing memory allocatorsfollowing the traditional paths,. In this case, the execution timeis the traditional execution path plus the path in the PiBoostercache. As a result, the execution time of the page table allocationis even longer than that of the baseline (Figure 1a). However,the overhead introduced by the PiBooster path is negligible, asthe control ﬂow will immediately return when the cache is empty.More speciﬁcally, the overhead consists of the function invocation,the stack adjustments and the checking logic of both the cache listand the page type. Putting them together, the overhead are lessthan 20 instructions. Fortunately, the worst case does not oftenoccur. According to our observations in the Dyn-PiBooster group,the number of the worst cases is , out of allocationrequests in 30 minutes. Note that the page-table deallocation isalways successful in a constant time.

Levels of Page Table Pre-PiBooster ( L1 L2

26 20 L3

145 136

Total

176 160

TABLE 1: Cache usages in both groups of Pre-PiBooster andDyn-PiBooster are small, occupying 176 pages and 160 pages,respectively.

To clearly observe the page usage in each level, we measure thenumbers of the cached pages in three levels and the results arelisted in Table 1. In the Pre-PiBooster group the PiBooster cachehas , and semi-writable pages ready for allocations,totally occupying KB , which is quite similar to the Dyn-PiBooster group, where the cache has in total, occupying KB . As a result, the PiBooster cache in both groups consumean insigniﬁcant memory usage, always less M B . The purpose of the macro-benchmarks is to evaluate the effectsof PiBooster on the overall system. All measurements are dividedinto two groups: 1) the baseline group with the default settings,and 2) the PiBooster group with the PiBooster enabled.Fig. 12: The latencies on average is reduced by 11%, 21% and17%, from left to right.

Lmbench is a macro-benchmark tool for measuring the latenciesof the process creations and exits (i.e., fork+exit, fork+execve,fork+/bin/sh -c), shown in Figure 12. The Lmbench is conﬁguredusing the default parameters, except for the parameters of pro-cessor MHz and memory range. In our experiment platform, theCPU frequency is . GHz, memory range is set as

MB tosave measurement time. As illustrated in Figure 12, the processesof fork+exit , fork+execve and fork+/bin/sh -c in the PiBoostergroup costs , , and in microseconds, %, %, % faster than the ones in the baseline group. We believe thatthe improvement will signiﬁcantly beneﬁt the workloads that relyon many temporary processes.Fig. 13: The improvements among all the benchmarks are within0.42%, and the PiBooster has no negative impact on the systemperformance. SPECint2006 [10] is an industry standard benchmark intendedfor measuring the performance of the CPU and memory. In ourexperiment, the tool version is SPECint 2006 v1.2, which has benchmarks in total and they are all invoked with the conﬁgurationﬁle linux64-ia32-gcc43+.cfg . All measurement results are listedin Figure 13. Among the benchmarks, all benchmark tools in thePiBooster group produce the same or a little better performanceresults, indicating that the PiBooster has no negative effect onsystem performance. The maximum improvement is about 0.42%,which is produced by the . As we know, IOTLB is used to accelerate the DMA address trans-lation to achieve better performance for I/O devices. Therefore, ifthere are many IOTLB misses caused by frequent IOTLB ﬂushes,they will introduce negative effects on the I/O performance. How-ever, rIOMMU [22] claims that the overhead caused by walkingthe IOMMU page tables due to IOTLB misses is so negligiblethat cannot be measured in the netperf , because the main latencyinduced by I/O interrupt processing and the TCP/IP stack is severalorders of magnitude larger than that of walking the page tables. Inaddition, Nadav Amit et al. [6] also has similar statements.In this paper, we test the network I/O and disk I/O performanceunder regular circumstances, and use these experiments to revisitof the problem between the IOTLB misses and I/O performance. We measured the network I/O using netperf tool. Speciﬁcally,we have two machines that are directly connected through theEthernet cable. The client on one machine sends a bulk of TCPpackets to the server on another machine. Note that the emulatedworkload is also enabled on the client machine to trigger IOTLBﬂushes. The sending buffer is KB and the test lasts 60 seconds.The measurement results of the server are listed in Table 2. Bycomparing both values of µ and σ , we ﬁnd that there is nodetectable improvement on the network I/O.Similarly, we also test the disk I/O using lmbench . The resultsindicate that the disk I/O speed remains the same. The above twoexperiments as new evidences support the observations in [6],[22]. Throughput (

Mbps ) PiBooster Baseline

Range . − .

010 87 . − . Arithmetic Mean ( µ ) .

926 87 . Standard Deviation ( σ ) .

028 0 . TABLE 2: The netperf results of network I/O indicate that theoverhead introduced by the IOTLB misses is negligible.In fact, the effects of the IOTLB misses are measurable, itrelies on an extremely high-speed I/O device, e.g., Intel’s I/OAcceleration Technology [18], or a newly designed IOMMU. Forinstance, the rIOMMU project [22] establishes a high-performancesetting with a newly designed IOMMU and the ibverbs library [2],[15]. We believe that the PiBooster can effectively reduce theoverhead in the above high-speed settings by eliminating theadditional IOTLB misses, and we also plan to conduct experimentsin the future.

ISCUSSION

According to IOMMU speciﬁcation [14], [5], a typical IOMMUis able to provide three types of IOTLB invalidation schemes, i.e.,global invalidation, domain-selective invalidation, page-selectiveinvalidation, which differ in granularity. Speciﬁcally, the globalinvalidation will always invalidate all IOTLB entries as a whole.The domain-selective invalidation only invalidates the selectedVM domain’s IOTLBs, whose performance is better than that ofthe global invalidation. The page-selective invalidation that onlyinvalidates the corresponding IOTLB entry could achieve the bestperformance, compared to the previous two schemes.If the IOMMU is conﬁgured to do the global or domain-selective invalidation, an IOTLB ﬂush will invalidate all IOTLBentries for at lease one domain, inevitably resulting in IOTLBmisses for DMA address translations of the domain. However,if the IOMMU is working with the page-selective invalidationscheme, one IOTLB ﬂush will only invalidate one IOTLB entrythat may not be used immediately, which means that the currentDMA request may not be affected, but the invalidation will badlyaffect DMA transfers in the near future .

Besides the command of explicitly shrinking the PiBooster cacheby using the virtual ﬁles, the PiBooster cache could shrink itselfaccording to the predeﬁned threshold. However, the threshold islikely to be different for different workload environments. Thus,we should adjust the threshold according to the factors of the workload. An interesting solution is to upgrade the PiBoostercache to allow itself to be aware of the workload updates andadjust the threshold accordingly. It is not that easy to proposethe self-adaption algorithm as there are many factors affecting theworkload in a real system. Another practical solution is to runa training tool in the guest kernel to calculate the number andthe proportion of the needed semi-writable pages and the pagetable pages. However, this approach requires that the workload isrelatively stable, otherwise the end user has to do the training overand over again. Fortunately, if the workload is updated in a regularway and the update pattern is known to the end user, the end usercould use a shell script to automatically update the threshold withthe help of the exported virtual ﬁles.

We believe that the design of the PiBooster cache could beneﬁtthe page table allocations and deallocations on the bare-metalOSes that work directly on hardware. The usage pattern of thepage table pages in paravirtual environment is similar if not thesame to the one on the bare-metal OS. Based on this feature,the deallocated page table pages are likely to be used in the nearfuture by newly created processes. By caching the deallocatedpage-table pages, the PiBooster cache could quickly response tothe upcoming allocation requests, without the need to invoke thesystem allocators every time. In the future, we plan to port thePiBooster cache onto a bare-metal OS, such as Linux, and fullyevaluate its beneﬁts.

ELATED W ORK

Guest Page Table Protection.

In paravirtualization, the guestpage table becomes the security critical data structure. To protectthe integrity of the page table and allow legitimate updates,Xen [8] sets the guest page table read-only and validates everyupdate of guest page tables so as to prevent any malicious access.However, the protected guest page table is still writable for DMArequests [25], [3]. To ﬁx this gap, the hypervisor has to enable theI/O virtualization (AMD-Vi [5] or Intel VT-d [14]) technology,preventing any DMA access to the guest page table. In this paper,we do not break or downgrade the page-table based security,instead we keep its security and accelerate the performance inthe page table management.

IOTLB Misses Reduction.

There are some existing ap-proaches [6], [22], [31] analyzing and reducing the negative effectsdue to the IOTLB misses. Amit et al. [6] ﬁrstly analyze the role ofthe IOTLB in DMA operations and quantify the performance over-head of IOTLB misses. Then they present new strategies of bothsoftware and hardware enhancements to reduce IOTLB miss ratein order to facilitate DMA address translation. rIOMMU [22] re-designs the architecture of IOMMU to achieve high performancein DMA transactions, during which the IOTLB misses are alsolargely reduced. Willmann et al. [31] proposes new strategies forXen to re-conﬁgure the addressing mode of IOMMU, resulting infewer IOTLB misses. Different from the previous approaches thatattempt to reduce the overall IOTLB misses, our approach mainlyfocuses on eliminating the additional IOTLB misses introduced bythe security validations (i.e., the DMA validation) during the guestpage table allocations and deallocations.

Other I/O Performance Improvements.

There are manyschemes trying to improve the I/O performance in different direc-tions. The approaches [23], [27], [28] aim to improve the network throughput by optimizing the paravirtual I/O model on network.In CDNA [32], the authors propose a method for concurrent anddirect network access for virtual machines. Ongaro et al. [26]study the impacts of guest scheduling on guest I/O performanceby concurrently running different combinations of processor-intensive, bandwidth-intensive and latency-sensitive workloads.The approaches in [12], [13] attempt to reduce the number ofVM exits. The studies in [19], [20], [29], [17], [33] accelerate I/Operformance by designating cores to speciﬁc uses. Several studies,like ELI [11] and vIC [4] attempt to reduce the overhead of I/Ointerrupts in virtual environments. ONCLUSION

The paravirtual guest OS has two important performance issuesin page table management : 1) the long execution paths ofpage table (de)allocations and 2) the additional IOTLB ﬂushesintroduced by the DMA validations. In this paper, we proposed thePiBooster system to address the above problems. We shortened theexecution paths of the page table (de)allocations by introducingthe PiBooster cache, which could quickly response to the pagetable (de)allocation requests. We introduced a ﬁne-grained valida-tion scheme, which successfully eliminated all additional IOTLBﬂushes and saved the time cost of enforcing the DMA validations,which further reduced the execution paths. We implemented aprototype of the PiBooster and fully evaluated its performancein micro- and macro-benchmarks. The micro experiment resultsindicated that PiBooster could completely eliminate the addi-tional IOTLB ﬂushes in the workload stable environments, andeffectively reduced (de)allocation time of the page table by 47%on average. The macro benchmarks showed that the latenciesof the process creations and exits were expectedly reduced by16% on average. Moreover, the

SPECINT , lmbench and netperf results indicated that PiBooster had no negative impacts on CPUcomputation, network I/O, and disk I/O. R EFERENCES

ACM Sigplan Notices , vol. 41, no. 11,pp. 2–13, 2006.[4] I. Ahmad, A. Gulati, and A. Mashtizadeh, “vic: Interrupt coalescing forvirtual machine storage device io,” in , 2011, p. 45.[5] AMD, “Secure virtual machine architecture reference manual,” Dec.2005.[6] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, “Iommu: Strategies formitigating the iotlb bottleneck,” in

Computer Architecture . Springer,2012, pp. 256–274.[7] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neuge-bauer, I. Pratt, and A. Warﬁeld, “Xen and the art of virtualization,” in

SOSP ’03: Proceedings of the nineteenth ACM symposium on Operatingsystems principles . New York, NY, USA: ACM, 2003, pp. 164–177.[8] ——, “Xen and the art of virtualization,”

ACM SIGOPS OperatingSystems Review , vol. 37, no. 5, pp. 164–177, 2003.[9] J. Bonwick et al. , “The slab allocator: An object-caching kernel memoryallocator.” in

USENIX summer

Proceedings of the Seventeenth InternationalConference on Architectural Support for Programming Languages andOperating Systems , ser. ASPLOS XVII. New York, NY, USA: ACM,2012, pp. 411–422. [Online]. Available: http://doi.acm.org/10.1145/2150976.2151020 [12] A. Gordon, N. Har’El, A. Landau, M. Ben-Yehuda, and A. Traeger,“Towards exitless and efﬁcient paravirtual i/o,” in

Proceedings of the 5thAnnual International Systems and Storage Conference . ACM, 2012,p. 10.[13] N. Har’El, A. Gordon, A. Landau, M. Ben-Yehuda, A. Traeger, andR. Ladelsky, “Efﬁcient and scalable paravirtual i/o system.” in

USENIXAnnual Technical Conference , 2013, pp. 231–242.[14] Intel, “Intel virtualization technology for directed i/o,” Sep. 2007.[15] G. Kerr, “Dissecting a small inﬁniband application using the verbs api,” arXiv preprint arXiv:1105.1827 , 2011.[16] K. C. Knowlton, “A fast storage allocator.”

Communications of the ACM ,no. 8, pp. 623–624, 1965.[17] A. Landau, M. Ben-Yehuda, and A. Gordon, “Splitx: Splitguest/hypervisor execution on multi-core.” in

WIOV , 2011.[18] K. Lauritzen, T. Sawicki, T. Stachura, and C. E. Wilson, “Intel i/oacceleration technology improves network performance, reliability andefﬁciently. technology@ intel magazine, march 2005.”[19] G. Liao, D. Guo, L. Bhuyan, and S. R. King, “Software techniques toimprove virtualized i/o performance on multi-core systems,” in

Proceed-ings of the 4th ACM/IEEE Symposium on Architectures for Networkingand Communications Systems . ACM, 2008, pp. 161–170.[20] J. Liu and B. Abali, “Virtualization polling engine (vpe): using dedicatedcpu cores to accelerate i/o virtualization,” in

Proceedings of the 23rdinternational conference on Supercomputing . ACM, 2009, pp. 225–234.[21] R. Love, S. H. W. Are, A. C. Linus, L. V. C. U. Kernels, and B. W. Begin,“Linux kernel development second edition,” 2004.[22] M. Malka, N. Amit, M. Ben-Yehuda, and D. Tsafrir, “riommu: Efﬁcientiommu for i/o devices that employ ring buffers,” in

Proceedings ofthe Twentieth International Conference on Architectural Support forProgramming Languages and Operating Systems . ACM, 2015, pp. 355–368.[23] A. Menon, A. L. Cox, and W. Zwaenepoel, “Optimizing network virtual-ization in xen,” in

Proc. USENIX Annual Technical Conference (USENIX2006) , 2006, pp. 15–28.[24] P. Mochel, “The sysfs ﬁlesystem,” in

Linux Symposium , 2005, p. 313.[25] D. G. Murray, G. Milos, and S. Hand, “Improving xen security throughdisaggregation,” in

Proceedings of the Fourth ACM SIGPLAN/SIGOPSInternational Conference on Virtual Execution Environments , ser. VEE’08. New York, NY, USA: ACM, 2008, pp. 151–160. [Online].Available: http://doi.acm.org/10.1145/1346256.1346278[26] D. Ongaro, A. L. Cox, and S. Rixner, “Scheduling i/o in virtual machinemonitors,” in

Proceedings of the fourth ACM SIGPLAN/SIGOPS interna-tional conference on Virtual execution environments . ACM, 2008, pp.1–10.[27] K. K. Ram, J. R. Santos, Y. Turner, A. L. Cox, and S. Rixner, “Achieving10 Gb/s using safe and transparent network interface virtualization,” in

International Conference on Virtual Execution Environments , 2009, pp.61–70.[28] J. R. Santos, Y. Turner, G. J. Janakiraman, and I. Pratt, “Bridging thegap between software and hardware techniques for i/o virtualization.” in

USENIX Annual Technical Conference , 2008, pp. 29–42.[29] L. Shalev, J. Satran, E. Borovik, and M. Ben-Yehuda, “Isostack-highlyefﬁcient network processing on dedicated cores.” in

USENIX AnnualTechnical Conference , 2010, p. 5.[30] A. Whitaker, M. Shaw, and S. D. Gribble, “Scale and performance inthe denali isolation kernel,”

ACM SIGOPS Operating Systems Review ,vol. 36, no. SI, pp. 195–209, 2002.[31] P. Willmann, S. Rixner, and A. L. Cox, “Protection strategies fordirect access to virtualized i/o devices.” in

USENIX Annual TechnicalConference , 2008, pp. 15–28.[32] P. Willmann, J. Shafer, D. Carr, S. Rixner, A. L. Cox, and W. Zwaenepoel,“Concurrent direct network access for virtual machine monitors,” in

High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13thInternational Symposium on . IEEE, 2007, pp. 306–317.[33] C. Xu, S. Gamage, H. Lu, R. R. Kompella, and D. Xu, “vturbo:Accelerating virtual machine i/o processing using designated turbo-slicedcore.” in

USENIX Annual Technical Conference , 2013, pp. 243–254. Yueqiang Cheng is a Staff Security Scientist atBaidu XLab America. He earned his PhD de-gree in School of Information Systems from Sin-gapore Management University under the guid-ance of Professor Robert H. Deng and Asso-ciate Professor Xuhua Ding. His research inter-ests are system security, trustworthy computing,software-only root of trust and software security.