XOS: An Application-Defined Operating System for Data Center Servers
Chen Zheng, Lei Wang, Sally A. McKee, Lixin Zhang, Hainan Ye, Jianfeng Zhan
XXOS: An Application-Defined Operating System forDatacenter Computing
Chen Zheng ∗† , Lei Wang ∗ , Sally A. McKee ‡ , Lixin Zhang ∗ , Hainan Ye § and Jianfeng Zhan ∗†∗ State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences † University of Chinese Academy of Sciences, China ‡ Chalmers University of Technology § Beijing Academy of Frontier Sciences and Technology
Abstract —Rapid growth of datacenter (DC) scale, urgency ofcost control, increasing workload diversity, and huge softwareinvestment protection place unprecedented demands on the op-erating system (OS) efficiency, scalability, performance isolation,and backward-compatibility. The traditional OSes are not builtto work with deep-hierarchy software stacks, large numbers ofcores, tail latency guarantee, and increasingly rich variety ofapplications seen in modern DCs, and thus they struggle to meetthe demands of such workloads.This paper presents XOS, an application-defined OS formodern DC servers. Our design moves resource managementout of the OS kernel, supports customizable kernel subsystemsin user space, and enables elastic partitioning of hardwareresources. Specifically, XOS leverages modern hardware supportfor virtualization to move resource management functionalityout of the conventional kernel and into user space, which letsapplications achieve near bare-metal performance. We implementXOS on top of Linux to provide backward compatibility. XOSspeeds up a set of DC workloads by up to 1.6 × over ourbaseline Linux on a 24-core server, and outperforms the state-of-the-art Dune by up to 3.3 × in terms of virtual memorymanagement. In addition, XOS demonstrates good scalability andstrong performance isolation. Index Terms —Operating System, Datacenter, Application-defined, Scalability, Performance Isolation
I. I
NTRODUCTION
Modern DCs support increasingly diverse workloads thatprocess ever-growing amounts of data. To increase resourceutilization, DC servers deploy multiple applications togetheron one node, but the interferences among these applicationsand the OS lower individual performances and introduceunpredictability. Most state-of-the-practice and state-of-the-artOSes (including Linux) were designed for computing envi-ronments that lack the support for diversity of resources andworkloads found in modern systems and DCs, respectively,and they present user-level software with abstracted interfacesto those hardware resources, whose policies are only optimizedfor a few specific classes of applications.In contrast, DCs need streamline OSes that can exploitmodern multi-core processors, reduce or remove the overheadsof resource allocation and management, and better support per-formance isolation. Although Linux has been widely adoptedas a preferred OS for its excellent usability and programma-bility, it has limitations with respect to performance, scal-
The corresponding author is Jianfeng Zhan. ability, and isolation. Such general-purpose, one-size-fits-alldesigns simply cannot meet the needs of all applications. Thekernel traditionally controls resource abstraction and alloca-tion, which hides resource-management details but lengthensapplication execution paths. Giving applications control overfunctionalities usually reserved for the kernel can streamlinesystems, improving both individual performances and systemthroughput [5].There is a large and growing gap between what DC ap-plications need and what commodity OSes provide. Effortsto bridge this gap range from bringing resource managementinto user space to bypassing the kernel for I/O operations,avoiding cache pollution by batching system calls, isolat-ing performance via user-level cache control, and factoringOS structures for better scalability on many-core platforms.Nonetheless, several open issues remain. Most approaches toconstructing kernel subsystems in user space only supportlimited capabilities, leaving the traditional kernel responsiblefor expensive activities like switching contexts, managingvirtual memory, and handling interrupts. Furthermore, manyinstances of such approaches cannot securely expose hardwareresources to user space: applications must load code into thekernel (which can affect system stability). Implementationsbased on virtual machines incur overheads introduced by thehypervisor layer. Finally, many innovative approaches breakcurrent programming paradigms, sacrificing support for legacyapplications.Ideally, DC applications should be able to finely controltheir resources, customize kernel policies for their own ben-efit, avoid interference from the kernel or other applicationprocesses, and scale well with the number of available cores.From 2011 to 2016, in collaboration with Huawei, we have alarge project to investigate different aspects of DC computing,ranging from benchmarking [6], [7], architecture [8], hard-ware [9], OS [4], and programming [10]. Specifically, we ex-plores two different OS architectures for DC computing [11]–[13]. This paper dedicates to XOS—an application-definedOS architecture that follows three main design principles: • Resource management should be separated from the OSkernel, which then merely provides resource multiplexingand protection. The another OS architecture for DC computing is reported in [4] a r X i v : . [ c s . O S ] J a n p pV F SI P CS c h e d u l e rD e v i c eD r i v e r s H a r d w a r eS Y S C A L LF i l e S y s t e mV i r t u a lM e m o r yI n t e r r u p tH a n d l e rD i s p a t c h e r (a) monolithic I P C V i r t u a l M e m o r yH a r d w a r e F i l eS e r v e rD e v i c eD r i v e r sU n i xS e r v e rA pp A ppA pp (b) microkernel
H a r d w a r eE x o k e r n e lA p pL i b r a r yO S A p pL i b r a r yO S (c) exokernel [1]
H a r d w a r eH y p e r v i s o rM i r a g eR u n t i m eA p p L i b r aA ppO S vV M A ppA p p (d) unikernel [2]
Single OS ImageHardwareApp Hardware A pp A pp A pp A pp A pp A pp CPU driverCPU driver CPU driver Supervisor SubOS SubOS (e) multikernel [3]
Single OS ImageHardwareApp Hardware A pp A pp A pp A pp A pp A pp CPU driverCPU driver CPU driver Supervisor SubOS SubOS (f) ITFS OS [4]
H a r d w a r e AppL i n u xK e r n e lAppX O SR u n t i m eA p pX O SR u n t i m eA p p (g) XOS nokernel
Fig. 1: The difference of the XOS nokernel model from theother OS models. • Applications should be able to define user-space kernelsubsystems that give them direct resource access andcustomizable resource control. • Physical resources should be partitioned such that appli-cations have exclusive access to allocated resources.Our OS architecture provides several benefits. Applicationshave direct control over hardware resources and direct accessto kernel-specific functions; this streamlines execution pathsand avoids both kernel overheads and those of crossing be-tween user and kernel spaces. Resource-control policies canbe tailored to the classes of applications coexisting on the sameOS. And elastic resource partitioning enables better scalabilityand performance isolation by vastly reducing resource con-tention, both inside and outside the kernel. We quantify thesebenefits using microbenchmarks that stress different aspects of the system plus several important DC workloads. XOS outper-forms state-of-the-art systems like Linux and Dune/IX [14],[15] by up to 2.3 × and 3.3 × , respectively, for a virtualmemory management microbenchmark. Compared to Linux,XOS speeds up datacenter workloads from BigDataBench [7]by up to 70%. Furthermore, XOS scales 7 × better on a 24-coremachine and performs 12 × better when multiple memory-intensive microbenchmarks are deployed together. XOS keepstail latency — i.e., the time it takes to complete requests fallinginto the 99th latency percentile — in check while achievingmuch higher resource utilization.II. B ACKGROUND AND R ELATED W ORK
XOS is inspired by and built on much previous work inOSes to minimize the kernel, reduce competition for resources,and specialize functionalities for the needs of specific work-loads. Each of these design goals addresses some, but notall, of the requirements for DC workloads. Our application-defined OS model is made possible by hardware support forvirtualizing CPU, memory, and I/O resources.Hardware-assisted virtualization goes back as far as the IBMSystem/370, whose VM/370 [16] operating system presentedeach user (among hundreds or even thousands) with a separatevirtual machine having its own address space and virtualdevices. For modern CPUs, technologies like Intel ® VT-x [17]break conventional CPU privilege modes into two new modes:
VMX root mode and
VMX non-root mode .For modern I/O devices, single root input/output virtualiza-tion (SR-IOV) [18] standard allows a network interface (inthis case PCI Express, or PCI-e) to be safely shared amongseveral software entities. The kernel still manages physicaldevice configuration, but each virtual device can be configuredindependently from user level.OS architectures have leveraged virtualization to bettermeet design goals spanning the need to support increasinghardware heterogeneity to provide better scalability, elasticitywith respect to resource allocation, fault tolerance, customiz-ability, and support for legacy applications. Most approachesbreak monolithic kernels into smaller components to specializefunctionality and deliver more efficient performance and faultisolation by moving many traditionally privileged activities outof the kernel and into user space.Early efforts like HURRICANE [19] and Hive [20] weredesigned for scalable shared-memory (especially NUMA)machines. They deliver greater scalability by implementingmicrokernels (Figure 1b) based on hierarchical clustering.In contrast, exokernel architectures [1] (Figure 1c) sup-port customizable application-level management of physicalresources. A stripped-down kernel securely exports hardwareresources to untrusted library operating systems running withinthe applications themselves. Modern variants use this approachto support Java workloads (IBM’s Libra [21]) or Windowsapplications (Drawbridge [22]). Like Libra, avoiding executionon a traditional Linux kernel that supports legacy applicationsrequires adapting application software. Unikernels such asirage and OSv [23] (Figure 1d) represent exokernel variantsbuilt for cloud environments.Early exokernels often loaded application extensions forresource provisioning into the kernel, which poses stabilityproblems. Implementations relying on virtual machine mon-itors can unacceptably lengthen application execution paths.XOS avoids such overheads by leveraging virtualization hard-ware to securely control physical resources and only uploadingstreamlined, specialized kernel subsystems to user space.The advent of increasingly high core-count machines andthe anticipation of future exascale architectures have broughta new emphasis on problems of scalability. Corey [24] imple-ments OS abstractions to allow applications to control inter-core resource sharing, which can significantly improve perfor-mance (and thus scalability) on multicore machines. Whereasdomains in earlier exoscale/library-OS solutions often sharedan OS image, replicated-kernel approaches like Hive’s areincreasingly being employed. Multikernel [3] employ no vir-tualization layer, running directly on bare hardware, withouta virtualization layer. The OS (Figure 1e) is composed ofdifferent CPU driver, each running on a single core. The CPUdrivers cooperate to maintain a consistent OS view.Our another DC OS architecture [4] introduces the ”iso-lated first, then share” (IFTS) OS model in which hardwareresources are split among disparate OS instances. The IFTSOS model decomposes the OS into supervisor and severalsubOSes, and avoids shared kernel states between application,which in turn reduces performance loss caused by contention.Several systems streamline the OS by moving traditionalfunctionalities into user space. For instance, solutions suchas mTCP [25], Chronos [26], and MegaPipe [27] build user-space network stacks to lower or avoid protocol-processingoverheads and to exploit common communication patterns.NoHype [28] eliminates the hypervisor to reduce OS noise.XOS shares the philosophy of minimizing the kernel to realizeperformance much closer to that of bare metal.Dune [14]/IX [15] and Arrakis [29] employ approachesvery similar to XOS. Dune uses nested paging to supportuser-level control over virtual memory. IX (Dune’s successor)and Arrakis use hardware virtualization to separate resourcemanagement and scheduling functions from network process-ing. IX uses a full Linux kernel as the control plane (as inSoftware Defined Networking), leveraging memory-mappedI/O to implement pass-through access to the NIC and Intel’sVT-x to provide three-way isolation between the control plane,network stack, and application. Arrakis uses Barrelfish as thecontrol plane and exploits IOMMU and SR-IOV to providedirect I/O access. XOS allows applications to directly accessnot just I/O subsystems but also hardware features such asmemory management and exception handling.III. T HE XOS M
ODEL
The growing gap between OS capabilities and DC work-load requirements necessitates that we rethink the design ofOSes for modern DCs. We propose an application-definedOS model. This OS model is guided by three principles: 1) separation of resource management from the kernel; 2)application-defined kernel subsystems; and 3) elastic resourcepartitioning.
A. Separating Resource Management
We contend that layered kernel abstractions are the maincauses of both resource contention and application interfer-ence. Achieving near bare-metal performance thus requiresthat we remove the kernel from the critical path of anapplication’s execution.In an application-defined OS (Figure 1g), the traditionalrole of OS kernel is split. Applications take over OS kernelduties with respect to resource configuration, provisioning,and scheduling. This allows most kernel subsystems to beconstructed in user space. The kernel retains the responsibilityfor resource allocation, multiplexing, and protection, but it nolong mediates every application operation. Reducing kernelinvolvement in application execution has several advantages:first, applications need not trap into kernel space. In currentgeneral-purpose OSes like Linux, applications must accessresources through the kernel, lengthening their executionpaths. The kernel may also interrupt application execution: forinstance, a system call invocation typically raises synchronousexception, which forces two transitions between user andkernel modes. Moreover, it flushes the processor pipeline twiceand pollutes critical processor structures, such as the TLB,branch prediction tables, prefetch buffers, and private caches.When a system call competes for shared kernel structures, itmay also stall the execution of other processes using thosestructures.One challenge in separating resource management from thekernel is how to securely expose hardware to user space.Our model leverages modern virtualization-support hardware.Policies governing the handling of privilege levels, addresstranslation, and exception triggers can enforce security whenapplications directly interact with hardware. They make itpossible for an application-defined OS to give applicationsthe ability to access all privileged instructions, interrupts,exceptions, and cross-kernel calls and to have direct controlover physical resources, including physical memory, CPUcores, I/O devices, and processor structures.
B. Application-Defined Kernel Subsystems
Our OS model is designed to separate all resource man-agement from the kernel, including CPU, memory, I/O, ex-ceptions. Applications are allowed to customize their ownkernel subsystems, choosing the types of hardware resourcesto expose in user space. Furthermore, applications running onthe same node may implement different policies for a givenkernel subsystem. Kernel services not built into user space arerequested from the kernel just as in normal system calls.Application-defined kernel subsystems in user space are amajor feature of XOS. Applications know what resources theyneed and how they want to use them. Tradeoffs in traditionalOS design are often made to address common applicationneeds, but this leads to poor performance for applicationsithout the such “common” needs or behaviors. For example,data analysis workloads with fixed memory requirements willnot always benefit from demand-paging in the virtual mem-ory subsystem, and network applications often need differentnetwork-stack optimizations for better throughput and latency.In this model, a cell is an XOS process that is grantedexclusive resource ownership and that runs in VMX non-root mode. Cells can bypass the kernel to have fine-graineddirect control over physical resources. Unlike other user-spaceI/O approaches that construct I/O stacks in user space, XOSallows the construction of any kernel subsystem within eachcell. Such subsystems can include paging, physical memorymanagement, I/O, and interrupt handling, and the application-defined OS model allows diverse policies for each subsystem.For example, XOS allows applications to request physicalpages in specific colors or banks in order to reduce cacheconflicts and enable better performance isolation. Our XOSmodel implements pass-through functionality for PCI-e de-vices with the help of SR-IOV, To avoid device exhaustion,message-based I/O system calls within XOS runtime serviceI/O requests.
C. Elastic Resource Partitioning
In the XOS model, each application has exclusive ownershipof the resources allocated to it, including kernel structures andphysical resources. Figure 1g shows how elastic partitioningworks. Each cell has unrestricted access to its resources,including CPU cores, physical memory, block device, NICs,and privileged features. It controls physical resources byscheduling them in whatever way it chooses. A cell withexclusive resources runs with little kernel involvement, sinceit shares no state with other cells. The kernel only providesmultiplexing, protection, and legacy support.One obvious drawback to such elastic partitioning is that itprevents resource sharing, which may reduce utilization. Toovercome this limitation, we build reserved resource poolsboth in the kernel and in the XOS runtime. These pools cangrow or shrink as needed. The kernel’s resource pools areper-CPU structures, which ensures that cells do not contendfor resources. If a cell exhausts its allocated resources, itcan invoke XOS runtime routines to request more resourcesfrom the kernel. By carefully accounting for the resourcesallocated to each cell, the kernel tracks resource consumptionfor each software component. It can thus guarantee QoS foreach cell through performance isolation and judicious resourceallocation. For instance, the kernel could choose to devotea fraction of the memory and I/O devices to a resourcepool serving a critical cell. The need for elastic partitioninggrows when co-existing applications compete for resources,which is common in real-world deployment. Without effectivemechanisms to provide performance guarantees, co-existingapplications may perform unpredictably, especially if they arekernel-intensive.
K e r n e l M o d eU s e r M o d eR e s o u r c eM o d u leX O S H a n d le r R e s e r v e dM e m o r y P o o lV M C A L L V M X M o d u leVMXRoot Mod eRi ng/0 VMXNon/RootMod eRi ng/0 I/ O K e r n e lT h r e a d P o o lN e t w o r kS t a c kF i le S y s t e mP a g e rM e m o r yM a n a g e rI n t e r r u p tH a n d le r E x c e p t io nH a n d le rK e r n e l I n t e r a c t iv e A P IP O S IX A P I M e s s a g e FB a s e d I/ OS Y S C A L LT a b leP a g e rM e m o r yM a n a g e rI n t e r r u p tH a n d le rK e r n e l I n t e r a c t iv e A P IP O S IX A P IA p p l ic a t io nA p p l ic a t io n A p p l ic a t io nl i b cS Y S C A L LMessage
Fig. 2: The XOS architectureIV. XOS I
MPLEMENTATION
Figure 2 shows the XOS architecture, which consists of anumber of dynamically loadable kernel modules plus the XOSruntimes. The XOS kernel has five more functionalities than anordinary kernel: initiating and configuring the VMX hardware(the
VMX module); allocating and accounting resources forcells (the resource module); handling XOS runtime requestsand violations (the
XOS handler ); handling message-based I/Osystem calls (I/O kernel thread pool); and providing a physicalmemory allocator to reserve a physical memory pool for XOSprocesses.The XOS kernel runs in VMX root mode ring 0, and thecells run in VMX non-root mode ring 0. XOS leverages VT-xto enable privileged hardware features to be securely exposedin VMX non-root mode, which allows the cells to directlymanage hardware resources without trapping into the kernel.The XOS handler intercepts VM-exits caused by privilegeviolations and VMCALLs initiated by the cells — this is themain means by which a cell interacts with the kernel.XOS cells can coexist with normal Linux processes. Eachcell has an independent virtual address space and a privateXOS runtime. The resources assigned to each cell are ex-clusive and cannot be accessed by other cells. We constructtraditional kernel services in user space by inlining kernelsubsystems into the XOS runtime. These kernel services canbe customized to meet the needs of different workloads.In addition to the kernel services defined in XOS runtime,other services can be obtained directly from the kernel viahardware traps (VMCALLs) or messages. The runtime wrapsuser-space OS services with the POSIX API, which providescompatibility with legacy software.The XOS runtime is a thin, trusted layer, that is respon-sible for resource management and kernel interaction duringresource (re)allocation. It is implemented with statically-linkedlibraries. We offer two classes of interfaces: one includesexplicit interfaces for direct hardware control, including pre-allocated pages, colored page allocation, page table entries,TLB entry invalidations, I/O control-space access, and DMAmanagement. The other includes POSIX-like interfaces. Itnvokes inlined kernel functions customized in XOS runtime,while other system calls can be served by the kernel. Specially,when there are too few I/O devices dedicated to each cell,the XOS runtime provides dedicated message-based I/O inter-faces, redirecting I/O system calls to I/O system services cellvia messages. In such a design, I/O services are deployed indifferent kernel threads. Consequently, the processor structureswithin cells will not be flushed.
A. Booting a New Cell
In XOS, converting a normal Linux process into a cellprocess occurs when the Linux process needs acceleration incertain kernel subsystems. XOS needs two mode switches tomake a cell online. The most important part is to set up asuitable operating environment before and after the booting.XOS provides a control interface for applications to applyfor resources. With the control interface, applications couldspecify exclusive resources and privileged features. Once anapplication invokes the control interface, the VMX moduleinitiates VT-x hardware via ioctl() system call, and the ap-plication makes the first mode switch into vmx root mode.After that, the resource module allocates exclusive resourcesfrom the resource pool. Then, the vmx module uploads theoriginal page table into the allocated memory for further userspace page table initialization. The VMX module constructsnew Interrupt Descriptor Table (IDT) for the cell. If theapplication reserves I/O devices, IOMMU is configured to mapthe allocated memory region accordingly. Meanwhile, XOShandler in the kernel registers new exception handlers for thecell, as specified by the application. Finally, XOS duplicatescurrent processor states into VMCS host-state area, and setsthe processor state and entry point in VMCS guest-state area.The VM-execution control fields of VMCS are also set toidentify the privileged features in user space. In vmx rootmode, the vmx module triggers VMLAUCH instruction toenable processes to run in non-root mode as cells.
B. Memory Management
As many DC applications need large chunks of memoryto save their growing data sets, the memory managementsubsystem in the kernel can easily become a bottleneck.During memory allocation, the kernel need perform multipleactions that may have negative performance implications. Forinstance, it needs to lock related structures, such as pagetable entries, before modifying them. It needs to trigger TLBshootdowns to flush associated TLB entries. When multipleprocesses make memory allocation calls, the ensuing con-tention could noticeably slow down all processes involved.XOS differs with others in several ways with respect to thememory management. Each cell has its own pager and physicalmemory, handles virtual memory in XOS runtime rather thanin kernel. The memory manager shares no states with others.XOS kernel merely allocates, deallocates memory resource,and maintains access control list for the resources.
Physical memory management.
XOS implements a two-phase physical memory management. The first phase is to reserve the physical memory in the kernel to launch XOScells. The second one is user space memory allocator in XOSruntimes to deliver memory management service.We argue that applications could benefit from simplememory allocation with large chunks of continuous memory.Though fine-grained discontinuous memory may improve re-source utilization, it increases complexity in the XOS runtimememory management module. In Linux buddy allocator, thedefault largest memory chunk that can be allocated is 4MBEven if we modify the buddy algorithm to allow larger chunksizes, it will be divided into fragmentations after OS boots.Consequently, we modify Linux kernel to reserve memorychunks when OS boots up. The reserved memory is managedby a buddy allocator in XOS resource module. The maximumchunk allowed is 1024MB in the memory pool. Furthermore,to avoid lock completion when multiple cells apply for thememory, we build a per-CPU list memory pool. XOS resourcemanager allocates, reclaims, and records the states of memoryusage in each XOS runtime. XOS runtime maintains anotherbuddy allocator similar as the one adopted in the XOS resourcemodule, but with a much smaller maximum chunk. Themaximum supported chunk is 64MB, while the minimumchunk is the base page size. XOS runtime uses its buddyallocator and memory pool to map smaller parts of memoryregions into the cell’s address space.
Virtual memory management.
XOS runtime handles vir-tual memory for cells directly rather than in kernel. For someapplications, the memory requirement cannot be predicted inadvance, so the XOS resource module allocates large continu-ous memory, and hands it over to user space buddy allocator.XOS runtime then maintains demand paging policies. Dueto performance considerations, we build both pre-paging anddemand paging utilities. An application can choose which oneto use on its own.XOS uploads page table and other process related kerneldata into cell’s address space, and back-up the original one inthe kernel. We configure the field bits in VMCS guest-statearea, including control register (e.g., CR3), interruptibilitystate, and etc. When a normal process becomes a cell byentering VMX non-root mode, hardware will load processorstructures from the VMCS region. Consequently, a cell willinherit the initial address space from Linux. To ensure correct-ness, the kernel will mlock() the already mapped page frame inthe original page table, preventing them from being swappedout.
User-level page fault handler.
Most page faults occur aftera process attempts to access addresses that are not currentlymapped to a physical memory location. The hardware raisesa page-fault exception and traps into the exception handlerroutine. We set bits in VMCS to make page-fault exception notcausing a vmexit in non-root mode. We replace Linux defaultIDT with our modified one, which will invoke the user spacepage fault handler we register inside XOS runtime. The pagefault handler will then construct a new page table entry, witha page frame from user space buddy allocator.When a cell needs more memory pages than available onesn its user space memory pool, it will request resource fromkernel by triggering a vmexit. The XOS handler in the kernelwill synchronize the two page tables, serve the request andallocate physical memory from the reserved memory pool.With ability to access its private page table, applications [30]with predictable memory needs can potentially achieve addi-tional performance gains. Functions such as garbage collectioncan benefit from manipulating the page table entries directly.Process live migration can benefit from the user-level pagefault handler.
C. Interrupt Management
In XOS, the kernel hands over I/O devices into userspace, including buffer rings, descriptor queues, and interrupts.As a result, XOS runtime could construct user space I/Odevice driver to serve I/O operations. Applications wouldhave direct access to I/O devices, bypassing entire kernelstack. Applications can also use customized device driversfor further performance improvement. For instance, we couldbuild aggressive network stacks for small messages processing.PCI-e devices are initiated in Linux kernel. We use PCI-stub to reserve devices for given cells. Once a device isallocated to a cell, the descriptor rings and configuration spaceof that device are all mapped into the cell’s address space.When a cell manages the device, others cannot touch thedevice. Particularly, when a PCI-e device has multiple virtualfunctions (e.g., SR-IOV), XOS passes through a single queueto each cell.The physical interrupts derived from PCI-e devices raisechallenges to the performance of user space device manager.When XOS runtime maintains a user space device driver,device interrupts are handled directly in user space. Becausehandling interrupts in the kernel will change the context of theprocess structure, and needs to redirect interrupts to specificcores. When allocating PCI-e device to the user space, wedeliver interrupts to the CPUs on which the cell runs. Wereplace the default Linux IDT with XOS cell’s IDT, whichis registered in XOS runtime. For interrupts permitted in non-root mode, the interrupt handler found by IDTR will handle theinterrupts directly in user space. For interrupts not permittedin VMX non-root mode, the hardware will trap into the kernel,and trigger the explicit kernel interrupt handler. We configureexplicitly in XOS kernel which interrupts do not cause avmexit. After an interrupt is processed, the interrupt handlerin XOS runtime is completed with a write to the x2APICMSR. If we set the according MSR bitmap in VMCS, signalinginterrupt completion will not cause a vmexit.
D. Message-Based I/O System Calls
When there are not enough physical devices or virtualdevice queues for cells that need dedicated devices, somecell may suffer from waiting for devices becoming available.Meanwhile, context switches due to I/O system calls andexternal interrupts delivered to computing process many alsodegrade the performance. To address this issue, we implement message-based I/O system calls to separate the kernel I/Ooperations from the normal execution path of applications.Figure 2 presents XOS architecture with message-basedsystem calls. XOS is divided into several parts, with cellsand I/O services running in their respective resources. TheI/O services runs in different CPUs, and are given specificdevices. I/O services are classified into two class: pollingservice threads and serving threads. Polling service threadsonly poll I/O requests from cells and dispatch them amongserving threads. Serving threads receive requests from messagequeues, perform the received I/O system calls, and responseto the dedicate cells. In XOS, we attempt to implementI/O threads in non-root mode to serve the message-basedsystem calls with user space device drivers. In the currentimplementation, we create kernel threads to serve I/O services.Once a normal process becomes a cell, shared memory bufferwith each I/O serving thread is established. We modified thestandard libc, hooked the posix I/O system call with message-based I/O syscall, and conduct multiple pthread-like fibersto serve. Once an I/O system call is invoked, a fiber getscurrent cell’s context, invokes a asynchronous message-basedsyscall, and yield the execution environment to next fiber.The message-based I/O syscall writes request messages in theshared memory buffer, and waits for return code. To gain bestperformance for each cell, at least one exclusive serving threadper cell is created to response system call requests. As thenumber of cores increases from one generation to another, wehave found it acceptable to bind kernel threads on separateCPU cores. Optimizing kernel threads deployment is part offuture work.The main challenge of aforementioned implementation isto synchronize the context of a cell. To perform asynchronousmessage-based I/O system calls, an I/O serving thread needsthe context, including virtual address space structures and filedescriptors, of the requesting cells. To do that, the processorrelated data is backed up in kernel thread’s address space andupdated with every new change. An I/O system call message iscontained in fixed size structure to avoid cache line evictions.It includes syscall numbers, parameters, status bits, and datapointed by arguments.
E. Security and Fault Tolerance
To achieve a comparable level of security, no XOS cellshould be able to access the data/funtion of other cells, Linuxprocesses, and the kernel without permission. Note that thisdoes not mean that applications are completely protected, asvulnerabilities in XOS runtime and Linux kernel could still beused to compromise the OS.To modify or inspect other cell’s data, one need to accessothers’ register or memory. Since each cell runs on exclusivecores, the registers can not be accessed. Memory access viola-tions are also mitigated. Since XOS enforces access control listfor resources, the only way a cell could access other’s physicalmemory would be to alter the pager in XOS runtime. In XOSkernel, we set up an integrity check derived from the integritymeasurement scheme, to collect integrity measurements andompare it with those values signed in kernel, with which toverify that a XOS runtime is running as expected or in trustedstates, to ensure the necessary chain of trust.The behaviors of applications are constrained by XOS run-time. Exceptions in user space, such as div/0, single step, NMI,invalid opcode, and page fault, are caught and solved by XOSruntime. An XOS cell is considered as an ”outsider” processby the kernel. When a cell crashes, it will be automaticallyreplaced without any rebooting.V. E
VALUATION
A. Evaluation Methodology
XOS currently runs on x86 64-based multiprocessors. XOSis implemented in Ubuntu with Linux kernel version 3.13.2.We choose Linux with the same kernel as the baseline andDune (IX) as another comparison point of XOS. Most of thestate-of-the-art proposals, like Arriaks, OSV, and Barrelfish,are initial prototypes or built from bottom up. They doesnot support current Linux software stacks. Thus only Duneis chosen in our study because it’s built on Linux, as theadversary.We deploy XOS on a node with two six-core 2.4GHZ IntelXEON E5645 processors, 32GB memory, 1TB disk, and oneIntel Gb ethernet NIC. Each core has two hardware threadsand a private L2 cache. Each processor chip has 12MB L3cache shared by six on-chip cores and supports VT-x.We use some well-understood microbenchmarks and a fewpublicly available application benchmarks. All of them aretested on the baseline Linux and XOS. However, due to Dunenot being able to provide robust backward compatibility, onlymicrobenchmarks are able to run on Dune. The microbench-marks also include the Will-It-Scale [31] benchmark – 47typical system calls for scalability tests, and the Stress [32]benchmark – a performance isolation evaluation tool withconfigurable parameters of CPU, memory, and I/O. The full-application benchmarks used in our study come from Big-DataBench, which includes MPI applications for data analysis,E-commerce, Search Engine, and SNS. We run each test tentimes and report the average performance figures.During the evaluation, hyperthreading of the Xeon processoris enabled, power management features and Intel Turbo Boostare disabled. The benchmarks are manually pinned to hardwarethreads, which helps avoid the influence of process scheduling.In order to measure precise performance cost, we use rdtsc()to obtain current time stamp of a CPU. To ensure the sequenceof rdtsc() is not optimized by gcc compiler, rdtsc() is definedas volatile to avoid out-of-order compiling.
B. Performance
To understand the overhead of OS kernel activities, we haveconducted experiments that measure the execution cycles of afew simple system calls. A null system call is just a system call(e.g., the getpid() system call), that does not invoke other rou-tines. Others are built to directly execute X86 instructions. Dueto space limitation, we only present the results of rdtsc, rdtscp,rdpmc, read/write cr8, load/store idt and gdt here. We measure
4K 8K 16K 256K 1M 2M 1GMemory Size (Bytes) A v e r age C yc l e s Linux Dune XOS (a) sbrk()
4K 8K 16K 256K 1M 2M 1GMemory Size (Bytes) A v e r age C yc l e s Linux Dune XOS (b) mmap()
4K 8K 16K 256K 1M 2M 1GMemory Size (Bytes) A v e r age C yc l e s LinuxDuneXOS (c) malloc()/free()
4K 8K 16K 256K 1M 2M 1GMemory Size (Bytes) A v e r age C yc l e s Linux Dune XOS (d) malloc()
Fig. 3: Microbenchmark Performancethe performance of these system calls on both Linux andXOS runtime, and categorize X86 instructions into privilegedinstructions and un-privileged instructions. Table I shows thata native Linux environment adds an additional 400% overheadto the execution time of a null system call. The overhead inLinux mainly consists of the two mode switching time andABLE I: System Calls and Privileged Features (Cycles) null syscall rdtsc rdtscp rdpmc read cr8 write cr8 lgdt sgdt lidt sidtLinux 174 4167 4452 226 N/A N/A N/A N/A N/A N/AXOS 42 65 101 134 55 46 213 173 233 158
TABLE II: Average Cycles for XOS Operations
XOS Operations CyclesLaunch a cell 198846Interact with Kernel 3090
TABLE III: Average Cycles for Memory Access read writeLinux 305.5 336.0Dune 202.5 291.2XOS 1418.0 332.0 other architectural impacts such as cache flushing. Comparingto Linux, XOS gains almost 60 × performance in un-privilegedinstructions, including rdtsc, rdtscp, and rdpmc. All of whichdo not need to trap into kernel when running on XOS. ForXOS, we constructed VMCS regions and implemented simplelow-level API in XOS runtime to directly execute privilegedinstructions in user space. As a result, the user space X86privileged instructions in XOS shows the similar overhead asthe un-privileged ones.To get initial characterization of XOS, we use a set ofmicrobenchmarks representing the basic operations in bigmemory workloads, malloc(), mmap(), brk(), and read/write.Their data set changes from 4KB (a page frame) to 1GB. Eachbenchmark allocates or maps a fix-sized memory region, andrandomly writes each page to ensure the page table is set.The results are shown in Figure 3. We can see that XOSis up to 3 × and 16 × faster than Linux and Dune for mal-loc() (Figure 3d), 53 × and 68 × faster for malloc()/free()(Figure 3c), 3.2 × and 22 × faster for mmap() (Figure 3b),and 2.4 × and 30 × faster for sbrk() (Figure 3a). The mainreason is that XOS can provide each process independentmemory access ability and user space resource management,while Linux and Dune have to compete for the shared datastructures in the monolithic kernel. Moreover, Dune needs totrigger VM-exits to obtain resources from the kernel. VM- Sort Grep WordcountKmeans Bayes E x e c u t i on T i m e ( S e c ond s ) LinuxXOS
Fig. 4: BigDataBench Execution Times exit is expensive, which causes poor performance. As thememory size enlarges from 4KB to 1GB, the elapsed timewith XOS has no significant change, while the elapsed timewith Linux and Dune increase orders of magnitude. Themain reason is that XOS processes have exclusive memoryresource, and do not need to trap into kernel. For Linuxand Dune, most of the elapsed time is spent on page walks.A page walk is a kernel procedure to handle a page faultexception, which brings up inevitable overhead for page faultexception delivery and memory completion in the kernel. InXOS, page walk overheads are reduced due to user spacepagefault handler. In particular, Linux and Dune experiencea significant increase in the execution time of malloc()/free()(Figure 3c). Malloc()/free() is a benchmark that mallocs fix-sized memory regions, writes to each region, and then freesthe allocated memory. Because XOS runtime hands over thereleased memory region to the XOS resource pool otherthan the kernel, and completes all memory management inuser space, which reduces the chance for competitions. Theseresults prove that XOS can achieve better performance thanLinux and Dune. Table III shows that the average time of dataread and write in XOS is similar to that in Linux, while Dunehas slightly better read performance, but has significant worsewrite performance. Dune takes two page walks per page-faultwhich may cause execution be stalled.The selected application benchmarks from BigDataBenchconsist of Sort, Grep, Wordcount, Kmeans, and Bayes. Theyare all classic representative workloads in DC. During the eval-uation, the data set for each of them is 10 G. From Figure 4,we can observe that XOS is up to 1.6 × faster than Linux inthe best case. Compared to the other workloads, Kmeans andBayes gain less performance improvement, because they aremore CPU-intensive workloads, and do not frequently interactwith the OS kernel. The results prove that XOS can achievebetter performance than Linux for common DC workloads.The better performance is mainly due to the fact that XOS hasefficient built-in user space resource management and reducescontentions in the kernel. C. Scalability
To evaluate the scalability of XOS, we run system callbenchmarks from Will-It-Scale. Each system call benchmarkforks multiple processes that intensively invoke a certainsystem call. We test all these benchmarks and find poorscalability of Linux for most of those system calls imple-mentation. Some of them have a turning point at about sixhardware threads, while others have a turning point at about12 hardware threads. Figure 5 presents the results of brk(),futex(), malloc(), mmap(), and page faults on XOS and Linuxwith different number of hardware threads. The throughput in I n v o c a t i on s \ S e c ond LinuxXOS (a) brk() I n v o c a t i on s \ S e c ond LinuxXOS (b) futex() I n v o c a t i on s \ S e c ond LinuxXOS (c) mmap() I n v o c a t i on s \ S e c ond LinuxXOS (d) page fault
Fig. 5: Scalability Tests La t en cy P e r c en t il e XOS: search+stress^3Linux: searchLinux: search+stress^3
Fig. 6: Tail latency of Search workload with InterferenceXOS is better than Linux by up to 7 × . The results show thatLinux scales poorly when the core number reaches 15, whileXOS consistently achieves good scalability. XOS physicallypartitions resources and bypasses the entire kernel, thus largelyavoids the contentions in the shared kernel structure. However,please note that a real application will not be so OS-intensivelike the system call benchmarks, which only reflect the bestcase scenario for XOS. D. Performance Isolation
As XOS targets DC computing, performance isolation forrunning coexisted workloads has become a key issue. The finalset of experiments evaluate the performance isolation providedby XOS architecture. In this experiment, the system node wassetup to run co-existing DC workloads.The workloads used for these tests were the stress bench-marks, and Search from BigDataBench. Search is a latency-critical workload deployed on three-nodes cluster. The front-end Tomcat distributes requests from client nodes to back-end Nutch index servers. With massive tests, we set 150request/second in client for tradeoff between throughput andrequest latency. We run Nutch servers and stress benchmark inour target OS node, as Nutch is the bottleneck of Search in ourexperiment. The stress benchmark is a multi-thread applicationwhere each thread repeatedly allocates 512MB memory andtouches a byte per page. In this experiment, we use three-threads stress benchmark for stability consideration. We bindeach workload on dedicated cores to avoid interference. Therequest latencies of all requests are profiled and presentedby the cumulative distributions (CDFs) in Figure 6. Eachrequest latency is normalized to the maximum latency ofall experiments. The results show that tail latency in XOSoutperforms the one in Linux. Particularly, the 99th latency percentile in XOS is 3 × better than Linux. In addition, thenumber of outliers (length of the tails) is generally muchsmaller for XOS. VI. C ONCLUSION
This paper explores the OS architecture for DC servers.We propose an application-defined OS model guided by threedesign principles: separation of resource management fromthe kernel; application-defined kernel subsystems; and elasticpartitioning of the OS kernel and physical resources. We builta Linux-based prototype to adhere the design model. Ourexperiments demonstrated XOS’s advantages over Linux andDune in performance, scalability, and performance isolation,while still providing full support for legacy code. We believethat the application-defined OS design is a promising trend forthe increasingly rich variety of DC workloads.A
CKNOWLEDGMENT
This work is supported by the National Key Research andDevelopment Plan of China (Grant No. 2016YFB1000600 and2016YFB1000601). The authors are grateful to anonymousreviewers for their insightful feedback.R
EFERENCES[1] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr., “Exokernel: An oper-ating system architecture for application-level resource management,” in
Proc. ACM Symposium on Operating Systems Principles (SOSP) , Dec.1995, pp. 251–266.[2] A. Madhavapeddy, R. Mortier, C. Rotsos, D. Scott, B. Singh, T. Gaza-gnaire, S. Smith, S. Hand, and J. Crowcroft, “Unikernels: Libraryoperating systems for the cloud,” in
Proc. ACM International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS) , Mar. 2013, pp. 461–472.[3] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter,T. Roscoe, A. Sch¨upbach, and A. Singhania, “The Multikernel: Anew OS architecture for scalable multicore systems,” in
Proc. ACMSymposium on Operating Systems Principles (SOSP) , Oct. 2009, pp.29–44.[4] G. Lu, J. Zhan, C. Tan, X. Lin, D. Kong, T. Hao, L. Wang, F. Tang, andC. Zheng, “Isolate first, then share: a new os architecture for datacentercomputing,” arXiv:1604.01378.[5] D. Engler, S. K. Gupta, and M. F. Kaashoek, “AVM: Application-levelvirtual memory,” in
Proc. IEEE Workshop on Hot Topics in OperatingSystems (HotOS) , May 1995, pp. 72–77.[6] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, “Characterizingdata analysis workloads in data centers,” in
Workload Characterization(IISWC), 2013 IEEE International Symposium on . IEEE, 2013, pp.66–76.7] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi,S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench:A big data benchmark suite from internet services,” in
Proc. IEEEInternational Symposium On High Performance Computer Architecture(HPCA) , Feb. 2014, pp. 488–499.[8] J. Ma, X. Sui, N. Sun, Y. Li, Z. Yu, B. Huang, T. Xu, Z. Yao, Y. Chen,H. Wang, L. Zhang, and Y. Bao, “Supporting differentiated services incomputers via Programmable Architecture for Resourcing-on-Demand(PARD),” in
Proc. ACM International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS) ,Mar. 2015, pp. 131–143.[9] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong, H. Wang, X. Gu, andS. Zhang, “Cost effective data center servers,” in
High PerformanceComputer Architecture (HPCA2013), 2013 IEEE 19th InternationalSymposium on . IEEE, 2013, pp. 179–187.[10] W. He, H. Cui, B. Lu, J. Zhao, S. Li, G. Ruan, J. Xue, X. Feng, W. Yang,and Y. Yan, “Hadoop+: Modeling and evaluating the heterogeneity formapreduce applications in heterogeneous clusters,” in
Proceedings ofthe 29th ACM on International Conference on Supercomputing . ACM,2015, pp. 143–153.[11] C. Zheng, R. Hou, J. Zhan, and L. Zhang, “Method and apparatus foraccessing hardware resource,” Dec. 27 2016, uS Patent 9,529,650.[12] C. Zheng, L. Fu, J. Zhan, and L. Zhang, “Method and apparatus foraccessing physical resources,” Sep. 15 2016, uS Patent App. 15/160,863.[13] G. Lu, J. Zhan, Y. Gao, C. Tan, and D. Xue, “Resource processingmethod, operating system, and device,” Oct. 6 2016, uS Patent App.15/175,742.[14] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazi`eres, andC. Kozyrakis, “Dune: Safe user-level access to privileged CPU features,”in
Proc. USENIX Conference on Operating Systems Design and Imple-mentation (OSDI) , Oct. 2012, pp. 335–348.[15] A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, andE. Bugnion, “IX: A protected dataplane operating system for highthroughput and low latency,” in
Proc. USENIX Symposium on OperatingSystems Design and Implementation (OSDI) , Oct. 2014, pp. 49–65.[16] R. J. Creasy, “The origin of the VM/370 time-sharing system,”
IBMJournal of Research and Development , vol. 25, no. 5, pp. 483–490,Sep. 1981.[17] R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V.Anderson, S. M. Bennett, A. Kagi, F. H. Leung, and L. Smith, “Intelvirtualization technology,”
Computer , vol. 38, no. 5, pp. 48–56, May2005.[18] P. Kutch, “PCI-SIG SR-IOV primer: An introduction to SR-IOV tech-nology,”
Intel Application Note , pp. 321 211–002, Jan. 2011.[19] R. Unrau, O. Krieger, B. Gamsa, and M. Stumm, “Hierarchical cluster-ing: A structure for scalable multiprocessor operating system design,”
Journal fo Supercomputing , vol. 9, pp. 105–134, Mar. 1995.[20] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, andA. Gupta, “Hive: Fault containment for shared-memory multiproces-sors,” in
Proc. ACM Symposium on Operating Systems Principles , Dec.1995, pp. 12–25.[21] G. Ammons, J. Appavoo, M. Butrico, D. Da Silva, D. Grove,K. Kawachiya, O. Krieger, B. Rosenburg, E. Van Hensbergen, andR. W. Wisniewski, “Libra: A library operating system for a JVMin a virtualized execution environment,” in
Proc. ACM InternationalConference on Virtual Execution Environments (VEE) , Jun. 2007, pp.44–54.[22] D. E. Porter, S. Boyd-Wickizer, J. Howell, R. Olinsky, and G. C.Hunt, “Rethinking the library OS from the top down,” in
Proc. ACMInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) , Mar. 2011, pp. 291–304.[23] A. Kivity, D. Laor, G. Costa, P. Enberg, N. Har’El, D. Marti, andV. Zolotarov, “OSv–Optimizing the operating system for virtual ma-chines,” in
Proc. USENIX Annual Technical Conference (USENIX ATC) ,Jun. 2014, pp. 61–72.[24] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris,A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, “Corey:An operating system for many cores,” in
Proc. USENIX Conference onOperating Systems Design and Implementation (OSDI) , Dec. 2008, pp.43–57.[25] E. Jeong, S. Wood, M. Jamshed, H. Jeong, S. Ihm, D. Han, andK. Park, “mTCP: a highly scalable user-level TCP stack for multicoresystems,” in
Proc. USENIX Symposium on Networked Systems Designand Implementation (NSDI) , Apr. 2014, pp. 489–502. [26] R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat,“Chronos: Predictable low latency for data center applications,” in
Proc.ACM Symposium on Cloud Computing (SoCC) , Oct. 2012, pp. 9:1–9:14.[27] S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy, “MegaPipe: Anew programming interface for scalable network I/O,” in
Proc. USENIXSymposium on Operating Systems Design and Implementation (OSDI) ,Oct. 2012, pp. 135–148.[28] E. Keller, J. Szefer, J. Rexford, and R. B. Lee, “NoHype: Virtualizedcloud infrastructure without the virtualization,” in
Proc. ACM/IEEEAnnual International Symposium on Computer Architecture (ISCA) , Jun.2010, pp. 350–361.[29] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy,T. Anderson, and T. Roscoe, “Arrakis: The operating system is thecontrol plane,” in
Proc. USENIX Symposium on Operating SystemsDesign and Implementation (OSDI) , Oct. 2014, pp. 1–16.[30] R. Riesen and K. Ferreira, “An extensible operating system designfor large-scale parallel machines,” Sandia National Laboratories, Tech.Rep. SAND09-2660, Apr. 2009, https://cfwebprod.sandia.gov/cfdocs/CompResearch/templates/insert/pubs.cfm.[31] A. Blanchard, “Will It Scale Benchmark Suite,”https://github.com/antonblanchard/will-it-scale, 2015, [Online; accessed8-July-2016].[32] A. Waterland, “stress benchmark suite,” http://people.seas.harvard.edu/ ∼∼