Multiverse: Easy Conversion of Runtime Systems into OS Kernels via Automatic Hybridization
MMultiverse: Easy Conversion of Runtime Systemsinto OS Kernels via Automatic Hybridization
Kyle C. Hale
Department of Computer ScienceIllinois Institute of [email protected]
Conor Hetland, Peter Dinda
Department of Electrical Engineering and Computer ScienceNorthwestern [email protected], [email protected] ©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works. DOI: 10.1109/ICAC.2017.24 (https://dx.doi.org/10.1109/ICAC.2017.24)
Abstract —The hybrid runtime (HRT) model offers a pathtowards high performance and efficiency. By integrating the OSkernel, runtime, and application, an HRT allows the runtimedeveloper to leverage the full feature set of the hardware andspecialize OS services to the runtime’s needs. However, conform-ing to the HRT model currently requires a port of the runtime tothe kernel level, for example to the Nautilus kernel framework,and this requires knowledge of kernel internals. In response, wedeveloped Multiverse, a system that bridges the gap between abuilt-from-scratch HRT and a legacy runtime system. Multiverseallows unmodified applications and runtimes to be brought intothe HRT model without any porting effort whatsoever by splittingthe execution of the application between the domains of a legacyOS and an HRT environment. We describe the design andimplementation of Multiverse and illustrate its capabilities usingthe massive, widely-used Racket runtime system.
I. I
NTRODUCTION
Runtime systems can gain significant benefits from execut-ing in a tailored software environment, such as our HybridRuntime (HRT) [18]. In an HRT, a light-weight kernel frame-work (called an AeroKernel), a runtime, and an applicationcoalesce into a single kernel-level entity. The OS is thiscomposite of the application, runtime, and AeroKernel. Assuch, the runtime and application enjoy a base platform offully privileged access to the underlying hardware, and canalso construct task-appropriate abstractions on top of thisbase, instead of being limited to abstractions provided by acommodity OS. These capabilities demonstrably enhance per-formance, scalability, and efficiency, particularly for parallelruntime systems running on current NUMA server hardwareand next generation high core-count multicore processors.The capabilities also enable forms of adaptation, both duringthe design process and during execution, that are simply notavailable to user-level systems.An AeroKernel facilitates the creation of HRTs by providingcore kernel functionality and optional mechanisms whoseinterfaces are geared to user-level developers instead of kerneldevelopers. An AeroKernel helps ease the migration of user-level code to kernel-level. The motivation for an AeroKerneldraws from the reliable performance of light-weight ker-nels [22], [21], [16], the philosophy regarding kernel abstrac-tions of Exokernel [12], new techniques and ideas developedin multi-core OS research [23], [13], and the simplicity ofother experimental OSes from previous decades [20], [28]. In this paper, we leverage the Nautilus AeroKernel [17], whichwe describe in more detail in Section II.Prior to the work and system we describe here, the im-plementation of an HRT consisted entirely of manual pro-cesses. HRT developers needed first to extend an AeroKernelframework such as Nautilus with the functionality the runtimeneeded. The HRT developers would then port the runtimeto this AeroKernel manually. While a manual port can pro-duce the highest performance gains, it requires an intimatefamiliarity with the runtime system’s functional requirements,which may not be obvious. These requirements must thenbe implemented in the AeroKernel layer and the AeroKerneland runtime combined. This requires a deep understandingof kernel development. This manual process is also iterative:the developer adds AeroKernel functionality until the runtimeworks correctly. The end result might be that the AeroKernelinterfaces support a small subset of POSIX, or that the runtimedeveloper replaces such functionality with custom interfaces.While such a development model is tractable ([17] givesthree examples), it represents a substantial barrier to entry tocreating HRTs, which we seek here to lower. The manual port-ing method is additive in its nature. We must add functionalityuntil we arrive at a working system. A more expedient methodwould allow us to start with a working HRT produced byan automatic process, and then incrementally extend it andspecialize it to enhance its performance.The Multiverse system we describe in this paper supportssuch a method using a technique called automatic hybridiza-tion to create a working HRT from an existing, unmodifiedruntime and application. With Multiverse, runtime developerscan take an incremental path towards adapting their systemsto run in the HRT model. From the user’s perspective, ahybridized runtime and application behaves the same as theoriginal. It can be run from a Linux command line and interactwith the user just like any other executable. But internally, itexecutes in kernel mode as an HRT.While in this paper we present one instance of an AeroKer-nel, Multiverse can work with any AeroKernel (or specializedOS kernel). Such pairings could enable new forms of adaptivecomputing, both in datacenter and HPC environments. Forexample, hybridization decisions could be made at runtime tomerge an application or runtime system with the most suitablespecialized OS kernel in response to, e.g., application charac- a r X i v : . [ c s . O S ] J a n eristics or phases, hardware capabilities, or energy constraints.In this sense, Multiverse would behave as a kind of dynamiclinker for applications/runtimes and AeroKernels in that itwould support runtime binding of application functionality tospecialized kernel services.By enabling an existing Linux program to run partiallyin kernel mode as an HRT, Multiverse has the potential toexpose additional adaptation mechanisms, enable additionaladaptation policies, and allow both to be incrementally addedto the program. That is, Multiverse expands the range ofautonomic computing possible within the program. For exam-ple, the initial program could be extended over time to havedirect control over the hardware paging and timer mechanismswhen it is running in HRT mode. A specialized policy couldthen be added to the program to drive these mechanisms, forexample to lower TLB miss rates by making use of program-specific information, and to achieve program-specific desirablescheduling behavior, such as real-time behavior.Although our general focus has been on supporting parallelprograms and runtimes, we note that a related concept toHRTs, namely Unikernels [24], has garnered considerableinterest in the cloud/datacenter computing space. A similarpremise to that given above applies in this space: A Multiverse-like approach expands the range of autonomic computingpossible.Multiverse bridges a specialized HRT with a legacy envi-ronment by borrowing functionality from a legacy OS, such asLinux. Functions not provided by the existing AeroKernel areforwarded to another core that is running the legacy OS, whichhandles them and returns their results. The runtime developercan examine traces of these forwarded events, identify hotspots in the legacy interface, and move their implementations(possibly even changing their interfaces) into the AeroKernel.The porting process with Multiverse is subtractive in that adeveloper iteratively removes dependencies on the legacy OS.At the same time, the developer can take advantage of thekernel-level environment of the HRT.To demonstrate the capabilities of Multiverse, we auto-matically hybridize the Racket runtime system. Racket hasa complex, JIT-based runtime system with garbage collectionand makes extensive use of the Linux system call interface,memory protection mechanisms, and external libraries. Hy-bridized Racket executes in kernel mode as an HRT, and yetthe user sees precisely the same interface (an interactive REPLenvironment, for example) as out-of-the-box Racket.Our contributions in this paper are as follows: • We introduce the concept of automatic hybridization fortransforming runtime systems and their applications intoHRTs, enabling them to run in kernel mode with fullaccess to hardware features and the ability to adapt thekernel to their needs. • We describe the design of Multiverse, an implementationof automatic hybridization that combines compile-time,link-time, run-time, and virtualization-based techniques. • We demonstrate automatic hybridization with Multiverseby transforming the Racket runtime into an HRT. • We evaluate the performance of Multiverse.Multiverse will be made publicly available.II. HRT
AND
HVMMultiverse builds on the previously described NautilusAeroKernel and Hybrid Virtual Machine [18], [17], whichwere developed to support the hybrid runtime (HRT) model.We describe the key salient findings and components here.The core premise of the HRT model is that by moving theruntime (and its application) to the kernel level, we enablethe runtime developer to leverage all hardware features (in-cluding privileged features), and to specialize kernel featuresspecifically for the runtime’s needs. These capabilities in turnallow for greater performance or efficiency than possible atuser-level. The Nautilus AeroKernel facilitates doing exactlythis. Nautilus runs on bare metal or under virtualization onx64 machines and the Intel Xeon Phi. It is open source (MITlicense) and publicly available.Three runtimes have been hand-ported to Nautilus, namelyLegion [4], the NESL VCODE interpreter [8], and the run-time of a home-grown nested data parallel language. Usingthe HPCG (High Performance Conjugate Gradient) bench-mark [10], [19] developed by Sandia National Labs and portedto Legion by Los Alamos National Labs, speedups over Linuxof up to 20% for the Intel Xeon Phi, and up to 40% fora 4-socket, 64-core x64 AMD Opteron 6272 machine weremeasured. Nautilus provides basic primitives, for examplethread creation and events, that outperform Linux by ordersof magnitude because they are designed to support runtimesin lieu of general-purpose computing, and because there areno kernel/user boundaries to cross. This combined with hard-ware and software capabilities available only in kernel mode,for example complete interrupt control and runtime-specificscheduling, leads to these performance gains in applications.Multiverse also builds on the Hybrid Virtual Machine(HVM), an extension to the open source (BSD license) Pala-cios VMM [22], and available in its repository. HVM allowsfor the creation of a VM whose memory, cores, and interruptlogic are segregated so that one VM simultaneously runs twooperating systems, the “Regular Operating System” (ROS)(e.g., Linux) and an HRT-based OS (e.g., Nautilus). The ROSruns on a partition of the cores and can only see and touchthe ROS cores and the ROS subset of physical memory. Incontrast, the HRT, while only allowed to run on its owndistinct partition of the cores, has full access to all the memory,cores, and interrupt logic of the entire VM. The HVM designand the Nautilus design are closely coupled, with the resultbeing that HRTs based on Nautilus can execute with negligiblevirtualization overheads on Palacios. The ROS and HRT canbe booted and rebooted independently, the latter with a latencycomparable to a fork() / exec() in Linux. That is, a bootof a Nautilus-based HRT is comparable to a Linux processcreation.II. M ULTIVERSE
We designed the Multiverse system to support automatichybridization of existing runtimes and applications that run inuser-level on Linux platforms.
A. Perspectives
Multiverse’s goal is to ease the path for developers trans-forming a runtime into an HRT. We seek to make thesystem look like a compilation option from the developer’sperspective. That is, to the greatest extent possible, the HRTis a compilation target. Compiling to an HRT results in anexecutable that is a “fat binary” containing additional codeand data that enables kernel-mode execution in an environmentthat supports it. An HVM-enabled virtual machine on Palaciosis the first such environment. The developer can extend thisincrementally; Multiverse facilitates a path for runtime andapplication developers to explore how to specialize their HRTto the full hardware feature set and the extensible kernelenvironment of the AeroKernel.From the user’s perspective, the executable behaves as if itwere compiled for a user-level Linux environment. The usersees no difference between HRT and user-level execution.
B. Techniques
The Multiverse system relies on three key techniques: splitexecution, event channels, and state superpositions.
Split execution:
In Multiverse, a runtime and its applicationbegin their execution in the ROS as an ordinary Linux process.Through a well-defined interface discussed in Section III-C,the runtime on the ROS side can spawn an execution contextin the HRT. At this point, Multiverse splits its execution intotwo components, each running in a different context; oneexecutes in the ROS and the other in the HRT; there is now aLinux process and a kernel. The semantics of these executioncontexts differ from traditional threads depending on theircharacteristics. We discuss these differences in Section IV. Inthe current implementation, the context on the ROS side com-prises a Linux thread, the context on the HRT side comprisesan AeroKernel thread, and we refer to them collectively asan execution group . While execution groups in our currentsystem consist of threads in different OSes, this need notbe true in general. The context on the HRT side executesuntil it triggers a fault, a system call, or other event. Theexecution group then converges on this event, with each sideparticipating in a protocol for requesting events and receivingresults. This protocol exchange occurs in the context of HVMevent channels, which we discuss below.Figure 1 illustrates the split execution of Multiverse for aROS/HRT execution group. At this point, the ROS has alreadymade a request to create a new context in the HRT, e.g. throughan asynchronous function invocation. When the HRT threadbegins executing in the HRT side, exceptional events, suchas page faults, system calls, and other exceptions vector tostub handlers in the AeroKernel (1). The AeroKernel redirectsthese events through an event channel (2) to request handlingin the ROS. The VMM then injects these into the originating
General Purpose OS(Linux)HVM LibraryParallel RuntimeParallel App AeroKernel (Nautilus) HVM Library Parallel Runtime Parallel App merged address space
VMM accelerated HRT (1)(2)(3) (4)
Fig. 1: Split execution in Multiverse.Item Cycles TimeAddress Space Merger ∼
33 K 15 µ sAsynchronous Call ∼
25 K 11 µ sSynchronous Call (different socket) ∼ ∼
790 359 nsFig. 2: Round-trip latencies of ROS ↔ HRT interactions.ROS thread, which can take action on them directly (3). Forexample, in the case of a page fault that occurs in the ROSportion of the virtual address space, the HVM library replicatesthe access, which will cause the same exception to occur on theROS core. The ROS will then handle it as it would normally.For events that need direct handling by the ROS kernel, suchas system calls, the HVM library can forward them (4).
Event channels:
When the HRT needs functionality thatthe ROS implements, access to that functionality occurs over event channels , event-based, VMM-controlled communicationchannels between the two contexts. The VMM only expectsthat the execution group adheres to a strict protocol for eventrequests and completion.Figure 2 shows the measured latency of event channels withthe Nautilus AeroKernel performing the role of HRT. Notethat these calls are bounded from below by the latency ofhypercalls to the VMM.
State superpositions:
In order to forego the addition ofburdensome complexity to the AeroKernel environment, ithelps to leverage functions in the ROS other than those thatlie at a system call boundary. This includes functionalityimplemented in libraries and more opaque functionality likeoptimized system calls in the vdso and the vsyscall page.To use this functionality, Multiverse can set up the HRT andROS to share portions of their address space, in this case theuser-space portion. Aside from the address space merger itself,Multiverse leverages other state superpositions to support ashared address space, including superpositions of the ROSGDT and thread-local storage state.In principle, we could superimpose any piece of state visibleto the VMM. The ROS or the runtime need not be aware ofthis state, but the state is nonetheless necessary for facilitatinga simple and approachable usage model.
OS Kernel (Linux) Application + Runtime Code and Data
Canonical “lower half” Canonical “higher half”
Application + Runtime Code and Data ROS Virtual Address Space HRT Virtual Address Space Physical Address Space HRT Private ROS + HRT Shared ROS + HRT Shared HRT Private
Fig. 3: Merged address space between ROS and HRT.The superposition we leverage most in Multiverse is amerged address space between the ROS and the HRT, depictedin Figure 3. The merged address space allows execution inthe HRT without a need for implementing ROS-compatiblefunctionality. When a merged address space takes effect, theHRT can use the same user-mode virtual addresses present inthe ROS. For example, the runtime in the ROS might loadfiles and construct a complex pointer-based data structure inmemory. It can then invoke a function within its counterpartin the HRT to compute over that data.
C. Usage models
The Multiverse system is designed to give maximum flex-ibility to application and runtime developers in order toencourage exploration of the HRT model. While the degreeto which a developer leverages Multiverse can vary, for thepurposes of this paper we classify the usage model into threecategories, discussed below.
Native:
In the native model, the application/runtime isported to operate fully within the HRT/AeroKernel setting.That is, it does not use any functionality not exported by theAeroKernel, such as glibc functionality or system calls like mmap() . This category allows maximum performance, butrequires more effort, especially in the compilation process.The ROS side is essentially unnecessary for this usage model,but may be used to simplify the initiation of HRT execution(e.g. requesting an HRT boot). The native model is also nativein another sense: it can execute on bare metal without anyvirtualization support.
Accelerator:
In this model, the app/runtime developerleverages both legacy (e.g. Linux) functionality and Aero-Kernel functionality. This requires less effort, but allows thedeveloper to explore some of the benefits of running theircode in an HRT. Linux functionality is enabled by the mergedaddress space discussed previously, but the developer can alsoleverage AeroKernel functions.Figure 4 shows a small example of code that will createa new HRT thread and use event channels and state super-position to execute to completion. Runtime initialization isopaque to the user, much like C runtime initialization code.When the program invokes the hrt_invoke_func() call, static void *routine ( void * in) { void * ret = aerokernel_func();printf("Result = %d\n", ret);} int main ( int argc, char ** argv) {hrt_invoke_func(routine); return
Fig. 4: Example user code adhering to the accelerator model.the Multiverse runtime will make a request to the HVM to run routine() in a new thread on the HRT core. Notice howthis new thread can call an AeroKernel function directly, andthen use the standard printf() routine to print its result.This printf call relies both on a state superposition (mergedaddress space) for the function call linkage to be valid, and onevent channels, which will be used when the C library codeinvokes a system call (e.g. write() ). Incremental:
The application/runtime executes in the HRTcontext, but does not initially leverage AeroKernel functional-ity. Benefits are immediately limited to aspects of the HRT environment . However, the developer need only recompiletheir application to explore this model.
In the incrementalmodel, the path to converting a runtime and its applicationinto a kernel is straightforward.
Instead of raising an explicitHRT thread creation request, Multiverse creates a new threadin the HRT corresponding to the program’s main() routine.The Incremental model also allows parallelism, as legacythreading functionality automatically maps to the correspond-ing AeroKernel functionality with semantics matching thoseused in pthreads. The developer can then incrementally expandtheir usage of hardware- and AeroKernel-specific features.While the accelerator and incremental usage models rely onthe HVM virtualized environment of Palacios, it is importantto note that they could also be built on physical partition-ing [26] as well. At its core, HVM provides to Multiversea resource partitioning, the ability to boot multiple kernelssimultaneously on distinct partitions, and the ability for thesekernels to share memory and communicate.
D. Function overrides
One way a developer can enhance a generated HRT isthrough function overrides . The AeroKernel can implementfunctionality that conforms to the interface of, for example,a standard library function, but that may be more efficient orbetter suited to the HRT environment. This technique allowsusers to get some of the benefits of the accelerator modelwithout any explicit porting effort. However, it is up to theAeroKernel developer to ensure that the interface semanticsand any usage of global data make sense when using thesefunction overrides. Function overrides are specified in a simpleconfiguration file that is discussed in Section IV.Figure 5 shows the same code from Figure 4 using func-tion overrides. Here the AeroKernel developer has overriddenthe standard pthreads routines so that pthread_create() tatic void *routine ( void * in) { void * ret = aerokernel_func();printf("Result = %d\n", ret);} int main ( int argc, char ** argv) {pthread_t t;pthread_create(&t, NULL, routine, NULL);pthread_join(t, NULL); return
Fig. 5: Example of user code using overrides.will create a new HRT thread in the same way that hrt_invoke_func() did in the previous example.
E. Toolchain
The Multiverse toolchain consists of two main components,the runtime system code and the build setup. The build setupconsists of build tools, configuration files, and an AeroKernelbinary provided by the AeroKernel developer. To leverageMultiverse, a user integrates their application or runtime withthe provided Makefile and rebuilds it. This will result inthe compilation of the AeroKernel components necessary forHRT operation and the Multiverse runtime system, includingfunction overrides, exit and signal handlers, and initializationcode, into the user program.IV. I
MPLEMENTATION
We now discuss the implementation of Multiverse. Thisincludes the components that are automatically compiled andlinked into the application’s address space at build time andthe parts of Nautilus and the HVM that support event channelsand state superpositions. Unless otherwise stated, we assumethe Incremental usage model discussed in Section III-C.
A. Multiverse runtime initialization
As mentioned in Section III, a new HRT thread must becreated from the ROS side (the originating ROS thread). This,however, requires an AeroKernel present on the requestedcore to create that thread. The runtime component (whichincludes the user-level HVM library) is in charge of booting anAeroKernel on all required HRT cores during program startup.They can either be booted on demand or at application startup.We use the latter in our current setup.Our toolchain inserts program initialization hooks beforethe program’s main() function, which carry out runtimeinitialization, including: • Registering ROS signal handlers • Hooking process exit for HRT shutdown • AeroKernel function linkage • AeroKernel image installation in the HRT • AeroKernel boot • Merging ROS and HRT address spaces libs
AeroKernel binary stackVirtual address spacefor control process (ROS core)libs.text.dataheap AeroKernel-managedmemoryHRT core physical memory
Loaded AeroKernelMultibootand VMM info structures
Fig. 6: AeroKernel boot process.
ROS HRT
Main threadPartner thread HRT threadNested HRT thread(1)(2) (3) (4)(5)
Fig. 7: Interactions within an execution group.
AeroKernel Boot:
Our toolchain embeds an AeroKernelbinary into the ROS program’s ELF binary. This is the imageto be installed in the HRT. At program startup, the Multiverseruntime component parses this embedded AeroKernel binaryand sends a request to the HVM asking that it be installedin physical memory, as shown in Figure 6. Multiverse thenrequests the AeroKernel be booted on the HRT cores. Theboot process brings the AeroKernel up into an event loop thatwaits for HRT thread creation requests.The above initialization tasks are opaque to the user.
B. Execution model
To implement split execution, we rely on HVM’s ability toforward requests from the ROS core to the HRT, along withevent channels and merged address spaces.The runtime developer can use two mechanisms to createHRT threads, as discussed in Section III-C. Furthermore, twotypes of threads are possible on the HRT side: top-level threadsand nested threads. Top-level threads are explicitly createdby the ROS. A top-level HRT thread can create its ownchild threads as well; we classify these as nested threads.The semantics of the two thread types differ slightly in theiroperation. Nested threads resemble pure AeroKernel threads,but their execution can proceed in the context of the ROS useraddress space. Top-level threads require extra semantics in theHRT and in the Multiverse component linked with the ROSapplication. hreads:
Multiverse pairs each top-level HRT thread witha partner thread that executes in the ROS. This thread hastwo purposes. First, it allows us to preserve join semantics.Second, it gives us the proper thread context in the ROS toinitiate a state superposition for the HRT. Figure 7 depicts thecreation of HRT threads and their interaction with the ROS.First, in (1) the main thread is created in the ROS. It sets up theruntime environment for Multiverse. When the runtime systemcreates a thread, e.g. with pthread_create() or with hrt_invoke_func() , Multiverse creates a correspondingpartner thread that executes in the ROS (2). It is the dutyof the partner thread to allocate a ROS-side stack for a newHRT thread and then invoke the HVM to request a threadcreation in the HRT using that stack (3). When the partnercreates the HRT thread, it also conveys information to initiatea state superposition that mirrors the ROS-side GDT and ROS-side architectural state corresponding to thread-local storage(primarily the %fs register). The HRT thread can then createas many nested HRT threads as it desires (4). Both top-levelHRT threads and nested HRT threads raise events to the ROSthrough event channels with the top-level HRT thread’s partnerthread acting as the communication end-point (5).As is typical in threading models, the main thread canwait for HRT threads to finish by using join() semantics,where the joining thread blocks until the child exits. Whilein theory we could implement the ability to join an HRTthread directly, it would add complexity to both the HRTand the ROS component of Multiverse. Instead, we chose toallow the main thread to join a partner thread directly andprovide the guarantee that a partner thread will not exit untilits corresponding HRT thread exits on the remote core. Whenan HRT thread exits, it signals the ROS of the exit event.When Multiverse creates an HRT thread, it keeps track of theNautilus thread data (sent from the remote core after creationsucceeds), which it uses to build a mapping from HRT threadsto partner threads. The thread exit signal handler in the ROSflips a bit in the appropriate partner thread’s data structurenotifying it of the HRT thread completion. The partner canthen initiate its cleanup routines and exit, at which point themain thread will be unblocked from its initial join() . Function overrides:
In Section III-C we described howa developer can use function overrides to select AeroKernelfunctionality over default ROS functionality. The Multiverseruntime component enforces default overrides that interposeon pthread function calls. All function overrides operateusing function wrappers. For simple function wrappers, theAeroKernel developer can simply make an addition to aconfiguration file included in the Multiverse toolchain thatspecifies the function’s attributes and argument mappingsbetween the legacy function and the AeroKernel variant.This configuration file then allows Multiverse to automaticallygenerate function wrappers at build time.When an overridden function is invoked, the wrapper runsinstead, consults a stored mapping to find the symbol namefor the AeroKernel variant, and performs a lookup to find itsHRT virtual address. This symbol lookup currently occurs on every function invocation, so incurs a non-trivial overhead.A symbol cache, much like that used in the ELF standard,could easily be added to improve lookup times. When theaddress of the AeroKernel override is resolved, the wrapperinvokes the function directly (since it is already executing inthe HRT context where it has appropriate page table mappingsfor AeroKernel addresses).
C. Event channels
The HVM model enables the building of essentially anycommunication mechanism between two contexts (in our case,the ROS and HRT), and most of these require no specific sup-port in the HVM. As a consequence, we minimally define the basic communication between the ROS, HRT, and the VMMusing shared physical memory, hypercalls, and interrupts.The user-level code in the ROS can use hypercalls tosequentially request HRT reboots, address space mergers (statesuperpositions), and asynchronous sequential or parallel func-tion calls. The VMM handles reboots internally, and forwardsthe other two requests to the HRT as special exceptions orinterrupts. Because the VMM and HRT may need to shareadditional information, they share a data page in memory.For a function call request, the page contains a pointer to thefunction and its arguments at the start and the return code atcompletion. For an address space merger, the page contains theCR3 of the calling process. The HRT indicates to the VMMwhen it is finished with the current request via a hypercall.After an address space merger, the user-level code in theROS can also use a hypercall to initiate synchronous operationwith the HRT. This hypercall indicates to the HRT a virtualaddress which will be used for future synchronization betweenthe HRT and ROS. They can then use a memory-basedprotocol to communicate, for example to allow the ROS toinvoke functions in the HRT without VMM intervention.
D. Merged address spaces
To achieve a merged address space, we leverage the canon-ical 64-bit address space model of x64 processors, and itswide use within existing kernels, such as Linux. In this model,the virtual address space is split into a “lower half” and a“higher half” with a gap in between, the size of which isimplementation dependent. In a typical process model, e.g.,Linux, the lower half is used for user addresses and the higherhalf is used for the kernel.For an HRT that supports it, the HVM arranges that thephysical address space is identity-mapped into the higherhalf of the HRT address space. That is, within the HRT, thephysical address space mapping (including the portion of thephysical address space only the HRT can access) occupies thesame portion of the virtual address space that the ROS kerneloccupies, namely the higher half. Without a merger, the lowerhalf is unmapped and the HRT runs purely out of the higherhalf. When the ROS side requests a merger, we map the lowerhalf of the ROS’s current process address space into the lowerhalf of the HRT address space. omponent SLOC
C ASM Perl TotalMultiverse runtime 2232 65 0 2297Multiverse toolchain 0 0 130 130Nautilus additions 1670 0 0 1670HVM additions 600 38 0 638
Total 4502 103 130 4735
Fig. 8: Source Lines of Code for Multiverse.
E. Nautilus additions
In order to support Multiverse in the Nautilus AeroKernel,we needed to make several additions to the codebase. Mostof these focus on runtime initialization and correct operationof event channels. When the runtime and application areexecuting in the HRT, page faults in the ROS portion of thevirtual address space must be forwarded. We added a check inthe page fault handler to look for ROS virtual addresses andforward them appropriately over an event channel.One issue with our current method of copying a portion ofthe PML4 on an address space merger is that we must keepthe PML4 synchronized. We must account for situations inwhich the ROS changes top-level page table mappings, eventhough these changes are rare. We currently handle this bydetecting repeat page faults. Nautilus keeps a per-core variablekeeping track of recent page faults, and matches duplicates.If a duplicate is found, Nautilus will re-merge the PML4.More clever schemes to detect this condition are possible, butunnecessary since it does not lie on the critical path.For correct operation, Multiverse requires that we catch all page faults and forward them to the ROS. If we collecta trace of page faults in the application running native andunder Multiverse, the traces should look identical. However,because the HRT runs in kernel mode, some paging semantics(specifically with copy-on-write) change. In default operation,an x86 CPU will only raise a page fault when writing a read-only page in user-mode. Writes to pages with the read-onlybit while running in ring 0 are allowed to proceed. This issuemanifests itself in the form of mysterious memory corruption,e.g. by writing to the zero page. Luckily, there is a bit toenforce write faults in ring 0 in the cr0 control register.Before we built Multiverse, Nautilus lacked support forsystem calls, as the HRT operates entirely in kernel mode.However, a legacy application will leverage a wide range ofsystem calls. To support them, we added a small system callstub handler in Nautilus that immediately forwards the systemcall to the ROS over an event channel.
F. Complexity
Multiverse development took roughly 5 person months ofeffort. Figure 8 shows the amount of code needed to supportMultiverse. The entire system is compact and compartmen-talized so that users can experiment with other AeroKernelsor runtime systems with relative ease. While the codebase is small, much of the time went into careful design of the execu-tion model and working out idiosyncrasies in the hybridization,specifically those dealing with operation in kernel mode.V. E
VALUATION
We evaluate Multiverse using a hybridized Racket runtimesystem running a set of benchmarks from The LanguageBenchmark Game. We ran all experiments on a Dell Pow-erEdge 415 with 8GB of RAM and an 8 Core 64-bit x86_64AMD Opteron 4122 clocked at 2.2GHz. Each CPU core hasa single hardware thread with four cores per socket. The hostmachine has stock Fedora Linux 2.6.38.6-26.rc1.fc15.x86_64installed, and is configured for maximum performance in theBIOS. Benchmark results are reported as averages of 10 runs.Experiments in a VM were run on a guest setup which con-sists of a simple BusyBox distribution running an unmodifiedLinux 2.6.38-rc5+ image with two cores (one core for theHVM and one core for the ROS) and 1 GB of RAM.Racket [15], [14] is the most widely used Scheme imple-mentation and has been under continuous development forover 20 years. It is an open source codebase that is downloadedover 300 times per day. Recently, support has been added toRacket for parallelism via futures [30] and places [31].The Racket runtime, which comprises over 800,000 linesof code, is a good candidate to test Multiverse, particularly itsmost complex usage model, the incremental model, becauseRacket includes many of the challenging features emblematicof modern dynamic programming languages that make exten-sive use of the Linux ABI, including system calls, memorymapping, processes, threads, and signals. These features in-clude complex package management via the filesystem, sharedlibrary-based support for native code, JIT compilation, tail-callelimination, live variable analysis (using memory protection),and garbage collection.Our port of Racket to the HRT model takes the form ofan instance of the Racket engine embedded into a simple Cprogram. Racket already provides support for embedding aninstance of Racket into C, so it was straightforward to producea Racket port under the Multiverse framework. This port usesa conservative garbage collector, the SenoraGC, which is moreportable and less performant than the default, precise garbagecollector. The port was compiled with GCC 4.6.3. The Cprogram launches a pthread that in turn starts the engine.Combined with the incremental usage model of Multiverse, theresult is that the existing, unmodified Racket engine executesentirely in kernel mode as an HRT.When compiled and linked for regular Linux, our portprovides either a REPL interactive interface through whichthe user can type Scheme, or a command-line batch interfacethrough which the user can execute a Scheme file (which caninclude other files). When compiled and linked for HRT use,our port behaves identically.To evaluate the correctness and performance of our port, wetested it on a series of benchmarks submitted to The ComputerLanguage Benchmarks Game [1]. We tested on seven differentbenchmarks: a garbage collection benchmark (binary-tree-2), enchmark System Calls Time (User/Sys) (s) Max Resident Set (Kb) Page Faults Context Switches Forwarded Eventsspectral-norm 23800 39.39/0.24 182300 51452 1695 75252n-body 18763 41.15/0.19 152300 45064 1430 63827fasta-3 35115 31.28/0.17 80492 25418 1075 60533fasta 29989 12.23/0.10 43568 14956 627 44945binary-tree-2 1260 31.98/0.10 82072 31082 491 32342mandelbrot-2 3667 7.76/0.05 43600 14250 291 17917fannkuch-redux 1279 2.73/0.01 21284 5358 33 6637
Fig. 9: System utilization for Racket benchmarks. A high-level language has many low-level interactions with the OS. f ann k u c h -r edu x b i na r y - t r ee - f a s t a f a s t a - ys pe c t r a l - no r mm ande l b r o t - R un t i m e ( s ) NativeVirtualMultiverse
Fig. 10: Performance of Racket benchmarks running Native,Virtual, and in Multiverse.
With Multiverse, the existing,unmodified Racket implementation has been automaticallytransformed to run entirely in kernel mode, as an HRT,with little to no overhead. a permutation benchmark (fannkuch), two implementationsof a random DNA sequence generator (fasta and fasta-3), ageneration of the Mandelbrot set (mandelbrot-2), an n-bodysimulation (n-body), and a spectral norm algorithm. Figure 9characterizes these benchmarks from the low-level perspective.Note that while this is an implementation of a high-levellanguage, the actual execution of Racket programs involvesmany interactions with the operating system. These exerciseMultiverse’s system call and fault forwarding mechanisms.The total number of forwarded events is in the last column.Figure 10 compares the performance of the Racket bench-marks run natively (note here native means on bare metalrather than virtualized) on our hardware, under virtualization,and as an HRT that was created with Multiverse. Error bars areincluded, but are barely visible because these workloads runin a predictable way. The key takeaway is that Multiverse per-formance is on par with native and virtualized performance—Multiverse let us move, with little to no effort, the existing,unmodified Racket runtime into kernel mode and run it as anHRT with little to no overhead.The small overhead of the Multiverse case compared to thevirtualized and native cases is due to the frequent interactions, m m ap m un m ap La t en cy ( cyc l e s ) VirtualMultiverse
Fig. 11: Multiverse event forwarding overheads demonstratedwith mmap() and munmap() system calls.such as those described above, with the Linux ABI. However,in all but two cases, the hybridized benchmarks actually out-perform the equivalent versions running without Multiverse.This is due to the nature of the accelerated environmentthat the HRT provides, which ameliorates the event channeloverheads. As we expect, the two benchmarks that performworse under Multiverse (nbody and spectral-norm) have themost overhead incurred from event channel interactions. Themost frequent interactions in both cases are due to page faults.While Figure 9 tells us the number of interactions that occur,we now consider the overhead of each using microbenchmarks.This estimate will also apply to page faults, since theirforwarding mechanism is identical to system calls.The most frequent system calls used in the Racket runtime(independent of any benchmark) are mmap() and munmap() ,and so we focus on these two. Figure 11 shows microbench-mark results (averages over 100 runs) for these two, comparingvirtualized execution and Multiverse. Note that neither systemcalls nor page faults involve the VMM in the virtualizedcase. For both system calls, the Multiverse event forwardingfacility adds roughly 1500 cycles of overhead. If we multiplythis number by the number of forwarded events for thesebenchmarks listed in Figure 9, we expect that Multiverse willadd about 112 million cycles (51 ms) of overhead for spectral-norm, and 96 million cycles (43 ms) for nbody. This is roughlyin line with Figure 10. Note that the relative overhead cost forthese benchmarks is roughly 0.1%. While the cost per systemcall is almost doubled here, such events are relatively rare. nbod y s pe c t r a l - no r m R un t i m e ( s ) Native-prefaultVirtual-prefaultMultiverse-prefault
Fig. 12: Performance of nbody and spectral-norm benchmarkswith and without a modification to the runtime that prefaults inall mapped pages, reducing the number of forwarded events.Note that event forwarding will decrease as a runtime isincrementally extended to take advantage of the HRT model.As an illustration of this, we modified Racket’s garbage col-lector to prefault in pages any time that it allocates a range ofmemory for internal use. We accomplished this with a simplechange that added the
MAP_POPULATE flag to Racket’sinternal mmap() invocations. This reduces the number ofpage faults incurred during the execution of a benchmark,therefore reducing the number of forwarded events. The resultsare shown in Figure 12. While this change increases therunning time for the benchmark overall—indicating that thisis not a change that one would introduce in practice—it showswhat the relative performance would be with fewer forwardedevents. Indeed, with fewer events, the hybridized versions(Multiverse-prefault) of these benchmarks running in a VMnow outperform their counterparts (Virtual-prefault).It is worth reflecting on what exactly has happenedhere: we have taken a massive (800K line), complexruntime system off-the-shelf, run it through Multiversewithout changes, and as a result have a version of theruntime system that correctly runs in kernel mode asan HRT and behaves identically with virtually identicalperformance. To be clear, all of the Racket runtimeexcept Linux kernel ABI interactions is seamlesslyrunning as a kernel.
While this codebase is the endpointfor user-level development, it represents a starting point for HRT development in the incremental model.VI. R
ELATED W ORK
As far as we are aware, Multiverse is the only system that automatically transforms an existing user-level application intoa split execution environment that runs within the context ofboth a general-purpose OS and a specialized OS.Work on specialized kernels goes back decades, and thedesign of Nautilus is heavily influenced by much of this early work, including Exokernels [11], [12], SPIN [7], Scout [25],KeyKOS [9], and ADEOS [33]. This line of work has recentlybeen revitalized in the context of Unikernels [24].The Dune system [6] allows a special kernel module topromote selected processes to ones that can access privilegedCPU features on a legacy Linux system. Dune leveragesvirtualization support to give applications the ability to access,for example, page tables and protection hardware. While Dunegives applications access to previously unavailable hardware,it does so from within the context of a Linux process. UnlikeMultiverse, Dune does not give the application the capabilityto run in an entirely separate OS.Libra [3] bears similarities to our system in its overallarchitecture. A Java Virtual Machine (JVM) runs on top ofthe Libra libOS, which in turn executes under virtualization. Ageneral-purpose OS runs in a controller partition and acceptsrequests for legacy functionality from the JVM/Libra partition.This system involved a manual port. In contrast, the HVMgives us a more powerful mechanism for sharing between theROS and HRT as they share a large portion of the addressspace. This allows us to leverage complex functionality in theROS like shared libraries and symbol resolution. Furthermore,the Libra system does not provide a way to automaticallycreate these specialized JVMs from their legacy counterparts.The Blue Gene/L series of supercomputer nodes run witha Lightweight Kernel (LWK) called the Blue Gene/L RunTime Supervisor (BLRTS) [2] that shares an address spacewith applications and forwards system calls to a specializedI/O node. While the bridging mechanism between the nodes issimilar, there is no mechanism for porting a legacy applicationto BLRTS. Others in the HPC community have proposedsimilar solutions that bridge a full-weight kernel with anLWK in a hybrid model. Examples of this approach includemOS [32], ARGO [5], and IHK/McKernel [29]. The PiscesCo-Kernel [27] treats performance isolation as its primarygoal and can partition hardware between enclaves , or isolatedOS/Rs that can involve different specialized OS kernels.In contrast to the above systems, the HRT model is theonly one that allows a runtime to act as a kernel, enjoyingfull privileged access to the underlying hardware. Furthermore,as far as we are aware, none of these systems provide anautomated mechanism for producing an initial port to thespecialized OS/R environment.VII. C ONCLUSIONS
We introduced Multiverse, a system that implements auto-matic hybridization of runtime systems in order to transformthem into hybrid runtimes (HRTs). We illustrated the designand implementation of Multiverse and described how runtimedevelopers can use it for incremental porting of runtimes andapplications from a legacy OS to a specialized AeroKernel.To demonstrate its power, we used Multiverse to automati-cally hybridize the Racket runtime system, a complex, widely-used, JIT-based runtime. With automatic hybridization, we cantake an existing Linux version of a runtime or application andautomatically transform it into a package that appears to runust like any other program, but actually executes on remotecores in kernel-mode, in the context of an HRT, and with fullaccess to the underlying hardware. We evaluated the perfor-mance overheads of an unoptimized Multiverse hybridizationof Racket and showed that performance varies with the usageof legacy functionality. In cases where such use is minimized,the hybridized runtime can outperform the baseline. Runtimedevelopers can use Multiverse to start with a working systemand incrementally migrate hot spot functionality to customcomponents within an AeroKernel.R
EFERENCES[1] The computer language benchmarks game.http://benchmarksgame.alioth.debian.org/.[2] G. Almási, R. Bellofatto, J. Brunheroto, C. Ca¸scaval, J. Castaños,L. Ceze, P. Crumley, C. C. Erway, J. Gagliano, D. Lieber, X. Martorell,J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the bluegene/l system software organization. In
Proceedings of the Euro-ParConference on Parallel and Distributed Computing (EuroPar 2003) ,Aug. 2003.[3] G. Ammons, J. Appavoo, M. Butrico, D. Da Silva, D. Grove,K. Kawachiya, O. Krieger, B. Rosenburg, E. V. Hensbergen, and R. W.Wisniewski. Libra: A library operating system for a jvm in a virtualizedexecution environment. In
Proceedings of the rd International Con-ference on Virtual Execution Environments (VEE 2007) , pages 44–54,June 2007.[4] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressinglocality and independence with logical regions. In
Proceedings ofSupercomputing (SC 2012)
Proceedings of the th USENIX Conference on Operating SystemsDesign and Implementation (OSDI 2012) , pages 335–348, Oct. 2012.[7] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski,D. Becker, C. Chambers, and S. Eggers. Extensibility, safety andperformance in the SPIN operating system. In
Proceedings of the th ACM Symposium on Operating Systems Principles (SOSP 1995) , pages267–283, Dec. 1995.[8] G. E. Blelloch, S. Chatterjee, J. Hardwick, J. Sipelstein, and M. Zagha.Implementation of a portable nested data-parallel language.
Journal ofParallel and Distributed Computing , 21(1):4–14, Apr. 1994.[9] A. C. Bomberger, W. S. Frantz, A. C. Hardy, N. Hardy, C. R. Landau,and J. S. Shapiro. The KeyKOS nanokernel architecture. In
Proceedingsof the USENIX Workshop on Micro-kernels and Other Kernel Architec-tures , pages 95–112, Apr. 1992.[10] J. Dongarra and M. A. Heroux. Toward a new metric for ranking highperformance computing systems. Technical Report SAND2013-4744,Sandia National Laboratories, June 2013.[11] D. R. Engler and M. F. Kaashoek. Exterminate all operating systemabstractions. In
Proceedings of the th Workshop on Hot Topics inOperating Systems (HotOS 1995) , pages 78–83, May 1995.[12] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: An op-erating system architecture for application-level resource management.In
Proceedings of the th ACM Symposium on Operating SystemsPrinciples (SOSP 1995) , pages 251–266, Dec. 1995.[13] D. G. Feitelson and L. Rudolph. Gang scheduling performance benefitsfor fine-grain synchronization.
Journal of Parallel and DistributedComputing , 16(4):306–318, Dec. 1992.[14] M. Felleisen, R. B. Findler, M. Flatt, S. Krishnamurthi, E. Barzilay,J. McCarthy, and S. Tobin-Hochstadt. The Racket Manifesto. In T. Ball,R. Bodik, S. Krishnamurthi, B. S. Lerner, and G. Morrisett, editors, ,volume 32 of
Leibniz International Proceedings in Informatics (LIPIcs) ,pages 113–128, Dagstuhl, Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. [15] M. Flatt and PLT. Reference: Racket. Technical Report PLT-TR-2010-1,PLT Design Inc., 2010. https://racket-lang.org/tr1/.[16] M. Giampapa, T. Gooding, T. Inglett, and R. W. Wisniewski. Experi-ences with a lightweight supercomputer kernel: Lessons learned fromBlue Gene’s CNK. In
Proceedings of Supercomputing (SC 2010) , Nov.2010.[17] K. Hale and P. Dinda. Enabling hybrid parallel runtimes throughkernel and virtualization support. In
Proceedings of the 12th ACMSIGPLAN/SIGOPS International Conference on Virtual Execution Envi-ronments (VEE 2016) , April 2016.[18] K. C. Hale and P. A. Dinda. A case for transforming parallel runtimesinto operating system kernels. In
Proceedings of the th ACMSymposium on High-performance Parallel and Distributed Computing(HPDC 2015) , June 2015.[19] M. A. Heroux, J. Dongarra, and P. Luszczek. HPCG technical specifica-tion. Technical Report SAND2013-8752, Sandia National Laboratories,October 2013.[20] G. C. Hunt and J. R. Larus. Singularity: Rethinking the software stack.
SIGOPS Operating Systems Review , 41(2):37–49, Apr. 2007.[21] S. M. Kelly and R. Brightwell. Software architecture of the lightweight kernel, Catamount. In
Proceedings of the 2005 Cray User GroupMeeting (CUG 2005) , May 2005.[22] J. Lange, K. Pedretti, T. Hudson, P. Dinda, Z. Cui, L. Xia, P. Bridges,A. Gocke, S. Jaconette, M. Levenhagen, and R. Brightwell. Palaciosand kitten: New high performance operating systems for scalable vir-tualized and native supercomputing. In
Proceedings of the th IEEEInternational Parallel and Distributed Processing Symposium (IPDPS2010) , Apr. 2010.[23] R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovi´c, and J. Kubiatowicz.Tessellation: Space-time partitioning in a manycore client OS. In
Proceedings of the st USENIX Conference on Hot Topics in Parallelism(HotPar 2009) , pages 10:1–10:6, Mar. 2009.[24] A. Madhavapeddy, R. Mortier, C. Rotsos, D. Scott, B. Singh, T. Gazag-naire, S. Smith, S. Hand, and J. Crowcroft. Unikernels: Library operatingsystems for the cloud. In
Proceedings of the th InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS 2013) , pages 461–472, Mar. 2013.[25] A. B. Montz, D. Mosberger, S. W. O’Malley, L. L. Peterson, and T. A.Proebsting. Scout: A communications-oriented operating system. In
Proceedings of the th Workshop on Hot Topics in Operating Systems(HotOS 1995) , pages 58–61, May 1995.[26] J. Oayang, B. Kocoloski, J. Lange, and K. Pedretti. Achieving per-formance isolation with lightweight co-kernels. In
Proceedings of the24th International ACM Symposium on High Performance Parallel andDistributed Computing, (HPDC 2015) , June 2015.[27] J. Ouyang, B. Kocoloski, J. R. Lange, and K. Pedretti. Achievingperformance isolation with lightweight co-kernels. In
Proceedings ofthe th International Symposium on High-Performance Parallel andDistributed Computing , pages 149–160, June 2015.[28] T. Roscoe. Linkage in the Nemesis single address space operatingsystem.
ACM SIGOPS Operating Systems Review , 28(4):48–55, Oct.1994.[29] T. Shimosawa, B. Gerofi, M. Takagi, G. Nakamura, T. Shirasawa,Y. Saeki, M. Shimizu, A. Hori, and Y. Ishikawa. Interface for het-erogeneous kernels: A framework to enable hybrid os designs targetinghigh performance computing on manycore architectures. In
Proceedingsof the IEEE International Conference on High Performance Computing(HiPC 2014) , Dec. 2014.[30] J. Swaine, K. Tew, P. Dinda, R. Findler, and M. Flatt. Back to thefutures: Incremental parallelization of existing sequential runtime sys-tems. In
Proceedings of the ACM SIGPLAN International Conference onObject-Oriented Programming, Systems, Languages, and Applications(OOPSLA 2010) , October 2010.[31] K. Tew, J. Swaine, M. Flatt, R. Findler, and P. Dinda. Places: Addingmessage passing parallelism to racket. In
Proceedings of the 2011Dynamic Languages Symposium (DLS 2011) , October 2011.[32] R. W. Wisniewski, T. Inglett, P. Keppel, R. Murty, and R. Riesen. mOS:An architecture for extreme-scale operating systems. In