[PDF] Cichlid: Explicit physical memory management for large machines

Abstract

Full PDF

CCichlid: Explicit physical memory management for large machines

Simon Gerber, Gerd Zellweger, Reto Achermann,Moritz Ho ﬀ mann, Kornilios Kourtis, Timothy Roscoe, Dejan Milojicic † Systems Group, Department of Computer Science, ETH Zurich † Hewlett-Packard Labs

Abstract

In this paper, we rethink how an OS supports virtualmemory. Classical VM is an opaque abstraction of RAM,backed by demand paging. However, most systems today(from phones to data-centers) do not page, and indeedmay require the performance beneﬁts of non-paged phys-ical memory, precise NUMA allocation, etc. Moreover,MMU hardware is now useful for other purposes, suchas detecting page access or providing large page transla-tion. Accordingly, the venerable VM abstraction in OSeslike Windows and Linux has acquired a plethora of extraAPIs to poke at the policy behind the illusion of a virtualaddress space.Instead, we present Cichlid, a memory system whichinverts this model. Applications explicitly manage theirphysical RAM of di ﬀ erent types, and directly (thoughsafely) program the translation hardware. Cichlid is im-plemented in Barrelﬁsh, requires no virtualization support,and outperforms VMM-based approaches for all but thesmallest working sets. We show that Cichlid enables use-cases for virtual memory not possible in Linux today, andother use-cases are simple to program and signiﬁcantlyfaster. We argue that applications for modern machines shouldmanage physical RAM explicitly and directly programMMUs according to their needs, rather than manipulat-ing such hardware implicitly through a virtual addressabstraction as in Linux. We show that explicit primitivesfor managing physical memory and the MMU delivercomparable or better application performance, greaterfunctionality, and a simpler and orthogonal interface thatavoids the feature interaction and performance anomaliesseen in Linux.Traditional virtual memory (VM) systems present aconceptually simple view of memory to the application programmer: a single, uniform virtual address spacewhich the OS transparently backs with physical mem-ory. In its pure form, applications never see page faults,RAM allocation, address translation, TLB misses, etc.This simplicity has a price. VM is an illusion — onecan exhaust physical memory, resulting in thrashing, orthe OS killing the application. Moreover, performance isunpredictable. VM hardware is complex, with multiplecaches, TLBs, page sizes, NUMA nodes, etc.For applications like databases the performance gainsfrom closely managing the MMU mappings and locationsof physical pages on memory controllers are as importantto the end user as the functional correctness of the pro-gram [39,56]. Consequently, the once-simple VM abstrac-tion in systems such as Linux has become steadily morecomplex, as application developers demand more controlover the mapping hardware, by piercing the VM abstrac-tion with features like transparent huge pages, NUMAallocation, pinned mappings, etc. In Section 2, we discussthe complexity, redundancy, and feature interaction in theformerly simple VM interface.In response, we investigate the consequences of turn-ing the VM system inside-out: applications (1) directlymanage physical RAM, and (2) directly (but safely) pro-gram MMUs to build the environment in which they op-erate. Our contribution is a comprehensive design whichachieves these goals, allows the full range of use-casesfor memory system hardware, and which performs well.Cichlid , a new memory management system built inthe Barrelﬁsh research OS, adopts a radically invertedview of memory management compared with a traditionalsystem like Linux. Cichlid processes still run inside a vir-tual address space (the MMU is enabled) but this addressspace is securely constructed by the application itself withthe help of a library which exposes the full functional-ity of the MMU. Above this, all the functionality of atraditional OS memory system is provided. Pronounced " s I kl d; see https://en.wikipedia.org/wiki/Cichlid . a r X i v : . [ c s . O S ] N ov pplication-level management of the virtual addressspace is not a new idea; we review its history in Section 5.Cichlid itself is an extension of the original Barrelﬁshphysical memory management system described in Bau-mann et al. [8], which itself was based on seL4 [32].The contributions of Cichlid over these prior systemsare: • A comprehensive implementation of application-level memory management for modern hardwarecapable of supporting applications which exploit itsfeatures. We extend the Barrelﬁsh model to supportsafe user construction of page tables, arbitrary su-perpage mapping, demand paging, and fast access topage status information without needing virtualiza-tion hardware. • A detailed performance evaluation of Cichlid com-paring it with a variety of techniques provided by,and di ﬀ erent conﬁgurations of, a modern Linuxkernel, showing that useful performance gains areachieved while greatly simplifying the interface.In the next section of this paper we ﬁrst review the var-ious memory management features in Linux as an exam-ple of the traditional Unix-based approach. In Section 3we then present Cichlid, and evaluate its performance inSection 4. Section 5 discusses the prior work on explicitphysical memory management, and Section 6 summarizesthe contribution and future work. We now discuss traditional VM systems, as context forCichlid. We focus on Unix-like systems and Linux inparticular as representative of mainstream approaches andthe problems they exhibit. Later, in Section 5 we discussprior systems which have adopted a di ﬀ erent approach,some of which have strongly inﬂuenced Cichlid. Unix was designed when RAM was scarce, and demandpaging essential to system operation. Virtual memory isfully decoupled from backing storage via paging. Eachprocess sees a uniform virtual address space. All memoryis paged to disk by a single system-wide policy. The basicvirtual memory primitive visible to software is fork() ,which creates a complete copy of the virtual address space.Modern fork() is highly optimized (e.g. using copy-on-write).Today, RAM is often plentiful, MMUs are sophisticatedand featureful devices (e.g. supporting superpages), andthe memory system is complex, with multiple controllers and set-associative caches (e.g. which can be exploitedwith page coloring).Workloads have also changed. High-performance mul-ticore code pays careful attention to locality and mem-ory controller bandwidth. Pinning pages is a commonoperation for performance and correctness reasons, andpersonal devices like phones are often designed to notpage at all.Instead, the MMU is used for purposes aside frompaging. In addition to protection, remapping, and sharingof physical memory, MMUs are used to interpose onmain memory (e.g. for copy-on-write, or virtualization)or otherwise record access (such as the use of “dirty” bitsin garbage collection).

The need to exploit the memory system fully is evidentfrom the range of features added to Linux over the years to“poke through” the basic Unix virtual address abstraction.The most basic of these creates additional “shared-memory objects” in a process’ address space, which mayor may not be actually shared. Such segments are referredto by ﬁle descriptors and can either be backed by ﬁles or“anonymous”. The basic operation for mapping such anobject is mmap() , which in addition to protection informa-tion accepts around 16 di ﬀ erent ﬂags specifying whetherthe mapping is shared, at a ﬁxed address, contains pre-zeroed memory, etc. We describe basic usage of mmap() and related calls in Section 2.3; above this are a numberof extensions. Large pages:

Modern MMUs support mappings at acoarser granularity than individual pages, typically by ter-minating a multi-level page table walk early. For example, x86_64 supports 2 MB and 1 GB superpages as well as4 kB pages, and for simplicity we assume this architecturein the discussion that follows (others are similar).Linux support for superpage mappings is some-what complex. Firstly, mappings can be created forlarge (2 MB) or huge (1 GB) pages via a ﬁle sys-tem, hugetlbfs [40, 58] either directly or through libhugetlbfs [41]. For each supported superpage size,a command-line argument tells the kernel to allocate aﬁxed pool of superpages at boot-time. This pool canbe dynamically resized by an administrator. Shrinkinga pool deallocates superpages from applications using ahard-wired balancing policy. In addition, one superpagesize is deﬁned as a system-wide default which will beused for allocation if not explicitly speciﬁed otherwise.Once an administrator has set up the page pools, userscan be authorized to create memory segments with su-perpage mappings, either by mapping ﬁles created in the hugetlbfs ﬁle system, or mapping anonymous segments2ith appropriate ﬂags. Superpages may not be demand-paged [5].The complexity of conﬁguring di ﬀ erent memory poolsin Linux at boot has led to an alternative, transparenthuge pages (THP) [27, 59]. When conﬁgured, the kernelallocates large pages on page faults if possible accordingto a single, system-wide policy, while a low-priority ker-nel thread scans pages for opportunities to use large pagesthrough defragmentation. Demand-paging is allowed byﬁrst splitting the superpage into 4 kB pages [5]. A typicalmodern x86_64 kernel is conﬁgured for transparent sup-port of 2 MB pages, but not 1 GB pages. Alternatively,an administrator can disable system-wide THP at bootor by writing to sysfs and programs can enable it on aper-region basis at runtime using madvise() . NUMA:

The mbind() system call sets a NUMA policyfor a speciﬁc virtual memory region. A policy consists ofa set of NUMA nodes and a mode: bind to restrict alloca-tion to the given nodes; preferred to prefer those nodes,but fall back to others; interleave allocations across thenodes, and default to lazily allocate backing memory onthe local node of the ﬁrst thread to touch the virtual ad-dresses. This “ﬁrst touch” policy has proved problematicfor performance [29]. libNUMA provides an additional numa_alloc_onnode() call to allocate anonymous memory on a speciﬁc nodewith mmap() and mbind() . Linux can move pages betweennodes: migrate_pages() attempts to move all pages ofa process that reside on a set of given nodes to anotherset of nodes, while move_pages() moves a set of pages(speciﬁed as an array of virtual addresses) to a set ofnodes. Note that policy is expressed in terms of virtual,not physical, memory.There are also attempts [19–22, 25, 29] to deal withNUMA performance issues transparently in the kernel, bymigrating threads closer to the nodes containing memorythey frequently access, or conversely migrating pages tothreads’ NUMA nodes, based on periodically revokingaccess to pages and tracking usage with soft page faults. Agood generic policy, however, may be impossible; highlyperformance-dependent applications currently implementcustom NUMA policies by modifying the OS [29].

User-space faults:

Linux signals can be used to reﬂectpage faults to the application. GNU libsigsegv [43] pro-vides a portable interface for handling page faults: a userfault handler is called with the faulting virtual address andmust then be able to distinguish the type of fault, and pos-sibly map new pages to the faulting address. When usedwith system calls such as mprotect() and madvise() , thisenables basic user-space page management. The currentlimitations of this approach (both in performance andﬂexibility) have led to a proposed facility for user-spacedemand paging [23, 26].

4k 2M 1G 64GBuffersize0.51.01.5 T i m e p e r p a g e [ u s ] Map

4k 2M 1G 64GBuffersize0.51.01.5

Unmap

4k 2M 1G 64GBuffersize0.51.01.5 T i m e p e r p a g e [ u s ] Protect

Linux MMAPLinux SHMLinux SHMAT

Figure 1: Managing memory on Linux ( ) Based on the simple Unix virtual address space, the LinuxVM system has evolved in response to new demandsby accreting new features and functionality. This hassucceeded up to a point, but has resulted in a number ofproblems.The ﬁrst is mechanism redundancy : there are multi-ple mechanisms available to users with di ﬀ erent perfor-mance characteristics. For example, Figure 1 shows theperformance of three di ﬀ erent Linux facilities for creat-ing, destroying, and changing “anonymous mappings”: re-gions of virtual address space backed by RAM but not cor-responding to a ﬁle. These measurements were obtainedusing the machine in Table 1 using 4k pages throughout.MMAP uses an mmap() call with MAP_POPULATE and

MAP_ANONYMOUS to map and unmap regions, and mprotect() for protection. This forces the kernel tozero pages being mapped, dominating execution time.Avoiding this behavior, even when safe, requires kernelreconﬁguration at build time – a global policy aimed atembedded systems.SHM creates a shared memory object with shm_open() and passes it to mmap() and mprotect() .In this case, mmap() will not zero the memory. Unmap-ping is also faster since memory is not immediatelyreclaimed. The object can be shared with other processes,but (unlike MMAP mappings) cannot use large pages.SHMAT attaches a shared segment with shmat() ,and does allow large pages if the process has the

CAP_IPC_LOCK capability. Internally, the mechanism issimilar to mmap() , with system-wide limits on the numberand size of segments.For bu ﬀ ers up to 2 MB, the cost per page decreaseswith size for all operations due to amortization of thesystem call overhead. Afterwards, the time stays constant3PU Intel Xeon E5-2670 v2 (Ivy Bridge) / / / /

20 @ 2.5 GHzL1 / L2 cache 32 kB /

256 kB (per core)L3 size 25 MB (shared)dTLB (4 kB pages) 64 entries (4-way)dTLB (2 MB pages) 32 entries (4-way)dTLB (1 GB pages) 4 entries (4-way)L2 TLB (4K) 512 entries (4-way)RAM 256 GB (128 GB per node)Linux kernel v.4.2.0 (Ubuntu 15.10)Table 1: Test bed speciﬁcations. [49] hugetlbfs enabled

Table 2: Tested Linux conﬁgurationsexcept for MMAP map operations. libhugetlbfs provides get_hugepage_region and get_huge_pages calls to directly allocatesuperpage-backed memory using a malloc -style in-terface. The actual page size cannot be speciﬁed anddepends on a system-wide default; 4 kB pages may beused transparently unless the

GHR_STRICT ﬂag is set. Bydefault, hugetlbfs prefaults pages.The high-level observation is:

No single Linux API isalways optimal, even for very simple VM operations .A second problem is policy inﬂexibility . While theappropriate policy for many memory management oper-ations such as page replacement, NUMA allocation orhandling of superpages depend strongly on individual ap-plication’s workloads. In Linux, however, they usuallyeither apply system-wide, require administrator conﬁgu-ration (often at boot), must be enabled at compile time, ora combination of them.For example, supporting two superpage sizes in hugetlbfs requires two di ﬀ erent, pre-allocated pools ofphysical memory, each assigned to a di ﬀ erent ﬁle sys-tem, precluding a dynamic algorithm that could adapt tochanging workloads.In addition to the added complexity in the kernel [24],the system-wide policies in transparent superpage supporthave led to a variety of performance issues: Oracle DB hassu ﬀ ered from I / O performance degradation when readinglarge extents from disk [5, 17]. Redis incurs unexpectedlatency spikes using THP due to copy-on-write overheadfor large pages, since the application periodically uses fork() to persist database snapshots [65]. The jemalloc memory allocator experiences performance anomalies due to its use of madvise to release small regions of mem-ory inside of bigger chunks which have been transparentlybacked by large pages — the resulting holes preventinglater merging of the region back into a large page [37].These issues are not minor implementation bugs, butarise from the philosophy that memory system complex-ity should be hidden from applications, and resource al-location policies should be handled transparently by thekernel.The third class of problem is feature interaction . Wehave seen how superpages cannot be demand paged (eventhough modern SSDs can transfer 2MB pages with lowlatency). Another example is the complex and subtleinteraction between kernel-wide policies for NUMA al-location with superpage support [58]. At one level, thisshows up in the inability to control initial superpage allo-cation at boot time (superpages are always balanced overall NUMA nodes). Worse, Gaud et al. [38] show that treat-ing large pages and NUMA separately does not work well:large pages hurt the performance of parallel applicationson NUMA machines because hot pages are more likely,and larger, and false page sharing makes replication ormigration less e ﬀ ective. Accordingly, the Carrefour [29]system modiﬁes the kernel’s NUMA-aware page place-ment to realize its performance gains.Collectively, these issues motivate investigating alter-native approaches. As memory hardware diversiﬁes in thefuture, memory management policies will become increas-ingly complicated. We note that none of the Linux mem-ory APIs actually deal with physical memory directly,but instead select from a limited number of complex, in-kernel policies for backing traditional virtual memory.In contrast, therefore, Cichlid safely exposes to pro-grams and runtime systems both physical memory andtranslation hardware, and allows libraries to build familiarvirtual memory abstractions above this. We now describe the design of Cichlid, and how it isimplemented over the basic memory functionality of Bar-relﬁsh. While Cichlid allows great ﬂexibility in arrangingan address space, it nevertheless ensures the followingkey safety property: no Cichlid process can issue reador write instructions for any area of physical memory forwhich it does not have explicit access rights.

Subject to this requirement, Cichlid also provides thefollowing completeness property: a Cichlid process cancreate any address space layout permitted by the MMU forwhich it has su ﬃ cient resources . In other words, Cichliditself poses no restriction on how the memory hardwarecan be used.There are three main challenges in the implementa-4ion that Cichlid must address: Firstly, it must securelyname and authorize access to, and control over, regions ofphysical memory. Cichlid achieves this using partitionedcapabilities . Secondly, it must allow safe control of hard-ware data structures (such as page tables) by applicationprograms. This, is achieved by considerably extending theset of memory types supported by the capability systemin Barrelﬁsh (and seL4) for Cichlid to use. Finally, Ci-chlid must give applications direct access to informationprovided by the MMU (such as access and write-trackingbits in the page tables). Unlike prior approaches whichrely on virtualization technology, Cichlid allows directread-only access to page table entries; we explain belowwhy this is safe.Cichlid has three main components: First, the kernelprovides capability invocations that allow application pro-cesses to install, modify and remove page table entriesand query for the base address and size of physical re-gions. Second, the kernel exception handler redirects anyexceptions generated by the MMU to the application pro-cess that caused the exception. Thirdly, a runtime libraryprovides to applications an abstraction layer over the ca-pability system which exposes a simple, but expressiveAPI for managing page tables. Cichlid applications directly allocate regions of physicalmemory and pass around authorization for these regionsin the form of capabilities. Regions can be mapped into avirtual address space by changing a page table, or used forother purposes such as holding page tables themselves.Cichlid extends the Barrelﬁsh capability design, itselfinspired by seL4 [30, 33, 53]. All physical regions arerepresented by capabilities, which also confer a particular memory type . For example, the integrity of the capabilitysystem itself is ensured by storing capability representa-tions in memory regions of type

CNode , which can neverbe directly written by user-space programs. Instead, aregion must be of type

Frame to be mapped writable intoa virtual address space. Holding both

Frame and

CNode capabilities to the same region would enable a process toforge new capabilities by directly manipulating their bitrepresentations, and so is forbidden. Such a situation isprevented by having a kernel enforced type hierarchy forcapabilities.Capabilities to memory regions can be split and retyped according to a set of rules. At system start-up, all mem-ory is initially of type

Untyped , and physical memoryis allocated to processes by splitting the initial untypedregion. Retyping and other operations on capabilities isperformed by system calls to the kernel.seL4 capabilities are motivated by the desire to provecorrectness properties of the seL4 kernel, in particular, the property that no system call can fail due to lack ofmemory. Hence, seL4 and Barrelﬁsh perform no dynamicmemory allocation in the kernel, instead memory for alldynamic kernel data structures is allocated by user-spaceprograms and retyped appropriately, such as to a kernelthread control block or a

CNode , for example.Capabilities are attractive since they export physicalmemory to applications in a safe manner: applicationmay not arbitrarily use physical memory; they must in-stead “own” the corresponding capability. Furthermore,capabilities can be passed between applications. Finally,capabilities have some characteristics of objects: each ca-pability type has a set of operations which can be invokedon it by a system call.In Barrelﬁsh, seL4, and Cichlid, the kernel enforcessafety using two types of meta-data: a derivation database and a per-processes capability space . All capability ob-jects managed by a kernel are organized in a capabilityderivation tree. This tree enables e ﬃ cient queries fordescendants (of retype and split operations) and copies.These queries are used to prevent retype races on separatecopies of a capability that might compromise the system.User processes refer to capabilities and invoke opera-tions on them using opaque handles. Each process has itsown capability address space, which is explicitly main-tained via a radix tree in the kernel which functions as a guarded page table . The nodes of the tree are also capabil-ities (retyped from RAM capabilities) and are allocatedby the application.The root of the radix tree for each process is storedin the process control block. When a process invokes acapability operation it passes to the kernel the capabilityhandle with the invocation arguments. To perform the op-eration, the kernel traverses the process’ capability spaceto locate the capability corresponding to the handle andauthorizes the invocation.Cichlid builds on the basic Barrelﬁsh capability mech-anisms to allow explicit allocation of di ﬀ erent kinds ofmemory. A memory region has architectural attributessuch as the memory controller it resides on, whether it ison an external co-processor like a GPGPU or Intel XeonPhi, whether it is persistent, etc. Applications explicitlyacquire memory with particular attributes by requesting acapability from an appropriate memory allocator process,of which there are many. Furthermore, less explicit “beste ﬀ ort” policies can be layered on top by implementingfurther virtual allocators which can, for example, stealRAM from nearby controllers if local memory is scarce. Page tables are hardware speciﬁc, and at the lowest level,Cichlid’s interface (like seL4 and Barrelﬁsh) reﬂects theactual hardware. Applications may use this interface di-5ectly, or a high-level API with common abstractions fordi ﬀ erent MMUs, to safely build page tables, exchangepage tables on a core, and install mappings for any phys-ical memory regions for which the application is autho-rized. The choice of virtual memory layout, and its repre-sentation in page tables, is fully controlled by the applica-tion. Cores can share sub-page-tables between di ﬀ erentpage-table hierarchies to alias a region of memory at adi ﬀ erent address or to share memory between di ﬀ erentcores as in Corey [16].Cichlid adds support for multiple page sizes (2 MB and1 GB superpages in x86_64 , and 16 MB, 1 MB, and 64 kBpages in ARMv7-a [4]) to the Barrelﬁsh memory manage-ment system [8]. Cichlid decouples the physical memoryallocation from programming the MMU. Therefore theAPI allows for a clean way to explicitly select the pagesize for individual mappings, map pages from a mixtureof di ﬀ erent page sizes, and change the virtual page sizesfor mappings of contiguous physical memory regions alldirectly from the applications itself instead of relying onthe kernel to implement the correct policy for all cases.To do this, Cichlid extends the Barrelﬁsh memory sys-tem (and that of seL4) by introducing a new capabilitytype for every level of page table for every architecturesupported by the OS. This is facilitated by the Ham-let domain-speciﬁc language for specifying capabilitytypes [28].For example, for an MMU in x86_64 long-mode thereare four di ﬀ erent types of page table capability, corre-sponding to the 4 levels of a 64-bit x86 page table ( PML4 , PDPT , PD , and PT ). A PT (last-level page table) capabilitycan only refer to a 4k page-aligned region of RAM andhas a map operation which takes an additional capabilityplus an entry number as arguments. This capability inturn must be of type Frame and refer to another 4k page.The operation installs the appropriate page table entry inthe PT to map the speciﬁed frame. The kernel imposes nopolicy on this mapping, other than restricting the type andsize of capabilities.Similarly, a map on a PD (a 2nd-level “page directory”)capability only accepts a capability argument which is ofsize 4 kB and type PT , or of type Frame and size 2 MB(signifying a large page mapping).A small set of rules therefore captures all possiblevalid and authorized page table operations for a process,while excluding any that would violate the safety prop-erty. Moreover, checking these rules is fast and is partlyresponsible for Cichlid’s superior performance describedin Section 4.2. This type system allows user-space Cichlidprograms to construct ﬂexible page tables while enforcingthe safety property stated at the start of this section.Cichlid’s full kernel interface contains the follow-ing capability invocations: identify , map , unmap , modify_flags (protect), and clear_dirty_bits . Memory regions represented by capabilities and asso-ciated rights allow user-level applications to safely con-struct page tables; they allocate physical memory regionsand retype them to hold a page table and install the entriesas needed.Typed capabilities ensure a process cannot success-fully map a physical region for which it does not haveauthorization. The process of mapping itself is still aprivileged operation handled by the kernel, but the kernelmust only validate the references and capability typesbefore installing the mapping. Safety is guaranteed basedon the type system: page tables have a speciﬁc type whichcannot be mapped writable.Care must be taken in Cichlid to handle capability revo-cation. In particular, when a Frame capability is revoked,all page table entries for that frame must be quickly iden-tiﬁed and removed. Cichlid handles this by requiringeach instance of a

Frame capability to correspond to atmost one hardware page table entry. To map a frame intomultiple page tables, or at multiple locations in the samepage table, the program must explicitly create copies ofthe capability.As described so far, each operation requires a separatesystem call. Cichlid optimizes this in a straightforwardway by allowing batching of requests, amortizing systemcall cost for large region operations. The map , unmap , and modify_flags operations all take multiple consecutiveentries for a given page table as arguments.In Section 4.3 we conﬁrm existing work on the e ﬀ ectof page size on performance of particular workloads, andin Section 4.4 we show that the choice of the page size ishighly dynamic and depends on the program’s conﬁgura-tion such as the number of threads and where memory isallocated.In contrast, having the OS transparently select a pagesize is an old idea [60] and is the default in many Linuxdistributions today, but ﬁnding a policy that satisﬁes adiverse set of di ﬀ erent workloads is di ﬃ cult in practiceand leads to inherent complexity with questionable per-formance beneﬁts [17, 38, 42, 65]. Cichlid uses the existing Barrelﬁsh functionality for re-ﬂecting VM-related processor exceptions back to the fault-ing process, as in Nemesis [44] and K42 [55]. This incurslower kernel overhead than classical VM and allows theapplication to implement its own paging policies. In Sec-tions 4.1 and 4.2 we show that Cichlid’s trap latency touser space is considerably lower than in Linux.Cichlid extends Barrelﬁsh to allow page-traps to beeliminated for some use-cases when the MMU maintainspage access information in the page table entries. WhileDune [9] uses nested paging hardware to present “dirty”6nd “accessed” bits in an x86_6 without hardwaresupport for virtualization.We extend the kernel’s mapping rules in Section 3.2to allow page tables themselves to be mapped read-onlyinto a process’ address space. Essentially, this boils downto allowing a 4 kB capability of type

PML4 , PDPT , PD , or PT to be mapped in an entry in a PT instead of a Frame capability, with the added restriction that the mappingmust be read-only.This allows applications (or libraries) to read “dirty”and “accessed” bits directly from page table entries with-out trapping to the kernel. Setting or clearing these bitsremains a privileged operation which can only be per-formed by a kernel invocation passing the capability forthe page table.Note that this functionality remains safe under the ca-pability system: an application can only access the map-pings it has installed itself (or for which it holds a validcapability), and cannot subvert them.In Section 4.5 we demonstrate the beneﬁts of this ap-proach for a garbage collector. Cichlid’s demand-pagingfunctionality for x86_64 also uses mapped dirty bits to de-termine if a frame’s contents should be paged-out beforereusing the frame.Since Cichlid doesn’t need hardware virtualization sup-port, such hardware, if present, can be used for virtualiza-tion. Cichlid can work both inside a virtual machine, oras a better memory management system for a low-levelhypervisor.Moreover, nested paging has a performance cost forlarge working sets, since TLB misses can be twice asexpensive. In Section 4.6 we show that for small work-ing sets (below 16 MB for our hardware) a Dune-likeapproach outperforms Cichlid due to lower overhead inclearing page table bits, but for medium-to-large workingsets Cichlid’s lower TLB miss latency improves perfor-mance.The Cichlid and Dune approaches are complementary,and a natural extension to Cichlid (not pursued here)would allow applications access to both the physical (ma-chine) page tables and nested page tables if the workloadcan exploit them.

Cichlid provides a number of APIs above the capabilityinvocations discussed above.The ﬁrst layer of Cichlid’s user-space wraps the in-vocations to create, modify or remove single mappings,and keep track of the application’s virtual address spacelayout.While application programmers can build directly onthis, the Cichlid library provides higher-level abstractions based on the concepts of virtual regions (contiguous setsof virtual addresses), and memory objects that can be usedto back one or more virtual regions and can themselvesbe comprised of one or more physical regions.This layer is important to Cichlid’s usability. Manuallyinvoking operations on capabilities to manage the virtualaddress space can be cumbersome; take the example of acommon operation such as mapping an arbitrarily-sizedregion of physical memory R with physical base address P and size S bytes, R = ( P , S ), at an arbitrary virtual baseaddress V . The number of invocations needed to createthis simple mapping varies based on V , S , and the desiredproperties of the mapping (such as page size), as well asthe state of the application’s virtual address space beforethe operation. In particular, installing a mapping canpotentially entail creating multiple levels of page tablein addition to installing a page table entry. The libraryencapsulates the code to do this on demand, as well asbatching operations up to amortize system call overhead.Finally, the library also provides traditional interfacessuch as sbrk() and malloc() for areas of memory whereperformance is not critical. To simplify start-up, programsrunning over Cichlid start up with a limited, conventionalvirtual address space with key segments (text, data, bss)backed with RAM, though this address space is, itself,constructed by the process’ parent using Cichlid (ratherthan the kernel).In addition, the Cichlid library provides demand pag-ing to disk as in Nemesis [44], but not by default: manytime-sensitive applications rely on not paging for correct-ness, small machines such as phones typically do not pageanyway, and and the growth of non-volatile main mem-ory [48, 62] may make demand-paging obsolete. Unlikethe Linux VM system, demand paging is orthogonal topage size: the Cichlid library can demand-page super-pages provided the application has su ﬃ cient frames anddisk space. Furthermore, the application is aware of thenumber of backing frames and can add or remove framesexplicitly at runtime if required.The library shows that building a classic VM abstrac-tion over Cichlid is straightforward, but the reverse is notthe case. We evaluate Cichlid by ﬁrst demonstrating that primitiveoperations have performance as good as, or better thanthose of Linux, and then showing that Cichlid’s ﬂexi-ble interface allows application programmers to usefullyoptimize their systems.All Linux results, other than those for Dune (Sec-tion 4.5), are for version 4.2.0, as shipped with Ubuntu15.10, with three large-page setups: none, hugetlbfs , andtransparent huge pages. As the Dune patches (git revision7 rot1-trap-unprot protN-trap-unprot trap only Strategy02000400060008000 E x e c u t i o n t i m e ( c y c l e s / ( p a g e | t r a p )) Linux defaultLinux full TLB flushLinux selective TLB flushCichlid default Cichlid full TLB flushCichlid selective TLB flushCichlid full TLB flush + DICichlid selective TLB flush + DI

Figure 2: Appel-Li benchmark. (Linux )6c12ba0) require a version 3 kernel, these benchmarksuse kernel version 3.16 instead. These conﬁgurations aresummarized in Table 2. Thread and memory pinning wasdone using numactl and taskctl . Performance numbersfor Linux are always the best among all tested conﬁgura-tions.

The Appel and Li benchmark [3] tests operations rele-vant to garbage collection and other non-paging tasks.This benchmark is compiled with ﬂags -O2 -DNDEBUG ,and summarized in Figure 2.We compare Linux and Cichlid with three di ﬀ erentTLB ﬂush modes: 1) Full: Invalidate the whole TLB(writing cr3 on x86 ) every time, 2) Selective: Only invali-date those entries relevant to the previous operation (usingthe invlpg instruction), and 3) System default: Cichlid,by default, does a full ﬂush only for more than one page.Linux’s default behavior depends on kernel version. Theversion tested (4.2.0) does a selective ﬂush for up to 33pages, and full a ﬂush otherwise [45]. We vary this valueto change Linux’s ﬂush mode. The working set here isless than 2 MB, and thus large pages have no e ﬀ ect andare disabled.Cichlid is consistently faster than Linux here.For multi-page protect-trap-unprotect ( protN-trap-unprot ), Cichlid is 46% faster than Linux. For both sys-tems, the default adaptive behavior is as good as, or betterthan, selective ﬂushing. The Cichlid + DI results use thekernel primitives directly, to isolate the cost of user-spaceaccounting, which is around 5%. We extend the Appel and Li benchmarks, to establish howthe primitive operations scale for large address spaces,using bu ﬀ ers up to 64 GB. We map , protect and unmap the entire bu ﬀ er, and time each operation separately. Wecompare Cichlid to the best Linux method for each pagesize established in § 2.3. On Cichlid we use the high-levelinterfaces on a previously allocated frame, for similarsemantics to shared memory objects in Linux. The ex-periments were conducted on a 2x10 Intel Xeon E5 v2.Figure 3 shows execution time per page. Map:

Cichlid per-page performance is highly pre-dictable, regardless of page size. Since all informationneeded is presented to each a system call, the kernel doesvery little. On Linux we use shm_open for 4k pages and shmat for others. Linux needs to consult the shared seg-ment descriptor and validate it. This results in a generalperformance improvement for Cichlid over Linux up to15x for 4 kB pages or 93x for large pages, once someupfront overhead is amortized.

Protect:

These are in line with the Appel and Li bench-marks: Cichlid outperforms Linux’s mprotect() on an mmap ’ed region in all conﬁgurations except for smallbu ﬀ ers of 4 kB pages. For large bu ﬀ ers, the di ﬀ erencesbetween Cichlid and Linux are up to 4x (4 kB pages) or8x (huge pages). Unmap:

Doing an unmap in Cichlid is expensive: therelevant page table capability must be looked up to invokeit and the mapped

Frame capability needs to be markedunmapped. Linux shmdt , however, simply detaches thesegment from the process but doesn’t destroy it. Cichlidcould be modiﬁed to directly invoke the page table, andthereby match the performance of Linux.Cichlid memory operations are competitive: capabili-ties and fast traps allows an e ﬃ cient virtual memory inter-face. Even when multiple page table levels are changed,Cichlid usually outperforms Linux on most cases, despiterequiring several system calls. Many HPC workloads have a random memory accesspattern, and spend up to 50% of their time in TLBmisses [67]. Using the RandomAccess benchmark [54]from the HPC Challenge [68] suite, we demonstrate thatcarefully user-selected page sizes, as enabled by Cichlid,have a dramatic performance e ﬀ ect.We measure update rate (Giga updates per second, orGUPS) for read-modify-write on an array of 64-bit in-tegers, using a single thread. We measure working setsup to 32 GB, which exceeds TLB coverage for all pagesizes. Linux conﬁguration is , with pagesallocated from the local NUMA node. If run with transpar-8 k 2M 1G 64GBuffersize0.51.01.52.0 T i m e p e r p a g e [ µ s ] Map

4k 2M 1G 64GBuffersize0.51.01.52.0

Unmap

4k 2M 1G 64GBuffersize0.51.01.52.0

Protect4k Linux 4k Cichlid 2M Linux 2M Cichlid 1G Linux 1G Cichlid

Figure 3: Comparison of memory operations on Cichlid and Linux using shmat , mprotect and shmdt . (Linux ) Cichlid LinuxPage Size GUPS Time GUPS Time4k 0.0122 1397s 0.0121 1414s2M 0.0408 420s 0.0408 421s1G 0.0659 260s 0.0658 261s

Table 3: GUPS as a function of page size, 32 GB table.

2M 4M 8M 16M 32M 128M 1G 4G 8G 32GSize of table in Bytes1x2x3x4x5x N o r m a li z e d e x e c u t i o n t i m e

1G Pages2M Pages4k Pages

Figure 4: GUPS as a function of table size, normalized.ent huge pages instead, the system always selects 2 MBpages, and achieves lower performance.Figure 4 shows the results on Cichlid, normalized to1 GB pages. Performance drops once we exceed TLBcoverage: at 2 MB for 4 kB pages, and at 128 MB for2 MB pages. The apparent improvement at 32 MB is dueto exhausting the L3 cache, which slows all three equally,bringing the normalized results together. Large pages notonly increase TLB coverage, but cause fewer table walksteps to service a TLB miss. Page-structure caches wouldreduce the number of memory accesses even further butare rather small [6, 12] in size. Cichlid and Linux performidentically in the test, as Table 3 shows. These resultssupport previous ﬁndings on TLB overhead [7, 67], andemphasize the importance for applications being able to

420 440 460 480 500 520 540Runtime [s]02468 R e p e t i t i o n s Figure 5: GUPS variance. , 2 MB pages.select the correct page size for their workload.On Linux, even with NUMA-local memory, highscheduling priority, and no frequency scaling or powermanagement, there is a signiﬁcant variance betweenbenchmark runs, evidenced by the multimodal distribu-tion in Figure 5. This occurs for both hugetlbfs andtransparent huge pages, and is probably due to variationsin memory allocation, although we have been unable toisolate the precise cause. This variance is completelyabsent under Cichlid even when truly randomizing pag-ing layout and access patterns, demonstrating again thebeneﬁt of predictable application-driven allocation.

Previous work [38] has shown that while large pagescan be beneﬁcial on NUMA systems, they can also hurtperformance. Things are even more complicated whenthere are more page sizes (e.g., 4 kB, 2 MB, 1 GB for x86_64 ). Furthermore, modern machines often have adistinct TLB for each page size, suggesting that using amix of page sizes increases TLB coverage.Kaestle et al. [50] showed that distribution and repli-9PU AMD Opteron 6378micro architecture Piledriver / / / /

32 @ 2.4 GHzL1 / L2 cache size 16 kB / / / ﬀ ect of the page size on applicationperformance using Shoal’s Green-Marl PageRank [50].NUMA e ﬀ ects are minimal on the 2-socket machine weare using in other experiments, so for this experiment weuse the machine in Table 4 and note that AMD’s SMTthreads (CMT) are disabled in our experiments.We evaluate two conﬁgurations: First, single-threaded(T = = =

32 (dist) T =

32 (repl + dist)4 kB 597.91 = replication, dist = distribution, T is the number of threads). Highlightedare best numbers for each conﬁguration. Standard erroris very small.90% of the working set is replicated. However, the last10% still cannot be distributed e ﬃ ciently, which leads toworse performance.It is clear that the right page size is highly dynamic anddepends on workload and application characteristics. It isimpractical to statically conﬁgure a system with pools (asin Linux) optimally for all programs, as the requirementsare not known beforehand. Also, memory allocated topools is not available for allocations with di ﬀ erent pagesizes. In contrast, Cichlid’s simpler interface allows arbi-trary use of page sizes and replication by the applicationwithout requiring a priori conﬁguration of the OS. The potential of using the MMU to improve garbage col-lection is known [3]. Out of many possible applications,we consider detecting page modiﬁcations; A feature used,for example, in the Boehm garbage collector [15] to avoidstopping the world. Only after tracing does the collectorstop the world and perform a ﬁnal trace that need onlyconsider marked objects in dirty pages. This way, newlyreachable objects are accounted for and not collected.There are two ways to detect modiﬁed pages: The ﬁrstis to make the pages read-only (e.g., via mprotect() ortransparently by the kernel using soft-dirty PTEs [66]),and handle page faults in user-space or kernel-space. Thehandler sets a virtual dirty bit, and unprotects the page toallow the program to continue. The second approach useshardware dirty bits, set when a page is updated. SomeOSes (e.g., Linux) do not provide access to these bits.This is not just an interface issue. The bits are activelyused by Linux to detect pages that need to be ﬂushed todisk during page reclamation. Other OSes such as Solarisexpose these dirty bits in a read-only manner via the /proc ﬁle-system. In this case, applications are requiredto perform a system call to read the bits, which, can leadto worse performance than using mprotect() [13].In Cichlid, physical memory and page tables are di-rectly visible to applications. Applications can map pagetables read-only in their virtual address space. Only clear-ing the dirty bits requires a system call.10 N o r m a li z e d E x e c u t i o n T i m e Cichlid (prot)Linux (prot)Cichlid (dirty) Dune (dirty)Cichlid/NP (dirty)

Figure 6: GCBench on Linux, Cichlid and Dune, normal-ized runtime to Linux. (Linux , )Dune [9] provides this functionality through nestedpaging hardware, intended for virtualization, by runningapplications as a guest OS. Dune applications have directaccess to the virtualized (nested) page tables. This ap-proach avoids any system call overhead to reset the dirtybits, but depends on virtualization hardware and can leadto a performance penalty due to greater TLB usage [7,11].We use the Boehm garbage collector [15] and theGCBench microbenchmark [14]. GCBench tests thegarbage collector by allocating and collecting binary treesof various sizes. We run this benchmark with the three de-scribed memory systems, Linux, Dune and Cichlid withﬁve di ﬀ erent conﬁgurations C1 to C5, which progres-sively increase the size of the allocated trees.In Figure 6 we compare the runtime of each system. Ci-chlid implements all three mechanisms: protecting pages(Cichlid (prot)), hardware dirty bits (Cichlid (dirty)) inuser-space and hardware dirty bits in guest ring 0 (Cich-lid / NP (dirty)) (as does Dune). Our virtualization code isbased on Arrakis [63].Cichlid (prot) performs slightly worse than Linux(prot). This is consistent with Figure 3 where Linux per-forms better than Cichlid for protecting a single 4 kB page.We achieve better performance (between 13% (C2) and19% (C4)) than Linux when we use hardware dirty bits,by avoiding traps when writing to pages. We still incursome overhead as we have to make a system call to re-set the dirty bits on pages. Dune outperforms Cichlid(dirty) by up to 21% (C1), as direct access to the guestpage tables enables resetting the dirty bits without hav-ing to make a system call. However, Cichlid manages toclose the gap as the working set becomes larger, in whichcase Dune performance noticeably shows the overheadof nested paging. Unfortunately, we were unable to getDune working with larger heap sizes on our hardware andthus have no numbers for Dune for conﬁgurations C4 and Conﬁg C1 C2 C3 C4 C5Runtime (s)Linux (prot) 2.1 9.6 42 191 848Cichlid (prot) 2.4 10.5 43 203 928Cichlid (dirty) 1.9 8.3 34 153 692Dune (dirty) 1.5 7.3 33 – –Cichlid / NP (dirty) 2.0 8.6 36 157 720CollectionsLinux (prot) 251 336 381 428 448Cichlid (prot) 245 335 393 432 442Cichlid (dirty) 230 323 383 435 441Dune (dirty) 318 367 403 – –Cichlid / NP (dirty) 233 325 381 434 443Heap size (MB)Linux (prot) 139 411 1924 7972 24932Cichlid (prot) 132 453 1413 6789 26821Cichlid (dirty) 100 453 1477 5669 28132Dune (dirty) 106 386 1579 – –Cichlid / NP (dirty) 100 453 1573 5541 28132Table 6: GCBench reported total runtime, heap size andamount of collections.C5.On Linux, using transparent huge pages did not havea signiﬁcant impact on performance and we report theLinux numbers with THP disabled. In a similar vein, wewere unable to get Dune working with superpages, butwe believe that having superpages might improve Duneperformance for larger heap sizes (c.f. 4.3).Cichlid / NP (dirty) runs GCBench in guest ring 0 andreads and clears dirty bits directly on the guest hardwarepage tables. The performance for Cichlid / NP is similarto Cichlid (dirty) and slower than Dune. However, thiscan be attributed to the fact that Cichlid / NP does notfully leverage the advantage of having direct access to theguest hardware page tables and still uses system calls toconstruct the address space.Table 6 shows the total runtime, number of collectionsthe GC did and the heap size used by the application.Ideally, the heap size should be identical for all systemssince it is always possible to trade memory for better runtime in a garbage collector. In practice this is very di ﬃ cultto enforce especially across entirely di ﬀ erent operatingsystems. For example Cichlid uses less memory (28%)for C4 compared to Linux (prot) but more memory (12%)for C5.We conclude that with Cichlid we can safely exposeMMU information to applications which in turn can ben-eﬁt from it without relying on virtualization hardwarefeatures.11 GUPS size [MB]0.00.51.01.52.02.5 N o r m a li z e d e x e c u t i o n t i m e gupsgups (dune)gups_lcggups_lcg (dune) Figure 7: Comparison of the execution time of Rando-mAccess with and without nested paging for varyingworking set sizes, normalized to GUPS on native Linux.(Linux , ) To illustrate the potential downside of nested paging, werevisit the HPC Challenge RandomAccess benchmark.Resolving a TLB miss with nested paging requires a 2Dpage table walk and up to 24 memory accesses [2] result-ing in a much higher miss penalty, and the overhead ofnested paging may end up outweighing the beneﬁts ofdirect access to privileged hardware in guest ring zero.GUPS represents a worst-case scenario due to its lack oflocality.We conduct the same experiment as in section 4.3 onDune [9] with a working set size ranging from 1 MB to128 MB. Figure 7 and Table 7 show that for the smallesttable sizes (1 MB and 2 MB) the performance of Rando-mAccess under Dune and Linux is comparable. Largerworking set sizes exceed the TLB coverage and hencemore TLB misses occur. This results in almost 2x higherruntime for RandomAccess in Dune than Linux. As forall comparisons with Dune, we disable transparent hugepages on Linux.Running applications in guest ring zero as in Dune haspros and cons: on one hand, the application gets accessto privileged hardware features, on the other hand, theperformance may be degraded due to larger TLB misscosts for working sets which cannot be covered by theTLB.

The core principle of paged virtual memory is that virtualpages are backed by arbitrary physical pages. This canadversely a ﬀ ect application performance due to unneces-sary conﬂict misses in the CPU caches and an increase Linux DuneSize GUPS GUPS LCG GUPS GUPS LCG1 2 1 2 12 3 3 3 34 11 11 18 198 35 36 61 6516 90 93 165 16932 236 240 421 42564 594 595 1098 1113128 1510 1571 2999 3043Table 7: RandomAccess absolute execution times in mil-liseconds. (Linux , )in non-determinism [52]. In addition, system wide pagecoloring introduces constraints on memory managementwhich may interfere with the application’s memory re-quirements [70].Implementing page placement policies is non-trivial:The complexity of the FreeBSD kernel is increased sig-niﬁcantly [31], Solaris allows applications to chose frommultiple algorithms [61], and there have been severalfailed attempts to implement page placement algorithmsin Linux. Other systems like COLORIS [69] replaceLinux’ page allocator entirely in order to support pagecoloring.In contrast, Cichlid allows an application to explicitlyrequest physical memory of a certain color and map ac-cording to its needs. For instance, a streaming databasejoin operator can restrict the large relation (which isstreamed from disk) to a small portion of the cache asmost accesses would result in a cache miss anyway andkeep the smaller relation completely in cache.Table 8 shows the results of parallel execution of twoinstances of the HPC Challenge suite RandomAccessbenchmark on cores that share the same last-level cache.In the ﬁrst column we show the performance of each in-stance running in isolation. We see a signiﬁcant dropin GUP / s for the instance with the smaller working setwhen both instances run in parallel. By applying cachepartitioning we can keep the performance impact on thesmaller instance to a minimum while improving the per-formance of the larger instance even compared to the casewhere the larger instance runs in isolation.The reason behind this unexpected performance im-provement is that the working set (the table) of the largerinstance is restricted to a small fraction of the cache whichreduces conﬂict misses between the working set and otherdata structures such as process state etc. With this evaluation, we have shown that the ﬂexibilityof Cichlid’s memory system allows applications to opti-12rocess Isolation Parallel Parallel Colors16M Table 0.0926 0.0834 90.0% 0.0921 99.5%64M Table 0.0570 0.0561 98.4% 0.0631 110.7%Table 8: Parallel execution of GUPS on Cichlid with andwithout cache coloring. Values in GUP / s.mize their physical resources for a particular workloadindependent of a system-wide policy without sacriﬁcingperformance.Cichlid’s strength lies in its ﬂexibility. By strippingback the policies baked into traditional VM systems overthe years (many motivated by RAM as a scarce resource)and exposing hardware resources securely to programs, itperforms as well as or better than Linux for most bench-marks, while enabling performance optimizations not pre-viously possible in a clean manner. Prior to Barrelﬁsh and seL4, the idea of moving memorymanagement into the application rather than a kernel orexternal paging server had been around for some time.Engler et al. in 1995 [35] outlined much of the motiva-tion for moving memory management into the applicationrather than the kernel or external paging server, and de-scribed AVM, an implementation for the Exokernel [36]based on a software-loaded TLB, presenting a small per-formance evaluation on microbenchmarks. AVM referredto physical memory explicitly by address, and “securebindings” conferred authorization to map it. Since then,software-loaded TLBs have fallen out of favor due tohardware performance trends. Cichlid targets modernhardware page tables, and uses capabilities to both nameand authorize physical memory access.The V ++ Cache Kernel [18] implemented user-levelmanagement of physical memory through page-framecaches [46] allowing applications to monitor and controlthe physical frames they have, with a focus on better page-replacement policies. A virtual address space is a segmentwhich is composed of regions from other segments calledbound regions. A segment manager, associated with eachsegment, is responsible for keeping track of the segmentto page mappings and hence handling page faults. Pagesare migrated between segments to handle faults. Segmentmanagers can run separately from the faulting application.It is critical to avoid double faults in the segment manager.Initialization is handled by the kernel which creates awell-known segment.Other systems have also reﬂected page faults to userspace. Microkernels like L4 [57], Mach [64], Chorus [1],and Spring [51] allow server processes to implement cus- tom page management policies. In contrast, the soft-realtime requirements of continuous media motivatedNemesis [44] redirecting faults to the application itself, toensures resource accountability. As with AVM, the targethardware is a uniprocessor with a software-loaded TLB.A similar upcall mechanism for reﬂecting page faults wasused in K42 [55].In contrast, extensible kernels like SPIN [10] andVINO [34] allow downloading of safe policy extensionsinto the kernel for performance. For example, SPIN’skernel interface to memory has some similarity with Cich-lid’s user-space API:

PhysAddr allowed allocation, deal-location, and reclamation of physical memory,

VirtAddr managed a virtual address space, and

Translation al-lowed the installation of mappings between the two, aswell as event handlers to be installed for faults. In com-parison, Cichlid allows applications to deﬁne policiescompletely in user-space, whereas SPIN has to rely oncompiler support to make sure the extensions are safe foruse in kernel-space.

Cichlid inverts the classical VM model and securely ex-poses physical memory and MMU hardware to appli-cations without recourse to virtualization hardware. Itenables a variety of optimizations based on the memorysystem which are either impossible to express in Unix-like systems, or can only be cast as “hints” to a ﬁxedkernel policy. Although MMU hardware has evolved tosupport a Unix-oriented view of virtual memory, Cichlidoutperforms the Linux VM in many cases, and equals itin others.Cichlid explores a very di ﬀ erent style of OS serviceprovision. Demand paging often badly impacts modernapplications that rely on fast memory; the virtual addressspace can be an abstraction barrier which degrades per-formance. In Cichlid, in contrast, an application knows when it has insu ﬃ cient physical memory and must explic-itly deal with it. Given current trends in both applicationsand hardware, we feel this “road less travelled” in OSdesign is worthy of further attention. Exposing hardwaresecurely to applications, libraries, and language runtimesmay be the only practical way to avoid the increasingcomplexity of memory interfaces based purely on virtualaddressing. References [1] A brossimov , E., R ozier , M., and S hapiro , M.Generic Virtual Memory Management for Operat-ing System Kernels. In Proceedings of the Twelfth CM Symposium on Operating Systems Principles (1989), SOSP ’89, ACM, pp. 123–136.[2] A hn , J., J in , S., and H uh , J. Revisiting Hardware-assisted Page Walks for Virtualized Systems. In Proceedings of the 39th Annual International Sym-posium on Computer Architecture (Washington, DC,USA, 2012), ISCA ’12, IEEE Computer Society,pp. 476–487.[3] A ppel , A. W., and L i , K. Virtual Memory Primitivesfor User Programs. In Proceedings of the FourthInternational Conference on Architectural Supportfor Programming Languages and Operating Systems (New York, NY, USA, 1991), ASPLOS IV, ACM,pp. 96–107.[4] ARM L td . Cortex-A9 Technical Reference Manual .Revision r4p1.[5] A ziz , K. Improving the Performanceof Transparent Huge Pages in Linux. https://blogs.oracle.com/linuxkernel/entry/performance_impact_of_transparent_huge , Aug2014.[6] B arr , T. W., C ox , A. L., and R ixner , S. TranslationCaching: Skip, Don’t Walk (the Page Table). In Proceedings of the 37th Annual International Sym-posium on Computer Architecture (New York, NY,USA, 2010), ISCA ’10, ACM, pp. 48–59.[7] B asu , A., G andhi , J., C hang , J., H ill , M. D., and S wift , M. M. E ﬃ cient Virtual Memory for BigMemory Servers. In Proceedings of the 40th AnnualInternational Symposium on Computer Architec-ture (New York, NY, USA, 2013), ISCA ’13, ACM,pp. 237–248.[8] B aumann , A., B arham , P., D agand , P.-E., H arris , T.,I saacs , R., P eter , S., R oscoe , T., S ch ¨ upbach , A., and S inghania , A. The Multikernel: a new OS architec-ture for scalable multicore systems. In Proceedingsof the 22nd ACM Symposium on Operating SystemsPrinciples (2009), pp. 29–44.[9] B elay , A., B ittau , A., M ashtizadeh , A., T erei , D.,M azi ` eres , D., and K ozyrakis , C. Dune: safe user-level access to privileged CPU features. In Proceed-ings of the 10th USENIX conference on OperatingSystems Design and Implementation (OSDI) (Hol-lywood, CA, USA, 2012).[10] B ershad , B. N., S avage , S., P ardyak , P., S irer , E. G.,F iuczynski , M. E., B ecker , D., C hambers , C., and E ggers , S. Extensibility Safety and Performance inthe SPIN Operating System. In Proceedings of the Fifteenth ACM Symposium on Operating SystemsPrinciples (New York, NY, USA, 1995), SOSP ’95,ACM, pp. 267–283.[11] B hargava , R., S erebrin , B., S padini , F., and M anne ,S. Accelerating Two-dimensional Page Walks forVirtualized Systems. In Proceedings of the 13thInternational Conference on Architectural Supportfor Programming Languages and Operating Systems (2008), ASPLOS XIII, pp. 26–35.[12] B hattacharjee , A. Large-reach Memory Manage-ment Unit Caches. In

Proceedings of the 46thAnnual IEEE / ACM International Symposium onMicroarchitecture (New York, NY, USA, 2013),MICRO-46, ACM, pp. 383–394.[13] B oehm , H.-J. Conservative GC algorithmic overview. .[14] B oehm , H.-J. Gcbench. http://hboehm.info/gc/gc_bench/ .[15] B oehm , H.-J., D emers , A. J., and S henker , S. MostlyParallel Garbage Collection. In Proceedings of theACM SIGPLAN 1991 Conference on ProgrammingLanguage Design and Implementation (1991), PLDI’91, pp. 157–164.[16] B oyd -W ickizer , S., C hen , H., C hen , R., M ao , Y.,K aashoek , F., M orris , R., P esterev , A., S tein , L.,W u , M., D ai , Y., Z hang , Y., and Z hang , Z. Corey:An Operating System for Many Cores. In Proceed-ings of the 8th USENIX Conference on Operat-ing Systems Design and Implementation (Berkeley,CA, USA, 2008), OSDI’08, USENIX Association,pp. 43–57.[17] C asey , M. Performance Issues withTransparent Huge Pages (THP). https://blogs.oracle.com/linux/entry/performance_ issues_with_transparent_huge , Sep 2013.[18] C heriton , D. R., and D uda , K. J. A Caching Modelof Operating System Kernel Functionality. In Pro-ceedings of the 1st USENIX Conference on Operat-ing Systems Design and Implementation (Monterey,California, 1994), OSDI ’94, USENIX Association.[19] C orbet , J. AutoNUMA: the other approach toNUMA scheduling. http://lwn.net/Articles/488709/ ,Mar 2012.[20] C orbet , J. NUMA in a hurry. http://lwn.net/Articles/524977/ , Nov 2012.[21] C orbet , J. Toward better NUMA scheduling. http://lwn.net/Articles/486858/ , Mar 2012.1422] C orbet , J. NUMA scheduling progress. http://lwn.net/Articles/568870/ , Oct 2013.[23] C orbet , J. User-space page fault handling. http://lwn.net/Articles/550555/ , May 2013.[24] C orbet , J. 2014 LSFMM summit: Huge page issues. http://lwn.net/Articles/592011/ , Mar 2014.[25] C orbet , J. NUMA placement problems. http://lwn.net/Articles/591995/ , Mar 2014.[26] C orbet , J. Page faults in user space:MADV_USERFAULT, remap_anon_range(),and userfaultfd(). http://lwn.net/Articles/615086/ ,Oct 2014.[27] C orbet , J. Transparent huge pages in 2.6.38. http://lwn.net/Articles/423584/ , Jan 2014.[28] D agand , P.-E., B aumann , A., and R oscoe , T. Filet-o-Fish: practical and dependable domain-speciﬁclanguages for OS development. In (Oct 2009).[29] D ashti , M., F edorova , A., F unston , J., G aud , F.,L achaize , R., L epers , B., Q uema , V., and R oth , M.Tra ﬃ c Management: A Holistic Approach to Mem-ory Placement on NUMA Systems. In Proceedingsof the Eighteenth International Conference on Archi-tectural Support for Programming Languages andOperating Systems (Houston, Texas, USA, 2013),ASPLOS ’13, ACM, pp. 381–394.[30] D errin , P., E lkaduwe , D., and E lphinstone , K. seL4Reference Manual . NICTA, 2006. .[31] D illon , M. Design elements of the FreeBSDVM system - Page Coloring. Online, , Nov2013. Accessed 2015-08-26.[32] E lkaduwe , D., D errin , P., and E lphinstone , K.A memory allocation model for an embedded mi-crokernel. In Proceedings of the 1st InternationalWorkshop on Microkernels for Embedded Systems(MIKES) (2007), pp. 28–34.[33] E lkaduwe , D., D errin , P., and E lphinstone , K. Ker-nel Design for Isolation and Assurance of PhysicalMemory. In Proceedings of the 1st Workshop on Iso-lation and Integration in Embedded Systems (NewYork, NY, USA, 2008), IIES ’08, ACM, pp. 35–40. [34] E ndo , Y., S eltzer , M., G wertzman , J., S mall , C.,S mith , K. A., and T ang , D. VINO: The 1994 FallHarvest. Technical Report TR-34-94, Center forResearch in Computing Technology, Harvard Uni-versity, December 1994.[35] E ngler , D. R., G upta , S. K., and K aashoek , M. F.AVM: Application-level Virtual Memory. In Pro-ceedings of the Fifth Workshop on Hot Topics inOperating Systems (HotOS-V) (1995), HOTOS ’95,IEEE Computer Society, pp. 72–.[36] E ngler , D. R., K aashoek , M. F., and