[PDF] A Least-Privilege Memory Protection Model for Modern Hardware

Abstract

We present a new least-privilege-based model of addressing on which to base memory management functionality in an OS for modern computers like phones or server-based accelerators. Existing software assumptions do not account for heterogeneous cores with different views of the address space, leading to the related problems of numerous security bugs in memory management code (for example programming IOMMUs), and an inability of mainstream OSes to securely manage the complete set of hardware resources on, say, a phone System-on-Chip. Our new work is based on a recent formal model of address translation hardware which views the machine as a configurable network of address spaces. We refine this to capture existing address translation hardware from modern SoCs and accelerators at a sufficiently fine granularity to model minimal rights both to access memory and configure translation hardware. We then build an executable specification in Haskell, which expresses the model and metadata structures in terms of partitioned capabilities. Finally, we show a fully functional implementation of the model in C created by extending the capability system of the Barrelfish research OS. Our evaluation shows that our unoptimized implementation has comparable (and in some cases) better performance than the Linux virtual memory system, despite both capturing all the functionality of modern hardware addressing and enabling least-privilege, decentralized authority to access physical memory and devices.

Full PDF

AA Least-Privilege Memory Protection Model for Modern Hardware

Reto Achermann, Nora Hossle, Lukas Humbel, Daniel Schwyn, David Cock, Timothy RoscoeSystems Group, Department of Computer Science, ETH Zurich

Abstract

We present a new least-privilege-based model of address-ing on which to base memory management functionality inan OS for modern computers like phones or server-basedaccelerators. Existing software assumptions do not accountfor heterogeneous cores with different views of the addressspace, leading to the related problems of numerous securitybugs in memory management code (for example programmingIOMMUs), and an inability of mainstream OSes to securelymanage the complete set of hardware resources on, say, aphone System-on-Chip.Our new work is based on a recent formal model of addresstranslation hardware which views the machine as a conﬁg-urable network of address spaces. We reﬁne this to captureexisting address translation hardware from modern SoCs andaccelerators at a sufﬁciently ﬁne granularity to model minimalrights both to access memory and conﬁgure translation hard-ware. We then build an executable speciﬁcation in Haskell,which expresses the model and metadata structures in termsof partitioned capabilities. Finally, we show a fully functionalimplementation of the model in C created by extending thecapability system of the Barrelﬁsh research OS.Our evaluation shows that our unoptimized implementa-tion has comparable (and in some cases) better performancethan the Linux virtual memory system, despite both capturingall the functionality of modern hardware addressing and en-abling least-privilege, decentralized authority to access physi-cal memory and devices.

1. Introduction

Both modern, fully-veriﬁed operating systems and traditionalproduction-quality kernels rely on a model of memory address-ing and protection so simple it is rarely remarked on: RAMand devices reside at unique addresses in a single, shared phys-ical address space, and all cores have homogeneous memorymanagement units which translate from a virtual address spaceinto these physical addresses. These MMUs are all conﬁguredby a single monolithic kernel.Unfortunately, this model bears little relation to modernhardware. Modern platforms like phone SoCs violate theassumption of a single physical address space. A moderncomputer is, in reality, a network of address spaces with ad-hoc address translation functions between them, many conﬁg-urable by sufﬁciently-privileged system software. Access tomemory is performed by a variety of heterogeneous cores and I/O devices from different points in this network. Simply con-ﬁguring a given platform correctly to maintain the assumptionof a single physical address space on which the veriﬁcation isbased on is, by itself, a complex and error-prone process.The result is that traditional kernels suffer from numerous(and continuing) security bugs arising from incorrect assump-tions about memory addressing in the system, while correct-ness proofs for veriﬁed kernels are cast into doubt by theexistence of “cross-SoC” attacks.Moreover, centralized authority over all memory accessdoes not accommodate features like the secure co-processorsand management engines standard in modern PCs as well asphone platforms. Authority to grant access to memory anddevices needs to be decentralized, and this decentralizationrepresented in the speciﬁcation of the OS itself.This paper develops an alternative model of hardware mem-ory addressing, protection, and authorization which capturesthe richness, complexity, and diversity of modern hardwareplatforms. Our model can serve as a basis for formal veriﬁ-cation of system software but also as an informal basis fordesigning correct memory management functionality.The model is based on two guiding principles: completeness ,meaning that we capture the full semantics of real addressinghardware without simplifying assumptions, and least-privilege ,meaning that we represent individual authority to both accessmemory and modify translations at as ﬁne a granularity asallowed by the hardware.In the next section we elaborate on the mismatch betweenmodern hardware and OS designs, and existing efforts to ad-dress it. In Section 3 we review the recent related work onwhich this paper builds, and lay out our methodology.Our ﬁrst contribution, in Section 4, is development of themodel itself. We start from an abstract model of memory ad-dressing and progressively reﬁne it until it captures the salientfeatures of modern memory hardware, including the rightsto modify translations in multi-level page tables and customprotection units. We build an executable spec in Haskell whichserves as a basis for an implementation in a real OS.In Section 5 we describe how Linux might be extended witha subset of our model (foregoing least-privilege and centraliz-ing authority in the kernel), and present a full implementationof the model by extending the capability system in the Bar-relﬁsh research OS. Our implementation runs on real hardwareand can manage protection rights on a variety of hardwareplatforms. We also discuss the minimal overhead it incurs for1 a r X i v : . [ c s . O S ] A ug etadata and bookkeeping.In Section 6 we evaluate the performance of this memorysystem and show that, despite the richer and more faithful viewof hardware it embodies, it provides comparable performanceto the highly optimized, but less functional, Linux virtualmemory system on identical hardware.This paper also has a set of non-goals . Firstly, we do notpresent any formally-veriﬁed OS software; our goal is ratherto show a model which can be used as a replacement for theover-simpliﬁed addressing models currently used in the proofsfor veriﬁed systems like seL4 and CertiKOS.Second, we develop no new memory subsystem for Unix-like OSes. As we note below, reasoning about the correctnessand security of a modern computer requires going beyond aLinux kernel to capture co-processors and intelligent devices.We sketch in section 5.1 how a simpliﬁed version of our modelmight be retroﬁtted to Linux.Finally, this is not a bug-ﬁnding paper. We do not aim toﬁnd problems in existing OS code (though we cite numerousexamples from other work). Instead, we lay the foundationsfor a more faithful view of the hardware on which to basebetter system software.

2. Motivation and Related Work

We ﬁrst review the implicit model of memory addressing usedby existing OSes, and then explain with concrete exampleswhy it no longer reﬂects hardware reality. We discuss theimplications of this for both new, formally veriﬁed OSes andtraditional kernels like Linux, and the limitations of existingapproaches to the problem in both kinds of OS.

Address translation is a fundamental technique in computing,enabling relocation, demand paging, machine virtualization(either via processes or full virtual machines), shared memory,inter-process protection, and much other functionality.Typically, the key abstraction employed is a virtual addressspace , accesses to which are translated into addresses withina unique, machine-wide physical address space by hardwaremechanisms (TLBs, multi-level page tables, etc.).Physical memory addresses can thus be used as unambigu-ous, system-wide identiﬁers for memory and devices, andso are also used to keep track of access rights: Linux main-tains such a data-structure for each page or frame of physicalmemory, while some microkernels like seL4 [14, 23] andBarrelﬁsh [11, 15] use a capability system [26] to representphysical memory regions with access rights.An OS must conﬁgure translation hardware and maintainthese data structures to ensure correct and secure operation.For example, user programs should only be able to load andstore to physical resources (memory, or memory-mapped I/Odevices) the OS has granted them access rights to.

Unfortunately, modern hardware platforms violate the assump-tions in the traditional model above. They are composed ofmultiple, heterogeneous cores and devices each of which canissue accesses to byte-addressable memory resources such asDRAM, non-volatile memory or device registers. Worse, thereis no single “reference” physical address space [16]. Instead,a network of address spaces or buses is connected by addresstranslation units which “routes” accesses through the network.This breaks most of the assumptions of the classical model:different cores and devices translate their virtual addressesinto different physical address spaces, physical addresses canno longer be used as global identiﬁers without further scoping,address aliasing is not only possible but likely, and ﬁnally,software with access to translation units can reconﬁgure the physical address space underneath the systems’ MMUs.For example, the Xeon Phi co-processor [21] implements a“system memory page table” which further translates physical(post-MMU) addresses from the accelerator cores into thehost’s PCI address space using a single, shared register arraywhere each register controls the translation of a ﬁxed 16GBpage in the Xeon Phi’s “physical” address space.Such additional layers of translation are commonplace inphone Systems-on-Chip like the NXP iMX8 [33], Texas In-struments OMAP [39], and NVIDIA Parker [31] processors.Such SoCs contain a variety of different processors with differ-ent physical address spaces, which overlap and intersect [16].This is a deliberate, rational design choice – for example, it isimportant that a secure co-processor holding encryption keyshas private memory that cannot be accessed from applicationcores, even in kernel mode.I/O memory management units (IOMMUs, or SystemMMUs) translate addresses generated by accelerators andDMA-capable devices into a “canonical” system-wide physi-cal address space. This allows user-space programs to share avirtual address space with a context on the device, but imposea further complexity burden on the underlying OS which mustnow ensure that IOMMUs are always correctly programmed.This code is fraught with complexity and consequent bugs andvulnerabilities, as it is also intended to provide protection frommalicious memory accesses [29, 30, 28, 27]. The problem islikely going to get worse with the proliferation of IOMMUdesigns built into GPUs, co-processors, and intelligent NICs.OpenCL’s Shared Virtual Memory extends the global mem-ory region into the host memory region using three differenttypes [22]. Similarly, nVidia’s CUDA [32] or HSA [19] pro-vide a uniﬁed view of memory. The same concerns apply here:the complexity of maintaining a shared virtual address spaceis pushed to system software, but remains.Even memory controllers can violate the traditional model.Hillenbrand et al. [18] reconﬁgure memory controller con-ﬁgurations from system software to provide DRAM aliasesfor mitigating the performance effects of channel and bank2nterleaving. Proposals for “in-memory” or “near-data” pro-cessing [34] raise further questions for OS abstractions [9] andrequire a way to unambiguously refer to memory regardlessof which module accesses it.

Correctness arguments about OS code therefore rely on as-sumptions about the hardware that no longer hold. Proofs forthe seL4 microkernel [23] assume a single, ﬁxed, physicaladdress space without other translation hardware, and provideno guarantees of safety in the presence of other cores or in-correctly programmed DMA devices. CertiKOS [17] provesfunctional correctness based on a model of memory accessesto abstract regions of private, shared or atomic memory, butagain provides no proof in the presence of other translationunits and heterogeneous cores. Even work on verifying mem-ory consistency in the presence of translation only considersthe simple case of virtual-to-physical mappings [36].Proofs aside, the difﬁculty of getting complex memory ad-dressing right in an OS is shown by the steady stream of relatedbugs and vulnerabilities in Linux [20], for example ignoringholes in huge pages (CVE-2017-16994), miscalculation of thenumber of affected pages (CVE-2014-3601), access rights fordata pages (CVE-2014-9888), interactions of virtually mappedstack with DMA scatter lists (CVE-2017-8061), handling ofshadow page tables (CVE-2016-3960). Moreover, miscalcula-tions, misinterpretations or underﬂows of addresses and off-sets, (Linux commits 9d8c3af3160, 7655739143, 29a90b708and 5016bdb79), mixing up memory addresses with MSI-Xinterrupt ranges (Linux: 17f5b569e09cf) and IOMMU addressspace allocations (Linux: a15a519ed6e) cause unexpectedbehavior, crashes or memory corruption.Faced with the complexity of hardware, a number of ad-hocpoint solutions have appeared for speciﬁc cases, primarilyGPUs, such as VAST [25] which uses compiler support todynamically copy memory to and from the GPU and Mo-saic [8], which provides support for multiple sizes of pagetranslation in a shared virtual address space between CPUand GPU. In DVMT [3], applications request physical framesfrom the OS that have speciﬁed properties. The system al-lows applications to customize how the virtual-to-physicalmapping is set up by registering a TLB miss handler for thespecial DVMT range. The CBuf[35] system globally managesvirtual and physical memory focusing on efﬁcient sharingand moving data between protection domains. CBuf uniﬁesshared-memory, memory allocation and system-wide physicalmemory allocation.All these approaches aim to simplify user code, at the costof OS complexity. In contrast, our work is a response to thiscomplexity: the central OS abstraction of a single, shared,global physical address space, combined with straightforwardtranslations to it from virtual address spaces, is inadequate fora secure and reliable OS running on modern hardware. Weneed a richer model of addressing, and this paper is based on one which views address spaces as nodes in a network oftranslation units.

3. Methodology

Our new model builds on the existing decoding net model ofAchermann et al. [1, 2], which has been shown to providea precise formal model of many of the sorts of systems weconsider in this work: Multi-socket NUMA systems, ARMSoCs, plug-in accelerators, etc.Achermann et al. model the addressing structure of a systemas a directed graph, where nodes represent (virtual or physical)address spaces or devices (including RAM), and edges thetranslation of

AS-local addresses into other ASs or devices.The graph is a set of nodes, deﬁned as an abstract datatype so: name = Name nodeid addressnode = Node accept :: { address } translate :: address → { name } Their model distinguishes local names ( address ), relative tosome address space, and global names ( name ), which qualifya local name with its enclosing address space. Each node may accept a set of (local) addresses, and/or translate them to oneor more global names (addresses in other address spaces).This existing model is a long way from being a basis foran operational system. In Section 4.1 we add two impor-tant features: dynamic conﬁguration of the translate functionwhich captures how real translation units can be programmed,and rights corresponding to the ability for software processesto conﬁgure such units. We model the complex network ofinteracting address spaces, identify and label the necessarydivisions of authority as ﬁnely as possible, following the prin-ciple of least privilege.We adopt a methodology strongly inﬂuenced by the suc-cessful combination of reﬁnement and executable speciﬁcation used in the seL4 project.Speciﬁcally, we begin by identifying all relevant objects (page tables, address spaces, . . . ), the subjects that manipulatethem (processes, the kernel, devices, . . . ), and which authority each subject exercises over an object (e.g. in mapping a frameto a virtual address). These are expressed in an access-controlmatrix (following Lampson [24]) which forms our abstractspeciﬁcation , analogous to the high-level security policy (in-tegrity) shown to be reﬁned (correctly implemented) all theway down to compiled binaries for seL4 [38].Again, as in seL4 [12], we next develop an executable spec-iﬁcation in Haskell (see Section 4.2), expressing subjects,objects, and authority as ﬁrst-class objects, permitting rapidprototyping without giving up strong formal semantics. Corre-spondence between abstract and executable models is thus farby inspection and careful construction.Finally, we show (again with precedent [40]) that the ex-ecutable model (and hence the abstract model) permits mul-tiple high-performance implementations: In the Barrelﬁsh3S (expressing rows of the access matrix with capabilities,see Section 5.2), and in Linux (collapsing distinct authori-ties held by the kernel, and taking columns as access-controllists, see Section 5.1). Barrelﬁsh and seL4 have closely re-lated capability-based resource management and authorizationsystems and our implementation transfers naturally to seL4;Barrelﬁsh is currently a better platform for our work, due toits support for multiprocessing and heterogeneous hardware.By adopting a proven methodology, we can be conﬁdent thatthe resulting artifact is compatible with an seL4-style veriﬁca-tion, and could thus serve as a more accurate replacement forthe hardware model underlying the seL4 or CertiKOS proofs.Simultaneously, by careful selection of an abstract model (theaccess-control matrix) and through the use of reﬁnement, ourmodel is not speciﬁc to a particular implementation.

4. Model

We derive our abstract, formal model from the existing de-coding net model in two steps. First, we extend the modelto include dynamic behavior (updating translations), and ex-press the required authority using an access-control matrix.Second, we build a (still relatively abstract) executable spec-iﬁcation in Haskell, allowing us to reason concretely aboutimplementation trade-offs.

Decoding nets are static: they represent the current state ofthe system. To describe the dynamic behavior of a system, weadd an abstraction above the decoding net, consisting of a setof (dynamic) address spaces . The state of the system is thenexpressed as a function from address space, to the mappingnode representing its current conﬁguration: conﬁguration = address space → node We can then express the conﬁguration space of an addressspace, as a set of possible conﬁgurations: conﬁg space = address space → { node } The conﬁguration space of a page table in a system with 4kiBtranslation granularity would, for example, only include nodesthat map all addresses in any naturally-aligned 4kiB regioncontiguously. We will use the conﬁguration space to expressallowable system states according to a security property.At this level of abstraction, state transitions are simplychanges in the current conﬁguration of the address spaces:

ModifyMap :: address → name → conﬁguration → conﬁguration Authority

Consider Figure 1, representing the general case ofan update to an intermediate address space (for example theintermediate physical address, IPA, in a two-stage translationsystem). We identify two distinct rights (authorities): The

VirtualAddress SpaceIntermediateAddress SpacePhysicalAddress Space grantgrant mapmap

Figure 1: Mappings between address spaces showing grantand map rights of mapped segments.

Xeon Phi CoreXeon Phi BusXeon Phi SMPTGDDR IOMMU

Registers

DMA CoreIOMMU PCI Bridge WindowRAM CPU Core

Figure 2: Address spaces in a system with two PCI devices map right, or the right to change the meaning of an IPA bychanging its mapping; and the grant right, or the right to grantaccess (by mapping) to some range of physical addresses.These two rights do not necessarily go together.Consider Figure 2, showing the address-space structure ofa system with two PCI devices: a DMA engine and an IntelXeon Phi co-processor. Imagine that we wish to establisha shared mapping to allow a process on a Xeon Phi core toreceive DMA transfers (e.g. network packets) into a bufferallocated to it in the on-board GDDR.The process ‘owns’ the buffer, and has the ability to call recv() , triggering a DMA transfer. We interpret this as theprocess having the right to grant access (temporarily) to theDMA core. The user-level process, however, clearly shouldnot have the ability to modify the IOMMU mappings of theDMA core at will (or its own, for that matter). That is, it doesnot have the map right on the relevant address space.What is needed is some agent (hereafter a subject , in stan-dard authority-control terminology) with both the grant righton the buffer object , and the map right on the address space object . In a traditional monolithic kernel, both these rights areheld (implicitly) by the kernel, which exercises them on behalfof the subjects. It is up to the kernel to maintain accuratebookkeeping to determine whether any such request is safe,typically using an ACL (access-control list) i.e. authority tiedto the subject .In a microkernel such as seL4 or Barrelﬁsh, these rights arerepresented by capabilities, handed explicitly to one subject ,in order to authorize the operation. In this case, authority istied to the object . These are equivalent from the perspectiveof access control, differing only on implementation detail: In4 ubject / object DMA IOMMU bufferIOMMU driver map

Xeon Phi process grant

Table 1: Access control matrix of the Xeon Phi example both cases, the same two basic authorities are present.

Right R1 (Grant)

The right to insert this object into some address space

Right R2 (Map)

The right to insert some object into this address spaceNote that the ’virtual’ and ’physical’ address spaces of Fig-ure 1 can be viewed as special cases of an intermediate ad-dress space: A top-level ’virtual’ address space is simply oneto which nobody has a grant right, and a ’physical’ addresse.g. DRAM is one to which there exists no map right.The standard representation of authority in systems is anaccess control matrix [24], such as that of Table 1. This can beread in rows: The IOMMU driver has the map capability to theIOMMU address space, and the process the grant capability tothe buffer. Alternatively, reading down the columns gives theACLs: the IOMMU records map permission for the driver, andfor the buffer is recorded a grant permission for the process.This access control matrix on maps and grants is our abstractmodel. A system is correct (secure) statically , if its currentconﬁguration is consistent with the access control matrix. It issecure dynamically if any possible transition, beginning in asecure state, must leave the system in a secure state.

Thus far we have expanded upon the existing decoding netmodel, giving us a dynamic access-control matrix formulationof the system’s correctness property. Next, we implementa (still abstract) reference monitor [4] in Haskell, to aid inrapid prototyping of both model and implementation, and asan intermediate step in the process of reﬁnement from abstractspeciﬁcation to operational, high-performance implementa-tion. In this, we again take our example from the seL4 ap-proach, which used just such an executable speciﬁcation [13]to prototype the kernel prior to implementation in C.Given our target environments of Linux and Barrelﬁsh, oper-ations and data structures for the reference monitor are namedin a manner suggestive of an OS kernel, although other im-plementations would be possible. The most important detailadded at this stage is to make translation structures explic-itly visible. The reason for this is to allow us to express thefact that the translation state of the system depends, in a deter-ministic manner, on the contents of RAM and device registers(e.g. segment registers). This in turn allows us to express theinvariant (necessary for integrity of the reference monitor) thatno such objects are ever made accessible (i.e. mapped) outsidethe monitor itself:

RAM Frame grantuse

AddressSpaceOf map, grant

Con ﬁ gurableAddress Spacemappable objectsunmappable objects TStructure use retype

Figure 3: Object Type Hierarchy and possible rights (green). mappingTrace :: (Operation KernelState)mappingTrace = do...-- retype a RAM object to a Frameres <- retype RAM Disp Frame Disp-- retype another RAM object to a translation structureres <- retype RAM2 Disp TStructure Disp-- map the frame into the translation structuremapping1 <- Model.map TStructure Frame Disp...

Figure 4: Mapping a RAM object

Invariant I1 (Never Accessible)

Subjects can never access unmappable objectsNote that (in contrast to the seL4 executable speciﬁcation),the details of the translation structures are kept opaque at thispoint—we merely record that they exist at certain locationsby dividing the mappable address spaces into objects (withterminology borrowed from Barrelﬁsh): data Object = RAM {base :: Name, size :: Natural}| Frame {base :: Name, size :: Natural}| TStructure {base :: Name, size :: Natural}

Objects form a hierarchy (Figure 3) which deﬁnes how ob-jects can be derived from each other. For example, translationstructures (

TStructure ) are created by retyping RAM ob-jects. The previous invariant now reduces to stating that noobject of type

TStructure is ever mapped.

RAM is the basetype for untyped memory, and a

Frame is RAM that has beenretyped to be mappable.In addition, the set of translation structures deﬁnes (again inan implementation-speciﬁc manner), the set of address spaces:

AddressSpaceOf :: TStructure -> AddressSpace

Authority is likewise stored as explicit rights: data Authority = Access Object | Map Object| Grant Authority

The monitor (kernel) state is a set of subjects (the term dis-patcher being borrowed from Barrelﬁsh), a mapping database (MDB) recording the derivation relation between objects, anda set of active address spaces: data KernelState= KernelState (Set Dispatcher) MDB (Set AddrSpace)

Operation . Thus changes tothe system’s state are a sequence of API calls e.g. retype ormap: data Operation a = Operation (State -> (a, State))instance Monad (Operation) where ...

Traces are thus sequences of such operations, correspond-ing to an observed sequence of

KernelState s. Each of thesestates deﬁnes a static conﬁguration of the decoding net. Oper-ations include: • retype converts an existing object into an object of a per-missible subtype. • map installs a mapping in a translation structure. • copy copies the rights from one subject to another.Contained within the set of all possible traces T , there isa set of correct traces CT ∈ T that correspond to sequencesof consistent KernelState s. All other traces indicate thatexecution had to be aborted at some point since an operationwas applied that would otherwise have led to transitioning toan inconsistent or disallowed system state.

5. Implementation

We ﬁrst describe how a subset of our model might be imple-mented in the Linux monolithic kernel, and then present a fullimplementation based on the open-source Barrelﬁsh OS [11].We refer to this new implementation as

Barrelﬁsh/MAS , whereMAS refers to Multiple Address Spaces.

We describe how one could implement the least privilegemodel and add support for multiple address spaces in a mono-lithic kernel at the example of Linux.The Linux kernel acts as the reference monitor and thereforeassumes authority over all address spaces in the system: it canchange address space mappings and grant access to memory atwill. This happens mostly as a reaction to user-space requestssuch as mmap , but may also originate from policy decisionsinside the kernel e.g. demand paging or page caches wherethe kernel decides to unmap memory from a process.A possible way to achieve separation is through interceptingupdates to the translation tables, which can be done using thepara-virtualization subsystem. Whenever a translation table ischanged, this gets converted into an API call to the referencemonitor. This gives some form of separation, but withoutproper virtualization cannot be strictly enforced.User-space processes may share memory by creating sharedmemory objects, which are implemented as ﬁles in a ramfs.Linux manages access to that shared memory object–and ﬁle-based objects in general–using standard UNIX permissions,representing an access control list. Consequently, every pro-cess with a matching user or group id can access the sharedmemory object. ACL means for each object (resource) in the system there is a list of subjects plus rights. Files can beopened, which gives the process a ﬁle descriptor, which canbe mmap ed, hence a read right on a ﬁle can be seen as a grantright to the memory described by that ﬁle. After opening aﬁle, the ﬁle descriptor can also be passed around, hence alsothe ﬁle descriptor represents a grant right.Most memory used by applications is not ﬁle backed, andhence referred to as anonymous memory. User-space pro-cesses have the access right to mapped anonymous memory.The process cannot explicitly hand over the grant right toanonymous memory to another process other than fork ingitself, where the child process inherits rights on resources fromits parent.A process can request memory to be mapped and unmappedfrom its address space. It may supply hints on what type ofmemory it would like, but in the end the Linux kernel decideswhere to map and what memory to grant.Apart from tracking rights, our design also requires the un-derstanding of multiple address spaces and have rights referto qualiﬁed names instead of addresses. In order for Linux todo this, we have to make sure that the physical frames are cor-rectly identiﬁed in the presence of multiple address spaces (e.g.the kernel sees a RAM region at a different address than say aDMA engine). Each frame is identiﬁed by a physical framenumber (PFN). We can use this PFN as the canonical name forthe frame itself and the sparse memory model in Linux [5] toimplement multiple address spaces holding physical resourcesas memory sections. For each frame of memory, Linux main-tains a data structure tracking its use. We can augment the datastructure to include type information to implement differentmemory object types (Linux already distinguishes betweenuser and kernel objects). The relationship between PFNs andlocal physical addresses would need to be changed from aﬁxed offset to one that depends on the current conﬁgurationof translation hardware plus the current executing core.In conclusion, the Linux kernel acts as a central authorityholding the grant and map right to all address spaces andresources. A separation is possible when using the para-virtualization subsystem to intercept updates to translationtables. The kernel data structures could be modiﬁed to supportthe notion of multiple address spaces. Because it is very hardto support the full granularity of our model in a monolithickernel, we chose to implement and evaluate it in a capabilitysystem, which is described in the next section.

Barrelﬁsh/MAS

We chose the open-source Barrelﬁsh OS [11] as the basis forour implementation because it uses an seL4-style capabilitysystem for authorization and resource management, but in con-trast to seL4 has has support for heterogeneous platforms andhas drivers for IOMMUs and the Intel Xeon Phi co-processor,thus providing a real-world example of complex addressing.We describe the relevant parts of our implementation in

Barrelﬁsh/MAS : the capability system that supports multiple6ddress spaces (§ 5.2.1), implementation of runtime supportby generating code for known translation and maintaininga graph of conﬁgurable nodes § 5.2.2, and ﬁnally adaptinguser-space device drivers (§ 5.2.3).

Barrelﬁsh manages physical re-sources using a capability system for naming, access control,and accounting of objects in a single physical address space.We describe the

Barrelﬁsh/MAS capability system as a wholehere, since a clear description of the original Barrelﬁsh capabil-ity system has not been published. As in seL4 [14], capabiltiesare typed to indicate what can be done with the memory theyrefer to; rules dictate valid retype operations (e.g retypingRAM to a Frame).

Barrelﬁsh/MAS builds on Barrelﬁsh by adding multipleaddress spaces and having capabilities which refer to memoryobjects hold the object’s canonical base name, the size of theobject they are referring to, as well as its type and rights.

Barrelﬁsh/MAS is a partitiioned capability system : Capabil-ities are stored in memory-resident objects as well, but theseare unmappable ensuring that no user-space process can forgecapabilities by writing to memory locations. A process hold-ing a capability obtains a certain set of rights on the objectreferred to by the capability. These rights can be exercised byinvoking the reference monitor API which is implemented asa system call interface.Capabilities encode the canonical names of the objects theyrefer to, implemented as a struct with two ﬁelds: the addressspace identiﬁer (ASID) and the address within the addressspace. An optimized variant packs both values into a 64-bit integer providing support for a 16-bit ASID and a 48-bitaddress, which is sufﬁcient for current platforms.ASIDs nevertheless are a limited resource, and their alloca-tion must be managed accordingly to avoid ASID exhaustion.We use a dedicated capability to manage ASIDs, where a newrange of ASIDs can be allocated by retyping a larger range ofASIDs.There may be multiple capabilities pointing to the sameobject, but there is always at least one capability for everygiven byte in memory.

The mapping database:

Barrelﬁsh/MAS manages a map-ping database , a data structure that allows efﬁcient lookupof all related capabilities given the name of object they referto. The mapping database is a balanced tree structure of allcapabilities present in the database.The mapping database stores the capabilities in a cannonicalordering, allowing efﬁcient lookup and range query operatorssuch as “overlap” and “contains”. The canonical ordering ofthe capabilities is deﬁned on their canonical name (addressspace and address), size and type. Capabilities to objects with asmaller name appear ﬁrst. If the base names of two capabilitiesare equal, then the larger object comes ﬁrst. Finally, all otherattributes being equal, the type of the capability deﬁnes theorder: types higher up in the hierarchy come ﬁrst. This ordering is important, because based on the canonicalorder of the capabilities one can deﬁne the descendant relation.We say a capability B is a descendant of capability A if A issmaller than A and B is fully contained in the range converedby A : descendant c c ↔ c ∩ c = c ∧ c . type ≤ c . type The mapping database can therefore be traversed to ﬁnd thedescendants of a capability (successors) and ancestors (prede-cessors) efﬁciently.It is important that the ordering relation is in line with theretype operation. If B can be retyped from A , B must be smallerthan A . Our deﬁnition fulﬁls this, a retype can increase thename, decrease the size or change the type to a subtype.With help of the mapping database, we can efﬁciently ﬁndall the ancestors and descendants of a particular object. Page tables and address spaces:

Barrelﬁsh/MAS has a dis-tinct capability type for each hardware-deﬁned translationtable e.g. one for each of the four levels of the x86_64 architec-ture. Each of these capability types are translation structuresin the sense of the executable spec.User processes can construct their own page tables throughcapability invocations. This is safe, because the invocationsonly allow operations resulting in correct-by-construction pagetables, and processes can only map resources for which theyhold a capability with the grant right to it.Since a page table deﬁnes an address space, we can derive an address space capability from a page table. This addressspace represents the input address space of the translationtable. For each translation table, the spanning address spacecan only be derived once.When we delete a page table, we use this stored ASID toquery the mapping database for address space capabilities andstart a recursive deletion. This ensures that upon deletion of thepage table, the address space is deleted including all segments within it. This is equivalent to revoking all descendants of theaddress space capability and then deleting it. Tracking mappings:

When access to an object is revoked,all positions where this object has been mapped must be foundand removed. We manage this bookkeeping using the capa-bility system. For each mapable object there exists a corre-sponding mapping capability. The mapping capability is adescendant of (retyped from) the mapped objects and hencewe can ﬁnd all locations where an object is mapped by walk-ing the mapping database in ascending order. Each mappingcapablity indicates the page table objects and slot range wherethe object has been mapped.The same technique is used to track mappings of multi-levelpage tables. For each valid entry in a page table there existsa mapping capability. When the last mapping capability isdeleted, the page table entry is invalidated.

In Figure 2 we draw a diagram ofthe different address spaces present in a heterogeneous multi-7rocessor system. To acquire the access right to a particularmemory object, a sequence of translations need to be setup.Which address spaces need to be conﬁgured depends on thesystem topology, which may only be discovered at runtime.

SoC-Platforms

The topology of SoC platforms is typicallyﬁxed and known at compile time. We can therefore enumer-ate all address spaces of the SoC and pre-compute all ﬁxedtranslations and store a graph of the topology consisting ofconﬁgurable and leaf address spaces in the kernel. We can generate core-speciﬁc translation functions that convert localaddresses to global names and vice versa. The name can thenbe resolved by walking the translation structures of the con-ﬁgurable address spaces until it reaches an accepting addressspace or there is no translation. We evaluate this scenarioin § 6.4.

Device Discovery

In general, the information about thehardware topology and its address spaces may be incompleteand must be discovered during runtime. For instance, the pres-ence of an IOMMU is known after parsing the ACPI tables andthe Xeon Phi co-processor of our example (Figure 2) is discov-ered by PCI, and lastly the size of the GDDR available on theco-processor is known by the driver. The state of the model istherefore populated by multiple sources of information.In Barrelﬁsh, there exists the system knowledge base(SKB) [37] which stores information about the system. TheSKB in a nutshell is a database storing facts about the systemwhich can be queried using Prolog. We implement the modelinside the SKB. During device discovery, processes insert in-formation about the discovered address spaces and how theyare connected with each other.

Model Queries

Device drivers must conﬁgure translationunits to enable devices to access memory. Booting a coreon the Xeon Phi co-processor is a particular example: appli-cation modules to be run on the co-processor may reside inhost RAM. To make this accessible from the co-processorthe IOMMU and the SMPT must be conﬁgured accordingly.This information can be obtained by querying the SKB, whichreturns a list of address spaces that must be conﬁgured. Thequery is based on a shortest path algorithm between the ad-dress space of the Xeon Phi core and the address space wherehost RAM resides in.Running the queries in the SKB is costly (§ 6.3). We providea library that caches the graph representation of conﬁgurableaddress spaces and run shortest path on it.The result of the query is a list of address spaces that needto be conﬁgured to make the memory object accessible fromthe source address space. This blueprint is then converted bythe user-space process into a sequence of capability operationsto allocate memory, setup translation structures and performthe relevant mappings. The model queries only provide a ‘hint’on what needs to be conﬁgured while the capablity systemenforces the authorization required to perform the requiredmappings. We evaluate the latency of this scenario in § 6.2.

Address Resolution

While the SKB stores the addressspace topology of the system it does not store the actual trans-lations of conﬁgurable address spaces. An address can befully resolved by performing the previous query and instead ofchanging the conﬁguration of the address spaces, we can usethe translation structure to calculate where the address spacetranslates the address.

We adapt the user-spacedevice drivers in Barrelﬁsh to use the runtime support de-scribed above when conﬁguring their devices and allocatingin-memory data structures. In

Barrelﬁsh/MAS , device driversrun in user-space. They are started by a device manager whichpasses a set of capabilities including a capability to the deviceregisters and the IOMMU IPC endpoint. The driver can thenuse capability operations to map the device registers into itsaddress space or program the IOMMU translation through theIOMMU IPC endpoint. Devices with additional memory, suchas the Xeon Phi with GDDR receive a capability to the leafaddress space, which the driver can then use to retype newRAM capabilities from it.Memory access from the device might be translated by theIOMMU. To setup a shared buffer between the driver andthe device the driver needs to: Allocate memory, Map thememory into the driver’s own address space, query the graphto determine necessary conﬁguration steps, follow the resultto map the memory into the device’s address space For anevaluation of these steps see § 6.2.To set up the IOMMU, we implemented two alternatives: i) an RPC to the IOMMU reference monitor that managesthe translations, or ii) direct capability invocations on thetranslation table used by the IOMMU for this device. This issafe, because the capability system enforces that only memoryfor which the driver has a capability for can be mapped.

6. Evaluation

We evaluate our implementation by showing memory manage-ment performance comparable to Linux (§ 6.1) and applicabil-ity to a real-world scenario using co-processors (§ 6.2). Wealso show the scaling behavior of the model queries (§ 6.3) anddemonstrate how the model can also be used in pathologicaltopologies using simulators (§ 6.4). Finally, we analyze thespace-time overheads of our implementation (§ 6.5).All performance evaluations use a dual-socket Intel XeonE5 v2 2600 (“Ivy Bridge”) with 256GB of main memory.There are 10 cores per socket with HyperThreading, Turbo-Boost, and speed stepping disabled, and the system runs in“performance” mode. The system also has two Intel XeonPhi co-processors (“Knights Corner”). All Linux experimentsuse Ubuntu 18.04LTS, with kernel version 4.15 and the latestpatches for mitigating Meltdown and Spectre attacks.

We compare the performance of

Barrelﬁsh/MAS ’s memorysubsystem against Linux with Spectre/Meltdown mitigation8 rot1-trap-unprot protN-trap-unprot trap only 2.55.07.510.012.515.0 k c y c l e s / ( p a g e | t r a p ) Linux - HeuristicLinux - Full Linux NS - HeuristicLinux NS - Full Barrelfish/MAS DefaultBarrelfish/MAS Direct

Figure 5: Appel-Li benchmark on

Barrelﬁsh/MAS and Linuxwith and without Spectre/Meltdown mitigation (NS). both enabled and disabled, using two microbenchmarks.

Bar-relﬁsh/MAS has no mitigation measures. [6] tests operationsrelevant to garbage collection and other non-paging tasks bymeasuring time to protect, and trap-and-unprotect pages ofmemory.We run the benchmark with working sets of less than 2MB(512 pages). We measure Linux with four conﬁgurations: i) default TLB ﬂush heuristic, and iii) always full TLB ﬂush, allwith Spectre/Meltdown mitigation both enabled and disabled.We benchmark Barrelﬁsh/MAS in two ways: i) direct invo-cation of the mapping capability and ii) protecting the pagethrough user-level data structures tracking the mapping. Notethat Barrelﬁsh/MAS does not support selective TLB ﬂushing.The results are shown in Figure 5. We observe that

Bar-relﬁsh/MAS is consistently faster than Linux in all cases. TheSpectre/Meltdown mitigation incurs a 45-53% slowdown. Forboth multi-page ( protN-trap-unprot ) and single page ( prot1-trap-unprot ) protect-trap-unprotect,

Barrelﬁsh/MAS is up to4x faster than Linux. We observe a slight increase in executiontime when full TLB ﬂushes are enabled. The

Barrelﬁsh/MAS “Direct” results use the kernel primitives directly. This enablesus to isolate the cost of user-space accounting, which accountsfor 10-17% of the execution time. measures theperformance of the primitive operations map , protect and unmap with respect to an increasing buffer size.The benchmark works as follows: i) allocate a region ofvirtual memory and fault on it to map memory, ii) write-protectthe entire virtual region, and iii) unmap the virtual memoryregion again. We time each operation separately. We mea-sured different ways to map memory on Linux using mmap , shmat and shmfd and compare Barrelﬁsh/MAS against the best performance we obtained on Linux for each operationand page-size. For mapping and unmapping 4kB pages, thiswas passing a ﬁle descriptor obtained through shm_open to mmap . For map/unmap with larger page sizes, shared memorysegments ( shmat , shmdt ) performed best. Changing pageprotection was always fastest using mprotect . Again, webenchmark Linux with and without Spectre/Meltdown mit-igation enabled. If possible, we do not measure the timefor memory allocation as this is dominated by memset . On Barrelﬁsh/MAS we use the high-level interfaces to include user-space book-keeping in the measurements.Figure 6 shows execution time of the three operations perpage for an increasing buffer size and three page sizes. En-abling Spectre/Meltdown mitigation results in a slow downof up to 2x for small page numbers. In all cases, the cost perpage decreases as the number of pages increases, amortizingthe system call cost.

Map: Barrelﬁsh/MAS is able to match and outperform Linuxin all but one case, with a signiﬁcant difference when usinglarge and huge pages.

Protect:

These are in line with the Appel and Li bench-marks above;

Barrelﬁsh/MAS outperforms Linux in all conﬁg-urations.

Unmap:

We observe very similar performance characteris-tics here. For small buffer sizes Linux is slightly faster, forlarger buffers

Barrelﬁsh/MAS slightly outperforms Linux.From these two microbenchmarks, we conclude that

Bar-relﬁsh/MAS memory operations are competitive: capabilitiesand fast traps allow an efﬁcient virtual memory interface de-spite splitting up larger mappings into multiple capability oper-ations and syscalls. It is possible to build a fast and competitivememory system which still fully implements our ﬁne-grained,least-privilege model.

We now proﬁle the support for address space networks in

Bar-relﬁsh/MAS , including memory mappings and model queries.We put the cost in the context of related operations a devicedriver has to perform.We proﬁle the boot process of the Xeon Phi Co-Processoron our server platform. All accesses from the co-processor tohost RAM are translated multiple times, most notably:

CoreMMU → SMPT → IOMMU → SystemBus

Each step must be conﬁgured correctly. We adapted the ex-isting drivers for the co-processor, the system memory pagetable (SMPT), and the IOMMU to use our new capabilitiesand model queries. The MMU is managed by the kernelrunning on the co-processor cores. Resources are managedusing the capability system which allows safe programmingof translation tables.First, we allocate 6MB from host RAM, and map this intothe device drivers address space (equivalent to performing ananonymous mmap in Linux). Then we copy the boot image intothis allocated buffer. We then query the model representationto determine which translation units must be reprogrammed.We map the buffer into the device’s IOMMU address space,and then map the resulting obtained segment into the SMPTaddress space. We compare the IOMMU mapping in twocases. In the ﬁrst, we ask the IOMMU driver to perform themapping; this ad-hoc approach corresponds to the current stateof the art. In the second, enabled by our model, we performthe mapping directly with capability invocations.9 k 2M 1G 64G0.00.51.01.52.02.53.0 T i m e p e r p a g e [ s ] Map

4k 2M 1G 64G0.00.51.01.52.02.53.0

Protect

4k 2M 1G 64G0.00.51.01.52.02.53.0

Unmap

4k Linux4k Linux NS 4k Barrelfish/MAS2M Linux 2M Linux NS2M Barrelfish/MAS 1G Linux1G Linux NS 1G Barrelfish/MAS

Figure 6: Comparison of memory operations on

Barrelﬁsh/MAS and Linux with and without Spectre/Meltdown mitigation (NS).Execution time per page in µ s. Buffer sizes in powers of two from 4kB to 64GB. Linux MMAP Local Map RPC Map01002003004005006007008009001000 T i m e [ u s ] Map SMPTMap IOMMUQuery ModelWrite MemoryAllocate &map process

Figure 7: Proﬁling Conﬁguration Time for a Xeon Phi Co-Processor comparing Local Syscalls and RPCs to PerformIOMMU Mappings with Linux mmap ’ing a buffer of the samesize for perspective.

As Figure 7 shows, the cost is dominated by memory al-location, which takes about 625 µ s and involves an RPC to amemory server. Writing the buffer content using memcpy takes224 µ s. Determining the units to be conﬁgured to make thebuffer available to the device takes 71 µ s, using the C graphimplementation. Setting up the IOMMU mapping is 2 µ s (or32 µ s, when using RPC). Mapping the segment into the SMPTusing a kernel driver takes 5 µ s.For comparison we also show the cost in Linux to mmap ananonymous 6MB region in a userspace process – equivalentin our implementation to allocating and mapping the bufferin the driver. We perform this operation slightly faster thanLinux, but pay an additional 71 µ to dynamically determinethe nodes that have to be conﬁgured. A less ﬂexible approachmight pre-compute or memoize this step, avoiding the latencyat map time. Compared with the cost of allocation and writingthe memory, the cost of setting up the IOMMU and SMPTmappings (together 7 µ s) are negligible. Note that in anysystem an untrusted agent will have to perform some sort ofinvocation (such as a system call) to install these mappings.Despite our ﬁne-grained rights and dynamic implementation,performance is comparable to Linux. We now turn to the scaling properties of the model represen-tation with respect to the system complexity. In real systems,we see ever-increasing numbers of cores and DMA-capabledevices, but the diameter of the decoding net representationgrows much more slowly, and rarely exceeds 10. This is true T i m e [ u s ] NativeEclipseCLP

Figure 8: Cost of determing mappable nodes on an X86 sys-tem with growing number of PCI devices. not only for x86 systems, but also for all the ARM SoCs wehave encountered to date.We write a synthetic benchmark that simulates a systemwith an increasing number of PCIe devices, each of which hasits own address space and translation unit, much like the IntelXeon Phi described in the previous section. This grows themodel state in two ways: the total number of address spaces,as well as the number of these that are conﬁgurable. Both growlinearly with the number of PCIe devices. We measure the timeit takes to determine the conﬁgurable address spaces betweena PCIe device and the system bus, a typical setup operationfrom a device driver that has to setup IOMMU and device-local translation structures. We evaluate two implementations: i) a Prolog implementation of the model using the EclipseCLPinterpreter in Barrelﬁsh, and our C implementation based on agraph represented as an adjacency matrix. Both use Dijkstra’salgorithm on the graph representation.Figure 8 shows that, due to internal memory allocations, theperformance of EclipseCLP implementation scales linearly inthe number of devices. The native C implementation, in con-trast, shows almost constant performance. Small linear factorsstem from walking of the adjacency-matrix and initializationof the parent array.We conclude that the cost of determining conﬁgurable nodesremains almost independent of the system complexity, as longas the graph is of low diameter and is maintained in an efﬁ-cient data structure, suggesting that the routing calculation isfeasible for modern hardware.10 .4. Correctness on simulated platforms In this qualitative evaluation we show that the model imple-mentation is functional and performant even when run onsimulated platforms with unusual address space topologiesnot supported by other systems. While these topologies areextreme, their envelope includes other real systems (such asthose with secure co-processors) which are not handled bycurrent systems.We wrote a series of system descriptions for the ARM FastModels simulator [7]. We use this description to i) conﬁgurethe simulator and ii) extract the topology of the memory sub-system. We then use this information to populate the addressspace model which is used at compile time to generate oper-ating system code and at runtime to query information aboutthe memory system as in the previous evaluation. We mentionfour conﬁgurations, where each consists of two ARM Cortex-A57 clusters, each having their own memory map connectingto DRAM and other devices. The memory map is conﬁguredas follows:1. Uniform

Uniform memory map between all clusters.2.

Swapped

Memory map contains two areas whose addressesare swapped (exchanged) between the two clusters.3.

Private

Each cluster has its own private memory region.4.

Private Swapped

A combination of the Private andSwapped conﬁgurationsWe know of no other current OS designs which can managememory globally in all these cases. Popcorn Linux [10] andBarrelﬁsh have limited support for case 3; while regular Linuxand seL4 only support case 1.

Barrelﬁsh/MAS is able to boot and manage memory on allplatforms without modiﬁcations, regardless of the topology, byvirtue of the capabilities used to refer to memory containingthe canonical name of the object. Whenever an object isaccessed, this canonical name is converted into a local addressusing a generated function.

Finally, we analyze the time and space complexity of manag-ing the physical resources of the system using capabilities inBarrelﬁsh.We are interested in the space overhead to store the capa-bilities, managing the lookup of capabilities in the mappingdatabase, and creating new mappings.In the implementation, capabilities occupy 64 bytes each.There is typically less than one capability for each frame ofmemory, as each capability can represent up to 2 − Barrelﬁsh/MAS creates a capability for bookkeeping. Thenumber of these mapping capabilities also grows sub-linearlyin the total number of mapped frames. Large frames resultin one mapping capability per page-table that is spanned forthe mapping. Since the 64-byte capability representation also includes all the pointers necessarily to index the mappingdatabase, the latter incurs no additional overhead. The indexitself is a balanced tree; lookups are logarithmic in the totalnumber of capabilities.Overall, keeping track of memory resources with capabil-ities incurs a space overhead which grows at worst linearlyin the available physical memory. Furthermore, the mappingdatabase can be implemented and queried efﬁciently. As-suming the

Barrelﬁsh/MAS worst case of one capability per4kB frame, this accounts for a 1.5% total memory overhead.In comparison, Linux manages a struct page per physicalframe of up to 80 bytes in size, an overhead of almost 2%.

7. Conclusion

In this paper we have built on existing work in modellingthe complex interacting address spaces in modern hardwareby adopting the proven methodology of the seL4 project toproduce a rigorous, no-stone-left-unturned model of mem-ory management. Our model applies well-known conceptsin access control, giving an abstract model amenable to im-plementation in capability-based systems (e.g. Barrelﬁsh), aswell as ACL-based systems such as Linux.We have shown that it is possible to implement the modelefﬁciently in an operating system delivering excellent memorymanagement performance while at the same time offering aclean and safe way to deal with the complexity of the allocationand enforcement problem.We’ve shown that the model can be used to conﬁgure real,complex (even pathological) systems, scales well, and intro-duces little overhead. Our model is a sound foundation forboth fully veriﬁed systems and more reliable memory manage-ment in existing systems.

References [1] Reto Achermann, Lukas Humbel, David Cock, and Timothy Roscoe.Formalizing Memory Accesses and Interrupts. In

Proceedings of the2nd Workshop on Models for Formal Analysis of Real Systems , MARS2017, pages 66–116, 2017.[2] Reto Achermann, Lukas Humbel, David Cock, and Timothy Roscoe.Physical Addressing on Real Hardware in Isabelle/HOL. In

InteractiveTheorem Proving , ITP’18, pages 1–19, Oxford, United Kingdom, 2018.Springer International Publishing.[3] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion. Do-It-Yourself Virtual Memory Translation. In

Proceedings of the 44thAnnual International Symposium on Computer Architecture , ISCA ’17,pages 457–468, New York, NY, USA, 2017. ACM.[4] James P. Anderson. Computer Security Technology Planning Study.Technical Report ESD-TR-73-51, Vol. I, AD-758 206, Electronic Sys-tems Division, Deputy for Command and Management Systems HQElectronic Systems Division (AFSC), L. G. Hanscom Field, Bedford,Massachusetts 01730, USA, October 1972.[5] Andy Whitcroft. Sparsemem Memory Model. https://lwn.net/Articles/134804/ , Aug 2019.[6] Andrew W. Appel and Kai Li. Virtual Memory Primitives for UserPrograms. In

Proceedings of the Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems , ASPLOS IV, pages 96–107, New York, NY, USA, 1991.ACM.[7] ARM Ltd. Development Tools and Software: Fast Mod-els. , August 2019.

8] Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, SaugataGhose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. Mo-saic: A GPU Memory Manager with Application-transparent Supportfor Multiple Page Sizes. In

Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture , MICRO-50 ’17, pages136–150, New York, NY, USA, 2017. ACM.[9] Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and GoetzBrasche. It’s Time to Think About an Operating System for Near DataProcessing Architectures. In

Proceedings of the 16th Workshop on HotTopics in Operating Systems , HotOS ’17, pages 56–61, New York, NY,USA, 2017. ACM.[10] Antonio Barbalace, Marina Sadini, Saif Ansary, Christopher Jeles-nianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray, andBinoy Ravindran. Popcorn: Bridging the Programmability Gap inheterogeneous-ISA Platforms. In

Proceedings of the Tenth EuropeanConference on Computer Systems , EuroSys ’15, pages 29:1–29:16,New York, NY, USA, 2015. ACM.[11] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris,Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach,and Akhilesh Singhania. The Multikernel: A New OS Architecturefor Scalable Multicore Systems. In

Proceedings of the ACM SIGOPS22Nd Symposium on Operating Systems Principles , SOSP ’09, pages29–44, New York, NY, USA, 2009. ACM.[12] David Cock, Gerwin Klein, and Thomas Sewell. Secure Microkernels,State Monads and Scalable Reﬁnement. In

Proceedings of the 21stInternational Conference on Theorem Proving in Higher Order Logics ,TPHOLs ’08, pages 167–182, Berlin, Heidelberg, 2008. Springer-Verlag.[13] Philip Derrin, Kevin Elphinstone, Gerwin Klein, David Cock, andManuel M. T. Chakravarty. Running the Manual: An Approach toHigh-assurance Microkernel Development. In

Proceedings of the 2006ACM SIGPLAN Workshop on Haskell , Haskell ’06, pages 60–71, NewYork, NY, USA, 2006. ACM.[14] Dhammika Elkaduwe, Gerwin Klein, and Kevin Elphinstone. VeriﬁedProtection Model of the seL4 Microkernel. In

Proceedings of the2Nd International Conference on Veriﬁed Software: Theories, Tools,Experiments , VSTTE ’08, pages 99–114, Berlin, Heidelberg, 2008.Springer-Verlag.[15] Simon Gerber.

Authorization, Protection, and Allocation of Memoryin a Large System . PhD thesis, ETH Zurich, 2018.[16] Simon Gerber, Gerd Zellweger, Reto Achermann, Kornilios Kourtis,Timothy Roscoe, and Dejan Milojicic. Not Your Parents’ PhysicalAddress Space. In

Proceedings of the 15th USENIX Conference onHot Topics in Operating Systems , HOTOS’15, pages 16–16, Berkeley,CA, USA, 2015. USENIX Association.[17] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Wu, Jieung Kim, Vil-helm Sjöberg, and David Costanzo. CertiKOS: An Extensible Archi-tecture for Building Certiﬁed Concurrent OS Kernels. In

Proceedingsof the 12th USENIX Conference on Operating Systems Design andImplementation , OSDI’16, pages 653–669, Berkeley, CA, USA, 2016.USENIX Association.[18] Marius Hillenbrand, Mathias Gottschlag, Jens Kehne, and Frank Bel-losa. Multiple Physical Mappings: Dynamic DRAM Channel Sharingand Partitioning. In

Proceedings of the 8th Asia-Paciﬁc Workshop onSystems , APSys ’17, pages 21:1–21:9, Mumbai, India, 2017.[19] HSA Foundation.

HSA Runtime Programmer’s Reference Manual ,version: 1.1.4 edition, Oct 2016.[20] Jian Huang, Moinuddin K. Qureshi, and Karsten Schwan. An Evo-lutionary Study of Linux Memory Management for Fun and Proﬁt.In

Proceedings of the 2016 USENIX Conference on Usenix AnnualTechnical Conference , USENIX ATC ’16, pages 465–478, Berkeley,CA, USA, 2016. USENIX Association.[21] Intel Corporation.

Intel Xeon Phi Coprocessor System Software Devel-opers Guide , 2014.[22] Khronos OpenCL Working Group.

The OpenCL Speciﬁcation , version:2.0, document revision: 29 edition, July 2015.[23] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick,David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt,Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, andSimon Winwood. seL4: Formal Veriﬁcation of an OS Kernel. In

Proceedings of the ACM SIGOPS 22Nd Symposium on OperatingSystems Principles , SOSP ’09, pages 207–220, New York, NY, USA,2009. ACM.[24] Butler W Lampson. Protection.

ACM SIGOPS Operating SystemsReview , 8(1):18–24, 1974.[25] Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. VAST: TheIllusion of a Large Memory Space for GPUs. In

Proceedings of the 23rdInternational Conference on Parallel Architectures and Compilation ,PACT ’14, pages 443–454, New York, NY, USA, 2014. ACM. [26] Henry M. Levy.

Capability-Based Computer Systems . Butterworth-Heinemann, Newton, MA, USA, 1984.[27] A Theodore Markettos, Colin Rothwell, Brett F Gutstein, AllisonPearce, Peter G Neumann, Simon W Moore, and Robert NM Watson.Thunderclap: Exploring Vulnerabilities in Operating System IOMMUProtection via DMA from Untrustworthy Peripherals. In

NDSS , 2019.[28] Alex Markuze, Adam Morrison, and Dan Tsafrir. True IOMMU Pro-tection from DMA Attacks: When Copy is Faster Than Zero Copy.In

Proceedings of the Twenty-First International Conference on Archi-tectural Support for Programming Languages and Operating Systems ,ASPLOS ’16, pages 249–262, New York, NY, USA, 2016. ACM.[29] Benot Morgan, Eric Alata, Vincent Nicomette, and Mohamed Kaaniche.Bypassing IOMMU Protection against I/O Attacks. In , pages145–150, Oct 2016.[30] Benot Morgan, Eric Alata, Vincent Nicomette, and Mohamed Kaaniche.IOMMU Protection Against I/O Attacks: A Vulnerability and a Proofof Concept.

Journal of the Brazilian Computer Society , 24(1):2, Jan2018.[31] NVIDIA.

NVIDIA Parker Series SoC Technical Reference Manual ,v.1.0p edition, June 2017.[32] NVIDIA Corporation .

Uniﬁed Memory in CUDA 6 , Nov 2013. https://devblogs.nvidia.com/unified-memory-in-cuda-6/ .[33] NXP. i.MX 8DualXPlus/8QuadXPlus Applications Processor Ref-erence Manual , January 2019. REV 1, .[34] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm,Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Kather-ine Yelick. A Case for Intelligent RAM.

IEEE Micro , 17(2):34–44,March 1997.[35] Yuxin Ren, Gabriel Parmer, Teo Georgiev, and Gedare Bloom. CBufs:Efﬁcient, System-wide Memory Management and Sharing. In

Proceed-ings of the 2016 ACM SIGPLAN International Symposium on MemoryManagement , ISMM 2016, pages 68–77, New York, NY, USA, 2016.ACM.[36] Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. Speci-fying and Dynamically Verifying Address Translation-aware MemoryConsistency. In

Proceedings of the Fifteenth Edition of ASPLOS onArchitectural Support for Programming Languages and OperatingSystems , ASPLOS XV, pages 323–334, New York, NY, USA, 2010.ACM.[37] Adrian Schüpbach, Andrew Baumann, Timothy Roscoe, and SimonPeter. A Declarative Language Approach to Device Conﬁguration. In

Proceedings of the Sixteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ASPLOSXVI, pages 119–132, New York, NY, USA, 2011. ACM.[38] Thomas Sewell, Simon Winwood, Peter Gammie, Toby Murray, JuneAndronick, and Gerwin Klein. seL4 Enforces Integrity. In MarkovanEekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors,

Interactive Theorem Proving , pages 325–340, Berlin, Heidelberg, 2011.Springer Berlin Heidelberg.[39] Texas Instruments.

OMAP44xx Multimedia Device Technical Ref-erence Manual , April 2014. Version AB, .[40] Simon Winwood, Gerwin Klein, Thomas Sewell, June Andronick,David Cock, and Michael Norrish. Mind the Gap. In

Proceedingsof the 22Nd International Conference on Theorem Proving in HigherOrder Logics , TPHOLs ’09, pages 500–515, Berlin, Heidelberg, 2009.Springer-Verlag., TPHOLs ’09, pages 500–515, Berlin, Heidelberg, 2009.Springer-Verlag.