[PDF] Secure Memory Management on Modern Hardware

Abstract

Almost all modern hardware, from phone SoCs to high-end servers with accelerators, contain memory translation and protection hardware like IOMMUs, firewalls, and lookup tables which make it impossible to reason about, and enforce protection and isolation based solely on the processor's MMUs. This has led to numerous bugs and security vulnerabilities in today's system software. In this paper we regain the ability to reason about and enforce access control using the proven concept of a reference monitor mediating accesses to memory resources. We present a fine-grained, realistic memory protection model that makes this traditional concept applicable today, and bring system software in line with the complexity of modern, heterogeneous hardware. Our design is applicable to any operating system, regardless of architecture. We show that it not only enforces the integrity properties of a system, but does so with no inherent performance overhead and it is even amenable to automation through code generation from trusted hardware specifications.

Full PDF

SSecure Memory Management on Modern Hardware

Reto Achermann, Nora Hossle, Lukas Humbel, Daniel Schwyn, David Cock, Timothy RoscoeSystems Group, Department of Computer Science, ETH Zurich

Abstract

Almost all modern hardware, from phone SoCs to high-endservers with accelerators, contain memory translation and pro-tection hardware like IOMMUs, ﬁrewalls, and lookup tableswhich make it impossible to reason about, and enforce pro-tection and isolation based solely on the processor’s MMUs.This has led to numerous bugs and security vulnerabilities intoday’s system software.In this paper we regain the ability to reason about and en-force access control using the proven concept of a referencemonitor mediating accesses to memory resources. We presenta ﬁne-grained, realistic memory protection model that makesthis traditional concept applicable today, and bring systemsoftware in line with the complexity of modern, heteroge-neous hardware.Our design is applicable to any operating system, regard-less of architecture. We show that it not only enforces theintegrity properties of a system, but does so with no inherentperformance overhead and it is even amenable to automationthrough code generation from trusted hardware speciﬁcations.

Both new, fully-veriﬁed kernels and traditional production-quality operating systems rely on a model of memory address-ing and protection so simple it is rarely remarked on: RAMand devices reside at unique addresses in a single, shared phys-ical address space, and all cores have homogeneous memorymanagement units (MMUs) which translate virtual addressesinto this single physical address space.The OS running on the platform then fulﬁlls two roles:First, it manages resource allocation . Virtual memory makesmultiplexing hardware easier by decoupling the application’sview of memory from the physical resources managed bythe OS, allowing late binding of addresses. Second it forms,alongside the MMU, a reference monitor [4]: All resourceaccesses (dereferences) are intercepted by the monitor (specif-ically the TLB), and checked against an access-control policy . This has for decades formed the basis for secure process iso-lation in all operating systems implementing virtual memory.The reference monitor concept repeats throughout tradi-tional OS design, with more sophisticated abstractions grad-ually built up, and their associated security properties en-forced through a combination of hardware-provided monitors(e.g. MMUs), and software ones (e.g. traps and syscalls).For example, consider name (or address) resolution andauthorization checks in the mmap() syscall. A process beginswith a reference to a ﬁle: its ﬁlename. The OS, meanwhile,enforces some access-control policy, e.g. UNIX-style per-missions. The calling process dereferences the ﬁlename bypassing it to the open() syscall, whereupon the OS vali-dates the request against policy (permissions), and resolves the reference to another reference: the ﬁle descriptor (FD),now referring to an entry in the global open-ﬁle table. Theexistence of this entry, and that the process may possess areference is justiﬁed by the top-level policy; The pattern ofopen ﬁles and FDs (the state) is a projection of somethingpermitted by the policy.This pattern is replicated in the VM system thanks to mmap() . Unix cannot directly interpose on memory readsand writes (to the buffer cache page mapped to the user), butdoes implement the initial mmap() call, and the page faulthandler. The kernel builds a reference monitor by composing itself with that provided by the MMU. On an mmap() call, thekernel veriﬁes that the FD is valid, with appropriate permis-sions (e.g. write), before constructing a VM region to back themapping. The policy encoded in the region’s ﬂags is thus a(transitive) projection of the original ﬁle system permissions.On a page fault, the kernel is again invoked to lazily populatethe region (from the buffer cache). Now, it can consult themapping parameters (e.g. writable), and translate these toﬂags in the page-table entry.Thus, the page-table state (e.g. permission bits), and thencethe eventual TLB state, are justiﬁed by a chain of monitors allthe way back up to the system policy (ﬁle system permissions).The MMU enforces this projected policy on the OS’ behalf.Together they form, in security terms, a compound reference a r X i v : . [ c s . O S ] S e p onitor to enforce a policy both on real hardware resources(RAM), and abstract OS-speciﬁc objects (processes, ﬁles).This model has worked well for decades, but has beenundermined by a changing hardware contract. A modernsystem contains not just processors and their attached MMUs,but system MMUs or IOMMUs, memory ﬁrewalls, regionlookup tables, etc. all of which mediate access to and fromparts of the platform. “Smart” devices like GPGPUs, co-processors, network cards, or accelerators come with theirown hardware protection and translation units [20].In such a system, the processor’s MMU alone does notform a reference monitor for memory, as it is not invoked onall accesses. Indeed, the complex address-translation topol-ogy of these systems renders even the concept of a uniquephysical address meaningless, raising the risk that the policyencoded into the distributed hardware reference monitor (thecollections of MMUs, SMMUs, etc.) is inconsistent due totheir differing views of the machine. These two problemshave already led to security vulnerabilities [32, 33, 37, 41].We identify three classes of security vulnerabilities andbugs (Table 1) that i) cause the execution of an operationwithout sufﬁcient rights (a failure of policy enforcement ), ii) allow a compromise of the reference monitor itself (e.g. writ-ing translation tables, a failure of partitioning ), or iii) use thewrong addresses in descriptors or pointers (a failure of nameresolution ). The lack of a proper reference monitor which isaware of the complex and conﬁgurable addressing networkcontinues to result in numerous bugs and security vulnera-bilities [14, 21, 42, 45, 46, 53, 61]. Conﬁning these bugs in akernel is hard, and they are likely to compromise the entiresystem [13].In this paper we demonstrate that these whole classes ofbugs can be prevented by extending the traditional OS-MMUreference monitor to cover all hardware translation and en-forcement engines, allowing policy enforcement on all mem-ory accesses, ensuring consistent name resolution by adoptingthe decoding net [1, 2] as a more faithful model of modernaddressing hardware, and ensuring the secure partitioning ofreference monitor state either through a partitioned capabilitysystem, or in a traditional kernel (such as Linux) by goodsoftware engineering practice and the application of existingmemory management interfaces.Our ﬁrst contribution is to identify the undermining ofthe traditional OS-MMU reference monitor by a changinghardware/software contract as the root cause of several largeclasses of critical security bugs.Our second contribution is to adopt a faithful model ofcomplex addressing hardware (the decoding net), and from itderive a minimal least-privilege model of memory manage-ment authority on modern hardware, covering the commonfunctionality of all virtual memory systems (§ 4.1).Our third contribution is the speciﬁcation of an OS-agnostic reference monitor to enforce policy expressed in the abovemodel, prototyped as an executable speciﬁcation in Haskell, Type CVE-...

Policy enforcement 1999-1166 2014-3601 2014-8369 2014-98882017-16994 2019-2250 2019-10538 2019-10539 2019-10540Partitioning 2011-1898 2013-43292014-0972 2018-10382018-11994 2019-2182 2019-19579Name resolution 2013-4329 2014-9932 2016-3960 2016-53492017-8061 2017-12188 2019-15099

Table 1: Classes of Security Vulnerabilities.and abstracting the OS’s internal policy language (e.g. capa-bilities or ACLs) as an access-control matrix .Our fourth contribution is to demonstrate that this refer-ence monitor design can be implemented without invasivechanges on either partitioned capability systems (e.g. seL4or Barrelﬁsh), or on ACL-based UNIX-style kernel (such asLinux). Further our benchmarks demonstrates that there isno measurable performance cost for a secure fully-explicitleast-privilege system-wide virtual memory authority imple-mentation (§ 6)

The difﬁculty of getting complex memory addressing right inan OS is shown by the steady, ongoing stream of related bugsand vulnerabilities in operating systems, for example, policyenforcement in Linux’s memory management code [25].We identify three classes of common bugs and security vul-nerabilities related speciﬁcally to the incompleteness of thecurrent reference monitor, which would be rendered impossi-ble under comprehensive reference monitor which faithfullyreﬂected the hardware:

Policy Enforcement.

These are bugs where a subject wasable to change the conﬁguration of a translation unit withouthaving the proper rights do to so. The reference monitor failshere to enforce the system policy: • Mappings with holes belonging to another subject [39]. • Incorrect permissions on data pages [40]. • IOMMU conﬁgured to map too large a range [47–49].All these bugs are impossible once the operations are per-formed through a (correct) reference monitor implementingthe system security property.

Partitioning.

These bugs involve bypassing the referencemonitor directly e.g. by directly modifying its internal state: • DMA transfers into MSI-x interrupt registers [36]. • DMA transfers into IOMMU control registers [38]. • Process modiﬁes its own page table [44].These are prevented once the reference monitor state is iden-tiﬁed and partitioned by subjecting them to system policye.g. that no DMA engine or process may map a page table.2 ame Resolution.

This class represents inconsistent inter-pretations of pointers (names): • Insufﬁcient context to identify the correct object [42]. • Resolving addresses in the wrong context [43].These are prevented once names are dereferenced (resolve)through a monitor with a complete, accurate model of address-ing.

Before presenting our authority model and the executablespeciﬁcation in the next section, we will brieﬂy cover ref-erence monitors in a little more detail, in particular the im-portance of consistent naming, and how complex addressingtopologies make it difﬁcult.We also summarize the existing decoding net model, the ex-ecutable speciﬁcation/reﬁnement approach which we borrowfrom the seL4 system, and the related work.

The reference monitor is a powerful structuring concept inaccess control, and is implicitly used in practically every OS.A reference monitor enforces an access-control policy, allow-ing a separation of concerns, and thus effort: if every accessis subject to the policy, then the overall safety of the system(w.r.t. the policy) can be guaranteed independently of the cor-rectness of the components making the accesses. This is ofenormous beneﬁt to a monolithic system (e.g. Linux), where afault in one subsystem can easily spread to others, particularlyas any subsystem can, in principle modify translations. Evenwithout enforcing a strict boundary between components (asin a microkernel), routing all updates via a single componentresponsible for safety ensures that accidental errors will nolonger lead to a whole-system compromise.The critical point for a reference monitor is that all accessesmust pass through to it, and that it is able to accurately iden-tify which resources are being accessed (e.g. which DRAMaddress will ultimately be written) when applying its pol-icy. Both of these are undermined in the complex address-translation networks of modern systems, but not fatally so:The hardware component of the reference monitor is now distributed among multiple system MMUs, ﬁrewalls, etc.;addresses may be rewritten after policy is applied, routingthem to locations that should not be accessible.Both of these problems are solved with an accurate modelof the hardware: First, to know the complete set of access-control components that must be included in the referencemonitor, and second, to guarantee that any translation belowthe access-control level is consistent with policy.

As established, modern platforms are composed of multiple,heterogeneous cores and devices each of which can issue ac-cesses to addressable resources such as DRAM, non-volatilememory or device registers. Worse, there is no single “ref-erence” physical address space [20]. Instead, a network ofaddress spaces or buses is connected by address translationunits which “route” memory accesses. As just described,in order to securely enforce access control, it is essential toknow what ﬁnal resource some intermediate address (or name )refers to.I/O memory management units (IOMMUs, or systemMMUs) translate addresses generated by accelerators andDMA-capable devices into a “canonical” system-wide physi-cal address space. This allows user-space programs to share avirtual address space with a context on the device, but imposea further complexity burden on the underlying OS which mustnow ensure that IOMMUs are always correctly programmed.This code is fraught with complexity and consequent bugsand vulnerabilities, as it is also intended to provide protectionfrom malicious memory accesses [32–35]. The problem islikely going to get worse with the proliferation of IOMMUdesigns built into GPUs, co-processors, and intelligent NICs.Even memory controllers can violate the traditional model.Hillenbrand et al. [23] reconﬁgure memory controller conﬁg-urations from system software to provide DRAM aliases formitigating the performance effects of channel and bank inter-leaving. Proposals for “in-memory” or “near-data” process-ing [51,56,60] raise further questions for OS abstractions [10]and require a way to unambiguously refer to memory regard-less of which module accesses it.

A systematic and accurate way to establish canonical namesfor access-controlled resources that may be referred by differ-ent local names in different parts of the system is providedby the established decoding net [1, 2] model of address trans-lation.Decoding nets model the addressing structure of a systemas a directed graph, where nodes represent (virtual or physical)address spaces or devices (including RAM), and edges thetranslation of

AS-local addresses into other address spaces ordevices. The graph is a set of nodes, deﬁned as an abstractdatatype: name = Name nodeid addressnode = Node accept :: { address } translate :: address → { name } The model distinguishes local names ( address ), relative tosome address space, and global names ( name ), which qualifya local name with its enclosing address space. Each nodemay accept a set of (local) addresses (e.g. RAM or mem-ory mapped device registers), and/or translate them to one3 ecoding Net ModelAbstract Authorization Model Prior workExecutable Speci ﬁ cationOperating System Implementation Dynamic updates, subjects/objects, authority.OS implementationExecutable speci ﬁ cationof a reference monitor. Figure 1: Methodology Overview: Reﬁnement steps.or more global names (addresses in other address spaces,e.g. MMU or PCI bridges).This approach dovetails nicely with the reference monitorconcept as described above. Every translate step correspondsto a dereference operation, and any accept can be used as acanonical name: the ID of the accepting node, plus the local address at which it accepts (e.g. address within a DRAMbank).Decoding nets have been successfully used to model a widevariety of systems of exactly the sort that is of interest to us,and give a trustworthy, precise guide to where a referencemonitor is required: any conﬁgurable translation node mustbe treated as part of the distributed reference monitor. It mustonly be conﬁgured such that its local translations are a projec-tion of the higher-level security property, exactly as for a pro-cessor’s MMU.

Static conﬁguration nodes must be conﬁguredin such a way (either by construction or static veriﬁcation)that their translations are consistent with the projected policyat the point they are applied.

We borrow our modeling technique, combining reﬁnement with executable speciﬁcation from the successful seL4 project.We identify all relevant objects (page tables, address spaces,frames, . . . ), the subjects that manipulate them (processes,devices, . . . ), and which authority each subject exercises overeach object (e.g. in mapping a frame to a virtual address).These are expressed in an access-control matrix (followingLampson [29]) which forms our abstract speciﬁcation , analo-gous to the high-level security policy (integrity) shown to bereﬁned (correctly implemented) all the way down to compiledbinaries for seL4 [55].Again, as in seL4 [15], we next develop an executablespeciﬁcation in Haskell (see § 4.2), expressing subjects, ob-jects, and authority as ﬁrst-class objects, permitting rapidprototyping without giving up strong formal semantics. Cor-respondence between abstract and executable models is thusfar by inspection and careful construction.Finally, we show (again with precedent [59]) that the exe-cutable model (and hence the abstract model) permits multiplehigh-performance implementations (see § 5): On Barrelﬁsh,as a representative of partitioned-capability systems includingseL4 (capabilities corresponding to rows in the matrix), andon Linux, as a representative UNIX-style monolithic kernel(where ACLs correspond to columns in the matrix).

The seL4 proof [28] assumed a single, ﬁxed, physical addressspace and a single MMU, and provides no guarantees in thepresence of other cores or DMA devices. CertiKOS [22]builds on a model of memory accesses to abstract regionsof private, shared or atomic memory, but again provides noproof in the presence of other translation units or cores. Evenwork on verifying memory consistency in the presence oftranslation currently only considers the simple case of virtual-to-physical mappings [52].Graviton [57] provides a trusted execution environmentfor GPUs requiring all updates to the page tables go throughthe command processor, acting as a reference monitor for theGPU. Komodo [19] uses ARM TrustZone [6] to implement asoftware enclave. Both of these works are steps in the rightdirection, and in this work we extend this approach to thewhole system.OpenCL’s Shared Virtual Memory [27], nVidia’sCUDA [50] or HSA [24] provide a uniﬁed view of memory,ensuring addresses remain valid between CPU and GPU.VAST [30] which uses compiler support to dynamicallycopy memory to and from the GPU and Mosaic [8], whichprovides support for multiple sizes of page translation in ashared virtual address space between CPU and GPU. Theseapproaches ensure address consistency in the speciﬁc caseof CPU–GPU sharing, but are again not whole-systemapproaches.In DVMT [3], a customized TLB miss handler imple-mented as a helper thread installs entries in the TLB usingspecialized instructions. Similar to the MMU, the OS/hy-pervisor sets up data structures specifying the policy whichmappings the thread is allowed to install. Again this solutionfocuses on the processor and its MMU.

A static decoding net is a snapshot of the address transla-tion conﬁguration of a system, at a particular moment. Weaugment the static decoding net with a transition relation,modelling the dynamic reconﬁguration of the translation hard-ware such as when a page table is modiﬁed. The allowabletransitions express the actions (or traces ) permitted by themodel.

The system consists of a set of address spaces each having acurrent conﬁguration , which corresponds to a decoding net node, that deﬁnes the translation of local addresses in thisaddress-space context : conﬁguration :: address space → node This lets us reason about translations with the existing mech-anisms available for decoding nets. Hardware constraints,4 odes: node :: Decoding Net Node

Objects:

Object = { name } Rights:

Right = Grant | Map | Access

Conﬁguration Space:

ConfSpace :: AddressSpace → { node } Address Space Conﬁguration:

Configuration :: AddressSpace → node Access Control Matrix:

AccessControlMatrix :: Subject × Object → {

Right } Model State:

State = (

AccessControlMatrix , Configuration ) State Transitions:

ModifyMap :: Subject → ( name → { name } ) → State → State

Figure 2: Model Deﬁnition

VirtualAddress SpaceIntermediateAddress SpacePhysicalAddress Space grantgrant mapmap

Figure 3: Mappings between address spaces showing grantand map rights of mapped segments.e.g. an MMU that only supports the translation of naturallyaligned 4 KiB blocks of addresses, are expressed as a restric-tion on the set of possible nodes an address space can mapto. This set is the conﬁguration space of an address space.Invariant I1 requires that every address space must have awell-deﬁned conﬁguration. The conﬁguration space of a ﬁxedaddress space is a singleton set.

Invariant I1 (Well-deﬁned Conﬁguration) ∀ a :: AddressSpace . Configuration a ∈ ConfSpace a . Conﬁguration Authority (Mapping).

The conﬁguration ofsome address spaces can be changed. The conﬁgurationspace deﬁnes the set of possible states an address space mayoccupy. An authority is a subset of conﬁguration transitions,representing what conﬁguration actions a given subject ispermitted to take.Consider Figure 3, representing the general case of anupdate to an intermediate address space (for example the in-termediate physical address, IPA, in a two-stage translationsystem). We identify two distinct authorities: The

MAP au-thority, or the authority to change the meaning of an IPA bychanging its mapping; and the

GRANT authority, or the rightto grant

ACCESS (by mapping) to some range of physical ad-dresses. Note that the ’virtual’ and ’physical’ address spacesof Figure 3 can be viewed as special cases of an intermediateaddress space: A top-level ’virtual’ address space is simplyone to which nobody has a

GRANT authority, and a ’physical’address space e.g. DRAM is one to which there exists no

MAP authority.

Right R1 (Grant)

The right to insert this memory object into some address space

Right R2 (Map)

The right to insert some memory object into this address space

Right R3 (Access)

The right to read or write an object.

Xeon Phi CoreXeon Phi BusXeon Phi SMPTGDDR IOMMU

Registers

DMA Core PCI Bridge WindowRAM Processor Memory ControllerCPU CoreIOMMU

Figure 4: Address spaces in a system with two PCI devices subject / object DMA IOMMU bufferIOMMU driver

MAP

Xeon Phi process

GRANT

Table 2: Access control matrix of the Xeon Phi example

Changing Mappings.

Consider Figure 4, showing the ad-dress space conﬁguration of a system with two PCI devices:a DMA engine and an Intel Xeon Phi co-processor. Imaginethat we wish to establish a shared mapping to allow a processon a Xeon Phi core to receive DMA transfers (e.g. networkpackets) into a buffer allocated on the GDDR (following thehighlighted path from the DMA core to the GDDR).The process ‘owns’ the buffer, and has the ability to call recv() , triggering a DMA transfer. In other words, theprocess has the right to grant

ACCESS (temporarily) to theDMA core, but it clearly should not have the ability to modifythe IOMMU mappings of the DMA core at will. Hence,it does not have the

MAP authority on the relevant addressspace.To change the mappings of an address space, an agent (a subject , in standard access-control terminology) needs boththe

GRANT authority on the buffer object , and the

MAP au-thority on the address space object .The state transition, i.e. changing the conﬁguration andtherefore how an address space translates addresses, is ex-pressed by the operation

ModifyMap() : A subject tries tochange how a name is being translated by the system, andthus updates its state.

Authority Representation.

In a monolithic kernel, boththese authorities are held (implicitly) by the kernel, whichexercises them on behalf of the subjects. It is up to the kernelto maintain accurate bookkeeping to determine whether anysuch request is safe, typically using an ACL (access-controllist) i.e. the object lists the subjects and their authorities on it.In a partitioned-capability system such as seL4 or Barrelﬁsh,these authorities are represented by capabilities, handed ex-plicitly to one subject , to authorize the operation. In this case,subjects hold the authority on the object . These are equiva-lent from the perspective of access control, differing only inimplementation: the same two basic types of authority arepresent.The standard representation of authority in systems is anaccess control matrix [29], such as that of Table 2. This can beread in rows: The IOMMU driver has the

MAP capability tothe IOMMU address space, and the process the

GRANT capa-5 ddressSpaceOf() mappable objectsunmappable objects retype()

RAMTranslation Structure grant

Frame grant,map

Con ﬁ g. Address Space Figure 5: Object Type Hierarchy and possible rights.bility to the buffer. Alternatively, reading down the columnsgives the ACLs: the IOMMU records

MAP permission for thedriver, and for the buffer records a

GRANT permission for theprocess.

Security Property.

This access control matrix is our abstractmodel. A system is correct (secure) statically , if its currentconﬁguration is consistent with the access control matrix. Itis secure dynamically if any possible transition, beginning ina secure state, must leave the system in a secure state. Theaccess control matrix, together with the conﬁguration spacedeﬁnes the allowable state transitions. The address spacemust have a valid conﬁguration supported by hardware, andthe subject modifying it must have sufﬁcient rights to do so.

We reﬁne this abstract model into an executable speciﬁcationof a reference monitor [4] for

ModifyMap() . When com-posed with the reference monitor

ACCESS i.e. the MMU, wehave our desired compound reference monitor for the fully-dynamic VM system, secure for accesses beginning at anycore or device.This speciﬁcation serves as an intermediate step between(Figure 1) the abstract model and the concrete OS implemen-tation of the next section, and also an OS-agnostic prototypefor implementation in other systems. This approach is in-spired by seL4 [17], which also employed an intermediateHaskell speciﬁcation to facilitate prototyping.

Explicit Translation Structures.

We now explicitly rep-resent address translation structures (e.g. page tables, ormemory-mapped device registers) as memory objects, with-out imposing any particular layout on them. This allows us toreason about the manner in with address translation dependson the contents of a memory object (e.g. page tables in RAM,or the contents of device registers).Once the translation structures are explicit, and noting thatthese are exactly the reference monitor state we must securelypartition, we can state the partitioning invariant (Invariant I2)in terms of implementation-visible objects.

Invariant I2 (Partitioning)

No subject has

ACCESS to a translation object

We model address translation structures as an opaque datatype (

TStructure ). This allows us to maintain generality byassuming nothing about their actual inner structure: data Object = RAM {base::Name, size::Natural}| Frame {base::Name, size::Natural}| TStructure {base::Name, size::Natural}

Memory objects form a hierarchy (Figure 5 shows an excerpt)which deﬁnes how the different types of objects can be de-rived from each other. For example, in-memory translationstructures ( TS TRUCTURE ) are created by retyping

RAM ob-jects.

RAM is the base type for untyped memory. Retyping

RAM to a F RAME makes it possible to map it into an addressspace i.e. to

GRANT access to it. Note, that neither

RAM nor TS TRUCTURE have the

GRANT right, and therefore thesemay never become accessible (partitioning).An address space is derived from (and deﬁned by) a trans-lation structure, and is an explicit object granting the rightto map this space into higher-level address spaces (e.g. asecond-stage page table deﬁning an IPA space, assigned tothe guest-physical address space of a virtualized OS): Fig-ure 3.

AddressSpaceOf :: Object -> AddressSpace

Authority and State.

The system is a set of agents, a map-ping database (MDB) recording the derivation relation be-tween objects, and a set of active address spaces: data KState = KState (Set Agent) MDB (Set AddrSpace)

Authority is either directly to an object, or a meta-authority,the right to grant an authority to another. In turn set of suchauthorities, coupled with an identiﬁer, deﬁne an agent. data Authority = Access Object | Map Object| Grant Authority

Reference monitor.

The model exposes a set of operationsthat either change a conﬁguration or access a memory address.The set of permitted operations deﬁnes the behavior of thereference monitor. We express this in Haskell as a customstate monad: data Operation a = Operation (State -> (a, State))instance Monad (Operation) where ...

The reference monitor intercepts operations and veriﬁes thatthe agent performing the operation has sufﬁcient rights toexecute it. We express the changes to the system’s state as se-quence of operations on the reference monitor, e.g. retype() or map() , forming a trace of operations: mappingTrace = do... -- retype a RAM object to a Frame res <- Model.retype RAM Agent Frame Agent -- retype another RAM object to a TStructure res <- Model.retype RAM2 Agent TStructure Agent -- map the frame into the translation structure mapping1 <- Model.map TStructure Frame Agent... Model traces are sequences of monitor states, ( KS TATE ),each corresponding to a static decoding net model. Operationsinclude: • retype() converts an existing object into an object of apermissible sub type. • map() installs a mapping in a translation structure. • copy() copies an authority from one subject to another. Valid Traces.

Contained within the set T of all possibletraces, there is a set of traces T V ∈ T that conform to all con-6 olicyMechanismStatic platformdescriptionHardwarediscovery Model runtime stateAlgorithmsstate population Access Control SystemApplicationmodel runtime QueryReference Monitor operations reference monitor Figure 6: Implementation Overviewstraints enforced by the executable speciﬁcation. We expressthese traces in the model as sequences of

KState s. All othertraces ( T − T V ) indicate ending in a failure state (e.g. thatexecution ended in a state not satisfying the access-controlpolicy). Summary.

The executable speciﬁcation allows us to bothsimulate and specify sequences of operations such as mem-ory accesses or translation conﬁgurations as they would beperformed by a concrete OS, implementing the new abstractmodel.

In this section, we describe the implementation of the refer-ence monitor and runtime support libraries and services intwo classes of operating systems: a complete implementationin

Barrelﬁsh/MAS as a representative of a partitoned capabil-ity system, derived from the open-source Barrelﬁsh OS [12],and side-by-side a sketch of an implementation within Linux,as a representative of a traditional UNIX-style kernel. Architecture Overview.

Figure 6 shows an overview of theresulting architecture. We separate policy and mechanism: 1at the center is the runtime representation of the model (§ 5.1)which stores the memory topology and provides queries andalgorithms for memory allocation policies, 2 the referencemonitor which enforces access control and provides the mech-anisms for resource management and conﬁguration, and 3static platform descriptions and dynamic discovery mecha-nisms (§ 5.3) provide input for the policy and mechanismimplementations.

We implement the runtime representation of the address spacemodel (Figure 6, 1 ) in a policy engine. On Barrelﬁsh,this is merged into the Prolog-based system knowledge-base(SKB) [54], which already stores both static and dynamicfacts about the system. On Linux, we could use a standaloneProlog instance and run it as a service, or implement themodel directly along with other memory allocation policiesinside the kernel. We now describe the model representation,its algorithms and potential optimizations.

Model representation.

We implement the model represen-tation by asserting facts for the accept, translate and overlay MAS stands for m ultiple a ddress s paces. assert ( translate ( RegionFrom , RegionTo )).assert ( overlay ( NodeFrom , NodeTo )).assert ( accept ( Region )).dn_get_allocation_range ( NodeSrc , NodeDst ).dn_get_config_nodes ( NodeSrc , NodeDst ).dn_resolve_range (Node , Addr , Size ).dn_resolve_range ( NodeSrc , Addr , Size , DstSrc ). Listing 1: Prolog Model Representationconstructs of the model (see syntax in [2]). Listing 1 showsthe corresponding Prolog rules. This encodes the decodingnet, and adds the information to the database.

Algorithms.

On top of the model encoding, we implementseveral algorithms, useful for making allocation and conﬁg-uration policy decisions. For instance, to set up a device,the driver uses the dn_get_allocation_range() query toﬁnd a suitable address space for memory allocation, then runs dn_get_config_nodes() to get the list of address spaceswhich need to be conﬁgured to make the memory resource ac-cessible, and lastly execute dn_resolve_range() to obtainthe address at which the device sees the memory resource.The result of the queries is then converted into a sequenceof capability operations to allocate memory, setup transla-tion structures and perform the relevant mappings. Note, themodel queries only provide a roadmap, the actual reconﬁgu-ration steps are invocations of the reference monitor whichenforces the authority and integrity of the system followingthe deﬁnition of the executable speciﬁcation (§ 4.2).

Optimization.

Running the Prolog queries on the full graphis costly. We provide a library that caches the (ﬂattened) graphrepresentation consisting only of cores/devices, conﬁgurableaddress spaces and memory nodes in the Prolog engine and directly in C using adjacency lists. We can then run a shortest-path algorithm to perform the queries, which minimizes thenumber of address spaces to conﬁgure.

We now describe the implementation of the reference monitordeﬁned by the executable speciﬁcation in Linux and

Bar-relﬁsh/MAS . Resource Management.

Both, Linux and Barrelﬁsh alreadyhave thorough resource management mechanisms, albeit difer-ent: Barrelﬁsh manages physical resources using a distributed,partitioned capability system for naming, access control, andaccounting of objects. As in seL4 [18], capabilities are typed to indicate what can be done with the memory they refer to;rules dictate valid retype operations (e.g RAM to a Frame).Linux maintains a data structure, the page struct, for every 4KiB page of memory. In both systems, only the kernel hasdirect access to those data structures, and can maintain thepartitioning invariant.

Reference Monitor.

As with all microkernels, Barrelﬁsh’skernel is essentially nothing but a reference monitor. It usesthe capability system to express the objects in memory and7he authority a process (subject) has over them. Any changesto the translation units (e.g. mapping a memory frame intothe IOMMU) correspond to capability operations. The refer-ence monitor checks type, address spaces and rights of thecapabilites.On Linux, we can use the para-virtualization interface (PV-Ops) to implement a reference monitor inside the kernel itself.We can then extend the PV-Ops interface to include all addresstranslation units in the system. This effectively implementsa well-deﬁned hypercall interface to request changes to thetranslation tables from the hypervisor acting as the referencemonitor. Similarly, the nested kernel [16] integrates a priv-ileged kernel inside the monolithic kernel which interposesall updates to translation tables. Extending this interface toinclude all other translation hardware as well, would present agood way to implement a reference monitor inside the Linuxkernel.

Naming of Resources.

Barrelﬁsh’s capabilities contain phys-ical addresses to identify the objects they are referring to. Tobe able to still identify the objects uniquely in the presenceof multiple address spaces we change the capability systemin

Barrelﬁsh/MAS to use canonical base names, consistingof an address space identiﬁer and an address within that ad-dress space. We adapt the kernel to consider the ASID whenperforming capability operations. An operation may nowfail in new ways, due to incompatible address spaces of thecapabilities (e.g. one cannot directly map host physical frameto a guest virtual address).Linux uses the physical frame number (PFN) uniquely iden-tify every 4 KiB page of memory. Using the sparse memorymodel [58] or heterogeneous memory [31], we can implementmemory nodes (address spaces) a dynamic mapping of thePFN to the underlying page struct. In this manner, we can usethe PFN as the memory resource’s canonical name.On both operating systems, we need a function to deref-erence the canonical name of a resource into a locally validaddress. We can generate such a translation function basedon the platform description or the model state.

Object Types.

In addition,

Barrelﬁsh/MAS introduces newcapability types for all hardware translation units (not justpage tables), ASID allocation, and entire physical, interme-diate or virtual address spaces. Like Barrelﬁsh, we allow acapability to refer to a memory region of arbitrary size, butrequire that it must not span multiple address spaces.On Linux, we do not need to use typed objects as such asthe kernel does not expose handles to physical resources touser space. Internally, Linux already uses different accountingtypes for memory allocations.

Page Tables and Address Spaces.

Barrelﬁsh/MAS intro-duces distinct capability types for all hardware-deﬁned trans-lation structures (register sets or page table levels). Each ofthese capability types are translation structures in the senseof the executable spec. Since a page table deﬁnes an addressspace, we can derive an address space capability from it, and use it to install mappings in other address spaces. Delet-ing the page table capability triggers a recursive deletion ofits spanned address spaces and all possible mappings. Weintegrated this process into the capability system. This is ef-fectively equivalent to revoking all descendants of the addressspace capability and then deleting it. This ensures, that thereare no mappings referring to an invalid address space.With the implementation of para-virtualization and KVM-based virtualization, Linux has support to represent the guestaddress space inside the kernel. This would be one possiblityto get support for different address spaces in the kernel. Al-ternatively, we can use the sparse memory model or HMM tocreate “virtual” memory nodes that correspond to an interme-diate address space. Tracking Mappings.

Barrelﬁsh/MAS uses designated map-ping capabilities to track mappings. For every mapped object,there is a corresponding mapping capability, which is a de-scendant thereof. Therefore, the capability system is ableto locate and invalidate all mappings when access to an ob-ject is revoked. Note, translation structures effectively deﬁnean address space, and hence there is no difference betweenmappings of multi-level page tables, or actual frames.Similar to the mapping capabilities, Linux uses the rmap data structure to store where a page of memory is mapped.This is already maintained for the page cache, as well as guestmemory pages. We can use this mechanisms to track allmappings of a page in Linux.

The last part of the implementation describes how the modelstate is populated ( 3 in Figure 6). There are two majorsources of memory topology information building up theruntime representation: i) static description of platforms (orparts there of), and ii) discovery mechanisms such as PCI orACPI, which may instantiate predeﬁned descriptions. Static Platform Descriptions.

The memory topology ofparts of the system – or in the case of SoC the entire sys-tem – is ﬁxed and known in advance: for instance, the XeonPhi co-processor has a deﬁned number of cores and memory.We can therefore write down a description of the memory sub-system. For this, we use a domain speciﬁc language (DSL),which follows closely the syntax of the formal model, allowswriting down the memory topology of the entire system, or itssub-components. The DSL compiler then produces a set ofProlog rules, which populate the model at runtime, either fullyor in response to hardware discovery events. On Linux, wecan use procfs and sysfs , as well as device trees to obtainsystem topology descriptions.

Using Static Descriptions: Code Generation.

From thestatic descriptions, we can pre-compute and enumerate theaddress spaces of the hardware component, or in the case ofSoC platforms, the entire memory topology. The DSL com-piler generates a set of data structures and code used by the8eference monitor to instantiate the initial set of capabilities,verify address space compatibility in capability operations,translation tables, or functions to convert the canonical namesinto valid, local physical or virtual addresses. We evaluatethis scenario in § 6.4.

Using Static Descriptions: Hardware Discovery.

In gen-eral, the conﬁguration of a platform is known after devicediscovery mechanisms such as ACPI or PCI (if percent). Dur-ing this process, the model is dynamically populated with thepartial descriptions of its components: e.g. the ACPI tableindicates the presence and version of an IOMMU, and in re-sponse the partial description of the IOMMU is instantiatedand added to the model at runtime. A driver may update themodel with more precise information, e.g. only the Xeon Phidriver knows the precise number of cores and memory size ofthe PCI Express attached co-processor.

In this section, we present a quantitative and qualitative per-formance evaluation of the address space and least-privilegeauthority model in

Barrelﬁsh/MAS . The goal of this sectionis to establish the following:1. The mechanism implementation results in a performantmemory system (§ 6.1, § 6.2).2. The policy implementation produces usable results withinreasonable overheads (§ 6.3).3. Qualitatively demonstrate, that the resulting system is ableto handle complex memory topologies (§ 6.4).

Evaluation Platform.

All performance measurements areperformed on a dual-socket server consisting of two IntelXeon E5-2670 v2 processors (

Ivy-Bridge micro-architecture)with 10 cores each. The machine has 256 GiB of main mem-ory split equally into two NUMA nodes. The machine runsin “ performance mode ”, with disabled simultaneous multi-threading (SMT), Intel TurboBoost technology, and IntelSpeed Stepping, to ensure consistent measurements. Themachine further contains two Intel Xeon Phi co-processor31S1 attached as a PCI Express 3.0 device. The co-processorshave 57 cores with four hardware threads per core, and 8 GiBGDDR memory. The Intel VT-d [26] (IOMMU) is enabled.We use a vanilla Ubuntu 18.04 LTS with Linux kernel 4.15.For a fair comparison we disable specter/meltdown mitiga-tion as they slow down memory operations signiﬁcantly andBarrelﬁsh doesn’t implement them. Barrelﬁsh and

Barrelﬁsh/-MAS are compiled in release mode.

In this part of the evaluation, we quantitatively evaluate theperformance of

Barrelﬁsh/MAS ’s virtual memory operationsin comparison to vanilla Barrelﬁsh and Linux. T i m e p e r p a g e [ s ] LinuxBarrelfishBarrelfish/MAS 1 2 4 8 16 32 64 128256 T i m e p e r p a g e [ s ] LinuxBarrelfishBarrelfish/MAS 1 2 4 8 16 32 64 T i m e p e r p a g e [ s ] LinuxBarrelfishBarrelfish/MAS (a) map()

Operation T i m e p e r p a g e [ s ] LinuxBarrelfishBarrelfish/MAS 1 2 4 8 16 32 64 128256 T i m e p e r p a g e [ s ] LinuxBarrelfishBarrelfish/MAS 1 2 4 8 16 32 64 T i m e p e r p a g e [ s ] LinuxBarrelfishBarrelfish/MAS (b) protect()

Operation

Figure 7: Measured Latency per Page for the VM Operationson Linux, Barrelﬁsh and

Barrelﬁsh/MAS . map() protect() unmap() Table 3: The Best Conﬁguration of the Linux VM Operations.

Benchmark Methodology.

We compare the performanceof the virtual memory operations map() , protect() and unmap() for buffer sizes from 4 KiB to 64 GiB using one ofthe three native supported page sizes (4 KiB, 2 MiB and 1GiB). On Barrelﬁsh/MAS and Barrelﬁsh, we use the defaultuser-level virtual memory management library, and on Linuxwe take the fastest of the measured different techniques to mapmemory using anonymous memory ( mmap() ), shared mem-ory objects ( shmfd() ) or shared memory segment ( shmat() ).We exclude the allocation and clearing of backing memory inthis benchmark as it affects all systems the same and woulddominate the execution times.

Results.

Figure 7 contains the results of this evaluation for thethree operations and page sizes. The graphs show the medianlatency (lower is better) and standard error per modiﬁed pagetable entry. We scale the number of changed page table entries.For Linux, we select the best conﬁguration as indicated inTable 3. We make the following observations: • Amortization:

The general pattern is similar: the cost perpage decreases with increasing numbers of affected pages.The cost of the virtual region management, syscall overhead,locating the page table entry is amortized among multiplepages, whose mappings are likely to be in consecutive pagetable entries. • map() . Both, Barrelﬁsh and Barrelﬁsh/MAS have match-ing performance patterns, independent of the used page size.Linux is faster for mapping up to two 4 KiB pages. Forlarger pages Barrelﬁsh (as well as

Barrelﬁsh/MAS ) outper-9orms Linux. This is not an effect of our implementation butdue to Linux allocating lower-level page tables, in case thesuper-page mapping needs to be broken up. Therefore, Linuxallocates and clears memory to hold the page table. Zeroing apage can add up to 0.71 µs which is the difference we see inthe graph. Both, Barrelﬁsh and Barrelﬁsh/MAS only have tocreate a new mapping capability and insert it into the MDB. • protect() . We observe very predictable patterns for Bar-relﬁsh and Barrelﬁsh/MAS , where vanilla Barrelﬁsh is slightlyfaster due to storing an explicit pointer to the page table di-rectly in the mapping capability, whereas

Barrelﬁsh/MAS stores the canonical name which requires an address transla-tion causing more work. In both cases, the mapping capabil-ity contains all information to perform the operation. Linuxneeds to walk the page table to locate the page table entry tobe protected. This is again not an effect of the MAS extensionbut a difference between Linux and vanilla Barrelﬁsh. • unmap() . Up to eight affected pages, Linux is faster thanBarrelﬁsh and Barrelﬁsh/MAS , which both need to removeand delete the mapping capability from the MDB, whichresults in another syscall on Barrelﬁsh (

Barrelﬁsh/MAS re-moves this when clearing the page table entry). Removingthe mapping capability gets amortized when more pages areaffected.

Discussion.

In direct comparison with Barrelﬁsh, we observethat

Barrelﬁsh/MAS is able to match the performance in allcases. Moreover, the comparison with Linux shows, that

Barrelﬁsh/MAS has comparable performance to a mainstreamOS. We conclude that our least-privilege access control modelwith support for multiple address spaces can be implementedwith ﬁne granularity while maintaining competitive memorymanagement performance.

The Appel-Li benchmark [5] exercises the virtual memorysubsystem with operations, which are relevant to tasks suchas garbage collection or tracking page modiﬁcations.

Benchmark Methodology.

The benchmark consists of thefollowing three experiments:1. prot1-trap-unprot.

Randomly pick a page of memory,write-protect the page, write to it, take a trap, unprotect thepage, continue with next page.2. protN-trap-unprot.

Write-protect 512 pages of memory atonce, write to each page of memory in turn, taking a trap andunprotecting the page.3. trap only.

Pick a protected page, write to it and take the trapcontinue with next page without changing any permissions.We run this benchmark on Barrelﬁsh and

Barrelﬁsh/MAS . Inaddition, we compare to Linux as a frame of reference. OnBarrelﬁsh and

Barrelﬁsh/MAS the numbers include the costof virtual address space accounting in userspace.

Results.

We show the benchmark results in Figure 8. Each prot1-trap-unprot protN-trap-unprot trap only 2.55.07.510.012.5 k c y c l e s / ( p a g e | t r a p ) Linux Barrelfish Barrelfish/MAS

Figure 8: Appel-Li Benchmark on

Barrelﬁsh/MAS and Linux.bar corresponds to a different OS and represents the timetaken per page. The three bar groups represent the threebenchmark experiments. The standard error is less than 0.5%.We make the following observations: • Barrelﬁsh vs. Barrelﬁsh/MAS.

Direct comparison shows aslowdown of less than 5% for

Barrelﬁsh/MAS vs. Barrelﬁsh.The trap performance of both systems is the same. • Linux vs Barrelﬁsh.

Barrelﬁsh outperforms Linux in allexperiments. Barrelﬁsh can use its capability system to efﬁ-ciently ﬁnd the page table that has to be modiﬁed while Linuxneeds to walk the page table tree. Furthermore Barrelﬁsh re-ﬂects the trap directly to user-space without checking whetherthe faulting address has been previously allocated [9]. Thisapplies to

Barrelﬁsh/MAS as well as vanilla Barrelﬁsh and isindependent of our extension. • Batching.

The protection of 512 pages in one syscall ( protN-trap-unprot ) amortizes the total syscall overheads, which re-duce the time per page on all systems by 600-2000 cycles.

Discussion.

In this evaluation, we show that

Barrelﬁsh/MAS is able to match the performance of Barrelﬁsh with a maxi-mum overhead of less than 5%, despite support for explicitaddress spaces. The comparison to Linux again shows that

Barrelﬁsh/MAS ’s memory operation performance is competi-tive to that of a mainstream OS.

In this evaluation, we investigate the overheads of themodel runtime representation and the translation unit re-conﬁguration following the principle of least-privilege.

Benchmark Methodology.

This benchmark models anofﬂoad-scenario, where an application workload wants tomake use of a co-processor attached to PCI Express. We usethe Xeon Phi co-processor for this purpose. We are interestedin the sequence of initialization steps to establish a sharedbuffer between the CPU cores and the co-processor:1.

Model Query.

Evaluate the runtime representation to ﬁnda suitable memory region and needed re-conﬁguration steps.2.

Allocate and Map.

Request memory from the allocator andmap it into the application’s virtual address space.3.

Program Translation Units.

Re-conﬁgure the translationunits indicated in the model query response. Here, this in-cludes i) the IOMMU, and ii) the SMPT of the co-processor.10

200 400 600 800 1000 1200 1400Time [us]Linux MMAPBarrelfish Alloc and MapBarrelfish/MAS Local MapBarrelfish/MAS RPC Map I/OMMU programmingSMPT programmingModel Query Memory Allocation and MappingMemset of Allocated Memory

Figure 9: Breakdown of the Ofﬂoading Scenario.We proﬁle the execution of these steps and measure thetime it takes to perform each step individually. We evaluatetwo mechanisms to program the IOMMU, i) to use capabilityinvocations directly, and ii) use an RPC to the IOMMU ser-vice acting as a reference monitor. The buffer size used is 8MiB. As a frame of reference, measure the time it takes tojust allocate and map memory on both Linux (using mmap() )and vanilla Barrelﬁsh. Results.

The breakdown of the operation into the steps isshown in Figure 9. We show both the numbers for bothmechanisms to program the IOMMU, and for comparison, weinclude the time it takes to just allocate and map the memoryon vanilla Barrelﬁsh and Linux. The x-axis represents themeasured times in µs . We make the following observations: • Memory Allocation and Mapping.

All three OSes use aboutthe same time to allocate and map the required memory region,which accounts for the majority of the proﬁled time. It isdominated by zeroing the newly allocated memory. • Model Query.

Evaluating the model at runtime accountsfor less than 5% of the total runtime. • SMPT Conﬁguration.

Programming the SMPT of the co-processor uses less than 0.3% of the runtime. • IOMMU Programming.

The conﬁguration of the IOMMUusing direct capability invocations is fast (0.2% of the run-time). When using the RPC to the IOMMU reference monitor,this requires capability transfers which corresponds to about3% of the execution time.Overall, the resulting overhead for the model query and theaddress space conﬁguration accounts for 5 . Barrelﬁsh/MAS compared to Barrelﬁsh and Linux.

Discussion.

In this evaluation, we have shown that it is possi-ble to efﬁciently implement a representation of our executablemodel in an operating system and reconﬁgure address spacesfollowing the principle of least-privilege. Moreover, subse-quent allocations may use the cached results of the modelquery, reducing the overhead even further. Note, that thequery merely indicate the operations to be carried out, but thecapability system enforces the integrity thereof.

In this evaluation, we qualitatively show the application andintegration of the address space model into the OS toolchain

Platform DescriptionDSL Prolog Model Representation Platform Data structures and functionsLISA+SimulatorCon ﬁ guration FastModelsSimulatorBinaryBarrel ﬁ shSKBState Barrel ﬁ sh/MASOS imageeclipseclpfastmodelscompiler gcccompilerrun onsimulator Figure 10: Running

Barrelﬁsh/MAS on an ARM FastMod-els [7] Platform Based on a Hardware Description.

Con ﬁ gurable Memory Map ARM Cortex A57ARM Cortex A57 Con ﬁ gurable Memory MapDRAM 0 DRAM 1 DRAM 2 DRAM 3 Figure 11: FastModels Simulator Conﬁgurationto generate low-level, platform-speciﬁc OS code and datastructures. By doing that we show, that our implementationis functional even when run on simulated platforms with un-usual address space topologies not supported by other systems.While these simulated platforms are extreme, they includeother real systems such as those with secure co-processors.

Evaluation Methodology.

We design and build the toolchainillustrated in Figure 10 and write a series of different platformdescriptions using a DSL. These platform descriptions thenspecify the memory topology of the simulated platforms. TheDSL compiler then generates:1.

Executable Model.

A runtime representation of the mem-ory topology model, and2.

Simulator Conﬁguration.

The LISA+ hardware descriptionthat conﬁgures the ARM FastModels simulator [7].The generated runtime representation of the topologymodel then acts as the initial state for the Barrelﬁsh SKB,and is used to generate low-level OS code and data struc-tures, which are compiled and linked into a platform-speciﬁc

Barrelﬁsh/MAS

OS image.We mention four example conﬁgurations we tested for thisevaluation. Figure 11 shows an illustration of the simulatedplatform, which consists of two ARM Cortex A57 processors,each having a conﬁgurable local memory map which deﬁnesat which addresses they see the DRAM regions (and the restof the system in general) in their local address space. Weevaluated the following conﬁgurations:1.

Uniform

Both cores have an identical memory map.2.

Swapped

DRAM is split in two halves, where each coresees the two halves at swapped address ranges.3.

Private

One shared memory region, and each core furtherhas a private memory region, inaccessible by the other.4.

Private Swapped

Combines the swapped and private se-tups: shared memory with swapped views, and private mem-ory per core.

Results.

During out experiments, we managed to compile

Barrelﬁsh/MAS and run it successfully on all tested platform11onﬁgurations. This includes various memory managementtasks and shared-memory message passing between the cores.There was no programmer effort required, besides writing theplatform description.

Discussion.

We know of no other current OS designs whichcan manage memory globally in all these cases. PopcornLinux [11] and Barrelﬁsh have limited support for case 3;while regular Linux and seL4 only support case 1. In contrast,

Barrelﬁsh/MAS supports all four cases.

Barrelﬁsh/MAS is able to boot and manage memory on allplatforms without modiﬁcations, regardless of the topology.

In this evaluation, we have shown that it is possible to efﬁ-ciently implement the address space model and least-privilegememory management in an OS. We have quantitatively eval-uated

Barrelﬁsh/MAS ’s virtual memory system, the recon-ﬁguration operations, and analyzed the space and runtimecomplexity of maintaining kernel state.Moreover, we have seen that

Barrelﬁsh/MAS is able to han-dle complex and non-standard memory topologies by strictlyusing the memory object’s canonical name in the capabil-ity system, and generated translation functions which furtherconvert this canonical name to a valid local address

In this paper, we made the case to bring back the concept ofa reference monitor to mediate access to memory resourceon modern, heterogeneous platforms. We presented a ﬁne-grained, realistic memory protection model based on whichwe can extend the reference monitor to include all memorytranslation and protection hardware present in the system.This allows systems software to adapt their access controlmodel and catch up with the complexity of modern hardware.We have shown that our design is applicable to any OS, re-gardless of its architecture. We have developed an executablespeciﬁcation of a reference monitor including the state, oper-ations and authority, on which we have based our prototypeimplementation in

Barrelﬁsh/MAS . Not only can this memoryprotection model eliminate three different classes of bugs andvulnerabilities, but there is also no inherent performance over-head in implementing it in an operating system. Moreover,based on trusted hardware speciﬁcations we can increase thelevel of automation and generate low-level operating systemscode. We believe that our approach can lay the foundationfor both fully veriﬁed systems and more reliable memorymanagement in existing systems.We plan to open-source the reference monitor and

Bar-relﬁsh/MAS implementations.

References [1] Reto Achermann, Lukas Humbel, David Cock, and Tim-othy Roscoe. Formalizing Memory Accesses and Inter-rupts. In

Proceedings of the 2nd Workshop on Modelsfor Formal Analysis of Real Systems , MARS 2017, pages66–116, 2017.[2] Reto Achermann, Lukas Humbel, David Cock, and Tim-othy Roscoe. Physical Addressing on Real Hardware inIsabelle/HOL. In

Proceedings of the 9th InternationalConference on Interactive Theorem Proving , ITP’18,pages 1–19, Oxford, United Kingdom, 2018. SpringerInternational Publishing.[3] Hanna Alam, Tianhao Zhang, Mattan Erez, and YoavEtsion. Do-It-Yourself Virtual Memory Translation.In

Proceedings of the 44th Annual International Sym-posium on Computer Architecture , ISCA ’17, pages457–468, New York, NY, USA, 2017. ACM.[4] James P. Anderson. Computer Security TechnologyPlanning Study. Technical Report ESD-TR-73-51, Vol.I, AD-758 206, Electronic Systems Division, Deputyfor Command and Management Systems HQ ElectronicSystems Division (AFSC), L. G. Hanscom Field, Bed-ford, Massachusetts 01730, USA, 10 1972.[5] Andrew W. Appel and Kai Li. Virtual Memory Primi-tives for User Programs. In

Proceedings of the FourthInternational Conference on Architectural Support forProgramming Languages and Operating Systems , AS-PLOS IV, pages 96–107, New York, NY, USA, 1991.ACM.[6] ARM Ltd.

ARM Security Technology - Building aSecure System using TrustZone Technology , prd29-genc-009492c edition, 4 2009.[7] ARM Ltd. Development Tools and Software:Fast Models. ,8 2019.[8] Rachata Ausavarungnirun, Joshua Landgraf, VanceMiller, Saugata Ghose, Jayneel Gandhi, Christopher J.Rossbach, and Onur Mutlu. Mosaic: A GPU MemoryManager with Application-transparent Support for Mul-tiple Page Sizes. In

Proceedings of the 50th AnnualIEEE/ACM International Symposium on Microarchitec-ture , MICRO-50 ’17, pages 136–150, New York, NY,USA, 2017. ACM.[9] Moshe Bar. The Linux Signals Handling Model.

LinuxJournal , 5 2000. .1210] Antonio Barbalace, Anthony Iliopoulos, Holm Rauch-fuss, and Goetz Brasche. It’s Time to Think About anOperating System for Near Data Processing Architec-tures. In

Proceedings of the 16th Workshop on HotTopics in Operating Systems , HotOS ’17, pages 56–61,New York, NY, USA, 2017. ACM.[11] Antonio Barbalace, Marina Sadini, Saif Ansary, Christo-pher Jelesnianski, Akshay Ravichandran, Cagil Kendir,Alastair Murray, and Binoy Ravindran. Popcorn: Bridg-ing the Programmability Gap in heterogeneous-ISA Plat-forms. In

Proceedings of the Tenth European Confer-ence on Computer Systems , EuroSys ’15, pages 29:1–29:16, New York, NY, USA, 2015. ACM.[12] Andrew Baumann, Paul Barham, Pierre-Evariste Da-gand, Tim Harris, Rebecca Isaacs, Simon Peter, Tim-othy Roscoe, Adrian Schüpbach, and Akhilesh Sing-hania. The Multikernel: A New OS Architecture forScalable Multicore Systems. In

Proceedings of the ACMSIGOPS 22nd Symposium on Operating Systems Prin-ciples , SOSP ’09, pages 29–44, New York, NY, USA,2009. ACM.[13] Simon Biggs, Damon Lee, and Gernot Heiser. Thejury is in: Monolithic os design is ﬂawed: Microkernel-based designs improve security. In

Proceedings ofthe 9th Asia-Paciﬁc Workshop on Systems , APSys ’18,New York, NY, USA, 2018. Association for ComputingMachinery.[14] Adam Chester. Exploiting CVE-2018-1038 - TotalMeltdown. Online. https://blog.xpnsec.com/total-meltdown-cve-2018-1038/ , 4 2018.[15] David Cock, Gerwin Klein, and Thomas Sewell. SecureMicrokernels, State Monads and Scalable Reﬁnement.In

Proceedings of the 21st International Conference onTheorem Proving in Higher Order Logics , TPHOLs ’08,pages 167–182, Berlin, Heidelberg, 2008. Springer-Verlag.[16] Nathan Dautenhahn, Theodoros Kasampalis, Will Dietz,John Criswell, and Vikram Adve. Nested kernel: An op-erating system architecture for intra-kernel privilege sep-aration. In

Proceedings of the Twentieth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS ’15, pages191–206, New York, NY, USA, 2015. ACM.[17] Philip Derrin, Kevin Elphinstone, Gerwin Klein, DavidCock, and Manuel M. T. Chakravarty. Running theManual: An Approach to High-assurance MicrokernelDevelopment. In

Proceedings of the 2006 ACM SIG-PLAN Workshop on Haskell , Haskell ’06, pages 60–71,New York, NY, USA, 2006. ACM. [18] Dhammika Elkaduwe, Gerwin Klein, and Kevin Elphin-stone. Veriﬁed protection model of the sel4 microker-nel. In

Proceedings of the 2nd International Confer-ence on Veriﬁed Software: Theories, Tools, Experiments ,VSTTE ’08, pages 99–114, Berlin, Heidelberg, 2008.Springer-Verlag.[19] Andrew Ferraiuolo, Andrew Baumann, Chris Haw-blitzel, and Bryan Parno. Komodo: Using Veriﬁcationto Disentangle Secure-enclave Hardware from Software.In

Proceedings of the 26th Symposium on OperatingSystems Principles , SOSP ’17, pages 287–305, NewYork, NY, USA, 2017. ACM.[20] Simon Gerber, Gerd Zellweger, Reto Achermann, Ko-rnilios Kourtis, Timothy Roscoe, and Dejan Milojicic.Not Your Parents’ Physical Address Space. In

Proceed-ings of the 15th USENIX Conference on Hot Topics inOperating Systems , HOTOS’15, pages 16–16, Berkeley,CA, USA, 2015. USENIX Association.[21] Xiling Gong. Exploiting Qualcomm WLAN and Mo-dem Over the Air. In

Proceedings of the BlackHat USA2019 , 2019.[22] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Wu,Jieung Kim, Vilhelm Sjöberg, and David Costanzo. Cer-tiKOS: An Extensible Architecture for Building Certi-ﬁed Concurrent OS Kernels. In

Proceedings of the 12thUSENIX Conference on Operating Systems Design andImplementation , OSDI’16, pages 653–669, Berkeley,CA, USA, 2016. USENIX Association.[23] Marius Hillenbrand, Mathias Gottschlag, Jens Kehne,and Frank Bellosa. Multiple Physical Mappings: Dy-namic DRAM Channel Sharing and Partitioning. In

Pro-ceedings of the 8th Asia-Paciﬁc Workshop on Systems ,APSys ’17, pages 21:1–21:9, Mumbai, India, 2017.[24] HSA Foundation.

HSA Runtime Programmer’s Refer-ence Manual , version: 1.1.4 edition, 10 2016.[25] Jian Huang, Moinuddin K. Qureshi, and KarstenSchwan. An Evolutionary Study of Linux MemoryManagement for Fun and Proﬁt. In

Proceedings ofthe 2016 USENIX Conference on Usenix Annual Tech-nical Conference , USENIX ATC ’16, pages 465–478,Berkeley, CA, USA, 2016. USENIX Association.[26] Intel Corporation.

Intel Virtualization Technology forDirected I/O - Architecture Speciﬁcation , d51397-011,revision 3.1 edition, 6 2019.[27] Khronos OpenCL Working Group.

The OpenCL Speci-ﬁcation , version: 2.1, document revision: 24 edition, 22018.1328] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, JuneAndronick, David Cock, Philip Derrin, Dhammika Elka-duwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish,Thomas Sewell, Harvey Tuch, and Simon Winwood.seL4: Formal Veriﬁcation of an OS Kernel. In

Proceed-ings of the ACM SIGOPS 22nd Symposium on Operat-ing Systems Principles , SOSP ’09, pages 207–220, NewYork, NY, USA, 2009. ACM.[29] Butler W Lampson. Protection.

ACM SIGOPS Operat-ing Systems Review , 8(1):18–24, 1974.[30] Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke.VAST: The Illusion of a Large Memory Space for GPUs.In

Proceedings of the 23rd International Conferenceon Parallel Architectures and Compilation , PACT ’14,pages 443–454, New York, NY, USA, 2014. ACM.[31] Linux Kernel Documentation.

Heterogeneous MemoryManagement (HMM) , version 5.0 edition, 4 2019.[32] A Theodore Markettos, Colin Rothwell, Brett F Gut-stein, Allison Pearce, Peter G Neumann, Simon WMoore, and Robert NM Watson. Thunderclap: Ex-ploring Vulnerabilities in Operating System IOMMUProtection via DMA from Untrustworthy Peripherals.In

NDSS , 2019.[33] Alex Markuze, Adam Morrison, and Dan Tsafrir. TrueIOMMU Protection from DMA Attacks: When Copy isFaster Than Zero Copy. In

Proceedings of the Twenty-First International Conference on Architectural Supportfor Programming Languages and Operating Systems ,ASPLOS ’16, pages 249–262, New York, NY, USA,2016. ACM.[34] Benot Morgan, Eric Alata, Vincent Nicomette, andMohamed Kaaniche. Bypassing IOMMU Protectionagainst I/O Attacks. In , pages145–150, 10 2016.[35] Benot Morgan, Eric Alata, Vincent Nicomette, and Mo-hamed Kaaniche. IOMMU Protection Against I/O At-tacks: A Vulnerability and a Proof of Concept.

Journalof the Brazilian Computer Society , 24(1):2, 1 2018.[36] NATIONAL VULNERABILITY DATABASE NVD.CVE-2011-1898. Online, 8 2011.[37] NATIONAL VULNERABILITY DATABASE NVD.CVE-2013-4329. Online, 9 2013.[38] NATIONAL VULNERABILITY DATABASE NVD.CVE-2014-0972. Online, 8 2014.[39] NATIONAL VULNERABILITY DATABASE NVD.CVE-2014-3601. Online, 8 2014. [40] NATIONAL VULNERABILITY DATABASE NVD.CVE-2014-9888. Online, 8 2014.[41] NATIONAL VULNERABILITY DATABASE NVD.CVE-2015-6994. Online, 1 2017.[42] NATIONAL VULNERABILITY DATABASE NVD.CVE-2016-5349. Online, 4 2017.[43] NATIONAL VULNERABILITY DATABASE NVD.CVE-2017-12188. Online, 10 2017.[44] NATIONAL VULNERABILITY DATABASE NVD.CVE-2018-1038. Online, 8 2018.[45] NATIONAL VULNERABILITY DATABASE NVD.CVE-2015-4421. Online, 5 2019.[46] NATIONAL VULNERABILITY DATABASE NVD.CVE-2015-4422. Online, 5 2019.[47] NATIONAL VULNERABILITY DATABASE NVD.CVE-2019-10538 - Modem into Linux Kernel issue.Online, 8 2019.[48] NATIONAL VULNERABILITY DATABASE NVD.CVE-2019-10539 - Compromise WLAN Issue. Online,8 2019.[49] NATIONAL VULNERABILITY DATABASE NVD.CVE-2019-10540 - WLAN into Modem issue. Online,8 2019.[50] NVIDIA Corporation.

Uniﬁed Memory in CUDA 6 , 112013.[51] David Patterson, Thomas Anderson, Neal Card-well, Richard Fromm, Kimberly Keeton, ChristoforosKozyrakis, Randi Thomas, and Katherine Yelick. ACase for Intelligent RAM.

IEEE Micro , 17(2):34–44, 31997.[52] Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J.Sorin. Specifying and Dynamically Verifying AddressTranslation-aware Memory Consistency. In

Proceed-ings of the Fifteenth Edition of ASPLOS on ArchitecturalSupport for Programming Languages and OperatingSystems , ASPLOS XV, pages 323–334, New York, NY,USA, 2010. ACM.[53] Pierre Schnarz, Joachim Wietzke, and Ingo Stengel. To-wards attacks on restricted memory areas through co-processors in embedded multi-os environments via ma-licious ﬁrmware injection. In

Proceedings of the FirstWorkshop on Cryptography and Security in ComputingSystems , CS2 ’14, pages 25–30, New York, NY, USA,2014. ACM.1454] Adrian Schüpbach, Andrew Baumann, Timothy Roscoe,and Simon Peter. A Declarative Language Approach toDevice Conﬁguration. In

Proceedings of the SixteenthInternational Conference on Architectural Support forProgramming Languages and Operating Systems , ASP-LOS XVI, pages 119–132, New York, NY, USA, 2011.ACM.[55] Thomas Sewell, Simon Winwood, Peter Gammie, TobyMurray, June Andronick, and Gerwin Klein. seL4Enforces Integrity. In Markovan Eekelen, HermanGeuvers, Julien Schmaltz, and Freek Wiedijk, editors,

Interactive Theorem Proving , pages 325–340, Berlin,Heidelberg, 2011. Springer Berlin Heidelberg.[56] Erik Vermij, Leandro Fiorin, Rik Jongerius, ChristophHagleitner, Jan Van Lunteren, and Koen Bertels. Anarchitecture for integrated near-data processors.

ACMTrans. Archit. Code Optim. , 14(3):30:1–30:25, Septem-ber 2017.[57] Stavros Volos, Kapil Vaswani, and Rodrigo Bruno.Graviton: Trusted execution environments on gpus. In

Proceedings of the 12th USENIX Conference on Oper-ating Systems Design and Implementation , OSDI’18,page 681–696, USA, 2018. USENIX Association. [58] Andy Whitcroft. Sparsemem Memory Model. https://lwn.net/Articles/134804/ , 8 2019.[59] Simon Winwood, Gerwin Klein, Thomas Sewell, JuneAndronick, David Cock, and Michael Norrish. Mindthe Gap. In

Proceedings of the 22nd InternationalConference on Theorem Proving in Higher Order Logics ,TPHOLs ’09, pages 500–515, Berlin, Heidelberg, 2009.Springer-Verlag.[60] Dongping Zhang, Nuwan Jayasena, Alexander Lya-shevsky, Joseph L. Greathouse, Lifan Xu, and MichaelIgnatowski. Top-pim: Throughput-oriented pro-grammable processing in memory. In

Proceedings ofthe 23rd International Symposium on High-performanceParallel and Distributed Computing , HPDC ’14, pages85–98, New York, NY, USA, 2014. ACM.[61] Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu,Emmett Witchel, and Mark Silberstein. Understandingthe security of discrete gpus. In