MI6: Secure Enclaves in a Speculative Out-of-Order Processor
Thomas Bourgeat, Ilia Lebedev, Andrew Wright, Sizhuo Zhang, Arvind, Srinivas Devadas
MMI6: Secure Enclaves in aSpeculative Out-of-Order Processor
Thomas Bourgeat ∗ Ilia Lebedev ∗ Andrew Wright ∗ Sizhuo Zhang ∗ Arvind Srinivas DevadasMIT Computer Science and Artificial Intelligence Laboratory
ABSTRACT
Recent attacks have broken process isolation by exploitingmicroarchitectural side channels that allow indirect access toshared microarchitectural state. Enclaves strengthen the pro-cess abstraction to restore isolation guarantees.We propose
MI6 , an aggressively speculative out-of-orderprocessor capable of providing secure enclaves under a threatmodel that includes an untrusted OS and an attacker capable ofmounting any software attack currently considered practical,including those utilizing control flow mis-speculation.
MI6 is inspired by Sanctum [16] and extends its isolation guaran-tee to more realistic memory hierarchy. It also introduces a purge instruction, which is used only when a secure processis (de)scheduled, and implements it for a complex processormicroarchitecture. We model the performance impact of en-claves in
MI6 through FPGA emulation on AWS F1 FPGAsby running SPEC CINT2006 benchmarks as enclaves withinan untrusted Linux OS. Security comes at the cost of approxi-mately 16.4% average slowdown for protected programs.
1. INTRODUCTION1.1 Secure Enclaves
The process abstraction is pervasive and underpins modernsoftware systems. Conventional wisdom pertaining to processsecurity teaches that an attacker that cannot name a particularstate element cannot attack that element. Even in a situationwhere unconditional trust in the OS is considered tenable, thearchitectural process isolation an OS can enforce with commod-ity hardware falls short: microarchitectural side channels allowindirect access to shared microarchitectural state from whichinformation about a process can be leaked (e.g., [8], [65]).Recent attacks based on control flow speculation (e.g., Spec-tre [33]) have brought this issue to the fore.Our focus in this paper is to strengthen the capability ofconventional systems, i.e., speculative out-of-order (OOO)multicores running an OS (e.g., Linux), with processes withstrong isolation guarantees. We will call such processes en-claves , borrowing the term from Intel SGX [45], though theidea predates SGX. Enclaves are to be used to run secure tasks,and will coexist with ordinary processes. ∗ Student authors listed in alphabetical order.
Strong isolation requires a stronger guarantee than privatememory. The goal of the enclave abstraction is to achieve thefollowing property:P
ROPERTY
TRONG I SOLATION ). Any attack by a priv-ileged attacker program co-located on a machine with the vic-tim enclave that can extract a secret inside the victim, can alsobe mounted successfully by an attacker on a different machinethan the victim.
The attacker on a different machine than the victim is onlyable to communicate with the victim through the victim’s pub-lic API and can observe the latency of these API calls. Noother program should be able to infer anything private fromthe enclave program through its use of shared resources orshared microarchitectural state, e.g., cache or branch predictorstate. This implies that it is not enough to just give a uniqueset of addresses to an enclave, but separation of resources hasto be provided at all levels of the cache hierarchy where theseaddresses may reside. Our goal is to make sure that an enclavecannot leak information to or be influenced by any programrunning in the system. To achieve this, all shared resources inour microarchitecture are isolated spatially and temporally asrequired by our threat model, described in Section 2.3.Enclaves trade expressivity for security; enclave software isrestricted in how it interacts with untrusted software, includingthe OS (cf. Section 2.1). A trusted security monitor runningin machine mode mediates enclave entry and exit, and verifiesresource allocation decided by untrusted system software toensure, for example, allocation of non-overlapping memoryto different enclaves. The security monitor (cf. Section 6.2)is designed to protect itself from tampering (by appropriatelyconfiguring hardware protection mechanisms), even if the OShas been compromised.We trust our hardware and assume that it is bug free. Ourthreat model assumes a privileged software adversary and isdetailed in Section 2.3; we also describe what is outside ourthreat model in Section 2.4. We are primarily concerned withattacks that exploit shared hardware resources to interact withthe victim enclave via methods outside its public API.
MI6 is based on the open-source out-of-order RiscyOO [67]processor, and provides secure enclaves under our threat model.We model the performance impact of enclaves in
MI6 throughFPGA emulation on AWS F1 FPGAs by running benchmarks1 a r X i v : . [ c s . CR ] A ug rom SPEC CINT2006 on top of an untrusted Linux OS. Like Sanctum [16], we argue that enclaves defined as astrengthened process are an excellent abstraction for securecomputation. We make the following contributions in supportof our argument:1. We show that there are many subtle side channels asso-ciated with queues and associated arbitration requiredto handle multiple outstanding memory requests in thememory hierarchy, and describe how to enforce strongisolation in such systems (cf. Section 5.4).
MI6 thereforeprotects against a broader class of attacks than Sanctum,which excludes, for example, attacks that use the cachecoherence bandwidth and DRAM controller bandwidthchannels from its threat model.2. We show that the complexity of the out-of-order pro-cessor core can be completely decoupled from the com-plexity of a modern memory hierarchy by using a new purge instruction, which can be easily incorporated inany Instruction Set Architecture (ISA). We describe op-timizations based on indistinguishability to software inour purge implementation (cf. Section 6.1).3. We describe key modifications to the security monitorto maintain strong isolation in a speculative processor(cf. Section 6.2), and an optimization relating to accesspermissions checking (cf. Section 5.3).4. We provide a detailed evaluation of the performanceoverhead of enclaves in an implementation of the
MI6
OOO processor. We show, for example, the cost of flush-ing shared microarchitectural state on each system call,which is required for strong isolation. Overall, securitycomes at the cost of approximately 16.4% average slow-down for protected programs. We note that this numberassumes a baseline that is not protected against physicalattacks on memory unlike Aegis [53] and Intel SGX [45],and with a constant latency DRAM controller.
Organization:
In Section 2, we describe the enclave abstrac-tion and threat model. Related work is the subject of Section3. In Section 4 we describe the baseline implementation ofthe RiscyOO speculative out-of-order processor based on theRISC-V ISA [58] and list the hardware modifications we made.Section 5 describes how we provide steady state isolation ofenclaves in
MI6 . Section 6 describes how
MI6 handles transi-tions between enclaves. The performance of
MI6 is evaluatedin Section 7. We conclude in Section 8.
2. ENCLAVES AND ISOLATED EXECUTION
In this section, we describe the enclave abstraction as pre-sented in [52] (cf. Section 2.1). We describe the high-levelapproach to (micro)architectural isolation
MI6 employs toachieve Property 1 (cf. Section 2.2). We then describe thecapabilities of the adversary (cf. Section 2.3). We also discusswhat falls outside our threat model (cf. Section 2.4).
A processor that can serve as an enclave platform imple-ments isolated software environments in disjoint address spacesonly accessible from within a given enclave’s threads of execu-tion. An enclaved process resides entirely and exclusively in such isolated memory, which contains all code and data struc-tures comprising the enclave, and is isolated from all othersoftware in the system.An enclave platform must guarantee integrity and privateexecution of an enclave in the presence of other software, asper its threat model, and thus strongly isolates the enclave fromall other software. As a consequence of its isolation, an enclavecannot transparently receive system services or issue systemcalls, as software outside the enclave is not trusted with accessto enclave private memory. The enclave platform mediatescontrol transfer to and from an enclave via statically-definedlocations called entry points, and guarantees no side effects ofexecution remain across these context switches.Enclaves requiring system services must proxy through un-trusted software, and must respect that the OS services areuntrusted. With these guarantees, enclaves can become theonly trusted components of an application (aside from the plat-form itself), and a carefully programmed enclave can executesafely even when privileged software is compromised.The platform also implements the measurement and attes-tation protocol of [36] to prove enclave integrity to a remoteparty.
The goal of the enclave abstraction is to achieve Property 1.To protect enclave integrity, the processor implements archi-tectural isolation (that of memory) by setting up invariants in ahardware mechanism to prevent all accesses to enclave-ownedphysical memory, allowing only the enclave’s code access. Anysharing of microarchitectural resources by mutually distrustingsoftware may transmit private information via the availabilityof these finite resources, measurable via the timing of certainoperations or other side channels. Microarchitectural isolationis necessary to not only protect the confidentiality of enclaveexecution (for example closing side channels through cachestate), but also to protect enclave integrity in the context of acore that executes speculatively.Instead of directly implementing isolated enclaves, the
MI6 hardware implements flushing, constraints on core instructionfetch, and a set of low-level isolation primitives (cf. Section5) sufficient to partition each relevant sub-system of
MI6 into“protection domains” that are non-overlapping allocations ofmachine resources. When programs are running on multiplecores in different protection domains, in this steady state,
MI6 guarantees non-interference and isolation of these domains.The platform must also allow for transitions between dif-ferent protection domain configurations in order to implementenclaves. Following the example of [16],
MI6 relies on a smalltrusted security monitor (cf. Section 6.2), which executes in adedicated protection domain to compose the machine’s protec-tion domains and constraints on execution into the high-levelproperties of isolated enclaves. While the implementation de-tails of an
MI6 security monitor largely borrow from [37], andare out of scope in this manuscript, Section 6.2 details theaspects of the monitor relevant to the isolation of enclaves.The trusted computing base (TCB) of the
MI6 processorincludes the processor chip, memory (i.e., DRAM), as well asthe security monitor binary. The security monitor is the onlysoftware running in the machine mode , which is the highestprivilege mode in RISC-V and is more privileged than the2 upervisor mode used by the untrusted OS [57].
MI6 isolatesthe software inside an enclave from other software on the samespeculative out-of-order processor to satisfy Property 1.
We assume an insidious remote adversary able to exploitsoftware vulnerabilities expected to be present in large sys-tem software. Specifically, we assume that an attacker cancompromise any operating system and hypervisor present onthe computer executing the enclave, and can launch maliciousenclaves. The attacker has complete knowledge of the enclaveplatform’s architecture and microarchitecture, and the softwareit loads. The attacker can analyze passively observed data, suchas page fault addresses, as well as mount active attacks, suchas memory probing, and cache tag state attacks (e.g., [65]).The attacker can exploit speculative state, branch predictorstate, and other shared microarchitectural state in any softwareattack, e.g., Spectre [33].
MI6 ’s isolation mechanisms exclusively address softwareattacks, and assume the absence of any adversary with phys-ical access. We do not protect against attacks such as [11],where the victim application leaks information via its publicAPI, and the leak occurs even if the victim runs on a dedi-cated machine. We also consider any software attacks that relyon sensor data to be physical attacks.
MI6 does not protectagainst physical attacks on memory, but can be augmented withmemory encryption and integrity verification exemplified byprocessors such as Aegis [53] and Oblivious Random AccessMemory (ORAM) [23] [51] as in Ascend [22] to variouslydefend against these attacks. ORAM overhead can be substan-tially reduced if smart memory is assumed [2, 6].
MI6 doesnot protect against denial-of-service (DoS) attacks, assumescorrect underlying hardware, and does not protect against soft-ware attacks that exploit hardware bugs (fault-injection attackssuch as rowhammer [31, 49]). Finally, we exclude attacks thatutilize shared performance counters.
3. RELATED WORK3.1 Microarchitectural side channel attacks
Attacks that exploit microarchitectural side channels to leakinformation come in many varieties. Attacks using cachetag state-based channels are known to retrieve cryptographickeys from a growing body of cryptographic implementations:AES [9, 46], RSA [11], Diffie-Hellman [34], and elliptic-curvecryptography [10], to name a few. Such attacks can be mountedby unprivileged software sharing a computer with the victimsoftware [7].Sophisticated channel modulation schemes such as flush+reload [64] and variants of prime+probe [44] target the last-level cache (LLC), which is shared by all cores in a socket.The evict+reload variant of flush+reload uses cache contentionrather than flushing [44]. A less common yet viable strategycorresponds to observing changes in coherence [63] or replace-ment metadata [32]. Directories are readily reverse-engineeredto construct eviction sets in [62].Recently, multiple security researchers (e.g., [25, 33, 40])have found ways for an attacker to exploit speculative execution to create a transmitter within victim code in order to leaksecrets. Spectre and Meltdown have exploited the fact thatcode executing speculatively has unrestricted access to anyregisters and memory, and a host of variants of these attackshave been proposed. These attacks have motivated
MI6 . Set partitioning, i.e., not allowing occupancy of any cacheset by data from different protection domains, can disablecache state-based channels. It has the advantage of requiringno new hardware, provided groups of sets are allocated at pagegranularity [39, 68] via page coloring [29, 54]. Sanctum usesset partitioning to block cache timing attacks, as does
MI6 .Intel’s Cache Allocation Technology (CAT) [24, 26] and itsvariants (e.g., CATalyst [41]) provide a mechanism to configureeach logical process with a class of service , and allocates LastLevel Cache (LLC) cache ways to logical processes, fallingsomewhat short of isolation. In CAT, access patterns may leakthrough metadata updates on hitting loads, as the replacementmetadata is shared across protection domains. DAWG [32]endows a set associative structure with a notion of protectiondomains to provide strong isolation. Unlike CAT, DAWGexplicitly isolates cache replacement state. Page coloring canbe replaced with DAWG or another cache isolation mechanismin
MI6 .A alternative strategy to protect against cache timing at-tacks separate from partitioning is to randomize interferencein the cache (e.g., RPcache [35, 56], Random Fill Cache [42],Newcache [43], CEASER [47]), CEASER-S [48], or alter re-placement policies (e.g., SHARP [61], RIC [27], [17]). Inthe
MI6 design, we do not allow adaptivity of cache area allo-cated to an enclave to protect against cache occupancy attacks(e.g., [50]) and therefore we cannot use these techniques.
MI6 uses techniques similar to [20] to achieve timing inde-pendence (non-interference) in its memory system. To supportdemand paging in a secure fashion in
MI6 enclaves, page-levelORAM as briefly described in Sanctum or the more efficientapproach of InvisiPage [3] can be used.
A detailed review of secure processors is provided in [15].Early architectures such as XOM [38], Aegis [53], ARM’sTrustZone [4], and Bastion [14] did not consider side channelattacks in their threat model.Intel’s SGX [5, 45] adapted the ideas in XOM, Aegis andBastion to multi-core processors with a shared, coherent last-level cache. SGX does not protect against cache timing attacks,nor control flow speculation attacks [12]. Iso-X extends SGXenclave allocation and does not protect against cache timingattacks [18].Sanctum [16] introduced a straightforward software-hardwareco-design to provide enclaves resistant to software attacks, in-cluding those that exploit shared microarchitectural state ina simple in-order processor.
MI6 is similar to Sanctum, butprotects against a broader class of attacks including those thatuse the cache/cache directory/DRAM controller bandwidthchannels. The cache hierarchy of
MI6 corresponds to a mod-ern processor as compared to the cache hierarchy modeled bySanctum, which did not include queues or arbitration logic.Further, speculation in
MI6 provides a rich attack surface.
MI6 purge instructionto flush the out-of-order pipeline. Keystone [30] borrows fromSanctum to implement enclaves targeting standard RISC-V in-order processors using RISC-V’s physical memory protectionprimitive [57].Komodo [19] builds a privileged software monitor on topof ARM TrustZone that implements enclaves. Komodo, asdescribed in [19], does not support multi-processor execution,and runs on an in-order processor. InvisiSpec [60] makes spec-ulation invisible in the data cache hierarchy. This comes at asignificant performance cost, and does not preclude speculationbased attacks on other shared microarchitectural state.
4. BASELINE OOO PROCESSOR
MI6 is based on the open-source
RiscyOO speculative OOOprocessor [67]. RiscyOO implements most features of mod-ern microprocessors, including register renaming, branch pre-diction, non-blocking caches and TLBs, and superscalar andspeculative execution. In particular, RiscyOO can issue a loadspeculatively even when older instructions have unresolvedbranches or memory addresses. RiscyOO also has a sharedL2 cache which is coherent and inclusive with the private L1caches of each core. We also refer to the shared L2 as the lastlevel cache (LLC). Therefore, RiscyOO contains cache-timingside channels and is vulnerable to speculation-based attackslike Spectre. The detailed configuration of RiscyOO used inthis paper can be found in Figure 4 in Section 7.RiscyOO uses the open-source RISC-V instruction set [1],and has been prototyped on AWS F1 FPGAs. Its FPGA im-plementation boots Linux and completes SPEC CINT2006benchmarks with the largest input size ( ref input) in slightlymore than a day. RiscyOO is a great platform that we can buildsecure enclaves on: it is vastly more complex than the imple-mentation of Sanctum in [16], and presents new challenges forsecurity.
Enclave support in a modern processor system require threeinterventions: 1. Physical address protection and isolationthrough the memory hierarchy. 2. A rigorous implementationof a “purge” operation to scrub each type of physical resourcethat can be separately allocated to an enclave. 3. A specula-tion guard for the security monitor: this software has accessto all physical addresses, and must not speculatively load ad-dresses, nor should it speculatively fetch outside of the securitymonitor’s own binary.Specifically, we summarize the the hardware modificationsused by
MI6 to support enclaves below. The details of eachmodification will be discussed in the rest of paper. • Flushing microarchitectural states: Sections 6 and 7.1. • Page-walk check: Section 5.3. • Turning off speculation and checking instruction fetchesin machine mode: Sections 6.2 and 7.5. • LLC set-partitioning: Sections 5.2 and 7.2. • MSHR partitioning in LLC: Sections 5.2 and 7.3. • Sizing LLC MSHR: Sections 5.2 and 7.3. • Other LLC changes to block side-channels: Sections 5.4.3and 7.4.
5. STEADY STATE ISOLATION
When examining the interactions between two programs,there are two cases to consider: the first is when the two pro-grams are running on different cores (this section), and thesecond is when the two programs run on the same core but atdifferent times (cf. Section 6).From an ISA-level point of view, two programs are indepen-dent if a program’s output does not depend on another program.We call this architectural isolation . This captures the interac-tion of instructions in the ISA such as loads and stores, but itdoes not include side channels such as timing.Beyond the basic operations provided by the ISA, we assumethat the precise time of any microarchitectural event within acore (instruction fetch, issue, execute, commit, etc.) can bemeasured by the program running on the core. This conser-vatively abstracts a cleverly engineered program’s ability tomeasure latencies.First, we define our guarantee of weak timing independence depending on the underlying cause of variation in timing. If thetiming variation is due to one program waiting multiple cyclesfor another core to release a reserved resource, we define majortiming leak . On the other hand, if the timing variation is dueto two programs compete for a resource within a single clockcycle, requiring per-cycle arbitration, we define minor timingleakage . Two programs have weak timing independence ifthey are architecturally isolated and only exhibit minor timingleakage (cf. Sections 5.2 and 5.3). Two programs have strongtiming independence if they are architecturally isolated andhave neither flavor of timing leak (cf. Section 5.4).In Sections 5.2 and 5.3, we ensure that programs usingdisjoint protection domains have weak timing independence.Section 5.4 achieves strong timing independence. If two pro-grams use resources from disjoint microarchitectural protectiondomains, then they are timing independent.
Definitions and Assumptions:
We consider a program to bethe collection of all the instructions running on a core in super-visor and lower privilege modes and the corresponding initialdata, and we assume no machine mode code is run as part ofthe enclave program or on the program’s behalf during theprogram’s lifetime. We ignore machine mode because soft-ware in machine mode can tamper with arbitrary configurationregisters and alter active protection domains. Including su-pervisor mode in our analysis simplifies the problem becausewe do not have to worry about what the operating system isdoing to provide services such as virtual memory to user modeprograms. All configuration and usage of virtual memory fallsentirely within a program to keep the security monitor as leanas possible and so minor page table operations do not cause anenclave exit to the security monitor.Each program has a set of physical addresses it accesses.There are many ways a program can access a physical address.When virtual memory is off, a physical address is accessedfor each instruction fetch and for each load and store. Whenvirtual memory is on, physical addresses are also accessed forpage table walks. Also since RiscyOO is an aggressive out-of-order execution processor, speculative instruction fetches,speculative loads, and speculative page table walks also causephysical memory accesses even if the speculation was incorrect.When talking about the set of all physical addresses accessedby a program, we mean the physical addresses of all the above4hysical memory accesses, even the speculative ones.We also assume that the entire address space is normal mem-ory, not memory-mapped I/O since we do not trust devices anddrivers.
If two programs do not share any addresses, then withoutusing the timing of microarchitectural events, the execution ofone program cannot affect the other. Therefore, disjoint addressspaces imply that programs are architecturally isolated.
Unfortunately, having disjoint address spaces between pro-grams, while enough for architectural isolation, is not enougheven for weak timing independence.
Cache Partitioning:
As an example, consider programs p and p which access disjoint address spaces. If the two pro-grams access physical addresses a and a in the same L2cache set, then accesses from p to a can cause a to getevicted causing p to see a miss instead of a hit next time p accesses a .This issue stems from the two programs dynamically sharingresources (in this case entries in the same cache set). The on-demand transition of resources from one program to the otheris observable by the programs and can therefore be used to inferthe demand of the other program. In order to get around thisproblem and achieve weak (or strong) timing independence,caches need to be statically partitioned between programs.MI6 partitions the cache through set partitioning. Similarto Sanctum [16], MI6 divides equally the physical memory(DRAM) into multiple contiguous
DRAM regions . MI6 mod-ifies the LLC cache indexing function so that each pair ofDRAM regions map to disjoint cache sets. That is, the higherbits of the original LLC index are replaced by the DRAM-region ID, which is the highest bits of the physical address(e.g., the highest 6 bits for 64 DRAM regions). Two programsusing set partitioning must only use physical addresses thatmap to disjoint DRAM regions in order to avoid timing leakagethrough dynamic sharing of the cache.
MSHR Partitioning:
The L2 cache in RiscyOO can onlyhandle a fixed number of requests at a time. These requests aretracked using miss status handling registers (MSHRs). If thereare no free MSHRs, then the L2 cannot take any more requestsand causes multiple cycles of backpressure for the child L1caches trying to send requests.Consider programs p and p where p is causing manycache misses and its requests fill the L2’s MSHRs. If p thencauses a cache miss, there will be no MSHR available for p and the request will be stalled. This is a timing variation in p that is caused by p and therefore a major timing leak.To avoid this timing leak, MI6 partitions the MSHRs inthe LLC. Since only processes that are actively running onthe processor cores can occupy MSHRs,
MI6 divides equallythe MSHRs in the LLC by the number of processor cores, andstatically associates each MSHR partition with a processor core. p using all of its allocated MSHRs will not affect whether ornot a request from p is stalled due to backpressure. Sizing the MSHRs to Avoid DRAM Backpressure:
Saythat d max is the maximum number of outstanding requests theDRAM controller can handle. After d max in-flight requests, theDRAM controller asserts backpressure and prevents further re-quests from being enqueued into it. If the DRAM controller is asserting backpressure, then DRAM requests will get delayedcausing a major timing leak.In RiscyOO, every request sent to the DRAM controllercomes from an L2 request in an MSHR. Moreover, each requestput into an MSHR can send up to two DRAM requests duringits lifetime: one for a possible write back and one for a readrequest. Therefore, for a machine with a DRAM controlleraccepting d max outstanding requests, the number of MSHRs inthe cache needs to be at most d max /
2. That amounts to about d max / ( N ) MSHRs per core, where N is the number of cores. DRAM Controller Latency:
Another complication of en-suring timing independence is the DRAM and the DRAMcontroller. DRAM controllers often reorder requests so thatrequests to the same bank are done back-to-back to increasethe achieved bandwidth of the DRAM.Consider programs p and p where p is accessing ad-dresses in DRAM bank 0, and p is accessing addresses inDRAM banks 1, 2, and 3. If p and p send interleavedrequests to DRAM, and p ’s requests were all to the samebank, a reordering DRAM controller would perform p ’s re-quests back-to-back, changing the timing of p ’s requests. Thischanges the timing of p based on p , breaking weak timingindependence between the two programs.For weak (or strong) timing independence, MI6 must eitheruse a DRAM controller with a constant latency or use a moresophisticated DRAM controller that is aware of protectiondomains and associated DRAM regions and ensures timing in-dependence across protection domains. That is, optimizationssuch as row buffer are allowed within a protection domain butnot across protection domains. The DRAM controller modelRiscyOO used for evaluation has a constant latency, and weleave the exploration of variable latency DRAM controllersthat are timing independent to future work.
Unlike Sanctum, in
MI6 an enclave does not share virtualaddress space with untrusted software, as explored in Sec-tion 6.2. Coupled with the mechanism of routing page faultsto an enclave, and per-enclave page tables, this blocks thepage fault and page access side channel and prevents attackssuch as [13, 59], where the untrusted OS views page faults oraccesses.In order to achieve timing independence, programs need tobe able to ensure they only use certain cache sets and no otherprograms use those sets. This restriction on address usageincludes all accesses to memory. It is easy to write a programthat only performs loads and stores on certain addresses, butmuch harder to ensure that the program will not emit specula-tive accesses or page table walks to memory that fall outsidethat range. It is also much harder to ensure another programwill stay outside of your address range.To make set partitioning easier,
MI6 has hardware supportto ensure all physical accesses fall within the specified DRAMregions (and therefore cache sets) allocated to the runningprogram. Each core in
MI6 has a machine-mode modifiablebitvector containing a bit for each DRAM region determiningif that region can be accessed or not. If the program makes anaccess (speculative or non-speculative) to an address outsidethe allocated cache sets, the core will not emit the access tothat location and will raise an exception if that access ends upbecoming non-speculative.5e need to check the DRAM region for each physical cacheaccess.
MI6 performs an optimization to simplify the design bycaching DRAM region permissions in the TLB. Each DRAMregion is large enough and has proper alignment so that no4 KB page falls in two DRAM regions. Therefore, if a pagetable walk determines an access to a specified page is legal,the translation is added to the TLB and the accesses usingthe translation are all legal until the DRAM region allocationchanges. To support programs using physical addresses, thesecurity monitor configures cores to trap on virtual memorymanagement instructions so the security monitor can swap inan identity page table when programs try to turn off virtualmemory. Section 6.2 describes how the TLB is maintained toensure state transitions do not violate isolation.
This section presents the remaining modifications requiredto achieve strong timing independence, and therefore Property1. The mechanisms introduced in Section 5.2 cannot achievestrong timing independence primarily because of the sharedLLC. We first explain the structure of LLC in RiscyOO inSection 5.4.1, then show the possible minor leakages that breakstrong timing independence in Section 5.4.2, and present oursolution in Section 5.4.3. We also analyze the performancecost qualitatively in Section 5.4.4.
Shared last level cache (LLC)
Core 0
DRAM
Core 1
Figure 1: Integrationof LLC in RiscyOO
Figure 2 shows the internal de-tails of the LLC.Figure 1 shows how the LLC isintegrated in a two-core RiscyOOmachine. The LLC uses an MSIdirectory-based cache coherencyprotocol [55], and it uses a dedi-cated link to communicate cache-coherence messages with the L1sin each processor core. Each linkcontains three independent FIFOsto transfer (1) upgrade requestsfrom the L1, (2) downgrade re-sponses from the L1, and (3) upgrade responses and down-grade requests from the LLC, respectively [55]. The LLC isconnected to the DRAM controller using a pair of FIFOs, andthe DRAM controller sends responses only for reads.The LLC contains MSHRs and a cache-access pipeline.Every incoming message sent to the LLC, including L1 up-grade requests, L1 downgrade responses and DRAM responses,needs to go through the cache-access pipeline to access thetag and data SRAMs of the LLC. An upgrade request fromL1 also needs to reserve an MSHR entry before entering thepipeline. A DRAM response is buffered in the MSHR entrythat initiates the corresponding DRAM read request before itenters the pipeline, and thus there is no backpressure on DRAMresponse .After a message finishes accessing the SRAMs in the pipeline,it is processed at the end of the pipeline. After the processing,an L1 upgrade request could be ready to respond, and in thiscase we enter the MSHR index of the request into a FIFO, i.e., UQ in Figure 2, and the response data is buffered in the MSHRentry. The depth of UQ is equal to the number of MSHRs so it will never backpressure the pipeline. In other cases wherea cache replacement or a cache miss occurs, the L1 upgraderequest that causes the replacement or cache miss needs torequest DRAM. In this case, we enter its MSHR index into aFIFO, i.e., DQ in Figure 2, and buffer the data in the MSHRentry if writeback is needed. The depth of DQ is also equalto the number of MSHRs, so it will also never backpressurethe pipeline. That is, the cache-access pipeline can never bebackpressured. The final piece in the LLC is the
Downgrade-L1 logic inFigure 2. Every cycle, the logic looks for an MSHR entry thatneeds to downgrade any L1s, and sends the downgrade request.
Cache-access pipelineCore 0 downgrade respCore 1 downgrade respCore 0 upgrade reqCore 1 upgrade req ProcessDowngrade-L1DRAM resp DRAM reqMSHR index to send upgrade respDowngrade req Core 0Core 1MSHR index to send DRAM req UQDQ M S H R s Figure 2: Internal microarchitecture of LLC in RiscyOO
Section 5.2 has partitioned the storage elements in the LLC,i.e., cache arrays and MSHRs, to prevent major timing leakages.However, there are still other shared resources in the LLC thatare contended for by messages belonging to different coresand potentially different protection domains. As long as suchcontention exists, minor timing leakage is possible. Here, weenumerate all such contended resources.
Entry port of the cache-access pipeline:
All the incomingmessages are contending on the entry into the cache-accesspipeline through a two-level mux as shown in Figure 2. If twomessages from two different cores arrive at the LLC at thesame time, then one message will block another for a cycle.This can lead to minor timing leakage. Contention betweendifferent types of messages from different cores can also forma minor leakage, e.g., a DRAM response for a miss by core 0and a L1 upgrade request from core 1.
Downgrade-L1 logic:
All the MSHR entries that need to senddowngrade requests are contending on the Downgrade-L1 logicto send downgrade requests. If the arbitration is not fair, thena large number of MSHR entries of core 0 that need to senddowngrade requests can block MSHR entries of core 1 fromsending downgrade requests.
UQ and Downgrade requests:
A head-of-line block in UQcaused by a response to core 0 can stall a later response to core1 in UQ. The downgrade requests sent by the Downgrade-L1logic also contend with the responses in UQ on the outgoingport to processor cores.
DQ:
If an MSHR entry is entered into DQ because of a cachemiss without replacement, then it only needs to send oneDRAM read request when it is dequeued from DQ. This willnot block the dequeue port of DQ or lead to leakage. However,if an MSHR entry enters DQ because of the completion ofcache replacement, then it needs to send not only a DRAMwriteback request, but also a DRAM read request. This isbecause the MSHR entry must have missed in the LLC. In thiscase, the dequeue port of DQ will be blocked for one cycle inorder to send both requests to DRAM. This block may delay6ater requests in DQ, creating minor timing leakage.We have enumerated all instances of contention in the LLCthat can create minor timing leakage. It should be noted that theDRAM-response port is not a source of leakage, even thoughresponses for cache misses from different cores all go throughthis port. This is because the DRAM-response port is neverbackpressured as explained in Section 5.4.1.
Figure 3 shows the microarchitecture of the LLC in
MI6 that prevents all the above minor timing leakages and achievesstrong timing independence. We explain the changes made tohandle each of the contended resources listed in Section 5.4.2.
Cache-access pipeline ProcessDowngrade-L1
DRAM resp DRAM reqMSHR index to send upgrade respDowngrade req C o r e MSHR index to send DRAM reqUQ1DQRound-robin arbiter UQ0 C o r e Retry L1 upgrade req when replacement finishesCore 1 downgrade respCore 1 upgrade reqDRAM resp for Core 1 miss C o r e M S HR s Core 0 downgrade respCore 0 upgrade req C o r e M S HR s DRAM resp for Core 0 miss
Downgrade-L1
Figure 3: Microarchitecture of LLC in
MI6 to achieve mi-croarchitectural isolationEntry port of the cache-access pipeline:
Instead of firstmerging incoming messages of the same type and then mergingdifferent message types, the LLC in Figure 3 first merges allincoming messages for the same core, including L1 upgrade re-quests, L1 downgrade responses, and DRAM responses for themisses caused by the core. Contention between messages forthe same core (and thus the same protection domain) will notcause any leakage. After merging messages for the same core,we use a round-robin arbiter to arbitrate messages from differ-ent cores before they enter the cache-access pipeline. Considerthe case where we have N cores with IDs 0 . . . N −
1. In thiscase, in cycle T , only one message from core T % N ( T modulo N ) can enter the pipeline. It should be noted that even if thereis no incoming message for this core, messages from othercores cannot proceed. Since the cache-access pipeline has nobackpressure (Section 5.4.1), the round-robin arbiter ensuresthat whether messages from a given core can enter the pipelineis independent from the activity of other cores or protectiondomains. That is, strong timing independence is achieved atthis entry port. Downgrade-L1 logic:
There are two approaches to solvethe contention in the Downgrade-L1 logic. In the first ap-proach, given that
MI6 has already partitioned the MSHRsacross processor cores, instead of checking all the MSHRs,the Downgrade-L1 logic examines only one MSHR partitionfor a single core at each cycle. The logic iterates through allpartitions in a round-robin fashion, providing timing indepen-dence. The logic will still spend a cycle on a partition evenif no MSHR in the partition has downgrade request to send.This ensures that whether the MSHRs for a given core cansend downgrade requests is independent from other cores orprotection domains. In the second approach, which we follow,we duplicate the Downgrade-L1 logic for each MSHR parti-tion. Each copy of the Downgrade-L1 logic is therefore only responsible for one MSHR partition, thereby removing anycontention (cf. Figure 3).
UQ and Downgrade requests:
To resolve the timing depen-dence due to UQ, we have split the original UQ into multipleFIFOs (see Figure 3). That is, UQ i keeps only the MSHRindexes for core i , and the depth of UQ i is equal to the sizeof the MSHR partition for core i . Thus, head-of-line blockingof UQ i is simply a stall within responses for core i , i.e., thereis no timing dependence across different cores or protectiondomains. It should be noted that the total number of entries ofall the UQs is still equal to the total number of MSHRs, that is,there is no area overhead.After the split of UQ, a downgrade request sent to core i cancontend only with responses to core i in UQ i . Consider the casethat the downgrade request is initiated by an upgrade requestfrom core j . In this case, the address requested by core j is inthe same cache set as the downgrading address owned by core i .Thus, cores i and j must be in the same protection domain (seeSection 5.2), e.g., a multithreaded enclave is assigned to cores i and j . Therefore, the contention between UQ and downgraderequests cannot influence timing across protection domains. DQ:
The key to solve the problem is to have the dequeue ofan MSHR index from DQ always take one cycle, in particularfor MSHR entries that are completing cache replacement. Inthis way, the dequeue port of DQ will never be blocked, andthere is no timing influence with DQ. Consider an MSHR entrywhich enters DQ because of completing cache replacement.In Figure 3, when the MSHR index is entered into DQ, weset a retry bit in the MSHR entry. When the MSHR indexis dequeued from DQ, it sends only the writeback requestto DRAM, so the dequeue takes only one cycle. An MSHRentry with the retry bit set will try to re-enter the cache-accesspipeline, and enters DQ again as a pure cache miss to issuethe DRAM read request. The re-entry only contends withmessages for the same core, and will not create new leakage.The cache slot is also locked to the MSHR entry so that otherupgrade requests cannot occupy the slot.
First, we note that the split of UQ into multiple FIFOs haszero performance overhead. The retry of a request that finishesreplacement will increase the total processing latency of thisrequest by a few cycles. However, since the request needs toread DRAM, this increase is negligible.The Downgrade-L1 logic was duplicated and therefore haszero performance overhead. If we had chosen to operate thelogic in the round-robin way, then the latency to downgradeL1s may increase proportionately to the total number of cores.The performance overhead comes mainly from the round-robin arbiter in front of the cache-access pipeline. The arbitergives each core 1 / N of the SRAM bandwidth, where N is thetotal number of cores. It should be noted that even withoutthe arbiter, messages from N cores are still contending on thebandwidth of the SRAM. Therefore, there is no bandwidth lossin the average case. Performance will decrease if the trafficfrom each core is bursty. This is not a big problem for a smallmultiprocessor with 2 or 4 cores, because the LLC can stilltake a request from a given core every 2 or 4 cycles, and thecore typically will not miss in L1 every cycle. This issue willbe exacerbated as the number of cores increases, but it depends7trongly on the timing and contention of memory accessesfrom different cores.The arbiter also introduces extra latency in accessing thepipeline, i.e., a message from a given core has to wait for itsturn to enter the pipeline. The average latency is roughly N /
6. ISOLATION ACROSS PROTECTIONDOMAIN TRANSITIONS
In addition to achieving isolation of programs scheduled con-currently onto different cores in the system, we must achieveisolation between programs scheduled to use the same core atdifferent times.As described earlier,
MI6 relies on flushing of microarchitec-tural state to erase any program-dependent microarchitecturalstate when scheduling a new protection domain onto the core.Operationally, we add a microarchitectural purge instructionto achieve this; below, we consider the specifics of purge byvisiting each module it scrubs. Section 6.2 discusses how theprivileged security monitor, which occupies a dedicated pro-tection domain and executes at highest privilege, orchestratesthe low-level operations to implement a secure context switchacross protection domains. purge instruction
In-flight instructions : The out-of-order core consists of manymodules containing bookkeeping for in-flight instructions in-cluding the register renaming table, register free list, reorderbuffer, issue queues, scoreboard, speculation tag manager, loadstore queue (with corresponding MSHRs), store buffers, andvarious FIFOs, and smaller modules. Between contexts, all ofthese states must correspond to “no instruction is currently inthe processor pipeline” in order to achieve a comprehensiveflush. The baseline RiscyOO processor correctly flushes thesestates on privilege change to handle read after write hazardswhen changing the privilege level.A multitude of states throughout the modules of the out-of-order core equivalently describe an empty pipeline. For exam-ple, a complete register free list indicates an empty pipeline,but there exist multiple permutations of the free list. Interest-ingly, this does not require special consideration so long asthese equivalent “free” states are not distinguishable by anysoftware. A similar situation pertains to the issue queue, fromwhere instructions are issued to execute when ready. In theRiscyOO processor, the issue queue is a circular buffer with as-sociated head and tail pointers. Any configuration where headand tail pointers are equal maps to an empty state, yet these areentirely indistinguishable by software means. While not appli-cable in the case of RiscyOO’s issue queue, this module wouldrequire additional care to correctly flush program-dependentstate in some priority queues, such as the MIPS R10000 [66],which favors issuing instructions from low-numbered slots inthe queue.
Branch predictor structures:
Branch predictors were demon-strated as a surface for hijacking speculative execution as partof Spectre [33]. These structures, being deeply stateful, canalso transmit information about the control flow of a previouslyscheduled program via the branch predictions observed aftera context switch. To purge the branch predictor state, the branch predictor must reach a well-defined public state, forexample via a reset to its initial state. In order to reduce theoverhead of cold branch prediction after each context switch,the processor may opt to implement primitives for saving andrestoring predictor state, if practical.
L1 Caches, TLBs, and translation caches:
Attacks exploit-ing shared cache tag state to observe secret-dependent changesin memory access latency are a well-explored field, offeringincreasingly practical examples of private state leakage via thecache. Any cache timing attack may use not only the sharedL2 cache (which
MI6 partitions to address this class of vulner-ability), but also the L1 caches, which are time-shared by theprograms scheduled onto the same core. In order to obviatethese attack surfaces, L1 caches, along with TLBs (both L1 andL2) and translation caches, must be flushed on context switches.(TLBs and translation caches are all private to the core.) Inaddition to the tag state, the cache lines’ replacement policystate must also be scrubbed; RiscyOO fortunately employs apseudo-random replacement policy with no replacement state.The TLBs (both L1 and L2) and translation caches use set as-sociative structures with LRU replacement policies. RiscyOO’simplementation of the LRU policy is self-cleaning: when noline’s data is present in a set, new lines are filled in a pre-defined order; the act of filling an LRU cache to prime it foreviction scrubs private information in the replacement state.A noteworthy observation: L2 cache sets need only bescrubbed when re-allocating physical memory. Protectiondomains use disjoint regions of physical memory, which cor-respond to disjoint sets in the L2 via set coloring. L2 linesbelonging to a de-scheduled protection domain are inaccessibleand can remain in the L2 until the domain is scheduled.
MI6 does not allow for memory shared between arbitraryprotection domains; all communication between domains ismediated by a security monitor, as described in Section 6.2.
As in Sanctum [16], we employ a software security monitorto map the high-level semantics of enclaves onto the low-levelinvariants implemented by the hardware. While the imple-mentation of a security monitor for
MI6 is not a contributionof this manuscript, this section briefly describes its requiredfunctionality, and where it differs slightly from the prior con-struction [37] for Sanctum. The security monitor interposeson scheduling and physical resource allocation decisions madeby the untrusted OS to assert that a given enclave’s resourcesdo not overlap with any other software, and to scrub theseresources before they are available for re-allocation. In
MI6 ,the security monitor considers two classes of resources: corestate, and memory; both include their respective space of sub-tle side effects these resources have on the memory hierarchy,microarchitecture, and the network-on-chip. When an OS re-quests enclaves be scheduled or de-scheduled, the securitymonitor uses purge , as described above, to scrub a core whenscheduling enclaves (to create a pristine environment free ofadversarial influence) and when de-scheduling enclaves, toerase side effects of enclave execution. Likewise, before mem-ory is granted to an enclave, or when an enclave is destroyed,the security monitor must be invoked to scrub it before it canbe given to a new owner. During steady-state execution, thesecurity monitor is de-scheduled, and protection domains are8solated via the mechanism is described in Section 5.3.The monitor also interposes on an enclave’s asynchronousevents and exceptions in order to safely de-schedule the enclavebefore delegating control to the OS handler; the OS observesthe event as occurring at the syscall that scheduled the enclave.The security monitor itself occupies a dedicated protectiondomain outside the OS’s reach, and statically reserves suffi-cient amount of physical memory for text and data structuresimplementing the monitor’s limited functionality. The securitymonitor sets up a physical memory protection primitive (
PAR :Sanctum’s protected address region) to ensure its own integrityfrom all other software.All protection domains with the exception of the securitymonitor execute in virtual memory; the operating system trans-parently uses an identity page table to access physical ad-dresses. When protection domains are created or destroyed,stale translations system-wide must be scrubbed: the securitymonitor forces a TLB shootdown during these transitions toensure cached translations remain coherent with the system’scurrent security policy. Because no translations exist to mapany virtual address in a protection domain to a physical addressoutside the protection domain (except as described at the end ofthe section), speculative fetches and loads will not fall outsideprotection domain boundaries, and uphold isolation.As described in Section 5.3,
MI6 ensures no protection do-main may access the memory of another (with the notableexception of the security monitor, which relies on a restrictedmode of execution to sidestep any speculative misbehavior, asdescribed below), resulting in straightforward isolation in theL2 cache via page coloring. Of course, enclaves must occa-sionally communicate with other software, at a minimum toreceive inputs and produce outputs. While Sanctum and SGXallowed for rich communication between an enclave and itshost (the untrusted Linux processor in whose address spacethe enclaved process exists),
MI6 cannot allow sharing anyportion of the address space with untrusted software in orderto silo speculative execution. The security monitor implementsexplicit messaging between protection domains by allowinga sender to request a message to be copied from the send-ing domain to a pre-allocated buffer in the receiving domain.Sanctum’s mailbox primitive is one such mechanism, allow-ing enclaves to send and receive authenticated private 64 Bytemessages (local attestation).
MI6 extends this primitive to alsoimplement a privileged memcopy between an enclave and theuntrusted software via an agreed-upon pair of buffers of equalsize. The security monitor responds to an enclave’s requestto “read” the OS buffer by copying its contents to the enclave,and to “write” it by copying from the enclave’s buffer to OSmemory. The security monitor’s handling of the primitivesabove does not depend on the transmitted data, and an en-clave’s invocation of these APIs is not considered private. Thesecurity monitor therefore need not perform a purge when itmediates communication. These primitives are a restrictionto the more permissive communication mechanism allowedby SGX and Sanctum, so as to defend against timing attackson shared memory, including those that exploit speculativeexecution.Allowing an enclave to interact with the outside world, evenonly through the security monitor, has implications for ourdefinition of security: instead of leaking its completion time, the enclave also transmits the timing and sequence of its inter-actions to a potential adversary. Any communications receivedfrom untrusted software are untrusted, and a potential influencechannel. The enclave is responsible for padding the timingof its interactions, and tolerating malicious responses. Thepadding can be to a constant value for zero leakage, or somevalue from a fixed size set to limit leakage [21].Protection domains other than the security monitor (enclaves,untrusted software) share no resources with one another, soside effects of speculative execution within these domains arenot visible across protection domain boundaries. Cores exe-cuting these protection domains may therefore speculate withno restrictions. This is not the case for the security monitor’sdomain, which executes with highest privilege, and may accessarbitrary virtual memory. As in [16], the security monitor’scode is trusted to maintain its own integrity, and to not violatethe isolation of other domains. This trust is insufficient in
MI6 , and we must restrict speculative execution of the secu-rity monitor to prevent side effects of mis-speculated fetchesand accesses from being observable across protection domainboundaries. We achieve this by restricting instruction fetchto a range of addresses corresponding to the security monitor,and by throttling register renaming in machine-mode execution(exclusively used by the security monitor), effectively seri-alizing execution. Restricting instruction fetch prevents thesecurity monitor from leaking information via the shared cacheby jumping/branching to a data-dependent address visible toan adversary. Further, we replicate the security monitor’s codewithin each enclave (these replicas contain nothing confiden-tial and protect their integrity via
PAR : Sanctum’s physicalmemory protection mechanism), so invoking an enclave’s com-munication primitive does not leave unintended side effects:the monitor’s text is not shared by protection domains, andthe memcopy is non-speculative (see above), and touches onlythe two buffers the enclave explicitly intends to access. Amore sophisticated implementation that allows for safe specu-lation within the security monitor while guaranteeing isolation,and allows finer-grained communication between protectiondomains, is deferred to future work.
In steady-state execution, the enclave is straightforwardlyisolated through its uniquely allocated core and address range.Since it does not share a core with any other software, itscore-local micro-architectural state is private. Since it does notshare an allocated address range with any threads running onother cores, the page invariant circuit ensures external softwareis not able to access the enclave’s physical memory, and theenclave is not able to access addresses outside its address range,including speculative accesses. Page coloring results in cacheset isolation in the shared cache, and MSHR partitioning andother changes result in memory request timing isolation inthe cache hierarchy. The enclaved program is responsible forsafeguarding the timing of its public operations.In transient execution, which includes enclave scheduling,de-scheduling, creation, and destruction, the enclave is isolatedby the security monitor sanitizing architectural and micro-architectural state within the core during each event. When thesecurity monitor is called to perform one of the enclave tran-sient operations, the core switches to machine mode, and that9witch causes the pipeline to flush all in-flight speculation fromthe previously executing program. The security monitor usesthe purge instruction to fully flush all the mircoarchitecturalstate from the processor and it runs a software routine to scrubthe architectural state to a known initial state. Before returningcontrol to software outside machine mode, the security monitorre-purges relevant architectural and micro-architectural stateto grant the incoming software a fresh execution environment.A subtle complication in this process is the security moni-tor’s own unrestricted access to physical memory. Speculativemisbehavior within the security monitor itself might be able tocircumvent the isolation of private enclave memory, so we donot speculate in machine mode.
7. PERFORMANCE EVALUATION
The performance overheads of
MI6 come from (1) flushingper-core microarchitectural states on a context switch, (2) par-titioning the shared last level cache (LLC), (3) partitioning andsizing the MSHRs in LLC, (4) the round-robin arbiter for theLLC pipeline, and (5) turning off speculation in machine mode.Here, we do not evaluate the performance overheads of thechanges made in Section 5.4.3 outside of the arbiter due to thereasons explained in Section 5.4.4. We evaluate each of thesefive overheads in Sections 7.1 to 7.5, respectively, and summa-rize the overall performance overheads of
MI6 in Section 7.6.Overheads (1) and (5) apply only to enclave applications whileoverheads (2), (3) and (4) apply to all processes. Turning offspeculation is needed only when the security monitor transfersdata on behalf of the enclave to and from the outside world.We expect these to be rare events which typically only happenat the beginning and end of enclave execution, so we will nottake into account the effect of turning off speculation whenevaluating the overall performance overheads of
MI6 in Sec-tion 7.6.We use the following 7 variants of the RiscyOO processor:1.
BASE : the baseline insecure RiscyOO processor withparameters listed in Figure 4.2.
FLUSH : flushes per-core microarchitectural states at ev-ery context switch on top of the BASE processor (usedin Section 7.1).3.
PART : set-partitions the LLC of the BASE processor(used in Section 7.2).4.
MISS : changes the organization of LLC MSHRs of theBASE processor to model the effect of LLC-MSHR par-titioning and sizing (used in Section 7.3).5.
ARB : increases the LLC pipeline latency of the BASEprocessor to model the effect of the round-robin arbiterfor the LLC pipeline (used in Section 7.4).6.
NONSPEC : executes memory instructions non-speculativelyon top of the BASE processor (used in Section 7.5).7.
F+P+M+A : the combination of FLUSH, PART, MISSand ARB (used in Section 7.6).We prototyped all 7 processors on AWS F1 FPGAs, and ranSPEC CINT2006 benchmarks with the ref input size. Wedid not run benchmark perlbench because we could not cross-compile it to RISC-V. For benchmarks with multiple ref inputs,we present the aggregate performance number over all theinputs. In most cases, all benchmarks ran to completion underLinux without sampling. The only exception is the evaluationof turning off speculation, in which case we truncate the runs because the processor without speculation becomes too slowto finish the benchmarks (see Section 7.5).
Front-end 2-wide superscalar fetch/decode/rename256-entry direct-mapped BTBtournament branch predictor as in Alpha 21264 [28]8-entry return address stackExecution 80-entry ROB with 2-way insert/commitEngine Total 4 pipelines: 2 ALU, 1 MEM, 1 FP/MUL/DIV16-entry IQ per pipelineLd-St Unit 24-entry LQ, 14-entry SQ, 4-entry SB (each 64B wide)L1 TLBs Both I and D are 32-entry, fully associative,D TLB has max 4 requestsL2 TLB Private to each core, 1024-entry, 4-way associativemax 2 requestsIncludes a translation cache which contains 24 fullyassociative entries for each intermediate translation stepL1 Caches Both I and D are 32KB, 8-way associative, max 8 requestsL2 Cache 1MB, 16-way, max 16 requests(LLC) coherent with I and DMemory 2GB, 120-cycle latency, max 24 requests(12.8GB/s for 2GHz clock)
Figure 4: Insecure baseline (BASE) configuration
Methodology:
We compare the single-core performance ofBASE and FLUSH to study the influence of flushing per-core microarchitectural states, including TLBs, L1 caches andbranch predictors. In FLUSH, these states are flushed wheneverthe processor takes a trap (i.e., an exception or an interrupt) orreturns from trap handling. In one cycle, the hardware can onlyflush a few entries of L1 caches, TLBs and branch predictors.Each L1 cache has 512 cache lines, we invalidate one line percycle. We cannot invalidate a whole set (i.e., 8 lines) per cycle,because the coherence protocol used in RiscyOO requires L1to notify L2 even for the invalidation of a clean line. Entriesin TLBs and branch predictors can be discarded directly. Thefully associative L1 TLBs can be flushed in one cycle. The L2TLB has 256 sets (each set has 4 entries) and we discard oneset per cycle. For the tournament branch predictor, the largesttable has 4096 entries (each of 2 bits), and we discard 8 entriesper cycle. L2 can sustain a bandwidth of one eviction per cycle,though the latency of completing an eviction is larger than onecycle. All these flushes done in parallel take 512 cycles tocomplete, during which time the processor idles.The stall time for flushing is not the only cost. When theapplication resumes from trap handling, the L1 caches, TLBsand branch predictors are all “cold,” and it takes time to warmthem up. This will lead to more misses in caches and TLBs,and more branch mispredictions.
Results:
Figure 5 shows the increased execution time causedby flushing for each benchmark. The last column shows theaverage across all benchmarks. Lower bars mean less perfor-mance overheads. The average overhead in execution time is5.4%, and the maximum overhead is 10.9% (in benchmarkastar). As explained earlier, these overheads are caused by(1) the stall time waiting for flushes to be completed, (2) theadditional cache and TLB misses caused by the cold start afterflushing, and (3) the additional branch mispredictions causedby the cold start after flushing. We now examine each of thesethree sources one by one.Figure 6 shows the stall time in FLUSH just waiting for the10 z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e I n c r e a s e d r un t i m e ( % ) Figure 5: The overall overhead of FLUSH in executiontime normalized to the execution time of BASE b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e N o r m a li z e d t i m e ( % ) Figure 6: The stall time for flushing states in FLUSH nor-malized to the execution time of BASE b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e M i ss e s p e r K i n s t . BASE FLUSH
Figure 7: Branch mispredictions in BASE and FLUSH microarchitectural states to be flushed, and the last column isthe average across all benchmarks. As we can see, flushing thestates takes merely 0.4% of the execution time. Benchmarkxalancbmk has the longest stall time (3.2%) because it makesa large number of system calls (which trigger exceptions) toprint characters to stdout.While we do not show the results here, the changes in in-struction and data cache and TLB misses are negligible, sothey are unlikely to be the source of performance degradation.Figure 7 shows the number of branch mispredictions perthousand instructions in BASE and FLUSH. The last columnis the average across all benchmarks. It turns that flushing thebranch predictions has substantial impact on the mispredictionrate. On average, the mispredictions per thousand instructionsrise from 18.3 to 24.3 after microarchitecture-state flushing isenabled. This significant increase is responsible for the 5.4%overall performance overhead in Figure 5. For benchmarkastar, mispredictions go up from 30.1 to 46.2. This leads to themaximum overhead of 10.9% in Figure 5.
Summary:
Flushing per-core microarchitectural states on con-text switches has little impact on the miss rates of caches orTLBs, but it increases the branch-misprediction rate substan-tially. However, the overall performance overhead caused byflushing is still small.
Methodology:
For LLC partitioning, the ideal evaluationmethodology would run multiprogrammed workloads on amultiprocessor. For example, we would like to evaluate a 16-core multiprocessor with 16MB shared L2 cache (LLC). Weassume that each core is still using the parameters in Figure 4, and the LLC is still 16-way set-associative and using 64B cachelines. However, the FPGA does not have enough logic gatesand SRAMs for us to prototype such a multiprocessor. Wenow explain how to closely approximate the evaluation of amultiprocessor using a single core.We first point out that the performance overheads of LLCset partitioning mainly come from (1) the decreased LLC sizeallocated for the enclave application, and (2) the additionalconflict misses in LLC caused by changing the cache-indexingfunction as described below. In fact, the allocated LLC size isnot a concern because our system allows an enclave to claimmultiple sets of the LLC. Besides, running multiprogrammedworkloads without considering security would also require par-titioning the LLC for Quality of Service (QoS). This evaluationis not about how to size each LLC partition to achieve the bestQoS, so we focus on the second type of overhead, i.e., cachemisses caused by changing the indexing function of LLC.Consider the case that we run a multiprogrammed workload,which consists of 16 SPEC benchmarks, on the 16-core mul-tiprocessor with one benchmark on each core. For simplicity,we assume the insecure baseline partitions the LLC to allocate1MB to each core. If the baseline is using way partitioning,then each core is using effectively a 1MB direct-mapped LLC.Here, we overestimate the baseline performance by assumingthat each core in baseline is using a 1MB cache which stillhas 16 associative ways. That is, the insecure baseline perfor-mance can be approximated by the performance of the BASEprocessor. In our secure system, consider the case that weset-partition the 16MB LLC into 64 regions, each with 256KBand we assign 4 regions (1MB) to each enclave which runs oneSPEC benchmark on a core. In this case, there is no differencein the allocated LLC size, and the performance overhead of set-partitioning comes only from the change in the cache-indexingfunction, which we explain next.Consider a cache-line address A which does not containthe 6-bit line offset (for 64B cache line). For the baselineinsecure 1MB LLC of BASE which has 2 sets, the cacheindex for this line is the lower 10 bits of A , i.e., A [ ] . For theset-partitioned 16MB LLC of our conceptual multiprocessorwhich has 2 sets, the cache index is the combination of the6-bit DRAM region R and the lower 8 bits of A , i.e., { R [ ] , A [ ] } . Since one enclave only gets 4 regions, the higher4 bits of region R is fixed, and the effective cache index is 10bits, i.e., { R [ ] , A [ ] } . This is equivalent to indexing the1MB LLC of BASE using a different index function.As a result, to approximate the performance impact of LLCset-partitioning, we can simply measure the influence of re-placing the higher 2 bits of the LLC index of BASE with thelower 2 bits of the DRAM region, i.e., the effect of changingthe LLC index from A [ ] to { R [ ] , A [ ] } . We refer tothe processor that uses the new LLC index as PART , and wecompare the performance of PART against BASE.The DRAM region R is the higher bits of the cache-lineaddress. In the case of 2GB DRAM, R [ ] would be A [
20 :19 ] . The performance overhead of PART is caused by usinghigher address bits to index the LLC. Results:
Figure 8 shows the increased execution time causedby LLC set-partitioning. The last column shows the averageacross all benchmarks. Lower bars mean less performanceoverheads. The average overhead in execution time is 7.4%,11nd the maximum overhead is 21.6% (in benchmark gcc).To better understand the performance overhead, we showthe number of LLC misses per thousand instructions for BASEand PART in Figure 9. Because of using higher address bits inthe LLC index, the average LLC misses per thousand instruc-tions increase from 17.4 to 19.6 (the last column in the figure).For benchmark gcc which has the maximum execution-timeoverhead, its LLC misses get doubled. These increased missescan be understood as follows. In our evaluation, we start thebenchmark right after the Linux boots. At that time, most ofthe physical memory has not been allocated, and Linux tends toallocate physical pages sequentially for the benchmark. Thus,physical addresses in the working set of the benchmark arelikely to share the same higher bits, and thus get mapped to thesame LLC index, leading to more LLC conflict misses. b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e I n c r e a s e d r un t i m e ( % ) Figure 8: Overhead of PART in execution time normalizedto the execution time of BASE b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e M i ss e s p e r K i n s t . Figure 9: LLC (L2$) misses in BASE and PART
Methodology:
We still consider the case of a multiprocessoras in Section 7.2. For the insecure baseline, we assume theaverage memory-system bandwidth available to each core isthe same as a single-core BASE processor. That is, each corein the multiprocessor can occupy 16 LLC MSHR entries andhave 24 in-flight DRAM requests on average.Now consider the secure case that we partition and size theLLC MSHRs to prevent side channels due to contention forLLC MSHRs and DRAM bandwidth. According to Section 5.2,the LLC MSHRs should be partitioned to allocate 12 entriesfor each core, so that the memory requests (including bothwritebacks and data fetches) generated by each core can neverexceed the DRAM bandwidth available to the core (i.e., 24requests). If the LLC is organized as several cache banks, thenthe partition of MSHRs should be done in each LLC bank. Inthis evaluation, we consider the case that the LLC is sliced into4 banks according to the lower bits of the cache index. In thiscase, one core will be allocated with 3 entries in each bank(still 12 entries in total). In the insecure baseline, cache missesby a core can occupy MSHR entries in any bank, i.e., the 16LLC MSHRs can be distributed across 4 banks in any form. According to the above analysis, the performance overheadsof LLC-MSHR partitioning and sizing come from (1) the re-duction in MSHR size, and (2) insufficient MSHRs in a singlecache bank (i.e., bank conflicts). To model the effects of thesetwo overheads, we instantiate the
MISS processor based on theBASE processor. The MISS processor has only 12 LLC-MSHRentries compared to 16 in BASE, and the MSHRs in MISSare sliced into 4 banks (according to the lower bits of cache-line addresses). The performance overheads of LLC-MSHRpartitioning can be characterized by the performance differ-ence between BASE and MISS. The performance of MISSis a pessimistic estimate. This is because overwhelming oneMSHR bank will stall the whole MSHR structure in MISS,while different MSHR banks are independent from each otherin the real case. This modeling error can be avoided by refiningthe implementation in the future.
Results:
Figure 10 shows the increased execution time causedby partitioning the MSHRs in the LLC. Lower bars mean lessperformance overheads. The average overhead in executiontime is 3.2% (the last column), and the maximum is 8.3% (inbenchmark astar). This overhead is not large. b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e I n c r e a s e d r un t i m e ( % ) Figure 10: Overhead of MISS in execution time normal-ized to the execution time of BASE
Methodology:
As described in Section 5.4.4, the performanceoverheads of the round-robin arbiter associated with the LLCcache-access pipeline are caused by bandwidth loss in case ofbursty cache traffic and the increased latency in accessing thepipeline. We do not evaluate the overhead due to bursty trafficbecause it depends strongly on the timing of the concurrentlyrunning applications and we are unable to fit a big RiscyOOmultiprocessor on an FPGA. As an approximation, we evaluateonly the overhead caused by increased pipeline latency. For a16-core multiprocessor, the pipeline latency is increased by 8cycles on average (cf. Section 5.4.4). We instantiate the
ARB processor, which increases the LLC-pipeline latency in BASEby 8 cycles, to model the overhead.
Results:
Figure 11 shows the increased execution time causedby the LLC arbiter. Lower bars mean less performance over-heads. The average overhead in execution time is 8.5% (the lastcolumn), and the maximum is 14% (in benchmark libquantum).This overhead is comparable to LLC set-partitioning.
Methodology:
When a processor runs non-speculatively, theaddress translation and execution of a memory instruction (e.g.,a load or a store) cannot start until the instruction can neverbe squashed (e.g., by branch mispredictions or exceptions).Since turning off speculation is uncommon, we implementthe non-speculative mode on top of the BASE processor in a12 z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e I n c r e a s e d r un t i m e ( % ) Figure 11: Overhead of ARB in execution time normalizedto the execution time of BASE simple (but less optimized) way. In the non-speculative mode,the processor does not rename a memory instruction (and thuscannot enter it into the ROB) until the ROB is empty. We referto this processor as
NONSPEC .To evaluate the performance overhead of turning off spec-ulation, we run benchmarks on NONSPEC entirely in non-speculative mode. Since non-speculative mode is much slowerthan the normal speculative mode, we truncate the benchmarks.Each benchmark was run for 20 billion instructions withoutcollecting performance data, and then run for 40 billion instruc-tions collecting performance data. Benchmarks were rerun onBASE using this methodology to get the baseline performance.
Results:
Figure 12 shows the increased execution time causedby running in the non-speculative mode. Lower bars mean lessperformance overheads. The average overhead in executiontime is 205% (the last column), and the maximum overheadis 427% (in benchmark h264ref). Although the overhead islarge, it is incurred only at the beginning and end of an enclaveprogram for the common use cases of enclaves, and it does notapply to insecure programs running outside enclaves. b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e I n c r e a s e d r un t i m e ( % ) Figure 12: Overhead of NONSPEC in execution time nor-malized to the execution time of BASE
MI6
Overheads of enclave processes:
A enclave program runningon
MI6 is affected by flushing microarchitectural states onevery context switch, LLC set-partitioning, LLC-MSHR parti-tioning and sizing, and the LLC round-robin arbiter. (We omitthe influence of turning off speculation as discussed earlier.) Itsoverhead can be approximated by evaluating the performanceof the
F+P+M+A processor, which is simply a combinationof FLUSH, PART, MISS and ARB. Figure 13 shows increasedexecution time of F+P+M+A compared to BASE, i.e., the per-formance overhead of an enclave program. Lower bars meanless performance overheads. The average overhead of runningin the enclave of
MI6 is 16.4% (the last column in the figure),and the maximum overhead is 34.8% (benchmark gcc).These performance numbers are a good approximation. Theprimary omission is ignoring the effect of bursty traffic on theoverhead of the LLC arbiter. However, we are conservative in modeling the overhead of LLC-MSHR partitioning and sizing. b z i p g c c m c f g o b m k h m m e r s j e n g li b q u a n t u m h r e f o m n e t p p a s t a r x a l a n c b m k a v e r a g e I n c r e a s e d r un t i m e ( % ) Figure 13: Execution-time overhead of an enclave applica-tion in
MI6 normalized to the execution time of BASEOverheads of non-enclave processes:
Compared to the per-formance overhead of an enclave program shown in Figure 13,the overhead of a non-enclave program in
MI6 will be lessbecause there is no flushing of microarchitectural states.
Area overhead:
Our synthesis results show that both BASEand F+P+M+A can be clocked at a maximum of 1GHz. There-fore, the additional hardware for enforcing security does not af-fect clock frequency. As for area, F+P+M+A is approximately2% bigger than BASE. Several area-consuming componentslike LLC, L1 SRAMs and FPUs are not included in the arearesults, so a 2% area increase on the rest is quite small.
8. CONCLUSION
Enclaves strengthen the process abstraction to restore iso-lation guarantees under a specified threat model. Throughcareful design, prototyping and evaluation, we show how suchenclaves can be supported in
MI6 , an aggressive speculativeout-of-order processor prototype, with reasonable overhead.Further design effort can lower overhead for unprotected pro-grams by turning on strong timing independence only whenat least one enclave is running, and for unprotected and pro-tected programs by modifying the OS to reduce the overheadof cache-indexing. A primary remaining challenge is allowingenclave software to be more expressive, e.g., allowing shar-ing of memory across protection domains while maintainingisolation.
9. ACKNOWLEDGMENTS
Funding for this research was partially provided by the Na-tional Science Foundation under contract number CNS-1413920,Analog Devices, Inc., DARPA & SPAWAR under contractN66001-15-C-4066 and DARPA under HR001118C0018.
10. REFERENCES [1] “Risc-v instruction set,” https://riscv.org/.[2] S. Aga and S. Narayanasamy, “Invisimem: Smart memory defenses formemory bus side channel,” in
Proceedings of the 44th AnnualInternational Symposium on Computer Architecture , ser. ISCA ’17,2017, pp. 94–106.[3] S. Aga and S. Narayanasamy, “Invisipage: Oblivious demand paging forsecure enclaves,” in
Proceedings of the 46th International Symposiumon Computer Architecture , ser. ISCA ’19, 2019, pp. 372–384.[4] T. Alves and D. Felton, “TrustZone: Integrated hardware and softwaresecurity,”
Information Quarterly , 2004.[5] I. Anati, S. Gueron, S. P. Johnson, and V. R. Scarlata, “Innovativetechnology for CPU based attestation and sealing,” in
HASP , 2013.[6] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “Obfusmem: Alow-overhead access obfuscation for trusted memories,”
SIGARCHComput. Archit. News , vol. 45, no. 2, pp. 107–119, Jun. 2017.
7] S. Banescu, “Cache timing attacks,” 2011, [Online; accessed26-January-2014].[8] J. Bonneau and I. Mironov, “Cache-collision timing attacks againstAES,” in
Cryptographic Hardware and Embedded Systems (CHES) .Springer, 2006.[9] J. Bonneau and I. Mironov, “Cache-collision timing attacks againstAES,” in
Cryptographic Hardware and Embedded Systems-CHES 2006 .Springer, 2006, pp. 201–215.[10] B. B. Brumley and N. Tuveri, “Remote timing attacks are still practical,”in
Computer Security–ESORICS . Springer, 2011.[11] D. Brumley and D. Boneh, “Remote timing attacks are practical,”
Computer Networks , 2005.[12] J. V. Bulck, M. Minkin, O. Weisse, D. Genkin, B. Kasikci, F. Piessens,M. Silberstein, T. F. Wenisch, Y. Yarom, and R. Strackx, “Foreshadow:Extracting the Keys to the Intel SGX Kingdom with TransientOut-of-Order Execution,” in , 2018.[13] J. V. Bulck, N. Weichbrodt, R. Kapitza, F. Piessens, and R. Strackx,“Telling your secrets without page faults: Stealthy page table-basedattacks on enclaved execution,” in , 2017, pp. 1041–1056.[14] D. Champagne and R. B. Lee, “Scalable architectural support for trustedsoftware,” in
HPCA . IEEE, 2010, pp. 1–12.[15] V. Costan and S. Devadas, “Intel SGX explained,” Cryptology ePrintArchive, Report 2016/086, Feb 2016.[16] V. Costan, I. Lebedev, and S. Devadas, “Sanctum: Minimal hardwareextensions for strong software isolation,” in , 2016, pp. 857–874.[17] L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev,“Non-monopolizable caches: Low-complexity mitigation of cache sidechannel attacks,”
Transactions on Architecture and Code Optimization(TACO) , 2012.[18] D. Evtyushkin, J. Elwell, M. Ozsoy, D. Ponomarev, N. Abu Ghazaleh,and R. Riley, “Iso-X: A flexible architecture for hardware-managedisolated execution,” in
Microarchitecture (MICRO) . IEEE, 2014.[19] A. Ferraiuolo, A. Baumann, C. Hawblitzel, and B. Parno, “Komodo:Using Verification to Disentangle Secure-enclave Hardware fromSoftware,” in
Proceedings of the 26th Symposium on Operating SystemsPrinciples , ser. SOSP ’17, 2017, pp. 287–305.[20] A. Ferraiuolo, Y. Wang, R. Xu, D. Zhang, A. Myers, and E. Suh,“Full-processor timing channel protection with applications to securehardware compartments,”
Technical Report , 2017.[21] C. W. Fletcher, L. Ren, X. Yu, M. V. Dijk, O. Khan, and S. Devadas,“Suppressing the oblivious RAM timing channel while makinginformation leakage and program efficiency trade-offs,” in , Feb 2014, pp. 213–224.[22] C. W. Fletcher, M. v. Dijk, and S. Devadas, “A secure processorarchitecture for encrypted computation on untrusted programs,” in
Workshop on Scalable Trusted Computing . ACM, 2012.[23] O. Goldreich, “Towards a theory of software protection and simulationby oblivious RAMs,” in
Theory of Computing . ACM, 1987.[24] A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal,and R. Iyer, “Cache QoS: From concept to reality in the Intel Xeonprocessor E5-2600 v3 product family,” in ,March 2016, pp. 657–668.[25] J. Horn, “Reading privileged memory with a side-channel,”https://googleprojectzero.blogspot.com/2018/01/, January 2018.[26] Intel Corp., “Improving real-time performance by utilizing CacheAllocation Technology,” April 2015.[27] M. Kayaalp, K. N. Khasawneh, H. A. Esfeden, J. Elwell,N. Abu-Ghazaleh, D. Ponomarev, and A. Jaleel, “Ric: Relaxed inclusioncaches for mitigating llc side-channel attacks,” in , 2017, pp. 1–6.[28] R. E. Kessler, “The alpha 21264 microprocessor,”
IEEE micro , vol. 19,no. 2, pp. 24–36, 1999.[29] R. E. Kessler and M. D. Hill, “Page placement algorithms for largereal-indexed caches,”
Transactions on Computer Systems (TOCS) , 1992. [30] Keystone, “Keystone: Open-source secure hardware enclave,”https://keystone-enclave.org/, 2018.[31] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson,K. Lai, and O. Mutlu, “Flipping bits in memory without accessing them:An experimental study of DRAM disturbance errors,” in
ISCA . IEEEPress, 2014.[32] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, and J. Emer,“Dawg: A defense against cache timing attacks in speculative executionprocessors,” in
Proceedings of the 51st International Symposium onMicroarchitecture , ser. MICRO-51, 2018.[33] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp,S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks:Exploiting speculative execution,”
ArXiv e-prints , Jan. 2018.[34] P. C. Kocher, “Timing attacks on implementations of Diffie-Hellman,RSA, DSS, and other systems,” in
Advances in Cryptology (CRYPTO) .Springer, 1996.[35] J. Kong, O. Aciicmez, J.-P. Seifert, and H. Zhou, “Deconstructing newcache designs for thwarting software cache-based side channel attacks,”in workshop on Computer security architectures . ACM, 2008.[36] I. A. Lebedev, K. Hogan, and S. Devadas, “Secure boot and remoteattestation in the sanctum processor,” in , 2018, pp. 46–60.[37] I. A. Lebedev, K. Hogan, J. Drean, D. Kohlbrenner, D. Lee, K. Asanovic,D. Song, and S. Devadas, “Sanctorum: A lightweight security monitorfor secure enclaves,”
CoRR , vol. abs/1812.10605, 2018.[38] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, andM. Horowitz, “Architectural support for copy and tamper resistantsoftware,”
SIGPLAN Notices , 2000.[39] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan,“Gaining insights into multicore cache partitioning: Bridging the gapbetween simulation and real systems,” in
HPCA . IEEE, 2008.[40] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard,P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,”
ArXive-prints , Jan. 2018.[41] F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. Lee,“CATalyst: Defeating last-level cache side channel attacks in cloudcomputing,” in
HPCA , Mar 2016.[42] F. Liu and R. B. Lee, “Random fill cache architecture,” in
Microarchitecture (MICRO) . IEEE, 2014.[43] F. Liu, H. Wu, K. Mai, and R. B. Lee, “Newcache: Secure cachearchitecture thwarting cache side-channel attacks,”
IEEE Micro , vol. 36,no. 5, pp. 8–16, 2016.[44] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cacheside-channel attacks are practical,” in
Security and Privacy . IEEE,2015.[45] F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas, H. Shafi,V. Shanbhogue, and U. R. Savagaonkar, “Innovative instructions andsoftware model for isolated execution,”
HASP , 2013.[46] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks andcountermeasures: the case of AES,” in
Topics in Cryptology–CT-RSA2006 . Springer, 2006, pp. 1–20.[47] M. Qureshi, “CEASER: Mitigating Conflict-Based Cache Attacks viaEncrypted-Address and Remapping,” in
Proceedings of the 51stInternational Symposium on Microarchitecture , ser. MICRO-51, 2018.[48] M. K. Qureshi, “New attacks and defense for encrypted-address cache,”in
Proceedings of the 46th International Symposium on ComputerArchitecture, ISCA 2019, Phoenix, AZ, USA, June 22-26, 2019 , 2019, pp.360–371.[49] M. Seaborn and T. Dullien, “Exploiting the DRAM rowhammer bug togain kernel privileges,”http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html, Mar 2015, [Online; accessed9-March-2015].[50] A. Shusterman, L. Kang, Y. Haskal, Y. Meltser, P. Mittal, Y. Oren, andY. Yarom, “Robust website fingerprinting through the cache occupancychannel.”
CoRR , vol. abs/1811.07153, 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1811.html n SIGSAC Computer & communications security . ACM, 2013.[52] P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas, and S. A. Seshia, “Aformal foundation for secure remote execution of enclaves,” in
Proceedings of the 2017 ACM SIGSAC Conference on Computer andCommunications Security . ACM, 2017, pp. 2435–2450.[53] G. E. Suh, D. Clarke, B. Gassend, M. Van Dijk, and S. Devadas,“AEGIS: architecture for tamper-evident and tamper-resistantprocessing,” in international conference on Supercomputing (ICS) .ACM, 2003.[54] G. Taylor, P. Davies, and M. Farmwald, “The TLB slice - a low-costhigh-speed address translation mechanism,”
SIGARCH ComputerArchitecture News , 1990.[55] M. Vijayaraghavan, A. Chlipala, Arvind, and N. Dave, “Modulardeductive verification of multiprocessor hardware designs,” in
International Conference on Computer Aided Verification (CAV) .Springer, 2015, pp. 109–127.[56] Z. Wang and R. B. Lee, “New cache designs for thwarting softwarecache-based side channel attacks,” in
International Symposium onComputer Architecture (ISCA) , May 2015, pp. 640–656. [60] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. W. Fletcher, andJ. Torrellas, “Invisispec: Making speculative execution invisible in thecache hierarchy,” in
Proceedings of the 51st International Symposium onMicroarchitecture , ser. MICRO-51, 2018.[61] M. Yan, B. Gopireddy, T. Shull, and J. Torrellas, “Securehierarchy-aware cache replacement policy (sharp): Defending againstcache-based side channel atacks,” in
Proceedings of the 44th AnnualInternational Symposium on Computer Architecture , ser. ISCA ’17,2017, pp. 347–360.[62] M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher, R. Campbell, andJ. Torrellas, “Attack directories, not caches: Side channel attacks in anon-inclusive world,” in ,2019.[63] F. Yao, M. Doroslovacki, and G. Venkataramani, “Are coherenceprotocol states vulnerable to information leakage?” in , Feb 2018, pp. 168–179.[64] Y. Yarom and K. Falkner, “FLUSH+RELOAD: A high resolution, lownoise, L3 cache side-channel attack.” in
USENIX Security Symposium ,2014.[65] Y. Yarom and K. E. Falkner, “Flush+reload: a high resolution, low noise,l3 cache side-channel attack,”
IACR Cryptology ePrint Archive , 2013.[66] K. C. Yeager, “The Mips R10000 superscalar microprocessor,”
IEEEMicro , vol. 16, no. 2, pp. 28–41, April 1996.[67] S. Zhang, A. Wright, T. Bourgeat, and Arvind, “Composable buildingblocks to open up processor design,” in
Proceedings of the 51stInternational Symposium on Microarchitecture , ser. MICRO-51, 2018.[68] X. Zhang, S. Dwarkadas, and K. Shen, “Towards practical pagecoloring-based multicore cache management,” in
Proceedings of the 4thACM European Conference on Computer Systems , ser. EuroSys ’09.New York, NY, USA: ACM, 2009, pp. 89–102. [Online]. Available:http://doi.acm.org/10.1145/1519065.1519076, ser. EuroSys ’09.New York, NY, USA: ACM, 2009, pp. 89–102. [Online]. Available:http://doi.acm.org/10.1145/1519065.1519076