[PDF] On Value Recomputation to Accelerate Invisible Speculation

Abstract

Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by providing the value of loads that miss) without incurring its negative side-effect (delaying the release of precious resources), if we can safely, non-speculatively, recompute a value in isolation (without being seen from the outside), so that we do not expose any information by transferring such a value via the memory hierarchy. Value Recomputation, which trades computation for data transfer was previously proposed in an entirely different context: to reduce energy-expensive data transfers in the memory hierarchy. In this paper, we demonstrate the potential of value recomputation in relation to the Delay-on-Miss approach of hiding speculation, discuss the trade-offs, and show that we can achieve the same level of security, reaching 93% of the unsecured baseline performance (5% higher than Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage) value predictor could do.

Full PDF

OO N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION

Christos Sakalis

Uppsala University, Sweden [email protected]

Zamshed I. Chowdhury

University of Minnesota, Twin Cities, USA [email protected]

Shayne Wadle

University of Wisconsin, Madison, USA [email protected]

Ismail Akturk

University of Missouri, Columbia, USA [email protected]

Alberto Ros

University of Murcia, Spain [email protected]

Magnus Själander

Norwegian University of Science and Technology, Norway [email protected]

Stefanos Kaxiras

Uppsala University, Sweden [email protected]

Ulya R. Karpuzcu

University of Minnesota, Twin Cities, USA [email protected] A BSTRACT

Recent architectural approaches that address speculative side-channel attacks aim to prevent softwarefrom exposing the microarchitectural state changes of transient execution. The

Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until theybecome non-speculative, resulting in no transient changes in the memory hierarchy. However, thiscosts performance, prompting the use of value prediction (VP) to regain some of the delay.However, the problem cannot be solved by simply introducing a new kind of speculation (valueprediction). Value-predicted loads have to be validated, which cannot be commenced until the loadbecomes non-speculative. Thus, value-predicted loads occupy the same amount of precious coreresources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yieldsmarginal beneﬁts over Delay-on-Miss.In this paper, our insight is that we can achieve the same goal as VP (increasing performance byproviding the value of loads that miss) without incurring its negative side-effect (delaying the releaseof precious resources), if we can safely, non-speculatively, recompute a value in isolation (withoutbeing seen from the outside), so that we do not expose any information by transferring such a valuevia the memory hierarchy.

Value Recomputation , which trades computation for data transfer waspreviously proposed in an entirely different context: to reduce energy-expensive data transfers in thememory hierarchy. In this paper, we demonstrate the potential of value recomputation in relationto the Delay-on-Miss approach of hiding speculation, discuss the trade-offs, and show that we canachieve the same level of security, reaching 93% of the unsecured baseline performance (5% higherthan Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage)value predictor could do. K eywords Hardware Security · Invisible Speculation

With the disclosure of Spectre [1] and Meltdown [2] in early 2018, speculation , one of the fundamental techniquesfor achieving high performance, proved to be a signiﬁcant security hole, leaving the door wide open for side-channelattacks [3, 4, 5, 6] to “see” protected data [1, 2]. As far as the instruction set architecture (ISA) and the target programare concerned, this type of information leakage through microarchitectural ( µ -architectural) state and structures is a r X i v : . [ c s . A R ] F e b N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.not illegal because it does not violate the functional behavior of the program. But speculative side-channel attacksreveal secret information during misspeculations , i.e., discarded execution that is not a part of the normal execution of aprogram. The stealthy nature of a speculative side-channel attack is based on microarchitectural state being changed bymisspeculation even when the architectural state is not.

First response techniques: delay, hide&replay, or cleanup?

A number of techniques have already been proposed toprevent microarchitectural state from leaking information during speculation, either by delaying such effects [7, 8, 9, 10], hiding them and making them re-appear for successful speculation ( hide&replay ) [11, 12] or cleaning up the changeswhen speculation fails [13]. Because these techniques were proposed for different threat models (i.e., responding to adifferent set of known or unknown threats), provide different protection for parts of the system that can leak secrets (e.g.,caches, DRAM, core), and make different assumptions for what other parts of the system are protected (hence carrydifferent costs), a direct comparison of all of them is, as of yet, not feasible. In this paper, without loss of generality, wefocus on delay techniques and for convenience we adopt the threat model of the work of Sakalis et al., referred to as

Delay-on-Miss (DoM) [7].

What problem are we solving?

Delay-on-Miss is the simple idea of delaying any speculative load that misses inthe L1 cache until the earliest time when it becomes non-speculative. To recover some of the lost performance fromdelaying critical instructions (loads that miss) Sakalis et al. proposed to use value prediction ( VP ) for the delayed missesin hope of performing useful work for the delayed loads and their dependent instructions. In other words, the aim of VPis to increase instruction-level-parallelism (ILP) by executing dependent instructions using load-value prediction.The conundrum of this approach is the following: VP, as another form of speculation, forces predicted loads to bevalidated in-order in the memory hierarchy, as each load remains speculative until all older loads have been performednon-speculatively. This means that the validation of these loads cannot have any memory-level-parallelism (MLP) .Thus, any possible gains in ILP from VP during speculation, could be compromised by the hindrance of MLP atvalidation [14]. A new perspective:

In this paper, we ask the question: Can we create “secret” values, invisible to an attacker , for thedelayed loads, without having to compromise MLP to validate them afterwards? Our key intuition is that the answerlies in value re-computation ( VRC ) also known as

Amnesic Computing [15]. The idea is that recomputing a value onan L1 miss — a value that otherwise would have been loaded from the memory hierarchy — can replace the need toaccess the memory hierarchy. This requires having a backward slice of producer instructions on a per (load) value basis,along with the necessary input operands to perform recomputation. By construction, slices do not contain any branch ormemory references (be it a store or a load). Most importantly, recomputation is also not speculative by construction,hence prevents nested speculation (and negative side effects thereof).

Our Contributions: • We propose to apply an unconventional idea, value recomputation (previously proposed as a means to evade thecost of moving data in the memory hierarchy) to solve this problem. We devise a µ -architectural framework forsecurity-aware value recomputation, well ﬁtted to the threat model at hand and show the synergy with Delay-on-Miss. • We evaluate the potential of value recomputation in eliminating speculative metadata, which makes classic processorsvulnerable to numerous threats, including but not limited to what is known so far.

A summary of our results:

This is the ﬁrst µ -architectural proposal that has the potential of outperforming the(unsecured) baseline in terms of performance and energy-efﬁciency, reducing the performance overhead of Delay-on-Miss by . In this paper, we provide a quantitative discussion on how to unlock this potential. Practically, we cover(known or yet to come) threats posed by speculative memory reads. Sakalis et al. introduced the concept of

Speculative Shadows to reason about the earliest time an instruction becomesnon-speculative and is considered safe to execute regardless of its effects on µ -architectural state [12, 7]. Speculativeshadows can be of the following types: E-Shadows are cast by any instruction that can cause an exception ; C-Shadows are cast by control instructions , such as branches and jumps, when either the branch condition or the target address areunknown or have been predicted but not yet veriﬁed;

D-Shadows are cast by potential data dependencies through storeswith unresolved addresses (read-after-write dependencies);

M-shadows are cast by speculatively executed memoryaccesses that may be caught violating the ordering rules of a memory model (e.g., total store order—TSO) and thereforemay need to be squashed; and

VP-shadows are cast by value-predicted loads [7]. To be more speciﬁc, shadowsdemarcate regions of speculative instructions. So far, attacks have been demonstrated under the E- [2], C- [1], andD-Shadows [16] only, but we cannot exclude future attacks using the rest.2 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.

We target speculative side- or covert-channel attacks that utilize the memory hierarchy (caches, directories, and themain memory) as their side-channel. Non-speculative cache side-channel attacks, as well as attacks that use otherside-channels (such as port contention) are not covered by Delay-on-Miss and, although still possible, are outsidethe scope of this work. We make no assumptions as to where the attacker is located in relation to the victim (on thecore) or if they share the same virtual memory address space or not. As a matter of fact the attacker and the victimcan be the same process, as in the Spectre v1 attack [1]. We assume that the attacker can execute arbitrary code orotherwise redirect the execution of running code arbitrarily. How the attacker manages to execute or redirect such codeis beyond the scope of this work. Instead of focusing on preventing the attacker from accessing data illegally, we focuson preventing the transmission of such data through a cache or memory side- or covert-channel.In this work, we use the concept of speculative shadows to determine when a load is safe or not. Speculative shadowsdetermine the earliest point at which an instruction is guaranteed to be committed and retired successfully. Other works,such as InvisiSpec [11] and NDA [8], make different assumptions based on the threat model. For example, InvisiSpecprovides two different versions, one based on the initial Spectre attacks where only the equivalent of C-Shadows areconsidered as part of the threat model, and one based on protecting against all possible future attacks, utilizing allthe shadows. Similarly, NDA provides different solutions if only C-Shadows are considered (strict/permissive datapropagation), if D-Shadows should also be considered (bypass restriction), or if all shadows should be considered (loadrestriction). In this work we follow the strictest approach and assume that all shadows have the potential of beingabused, as we cannot reasonably argue that any of them are not exploitable.

To summarize, we cover any known or yet-to-be-discovered side-channel posed by a speculative memory read. Weassume that all system components operate correctly.

The goal of Delay-on-Miss is to hide speculative changes in the memory hierarchy (including main memory). Toachieve this, Delay-on-Miss delays speculative loads that miss in the L1 cache. Loads that hit in the L1 (and theirdependent instructions) are allowed to execute speculatively as their effects (i.e., on the L1 replacement state) canbe deferred to when the loads are cleared from any speculative shadow. The miss of a delayed load is allowed to beresolved in the memory hierarchy at the earliest point the load becomes non-speculative. An efﬁcient mechanism totrack shadows is proposed by Sakalis et al. [7].Under Delay-on-Miss, the vast majority of loads are executed speculatively (80+% on average [7]), which causes anotable fraction of the loads to be delayed. This takes up precious resources (i.e., entries in the instruction queue,the reorder buffer, and the load/store queue) and eventually stalls instructions from committing. The signiﬁcantamount of speculation that is performed, results in each load being covered by several speculative shadows (ﬁve onaverage according to our simulations). This forces the majority of the loads to be executed serially, severely limitingMLP [17, 14]. Furthermore, removing any individual shadow (e.g., the C-Shadow) has a limited effect, as the load canbe covered by another overlapping shadow [17].

The concept behind using value prediction (VP) with Delay-on-Miss is to speed up the delayed loads (and theirdependent instructions) and regain some of the lost performance. However, VP—no mater how good we make it (evenunder 100% coverage and accuracy)—gives only a limited beneﬁt on top of Delay-on-Miss [14]. VP clearly cannotregain the lost performance because of the following:VP cannot help much as it simply provides values early; however, the validation is still delayed until all shadowshave been lifted. Thus, precious core resources are still occupied until the same point in time as simply delaying theload. The only perceptible difference is a faster commit of pre-executed dependent instructions if the validation of avalue-predicted load proves to be correct.Furthermore, VP introduces a new speculative shadow, which is referred to as the

VP-Shadow . This new shadow is onlylifted from younger loads when the validation of the VP is complete. Thus, preventing younger loads from validating inparallel and limits the MLP, which results in VP occupying precisous resources in the same manner as Delay-on-Miss.

Due to imbalances in technology scaling, the energy usage (and latency) of data transfers in the memory hierarchy caneasily exceed the energy usage (and latency) of value recomputation [18]. Value recomputation (VRC) is proposedas a way to trade off data movement in the memory hierarchy for in-core computation to save energy [15, 19]. Thebasic idea is to swap slow and energy-hungry loads for recomputation of the respective data values. This is achieved byidentifying a slice of producer instructions of the respective data values and executing them when the value is needed.Each such slice forms a backward slice of execution, and strictly contains only arithmetic and logic instructions.3 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al. i1 Data flow i4 i5 rooti2 i3 level 1level 2leaves (terminal)intermediate nodes (a) Slice example (b) Execution semantics and µ -architecture Figure 1: (a) Backward slice; (b) ISER overview: All µ -architectural buffers have an invalid ﬁeld per entry to managespace (de)allocation.As depicted in Figure 1a, each slice represents a data-dependency graph, where nodes correspond to producer instructionsto be (re)executed. Data ﬂows from the leaf nodes to the root. The root represents the producer of the store whose valuewill be recomputed when its corresponding (consumer) load is encountered, i.e., a load accessing the same memorylocation. Nodes at level 1 are immediate producers of the (input operands of the) root, nodes at level 2 are producers ofnodes at level 1, and so on and so forth. The nodes which do not have any producers are terminal instructions whoseinput operands must be available at the time of recomputation. If these input operands are read-only values to be loadedfrom memory (such as program inputs) or register values that will be overwritten, then buffering of these values areneeded to enable recomputation of the load [15]. Premise:

VRC has the potential to render a more energy efﬁcient (and faster) execution than servicing a miss in thememory hierarchy. At the same time, there is no need for MLP, since as opposed to VP,

VRC is not speculative byitself and does not require any costly validation. A recomputed load can be committed as soon as all the shadows arelifted—this is in stark contrast to Delay-on-Miss/VP, which require a load/validation to be performed before commit.

We will next detail the mechanics of our novel approach,

Invisible Speculative Execution through (Value) Recomputation (ISER). Due to space limitations, we will focus on how value recomputation can help eliminate the targeted threats(Section 2.2). For a thorough discussion of value recomputation we refer the reader to [15, 19].

ISER only resorts to recomputation for regenerating values that otherwise would be read by a speculative load from thememory hierarchy, and only so, if the respective speculative load misses in the L1 cache. Recomputation takes place aslong as a slice exists and the input operands to the slice instructions can be made readily available.While ISER shares basic µ -architectural structures with Amnesiac [15] to facilitate VRC (such as dedicated buffers toprevent corruption of µ -architectural state during recomputation), its execution semantics are quite different when itcomes to slice identiﬁcation and triggering recomputation. These stem from the deﬁning difference in optimizationtargets: Amnesiac uses VRC to maximize energy efﬁciency irrespective of security implications. ISER, on the otherhand, uses VRC to eliminate (already known or yet to be discovered) threats induced by speculative loads. In a nutshell,differences between Amnesiac and ISER expand along two axes:• What to recompute (slice identiﬁcation):

As opposed to Amnesiac, ISER does not impose any direct constraintto preserve energy efﬁciency, as we are not after minimizing energy or latency per load. As long as a sliceexists, and its inputs can be made readily available at the anticipated time of recomputation, ISER wouldconsider it for recomputation. The only practical limitation on slice length may stem from storage overhead of µ -architectural buffers in this case (Section 3.3).• When to recompute:

ISER swaps speculative loads that miss in L1 for recomputation (i.e., with producerinstructions of the respective value along a slice). Amnesiac on the other hand, triggers recomputation(irrespective of whether the load is speculative or not) only if it is more energy-efﬁcient to do so.We continue with ISER design speciﬁcs, limitations, and side effects including coherence and consistency implications.4 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.

Similar to Amnesiac, we rely on a compiler pass (backed up by proﬁling) to form and annotate slices, which mainlyconstitutes dependency analysis to identify the producer instructions for each load. Slice creation is a best effort understrict validity guarantees. Not being able to generate a recomputation slice for a load is not a security weakness under asecurity technique such as Delay-on-Miss, but simply a missed optimization opportunity. Although in this paper theslice formation is conservative, as we will see later, the requirement for strict guarantees of the recomputation validitycan be relaxed (potentially increasing the coverage of recomputation, i.e., portion of load values that can be recomputed,and addressing coherence issues) if the appropriate architectural support is available. However, such extensions areoutside the scope of this paper and will be fully evaluated in future work.The slice formation pass builds the slice as a data dependency graph, where the immediate producer of the value to beloaded resides at the root (Figure 1a). As opposed to Amnesiac, the restriction to slice length comes from slice inputs orstorage requirements (rather than the associated energy cost). If, during the traversal of data dependencies, we encounterother load instructions, we replace them recursively with the respective producer instructions. This recursive growth cancontinue until a store to the same address is encountered. Loads and stores cannot be present in any slice by deﬁnition.Once construction is complete, each slice gets embedded into the binary. Similar to Amnesiac, the special controlﬂow instruction

RCMP communicates recomputation opportunities to the runtime, which semantically corresponds to anatomic bundle of a conditional branch + load (where no prediction is involved for the “branch” portion). The runtimescheduler resolves the branching condition: if the respective load (while shadowed) misses in L1,

RCMP acts as a jumpto the entry point (starting from the terminal instructions) of the corresponding slice. Otherwise (i.e., the load is notshadowed or the shadowed load hits in L1),

RCMP acts as a conventional load. All operands of the respective load andthe starting address of its slice form the operands of the

RCMP . An

RTN instruction (similar to a procedure return innature) demarcates the end of each slice and returns the control to the instruction following the

RCMP . Before the returntakes place, the recomputed value is provided to the consumers of the respective load, in the same way as if the loadwas actually performed (i.e., by passing the value in a physical register).As explained by Akturk and Karpuzcu [15], recomputation is possible, even if the compiler cannot prove that allinput operands of terminal instructions correspond to immediate or live register values at the anticipated time ofrecomputation, by keeping such input operands (e.g., overwritten register values) in a dedicated buffer. For any operandof this sort, a

REC instruction is inserted directly after the instruction producing the value of the operand.

REC takes asoperands the destination register of the previous instruction and an integer operand: leaf-address , which points tothe address of the corresponding terminal instruction in the slice.

REC practically checkpoints the input operand to adedicated buffer.

ISER implements the shadow tracking technique proposed by Sakalis et al. [7]. The shadow tracking consists of a shadow buffer (SB) that acts as a circular buffer similar to the reorder buffer (ROB). When a shadow casting instructionenters the ROB a new entry is allocated at the tail of the SB. Every load that enters the ROB checks the SB and ifnot empty, an entry is allocated in a release queue that associates the load with the youngest entry in the SB (i.e., itstail). The load remains speculative as long as the head of the SB is marked as unresolved and not equal to the SB entryassociated with the load. This mechanism performs a simple comparison between the head of the release queue andthe head of the shadow buffer to identify when loads exit all their shadows, thus, avoiding the need for costly contentaddressable memory (CAM) searches.On top of this, as depicted in Figure 1b, ISER uses a few small buffers that serve two main purposes: (1) Keeping µ -architectural state intact during recomputation; (2) Making slice instructions and operands available at the time ofrecomputation. The Scratch-File (SFile) acts as a small physical register ﬁle during recomputation. Speciﬁcally, while recomputation isin progress, all data ﬂows through the SFile. Thereby, ISER preserves µ -architectural state during recomputation. Nostructure beyond the SFile is necessary in this case, as no memory access instruction is permitted in a slice. The Renamelogic translates (architectural) register references of slice instructions to SFile entries. To this end, ISER can re-use therename logic of a conventional out-of-order processor with the addition of a small dedicated set of rename tables.

The Instruction Buffer (IBuff) caches slice instructions in order to avoid unnecessary pressure on the instruction cache.Each entry of IBuff corresponds to a recomputing instruction. Fetch logic ﬁlls IBuff while IBuff feeds the renamelogic. For each slice where the input operands of terminal instructions represent immediate or live register values,no additional buffering is necessary. Otherwise, ISER keeps the input operands (such as overwritten register values)for each terminal instruction in a dedicated buffer called

The History Table (Hist) . The address ( leaf-address ) and(non-constant, non-live) input operands of a terminal instruction constitute each Hist entry. Instead of a compiler, the same job can be performed by dynamic binary instrumentation at run time (albeit with probablyinferior alias analysis but more dynamic information), rendering recompilation unnecessary in deployments where it is not an option. N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.Figure 2: Illustration of (a) slice identiﬁcation; (b) slice generation; and (c) VRC-enabled code.For ISER, an

RCMP always translates into branch on L1 miss for speculative loads. As shown in Figure 1b, for each

RCMP instruction encountered, ISER ﬁrst checks whether the corresponding load is speculative, and if so, whether itmisses in L1. Here we deﬁne an L1 miss as (i) the cache block does not reside in the L1 cache and (ii) there is no MSHRentry for that cache block. If an MSHR already exists it is then safe to take advantage of the existing MLP and servicethe load as soon as the older load is completed (i.e., the load that caused the MSHR to be allocated). ISER triggersrecomputation for any shadowed load that misses in L1. An

RCMP instruction will always produce a value (either loadedor recomputed) so for each

RCMP a physical register is allocated by the conventional renaming mechanism.On a speculative L1 miss, ISER jumps to the entry point of the corresponding slice and starts fetching instructions.Inputs to a slice instruction can either come from (i) live register inputs, (ii) live values stored in Hist, or (iii) temporaryvalues written to the SFile by an older slice instruction. Live registers are read directly from the physical register ﬁleusing the conventional renaming tables. Architectural registers written by slice instructions are mapped to the SFileregisters using the rename logic, similar to how conventional renaming would map them to the physical register ﬁle.Potential values stored in Hist are referenced using the address of the slice (leaf) instruction. Instructions are fetcheduntil hitting

RTN , which copies the produced value from the SFile to the physical destination register and wakes-upconsumers of the recomputed value. The

RCMP instruction is then committed as any other instruction without furtherdelays.Figure 2 provides an illustrative example. Figure 2(a) shows a pseudo-code excerpt, where we want to create abackward slice for the stored data sumArr , which will later be (speculatively) loaded (line N + 1 ). The instructionswithin boxes in Figure 2(a) are involved in the calculation of sumArr , which are identiﬁed by the compiler (notice thatthe store instruction for sumArr is not part of the slice but informs us about the memory address of the correspondingvalue). Figure 2(b) shows the resulting backward slice (i.e., only the instructions involved in generating the value of sumArr ). In this illustration, we assume that inputs to the slice – i and j – are stored in Hist by REC instruction thatassociates inputs with leaf-instructions at addresses S + 1 and S + 2 , respectively, as shown in Figure 2(c). Noticethat the slice does not contain any control ﬂow instruction, thus the while loop used in Figure 2(a) to generate valuesof sumArr is unrolled in Figure 2(b). Following the semantic explained earlier, RCMP instruction at address M + 1 inFigure 2(c) replaces the ordinary load instruction. Recall that RCMP works as an ordinary load instruction if it hits in L1.However, if it misses L1,

RCMP jumps to entry point of the corresponding slice (which is at address S in Figure 2(b)),and thereby avoids any access to the lower levels of the memory hierarchy. After jumping to the slice entry point,inputs to the slice which were recorded earlier can be read from Hist (by instructions at addresses S + 1 and S + 2 in Figure 2(b)), and the desired output can be recalculated by fetching and executing instructions in the slice. Noticethat recArr is used as a temporary place holder for recomputed value (which is allocated in SFile), to keep the contentof memory address of sumArr intact (i.e., recomputation has no side effect/change in existing architectural state ofthe ongoing computation). Finally, the intended value of sumArr is recomputed and returned by the RTN instruction(by copying the recArr to destination of load instruction). Then, the control ﬂow jumps back to the next instructionfollowing

RCMP in Figure 2(c). The instructions contained in boxes in Figure 2(c) are extra instructions to be added intothe binary to facilitate VRC.

Overhead:

Latency or energy per recomputing instruction in a slice is not any different than the non-recomputing,conventional counterparts. The only difference is that ISER executes these instructions using a dedicated instructionand data supply rather than the conventional instruction cache and physical register ﬁle/data cache.6 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.

Coverage:

We cannot guarantee that all speculative loads missing in L1 have a corresponding slice. This may be due tocomplex producer-consumer chains, which cannot be expressed by a chain of arithmetic/logic instructions only, and/orslice inputs that cannot be guaranteed to be available during recomputation. Furthermore, some values are not producedby the application and are impossible to recompute, such as I/O.

Locality:

Any speculative load that misses in L1 and gets replaced with recomputation would never reach the memoryhierarchy. As a result, subsequent memory requests to the same cache block become more likely to miss in the cachehierarchy, as well. This adverse effect can easily degrade performance, but recomputation targeting such new missesmay be able to recover some of the lost performance. We will discuss this effect further in the evaluation (Section 5).

Exception Handling (during Recomputation):

Exception handling during recomputation should be rare as it simplyre-executes a previously seen slice of instructions with equivalent inputs. However, in case an exception would beraised we revert back to the Delay-on-Miss alternative and simply wait until all the shadows have been lifted (no longerspeculative) and execute the load as normal.

Pipeline Integration:

The only negative impact may be due to potential increase in the pressure on execution units, asexecution units are shared with the rest of the instructions. However, recomputing instructions along a slice (whichform a dependency chain) are executed sequentially, one at a time. The impact would, therefore, be one additionalinstruction competing for the respective functional unit at a time. This can also be regarded as an opportunity to utilizethe cycles (and functional units) that could be wasted otherwise due to stalled instructions waiting on delayed loads.

ISER is based on the premise that we do not have to validate recomputed values:

VRC is not a speculation (i.e., itis not a prediction). This is certainly the case for immutable values that we can safely recompute instead of fetchingthem from the memory hierarchy. As long as the compiler guarantees via alias analysis that recomputed loads access immutable values (from the time they were written by the corresponding producer), the approach is compatible with anyconsistency model and coherence protocol, simply because neither is needed to ensure correctness. We evaluate thiscase which, however, restricts VRC coverage and limits the potential gains.Here, we sketch one approach on how to increase coverage by relaxing the restrictions on slice formation but the actualmechanisms are beyond the scope of this paper. Our aim is to show that there is signiﬁcant untapped potential in thisdirection. In the evaluation we show the upper bound for such a potential approach with an oracle model.The central question is what happens if it is not possible to statically ascertain the immutability of a load’s value. Inother words, what happens for recomputed values that are considered as immutable but there is a possibility, howeversmall, that they can change by some unknown store. We refer to such values as mostly-immutable . For mostly-immutable values, we still want to maintain the essential property for our purposes, that VRC is not aprediction that needs to be validated. Instead, what we want is to be able to make a simple binary decision: to recompute(if the value has not changed) or not (if the value has changed). In other words, we never validate VRC, but we expectthat a store would prevent future recomputation of loads that access the same address. This implies that we must track any possible change of the data that could be accessed by recomputed loads.For single-threaded applications, handling the recomputation of mostly-immutable values, implies a mechanism tomatch the thread’s own stores to the recomputed loads and invalidate the corresponding VRC slices when such matchesare found. To enable such a mechanism, the target address of the producer instruction is saved as a tag for thecorresponding slice in the ISER structures. This tag can be matched by future stores on the same address, to invalidatethe slice (and cancel recomputation) by invalidating, selectively or in bulk, ISER structures. Since we expect this to be arare occurrence (for what we choose to recompute), we can optimize for the case when it does not happen: Producertags (store target addresses) can be encoded in signatures (Bloom ﬁlters) and if a future store hits in a signature, ISERstructures and signatures are reset in bulk and need to be repopulated anew.For multithreaded-applications, this matching and invalidation of recomputation slices should be expanded to includestores from other threads besides the thread’s own stores. This requires an additional “coherence” mechanism to detectremote writes even when there is no copy of the relevant cacheline in the local cache. A solution can be based on anapproach that serves a similar purpose: detecting remote writes in the absence of cached copies.Speciﬁcally, the

Callback concept, introduced by Ros and Kaxiras [20] can serve as the substrate on which to build asolution. A callback simply says “notify me if someone writes on this address” and it does not need cached copiesthat invite invalidations. Callback was introduced for synchronization, as an explicit request for an invalidation in theabsence of coherence invalidations (or more broadly absence of sharing). Callback can be generalized to perform asimilar role in our situation with regards to detecting changes on what we would otherwise consider immutable values.Similarly to the single-threaded case, tracked addresses can be encoded in signatures for efﬁcient matching. Security Naturally, we are not targeting mutable values as successful VRC would likely be much less prevalent. We assume, for the single-threaded case, that we would not recompute loads that touch I/O space that can be changed by adevice without seeing any of our own stores modifying that space. N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.implications of using callbacks (such as perhaps new side-channels enabled by the callback directories [21]) must alsobe addressed in the same way as in the work of Yan et al., SecDir [22].To conclude, we argue that VRC slices can be made coherent by explicitly detecting changes to what we wouldconsider immutable values. Techniques for explicitly detecting writes without invalidations have been proposed in priorwork [21, 20] and their adaptation to our purposes is feasible.

While the coherence approaches sketched above enable us to explicitly detect changes in mostly-immutable values and invalidate the corresponding VRC slice, here, we discuss the order that this would need to happen in relation to theconsistency model of the baseline architecture. We use total store order (TSO) and release consistency (RC) as ourprime examples but our reasoning can be generalized to other consistency models. We use the term callback invalidation to distinguish from the normal coherence invalidation, which may not be available when we have no cached copy ofthe corresponding data. The question here is, once a change is detected to a value that we are capable of recomputing,when exactly is VRC canceled?If VRC occurs well in advance of the callback invalidation it is safe in any consistency model such as TSO or RC.By “well in advance” we mean that the recomputed load is retired from the reorder buffer. In this case, it is as if thecorresponding load has seen the old value, well in advance of the change in the value. Once the callback invalidationreaches the core, there will be no further VRC of that load. Thus, we only need to clarify what happens when a callbackinvalidation and the corresponding VRC occur in a critical window when consistency rules could be violated.In RC, VRC is safe between memory fences. (RC, allows both loads and stores to be reordered, unless otherwiseenforced by memory fences.) Callback invalidations received before an acquire memory fence must take hold andcancel VRC before crossing the fence.In TSO, load–load reordering is not allowed to be observed. A recomputed load is considered performed as weconsider it equivalent to accessing the actual data. In a speculative implementation of TSO, a recomputed load wouldbe speculative with respect to an older load that is not performed. In other words, a recomputed load can be in theM-Shadow of one or more older loads. A callback invalidation reaching the core while a recomputed load is stillunder an M-Shadow (e.g., one or more older loads are still not-performed) should squash the recomputed load (and itsdependents) and cancel further VRC.To conclude, we argue that VRC is compatible with both TSO and RC by observing a correct ordering between callbackinvalidations and VRC.

ISER is based on slice formation, replacement of corresponding loads with

RCMP instructions, and checkpointing ofinput operands with

REC instructions. The question here is what happens if any part or all of the ISER infrastructure canbe abused by an adversary. This is of course equivalent to hijacking the compiler, or dynamic instrumentation (or eventhe binary of an application where the same security risks would apply). However, even under such assumptions,

ISERstill cannot leak information speculatively , which is the main goal of our work.To see this, assume that the compiler are compromised. Attackers can make them do anything they want. We are stillsafe with respect to leaking information via speculative side-channel attacks because of the following reasons:1. VRC itself cannot be used to construct a speculative side-channel in the memory hierarchy because it does notperform any memory accesses at all.2. VRC is only used if the load is already under a speculative shadow. Even if VRC recomputes a secret value,all future loads will be restricted under Delay-on-Miss.To expand on (2), VRC only starts if the

RCMP is under a speculative shadow. While VRC has access to input operandsthat may hold secrets, the recomputation slice cannot perform any memory accesses to leak those secrets and theonly way would be to pass the secret value to another (younger) load, which will also be speculative.

Delay-on-Missguarantees that the younger load cannot have any visible side-effects, preventing any information leakage. Essentially,VRC maintains the Delay-on-Miss invariant that only non-speculative loads are allowed to cause side-effects in thememory hierarchy. Therefore, we conclude that VRC is safe from cache and memory speculative side-channel attacks ,no matter how compromised the compiler, dynamic instrumentation, or the binary is.In addition, the VRC structures are local to the core and cannot be observed by another core. While under speculation,the only changes allowed are ones that cannot be observed from the outside, such as writes to the SFile. Any otherchanges (e.g., to the IBuff and Hist) are buffered or squashed, i.e., they are only updated once the instruction causing thechange is no longer speculative. Furthermore, if SMT is present, the VRC structures can be partitioned where necessary,to avoid contention attacks between SMT threads. It should be mentioned that if SMT is present, since the slices use thefunctional units (FUs) of the core, it is possible to perform an FU-contention attack. However, such attacks are outside8 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.Table 1: The simulated system parameters.

Parameter ValueTechnology node 22nmProcessor type out-of-order x86 CPUProcessor frequency 3.4GHzIssue / Execute / Commit width 8Cache line size 64 bytesL1 private cache size 32KiB, 8-wayL1 access latency 2 cyclesL2 shared cache size 1MiB, 16-wayL2 access latency 20 cyclesValue predictor VTAGEValue predictor size 13 comp.s ×

128 entries the scope of Delay-on-Miss and this work, and are possible with or without VRC. Thus,

VRC does not open up any newattack opportunities under our current threat model.

Note also that disabling SMT has been recommended by vendors(e.g., Microsoft [23]) as a measure against several attacks.

We use a Pin-based tool [24] to identify and annotate recomputation slices. For practical reasons, we limit the maximumslice size during construction to 100 instructions (which represents a loose upper bound in practice). The annotatedslices, together with the original binary, are fed to the gem5 [25] simulator where the shadows, Delay-on-Miss, andVP have been implemented as described in the Delay-on-Miss work by Sakalis et al. [7]. In gem5, we begin withfast-forwarding through the ﬁrst one billion instructions of the application and then simulate in detail for another billion.We use McPAT [26] with CACTI [27], as well as the dynamic DRAM energy provided by gem5, to calculate the energybreakdown of the system. The conﬁguration used for simulations are shown in Table 1. We evaluate the followingversions:

Baseline:

An unsecured out-of-order CPU.

DoM:

Delay-on-Miss without any value prediction or recomputation. This is considered as the secure baseline.

VP:

DoM with an added VTAGE value predictor.

VRC:

DoM with the added value recomputation. This is the solution we are proposing. This does not include callbacks,only immutable values are recomputed.

VRC (2 cycles):

Same as the VRC version but we have artiﬁcially limited the latency of every slice to at most twocycles. We have also limited the number of instructions needed for the recomputation accordingly. As all VPversions take at most 2-cycles per prediction in our implementation, this VRC version enables iso-performancecomparison with VP variants.

Oracle VP:

Same as the VP version but with an oracle predictor capable of predicting correctly of all speculativeL1 misses. Even though the predictor is perfect, its results are still being validated once the loads have beenunshadowed.

Oracle VRC:

Same as the VRC (2 cycles) version but with an oracle compiler capable of recomputing of allspeculative L1 misses. Note that this is the Oracle in regards to VRC coverage, not performance. We discuss theimplications of recomputing all speculative L1 misses in the evaluation, section 5.For the sake of brevity, the last three versions are only shown in the performance (IPC) results and are excluded fromthe rest of the ﬁgures.We evaluate all these different versions using the SPEC2006 benchmark suite [28], with the reference inputs, as inprevious work [7]. For one of the benchmarks,

GemsFDTD , none of the techniques we tried produced any improvement.

GemsFDTD is a ﬂoating point benchmark that is dominated by overlapping C-Shadows. It achieves only about 20% ofthe baseline performance with DoM (also corroborated by Sakalis et al. [7]). In our work, we were unable to achieveany improvement with either VP or VRC because of near-zero coverage. In contrast, it shows an impressive . × ( ) improvement with an oracle VRC ( coverage)—however, this may be impractical to attain. Energy resultsfollow the same pattern, either showing high energy consumption ( × of the baseline) with all the techniques we triedor lower than the baseline with the VRC oracle. We surmise that GemsFDTD performs badly, in general, underany “delay” technique (including NDA [8] and STT [9]). Unfortunately, it is not included in these works to allowfor comparisons. Because

GemsFDTD represents such a special case for delay techniques we believe that further workis required to speciﬁcally address its shortcomings. For these reasons, we point out its idiosyncrasy here, instead ofdiscussing it with the rest of the benchmarks. 9 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.Figure 3: The coverage of VP and VRC, i.e., the ratio of shadowed L1 misses that can be predicted or recomputedinstead of being delayed (bars). Also depicted on the same plot is the L1 miss ratio for both versions (circles/crosses). b z i p cc b w a v e s g a m e ss m c f m il c z e u s m p g r o m a c s c a c t u s A D M l e s li e dn a m d go b m k s o p l e x h m m e r s j e n g li bqu a n t u m h r e f l b m a s t a r w r f s ph i n x G M e a n V R C L a t e n c y Figure 4: The mean latency for recomputing a shadowed L1 miss.

The coverage for the VRC can be seen in Figure 3, together with the VP coverage. We can immediately observe that,on average, VRC has higher coverage than VP, at of all speculative L1 misses vs. with the VP. A notableexample is mcf , which is one of the worst performing benchmarks with DoM (Section 5.2). On the other hand, lbm is acounter-example, where we have almost zero VRC coverage. This, however, does not affect the performance negatively,as lbm does not suffer from any performance penalties even with the plain DoM.In the same ﬁgure, we have also superimposed the cache miss ratio for both versions. We only predict or recompute L1misses, so the miss ratio is needed in conjunction with the coverage to infer the percentage of loads in the applicationthat are being predicted or recomputed. More detailed L1D miss data can be found in Figure 5. Note how, as discussedin Section 3.4, VRC increases the miss ratio.With VP, all loads that can be predicted are predicted in the same amount of time (two cycles in our setup), but the sameis not true for the VRC, where the latency depends on the slice length and the instructions it contains. In Figure 4 wecan see the mean recomputation latency for each benchmark, as well as the overall mean. In all cases, VRC requiresmore cycles than VP to recompute a value, with a mean of seven cycles per slice. However, as we will see in Section 5.2,this does not impact the performance signiﬁcantly.

Figure 6 contains the number of committed instructions per cycle, normalized to the unsecured baseline processor.Delay-on-Miss without VP or VRC, which is our secure baseline, performs at of the unsecured baseline, similar tothe results reported by Sakalis et al. [7]. The benchmarks that incur the biggest hit in performance are mcf (at of10 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.Figure 5: L1D miss ratio for Delay-on-Miss with VR and VRC. b z i p cc b w a v e s g a m e ss m c f m il c z e u s m p g r o m a c s c a c t u s A D M l e s li e d n a m d go b m k s o p l e x h m m e r s j e n g li bqu a n t u m h r e f l b m a s t a r w r f s ph i n x G M e a n . . . . N o r m a li z e d I P C Baseline: Unsecured OoO CPU

DoM VP VRC VRC (2 cycles) Oracle VP Oracle VRC

Figure 6: Performance (IPC – higher is better) normalized to an unsecured OoO baseline.the baseline), followed by milc ( ), cactusADM ( ) and libquantum ( ). Out of these benchmarks, three( mcf , milc , and libquantum ) have high LLC MPKI, but that in itself is not the only factor, as other benchmarks(e.g., lbm ) also have a high MPKI. Instead, the cost of Delay-on-Miss also depends on the amount of MLP that thebenchmarks exhibit; the more MLP that is taken advantage of in the baseline, the higher the performance loss.If VP is introduced, then the performance is similar, at of the unsecured baseline. This result contradicts the resultsgiven by Sakalis et al. [7], where the VP gives a signiﬁcant performance advantage . The reason that VP does not offera signiﬁcant advantage is because VP itself is speculative: When a value is predicted it still needs to be validated at alater point. By predicting the value, a small amount of parallelism (ILP) can be exploited during execution, but theslow L1 misses still need to be satisﬁed for the validation. Due to the high number of speculative shadows, validationsbecome serialized and are not able to take advantage of any MLP that might be found in the application. In essence,the VP pushes the cost of delaying speculative loads from the execution stage to the validation stage, but it does noteliminate it. This can be seen in the Oracle VP results, where even prediction rate (i.e., all shadowed L1 missesare successfully predicted) only leads to a marginal performance improvement of one percentage point.The same is not true for VRC, as once a value has been recomputed, it does not need to be validated, meaning that thecost for delaying a long latency miss is eliminated and no serialization is enforced. While VRC does not increase theamount of MLP that can be taken advantage of, it does eliminate some of the need for it. Overall, VRC performs at of the unsecured baseline, decreasing the performance cost of Delay-on-Miss by more than one third (speciﬁcally, by ). The benchmark with the most dramatic performance increase is mcf , which is the worst performing benchmarkfor Delay-on-Miss. VRC improves the performance from to , reducing the performance cost to one ﬁfth ofthat of Delay-on-Miss.We have also evaluated an artiﬁcial version of VRC where we keep the same slice coverage but reduce the cost ofthe slices to at most two cycles. This version exhibits almost identical performance to the real VRC, with a meanperformance difference of half a percentage point. This strongly indicates that instead of trying to keep the cost of theslices low, it is more important to increase the coverage, even if large slices are required. This is further corroborated bythe results from the Oracle version, discussed below. However, large slices do increase the energy usage, as we will seein Section 5.3, so a balance still needs to be kept.If we introduce an Oracle VRC that can recompute all shadowed L1 misses, the difference between the VP and theVRC approaches becomes even more apparent. Both Oracle versions have coverage and the same latency,the only difference is that with VP the loads need to be validated when they are unshadowed, while with VRC theyare completed as soon as the value has been recomputed. While, as we have seen, the VP Oracle can only achievemarginal improvements over the non-Oracle version, the VRC Oracle is able to outperform even the baseline, includingbenchmarks such as mcf , cactusADM , and libquantum . Of course, such an Oracle is unrealistic, but it does supportour argument that the limiting factor for VP is the cost of validation.However, it is worth noting here that a 100%-coverage VRC does not necessarily guarantee that the performance willexceed that of the baseline. In fact, there are four benchmarks where the Oracle VRC is slower than the baseline: bwaves , milc , leslie3d , and lbm . Out of these, the bwaves and lbm VRC Oracle is also slower than DoM. Thereare various factors that contribute to this result: In bwaves and leslie3d the L1 and the L2 miss ratio (not shown) isincreased signiﬁcantly with the Oracle; in milc the Oracle increases the number of write misses in the L1 (not shown), We contacted the authors and veriﬁed that our results are indeed valid. N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al. b z i p cc b w a v e s g a m e ss m c f m il c z e u s m p g r o m a c s c a c t u s A D M l e s li e d n a m d go b m k s o p l e x h m m e r s j e n g li bqu a n t u m h r e f l b m a s t a r w r f s ph i n x G M e a n . . . . N o r m a li z e d E n e r g y U s a g e baseline DoM VP VRC Figure 7: Energy usage, where each bar consists of four parts (from bottom up): The bottom, light colored part is thedynamic energy of the CPU, the middle, dark colored one is the static energy of the CPU, the middle light part is theDRAM energy, including refresh and power-down energy, and the top dark part is the overhead of VP and VRC, bothstatic and dynamic.as well as the average write miss latency (not shown); ﬁnally in lbm a combination of many factors contribute to worsecache performance. The problem is that, even with coverage, not every single memory access is recomputed:Stores, non-speculative loads, and speculative L1 misses that hit in the MSHRs, are still served by the memory hierarchy.By recomputing the rest of the loads, which account for the majority of the L1 misses, the Oracle VRC disrupts thenormal operation of the cache and the prefetcher, resulting in performance losses. Essentially, there is a trade-offbetween the beneﬁts of eliminating long-latency L1 misses and the cost of disrupting the normal cache operation. Forthe majority of the benchmarks, this trade-off leans towards the beneﬁts, but this is not true for all of the benchmarks.Future work aiming to increase VRC coverage must account for these factors to achieve optimal performance.

Energy, in our case, is affected by three main factors: The execution time/performance, the number of accesses in thememory hierarchy (especially the DRAM), and the cost of predicting (VP) or recomputing (VRC) a value. Figure 7shows, starting from the bottom, the dynamic (bottom, light color) and static (middle, dark color) energy of the CPU,the total DRAM energy (middle, light), and, ﬁnally, the overhead (if any) for VP and VRC (top, dark). OverallDelay-on-Miss and VP increase the mean energy usage over the unsecured baseline by , while VRC increases it by . The dynamic energy of the CPU (excluding the overheads) remains mostly the same across all versions, instead itis the static, DRAM, and overhead energy that changes.Static energy is affected because the execution time is affected. This is most obvious in mcf , the application withthe worst DoM performance, followed by milc . None of the evaluated solutions affect the LLC MPKI signiﬁcantly(not shown), so the increase in the DRAM energy is not due to an increase in the number of accesses but due to otheroperations such as refresh and power-down states. These operations do depend on the access patterns, but they alsodepend on the execution time, similar to the static energy usage of the system.On the other hand, the overheads introduced by the VP and the VRC are affected both by the execution time (staticenergy) and by the operations performed. This is particularly visible in the case of the VRC, where the majority of theoverhead is due to the instructions of the slices. As we have discussed in Section 5.2, smaller slices do not lead to betterperformance, but the same is not true for the energy costs. Instead, a balance between coverage (which increases theperformance) and slice length (which increases the energy usage) needs to be achieved.Out of all the benchmarks, the ones with the highest (relative to the baseline) energy usage are milc (at over thebaseline), gromacs ( ) and libquantum ( ). The rest of the benchmarks have energy overheads of less than over the baseline. milc is the benchmark with the worse performance, so part of the energy increase is due tostatic and DRAM energy. It also has a high VRC coverage and also some of the third most expensive (in cycles, onaverage) slices among all the benchmarks, which increases the VRC overhead energy. On the other hand, gromacs ’sperformance comes very close to the baseline, but it does have the second most expensive slices, while also havinghigh coverage. Finally, libquantum also sees an increase in execution time and by extension, energy usage. Thenext benchmark with the higher energy increase over the baseline is mcf ( ), but this is far better than DoM, with orwithout VP, which is at and respectively. Thus far, related security proposals exert a toll on performance and/or increase cost/complexity. In ISER, as well,microarchitectural support for VRC increases hardware complexity, but only slightly: Slices differ in length, but herewe conservatively assume that all would be as long as the maximum-length slice we observe across all benchmarks. Inthis case, 22 KiB sufﬁces to accommodate all “live” slices in Hist, which represents the largest structure. This is similarto the storage overhead of the VTAGE value predictor we use for the VP conﬁgurations. Furthermore, static loads thatneed to be recomputed at runtime are few, so the overhead in the binary is small; <3% across all applications. Finally,12 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.as we pointed out throughout the evaluation, since our conservative VRC implementation leaves many optimizationopportunities untapped, it still has potential for even further improvement.

The architecture community promptly proposed a number of techniques (starting with the ground-breaking

InvisiSpec work [11]) to prevent disclosure gadgets from revealing secrets. The techniques fall in one of the following threebroad categories shown below but each individual proposal has different assumptions as to the threat model (type ofspeculative shadows covered) and prevention of information leakage (disclosure gadgets). It is obvious that at this pointno direct comparison is possible but we make an effort to compare the solutions qualitatively.

Hide&Replay:

Perform speculative memory accesses in a manner that does not perturb any µ –architectural state inthe memory system; subsequently, perform a replay of the access (when it becomes non-speculative) to affect thecorrect changes in the µ -architectural state [29, 30, 11, 12, 31]. Invisispec (Yan et al.) [11] and

Ghost loads (Sakalis etal.) [12] were the ﬁrst such proposals. Hide&Replay techniques, as the ﬁrst to be proposed, showed a signiﬁcant costin performance (and a moderate implementation cost). They only protect against information leaks via the memoryhierarchy (and not even all of it, as DRAM leaks are possible [32]). On the other hand, both of these techniqueswere designed to protect against attacks on any possible speculation primitive, i.e., cover all the speculative shadowsmentioned above. A recent work, InvarSpec [33] relies on compile time analysis to identify instructions that maybecome non-speculative during execution (i.e., speculation invariant). The protection scheme used for these speculativeinstructions can be lifted at runtime, thus, reducing the performance overhead associated with speculation-relatedprotection mechanisms in hardware. Reported performance improvement from such HW-SW co-design, however,cannot reach the negative overhead of ISER. Instead, as it takes an orthogonal approach, InvarSpec can be used inconjunction with ISER to further improve the performance while also reducing the size of the structures needed forrecomputation, by reducing the number of loads that trigger recomputation.

Delay:

Delaying speculative changes in µ -architectural state until execution is non-speculative. Sakalis et al. proposedto delay loads that miss in the L1 ( Delay-on-Miss ) until they are non-speculative [7, 14]. This delays any µ -state changein the memory hierarchy. A different form of delay (such as NDA , proposed by Weisse at al. [8]), is to prevent speculativedata propagation by delaying dependent instructions from executing with speculative inputs [8, 9, 10, 34, 35, 36].Delay-on-Miss protects against all speculative shadows (i.e., any possible “Speculation Primitive”) but delays onlychanges in the memory hierarchy (including DRAM). Subsequent work, that delays speculative propagation of data [8],achieves good performance by protecting against any µ -state changes (i.e., a much larger gamut of “disclosure gadgets”than just the memory hierarchy) but responding only to C-Shadows, i.e., control speculation primitives. Anothersimilar alternative, STT [9], also protects against other shadows (referred to as the “Futuristic” model) but at ahigher performance cost. In a recent publication, STT has been extended to utilize speculation as well, referred to as“speculative data-oblivious speculation – SDO” [37], in order to replace the potentially leaky speculative paths withsecure, data-independent, paths. This approach is similar to the approach that ISER takes, only ISER is non-speculativeand does not require any veriﬁcation or squashing, further reducing the runtime overhead. Tran et al., propose aSW-HW extension that can reduce the time in which loads are shadowed (i.e., loads are speculative) and thereby canincrease the MLP [17]. Their proposal includes instruction reordering to prioritize calculations that minimize thespeculation window, such as target address computation of memory accesses, and resolution of branch conditions. Muchlike InvarSpec, their approach may reduce the performance overhead of delay-based security solutions by reducing thenumber of speculative loads or time spent in speculations, and it is orthogonal to our proposal. Both approaches can becombined together to offer better security coverage with minimum performance overhead. SPECCFI [38] uses theControl-Flow Integrity (CFI) to prevent Spectre-type attacks that abuse illegal control ﬂow during speculative execution.Not all possible speculative side-channel attacks are covered by this technique but, much like the other compiler-basedtechniques we have discussed, it can be used in conjunction with our technique to limit the cases where recomputationis needed. Cleanup:

Perform a speculative change in µ -architectural state but then undo if speculation is squashed. In the ﬁrstsuch proposal, CleanupSpec , by Sailshwar et al. [13], the undo is expensive so its application is restricted to the L1cache. The rest of the memory hierarchy (L2, LLC, and coherence Directory) is assumed to be protected in other ways,including randomization and delaying of coherence state changes, but DRAM row buffers still remain a security hole.Cleanup techniques only protect the L1, assuming—at a cost—that the rest of the hierarchy (excluding DRAM) isprotected otherwise [13]. (Generic) Recomputation:

Amnesiac [15] introduces a µ -architecture for recomputation that differs from ISER inthe way slices are generated and their usage. The goal of Amnesiac is to replace as many energy-hungry loads aspossible with recomputations of the respective data value. In contrast, ISER recomputes slices selectively, such thatrecomputation is triggered only for shadowed loads that miss in L1.Kandemir et al., proposed a recomputation-based approach to reduce off-chip memory space in embedded proces-sors [39]. Koc et al. investigated how recomputation of data residing in memory banks in low-power states can reducethe energy consumption [40], and devised compiler optimizations for scratchpads [41] that are limited to array variables.The dual of recomputation, memoization [42, 43] replaces computation with table look-ups for pre-computed values (for13 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.the ones that are frequent and expensive to recompute). Memoization can mitigate the communication overhead – aslong as table look-ups are cheaper than long-distance data retrieval, but is only effective if the respective computationsexhibit signiﬁcant value locality. Therefore, memoization and recomputation can complement each other in boostingenergy efﬁciency.

Idempotent Processors [44] execute programs as a sequence of compiler-constructed idempotent (i.e., re-executablewithout any side effects) code regions. As the name suggests, idempotent regions regenerate the same output regardlessof how many times they are executed with the given program state. Generally, idempotent regions are larger, andtherefore tend to incur higher overhead for recomputation, while slices for VRC employ ﬁne-grain data recomputation(along a short, independent slice for each value), where each slice contains only the necessary instructions to generate avalue. Accordingly, slices for VRC may provide more ﬂexibility than idempotent regions.Elnawawy et al., demonstrated the applicability of recomputation to loop-based code [45] to reduce checkpointingoverheads. In their proposal, a whole loop is (re)executed during recovery, where only the initial state of the loopis required to be checkpointed. The loops may contain extra computations that are not relevant to the productionof the value to be recovered. Compared to such a coarse-grain recomputation, slice-based recomputation does notcontain any irrelevant instructions. Also, slices used for VRC do not contain load instructions, as opposed to [45]; andrecomputation applies outside of loops, providing wider applicability. To summarize, although value recomputationhas been explored in different contexts before, to the best of our knowledge, none of the prior works has evaluatedrecomputation in the context of security.

Slice Generation:

Automatic creation of VRC slices in hardware is complicated because we are not after the slice ofthe load to be replaced (which could be created by existing techniques like IBDA [46]) but the slice of the correspondingstore creating the value. This would require tracking of all stores and their slices and somehow matching these with a(speculative) load missing in the L1 cache. Srinivasan et al. [47] generate “forward” slices for loads that miss in LLC.This is easier on hardware since the dependency tracking starts with the producer (i.e., load that misses in LLC) andthe consumers (following use-def chains) are executed after the producer. However, in our case, we have to identify“backward” slices – i.e., the producers, not consumers, of a value that will be loaded – where all the producers wereexecuted before the load itself. Such backward dependency tracking would likely require expensive bookkeeping inhardware.

Delay techniques aim to hide the effects of transient execution by simply delaying instructions until they becomenon-speculative. Whether delaying loads that miss in the L1, as Delay-on-miss does, or delaying the propagation ofspeculative data to dependent instructions, as NDA and STT do, delay techniques extract a heavy toll in performance, indirect relation to the set of speculative shadows they protect against. Delay techniques would be at an impasse withrespect to improvement if we could not regain some of this lost performance in some other way. To this end, valueprediction, invisible from the outside, was initially proposed as a solution.However, value prediction is not the right abstraction for recovering lost performance in Delay-on-miss. This is notbecause of coverage or accuracy but because value prediction is just another form of speculation that needs to bevalidated. Validation limits the potential beneﬁts to the point where even an oracle VP (100% coverage and accuracy)does not do any better than a practical VP. In our evaluation we found that, no matter how good, VP is limited to justone percentage point improvement over Delay-on-miss.Instead, we propose another, non-speculative , abstraction to regain performance for delay techniques, and in particularfor Delay-on-miss. We propose to use recomputation that yields correct values—not predictions—as the key to overcomeDelay-on-miss performance limitations. We describe the architecture, we evaluate it using a practical approach togenerate recomputation slices albeit with modest coverage, and we exceed the performance of Oracle VP ( vs. ) with lower energy usage. Finally, we discuss the potential for increasing the coverage of recomputation withfuture architectural support. Because, as we show, oracle recomputation easily exceeds even the performance of theunmodiﬁed (unsecured) baseline, this direction provides tangible motivation for researching techniques for a futuresecure processor.To regain the performance cost of securing the memory hierarchy we need to identify methods that improve the MLP.This paper demonstrates, for the ﬁrst time, value recomputation’s unique ability in overcoming the MLP restrictionthat is inherent in VP when applied on the Delay-on-Miss technique. To the best of our knowledge, no previous studyon recomputation considered any security impact . Finally, these ﬁndings should be considered in the context of ourrepresentative threat model (Section 2.2). In the end, no threat model can cover all possible security vulnerabilities. But,as explained in Section 3.7, ISER does not introduce any new attack opportunities under the provided threat model.That said, as any technique that affects the control ﬂow timing – including value prediction or even Delay-on-Missto name a few – recomputation may give rise to timing channels, where information to be leaked gets encoded intiming differences between various microarchitectural events. Even if a given value is recomputed multiple timesthroughout execution, as resource contention and speculation can easily change timing of microarchitectural eventsnon-deterministically, there is also a very good chance that recomputation rather obfuscates control ﬂow timing.14 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.Speciﬁcally, provided that (i) a slice is executed only upon an associated L1 miss; that (ii) each slice may have not onlya different number but also a different composition of arithmetic/logic instructions, with each instruction featuring adifferent number of operands (which all affect how much time it takes to process each slice instruction through theSFile); that (iii) none, one, or more Hist accesses may be the case per slice execution; and that (iv) due to the smallfootprint of such microarchitectural buffers, their access times are relatively short, identifying a unique timing signaturefor each slice encountered throughout the execution (in order to associate slice timings with values) would not be easy –even more so under speculation and resource contention.To conclude, potential timing channels, if at all, would not necessarily be straight-forward to exploit. In fact, recom-putation is more likely to result in control ﬂow obfuscation. We leave the exploration of such effects to future work,conﬁning the analysis in this paper to only memory side-channels because they are easier to exploit and can be exploitedacross cores. This does not imply that side-channels such as functional unit contention based ones are not possible, theyare just outside the scope of our threat model.

References [1] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, ThomasPrescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploiting speculative execution. In

IEEESymposium on Security and Privacy , New York, NY, USA, May 2019. IEEE.[2] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher,Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown, January 2018.[3] Daniel J Bernstein. Cache-timing attacks on AES, 2005.[4] Yuval Yarom and Katrina Falkner. FLUSH+ RELOAD: A high resolution, low noise, l3 cache side-channel attack.In

USENIX Conference on Security Symposium , pages 719–732. USENIX Association, 2014.[5] Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. Last-level cache side-channel attacks arepractical. In

IEEE Symposium on Security and Privacy , pages 605–622, New York, NY, USA, May 2015. IEEE.[6] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. Cross processor cache attacks. In

ASIA Conference onComputer and Communications Security (ASIACCS) , pages 353–364, New York, NY, USA, 2016. ACM.[7] Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, and Magnus Själander. Efﬁcient invisiblespeculative execution through selective delay and value prediction. In

International Symposium on ComputerArchitecture (ISCA) , pages 723–735, New York, NY, USA, 2019. ACM.[8] Oﬁr Weisse, Ian Neal, Kevin Loughlin, Thomas F Wenisch, and Baris Kasikci. NDA: Preventing speculativeexecution attacks at their source. In

International Symposium on Microarchitecture , pages 572–586, New York,NY, USA, 2019. ACM.[9] Jiyong Yu, Mengjia Yan, Artem Khyzha, Adam Morrison, Josep Torrellas, and Christopher W. Fletcher. SpeculativeTaint Tracking (STT): A Comprehensive Protection for Speculatively Accessed Data. In

International Symposiumon Microarchitecture , pages 954–968, New York, NY, USA, 2019. ACM.[10] Jacob Fustos, Farzad Farshchi, and Heechul Yun. SpectreGuard: An Efﬁcient Data-centric Defense Mechanismagainst Spectre Attacks. In

The Design Automation Conference (DAC) , pages 1–6, Las Vegas, NV, USA, 2019.ACM.[11] Mengjia Yan, Jiho Choi, Dimitrios Skarlatos, Adam Morrison, Christopher W. Fletcher, and Josep Torrellas.InvisiSpec: Making speculative execution invisible in the cache hierarchy. In

International Symposium onMicroarchitecture , pages 428–441, New York, NY, USA, October 2018. IEEE.[12] Christos Sakalis, Mehdi Alipour, Alberto Ros, Alexandra Jimborean, Stefanos Kaxiras, and Magnus Själander.Ghost loads: what is the cost of invisible speculation? In

ACM International Conference on Computing Frontier ,pages 153–163, New York, NY, USA, 2019. ACM.[13] Gururaj Saileshwar and Moinuddin K Qureshi. Cleanupspec: An undo approach to safe speculation. In

International Symposium on Microarchitecture , pages 73–86, New York, NY, USA, 2019. ACM.[14] C. Sakalis, S. Kaxiras, A. Ros, A. Jimborean, and M. Själander. Understanding selective delay as a method forefﬁcient secure speculative execution.

IEEE Trans. Comput. , 69(11):1584–1595, 2020.[15] Ismail Akturk and Ulya R. Karpuzcu. AMNESIAC: Amnesic Automatic Computer - Trading Computation forCommunication for Energy Efﬁciency. In

International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) , New York, NY, USA, 2017. ACM.[16] National Vulnerability Database. CVE-2018-3693. Available from MITRE, CVE-ID CVE-2018-3693., Decem-ber 28 2017.[17] Kim-Anh Tran, Christos Sakalis, Magnus Själander, Alberto Ros, Stefanos Kaxiras, and Alexandra Jimborean.Clearing the shadows: Recovering lost performance for invisible speculative execution through hw/sw co-design.In

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques(PACT) , PACT ’20, page 241–254, New York, NY, USA, 2020. Association for Computing Machinery.15 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.[18] Mark Horowitz. Computing’s Energy Problem (and what we can do about it).

Keynote at International Conferenceon Solid State Circuits , April 2014.[19] I. Akturk and U. R. Karpuzcu. Trading computation for communication: A taxonomy of data recomputationtechniques.

IEEE Transactions on Emerging Topics in Computing , 2018.[20] Alberto Ros and Stefanos Kaxiras. Callback: Efﬁcient synchronization without invalidation with a directory justfor spin-waiting. In

International Symposium on Computer Architecture (ISCA) , pages 427–438. IEEE, 2015.[21] Alberto Ros and Stefanos Kaxiras. Racer: TSO consistency via race detection. In

The 49th Annual IEEE/ACMInternational Symposium on Microarchitecture , page 33. IEEE Press, 2016.[22] Mengjia Yan, Jen-Yang Wen, Christopher W Fletcher, and Josep Torrellas. Secdir: a secure directory to defeatdirectory side-channel attacks. In

Proceedings of the 46th International Symposium on Computer Architecture ,pages 332–345. ACM, 2019.[23] Microsoft Support. Windows guidance to protect against speculative execution side-channel vulnerabilities,November12 2019.[24] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vi-jay Janapa Reddi, and Kim Hazelwood. Pin: Building Customized Program Analysis Tools with DynamicInstrumentation. In

ACM SIGPLAN Conference on Programming Language Design and Implementation , 2005.[25] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, JoelHestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib,Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator.

ACM SIGARCH Computer ArchitectureNews , 39(2):1–7, August 2011.[26] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. McPAT: Anintegrated power, area, and timing modeling framework for multicore and manycore architectures. In

InternationalSymposium on Microarchitecture , pages 469–480, New York, NY, USA, December 2009. IEEE.[27] Sheng Li, Ke Chen, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. CACTI-P: Architecture-levelmodeling for SRAM-based structures with advanced leakage reduction techniques. In

International ConferenceOn Computer Aided Design (ICCAD) , pages 694–701, New York, NY, USA, 2011. IEEE.[28] Standard Performance Evaluation Corporation. SPEC CPU benchmark suite. , 2006.[29] Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, and Dan Meng. Conditional Speculation: An Effective Approachto Safeguard Out-of-Order Execution Against Spectre Attacks. In

IEEE International Symposium on High-Performance Computer Architecture , pages 264–276, Washington, DC, USA, February 2019. IEEE.[30] Khaled N. Khasawneh, Esmaeil Mohammadian Koruyeh, Chengyu Song, Dmitry Evtyushkin, Dmitry Ponomarev,and Nael Abu-Ghazaleh. SafeSpec: Banishing the Spectre of a Meltdown with Leakage-Free Speculation. In , pages 1–6, June 2019. ISSN: 0738-100X.[31] Sam Ainsworth and Timothy M. Jones. MuonTrap: Preventing cross-domain spectre-like attacks by capturingspeculative state. In

International Symposium on Computer Architecture (ISCA) . IEEE, 2020.[32] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. DRAMA: ExploitingDRAM addressing for cross-CPU attacks. In

USENIX Conference on Security Symposium , pages 565–581, 2016.[33] Zirui Neil Zhao, Houxiang Ji, Mengjia Yan, Jiyong Yu, Christopher W. Fletcher, Adam Morrison, Darko Marinov,and Josep Torrellas. Speculation invariance (invarspec): Faster safe execution through program analysis. In

International Symposium on Microarchitecture , pages 1138–1152. IEEE, 2020.[34] K. Barber, A. Bacha, L. Zhou, Y. Zhang, and R. Teodorescu. SpecShield: Shielding speculative data frommicroarchitectural covert channels. In

Proceedings of the ACM International Conference on Parallel Architecturesand Compilation Techniques (PACT) , pages 151–164, Sep. 2019.[35] Michael Schwarz, Robert Schilling, Florian Kargl, Moritz Lipp, Claudio Canella, and Daniel Gruss. ConTExT:Leakage-Free Transient Execution. arXiv:1905.09100 [cs] , May 2019. arXiv: 1905.09100.[36] Mohammadkazem Taram, Ashish Venkat, and Dean Tullsen. Context-Sensitive Fencing: Securing SpeculativeExecution via Microcode Customization. In

International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS) , pages 395–410, Providence, RI, USA, 2019. ACM Press.[37] Jiyong Yu, Namrata Mantri, Josep Torrellas, Adam Morrison, and Christopher W. Fletcher. Speculative data-oblivious execution: Mobilizing safe prediction for safe and efﬁcient speculative execution. In

InternationalSymposium on Computer Architecture (ISCA) . IEEE, 2020.[38] E. M. Koruyeh, S. Haji Amin Shirazi, K. N. Khasawneh, C. Song, and N. Abu-Ghazaleh. SpecCFI: Mitigatingspectre attacks using CFI informed speculation. In

IEEE Symposium on Security and Privacy , pages 39–53, 2020.[39] M Kandemir, Feihul Li, Guilin Chen, Guangyu Chen, and O Ozturk. Studying Storage-Recomputation Tradeoffsin Memory-Constrained Embedded Processing. In

Design, Automation and Test in Europe (DATE) , 2005.16 N V ALUE R ECOMPUTATION TO A CCELERATE I NVISIBLE S PECULATION , Sakalis et al.[40] H Koc, O Ozturk, M Kandemir, and E Ercanli. Minimizing Energy Consumption of Banked Memories UsingData Recomputation. In

International Symposium on Low Power Electronics and Design (ISLPED) , 2006.[41] H Koc, M Kandemir, E Ercanli, and O Ozturk. Reducing Off-Chip Memory Access Costs Using Data Recomputa-tion in Embedded Chip Multi-processors. In

The Design Automation Conference (DAC) , 2007.[42] A Sodani and G S Sohi. Dynamic Instruction Reuse. In

International Symposium on Computer Architecture(ISCA) , 1997.[43] Xiaochen Guo, Engin Ipek, and Tolga Soyata. Resistive Computation: Avoiding the Power Wall with Low-leakage,STT-MRAM Based Computing. In

International Symposium on Computer Architecture (ISCA) , 2010.[44] Marc de Kruijf and Karthikeyan Sankaralingam. Idempotent Processor Architecture. In

International Symposiumon Microarchitecture (MICRO) , December 2011.[45] H. Elnawawy, M. Alshboul, J. Tuck, and Y. Solihin. Efﬁcient checkpointing of loop-based codes for non-volatilemain memory. In

Proceedings of the ACM International Conference on Parallel Architectures and CompilationTechniques (PACT) , Sept 2017.[46] T. E. Carlson, W. Heirman, O. Allam, S. Kaxiras, and L. Eeckhout. The load slice core microarchitecture. In , 2015.[47] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. Continual ﬂow pipelines: achieving resource-efﬁcient latency tolerance.