[PDF] A Survey on Recent Hardware Data Prefetching Approaches with An Emphasis on Servers

Abstract

Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we discuss the fundamental concepts in data prefetching and study state-of-the-art hardware data prefetching approaches. Additional Key Words and Phrases: Data Prefetching, Scale-Out Workloads, Server Processors, and Spatio-Temporal Correlation.

Full PDF

AA Survey on Recent Hardware DataPrefetching Approaches with An Emphasis onServers

MOHAMMAD BAKHSHALIPOUR,

Sharif University of Technology

MEHRAN SHAKERINAVA,

Sharif University of Technology

FATEMEH GOLSHAN,

Sharif University of Technology

ALI ANSARI,

Sharif University of Technology

PEJMAN LOTFI-KAMRAN,

Institute for Research in Fundamental Sciences (IPM)

HAMID SARBAZI-AZAD,

Sharif University of Technology & Institute for Research in FundamentalSciences (IPM)Data prefetching, i.e., the act of predicting application’s future memory accesses and fetching those that arenot in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memoryaccesses. The fruitfulness of data prefetching is evident to both industry and academy: nowadays, almostevery high-performance processor incorporates a few data prefetchers for capturing various access patterns ofapplications; besides, there is a myriad of proposals for data prefetching in the research literature, where eachproposal enhances the efficiency of prefetching in a specific way. In this survey, we discuss the fundamentalconcepts in data prefetching and study state-of-the-art hardware data prefetching approaches.Additional Key Words and Phrases: Data Prefetching, Scale-Out Workloads, Server Processors, and Spatio-Temporal Correlation.

Server workloads like

Media Streaming and

Web Search serve millions of users and are consideredan important class of applications. Such workloads run on large-scale data-center infrastructuresthat are backed by processors which are essentially tuned for low latency and quality-of-serviceguarantees. These processors typically include a handful of high-clock frequency, aggressively-speculative, and deeply-pipelined cores so as to run server applications as fast as possible, satisfyingend-users’ latency requirements [1–11].Much to processor designer’s chagrin, bottlenecks in the memory system prevent server pro-cessors from getting high performance on server applications. As server workloads operate on alarge volume of data, they produce active memory working sets that dwarf the capacity-limitedon-chip caches of server processors and reside in the off-chip memory; hence, these applicationsfrequently miss the data in the on-chip caches and access the long-latency memory to retrieve

Author’s addresses: Mohammad Bakhshalipour, Department of Computer Engineering, Sharif University of Technology;Mehran Shakerinava, Department of Computer Engineering, Sharif University of Technology; Fatemeh Golshan, Departmentof Computer Engineering, Sharif University of Technology; Ali Ansari, Department of Computer Engineering, SharifUniversity of Technology; Pejman Lotfi-Kamran, School of Computer Science, Institute for Research in FundamentalSciences (IPM); Hamid Sarbazi-Azad, Department of Computer Engineering, Sharif University of Technology and School ofComputer Science, Institute for Research in Fundamental Sciences (IPM) . a r X i v : . [ c s . A R ] S e p it. Such frequent data misses preclude server processors from reaching their peak performancebecause cores are idle waiting for the data to arrive [1, 4, 12–24].System architects have proposed various strategies to overcome the performance penalty offrequent memory accesses. Data Prefetching is one of these strategies that has demonstratedsignificant performance potentials. Data prefetching is the art of predicting future memory accessesand fetching those that are not in the cache before a core explicitly asks for them in order to hide the long latency of memory accesses. Nowadays, virtually every high-performance computingchip uses a few data prefetchers (e.g.,

Intel Xeon Phi [25],

IBM Blue Gene/Q [26],

AMD Opteron [27],and

UltraSPARC III [28]) to capture regular and/or irregular memory access patterns of variousapplications. In the research literature, likewise, there is a myriad of proposals for data prefetching,where every proposal makes the prefetching more efficient in a specific way.In this study, we first discuss the fundamental concepts in data prefetching then study recent,as well as classic, hardware data prefetchers in the context of server workloads. We describe theoperations of every data prefetcher, in detail, and shed light on its design trade-offs. In a nutshell,we make the following contributions in this study: • We describe memory access patterns of applications and discuss how these patterns lead todifferent classes of correlations, from which data prefetchers can predict future memoryaccesses. • We describe the operations of state-of-the-art hardware data prefetchers in the researchliterature and discuss how they are able to capture data cache misses. • We highlight the overheads of every data prefetching technique and discuss the feasibilityof implementing it in modern processors.

Progress in technology fabrication accompanied by circuit-level and microarchitectural advance-ments have brought about significant enhancements in the processors’ performance over the pastdecades. Meanwhile, the performance of memory systems has not held speed with that of the pro-cessors, forming a large gap between the performance of processors and memory systems [29–41].As a consequence, numerous approaches have been proposed to enhance the execution performanceof applications by bridging the processor-memory performance gap. Hardware data prefetching isjust one of these approaches. Hardware data prefetching bridges the gap by proactively fetchingthe data ahead of the cores’ requests in order to eliminate the idle cycles in which the processoris waiting for the response of the memory system. In this section, we briefly review the otherapproaches that target the same goal (i.e., bridging the processor-memory performance gap) but inother ways.

Multithreading [42] enables the processor to better utilize its computational resources, as stallsin one thread can be overlapped with the execution of other thread(s) [43–46]. Multithreading,however, only improves throughput and does nothing for (or even worsens) the response time [1,33, 47], which is crucial for satisfying the strict latency requirements of server applications.

Thread-Based Prefetching techniques [48–52] exploit idle thread contexts or distinct pre-execution hardware to drive helper threads that try to overlap the cache misses with speculativeexecution. Such helper threads, formed either by the hardware or by the compiler, execute a pieceof code that prefetches for the main thread. Nonetheless, the additional threads and fetch/executionbandwidth may not be available when the processor is fully utilized. The abundant request-levelparallelism of server applications [1, 4] makes such schemes ineffective in that the helper threadsneed to compete with the main threads for the hardware context.

Runahead Execution [53, 54] makes the execution resources of a core that would otherwisebe stalled on an off-chip cache miss to go ahead of the stalled execution in an attempt to discoveradditional load misses. Similarly,

Branch Prediction Directed Prefetching [55] utilizes thebranch predictor to run in advance of the executing program, thereby prefetching load instructionsalong the expected future path. Such approaches, nevertheless, are constrained by the accuracyof the branch predictor and can cover simply a portion of the miss latency, since the runaheadthread/branch predictor may not be capable of executing far ahead in advance to completely hidea cache miss. Moreover, these approaches can only prefetch independent cache misses [56] andmay not be effective for many of the server workloads, e.g.,

OLTP and

Web applications, that arecharacterized by long chains of dependent memory accesses [57, 58].On the software side, there are efforts to re-structure programs to boost chip-level

Data Sharing and

Data Reuse [59, 60] in order to decrease off-chip accesses. While these techniques are usefulfor workloads with modest datasets, they fall short of efficiency for big-data server workloads,where the multi-gigabyte working sets of workloads dwarf the few megabytes of on-chip cachecapacity. The ever-growing datasets of server workloads make such approaches unscalable . Soft-ware Prefetching techniques [61–65] profile the program code and insert prefetch instructions toeliminate cache misses. While these techniques are shown to be beneficial for small benchmarks,they usually require significant programmer effort to produce optimized code to generate timelyprefetch requests.

Memory-Side Prefetching techniques [66–68] place the hardware for data prefetching nearDRAM, for the sake of saving precious SRAM budget. In such approaches (e.g., [67]), prefetching isperformed by a user thread running near the DRAM, and prefetched pieces of data are sent to theon-chip caches. Unfortunately, such techniques lose the predictability of core requests [69] and areincapable of performing cache-level optimizations (e.g., avoiding cache pollution [70]).

In this section, we briefly overview a background on hardware data prefetching and refer the readerto prior work [17, 69, 71] for more details. For simplicity, in the rest of the manuscript, we use theterm prefetcher to refer to the core-side hardware data prefetcher . The first step in data prefetching is predicting future mem-ory accesses. Fortunately, data accesses demonstrate several types of correlations and localities, thatlead to the formation of patterns among memory accesses, from which data prefetchers can predictfuture memory references. These patterns emanate from the layout of programs’ data structures inthe memory, and the algorithm and the high-level programming constructs that operate on thesedata structures.In this work, we classify the memory access patterns of applications into three distinct categories:(1) strided , (2) temporal , and (3) spatial access patterns.

Strided Accesses:

Strided access pattern refers to a sequence of memory accesses in which the distance of consecutive accesses is constant (e.g., { A , A + k , A + k , . . . } ). Such patterns are abundantin programs with dense matrices and frequently come into sight when programs operate on multi-dimensional arrays. Strided accesses also appear in pointer-based data structures when memoryallocators arrange the objects sequentially and in a constant-size manner in the memory [72]. Temporal Address Correlation:

Temporal address correlation [58] refers to a sequence of ad-dresses that favor being accessed together and in the same order (e.g., if we observe { A , B , C , D } ,then it is likely for { B , C , D } to follow { A } in the future). Temporal address correlation stems fundamentally from the fact that programs consist of loops, and is observed when data structuressuch as lists, arrays, and linked lists are traversed. When data structures are stable [73], accesspatterns recur, and the temporal address correlation is manifested [58]. Spatial Address Correlation:

Spatial address correlation [74] refers to the phenomenon thatsimilar access patterns occur in different regions of memory (e.g., if a program visits locations { A , B , C , D } of Page X , it is probable that it visits locations { A , B , C , D } of other pages as well).Spatial correlation transpires because applications use various objects with a regular and fixedlayout, and accesses reappear while traversing data structures [74]. Prefetchers need to issue timely prefetch requests for the predictedaddresses. Preferably, a prefetcher sends prefetch requests well in advance and supply enoughstorage for the prefetched blocks in order to hide the entire latency of memory accesses. An earlyprefetch request may cause evicting a useful block from the cache, and a late prefetch may decreasethe effectiveness of prefetching in that a portion of the long latency of a memory access is exposedto the processor.Prefetching lookahead refers to how far ahead of the demand miss stream the prefetcher can sendrequests . An aggressive prefetcher may offer a high prefetching lookahead (say, eight) and issuemany prefetch requests ahead of the processor to hide the entire latency of memory accesses;on the other hand, a conservative prefetcher may offer a low prefetching lookahead and send asingle prefetch request in advance of the processor’s demand to avoid wasting resources (e.g., cachestorage and memory bandwidth). Typically, there is a trade-off between the aggressiveness ofa prefetching technique and its accuracy: making a prefetcher more aggressive usually leads tocovering more data-miss–induced stall cycles but at the cost of fetching more useless data.Some pieces of prior work propose to dynamically adjust the prefetching lookahead [55, 70, 75].Based on the observation that the optimal prefetching degree is different for various applicationsand various execution phases of a particular application, as well, these approaches employ heuristicsto increase or decrease the prefetching lookahead, dynamically at run-time. For example, SPP [75]monitors the accuracy of issued prefetch requests and reduce the prefetching lookahead if theaccuracy becomes smaller than a predefined threshold.

Prefetching can be employed to move the data from lowerlevels of the memory hierarchy to any higher level . Prior work used data prefetchers at all cachelevels, from the primary data cache to the shared last-level cache.The location of a data prefetcher has a profound impact on its overall behavior [76]. A prefetcherin the first-level cache can observe all memory accesses, and hence, is able to issue highly-accurateprefetch requests, but at the cost of imposing large storage overhead for recording the metadatainformation. In contrast, a prefetcher in the last-level cache observes the access sequences thathave been filtered at higher levels of the memory hierarchy, resulting in lower prediction accuracy,but higher storage efficiency. A naive deployment of a data prefetcher not only may not improvethe system performance but also may significantly harm the performance and energy-efficiency [77].The two well-known major drawbacks of data prefetching are (1) cache pollution and (2) off-chipbandwidth overhead. We use the term higher (lower) levels of the memory hierarchy to refer to the levels closer to (further away from) the core,respectively.

Cache Pollution:

Data prefetching may increase the demand misses by replacing useful cacheblocks with useless prefetched data, harming the performance. Cache pollution usually occurswhen an aggressive prefetcher exhibits low accuracy and/or when prefetch requests of a core in amany-core processor compete for shared resources with demand accesses of other cores [78].

Bandwidth Overhead:

In a many-core processor, prefetch requests of a core can delay demandrequests of another core because of contending for memory bandwidth [78]. This interference is themajor obstacle of using data prefetchers in many-core processors, and the problem gets thornier asthe number of cores increases [79, 80].

Data prefetchers usually place the prefetched data into one ofthe following two structures: (1) the cache itself, and (2) an auxiliary buffer next to the cache. Incase an auxiliary buffer is used for the prefetched data, demand requests first look for the data inthe cache; if the data is not found, the auxiliary buffer is searched before sending a request to thelower levels of the memory hierarchy.Storing the prefetched data into the cache lowers the latency of accessing data when the pre-diction is correct. However, when the prediction is incorrect or when the prefetch request is nottimely (i.e., too early), having the prefetched data in the cache may result in evicting useful cacheblocks.

To give insight on how a stereotype operates, now we describe a preliminary-yet-prevalent typeof stride prefetching. Generally, stride prefetchers are widely used in commercial processors (e.g.,

IBM Power4 [81],

Intel Core [82],

AMD Opteron [27],

Sun UltraSPARC III [28]) and have been shownquite effective for desktop and engineering applications. Stride prefetchers [83–90] detect streams(i.e., the sequence of consecutive addresses) that exhibit strided access patterns (Section 1.2.1) andgenerate prefetch requests by adding the detected stride to the last observed address .Instruction-Based Stride Prefetcher (IBSP) [83] is a preliminary type of stride prefetching.The prefetcher tracks the strided streams on a per load instruction basis: the prefetcher observesaccesses issued by individual load instructions and sends prefetch requests if the accesses manifesta strided pattern. Figure 1 shows the organization of IBSP’s metadata table, named

ReferencePrediction Table (RPT) . RPT is a structure tagged and indexed with the

Program Counter (PC) of loadinstructions. Each entry in the

RPT corresponds to a specific load instruction; it keeps the

LastBlock referenced by the instruction and the

Last Stride observed in the stream (i.e., the distance oftwo last addresses accessed by the instruction).Upon each trigger access (i.e., a cache miss or a prefetch hit), the

RPT is searched with the PC ofthe instruction. If the search results in a miss, it means that no history does exist for the instruction,and hence, no prefetch request can be issued. Under two circumstances, a search may result ina miss: (1) whenever a load instruction is a new one in the execution flow of the program, andergo, no history has been recorded for it so far, and (2) whenever a load instruction is re-executedafter a long time, and the corresponding recorded metadata information has been evicted fromthe

RPT due to conflicts. In such cases when no matching entry does exist in the

RPT , a newentry is allocated for the instruction, and possibly a victim entry is evicted. The new entry istagged with the PC, and the

Last Block field of the entry is filled with the referenced address. The

Last Stride is also set to zero (an invalid value) as no stride has yet been observed for this stream.However, if searching the

RPT results in a hit, it means that there is a recorded history for the

Tag Last Block Last Stride PC =- Current Address + Is Strided? Prefetch

CandidateReference Prediction Table (RPT) C u rr e n t A dd r e ss Current Stride

Fig. 1. The organization of Instruction-Based Stride Prefetcher (IBSP). The ‘RPT’ keeps track of variousstreams. instruction. In this case, the recorded history information is checked with the current access to findout whether or not the stream is a strided one. To do so, the difference of the current address andthe

Last Block is calculated to get the current stride . Then, the current stride is checked against therecorded

Last Stride . If they do not match, it is implied that the stream does not exhibit a stridedaccess pattern. However, if they match, it is construed that the stream is a strided one as threeconsecutive accesses have produced two identical strides. In this case, based on the lookaheadof the prefetcher (Section 1.2.2), several prefetch requests are issued by consecutively adding theobserved stride to the requested address. For example, if the current address and the current strideare A and k , respectively, and the lookahead of prefetching is three, prefetch candidates will be { A + k , A + k + k , A + k + k + k } . Finally, regardless of the fact that the stream is strided or not, thecorresponding RPT entry is updated: the

Last Block is updated with the current address, and the

Last Stride takes the value of the current stride.In the following chapters, we introduce state-of-the-art data prefetchers and describe theirmechanism.

Spatial data prefetchers predict future memory accesses by relying on spatial address correlation, i.e.,the similarity of access patterns among multiple regions of memory. Access patterns demonstratespatial correlation because applications use data objects with a regular and fixed layout, and accessesreoccur when data structures are traversed [74]. Spatial data prefetchers [74, 75, 91–98] dividethe memory address space into fixed-size sections, named

Spatial Regions , and learn the memoryaccess patterns over these sections. The learned access patterns are then used for prefetching futurememory references when the application touches the same or similar Spatial Regions .Spatial data prefetchers impose low area overhead because they store offsets (i.e., the distanceof a block address from the beginning of a Spatial Region ) or deltas (i.e., the distance of twoconsecutive accesses that fall into a

Spatial Region ) as their metadata information, and not completeaddresses. Another equally remarkable strength of spatial data prefetchers is their ability toeliminate compulsory cache misses. Compulsory cache misses are a major source of performancedegradation in important classes of applications, e.g., scan-dominated workloads, where scanninglarge volumes of data produces a bulk of unseen memory accesses that cannot be captured bycaches [74]. By utilizing the pattern that was observed in a past

Spatial Region to a new unobserved

Spatial Region , spatial prefetchers can alleviate the compulsory cache misses, significantly enhancingsystem performance.

The critical limitation of spatial data prefetching is its ineptitude in predicting pointer-chasing–caused cache misses. As dynamic objects can potentially be allocated everywhere in the memory,pointer-chasing accesses do not necessarily exhibit spatial correlation, producing bulks of dependentcache misses for which spatial prefetchers can do very little (cf. Section 3).We include two state-of-the-art spatial prefetching techniques: (1) Spatial Memory Stream-ing [91], and (2) Variable Length Delta Prefetcher [93].

SMS is a state-of-the-art spatial prefetcher that was proposed and evaluated in the context of serverand scientific applications. Whenever a

Spatial Region is requested for the first time, SMS starts toobserve and record accesses to that

Spatial Region as long as the

Spatial Region is actively used bythe application. Whenever the

Spatial Region is no longer utilized (i.e., the corresponding blocks ofthe

Spatial Region start to be evicted from the cache), SMS stores the information of the observedaccesses in its metadata table, named

Pattern History Table (PHT) .The information in

PHT is stored in the form of (cid:104) event , pattern (cid:105) . The event is a piece ofinformation to which the observed access pattern is correlated. That is, it is expected for the storedaccess pattern to be used whenever event reoccurs in the future. SMS empirically chooses PC+Offset of the trigger access (i.e., the PC of the instruction that first accesses the

Spatial Region combinedwith the distance of the first requested cache block from the beginning of the

Spatial Region ) as the event to which the access patterns are correlated. Doing so, whenever a

PC+Offset is reoccurred,the correlated access pattern history is used for issuing prefetch requests. The pattern is the historyof accesses that happen in every

Spatial Region . SMS encodes the patterns of the accesses as a bit vector . In this manner, for every cache block in a

Spatial Region , a bit is stored, indicatingwhether the block has been used during the latest usage of the

Spatial Region ( ‘1’ ) or not ( ‘0’ ).Therefore, whenever a pattern is going to be used, prefetch requests are issued only for blockswhose corresponding bit in the stored pattern is ‘1 . ’ Figure 2 shows the hardware realization ofSMS.

Issue Prefetch

Tag PC , Offset Pattern

Pattern History Table (PHT) PC Instruction Address

Offset Trigger Access PC , Offset + Fig. 2. The organization of Spatial Memory Streaming (SMS).

VLDP is a recent state-of-the-art spatial data prefetcher that relies on the similarity of delta patternsamong

Spatial Regions of memory. VLDP records the distance between consecutive accesses thatfall into

Spatial Regions and uses them to predict future misses. The key innovation of VLDP is the deployment of multiple prediction tables for predicting delta patterns. VLDP employs severalhistory tables where each table keeps the metadata based on a specific length of the input history.Figure 3 shows the metadata organization of VLDP. The three major components are

DeltaHistory Buffer (DHB) , Delta Prediction Table (DPT) , and

Offset Prediction Table (OPT) . DHB is asmall table that records the delta history of currently-active Spatial Regions . Each entry in

DHB isassociated with an active

Spatial Region and contains details like the

Last Referenced Block . Thesedetails are used to index

OPT and

DPTs for issuing prefetch requests.

AddressPage Tag Last Referenced Block Last Observed Deltas

Delta PredictionDPT-1 Deltas PredictionDPT-2

1, 22, 3

5, 2 Deltas PredictionDPT-3

1, 2, 32, 3, 5

5, 2, 4 Delta Prediction Tables (DPTs)

Offset Prediction O ff s e t P r e d i c t i o n T a b l e ( O P T ) Prefetch Candidate

Per-Page Delta History Buffer (DHB)

Fig. 3. The organization of Variable Length Delta Prefetcher (VLDP).

DPT is a set of key-value pairs that correlates a delta sequence to the next expected delta. VLDPbenefits from multiple

DPTs where each

DPT records the history with a different length of theinput.

DPT − i associates a sequence of i deltas to the next expected delta. For example, if thelast three deltas in a Spatial Region are d , d , and d ( d is the most recent delta), DPT -2 stores [ (cid:104) d , d (cid:105) → d ] , while DPT -1 records [ (cid:104) d (cid:105) → d ] . While looking up the DPTs , if several of themoffer a prediction, the prediction of the table with the longest sequence of deltas is used, becausepredictions that are made based on longer inputs are expected to be more accurate [74]. This way,VLDP eliminates wrong predictions that are made by short inputs, enhancing both accuracy andmiss coverage of the prefetcher.

OPT is another metadata table of VLDP, that is indexed using the offset (and not delta) of thefirst access to a

Spatial Region . Merely relying on deltas for prefetching causes the prefetcher toneed to observe at least first two accesses to a

Spatial Region before issuing prefetch requests;however, there are many sparse Spatial Regions in which a few, say, two, of the blocks are usedby the application. Therefore, waiting for two accesses before starting the prefetching may divestthe prefetcher of issuing enough prefetch requests when the application operates on a significantnumber of sparse

Spatial Regions . Employing

OPT enables VLDP to start prefetching immediatelyafter the first access to

Spatial Regions . OPT associates the offset of the first access of a

SpatialRegion to the next expected delta. After the first access to a

Spatial Region , OPT is looked up usingthe offset of the access, and the output of the table is used for issuing a prefetch request. For therest of the accesses to the

Spatial Region (i.e., second access onward), VLDP uses only

DPTs .Even though VLDP relies on prediction tables with a single next expected delta, it is still ableto offer a prefetching lookahead larger than one (Section 1.2.2), using the proposed multi-degree prefetching mechanism. In the multi-degree mode, upon predicting the next delta in a

Spatial Region ,VLDP uses the prediction as an input for

DPTs to make more predictions.

Spatial prefetching has been proposed and developed to capture the similarity of access patternsamong memory pages (e.g., if a program visits locations { A , B , C , D } of Page X , it is probable that itvisits locations { A , B , C , D } of other pages as well). Spatial prefetching works because applicationsuse data objects with a regular and fixed layout, and accesses reoccur when data structures aretraversed. Spatial prefetching is appealing since it imposes low storage overhead to the system,paving the way for its adoption in the future systems. Temporal prefetching refers to replaying the sequence of past cache misses in order to avert futuremisses. Temporal data prefetchers [58, 67, 92, 99–104] record the sequence of data misses in theorder they appear and use the recorded history for predicting future data misses. Upon a new datamiss, they search the history and find a matching entry and replay the sequence of data missesafter the match in an attempt to eliminate potential future data misses. A tuned version of temporalprefetching has been implemented in

IBM Blue Gene/Q , where it is called List Prefetching [26].Temporal prefetching is an ideal choice to eliminate long chains of dependent cache misses, thatare common in pointer-chasing applications (e.g.,

OLTP and

Web ) [58]. A dependent cache missrefers to a memory operation that results in a cache miss and is dependent on data from a prior cachemiss. Such misses have a marked effect on the execution performance of applications and impedethe processor from making forward progress since both misses are fetched serially [56, 58]. Becauseof the lack of strided/spatial correlation among dependent misses, stride and spatial prefetchersare usually unable to prefetch such misses [105]; however, temporal prefetchers, by recordingand replaying the sequences of data misses, can prefetch dependent cache misses and result in asignificant performance improvement.Temporal prefetchers, on the other face of the coin, also have shortcomings. Temporal prefetchingtechniques exhibit low accuracy as they do not know where streams end. That is, in the foundationof temporal prefetching, there is no wealth of information about when prefetching should be stopped ;hence, temporal prefetchers continue issuing many prefetch requests until another triggering eventoccurs, resulting in a large overprediction. Moreover, as temporal prefetchers rely on addressrepetition, they are unable to prevent compulsory misses (unobserved misses) from happening. Inother words, they can only prefetch cache misses that at least once have been observed in the past;however, there are many important applications (e.g.,

DSS ) in which the majority of cache missesoccurs only once during the execution of the application [74], for which temporal prefetching cando nothing. Furthermore, as temporal prefetchers require to store the correlation between addresses,they usually impose large storage overhead (tens of megabytes) that cannot be accommodatedon-the-chip next to the cores. Consequently, temporal prefetchers usually place their metadatatables off-the-chip in the main memory. Unfortunately, placing the history information off-the-chipincreases the latency of accessing metadata, and more importantly, results in a drastic increase inthe off-chip bandwidth consumption for fetching and updating the metadata.We include two state-of-the-art temporal prefetching techniques: (1) Sampled Temporal MemoryStreaming [101], and (2) Irregular Stream Buffer [103].

STMS is a state-of-the-art temporal data prefetcher that was proposed and evaluated in the contextof server and scientific applications. The main observation behind STMS is that the length of temporalstreams widely differs across programs and across different streams in a particular program, as well;ranging from a couple to hundreds of thousands of cache misses. In order to efficiently store the information of various streams, STMS uses a circular FIFO buffer, named History Table , and appendsevery observed cache miss to its end. This way, the prefetcher is not required to fix a specificpredefined length for temporal streams in the metadata organization, that would be resulted inwasting storage for streams shorten than the predefined length or discarding streams longer thanit; instead, all streams are stored next to each other in a storage-efficient manner. For locatingevery address in the

History Table , STMS uses an auxiliary set-associative structure, named

IndexTable . The

Index Table stores a pointer for every observed miss address to its last occurrence inthe

History Table . Therefore, whenever a cache miss occurs, the prefetcher first looks up the

IndexTable with the missed address and gets the corresponding pointer. Using the pointer, the prefetcherproceeds to the

History Table and issues prefetch requests for addresses that have followed themissed address in the history.Figure 4 shows the metadata organization of STMS, which mainly consists of a

History Table andan

Index Table . As both tables require multi-megabyte storage for STMS to have reasonable misscoverage, both tables are placed off-the-chip in the main memory. Consequently, every access tothese tables (read or update) should be sent to the main memory and brings/updates a cache blockworth of data. That is, for every stream, STMS needs to wait for two long (serial) memory requeststo be sent (one to read the

Index Table and one to read the correct location of the

History Table ) andtheir responses to come back to the prefetcher before issuing prefetch requests for the stream. Thedelay of the two off-chip memory accesses, however, is compensated over several prefetch requestsof a stream if the stream is long enough.

Index TableA History Table

A B C BC Fig. 4. The organization of Sampled Temporal Memory Streaming (STMS).

ISB is another state-of-the-art proposal for temporal data prefetching that targets irregular streamsof temporally-correlated memory accesses. Unlike STMS that operates on the global miss sequences,ISB attempts to extract temporal correlation among memory references on a per load instructionbasis (Section 1.3). The key innovation of ISB is the introduction of an extra indirection levelfor storing metadata information. ISB defines a new conceptual address space, named

StructuralAddress Space (SAS) , and maps the temporally-correlated physical address to this address space in away that they appear sequentially . That is, with this indirection mechanism, physical addressesthat are temporally-correlated and used one after another, regardless of their distribution in the

Physical Address Space (PAS) of memory, become spatially-located and appear one after another in

SAS . Figure 5 shows a high-level example of this linearization.ISB utilizes two tables to record a bidirectional mapping between address in

PAS and

SAS : onetable, named

Physical-to-Structural Address Mapping (PSAM) , records temporally-correlated physicaladdresses and their mapping information (i.e., to which location in

SAS they are mapped); the othertable, named

Structural-to-Physical Address Mapping (SPAM) , keeps the linearized form of physicaladdresses in

SAS and the corresponding mapping information (i.e., which physical addresses are Physical Address Space (PAS)A B C D Structural Address Space (SAS)A C B D

Fig. 5. An example of linearizing scattered temporally-correlated memory references. mapped to every structural address). The main purpose of such a linearization is to representthe metadata in a spatially-located manner, paving the way to putting the metadata off-the-chipand caching its content in on-chip structures [106]. Like STMS, ISB puts its metadata informationoff-the-chip to save the precious SRAM storage; however, unlike them, ISB caches the content ofits off-chip metadata tables in on-chip structures. Caching the metadata works for ISB as a result ofthe provided spatial locality. By caching the metadata information, ISB (1) provides faster access tometadata since the caches offer a high hit ratio, and it is not required to proceed to the off-chipmemory for every metadata access, and (2) reduces the metadata-induced off-chip bandwidthoverhead as many of the metadata manipulations coalesce in the on-chip caches. Figure 6 showsan overview of the metadata structures of ISB.Another important contribution of ISB is the synchronization of off-chip metadata manipulationswith Translation Lookaside Buffer (TLB) misses. That is, whenever a TLB miss occurs, concurrentwith resolving the miss, ISB fetches the corresponding metadata information from the off-chipmetadata tables; moreover, whenever a TLB entry is evicted, ISB evicts its corresponding entryfrom the on-chip metadata structures and updates the off-chip metadata tables. Doing so, ISBensures that the required metadata is always present in the on-chip structures, significantly hidingthe latency of off-chip memory accesses that would otherwise be exposed.

Physical-to-Structural Address Mapping (PSAM)

Structural Address Physical Address m, m+1, n, n+1,

A, B,

X, Y,

Structural-to-Physical Address Mapping (SPAM)

Physical Address Structural Address

AB mm+1

X n

Fig. 6. The organization of Irregular Stream Buffer (ISB).

Temporal prefetching has been proposed and developed to capture temporally-correlated accesspatterns (i.e., the repetition of access patterns in the same order; e.g., if we observe { A , B , C , D } ,then it is likely for { B , C , D } to follow { A } in the future). Temporal prefetching is well beneficialin the context of pointer-chasing applications, where applications produce bulks of cache missesthat exhibit no spatial correlation, but temporal repetition. Temporal prefetchers, however, imposesignificant overheads to the system, which is still a grave concern in the research literature. Temporal and spatial prefetching techniques capture separate subsets of cache misses, and hence,each omits a considerable portion of cache misses unpredicted. As a considerable fraction of datamisses is predictable only by one of the two prefetching techniques, spatio-temporal prefetching tries to combine them in order to reap the benefits of both methods. Another motivation for spatio-temporal prefetching is the fact that the effectiveness of temporal and spatial prefetching techniquesvaries across applications. As discussed, pointer-chasing application (e.g., OLTP ) produce longchains of dependent cache misses which cannot be effectively captured by spatial prefetching buttemporal prefetching. On the contrary, scan-dominated applications (e.g.,

DSS ) produce a largenumber of compulsory cache misses that are predictable by spatial prefetchers and not temporalprefetchers.We include Spatio-Temporal Memory Streaming (STeMS) [105], as it is the only proposal inthis class of prefetching techniques.

STeMS synergistically integrates spatial and temporal prefetching techniques in a unified prefetcher;STeMS uses a temporal prefetcher to capture the stream of trigger accesses (i.e., the first accessto each spatial region) and a spatial prefetcher to predict the expected misses within the spatialregions. The metadata organization of STeMS mainly consists of the metadata tables of STMS [101]and SMS [91]. STeMS, however, seeks to stream the sequence of cache misses in the order theyhave been generated by the processor , regardless of how the corresponding metadata informationhas been stored in the history tables of STMS and SMS. To do so, STeMS employs a

ReconstructionBuffer which is responsible for reordering the prefetch requests generated by the temporal and thespatial prefetchers of STeMS so as to send prefetch requests (and deliver their responses) in theorder the processor is supposed to consume them.For enabling the reconstruction process, the metadata tables of SMS and STMS are slightlymodified. SMS is modified to record the order of the accessed cache blocks within a spatial regionby encoding spatial patterns as ordered lists of offsets, stored in

Patterns Sequence Table (PST) .Although

PST is less compact than

PHT (in the original SMS), the offset lists maintain the orderrequired for accurately interleaving temporal and spatial streams. STMS is also modified andrecords only spatial triggers (and not all events as in STMS) in a

Region Miss Order Buffer (RMOB) .Moreover, entries in both spatial and temporal streams are augmented with a delta field. The deltafield in a spatial (temporal) stream represents the number of events from the temporal (spatial)stream that is interleaved between the current and next events of the same type. Figure 7 gives anexample of how STeMS reconstructs the total miss order.

Region Miss Order Buffer (RMOB) PC Delta

Reconstruction Buffer

A A+5 A+8 B C C-1 A+7 B+2 D

Address PC A A B B2 PC C C PC D D Index PC B PC A Seq: (Offset, Delta)(5, 0) (8, 0) (7, 3) (2,3) PC C (-1, 0) Pattern Sequence Table (PST)

Fig. 7. The organization of Spatio-Temporal Memory Streaming (STeMS) and the reconstruction process.

Spatio-temporal prefetching has been proposed and developed to capture both temporal and spatialmemory access patterns of applications. Spatio-temporal prefetching is based on the observation that temporal and spatial prefetchers each target a specific subset of cache misses, and leave therest uncovered. Spatial-temporal data prefetching tries to synergistically capture both types ofpatterns, that any of the temporal or spatial prefetcher lonely cannot. In this chapter, we describe our own proposals for efficient data prefetching that have been publishedin recent years. We include them in the chronological order based on their publication date.

Domino is a state-of-the-art temporal data prefetcher that is built upon STMS (Section 3.1) andseeks to improve its effectiveness. Domino is based on the observation that a single miss address,as used in the lookup mechanism of STMS, cannot always identify the correct miss stream in thehistory. Therefore, Domino provides a mechanism to look up the history of miss addresses with acombination of the last one or two miss addresses. To do so, Domino replaces the

Index Table ofSTMS with a novel structure, named

Enhanced Index Table (EIT) . EIT like the

Index Table of STMSstores a pointer for each address in the history; but unlike it, keeps the subsequent miss of eachaddress , additionally. Having the next miss of every address in the

EIT enables Domino to findthe correct stream in the history using the last one or two misses addresses. Moreover, with thisorganization, Domino becomes able to start prefetching (i.e., issuing the first prefetch request) rightafter touching

EIT . That is, unlike STMS that needs to wait for two serial memory accesses (onefor

Index Table , then another for

History Table ) to start prefetching, Domino can start prefetchingimmediately after accessing

EIT , because

EIT contains the address of the first prefetch candidate.Starting prefetching sooner, causes Domino to offer superior timeliness as compared to STMS.Figure 8 shows the organization of the

EIT . A B C D F A Q B A X C U

Row C (U , P7) A (X , P6) , (Q , P4) , (B , P1) B (A , P5) , (C , P2) F (A , P3)

The most recent super-entry in this row The most recent entry of ‘A’

P1 P2 P3 P4 P5 P6 P7

Fig. 8. The details of the Enhanced Index Table in Domino prefetcher [58].

The

EIT is indexed by a single miss address. Associated with every tag, there are several address-pointer pairs, where the address is a miss of the core and the pointer is a location in the

HistoryTable . An (a, p) pair associated to tag t indicates that the pointer to the last occurrence of missaddress t followed by a is p . The tag along with its associated address-pointer pairs is called a super-entry , and every address-pointer pair is named an entry . Every row of the EIT has several super-entries , and each super-entry has several entries . Domino keeps the LRU stack among both the super-entries and the entries within each super-entry . Upon a cache miss, Domino uses the missedaddress to fetch a row of the

EIT . Then, Domino attempts to find the super-entry associated withthe missed address. In case a match is not found, nothing will be done, and otherwise, a prefetch will be sent for the address field of the most recent entry in the found super-entry . When the nexttriggering event occurs (miss or prefetch hit), Domino searches the super-entry and picks the entry for which the address field matches the triggering event. In case no match is found, Domino usesthe triggering event to bring another row from the EIT . Otherwise, Domino sends a request to readthe row of the

History Table pointed to by the pointer field of the matched entry . Once the sequenceof miss addresses from the

History Table arrives, Domino issues prefetch requests.

Bingo is a recent proposal for spatial data prefetching, as well as the runner-up (and the winner inthe multi-core evaluations) of The Third Data Prefetching Championship (DPC-3) [107]. Bingois based on the observation that assigning footprint information to a single event is suboptimalas compared to a case where footprints are correlated with multiple events. Bingo discusses thatevents either are short , that while their probability of recurrence is high, assigning footprints tothem results in low accuracy, or are long , that while prefetching using them results in high accuracy,much of the opportunity gets lost since the probability of recurring them is quite low. Bingo, forthis reason, proposes to associate the observed footprint information to multiple events in order toprovide both high opportunity and high accuracy. More specifically, Bingo correlates the observedfootprint information of various pages with both

PC+Address and

PC+Offset of trigger accesses. Inthe context of Bingo,

PC+Address is considered as a long event, while

PC+Offset is treated as ashort event. Whenever the time for prefetching comes (i.e., a triggering access occurs), Bingo usesthe footprint that is associated with the longest occurred event for prefetching (i.e.,

PC+Address ;and, if no history is recorded for

PC+Address , PC+Offset ).A naive implementation of Bingo requires two distinct

PHTs : one table maintains the history offootprints observed after each

PC+Address , while the other keeps the footprint metadata associatedwith

PC+Offset . Upon looking for a pattern to prefetch, logically, first, the

PC+Address of the triggeraccess is used to search the long

PHT . If a match is found, the corresponding footprint is utilized toissue prefetch requests. Otherwise, the

PC+Offset of the trigger access is used to look up the short

PHT . In case of a match, the footprint metadata of the matched entry will be used for prefetching.If no matching entry is found, no prefetch will be issued. Such an implementation, however, wouldimpose significant storage overhead. Authors in Bingo observe that, in the context of spatial dataprefetching, a considerable fraction of the metadata that is stored in the

PHTs are redundant . Thatis, there are many cases where both metadata tables (tables associated with long and short events)offer the same prediction.To efficiently eliminate redundancies in the metadata storage, instead of using multiple historytables, Bingo proposes having a single history table but looking it up multiple times, each time witha different event . Figure 9 details the practical design of Bingo which uses only one

PHT . Themain insight is that short events are carried in long events . That is, by having the long event athand, one can find out what the short events are, just by ignoring parts of the long event. For thecase of Bingo, the information of

PC+Offset is carried in

PC+Address ; therefore, by knowing the

PC+Address , the

PC+Offset is also known. To exploit this phenomenon, Bingo proposes having onlyone history table which stores just the history of the long event but is looked up with both long andshort events . For the case of Bingo, the history table stores footprints which were observed aftereach

PC+Address event, but is looked up with both the

PC+Address and

PC+Offset of the triggeraccess in order to offer high accuracy while not losing prefetching opportunities.To enable this, Bingo indexes the table with a hash of the shortest event but tags it with the longestevent . Whenever a piece of information is going to be stored in the history metadata, it is associatedwith the longest event, and then stored in the history table. To do so, the bits corresponding to the = Hash of PC+OffsetTrigger Access P C + A dd r e ss TagV Recency Footprint = =

Any Match?

Yes = = = P C + O ff s e t B i t s Footprint P C + O ff s e t B i t s Any Match?No

Yes

No Prefetch Footprint

Fig. 9. The details of the PHT lookup in Bingo prefetcher. Gray parts indicate the case where lookup withlong event fails to find a match. Each large rectangle indicates a physical way of the history table [74]. shortest event are used for indexing the history table to find the set in which the metadata should bestored; however, all bits of the longest event are used to tag the entry . More specifically, with Bingo,whenever a new footprint is going to be stored in the metadata organization, it is associated withthe corresponding

PC+Address . To find a location in the history table for the new entry, a hash ofonly

PC+Offset is used to index the table. By knowing the set, the baseline replacement algorithm(e.g., LRU) is used to choose a victim to open room for storing the new entry. After determining thelocation, the entry is stored in the history table, but all bits of the

PC+Address are used for taggingthe entry.Whenever there is a need for prediction, the

PHT is first looked up with the longest event; ifa match is found, it will be used to make a prediction. Otherwise, the table should be looked upwith the next-longest event. As both long and short events are mapped to the same set, there isno need to check a new set; instead, the entries of the same set are tested to find a match withthe shorter event. With Bingo, the table is first looked up with the

PC+Address of the triggeraccess. If a match is found, the corresponding footprint metadata will be used for issuing prefetchrequests. Otherwise, the table should be looked up with the

PC+Offset of the trigger access. Asboth

PC+Address and

PC+Offset are mapped to the same set, there is no need to check a new set.That is, all the corresponding

PC+Offset entries should be in the same set . Therefore, the entriesof the same set are tested to find a match. In this case, however, not all bits of the stored tags inthe entries are necessary to match; only the

PC+Offset bits need to be matched. This way, Bingoassociates each footprint with more than one event (i.e., both

PC+Address and

PC+Offset ) but storethe footprint metadata in the table with only one of them (the longer one) to reduce the storagerequirement. Doing so, redundancies are automatically eliminated because a metadata footprint isstored once with its

PC+Address tag. Multi-Lookahead Offset Prefetcher (MLOP) [108] is a state-of-the-art offset prefetching . Offsetprefetching, in fact, is an evolution of stride prefetching (Section 1.3), in which, the prefetcher does not try to detect strided streams. Instead, whenever a core requests for a cache block (e.g., A ), theoffset prefetcher prefetches the cache block that is distanced by k cache lines (e.g., A + k ), where k is the prefetch offset . In other words, offset prefetchers do not correlate the accessed address to anyspecific stream; rather, they treat the addresses individually, and based on the prefetch offset, theyissue a prefetch request for every accessed address. Offset prefetchers have been shown to offersignificant performance benefits while imposing small storage and logic overheads [109, 110].The initial proposal for offset prefetching, named Sandbox Prefetcher (SP) [109], attempts tofind offsets that yield accurate prefetch requests. To find such offsets, SP evaluates the prefetchingaccuracy of several predefined offsets (e.g., − , − , . . . , +

8) and finally allows offsets whose prefetch-ing accuracy are beyond a certain threshold to issue actual prefetch requests. The later work, namedBest-Offset Prefetcher (BOP) [110] tweaks SP and sets the timeliness as the evaluation metric.BOP is based on the insight that accurate but late prefetch requests do not accelerate the executionof applications as much as timely requests do. Therefore, BOP finds offsets that yield timely prefetchrequests in an attempt to have the prefetched blocks ready before the processor actually asks forthem.MLOP takes another step and proposes a novel offset prefetcher. MLOP is based on the obser-vation that while BOP is able to generate timely prefetch requests, it loses much opportunity atcovering cache misses because of relying on a single best offset and discarding many other appropriateoffsets . BOP evaluates several offsets and considers the offset that can generate the most timelyprefetch requests as the best offset; then, it relies only on this best offset to issue prefetch requestsuntil another offset becomes better, and hence, the new best. In fact, this is a binary classification:the prefetch offsets are considered either as timely offsets or late offsets. After classification, theprefetcher does not allow the so-called late offsets to issue any prefetch requests. However, theremight be many other appropriate offsets that are less timely but are of value in that they can hide asignificant fraction of cache miss delays.To overcome the deficiencies of prior work, MLOP proposes to have a spectrum of timelinesses for various prefetch offsets during their evaluations, rather than binarily classifying them. Duringthe evaluation of various prefetch offsets, MLOP considers multiple lookaheads for every prefetchoffset: with which lookahead can an offset cover a specific cache miss?

To implement this, MLOPconsiders several lookaheads for each offset, and assigns score values to every offset with everylookahead, individually . Finally, when the time for prefetching comes, MLOP finds the best offset foreach lookahead and allows it to issue prefetch requests; however, the prefetch requests for smallerlookaheads are prioritized and issued first. By doing so, MLOP ensures that it allows the prefetcherto issue enough prefetch requests (i.e., various prefetch offsets are utilized; high miss coverage)while the timeliness is well considered (i.e., the prefetch requests are ordered).Figure 10 shows an overview of MLOP. To extract offsets from access patterns, MLOP uses an

Access Map Table (AMT) . The

AMT keeps track of several recently-accessed addresses, along witha bit-vector for each base address. Each bit in the bit-vector corresponds to a cache block in theneighborhood of the address, indicating whether or not the block has been accessed.MLOP considers an evaluation period in which it evaluates several prefetch offsets and choosesthe qualified ones for issuing prefetch requests later on. For every offset, it considers multiple levelsof score where each level corresponds to a specific lookahead. The score of an offset at lookaheadlevel X indicates the number of cases where the offset prefetcher could have prefetched an access,at least X accesses prior to occurrence. For example, the score of offsets at the lookahead level 1 Address Address Address n Access Map Table (AMT)

Scores

Lookahead Lookahead A , A , B , B , C , C , Best OffsetsLast

Accesses

Fig. 10. The hardware realization of our MLOP. indicates the number of cases where the offset prefetcher could have prefetched any of the futuresaccesses.To efficiently mitigate the negative effect of all predictable cache misses, MLOP selects one bestoffset from each lookahead level . Then, during the actual prefetching, it allows all selected bestoffsets to issue prefetch requests. Doing so, MLOP ensures that it chooses enough prefetch offsets(i.e., does not suppress many qualified offsets like prior work [110]), and will cover a significantfraction of cache misses that are predictable by offset prefetching. To handle the timeliness issue,MLOP ties to send the prefetch requests in a way that the application would have sent if there hadnot been any prefetcher: MLOP starts from lookahead level 1 (i.e., the accesses that are expected tohappen the soonest) and issues the corresponding prefetch requests (using its best offset), then goesto the upper level; this process repeats. With this prioritization, MLOP tries to hide the latency ofall predictable cache misses, as much as possible.

Runahead MetaData (RMD) [111] is a general technique for harnessing pairwise-correlating prefetching. Pairwise-correlating prefetching refers to methods that correlate every address (ormore generally, every event [74]) with a single next prediction . The next prediction can be the nextexpected address, with address-based pairwise-correlating prefetchers [99, 104, 112–114] or thenext expected delta, with delta-based pairwise-correlating prefetchers [75, 93, 110].A main challenge in pairwise-correlating prefetching is harnessing prefetching degree. Unlikestreaming prefetchers [58, 101, 102] that prefetch multiple data addresses that follow the correlatedaddress in the FIFO history buffer, or footprint-based prefetchers [74, 91, 95] that prefetch multipledata addresses whose corresponding bit in the bit-vector is set, pairwise-correlating prefetchersare limited to a single prediction per correlation entry; they cannot trivially issue multi-degree prefetching. With this lookahead limitation, pairwise-correlating prefetching faces timeliness asa major problem, in that, issuing merely a single prefetch request every time may not result inprefetch requests that cover the whole latency of cache misses (Section 1.2). What is typically employed in state-of-the-art pairwise-correlating data prefetchers as the de factomechanism, including delta-based [93] and address-based [104] ones, and even similar instructionprefetchers [115], is using the prediction as input to the metadata tables to make more predictions :whenever a prediction is made, the prefetcher assumes it a correct prediction, and repeatedlyindexes the metadata table with the prediction to make more predictions. While this approach hasno storage overhead, it offers poor accuracy, as explicitly shown by recent work [58, 74, 75, 115].The problem with such an approach is that the prefetcher has no information about how manytimes it should repeat this process . In fact, this emanates from dissimilar stream lengths: if theprefetcher repeats this process N times, for streams whose length is smaller than N , say M , itoverprefetches N − M addresses, resulting in inaccuracy; for streams longer than N , it may losetimeliness. Prior approaches that perform multi-degree prefetching in such a way, typically choosethe degree of prefetching empirically based on a set of studied workloads. For example, Shevgooret al. [93] set the degree to four; Bakhshalipour et al. [104] set it to three. These numbers arechosen completely experimentally for a specific configuration and by examining a limited numberof workloads, with which, the chosen number provides a reasonable trade-off between accuracyand timeliness. Obviously, limiting the degree to a certain predefined number neither is a solutionthat scales to various configurations and workloads, nor is optimal (w.r.t accuracy and timeliness)for the examined very configuration/workloads.RMD proposes a novel solution to harness the multi-degree prefetching in the context of pairwise-correlating prefetchers. The key idea is to have separate metadata information for predicting thenext but one expected event (e.g., the delta following the next delta; two deltas away from now) . Thisway, in fact, RMD employs two separate metadata tables: one predicts the next event ( Distance1 ; D1 ), the other predicts the next but one event ( Distance2 ; D2 ), which is called Runahead MetadataTable . When issuing multi-degree prefetching, the first prefetching is issued using only D1 . Forissuing the second prefetch, D1 is searched using its previous prediction, similar to multi-degreeprefetching of previous methods; meanwhile, D2 is searched using the actual input (not prediction);the prefetch request is issued only if the prediction of both tables match; otherwise, the prefetchingis finished. From the third prefetch request (if any) onward, both tables are searched using thecorresponding inputs from the previous steps; if their predictions match, the prefetch request isissued and the process continues; otherwise, the prefetching is finished, concluding that the streamhas come to an end.The reason for adding D2 is to harness the multi-degree prefetching of D1 : until when therecursive lookups should resume? As D2 operates one step ahead of D1 , what D2 offers is what D1 is expected to offer in the next step. Hence, when D1 ’s second prediction (i.e., prediction usingthe previous prediction as input) is not equal with D2 ’s prediction, it is intuitively concluded thatthe stream has been finished, and no further prefetch request is issued for the current stream.However, as long as the predictions match, the prefetcher continues prefetching to provide efficienttimeliness, while preserving accuracy.Figure 11 epitomizes how RMD works. The entries in tables are interpreted in this way: (cid:104) A , B (cid:105) in D1 shows that immediately after A , it is expected B to happen; (cid:104) C , J (cid:105) in D2 is intended to mean thattwo steps away from C , it is expected J to happen. Consider that A happens. RMD indexes D1 by A .The prediction of D1 is B ; a prefetch request is issued for B . Then, D1 is indexed by B ; meanwhile, D2 is indexed by A . The predictions of both D1 and D2 are C ; their predictions match, and theprefetcher prefetches C . Then, D1 is indexed by C and D2 by B . The prediction of D1 is D , and theprediction of D2 is P ; their predictions do not match, and no further prefetch request is issued. D2 Table

Delta Pred.

A CBC PJ

D1 Table

Delta Pred.

A BBC CD

Fig. 11. An illustration of how RMD works.

REFERENCES [1] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, andB. Falsafi, “Scale-Out Processors,” in

Proceedings of the International Symposium on Computer Architecture (ISCA) ,pp. 500–511, 2012.[2] B. Grot, D. Hardy, P. Lotfi-Kamran, C. Nicopoulos, Y. Sazeides, and B. Falsafi, “Optimizing Data-Center TCO withScale-Out Processors,”

IEEE Micro , vol. 32, pp. 1–63, September 2012.[3] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt, “Understanding and Designing New ServerArchitectures for Emerging Warehouse-Computing Environments,” in

Proceedings of the International Symposium onComputer Architecture (ISCA) , pp. 315–326, 2008.[4] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, andB. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware,” in

Proceedings ofthe International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) ,pp. 37–48, 2012.[5] P. Esmaili-Dokht, M. Bakhshalipour, B. Khodabandeloo, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Scale-Out Processors& Energy Efficiency,” arXiv preprint arXiv:1808.04864 , 2018.[6] A. Ansari, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Code Layout Optimization for Near-Ideal Instruction Cache,”

IEEEComputer Architecture Letters (CAL) , 2019.[7] F. Mireshghallah, M. Bakhshalipour, M. Sadrosadati, and H. Sarbazi-Azad, “Energy-Efficient Permanent Fault Tolerancein Hard Real-Time Systems,”

IEEE Transactions on Computers (TC) , 2019.[8] M. R. Jokar, J. Qiu, F. T. Chong, L. L. Goddard, J. M. Dallesasse, M. Feng, and Y. Li, “Baldur: A Power-Efficient andScalable Network Using All-Optical Switches,” in

Proceedings of the International Symposium on High PerformanceComputer Architecture (HPCA) , pp. 153–166, 2020.[9] M. R. Jokar, J. Qiu, L. L. Goddard, J. M. Dallesasse, M. Feng, Y. Li, and F. T. Chong, “A High-Performance andEnergy-Efficient Optical Network Using Transistor Laser,” in

TECHCON , 2019.[10] M. Al-Fares, A. Loukissas, and A. Vahdat, “A Scalable, Commodity Data Center Network Architecture,”

ACMSIGCOMM Computer Communication Review (CCR) , pp. 63–74, 2008.[11] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat,“Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers,” in

Proceedings of the ACMSIGCOMM Conference , pp. 339–350, 2010.[12] T. S. Karkhanis and J. E. Smith, “A First-Order Superscalar Processor Model,” pp. 338–349, 2004.[13] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz,“Understanding Sources of Inefficiency in General-Purpose Chips,” in

Proceedings of the International Symposium onComputer Architecture (ISCA) , pp. 37–47, ACM, 2010.[14] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, andB. Falsafi, “Quantifying the Mismatch Between Emerging Scale-Out Applications and Modern Processors,”

ACMTransactions on Computer Systems (TOCS) , vol. 30, pp. 15:1–15:24, November 2012.[15] A. Vakil-Ghahani, S. Mahdizadeh-Shahri, M.-R. Lotfi-Namin, M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad,“Cache Replacement Policy Based on Expected Hit Count,”

IEEE Computer Architecture Letters (CAL) , vol. 17, no. 1,pp. 64–67, 2018.[16] S. A. V. Ghahani, S. M. Shahri, M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Making Belady-InspiredReplacement Policies More Effective Using Expected Hit Count,” arXiv preprint arXiv:1808.05024 , 2018.[17] M. Bakhshalipour, S. Tabaeiaghdaei, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Evaluation of Hardware Data Prefetcherson Server Processors,”

ACM Computing Surveys (CSUR) , vol. 52, pp. 52:1–52:29, June 2019. [18] M. Bakhshalipour, A. Faraji, S. A. V. Ghahani, F. Samandi, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Reducing WritebacksThrough In-Cache Displacement,” ACM Transactions on Design Automation of Electronic Systems (TODAES) , vol. 24,no. 2, p. 16, 2019.[19] E. Lockerman, A. Feldmann, M. Bakhshalipour, A. Stanescu, S. Gupta, D. Sanchez, and N. Beckmann, “Livia: Data-Centric Computing Throughout the Memory Hierarchy,” in

Proceedings of the International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS) , pp. 417–433, 2020.[20] H. A. Esfeden, F. Khorasani, H. Jeon, D. Wong, and N. Abu-Ghazaleh, “CORF: Coalescing Operand Register Filefor GPUs,” in

Proceedings of the International Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) , ACM, 2019.[21] F. Khorasani, H. A. Esfeden, N. Abu-Ghazaleh, and V. Sarkar, “In-Register Parameter Caching for Dynamic Neural Netswith Virtual Persistent Processor Specialization,” in

Proceedings of the International Symposium on Microarchitecture(MICRO) , pp. 377–389, IEEE, 2018.[22] F. Khorasani, H. A. Esfeden, A. Farmahini-Farahani, N. Jayasena, and V. Sarkar, “RegMutex: Inter-Warp GPU RegisterTime-Sharing,” in

Proceedings of the International Symposium on Computer Architecture (ISCA) , pp. 816–828, IEEEPress, 2018.[23] M. Kayaalp, K. N. Khasawneh, H. A. Esfeden, J. Elwell, N. Abu-Ghazaleh, D. Ponomarev, and A. Jaleel, “RIC: RelaxedInclusion Caches for Mitigating LLC Side-Channel Attacks,” in

Proceedings of the Design Automation Conference(DAC) , pp. 7:1–7:6, ACM, 2017.[24] R. Hojabr, M. Modarressi, M. Daneshtalab, A. Yasoubi, and A. Khonsari, “Customizing Clos Network-on-Chip forNeural Networks,”

IEEE Transactions on Computers (TC) , 2017.[25] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “KnightsLanding: Second-Generation Intel Xeon Phi Product,”

IEEE Micro , vol. 36, pp. 34–46, March 2016.[26] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield, K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich,R. Wisniewski, A. Gara, G. Chiu, P. Boyle, N. Chist, and C. Kim, “The IBM Blue Gene/Q Compute Chip,”

IEEE Micro ,vol. 32, no. 2, pp. 48–60, 2012.[27] P. Conway and B. Hughes, “The AMD Opteron Northbridge Architecture,”

IEEE Micro , vol. 27, pp. 10–21, March 2007.[28] T. Horel and G. Lauterbach, “UltraSPARC-III: Designing Third-Generation 64-bit Performance,”

IEEE Micro , vol. 19,no. 3, pp. 73–85, 1999.[29] W. A. Wulf and S. A. McKee, “Hitting the Memory Wall: Implications of the Obvious,”

SIGARCH Comput. Archit.News , vol. 23, pp. 20–24, March 1995.[30] P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas, “The Memory Performance of DSS Commercial Workloadsin Shared-Memory Multiprocessors,” in

Proceedings of the International Symposium on High Performance ComputerArchitecture (HPCA) , pp. 250–260, 1997.[31] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood, “DBMSs on a Modern Processor: Where Does Time Go?,” in

Proceedings of the International Conference on Very Large Data Bases (VLDB) , pp. 266–277, 1999.[32] R. A. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J. P. Shen, “Scaling and CharacterizingDatabase Workloads: Bridging the Gap Between Research and Practice,” in

Proceedings of the International Symposiumon Microarchitecture (MICRO) , pp. 116–120, 2003.[33] N. Hardavellas, I. Pandis, R. Johnson, N. G. Mancheril, A. Ailamaki, and B. Falsafi, “Database Servers on ChipMultiprocessors: Limitations and Opportunities,” in

Proceedings of the Biennial Conference on Innovative Data SystemsResearch (CIDR) , pp. 79–87, 2007.[34] M. Bakhshalipour, H. Zare, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Die-Stacked DRAM: Memory, Cache, or Mem-Cache?,” arXiv preprint arXiv:1809.08828 , 2018.[35] S. Rashidi, M. Jalili, and H. Sarbazi-Azad, “Improving MLC PCM Performance Through Relaxed Write and Read forIntermediate Resistance Levels,”

ACM Transactions on Architecture and Code Optimization (TACO) , vol. 15, no. 1,pp. 12:1–12:31, 2018.[36] S. Rashidi, M. Jalili, and H. Sarbazi-Azad, “A Survey on PCM Lifetime Enhancement Schemes,”

ACM ComputingSurveys (CSUR) , 2019.[37] A. Mirzaeian, H. Homayoun, and A. Sasan, “NESTA: Hamming Weight Compression-Based Neural Proc. Engine,” in

The Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC) , 2020.[38] A. Mirzaeian, H. Homayoun, and A. Sasan, “TCD-NPE: A Re-Configurable and Efficient Neural Processing En-gine, Powered by Novel Temporal-Carry-Deferring MACs,” in

The Processings of the International Conference onReConFigurable Computing and FPGAs (ReConFig) , 2020.[39] S. A. Vakil Ghahani, M. T. Kandemir, and J. B. Kotra, “DSM: A Case for Hardware-Assisted Merging of DRAM Rowswith Same Content,” in

The Proceedings of the ACM on the Measurement and Analysis of Computing Systems (POMACS) ,2020. [40] H. Jeon, H. A. Esfeden, N. B. Abu-Ghazaleh, D. Wong, and S. Elango, “Locality-aware GPU Register File,” IEEEComputer Architecture Letters (CAL) , 2019.[41] M. R. Jokar, L. Zhang, and F. T. Chong, “Cooperative NV-NUMA: Prolonging Non-Volatile Memory Lifetime ThroughBandwidth Sharing,” in

Proceedings of the International Symposium on Memory Systems (MEMSYS) , pp. 67–78, 2018.[42] M. Nemirovsky and D. M. Tullsen,

Multithreading Architecture . Morgan & Claypool Publishers, 1st ed., 2013.[43] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, “Optimization Principles andApplication Performance Evaluation of a Multithreaded GPU Using CUDA,” in

Proceedings of the Symposium onPrinciples and Practice of Parallel Programming (PPoPP) , pp. 73–82, ACM, 2008.[44] H. Akkary and M. A. Driscoll, “A Dynamic Multithreading Processor,” in

Proceedings of the International Symposiumon Microarchitecture (MICRO) , pp. 226–236, IEEE, 1998.[45] H. Cui, J. Wu, C.-C. Tsai, and J. Yang, “Stable Deterministic Multithreading Through Schedule Memoization,” in

Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI) , pp. 207–221, USENIXAssociation, 2010.[46] M. Bakhshalipour and H. Sarbazi-Azad, “Parallelizing Bisection Root-Finding: A Case for Accelerating Serial Algo-rithms in Multicore Substrates,” arXiv preprint arXiv:1805.07269 , 2018.[47] J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh, “An Analysis of Database WorkloadPerformance on Simultaneous Multithreaded Processors,” in

Proceedings of the International Symposium on ComputerArchitecture (ISCA) , pp. 39–50, 1998.[48] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen, “Speculative Precomputation:Long-Range Prefetching of Delinquent Loads,” in

Proceedings of the International Symposium on Computer Architecture(ISCA) , pp. 14–25, 2001.[49] I. Ganusov and M. Burtscher, “Future Execution: A Prefetching Mechanism That Uses Multiple Cores to Speed UpSingle Threads,”

ACM Transactions on Architecture and Code Optimization (TACO) , vol. 3, pp. 424–449, December 2006.[50] J. Lee, C. Jung, D. Lim, and Y. Solihin, “Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems,”

IEEE Transactions on Parallel and Distributed Systems (TPDS) , vol. 20, pp. 1309–1324, September 2009.[51] M. Kamruzzaman, S. Swanson, and D. M. Tullsen, “Inter-Core Prefetching for Multicore Processors Using MigratingHelper Threads,” in

Proceedings of the International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) , pp. 393–404, 2011.[52] J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, “Dynamic Speculative Precomputation,” in

Proceedings of theInternational Symposium on Microarchitecture (MICRO) , pp. 306–317, 2001.[53] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead Execution: An Alternative to Very Large InstructionWindows for Out-of-Order Processors,” in

Proceedings of the International Symposium on High Performance ComputerArchitecture (HPCA) , pp. 129–, 2003.[54] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for Efficient Processing in Runahead Execution Engines,” in

Proceedingsof the International Symposium on Computer Architecture (ISCA) , pp. 370–381, 2005.[55] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz, and D. Jimenez, “B-Fetch: Branch Prediction Directed Prefetching forChip-Multiprocessors,” in

Proceedings of the International Symposium on Microarchitecture (MICRO) , pp. 623–634,2014.[56] M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “Accelerating Dependent Cache Misses with an EnhancedMemory Controller,” in

Proceedings of the International Symposium on Computer Architecture (ISCA) , pp. 444–455,2016.[57] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso, “Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors,” in

Proceedings of the International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS) , pp. 307–318, 1998.[58] M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Domino Temporal Data Prefetcher,” in

Proceedings of theInternational Symposium on High-Performance Computer Architecture (HPCA) , pp. 131–142, IEEE, 2018.[59] R. Johnson, S. Harizopoulos, N. Hardavellas, K. Sabirli, I. Pandis, A. Ailamaki, N. G. Mancheril, and B. Falsafi, “ToShare or Not to Share?,” in

Proceedings of the International Conference on Very Large Data Bases (VLDB) , pp. 351–362,2007.[60] J. R. Larus and M. Parkes, “Using Cohort-Scheduling to Enhance Server Performance,” in

Proceedings of the GeneralTrack of the Annual Conference on USENIX Annual Technical Conference (ATEC) , pp. 103–114, 2002.[61] W. Zhang, B. Calder, and D. M. Tullsen, “A Self-Repairing Prefetcher in an Event-Driven Dynamic OptimizationFramework,” in

Proceedings of the International Symposium on Code Generation and Optimization (CGO) , pp. 50–64,2006.[62] C.-K. Luk and T. C. Mowry, “Compiler-Based Prefetching for Recursive Data Structures,” in

Proceedings of theInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) ,pp. 222–233, 1996. [63] A. Roth and G. S. Sohi, “Effective Jump-Pointer Prefetching for Linked Data Structures,” in Proceedings of theInternational Symposium on Computer Architecture (ISCA) , pp. 111–121, 1999.[64] T. M. Chilimbi and M. Hirzel, “Dynamic Hot Data Stream Prefetching for General-Purpose Programs,” in

Proceedingsof the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) , pp. 199–209, 2002.[65] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry, “Improving Hash Join Performance Through Prefetching,”

ACM Transactions on Database Systems (TODS) , vol. 32, August 2007.[66] C. J. Hughes and S. V. Adve, “Memory-Side Prefetching for Linked Data Structures for Processor-in-Memory Systems,”

Journal of Parallel and Distributed Computing , vol. 65, pp. 448–463, April 2005.[67] Y. Solihin, J. Lee, and J. Torrellas, “Using a User-Level Memory Thread for Correlation Prefetching,” in

Proceedings ofthe International Symposium on Computer Architecture (ISCA) , pp. 171–182, 2002.[68] P. Yedlapalli, J. Kotra, E. Kultursay, M. Kandemir, C. R. Das, and A. Sivasubramaniam, “Meeting Midway: Improv-ing CMP Performance with Memory-Side Prefetching,” in

Proceedings of the International Conference on ParallelArchitectures and Compilation Techniques (PACT) , pp. 289–298, 2013.[69] S. Mittal, “A Survey of Recent Prefetching Techniques for Processor Caches,”

ACM Computing Surveys (CSUR) , vol. 49,pp. 35:1–35:35, August 2016.[70] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, “Feedback Directed Prefetching: Improving the Performance andBandwidth-Efficiency of Hardware Prefetchers,” in

Proceedings of the International Symposium on High PerformanceComputer Architecture (HPCA) , pp. 63–74, 2007.[71] B. Falsafi and T. F. Wenisch,

A Primer on Hardware Prefetching . Morgan & Claypool Publishers, 2014.[72] F. Dahlgren and P. Stenstrom, “Effectiveness of Hardware-Based Stride and Sequential Prefetching in Shared-memoryMultiprocessors,” in

Proceedings of the International Symposium on High Performance Computer Architecture (HPCA) ,pp. 68–, 1995.[73] T. M. Chilimbi, “On the Stability of Temporal Data Reference Profiles,” in

Proceedings of the International Conferenceon Parallel Architectures and Compilation Techniques (PACT) , pp. 151–160, 2001.[74] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Bingo Spatial Data Prefetcher,” in

Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA) , pp. 399–411, 2019.[75] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti, “Path Confidence Based LookaheadPrefetching,” in

Proceedings of the International Symposium on Microarchitecture (MICRO) , pp. 60:1–60:12, 2016.[76] S. Mehta, Z. Fang, A. Zhai, and P.-C. Yew, “Multi-Stage Coordinated Prefetching for Present-Day Processors,” in

Proceedings of the International Conference on Supercomputing (ICS) , pp. 73–82, 2014.[77] M. Bakhshalipour, P. Lotfi-Kamran, A. Mazloumi, F. Samandi, M. Naderan-Tahan, M. Modarressi, and H. Sarbazi-Azad,“Fast Data Delivery for Many-Core Processors,”

IEEE Transactions on Computers (TC) , vol. 67, no. 10, pp. 1416–1429,2018.[78] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,”in

Proceedings of the International Symposium on Microarchitecture (MICRO) , pp. 316–326, 2009.[79] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Fairness via Source Throttling: A Configurable and High-PerformanceFairness Substrate for Multi-Core Memory Systems,” in

Proceedings of the International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS) , pp. 335–346, 2010.[80] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A Scalable and High-Performance Scheduling Algorithmfor Multiple Memory Controllers,” in

Proceedings of the International Symposium on High Performance ComputerArchitecture (HPCA) , pp. 1–12, 2010.[81] J. M. Tendler, J. S. Dodson, J. Fields, H. Le, and B. Sinharoy, “POWER4 System Microarchitecture,”

IBM Journal ofResearch and Development , vol. 46, no. 1, pp. 5–25, 2002.[82] J. Doweck, “Inside Intel® Core Microarchitecture,” in

IEEE Hot Chips Symposium (HCS) , pp. 1–35, 2006.[83] J.-L. Baer and T.-F. Chen, “An Effective On-chip Preloading Scheme to Reduce Data Access Penalty,” in

Proceedings ofthe ACM/IEEE Conference on Supercomputing , pp. 176–186, 1991.[84] T. Sherwood, S. Sair, and B. Calder, “Predictor-Directed Stream Buffers,” in

Proceedings of the International Symposiumon Microarchitecture (MICRO) , pp. 42–53, 2000.[85] Y. Ishii, M. Inaba, and K. Hiraki, “Access Map Pattern Matching for Data Cache Prefetch,” in

Proceedings of theInternational Conference on Supercomputing (ICS) , pp. 499–500, 2009.[86] S. Sair, T. Sherwood, and B. Calder, “A Decoupled Predictor-Directed Stream Prefetching Architecture,”

IEEE Transac-tions on Computers (TC) , vol. 52, pp. 260–276, March 2003.[87] N. P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache andPrefetch Buffers,” in

Proceedings of the International Symposium on Computer Architecture (ISCA) , pp. 364–373, 1990.[88] S. Palacharla and R. E. Kessler, “Evaluating Stream Buffers As a Secondary Cache Replacement,” in

Proceedings of theInternational Symposium on Computer Architecture (ISCA) , pp. 24–33, 1994. [89] C. Zhang and S. A. McKee, “Hardware-Only Stream Prefetching and Dynamic Access Ordering,” in Proceedings of theInternational Conference on Supercomputing (ICS) , pp. 167–175, 2000.[90] S. Iacobovici, L. Spracklen, S. Kadambi, Y. Chou, and S. G. Abraham, “Effective Stream-Based and Execution-BasedData Prefetching,” in

Proceedings of the International Conference on Supercomputing (ICS) , pp. 1–11, 2004.[91] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Spatial Memory Streaming,” in

Proceedings ofthe International Symposium on Computer Architecture (ISCA) , pp. 252–263, 2006.[92] K. J. Nesbit and J. E. Smith, “Data Cache Prefetching Using a Global History Buffer,” in

Proceedings of the InternationalSymposium on High Performance Computer Architecture (HPCA) , pp. 96–, 2004.[93] M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti, “Efficiently PrefetchingComplex Address Patterns,” in

Proceedings of the International Symposium on Microarchitecture (MICRO) , pp. 141–152,2015.[94] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith, “AC/DC: An Adaptive Data Cache Prefetcher,” in

Proceedings of theInternational Conference on Parallel Architectures and Compilation Techniques (PACT) , pp. 135–145, 2004.[95] S. Kumar and C. Wilkerson, “Exploiting Spatial Locality in Data Caches Using Spatial Footprints,” in

Proceedings ofthe International Symposium on Computer Architecture (ISCA) , pp. 357–368, 1998.[96] C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos, “Accurate and Complexity-Effective Spatial Pattern Prediction,”in

Proceedings of the International Symposium on High Performance Computer Architecture (HPCA) , pp. 276–287, 2004.[97] J. F. Cantin, M. H. Lipasti, and J. E. Smith, “Stealth Prefetching,” in

Proceedings of the International Conference onArchitectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 274–282, 2006.[98] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Accurately and Maximally PrefetchingSpatial Data Access Patterns with Bingo,”

The Third Data Prefetching Championship , 2019.[99] D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” in

Proceedings of the International Symposium onComputer Architecture (ISCA) , pp. 252–263, 1997.[100] Y. Chou, “Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications,” in

Proceedings of theInternational Symposium on Microarchitecture (MICRO) , pp. 301–313, 2007.[101] T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, “Practical Off-Chip Meta-Data for TemporalMemory Streaming,” in

Proceedings of the International Symposium on High Performance Computer Architecture(HPCA) , pp. 79–90, 2009.[102] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi, “Temporal Streaming of Shared Memory,”in

Proceedings of the International Symposium on Computer Architecture (ISCA) , pp. 222–233, 2005.[103] A. Jain and C. Lin, “Linearizing Irregular Memory Accesses for Improved Correlated Prefetching,” in

Proceedings ofthe International Symposium on Microarchitecture (MICRO) , pp. 247–259, 2013.[104] M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “An Efficient Temporal Data Prefetcher for L1 Caches,”

IEEEComputer Architecture Letters (CAL) , vol. 16, no. 2, pp. 99–102, 2017.[105] S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi, “Spatio-Temporal Memory Streaming,” in

Proceedings of theInternational Symposium on Computer Architecture (ISCA) , pp. 69–80, 2009.[106] I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi, “Predictor Virtualization,” in

Proceedings of the InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 157–167, 2008.[107] “The Third Data Prefetching Championship.” https://dpc3.compas.cs.stonybrook.edu/, 2019.[108] M. Shakerinava, M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Multi-Lookahead Offset Prefetching,”

TheThird Data Prefetching Championship , 2019.[109] S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian,“Sandbox Prefetching: Safe Run-Time Evaluation of Aggressive Prefetchers,” in

Proceedings of the InternationalSymposium on High Performance Computer Architecture (HPCA) , pp. 626–637, 2014.[110] P. Michaud, “Best-Offset Hardware Prefetching,” in

Proceedings of the International Symposium on High PerformanceComputer Architecture (HPCA) , pp. 469–480, 2016.[111] F. Golshan, M. Bakhshalipour, M. Shakerinava, A. Ansari, P. Lotfi-Kamran, and H. Sarbazi-Azad, “HarnessingPairwise-Correlating Data Prefetching with Runahead Metadata,”

IEEE Computer Architecture Letters (CAL) , 2020.[112] Z. Hu, S. Kaxiras, and M. Martonosi, “Timekeeping in the Memory System: Predicting and Optimizing MemoryBehavior,” in

Proceedings of the International Symposium on Computer Architecture (ISCA) , pp. 209–220, 2002.[113] Z. Hu, M. Martonosi, and S. Kaxiras, “TCP: Tag Correlating Prefetchers,” in

Proceedings of the International Symposiumon High Performance Computer Architecture (HPCA) , pp. 317–326, 2003.[114] A.-C. Lai, C. Fide, and B. Falsafi, “Dead-Block Prediction & Dead-Block Correlating Prefetchers,” in

Proceedings of theInternational Symposium on Computer Architecture (ISCA) , pp. 144–154, 2001.[115] L. Spracklen, Y. Chou, and S. G. Abraham, “Effective Instruction Prefetching in Chip Multiprocessors for ModernCommercial Applications,” in