[PDF] MANA: Microarchitecting an Instruction Prefetcher

Abstract

L1 instruction (L1-I) cache misses are a source of performance bottleneck. Sequential prefetchers are simple solutions to mitigate this problem; however, prior work has shown that these prefetchers leave considerable potentials uncovered. This observation has motivated many researchers to come up with more advanced instruction prefetchers. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher could effectively eliminate all of the instruction-cache misses. However, its enormous storage cost makes it an impractical solution. Consequently, reducing the storage cost was the main research focus in the instruction prefetching in the past decade. Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performance with significantly lower storage overhead. However, our findings show that there is a considerable performance gap between these proposals and PIF. While these proposals use different mechanisms for instruction prefetching, the performance gap is largely not because of the mechanism, and instead, is due to not having sufficient storage. Prior proposals suffer from one or both of the following shortcomings: (1) a large number of metadata records to cover the potential, and (2) a high storage cost of each record. The first problem causes metadata miss, and the second problem prohibits the prefetcher from storing enough records within reasonably-sized storage.

Full PDF

MMANA: Microarchitecting an Instruction Prefetcher

ALI ANSARI,

EPFL, Switzerland

FATEMEH GOLSHAN,

Sharif University of Technology, Iran

PEJMAN LOTFI-KAMRAN,

Institute for Research in Fundamental Sciences (IPM), Iran

HAMID SARBAZI-AZAD,

Sharif University of Technology, Iran and Institute for Research in FundamentalSciences (IPM), IranL1 instruction (L1-I) cache misses are a source of performance bottleneck. Sequential prefetchers are simplesolutions to mitigate this problem; however, prior work has shown that these prefetchers leave considerablepotentials uncovered. This observation has motivated many researchers to come up with more advancedinstruction prefetchers. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher couldeffectively eliminate all of the instruction-cache misses. However, its enormous storage cost makes it animpractical solution. Consequently, reducing the storage cost was the main research focus in the instructionprefetching in the past decade.Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performancewith significantly lower storage overhead. However, our findings show that there is a considerable perfor-mance gap between these proposals and PIF. While these proposals use different mechanisms for instructionprefetching, the performance gap is largely not because of the mechanism, and instead, is due to not havingsufficient storage. Prior proposals suffer from one or both of the following shortcomings: (1) a large numberof metadata records to cover the potential, and (2) a high storage cost of each record. The first problemcauses metadata-miss, and the second problem prohibits the prefetcher from storing enough records withinreasonably-sized storage.In this paper, we make the case that the key to designing a powerful and cost-effective instruction prefetcheris choosing the right metadata record and microarchitecting the prefetcher to minimize the storage. We findthat high spatial correlation among instruction accesses leads to compact, accurate, and minimal metadatarecords. We also show that chaining these records is an effective way to enable robust and timely prefetching.Based on the findings, we propose

MANA , which offers PIF-level performance with 15.7 × lower storage cost.MANA outperforms RDIP and Shotgun by 12.5 and 29%, respectively. We also evaluate a version of MANAwith no storage overhead and show that it offers 98% of the peak performance benefits.CCS Concepts: • Computer systems organization → Architectures .Additional Key Words and Phrases: processors, frontend bottleneck, instruction prefetching, instruction cache

ACM Reference Format:

Ali Ansari, Fatemeh Golshan, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. MANA: Microarchitectingan Instruction Prefetcher.

ACM Trans. Comput. Syst.

1, 1 (February 2020), 24 pages.

Instruction cache misses are a well-known source of performance degradation when the limited-capacity L1 instruction (L1-I) cache cannot capture a large number of instruction blocks demandedby a processor [7, 10, 12, 14, 20, 24, 27, 35 ? ]. In modern processors, the address generator isresponsible for filling the fetch queue , which is a queue of addresses expected to be demanded by By instruction block, we mean a 64-byte cache line.Authors’ addresses: Ali Ansari, [email protected], EPFL, Lausanne, Switzerland, 1015; Fatemeh Golshan, Sharif Universityof Technology, Azadi Avenue, Tehran, Iran, [email protected]; Pejman Lotfi-Kamran, Institute for Research in Funda-mental Sciences (IPM), Tehran, Iran, [email protected]; Hamid Sarbazi-Azad, Sharif University of Technology, Azadi Avenue,Tehran, Iran, [email protected], Institute for Research in Fundamental Sciences (IPM), Tehran, Iran, [email protected]. a r X i v : . [ c s . A R ] F e b he processor. The fetch engine looks up the L1-I cache to extract the instructions of the addressesin the fetch queue. These instructions are decoded and sent to the core backend for execution.While modern processors support out-of-order execution, instruction supply to the core backend isstill in-order. Therefore, if the instruction at the head of the fetch queue misses in the L1-I cache,the core will no longer be fed by new instructions until the missing instruction arrives at the L1-Icache, which results in performance degradation. However, the fetch engine may continue to fetchthe remaining addresses in the fetch queue from the L1-I.Instruction prefetching is a technique to address this problem. An instruction prefetcher predictsthe future cache misses and sends prefetch requests to fill the cache before demand requests arrive.The most common instruction prefetchers are sequential prefetchers that, upon activation, sendprefetch requests for a few subsequent blocks [47, 56]. While sequential prefetchers are usedin commodity processors [25, 42, 46, 47], prior work has shown that such prefetchers leave asignificant fraction of instruction misses uncovered, and hence, there is a substantial opportunityfor improvement [21, 32, 34].Sequential prefetchers’ limitations motivated researchers to propose more sophisticated prefetch-ers. Proactive Instruction Fetch (PIF) is a pioneer that showed a hardware instruction prefetchercould eliminate most of the instruction cache misses [21]. However, the proposed prefetcher isimpractical because of its high storage cost. Nonetheless, PIF motivated many researchers to developeffective and storage-efficient prefetchers [8, 26, 28, 32–34].We evaluate Return-Address-Stack Directed Instruction Prefetcher (RDIP) [32], Shotgun [33], andPIF, as the three state-of-the-art prefetchers that use entirely different approaches for instructionprefetching. We show that PIF offers considerably higher speedup as compared to the other twoprefetchers. Our findings indicate that the inefficiency of the competitors is mainly because of the metadata-miss as a result of not having enough storage. If the storage budget is unlimited, they offeralmost the same level of performance. These results suggest that designing a strong instructionprefetcher is mainly about the storage-efficiency of the prefetcher.In this paper, we argue that designing a strong instruction prefetcher needs considering the fol-lowing. (1) Prefetchers create and store metadata records and prefetch accordingly. These metadatarecords should carefully be chosen to minimize the number of distinct records. The small numberof such records enables a prefetcher to experience a high hit ratio in its metadata storage. (2) Alongwith the small number of distinct records, each record should need as few bits as possible. Thisfeature enables a prefetcher to store a larger number of records when a specific storage budget isprovided.Based on the guideline, we introduce MANA prefetcher, which benefits from spatial correlation.Not only spatial correlation offers compact metadata records, but also the number of distinct recordsis small. We also find that chaining spatial-metadata records provides a space-efficient way to takeadvantage of temporal correlation among spatial records to maximize the benefit. We organizethe metadata storage so that MANA stores as few records as possible, and each record requires aminimal storage cost. The low storage cost enables MANA to achieve over 92% of the performancepotential with only 15 KB of storage, considerably outperforming RDIP and Shotgun. Moreover,MANA can prefetch for smaller L1-I caches to eliminate the storage overhead as compared to thebaseline design.

This section introduces primary instruction prefetchers. .1 Temporal Prefetchers

Temporal prefetching is based on the fact that the sequence of instruction cache accesses or missesis repetitive, and hence, predictable [21, 22]. Consequently, temporal instruction prefetchers recordand replay the sequence to eliminate future instruction cache misses. Temporal Instruction FetchStreaming (TIFS) [22] records and replays the sequence of misses and offers adequately good results.However, PIF [21] offers a more significant improvement as compared to TIFS by recording andreplaying the sequence of instruction accesses.Temporal prefetchers have two main components: a history in which the sequence is recorded,and an index table that determines the last location of every address (more precisely trigger addressas we discuss shortly) in the history. Such a structure imposes a high storage cost, which is the mainshortcoming of temporal prefetchers [11, 13]. As an example, PIF requires more than 200 KB storagebudget per core to work effectively. As providing such ample storage is not feasible, researchersproposed techniques to reduce the storage cost.Shared History Instruction Fetch (SHIFT) [28] shares PIF’s metadata among cores and virtual-izes [16] it in the last-level cache (LLC). In multi-core processors, when cores execute the sameapplication, the sequence that is created by one core can be used for others as well. As a result,it is not necessary to use a dedicated metadata table for each core. Sharing and virtualizing PIF’smetadata in the LLC reduced prefetcher’s storage cost from more than 200 KB per core to a total of240 KB virtualized in an 8 MB LLC . However, the results show that sharing and virtualizing themetadata in the LLC degrades the performance boost of SHIFT as compared to PIF. Return-Address-Stack Directed Instruction Prefetcher (RDIP) [32] is proposed to offer PIF-levelperformance with a significantly lower storage cost. RDIP observes that the current state of thereturn-address-stack (RAS) can give an accurate representation of the program’s state. To exploitthis observation, RDIP XORs the four top entries of the RAS and calls it a signature . Then it assignsthe observed instruction cache misses to the corresponding signature. Finally, it stores these missesin a set-associative table that is looked up using the signature. RDIP reduces the per-core storageto over 60 KB. While RDIP requires considerably lower storage as compared to PIF, it still needs asignificant storage budget.

Branch Target Buffer (BTB)-directed prefetchers are advertised as metadata-free prefetchers. FetchDirected Instruction Prefetcher (FDIP) [44] is the pioneer of such prefetchers. The main idea is todecouple the fetch engine from the branch predictor unit. This way, the branch predictor unit goesahead of the fetch stream to discover the instruction blocks that will be demanded shortly. Theprefetcher checks if any of those blocks are missing and prefetches them. For this goal, BTB-directedprefetchers use a deep queue of discovered instructions named Fetch Target Queue (FTQ). FTQ isused to fill the gap between the fetch engine and the branch prediction unit. The prefetcher makesprogress instruction by instruction, finds branch instructions by looking up the BTB, consultsthe branch predictor to determine their target, and inserts the instructions into the FTQ. Theinstructions at the head of the FTQ are the demand instructions. The remaining entries, on theother hand, enable the prefetch engine to look up the cache and prefetches the missing ones.The main bottleneck of BTB-directed prefetchers is BTB misses [33, 34]. To correctly go farahead of the fetch stream, such a prefetcher needs to detect the branches and predict their target.BTB is the component that is used to detect branch instructions. The branches’ directions can be SHIFT’s storage cost is proportional to the LLC size. dentified using a branch predictor. Kumar et al. investigated the effect of these two components onBTB-directed instruction prefetching [34]. They showed that while the branch predictor’s accuracyis not important, BTB misses can significantly limit BTB-directed prefetchers’ efficiency. Hence,their proposal, Boomerang, not only prefetches for the L1-I caches but also the BTB.Boomerang uses a basic block-oriented BTB. The main advantage of this BTB type is that thetarget of each branch is the starting address of a basic block. As a result, Boomerang can detectBTB misses. Moreover, Boomerang uses an instruction pre-decoder to detect branch instructionsand extract their targets. By detecting the BTB misses and extracting them from the instructionblocks, Boomerang can prefill BTB to continue going ahead of the fetch stream.With Boomerang, BTB misses are still a bottleneck [33]. Boomerang stalls on a BTB missand waits until it is resolved. However, resolving a BTB miss requires the instruction block thatholds the missing BTB entry to be present in the cache. Otherwise, Boomerang should fetch thatblock, and then, pre-decode it to extract the required BTB entry and fill in the BTB. As fetchingan instruction block takes considerable clock cycles, in workloads with very large instructionfootprints in which instruction block and BTB misses are frequent, Boomerang does not offer aconsiderable performance boost [33].To address this problem, Kumar et al. proposed a new BTB organization within Shotgunprefetcher [33] to offer two advantages on top of Boomerang. First, Shotgun prefetches for L1-Iwithout the need to hold all the basic blocks of the instruction cache blocks in the BTB. Moreover,the new BTB covers a larger address space than conventional BTB’s. Shotgun has three distinctBTB structures, one for unconditional branches (U-BTB), the other one for conditional branches(C-BTB), and the last one for function and trap returns (RIB). The idea is that unconditional branchesdetermine the global control flow of a program. Consequently, Shotgun uses a large part of itsBTB to hold unconditional branches. Moreover, Shotgun stores two footprints for each uncondi-tional branch that show which instruction blocks are accessed around that branch and its target.Using these modifications, Shotgun offers better performance, mainly on workloads with largerinstruction footprints.

We compare RDIP, Shotgun, and PIF to determine their advantages and disadvantages. We, then,discuss why we need to develop a new instruction prefetcher and how this prefetcher should be toaddress the shortcomings of the prior work. The details of the competing approaches, the simulatedcore, and the benchmarks are given in Section 5.

Figure 1 compares the performance improvement of RDIP, Shotgun, and PIF over a baseline withouta prefetcher. For RDIP and Shotgun, along with the authors-proposed configuration, we evaluate aconfiguration with infinite storage. Moreover, we evaluate an implementation of PIF in which thehistory buffer has 5 K entries while it has 32 K entries in the authors-proposed configuration. Resultsreveal three essential facts. First, PIF outperforms RDIP and Shotgun with a large gap. It meansthat the reduction in storage in RDIP and Shotgun is achieved at the cost of losing considerablespeedup. Second, RDIP and Shotgun considerably fill this gap when infinite storage is available tothem. As such, the performance gap in the original configuration is because RDIP and Shotgun areincapable of holding the large number of records that they require to prefetch effectively. Finally,PIF loses performance when the history buffer is reduced. It means that PIF needs a large historybuffer to exploit the potential. .01.2 client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All S p ee dup RDIP RDIP Infinite Shotgun Shotgun Infinite PIF PIF 5 K History

Fig. 1. Speedup offered by state-of-the-art prefetchers.

To find out why RDIP and Shotgun suffer from metadata-miss, we should know what their prefetch-ing records are and how they are stored. RDIP creates signatures that are the bit-wise XOR ofthe four top entries in the RAS. Moreover, it sets the last bit of signatures based on the type ofcontrol-flow change (return or call). The signatures are used to look up the Miss Table in which theaddresses of missed blocks are recorded. As a result, RDIP should have a record for each observedsignature in the Miss Table, and its number of required records is equal to the number of distinctsignatures. We note that RDIP suggested a 4 K-entry Miss Table that is organized as a 4-wayset-associative structure.On the other hand, Shotgun needs to store basic blocks in its BTBs. Shotgun discovers basicblocks one after another and inserts them into the FTQ and prefetches the blocks associated withthose basic blocks. As a result, the prefetching record of Shotgun is a basic block, and BTB shouldbe large enough to accommodate the basic blocks. To hold these basic blocks, Shotgun uses threeBTBs that hold 2 K entries altogether. However, Shotgun attempts to prefill its BTBs to compensatefor its relatively small BTBs. Nevertheless, Figure 1 shows that even with the prefilling mechanism,the metadata-miss problem is still a considerable bottleneck.Finally, PIF benefits from spatial regions. Each spatial region consists of a block address, calleda trigger , and a footprint that shows the bitmap of accessed blocks around the trigger. Using thefootprint, PIF can detect the accessed blocks by keeping a single bit for each block at the cost ofstoring the full address of the trigger. PIF writes these spatial regions in a long circular history.Moreover, it uses an index table that records the latest entry of the history buffer in which aparticular spatial region is recorded. PIF suggests an index table with 8 K entries that are organizedas a 4-way set-associative structure and a 32 K-entry history buffer that is a circular buffer to holdthe required prefetching records.Figure 2 shows the number of distinct records that are observed for each prefetcher. In otherwords, we count the number of signatures for RDIP, basic blocks for Shotgun, and spatial regions’trigger addresses for PIF. It can easily be inferred that RDIP and Shotgun have a significantly largernumber of distinct prefetching records. Comparing these values to the number of entries that aresuggested for RDIP’s Miss Table and Shotgun’s BTBs, we conclude that these approaches needorders of magnitude more entries to obtain the full potential. Moreover, we observe that PIF has a client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All D i s t i n c t P r e f e t c h i n g R e c o r d s x1000 RDIP Shotgun PIF

204 K

Fig. 2. Number of distinct prefetching records. significantly smaller number of distinct records. The absolute value is close to 5 K on average, andan 8 K-entry index table can accommodate the records.While Figure 2 suggests that PIF has fewer distinct prefetching records, its design cannot exploitthis advantage. Figure 1 shows that by decreasing the number of history-buffer entries from 32 Kto 5 K, the obtained speedup shrinks from 42.5% to 32%. This result corroborates a similar study inprior work [28]. The reason is that multiple instances of a spatial record may be written in PIF’shistory buffer. Consequently, while the number of distinct records is about 5 K, the history buffershould be much larger to hold all of the records successfully. Note that a version of PIF that has a 5K-entry index table and a 5 K-entry history buffer requires 59 KB, which is still significant.

Not only the number of distinct records but also the size of each record influences the storageoverhead. In this section, we take a look at the records of various prefetchers. To make a fair andconsistent comparison, we assume that prefetchers deal with a 46-bit address space.

RDIP:

Each entry in RDIP’s Miss Table consists of a signature tag and three trigger addresses,each having an 8-bit footprint. As signatures are 32-bit long, and Miss Table is a 4 K-entry, 4-wayset-associative structure, each signature tag is 22-bit long. On the other hand, each trigger addressis 40 bits. Summing up altogether, the total storage cost of each Miss Table entry is 166 bits (over20 bytes).

Shotgun:

Shotgun associates its required information with the BTB and states that it is ametadata-free prefetcher. Nevertheless, to have a powerful Shotgun prefetcher, there is no otherway than increasing the BTB size. Unfortunately, BTB is a storage-hungry component as it requirestwo instruction addresses, the branch address, and the branch target. While Shotgun proposesthree separate BTBs and they have some differences, we consider U-BTB in this study, as it is thelargest BTB of Shotgun.Considering a 2 K-entry BTB that is organized as a 4-way set-associative structure, the tag of thebasic-block address needs 37 bits. The target address is 46 bits. Moreover, an entry needs 5 bitsfor the basic block’s size and 1 bit for the branch type. Finally, Shotgun adds two Call and Returnfootprints to a BTB entry, and each footprint is 8 bits. Altogether, every BTB entry requires 105bits (over 13 bytes).

IF:

Each entry of the index table has a spatial-region tag and a pointer to a 32 K-entry historybuffer. As the trigger address of each spatial region is a block address, a spatial-region tag is 29-bitlong. Moreover, the pointer requires 15 bits to index a 32 K-entry history buffer. As a result, 44bits should be used for an entry in the index table. Moreover, every entry of the history buffer is aspatial region. A spatial region needs 40 bits for the trigger address and 8 bits for the footprint. Asa result, 48 bits are used for an entry in the history buffer. The sum of the number of bits of entriesin the index table and history buffer is 92 bits (over 11 bytes). KB KB KB KB KB KB KB KB KB KB KB KB KB KB KB KB KB KBRDIP Shotgun PIF S p ee dup Fig. 3. Speedup of RDIP, Shotgun, and PIF with an equal storage budget, averaged across 50 benchmarks.

This study shows that due to having a large number of distinct records and the high storagecost of each record, prior prefetchers either do not offer the full potential or need an impracticallylarge storage budget. Figure 3 experimentally verifies this conclusion by showing the speedup ofthese prefetchers when we change the storage budget from 8 to 256 KB. The reported speedup isthe average across all 50 benchmarks (See Section 5). We assume that the baseline design has a2 K-entry BTB. In the case of Shotgun, we use the storage budget to enlarge the BTB.Generally, Shotgun and PIF offer a very close performance improvement across different storagebudgets. However, Figure 3 shows that RDIP lags behind the other prefetchers in all studied storagebudgets. Finally, comparing Figure 3 and Figure 1 reveals that even a 256 KB storage budget is notsufficient for RDIP to reach its full potential, which is offered by RDIP infinite.

All of the three prefetchers use a structure similar to PIF’s spatial regions. RDIP’s Miss Table andShotgun’s U-BTB both have footprints to encode the prefetch candidates associated with the trigger.Such a structure is used because accessed or missed blocks have high spatial correlations, and afootprint can encode the blocks in a lightweight manner. However, surprisingly, none of the priorwork used a simple table to record spatial-region footprints while the table is looked up by thetrigger address.RDIP, Shotgun, and PIF not only successfully detect the current spatial region but also havethe ability to find the successor spatial regions. By providing this feature, the prefetcher can (a)prefetch the trigger address of spatial regions and (b) offer excellent timeliness. Providing thiseature contributed to many of the complexities of these prefetchers. RDIP associates the currentsignature’s misses to the prior signature. As a result, the prefetched misses are one signature (orequivalently, one call or return) ahead of the fetch stream. Shotgun follows basic blocks one afteranother to reach the successive U-BTB or RIB hit to prefetch the corresponding footprints. PIFwrites the sequence of spatial regions in its history buffer, and consequently, the successive spatialregions are the subsequent entries in the history buffer.

We know that the spatial region is an excellent prefetching record because of the high spatialcorrelation among accesses, the compactness of the footprint, and the fewer number of distinctspatial regions’ trigger addresses as compared to other record types (See Figure 2). Consequently, aprefetcher that uses spatial regions as its prefetching records does not suffer from the two cruciallimitations of prior work. Moreover, prior work suffered from the mechanism of identifying thesubsequent spatial regions. Unfortunately, a spatial region by itself does not identify the next spatialregion. In Section 4.4, we discuss how this feature can be provided without the need for complicatedand storage-hungry approaches used in the prior work.

MANA creates the spatial regions using a spatial region creator and stores them in a set-associativetable named MANA_Table. Each spatial region is also associated with a pointer to another MANA_Tableentry in which its successor is recorded. A pointer is sufficient to benefit from temporal correlations,as prior temporal prefetchers showed that recently-accessed addresses tend to recur [19, 21, 22, 54,55]. To reduce the storage cost, MANA exploits this observation that there are a small number ofdistinct high-order-bits patterns. This phenomenon is because the code base of a program has ahigh spatial locality and is much smaller than the size of the physical memory [9, 15]. Consequently,instead of recording the complete trigger address that is the largest field of spatial regions, MANAuses the pointers to the observed high-order-bits patterns that need a considerably fewer numberof bits.

Spatial Region Creator (SRC) is responsible for creating MANA’s prefetching records. Spatial regionsconsist of a trigger address and a footprint that shows which instruction blocks are observed in theneighborhood of the trigger address. SRC tracks the retire-order instruction stream and extracts itsinstruction blocks. If the current instruction block is different from the last observed instructionblock, SRC attempts to assign this new instruction block to a spatial region. SRC has a queue ofspatial regions named Spatial Region Queue (SRQ). After detecting a new instruction block, SRCcompares this instruction block with SRQ entries. If this block falls in the address space covered byone of the SRQ entries, SRC sets the corresponding bit in that spatial-region footprint. Otherwise,SRC dequeues an item from SRQ, creates a new spatial region whose trigger address is the newinstruction block, resets the footprint, and enqueues it in the SRQ.

When SRC dequeues an entry from SRQ to enqueue a new spatial region, the evicted spatial regionis inserted into MANA_Table. MANA_Table stores the spatial regions and uses a set-associativestructure with Least Recently Used (LRU) replacement policy that is looked up by the triggeraddress of the spatial region. Upon an insertion, if a spatial region hit occurs, the spatial region’sfootprint is updated with the latest footprint. Otherwise, the LRU entry is evicted, and the newspatial region is inserted into MANA_Table. .3 Finding the Next Spatial Region

We include in MANA_Table’s prefetching record a pointer to another entry of MANA_Table toprovide a sufficient lookahead to prefetch subsequent spatial regions. Whenever a spatial region isinserted into MANA_Table, MANA records its location. By knowing this location, when MANArecords a new entry in the table, the pointer of the last recorded spatial region is set to the locationof the new entry. Using these pointers, MANA can chase the spatial regions one after another byiteratively going from a spatial region to its successor.

Considering a 46-bit address space and a 4 K-entry 4-way set-associative MANA_Table, each recordrequires a 30-bit trigger-address tag, an 8-bit footprint, and a 12-bit pointer to the successor. The8-bit footprint is derived from prior work [21, 28, 33]. However, in Section 6.1.1, we choose theappropriate footprint type for MANA. To further reduce the storage cost, we observe that there isa considerable similarity between the high-order bits of the instruction blocks, and there are a fewdistinct patterns due to the high spatial locality of the code base of programs. As a result, we dividethe trigger address tag into two separate parts, a partial tag, and the rest of the high-order bits.We store the partial tag in MANA_Table and the rest of the bits in a separate structure. Thedivision of tag should be done in a way to minimize the storage overhead. If we devote more bits forthe partial tag, we will have fewer high-order-bits patterns (HOBPs), but we need to store longerpartial tags in MANA_Table. On the contrary, if we devote a fewer number of bits to the partial tagfield, we will encounter more distinct HOBPs. In the evaluation section, we show how to divide thetag to minimize the overhead.MANA stores HOBPs in a set-associative table named high-order-bits patterns’ table (HOBPT).Every new observed HOBP is inserted into HOBPT. Moreover, each MANA_Table record has a

HOBP index , which points to a HOBPT entry in which the corresponding HOBP is recorded.

A A 0000A+1 A 1000B B 0000 A 1000A+2 B 0000 A 1100C C 0000 B 0000D D 0000 C 0000 ...0 0xAB 1100 120...1 0x1F 0000 ?...50120 HOBPIndex Partial Tag Footprint Successor PointerTable Entry 0x1FFC 0x23AD ... ...

High-Order-Bits Patterns' Table (HOBPT)

Block Address High-OderBit Pattern Partial Tag SetNumber A = 0x M A N A _ T ab l e Spatial Region Queue ( A , ) ( B , ) Fig. 4. Overview of MANA recording spatial regions. .5 Example

Figure 4 shows how MANA records the spatial regions. The configuration of MANA is for illustrativepurposes. The actual configuration is determined through sensitivity analyses in Section 6. Theshown process has six steps. SRQ holds two spatial regions whose footprints contain four instructionblocks ahead of the trigger address. The figure also shows the state of HOBPT and MANA_Table.We represent a spatial region as ’(A, XXXX)’ where ’A’ is the trigger address, and ’XXXX’ is thefootprint of the following four blocks. In the beginning, SRQ is empty. SRC tracks the block addressof the retired instructions. In the first step, the retired instruction belongs to block A . SRC looks upSRQ, and as it is empty, SRC finds no match. As a result, SRC creates a new spatial region which is’(A, 0000)’. In Step 2, the retired instruction is from block A+1 . SRC looks up SRQ and finds out that

A+1 falls in the address space covered by ’(A, 0000)’ and sets the corresponding bit in its footprint.In the next step, the observed instruction block is B that does not fall in any of the already createdspatial regions. Consequently, SRC creates a new spatial region and enqueues it in the SRQ. In Step4, similar to Step 2, another bit is set in the spatial region’s footprint, whose trigger address is A .In Step 5, SRC cannot match the instruction block with any of the SRQ entries. Moreover, SRQ isfull. As a result, it dequeues an entry from SRQ to provide a space for the new spatial region. SRCinserts ’(C, 0000)’ into SRQ. Moreover, the evicted spatial region is sent to MANA_Table. Similarly,in Step 6, ’(D, 0000)’ is enqueued in SRQ, and ’(B, 0000)’ is evicted and stored in MANA_Table.We now describe how spatial regions are placed in MANA_Table. For simplicity, we consider adirect-mapped MANA_Table with 256 entries. Suppose that the address of instruction block A is0x1FFCAB32. Due to MANA_Table’s structure, the eight lower-order bits of A are used to determinethe table index in which this spatial region should be stored. Because the lower-order bits are 0x32or 50 in decimal number representation, as the figure shows, this spatial region is inserted into the50th entry of the table. MANA_Table exploits the commonality among the high-order bits of theinstruction block addresses. This means that MANA compares the 16 high-order bits of A with theobserved patterns in HOBPT. As MANA finds a match in the first entry of HOBPT, zero is recordedin the HOBP index field of MANA_Table. This way, instead of keeping 16 bits for higher-orderbits, we need only four bits assuming that HOBPT keeps 16 distinct patterns. Moreover, we use an8-bit partial tag to store the remaining bits of the trigger address tag that are not stored in HOBPT.Figure 4 shows how an instruction cache block address can be constructed by combining the HOBPindex, the partial tag, and the set number in which a record is stored.MANA also updates the successor pointer. When MANA inserts ’(A, 1100)’ into MANA_Table, italso stores the table index in which this record is inserted (i.e., 50). Later on, when MANA insertsthe next record ’(B, 0000)’ in the 120th row of MANA_Table, it also goes to row 50 and sets thesuccessor pointer to 120. It means that when MANA prefetches the instruction blocks recorded in’(A, 1100)’, it can easily prefetch ’(B, 0000)’ by following the pointer. After setting the successorpointer, MANA updates its last inserted record from 50 to 120.

MANA takes advantage of the stream address buffer (SAB), which is previously used by priortemporal prefetchers [21, 28, 29], to prefetch for the L1-I cache. SAB is a fixed-length sequence ofspatial regions that are created by chasing the spatial regions one after another from the pointersthat are stored in MANA_Table. Moreover, SAB has a pointer to the MANA_Table entry that thelast spatial region is fetched from and inserted into SAB.SAB has three goals. First, it enables MANA to go sufficiently ahead of the retire-order instructionstream to issue timely prefetches. Second, SAB helps MANA to know the instruction blocks thatare already prefetched and the current lookahead of them. Finally, by tracking the spatial regionshat are already prefetched, SAB enables MANA to eliminate redundant and repetitive prefetchrequests.MANA attempts to have a fixed lookahead ahead of the fetch stream to prefetch the triggeraddress of the successor spatial regions and also to have timeliness. This lookahead is defined as thenumber of spatial regions that MANA prefetches ahead when it observes an instruction cache block.MANA tracks the fetch stream and extracts its instruction block addresses. If the block addressfalls in the address space that is covered by a spatial region in a SAB, MANA checks the number ofspatial regions that are prefetched after the matched spatial region, and hence, are inserted intoSAB. If this number is lower than the lookahead, MANA chases the spatial regions using SAB’spointer to MANA_Table to have sufficient lookahead. If MANA finds no SAB associated with theblock address, it considers the instruction block as the trigger address of a spatial region and looksup MANA_Table to find the corresponding spatial region. If MANA_Table finds a match, MANAevicts the LRU SAB entry (if it has multiple SABs) and creates a new SAB by inserting the foundspatial regions into SAB and chasing its successor pointer to find the next spatial region. MANArepeats this process until the number of inserted spatial regions into SAB reaches the predefinedlookahead depth. Finally, MANA extracts the instruction blocks that are encoded in the footprintof the inserted spatial regions and prefetches them.

To evaluate our proposal, we use ChampSim simulator [1] with the configurations shown in Table 1.We made substantial changes to ChampSim to accurately model the frontend of a processor. Amongthem, we model BTB in ChampSim, which is not modeled in the baseline implementation. Moreover,we simulate BTB miss and branch direction/target misprediction stalls in the address generationcomponent as a result of modeling BTB . Furthermore, we model wrong-path fetches when a takenBTB miss or a branch direction/target misprediction happens. Modeling the wrong-path fetches isimportant as they may pollute the cache hierarchy if they eventually become useless. Or they maybecome useful as observed in [34], and hence, lower the usefulness of a prefetcher. The evaluatedprefetchers are triggered on instruction or block addresses touched in the wrong-path, whichaffects their behavior. We find these changes are crucial to have a fair and accurate evaluation ofthe competing prefetchers in the context of ChampSim. We use public benchmarks that are provided by the first instruction prefetching championship(IPC-1) [2]. This package contains 50 benchmarks, including eight for the clients’ execution, as wellas 35 for servers, and seven from SPEC. While we execute all 50 benchmarks, as it is not possibleto show all the results, we selected ten benchmarks that represent various observed behaviors.Moreover, we report the average of the ten selected benchmarks as well as the average of all 50benchmarks.Each benchmark is executed for 50 million instructions to warm up the system, including thecaches, the branch predictor, and prefetchers’ metadata. The rest of the instructions (i.e., 50 millioninstructions) are used to collect the evaluation metrics, including Instruction Per Cycle (IPC). baseline:

Table 1 summarizes our baseline core’s configuration. All performance improvementsare measured against this baseline. The baseline implementation only models branch direction misprediction stalls. able 1. Evaluation Parameters

Parameter Value

Core 14 nm, a single 4 GHz OoO core352-entry ROB, 128-entry Load Queue, 72-entry Store QueueFetch Unit 32 KB, 8-way, 64B block size, 4-cycle latencyhashed-perceptron branch predictor [51], 2K-entry Branch Target Buffer, 8-entry MSHRsL1-D Cache 48 KB, 12-way, 64B block, 5-cycle latency, 16-entry MSHRs, next-line prefetcherL2 Cache 512 KB, 8-way, 10-cycle latency, Signature Path Pattern (SPP) [31] prefetcherLast Level Cache 2 MB, 16-way, 20-cycle latency, 64-entry MSHRs

RDIP:

RDIP [32] uses the top four entries of RAS to create the signatures. Moreover, we model a4 K-entry, 4-way set-associative Miss Table, where each entry holds three trigger addresses and theassociated footprint of the observed instruction cache misses. All the parameters are chosen basedon the original proposal [32]. This prefetcher imposes 83 KB storage to the baseline design with noinstruction prefetcher.

Shotgun:

We model Shotgun [33] with a 1.5 K-entry U-BTB, a 128-entry C-BTB, and a 512-entryRIB, as suggested in the original proposal. Moreover, Shotgun uses a 64-entry instruction cacheprefetch buffer and a 32-entry BTB prefetch buffer. Shotgun imposes 6 KB storage cost to thebaseline design; 4 KB comes from the prefetch buffers, and the rest is because of the changes madeto the BTB.

PIF:

PIF [21] records the sequence of spatial regions in a circular history buffer of 32 K spatialregions. To find a record in the history buffer, PIF uses an index table that holds pointers to historybuffer’s entries. We model a 4-way set-associative index table with 2 K sets, as in the originalproposal. PIF imposes over 236 KB of storage overhead to the baseline design. As in the originalproposal, the temporal compactor contains eighteen spatial regions, the lookahead is five, and fourSABs are used where each one tracks seven consecutive spatial regions [21, 28].

MANA:

MANA stores the spatial regions in a 4 K-entry table of 1 K sets. Each MANA_Tablerecord consists of 7 bits to indicate the HOBP index, a 2-bit partial tag, an 8-bit footprint, and a12-bit pointer to the successor spatial region. Moreover, it uses a 128-entry, 8-way set-associativeHOBPT. MANA needs a 15 KB storage budget for its metadata.

Minimum Latency (MinLat) L1-I:

MinLat L1-I is used to show the potential of instructionprefetching. In MinLat L1-I, lower levels of the caching hierarchy spend a single cycle when theyserve instruction blocks’ requests.

MinLat L1-I + Perfect BTB:

In this implementation, an ideal BTB is also used along with theMinLat L1-I that only faces compulsory misses.

In recent proposals, the instruction prefetching problem is twisted to BTB prefilling problem to offera unified solution to the frontend bottleneck [29, 33, 34]. The main idea is that an instruction blockhas the required information to fill in the missing BTB entries. A simple instruction pre-decoderdecodes the fetched and prefetched blocks to extract the branches. Then, these branches are insertedinto the BTB to avoid BTB misses. In our evaluation, we assume that RDIP, PIF, and MANA alsobenefit from this BTB prefilling mechanism. It also provides a more fair comparison with Shotgunas it benefits from BTB prefilling mechanism in its design. However, in section 6.2.3, we evaluatecompeting proposals only based on their instruction prefetching abilities.

EVALUATION RESULTS6.1 Parameter Selection

We start with an initial MANA prefetcher and run sensitivity analyses to tune the parameters. In theinitial MANA, the lookahead is five, SRQ size is eighteen, there are four SABs, each having twelveentries. Prior work has shown that this configuration successfully exploits the potential [21, 28].Moreover, MANA_Table holds 4 K entries in a 4-way set-associative structure. We start with a4 K-entry table because we found MANA creates less than 4 K distinct records, on average . As aresult, we expect that such a table can hold the required prefetching records. The spatial region type determines the length of the spatial region’sfootprint and the instruction cache blocks that are encoded into it. We use (X, Y) notation torepresent a spatial region that holds X blocks behind and Y blocks ahead of the trigger block. Such aspatial region holds X+Y+1 instruction blocks. Note that the trigger block is held implicitly and doesnot need to have a dedicated bit in the footprint. Some pieces of prior work have used (2, 6) spatialregions [21, 32, 33]. However, we examine different spatial regions to find the best performing one.Figure 5 shows MANA’s speedup when (0, 4), (0, 6), (0, 8), (1, 7), and (2, 6) spatial regions are used.Results show that the gap between these regions is negligible. However, (0, 8) performs slightlybetter than the other types. As such, we use this region type in the rest of the evaluation. We notethat (0, 4) can be a good design choice as well. It offers competitive speedup and requires four fewerbits, which provides a storage-saving opportunity. client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All S p ee dup (0,4) (0,6) (0,8) (1,7) (2,6) Fig. 5. Speedup of various region types.

The next parameter that we are going to set is SRQ length. A longer SRQ holdsthe spatial regions for a more extended period. It enables SRC to better associate the observedinstruction blocks to the already created spatial regions in SRQ. However, a longer SRQ imposesmore search overhead. Figure 6 shows MANA’s speedup when we change SRQ length from four toeighteen. As expected, increasing SRQ length improves speedup. However, we choose an 8-entrySRQ to have a more practical and simpler search process. In Figure 2, we show that PIF creates 5 K records, on average. MANA creates a smaller number of distinct records becauseinstead of a decoupled spatial and temporal compactors used in PIF, it uses a coupled approach in its SRC that helps it tomore efficiently create its spatial regions. .0 client client server server server server server server spec gcc-3 spec x264-1 Avrg. Avrg.

All S p ee dup Fig. 6. Speedup of various SRQ sizes.

We analyze the effect of various lookaheads on the MANA prefetcher.Figure 7 shows the obtained speedup when the lookahead is 1, 2, 3, 5, and 7. The results showthat the lookahead of one lags significantly behind the others. Consequently, the pointer to thesuccessor spatial region is necessary to offer good speedup. Moreover, changing the lookaheadfrom two to seven makes small differences.While the lookahead of one only offers 24% speedup, comparing Figure 7 and Figure 1 revealsthat even MANA with the lookahead of one outperforms RDIP and Shotgun. When lookahead isone, MANA does not need the ability to chase the spatial regions. As a result, we can remove thepointer to the successor region field from MANA_Table to reduce the storage overhead. Moreover,it does not need a full tag because MANA only prefetches the instruction blocks that are in theproximity of the trigger address. MANA needs the HOBP index, the partial tag, and the set numberto create the trigger address of the spatial region when it chases the pointers. However, whenthe lookahead is one, MANA looks up MANA_Table using a trigger address that is completelyknown. Just to determine that an entry is in the table or not, MANA needs a partial tag. We findthat the 8-bit partial tag is sufficient to separate MANA_Table hits from misses. This way, eachMANA_Table record contains an 8-bit partial tag and an 8-bit footprint, eliminating the need for aHOBP index. Considering a 4 K-entry MANA_Table, 24% speedup can be achieved using only 8 KBof storage.We also evaluate the coverage and overprediction of MANA when its lookahead varies from oneto seven. A larger lookahead may offer better speedup but may cause lots of useless prefetchesbecause of going too far ahead of the execution stream. Figure 8 clearly shows this effect. In thisfigure, miss coverage, non_covered misses, untimely prefetches, and overprediction are shown.Miss coverage shows the fraction of misses that are eliminated by a prefetcher. The encounteredmisses are divided into two separate categories. The first type includes those misses that MANAdid not send a prefetch for, and the second type contains cache misses that a prefetch has been sentfor, but the demand arrived before the prefetch reply. The overprediction is the number of uselessprefetches to the number of baseline misses and is used to evaluate prefetchers’ accuracy [23].Figure 8 shows that increasing the lookahead from one to two improves the miss-coverage from40% to 61%. The considerable better miss coverage is because when lookahead is one, MANA isnot able to prefetch the trigger address blocks, and it can only prefetch the footprint. However,when the lookahead is two, MANA can also prefetch the trigger address of the successor spatial .0 client client server server server server server server spec gcc-3 spec x264-1 Avrg. Avrg.

All S p ee dup Fig. 7. Effect of lookahead on speedup of MANA. regions. However, by going from one to two, the untimely prefetches have been increased from5% to 13%. It means that MANA issues more prefetch requests, but they are not sufficiently timelyto completely hide the fetch access latency. Going from two to three mitigates this problem: theuntimely prefetch requests are decreased from 13% to 9%, and the miss-coverage reaches 69%.Increasing the lookahead from three to five and seven only makes negligible differences to themiss-coverage and the untimely prefetches. However, the overprediction increases significantly.Based on Figure 7 and 8, we set MANA’s lookahead to three in the rest of the evaluation as it offersa right balance between the obtained speedup and the overprediction. L - I M i ss e s ( % ) Miss Coverage Untimely PrefetchesNon_covered Misses Overprediction

Fig. 8. Effect of lookahead on MANA miss coverage.

In section 4.6, we described how MANA uses SAB to issue or filterthe prefetch requests. A single stream whose length is at least equal to the lookahead is the smallestpossible SAB configuration. However, tracking multiple SABs or a longer SAB may be helpfulbecause of successfully capturing the recently prefetched spatial regions. Prior work has suggestedusing four SABs that each one tracks seven [21] or twelve [28] spatial regions. However, we find anegligible difference between different configurations both in terms of performance improvementnd the ability to filter redundant prefetches. We use a single SAB tracking five spatial regions tohave a simple and practical design.

Figure 9 shows the speedup of MANA for various MANA_Table sizes.In prior experiments, MANA_Table had 4 K entries in a 4-way set-associative structure. In thispart, we still use a 4-way set-associative table but vary the number of entries from 1 K to 16 K.Figure 9 shows that a 4 K-entry table offers considerably better results as compared to 1 K- and2 K-entry tables. Moreover, increasing the table size to 8 K and 16 K only offers a small improvement.Consequently, we set the number of MANA_Table entries in the rest of the evaluation to 4 K. client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All S p ee dup Fig. 9. Speedup of various MANA_Table sizes.

Figure 10 shows how the speedup of MANA varies when wechange MANA_Table’s associativity from 1 to 8. Increasing the associativity improves performance,however, the improvement from 4 to 8 is not considerable. As such, we use a 4-way set-associativestructure for MANA_Table. client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264_1 Avrg.10 Avrg.All S p ee dup Fig. 10. Effect of MANA_Table’s associativity on speedup .1.7 High-Order-Bits Patterns.

In Section 4.4, we described how the storage requirements ofMANA could be reduced by using a partial tag and the commonality of HOBPs. Table 2 shows howchanging the partial tag’s length affects MANA’s storage requirements. In this study, we assumethat HOBPT is large enough to accommodate all observed patterns. We can infer that MANArequires lower storage when the partial tag’s length is two. In this configuration, HOBPT needs tostore up to 128 distinct HOBPs, which needs 0.44 KB. Moreover, every MANA_Table entry containsa 7-bit HOBP index to HOBPT. Altogether, HOBPT and MANA_Table will require 14.94 KB storage.

Table 2. Effect of partial-tag length on MANA’s storage requirements.

Partial Tag Bits HOBP Index Bits HOBPT Storage MANA_Table Storage Sum0 9 1.88 KB 14.5 KB 16.38 KB1 8 0.9 KB 14.5 KB 15.4 KB2 7 0.44 KB 14.5 KB 14.94 KB5 5 0.1 KB 15 KB 15.1 KB8 3 0.02 KB 15.5 KB 15.52 KB11 3 0.02 KB 17 KB 17.02 KB

In this section, we compare MANA prefetcher against state-of-the-art proposals.

Figure 11 shows the speedup of competing prefetchers. Thisfigure reveals that MANA significantly outperforms RDIP and Shotgun and has only a small gapwith PIF. RDIP, Shotgun, PIF, and MANA provide 22, 6.5, 42, and 38% speedup on top of a baselinedesign without any prefetcher. It clearly shows how MANA outperforms its competitors with a largegap or offer competitive performance improvement with significantly smaller storage. ComparingMANA with

MinLat L1-I + Perfect BTB , we conclude that MANA offers 92% of the performancethat can be obtained by

MinLat L1-I + Perfect BTB .To provide a fair comparison, in Figure 11, we also evaluate two other prefetchers:

Shogtun(+16 KB) and

MANA Aggressive . Shotgun’s BTB structures require 23.7 KB storage [33]. As MANAimposes 15 KB storage overhead to the baseline design, we enlarge Shotgun’s BTBs to 40 KB toprovide the same storage to Shotgun as MANA’s. This design is shown as

Shotgun (+16 KB) . Despiteincreasing the size of BTBs,

Shotgun (+16 KB) still lags behind MANA. Moreover, to have a faircomparison with PIF, we also evaluate an aggressive implementation of MANA shown as (

MANAAggr. ) in which, similar to PIF, the lookahead is five, the SRQ length is eighteen, the number ofSABs is four, and each SAB tracks seven consecutive spatial regions. Besides, as PIF’s index tableholds 8 K pointer to the history buffer, we enlarge MANA_Table to have an equal number of entries.This configuration needs 30 KB storage. Figure 11 shows that such an implementation has almostno gap with PIF. It means that the gap between MANA and PIF is due to our policy to set practicaldesign parameters, and the aggressive implementation offers the level of performance that PIFoffers while requires 7 . × less storage. Figure 12 shows how competing prefetchers perform in terms of coveringthe L1-I cache misses. RDIP covers a smaller fraction of misses and also has significant untimelyprefetches that result in lower performance improvement. On the other hand, we see a considerablylarge miss coverage from Shotgun despite its poor performance improvement. The reason is thatShotgun stalls feeding the fetch engine when a BTB miss occurs. In such cases, Shotgun triggers its .01.21.41.61.82.02.22.4 client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All S p ee dup RDIP Shotgun Shotgun (+16 KB) PIF MANA MANA Aggr. MinLat L1-I + Perfect BTB

Fig. 11. Speedup of competing prefetchers.

BTB prefilling mechanism that requires bringing the blocks into the L1-I cache and then, feedingthe pre-decoder with the arriving instructions to extract the missing branches. After resolvingthe BTB miss, the basic block is fed into the FTQ and in the next step, it is consumed by the fetchengine. Consequently, these blocks hit in the cache, resulting in a high miss coverage; however,the fetch engine is stalled for a considerable amount of time to resolve the BTB miss, which hurtsperformance. This observation also corroborates a similar study [6]. PIF has better miss coverageas compared to MANA which is also translated to its better speedup. Moreover, it has a higheroverprediction since it has a higher prefetching depth as compared to MANA. R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A R D I P S h o t g un P I F M A N A client client server server server server server server spec gcc-3 spec x264-1 Avrg. Avrg.

All L - I M i ss e s ( % ) Miss Coverage Untimely PrefetchesNon_covered Misses Overprediction

Fig. 12. Miss coverage of competing prefetchers.

In prior experiments, all competing designs could prefill theBTB to gain performance by eliminating the BTB misses. In this section, we compare them solelybased on their instruction prefetching abilities. In other words, all designs have a 2 K-entry BTBthat will not be prefilled. As BTB prefilling is vital for Shotgun because of its very small C-BTB, weprovide two separate BTBs to evaluate Shotgun in this section: a 2 K-entry BTB is used to drivethe fetch engine that will not be prefilled similar to its competitors, and Shotgun’s BTBs that arenly used for L1-I prefetching and benefit from BTB prefilling. Results are shown in Figure 13.Comparing

MinLat L1-I in this figure and

MinLat L1-I + Perfect BTB in Figure 11, we can see howmuch eliminating BTB misses is crucial to have a high performance instruction supply. Moreover,we see Shotgun offers near the same level of performance when we compare its results in Figures 11and 13. It means that a conventional 2 K-entry BTB that does not benefit from BTB prefilling is asstrong as Shotgun’s BTBs where its small C-BTB is aggressively prefilled, in terms of driving thefetch engine. RDIP, Shotgun, PIF, and MANA offer 13, 7, 25, and 21% speedup where MinLat L1-Ioffers 25% that is very close to PIF. It can be seen that for some traces like server 12 , PIF outperforms

MinLat L1-I . It is because a powerful instruction prefetcher can completely hide the miss latencywhile the

MinLat L1-I faces some delays to process the fetch requests in the lower levels of thememory hierarchy. client2 client7 server1 server11 server12 server16 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All S p ee dup RDIP Shotgun PIF MANA MinLat L1-I

Fig. 13. Speedup of competing proposals when BTB prefilling is disabled.

In this section, we use MANA to prefetch for a smaller cache-size. This study has two goals. First,by decreasing the size of the L1-I cache, the number of misses increases, and hence, puts morepressure on the prefetcher. Moreover, if the prefetcher still offers good speedup when it is usedwith a smaller cache-size, we can reduce the instruction cache size to provide space for MANAprefetcher. However, this design increases the traffic between the L1-I and L2 caches. Moreover,we expect the L2 external bandwidth usage does not change as almost all of the prefetch requestsare served by the L2. To provide quantitative analyses, we show the speedup and the L1-I and L2external bandwidth usage when MANA is used to prefetch for a 16 KB and an 8 KB L1-I cache. Byexternal bandwidth usage, we mean the number of fetch and prefetch requests that are sent to thelower level of the cache hierarchy to the number of fetch requests that are sent when we have useda 32 KB cache with no prefetcher.Figure 14 shows that when we decrease the L1-I cache size, MANA still offers good speedup. Thespeedup is 35% and 33% for 16 KB and 8 KB caches, respectively. Note that MANA is designed to beindependent of what is going on in L1-I caches. In other words, MANA does the same independentof the L1-I cache. The offered speedup on 16 KB cache is very close to the speedup obtainedby the conventional 32 KB cache. So, we can use MANA with a 16 KB cache to avoid imposingstorage overhead. This way, the design imposes no storage overhead while offers almost the sameperformance as MANA with a 32 KB cache. .01.2 client client server server server server server server spec gcc-3 spec x264-1 Avrg. Avrg.

All S p ee dup

32 KB 16 KB 8 KB

Fig. 14. Speedup of MANA for various L1-I cache sizes.

As expected, however, the external bandwidth usage increases by decreasing the L1-I size.Figure 15 shows that the external bandwidth usage of 16 K and 8 K L1-I caches increases by 2 × and2.65 × , respectively. While the L1-I external bandwidth increases, as it is the bandwidth between L1and L2 within the chip, it is beneficial to trade it for significant performance improvement. On theother hand, the L2 external bandwidth usage does not change (slightly decreases in some cases)by decreasing the L1-I cache size. We find that almost all of the L1-I requests are served by the L2cache (no extra traffic). Moreover, L1-I requests regularly promote the instruction cache blocks inthe L2 cache to the most recently used (MRU) position, helping them to stay a longer period in theL2 cache, resulting in lower L2 external traffic in some cases. L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2C L1-I L2Cclient2 client7 server11 server12 server16 server21 server30 server36 specgcc-3 specx264-1 Avrg.10 Avrg.All B a nd w i d t h L1-I 32 KB L1-I 16 KB L1-I 8 KB

Fig. 15. L1-I and L2 external bandwidth usage when MANA prefetches for various L1-I cache sizes.

Many pieces of prior work showed that the frontend bottleneck is a major source of performancedegradation [3, 4, 8, 17, 24, 30, 36, 43, 50]. A myriad of proposals suggested prefetchers to addressthis problem [5, 18, 21, 22, 26, 28, 29, 32–34, 37, 40, 44, 45, 48, 49, 52, 57, 58].ome of these proposals are branch predictor-directed prefetchers and leverage the branchpredictor unit to run ahead of the fetch stream to discover the missed blocks and prefetch them [18,33, 34, 40, 44, 45, 49, 52].Discontinuity prefetcher [48] uses a sequential prefetcher to eliminate sequential misses andthen records the remaining cache misses in a discontinuity table. Using the discontinuity table, itprefetches the discontinuity misses. Nonetheless, this prefetcher demands a large storage budget aseach discontinuity entry consists of two distinct cache block addresses.Code-layout optimization is another way to tackle the L1-I miss problem [39, 41]. In thesetechniques, the program is profiled, and a control flow graph (CFG) is created. By chaining themost frequently executed control-flow changes in the CFG, the layout of the program is optimized.Code-layout optimization works well for workloads, where the control-flow changes are mostlystatic and can be determined at compile time.Some pieces of prior work inserted prefetching instructions into the program code [5, 8, 26, 37,38, 53]. These proposals use an offline or online program-profiling to choose where the prefetchinginstructions should be added.Confluence [29] is the first proposal that offers a unified solution to address both instructioncache, and BTB misses. It used a pre-decoder to extract the branch instructions from the fetched orprefetched blocks to fill in the BTB. Confluence used SHIFT [28] as its instruction prefetcher; how-ever, its idea can be applied to any other instruction prefetcher. In this work, we used Confluence’snotion along with competing prefetchers to eliminate the BTB misses.

Prior work used various prefetchers to eliminate instruction cache misses; however, as shownin this paper, they either do not offer the full potential or require excessive storage. We showedthat prior proposals suffer from requiring a large number of prefetching records to offer their fullpotential and the high storage cost of these records. In this paper, we made a case that designingan effective and cost-efficient instruction prefetcher is about choosing the right metadata recordand microarchitecting the prefetcher to minimize the storage requirement. Given these insights,we introduced MANA prefetcher. With only 15 KB storage, MANA offers a level of performanceclose to the best performing and highly storage-hungry instruction prefetcher. Moreover, MANAsignificantly outperforms all prior prefetchers when they have the same storage budget.

REFERENCES [1] 2020. ChampSim. https://github.com/ChampSim/.[2] 2020. IPC-1. https://research.ece.ncsu.edu/ipc/.[3] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. 1999. DBMSs on a Modern Processor: Wheredoes Time Go?. In

Proceedings of the 25th International Conference on Very Large Data Bases (VLDP) . 266–277.[4] Samira Mirbagher Ajorpaz, Elba Garza, Sangam Jindal, and Daniel A Jiménez. 2018. Exploring Predictive ReplacementPolicies for Instruction Cache and Branch Target Buffer. In

Proceedings of the International Symposium on ComputerArchitecture (ISCA) .[5] Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2003. Call Graph Prefetching for Database Applications.

ACM Transactions on Computer Systems (TOCS)

21, 4 (Nov. 2003), 412–444. https://doi.org/10.1145/945506.945509[6] Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. Divide and Conquer Frontend Bottleneck. In

Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA) .[7] Grant Ayers, Jung Ho Ahn, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Memory Hierarchy for WebSearch. In

Proceedings of the 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA) .643–656. https://doi.org/10.1109/HPCA.2018.00061[8] Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, TrivikramKrishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. AsmDB: understanding andmitigating front-end stalls in warehouse-scale computers. In

Proceedings of the 46th International Symposium onComputer Architecture . 462–473.9] Mohammad Bakhshalipour, Aydin Faraji, Seyed Armin Vakil Ghahani, Farid Samandi, Pejman Lotfi-Kamran, and HamidSarbazi-Azad. 2019. Reducing writebacks through in-cache displacement.

ACM Transactions on Design Automation ofElectronic Systems (TODAES)

24, 2 (2019), 1–21.[10] Mohammad Bakhshalipour, Pejman Lotfi-Kamran, Abbas Mazloumi, Farid Samandi, Mahmood Naderan-Tahan, MehdiModarressi, and Hamid Sarbazi-Azad. 2018. Fast data delivery for many-core processors.

IEEE Trans. Comput.

67, 10(2018), 1416–1429.[11] Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2017. An efficient temporal data prefetcherfor L1 caches.

IEEE Computer Architecture Letters

16, 2 (2017), 99–102.[12] Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Domino temporal data prefetcher.In . IEEE, 131–142.[13] Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo spatialdata prefetcher. In . IEEE,399–411.[14] Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluationof hardware data prefetchers on server processors.

ACM Computing Surveys (CSUR)

52, 3 (2019), 1–29.[15] Mohammad Bakhshalipour, HamidReza Zare, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Die-StackedDRAM: Memory, Cache, or MemCache? arXiv preprint arXiv:1809.08828 (2018).[16] Ioana Burcea, Stephen Somogyi, Andreas Moshovos, and Babak Falsafi. 2008. Predictor Virtualization. In

Proceedings ofthe 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .157–167. https://doi.org/10.1145/1346281.1346301[17] Qiang Cao, Pedro Trancoso, J-L Larriba-Pey, Josep Torrellas, Robert Knighten, and Youjip Won. 1999. DetailedCharacterization of a Quad Pentium Pro Server Running TPC-D. In

Proceedings of the IEEE International Conference onComputer Design (ICCD) . 108–115. https://doi.org/10.1109/ICCD.1999.808414[18] I-Cheng K Chen, Chih-Chieh Lee, and Trevor N Mudge. 1997. Instruction prefetching using branch predictioninformation. In

Proceedings International Conference on Computer Design VLSI in Computers and Processors . IEEE,593–601.[19] Michael Ferdman. 2012.

Proactive Instruction Fetch . Ph.D. Dissertation. Carnegie Mellon University.[20] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, CansuKaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of EmergingScale-Out Workloads on Modern Hardware. In

Proceedings of the 17th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) . 37–48.[21] Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive Instruction Fetch. In

Proceedings of the 44thAnnual ACM/IEEE International Symposium on Microarchitecture (MICRO) . 152–162.[22] Michael Ferdman, Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. TemporalInstruction Fetch Streaming. In

Proceedings of the 41th Annual ACM/IEEE International Symposium on Microarchitecture(MICRO) . 1–10.[23] Fatemeh Golshan, Mohammad Bakhshalipour, Mehran Shakerinava, Ali Ansari, Pejman Lotfi-Kamran, and HamidSarbazi-Azad. 2020. Harnessing Pairwise-Correlating Data Prefetching With Runahead Metadata.

IEEE ComputerArchitecture Letters

19, 2 (2020), 130–133.[24] Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007.Database Servers on Chip Multiprocessors: Limitations and Opportunities. In

Proceedings of the 3rd Biennial Conferenceon Innovative Data Systems Research (CIDR) . 79–87.[25] Norman P Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cacheand prefetch buffers.

ACM SIGARCH Computer Architecture News

18, 2SI (1990), 364–373.[26] Prathmesh Kallurkar and Smruti R. Sarangi. 2016. pTask: A Smart Prefetching Scheme for OS Intensive Applications.In

Proceedings of the 49th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO) . 3:1–3:12. https://doi.org/10.1109/MICRO.2016.7783706[27] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and DavidBrooks. 2015. Profiling a Warehouse-scale Computer. In

Proceedings of the 42nd Annual International Symposium onComputer Architecture (ISCA) . 158–169. https://doi.org/10.1145/2749469.2750392[28] Cansu Kaynak, Boris Grot, and Babak Falsafi. 2013. SHIFT: Shared History Instruction Fetch for Lean-core ServerProcessors. In

Proceedings of the 46th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO) .272–283.[29] Cansu Kaynak, Boris Grot, and Babak Falsafi. 2015. Confluence: Unified Instruction Supply for Scale-out Servers. In

Proceedings of the 48th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO) . 166–177. https://doi.org/10.1145/2830772.283078530] Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger C. Raphael, and Walter E. Baker. 1998. PerformanceCharacterization of a Quad Pentium Pro SMP Using OLTP Workloads. In

Proceedings of the 25th Annual InternationalSymposium on Computer Architecture (ISCA) . 15–26.[31] Jinchun Kim, Seth H Pugsley, Paul V Gratz, AL Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Pathconfidence based lookahead prefetching. In . IEEE, 1–12.[32] Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. 2013. RDIP: Return-Address-Stack Directed Instruction Prefetching.In

Proceedings of the International Symposium on Microarchitecture (MICRO) .[33] Rakesh Kumar, Boris Grot, and Vijay Nagarajan. 2018. Blasting Through the Front-End Bottleneck with Shotgun. In

Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS) . 30–42. https://doi.org/10.1145/3173162.3173178[34] Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. 2017. Boomerang: A Metadata-Free Architec-ture for Control Flow Delivery. In

Proceedings of the IEEE International Symposium on High-Performance ComputerArchitecture (HPCA) . 493–504. https://doi.org/10.1109/HPCA.2017.53[35] Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt.2008. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In

Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA) . 315–326.[36] Jack L. Lo, Luiz André Barroso, Susan J. Eggers, Kourosh Gharachorloo, Henry M. Levy, and Sujay S. Parekh. 1998. AnAnalysis of Database Workload Performance on Simultaneous Multithreaded Processors. In

Proceedings of the 25thAnnual International Symposium on Computer Architecture (ISCA) . 39–50.[37] Chi-Keung Luk and Todd C. Mowry. 1998. Cooperative Prefetching: Compiler and Hardware Support for EffectiveInstruction Prefetching in Modern Processors. In

Proceedings of the 31st Annual ACM/IEEE International Symposium onMicroarchitecture (MICRO) . 182–194. https://doi.org/10.1109/MICRO.1998.742780[38] Chi-Keung Luk and Todd C. Mowry. 2001. Architectural and Compiler Support for Effective Instruction Prefetching: ACooperative Approach.

ACM Transactions on Computer Systems (TOCS)

19, 1 (Feb. 2001), 71–109. https://doi.org/10.1145/367742.367786[39] Karl Pettis and Robert C. Hansen. 1990. Profile Guided Code Positioning. In

Proceedings of the ACM SIGPLAN Conferenceon Programming Language Design and Implementation (PLDI) . 16–27. https://doi.org/10.1145/93542.93550[40] Jim Pierce and Trevor Mudge. 1996. Wrong-path Instruction Prefetching. In

Proceedings of the 29th Annual ACM/IEEEInternational Symposium on Microarchitecture (MICRO) . 165–175. https://doi.org/10.1109/MICRO.1996.566459[41] Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P. Geoffrey Lowney, andMateo Valero. 2001. Code Layout Optimizations for Transaction Processing Workloads. In

Proceedings of the 28thAnnual International Symposium on Computer Architecture (ISCA) . 155–164. https://doi.org/10.1145/379240.379260[42] Alex Ramirez, Oliverio J Santana, Josep L Larriba-Pey, and Mateo Valero. 2002. Fetching Instruction Streams. In

Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO) . 371–382. https://doi.org/10.1109/MICRO.2002.1176264[43] Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, and Luiz André Barroso. 1998. Performance ofDatabase Workloads on Shared-Memory Systems with Out-of-Order Processors. In

Proceedings of the 8th InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) . 307–318.[44] Glenn Reinman, Brad Calder, and Todd Austin. 1999. Fetch Directed Instruction Prefetching. In

Proceedings of the 32ndAnnual ACM/IEEE International Symposium on Microarchitecture (MICRO) . 16–27.[45] Glenn Reinman, Brad Calder, and Todd Austin. 2001. Optimizations enabled by a decoupled front-end architecture.

IEEE Trans. Comput.

50, 4 (2001), 338–355.[46] Oliverio J. Santana, Alex Ramirez, and Mateo Valero. 2007. Enlarging Instruction Streams.

IEEE Transactions onComputers (TC)

56, 10 (Oct. 2007), 1342–1357. https://doi.org/10.1109/TC.2007.70742[47] Alan J. Smith. 1978. Sequential Program Prefetching in Memory Hierarchies.

Computer

11, 12 (Dec. 1978), 7–21.https://doi.org/10.1109/C-M.1978.218016[48] Lawrence Spracklen, Yuan Chou, and Santosh G. Abraham. 2005. Effective Instruction Prefetching in Chip Multiproces-sors for Modern Commercial Applications. In

Proceedings of the 11th IEEE International Symposium on High-PerformanceComputer Architecture (HPCA) . 225–236. https://doi.org/10.1109/HPCA.2005.13[49] Viji Srinivasan, Edward S Davidson, Gary S Tyson, Mark J Charney, and Thomas R Puzak. 2001. Branch HistoryGuided Instruction Prefetching. In

Proceedings of the 7th IEEE International Symposium on High-Performance ComputerArchitecture (HPCA) . 291–300. https://doi.org/10.1109/HPCA.2001.903271[50] Robert Stets, Kourosh Gharachorloo, and L Barroso. 2002. A Detailed Comparison of Two Transaction ProcessingWorkloads. In

Proceedings of the IEEE International Workshop on Workload Characterization (WWC) . 37–48. https://doi.org/10.1109/WWC.2002.122649251] David Tarjan and Kevin Skadron. 2005. Merging path and gshare indexing in perceptron branch prediction.

ACMtransactions on architecture and code optimization (TACO)

2, 3 (2005), 280–300.[52] Alexander V. Veidenbaum. 1997. Instruction Cache Prefetching Using Multilevel Branch Prediction. In

Proceedings ofthe International Symposium of High-Performance Computing (ISHPC) . 51–70. https://doi.org/10.1007/BFb0024203[53] Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided RegionPrefetching: A Cooperative Hardware/Software Approach. In

Proceedings of the 30th Annual International Symposiumon Computer Architecture (ISCA) . 388–398. https://doi.org/10.1145/859618.859663[54] Thomas F. Wenisch, Michael Ferdman, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. PracticalOff-chip Meta-data for Temporal Memory Streaming. In

Proceedings of the 15th IEEE International Symposium onHigh-Performance Computer Architecture (HPCA) . 79–90. https://doi.org/10.1109/HPCA.2009.4798239[55] Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi.2005. Temporal Streaming of Shared Memory. In

Proceedings of the 32nd Annual International Symposium on ComputerArchitecture (ISCA) . 222–233. https://doi.org/10.1109/ISCA.2005.50[56] Chun Xia and Josep Torrellas. 1996. Instruction Prefetching of Systems Codes with Layout Optimized for ReducedCache Misses. In

Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA) . 271–282.https://doi.org/10.1145/232973.233001[57] Jun Yan and Wei Zhang. 2008. Analyzing the Worst-Case Execution Time for Instruction Caches with Prefetching.

ACM Transactions on Embedded Computing Systems (TECS) (2008).[58] Yi Zhang, Steve Haga, and Rajeev Barua. 2002. Execution History Guided Instruction Prefetching. In