[PDF] Arsenal of Hardware Prefetchers

Abstract

Hardware prefetching is one of the latency tolerance optimization techniques that tolerate costly DRAM accesses. Though hardware prefetching is one of the fundamental mechanisms prevalent on most of the commercial machines, there is no prefetching technique that works well across all the access patterns and different types of workloads. Through this paper, we propose Arsenal, a prefetching framework which allows the advantages provided by different data prefetchers to be combined, by dynamically selecting the best-suited prefetcher for the current workload. Thus effectively improving the versatility of the prefetching system. It bases on the classic Sandbox prefetcher that dynamically adapts and utilizes multiple offsets for sequential prefetchers. We take it to the next step by switching between prefetchers like Multi look Ahead Offset Prefetching and Timing SKID Prefetcher on the run. Arsenal utilizes a space-efficient pooling filter, Bloom filters, that keeps track of useful prefetches of each of these component prefetchers and thus helps to maintain a score for each of the component prefetchers. This approach is shown to provide better speedup than anyone prefetcher alone. Arsenal provides a performance improvement of 44.29% on the single-core mixes and 19.5% for some of the selected 25 representative multi-core mixes.

Full PDF

AArsenal of Hardware Prefetchers

Dishank Yadav * , Chaitanya Paikara + , * Indian Institute of Technology Kharagpur + University of Washington

ABSTRACT

Hardware prefetching is one of the latency tolerance optimiza-tion techniques that tolerate costly DRAM accesses. Thoughhardware prefetching is one of the fundamental mechanismsprevalent on most of the commercial machines, there is noprefetching technique that works well across all the accesspatterns and different types of workloads. Through this pa-per, we propose Arsenal, a prefetching framework whichallows the advantages provided by different data prefetch-ers to be combined, by dynamically selecting the best-suitedprefetcher for the current workload. Thus effectively improv-ing the versatility of the prefetching system. It bases on theclassic Sandbox prefetcher that dynamically adapts and uti-lizes multiple offsets for sequential prefetchers. We take itto the next step by switching between prefetchers like Multilook Ahead Offset Prefetching and Timing SKID Prefetcheron the run. Arsenal utilizes a space-efﬁcient pooling ﬁlter,Bloom ﬁlters, that keeps track of useful prefetches of eachof these component prefetchers and thus helps to maintain ascore for each of the component prefetchers. This approach isshown to provide better speedup than anyone prefetcher alone.Arsenal provides a performance improvement of 44.29% onthe single-core mixes and 19.5% for some of the selected 25representative multi-core mixes.

1. INTRODUCTION

Most modern prefetchers are designed with a particularscenario in mind and thus give better performance only whenthe cache access pattern matches that scenario. In this workwe present ARSENAL, a data prefetching framework that dy-namically selects the best-suited prefetcher among its compo-nents for the current workload and deploys it, which ensuresthe highest possible speedup irrespective of the cache accesspattern type. As proof of concept, we present two cases, Oneat L1D cache level and another at L2D.

To understand the effectiveness of state-of-the-art prefetch-ers on a common scale framework, we analysed the perfor-mance of various L1 Cache centric prefetchers like TSKID[1],MLOP [2], Bingo[3] , pangloss[4] as well as L2 cache cen-tric prefetchers like SPP[5], VLDP[6], Best-offset[7] usingtrace-based simulator, Champsim. We used traces for SPECCPU 2017 to compare their performance. In the case of L1Dcentric prefetchers T-SKID comes up as a clear winner for the single-core mixes, in terms of overall performance; how-ever, there are workloads in which TSKID underperformscompared to other prefetchers. For example, in cases of GCCand fotonik3d traces, MLOP provides greater speedup thanSKID. Detailed analysis across the benchmark, as shownin Figure1, reveals that much higher a speedup if we pickthe best performing prefetcher for each workload. A similarobservation was made for L2D cache centric prefetchers, assummarised in Figure2. This lead to the inception of the ideato recognize the type of workload dynamically and deploythe suitable prefetcher on the run. For such a framework toprovide the maximum beneﬁt with a minimum number ofcomponent prefetchers (thus minimum overhead), the compo-nents chosen have to orthogonal i.e., give good performancefor complementary sets of workloads. The analysis leads tothe conclusion that SKID[1] and MLOP[2] formed such apair among the L1d centric prefetchers, so these were chosenfor the ﬁrst test case targeting L1d cache. Among L2d cen-tric prefetchers, SPP[5] and IP-stride were chosen.we haveexplored all the conventional and state-of-the-art prefetch-ers, and prefetchers that appear in Data Prefetching Cham-pionship 1 [8, 9, 10, 11, 12, 13, 14, 15], Data PrefetchingChampionship 2 [16, 17, 18, 19, 20, 6, 5, 7] , and DataPrefetching Championship 3 [21, 22, 23, 1, 4, 3, 2] to ﬁndthis best combination. In this article, we present these twocases as proof of concept for the arsenal framework.

2. IMPLEMENTATION

In this section, we provide implementation details of thearsenal framework for both test cases. Portions with no dis-tinction for the test case 1 and test case 2 are common forboth.

Here we introduce the different component prefetchers thatwe have analyzed and eventually used as a proof of conceptfor our Arsenal prefetching framework.

Test Case 1Timing SKID [1] T-SKID prefetcher utilizes the repeti-tive access patterns spread over a larger instruction window,which the conventional prefetchers fail to recognize becauseof a short instruction window. Cache misses, even if predictedand prefetched successfully, maybe evicted before being ac-cessed because of intermediary thrashing. The T-SKID learnsthese access patterns and effectively controls the prefetch tim-1 igure 1: Normalized performance with different prefetchers for test case 1.Figure 2: Normalized performance with different prefetchers for test case 2. ing based on a PC, which has a strong correlation of memoryaccess patterns even indifferent address zones.

Multi-Lookahead Offset Prefetcher [2] Evaluates differentprefetching offsets on the two metrics of timeliness and misscoverages, as many of the conventional offset prefetcherseither neglect timeliness or sacriﬁce miss coverage whileselecting the optimum offset for prefetch. The state of theart offset prefetchers generally lose on cache miss coveragebecause of their reliance on a single best offset, which gener-ates most timely prefetch requests, however, instead of sucha binary classiﬁcation, MLOP considers multiple lookaheadsfor every prefetch offset and scores them individually. It thenselects one offset for each lookahead level and thus allowsprefetcher to issue enough requests while still consideringthe timeliness of these prefetch requests.

Test Case 2IP-stride is a stride prefetcher that can handle stride pat-terns based on instruction pointer. It maintains a table ofprevious addresses accessed by a list of instruction point-ers. When the same instruction is executed again, a strideis calculated in the address accessed and a prefetch requestis made based on it. Replacement of stored IPs is based onLRU algorithm.

Signature Path Prefetcher (SPP) [5] stores the stride pat-terns in a compressed form in the signature table (ST). Eachentry in the ST is used to index into the pattern table (PT),which is used to predict the next stride and also contains theconﬁdence for the current prefetch. The signature is then up-dated with the latest stride and is used to recursively lookupthe PT to predict more strides. This goes on until the conﬁ- dence, which is multiplied with the last prefetch conﬁdencegoes below a certain threshold. The GHR stores prefetchrequests that cross page boundaries so that prefetching cantake place across pages.

Next line is one of the simplest prefetchers which prefetchesthe next cache line on each cache miss or prefetch hit. Herewe used a modiﬁed version, which varies its aggressivenessor the number of cache lines prefetched based on its score.

Arsenal is motivated by the basic sandbox prefetcher[24],which searches and selects the best offset for an application.With Arsenal, we try to select the best prefetcher among theavailable components. The Arsenal framework is trainedwith prefetch activation events(PAE), i.e., cache Misses andcache prefetch hits. The framework works in two phases: (i) acontinuous evaluation phase and (ii) a selection phase. The se-lection phase is triggered when the evaluation count (numberof prefetcher calls) of all the component prefetchers crosses athreshold, which is considered after careful examination. Atthe end of every selection phase, the best-suited prefetcher isselected using the parameters gathered during the evaluationphase (in some cases, none might get selected). At each PAE,all the prefetchers are triggered by the Arsenal framework.Cache lines prefetched by the prefetchers are stored in theirrespective boom ﬁlters [25] without passing them along to theprefetch queue. Only the prefetch requests of the prefetcherthat is selected during the last selection phase, are passed tothe prefetch queue i.e.actually prefetched. Also, the evalu-ation counter of each of the prefetchers is incremented by2 igure 3: The Arsenal framework: evaluation phase and selection. one, and the prefetch count is incremented by the number ofprefetch requests. At every miss and prefetch hit, the cacheline address corresponding to the demanded address is com-pared to the contents of each of the Bloom ﬁlters. If any ofthe Bloom ﬁlters produce a match, then the correspondingprefetcher score is incremented by SCORE-INC; otherwise,the score is decremented by SCORE-DEC. A switch fromthe evaluation to the selection phase happens when all of theevaluation counters exceed EVAL-CNT.

Test case 1

As T-SKID and MLOP are intelligent prefetchers that ad-just their own aggressiveness based on a feedback mech-anism number of prefetches attempted by these are alsoconsidered in addition to prefetchers’ scores. Speciﬁcally,when a wrong (less favorable) prefetcher is selected, it’sscore might get inﬂated, leading to a faulty cycle where thewrong prefetcher will keep getting selected. The numberof prefetches attempted can be used to correct this. If T-skid score is higher or if T-skid attempts more prefetchesthan TSKID_SELECTION_ATTEMPT T-skid is selected. IfMLOP score is higher or if MLOP prefetch attempts are equalto Tskid, MLOP is selected.

Test case 2

If SPP or IP-stride have the maximum score and it is greaterthan the MIN-SCORE threshold, the prefetcher is selected.If the score of the next line is the maximum among the threeand it is greater than a threshold called NEXT-LINE-MIN-SCORE, then the next line is selected. If the score of next-lineis the maximum of the three, but it is not higher than NEXT-LINE-MIN-SCORE, the one among SPP andIP-stride, whichhas a higher score (and also greater thanMIN-SCORE) isselected. If none of the scores cross their respective threshold,then no prefetcher is selected.Figure3 illustrates the evaluation and the selection phase ofour Arsenal framework.

Thresholds of interest:

These are some of the thresholdsused by Arsenal framework in the presented work. scoreincrement on a prefetch request hit at the Bloom ﬁlter: +4score decrement on a prefetch request miss at the Bloomﬁlter: -1MIN-SCORE: minimum required score for selection of SPP,IP-stride, TSKID or MLOP: 0NEXT-LINE-MIN-SCORE: minimum required score for se-lection of NL: 1500EVAL-CNT: No of Prefetch activation events after which theselection process is repeated: 512TSKID_SELECTION_ATTEMPT: Prefetch attempts madeby TSKID which lead to its selection: 10000BF-FPP: Required false positive rate of Bloom ﬁlter: 0.01BF-EST-CAP: No. of entries in the Bloom ﬁlter: 20003 igure 4: Normalized performance provided by Arsenal framework compared to components in test case 1.Figure 5: Normalized performance provided by Arsenal framework compared to components in test case 2 entry-size × entries Arsenalcounters Evaluate counters for N prefetchers (9xN bits) +Prefetch counters for N prefetchers (12xN bits) +prefetch scores for 3 prefetchers (11xN bits) 0.03 KBBloom ﬁl-ters 5 bits (false-positive-probability) + 32 bits(random-seed) + 11 bits (inserted-element-count)+ 11 bit (projected-element-count) + 15 bit (table-size) + 3 bit (salt-count) + 2399*8 (bit table) +135*32 (salt table) = 2948B × N 2.87 KBThresholds 0.1 KBTotal 3 KB

Table 1: Arsenal Hardware Overhead per component.

Test case 1T-SKID overhead 52.5 KBMLOP overhead 12 KBArsenal framework 6 KBTotal 70.5 KBTest case 2SPP overhead 5.73 KBIP-stride overhead 5.47 KBNext line overhead 0 KBArsenal framework 9 KBTotal 20.2 KB

Table 2: Hardware Overhead for both test casesHardware Overhead:

Table 1 shows the hardware overheadof Arsenal framework. This is the overhead of the frameworkalone i.e. in addition to the memory overhead requirementsof the component prefetchers.Table 2 shows the hardware overhead for the two test casestaking into account the memory overhead requirements of thecomponent prefetchers.

3. EVALUATION AND RESULTS

We used traces of SPEC CPU 2017 to evaluate the perfor-mance of Arsenal framework. Figure 4 shows the normal-

Mixno. Mix details0 600.perlbench-570B 657.xz-2302B 605.mcf-994B 620.omnetpp-874B1 620.omnetpp-141B 641.leela-1083B 605.mcf-665B 607.cactuBSSN-4004B2 607.cactuBSSN-3477B 654.roms-1613B 623.xalancbmk-10B 605.mcf-1152B3 654.roms-1007B 628.pop2-17B 627.cam4-490B 605.mcf-782B4 607.cactuBSSN-2421B 605.mcf-1644B 619.lbm-4268B 619.lbm-2677B5 602.gcc-734B 605.mcf-1554B 619.lbm-3766B 605.mcf-472B6 649.fotonik3d-10881B 621.wrf-8065B 605.mcf-484B 619.lbm-2676B7 621.wrf-6673B 623.xalancbmk-165B 605.mcf-1536B 654.roms-293B8 654.roms-294B 602.gcc-1850B 603.bwaves-2931B 623.xalancbmk-202B9 649.fotonik3d-1176B 649.fotonik3d-8225B 654.roms-1070B 654.roms-523B10 654.roms-1390B 649.fotonik3d-7084B 603.bwaves-891B 602.gcc-2226B11 600.perlbench-570B 657.xz-2302B 605.mcf-1152B 654.roms-1007B12 628.pop2-17B 627.cam4-490B 619.lbm-2677B 602.gcc-734B13 605.mcf-472B 649.fotonik3d-10881B 619.lbm-2676B 621.wrf-6673B14 621.wrf-8065B 605.mcf-484B 623.xalancbmk-165B 605.mcf-1536B15 654.roms-294B 602.gcc-1850B 603.bwaves-1740B 603.bwaves-2609B16 654.roms-1070B 654.roms-523B 603.bwaves-891B 602.gcc-2226B17 600.perlbench-570B 657.xz-2302B 603.bwaves-891B 602.gcc-2226B18 654.roms-1070B 654.roms-523B 605.mcf-1644B 619.lbm-4268B19 603.bwaves-2609B 649.fotonik3d-1176B 607.cactuBSSN-2421B 605.mcf-1644B20 654.roms-1007B 619.lbm-2676B 603.bwaves-1740B 602.gcc-2226B21 605.mcf-1536B 605.mcf-1554B 605.mcf-1644B 605.mcf-994B22 603.bwaves-1740B 603.bwaves-2609B 603.bwaves-2931B 603.bwaves-891B23 649.fotonik3d-10881B 649.fotonik3d-1176B 649.fotonik3d-7084B649.fotonik3d-8225B24 619.lbm-2676B 619.lbm-2677B 619.lbm-3766B 619.lbm-4268B

Table 3: 25 representative 4-core mixes. ized performance with the Arsenal framework for test case1.On average, Arsenal provides 44.29% performance improve-ment, whereas MLOP and TSKID provide 38% and 40%,respectively. If we select the best prefetcher for each traceindependently, then 43.76% improvement in performance isachieved. Thus this conﬁguration outperforms IPCP [22] ,the winner of Data prefetching championship 3.Figure 5 shows the normalized performance with the Arsenalframework for test case 2. On average, Arsenal provides39.42% performance improvement, whereas SPP, next-line4ith degree 5, and IP-stride with degree 8 provide 35.1%,32.6 and 31.89%, respectively. If we select the best prefetcherfor each trace independently, then 35.92% improvement inperformance is achieved.The fact that the Arsenal framework outperforms even theideal cases where we pick the maximum among speedupsprovided by its component prefetchers shows the effective-ness of our framework.For multi-core evaluation, we created 25 representative mixes,as mentioned in Table 3. For these mixes, Arsenal providesan average speedup of 19.51% in test case 1 and 16.39% incase of test case 2.

4. CONCLUSION AND FUTURE WORK

This paper proposed the Arsenal framework that selectsthe best prefetcher from three prefetchers using a sandboxmethod. The framework uses Bloom ﬁlters to test the ef-fectiveness of all the prefetchers. Arsenal provides an aver-age performance improvement of 44.29% for the single-coretraces. The effectiveness of Arsenal will improve if the frame-work gets multiple prefetchers that compliment each other:like a combination of regular and irregular prefetchers. Ex-ploring the same along with modeling of DRAM contentionfor multi-cores is an exciting avenue for future work. Furtherresearch is also required to make the selection process adap-tive so the framework can modify its selection criterion onthe run if it encounters new workloads.

5. ACKNOWLEDGEMENT

Thanks to Biswabandan Panda, IIT Kanpur for his valuablesuggestions.

6. REFERENCES [1] T. Nakamura, T. Koizumi, Y. Degawa, H. Irie, S. Sakai, and R. Shioya,“T-skid: Timing skid prefetcher,”[2] M. Shakerinava, M. Bakhshalipour, P. Lotﬁ-Kamran, andH. Sarbazi-Azad, “Multi-lookahead offset prefetching,”

The ThirdData Prefetching Championship , 2019.[3] M. Bakhshalipour, M. Shakerinava, P. Lotﬁ-Kamran, andH. Sarbazi-Azad, “Bingo spatial data prefetcher,” in , pp. 399–411, 2019.[4] P. Papaphilippou, P. H. J. Kelly, and W. Luk, “Pangloss: a novelmarkov chain prefetcher,”

CoRR , vol. abs/1906.00877, 2019.[5] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, andZ. Chishti, “Path conﬁdence based lookahead prefetching,” in , pp. 60:1–60:12,2016.[6] M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H.Pugsley, and Z. Chishti, “Efﬁciently prefetching complex addresspatterns,” in

Proceedings of the 48th International Symposium onMicroarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9,2015 , pp. 141–152, 2015.[7] P. Michaud, “Best-offset hardware prefetching,” in ,pp. 469–480, 2016.[8] M. Dimitrov and H. Zhou, “Combining local and global history forhigh performance data prefetching,”

J. Instruction-Level Parallelism ,vol. 13, 2011. [9] M. Grannæs, M. Jahre, and L. Natvig, “Storage efﬁcient hardwareprefetching using delta-correlating prediction tables,”

J.Instruction-Level Parallelism , vol. 13, 2011.[10] Y. Ishii, M. Inaba, and K. Hiraki, “Access map pattern matching forhigh performance data cache prefetch,”

J. Instruction-LevelParallelism , vol. 13, 2011.[11] G. Liu, Z. Huang, J. Peir, X. Shi, and L. Peng, “Enhancements foraccurate and timely streaming prefetcher,”

J. Instruction-LevelParallelism , vol. 13, 2011.[12] L. M. Ramos, J. L. Briz, P. E. Ibáñez, and V. Viñals, “Multi-leveladaptive prefetching based on performance gradient tracking,”

J.Instruction-Level Parallelism , vol. 13, 2011.[13] A. Sharif and H.-H. S. Lee, “Data prefetching mechanism byexploiting global and local access patterns,”

The Journal ofInstruction-Level Parallelism Data Prefetching Championship , 2009.[14] S. Verma, D. M. Koppelman, and L. Peng, “A hybrid adaptivefeedback based prefetcher,”

Proceedings of the 1st JILP/Intel DataPrefetching Championship (DPC-1) in conjunction with HPCA ,vol. 15, 2009.[15] M. Ferdman, S. Somogyi, and B. Falsaﬁ, “Spatial memory streamingwith rotated patterns,” ,vol. 29, 2009.[16] I. B. Karsli, M. Cavus, and R. Sendag, “Pefetching on-time and whenit works,”[17] M. Sutherland, A. Kannan, and N. Enright Jerger, “Not quite mytempo: Matching prefetches to memory access times,” 06 2015.[18] N. T. Brown and R. Sendag, “Sandbox based optimal offsetestimation,”[19] A. K. N L and V. Young, “Towards bandwidth efﬁcient prefetchingwith slim ampm,” 06 2015.[20] Q. Jia, M. B. Padia, K. Amboju, and H. Zhou, “An optimizedampm-based prefetcher coupled with conﬁgurable cache line sizing,”[21] E. Bhatia, G. Chacon, S. H. Pugsley, E. Teran, P. V. Gratz, and D. A.Jiménez, “Perceptron-based prefetch ﬁltering,” in

Proceedings of the46th International Symposium on Computer Architecture, ISCA 2019,Phoenix, AZ, USA, June 22-26, 2019 , pp. 1–13, 2019.[22] S. Pakalapati and B. Panda, “Bouquet of instruction pointers:Instruction pointer classiﬁer based hardware prefetching,” 2019.[23] C. Sakalis, M. Alipour, A. Ros, A. Jimborean, S. Kaxiras, andM. Själander, “Ghost loads: What is the cost of invisible speculation?,”in

ACM International Conference on Computing Frontiers , (Alghero,Sardinia, Italy), pp. 153–163, Association for Computing Machinery(ACM), Apr. 2019.[24] S. H. Pugsley, Z. Chishti, C. Wilkerson, P. Chuang, R. L. Scott,A. Jaleel, S. Lu, K. Chow, and R. Balasubramonian, “Sandboxprefetching: Safe run-time evaluation of aggressive prefetchers,” in ,pp. 626–637, 2014.[25] in

Bloom Filter: https://github.com/ArashPartow/bloom ..