[PDF] A Study of Runtime Adaptive Prefetching for STTRAM L1 Caches

Abstract

Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAM in on-chip caches due to several advantages. These advantages include non-volatility, low leakage, high integration density, and CMOS compatibility. Prior studies have shown that relaxing and adapting the STTRAM retention time to runtime application needs can substantially reduce overall cache energy without significant latency overheads, due to the lower STTRAM write energy and latency in shorter retention times. In this paper, as a first step towards efficient prefetching across the STTRAM cache hierarchy, we study prefetching in reduced retention STTRAM L1 caches. Using SPEC CPU 2017 benchmarks, we analyze the energy and latency impact of different prefetch distances in different STTRAM cache retention times for different applications. We show that expired_unused_prefetches---the number of unused prefetches expired by the reduced retention time STTRAM cache---can accurately determine the best retention time for energy consumption and access latency. This new metric can also provide insights into the best prefetch distance for memory bandwidth consumption and prefetch accuracy. Based on our analysis and insights, we propose Prefetch-Aware Retention time Tuning (PART) and Retention time-based Prefetch Control (RPC). Compared to a base STTRAM cache, PART and RPC collectively reduced the average cache energy and latency by 22.24% and 24.59%, respectively. When the base architecture was augmented with the state-of-the-art near-side prefetch throttling (NST), PART+RPC reduced the average cache energy and latency by 3.50% and 3.59%, respectively, and reduced the hardware overhead by 54.55%

Full PDF

AA Study of Runtime Adaptive Prefetching forSTTRAM L1 Caches

Kyle Kuan and Tosiron Adegbija

Department of Electrical & Computer EngineeringUniversity of Arizona, Tucson, AZ, USAEmail: { ckkuan, tosiron } @email.arizona.edu Abstract —Spin-Transfer Torque RAM (STTRAM) is a promis-ing alternative to SRAM in on-chip caches due to several advan-tages. These advantages include non-volatility, low leakage, highintegration density, and CMOS compatibility. Prior studies haveshown that relaxing and adapting the STTRAM retention time toruntime application needs can substantially reduce overall cacheenergy without signiﬁcant latency overheads, due to the lowerSTTRAM write energy and latency in shorter retention times.In this paper, as a ﬁrst step towards efﬁcient prefetching acrossthe STTRAM cache hierarchy, we study prefetching in reducedretention STTRAM L1 caches. Using SPEC CPU 2017 bench-marks, we analyze the energy and latency impact of differentprefetch distances in different STTRAM cache retention times fordifferent applications. We show that expired unused prefetches —the number of unused prefetches expired by the reduced retentiontime STTRAM cache—can accurately determine the best reten-tion time for energy consumption and access latency. This newmetric can also provide insights into the best prefetch distance formemory bandwidth consumption and prefetch accuracy. Basedon our analysis and insights, we propose

Prefetch-Aware Retentiontime Tuning (PART) and

Retention time-based Prefetch Control(RPC) . Compared to a base STTRAM cache, PART and RPCcollectively reduced the average cache energy and latency by22.24% and 24.59%, respectively. When the base architecture wasaugmented with the state-of-the-art near-side prefetch throttling(NST), PART+RPC reduced the average cache energy and latencyby 3.50% and 3.59%, respectively, and reduced the hardwareoverhead by 54.55%.

I. I

NTRODUCTION

Much research has focused on optimizing caches’ perfor-mance and energy efﬁciency due to the caches’ non-trivialimpact on processor architectures. These optimization effortsare especially important for resource-constrained devices forwhich low-overhead energy reduction remains a major con-cern. An increasingly popular approach for improving caches’energy efﬁciency involves replacing the traditional SRAM withemerging non-volatile memory (NVM) technologies.Among several NVM alternatives, Spin-Transfer TorqueRAM (STTRAM) has emerged as a promising candidatefor replacing traditional SRAMs in future on-chip caches.STTRAMs offer several attractive characteristics, such as non-volatility, low leakage, high integration density, and CMOScompatibility. However, some of STTRAM’s most importantchallenges include its long write latency and high writeenergy [1], [2]. These challenges are attributed, in part, tothe STTRAM’s long retention time —the duration for whichdata is maintained in the memory in the absence of power. For caches, the intrinsic STTRAM retention time of up to 10 yearsis unnecessary, since most cache blocks need to be retainedin the cache for no longer than 1 s [3]. Furthermore, differentapplications or application phases may have different retentiontime requirements [4]. Thus, prior research has proposedreduced retention STTRAMs that can be specialized to theneeds of various applications [4] or different cache levels [5].To further improve cache efﬁciency, cache prefetching is apopular technique that fetches data blocks from lower memorylevels before the data is actually needed. While prefetchingcan be very effective for improving cache access time, inac-curate prefetching can cause cache pollution, increase memorybandwidth contention, and in effect, degrade the cache’sperformance and energy efﬁciency [6], [7]. Apart from deter-mining the right prefetch targets, the prefetch distance mustalso be well-monitored such that it maintains good prefetchaccuracy [7]. This is especially important in reduced retentionSTTRAMs, which, our analysis show, exhibit different localitybehaviors than traditional SRAM caches due to cache blockexpiration.In this paper, as an important ﬁrst step towards under-standing prefetching across the STTRAM cache hierarchy, westudy data prefetching in the context of a reduced retention L1STTRAM cache—simply referred to hereafter as ’STTRAMcache’. We assume an STTRAM cache that features the abilityto adapt to different applications’ retention time requirements(e.g., [5], [4]). We focus on the potentials of data prefetchingfor improving STTRAM cache’s energy efﬁciency. To mo-tivate this study, we performed extensive experiments usinga variety of SPEC 2017 benchmarks and a PC-based strideprefetcher that prefetches memory addresses based on thecurrent program counter (PC) [8]. We observed that if earlierprefetched data blocks are expired because of the reducedretention time, a conventional prefetcher would not reloadthese blocks. However, a prefetcher could be modiﬁed toreload these blocks, thereby reducing the miss penalty causedby premature expiration of blocks (i.e., expiration misses [9]). Furthermore, the low write energy in reduced retentionSTTRAM also mitigates the negative impact of writing blocksin addition to demand requests. We also observed that commonmetrics for determining the best retention time during runtime(e.g., cache miss rates [4]) may not be accurate in the presenceof a prefetcher and can unnecessarily waste energy. As such,prefetching, if carefully designed in the context of reduced a r X i v : . [ c s . A R ] S e p etention STTRAMs, can increase energy savings as comparedto prior reduced retention STTRAM design techniques, with-out incurring signiﬁcant latency overheads.Based on the above observations, we propose a new metric,which we call expired unused prefetches , to evaluate the qual-ity of a current retention time and prefetch distance. The ex-pired unused prefetches represents the number of prefetchedblocks that were not accessed by a demand request beforeexpiry. Using this metric, we developed Prefetch-Aware Re-tention time Tuning (PART) and

Retention time-based PrefetchControl (RPC) . During a brief runtime proﬁling phase for eachapplication, PART uses the ratio of expired unused prefetches to total prefetches to determine if the current retention timesufﬁces for the application. The retention time selected byPART indicates the average amount of time for which cacheblocks used by an application reside in the cache. As such, iftoo many prefetches are expired without being used, it is likelythat those prefetches were inaccurate. RPC uses this idea tomap expired unused prefetches to the prefetch distance.Our major contributions are summarized as follows: • We study prefetching in STTRAM caches and pro-pose a metric— expired unused prefetches —that canbe used to effectively determine both retention timeand prefetch distance, without the need for anycomplex hardware overhead. • Using expired unused prefetches , we proposed analgorithm to determine retention time and prefetchdistance during runtime. • Compared to a base state-of-the-art reduced reten-tion time STTRAM cache, PART+RPC reduced theaverage energy and latency by up to 22.24% and24.59%, respectively. Furthermore, when the basearchitecture was augmented with the state-of-the-artnear-side prefetch throttling (NST) prefetching, ourapproach reduced the average energy and latencyby 3.50% and 3.59%, respectively, and substantiallyreduced the hardware overhead by 54.55%.II. B

ACKGROUND AND R ELATED W ORK

STTRAM’s basic structure, comprising of magnetic tunneljunction (MTJ) cells, and characteristics have been detailedin prior work [10]. Earlier works suggest the use of veryshort retention times (e.g., 26.5 µ s [5]) with a DRAM-stylerefresh scheme for cache implementation [5], [3]. Recentworks show that adapting a set of pre-determined retentiontimes to applications’ needs, speciﬁcally the cache blocklifetimes, can further improve energy consumption [4], [11].In this section, we present a brief overview of prior workon adaptable retention time STTRAM cache—the architectureon which we build the analysis presented herein—and anoverview of prefetch distance control. A. Adaptable Retention Time STTRAM Caches

Recent optimizations on STTRAM cache exploit the vari-able cache block needs of different applications for energy minimization. For example, Sun et al. [5] proposed a multi-retention time cache featuring various retention times enabledby various MTJ designs, wherein different applications couldbe run on the retention time that suits them best. More recently,Kuan et al. [4] analyzed the retention times of differentapplications and proposed a logically adaptable retention time(LARS) cache [4] that used multiple STTRAM units with dif-ferent retention times. LARS involves a hardware structure thatsamples the application’s characteristics during its very ﬁrstrun. Based on the applications’ retention time requirements,each application is executed on the retention time unit that bestsatisﬁes their retention time needs. In this paper, we assume asimilar multi-retention time architecture to LARS. For brevity,we direct readers to [4] for additional low-level details of thearchitecture, but omit those details herein.

B. Prefetch distance control

Prefetch distance refers to how far into a demand missstream that a prefetcher can prefetch [8]. Effective prefetchingrelies on accurate prefetch addresses and timely arrival ofdata blocks to hide the latency between processor and mainmemory. As such, the prefetch distance must not be so shortas to generate excessive late prefetches [6] or too long to loseprefetch accuracy [6], [7]. Inaccurate prefetches can causeperformance degradation due to the saturation of memorybandwidth and cache pollution. As such, lots of prior worksdiscuss various techniques for controlling prefetch distance,feedback directed prefetching techniques, ways to monitor thenumber of total prefetches and late prefetches to evaluateprefetch accuracy and lateness, and how to determine theprefetcher aggressiveness. For example, Ebrahimi et al. [12]proposed a rule-based control method to separate global throt-tling and local throttling, and reduce inter-core interference.Both [12] and [6] looked at the number of useless prefetches,which is determined by prefetches that are not used before theyare evicted. Heirman et al. [7] referred to the aforementionedmethods as farside throttling, since they maintained highprefetch distance and throttled down when negative effectswere observed. Heirman et al. [7] proposed near-side prefetchthrottling (NST), which monitored the ratio of late prefetchesand total prefetches, kept prefetch distance low and only raisedthe distance if necessary. None of these techniques, however,have considered prefetching in STTRAM caches. As weshow in our analysis herein, state-of-the-art prefetchers mayunder-perform if simply implemented on STTRAM cacheswithout considering execution characteristics and metrics thatare unique to STTRAM caches.III. E

NABLING P REFETCHING IN

STTRAM C

ACHE

A. Effectiveness of prefetching expired blocks

Expired blocks in STTRAM caches incur misses when ademand request accesses an expired block prior to eviction.We refer to these misses as expiration misses , similar to priorwork [9]. As the retention time becomes shorter, expirationmisses increase, until expiration misses become the majorityof misses and essentially disables the cache’s ability to exploit S e t add r e ss Demand miss (Expired block)Demand miss (Expired block)Demand miss (Expired block) (a) Prefetch disabled S e t add r e ss Demand miss (Expired block)Demand hit(Prefetch hit) Demand hit(Prefetch hit)Prefetch next stridePrefetch next stride (b) Prefetch enabled

Fig. 1: Prefetching expired blocks. In (a) the prefetch does notbring back previously expired blocks into the cache; in (b) thepreviously expired blocks are brought back into the cachetemporal locality. Given the uniqueness of expiration misses inSTTRAM caches, we ﬁrst studied the impact of prefetching onexpired cache blocks. Figure 1 illustrates a simpliﬁed diagramof a data cache, with each cell representing a cache block. Thehorizontal blocks represent the cache ways (four ways in total)and the vertical blocks represent the set address (seven setaddresses in total). The blocks’ colors represent the prefetchstream that brought the cache blocks into the cache. We usedthe stride prefetcher [8] as the base to illustrate our idea andin our experiments. The number associated with the colorrepresents the program counter (PC) value of the load/storeinstruction that begins the stream due to a demand miss.Figure 1a illustrates the STTRAM cache without prefetch-ing expired blocks. Assume that the instruction at PC 504brought three cache blocks into the cache. Since the blocks arebrought in by the same stream, they are likely to expire aroundthe same time. If the prefetcher is disabled on those expiredblocks, as in a conventional prefetcher, when the demandrequest accesses the blocks again, loading each block willincur the miss penalty due to expiration misses. Alternatively,enabling the prefetcher for the expired blocks can have apositive effect, since, as shown in Figure 1b, the prefetcherbrings in subsequent blocks after the ﬁrst demand miss (expira-tion miss). Thus, subsequent accesses to the prefetched blocksbecome demand hits without exposing the memory latency.To quantify the beneﬁts of prefetching expired blocks,we performed experiments using SPEC CPU 2017 rate ( r) benchmarks and evaluated the energy and latency changes. Weused a base stride prefetcher of prefetch distance 16, similarto [13] and considered retention times from 25 µ s to 1ms. Ourdetailed simulation setup is described in Section IV-A. Weuse the term prefetchable expired blocks to represent expiredblocks that can be accurately predicted and reloaded throughthe stride prefetcher, and therefore would incur no expirationmiss. Figure 2 shows the percentage of prefetchable expiredblocks in total expired blocks across the benchmarks, assumingthe best retention times. On average across all benchmarks,10.85% of expired blocks can be reloaded into the cache forreuse. Depending on the applications’ access pattern and cacheblock lifetimes, the reused expired blocks can be as high as 29.66% for leela , while over the half of benchmarks (13 of 21)have reuse rates over 10%. To further illustrate this behavior,Figure 3 shows the percentage of prefetchable expired blocksin total expired blocks for different retention times. For brevity,the geometric mean is shown for the different retention times.In general, the percentage of reused expired blocks increasesas the retention time decreases, with the highest being 8.69%at 25 µ s. These analysis motivate us to explore low-overheadtechniques for prefetching and determining the best retentiontime in STTRAM caches during runtime. p e r l b e n c h _ r m c f _ r x a l a n c b m k _ r x _ r d ee p s j e n g _ r l ee l a _ r e x c h a n g e _ r x z _ r b w a v e s _ r c a c t u s B SS N _ r n a m d _ r p a r e s t _ r p o v r a y _ r l b m _ r w r f _ r b l e nd e r _ r c a m _ r i m a g i c k _ r n a b _ r f o t o n i k d _ rr o m s _ r A V E R A G E % o f p r e f e t c h a b l e e x p i r e d b l o c k s Fig. 2: Percentage of prefetchable expired blocks in totalexpired blocks across SPEC CPU 2017 benchmarks % o f p r e f e t c h a b l e e x p i r e d b l o c k s Fig. 3: Percentage of prefetchable expired blocks in totalexpired blocks for different retention times for SPEC CPU2017 benchmarks (Geometric mean is shown for brevity)

Demand requestsPrefetch arrivals

LOAD(A):Demand miss

PF(A+1)

LOAD(A+1):Demand hit

PF(A+2) PF(A+3) PF(A+4)

LOAD(A+2):Demand hit

PF(A+1) PF(A+2) PF(A+3) PF(A+4)

Expiring unused prefetchesExpiring used prefetchesSTTRAM retention time

Fig. 4: Retention time expiration detect potentially unusedprefetches

B. Prefetch-Aware Retention time Tuning (PART)

A key point of our analysis so far is that, as illustrated inFigure 1b, expiration of cache blocks must be considered inthe design of prefetchers. Furthermore, we also analyzed prioradaptable retention time techniques (e.g., [4]) that used missrates to predict the best retention time. We found that thesetechniques only accurately predicted the best retention timeusing cache miss rates in the absence of a prefetcher. When aprefetcher is introduced, using miss rates may not be as accu-rate due to the interplay of expiration misses and prefetching.Thus, we designed the prefetch-aware retention time tuning(PART) technique to take into account the expiration misses. lgorithm 1:

Prefetch-Aware Retention Time Tuning

Data:

Retention time set R = { µs, µs, µs, µs, ms } Result:

OutputRetentionTime OutputRetentionTime ← ms ; foreach r ∈ R do allPF ← totalPrefetches( r ) / totalMSHRRequests( r ) ; expiredPF ← expiredUnusedPrefetches( r ) / totalPrefetches( r ) ; if allPF > then if baseExpiredPF is set then if expiredPF < baseExpiredPF then OutputRetentionTime ← r ; end else return OutputRetentionTime ; end end else OutputRetentionTime ← r ; if expiredPF > then baseExpiredPF ← expiredPF ; end end end else OutputRetentionTime ← r ; missBasedTuning( OutputRetentionTime ) ; return OutputRetentionTime ; end end return OutputRetentionTime ;To motivate PART, Figure 4 illustrates the timeline ofwhen prefetched blocks are brought into the cache and thenexpired. Assume that

LOAD(A) instruction accesses memoryaddress A and causes a demand miss, the prefetcher sendsout four requests from address A+1 to A+4 . The prefetcharrival times are marked in green color. After the retentiontime elapses, prefetched blocks begin to expire. We recordthe number of blocks that were not used by demand re-quests before expiration; we refer to these blocks as ex-pired unused prefetches . The basic idea of PART is to usethe shortest retention time that does not excessively increasethe expired unused prefetches . To this end, PART tracks thechanges in expired unused prefetches at prefetch degree 1during different tuning intervals to determine the best retentiontime.Algorithm 1 depicts the PART algorithm, which takes asinput the available retention times in the system and outputsthe best retention time. PART iterates through the availableretention time set starting from the longest to the shortest(e.g., 1ms to 25 µ s), runs the application for a sampling period, and takes the ratio of total prefetches to total MSHR requests( allPF ) and the ratio of expired unused prefetches to totalprefetches ( expiredPF ), as shown in lines 3-4. If allPF issmaller than 0.1%, we infer that prefetches do not substantiallycontribute to memory trafﬁc. Therefore, the algorithm switchesto a subroutine that predicts the retention time based on cachemisses, similar to prior techniques [4] (line 23). If allPF isgreater than 0.1%, the algorithm ﬁrst checks if expiredPF issigniﬁcant enough ( > expiredPF is greater than0.02 %, this expiredPF is stored as baseExpiredPF and usedin subsequent tuning stages. Otherwise, PART iterates thenext available retention times to see if the thresholds aresatisﬁed (line 15-18). Note that we determined the thresholdsempirically through extensive experiments and analysis. Afterobtaining baseExpiredPF , PART explores shorter retentiontimes to ﬁnd the one that does not excessively increase expiredPF as compared to baseExpiredPF . PART checks if expiredPF is smaller than twice baseExpiredPF . If so, itproceeds to the next shorter retention time, otherwise, thecurrent retention time is returned as the tuning result (line7-12). C. Retention Time-based Prefetch Control (RPC)

We also developed a simple heuristic, called retentiontime-based prefetch control (RPC) , that works in conjunctionwith PART to determine the best prefetch distance duringruntime. To minimize tuning overhead, RPC determines thebest prefetch distance in ’one-shot’ along with the reten-tion time tuning by the PART algorithm. PART tracks ex-pired unused prefetches at prefetch degree 1 for tuning theretention time. The determined retention time represents theperiod that sufﬁces, on average, for the executing applications’cache block lifetimes. A prefetch degree of 1 is usuallyconsidered conservative in prefetch distance throttling [6],[12]. As such, if expired unused prefetches is excessively highafter retention time tuning, it is likely that wrong addresseswere prefetched. In this case, we maintain the prefetch distanceof 1 to minimize cache pollution and memory bandwidthcontention. RPC takes expiredPF in Algorithm 1 as input todetermine the prefetch aggressiveness, and maps the prefetchdistance similarly to [6]. Table I shows the distribution ofthis mapping, representing different ranges of expiredPF andthe associated prefetch distance. If expiredPF is above 5%,the stride pattern does not match the current application’sdata access. Thus, the prefetch distance is kept at 1 in orderto maintain prefetch functionality. On the other extreme, weobserved that some applications have the lowest expiredPF (and energy consumption) at prefetch distance 32, whichindicates that the stride prefetcher captures the applications’data access pattern and is able to recover expired blocks.

D. Overhead

Assuming a base architecture that has the capability ofmultiple retention times (e.g., [4]), PART’s major advantageis that it imposes negligible hardware and tuning overhead.PART exploits most of the hardware components described inABLE I: Prefetch distances for different

ExpiredPF

ExpiredPF at prefetch degree 1 Prefetch distanceAbove 5% 11.01% - 5% 40.51% - 1% 80.05% - 0.5% 16Below 0.05% 32 [4] for tuning. In addition to the four 32-bit registers and onedivision circuit used in prior work, PART only requires oneadditional 32-bit register for allPF and expiredPF . To keeptrack of expired unused prefetches, PART only requires onecustom hardware counter, which increments when an expiringblock’s prefetch bit is valid. Using the shared expiredPF inPART, RPC requires only one 32-bit comparator. In total, weestimate that the area overhead is less than 1% of modernprocessors like ARM Cortex-A72 [13].We note that the base architecture incurs energy and latencyswitching overheads from migrating the cache state fromone STTRAM unit to another. Switching occurs when anapplication is ﬁrst executed during its sampling period. Forexample, given a tuning interval of 10 million instructionsand ﬁve retention time options, sampling would require 50million instructions. However, PART does not increase theswitching overhead with respect to the base. In the worst case,each migration takes approximately 2560 cycles and 8.192nJenergy, resulting in total time and energy overheads of 10240cycles and 32.768nJ, respectively. While these overheads areminimal in the context of full application execution, wereiterate that PART did not contribute to this overhead.IV. S

IMULATION R ESULTS

A. Experimental Setup

To perform our analysis and evaluate PART, we imple-mented PART using an in-house modiﬁed version of theGEM5 simulator [14]. We modiﬁed GEM5 [14] to modelcache block expiration, variable tag lookup and cache writelatency, variable retention time units, and variable prefetchdistance as described herein. To enable rigorous comparisonof PART against the state-of-the-art, we used two recent priorworks to represent the state-of-the-art—LARS [4] to representadaptable retention time and NST [7] to represent variableprefetch distance. We also implemented these two techniquesin GEM5. We used conﬁgurations similar to the ARM CortexA72 [13], featuring a 2GHz clock frequency, and a privateL1 cache with separate instruction and data caches. For thiswork, we focused on data cache prefetching, since it providesmuch opportunity for runtime adaptability, as opposed to theinstruction cache [4]. Every MSHR request from the L1 datacache is directly sent to an 8GB main memory, and incursmemory latency. We intend to explore the impact of our workon the instruction cache and lower level caches in future work.We considered ﬁve retention times: 25 µ s, 50 µ s, 75 µ s,100 µ s, 1ms, which we empirically found to be sufﬁcient for the considered benchmarks. We used the MTJ modelingtechniques proposed in [15] to model the different reten-tion times, and used NVSim [16] to estimate the energyfor the different retention times. Table II depicts prefetcherconﬁgurations and the STTRAM cache parameters used inour experiments as obtained from the modeling tools andtechniques. We used twenty-one SPECrate CPU2017 bench-marks [17], cross-compiled for the ARMv8-A instruction setarchitecture. Each benchmark was run using the reference input sets for 1B instructions after restoring checkpoints from240B instructions. B. Results and Comparisons

In this section, we compare the cache energy and accesslatency beneﬁts of our work to prior work in various prefetchdistance control scenarios. We denote uniform prefetch dis-tance 1 to 32 as

PFD N , where N represents the memoryaddress distance. RPC represents the optimal static distanceamong PFD N, since RPC accurately determines the distancein the sampling phase and uses that distance throughout theapplication’s run. We use NST [7] to represent the state-of-the-art dynamic prefetch distance throttling. We compare PART tothe miss-based tuning algorithm used in LARS. We start witha direct comparison of PART to LARS without prefetching.Next, we compare PART to LARS with a uniform strideprefetcher and use moderate prefetch aggressiveness: prefetchdegree 2 and prefetch distance 16 (LARS+PFD 16), similarto prior work [6]. Thereafter, we compare PART to LARSwith the NST stride prefetcher (LARS+NST) to evaluate theimprovement over dynamic prefetch distance throttling. Lastly,we compare PART to an SRAM cache with the NST strideprefetcher (SRAM+NST) to show the collective improvementsof adaptable retention time STTRAM cache when prefetchingis active. All energy and latency results of PART are normal-ized to the subject of comparison.

1) Comparison to the base STTRAM cache (LARS):

Figure5a depicts the energy consumption of PART in differentprefetch distance scenarios normalized to LARS. On averageacross all benchmarks, PART reduced the energy by 19.53%,21.25%, 21.29%, 20.09%, and 17.68% for PFD 1, PFD 4,PFD 8, PFD 16, and PFD 32, respectively. RPC properlymapped expired unused prefetches ( expiredPF ) to prefetchdistance and ensured that the ideal static prefetch distancewas selected. As such, PART+RPC reduced the average energyby 22.24%, with savings as high as 65.96% for imagick . For parest , imagick , lbm , roms , and fotonik3d , PART+RPC reducedthe energy by more than 40%, and no benchmarks’ energyconsumption was degraded by PART. Figure 5b depicts thecache access latency normalized to LARS without prefetch-ing. On average across all benchmarks, PART reduced thelatency by 21.52%, 23.50%, 23.51%, 22.08%, 19.29%, and24.59% for PFD 1, PFD 4, PFD 8, PFD 16, PFD 32, andRPC, respectively. PART+RPC reduced the latency by up to70.41% for imagick . PART only incurred a negligible latencyoverhead (1.07%) for cactusBSSN while latency reductionswere achieved for the rest of the twenty benchmarks.ABLE II: Prefetcher conﬁguration and STTRAM cache parameters with different retention times Prefetcher Conﬁguration Type: stride prefetcher, degree: 4, adaptable prefetch distance: 1, 4, 8, 16, 32Cache Conﬁguration 32KB, 64B line size, 4-way, 22nm technologyMemory device SRAM STTRAM-25 µ s STTRAM-50 µ s STTRAM-75 µ s STTRAM-100 µ s STTRAM-1msWrite energy (per access) 0.002nJ 0.006nJ 0.007nJ 0.007nJ 0.008nJ 0.011nJHit energy (per access) 0.008nJ 0.005nJLeakage power 75.968mW 11.778mW 11.778mW 11.778mW 11.778mW 11.365mWHit latency (cycles) 2 1Write latency (cycles) 2 2 3 3 3 4 p e r l b e n c h _ r m c f _ r x a l a n c b m k _ r x _ r d ee p s j e n g _ r l ee l a _ r e x c h a n g e _ r x z _ r b w a v e s _ r c a c t u s B SS N _ r n a m d _ r p a r e s t _ r p o v r a y _ r l b m _ r w r f _ r b l e nd e r _ r c a m _ r i m a g i c k _ r n a b _ r f o t o n i k d _ rr o m s _ r G E O M E A N P A R T e n e r g y n o r m a li z e d t o L A R S PFD_1 PFD_4 PFD_8 PFD_16 PFD_32 RPC (a) Energy p e r l b e n c h _ r m c f _ r x a l a n c b m k _ r x _ r d ee p s j e n g _ r l ee l a _ r e x c h a n g e _ r x z _ r b w a v e s _ r c a c t u s B SS N _ r n a m d _ r p a r e s t _ r p o v r a y _ r l b m _ r w r f _ r b l e nd e r _ r c a m _ r i m a g i c k _ r n a b _ r f o t o n i k d _ rr o m s _ r G E O M E A N P A R T l a t e n c y n o r m a li z e d t o L A R S PFD_1 PFD_4 PFD_8 PFD_16 PFD_32 RPC (b) Latency

Fig. 5: PART with different prefetch scenarios (PFD N and RPC) normalized to the base STTRAM cache (LARS)Compared to LARS, we observed that the energy reductiontrends were similar to the latency. Since prefetching can reducecompulsive misses, increased latency beneﬁts are achieved asa result of the impact of expiration misses as discussed inSection III-A. As shown in Figure 2, the average expiredblocks that can be accurately prefetched and ’reused’ are upto 10.85%. Thus, the reduced expiration misses contributedsigniﬁcantly to miss latency reduction.

2) Comparison to LARS with uniform prefetch distance(LARS+PFD 16):

Figure 6 depicts the energy and latency ofPART normalized to LARS+PFD 16. For brevity, only the ge-ometric mean (across all the twenty-one benchmarks in Figure5) and a subset of notable benchmarks are shown. Figure 6ashows that across all the benchmarks, PART+RPC reducedthe average energy consumption by 4.75%, compared toLARS+PFD 16 (the uniform prefetch distance). PART+RPCreduced the energy by up to 20.51% and 18.77% for roms and exchange2 , respectively, with energy savings over 5%for perlbench , mcf , xalancbmk , namd , nab , and imagick . Weobserved that PART generally selected shorter retention timesthan LARS+PFD 16. By incorporating the expiration missesinto the decision making about prefetching, PART achieveda balance of short retention times without translating intoincreases in miss latency. PART allowed the stride prefetcherto recover expired blocks in short retention times. PART+RPConly degraded the energy (by 0.48%) for parest .As described in Section III-A, due to the reduced latency p e r l b e n c h _ r m c f _ r x a l a n c b m k _ r e x c h a n g e _ r n a m d _ r p a r e s t _ r i m a g i c k _ r n a b _ rr o m s _ r G E O M E A N P A R T e n e r g y n o r m a li z e d t o L A R S + P F D _ PFD_1 PFD_4 PFD_8PFD_16 PFD_32 RPC (a) Energy p e r l b e n c h _ r m c f _ r x a l a n c b m k _ r e x c h a n g e _ r n a m d _ r p a r e s t _ r i m a g i c k _ r n a b _ rr o m s _ r G E O M E A N P A R T l a t e n c y n o r m a li z e d t o L A R S + P F D _ PFD_1 PFD_4 PFD_8PFD_16 PFD_32 RPC (b) Latency

Fig. 6: PART with different prefetch scenarios (PFD N andRPC) normalized to LARS+PFD 16achieved by prefetching expired blocks, PART uses shorterretention times to improve energy consumption, since theshort retention times do not substantially increase the latency.Figure 6b shows that, similar to the energy improvement,ART+RPC reduced the average latency by 4.99%, as com-pared to LARS+PFD 16. PART+RPC reduced the latency byup to 25.76%, 21.09%, and 14.15% for roms , exchange2 ,and nab , respectively. To understand why PART performedso well for these benchmarks, we studied their executionmore closely. For exchange2 , we observed that LARS se-lected a long retention time (1ms) due to low miss ratesat 1ms, whereas shorter retention times increased the missrates substantially (by up to 9x). However, the large amountsof misses at shorter retention times were rapidly amortizedby stride prefetching and did not have substantial negativeimpact on the latency. We observed that even though shorterretention times increased totalPrefetches for expired blocks,the expired unused prefetches increased at a much slowerrate, thereby substantially reducing expiredPF (by up to 42%).As such, PART selected short retention times (e.g., 25 µ s) andwas able to improve the latency for these benchmarks.On the other hand, for roms , LARS selected a short re-tention time of 25 µ s due to the low miss rates. However,the expiredPF were substantially higher at shorter retentiontimes than 1ms. As such, PART selected 1ms for roms tosave potentially useful prefetches with the longer retentiontime. The reduced latency in nab resulted from the optimalprefetch distance (at PFD 1) as determined by RPC. Theseresults illustrate the importance of adaptable prefetch distanceto satisfy different applications’ needs. PART incurred minorlatency overheads of up to 1.6% and 0.19% for cactusBSSN and parest , but reduced the latency for majority of the bench-marks (19 of 21).

3) Comparison to LARS with dynamic prefetch distance(LARS+NST):

We further compared PART with LARS+NSTto evaluate the improvement when the dynamic prefetch throt-tling is enabled as in previous work [7]. For brevity, Figure7 compares PART to LARS+NST using a subset of notablebenchmarks and the geometric mean of all the benchmarks.Figure 7a shows that on average, PART+RPC improved the e x c h a n g e _ r c a c t u s B S S N _ r r o m s _ r G E O M E A N P A R T e n e r g y n o r m a li z e d t o L A R S + N S T PFD_1 PFD_4 PFD_8PFD_16 PFD_32 RPC (a) Energy e x c h a n g e _ r c a c t u s B S S N _ r r o m s _ r G E O M E A N P A R T l a t e n c y n o r m a li z e d t o L A R S + N S T PFD_1 PFD_4 PFD_8PFD_16 PFD_32 RPC (b) Latency

Fig. 7: PART with different prefetch scenarios (PFD N andRPC) normalized to LARS+NST energy by 3.50% over LARS+NST, with energy savings ofup to 18.77% for exchange2 . On the other hand, on average,LARS+NST only improved over LARS+PFD 16 by 1.43%.We observed that in STTRAM cache without PART, thedynamic prefetcher (NST) offered minimal energy savings,even if it recovered expired blocks. As shown in Figure 7b,PART+RPC reduced the average latency by 3.59% comparedto LARS+NST, with reductions of up to 21.09% and 12.23%for exchange2 and roms , respectively. In the worst case, thelatency overhead was 1.60% for cactusBSSN , while the rest ofbenchmarks beneﬁted from latency reduction.In a few cases, PART+RPC did not improve the latencyor energy as compared with LARS+PFD 16 or LARS+NST(for example, for cactusBSSN ). CactusBSSN was one of thebenchmarks with a low prefetch percentage in total MSHRrequests. As deﬁned in Algorithm 1 (line 3), the allPF in cactusBSSN was very low at 0.0002%. Thus, PART reverts tomiss based tuning for cactusBSSN , as described in SectionIII-B. However, to provide a clear contrast between ourwork and prior work, we used expiredPF -based tuning inall PART+RPC results. For cactusBSSN , the RPC table wasunable to map the correct prefetch distance for latency orenergy improvement. We note, however, that in almost allcases (20 out of 21 benchmarks), PART+RPC outperformedboth LARS+PFD 16 and LARS+NST in both energy andlatency. Importantly, we also reiterate that LARS+NST re-quired additional hardware structures to implement the NSTprefetcher, whereas RPC’s overhead was marginal comparedto LARS+PFD 16, as described in Section III-D. The mainadvantage of PART+RPC is the negligible hardware overheadcompared to NST. For instance, NST required seven 32-bitregisters for storage [7], whereas PART only introduced oneadditional register to LARS in order to track the number ofoutgoing MSHR requests, total prefetches, and expired unusedprefetches. Overall, PART+RPC reduced the implementationoverhead by 54.55% compared to LARS+NST.

4) Exploring the synergy of PART and NST:

We alsoexplored the extent of the beneﬁt, if any, of combining PARTwith NST (i.e., PART+NST). Figure 8 summarizes the energyand latency of PART+RPC and PART+NST normalized toLARS+NST. For brevity, only the geometric mean of allthe SPEC CPU 2017 benchmarks are shown. On average,PART+NST improved the energy and latency by 2.75%and 2.63%, respectively, compared to LARS+NST, whereasPART+RPC reduced the energy and latency by 3.50% and3.59%, respectively. The results show that while providingdynamic prefetch distance control, NST’s increased hardwareoverhead compared to PART does not translate to energyor latency beneﬁts. In fact, PART still reduced the energyand latency, albeit marginally, while substantially reducingthe implementation overheads (Section IV-B3). The resultsalso reveal the promise of a low overhead dynamic prefetchdistance control for STTRAM cache based on expiredPF . Weanticipate that even more energy and latency beneﬁts can beachieved in larger STTRAM caches (such as LLC), and weintend to explore and quantify these beneﬁts in future work. .940.960.98 Energygeomean Latencygeomean P A R T r e s u l t s n o r m a li z e d t o L A R S + N S T PART+RPC PART+NST

Fig. 8: PART normalized toLARS+NST P A R T r e s u l t s n o r m a li z e d t o S R A M + N S T PFD_1 PFD_4 PFD_8PFD_16 PFD_32 RPC

Fig. 9: PART normalized toSRAM+NST

5) Comparison to SRAM with dynamic prefetch distance(SRAM+NST):

We also compare PART to SRAM cachewith NST prefetcher enabled (SRAM+NST). Figure 9 sum-marizes the energy and latency of PART in the differentconﬁgurations normalized to SRAM+NST. On average, in allprefetch conﬁgurations, PART reduced the energy by morethan 80%. We attribute this reduction largely to the STTRAM’slow leakage power (Table II) and PART’s ability to selectretention times that satisﬁed the different applications’ cacheblock requirements. As a result of this specialization, PARTwas also able to reduce the latency (e.g., by 10.28% forPART+RPC). As shown in Table II, STTRAM has advantagesin hit latency but not write latency. However, with the helpof PART, STTRAM was able to select shorter retention timesthat satisfy the applications’ needs while maintaining writelatencies that were close to SRAM. We took a closer lookat benchmarks with high write activity, where write requestsand miss responses were greater than 40%, such as perlbench , cactusBSSN , povray , lbm , cam4 , and fotonik3d . Our analysisrevealed that the synergy of prefetching and PART’s retentiontime selection made the write performance for these bench-marks comparable to SRAM. As a result, the STTRAM cachewith PART did not degrade the latency compared to SRAM.V. C ONCLUSIONS AND F UTURE W ORK

In this paper, we studied prefetching in reduced re-tention STTRAM L1 caches. We showed that using ex-pired unused prefetches , and practically, tracking changes inexpired prefetches ( expiredPF ) with respect to total prefetches( allPF ), we could provide an accurate description of the bestretention with regards to energy consumption and derive in-sights into the best prefetch distance. Based on these insights,we proposed prefetch-aware retention time tuning (PART) andretention time based prefetch control (RPC) to predict the bestretention time and the best prefetch distance during runtime.Experiments show that PART+RPC can reduce the averagecache energy and latency by 22.24% and 24.59%, respectively,compared to a base architecture, and by 3.50% and 3.59%,respectively, compared to prior work, while reducing theimplementation hardware overheads by 54.55%. For futurework, we plan to explore the implications of PART on sharedlower level caches and in the presence of workload variations.A

CKNOWLEDGEMENT

This work was supported in part by the National ScienceFoundation under grant CNS-1844952. Any opinions, ﬁndings, and conclusions or recommendations expressed in this materialare those of the authors and do not necessarily reﬂect the viewsof the National Science Foundation.R

EFERENCES[1] J. Ahn, S. Yoo, and K. Choi, “Dasca: Dead write prediction assisted stt-ram cache architecture,” in , Feb 2014, pp.25–36.[2] N. Sayed, R. Bishnoi, F. Oboril, and M. B. Tahoori, “A cross-layeradaptive approach for performance and power optimization in STT-MRAM,” in . IEEE, 2018, pp. 791–796.[3] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R.Das, “Cache revive: Architecting volatile stt-ram caches for enhancedperformance in cmps,” in

DAC Design Automation Conference 2012 ,June 2012, pp. 243–252.[4] K. Kuan and T. Adegbija, “Energy-Efﬁcient Runtime Adaptable L1 STT-RAM Cache Design,”

IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , vol. 39, no. 6, pp. 1328–1339, 2020.[5] Z. Sun, X. Bi, H. Li, W. F. Wong, Z. L. Ong, X. Zhu, and W. Wu,“Multi retention level STT-RAM cache designs with a dynamic refreshscheme,” in , Dec 2011, pp. 329–338.[6] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, “Feedback DirectedPrefetching: Improving the Performance and Bandwidth-Efﬁciency ofHardware Prefetchers,” in , Feb 2007, pp. 63–74.[7] W. Heirman, K. D. Bois, Y. Vandriessche, S. Eyerman, and I. Hur, “Near-Side Prefetch Throttling: Adaptive Prefetching for High-PerformanceMany-Core Processors,” in

Proceedings of the 27th International Con-ference on Parallel Architectures and Compilation Techniques , ser.PACT 18. New York, NY, USA: Association for Computing Machinery,2018.[8] Tien-Fu Chen and Jean-Loup Baer, “Effective hardware-based dataprefetching for high-performance processors,”

IEEE Transactions onComputers , vol. 44, no. 5, pp. 609–623, May 1995.[9] D. Gajaria and T. Adegbija, “Arc: Dvfs-aware asymmetric-retention stt-ram caches for energy-efﬁcient multicore processors,” in

Proceedings ofthe International Symposium on Memory Systems , ser. MEMSYS ’19.New York, NY, USA: Association for Computing Machinery, 2019, p.439450. [Online]. Available: https://doi.org/10.1145/3357526.3357553[10] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan,“Relaxing non-volatility for fast and energy-efﬁcient STT-RAM caches,”in , Feb 2011, pp. 50–61.[11] K. Kuan and T. Adegbija, “HALLS: An Energy-Efﬁcient HighlyAdaptable Last Level STT-RAM Cache for Multicore Systems,”

IEEETransactions on Computers , vol. 68, no. 11, pp. 1623–1634, Nov 2019.[12] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated Controlof Multiple Prefetchers in Multi-Core Systems,” in

Proceedings of the42nd Annual IEEE/ACM International Symposium on Microarchitecture ,ser. MICRO 42. New York, NY, USA: Association for ComputingMachinery, 2009, p. 316326.[13] “ARM Cortex-A72 MPCore Processor Technical Reference ManualRevision r0p3 Revision r0p3 Documentation.” [Online]. Available:https://developer.arm.com/docs/100095/0003[14] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,”

SIGARCH Comput. Archit. News , vol. 39, no. 2, pp. 1–7, Aug. 2011.[15] K. C. Chun, H. Zhao, J. D. Harms, T. H. Kim, J. P. Wang, and C. H.Kim, “A Scaling Roadmap and Performance Evaluation of In-Planeand Perpendicular MTJ Based STT-MRAMs for High-Density CacheMemory,”

IEEE Journal of Solid-State Circuits , vol. 48, no. 2, pp. 598–610, Feb 2013.[16] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-levelperformance, energy, and area model for emerging nonvolatile memory,”