[PDF] CBP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling

Abstract

Reducing the average memory access time is crucial for improving the performance of applications running on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Techniques for partitioning of shared resources - cache and bandwidth - and prefetching throttling have been proposed to mitigate contention and reduce the average memory access time. However, existing proposals only employ a single or a subset of these techniques and are therefore not able to exploit the full potential of coordinated management of cache, bandwidth and prefetching. Our characterization results show that application performance, in several cases, is sensitive to prefetching, cache and bandwidth allocation. Furthermore, the results show that managing these together provides higher performance potential during workload consolidation as it enables more resource trade-offs. In this paper, we propose CBP a coordination mechanism for dynamically managing prefetching throttling, cache and bandwidth partitioning, in order to reduce average memory access time and improve performance. CBP works by employing individual resource managers to determine the appropriate setting for each resource and a coordinating mechanism in order to enable inter-resource trade-offs. Our evaluation on a 16-core CMP shows that CBP, on average, improves performance by 11% compared to the state-of-the-art technique that manages cache partitioning and prefetching and by 50% compared to the baseline without cache partitioning, bandwidth partitioning and prefetch throttling.

Full PDF

CCBP: Coordinated management of cache partitioning,bandwidth partitioning and prefetch throttling

Nadja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, Miquel Pericàs

Department of Computer Science and EngineeringChalmers University of TechnologyGöteborg, SwedenEmail:{holtryd,madhavan,per.stenstrom,miquelp}@chalmers.se

ABSTRACT

Reducing the average memory access time is crucial for improv-ing the performance of applications running on multi-core archi-tectures. With workload consolidation this becomes increasinglychallenging due to shared resource contention. Techniques for parti-tioning of shared resources - cache and bandwidth - and prefetchingthrottling have been proposed to mitigate contention and reducethe average memory access time. However, existing proposals onlyemploy a single or a subset of these techniques and are thereforenot able to exploit the full potential of coordinated managementof cache, bandwidth and prefetching. Our characterization resultsshow that application performance, in several cases, is sensitiveto prefetching, cache and bandwidth allocation. Furthermore, theresults show that managing these together provides higher perfor-mance potential during workload consolidation as it enables moreresource trade-offs. In this paper, we propose CBP a coordinationmechanism for dynamically managing prefetching throttling, cacheand bandwidth partitioning, in order to reduce average memoryaccess time and improve performance. CBP works by employingindividual resource managers to determine the appropriate settingfor each resource and a coordinating mechanism in order to enableinter-resource trade-offs. Our evaluation on a 16-core CMP showsthat CBP, on average, improves performance by 11% compared tothe state-of-the-art technique that manages cache partitioning andprefetching and by 50% compared to the baseline without cachepartitioning, bandwidth partitioning and prefetch throttling.

Memory access time has a significant impact on application per-formance. Effective utilization of the memory system is thereforenecessary. Typically, resources in the memory system (e.g., last-levelcache (LLC) and off-chip memory bandwidth) are shared amongmultiple cores as they help in improving resource utilization duringworkload consolidation. However, sharing can detrimentally im-pact average memory access time and performance due to resourcecontention. Prior works have proposed partitioning of shared re-sources – cache [10, 15, 19, 26, 28] and bandwidth [13, 17, 24] – andprefetching [11] to mitigate contention, reduce or hide memoryaccess time and improve performance.Recent works have proposed combining cache and bandwidthpartitioning [4, 23, 27], prefetching and cache partitioning [31, 33]and bandwidth partitioning and prefetching [8, 18] to provide addi-tional performance gains. The key insight from these papers is thatcoordinated management of two techniques is more advantageousthan considering each in isolation because of the trade-offs that are made possible. However, no study so far has considered combiningall three techniques. The goal of this paper is to do so.Coordinated management of cache partitioning, bandwidth par-titioning and prefetch throttling provides the following advantages.Firstly, it makes it possible to address more applications and covera broader range of workloads, as shown in our in-depth perfor-mance characterization (see Section 2). The results show that 90%of the applications in the SPEC CPU2006 suite have performancesensitivity (over 10% change in IPC) to at least one of the tech-niques, and 70% are also sensitive to multiple techniques. Secondly,managing these techniques jointly opens up the opportunity tonew and improved trade-offs. There are synergistic interactionsbetween the techniques and these cannot be realized if cache par-titioning, bandwidth partitioning and prefetch throttling are notjointly managed.As an example, consider the simple case of a workload compris-ing of two applications. The first application, lbm , is sensitive tobandwidth and prefetching while the second, xalancbmk , is sensi-tive to cache size and has lower performance when prefetching isenabled. The best solution when managing all three techniques is togive xalancbmk a large cache allocation, small bandwidth allocationand disable the prefetcher, while giving lbm a large bandwidth allo-cation, small cache allocation while keeping the prefetcher active.Figure 1 shows the performance from coordinated managementof all three techniques ( cache+bw+pref ) compared to managing asubset of the techniques. The results show that the solution thatmanages all three techniques is better than others that managetwo of the techniques, and leads to an additional performance gainof 15%. The main challenge of coordinately managing all threetechniques is the complexity of evaluating all possible allocations N o r m a li z e d t o b a s e li n e only bwonly prefonly cachebw+prefbw+cachecache+prefcache+bw+pref Figure 1: Workload with lbm and xalancbmk. Total band-width 16GB/s, total cache size 2MB. Executing applicationsfor 1B instructions, more details in Section 4. Settings: lbm-prefething active, 12GB/s, xalancbmk- prefetcher inactive,4GB/s, determined from characterization (Section 2). Cachepartition sizes are decided dynamically. a r X i v : . [ c s . A R ] F e b adja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, Miquel Pericàs dynamically and determining the best possible allocation whileexploiting the large number of possible trade-offs.Guided by our characterization results, we propose CBP, a co-ordinated mechanism for dynamically managing cache partitioning,bandwidth partitioning and prefetch throttling for multi-programmedworkloads. CBP consists of three local controllers, one for each re-source that together with a coordination mechanism manages andallocates the resources. With CBP, the three techniques are dynam-ically tuned in an iterative fashion. First, cache space is allocatedsince avoiding a memory access altogether is better than reduc-ing their latency. As a next step, bandwidth is allocated takinginto consideration the impact due to cache allocation. Lastly, theprefetch setting is determined by testing the impact of prefetch-ing on performance for the current allocation of bandwidth andcache. The prefetcher performance influences the next reallocationof cache space and bandwidth. The feedback mechanism betweenthe different techniques dynamically adapts the allocations in orderto reach a good configuration depending on the characteristics ofthe individual applications in the workload. Our approach of com-bining local controllers with a feedback mechanism reduces thecomplexity.In summary, we make the following contributions:(a) We present an in-depth characterization of the performanceimpact of cache, bandwidth and prefetching on the entire SPECCPU2006 suite. Our characterisation results provide several in-sights: i) a majority of the applications (over 90%) are sensitive toone or multiple techniques, ii) managing cache, bandwidth andprefetch, opens up opportunities for exploiting more trade-offs andimproving performance for consolidated workloads, and iii) man-aging cache, bandwidth and prefetch jointly has the potential tooutperform combinations of two of the techniques.(b) We propose CBP, a mechanism to dynamically manage thethree resources in coordination. The solution is based on simpleheuristics in order to sidestep the complexity associated with eval-uating all possible configurations and choosing the most efficientconfiguration. CBP works by employing individual resource man-agers to determine the appropriate setting for each resource and acoordinating mechanism to enable inter-resource trade-offs.(c) We evaluate our solution with multi-programmed workloadson a 16-core tiled CMP. CBP improves performance by up to 36%(geom. mean 11%) compared to the state-of-the-art technique thatmanages cache partitioning and prefetching in a coordinated man-ner and by up to 86% (geom. mean 50%) compared to an unparti-tioned S-NUCA without cache partitioning, bandwidth partitioningand prefetching.The rest of the paper is organized as follows. Section 2 motivatesthe need for a coordinated approach using cache partitioning, band-width partitioning and prefetch throttling. Section 3 describes ourproposed solution in detail. We then discuss the methodology inSection 4 and Section 5 presents the evaluation of the proposal. Weprovide an overview of related work in Section 6 and conclude inSection 7. In order to motivate the need for coordinated management of cachepartitioning, bandwidth partitioning and prefetch throttling, weperform a detailed characterization study of applications in the SPEC2006 CPU suite. The aim of this study is to: i) characterizeapplications to determine the extent to which they are performancesensitive to cache, bandwidth and prefetch settings, ii) understandthe different resource interactions, their impact on performance andthe inter-resource trade-offs that are possible, and iii) demonstratethe performance potential of coordinated management of all threeresources over a subset of resources.

To understand the sensitivity of applications to cache, bandwidthand prefetch settings we model a system consisting of one out-of-order core with a 3-level cache hierarchy using the Sniper simulator[5]. Details about the methodology are provided in Section 4. Forthis experiment, the baseline LLC and bandwidth allocation is 512kBand 4GB/s, respectively. We run the application in steady-state for1B instructions and use IPC as a measure of performance.Figure 2 shows the performance impact of changing the cache al-location, bandwidth allocation and enabling prefetching normalizedto the baseline allocation without prefetching. Note that we onlychange the setting for one resource at a time. In Figure 2a, C-L andB-L represent low allocation settings where the cache allocationis decreased to 128kB and the bandwidth allocation is decreasedto 1GB/s, respectively, while prefetching is disabled. Similarly, inFigure 2b, C-H and B-H represent high allocation settings wherethe cache allocation is increased to 2MB and the bandwidth alloca-tion is increased to 16GB/s, while prefetching is disabled. Finally,P-B represents the setting where prefetching is enabled with thebaseline cache and bandwidth allocation. We classify applicationsas performance sensitive to a specific resource if the modified allo-cation results in a 10% deviation from the baseline IPC. We referto applications that are performance sensitive to change in cacheallocation as cache sensitive (CS), sensitive to change in bandwidthallocation as bandwidth sensitive (BS) and sensitive to prefetchthrottling as prefetch sensitive (PS).The sensitivity results for cache size show that nearly 60% of theapplications (17 out of 29) are sensitive to changes in cache alloca-tion. The extent to which applications are performance sensitivevaries greatly with a performance increase of up to 4x in some cases.Furthermore, a larger number of applications are sensitive in thelow allocation setting in comparison to high allocation setting (17compared to 11). The sensitivity results for bandwidth allocationalso shows a similar trend, as more applications are sensitive inthe low allocation setting (23 compared to 15). Also, the extent ofperformance sensitivity varies greatly with an increase of up to 3x.The sensitivity results for prefetch throttling indicate that nearly38% of the applications (11 out of 29) are sensitive to prefetchingand experience a speedup. However, there are some applicationsthat experience a slowdown due to prefetching. In summary, wemake the following observation:

OBSERVATION 1.

In SPEC CPU2006 suite 90% of the applicationsare sensitive to one resource, while 70% are sensitive to multiple re-sources and the extent of sensitivity varies greatly. BP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling a s t a r b w a v e s b z i p c a c t u s A D M c a l c u li x d e a l II g a m e ss G e m s F D T D g cc g o b m k g r o m a c s h r e f h m m e r l b m l e s li e d li b q u a n t u m m c f m il c n a m d o m n e t p pp e r l b e n c h p o v r a y s j e n g s o p l e x s p h i n x t o n t o w r f x a l a n c b m k z e u s m p I P C r e l a t i v e t o b a s e li n e C-L B-L (a) Slowdown when decreasing the cache size to 128kB (C-L), and slow-down when decreasing the bandwidth allocation to 1GB/s (B-L) incomparison to the baseline allocation, with prefetching disabled. a s t a r b w a v e s b z i p c a c t u s A D M c a l c u li x d e a l II g a m e ss G e m s F D T D g cc g o b m k g r o m a c s h r e f h m m e r l b m l e s li e d li b q u a n t u m m c f m il c n a m d o m n e t p pp e r l b e n c h p o v r a y s j e n g s o p l e x s p h i n x t o n t o w r f x a l a n c b m k z e u s m p I P C r e l a t i v e t o b a s e li n e C-H B-H P-B (b) Performance improvement from increasing cache allocation to2MB (C-H), and increasing bandwidth allocation to 16GB/s (B-H) incomparison to the baseline allocation, with prefetching disabled. Per-formance improvement from enabling prefetching for the baselineallocation (P-B)

Figure 2: Performance impact of changing cache size, bandwidth allocation and prefetcher setting. There are 6 CS-BS-PS ap-plications, 8 CS-BS, 6 BS-PS, 3 CS, 3 BS and 3 applications are insensitive (I) to all three techniques.

Next, we investigate inter-resource interactions and trade-offs thatare enabled when jointly managing cache, bandwidth and prefetch-ing. We focus on intra-application resource interaction initially.There are four possible types of interactions within an application:cache-bandwidth-prefetch, bandwidth - prefetch, cache - prefetchand cache - bandwidth.Regarding the cache-bandwidth-prefetch trade-off, we want tofind out how the performance impact from prefetching varies withthe allocation of cache and bandwidth. Figure 3 shows the perfor-mance impact of prefetching for three different cache/bandwidthsettings normalized to the respective baseline setting without prefetch-ing. The cache and bandwidth setting for an application in a lowallocation scenario (P-L) is 128kB and 1GB/s, the baseline allocationscenario (P-B) setting is 512kB and 4GB/s while the high allocationscenario (P-H) setting is 2MB and 16GB/s. a s t a r b w a v e s b z i p c a c t u s A D M c a l c u li x d e a l II g a m e ss G e m s F D T D g cc g o b m k g r o m a c s h r e f h m m e r l b m l e s li e d li b q u a n t u m m c f m il c n a m d o m n e t p pp e r l b e n c h p o v r a y s j e n g s o p l e x s p h i n x t o n t o w r f x a l a n c b m k z e u s m p I P C r e l a t i v e t o p r e f e t c h d i s a b l e d P-L P-B P-H

Figure 3: Performance impact of enabling prefetching rela-tive to allocation of cache and bandwidth, for allocation set-tings; L:128kB,1GB/s B:512kB,4GB/s H:2MB,16GB/s

For some applications, lower bandwidth and cache allocationleads to higher sensitivity for prefetching as seen in hmmer . This isbecause avoiding a miss altogether, as a consequence of accurateprefetching, can have a larger impact in low allocation settingswhere the bandwidth is scarce and the memory queuing delaystend to be longer. Also, there are applications like gcc which experi-ence higher prefetch sensitivity with a larger cache and bandwidth allocation. The results indicate that applications tend to be prefetchsensitive in some settings and prefetch insensitive in others. Wemake the following observation:

OBSERVATION 2.

Allocation of cache and bandwidth influencesprefetch sensitivity. Furthermore, applications tend to be prefetchsensitive in some settings and prefetch insensitive in others.

In the interest of space we use leslie3d as a representative ex-ample to illustrate the other pairwise resource interactions andtrade-offs, since it is sensitive to all three techniques. Note thatthe baseline setting will be used for the resources unless specifiedotherwise. The bandwidth - prefetch interaction manifests in twodifferent ways. Firstly, prefetching typically increases the numberof memory accesses and this in turn increases the bandwidth pres-sure. In the case of leslie3d prefetching results in a 15% increase inthe number of memory requests in comparison to the baseline. Notethat prefetch misses, i.e. prefetched blocks that are evicted beforeuse, can result in a further increase in pressure on the memorybandwidth. Secondly, the performance improvement from prefetch-ing can be influenced by the bandwidth allocation. The results inFigure 4a show the performance for different bandwidth allocationwith and without prefetching. We make the following observation:

OBSERVATION 3.

A larger bandwidth allocation can compensatefor increased bandwidth demands, due to inaccurate prefetches, lead-ing to increased performance with prefetching.

The cache - prefetch interaction also manifests in two main ways.Firstly, the performance loss from reduced cache allocation can beoffset if prefetching is effective. Figure 4c shows the IPC for dif-ferent cache allocations with and without prefetching. The resultsshow that the performance of 128kB allocation with prefetching isbetter than 512kB allocation without prefetching. Secondly, largercache sizes (if it is used efficiently) can lead to higher speedup fromprefetching. Figure 4b shows the performance improvement fromprefetching with different cache allocation normalised to the re-spective cache allocation without prefetching. The results showthat prefetching is effective with lower cache allocation and that adja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, Miquel Pericàs Bandwidth (GB/s)0.00.10.20.30.40.50.60.70.80.91.01.1 I P C prefech enabledprefetch disabled (a) IPC with and withoutprefetching for differentbandwidth allocation. Cache size (kB)1.01.11.21.31.41.51.6 Sp ee d u p (b) Performance improvementfrom prefetching dependingon cache allocation. Cache size (kB)0.00.20.40.60.81.01.2 I P C prefech enabledprefetch disabled (c) IPC with and withoutprefetching for different cacheallocation. Bandwidth (GB/s)1.01.11.21.31.4 Sp ee d u p (d) Performance improvementfrom increasing cache alloca-tion to 2MB depending onbandwidth allocation. Figure 4: leslie3d - example of interactions and its impact on performance within a single application. its effectiveness can increase with additional allocation. The rea-son for this behaviour is that a larger cache reduces the numberof memory accesses which has the same effect as increasing theavailable bandwidth, i.e. a lower queuing delay, which is more for-giving when there is an increase in memory accesses caused byinaccurate prefetches (in leslie3d there is a 15% increase in dramaccesses caused by prefetching). These results lead to the followingobservation:

OBSERVATION 4.

A trade-off can be made between either increas-ing cache size or enabling prefetching, leading to the same perfor-mance, for applications which are performance sensitive to both cacheand prefetching.

As for the cache - bandwidth interaction, a lower bandwidthallocation can result in a larger sensitivity to cache size. Figure 4dshows the performance improvement from increasing the cacheallocation from 512kB to 2MB with different bandwidth allocationsettings. The results show that performance improvement fromadditional cache allocation is much higher in low bandwidth al-location settings (see the result for 1GB/s bandwidth allocation).This is because the average cost of a miss is much higher in thecase of lower bandwidth allocation. The results also show that alarge cache allocation can reduce the performance sensitivity tobandwidth allocation (see the result for 16GB bandwidth allocation).These results lead to the following observation:

OBSERVATION 5.

A trade-off can be made between either increasedcache space or increased bandwidth allocation, for applications whichare performance sensitive to both cache and bandwidth.

We now describe how the observed intra-application interactionsand trade-offs can be leveraged in the inter-application settingfor multi-programmed workloads. Let us revisit the example ofrunning a simple workload comprising two application ( lbm and xalancbmk ) on a dual-core system with 2MB LLC capacity and16GB/s bandwidth, discussed in Figure 1. For achieving the bestaggregate performance we expect xalancbmk to get the majorityof the cache (nearly 1.75MB), while lbm is given a smaller cacheallocation of 256KB. For bandwidth, we would expect lbm to have alarge allocation (12GB/s) of the available bandwidth and xalancbmkto get a smaller allocation (4GB/s). This is reflected in Observation 5 about the trade-off between cache and bandwidth where we wouldprioritize the application that shows the highest sensitivity forthe resource. Furthermore, as reflected in Observation 2, we expectprefetching to be more effective for lbm since it has a large allocationof bandwidth. In the case of xalancbmk, prefetching leads to lowerperformance regardless of the allocation of cache and bandwidth.

In order to show the potential for coordinated management of cache,bandwidth and prefetch we run 640 randomly generated workloadseach comprising 4 SPEC 2006 CPU applications. We compare theperformance of jointly managing all the three resources to otherresource managers that only manage a subset of these resources. Forthis experiment the baseline allocation of cache and bandwidth foreach application is 512kB and 4GB/s. We use an exhaustive searchalgorithm to find the best static configuration (over 1B instructions)for the different resources when running each workload. Figure5a shows average (geometric mean) performance with differentresource managers normalized to the baseline settings withoutprefetching. equal on , depicts the performance when prefetching isenabled for all applications and improves performance by 6% while only pref , depicts the performance when prefetching is selectivelyactivated and improves performance by 9%. cache+bw+pref resultsshow that coordinately managing cache partitioning, bandwidthpartitioning and prefetch throttling improves performance by 5%compared to the best combination of two techniques (22% comparedto 17%).Figure 5b shows the number of workloads (among the 640 work-loads considered) that experience a performance gain of at least10% using the different resource managers discussed previously.The results show that 90% (597) of the workloads are sensitive tothe resource manager that jointly manages all three techniques. Asmaller fraction of the workloads are sensitive to resource man-agers that manage a subset of these techniques (77% are sensitive to cache+pref resource manager and 69% are sensitive to the cache+bw resource manager).In summary, the results from the characterization study demon-strate that around 90% of applications in the SPEC 2006 CPU suiteare sensitive to different resources and that coordinately manag-ing them opens up new possibilities for improving performance BP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling N o r m a li z e d t o e q u a l p a r t i t i o n i n g equal ononly bwonly prefonly cache bw+prefbw+cachecache+prefcache+bw+pref (a) Performance potential of usingdifferent resource managers. F r a c t i o n o f w o r k l o a d s (b) Fraction, of the work-loads, with at least 10% per-formance improvement. Figure 5: Potential for coordinated management measuredusing 640 random workloads of 4 SPEC CPU2006 applica-tions. Performance is obtained using an exhaustive searchalgorithm that evaluates bandwidth settings (2GB/s, 4GB/s,6GB/s), cache settings (256kB, 512kB, 1024kB) and prefetch-ing settings (active/inactive), in conjunction, to determinethe resource allocation for each application in the workloadthat maximizes aggregate performance. by trading resource allocations. Furthermore, jointly managingthese resources has the potential to cover a broader range of work-loads and outperform resource managers that manage a subset ofresources. In the next section we will discuss how the proposedresource manager, CBP, determines cache, bandwidth and prefetchsettings for different applications in a workload.

Section 3.1 provides an overview of the CBP resource manager.Section 3.2 discusses the individual resource controllers while Sec-tion 3.3 describes how the coordination mechanism ties the localresource controllers together. Finally, the implementation and over-head of the proposed mechanism is discussed in Section 3.4.

CBP is a coordinated mechanism for dynamically managing cachepartitioning, bandwidth partitioning and prefetch throttling. Thedesign consists of one local controller for each of the three tech-niques, and a coordination mechanism, as shown in Figure 6.

Cache resource controller

Coordination mechanism

Bandwidth resource controllerPrefetch throttling controller

Cache allocation per applicationBandwidth allocation per applicationPrefetch setting per applicationATDs estimating misses for different cache sizesQueuing delay per applicationSampled IPC with prefetch enabled/disabled per application

Figure 6: Overview of CBP resource manager.

The cache allocation controller estimates the number of missesfor different cache sizes using auxiliary tag directories (ATDs) [26]and uses this as input for determining cache allocation. The cacheallocation per application is determined such that it reduces the ag-gregate number of cache misses for the entire workload. The band-width allocation controller uses memory request queuing delayexperienced by the applications as input and allocates the availablebandwidth in proportion to the delay. The bandwidth allocationcontroller, assigns a larger allocation of the available bandwidth toapplications that experience longer queuing delay, and a compar-atively lower allocation to those that experience shorter queuingdelays. Lastly, the prefetch controller samples IPC with and with-out prefetching to determine whether the prefetcher should beenabled/disabled for each application.The three techniques are dynamically tuned using the coordina-tion mechanism in an iterative manner such that the local controllertakes into consideration the decisions taken by the other controllers.

A partitioning solution for shared resources like cache and band-width typically comprises two components: allocation policy, thatdetermines how a resource is divided among multiple co-runningapplications, and an enforcement mechanism, that enforces thepartitioning decision. Similarly, prefetch throttling involves a pol-icy to determine the best prefetch setting to use and a mechanismimplemented in hardware to enforce the setting. In the contextof CBP, the policy component is of particular interest because itenables inter-resource trade-offs. We discuss the allocation policyin this section and defer the details of the enforcement mechanismto Section 3.4.

The cache allocation controller uses theLookahead algorithm [26], to determine cache allocation. In a nut-shell, the algorithm computes the utility for each application whereutility is the measure of how many additional misses can be reducedwith allocation of cache ways. It then computes the number of waysthat maximizes the utility for each application (while ensuring thisis less than the total number of ways available for allocation). Fi-nally, it compares the utility values for the different applications,determines the application that has the highest utility and assignsthe pre-computed number of ways that maximizes the utility forthat application. The process repeats, with recomputation of utilityfor each application and reassignment of available cache ways to theapplication that has the largest utility, until the rest of the availablecapacity is distributed. The allocation controller relies on sampledATDs to estimate, based on past behaviour, the number of missesthat can be avoided with additional allocation of cache ways foreach application. In order to adapt to an inclusive cache hierarchy,we assign a minimum allocation of cache space ( min_ways ) to allthe applications before distributing the remaining capacity.

We propose a bandwidth allocationalgorithm, that partitions bandwidth proportional to the memoryqueuing delay experienced by each application. The pseudo-codefor the proposed bandwidth allocation controller is outlined in Al-gorithm 1. The controller assigns a minimum bandwidth allocation( min_bandwidth_allocation ) for each application in order to avoidunfairly giving a very low allocation to applications with a small adja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, Miquel Pericàs Algorithm 1:

Bandwidth allocation controller pseudo-code

Input :

A list 𝑞𝑢𝑒𝑢𝑖𝑛𝑔𝐷𝑒𝑙𝑎𝑦𝑃𝑒𝑟𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛

Output :

A list 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝐴𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑃𝑒𝑟𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 At time period 𝑟𝑒𝑐𝑜𝑛𝑓 𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛 _ 𝑖𝑛𝑡𝑒𝑟 𝑣𝑎𝑙 ; 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ = ( 𝑡𝑜𝑡𝑎𝑙𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ - 𝑚𝑖𝑛 _ 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ _ 𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 * 𝑡𝑜𝑡𝑎𝑙𝑁𝑢𝑚𝑏𝑒𝑟𝑂 𝑓 𝐶𝑜𝑟𝑒𝑠𝑡𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑎𝑦 = 0; for 𝑖 ← to 𝑡𝑜𝑡𝑎𝑙𝑁𝑢𝑚𝑏𝑒𝑟𝑂 𝑓 𝐶𝑜𝑟𝑒𝑠 − do 𝑡𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑎𝑦 += 𝑞𝑢𝑒𝑢𝑖𝑛𝑔𝐷𝑒𝑙𝑎𝑦𝑃𝑒𝑟𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 [i]; 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝐴𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑃𝑒𝑟𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 [i] = 𝑚𝑖𝑛 _ 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ _ 𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 ; end for 𝑖 ← to 𝑡𝑜𝑡𝑎𝑙𝑁𝑢𝑚𝑏𝑒𝑟𝑂 𝑓 𝐶𝑜𝑟𝑒𝑠 − do 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝐴𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑃𝑒𝑟𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 [i] +=( 𝑞𝑢𝑒𝑢𝑖𝑛𝑔𝐷𝑒𝑙𝑎𝑦𝑃𝑒𝑟𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 [i]/ 𝑡𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑎𝑦 ) * 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ; end queuing delay. The remaining bandwidth is set for distributionamong the applications (see line 2). The algorithm computes thetotal queuing delay by summing up the individual queuing delaysexperienced by each application (line 4) while assigning the mini-mum allocation to each application (line 5). As the next step, theremaining bandwidth is allocated proportionally to the queuingdelay experienced by the application (line 7-9). The queuing delayfor each application is obtained by measuring the memory accesstime for requests from each application. The prefetch throttling policy determinesthe best prefetcher settings for each application. The pseudo-codefor prefetch throttling controller is outlined in Algorithm 2. Thealgorithm considers two possible settings – prefetcher enabled andprefetcher disabled – but can easily be extended to support otheraggressiveness settings as well. The algorithm uses the sampledIPC values, for each application obtained, with different prefetchersetting over a sample period ( prefetch_sampling_period ) as input.The algorithm first computes the speedup from prefetching for eachapplication using the sampled IPC values. If the speedup is below athreshold ( speedup_threshold ) the prefetcher is deactivated for thenext prefetch interval ( prefetch_interval ) (line 3-4). If the speedupis above the threshold prefetching is activated for the next prefetchinterval (line 6). The prefetch throttling controller is generic enoughto support any type of prefetcher.

Algorithm 2:

Prefetch throttling controller pseudo-code

Input :

Two lists 𝑖𝑝𝑐𝑊 𝑖𝑡ℎ𝑃𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑖𝑛𝑔𝐴𝑐𝑡𝑖𝑣𝑒 , 𝑖𝑝𝑐𝑊 𝑖𝑡ℎ𝑃𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑖𝑛𝑔𝐼𝑛𝑎𝑐𝑡𝑖𝑣𝑒 Output :

A list 𝑝𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑆𝑒𝑡𝑡𝑖𝑛𝑔𝑃𝑒𝑟𝐶𝑜𝑟𝑒 At time period 𝑝𝑟𝑒𝑓 𝑒𝑡𝑐ℎ _ 𝑖𝑛𝑡𝑒𝑟 𝑣𝑎𝑙 ; for 𝑖 ← to 𝑡𝑜𝑡𝑎𝑙𝑁𝑢𝑚𝑏𝑒𝑟𝑂 𝑓 𝐶𝑜𝑟𝑒𝑠 − do if ( 𝑖𝑝𝑐𝑊 𝑖𝑡ℎ𝑃𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑖𝑛𝑔𝐴𝑐𝑡𝑖𝑣𝑒 [i]/ 𝑖𝑝𝑐𝑊 𝑖𝑡ℎ𝑃𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑖𝑛𝑔𝐼𝑛𝑎𝑐𝑡𝑖𝑣𝑒 [i])> 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 _ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then 𝑝𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑆𝑒𝑡𝑡𝑖𝑛𝑔𝑃𝑒𝑟𝐶𝑜𝑟𝑒 [i] = 0; else 𝑝𝑟𝑒𝑓 𝑒𝑡𝑐ℎ𝑆𝑒𝑡𝑡𝑖𝑛𝑔𝑃𝑒𝑟𝐶𝑜𝑟𝑒 [i] = 1; end end The goal of the coordination mechanism is to ensure that eachlocal controller takes into account the decisions taken by othercontrollers. This is necessary to exploit the trade-offs outlined ear-lier in Section 2. There are two essential tasks carried out in orderto establish this: i) controller prioritization and ii) inter-controllerinteraction.

Controller prioritization:

Since the decision taken by one con-troller has the potential to influence those taken by others it isimportant to establish priority among the different local controllerssince that determines the order in which local controllers make al-location decisions. In CBP, the highest priority is given to the cacheallocation controller, which first makes the allocation decision. Therationale is that avoiding a memory access is typically more effec-tive, for reducing average memory access time, than lowering thememory access penalty. Next, priority is given to the bandwidthallocation controller since our characterization results show thatapplications are comparatively more sensitive to bandwidth thanto prefetching. The least priority, is given to the prefetch throttlingcontroller. This is because it is important that the prefetcher settingis determined based on the current allocation, of cache and band-width, since prefetching can have a negative impact on performanceif the bandwidth allocation is insufficient.

Inter-controller interaction:

Figure 7 provides an overviewof the interactions that happens between the different resourceallocation controllers. Firstly, we describe how the bandwidth allo-cation controller decisions takes into account the decisions madeby the cache and the prefetch controller. The bandwidth allocationcontroller, makes decisions based on the queuing delay of each ap-plication which is affected by the number of memory accesses. Thecache allocation controller through a larger cache allocation can re-duce the number of memory accesses (Interaction

Cache allocation Prefetch allocationBandwidth allocation

4. Allocation affects prefetch sensitivity2. Prefetch misses affects bandwidth demand5. Prefetch affects ATDs and subsequent allocations1. Cache hits affects queuing time 3. Allocation affects prefetch sensitivity

Figure 7: Interactions among the different resource alloca-tion controllers in CBP . BP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling monitored in the ATDs. Since the cache allocation, is computedbased on the counter values observed in the ATD, this end up affect-ing the subsequent cache allocation decision, resulting in a smallercache allocation for prefetch sensitive applications.

Putting it all together:

We discuss the timeline showing whenthe different resource allocation controllers are invoked and howthey interact with each other, in an iterative manner, using an ex-ample illustrated in Figure 8. The three resource controllers, areinvoked after a specific interval that we refer to as the reconfigura-tion_interval in the sequence shown in the figure. First cache andbandwidth are equally partitioned among all applications at time 0,since information about misses and queuing delay is initially un-available, as shown in Step 0. This is followed by sampling the IPCof applications with different prefetch settings for a specific interval(twice the prefetch_sampling_period ) as shown in Step 1. Based onthe sampled IPCs the prefetch throttling controller determines theappropriate prefetcher setting for each prefetcher for the current re-configuration_interval . Cache and bandwidth allocation controllersare again invoked after the reconfiguration_interval , as shown inStep 2. A cache allocation decision, for the next interval is influ-enced by the number of hits and misses observed in the previousinterval. The ATD values will be halved after each reconfiguration,in order to be sensitive to changes in the last time interval while theper application queuing delays are accumulated with those fromthe previous interval. The bandwidth allocation decision shownin Step3, is influenced both by the cache allocation and prefetchersetting in the previous interval as discussed previously. Finally, theprefetcher throttling controller shown in Step 4 is influenced by thenew cache and bandwidth allocation. The interactions among thedifferent resource allocation controllers, take place over multipleiterations, and is the key to finding an effective solution.

Prefetch active/inactive:0.Initial cache and bandwidth allocation. 1.Prefetch throttling decision influenced by 0. 2.New cache allocation, influenced by 1. 3.New bandwidth allocation, influenced by 0,1. 4.New prefetch throttling decision influenced by 2,3. prefetch_sampling_period = a, reconfiguration_interval = bON OFF ON OFF ba Timeba

Figure 8: Timing and interactions of CBP resource manager.

The computational overhead of CBP resource management is lowsince the design uses heuristics to guide the allocation decisionsinstead of exhaustively evaluating the different possible allocations.The cache allocation controller needs hardware support in orderto estimate the number of misses with different cache sizes. Weuse sampled ATDs [26] as discussed previously to compute theeffect of different cache allocations on the misses. When enforcingcache partitioning there is an overhead associated with invalida-tions due to reconfiguration decisions. This is modelled faithfullyby invalidating the addresses and re-fetching them when accessed,this includes the latency and impact on bandwidth from accessing memory. We have used the enforcement mechanism proposed byHoltryd et al. [12] since it is suitable for a modern tile-based CMPand is both fine-grained and locality aware. The enforcement mech-anism uses per-core Cache Bank Tables (CBTs), where mappingsbetween addresses and banks are recorded. When a request needs toaccess the LLC, the CBT is used to identify the cache bank that theaddress is mapped to. Inside each bank, way partitioning hardwaredivides the capacity. The enforcement results in a partition granu-larity of 32kB on our system, see Section 4. We incur hardware costfor implementing ATDs and cache partition enforcement [12, 26].The bandwidth partition enforcement is done in likeness withIntel Memory Bandwidth Allocation (MBA) technology [1, 2] whichis commercially available. The solution uses delays as a way toallocate the bandwidth. An application with a high delay has alow allocation, and experiences a longer queuing delay for eachmemory access. In our solution the additional delay is added afterthe LLC, instead of after the L2, as in the original proposal.The overhead of prefetch-throttling comes from sampling anapplication with different prefetcher settings. This is because de-activating prefetching for an application can be detrimental for itsperformance, especially when prefetching is effective. Likewise, itis detrimental to turn it on prefetching (for a sample period) for anapplication whose performance is hurt by prefetching.

We evaluate our proposal on a 16-core tiled CMP architecture mod-eled using the Sniper Simulator [5]. Each tile has an out-of-order(OOO) core with a private L1 data and instruction cache, a unifiedprivate L2 cache and an LLC bank of 512KB. The cache latenciesassumed have been modelled using CACTI 6.5 [20]. Details aboutthe baseline architecture are shown in Table 1. A sensitivity studyis provided for the CBP parameters in Section 5.2.

Cores

16 cores, x86-64 ISA, 4GHz, OOO,Nehalem-like, 128 ROB entries, dispatch width 4

L1 caches

L2 caches

LLC

Coherence protocol

MESIF-protocol, 64 B lines, in-cache directory

Global NoC

Memory controllers

Prefetcher stride-based, located in L2, 4 prefetchesstop at page boundary, 8 flows/core

CBP parameters reconfiguration_interval =10ms prefetch_sampling_period =0.5ms, speedup_threshold = 1.05, prefetch_interval =10ms, min_bandwidth_allocation =1, min_ways =4 Table 1: Configuration of the simulated 16-core tiled CMP.

We use the entire SPEC CPU2006 suite in our evaluation. The appli-cations are in the format of whole program pinballs [29]. We create adja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, Miquel Pericàs w w1 4CS-BS-PS,5CS-BS,3BS-PS,3CS,1BS xalancbmk(xa),gromacs(gr),libquantum(li)(2)h264ref(h2),zeusmp(ze),tonto(to),soplex(so),lbm(lb),perlbench(pe),calculix(ca),milc(mi)sphinx3(sp),bwaves(bw),gobmk(go),gamess(ga)w2 3CS-BS-PS,5CS-BS, lb,to,pe,go,gcc(gc),mi,li(2),namd(na),5BS-PS,2CS,1BS h2,cactusADM(cac),ze(2),ca,so,astar(as)w3 6BS-PS,CS,5BS,4I bw(2),povray(po)(2),sjeng(2),sp(2),na(2),ze,GemsFDTD(Ge),cac,li,mi,wrf(wr)w4 CS-BS-PS,2CS-BS,5BS-PS,3CS,2BS,3I po,bw(2),h2,sjeng(sj),li(2),gr,na,mi(2),as,Ge,ga,wr,lbw5 5CS-BS-PS,10CS-BS,BS-PS dealII(de),omnetpp(om)(2),go(2),hmmer(hm),xaleslie3d(le),bzip2(bz)(2),gc,so,mcf(mc),pe,ca(2)w6 3CS-BS-PS,5CS-BS,4BS-PS,2CS,2BS sp,bw(2),h2,om,li,gr,go,mi(2),as,hm,ga,le,lb,caw7 2CS-BS-PS,2CS-BS,3BS-PS,5CS,4I po(2),to,sj,h2(2),na,lb(2),ze(2),gr,Ge,as,wr,gaw8 4CS-BS-PS,4CS-BS,2CS-PS,3BS-PS de,bw(3),xa,mi(3),om,li(2),bz,go,so,hm,pe3BSw9 2CS-BS-PS,5CS-BS,2BS-PS,3CS,BS,2I gc,po,to,hm,sj,h2,bz,ze,gr,so,Ge,as,pe,wr,ga,cacw10 2CS-BS-PS,3CS-BS,6BS-PS,CS,2BS,2I sj,bw(2),de,na,li(2),om,ze,mi(2),xa,Ge,bz,wr,gcw11 2CS-BS-PS,4CS-BS,4BS-PS,CS,2BS,3I po,om,sj,go,na(2),le,ze,xa,Ge,bz,wr,ca,sj,sp,gcw12 6CS-BS-PS,8CS-BS,2CS de,to,go,h2(2),hm,gr,xa,as(2),bz,ga,gc,lb,so,caw13 3CS-BS-PS,2CS-BS,4BS-PS,4CS,3I to,po,h2,sj,gr,na,as,ze,ga,Ge,lb(2),li,to,mi,wrw14 5CS-BS-PS,2CS-BS,5BS-PS,CS,BS,2I de,bw,go,po,hm,na,xa,ze,so,Ge,mc,li,pe,mi,ca,wr Table 2: 16-core workload.

14 workload mixes (each comprising 16 applications) by randomlyselecting applications from the entire SPEC CPU2006 suite. Detailsabout workload mixes are presented in Table 2.We fast-forward for 16B instructions (in total) and then carry outdetailed simulation until all benchmarks have completed at least500M instructions. Statistics are reported based on the detailedsimulation of 500M instructions. After this period the applicationscontinue to run and compete for resources to avoid having a lighterload on long running applications. The methodology is in line withearlier works [3, 19, 28].

We report normalized weighted speedup over baseline for eachworkload. This is computed by 𝑁 (cid:205) 𝑁𝑖 = 𝐼𝑃𝐶 𝑖,𝑅𝑀

𝐼𝑃𝐶 𝑖,𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 , in order toevaluate system performance for multi-programmed workloads.RM refers to the system with a resource manager that managescache, bandwidth and prefetcher settings and the baseline refersto a system with unpartitioned cache and bandwidth and withoutprefetching.We also report average normalized turnaround time (ANTT) foreach workload since this is a user-oriented performance metricwhich shows fairness. ANTT is given by 𝑁 (cid:205) 𝑁𝑖 = 𝐶𝑃𝐼 𝑖,𝑅𝑀

𝐶𝑃𝐼 𝑖,𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒

Table 3 shows the resource managers we evaluate, in additionto CBP, and the corresponding settings they use for cache, band-width and prefetching. The baseline configuration represents a sys-tem with unpartitioned cache, unpartitioned bandwidth and withprefetching disabled. equal off configuration represents a systemwhere cache and bandwidth are equally partitioned and prefetchingis disabled. only cache represents the configuration where cache ispartitioned as described in Section 3.2.1, while bandwidth is unpar-titioned and with prefetching disabled. Likewise, only bw representsthe configuration where bandwidth is partitioned as described inSection 3.2.2 while the cache is unpartitioned and with prefetch-ing disabled. As for only pref , prefetch throttling is performed as

RM cache setting bandwidth setting prefetch setting baseline unpartitioned unpartitioned disabledequal off equal equal disabledonly cache dynamic (see 3.2.1) unpartitioned disabledonly bw unpartitioned dynamic (see 3.2.2) disabledonly pref unpartitioned unpartitioned dynamic (see 3.2.3)bw+pref unpartitioned dynamic dynamicbw+cache dynamic dynamic disabledcache+pref dynamic unpartitioned dynamicCPpf dynamic unpartitioned enabledCBP dynamic dynamic dynamic

Table 3: Configurations evaluated described in Section 3.2.3 while cache and bandwidth remains un-partitioned. The resource managers that jointly manage two outof the three resources ( bw+perf, bw+cache, cache+perf ) in a coordi-nated manner can leverage a subset of the interactions described inSection 3.3. We also compare against

CPpf [33], a recently proposedtechnique for jointly managing prefetching and cache partitioning.In

CPpf , prefetch friendly applications are allocated small partitionsizes (because the benefit from prefetching can offset the perfor-mance drop from small allocation) while the rest of the cache isallocated to the non prefetch friendly applications. In our imple-mentation, we give minimum allocation to the prefetch friendlyapplications and use UCP (see Section 3.2.1), to partition the re-maining capacity among the non prefetch friendly applications. Weuse UCP and per application partitioning, in order to not put

CPpf at a disadvantage in comparison to other schemes. Finally, CBPjointly manages all the three techniques dynamically.

We first compare CBP to other resource managers that managea subset of resources. Next, we carry out sensitivity analysis forthe different design parameters, to understand its impact on theperformance of CBP.

Figure 9 shows normalized weighted speedup for each of the 14workloads with the bars representing the different resource man-agers. We use normalized weighted speedup as a measure of perfor-mance for the entire workload. equal off improves performance in12 of the mixes and improves performance by 10% on average overthe baseline. only bw improves performance in 7 of the mixes andon average by 4% over the baseline. only pref improves performance w w w w w w w w w w w w w w G M E A N N o r m a li z e d t o b a s e li n e equal offonly bw only prefonly cache bw+prefbw+cache cache+prefCPpf CBP Figure 9: Performance results, shows normalized weightedspeedup over baseline. BP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling w w w w w w w w w w w w w w G M E A N N o r m a li z e d t o b a s e li n e equal offonly bw only prefonly cache bw+prefbw+cache cache+prefCPpf CBP Figure 10: Fairness results, shows average normalized turn-around time (ANTT) over baseline, where lower is better. for 12 workloads and provides an average improvement of 9%. onlycache improves performance for all the workloads and provides anaverage improvement of around 28%.Coordinated management of bandwidth partitioning and prefetchthrottling ( bw+pref ) leads to higher performance in comparison tothe baseline in 12 workloads and an average overall improvementof 10%. Coordinated management of bandwidth and cache parti-tioning ( cache+bw ) improves performance across all workloads andprovides an average performance improvement of 37% (up to 64%).Coordinated cache partitioning and prefetch throttling ( cache+pref )improves performance across all workloads on average by 39% (upto 57%).

CPpf , cache partitioning influenced by prefetching, im-proves performance by 39% (up to 63%).Among the resource managers that performs coordinated man-agement of two resources, cache+pref and

CPpf achieve the bestperformance. The results also show that the improvement achievedwith coordinated management of two techniques, is larger thansumming the improvements from individual techniques. This showsthat coordinated management helps exploit synergistic interactionsamong the different techniques, which cannot otherwise be lever-aged. Finally, CBP turns out to be the best performing coordinatedresource manager in 14 of the 15 workloads and provides an averageimprovement of 50% (up to 86%). CBP improves performance byan additional 11% in comparison to the best performing resourcemanager that does coordinated management of two techniques, aswell as state-of-the-art. In one workload, w3, CBP achieves slightlylower performance (2%) in comparison to cache+pref . This is be-cause bandwidth partitioning is not very effective for this specificworkload.Figure 10 shows the average normalized turnaround time whichshows the fairness of the different resource managers. Note that alower value signifies greater fairness. On average, CBP shows 27%better fairness than the baseline and 4% better fairness than thebest combination of two techniques, cache+pref . cache+pref has 4%better fairness than CPpf . Case study:

We investigate the performance of a single work-load in detail to understand how CBP improves performance incomparison to resource managers that manage a subset of the tech-niques. Figure 11 shows the IPC for individual applications in aspecific workload (w2) normalized to the baseline IPC. We haveclassified the applications in this workload into two groups. Group 1comprises applications, from lbm to gcc (in the figure), for which the l b m p e r l b e n c h li b q u a n t u m li b q u a n t u m c a c t u s A D M c a l c u li x z e u s m p z e u s m p g cc s o p l e x g o b m k h r e f a s t a r t o n t o m il c n a m d G M E A N N o r m a li z e d t o b a s e li n e only bwonly pref only cachebw+pref bw+cachecache+pref CBP Figure 11: Results for workload 2. cache+pref resource manager performs better than the bw+cache resource manager. Group 2 comprises the rest of the applications inthe workload, from soplex to namd, where the bw+cache resourcemanager performs better than the cache+pref resource manager. cache+pref resource manager provides the best performance forapplications in group 1 because the applications are comparativelymore memory intensive and get a larger share of the available band-width using this resource manager. Applications in group 2 benefitfrom bandwidth partitioning since they then get a fair bandwidthshare, and in addition are not sensitive to prefetching. When we per-form coordinated management of all the three resources, we wouldideally prefer to have allocation decisions made by cache+pref re-source manager for applications in group 1 and bw+cache resourcemanager for applications in group 2. With CBP, some applicationsin group 1 end up with a lower allocation of bandwidth (comparedto cache+pref ) which hurts their performance (see lbm, perlbench,cactusADM) while the rest of the applications in the group see aperformance improvement from getting the right amount of theallocation. For the applications in group 2, CBP manages to matchthe performance of bw+cache resource manager. In summary, CBPenables better trade-offs resulting in a solution that improves over-all performance for the workload and outperforms other resourcemanagers that only manage a subset of the techniques.

We investigate the sensitivity of CBP to different design parametersin this section.

The reconfiguration in-terval, determines how frequently the different resource allocationcontrollers are invoked when running a workload. We investigatethe sensitivity of CBP to different reconfiguration interval values,in order to determine an appropriate interval. Figure 12a showsthe average (geo. mean) performance when using three differentreconfiguration intervals - 1ms, 10ms and 100ms. A shorter recon-figuration period has the potential to adapt faster to phase changebehaviour. However, it also incurs a higher overhead because alarger fraction of the interval would be used up for IPC sampling(required for prefetch throttling). Overall, the results show thatusing a 10ms period provides a good trade-off between quick adap-tation and the overhead incurred for IPC sampling.

The results thus far assume that eachtile has a baseline cache allocation of 512kb. The total LLC capacityassuming a 16-core tiled CMP would be 8MB. We next study the adja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, Miquel Pericàs I P C r e l a t i v e t o b a s e li n e (a) Sensitivity forreconfigurationperiod. I P C r e l a t i v e t o b a s e li n e (b) Sensitivityfor largercache size. I P C r e l a t i v e t o b a s e li n e (c) Sensi-tivity forminimumbandwidthallocation. I P C r e l a t i v e t o b a s e li n e (d) Sensitivityfor sample pe-riod in prefetchthrottling. Figure 12: Sensitivity analysis. impact of changing the cache capacity available for a single tile to1MB. Figure 12b shows the average performance achieved usingCBP with different per-tile capacity normalized to the baseline con-figuration with the same capacity. The results show that increasingthe available LLC capacity leads to a 5% drop in aggregate perfor-mance with CBP. This performance drop can be attributed to anincrease in cache access time for the larger cache.

We in-vestigate the sensitivity to the minimum bandwidth allocation, usedin the bandwidth allocation algorithm presented in Section 3.2.2.Figure 12c shows the difference in performance with a minimumbandwidth allocation of 0.5GB/s and 1GB/s, normalized to the base-line. There were small variations among the different workloadsand some experience a slight improvement while others workloadsexperience a small drop in performance. Overall, reducing the min-imum bandwidth allocation did not have a considerable impact onthe performance of CBP as long as it is sufficient for workloadswhich experience very little queuing delay.

We finallyinvestigate the impact of changing the prefetch_sampling_period used in the prefetch throttling controller 3.2.3. Figure 12d shows theimpact of changing the sampling period on performance normalisedto the baseline. The intervals we use for evaluation are 0.25ms,0.5ms and 1ms. The advantage of using a shorter sampling periodis that it carries a lower overhead, while the drawback is the risk ofover/under estimating the performance benefit from prefetching.The results indicate the sampling interval of 0.5ms achieves thebest performance.

Isolated Management:

Several techniques have been proposedin the literature that focus specifically on cache partitioning, band-width partitioning and prefetch throttling. Cache partitioning tech-niques [10, 15, 19, 26, 28] help improve performance and achievebetter utilization of available cache resources, by avoiding inter-ference among co-running applications and reducing the numberof accesses to memory. Bandwidth partitioning techniques [13, 17,24, 32], reduce average memory access penalty, by dynamicallydetermining how bandwidth must be shared among the co-runningapplications. Prefetching can hide memory access latency by fetch-ing the data before it is requested [6, 7, 14, 16, 21, 22]. However, in-accurate prefetches can impact application performance since it can increase the number and cost of demand misses [9]. Prefetch throt-tling [6, 30], involves adaptively tuning when and what prefetchersettings are used dynamically based on application characteristicsand has been shown to provide better performance and addressdrawbacks of prefetching. The aforementioned works, considereach of the techniques in isolation and leaves room for improve-ment, as shown in this work, since they do not take the interactionbetween cache partitioning, bandwidth partitioning and prefetchthrottling into account.

Coordinated Management:

Several works have proposed com-bining two of the techniques in order to exploit the benefits fromcoordination. These works can be broadly classified into the follow-ing groups: i) coordinated cache and bandwidth partitioning [23, 27],ii) coordinated prefetching and cache partitioning [31, 33], and iii)coordinated bandwidth partitioning and prefetching [18]. Sahu etal. propose [27] a method for cache and bandwidth partitioning,using a CPI model for bandwidth and set partitioning for the cache.CoPart [23] combines bandwidth and cache partitioning using auser-level run-time. Unlike CBP, their goal is to improve fairness. Re-cently, CPpf [33] and Sun et.al [31] propose a coordinated approachfor cache partitioning and prefetch where prefetch friendly appli-cations where given a smaller cache allocation. Unlike CBP whichmaintains per-application partitions, cache partitioning in thesetwo proposals is performed for groups of applications. Ebrahimi etal. [8] propose general mechanisms to make memory schedulingtechniques prefetch aware. However, these works cannot use theadditional interactions and trade-offs which are available whencoordinately managing all three resources, which we have shownis important for performance.Some works have also proposed coordinated management ofmultiple resources. For instance, CLITE [25] uses bayesian opti-mization to provide theoretically-grounded resource partitioning tomeet QoS targets of multiple resources (e.g., cores, caches, memorybandwidth, memory capacity, disk bandwidth etc.) among multipleco-located jobs. Bitirgen et al. [4] use machine learning to managepower, cache and bandwidth in a coordinated way to anticipatesystem-level performance impact of allocation decisions. However,in neither of these works is prefetch throttling considered, whichwe have shown is important in order to realise the full potential ofcoordinated resource management. To the best of our knowledge,CBP is the first coordinated resource manager for cache partitioning,bandwidth partitioning and prefetch throttling.

We present CBP, a mechanism for coordinated management ofcache partitioning, bandwidth partitioning and prefetch throttling.The design is motivated by our in-depth characterisation of theperformance impact of cache, bandwidth and prefetch allocationand their interactions. CBP combines local resource allocation con-trollers with a coordination mechanism that dynamically managesand allocates the resources, in a way which considers both inter-and intra-application interactions. Our evaluation, on a tiled 16-core CMP, demonstrates that CBP improves performance by up to86% (geo. mean 50%) compared to a system without partitioningand prefetching and by up to 36% (geo. mean 11%) over the state-of-the-art technique that manages cache partitioning and prefetchingin a coordinated manner. BP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling

REFERENCES

Proc. PACT-22 . https://doi.org/10.1109/PACT.2013.6618818[4] Ramazan Bitirgen, Engin Ipek, and José F. Martínez. 2008. Coordinated man-agement of multiple interacting resources in chip multiprocessors: A machinelearning approach.

Proceedings of the Annual International Symposium on Mi-croarchitecture, MICRO , 318–329. Issue 2008 PROCEEDINGS. https://doi.org/10.1109/MICRO.2008.4771801[5] Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeck-hout. 2014. An Evaluation of High-Level Mechanistic Core Models.

ACM TACO (2014). https://doi.org/10.1145/2629677[6] F. Dahlgren, M. Dubois, and P. Stenstrom. 1993. Fixed and Adaptive SequentialPrefetching in Shared Memory Multiprocessors. In , Vol. 1. 56–63. https://doi.org/10.1109/ICPP.1993.92[7] Fredrik Dahlgren and Per Stenström. 1996. Evaluation of hardware-based strideand sequential prefetching in shared-memory multiprocessors.

IEEE Transactionson Parallel and Distributed Systems

7, 385–398. Issue 4. https://doi.org/10.1109/71.494633[8] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems.

ACM SIGARCHComputer Architecture News

39 (2011), 141. Issue 3. https://doi.org/10.1145/2024723.2000081[9] Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N Patt. 2009. Coordinatedcontrol of multiple prefetchers in multi-core systems. 316. https://doi.org/10.1145/1669112.1669154[10] Nosayba El-Sayed, Anurag Mukkara, Po An Tsai, Harshad Kasture, XiaosongMa, and Daniel Sanchez. 2018. KPart: A Hybrid Cache Partitioning-SharingTechnique for Commodity Multicores.

Proceedings - International Symposium onHigh-Performance Computer Architecture

Synthesis Lectures on Computer Architecture

28 (5 2014), 1–69. https://doi.org/10.2200/S00581ED1V01Y201405CAC028[12] N. Holtryd, M. Manivannan, P. Stenström, and M. Pericàs. 2020. DELTA: Dis-tributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors.In .578–589. https://doi.org/10.1109/IPDPS47924.2020.00066[13] D. R. Hower, H. W. Cain, and C. A. Waldspurger. 2017. PABST: Propor-tionally Allocated Bandwidth at the Source and Target. In . 505–516.https://doi.org/10.1109/HPCA.2017.33[14] Sushant Kondguli and Michael Huang. 2018. Division of labor: A more effectiveapproach to prefetching.

Proceedings - International Symposium on ComputerArchitecture , 83–95. https://doi.org/10.1109/ISCA.2018.00018[15] Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers. 2011. CloudCache: Expandingand shrinking private caches. In

Proc. HPCA-17 . https://doi.org/10.1109/HPCA.2011.5749731[16] Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When Prefetching Works,When It Doesn’t, and Why.

ACM Transactions on Architecture and Code Opti-mization

HPCA - 16 2010 The Sixteenth International Symposium on High-PerformanceComputer Architecture . 1–12. https://doi.org/10.1109/HPCA.2010.5416655[18] Fang Liu and Yan Solihin. 2011. Studying the impact of hardware prefetchingand bandwidth partitioning in chip-multiprocessors.

Proceedings of the ACM SIG-METRICS joint international conference on Measurement and modeling of computersystems - SIGMETRICS ’11 , 37. https://doi.org/10.1145/1993744.1993749[19] R Manikantan, Kaushik Rajan, and R Govindarajan. 2012. Probabilistic sharedcache management (PriSM). In

Proc. ISCA-39 . https://doi.org/10.1109/ISCA.2012.6237037[20] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2007. Optimizing NUCAOrganizations and Wiring Alternatives for Large Caches with CACTI 6.0. In

Proc.MICRO-40 . https://doi.org/10.1109/MICRO.2007.33[21] K.J. Nesbit, A.S. Dhodapkar, and J.E. Smith. 2004. AC/DC: an adaptive data cacheprefetcher.

Proceedings. 13th International Conference on Parallel Architecture andCompilation Techniques, 2004. PACT 2004. , 135–145. https://doi.org/10.1109/pact.2004.1342548[22] Kyle J. Nesbit and James E. Smith. 2005. Data cache prefetching using a globalhistory buffer.

IEEE Micro

25 (1 2005), 90–97. Issue 1. https://doi.org/10.1109/MM.2005.6 [23] Jinsu Park, Seongbeom Park, and Woongki Baek. [n.d.]. CoPart: CoordinatedPartitioning of Last-Level Cache and Memory Bandwidth for Fairness-AwareWorkload Consolidation on Commodity Servers.

Proceedings of the FourteenthEuroSys Conference 2019

19. https://doi.org/10.1145/3302424.3303963[24] Jinsu Park, Seongbeom Park, Myeonggyun Han, Jihoon Hyun, and Woongki Baek.2018. HyPart: A hybrid technique for practical memory bandwidth partitioning oncommodity servers.

Parallel Architectures and Compilation Techniques - ConferenceProceedings, PACT , 1–14. https://doi.org/10.1145/3243176.3243211[25] T. Patel and D. Tiwari. 2020. CLITE: Efficient and QoS-Aware Co-Location ofMultiple Latency-Critical Jobs for Warehouse Scale Computers. In .193–206. https://doi.org/10.1109/HPCA47549.2020.00025[26] Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: Alow-overhead, high-performance, runtime mechanism to partition shared caches.In

Proc. MICRO-49 . https://doi.org/10.1109/MICRO.2006.49[27] Aryabartta Sahu and Saparapu Ramakrishna. 2014. Creating heterogeneity atrun time by dynamic cache and bandwidth partitioning schemes.

Proceedings ofthe ACM Symposium on Applied Computing , 872–879. https://doi.org/10.1145/2554850.2554992[28] Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and EfficientFine-Grain Cache Partitioning.

Proc. ISCA-38 (2011). https://doi.org/10.1145/2000064.2000073[29] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Auto-matically Characterizing Large Scale Program Behavior. In

Proc. ASPLOS-10 (SanJose, California). https://doi.org/10.1145/605397.605403[30] Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. FeedbackDirected Prefetching: Improving the Performance and Bandwidth-Efficiency ofHardware Prefetchers. In

Proceedings of the 2007 IEEE 13th International Sympo-sium on High Performance Computer Architecture (HPCA ’07) . IEEE ComputerSociety, USA, 63–74. https://doi.org/10.1109/HPCA.2007.346185[31] Gongjin Sun, Junjie Shen, and Alexander V. Veidenbaum. 2019. Combiningprefetch control and cache partitioning to improve multicore performance.

Pro-ceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Sympo-sium, IPDPS 2019 , 953–962. https://doi.org/10.1109/IPDPS.2019.00103[32] Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, and Zhenlin Wang.2019. EMBA: Efficient Memory Bandwidth Allocation to Improve Performanceon Intel Commodity Processor. In

Proceedings of the 48th International Conferenceon Parallel Processing (Kyoto, Japan) (ICPP 2019) . Association for ComputingMachinery, New York, NY, USA, Article 16, 12 pages. https://doi.org/10.1145/3337821.3337863[33] Jun Xiao, Andy D. Pimentel, and Xu Liu. 2019. CPpf: A prefetch aware LLCpartitioning approach.