[PDF] Effective Cache Apportioning for Performance Isolation Under Compiler Guidance

Abstract

With a growing number of cores per socket in modern data-centers where multi-tenancy of a diverse set of applications must be efficiently supported, effective sharing of the last level cache is a very important problem. This is challenging because modern workloads exhibit dynamic phase behavior - their cache requirements & sensitivity vary across different execution points. To tackle this problem, we propose Com-CAS, a compiler-guided cache apportioning system that provides smart cache allocation to co-executing applications in a system. The front-end of Com-CAS is primarily a compiler-framework equipped with learning mechanisms to predict cache requirements, while the backend consists of an allocation framework with a pro-active scheduler that apportions cache dynamically to co-executing applications. Our system improved average throughput by 21%, with a maximum of 54% while maintaining the worst individual application execution time degradation within 15% to meet SLA requirements.

Full PDF

EEffective Cache Apportioning for Performance Isolation Under CompilerGuidance

Bodhisatwa Chatterjee

Georgia Institute of TechnologyAtlanta, [email protected]

Sharjeel Khan

Georgia Institute of TechnologyAtlanta, [email protected]

Santosh Pande

Georgia Institute of TechnologyAtlanta, [email protected]

Abstract

With a growing number of cores per socket in modern data-centers where multi-tenancy of a diverse set of applicationsmust be efﬁciently supported, effective sharing of the lastlevel cache is a very important problem. This is challengingbecause modern workloads exhibit dynamic phase behaviour - their cache requirements & sensitivity vary across differentexecution points. To tackle this problem, we propose

Com-CAS , a compiler guided cache apportioning system that pro-vides smart cache allocation to co-executing applications in asystem. The front-end of

Com-CAS is primarily a compiler-framework equipped with learning mechanisms to predictcache requirements, while the backend consists of allocationframework with pro-active scheduler that apportions cache dy-namically to co-executing applications. Our system improvedaverage throughput by 21%, with a maximum of 54% whilemaintaining the worst individual application execution timedegradation within 15% to meet SLA requirements.

High-performance computing systems facilitate concurrentexecution of multiple applications by sharing resourcesamong them. The

Last-Level Cache (LLC) is one such re-source, which is usually shared by all running applicationsin the system. However, sharing cache often results in inter-application interference, where multiple applications can mapto the same cache line and incur conﬂict misses, resultingin potential performance degradation. Another aspect of thisapplication co-existence in LLC is the increased vulnerabili-ties to shared-cache attacks such as the side-channel attacks[9, 17, 33], timing-attacks [3, 13], and cache-based denial-of-service (DOS) attacks [2, 11, 19, 35]. Furthermore, theseproblems are exacerbated by the fact that the number of cores(and thus the processes) that share the LLC are rapidly in-creasing in the recent architectural designs. Therefore, bothfrom a performance and security point of view, it is imperativethat LLCs are carefully managed.To address these issues, modern computing systems use cache partitioning to divide the LLC among the co-executingapplications in the system. Ideally, a cache partitioningscheme obtains overall gains in system performance by pro-viding a dedicated region of cache memory to high-prioritycache-intensive applications and ensures security againstcache-sharing attacks by the notion of isolated execution inan otherwise shared LLC. Apart from achieving superiorapplication performance and improving system throughput[7, 20, 31], cache partitioning can also serve a variety of pur-poses - improving system power and energy consumption[6, 23], ensuring fairness in resource allocation [26, 36] andeven enabling worst case execution-time analysis of real-timesystems [18]. Owing to these overwhelming beneﬁts of cachepartitioning, modern processor families (Intel® Xeon Series)implement hardware way-partitioning through

Cache Allo-cation Technology (CAT) [10]. Such systems aim to provideextended control and ﬂexibility to the user by allowing themto customize cache partitions according to each application’srequirement.On the other hand, modern workloads exhibit dynamicphase behaviour throughout their execution resulting inrapidly changing cache demands. These behaviours arise dueto complex control ﬂows and input data driven behaviors inapplications and its co-execution environments. It is oftenthe case that even across the same program artifact (such asa loop), different cache requirements are exhibited duringdifferent invocations of the same. Majority of the prior works[6, 20, 26, 31] on cache partitioning often tends to classifyapplications into several categories on the lines of whetherthey are cache-sensitive or not. Based on this ofﬂine character-ization, cache allocations for various application are decidedthough a mix of static and runtime policies. The fact that asingle application (or even a given loop) can exhibit dualbehaviour , i.e both cache-sensitive & cache-insensitive be-haviour during its execution, is not taken into account. Someapproaches employ a ‘damage-control’ methodology, whereattempts are made to adjust cache partitions dynamically af-ter detecting changes in application behaviour by runtimemonitoring using performance counters and hardware moni-1 a r X i v : . [ c s . P L ] F e b ors [6, 22, 28, 32]. However, these approaches are reactiveand suffer from the problem of detection and reaction lag.,ie. the application behaviour of modern workloads are likelyto change before the adjustments are made leading to lostperformance.In order to account for applications’ dynamic phase be-haviors and provide smart & proactive cache partitioning,we propose Compiler-Guided Cache Apportioning Sys-tem (Com-CAS) with the goal of providing superior per-formance & maximal isolation to reduce the threat of sharedcache based attacks. The front-end of

Com-CAS consists of

Probes Compiler Framework , which uses traditional com-piler analysis & learning mechanisms to predict dynamiccache requirements of co-executing applications. The at-tributes that determine cache requirements (memory foot-print, and data-reuse) are encapsulated in specialized librarymarkers called ‘probes’ , which are statically inserted outsideeach an loop-nest in an application. These probes communi-cate the dynamic resource information to

BCache AllocationFramework , which makes up the back-end of

Com-CAS . Itconsists of allocation algorithms that dynamically apportionsthe LLC for co-executing applications based on aggregatingprobes-generated information, and a proactive scheduler thatinvokes the Intel CAT to perform the actual cache partitioningaccording to application phase changes.We evaluated

Com-CAS on an Intel Xeon Gold system with35 application mixes from GAP [1], a graph benchmark suite,Polybench [21], a numerical computational suite, and Rodinia[5], a heterogeneous compute-intensive benchmark suite thatrepresent a variety of workloads typically encountered in acloud environment. The average throughput gain obtainedover all the mixes is 21%, with the maximum one being upto54% with no degradation over a vanilla co-execution environ-ment with CFS with no cache apportioning scheme. We alsoshow the

Com-CAS maximises the isolation of reuse heavyloops providing protection against the DOS attacks simulta-neously while maintaining the SLA agreement of degradation < Com-CAS in Section3. The prior works and their shortcomings are discussed in4. The

Probes Compiler Framework and techniques to esti-mate the cache requirements is presented in Section 5 andthe apportioning algorithms and cache-partitioning schemefor

BCache Allocation Framework are explained in 6. The ex-perimental evaluation of our system along with detailed casestudies are discussed in 7. The ﬁnal remarks and conclusionsare presented in Section 8.

Determining memory requirements for a mix of running ap-plications in a system is quite a complex task. The cachedemand for each process is dependent on speciﬁc programpoints and can change drastically throughout its execution.It is possible for a same program artifact (loop) to exhibitdifferent behaviour across multiple invocations during singleexecution. Identifying these dynamic variations in cache re-quirements of an application is paramount for apportioningcache and obtaining superior performance. For example, letus consider an application mix, consisting of 5 instances ofBC benchmark from the GAP Suite [1]. These applicationsare provided with the same input -

Uniform Random Graph with 2 nodes. Our goal is to leverage cache partitioningthrough Intel® CAT to obtain superior performance. We nowconsider different possibilities to show that obvious solutionsdo not work: Can we partition the cache equally amongidentical processes?

Since the distribution of processes inthe mix are identical, it is a natural inclination to provideeach process an equal amount of isolated cache ways. In a11-way set associative cache system, one possible partitionis giving any four processes 2-ways each and the ﬁfth pro-cess 3-ways. However, as we can see in Fig. 1, following thisapproach results in a performance degradation by over 16%,compared to the processes running without any cache appor-tioning. Clearly, giving each processes 2-ways is not enoughin terms of capacity.

How about increasing the cache allo-cation to optimal ways?

By performing ofﬂine proﬁling, theoptimal number of cache-ways that minimizes execution timeof an application can be ﬁgured out statically. For a single BC changing cache require-ments based on application’s execution phase for obtainingperformance improvements by providing as much isolationas possible.Figure 1: Performance degradation by injudicious cache par-titioning In order to satisfy the need for dynamic cache apportioning,we propose a compiler guided approach for solving the prob-lem of apportioning LLC across co-executing applicationson a socket. Fig. 2 depicts a high-level representation of ourdynamic cache apportioning system

Com-CAS . The systemcan be broadly divided into two major components:

ProbesCompiler framework , which proﬁles applications to deter-mine cache requirements at loop-nest granularity & staticallyinserts ‘probes’ to encapsulate & broadcast the necessaryinformation during runtime and

BCache Allocation Frame-work , that makes use of the probe-generated information to dynamically make apportioning & co-location decisions on asocket.To estimate precise cache requirements, the probes gen-erates cache memory footprints & sensitivity , data-reuse be-haviour and anticipated phase duration for each loop-nestpresent in an application (Fig. 3). The cache footprints ofa loop-nest are estimated by the polyhedral compiler analy-sis [8] and the expected duration is predicted by training aregression model on the loop bounds. Static compiler loopanalysis based on data-reuse patterns, is performed to classifyloops nests into streaming & reuse , while cache sensitivity isdetermined by ofﬂine proﬁling. Each probe encapsulates allthese information and are inserted at the entrances of their re-spective loop nests. During runtime, the scheduler aggregatesdynamic information coming from all co-executing probe-instrumented applications and invokes BCache Allocationframework, whose goal is to maximize isolation & enhanceperformance. The apportioning algorithms attempts as muchisolation as possible; when sharing cache is inevitable, it ag-gressively performs sharing of certain kinds of loops (the onesexhibiting low data reuse) whereas minimizes sharing loopsexhibiting heavy data reuse. We assume that all processesmust go through Probe Compiler Framework, so the BCacheAllocation Framework can utilize its allocate ways to theseprocesses. In addition, the mixes do not have any processesthat have malicious loops or a high security application withthe need of full isolation. Cache partitioning is commonly used to divide the last-levelcache (LLC) capacity among the multiple co-running applica-tions in a system. In this section, we will present prior workson cache partitioning, focusing on both hardware-driven andsoftware/compiler-driven approaches. We will conclude thissection by giving an overview of Intel CAT, which is a re-conﬁgurable implementation of way-partitioning and is usedby our system to partition the LLC.

Hardware-driven cache partitioning : Cache partitioning isachieved by using specialized hardware support to physicallyenforce partitions in the cache. The enforced partitions areusually allocated on a granularity of cache-ways and thistechnique is commonly known as way-partitioning . Sincesophisticated partitioning at ﬁne granularity levels often re-quires extensive hardware support and appropriate softwaremodiﬁcations to interact with the novel hardware, most re-search related to sophisticated way-partitioning is validatedonly with architectural simulators [4, 12, 25, 28]. Modern pro-cessors typically implement only basic way-partitioning dueto the feasibility of such an implementation in a real hardware.The main drawback of basic way-partitioning is that one ends3igure 2: Overview of

Com-CAS consisting of Probes Compiler Framework & BCache Allocation FrameworkFigure 3: Cache proﬁle for each loop nests are generated &encapsulated in Probesup with coarse-grain cache partitions, that is limited by thenumber of cache-ways in the system. Several works have beenproposed to counter these limitations and obtain ﬁne-grainpartitions, such as using hash functions to access ways [24],ranking the cache line utility to decide replacements [30] orextending hardware support. Since current commodity hard-ware supports the basic way level partitioning, this work isbased on the same.

Software-driven cache partitioning : Such techniques in-volves using either the compiler or the OS to restrict thememory pages into conﬁned cache sets. The OS utilizes page-coloring [16, 27, 29, 34, 37, 38] to partition the cache, wherethe idea is that the physical pages are divided into distinctcolors and each color can be mapped to a speciﬁc cache set.While this approach does not need additional hardware sup-port, the main drawbacks here are that dynamically changingthe partitions incurs the overhead of re-coloring the physi-cal pages and often additional software modiﬁcations likepartitioning the entire main-memory are inevitable. On theother hand, the compiler approaches uses special instructionsto place different applications into different cache partitions[18, 23]. However, these techniques mostly approach cache-partitioning as a static problem and do not take into consider-ation dynamic loop information (such as memory footprints,cache behaviour, etc); nor they account for dynamic concur-rency of different application phases to determine the dy-namic cache demands. Modern applications exhibit complexdynamic phase behaviors in terms of their cache requirementsas well as cache sensitivity both being different at differentprogram points. In addition, these behaviors even changeacross multiple invocations of the same loop nest. The solu- tion would involve deciding both co-location of applicationson a socket as well as effective apportioning of the LLC,which is a distinct characteristic of our proposed approach.Recent works have focused on new approaches to overcomelimitations of way-partitioning and page coloring [7, 20, 31].El-Sayed et al. [7] takes an approach of clustering a groupof applications based on their isolated and combined miss-curves and partitioning the cache among them. This approachrequires proﬁling every pair of applications in the mix in ad-dition to proﬁling them separately, i.e performing O ( n ) pro-ﬁling operations on top of O ( n ) individual operations. Apartfrom being non-scalable for mixes containing large numberof applications, this approach is not practical in real-worldscenarios like data centers, where mix composition or evenparticipating processes therein are not even known apriori.Moreover, in the above approaches, the proposed cache parti-tions “plans” are adjusted over ﬁxed intervals - the approachesdo not account for the dynamic phase changes in each appli-cation. An alternative approach that accounts for applicationphase changes is proposed by Pons et al. [20], wherein basedon the execution behavior, applications are categorized intomultiple categories by extensive proﬁling. However, suchstatic categorization misses an important aspect - applica-tion phases are input-dependent and are dynamic as well ascontingent to entire mix composition. Thus, an applicationbehaviour might change based on an different execution sce-nario and a new input data set. Moreover, the approach missesaccounting for dynamic interactions; due to these reasons, wefeel that application classiﬁcation is not a feasible approach.DoS attacks are demonstrated in shared cache environ-ment in which the data belonging to cache sensitive pro-cesses is evicted to force cache misses and slow them down[2, 11, 19, 35]. Intel CAT provides an attractive way of main-taining isolation through the use of CLOS groups as describedin the next subsection. In our proposed work, we minimizesharing of cache-ways between reuse oriented processes, andin cases where must share, the duration of sharing is mini-mized thereby doing a best effort to minimize the possibilityof DoS attacks. The framework coupled with the system as-sumptions in our opinion provides for a robust real worldsolution to the problem of cache based performance isolation.Our experimentation shows that the performance degradation4ssociated with full isolation in terms of cache ways is ex-tremely severe ruling out such solutions. On the other hand,our best effort solution carefully allocates ways in terms ofminimal sharing and thus the resulting performance degrada-tion is negligible as shown in our empirical evaluation. Intel® Cache Allocation Technology (CAT) [10] allows theuser to specify cache capacity designated to each runningapplication. The primary objective of Intel CAT is to improveperformance by prioritizing and isolating critical applications.Through isolation, Intel CAT also provides powerful meansto stops certain kinds of side channel and DoS attacks. Itprovides a customized implementation of way-partitioningwith a “software programmable” user-interface, that is usedto invoke in-built libraries to perform cache partitioning.To reason about different cache partitions, Intel CAT in-troduces the notion of

Class-of-Service (CLOS), which aredistinctive groups for cache allocation. The implication of theCLOS abstraction is that one or more applications belongingto the same CLOS will experience the same cache partition.Intel CAT framework allows users to specify the amount ofcache allocated to each CLOS in terms of Boolean vector-likerepresentations called

Capacity Bitmasks (CBMs). Theseare used to specify the entire cache conﬁguration for eachCLOS, including isolation and overlapping of ways betweenone or more CLOS. On the hardware side, Intel CAT usesarrays of MSRs to keep track of CLOS-to-ways mappings.Minor changes are then done in the resource allocation mech-anism of Linux Kernel to interact with these dedicated regis-ters. In our work, we use apportioning algorithms to generateCBMs for a process residing in a particular CLOS. We thenuse specialized library calls to interact with the Intel CATinterface to perform the required allocation.

We ﬁrst describe the Compiler phase of probe insertion out-lined in the overall system solution Fig. 2.

Probes CompilerFramework is a LLVM-based instrumentation frameworkequipped with learning mechanisms to determine an applica-tion’s resource requirements across various execution points.It inserts ‘ probes ’ (specialized library markers) at each outer-most nested-loop level within each function of an application.These probes estimate the attributes that directs an applica-tion’s cache requirements for an execution phase: memoryfootprint , cache sensitivity , reuse behaviour & phase timing .In our work, we consider each nested loop to be an execu-tion phase. Once inserted, the probes encapsulate these re-source information and broadcasts them during execution to aproactive scheduler, which performs smart cache partitioning.While loop attributes like memory footprint and reuse be-haviour can be statically analyzed and dynamically computed at runtime, other attributes such as phase timing have to bepredicted using learning regression models which are embed-ded statically at compile time. In order to incorporate both thetrained model & compute attributes dynamically, the ProbesFramework has two components: compilation component and runtime component .The compilation component primarily consists of multipleLLVM [15] compiler pass instruments probes into applica-tion and embeds loop memory footprint usage, data-reusebehaviour analysis and trained machine learning models forphase-timing. First, a preliminary pass proﬁles each loop nestto train regression models that predicts their execution times.The loop-phase timing is established as a linear function of theloop-bounds which constitutes the linear regression model.Apart from encapsulating loop attributes, the compilationcomponent also inserts a special probe-start function in thepreheader of the loop nest and a probe-completion functionat the exit node of the loop nest. For loops present in theinner-nesting level, the probe functions are hoisted outsidethe outermost loop inter-procedurally. During hoisting, thepass combines all attributes of inner-loops with the outermostloop. For example, if any of the inner-most loop exhibits sig-niﬁcant data reuse, then the entire loop-nest is consideredto be a reuse-heavy phase. The runtime component on theother hand, compliments the compilation component by dy-namically computing the values of memory footprint usage &phase-timing and conveying them to the proactive scheduler.This communication is facilitated by passing the attributesas arguments to the probe library function calls, which arefurther transferred to the scheduler via shared memory.

The phase-timing is deﬁned as the time taken for executingan entire loop-nest in an application and each loop-nest corre-sponds to an application phase. Probes Compiler Frameworkuses a linear regression model to predict the execution timefor each loop-nest. The idea here is to establish a numericalrelation between loop-timing and loop-iterations. Since to-tal number of loop-iterations can be expressed in terms ofthe its bounds, the loop-time model can be further reﬁned asa function of loop-bounds. In general, for a loop with arbi-trary statements in its body, the loop-timing is proportionalto the loop bounds. For nested-loops, bounds of each individ-ual loops has to be included as well. To make this analysiseasier, Probes Framework uses the LLVM [15] loop-simplify & loop-normalization passes, which transforms each loop tolower-bound of 0 and unit-step size. Thus, the timing model ofa normalized loop-nest with n perfectly-nested loops havingupper-bounds u , u , ..., u n is: T = f ( u , u , ...., u n ) , (1)To enable analysis of imperfectly nested-loops, loop-distribution can be performed to transform them into a series5f perfectly-nested loops. An example of semantically equiva-lent loops by loop distribution with same phase-time is shownin Fig. 4.Figure 4: Semantically equivalent loops by distributionTherefore, the phase-timing equation can be decomposedsum of individual n distributed loops: T = f ( u ) + f ( u , u ) ... + f n ( u , u , ..., u n ) (2)Eq. 2 can be interpreted as a linear-regression model T c ( u ) with weights c , c , ... c n & intercept c : T c ( u ) = c + c u + ... + c n u ... u n (3)Probes framework uses a specialized pass to generate loop-bounds & the timing for speciﬁc set of application inputsto generate test and training data for the regression model.Once the model is trained, the timing equation for each loop-nest is embedded in its corresponding probe. During runtime,the actual loop-bounds are plugged into this phase-timingequation to generate the phase-time, which is passed as anargument to the probe functions. For non-afﬁne loops thathave unknown bounds, we generate an approximate loop-bound based on the test input sets to predict the timing. Memory footprint is the amount of cache that will be utilizedduring an application phase. Probes framework calculatesthe memory footprint through static polyhedral analysis forthe memory accesses in a loop nest. Polyhedral analysis cre-ates memory access relations for each read and write accessstatements present in the loop body. These relations maps thedynamic execution instances of the statements with the setof data elements accessed by them. Apart from this mapping,polyhedral access relations also contains the loop-invariantparameters (often unknown at compile time) and

PresbergerFormula , that captures the conditions surrounding the mem-ory access. A simple example illustrating polyhedral accessrelation is depicted in Fig. 5.In the above example, there are total four polyhedral accessrelations, two each for P & Q denoting the mapping of theirdynamic instances with accessed elements. Since both thestatements are enclosed within the same loop, the number ofarray elements of A & B accessed by these statements aresame and is equal to Ns . Now, the total number of uniquedata elements accessed in the array denotes memory foot-print exhibited by the loop-nest. However, in this loop both Figure 5: Constructing Polyhedral Access Relationstatements have partially overlapping memory accesses overboth arrays A & B . Therefore, the total number of uniquedata elements can be found out by taking a union of memoryelements accessed by both statements P & Q for arrays A & B , denoted by n A ( P ∪ Q ) & n B ( P ∪ Q ) : n A ( P ∪ Q ) = n A ( P ) + n A ( Q ) − n A ( P ∩ Q ) = N + n B ( P ∪ Q ) = n B ( P ) + n B ( Q ) − n B ( P ∩ Q ) = N + n A ( P ) denotes the number of data elements of array A accessed by dynamic instances of P & n A ( P ∩ Q ) denotesthe common elements of A accessed by P and so on. Thus,the total number of data elements accessed and the memoryfootprint of the loop nest is: { ( N + ) || N ≥ } . Therefore, atruntime with the value of loop upper-bound N , probes calcu-lates the exact footprint & passes it to the runtime component. Most applications exhibit temporal and spacial locality acrossvarious loop iterations. To fully utilize the locality beneﬁts,the blocks of memory with reuse potential has to remain inthe cache, along with all the intermediate memory accesses.To determine the amount of cache required by a loop-nest toensure that locality is maximized, we need to obtain a sense ofreuse behaviour exhibited by the loop-nest. To classify reusebehaviour, Probes Framework uses

Static Reuse Distance(SRD) , which is a measure of the number of memory instruc-tions between two accesses to the same memory location.Based on this metric, the loop-nests could either be classiﬁedas

Streaming - if its SRD is negligible, or else

Reuse - if theSRD is signiﬁcant and the reuse occurs over a large set ofintermediate memory instructions. Consequently, reuse loopsrequire signiﬁcantly larger cache than streaming loop and thishas to be accounted while deciding cache apportioning. Fig.6 shows an example of how SRD can be leveraged to classifydata-reuse behaviour of loop nests.The loop-nest shown in the above example has statements S & S , which have potential reuse of array A between them& statement S has potential reuse of array B among iterationsof outer-loop. Statement S reads a memory location whichis re-read by S after two iterations of I-loop. However, sinceevery iteration of I-loop comes with N intermediate iterationsof statement S from the inner-loop, the SRD between two6igure 6: Classifying Reuse Behaviour in a Loop Nestsame accesses of A [ i ] from S & S will be 2 ∗ N . Similarly,each access to array B from S is repeated after every iterationof outer-loop, which makes the SRD equal to M . Since M >> N , the SRD from S & S is insigniﬁcant and classiﬁed as streaming , while the SRD from S classiﬁes the inner-loop as reuse . Overall, the entire loop-nest is classiﬁed as reuse . Atruntime, the exact SRD value is computed dynamically andpassed on the probe functions. Similar to footprint estimation,approximate bounds are generated to compute SRD for loopswhose bounds are unknown. Also, loops containing indirectsubscritps or unanalyzable access patterns are classiﬁed as reuse since at runtime they could potentially exhibit a datareuse. Intuitively, the performance of a cache-sensitive applicationshould improve with increase in allocated cache size. How-ever, after an initial performance improvement most appli-cations often exhibit a saturation point in their performance,which means that after a certain number of allocated cache-ways, there is no further performance beneﬁt with more cacheallocation. The number of cache ways that correspond toperformance saturation point of an application, is deﬁned as max ways . Probes Framework estimates this by executingthe application with various cache sizes and plotting the exe-cution time (or cache misses) corresponding to the allocatedcache-ways. For applications that are cache-insensitive, themax-ways is assumed to be 2, as allocating less than that de-grades performance by behaving as a directly-mapped cache.Although max ways identiﬁes whether an application ex-hibits cache-sensitive behaviour during its execution, it doesnot fully account for the degree of cache sensitivity. This isimportant because the more the cache-sensitivity an appli-cation has, the greater performance beneﬁts it experienceswith the same cache allocation. To quantify this varying de-gree of cache-allocation sensitivity among different applica-tions, Probes Framework uses a metric called performancesensitivity factor ( α ), that captures the change in an appli-cation’s loop nest execution times as a function of cacheways allocated to that application. For an application A , the performance sensitivity factor can be deﬁned as: α A = max ways ∑ i = | ∆ t i | ∆ w i (6)where ∆ t i = | t i − t i − | and ∆ w i = | w i − w i − | . A highervalue of α indicates that the application’s performance ben-eﬁts could be increased by changing the cache allocationwhen a particular loop nest is being executed. Probes proﬁlesand sends the tuple ( α , max-ways ) for an application to thescheduler at runtime to guide the apportioning algorithms. BCache Allocation is a compiler-guided cache apportion-ing framework that obtains efﬁcient cache partitions for adiverse application mixes, based on their execution phases.The program execution phase is primarily characterized bythe reuse behaviour of executing loop and its corresponding memory footprints , that are already predicted by the ProbesFramework. Typically, during a particular instance of systemexecution, an application executing a reuse loop with a highervalue of memory footprint is allocated a greater portion of theLLC, compared to another application executing a streamingloop. The phase change occurs whenever an executing loopchanges its reuse-behaviour or exhibits a signiﬁcant varia-tion in its memory footprint & cache partitions are adjustedaccordingly. The phase information is relayed by respectiveprobes to the scheduler, which is further used by the appor-tioning algorithms to obtain phased-based cache partitionsfor co-executing applications. Also, the apportioning deci-sions of BCache minimizes shared-cache attacks by avoidinggrouping of reuse-reuse process whenever possible.An overview of BCache allocation framework’s workingon a mix of N applications is shown in Fig. 7. These applica-tions are instrumented by the Probes Framework. The overallBCache allocation framework is triggered upon application’sdynamic phase-changes & can be broadly categorized intotwo steps: Cache Apportioning based on per application de-mand and

Cache Allocation Through CBM Generation toinvoke Intel CAT and alter the cache conﬁgurations.Figure 7: BCache Allocation FrameworkBCache framework apportions the last-level cache amongthe applications by adopting an unit-based fractional cachepartitioning scheme. The idea here is that each application7ill be allocated a fraction of the LLC, which is measured byestimating how much they contribute to the entire memoryfootprint. The actual cache allocation is done in two parts -ﬁrst, the cache demand for each application phase is calculatedas a fraction of its memory footprints over all the other co-executing applications’ memory footprint. These footprintsare adjusted to account for data-reuse nature of loop nests.The apportioning algorithms also makes use of other probesinformation like phase-timing, max-way, etc to enhance theaccuracy of partitioning decisions.Secondly, based on the fractional cache amount calculatedby the apportioning scheme for each running application, theframework then determines cache partitions in terms of ca-pacity bitmasks (CBM) to invoke Intel® CAT. CBMs ensurethat during application phase changes, the bitmasks are up-dated in a manner that is consistent with the overall systemcache conﬁguration. The other key factors in cache allocationinclude maintaining data locality and grouping compatibleapplications . Shifting an application from one cache partitionto another during the course of its execution can jeopardize thebeneﬁts from data locality and increases compulsory misses.Therefore, we need to ensure that dynamically changing thecache allocations during application phase changes does notadversely affect data locality. Furthermore, sometimes whenthe number of running processes in a system increases orcache demand of application rises, sharing the LLC betweenmultiple applications is inevitable. This necessitates group-ing of more than one applications to share the same cache-ways and forcing two incompatible processes to be groupedtogether can adversely result in conﬂict misses. Therefore,determining application compatibility in cache-allocation isimperative. Using the notion of compatibility, the frameworkmaximizes isolation of reuse sensitive loops in terms of theallocation of cache-ways to minimize the chances of DoSattacks.Keeping these requirements in mind, BCache allocationframework uses two phase-aware cache partitioning algo-rithms:

Initial Phased Cache Allocation (IPCA) and

PhaseChange Cache Allocation (PCCA). There algorithms gener-ate CBMs in a manner that preserves data-locality and takesapplication compatibility into account. System socket selec-tion and core allocation to applications are also taken into ac-count while simultaneously limiting the number and the kindof loops/processes that can be grouped in the same CLOSto enhance security. IPCA and PCCA are invoked by theBCache scheduler for the initial cache allocations when theprocesses start and during phase change of each applicationrespectively.

BCache allocation framework determines cache apportionsfor an application based on loop memory footprint & data-reuse nature. First, the memory footprint of each loop is scaled as per whether it is a streaming loop or a reuse loop. Scalingensures that the reuse loops get a bigger portion of cache thanstreaming loops. For an application K with current execut-ing loop n having memory footprints m , the adjusted loopmemory footprint is deﬁned as: K m phase = K m ∗ S n , where S n denotes the scaling factor of the current loop n and canbe adjusted dynamically. Depending on the reuse behaviourexhibited by a loop, the value of scaling factor ranges from: S n = (cid:26) ≥ , if n → reuse [ , ) , if n → stream .Based on the adjusted foot-print value and the adjusted footprints of the other executingloops in the system, a fraction of the cache size that will beallocated to the application is determined. This fraction iscalculated by determining how much does the current applica-tion’s loop footprints contribute to the overall loop footprintsin the system. In a system with N applications, the fraction ofcache allocated to application K at time t is given by: f cachet ( K ) = K m phase ∑ NI = I m phase ∗ (7)where K m phase denotes the adjusted loop memory footprint ofapplication K and I m phase ∗ denotes the adjusted loop memoryfootprint of application I in their current execution phases. At t =

0, all the applications will be allocated an initial fraction ofcache according to the memory footprints of their ﬁrst execut-ing loop and the sum of these fractions for all the applicationsat t = t . Sum of Cache fractions ∑ NI = f cachet ( I ) System Scenario = > < Table 1: Three possible scenarios in fractional cache partition-ing scheme8 .2 Phase-Aware Dynamic Cache Allocation

Generating cache partitions and their respective CBMs foreach application from the cache fractions is achieved by

InitialPhased Cache Allocation (IPCA) and

Phase Change CacheAllocation (PCCA) algorithms. The IPCA algorithm (pre-sented as Algo 1) is invoked whenever an application beginsits execution. It is responsible for assigning a socket for eachapplication. The sockets are selected based on the α valuesand available cores. A separate socket is dedicated for appli-cations with a higher value of α , while low- α applicationsresides in the other sockets. If there are no cache-ways left inthe high- α socket, then applications are grouped in differentsockets according to the available cores in the sockets. Algorithm 1:

Initial Phased Cache Allocation (IPCA)

Input:

Process P

Result:

Find effective cache partition for an application based on itsinitial phase ****** Initializing Socket ******if ( P → α ) > thenif high- α Socket . ways () > ( P → max ways ) then Assign P to high- α Socket endelse

Assign P to a different socket with max available cores end****** Selecting CLOS ******if socket . availCLOS > ( . ∗ socket . totalCLOS ) then ﬁnd new clos for REUSE processﬁnd compatible clos for STREAM process else group STREAM process in compatible clos having max ( ∆ t ) valuegroup REUSE process in compatible clos having min ( ∆ α ∆ t ) ratio end****** Allocating cache ways ****** req _ ways = P → getCacheApportion () ; if avail_ways > req_ways then Allocate req _ ways to P → CLOS else

Allocate avail _ ways to P → CLOS end

Generate appropriate bitmasks for allocated ways

After socket allocation, a suitable CLOS is obtained forthe process. If less than 75% of all CLOS groups on a socketare vacant, each reuse processes gets its own CLOS. Forprocesses exeucting streaming loops, separate CLOS is as-signed only if there is no compatible CLOS available in thesocket. Consequently, when the number of occupied CLOSin a socket crosses the 75% mark, both streaming and reuseloops are grouped in compatible CLOS groups. Next, theIPCA algorithm estimates the initial cache demand by usingthe fractional scheme, which is passed into a bitmask gen-erator to obtain the required CBMs after checking with theavailable cache-ways in the system. In case the demand ex-ceeds the available cache capacity, the framework allocatesall possible ways to the CLOS and marks it as ‘unsatisﬁed’. Finally, Intel® CAT is invoked by these generated bitmasksto create partitions.The BCache scheduler obtains information broadcast fromprobes in the running applications. The probes convey tothe scheduler of possible phase-changes in application andPCCA algorithm (presented as Algo. 2) is invoked to updatethe existing cache partitions based on the new requirements.The PCCA algorithm obtains the new demand by using thesame fractional apportioning scheme and checks if the cachedemand has changed or not. If demand increases, then extraways are allocated to the CLOS if possible. However, if thecurrent demand of the application can be satisﬁed with lesserways, then extra ways are freed and allocated to the ‘most un-satisﬁed CLOS’ ( α max ) in the system. The required CBMs areobtained from the bitmask generator accordingly. A similarapproach is followed when a loop ﬁnishes its execution, alongwith updating all system parameters like occupied CLOS,available ways, etc. Overall, the cache allocation algorithmsare responsible for managing the following aspects of thesystem: Algorithm 2:

Phase-Change Cache Allocation (PCC)

Input:

Process* P

Result:

Find efﬁcient cache partition for an application based on itsphase change req _ ways = P → getCacheApportion () ; curr _ ways = P → ways ; if req_ways > curr_ways then extra _ ways = req _ ways − curr _ ways ; avail _ ways = socket → getavailWays () ; if avail_ways ≥ extra_ways then Allocate extra _ ways to P → CLOS else

Allocate avail _ ways to P → CLOS endelse f ree _ ways = curr _ ways − req _ ways ;Allocate f ree _ ways to most unsatisﬁed P (cid:48) → CLOS ; end • Preserving Data Locality : All processes are conﬁnedto their initial socket throughout execution time and arenot allowed to move to other socket which will jeopar-dize data locality. Applications that possess reuse loopswith large footprints are kept within the same CLOS, tomaintain a ﬁxed subset of cache-ways for their entireexecution period. On account of phase changes, the num-ber of ways in that CLOS are expanded or shrunk in away that ensures that an application doesn’t start overwith entirely new cache.•

Finding Compatible CLOS : When total allocated waysreaches above 75% of the maximum available ways,streaming processes are aggressively grouped togetherif their way-demands are similar and their difference inexecution time is maximum. This allows preservationof ways for reuse processes. The compatibility relationfor reuse processes involves similarity of their cache9emands in terms of ways needed, and if their α differ-ences and overlapping execution times ∆α∆ t are minimum.This ensures that if at all two or more reuse applicationsare grouped together if they require the same numberof ways, their overall performance will not be impacted( ∆α min ) and they won’t co-execute for long ( ∆ t max ).• Limiting Total Process in a CLOS :The allocation al-gorithms makes sure that at any time instant, total pro-cesses in a given CLOS) ≤ GFactor to avoid excessivethrashing.

Experimental Setup : Com-CAS is evaluated on a Dell Pow-erEdge R440 server with supports way-partitioning throughIntel CAT. Table 2 enlists our system conﬁguration. TheProbes Framework was compiled as an uniﬁed set of compilerpasses in LLVM 3.8. For interacting with Intel CAT, pqos li-brary 1.2.0 was used. For each socket in our system, one coreand a single CLOS is kept vacant in order to avoid interferingwith

Daemon system processes.

Machine Dell PowerEdge R440Processor Intel Xeon Gold 5117 @ 2.00 GHzOS Ubuntu 18.04, Linux Kernel 4.15Sockets 2Cores 14 per socketClos 16 per socketL1 Cache 8-way set-associative, 32 KB privateL2 Cache 16-way set-associative, 1 MB privateL3 Cache 11-way set-associative, 19 MB shared

Table 2: System Speciﬁcation

Benchmark Choice : Three diverse sets of benchmarks arechosen for our experiments:

PolyBench

Suite [21], a numeri-cal computational suite containing 30 individual benchmarkssolving linear algebra problems and data analysis applica-tions,

GAP

Benchmark Suite [1], a standard graph processingworkload containing 7 benchmark kernels from problems ingraph theory &

Rodinia [5], which is a compute-intensiveheterogeneous benchmark suite having 20 benchmarks fromdifferent domains like medical imaging and pattern recogni-tion. From all these suites, we excluded benchmarks that hadinsigniﬁcant execution time since they don’t exhibit cache-sensitive behaviour & their execution time makes the resultsnon-repeatable. Table 3 shows the ﬁnal 36 benchmarks thatwere used that represent a variety of computational stepsacross a variety of domains that could execute in a multi-tenant environment.

Creating Effective Workload Mixes : To test the potentialof our allocation algorithms, the mixes should be composed ofprocesses that can potentially saturate the system in terms ofeither cache demand or core demand. Keeping this constraintin mind, a total of 35 mixes were created, with approximately8 processes per mix on average. The applications with α > PolyBench Lu[1], Correlation[2], Covariance[3], Gemm[4],Symm[5], Syr2k[6], Cholesky[7], Trmm[8], 2mm[9],3mm[10], Doitgen[11], Floyd-Warshall[12], Fdtd-2d[13],Heat-3d[14], Jacobi-2d[15], Seidel-2d[16], Nussinov[17],Gramschimdt[18], Syrk[19], Adi[20], Ludcmp[21]GAP BC[22], CC[23], CC_SV[24], TC[25], PR[26],SSSP[27], BFS[28]Rodinia Backprop [29], LU[30], Heartwall[31], CFD[32],Hotspot[33], Srad [34], Particleﬁlter[35],Streamcluster[36]

Table 3: List of Benchmarks with their numbering used forEvaluationare added in every mix since we are interested in determiningthe effectiveness of our algorithms in sensitive benchmarksto improve throughput.Figure 8: Max-ways & coredemands of each mixes Figure 9: Distribution of α across benchmarks Prediction Accuracy : The loop timing, shown in Fig-ure 10 had an average of 86% with a minimum of 64% andmaximum of 100%. On the other hand, memory footprint,shown in Figure 11 had an average of with minimumof and maximum of . Both loop timing and mem-ory footprint are affected by the hoisting of probes becauseit replaces exact values with expected values leading to lessprecise timing and footprint values. This result can be seenespecially GAP benchmarks because these benchmarks tendto have many interprocedural loops compared to Polybenchand Rodinia. For the benchmarks that are high in Polybench,these are the small benchmarks that did not change drasticallyfrom training inputs leading to very high accuracies. Evenif these two variables might not be super accurate, they maynot have a huge factor in an allocation if a majority of theways are already allocated or the maxways of the loop cutsit off. Since the accuracy for both is above , it providesreasonable allocations that improved execution time.

Com-CAS ’s performance on all 35 mixes from

Polybench , GAP and

Rodinia is summarized in Fig. 12. We compare

Com-CAS with an unpartitioned cache, where the mixes are simplycompiled with normal LLVM and executed on Linux’s CFSScheduler [14], which is the most commonly used schedulerin real-world systems. The mix completion times were notedon the baseline system (unpartitioned cache) and

Com-CAS

COM-CAS over thebase was calculated.

Com-CAS had an average improvementof over all 15

Polybench mixes, over all 10 GAPSmixes and over all 10 Rodinia mixes. In general,

Com-CAS obtains completion or throughput speedup in every singleapplication mixes over the system with unpartitioned cache .In addition,

Com-CAS ’s performance maintains SLA agree-ments of applications in all mixes. The SLA agreement, in ourcase, is performance degradation below 15% of their ‘ original-unmixed ’ time. The original-unmixed time is measured byrunning the process individually in an unpartitioned cachesystem. To best show this SLA agreement, we select two rep-resentative mixes from each benchmarks shown in Figure 13:the mix that obtained maximum speedup ( best-performingmix ) over unpartitioned cache and mix that obtained minimumspeedup ( worst-performing mix ). Overall, across all 271 indi-vidual processes distributed in 35 mixes,

Com-CAS achievesan average performance degradation of , compared totheir original-unmixed time, and they all were within the SLAagreement, i.e. none showed a degradation of worse than 15%.In some cases, the applications do better than original andthis can be due to code transformations done by our LLVMprobe insertion pass.

The allocation algorithms in BCache Framework focusesenhancing the overall system performance by apportioninghigher amount of cache to reuse-based processes that exhibithigher cache-sensitivity (high α ). As a result, the general trendobserved is that reduction in LLC cache misses are shiftedtowards reuse-based applications that "need a greater amountof cache" . This is important since the system throughput is de-termined by the execution time of process that has the longestexecution time in mix and typically, they demand more cacheand are cache-sensitive. The processes Floyd-Warshall (Poly-bench), BC (GAP) & SRad (Rodinia) showed a reductionof , & respectively. For other applica-tions present in the mix, the cache misses are either the sameor have somewhat increased. This is because BCache Frame-work prioritizes applications which exhibits higher degrees ofcache-sensitivity. Also, for non-cache sensitive applicationsthat have been instrumented with probes exhibit more missesbecause of the additional probe functions. However, the re- sulting penalty is still within the 15% SLA limit. Overall, Com-CAS achieved a reduction of in LLC misses overall the mixes.

We now take a closer look at particular mixes and show ex-actly how

Com-CAS affects the applications. We look at theworst ( M ) and best ( M ) performing GAP Mixes with theirperformance improvements shown in Figure 13.Mix 22 consisted of BC , CC , and CC_SV processes, thatare all cache sensitive. In the initial allocation, BC was placedin Socket 1 with 2 ways, CC was placed in Socket 1 with 2ways, and CC_SV was placed in Socket 0 with 2 ways. Allprocesses begin with 2 ways because the ﬁrst loop nest, whichis common between all benchmarks in GAP, is a streamingloop that only required 2 ways at max. The max ways do notchange across the execution, so each application ends with2 ways. CC ends ﬁrst followed by

CC_SV then BC with a mix speedup and a decrease of total LLC cachemisses.Mix 25 consisted of 2 BC s, 2 CC s, and 2 SSSP s. Comparedto

CC_SV , SSSP is considered cache insensitive. Both BC swere placed in Socket 1 and both CC s and SSSP s were placedin Socket 0. Each CC and BC are placed in different CLOSstarting with 2 separate ways. The two

SSSP s share the sameCLOS meaning they share the same ways. Eventually, eachCC application will decrease his ways from two to one thenincrease it back to two. Just like the previous mix, each pro-cess ends with two ways. The order of completion is

CC, CC,BC, BC, SSSP , and

SSSP . Overall, the mix speedup is and a decrease of total LLC cache misses.

Across all the mixes, our

Com-CAS did 203 allocations with157 of them being initial apportions. During the executionsof these mixes,

Com-CAS increased partitions 20 times andshrunk partitions 26 times. As the most allocations happenedinitially, our system ﬁnds it to be optimal for the mixes to notchange ways a lot. In the cases that they did change,

Com-CAS increased partitions 20 times and shrunk partitions 26times with The largest increase being 3 with decrease being 2.In total across all mixes, there were only 157 CLOS groups.Typically, the streaming applications can share CLOS groupswith other applications. On the other hand, reuse orientedapplications do not tend to be within the same CLOS group.Our

Com-CAS only had 11 CLOS groups containing morethan one reuse oriented application leading to the probabilityof a CLOS group with at least 2 reuse applications being 7%.This probability being low means the less likelihood of anapplication doing an LLC DoS attack on another applicationdue to the reuse separation. In addition, a majority of theloops in reuse oriented applications are streaming meaningreuse loops do not happen at the same time. As a result, the11igure 12: Improvement in Mix Execution time over Unpartitioned Cache SystemFigure 13: Representative Mixes adhering to SLA Agreement 15% degradationFigure 14: M Timeline Figure 15: M Timelineprobability of LLC DoS attacks are lower than . Moreoverthe timing overlap of loops sharing the CLOS was minimizedwhile co-locating them in the same CLOS. In this work, we propose a

Compiler-Guided Cache Ap-portioning System (Com-CAS) for effectively apportioningthe shared LLC leveraging Intel CAT under compiler guid-ance.

Probes Compiler Framework evaluates cache-relatedloop properties such as loop property (streaming or reuse),cache footprint, loop timings saturation factor (in terms ofnumber of ways), and cache sensitivity and relays them toapportioning scheduler. On the other hand,

BCache Allo- cation Framework uses allocation algorithms to scheduleprocesses to sockets, cores, and CLOS then dynamically par-titions the cache based on the above information. Our systemimproved average throughput by 21%, with a maximum of54% while maintaining the worst individual application execu-tion time degradation within 15% to meet SLA requirementsin a multi-tenancy setting. In addition, the system’s schedul-ing minimizes the co-location of reuse applications within thesame CLOS group along with their overlap. With improvedthroughput, fulﬁlled SLA agreement, and increased securityof processes, our

Com-CAS is a viable system for multi-tenantsetting.

References [1] S. Beamer, K. Asanovi´c, and D. Patterson, “The gapbenchmark suite,” arXiv preprint arXiv:1508.03619 ,2015.[2] M. G. Bechtel and H. Yun, “Denial-of-service attackson shared cache in multicore: Analysis and prevention,”

CoRR , vol. abs/1903.01314, 2019. [Online]. Available:http://arxiv.org/abs/1903.01314123] D. J. Bernstein, “Cache-timing attacks on aes,” 2005.[4] J. Chang and G. S. Sohi, “Cooperative cache partitioningfor chip multiprocessors,” in

ACM International Con-ference on Supercomputing 25th Anniversary Volume ,2007, pp. 402–412.[5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,S.-H. Lee, and K. Skadron, “Rodinia: A benchmarksuite for heterogeneous computing,” in . Ieee, 2009, pp. 44–54.[6] H. Cook, M. Moreto, S. Bird, K. Dao, D. A. Patterson,and K. Asanovic, “A hardware evaluation of cache par-titioning to improve utilization and energy-efﬁciencywhile preserving responsiveness,”

ACM SIGARCH Com-puter Architecture News , vol. 41, no. 3, pp. 308–319,2013.[7] N. El-Sayed, A. Mukkara, P.-A. Tsai, H. Kasture, X. Ma,and D. Sanchez, “Kpart: A hybrid cache partitioning-sharing technique for commodity multicores,” in . IEEE, 2018, pp. 104–117.[8] T. Grosser, H. Zheng, R. Aloor, A. Simbürger,A. Größlinger, and L.-N. Pouchet, “Polly-polyhedral op-timization in llvm,” in

Proceedings of the First Interna-tional Workshop on Polyhedral Compilation Techniques(IMPACT) , vol. 2011, 2011, p. 1.[9] D. Gruss, C. Maurice, K. Wagner, and S. Mangard,“Flush+ ﬂush: a fast and stealthy cache attack,” in

In-ternational Conference on Detection of Intrusions andMalware, and Vulnerability Assessment . Springer, 2016,pp. 279–299.[10] A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gi-anos, R. Singhal, and R. Iyer, “Cache qos: From conceptto reality in the intel® xeon® processor e5-2600 v3product family,” in .IEEE, 2016, pp. 657–668.[11] S. Khan, G. Mruru, and S. Pande, “A compiler assistedscheduler for detecting and mitigating cache-basedside channel attacks,” arXiv preprint arXiv:2003.03850 ,2020.[12] S. Kim, D. Chandra, and Y. Solihin, “Fair cache sharingand partitioning in a chip multiprocessor architecture,”in

Proceedings. 13th International Conference on Par-allel Architecture and Compilation Techniques, 2004.PACT 2004.

IEEE, 2004, pp. 111–122. [13] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas,and J. Emer, “Dawg: A defense against cache timingattacks in speculative execution processors,” in , Oct 2018, pp. 974–987.[14] J. Kobus and R. Szklarski, “Completely fair schedulerand its tuning,” draft on Internet , 2009.[15] C. Lattner and V. Adve, “Llvm: A compilation frame-work for lifelong program analysis & transformation,”in

International Symposium on Code Generation andOptimization, 2004. CGO 2004.

IEEE, 2004, pp. 75–86.[16] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sa-dayappan, “Gaining insights into multicore cache parti-tioning: Bridging the gap between simulation and realsystems,” in . IEEE,2008, pp. 367–378.[17] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache side-channel attacks are practical,” in , May 2015,pp. 605–622.[18] F. Mueller, “Compiler support for software-based cachepartitioning,”

ACM Sigplan Notices , vol. 30, no. 11, pp.125–133, 1995.[19] N. Pimpalkar and J. Abraham, “A llc-based dos attacktechnique on virtualization system with detection andprevention model,” in , 2018, pp. 419–424.[20] L. Pons, J. Sahuquillo, V. Selfa, S. Petit, andJ. Pons, “Phase-aware cache partitioning to target bothturnaround time and system performance,”

IEEE Trans-actions on Parallel and Distributed Systems , vol. 31,no. 11, pp. 2556–2568, 2020.[21] L.-N. Pouchet et al. , “Polybench: The polyhedral bench-mark suite,” , 2012.[22] M. K. Qureshi and Y. N. Patt, “Utility-based cache par-titioning: A low-overhead, high-performance, runtimemechanism to partition shared caches,” in . IEEE, 2006, pp. 423–432.[23] R. Ravindran, M. Chu, and S. Mahlke, “Compiler-managed partitioned data caches for low power,”

ACMSIGPLAN Notices , vol. 42, no. 7, pp. 237–247, 2007.1324] D. Sanchez and C. Kozyrakis, “The zcache: Decouplingways and associativity,” in . IEEE,2010, pp. 187–198.[25] ——, “Vantage: scalable and efﬁcient ﬁne-grain cachepartitioning,” in

Proceedings of the 38th annual interna-tional symposium on Computer architecture , 2011, pp.57–68.[26] V. Selfa, J. Sahuquillo, L. Eeckhout, S. Petit, and M. E.Gómez, “Application clustering policies to address sys-tem fairness with intel’s cache allocation technology,” in . IEEE,2017, pp. 194–205.[27] T. Sherwood, B. Calder, and J. Emer, “Reducing cachemisses using hardware and software page placement,”in

Proceedings of the 13th international conference onSupercomputing , 1999, pp. 155–164.[28] G. E. Suh, L. Rudolph, and S. Devadas, “Dynamic parti-tioning of shared cache memory,”

The Journal of Super-computing , vol. 28, no. 1, pp. 7–26, 2004.[29] D. Tam, R. Azimi, L. Soares, and M. Stumm, “Managingshared l2 caches on multicore systems in software,” in

Workshop on the Interaction between Operating Systemsand Computer Architecture , 2007, pp. 26–33.[30] R. Wang and L. Chen, “Futility scaling: High-associativity cache partitioning,” in . IEEE, 2014, pp. 356–367.[31] X. Wang, S. Chen, J. Setter, and J. F. Martínez, “Swap:Effective ﬁne-grain management of shared last-levelcaches with minimum hardware support,” in . IEEE, 2017, pp. 121–132.[32] Y. Wang, A. Ferraiuolo, D. Zhang, A. C. Myers, and G. E.Suh, “Secdcp: secure dynamic cache partitioning forefﬁcient timing channel protection,” in

Proceedings ofthe 53rd Annual Design Automation Conference , 2016,pp. 1–6.[33] Y. Yarom and K. Falkner, “Flush+reload: A highresolution, low noise, l3 cache side-channel attack,”in . IEEE, 2014, pp.381–392.[35] T. Zhang, Y. Zhang, and R. B. Lee, “Dos attackson your memory in cloud,” in

Proceedings ofthe 2017 ACM on Asia Conference on Computerand Communications Security , ser. ASIA CCS ’17.New York, NY, USA: Association for ComputingMachinery, 2017, p. 253–265. [Online]. Available:https://doi.org/10.1145/3052973.3052978[36] X. Zhang and Q. Zhu, “Hica: Hierarchical cache parti-tioning for low-tail-latency qos over emergent-securityenabled multicore data centers networks,” in

ICC 2020-2020 IEEE International Conference on Communica-tions (ICC) . IEEE, 2020, pp. 1–6.[37] X. Zhang, S. Dwarkadas, and K. Shen, “Towards practi-cal page coloring-based multicore cache management,”in

Proceedings of the 4th ACM European conference onComputer systems , 2009, pp. 89–102.[38] H. Zhu and M. Erez, “Dirigent: Enforcing qos forlatency-critical tasks on shared multicore systems,” in