A Fast Analytical Model of Fully Associative Caches
Tobias Gysi, Tobias Grosser, Laurin Brandner, Torsten Hoefler
AA Fast Analytical Model of Fully Associative Caches
Tobias Gysi
ETH ZurichSwitzerland [email protected]
Tobias Grosser
ETH ZurichSwitzerland [email protected]
Laurin Brandner
ETH ZurichSwitzerland [email protected]
Torsten Hoefler
ETH ZurichSwitzerland [email protected]
Abstract
While the cost of computation is an easy to understand localproperty, the cost of data movement on cached architecturesdepends on global state, does not compose, and is hard topredict. As a result, programmers often fail to consider thecost of data movement. Existing cache models and simulatorsprovide the missing information but are computationally ex-pensive. We present a lightweight cache model for fullyassociative caches with least recently used (LRU) replace-ment policy that gives fast and accurate results. We countthe cache misses without explicit enumeration of all memoryaccesses by using symbolic counting techniques twice: 1)to derive the stack distance for each memory access and 2) tocount the memory accesses with stack distance larger thanthe cache size. While this technique seems infeasible in the-ory, due to non-linearities after the first round of counting,we show that the counting problems are sufficiently linear inpractice. Our cache model often computes the results withinseconds and contrary to simulation the execution time ismostly problem size independent. Our evaluation measuresmodeling errors below 0.6% on real hardware. By provid-ing accurate data placement information we enable memoryhierarchy aware software development.
CCS Concepts • Software and its engineering → Soft-ware performance ; Compilers . Keywords static analysis, cache model, performance tool
Most programmers know the time complexity of their al-gorithms and tune codes by minimizing computation. Yet,ever increasing data-movement costs urge them to pay moreattention to data-locality as a prerequisite for peak perfor-mance. When considering different implementation variantsof an algorithm, we typically have a good understanding ofwhich variant performs less computation or can be vector-ized well. Selecting the optimal tile size or deciding whichloop fusion choice is optimal is far less intuitive. Essentially,we lack a perception of the cache state that allows us toreason about data movement.Data-locality optimizations are often pushed to the end ofthe development cycle when the code is available for bench-marking. But at this stage eliminating fundamental designflaws may be hard. We believe a cache model responsiveenough to be part of the day-to-day workflow of a perfor-mance engineer can provide the necessary guidance to make −1 e x e c u t i o n t i m e [ s ] c o ff ee b r e a k r e s p o n s i v e
25x 26x 2039x 54285x haystack (analytical model)dinero IV (simulation)choleskygemm
Figure 1.
Scaling of the cache model compared to simulation.good design choices upfront. After the completion of the de-velopment, the very same model could provide the necessarydata for accurate model driven automatic memory tuning.We present HayStack the first cache model for fully as-sociative caches with least recently used (LRU) replacementpolicy which is both fast and accurate. At the core of ourmodel, we calculate the LRU stack distance [29] (also calledreuse distance [5, 15, 43]) symbolically for each memory ac-cess. The stack distance counts the distinct memory accessesbetween two subsequent accesses of the same memory lo-cation. All memory accesses with distance shorter than thecache size hit a fully associative LRU cache.We show in Figure 1 the scaling of HayStack compared tothe Dinero IV [17] cache simulator for increasing problemsizes. The simulation times are proportional to the problemsize since simulators [7, 10, 17, 25] enumerate all memory ac-cesses. We use the Barvinok algorithm [40] to count the cachemisses. The algorithm avoids explicit enumeration by deriv-ing symbolic expressions that evaluate to the cardinality ofthe counted affine integer sets and maps. As demonstrated bythe flat GEMM scaling curve, this symbolic counting makesthe model execution time problem size independent. Even forCholesky factorization, with its known non-linearities [6]that prevent full symbolic counting, the scaling of the execu-tion time remains flat compared to simulation. https://github.com/spcl/haystack a r X i v : . [ c s . PF ] J a n hile computing stack distances for static control pro-grams is a well known technique, reducing stack distanceinformation for all dynamic memory accesses to a singlecache miss count is difficult. Beyls et al. [6] show that stackdistances in general are non-affine. The divisions introducedwhen modeling cache lines add even more non-affine con-straints. While symbolic summation over affine constraintsets is possible with the Barvinok algorithm, symbolic count-ing over non-affine constraints is considered hard in general.In this work, we show that this generally hard problem canin practice become surprisingly tractable if non-linearitiesare carefully eliminated by either specialization or partialenumeration. As a result we contribute: • The first efficient cache model to accurately predictstatic affine programs on fully associative LRU caches. • An efficient hybrid algorithm that combines symboliccounting with partial enumeration to reduce the as-ymptotic cost of the cache miss counting. • A set of simplification techniques that exploit the reg-ular patterns induced by the cache line structure tomake the stack distance polynomials affine. • An exhaustive evaluation which shows that our cachemodel performs well in practice with large speedupscompared to existing approaches while achieving highaccuracy compared to measurements on real hardware.
We first introduce our hardware model, provide backgroundon cache misses, explain the concept of affine integer setsand maps, and discuss the set of considered programs.
A cache implements various complex and sometimes undis-closed policies that define the exact behavior. We deliberatelymodel a generic cache with full associativity and LRU replace-ment policy. When writing, we assume the caches allocate acache line and load the memory reference if necessary (write-allocate) and then forward the write to all higher-level caches(write-through). We parametrize our cache model with thecache line size L and the cache size C in bytes. When mod-eling multiple cache hierarchy levels, we assume inclusivecaches and specify the cache size for every hierarchy level.These design choices avoid an overly detailed model that isonly correct in a very controlled environment with knowdata alignment and allocation. As shown by Section 4.2, westill model enough detail to produce actionable and accurateresults in practice. We assume that the modeled programs run in isolation andthat their execution starts with an empty cache. We countdata accesses and ignore instruction fetches. According to Hill [23], we distinguish three types of cachemisses; 1) compulsory misses happen if a program accessesa cache line for the first time, 2) capacity misses happenif a program accesses too many distinct cache lines beforeaccessing a cache line again, and 3) conflict misses happen ifa program accesses to many distinct cache lines that map tothe same cache set of an associative cache before accessing acache line again. We model fully associative caches and thuscompute only compulsory and capacity misses.Not every access of a program variable translates in acache access as the compiler may place scalar variables inregisters. Compiler and hardware techniques such as out-of-order execution also change the order of the memoryaccesses. We assume all scalar variables are buffered in reg-isters and count only array accesses in the order providedby the compiler front end.The cache misses measured when profiling a programdepend on many factors generally unknown to an analyti-cal cache model, for example, concurrent programs or theoperating system may pollute the caches or the hardwareprefetchers may load more data than necessary. We do notconsider this system noise and instead provide an approxi-mate but deterministic cache model.
We use sets and maps of integer tuples to count the cachemisses. We next define the relevant set and map operationsnecessary for the model implementation. These operationsare a subset of the functionality provided by the integer setlibrary (isl) [38].An affine set S = {( i , . . . , i n ) : con ( i , . . . , i n )} defines the subset of integer tuples ( i , . . . , i n ) ∈ Z n thatsatisfy the constraints con ( i , . . . , i n ) . The constraints arePresburger formulas that combine affine expressions withcomparison operators, boolean operators, and existentialquantifiers. Presburger arithmetic [21] also admits floor di-vision and modulo with a constant divisor.An affine map R = {( i , . . . , i n ) → ( j , . . . , j m ) : con ( i , . . . , i n , j , . . . , j m )} defines the relation from integer tuples ( i , . . . , i n ) ∈ Z n to integer tuples ( j , . . . , j m ) ∈ Z m that satisfy the con-straints con ( i , . . . , i n , j , . . . , j m ) where the constraints havethe same restrictions as the set constraints. The domain R dom defines the set of the integer tuples ( i , . . . , i n ) of the inputdimensions for which a relation exists, and conversely therange R ran defines the set of integer tuples ( j , . . . , j m ) ofthe output dimensions for which a relation exists.Both sets and maps support the set operations intersection S ∩ S , union S ∪ S , projection, and cardinality | S | . Thedomain intersection R ∩ dom S intersects the domain of themap R with the set S . Maps also support the map operations int sum = 0; for ( int i=0; i<4; ++i) S0 : M[i] = i; for ( int j=0; j<4; ++j) S1 : sum += M[3-j]; Figure 2.
Example program used for illustration. S -1 A -1 S A schedule values (0,i); (1,j) : i,j=[0..3] memory locationsM(k) : k=[0..3]statement instancesS0(i); S1(j) : i,j=[0..3]
Figure 3.
The statement instances and the related schedulevalues (schedule S ) and memory accesses (access map A ) aresufficient to compute the cache misses of a program.composition R ◦ R and inversion R − . The operatorlexmin ( R ) = {( i , . . . , i n ) → ( m , . . . , m m ) : ∄ ( i , . . . , i n ) → ( j , . . . , j m ) ∈ R , s.t. ( j , . . . , j m ) ≺ ( m , . . . , m m )} computes for every input tuple ( i , . . . , i n ) the lexicographicsmallest output tuple ( m , . . . , m m ) of all tuples ( j , . . . , j m ) related to the input tuple.A named set or map prefixes the integer tuples with namesthat convey semantic information. For example, we prefixthe array element M ( ) with the array name and the state-ment instance S0 ( ) with statement name. We use statementnames starting with the letter S and array names startingwith any other letter. The names are semantically equivalentto an additional tuple dimension. Our cache model analyzes affine static control programs con-sisting of loop nests with known loop bounds that performarray accesses with affine index expressions. Figure 2 showsan example program with two statements: the statement S0 initializes an array M and the statement S1 accumulates thearray elements. Before analyzing a program, we extract thesets and maps that specify the statement execution orderand the memory access offsets.The iteration domain I = { S0 ( i ) : 0 ≤ i < S1 ( j ) : 0 ≤ j < } defines the set of all executed statement instances. For thetwo statements of the example program, the loop variables i and j are limited to the range zero to three. To define theexecution order, the schedule S = { S0 ( i ) → ( , i ) ; S1 ( j ) → ( , j )} ∩ dom I S0(2) S0(3) S1(0) S1(1)S0(1)S0(0) S1(2) S1(3)(1) M(2) M(3) M(3) M(2)M(1)M(0) M(1) M(0)(2) in cache?cache hit = 1, if |{M(1), M(2), M(3)}| ≤ cache size0, otherwise
Figure 4.
The (1) statement instance and the (2) memoryaccess trace of the example program allow us to compute ifthe access M ( ) of the statement S1 ( ) hits the cache.maps the statement instances to a multi-dimensional sched-ule value. The statement instances then execute according tothe lexicographic order of the schedule values. The intersec-tion with the iteration domain I limits the schedule domainto the program loop bounds. The access map A = { S0 ( i ) → M ( i ) ; S1 ( j ) → M ( − j )} maps the array accesses of the statement instances to the ac-cessed array elements. The iteration domain I , the schedule S , and the access map A capture all relevant program prop-erties necessary to evaluate the cache model. Figure 3 showshow the schedule S and the access map A relate statementinstances, schedule values, and memory locations. Our cache model computes for every memory access thestack distance parametric in the loop variables and counts theinstances with a stack distance larger than the cache capacityto determine the capacity misses. All memory accesses withundefined backward stack distance access the cache line forthe first time and count as compulsory misses.Figure 4 shows the computation of the capacity misses forthe example program introduced by Figure 2: (1) enumeratesthe statement instances according to the schedule S and(2) applies the access map A to the statement instances tocompute the memory trace. Assuming the array element sizeis equal to the cache line size, the stack distance correspondsto the cardinality of the set { M ( ) , M ( ) , M ( )} which containsthe array elements accessed between and including the twosubsequent accesses of M ( ) . The second access of M ( ) hitsthe cache if the cardinality of the set is lower than or equalto the cache capacity. The stack distance computation counts the number of dis-tinct memory accesses between subsequent accesses of thesame memory location. We determine for every memoryreference the last access to the same memory location andcount the set of memory accesses since this last access toobtain the stack distance parametric in the loop variables. or our example program, the stack distance of the mem-ory access in statement S1 is equal to the loop variable j plusone. We can thus express the stack distance of the memoryaccess with the map D = { S1 ( j ) → j + ≤ j < } limited to the statement iteration domain. As the statement S0 accesses all array elements for the first time its backwardstack distance is undefined and the accesses count as com-pulsory misses.Our discussion of the stack distance computation initiallyassumes that every statement performs at most one accessof a one-dimensional array with an element size equal to thecache line size. At the end of this section, we show how toovercome these limitations.The memory accesses execute according to the statementexecution order defined by the schedule. The map L ≺ = {( i , . . . , i n ) → ( j , . . . , j n ) : ( i , . . . , i n ) ≺ ( j , . . . , j n )∧( i , . . . , i n ) , ( j , . . . , j n ) ∈ S ran } relates the schedule values ( i , . . . , i n ) to all lexicographicallylarger schedule values ( j , . . . , j n ) and the map L ⪯ = {( i , . . . , i n ) → ( j , . . . , j n ) : ( i , . . . , i n ) ⪯ ( j , . . . , j n )∧( i , . . . , i n ) , ( j , . . . , j n ) ∈ S ran } relates the schedule values ( i , . . . , i n ) to all lexicographicallylarger or equal schedule values ( j , . . . , j n ) . Later on, we usethese helper maps to filter relations by execution order.The stack distance computation first identifies all accessesto the same array element. The equal map E = S ◦ A − ◦ A ◦ S − relates each schedule value to all schedule values that accessthe same array element. The concatenation A ◦ S − mapsthe schedule values to the accessed array elements and itsreverse S ◦ A − maps the accesses back to the schedule values.For our example program, the composition A ◦ S − = {( , i ) → M ( i ) : 0 ≤ i < ( , j ) → M ( − j ) : 0 ≤ j < } relates the schedule values to the accesses of the array M . Theequal map then relates all schedule values that access thesame array element. For example, the relations ( , i ) → M ( i ) and ( , j ) → M ( − j ) access the same array element if i isequal to 3 − j . The resulting equal map E = {( , i ) → ( , i ) : 0 ≤ i < ( , j ) → ( , j ) : 0 ≤ j < ( , i ) → ( , j ) : j = − i ∧ ≤ i < ( , j ) → ( , i ) : i = − j ∧ ≤ j < } S0(2) S0(3) S1(0) S1(1)S0(1)S0(0) S1(3) F S1(2) N -1 S -1 ◦ L ≤ ◦ S S0(2) S0(3) S1(0) S1(1)S0(1)S0(0) S1(2) S1(3)
B S -1 ◦ L ≤-1 ◦ S S0(2) S0(3) S1(0) S1(1)S0(1)S0(0) S1(3) F ∩ B S1(2)M(2) M(3) M(3) M(2)M(1)M(0) M(1) M(0)S0(2) S0(3) S1(0) S1(1)S0(1)S0(0) S1(3) A ◦( F ∩ B ) S1(2) Figure 5.
The relations of the forward map F and the back-ward map B for the statement instance S1 ( ) of the exampleprogram (the forward map F corresponds to the concate-nation of the blue backward arrow and the black forwardarrows). The map intersection defines the statement instancebetween and including the two accesses of M ( ) . The concate-nation with the map A yields the related memory accesses.contains the relation ( , i ) → ( , j ) with j = − i and itsreverse but also the self relations of the schedule values.The lexicographically shortest relations of the equal mapdenote the subsequent accesses to the same array elementwhich are closest in time. The next map N = S − ◦ lexmin ( L ≺ ∩ E ) ◦ S intersects the equal map E with the map L ≺ to filter outall backward in time and self relations and the lexmin op-erator removes all forward in time relations except for theshortest ones. We compose the result with S and S − to con-vert the schedule values to statement instances. The nextmap consequently relates every statement instance to thenext statement instance that accesses the same array ele-ment. For our example program, the equal map containsonly the forward relation ( , i ) → ( , j ) which means thelexmin operator has no effect since there is only one relationper statement instance. The next map N = { S0 ( i ) → S1 ( j ) : j = − i ∧ ≤ i < } thus relates the instances of statement S0 to the instances ofstatement S1 that access the same array element.The next map contains subsequent statement instancesthat access the same array element but not the statement in-stances executed in between. To compute them, we intersectthe set of statement instances executed after the first access ith the set of statement instances executed before the sec-ond access of the same array element. Figure 5 illustratesthis intersection. The backward map B = S − ◦ L − ⪯ ◦ S relates the statement instances to all statement instanceswith lexicographically smaller or equal schedule value. Themaps S and S − convert from statement instances to schedulevalues and back. The forward map F = ( S − ◦ L ⪯ ◦ S ) ◦ N − relates the statement instances to all statement instanceswith lexicographically larger or equal schedule value thanthe statement instance that last accessed the same array ele-ment. We reverse the next map N to compute the statementinstance that accessed the array element last. The intersec-tion of the forward map and the backward map contains allstatement instances executed between subsequent accessesof the same array element.Figure 5 shows the forward and backward map relationsfor the statement instance S1 ( ) of the example programthat accesses the array element M ( ) . The forward map F corresponds to the concatenation of the blue backward arrowand the black forward arrows. The intersection of the twomaps contains the statement instances executed between thesubsequent accesses of the array element M ( ) . We finallyconcatenate this intersection with the access map A to obtainthe stack distance map that relates every statement instanceto the array accesses performed since the last access of thesame array element.The number of related array elements defines the stackdistance of the statement instances in the stack distance map.We use the isl [38] implementation of the Barvinok algo-rithm [40] to count the relations symbolically. The algorithmcomputes the map cardinality by counting the points of therange related to every point of the domain. The result ofthe computation are quasi polynomials parametric in theinput dimensions of the map that evaluate to the number ofrelated range points. As the domain is not always homoge-neous, the algorithm splits the map domain into pieces thatconsist of a quasi polynomial and the subdomain of the mapdomain where the polynomial is valid. After counting thestack distance map, the distance set D = {| A ◦ ( F ∩ B )|} contains pieces with quasi polynomials parametric in theschedule input dimensions that for a subdomain of the iter-ation domain evaluate to the stack distance. The pieces donot overlap and together cover the full iteration domain. Forour example program, the distance set D = { S1 ( j ) → j + ≤ j < } contains one piece with the polynomial S1 ( j ) → j + ≤ j < cache lines and multi-dimensional arrays An adaptedaccess map A that relates statement instances to cache linesinstead of array elements suffices to support cache lines andmulti-dimensional arrays. Let us assume our example pro-gram initializes the diagonal elements of a two-dimensionalarray M ( i , i ) . Then the access map A = { S0 ( i ) → M ( i , c = ⌊ i ∗ E / L ⌋)} models the accessed cache lines given the size of the arrayelements E and cache line size L in bytes. We replace theinnermost dimension of the array access with the cache lineindex c , which multiplies the array index with the elementsize and divides the result by the cache line size. As a result,accesses of neighboring array elements map to the samecache line. The outer dimensions of the array index remainunchanged since we assume the innermost dimension iscache line aligned and padded to an integer multiple of thecache line size. This restriction can be lifted at the expenseof a more complex formulation. multiple memory accesses per statement An extensionof the schedule S and the access map A with an additionalschedule dimension that orders the memory accesses of thestatements allows us to model more than one memory accessper statement. Let us assume the statement S0 of the exampleprogram reads the array element I ( i ) and writes the resultto the array element M ( i ) . We then extend the schedule S = { S0 ( i , a ) → ( , i , a ) ; S1 ( j , a ) → ( , j , a )} with the access dimension a that orders the memory accessesof the statement. Then the access map A = { S0 ( i , ) → I ( i ) ; S0 ( i , ) → M ( i ) ; S1 ( j , ) → M ( − j )} assigns every array access to a unique statement instancesince the access dimension enumerates the array accessesof every statement in the order provided by the compilerfront end. The extended schedule executes only one arrayaccess per statement instance and thus requires no furthermodifications of the stack distance computation.The output of the stack distance computation is a set ofpolynomials that defines the backward stack distance forevery array access of the static control program. All memory accesses with stack distance larger then thecache size count as capacity miss. As discussed in Section 3.1,the stack distance computation splits the iteration domaininto pieces. Each piece defines the stack distance for a subdo-main of the iteration domain. To obtain the capacity misses,we count for every piece the points of the subdomain forwhich the polynomial evaluates to a stack distance largerthan the cache size.The piece with polynomial S1 ( j ) → j + ≤ j < = {S0(i,j) → i+ j : i,j=[0..3]}(0,2) (1,2) (2,2)(0,1) (1,1) (2,1)(0,0) (1,0) (2,0) 210E = {j : j=[0..3]} P j=0 = {S0(i) → i+ : i=[0..3]} P j=1 = {S0(i) → i+ : i=[0..3]} P j=2 = {S0(i) → i+ : i=[0..3]}0 1 20 1 20 1 2π i Figure 6.
To count the non-affine piece P , we project outthe affine i -dimension to obtain the enumeration domain E .We next bind the j -dimension of the piece P to the j -valuesin the enumeration domain and separately count the cachemisses for the resulting affine pieces P j = , P j = , and P j = .domain of our example program. The cache miss set M = { S1 ( j ) : j + > C ∧ ≤ j < } contains all points of the piece with stack distance larger thancache size C which means the cardinality of the cache missset | M | is equal to the number of capacity misses. Assumingcache size two, the cache miss set contains the statementinstances S1 ( ) and S1 ( ) that cause two capacity misses.The distance set specifies the stack distance for all programstatements. To count the capacity misses per statement, wesplit the distance set by statement and compute the cachemisses separately. Without loss of generality, we discuss thecache miss computation for a statement S0 .The Barvinok algorithm also computes the set cardinalityby counting the points symbolically. We use the algorithm tocount affine cache miss sets and resort to explicit enumera-tion for non-affine sets. As explicit enumeration is expensive,we only enumerate the non-affine polynomial dimensionsand count the affine dimensions symbolically. This partialenumeration technique splits cache miss sets into pieces withaffine lower-dimensional polynomials. Figure 6 demonstratesthe technique for an example polynomial with non-affine j -dimension. Section 3.3 discusses further techniques to splitnon-affine pieces into multiple affine pieces.Algorithm 1 counts the total number of cache misses Tgiven the distance set D of the program. The algorithm enu-merates all pieces P of the distance set (lines 2-12). Everypiece P consists of a polynomial and a domain that definethe stack distance of a memory access for a subdomain of theiteration domain. If the polynomial of the piece P is affine wecount the cache misses symbolically (lines 3-4), otherwisethe partial enumeration projects the non-affine dimensionsout of the domain of the piece P and enumerates all pointsof the resulting non-affine enumeration domain E (lines 6-9).For every such point pt, we bind the non-affine dimensionsof the piece P to the coordinates of the point pt and countthe cache misses of the affine piece P pt symbolically. Figure 6illustrates the splitting of non-affine pieces (lines 6-9). Algorithm 1: counting the capacity misses input : D distance set of pieces output : T total number of cache misses parameter : C cache size T ← foreach P in D do if isPieceAffine( P ) then T ← T + countAffinePiece( P , C ) else E ← getNonAffineDomain( P ) foreach pt in E do P pt ← bindNonAffineDimensions( P , pt ) T ← T + countAffinePiece( P pt , C ) end end end return T The method countAffinePiece counts the cache misses ofthe piece P with affine stack distance polynomial. A polyno-mial is affine if its degree is zero or one. We first computethe cache miss set M = { S0 ( i , . . . , i n ) : P p ( i , . . . , i n ) > C ∧ ( i , . . . , i n ) ∈ P D } where P p denotes the polynomial and P D the domain ofthe piece P . The cache miss set contains all memory ac-cesses with stack distance larger than cache size C . To countthe cache misses, we compute the cardinality | M | using theBarvinok algorithm.The method getNonAffineDomain projects all points of thepiece P to the non-affine dimensions to obtain the enumera-tion domain E . For example, Figure 6 projects the piece P = { S0 ( i , j ) → i + j : 0 ≤ i < ∧ ≤ j < } which contains the quadratic term j . We project the pointsto the non-affine j -dimension to compute the enumerationdomain E = { j : 0 ≤ j < } . The enumeration alwaysspans all dimensions with degree larger than one. But thepolynomial may also contain product terms with multipledimensions. We then greedily select the dimensions thatconflict with most other dimensions. For example, if thepolynomial contains the products ij and ik we enumeratethe i -dimension since it conflicts with both other dimensions.The method bindNonAffineDimensions binds the non-affinedimensions of the piece P to the values of the point pt. Forexample, Figure 6 binds the j -dimension of the piece P = { S0 ( i , j ) → i + j : 0 ≤ i < ∧ ≤ j < } to the value two and obtains the piece P j = = { S0 ( i ) → i + ≤ i < } which we can count with the method countAffinePiece . P i%3<2 = {S0(i,j) → j : i%3<2 ᴧ i=[0..2] ᴧ j=[0..1]} P i%3=2 = {S0(i,j) → j : i%3=2 ᴧ i=[0..2] ᴧ j=[0..1]}(0,1) (1,1) (2,1)(0,0) (1,0) (2,0) P = {S0(i,j) → ( ⌊ (1+i)/3 ⌋ - ⌊ i/3 ⌋ ) j :i=[0..2] ᴧ j=[0..1]} Figure 7.
Equalization replaces the non-affine piece P withthe affine pieces P i %3 < and P i %3 = to model a stack distancethat varies at the last cache line offset. (0,1) (1,1) (2,1)(0,0) (1,0) (2,0) P = {S0(i,j) → ( i-3 ⌊ i/3 ⌋ ) j : i=[0..2] ᴧ j=[0..1]} P i%3=1 = {S0(i,j) → j : i%3=1 ᴧ i=[0..2] ᴧ j=[0..1]}(0,1)(0,0) P i%3=0 = {S0(i,j) → j : i%3=0 ᴧ i=[0..2] ᴧ j=[0..1]}(2,1)(2,0) P i%3=2 = {S0(i,j) → j : i%3=2 ᴧ i=[0..2] ᴧ j=[0..1]}(1,1)(1,0) Figure 8.
Rasterization replaces the non-affine piece P withthe affine pieces P i %3 = , P i %3 = , and P i %3 = to model a stackdistance that varies at every cache line offset.The counting algorithm works for all static control pro-grams and avoids complete enumeration except all dimen-sions are non-affine. Many stack distance polynomials contain non-affine termsthat prevent fast symbolic counting. We develop rewritestrategies that eliminate non-affine terms containing floorexpressions. The floor expressions themselves are quasi-affine but often appear in products with other non-constantoperands modeling effects such as the stack distance varia-tion for different cache line offsets. We specialize the stackdistance polynomials for different cache line offsets to makethem affine which enables the efficient symbolic counting.The floor expressions of some polynomials differ only bya constant offset. For example, the piece P = { S0 ( i , j ) ← (⌊( + i )/ ⌋−⌊ i / ⌋) j : 0 ≤ i < ∧ ≤ j < } contains the floor expressions ⌊( + i )/ ⌋ and ⌊( i )/ ⌋ . Thetwo floor expressions are equal except if i modulo threeis equal to two. Then the second floor expression is largerby one. The difference of the two floor expressions thusevaluates to zero for the first two elements and to one for the last element of every cache line. Figure 7 shows how tointroduce simplified polynomials for the first two and thelast element of every cache line. This equalization techniquesplits the cache line in multiple regions that typically containmore than one element.The polynomials may also contain terms with the plainvariable and other terms which compute the floor of thevariable. For example, the piece P = { S ( i , j ) → ( i − ⌊ i / ⌋) j : 0 ≤ i < ∧ ≤ j < } contains the floor expression 3 ⌊ i / ⌋ which is equal to i ex-cept for a constant that depends on the cache line offset.Figure 8 shows how to replace the polynomial with one sim-plified polynomial per cache line offset. This rasterization technique enumerates all cache line offsets.We apply the two floor elimination techniques in the orderof presentation and only keep the results if the degree of atleast one simplified polynomial is lower than the degree ofthe original polynomial. All memory accesses that touch a cache line for the first timeare compulsory misses.As the array M of our example program is initialized bythe statement S0 , the first map F = { M ( i ) → S0 ( i ) : 0 ≤ i < } relates every array element to the statement instance thataccesses the element first which means the cardinality | F dom | of the first map domain counts the compulsory misses.The compulsory misses are the memory accesses withlexicographically minimal schedule value. The first map F = S − ◦ lexmin ( S ◦ A − ) thus selects for every memory access the lexicographicallyminimal relation of the composition S ◦ A − that relatesmemory accesses to schedule values and composes the resultwith the inverse schedule S − to obtain the related statementinstances. The composition with the inverse schedule allowsus to intersect the range of the first map with the iterationdomain of the individual statements to count the compul-sory misses per statement. For our example program, thecomposition S ◦ A − = { M ( i ) → ( , i ) : 0 ≤ i < M ( j ) → ( , − j ) : 0 ≤ j < } contains two accesses for every array element. The lexminoperator removes the second access due to the lexicographi-cally larger schedule value. After the composition with theinverse schedule S − , we use the Barvinok algorithm to countthe compulsory misses | F dom | . e rr o r [ % ] due to associativity mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e d o i t g e n d u r b i n f d t d - d f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t h e a t - d j a c o b i - d j a c o b i - d l u l u d c m p m v t nu ss i n o v s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm a cc e ss e s [ % ] e ( g ) = 0.6%hitsmissesmeasured (a) L1 cache e rr o r [ % ] mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e d o i t g e n d u r b i n f d t d - d f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t h e a t - d j a c o b i - d j a c o b i - d l u l u d c m p m v t nu ss i n o v s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm a cc e ss e s [ % ] e ( g ) = 0.2%hitsmissesmeasured (b) L2 cache
Figure 9.
Cache misses and hits predicted by HayStack compared to the measured cache misses (median of 10 measurements)for the PolyBench kernels with the prediction error relative to the number of memory accesses on top. e rr o r [ % ] due to associativity mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e d o i t g e n d u r b i n f d t d - d f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t h e a t - d j a c o b i - d j a c o b i - d l u l u d c m p m v t nu ss i n o v s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm a cc e ss e s [ % ] e ( g ) = 0.6%hitsmissesmeasured (a) L1 cache (fully associative) e rr o r [ % ] mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e d o i t g e n d u r b i n f d t d - d f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t h e a t - d j a c o b i - d j a c o b i - d l u l u d c m p m v t nu ss i n o v s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm a cc e ss e s [ % ] e ( g ) = 0.4%hitsmissesmeasured (b) L1 cache (8-way associative)
Figure 10.
Cache misses and hits simulated by Dinero IV compared to the measured cache misses (median of 10 measurements)for the PolyBench kernels with the prediction error relative to the number of memory accesses on top.
All compute-heavy parts of our cache model perform Pres-burger arithmetic that in general is known to have very highcomputational complexity [21, 30]. The established complex-ity bounds range from polynomial time decidable [26] forexpressions with fixed dimensionality and only existentialquantification to double exponential [19] for arbitrary ex-pressions. Haase [21] presents further results that show acomplexity increase with the dimensionality and the numberof quantifier alternations of the Presburger expression.The Presburger relations computed by our cache modelhave only existential quantification and the dimensionalityis limited by the loop depth suggesting polynomial complex-ity. Yet, the cache model may introduce further variables tomodel divisions or modulo operations making the complex-ity exponential in the number of dimensions.Although the cache model has exponential worst-casecomplexity, the empirical performance evaluation presentedin Section 4.3 shows that our cache model performs well for typical input programs. The dimensionality of the observedPresburger relations remains limited since most real-worldprograms do not make extensive use of branch conditionsand index expressions that result in integer divisions or mod-ulo operations.
We next evaluate the performance of HayStack and compareits accuracy to simulated and measured results.
We evaluate on a test system with two 18-core Intel XeonGold 6150 processors. Every core has a 32KiB L1 cache (8-way set associative) and an inclusive 1MiB L2 cache (16-wayset associative). The non-inclusive 18x1.375MiB L3 cache(11-way set associative) is shared among all cores. A non-inclusive cache may and an inclusive cache has to duplicateall cache lines stored by the lower-level caches. All caches oad the cache line before writing (write-allocate) and for-ward the write only if the cache line is evicted (write-back).We compile with GCC 6.3 and use the Dinero IV cachesimulator [17] to compute and the PAPI-C library [34] tomeasure the number of cache misses. We evaluate the modelfor a number of different kernels. PolyBench 4.2.1-beta [32]is a collection of static control programs that implementalgorithmic motifs from scientific computing. If not statedotherwise the PolyBench experiments use the default con-figuration (large) and the model emulates fully associativeL1 and L2 caches with the capacities of the test system.All performance measurements run single-threaded us-ing only one core of the test system. To quantify measure-ment noise, the execution times show the median and thenon-parametric 95% confidence intervals [24] of 10 measure-ments. All mathematical models are a trade-off between accuracyand complexity. A static cache model cannot predict dynamicmeasurement noise for example due to concurrent code exe-cution. We aim at an accurate prediction of the cache misseswithout modeling too many implementation details.A comparison to measurements on a real system is themain benchmark for every cache model. To measure thecache misses, we compile the PolyBench [32] kernels withPAPI [34] support using GCC optimization level O2. Poly-Bench [32] flushes the caches before every kernel executionwhich allows us to measure compulsory and capacity misses.We collect the counters
PAPI_L1_DCM and
PAPI_L2_DCM thatsum the data cache misses for the L1 and L2 caches, respec-tively. Figure 9 compares the sum of the compulsory andcapacity misses predicted by HayStack to the measured cachemisses shown by black lines. Most kernels cause more cachemisses than predicted which is expected since we model ide-alized fully associative caches with LRU instead of pseudo-LRU replacement policy. We also do not consider possibleoverfetch due to the hardware prefetchers. To quantify theerror, Figure 9 shows for every kernel the prediction errorrelative to the total number of memory accesses computedby the model. Most kernels have low single digit predictionerrors with a geometric mean error of 0.6% and 0.2% for theL1 cache and the L2 cache, respectively. Only doitgen andgramschmidt have prediction errors above 10%.We also execute the PolyBench kernels with Dinero IV [17]to simulate the number of cache misses with full associativ-ity and with the associativity of our test system. Figure 10compares the sum of the simulated compulsory, capacity,and conflict misses to the measured cache misses shown byblack lines. We observe that the simulation results for thefully associative L1 cache qualitatively agree with the model.All simulation results are within 0.1% of the model for the L1cache and within 3% of the model for the L2 cache (relativeto the total number of memory accesses). We conclude that j a c o b i - dg e mm g e s u mm v b i c g a t a x t r mm t r i s o l v s y r k mmm v t s e i d e l - dd o i t g e n s y r k mm g r a m s c h m i d t d u r b i n f d t d - d s y mm j a c o b i - dg e m v e r c o v a r i a n c e c o rr e l a t i o n d e r i c h e f l o y d - w a r s h a ll h e a t - d a d il u d c m p l unu ss i n o v c h o l e s k y e x e c u t i o n t i m e [ s ] p i e c e s Figure 11.
Execution times for the main components ofHayStack compared to the number of separately countedpieces for the PolyBench kernels sorted by execution time. j a c o b i - dg e mm g e s u mm v b i c g a t a x t r mm t r i s o l v s y r k mmm v t s e i d e l - dd o i t g e n s y r k mm g r a m s c h m i d t d u r b i n f d t d - d s y mm j a c o b i - dg e m v e r c o v a r i a n c e c o rr e l a t i o n d e r i c h e f l o y d - w a r s h a ll h e a t - d a d il u d c m p l unu ss i n o v c h o l e s k y e x e c u t i o n t i m e [ s ] p i e c e s Figure 12.
Execution times for the extra large (XL), large(L), and medium (M) problem sizes of PolyBench comparedto the number of counted pieces.our design decisions of padding the innermost dimension ofmulti-dimensional arrays, discussed in Section 3.1, and mod-eling only array accesses and not scalar accesses, discussedin Section 2.2, have no significant impact on the accuracy ofthe model. The simulation results with test system associa-tivity eliminate the error for the doitgen kernel. We concludethat modeling set associativity is only relevant for one ofthe PolyBench kernels. The error of the remaining kernelsis dominated by other error sources such as the differencebetween LRU and pseudo-LRU replacement policy that areneither considered by the simulator nor by the model.HayStack reproduces the simulation results for full asso-ciativity and the associativity mismatch compared to the testsystem does not dominate the modeling error.
We next analyze the performance of HayStack and its sensi-tivity to model parameters such as the problem size or thenumber of cache hierarchy levels.Two components dominate the model execution time: 1)the stack distance computation discussed in Section 3.1 and2) the capacity miss counting discussed in Section 3.2. Fig-ure 11 shows the cost of the two components compared to thetotal model execution times for the PolyBench kernels. The a c o b i - dg e mm g e s u mm v b i c g a t a x t r mm t r i s o l v s y r k mmm v t s e i d e l - dd o i t g e n s y r k mm g r a m s c h m i d t d u r b i n f d t d - d s y mm j a c o b i - dg e m v e r c o v a r i a n c e c o rr e l a t i o n d e r i c h e f l o y d - w a r s h a ll h e a t - d a d il u d c m p l unu ss i n o v c h o l e s k y e x e c u t i o n t i m e [ s ] Figure 13.
Comparison of the execution times when model-ing one, two, or three cache hierarchy levels.analysis of most kernels terminates within 5 seconds (jacobi-1d to heat-3d) while the more expensive kernels take up to20 seconds (adi to cholesky). The capacity miss countingdominates the cost of the expensive kernels. When countingthe capacity misses, the partial enumeration and to a lesserextend the equalization and rasterization , discussed in Sec-tion 3.3, split the iteration domain into pieces with affinestack distance polynomials that support symbolic counting.The solid line in Figure 11 shows the number of countedpieces. We observe that the expensive kernels require moresplits due to non-affine stack distance polynomials and thatthe counting costs correlate with the number of pieces.Other than for a cache simulator, the model executiontime is not proportional to the number of memory accesses.Figure 12 shows the model execution times for the threelargest PolyBench problem sizes. The large (L) and the extralarge (XL) problem size perform roughly 100 and 1000 timesmore memory access than the medium (M) problem size,respectively. Yet, the execution times remain constant fora majority of the kernels. Only the execution times of theexpensive kernels increase since the partial enumeration requires more splits. The number of counted pieces, shownby the solid, dashed, and dotted lines in Figure 12, correlatewith the cost increase for the larger problem sizes. Even forthe expensive kernels, the increase of the execution time isnot proportional to the number of memory accesses sincewe enumerate only the non-affine dimensions of the stackdistance polynomials.When counting the cache misses for multiple cache hier-archy levels, we reuse the stack distance polynomials andenumerate the non-affine dimensions only once. The count-ing of the individual pieces is the only step repeated for everycache size. As the Barvinok algorithm [40] supports paramet-ric counting, we can count the capacity misses parametric inthe cache size which avoids any additional overhead whenmodeling additional cache hierarchy levels. We benchmarkthe non-parametric version of the code as it runs faster evenwhen modeling three cache hierarchy levels. Figure 13 showsminor increases of the total execution time for two and threecache hierarchy levels. s p ee d u p equalization (without rasterization) . x s p ee d u p rasterization . x mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e d o i t g e n d u r b i n f d t d - d f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t h e a t - d j a c o b i - d j a c o b i - d l u l u d c m p m v t nu ss i n o v s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm g e o m . m e a n s p ee d u p partial enumeration . x Figure 14.
Speedup due to equalization , rasterization , and partial enumeration . All kernels without speedup (gray bars)are not included in the geometric mean. Only few kernelsrun fast without any optimization (gray labels). Table 1.
Number of non-affine polynomials with zero, one,or two affine dimensions. mm a d i c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e g e m v e r l u l u d c m p m v t n u ss i n o v partial enumeration , discussed in Section 3.2, com-bines enumeration of the non-affine dimensions with sym-bolic counting of the affine dimensions. Figure 14 compares partial enumeration to the explicit enumeration of all points.When considering only kernels with non-affine stack dis-tance polynomials, we measure a geometric mean speedupof 12.4x with pieces that contain 4,400 points on average.The more points per piece the bigger the efficiency gain dueto our hybrid counting approach. We still require explicitenumeration for all non-affine polynomials without affinedimension. Table 1 shows that most non-affine polynomialshave at least one affine dimension. For these polynomials, partial enumeration reduces the asymptotic complexity ofthe capacity miss counting. mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d o i t g e n d u r b i n d y n p r o g f d t d - d f d t d - a p m l f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t j a c o b i - d j a c o b i - d l u m v t r e g _ d e t e c t s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm g e o m . m e a n s p ee d u p x (a) PolyCache mm mm a d i a t a x b i c g c h o l e s k y c o rr e l a t i o n c o v a r i a n c e d e r i c h e d o i t g e n d u r b i n f d t d - d f l o y d - w a r s h a ll g e mm g e m v e r g e s u mm v g r a m s c h m i d t h e a t - d j a c o b i - d j a c o b i - d l u l u d c m p m v t nu ss i n o v s e i d e l - d s y mm s y r k s y r k t r i s o l v t r mm g e o m . m e a n s p ee d u p x (b) Dinero IV
Figure 15.
Speedup of HayStack compared to PolyCache and Dinero for the PolyBench 3.2 and 4.2.1 kernels, respectively.As discussed by Section 3.3, the floor elimination tech-niques simplify non-affine stack distance polynomials withless splits than partial enumeration but are less generic and donot apply to all polynomials. Figure 14 shows the speedupsfor equalization compared to a baseline without equalization and rasterization . We disable both techniques since otherwise rasterization optimizes the polynomials normally handledby equalization . We observe a geometric mean speedup of1.9x for the kernels that benefit. Figure 14 also compares thespeedups for rasterization to a baseline without rasterization .We measure a geometric mean speedup of 1.9x for cholesky,lu, ludcmp, nussinov, and seidel-2d. Overall the floor elimi-nation techniques reduce the number of counted pieces bymore than 80% which results in bigger pieces with bettercounting performance.A majority of the kernels perform well independent ofproblem size and number of cache hierarchy levels. Yet, themodel execution times for kernels with non-affine polynomi-als are higher and problem size dependent. We mitigate thiswith efficient enumeration and floor elimination techniques.
The polyhedral cache model PolyCache [2] and the cachesimulator Dinero IV [17] are alternative cache modeling tools.We compare their performance to HayStack.PolyCache models set associative caches with an LRUreplacement policy. We compare to the published resultsthat show the performance for the default problem size ofPolyBench 3.2 and adapt the configuration of our model tomatch the cache sizes of the published experiments (32KiBof L1 cache and 256KiB of L2 cache). The only differenceis that we model fully associative caches instead of 4-wayassociative caches. Figure 15a shows an average speedup of21x (geometric mean) of HayStack compared to PolyCacheeven though PolyCache computes the cache misses for all1024 cache sets in parallel.Dinero IV is a trace driven cache simulator which meansthe expected simulation cost are proportional to the numberof memory accesses (Figure 1). Figure 15b shows the speedup j a c o b i - dg e s u mm v m v t s e i d e l - d a t a x d u r b i n b i c gg e mm g e m v e r s y r k d e r i c h e t r mm t r i s o l v mm g r a m s c h m i d t s y r k mm d o i t g e n c o v a r i a n c e f d t d - d s y mm c o rr e l a t i o n j a c o b i - d f l o y d - w a r s h a lll u d c m p c h o l e s k y nu ss i n o v a d il uh e a t - d e x e c u t i o n t i m e [ s ] totalcapacity missesstack distances Figure 16.
Execution times for the main components ofHayStack for tiled versions of the PolyBench kernels. A fewkernels (gray labels) have no rectangular tiling.of HayStack compared to the Dinero IV simulation timesthat include the trace generation with QEMU [3]. Dinero IVsimulates the associativity of our test system while we modelfully associative caches. As simulation and model run sin-gle core, the execution times are comparable. We measurean average speedup of 370x (geometric mean) for the largeproblem size that would be even bigger for the extra largeproblem size. Simulating full associativity further increasesthe average simulation time by factor 2.2x (geometric mean).PolyCache models cache behavior in-depth, which allowsdevelopers to analyze the effects of set associativity anddifferent write policies, but its high accuracy can make itcostly to compute. Dinero IV works for small problem sizesbut the cost increase for realistic problem sizes is dramatic.
A tiled code decomposes the iteration domain into tiles andexecutes tile-by-tile to improve the spacial locality. Tilingcan double the loop nest depth which allows us to evaluateour approach for more complex codes. At the same time,estimating the benefits of tiling or even selecting optimaltile sizes is an important application for a cache model.We employ the PPCG [39] source-to-source compiler totile all PolyBench kernels with tile size 16. We limit the sum f all scheduling coefficients to one and disable loop fusionto obtain a rectangular tiling without loop skewing (time-tiling). All kernels except for jacobi-1d, durbin, seidel-2d,and nussinov have a rectangular tiling. Figure 16 shows themodel execution times for the tiled kernels. Tiling makesthe cache miss computation more expensive. Especially thestack distance computation of the head-3d kernel runs long.We attribute the cost increase to the more complex iterationdomains and memory access patterns.Tiling increases the model execution times but for a ma-jority of the kernels the cache miss computation still takesonly a few seconds. Cache behavior analysis is a prerequisite when tuning for thememory hierarchy. We distinguish three main approaches:1) simulation, 2) profiling, and 3) analytical modeling.
Simulators
Dinero [17] and CASPER [25] are examples oftrace-based cache simulators that compute the cache missesfor the full memory hierarchy. Sniper [10] and gem5 [7]have a broader scope and simulate the full system includingthe caches. All simulators execute the program to count thecache misses which means the simulation costs are propor-tional to the number of executed memory accesses.
Profiling
Multiple works discuss the analysis of memoryaccess traces to extract locality metrics. Mattson et al. [29]compute the stack distance using a linked list and derivethe cache hit rate for different cache sizes. Tree based imple-mentations [4, 31, 33] reduce the cost of the stack distancecomputation. Kim et al. [28] apply hashing and approxi-mation to increase the efficiency. Ding et al. [15] discusstree based approximate algorithms that reduce the time andspace complexity of the stack distance computation and pre-dict the stack distance histogram for arbitrary problem sizesgiven training inputs for few different problem sizes. Eklov etal. [16] sample the reuse distance for a few memory accessesand employ statistics to estimate stack distances and cachemiss ratio. Xiang et al. [43] discuss five different localitymetrics and show how to derive miss rate and reuse distancegiven the a single measure called average footprint whichthey compute with an efficient linear time algorithm [42]. Adisadvantage of the profiling approaches is the acquisitionand the handling of the large program traces. Chen et al. [14]sample the reuse time during compilation which allows themto estimate the cache miss ratio of complex loop nests.
Analytical models
Agarwal et al. [1] develop an analyticalmodel that uses parameters extracted from the program trace.Harper et al. [22] model set associative caches for regularloop nests. Cost models [8, 11, 27] allow compilers to decideif data-locality transformations are beneficial. All of thesemodels only approximate the number of cache misses.Ferdinand et al. [18] use abstract interpretation to modelset associative LRU caches. Model-checking [13, 35] increases the accuracy of this analysis that distinguishes always hit,always miss, and not classified. Touzeau et al. [36] show howto attain high accuracy without costly model-checking. Theabstract interpretation approaches are complementary to ourcache model since they support dynamic control flow butapproximate the cache misses of loop nests by classifying allinstances of a memory access at once.Ghosh et al. [20] derive cache miss equations to countthe cache misses for perfect loop nests with data dependen-cies represented by reuse vectors [41]. Assuming an LRUreplacement policy, a cache miss occurs if the number ofsolutions to a cache miss equality exceeds the cache associa-tivity. Counting the solutions for every point of the iterationdomain is expensive. Vera and Xue et al. [37, 44] thus samplethe iteration domain to speedup the cache miss computationwhich allows them to perform approximate whole-programanalysis. Cascaval et al. [9] compute the stack distance his-togram symbolically for perfect loop nests with uniformdata dependencies. They model fully associative caches withan LRU replacement policy and use statistics to model setassociative caches. Chatterjee et al. [12] use Presburger for-mulas to express the set of compulsory and capacity missesof imperfect loop nests for associative caches. At the time,their approach was limited to small problem sizes and lowassociativity since the computation of analytical results forrealistic hardware and even small benchmarks kernels wasprohibitively complex. While Beyles et al. [6] did not addressthe cache miss problem, they use analytically computed stackdistance to generate cache hints at runtime. Their stack dis-tance computation, extended by our cache miss countingtechnique for non-affine polynomials, is the foundation ofour cache model. PolyCache [2] presented the first analyticalapproach fast enough to compute the cache behavior of staticcontrol programs for interesting benchmark kernels and re-alistic hardware parameters. Its analytical model relates forevery cache set successive accesses of distinct cache linesand repeatedly removes the shortest relations to model setassociativity with LRU replacement policy. While PolyCachealso uses symbolic counting techniques to avoid a completeenumeration of the computation, its complexity increaseswith high associativity. Our work provides a fast analyticalmodel for fully associative caches and shows that fully as-sociative models introduce only small errors compared tomeasurements on actual hardware.
As memory behavior depends on the cache state, under-standing the cost of memory accesses is much more difficultthan understanding the cost of arithmetic instructions. WithHayStack, we close this gap by providing developers with ac-curate information about the interaction of memory accesseswith the large and deep cache hierarchy of modern proces-sors. HayStack allows the programmer to predict memory ccess costs accurately and to develop programs well opti-mized for the memory hierarchy. When striving for ultimateperformance, both a good baseline and an accurate surrogatemodel accelerates empirical tuning. As a result, cache-awareprogram optimization becomes accessible.Responsiveness is key for the adoption of any cache model.We demonstrate excellent often problem size independentresponse times that for the first time make analytical cachemodeling practical. In addition, the cache size independentcosts allow our model to easily scale to future hardware.We show the practicality of our deliberate decision againsthigh fidelity and in favor of a generic fully associative cachemodel. The proposed model is robust to memory layoutchoices and hardware implementation details and yet reachesvery high accuracy on real hardware across a wide range ofcomputations. Acknowledgments
This project has received funding from the European Re-search Council (ERC) under the European Union’s Horizon2020 programme (grant agreement DAPP, No. 678880), theSwiss National Science Foundation under the Ambizioneprogramme (grant PZ00P2168016), and ARM Holdings plcand Xilinx Inc in the context of Polly Labs. We also wouldlike to thank the Swiss National Supercomputing Center forproviding the computing resources.
References [1] Anant Agarwal, John Hennessy, and Mark Horowitz. 1989. An analyt-ical cache model.
ACM Transactions on Computer Systems (TOCS)
7, 2(1989), 184–215.[2] Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P Sa-dayappan. 2017. Analytical modeling of cache behavior for affineprograms.
Proceedings of the ACM on Programming Languages
2, POPL(2017), 32.[3] Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator..In
USENIX Annual Technical Conference, FREENIX Track , Vol. 41. 46.[4] Bryan T Bennett and Vincent J. Kruskal. 1975. LRU stack processing.
IBM Journal of Research and Development
19, 4 (1975), 353–357.[5] Kristof Beyls and Erik H. D’Hollander. 2001. Reuse Distance as aMetric for Cache Behavior. In
In Proceedings of the IASTED Conferenceon Parallel and Distributed Computing and Systems . 617–662.[6] Kristof Beyls and Erik H D’Hollander. 2005. Generating cache hintsfor improved program efficiency.
Journal of Systems Architecture
51, 4(2005), 223–250.[7] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Rein-hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, TusharKrishna, Somayeh Sardashti, et al. 2011. The gem5 simulator.
ACMSIGARCH Computer Architecture News
39, 2 (2011), 1–7.[8] Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, andPonnuswamy Sadayappan. 2008. A practical automatic polyhedralparallelizer and locality optimizer. In
Acm Sigplan Notices , Vol. 43.ACM, 101–113.[9] Calin CaBcaval and David A Padua. 2003. Estimating cache missesand locality using stack distances. In
Proceedings of the 17th annualinternational conference on Supercomputing . ACM, 150–159.[10] Trevor E Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper:exploring the level of abstraction for scalable and accurate parallelmulti-core simulation. In
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis .ACM, 52.[11] Steve Carr, Kathryn S McKinley, and Chau-Wen Tseng. 1994.
Compileroptimizations for improving data locality . Vol. 29. ACM.[12] Siddhartha Chatterjee, Erin Parker, Philip J Hanlon, and Alvin R Lebeck.2001. Exact analysis of the cache behavior of nested loops.
ACMSIGPLAN Notices
36, 5 (2001), 286–297.[13] Sudipta Chattopadhyay and Abhik Roychoudhury. 2013. Scalableand precise refinement of cache timing analysis via path-sensitiveverification.
Real-Time Systems
49, 4 (01 Jul 2013), 517–562.[14] Dong Chen, Fangzhou Liu, Chen Ding, and Sreepathi Pai. 2018. Lo-cality analysis through static parallel sampling. In
Proceedings of the39th ACM SIGPLAN Conference on Programming Language Design andImplementation . ACM, 557–570.[15] Chen Ding and Yutao Zhong. 2003. Predicting whole-program localitythrough reuse distance analysis. In
ACM Sigplan Notices , Vol. 38. ACM,245–257.[16] David Eklov and Erik Hagersten. 2010. StatStack: Efficient modelingof LRU caches. In
Performance Analysis of Systems & Software (ISPASS),2010 IEEE International Symposium on . IEEE, 55–65.[17] Jan Elder and Mark D. Hill. 2003. Dinero IV Trace-Driven UniprocessorCache Simulator.[18] Christian Ferdinand, Florian Martin, Reinhard Wilhelm, and MartinAlt. 1999. Cache Behavior Prediction by Abstract Interpretation.
Sci.Comput. Program.
35, 2-3 (Nov. 1999), 163–189.[19] M. J. Fischer and M. O. Rabin. 1974.
SUPER-EXPONENTIAL COMPLEX-ITY OF PRESBURGER ARITHMETIC . Technical Report. Cambridge,MA, USA.[20] Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cachemiss equations: a compiler framework for analyzing and tuning mem-ory behavior.
ACM Transactions on Programming Languages and Sys-tems (TOPLAS)
21, 4 (1999), 703–746.[21] Christoph Haase. 2018. A Survival Guide to Presburger Arithmetic.
ACM SIGLOG News
5, 3 (July 2018), 67–82.[22] John S Harper, Darren J Kerbyson, and Graham R Nudd. 1999. Analyt-ical modeling of set-associative cache behavior.
IEEE Trans. Comput.
48, 10 (1999), 1009–1024.[23] Mark D Hill. 1987.
Aspects of cache memory and instruction bufferperformance . Technical Report. CALIFORNIA UNIV BERKELEY DEPTOF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES.[24] Torsten Hoefler and Roberto Belli. 2015. Scientific benchmarking ofparallel computing systems: twelve ways to tell the masses when re-porting performance results. In
Proceedings of the international confer-ence for high performance computing, networking, storage and analysis .ACM, 73.[25] Ravi Iyer. 2003. On modeling and analyzing cache hierarchies usingCASPER. In null . IEEE, 182.[26] LENSTRA JR. H.W. 1983. Integer Programming with a Fixed Numberof Variables.
Report 81-03, Mathematisch Instituut Amsterdam (1981)
Proceedings of the 6th international confer-ence on Supercomputing . ACM, 323–334.[28] Yul H Kim, Mark D Hill, and David A Wood. 1991.
Implementing stacksimulation for highly-associative memories . Vol. 19. ACM.[29] Richard L Mattson, Jan Gecsei, Donald R Slutz, and Irving L Traiger.1970. Evaluation techniques for storage hierarchies.
IBM Systemsjournal
9, 2 (1970), 78–117.[30] Danh Nguyen Luu. 2018.
The Computational Complexity of PresburgerArithmetic . Ph.D. Dissertation. UCLA.[31] Frank Olken. 1981. Efficient methods for calculating the success func-tion of fixed space replacement policies. (1981).[32] Louis-Noël Pouchet. 2012. Polybench: The polyhedral benchmarksuite.
URL: https://sourceforge.net/projects/polybench/ (2012).
33] Rabin A Sugumar and Santosh G Abraham. 1993.
Efficient simulationof caches under optimal replacement with applications to miss charac-terization . Vol. 21. ACM.[34] Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010.Collecting performance data with PAPI-C. In
Tools for High Perfor-mance Computing 2009 . Springer, 157–173.[35] Valentin Touzeau, Claire Maïza, David Monniaux, and Jan Reineke.2017. Ascertaining Uncertainty for Efficient Exact Cache Analysis.
CoRR abs/1709.10008 (2017).[36] Valentin Touzeau, Claire Maïza, David Monniaux, and Jan Reineke.2019. Fast and Exact Analysis for LRU Caches.
Proc. ACM Program.Lang.
3, POPL, Article 54 (Jan. 2019), 29 pages.[37] Xavier Vera and Jingling Xue. 2002. Let’s study whole-program cachebehaviour analytically. In
High-Performance Computer Architecture,2002. Proceedings. Eighth International Symposium on . IEEE, 175–186.[38] Sven Verdoolaege. 2010. isl: An integer set library for the polyhedralmodel. In
International Congress on Mathematical Software . Springer,299–302.[39] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Igna-cio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA.
ACM Transactions on Architectureand Code Optimization (TACO)
9, 4 (2013), 54.[40] Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, andMaurice Bruynooghe. 2007. Counting integer points in parametricpolytopes using Barvinok’s rational functions.
Algorithmica
48, 1(2007), 37–66.[41] Michael E Wolf and Monica S Lam. 1991. A data locality optimizingalgorithm. In
ACM Sigplan Notices , Vol. 26. ACM, 30–44.[42] Xiaoya Xiang, Bin Bao, Chen Ding, and Yaoqing Gao. 2011. Linear-time modeling of program working set in shared cache. In
ParallelArchitectures and Compilation Techniques (PACT), 2011 InternationalConference on . IEEE, 350–360.[43] Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: a higherorder theory of locality. In
ACM SIGARCH Computer Architecture News ,Vol. 41. ACM, 343–356.[44] Jingling Xue and Xavier Vera. 2004. Efficient and accurate analyticalmodeling of whole-program data cache behavior.
IEEE Trans. Comput.
53, 5 (2004), 547–566.53, 5 (2004), 547–566.