[PDF] AdaptMemBench: Application-Specific MemorySubsystem Benchmarking

Abstract

Optimizing scientific applications to take full advan-tage of modern memory subsystems is a continual challenge forapplication and compiler developers. Factors beyond working setsize affect performance. A benchmark framework that exploresthe performance in an application-specific manner is essential tocharacterize memory performance and at the same time informmemory-efficient coding practices. We present AdaptMemBench,a configurable benchmark framework that measures achievedmemory performance by emulating application-specific accesspatterns with a set of kernel-independent driver templates. Thisframework can explore the performance characteristics of a widerange of access patterns and can be used as a testbed for potentialoptimizations due to the flexibility of polyhedral code generation.We demonstrate the effectiveness of AdaptMemBench with casestudies on commonly used computational kernels such as triadand multidimensional stencil patterns.

Full PDF

AAdaptMemBench: Application-Speciﬁc MemorySubsystem Benchmarking

Mahesh Lakshminarasimhan

Department of Computer ScienceBoise State University

Boise, Idaho, [email protected]

Catherine Olschanowsky

Department of Computer ScienceBoise State University

Boise, Idaho, [email protected]

Abstract —Optimizing scientiﬁc applications to take full advan-tage of modern memory subsystems is a continual challenge forapplication and compiler developers. Factors beyond working setsize affect performance. A benchmark framework that exploresthe performance in an application-speciﬁc manner is essential tocharacterize memory performance and at the same time informmemory-efﬁcient coding practices. We present AdaptMemBench,a conﬁgurable benchmark framework that measures achievedmemory performance by emulating application-speciﬁc accesspatterns with a set of kernel-independent driver templates. Thisframework can explore the performance characteristics of a widerange of access patterns and can be used as a testbed for potentialoptimizations due to the ﬂexibility of polyhedral code generation.We demonstrate the effectiveness of AdaptMemBench with casestudies on commonly used computational kernels such as triadand multidimensional stencil patterns.

Index Terms —Benchmarking, Memory Performance, CodeGeneration, Stencil Computations, Tiling Optimizations.

I. I

NTRODUCTION

Scientiﬁc application performance is a function of memorybandwidth, instruction mix and order, memory footprint, andmemory access patterns. The contribution of each is oftennot clear and interdependencies exist between each variable.This complexity, combined with the difﬁculty of instrument-ing large application makes efﬁcient optimization of theseapplications difﬁcult. AdaptMemBench provides a frameworkfor application developers and optimization experts to isolateportions of their application and measure execution charac-teristics. The framework provides a starting point to identifyperformance bottlenecks, identify potential optimizations, andexplore the potential gains of those optimizations.Application performance is often bottlenecked by interac-tion with the memory subsystem due to the memory wall [27].Modern architectures combat this by using deep memoryhierarchies and physically fragmented system memory. Re-ducing working set sizes is considered a good ﬁrst step inoptimization to take advantage of the caching capability ofmachines. However, optimizing is more complex than that,especially when dealing with shared memory parallelization.Memory access patterns, instruction mix, data sharing acrosscaches, and vectorizability must all be considered in concert.Selecting and applying optimizations remains a primarychallenge during performance enhancements. Testing and un-derstanding optimizations in situ when working with a large application can be cumbersome and error prone. Given the dif-ﬁculty around manipulating access patterns in situ, fewer op-timization strategies are attempted and potential performanceimprovements are overlooked. Additionally, performance toolssuch as hardware counters, remain difﬁcult to use in thecontext of a large application. The combination of these factorsdiscourages effective optimizations.A framework that allows extracted code to be isolated andmeasured will beneﬁt the optimization process for speciﬁcprojects, and will improve the reliability and reproducibilityof performance experiments in the compiler optimizationand programming construct research communities. During theexploration and experimentation phase, many different variantsof the same code are produced. Tracking the differencesbetween variants and maintaining correct execution becomestime consuming and challenging. A shared framework thatsupports experimentation and tracks code versions while out-putting metadata with measurements will ease this challenge.We propose a tool to explore the design landscape of thetarget architecture. The

AdaptMemBench framework can beused to measure system performance and to guide application-speciﬁc optimization decisions. Expensive kernels extractedfrom larger applications can be manipulated in isolation to ﬁndthe best optimization strategies. The framework reduces theamount of code that is transferred and provides mechanismsto experiment with data storage layout, execution order, andparallelization strategies.AdaptMemBench provides several execution templates. Thetemplates are combined with user provided code segments.The templates provide a common command line interface,handle all timing and hardware counter code, and outputmetadata and measurements in a common format. The codesegments provided by the user can be expressed as C codeor by using the polyhedral model. The latter provides aconvenient mechanism for optimization experiments.Several benchmarks [2], [10], [13], [14], [18]–[20] existthat measure machine performance, with the benchmarkingresults conveying essential information about the applicationperformance on the memory hierarchy of the machine. Exist-ing memory benchmarks [14], [18], [19] measure performanceusing a limited collection of streaming access patterns. How-ever, benchmarking application-speciﬁc patterns that tend to1 a r X i v : . [ c s . PF ] D ec ig. 1. The proposed framework. be more complex remains a challenge. Current benchmarks[10], [15] are further constrained by the data sizes which canbe executed, speciﬁcally in the higher levels of the memorysubsystem.AdaptMemBench differs from previous efforts by incorpo-rating polyhedral code generation. This creates a conﬁgurablebenchmarking framework that measures achieved memorybandwidth while mimicking application-speciﬁc memory ac-cess patterns. The Polyhedral model [25] simpliﬁes writing theinitial benchmark and provides a mechanism to automaticallytransform the code. Furthermore, our benchmark supportsparallel applications and systems, and measures memory per-formance for data sizes across all levels of the memoryhierarchy.The primary contribution of this paper is a description andvarious demonstrations of the AdaptMemBench framework.Additionally, the framework was used to explore the perfor-mance of our university’s HPC cluster. The contributions ofthis paper include the following: • A conﬁgurable benchmarking framework for application-speciﬁc memory performance characterization. • A detailed performance study on common computationalkernels found in scientiﬁc applications for the impact ofimplicit locks, shared data spaces and false sharing. • An interleaved optimization strategy and demonstratedeffectiveness for the triad pattern. • An evaluation of the efﬁcacy of spatial tiling strategies formultidimensional Jacobi patterns using AdaptMemBench.II. A

DAPT M EM B ENCH D ESIGN

The AdaptMemBench framework separates the user inter-face, validation, and output of the benchmark from the codebeing measured and provide low overhead access to PAPI.Figure 1 illustrates the building blocks of the framework.Each computational kernel of interest is coded in a patternspeciﬁcation. If that pattern speciﬁcation involves the polyhe-dral model it is passed through a polyhedral compiler. Theresulting (or original) c code is compiled together with one ofseveral potential templates. The templates provide a uniforminterface and handle code to vary the working set size to covereach portion of the memory hierarchy, along with timing,PAPI data collection, and output formatting. The use of thepolyhedral model adds a great deal of ﬂexibility in terms ofexploring optimizations. The following subsection provides abrief overview of the polyhedral model. After the overview,the benchmark framework is described.

A. Polyhedral Code Generation

Polyhedral code generation enables loop constructs to beexpressed and manipulated mathematically. The iteration setscan be expressed without ordering unless a speciﬁc orderingis required. Figure 2 shows a loop nest for solving the heatequation. The associated iteration space is shown graphicallyas a two-dimensional space ( i, j ) . Each node in the graph rep-resents an iteration. The Presburger formula for this exampleis shown at the bottom of the ﬁgure. Fig. 2. An example of polyhedral code generation with ISCC/ISL.

Code generation is performed on sets through polyhedralscanning, the result is control ﬂow that produces the iterationsin lexicographical ordering. As expressed in Figure 2 theoriginal code would be produced. Transformations on the codeare realized through the application of relations (or functions).Loop interchange is a loop transformation that switchesthe order of two loops. Figure 3 shows the relation usedto apply loop interchange for the code in Figure 2. For therelation from { i,j } to { j,i } , we apply the transformation onthe execution domain deﬁned, using the intersection operator.More complex transformations such as tiling can be performedwith ease using the polyhedral model. Fig. 3. An illustration of loop interchange using ISCC.

The polyhedral model represents iteration spaces that areafﬁne. A signiﬁcant amount of work has been done to expandthe iteration spaces and schedules that can be represented,including work that uses schedule trees for code generationwithin ISL [24]. The Omega+ code generation tool is also ableto incorporate iteration bounds based on runtime information2sing uninterpreted functions [11]. Even with recent advances,the polyhedral model cannot express all C kernels, and is,therefore, an optional step in the benchmark speciﬁcation.In the proposed benchmark, to automatically generateschedules for the application kernel initialization, execution,and validation, the ISCC [25] polyhedral code generation toolis used, which offers an interface to the functionality providedby Integer Set Library (ISL) [24] and Barvinok library [23].This tool enables the end user to manipulate sets and relationsand generate source code reﬂecting their input. ... //Execution for(int k = 0; k < ntimes; k++) { } ... Listing 1. The inner-most section of the Uniﬁed Data Spaces Template. ... //Execution { int t_id = omp_get_thread_num(); for(int k = 0; k < ntimes; k++) { } } ... Listing 2. The inner-most section of the Independent Data Spaces Template.

B. Benchmark Implementation

The proposed framework uses a set of generic benchmarkdriver templates for all variations of the access patterns. Thesedriver templates provide a standard command line interfaceand a standard machine parsable and human readable output.Currently, the framework supports the following three varietiesof benchmark driver templates for shared memory applica-tions:1)

The

Uniﬁed Data Spaces

Template (Listing 1): Thestandard benchmarking template that utilizes uniﬁed dataspaces shared among threads. It uses the work sharing andscheduling constructs offered by OpenMP to distributeresources among threads. The OpenMP clauses can beeasily conﬁgured using the framework.2)

The

Independent Data Spaces

Template (Listing 2):This is a modiﬁed version of the uniﬁed data spacestemplate. It supports distinct data spaces separated intodifferent memory regions accessed without any overlap,avoiding false sharing. As indicated by the experimentalresults that follow, benchmarking in this paradigm, yieldsoptimal performance in the higher cache levels.3)

The

PAPI Measurement

Template:

This template is builton top of the above two templates, using PAPI’s low level API. The user is given an option to choose between theabove two benchmarking paradigms and input the PAPIevents to be recorded.Input pattern speciﬁcations consist of a header ﬁle and a setof ISCC input ﬁles. The initial step is to run the polyhedralcode generator for the ISCC input ﬁles and transform them intocorresponding C code ﬁles. The user-chosen driver templateis then updated with the appropriate header and source ﬁlesto create the customized .cpp benchmark driver code ﬁle.This benchmark driver code is compiled and executed withruntime arguments such as working set size, thread count andother parameters depending on the access pattern for whichthe benchmark is run.

Fig. 4. Implementation of the benchmark.

The purpose and functionality of each component in thepattern speciﬁcations shown in Figure 4 are described below:1)

Header ﬁle ( .h ):This ﬁle contains the deﬁnitions of the memory map-pings, statement macros, and the allocation code. • Memory Mapping : Indicates how the statements shouldmap into memory using iterators as input. • Statement Macros : The deﬁnition of the statementmacros substituted in each of the C code ﬁles generatedfrom the ISCC input ﬁles. Any data referred to withinthe statement should be referred to indirectly throughthe data mapping. • Allocation Code : Speciﬁes memory allocation of thedata spaces used in the given application kernel.2)

Initialization steps ( _init.in ):This ISCC input ﬁle speciﬁes the schedule for which thedata domains allocated in the header ﬁle are initialized. Inthe C code ﬁle generated with ISCC, the associated state-ment macro specifying initialization steps is substitutedwhen the benchmark is executed.3) Execution Schedule ( _run.in ):An ISCC input ﬁle that deﬁnes the iteration space inwhich the access pattern is executed. The applicationkernel deﬁned as a macro in the header ﬁle is replacedin the .c ﬁle generated. This code ﬁle consists of thefor loop constructs associated with the execution domainwhich will be substituted in the driver when executed.4) Validation condition ( _val.in ):This ISCC input ﬁle describes the schedule for whichthe results after executing the kernel is validated. The3orresponding C code ﬁle generated is then called in theheader ﬁle to validate the results.III. C ASE S TUDIES

The performance characteristics of a set of computationalkernels commonly used in performance studies are presentedin this section. The kernels are STREAM’s triad and Jacobi1D, 2D, and 3D. The kernels were chosen for their simplicityand well understood performance behaviors. The use casesdemonstrate the need to separate implementation concernswhen studying the performance of even simple kernels. Thestructure provided by AdaptMemBench improves the breadthof data collected and makes experiment reliability and repro-ducibility more easily attained. For each kernel we explore theimpact of implicit locks, shared data spaces, and false sharingin SMP systems.

Hardware:

Experiments were run on one of the nodes inthe R2 HPC cluster at Boise State University, which has a2.40GHz dual Intel Xeon E5-2680 v4 CPU. This node consistsof two NUMA domains each containing 14 cores. Each corehas a dedicated 32K L1 data cache and 256K L2 cache. The35 MB L3 cache is shared among all the cores in each NUMAdomain. The size of each cache line in this architecture is 64bytes.

Compilers:

GNU’s gcc (version 6.3). When building C++benchmark drivers, -fopenmp and -O3 optimization ﬂagswere used. The -lpapi ﬂag was set for PAPI-enabled bench-mark drivers.

Proﬁling Tool:

The benchmark drivers are instrumented withthe Performance API (PAPI) [16] library to access perfor-mance counters across the CPUs evaluated. PAPI is used tomeasure cache hits and the requests for exclusive access tocache lines.

Problem size:

We executed the benchmarks with problemsizes across all levels of cache and those which exceeded thelast-level cache and ﬁt into the main memory. Each benchmarkis executed for 1000 time iterations. The number of repetitionsis conﬁgurable.

A. The Triad benchmark

We demonstrate the simplicity of AdaptMemBench byimplementing the triad kernel from the STREAM benchmarkdue to its brevity and well-known performance. Listings 3 and4, along with the templates in Listings 1 and 2 illustrate theprocess of creating a custom benchmark using a combinationof input C code ﬁles, bypassing the polyhedral code generator.Alternatively, the kernel could have been expressed as a set: { [ j ] | < = j < n } . The results are equivalent. Cost of Barriers in OpenMP

We use the triad benchmark generated to evaluate theoverhead associated with barriers in OpenMP by using the nowait clause. With the AdaptMemBench framework, allthat is required is to modify the deﬁnition of the macro

CLAUSE to be nowait . As memory bandwidth results inFigure 5 indicate, there is signiﬁcant overhead caused by the //Allocation Code double* B = double *) malloc(sizeof(double) * n); \ double* C = double *) malloc(sizeof(double) * n); //Memory Mapping //Initialization //Statement Definition //OpenMP clause Listing 3. Header ﬁle for the triad benchmark. for (int j = 0; j < n; j++){ Triad_run(j); } Listing 4. The execution schedule of the benchmark driver generated bycombining the input ﬁle and the template. barrier, and by breaking the barrier using the nowait clausewe are able to achieve a reasonable speedup. Though thismodiﬁcation may not be possible for all computations, e.g.those that have loop carried dependencies, our intention isjust to demonstrate the performance degradation caused bycompiler-induced locks using the simple triad kernel.

Overhead of shared data spaces

The shape of the curve in the performance results on thetriad benchmark is disconcerting. Speciﬁcally, bandwidth inL1 is less than that in L2. There is a signiﬁcant amount ofoverhead to utilize shared memory parallel applications. We

Fig. 5. The impact of OpenMP barriers on achieved memory bandwidth. for(int k = 0; k < ntimes; k++) { schedule(static, n/t) nowait for (int i = 0; i < n; i++){ A[i] = B[i] + scalar * C[i]; } } Listing 5. Utilizing the OpenMP work sharing construct for data spaces ofsize n and t number of threads. int N = n/t; { int t_id = omp_get_thread_num(); for(int k = 0; k < ntimes; k++) { for (int i = 0; i < N; i++){ A[t_id][i] = B[t_id][i] + scalar * C[t_id][i]; } } } Listing 6. The resultant triad benchmark using the independent data spaces driver template explore the resultant performance bottleneck with two variantsof the triad benchmark: uniﬁed data spaces and independentdata spaces.The ﬁrst variant is implemented with uniﬁed data spacesusing OpenMP’s work sharing constructs. Listing 5 is a partof the benchmark driver generated from the uniﬁed dataspaces template with the macro

CLAUSE in triad.h setto schedule(static, n/t) nowait .The second benchmark uses the independent data spaces template implemented with distinct data spaces independentof the threads (listing 6). The only change needed in thebenchmark speciﬁcation is done in the data mapping in theheader ﬁle. The listing shows the result after macro expansion.Memory bandwidth results in Figure 6 clearly indicate thebeneﬁt of using distinct data spaces over the shared dataspaces variant implemented using OpenMP work-sharing andscheduling constructs. Using independent data spaces sepa-rates data domains into separate memory regions, eliminatingcross-thread communication. This in turn eliminates perfor-mance bottlenecks, for example, avoiding multiple threadsaccessing the same cache line. We observe an approximatetwo-fold performance boost in the L1 cache with this approachcompared to uniﬁed data spaces using OpenMP work-sharingconstructs, which is deemed to be efﬁcient. Scheduling to Maximize Bandwidth

The triad pattern that comprises three data spaces is oftenconsidered to yield optimal performance in a given architec-ture. With the conﬁgurability offered by our benchmarkingframework, we expand the number of data spaces evaluatedfrom 3 (in triad) to 20 data streams that are simultaneouslyread in the body of the loop. This is achieved by modifyingthe statement deﬁnition and memory allocation speciﬁcationsin the header ﬁle.

Fig. 6. Illustrating the overhead associated with data shared among threads.Fig. 7. An experiment to identify the number of data streams fetchingsimultaneously that gives optimal performance on parallel execution with 28threads.

Figure 7 shows the results of running this experiment inparallel with 28 threads. The memory bandwidth values areinconsistent for working set sizes that sit in L1 cache sincesmall data sets are shared among a large number of threads.Considering working sets in L2 cache, where the performanceis more consistent, we observe that the achieved memorybandwidth peaks for 11 data spaces, which is considerablyhigher when compared to triad that comprises 3 data streams. for (int i = 0; i < n/2; i++){ A[i] = B[i] + scalar * C[i]; A[i+n/2] = B[i+n/2] + scalar * C[i+n/2]; } Listing 7. Customized benchmark driver with uniﬁed spaces illustratinginterleaved optimization for triad ig. 8. Illustration of interleaved optimization with a single data space ofsize n . This experiment led to reschedule the execution to triad.Listing 7 describes the interleaved optimization imple-mented for triad. This splits each data spaces of size n into twoindependent blocks of size n each. Each of these blocks arefused together to execute in a single iteration and elements inboth of these blocks are accessed simultaneously. So, insteadof reading three data spaces at the same time, six data streamsare accessed concurrently, hence better utilizing the availableprefetching lines. Figure 8 illustrates how a single data spaceis interleaved into two blocks and are fused together to beaccessed simultaneously within a single iteration. Fig. 9. Interleaved optimization for triad is beneﬁcial in L1 cache on parallelexecution with 28 threads.

Performance results in Figure 9 illustrate the improvementin achieved bandwidth for triad in the L1 cache. A signiﬁcantspeedup is observed from the na¨ıve triad operation imple-mented with independent data spaces. For working set sizesfalling out of the L1 cache, this optimization is not effectivedue to poor prefetching. This further validates the experimentalresults from Figure 7, wherein we achieve higher performancewith 6 data spaces (i.e., the na¨ıve hexad operation) than 3, which is the case for triad. We attempted interleaving dataspaces for triad with interleaving factors greater than two, butwe obtain the highest performance when interleaved by 2, dueto access to a single cache line exhibiting truly independentdata spaces.

B. Multidimensional Jacobi patterns

Iterative Jacobi stencils are at the core of a wide range ofscientiﬁc applications and are represented in the StructuredGrid motif [4]. These patterns involve nearest neighborhoodcomputations in which each point in a multidimensional grid isiteratively updated by a subset of its neighbors. The polyhedralmodel is used to generate benchmark drivers for the Jacobipatterns, as it is helpful to test potential optimizations such astiling, exercising the ﬂexibility of AdaptMemBench.

Figure 11 demonstrates the process of custom benchmarkgeneration for this pattern using polyhedral code generation forthe input pattern speciﬁcations using the uniﬁed data spaces benchmark template. Allocating independent spaces is advan-tageous for this pattern as well, as reﬂected by the memorybandwidth results in Figure 12. However, performance scalingin L1 is still an issue, due to false sharing.

Impact of false sharing

In symmetric multiprocessing systems, where each proces-sor core has dedicated local cache(s), false sharing is a well-known performance issue. False sharing occurs when multiplethreads involve in modifying independent variables sharingthe same cache line, requiring unnecessary cache ﬂushes andsubsequent loads. The potential source of false sharing ismultiple threads accessing dynamically allocated or globalshared data structures simultaneously.The impact of false sharing is quantiﬁed by recording theperformance counters using PAPI. We measure the data cachehits in L1 and the requests for exclusive access to sharedcache lines in Figure 10(a). We observe that the shared dataspaces get affected by cache misses nearly 10 times more thanthe independent data spaces. Please note that Figures 10(a)and 10(b) each have a primary and secondary y-axis. The dataplotted using green triangles is associated with the secondaryaxis (on the right). The cache misses recorded for independentdata spaces is better, but the variation in number of exclusiverequests to clean cache line for the three cases in Figure 10(b)is much higher for L1 in the case that suffers from falsesharing.Padding arrays is a common solution to overcome falsesharing. In the architecture evaluated, each cache line is ofsize 64 bytes. As shown in Listing 8, the data spaces oftype double are padded with a factor 8 to allocate eachelement in different cache lines to avoid false sharing. WithAdaptMemBench, this can be achieved just by modifyingthe memory mapping. Eliminating false sharing leads to adrastic performance speedup in the L1 cache, as the results inFigure 12 reﬂect. The PAPI results were collected by running6 a) Number of L1 data cache misses (b) Number of requests to shared cache lineFig. 10. Cache misses and cache line requests for 3-pt Jacobi 1D. Measurements for the uniﬁed data spaces are plotted along the secondary y-axis for betterreadability of results.Fig. 11. Illustration of custom benchmark generation for 3-pt Jacobi 1Dkernel with uniﬁed data spaces using the polyhedral model. { int t_id = omp_get_thread_num(); for(int k = 0; k < ntimes; k++) { for (int i = 1; i < n - 1; i++){ A[t_id * 8][i] = (B[t_id * 8][i - 1] + B[t_id * 8][i] + B[t_id * 8][i + 1]) * 0.33; } } } Listing 8. The resultant independent data spaces benchmark driver reﬂectingarray padding for Jacobi 1D the same code conﬁgurations with a PAPI driver within theframework, and the memory bandwidth results are exclusiveof the minimal overhead of accessing hardware counters.

Fig. 12. Demonstration of overhead associated with shared data spaces inSMP systems with Jacobi 1D.

Higher dimensional Jacobi patterns

The process of creating a custom benchmark driver for 9-ptJacobi 2D using uniﬁed data spaces is illustrated in Figure 13.A 7-point Jacobi 3D benchmark driver can be similarly createdwith an added dimension to the code generation script andcorresponding modiﬁcations to the pattern speciﬁcation.From Figures 14 and 15, it can be noted that separatingdata spaces into different memory regions is beneﬁcial forboth Jacobi 2D and Jacobi 3D. However, false sharing doesn’taffect performance and both the patterns struggle to scale inthe L1 cache.7 ig. 13. Illustration of custom benchmark generation for 9-pt Jacobi 2Dkernel with uniﬁed data spaces using the polyhedral model.Fig. 14. Analyzing the performance bottleneck caused by shared data spacesin Jacobi 2D.

Tiling Optimization for Jacobi transformations

Rectangular space tiling [8] is one of the traditional opti-mization strategies for stencil computations. Rectangular tilingbreaks a large iteration space into a set of smaller iterationspaces, which improves spatial and temporal locality. Wheniterating over a large two-dimensional data space applyinga multipoint stencil, it is highly probable that one of theneighbors accessed would have fallen out of the cache whilethe iteration comes around to the same point again. Tilingiteration space eliminates such cache misses and improves datareuse. This optimization is explored, not to provide anotherdata point on the impact of tiling, but to demonstrate theadvantages of including polyhedral code representations in theframework.

Tiling Three-dimensional Jacobi

We implement this spatial tiling strategy on the 7-pointJacobi 3D transformation. The initial approach is to tile in the3D grid in all directions. Listing 9 shows the ISCC input scriptand corresponding C code ﬁle generated. AdaptMemBench

Fig. 15. Impact of performance with varying memory allocation in Jacobi3D. simpliﬁes the testing of this optimization with this inputISCC script as execution schedule ﬁle with the other patternspeciﬁcations remaining the same as for the na¨ıve Jacobi 3Dbenchmark. Domain_run := [n] -> { STM_3DS_run[k,j,i] : i <= n and i >= 1 and j<=n and j >= 1 and k<=n and k >= 1; }; Tiling := [n] -> { STM_3DS_run[k,j,i] -> STM_3DS_run[tk,tj,ti,k,j,i]:exists rk,rj,ri: and 0<=rj<64 and j=tj*64+rj and 0<=ri<16 and i=ti*16+ri; }; codegen (Tiling * Domain_run); for (int c0 = 0; c0 <= floord(n, 32); c0 += 1) for (int c1 = 0; c1 <= n / 64; c1 += 1) for (int c2 = 0; c2 <= n / 16; c2 += 1) for (int c3 = max(1, 32 * c0);c3 <= min(n, 32 * c0 + 31); c3 += 1) for (int c4 = max(1, 64 * c1);c4 <= min(n, 64 * c1 + 63); c4 += 1) for (int c5 = max(1, 16 * c2);c5 <= min(n, 16 * c2 + 15); c5 += 1) STM_3DS_run(c3, c4, c5);

Listing 9. ISCC script

Jacobi3D_xyz_tiled.in and the generated Ccode ﬁle

Jacobi3D_xyz_tiled.c . We initially block the iteration space in all the three di-mensions, for block sizes × , × and × . Ourresults agree with previous experimental evaluation showingno performance gain [10].We implement the partial blocking strategy [17] in whichblocking is done in two least signiﬁcant dimensions alone.This results in a series of 2D slices that are stacked one overthe other in the unblocked dimension. We tested the efﬁcacy ofthis technique on grid sizes up to 256, with block sizes rangingfrom 16 to 64 in both directions. This approach too does not8 ig. 16. Achieved memory bandwidth with 2D Cache blocking for Jacobi 3Dwith a tile sweep for sizes ranging from 16 to 64 in both the tiled directions. offer any speedup if we compare the peak bandwidth fromFigure 15 with the most performant block area in ﬁgure 16.Large on-chip caches affect cache reuse and thus provide noperformance gain with this blocking strategy. Increasing gridsizes would be impractical since many scientiﬁc applications,such as computation ﬂuid dynamics, typically use a box sizeof or less [1].These results conﬁrm conclusions from previous studies[5], [9], [10] on these tiling strategies performed for serialexecution. We extend these studies to parallel applicationsand systems using with the ﬂexibility of the polyhedral modeloffered by AdaptMemBench. Several temporal tiling strate-gies [3], [7], [12], [26] have proved to be effective for higherdimensional stencil patterns, which are not evaluated in thispaper, but the framework can accommodate them.IV. R ELATED W ORK

Several categories of memory benchmarks have been de-veloped over the years. Most relevant to our work are thestreaming bandwidth benchmarks, which use a predeﬁned setof access patterns to measure achieved memory bandwidth,and the stencil benchmarks. The following section presentsrepresentatives from each benchmarking category.Our benchmarking framework adds capabilities beyondthese benchmarks by offering conﬁgurability to explore theperformance of scientiﬁc applications. It emulates application-speciﬁc memory access patterns using the mechanism of poly-hedral code generation. It is a ﬂexible and consistent testbedfor evaluating various code optimizations without needing toport or modify the entire application.

A. Streaming Bandwidth Benchmarks

STREAM [14] is a microbenchmark that measures sustain-able memory bandwidth and the corresponding computation rates for the performance evaluation of high performance com-puting systems. STREAM measures the performance of fouroperations: COPY ( a[i] = b[i] , measures data transferwithout arithmetic), SCALE ( a[i] = q*b[i] , with a sim-ple arithmetic operation), SUM ( a[i] = b[i] + c[i] ,tests multiple load and store operations) and TRIAD ( a[i]= b[i] + q*c[i] ). The STREAM benchmark does notmeasure memory bandwidth for small data sizes in the higherlevels of memory hierarchy, i.e., in level 1 cache and someportions of level 2 cache, depending on the target architecture.AdaptMemBench calculates the cumulative computation timefor the overall execution of the kernel and enabling it toexplore achieved performance in higher levels of cache.MultiMAPS [19] is a benchmark probe designed to measureplatform-speciﬁc bandwidths, similar to STREAM, it accessesdata arrays repeatedly. In MultiMAPS, the access pattern isvaried in stride and array size varying spatial and temporallocality. It measures achieved memory bandwidth of differentmemory levels, different size working sets and a small setof access patterns. This benchmark is most closely related toours. The primary difference is the ability to include arbitrarymemory access patterns, and test optimization strategies.Stanza triad [10], a microbenchmark, is a derivative ofSTREAM, which measures the impact of prefetching on mod-ern microprocessors. It works by comparing the bandwidthmeasurements by varying stanza length L and stride of access S for different data sizes and predicts performance. This beinga serial benchmark, cannot be scaled to parallel applications,and cannot be conﬁgured for patterns other than triad. B. Synthetic memory benchmarks

Apex-MAP [20] is a synthetic benchmark that characterizesapplication performance, implemented sequentially [21], andin parallel using MPI [22]. This benchmark approximates the memory access performance based on concurrent ad-dress streams considering regularity of access pattern, spatiallocality, and temporal reuse. Using a set of characteristicperformance factors, its execution proﬁle is tuned such thatthese factors act as a proxy for the performance behavior ofcode with similar characteristics.Stencil Probe [10] is a lightweight, ﬂexible stencilapplication-speciﬁc benchmark that explores the behavior ofgrid-based computations. Stencil Probe mimics the kernels ofapplications that use stencils on regular grids by modifyingthe operations in the inner loop of the benchmark. Similar toStanza Triad, this benchmark is serial and cannot be extendedto large-scale parallel applications and systems. Furthermore,this probe is not friendly for testing code optimizations andrequires rewriting of the entire the benchmark code for eachtransformation.Bandwidth [18] is an artiﬁcial benchmark to measure mem-ory bandwidth on x86 and x86 64 based architectures. Thisbenchmark can be used to evaluate the performance of thememory subsystem, the bus architecture, the cache architec-ture and the processor. Memory bandwidth is measured byperforming sequential and random reads and writes of varying9izes across the levels of the memory hierarchy. However, thisbenchmark is neither application-speciﬁc nor customizable. Itmeasures performance based on a predeﬁned set of memoryaccess patterns and cannot be conﬁgured speciﬁcally to a targetapplication. Moreover, this benchmark executes serially andcannot be scaled to parallel systems and applications.

C. Application Benchmarks

Application Benchmarks are used as exemplars of appli-cation patterns. The NAS Parallel Benchmarks [2] comprisesbenchmarks developed to represent the major types of com-putations performed by highly parallel supercomputers andmimic the computation and data movement characteristicsof scientiﬁc applications. It consists of ﬁve parallel kernel benchmarks (EP - an embarrassingly parallel kernel, MG - asimpliﬁed multigrid kernel, CG - a conjugate gradient method,FT - fast Fourier transforms and IS - a large integer sort) andthree simulated application benchmarks (LU - lower and uppertriangular system solution, SP - scalar pentadiagonal solverand BT - set of block tridiagonal equations).The HPC Challenge benchmark suite [13] provides a set ofbenchmarks that deﬁne the performance boundaries of futurePetascale computing systems. This hybrid benchmark suiteexamines the performance of HPC architectures as a func-tion of memory access characteristics using different accesspatterns. It is composed of well-known computational kernelssuch as STREAM, HPL [6], matrix multiply, parallel matrixtranspose, FFT, RandomAccess and bandwidth/latency teststhat span high and low spatial and temporal locality space.V. C

ONCLUSIONS

This paper presents a conﬁgurable benchmark frameworkthat captures application-speciﬁc memory access patterns thatcan be expressed using the polyhedral model. The use of thepolyhedral model and associated code generation tools allowsfor quick development and experimentation with optimiza-tion strategies. The AdaptMembench framework was used todemonstrate the beneﬁt of using distinct data spaces on threadsand the overhead of OpenMP constructs and false sharingwhen targeting the L1 cache.R

EFERENCES[1] M Adams, P O Schwartz, H Johansen, P Colella, T J Ligocki, D Martin,ND Keen, Dan Graves, D Modiano, Brian Van Straalen, et al. Chombosoftware package for amr applications-design document. Technicalreport, 2015.[2] DH Bailey, E Barszcz, JT Barton, DS Browning, RL Carter, L Dagum,RA Fatoohi, Paul O Frederickson, Thomas A L, Rob S Schreiber,et al. The nas parallel benchmarks.

The International Journal ofSupercomputing Applications , 5(3):63–73, 1991.[3] V Bandishti, I Pananilath, and U Bondhugula. Tiling stencil com-putations to maximize parallelism. In

High Performance Computing,Networking, Storage and Analysis (SC), 2012 International Conferencefor , pages 1–11. IEEE, 2012.[4] P Colella. Deﬁning software requirements for scientiﬁc computing.2004.[5] K Datta, S Kamil, S Williams, L Oliker, J Shalf, and K Yelick.Optimization and performance modeling of stencil computations onmodern microprocessors.

SIAM review , 51(1):129–159, 2009. [6] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. The linpackbenchmark: past, present and future.

Concurrency and Computation:practice and experience , 15(9):803–820, 2003.[7] Matteo Frigo and Volker Strumpen. Cache oblivious stencil computa-tions. In

Proceedings of the 19th annual international conference onSupercomputing , pages 361–366. ACM, 2005.[8] Franc¸ois Irigoin and Remi Triolet. Supernode partitioning. In

Proceed-ings of the 15th ACM SIGPLAN-SIGACT symposium on Principles ofprogramming languages , pages 319–329. ACM, 1988.[9] Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, JohnShalf, and Katherine Yelick. Implicit and explicit optimizations forstencil computations. In

Proceedings of the 2006 workshop on Memorysystem performance and correctness , pages 51–60. ACM, 2006.[10] Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and KatherineYelick. Impact of modern memory subsystems on cache optimizationsfor stencil computations. In

Proceedings of the 2005 workshop onMemory system performance , pages 36–43. ACM, 2005.[11] Wayne Kelly. Optimization within a uniﬁed transformation framework.Technical report, 1998.[12] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, Jagan-nathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan.Effective automatic parallelization of stencil computations. In

ACMsigplan notices

IEEE TCCA Newsletter , 19:25,1995.[16] P J Mucci, S Browne, C Deane, and G Ho. Papi: A portable interfaceto hardware performance counters. In

Proceedings of the department ofdefense HPCMP users group conference , volume 710, 1999.[17] Gabriel Rivera and Chau-Wen Tseng. Tiling optimizations for 3d scien-tiﬁc computations. In

Proceedings of the 2000 ACM/IEEE conferenceon Supercomputing , page 32. IEEE Computer Society, 2000.[18] Zack Smith. Bandwidth: a memory bandwidth benchmark, 2008.[19] A Snavely, L Carrington, N Wolter, J Labarta, R Badia, andA Purkayastha. A framework for performance modeling and prediction.In

Supercomputing 2002 , pages 21–21. IEEE, 2002.[20] E. Strohmaier and Hongzhang Shan. Apex-map: A global data accessbenchmark to analyze hpc systems and parallel programming paradigms.In

Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005Conference , pages 49–49, Nov 2005.[21] Erich Strohmaier and Hongzhang Shan. Architecture independentperformance characterization and benchmarking for scientiﬁc appli-cations. In

Modeling, Analysis, and Simulation of Computer andTelecommunications Systems, 2004.(MASCOTS 2004). Proceedings. TheIEEE Computer Society’s 12th Annual International Symposium on ,pages 467–474. IEEE, 2004.[22] Erich Strohmaier and Hongzhang Shan. Apex-map: A synthetic scalablebenchmark probe to explore data access performance on highly parallelsystems. In

European Conference on Parallel Processing , pages 114–123. Springer, 2005.[23] Sven Verdoolaege. barvinok: User guide. , 2007.[24] Sven Verdoolaege. isl: An integer set library for the polyhedral model.In

International Congress on Mathematical Software , pages 299–302.Springer, 2010.[25] Sven Verdoolaege and Tobias Grosser. Polyhedral extraction tool. In

Second International Workshop on Polyhedral Compilation Techniques(IMPACT12), Paris, France , pages 1–16, 2012.[26] David Wonnacott. Using time skewing to eliminate idle time due tomemory bandwidth and network limitations. In

Parallel and DistributedProcessing Symposium, 2000. IPDPS 2000. Proceedings. 14th Interna-tional , pages 171–180. IEEE, 2000.[27] Wm A Wulf and Sally A McKee. Hitting the memory wall: implicationsof the obvious.

ACM SIGARCH computer architecture news , 23(1):20–24, 1995., 23(1):20–24, 1995.