AdaptMemBench: Application-Specific MemorySubsystem Benchmarking
AAdaptMemBench: Application-Specific MemorySubsystem Benchmarking
Mahesh Lakshminarasimhan
Department of Computer ScienceBoise State University
Boise, Idaho, [email protected]
Catherine Olschanowsky
Department of Computer ScienceBoise State University
Boise, Idaho, [email protected]
Abstract —Optimizing scientific applications to take full advan-tage of modern memory subsystems is a continual challenge forapplication and compiler developers. Factors beyond working setsize affect performance. A benchmark framework that exploresthe performance in an application-specific manner is essential tocharacterize memory performance and at the same time informmemory-efficient coding practices. We present AdaptMemBench,a configurable benchmark framework that measures achievedmemory performance by emulating application-specific accesspatterns with a set of kernel-independent driver templates. Thisframework can explore the performance characteristics of a widerange of access patterns and can be used as a testbed for potentialoptimizations due to the flexibility of polyhedral code generation.We demonstrate the effectiveness of AdaptMemBench with casestudies on commonly used computational kernels such as triadand multidimensional stencil patterns.
Index Terms —Benchmarking, Memory Performance, CodeGeneration, Stencil Computations, Tiling Optimizations.
I. I
NTRODUCTION
Scientific application performance is a function of memorybandwidth, instruction mix and order, memory footprint, andmemory access patterns. The contribution of each is oftennot clear and interdependencies exist between each variable.This complexity, combined with the difficulty of instrument-ing large application makes efficient optimization of theseapplications difficult. AdaptMemBench provides a frameworkfor application developers and optimization experts to isolateportions of their application and measure execution charac-teristics. The framework provides a starting point to identifyperformance bottlenecks, identify potential optimizations, andexplore the potential gains of those optimizations.Application performance is often bottlenecked by interac-tion with the memory subsystem due to the memory wall [27].Modern architectures combat this by using deep memoryhierarchies and physically fragmented system memory. Re-ducing working set sizes is considered a good first step inoptimization to take advantage of the caching capability ofmachines. However, optimizing is more complex than that,especially when dealing with shared memory parallelization.Memory access patterns, instruction mix, data sharing acrosscaches, and vectorizability must all be considered in concert.Selecting and applying optimizations remains a primarychallenge during performance enhancements. Testing and un-derstanding optimizations in situ when working with a large application can be cumbersome and error prone. Given the dif-ficulty around manipulating access patterns in situ, fewer op-timization strategies are attempted and potential performanceimprovements are overlooked. Additionally, performance toolssuch as hardware counters, remain difficult to use in thecontext of a large application. The combination of these factorsdiscourages effective optimizations.A framework that allows extracted code to be isolated andmeasured will benefit the optimization process for specificprojects, and will improve the reliability and reproducibilityof performance experiments in the compiler optimizationand programming construct research communities. During theexploration and experimentation phase, many different variantsof the same code are produced. Tracking the differencesbetween variants and maintaining correct execution becomestime consuming and challenging. A shared framework thatsupports experimentation and tracks code versions while out-putting metadata with measurements will ease this challenge.We propose a tool to explore the design landscape of thetarget architecture. The
AdaptMemBench framework can beused to measure system performance and to guide application-specific optimization decisions. Expensive kernels extractedfrom larger applications can be manipulated in isolation to findthe best optimization strategies. The framework reduces theamount of code that is transferred and provides mechanismsto experiment with data storage layout, execution order, andparallelization strategies.AdaptMemBench provides several execution templates. Thetemplates are combined with user provided code segments.The templates provide a common command line interface,handle all timing and hardware counter code, and outputmetadata and measurements in a common format. The codesegments provided by the user can be expressed as C codeor by using the polyhedral model. The latter provides aconvenient mechanism for optimization experiments.Several benchmarks [2], [10], [13], [14], [18]–[20] existthat measure machine performance, with the benchmarkingresults conveying essential information about the applicationperformance on the memory hierarchy of the machine. Exist-ing memory benchmarks [14], [18], [19] measure performanceusing a limited collection of streaming access patterns. How-ever, benchmarking application-specific patterns that tend to1 a r X i v : . [ c s . PF ] D ec ig. 1. The proposed framework. be more complex remains a challenge. Current benchmarks[10], [15] are further constrained by the data sizes which canbe executed, specifically in the higher levels of the memorysubsystem.AdaptMemBench differs from previous efforts by incorpo-rating polyhedral code generation. This creates a configurablebenchmarking framework that measures achieved memorybandwidth while mimicking application-specific memory ac-cess patterns. The Polyhedral model [25] simplifies writing theinitial benchmark and provides a mechanism to automaticallytransform the code. Furthermore, our benchmark supportsparallel applications and systems, and measures memory per-formance for data sizes across all levels of the memoryhierarchy.The primary contribution of this paper is a description andvarious demonstrations of the AdaptMemBench framework.Additionally, the framework was used to explore the perfor-mance of our university’s HPC cluster. The contributions ofthis paper include the following: • A configurable benchmarking framework for application-specific memory performance characterization. • A detailed performance study on common computationalkernels found in scientific applications for the impact ofimplicit locks, shared data spaces and false sharing. • An interleaved optimization strategy and demonstratedeffectiveness for the triad pattern. • An evaluation of the efficacy of spatial tiling strategies formultidimensional Jacobi patterns using AdaptMemBench.II. A
DAPT M EM B ENCH D ESIGN
The AdaptMemBench framework separates the user inter-face, validation, and output of the benchmark from the codebeing measured and provide low overhead access to PAPI.Figure 1 illustrates the building blocks of the framework.Each computational kernel of interest is coded in a patternspecification. If that pattern specification involves the polyhe-dral model it is passed through a polyhedral compiler. Theresulting (or original) c code is compiled together with one ofseveral potential templates. The templates provide a uniforminterface and handle code to vary the working set size to covereach portion of the memory hierarchy, along with timing,PAPI data collection, and output formatting. The use of thepolyhedral model adds a great deal of flexibility in terms ofexploring optimizations. The following subsection provides abrief overview of the polyhedral model. After the overview,the benchmark framework is described.
A. Polyhedral Code Generation
Polyhedral code generation enables loop constructs to beexpressed and manipulated mathematically. The iteration setscan be expressed without ordering unless a specific orderingis required. Figure 2 shows a loop nest for solving the heatequation. The associated iteration space is shown graphicallyas a two-dimensional space ( i, j ) . Each node in the graph rep-resents an iteration. The Presburger formula for this exampleis shown at the bottom of the figure. Fig. 2. An example of polyhedral code generation with ISCC/ISL.
Code generation is performed on sets through polyhedralscanning, the result is control flow that produces the iterationsin lexicographical ordering. As expressed in Figure 2 theoriginal code would be produced. Transformations on the codeare realized through the application of relations (or functions).Loop interchange is a loop transformation that switchesthe order of two loops. Figure 3 shows the relation usedto apply loop interchange for the code in Figure 2. For therelation from { i,j } to { j,i } , we apply the transformation onthe execution domain defined, using the intersection operator.More complex transformations such as tiling can be performedwith ease using the polyhedral model. Fig. 3. An illustration of loop interchange using ISCC.
The polyhedral model represents iteration spaces that areaffine. A significant amount of work has been done to expandthe iteration spaces and schedules that can be represented,including work that uses schedule trees for code generationwithin ISL [24]. The Omega+ code generation tool is also ableto incorporate iteration bounds based on runtime information2sing uninterpreted functions [11]. Even with recent advances,the polyhedral model cannot express all C kernels, and is,therefore, an optional step in the benchmark specification.In the proposed benchmark, to automatically generateschedules for the application kernel initialization, execution,and validation, the ISCC [25] polyhedral code generation toolis used, which offers an interface to the functionality providedby Integer Set Library (ISL) [24] and Barvinok library [23].This tool enables the end user to manipulate sets and relationsand generate source code reflecting their input. ... //Execution for(int k = 0; k < ntimes; k++) { } ... Listing 1. The inner-most section of the Unified Data Spaces Template. ... //Execution { int t_id = omp_get_thread_num(); for(int k = 0; k < ntimes; k++) { } } ... Listing 2. The inner-most section of the Independent Data Spaces Template.
B. Benchmark Implementation
The proposed framework uses a set of generic benchmarkdriver templates for all variations of the access patterns. Thesedriver templates provide a standard command line interfaceand a standard machine parsable and human readable output.Currently, the framework supports the following three varietiesof benchmark driver templates for shared memory applica-tions:1)
The
Unified Data Spaces
Template (Listing 1): Thestandard benchmarking template that utilizes unified dataspaces shared among threads. It uses the work sharing andscheduling constructs offered by OpenMP to distributeresources among threads. The OpenMP clauses can beeasily configured using the framework.2)
The
Independent Data Spaces
Template (Listing 2):This is a modified version of the unified data spacestemplate. It supports distinct data spaces separated intodifferent memory regions accessed without any overlap,avoiding false sharing. As indicated by the experimentalresults that follow, benchmarking in this paradigm, yieldsoptimal performance in the higher cache levels.3)
The
PAPI Measurement
Template:
This template is builton top of the above two templates, using PAPI’s low level API. The user is given an option to choose between theabove two benchmarking paradigms and input the PAPIevents to be recorded.Input pattern specifications consist of a header file and a setof ISCC input files. The initial step is to run the polyhedralcode generator for the ISCC input files and transform them intocorresponding C code files. The user-chosen driver templateis then updated with the appropriate header and source filesto create the customized .cpp benchmark driver code file.This benchmark driver code is compiled and executed withruntime arguments such as working set size, thread count andother parameters depending on the access pattern for whichthe benchmark is run.
Fig. 4. Implementation of the benchmark.
The purpose and functionality of each component in thepattern specifications shown in Figure 4 are described below:1)
Header file (
Initialization steps (
The performance characteristics of a set of computationalkernels commonly used in performance studies are presentedin this section. The kernels are STREAM’s triad and Jacobi1D, 2D, and 3D. The kernels were chosen for their simplicityand well understood performance behaviors. The use casesdemonstrate the need to separate implementation concernswhen studying the performance of even simple kernels. Thestructure provided by AdaptMemBench improves the breadthof data collected and makes experiment reliability and repro-ducibility more easily attained. For each kernel we explore theimpact of implicit locks, shared data spaces, and false sharingin SMP systems.
Hardware:
Experiments were run on one of the nodes inthe R2 HPC cluster at Boise State University, which has a2.40GHz dual Intel Xeon E5-2680 v4 CPU. This node consistsof two NUMA domains each containing 14 cores. Each corehas a dedicated 32K L1 data cache and 256K L2 cache. The35 MB L3 cache is shared among all the cores in each NUMAdomain. The size of each cache line in this architecture is 64bytes.
Compilers:
GNU’s gcc (version 6.3). When building C++benchmark drivers, -fopenmp and -O3 optimization flagswere used. The -lpapi flag was set for PAPI-enabled bench-mark drivers.
Profiling Tool:
The benchmark drivers are instrumented withthe Performance API (PAPI) [16] library to access perfor-mance counters across the CPUs evaluated. PAPI is used tomeasure cache hits and the requests for exclusive access tocache lines.
Problem size:
We executed the benchmarks with problemsizes across all levels of cache and those which exceeded thelast-level cache and fit into the main memory. Each benchmarkis executed for 1000 time iterations. The number of repetitionsis configurable.
A. The Triad benchmark
We demonstrate the simplicity of AdaptMemBench byimplementing the triad kernel from the STREAM benchmarkdue to its brevity and well-known performance. Listings 3 and4, along with the templates in Listings 1 and 2 illustrate theprocess of creating a custom benchmark using a combinationof input C code files, bypassing the polyhedral code generator.Alternatively, the kernel could have been expressed as a set: { [ j ] | < = j < n } . The results are equivalent. Cost of Barriers in OpenMP
We use the triad benchmark generated to evaluate theoverhead associated with barriers in OpenMP by using the nowait clause. With the AdaptMemBench framework, allthat is required is to modify the definition of the macro
CLAUSE to be nowait . As memory bandwidth results inFigure 5 indicate, there is significant overhead caused by the //Allocation Code double* B = double *) malloc(sizeof(double) * n); \ double* C = double *) malloc(sizeof(double) * n); //Memory Mapping //Initialization //Statement Definition //OpenMP clause Listing 3. Header file
Overhead of shared data spaces
The shape of the curve in the performance results on thetriad benchmark is disconcerting. Specifically, bandwidth inL1 is less than that in L2. There is a significant amount ofoverhead to utilize shared memory parallel applications. We
Fig. 5. The impact of OpenMP barriers on achieved memory bandwidth. for(int k = 0; k < ntimes; k++) { schedule(static, n/t) nowait for (int i = 0; i < n; i++){ A[i] = B[i] + scalar * C[i]; } } Listing 5. Utilizing the OpenMP work sharing construct for data spaces ofsize n and t number of threads. int N = n/t; { int t_id = omp_get_thread_num(); for(int k = 0; k < ntimes; k++) { for (int i = 0; i < N; i++){ A[t_id][i] = B[t_id][i] + scalar * C[t_id][i]; } } } Listing 6. The resultant triad benchmark using the independent data spaces driver template explore the resultant performance bottleneck with two variantsof the triad benchmark: unified data spaces and independentdata spaces.The first variant is implemented with unified data spacesusing OpenMP’s work sharing constructs. Listing 5 is a partof the benchmark driver generated from the unified dataspaces template with the macro
CLAUSE in triad.h setto schedule(static, n/t) nowait .The second benchmark uses the independent data spaces template implemented with distinct data spaces independentof the threads (listing 6). The only change needed in thebenchmark specification is done in the data mapping in theheader file. The listing shows the result after macro expansion.Memory bandwidth results in Figure 6 clearly indicate thebenefit of using distinct data spaces over the shared dataspaces variant implemented using OpenMP work-sharing andscheduling constructs. Using independent data spaces sepa-rates data domains into separate memory regions, eliminatingcross-thread communication. This in turn eliminates perfor-mance bottlenecks, for example, avoiding multiple threadsaccessing the same cache line. We observe an approximatetwo-fold performance boost in the L1 cache with this approachcompared to unified data spaces using OpenMP work-sharingconstructs, which is deemed to be efficient. Scheduling to Maximize Bandwidth
The triad pattern that comprises three data spaces is oftenconsidered to yield optimal performance in a given architec-ture. With the configurability offered by our benchmarkingframework, we expand the number of data spaces evaluatedfrom 3 (in triad) to 20 data streams that are simultaneouslyread in the body of the loop. This is achieved by modifyingthe statement definition and memory allocation specificationsin the header file.
Fig. 6. Illustrating the overhead associated with data shared among threads.Fig. 7. An experiment to identify the number of data streams fetchingsimultaneously that gives optimal performance on parallel execution with 28threads.
Figure 7 shows the results of running this experiment inparallel with 28 threads. The memory bandwidth values areinconsistent for working set sizes that sit in L1 cache sincesmall data sets are shared among a large number of threads.Considering working sets in L2 cache, where the performanceis more consistent, we observe that the achieved memorybandwidth peaks for 11 data spaces, which is considerablyhigher when compared to triad that comprises 3 data streams. for (int i = 0; i < n/2; i++){ A[i] = B[i] + scalar * C[i]; A[i+n/2] = B[i+n/2] + scalar * C[i+n/2]; } Listing 7. Customized benchmark driver with unified spaces illustratinginterleaved optimization for triad ig. 8. Illustration of interleaved optimization with a single data space ofsize n . This experiment led to reschedule the execution to triad.Listing 7 describes the interleaved optimization imple-mented for triad. This splits each data spaces of size n into twoindependent blocks of size n each. Each of these blocks arefused together to execute in a single iteration and elements inboth of these blocks are accessed simultaneously. So, insteadof reading three data spaces at the same time, six data streamsare accessed concurrently, hence better utilizing the availableprefetching lines. Figure 8 illustrates how a single data spaceis interleaved into two blocks and are fused together to beaccessed simultaneously within a single iteration. Fig. 9. Interleaved optimization for triad is beneficial in L1 cache on parallelexecution with 28 threads.
Performance results in Figure 9 illustrate the improvementin achieved bandwidth for triad in the L1 cache. A significantspeedup is observed from the na¨ıve triad operation imple-mented with independent data spaces. For working set sizesfalling out of the L1 cache, this optimization is not effectivedue to poor prefetching. This further validates the experimentalresults from Figure 7, wherein we achieve higher performancewith 6 data spaces (i.e., the na¨ıve hexad operation) than 3, which is the case for triad. We attempted interleaving dataspaces for triad with interleaving factors greater than two, butwe obtain the highest performance when interleaved by 2, dueto access to a single cache line exhibiting truly independentdata spaces.
B. Multidimensional Jacobi patterns
Iterative Jacobi stencils are at the core of a wide range ofscientific applications and are represented in the StructuredGrid motif [4]. These patterns involve nearest neighborhoodcomputations in which each point in a multidimensional grid isiteratively updated by a subset of its neighbors. The polyhedralmodel is used to generate benchmark drivers for the Jacobipatterns, as it is helpful to test potential optimizations such astiling, exercising the flexibility of AdaptMemBench.
Figure 11 demonstrates the process of custom benchmarkgeneration for this pattern using polyhedral code generation forthe input pattern specifications using the unified data spaces benchmark template. Allocating independent spaces is advan-tageous for this pattern as well, as reflected by the memorybandwidth results in Figure 12. However, performance scalingin L1 is still an issue, due to false sharing.
Impact of false sharing
In symmetric multiprocessing systems, where each proces-sor core has dedicated local cache(s), false sharing is a well-known performance issue. False sharing occurs when multiplethreads involve in modifying independent variables sharingthe same cache line, requiring unnecessary cache flushes andsubsequent loads. The potential source of false sharing ismultiple threads accessing dynamically allocated or globalshared data structures simultaneously.The impact of false sharing is quantified by recording theperformance counters using PAPI. We measure the data cachehits in L1 and the requests for exclusive access to sharedcache lines in Figure 10(a). We observe that the shared dataspaces get affected by cache misses nearly 10 times more thanthe independent data spaces. Please note that Figures 10(a)and 10(b) each have a primary and secondary y-axis. The dataplotted using green triangles is associated with the secondaryaxis (on the right). The cache misses recorded for independentdata spaces is better, but the variation in number of exclusiverequests to clean cache line for the three cases in Figure 10(b)is much higher for L1 in the case that suffers from falsesharing.Padding arrays is a common solution to overcome falsesharing. In the architecture evaluated, each cache line is ofsize 64 bytes. As shown in Listing 8, the data spaces oftype double are padded with a factor 8 to allocate eachelement in different cache lines to avoid false sharing. WithAdaptMemBench, this can be achieved just by modifyingthe memory mapping. Eliminating false sharing leads to adrastic performance speedup in the L1 cache, as the results inFigure 12 reflect. The PAPI results were collected by running6 a) Number of L1 data cache misses (b) Number of requests to shared cache lineFig. 10. Cache misses and cache line requests for 3-pt Jacobi 1D. Measurements for the unified data spaces are plotted along the secondary y-axis for betterreadability of results.Fig. 11. Illustration of custom benchmark generation for 3-pt Jacobi 1Dkernel with unified data spaces using the polyhedral model. { int t_id = omp_get_thread_num(); for(int k = 0; k < ntimes; k++) { for (int i = 1; i < n - 1; i++){ A[t_id * 8][i] = (B[t_id * 8][i - 1] + B[t_id * 8][i] + B[t_id * 8][i + 1]) * 0.33; } } } Listing 8. The resultant independent data spaces benchmark driver reflectingarray padding for Jacobi 1D the same code configurations with a PAPI driver within theframework, and the memory bandwidth results are exclusiveof the minimal overhead of accessing hardware counters.
Fig. 12. Demonstration of overhead associated with shared data spaces inSMP systems with Jacobi 1D.
Higher dimensional Jacobi patterns
The process of creating a custom benchmark driver for 9-ptJacobi 2D using unified data spaces is illustrated in Figure 13.A 7-point Jacobi 3D benchmark driver can be similarly createdwith an added dimension to the code generation script andcorresponding modifications to the pattern specification.From Figures 14 and 15, it can be noted that separatingdata spaces into different memory regions is beneficial forboth Jacobi 2D and Jacobi 3D. However, false sharing doesn’taffect performance and both the patterns struggle to scale inthe L1 cache.7 ig. 13. Illustration of custom benchmark generation for 9-pt Jacobi 2Dkernel with unified data spaces using the polyhedral model.Fig. 14. Analyzing the performance bottleneck caused by shared data spacesin Jacobi 2D.
Tiling Optimization for Jacobi transformations
Rectangular space tiling [8] is one of the traditional opti-mization strategies for stencil computations. Rectangular tilingbreaks a large iteration space into a set of smaller iterationspaces, which improves spatial and temporal locality. Wheniterating over a large two-dimensional data space applyinga multipoint stencil, it is highly probable that one of theneighbors accessed would have fallen out of the cache whilethe iteration comes around to the same point again. Tilingiteration space eliminates such cache misses and improves datareuse. This optimization is explored, not to provide anotherdata point on the impact of tiling, but to demonstrate theadvantages of including polyhedral code representations in theframework.
Tiling Three-dimensional Jacobi
We implement this spatial tiling strategy on the 7-pointJacobi 3D transformation. The initial approach is to tile in the3D grid in all directions. Listing 9 shows the ISCC input scriptand corresponding C code file generated. AdaptMemBench
Fig. 15. Impact of performance with varying memory allocation in Jacobi3D. simplifies the testing of this optimization with this inputISCC script as execution schedule file with the other patternspecifications remaining the same as for the na¨ıve Jacobi 3Dbenchmark. Domain_run := [n] -> { STM_3DS_run[k,j,i] : i <= n and i >= 1 and j<=n and j >= 1 and k<=n and k >= 1; }; Tiling := [n] -> { STM_3DS_run[k,j,i] -> STM_3DS_run[tk,tj,ti,k,j,i]:exists rk,rj,ri: and 0<=rj<64 and j=tj*64+rj and 0<=ri<16 and i=ti*16+ri; }; codegen (Tiling * Domain_run); for (int c0 = 0; c0 <= floord(n, 32); c0 += 1) for (int c1 = 0; c1 <= n / 64; c1 += 1) for (int c2 = 0; c2 <= n / 16; c2 += 1) for (int c3 = max(1, 32 * c0);c3 <= min(n, 32 * c0 + 31); c3 += 1) for (int c4 = max(1, 64 * c1);c4 <= min(n, 64 * c1 + 63); c4 += 1) for (int c5 = max(1, 16 * c2);c5 <= min(n, 16 * c2 + 15); c5 += 1) STM_3DS_run(c3, c4, c5);
Listing 9. ISCC script
Jacobi3D_xyz_tiled.in and the generated Ccode file
Jacobi3D_xyz_tiled.c . We initially block the iteration space in all the three di-mensions, for block sizes × , × and × . Ourresults agree with previous experimental evaluation showingno performance gain [10].We implement the partial blocking strategy [17] in whichblocking is done in two least significant dimensions alone.This results in a series of 2D slices that are stacked one overthe other in the unblocked dimension. We tested the efficacy ofthis technique on grid sizes up to 256, with block sizes rangingfrom 16 to 64 in both directions. This approach too does not8 ig. 16. Achieved memory bandwidth with 2D Cache blocking for Jacobi 3Dwith a tile sweep for sizes ranging from 16 to 64 in both the tiled directions. offer any speedup if we compare the peak bandwidth fromFigure 15 with the most performant block area in figure 16.Large on-chip caches affect cache reuse and thus provide noperformance gain with this blocking strategy. Increasing gridsizes would be impractical since many scientific applications,such as computation fluid dynamics, typically use a box sizeof or less [1].These results confirm conclusions from previous studies[5], [9], [10] on these tiling strategies performed for serialexecution. We extend these studies to parallel applicationsand systems using with the flexibility of the polyhedral modeloffered by AdaptMemBench. Several temporal tiling strate-gies [3], [7], [12], [26] have proved to be effective for higherdimensional stencil patterns, which are not evaluated in thispaper, but the framework can accommodate them.IV. R ELATED W ORK
Several categories of memory benchmarks have been de-veloped over the years. Most relevant to our work are thestreaming bandwidth benchmarks, which use a predefined setof access patterns to measure achieved memory bandwidth,and the stencil benchmarks. The following section presentsrepresentatives from each benchmarking category.Our benchmarking framework adds capabilities beyondthese benchmarks by offering configurability to explore theperformance of scientific applications. It emulates application-specific memory access patterns using the mechanism of poly-hedral code generation. It is a flexible and consistent testbedfor evaluating various code optimizations without needing toport or modify the entire application.
A. Streaming Bandwidth Benchmarks
STREAM [14] is a microbenchmark that measures sustain-able memory bandwidth and the corresponding computation rates for the performance evaluation of high performance com-puting systems. STREAM measures the performance of fouroperations: COPY ( a[i] = b[i] , measures data transferwithout arithmetic), SCALE ( a[i] = q*b[i] , with a sim-ple arithmetic operation), SUM ( a[i] = b[i] + c[i] ,tests multiple load and store operations) and TRIAD ( a[i]= b[i] + q*c[i] ). The STREAM benchmark does notmeasure memory bandwidth for small data sizes in the higherlevels of memory hierarchy, i.e., in level 1 cache and someportions of level 2 cache, depending on the target architecture.AdaptMemBench calculates the cumulative computation timefor the overall execution of the kernel and enabling it toexplore achieved performance in higher levels of cache.MultiMAPS [19] is a benchmark probe designed to measureplatform-specific bandwidths, similar to STREAM, it accessesdata arrays repeatedly. In MultiMAPS, the access pattern isvaried in stride and array size varying spatial and temporallocality. It measures achieved memory bandwidth of differentmemory levels, different size working sets and a small setof access patterns. This benchmark is most closely related toours. The primary difference is the ability to include arbitrarymemory access patterns, and test optimization strategies.Stanza triad [10], a microbenchmark, is a derivative ofSTREAM, which measures the impact of prefetching on mod-ern microprocessors. It works by comparing the bandwidthmeasurements by varying stanza length L and stride of access S for different data sizes and predicts performance. This beinga serial benchmark, cannot be scaled to parallel applications,and cannot be configured for patterns other than triad. B. Synthetic memory benchmarks
Apex-MAP [20] is a synthetic benchmark that characterizesapplication performance, implemented sequentially [21], andin parallel using MPI [22]. This benchmark approximates the memory access performance based on concurrent ad-dress streams considering regularity of access pattern, spatiallocality, and temporal reuse. Using a set of characteristicperformance factors, its execution profile is tuned such thatthese factors act as a proxy for the performance behavior ofcode with similar characteristics.Stencil Probe [10] is a lightweight, flexible stencilapplication-specific benchmark that explores the behavior ofgrid-based computations. Stencil Probe mimics the kernels ofapplications that use stencils on regular grids by modifyingthe operations in the inner loop of the benchmark. Similar toStanza Triad, this benchmark is serial and cannot be extendedto large-scale parallel applications and systems. Furthermore,this probe is not friendly for testing code optimizations andrequires rewriting of the entire the benchmark code for eachtransformation.Bandwidth [18] is an artificial benchmark to measure mem-ory bandwidth on x86 and x86 64 based architectures. Thisbenchmark can be used to evaluate the performance of thememory subsystem, the bus architecture, the cache architec-ture and the processor. Memory bandwidth is measured byperforming sequential and random reads and writes of varying9izes across the levels of the memory hierarchy. However, thisbenchmark is neither application-specific nor customizable. Itmeasures performance based on a predefined set of memoryaccess patterns and cannot be configured specifically to a targetapplication. Moreover, this benchmark executes serially andcannot be scaled to parallel systems and applications.
C. Application Benchmarks
Application Benchmarks are used as exemplars of appli-cation patterns. The NAS Parallel Benchmarks [2] comprisesbenchmarks developed to represent the major types of com-putations performed by highly parallel supercomputers andmimic the computation and data movement characteristicsof scientific applications. It consists of five parallel kernel benchmarks (EP - an embarrassingly parallel kernel, MG - asimplified multigrid kernel, CG - a conjugate gradient method,FT - fast Fourier transforms and IS - a large integer sort) andthree simulated application benchmarks (LU - lower and uppertriangular system solution, SP - scalar pentadiagonal solverand BT - set of block tridiagonal equations).The HPC Challenge benchmark suite [13] provides a set ofbenchmarks that define the performance boundaries of futurePetascale computing systems. This hybrid benchmark suiteexamines the performance of HPC architectures as a func-tion of memory access characteristics using different accesspatterns. It is composed of well-known computational kernelssuch as STREAM, HPL [6], matrix multiply, parallel matrixtranspose, FFT, RandomAccess and bandwidth/latency teststhat span high and low spatial and temporal locality space.V. C
ONCLUSIONS
This paper presents a configurable benchmark frameworkthat captures application-specific memory access patterns thatcan be expressed using the polyhedral model. The use of thepolyhedral model and associated code generation tools allowsfor quick development and experimentation with optimiza-tion strategies. The AdaptMembench framework was used todemonstrate the benefit of using distinct data spaces on threadsand the overhead of OpenMP constructs and false sharingwhen targeting the L1 cache.R
EFERENCES[1] M Adams, P O Schwartz, H Johansen, P Colella, T J Ligocki, D Martin,ND Keen, Dan Graves, D Modiano, Brian Van Straalen, et al. Chombosoftware package for amr applications-design document. Technicalreport, 2015.[2] DH Bailey, E Barszcz, JT Barton, DS Browning, RL Carter, L Dagum,RA Fatoohi, Paul O Frederickson, Thomas A L, Rob S Schreiber,et al. The nas parallel benchmarks.
The International Journal ofSupercomputing Applications , 5(3):63–73, 1991.[3] V Bandishti, I Pananilath, and U Bondhugula. Tiling stencil com-putations to maximize parallelism. In
High Performance Computing,Networking, Storage and Analysis (SC), 2012 International Conferencefor , pages 1–11. IEEE, 2012.[4] P Colella. Defining software requirements for scientific computing.2004.[5] K Datta, S Kamil, S Williams, L Oliker, J Shalf, and K Yelick.Optimization and performance modeling of stencil computations onmodern microprocessors.
SIAM review , 51(1):129–159, 2009. [6] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. The linpackbenchmark: past, present and future.
Concurrency and Computation:practice and experience , 15(9):803–820, 2003.[7] Matteo Frigo and Volker Strumpen. Cache oblivious stencil computa-tions. In
Proceedings of the 19th annual international conference onSupercomputing , pages 361–366. ACM, 2005.[8] Franc¸ois Irigoin and Remi Triolet. Supernode partitioning. In
Proceed-ings of the 15th ACM SIGPLAN-SIGACT symposium on Principles ofprogramming languages , pages 319–329. ACM, 1988.[9] Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, JohnShalf, and Katherine Yelick. Implicit and explicit optimizations forstencil computations. In
Proceedings of the 2006 workshop on Memorysystem performance and correctness , pages 51–60. ACM, 2006.[10] Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and KatherineYelick. Impact of modern memory subsystems on cache optimizationsfor stencil computations. In
Proceedings of the 2005 workshop onMemory system performance , pages 36–43. ACM, 2005.[11] Wayne Kelly. Optimization within a unified transformation framework.Technical report, 1998.[12] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, Jagan-nathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan.Effective automatic parallelization of stencil computations. In
ACMsigplan notices
IEEE TCCA Newsletter , 19:25,1995.[16] P J Mucci, S Browne, C Deane, and G Ho. Papi: A portable interfaceto hardware performance counters. In
Proceedings of the department ofdefense HPCMP users group conference , volume 710, 1999.[17] Gabriel Rivera and Chau-Wen Tseng. Tiling optimizations for 3d scien-tific computations. In
Proceedings of the 2000 ACM/IEEE conferenceon Supercomputing , page 32. IEEE Computer Society, 2000.[18] Zack Smith. Bandwidth: a memory bandwidth benchmark, 2008.[19] A Snavely, L Carrington, N Wolter, J Labarta, R Badia, andA Purkayastha. A framework for performance modeling and prediction.In
Supercomputing 2002 , pages 21–21. IEEE, 2002.[20] E. Strohmaier and Hongzhang Shan. Apex-map: A global data accessbenchmark to analyze hpc systems and parallel programming paradigms.In
Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005Conference , pages 49–49, Nov 2005.[21] Erich Strohmaier and Hongzhang Shan. Architecture independentperformance characterization and benchmarking for scientific appli-cations. In
Modeling, Analysis, and Simulation of Computer andTelecommunications Systems, 2004.(MASCOTS 2004). Proceedings. TheIEEE Computer Society’s 12th Annual International Symposium on ,pages 467–474. IEEE, 2004.[22] Erich Strohmaier and Hongzhang Shan. Apex-map: A synthetic scalablebenchmark probe to explore data access performance on highly parallelsystems. In
European Conference on Parallel Processing , pages 114–123. Springer, 2005.[23] Sven Verdoolaege. barvinok: User guide. , 2007.[24] Sven Verdoolaege. isl: An integer set library for the polyhedral model.In
International Congress on Mathematical Software , pages 299–302.Springer, 2010.[25] Sven Verdoolaege and Tobias Grosser. Polyhedral extraction tool. In
Second International Workshop on Polyhedral Compilation Techniques(IMPACT12), Paris, France , pages 1–16, 2012.[26] David Wonnacott. Using time skewing to eliminate idle time due tomemory bandwidth and network limitations. In
Parallel and DistributedProcessing Symposium, 2000. IPDPS 2000. Proceedings. 14th Interna-tional , pages 171–180. IEEE, 2000.[27] Wm A Wulf and Sally A McKee. Hitting the memory wall: implicationsof the obvious.
ACM SIGARCH computer architecture news , 23(1):20–24, 1995., 23(1):20–24, 1995.