[PDF] Understanding Memory Access Patterns Using the BSC Performance Tools

Abstract

The growing gap between processor and memory speeds results in complex memory hierarchies as processors evolve to mitigate such divergence by taking advantage of the locality of reference. In this direction, the BSC performance analysis tools have been recently extended to provide insight relative to the application memory accesses depicting their temporal and spatial characteristics, correlating with the source-code and the achieved performance simultaneously. These extensions rely on the Precise Event-Based Sampling (PEBS) mechanism available in recent Intel processors to capture information regarding the application memory accesses. The sampled information is later combined with the Folding technique to represent a detailed temporal evolution of the memory accesses and in conjunction with the achieved performance and the source-code counterpart. The results obtained from the combination of these tools help not only application developers but also processor architects to understand better how the application behaves and how the system performs. In this paper, we describe a tighter integration of the sampling mechanism into the monitoring package. We also demonstrate the value of the complete workflow by exploring already optimized state--of--the--art benchmarks, providing detailed insight of their memory access behavior. We have taken advantage of this insight to apply small modifications that improve the applications' performance.

Full PDF

UUnderstanding Memory Access PatternsUsing the BSC Performance Tools

Harald Servat a, , Jes´us Labarta b,c , Hans-Christian Hoppe a , Judit Gim´enez b,c , AntonioJ. Pe˜na b a Intel Corporation b Barcelona Supercomputing Center (BSC) c Universitat Polit`ecnica de Catalunya (UPC)

Abstract

The growing gap between processor and memory speeds has lead to complex memoryhierarchies as processors evolve to mitigate such divergence by exploiting the localityof reference. In this direction, the BSC performance analysis tools have been recentlyextended to provide insight into the application memory accesses by depicting theirtemporal and spatial characteristics, correlating with the source-code and the achievedperformance simultaneously. These extensions rely on the Precise Event-Based Sampling(PEBS) mechanism available in recent Intel processors to capture information regardingthe application memory accesses. The sampled information is later combined with theFolding technique to represent a detailed temporal evolution of the memory accesses andin conjunction with the achieved performance and the source-code counterpart. Thereports generated by the latter tool help not only application developers but also pro-cessor architects to understand better how the application behaves and how the systemperforms. In this paper, we describe a tighter integration of the sampling mechanisminto the monitoring package. We also demonstrate the value of the complete workﬂowby exploring already optimized state–of–the–art benchmarks, providing detailed insightof their memory access behavior. We have taken advantage of this insight to apply smallmodiﬁcations that improve the applications’ performance.

Keywords: performance analysis, memory references, sampling, instrumentation

1. Introduction

The growing gap between processor and memory speeds leads to more and more com-plex memory hierarchies as processors evolve generation after generation. The memoryhierarchy is organized in diﬀerent strata to exploit the applications’ temporal and spatial (cid:73)

DOI: 10.1016/j.parco.2018.06.007. c (cid:13) http://creativecommons.org/licenses/by-nc-nd/4.0 ∗ Corresponding author.

Email address: [email protected] (Harald Servat) a r X i v : . [ c s . PF ] M a y ocalities of reference. On one end of the hierarchy lie extremely fast, tiny and power-hungry registers while on the other end there is the slow, huge and less energy-consumingDRAM. In between these two extremes, there are multiple cache levels that mitigate theexpense of bringing data from the DRAM when the application exposes either spatial ortemporal locality. Still, researchers and manufacturers look for alternatives to improvethe memory hierarchy performance- and energy-wise. For instance, they consider ad-ditional integration directions so that the memory hierarchy adds layers as scratchpadmemories, stacked 3D DRAM [1] and even non-volatile RAM [2].A proper analysis of the application memory references and its data structures is vitalto identify which application variables are referenced the most, their access cost, as well asto detect memory streams. All this information might provide hints to improve the exe-cution behavior by helping prefetch mechanisms, suggesting on the usage of non-temporalinstructions, calculating reuse distances, tuning cache organization and even facilitatingresearch on multi-tiered memory systems. Two approaches are typically used to addressthese studies. First, instruction-based instrumentation tools monitor load/store instruc-tions and decode them to capture the referenced addresses and the time to solve thereference. While this approach can capture all data references and accurately correlatescode statements with data references, it estimates cache access costs by simulating thecache hierarchy and introduces signiﬁcant overheads that alter the observed performanceand challenges the analysis with large data collections and time-consuming analysis, andis thus not practical for production runs. Second, some processors have enhanced theirPerformance Monitoring Unit (PMU) to sample memory instructions and capture datasuch as: referenced address, time to solve the reference and the memory hierarchy levelthat provides the data. The sampling mechanisms help to reduce the amount of datacaptured and the overhead imposed and thus allow targeting production application runs.However, the results obtained using statistical approximations may require suﬃcientlylong runs to approximate the actual distribution; still, highly dynamic access patternsor rare performance excursions may be missed.The Extrae instrumentation package [3] and the Folding tool [4] belong to the BSCperformance tools suite and have been recently extended to explore the performancebehavior and the references of the application data objects simultaneously [5]. However,the initial research prototype combined the results of two independent monitoring tools(Extrae and the perf tool [6]) that monitored the same process before depicting theresults through the Folding tool. The changes described in this paper address several ofthe limitations of that prototype.In this document we describe a fully integrated solution of the initial prototype. Thenovelties of this integration include: • Simpliﬁed the collection mechanism by using the perf kernel infrastructure directlyfrom Extrae to use the Intel Precise Event-Based Sampling (PEBS) [7] mechanism.This avoids to load a kernel module to correlate clocks between the two tools andreduces the overall overhead suﬀered by the application. • Use Extrae capabilities to multiplex load and store instructions in a single applica-tion execution. This naturally provides load and store references in a single reportwhile in the prototype it was uneasy due to kernel security features. • Extend the Extrae API to create synthetic events that delimit a memory region.2 xtraeApplication gnuplot reportsFoldingParaverTrace-file Synthetic ParaverTrace-filePAPI perf

Figure 1: Tool integration for the memory reference analysis.

This reduces the space needed for intermediate ﬁles on applications that allocatedata in small consecutive chunks.The organization of this paper is as follows. Section 2 describes the extensions doneto the BSC performance tools in order to collect and represent data related to memorydata-objects and references to them. Section 3 follows with exhaustive performanceand memory access analyses of several benchmarks including code modiﬁcations andcomparing the execution behavior before and after the code changes. Then Section 4contextualizes this tool with respect to the state-of-the-art tools. Finally, Section 5draws conclusions.

2. Extensions to the BSC performance tools

This section covers the extensions applied to the Extrae and Folding tools. Figure 1depicts the interaction of these tools when exploring a target application. First, Extraemonitors the target application. Extrae is an open-source instrumentation and samplingsoftware which generates Paraver [8] timestamped event traces for oﬄine analysis. Thepackage monitors several programming models (e.g. MPI, OpenMP, OmpSs and POSIXthreads) to allow the analyst to understand the application behavior. Although Ex-trae oﬀers an API for manual instrumentation, it also monitors in-production optimizedbinaries through the shared-library preloading mechanisms. Extrae can also multiplexthe performance counters capturing more performance counters over the application runthan the underlying hardware can collect simultaneously. The sampling mechanism isimplemented on top of time-based alarms as well as on top of hardware counters.After the trace-ﬁle has been generated, the Folding tool is invoked. The Folding tooltakes advantage of the repetitive nature of many applications (especially in the HPCenvironment) and combines sampling and instrumented information to provide detailedprogression within a repetitive computing region. This allows monitoring the applicationat a coarse sampling frequency without impacting on the application performance. TheFolding generates two diﬀerent outputs on the delimited repetitive regions. On theone hand, it generates summarized performance reports that can be explored using the gnuplot tool . On the other hand, it generates synthetic Paraver trace-ﬁles that includeall the information from the summarized performance reports and additional details thatcannot be represented in the plots. http://gnuplot.info .1. Extensions to Extrae2.1.1. On the collection of the application data-objects The modiﬁcations on Extrae focus on capturing information of the application datastructures and collecting information about the references to these structures. Therefore,to help the analyst understand the access patterns, it is necessary to map addresses toactual application data structures. Consequently, Extrae has been extended to capturesome properties of static and dynamic variables. With respect to the static variables, theinstrumentation package scans the symbols within the application binary image usingthe binutils library to acquire their name, starting address and size. Regarding thedynamically allocated variables, the monitoring package has been extended to instrumentthe malloc -related routines. Extrae captures their input parameters and output resultsto determine the starting address and size, as well as a portion of the call-stack in orderto locate them within the user code. Extrae identiﬁes them by the top of the call-stackon the allocation site, instead. Extrae also captures the references to the local (stack)variables, but the tool cannot track their creation and thus these references remainunnamed.As applications may allocate and de-allocate many variables during the applicationlifetime, Extrae ignores allocations smaller than a given threshold (that defaults to 1MByte but can be changed by the user) to avoid generating huge trace-ﬁles. This ap-proach limits the analysis when targeting graph- or tree-based or other irregular appli-cations where allocations may be tiny. To circumvent this limitation, we have extendedthe Extrae API to create a synthetic events to delimit a memory region based uponbegin and end addresses. This approach lets a user wrap small and consecutive dynamicallocations through this API and correlate the memory references to a synthetic objectthat represents all the allocations. As this approach requires manual intervention, onealternative (not currently implemented) would be to limit the instrumentation a givennumber of small allocations. For monitoring the application’s memory references, Extrae uses the PEBS infras-tructure. Despite Extrae relies on PAPI [9] to collect the value of hardware performancecounters from the PMU, this performance library does capture the PEBS generated in-formation . Consequently, we have modiﬁed Extrae to use the perf subsystem of theLinux kernel to monitor the memory references. In brief, to conﬁgure the PEBS to samplememory references perf has to: • allocate a buﬀer to hold the PEBS samples, • setup a pe event for a performance counter that captures memory references (e.g.memory operations) and specify a sampling period, and • associate an interrupt handler for the interrupts generated when the PEBS buﬀeris full and processes the PEBS buﬀer. As of the latest released PAPI version (5.5.1) See perf event open(2) on the Linux manual page. ime Source code referencesPerformance counterssolver_1d_x solver_1d_ysolver_2dX loadscommitted PEBS memory referenceSource code referencesPerformance counters

Sample point Instrumentation point(e.g. function entry/exit points)

Figure 2: The extensions to the Extrae instrumentation package allow monitoring at instrumentationpoints as well as PEBS-based points.

The reader may wonder on the portability of the monitoring tool to other processors. Webelieve that the approach of using the perf subsystem holds true for mechanisms similarto PEBS (e.g. IBS in AMD Opteron [10] and MRK in IBM Power7 [11]). However, wecannot provide speciﬁc details.It is worth to mention that the metrics associated to the memory references de-pend on the monitored performance counter and within processor families. For instance,Intel R (cid:13) Xeon R (cid:13) processors extend PEBS with Load-Latency features that allow monitor-ing load instructions and provide the address referenced, the access cost and which partof the memory hierarchy provided the data. However, store instructions just provideinformation regarding the address referenced and whether the access hit in cache.We want to highlight that PEBS records do not contain a time-stamp . Also, as westated before, the PEBS buﬀer is forwarded to the performance tool when the buﬀeris full through an interrupt handler. Since Extrae needs to associate each referencedaddress a time-stamp, we have taken the approach on Extrae allocating a 1-entry buﬀer.When PEBS interrupts the tool after generating each sample, the interrupt handler willassociate a time-stamp to it.The illustration shown in Figure 2 depicts where the instrumentation and samplingcombined monitoring capabilities occurs during the application execution. In the Figure,black markers represent instrumentation-based points that record when a routine hasstarted or ﬁnished executing while red markers represent when the PMU has interruptedthe application for a PEBS sample after X loads. The monitors capture the value ofthe performance counters and the top executing routine while PEBS samples captureperformance counters, a portion of the call-stack and the PEBS record associated withthe sample. Finally, the integration between Extrae and perf has also included multiplexing ca-pabilities on the PEBS sampling. That is, Extrae not only automatically changes theperformance counters being collected at runtime but also can multiplex over PEBS sam-pling events. Our implementation has covered the case in which Extrae monitors load Intel Skylake generation introduces time-stamps in the PEBS records.

The additional performance information captured by Extrae allows the Folding toolto enrich its outputs (the reports and the synthetic trace-ﬁle). The reports generated bythe Folding correlate the progression within the source-code, the performance and theaddress space in a single plot. As the reports are limited by display properties, all themeaningful data is included into a Paraver trace-ﬁle for a quantitative and more detailedanalysis.In the report, the address space is partitioned based on the existing data objectsand labeled accordingly with the variable names (or call-stack) if available which allowsthe analyst to identify the data structures. The report also includes memory referenceswhich shows how the data objects (including the wrapped data objects through the APIextensions on Extrae) are accessed. On multi-threaded/process applications, one report isgenerated for each executing thread/process. This approach allows the analyst to exploreeach thread/process independently and to learn whether diﬀerent threads are accessingto shared or private variables, although this exploration has to be done manually at themoment. The forthcoming Section 3.2 provides a full-featured analysis example.Despite the Folding tool can combine the performance metrics from diﬀerent pro-cesses (as long as they refer to the same code), this is no longer possible when combiningmemory-related information. The inclusion of the Address Space Layout Randomization(ASLR) security techniques leads to unique address spaces on each process even for thesame binary and makes diﬃcult to combine the address space information from multipleprocesses. As a result, the Folding tool only uses information from one process when ex-posing memory-related information. The ASLR mechanism required a manual matchingof the data-objects in the initial prototype when analyzing reports generated for load andstore instructions independently. This task was tedious, especially when applications re-fer to a large number of data objects. However, the usage of the multiplexing capabilitiesfor the PEBS sampling mechanism in Extrae allows the Folding technique to depict theload and store references in the process address space with a single execution.

3. Application evaluation

We have evaluated several applications on the Jureca system [12] to show the usabilityof the extensions described above when exploring the load and store references. Each nodeof the system contains two Intel Xeon E5-2680v3 (codename Haswell) 12-core processorswith hyper-threading enabled, for a total of 48 threads per node. The nominal andmaximum “turbo” processor frequencies are 2.50 GHz and 3.30 GHz, respectively. Theprocessor has three levels of cache with a line size of 64 bytes: level 1 are two independent https://lwn.net/Articles/546686 R (cid:13) C and Fortran compilers v15.0 anduses Intel R (cid:13) MPI library v5.1. We have manually delimited with instrumentation pointsthe main iteration loop body of the respective applications.With respect to the Extrae conﬁguration, we have only captured dynamically-allocatedobjects that are larger or equal than 32 KByte. Applications have been sampled every137K load and every 8231K store instructions and the package has been conﬁgured tomultiplex them every 15 seconds. We use prime numbers to minimize the correlationbetween the sampling and the application periods. The store sampling period is higherthan the load sampling period because the Load-Latency feature already subsamples loadinstructions through a randomization tagging mechanism. On the selected machine, themonitors (collecting time-stamp, call-stack and performance counters) take less than 2 µ sto execute for a measured overhead below 5% on the presented experiments. For exempliﬁcation purposes, we have monitored the serial version of the Streambenchmark [13]. The benchmark has been compiled using the GNU compiler suite witheach array being of size N =2 × elements. As Stream accesses statically allocatedvariables through ordered linear accesses, we have modiﬁed the code so that: (i) the b array is no longer a static variable but allocated by malloc , (ii) the scale kernel loadsdata from pseudo-random indices from the c array, and, (iii) we have delimited the mainapplication loop using the Extrae API calls. Due to modiﬁcation (ii), scale executesadditional instructions and exposes less locality of reference, thus we have reduced theloop trip count in this kernel to N/ for i := 1 to NT IMES do ! main loop Extrae function begin () ! Delimit begin body loop for j := 1 to N do c [ j ] := a [ j ]; od ! Copy for j := 1 to N/ do b [ j ] := s ∗ c [ random ( j )]; od ! Scale for j := 1 to N do c [ j ] := a [ j ] + b [ j ]; od ! Add for j := 1 to N do a [ j ] := b [ j ] + s ∗ c [ j ]; od ! Triad Extrae function end () ! Delimit end body loop od Figure 3 shows the result of the extensions to the Folding mechanism. The Figureconsists of three plots: source code references (top), address space load references (mid-dle), and performance metrics (bottom). In the source code proﬁle each color representsthe active routine (identiﬁed by a label of the form X [ n ], where X refers to the active rou-tine, and n refers the most observed code line). Additionally, the purple dots representa time-based proﬁle of the sampled code lines where the top (bottom) of the plot repre-sents the begin (end) of the source ﬁle. This plot shows that the application progresses Downloaded from with SHA-1 afe4e58ec9ba61eba0b8b65cb24789295f8a539e . .00 00000000000000000000bottomtop C od e li n e ghost __ m e m c py_ sss e ac k [ ] S ca l e [ ] A dd [ ] T r i a d [ ] A dd r e ss e s r e f e r e n ce d c|152 MBa|152 MB M I PS c oun t e r /i n s t r u c ti on Time (ms)L1D miss L2 miss L3 miss MIPS

Figure 3: Analysis of the modiﬁed version of the Stream benchmark using the results from the Foldingtool. There are triple correlation time-lines for the main iteration (from top to bottom): source code,addresses referenced and performance.Table 1: Classiﬁcation and average costs of diﬀerent accesses to the memory hierarchy per routine forthe modiﬁed version of the Stream Benchmark.

Routine Metric Memory hierarchy part

L1 LFB L2 L3 DRAM

Copy % of load references 75.8% 22.5% 1.0% 0% 0.5%Copy Average cost (in cycles) 7 28, 40 14 n/a n/a

70 350, 800Add % of load references 3.8% 73.1% 0% 18.6% 3.8%Add Average cost (in cycles) 7 50, 100 n/a

70 400,440Triad % of load references 9.4% 74.2% 3.9% 8.9% 3.4%Triad Average cost (in cycles) 7 74, 108 19 84 350 through four routines (each representing a kernel) and that most of the activity observedof each of these routines occurs in a tiny amount of lines. The second plot depicts theaddress space, including variable names of allocated objects and memory references tothe address space. On this plot, the variables (either static or dynamically allocated) andtheir size are on the left Y-axis, if any, and the right Y-axis shows the address space. Thedots in this plot show a time-based proﬁle of the addresses referenced through load/storeinstructions. Load instructions are colored with a gradient that ranges from green toblue referring to low and high access costs, respectively. Store instructions are coloredin black. Finally, the third plot shows in black the achieved instruction (MIPS) rate(referenced on the right Y-axis) within the instrumented region, as well as the L1D, L2and L3 cache misses per instruction (on the left Y-axis) using red, orange and yellow,respectively. With this plot, the performance analyst can correlate diﬀerent metrics ofthe performance and see how they progress as the execution traverses code regions andaccesses data objects. 8 .2.2. Analysis of the folding report

We outline several phenomena exposed from Figure 3. First, as expected, the accesspattern in the Scale kernel to the variable c shows a randomized access pattern withlots of high-latency (blue) references while storing a portion of the memory allocatedin line 181 from ﬁle stream.c (the original variable b ). The straight lines formed bythe references in the rest of the routines denote that they linearly advance and thusexpose spatial locality, and also the greenish color indicates that these references takeless time to be served. Second, the instructions within routines Add and

Triad referencetwo addresses per instruction on average, the loaded data comes from two independentvariables (or streams) simultaneously, and their accesses go from low to high addresseshonoring the user code. Finally and surprisingly, the

Copy routine accesses the array ina downwards direction although the loop is written with its index going upwards. Thiseﬀect occurs because the compiler has replaced the loop by a call to memcpy (from glibc2.17) that reverses the loop traversal and uses SSSE3 vector instructions (through theactual implementation memcpy ssse3 back ). We observe that

Triad and

Copy bench-marks achieve the highest and lowest MIPS rates, respectively. The low MIPS rate in

Copy may be explained because the execution of vector instructions and these instruc-tions take more cycles to complete, but as a single instruction operates on multiple data,it ﬁnalizes faster. Additionally, we expected a noticeable diﬀerence regarding MIPS in

Scale due to the introduction of the random access to the variable. However, we observethat the instruction rate is not signiﬁcantly diﬀerent between the kernels. This happensbecause the random() function is inlined and avoids accessing memory by means of reg-isters; thus, the additional instructions do not miss in the cache and reduce the cachemiss ratio per instruction. Globally speaking, we notice that the L2 cache miss ratio issimilar to the L1D cache miss ratio. This eﬀect suggests that L2 provides little beneﬁtfor this benchmark because L2 is not suﬃciently large to keep the working set. Morespeciﬁcally, we observe in the Scale kernel that the L1D, L2 and L3 miss ratios are verysimilar (about 5%) indicating that each instruction that misses on L1D is likely to misson L2 and L3, as a result of the low temporal locality. In addition, we can estimate theused memory bandwidth used in kernels that linearly access to variables (such as

Copy , Add and

Triad ) if we consider that the whole variable is traversed ( i.e. the loop has1-stride access). Given these assumptions, the estimates indicate that

Copy and

Triad may use 20097 and 15263 MB/s of the memory bandwidth. While these numbers arefar from the nominal maximum memory bandwidth (68 GB/s ) for a single socket, thebenchmark ran with one thread/process only and thus it is unlikely that it saturates thememory bus. We have also explored the synthetic trace-ﬁles generated by the Folding tool usingParaver. Table 1 summarizes these results by showing the proportion of memory accessesto the diﬀerent parts of the memory hierarchy as well as the average cost when access-ing each part depending on the active routine. Our ﬁrst observation is the importantcontribution of the Line-Fill Buﬀer (LFB) in terms of percentage of accesses in terms of http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz able 2: Association for the labels shown in the Figure 4a, including the most observed code line (MOCL)for each region. Label User function MOCL Duration

A a1

CalcVolumeForceForElems

CalcHourglassControlForElems

LagrangeNodal

CalcLagrangeElements

CalcQForElems

ApplyMaterialPropertiesForElems

Table 3: Classiﬁcation and average costs of diﬀerent accesses to the memory hierarchy per routine forthe Lulesh benchmark.

Region

Subregion

Metric Memory hierarchy part

L1 LFB L2 L3 DRAM a1 % of load references 99.66% 0.05% 0.28% 0% 0%A a1 Average cost (in cycles) 7 25 14 n/a n/a

A a2 % of load references 99.59% 0.14% 0.20% 0.06% 0%a2 Average cost (in cycles) 7 20 14 80 n/a

B % of load references 98.37% 0.54% 1.00% 0.07% 0%B Average cost (in cycles) 7 21 14 49 n/a

C % of load references 98.80% 1.19% 0% 0% 0%C Average cost (in cycles) 7 15 n/a n/a n/a

D % of load references 99.44% 0.12% 0.36% 0% 0.06%D Average cost (in cycles) 7 14 14 n/a n/a n/a

F % of load references 96.55% 2.48% 0.45% 0.16% 0.33%F Average cost (in cycles) 7 95, 230 14, 19 109 300, 600 the average cost in cycles. The LFB is a buﬀer that keeps track of already requestedcache-lines. So memory references served by the LFB refer to load instructions thatare initiated by earlier and still incomplete instructions, thus exposing locality. Conse-quently, the reported cost depends on the distance between load instructions and theservice time. We highlight that LFB and DRAM costs show multi-modal behaviors withhigh variability. For instance, in the Scale kernel, data coming from DRAM takes either350 or 800 cycles. It is also worth mentioning that DRAM and LFB provide about 13.8%and 80.5% of the data to the

Scale routine, respectively indicating a poor eﬃciency ofthe L1, L2 and L3 caches as a result of adding a random indirection.

The Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH)proxy application [14] is a representative of simpliﬁed 3D Lagrangian hydrodynamics onan unstructured mesh. We have compiled the reference code of the application using the Downloaded from https://codesign.llnl.gov/lulesh/lulesh2.0.3.tgz with SHA-1 . .00 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bottomtop C od e li n e ghostAa1 a2 B C D E F 7ffca2beb3007ffca2beb4907ffca2beb6207ffca2beb7b0 A dd r e ss e s r e f e r e n ce d M I PS c oun t e r /i n s t r u c ti on Time (ms)

Branches L1D miss L2 miss L3 miss MIPS (a) Folding results for the main computation region of the Lulesh 2.0 benchmark. C od e li n e ghostAa1 a2 B C D E F 7ffe5bf02e007ffe5bf02f907ffe5bf031207ffe5bf032b0 A dd r e ss e s r e f e r e n ce d M I PS c oun t e r /i n s t r u c ti on Time (ms)

Branches L1D miss L2 miss L3 miss MIPS (b) Folding results for the main computation region of the Lulesh 2.0 benchmark after the code modiﬁca-tions (at the same time-scale as in Figure 4a).Figure 4: Analysis of Lulesh 2.0. able 4: Top 5 referenced variables in Lulesh identiﬁed by their allocation call-site. Allocation site Size % of references

Commentlulesh.h 156 7 MB 1.23% coordinateslulesh.h 163 27 MB 1.20% node listlulesh.h 143 7 MB 0.91% forceslulesh.h 154 7 MB 0.89% accelerationslulesh.h 147 7 MB 0.74% velocities

Intel compiler suite with the -O3 -xAVX -g compilation ﬂags. We have delimited withinstrumentation points the main iteration loop of the application and executed it using27 MPI processes on two nodes of Jureca with a problem size 96 for 200 iterations. Figure 4a shows the evolution of the code regions, accesses to the address space andthe performance within the main iteration of the benchmark. Due to lack of space in theplot, we have added labels (A-F) manually and the correspondence between the labelsand Table 2 shows the name of the routines. The main loop traverses seven applicationregions (A-F), in which A is divided into two phases ( a1 and a2 ). Regions A-E showgood MIPS performance with IPC rates close to 2, while region F exposes a much lowerperformance. Such a lower performance is correlated with an increase of the cache missesper instruction at all cache levels but still below 4%. The high part of the address spacerefers to local variables allocated on the stack and the rest of the allocations are performedthrough the new C++ language construct. We observe a larger number of modiﬁcationswithin the stack region compared to the other parts of the address space. We also noticethat region a2 writes on a region of the memory space (addresses preﬁxed by ) thatis later read from region B and that phases E and F modify disjoint parts of the loweraddress space. This information would be valuable when searching for parallelizationopportunities using data-dependent task-based programming models. Table 3 provides detailed access statistics for the identiﬁed regions within the mainiteration. We see a general trend: L1 serves most of the memory references, except forRegion F that shows a high value in the number of accesses provided by the LFB andthe access cost exposes a bi-modal behavior between 95 and 230 core cycles.Further analysis with Paraver shows that there are many referenced memory objectsand we tabulate the most referenced in Table 4. The most referenced objects involvethe nodelist (allocated in lulesh.h line 163), and the coordinates, the forces, the ac-celerations and the velocities of each element. The four latter objects implement 3Dﬂoating-point arrays using 3 C++ vector containers (one per dimension) in a C++ class,such as a struct of arrays (SoA). This storing method may not be eﬃcient because thecode pointed by Table 2 shows concurrent accesses to the 3 dimensions per element, whichmay result in poor locality because memory references point to diﬀerent containers.We have changed the implementation of these 3D ﬂoating-point arrays to an arrayof structs (AoS) aiming to increase the locality. After applying the change, the Figure12f Merit (FOM) increased from 11891 . z/s to 12414 . z/s , a 4.40% increase. Addi-tionally, if we focus on the longest region (F) and explore the pointed code, we observethat it refers to an inline function invocation to the routine

EvalEOSForElems . The mainloop of this routine consists of 3 inner loops that iterate over the number of elements andadditional conditionals that may also execute an additional loop over all the elements.By joining these loops to increase the locality and reduce the number of branch instruc-tions, the FOM increased to 12480 . z/s (a 4.80% increase from baseline). Figure 4bshows the results for the Folding process when applied to the modiﬁed binary. Whilethe application behavior does not change abruptly, the overall MIPS rate is higher by2% responding to a L1 data-cache miss reduction by 9.6%. The optimized version alsoexecutes 2% less number of instructions due to the reduction of the branch instructionsexecuted (15.8%). The High Performance Conjugate Gradient (HPCG) code benchmarks computer sys-tems based on a simple additive Schwarz, symmetric Gauss-Seidel preconditioned conju-gate gradient solver [15]. We have compiled the reference code of the application usingthe Intel compiler suite with the -O3 -xAVX -g compilation ﬂags. We have executed thebenchmark using 24 MPI processes on a single node of the Jureca system using a problemsize nx = ny = nz =104. The application undergoes ﬁrst a setup phase to test the systemresilience and ability to remain operational and then the application runs the executionphase. Here, we have ignored the setup phase and have delimited with instrumentationpoints the main execution phase only.On a preliminary analysis of the application, we observed that most of the PEBSreferences were not associated to a memory object. This occurred because the applicationallocates its data using many consecutive tiny allocations (10s to 100s of bytes) whichare below the speciﬁed threshold (32 KByte) and thus not traced. The data objectsare allocated using two diﬀerent mechanisms in lines 108-110 and 143 within the ﬁle GenerateProblem ref.cpp , respectively. The ﬁrst set of objects are allocated throughthe new

C++ language construct while the second set are allocated through the [] -operator of the C++ STL-based map structures. We avoided creating huge event trace-ﬁles by grouping these allocations in two groups by using the new API call from Extrae.Each grouped allocation covers the ﬁrst and last addresses of all the included allocations.Even though memory allocators may use diﬀerent arenas (each on a diﬀerent part ofthe address space) to reduce memory fragmentation, this approach served our purposesbecause the allocated regions were located in consecutive addresses. Figure 5a shows the result of the folding tool when applied to the modiﬁed versionof the HPCG benchmark and Table 5 associates the code regions (A-E) shown in theFigure with the actual code. We notice that each iteration consists of two rounds of calls As a side note, the benchmark includes a header ﬁle ( lulesh tuple.h ) to apply this change toadditional structures than those we indicated but its usage reduced the FOM to 11081.57 z/s . Downloaded from with SHA-1 . able 5: Code association for the labels shown in Figure 5a including the most observed code line(MOCL) for each region. Label User function MOCL Duration

A a1

ComputeSYMGS ref

76 147 msA a2

ComputeSYMGS ref

95 143 msB

ComputeSPMV ref

68 116 msC

ComputeMG ref

47 96 msD d1

ComputeSYMGS ref

76 147 msD d2

ComputeSYMGS ref

95 143 msE

ComputeSPMV ref

68 136 ms

Table 6: Classiﬁcation and average costs of diﬀerent accesses to the memory hierarchy per routine forthe HPCG Benchmark.

Region

Subregion

Metric Memory hierarchy part

L1 LFB L2 L3 DRAM a1 % of load references 58.8% 30.7% 1.8% 1.6% 1.2%A a1 Average cost (in cycles) 7 15, 70 14 50 350, 450A a2 % of load references 61.8% 21.8% 2.2% 1.6% 2.0%a2 Average cost (in cycles) 7 15 14 65 350, 700B % of load references 58.9% 30.9% 1.5% 1.6% 1.2%B Average cost (in cycles) 7 60 14 65 540C % of load references 62.2% 24.8% 2.0% 1.4% 2.0%C Average cost (in cycles) 7 70 14 110 300, 450d1 % of load references 59.9% 30.5% 2.2% 1.0% 1.4%D d1 Average cost (in cycles) 7 15, 70 14 50 350D d2 % of load references 62.7% 22.5% 1.7% 1.0% 2.3%d2 Average cost (in cycles) 7 15 15 50 350E % of load references 57.1% 33.8% 1.7% 1.5% 0.5%E Average cost (in cycles) 7 15, 50 16 70 270, 330 to ComputeSYMGS ref (labels A and D) and

ComputeSPMV ref (labels B and E) and inbetween there is a call to

ComputeMG ref (label C). We identify linear accesses in thehigher and lower part of the address space. More precisely, regions A and D presenta phase (labeled as a1 and d1 in blue) that accesses the address space from lower toupper addresses followed by a phase (labeled as a2 and d2 also in blue) that accesses theaddress space from upper to lower addresses. The lower to upper accesses represent oneforward sweep, while the upper to lower accesses represent a backward sweep. It is worthto note that there are no stores ( i.e. black points) in the lower part of the address spacein the execution phase, suggesting that data has been written in an earlier applicationphase.From the performance perspective, the code does not exceed 1500 MIPS (IPC of 0.6at the nominal frequency). The transitions between phases expose higher instruction andbranch instruction rates and a cache miss reduction. Within routines, the instructionrate increases marginally when the application moves from forward sweep to backwardsweep (regions a1 to a2 and d1 to d2 ).The report indicates that a1 and a2 traverse the whole data structure, the approx-imations for the memory bandwidth while traversing the structure are 4197 MB/s and14 able 7: Top 5 referenced variables in HPCG identiﬁed by their allocation call-site. Allocation site Size % of refs

CommentGenerateProblem ref.cpp 124

617 MB 46.21% sparse matrixGenerateProblem ref.cpp 205

89 MB 4.96% global/local mapsGenerateProblem ref.cpp 205

89 MB 4.96% local/global mapsGenerateProblem ref.cpp 124

78 MB 4.74% sparse matrixGenerateProblem ref.cpp 124

10 MB 0.70% sparse matrixGenerateProblem ref.cpp 124 B achieves 6427 MB/s. These values are smaller than the band-width observed in the Stream example but we have to consider that (i) there were 24 MPIranks running in the same node and competing for the bandwidth and (ii) the reportprovides the performance for a single process/thread. Table 6 shows that the backward sweeps hit approximately 3% more in L1D comparedto forward sweeps, about 8-9% less in LFB and an additional 1% of the references miss inall the caches and have to go to DRAM. The data provided by LFB presents multi-modalcost access that is diﬃcult to characterize.The results shown in Table 7 prove the high number of references to the memoryobjects that we wrapped. It is known that the C library does not provide consecutiveaddresses to consecutive allocation calls because of (i) internal book-keeping to track freeblocks, (ii) minimum allocation size, and (iii) alignment padding if needed. Consequently,the object allocated in a single allocation will be more compact than the object allocatedthrough small allocation calls and thus it is likely to expose better spatial locality. Withthis in mind, we changed the allocation of the data objects to minimize the number ofallocations and the results. Using this modiﬁed version of the code, the FOM reportedby the benchmark increased from 9.95 to 15.64 GFLOP/s (57% higher than the original)and the performance results of the new version are shown in Figure 5b. The Figureshows that the main computation phase on the new version lasts approximately 618 ms(37% less) and that cache misses have decreased (for instance, L1D misses [in red] arealways below 5%). Regarding the address space, we observe the following. First, the(wrapped) memory object allocated in

GenerateProblem ref.cpp line 124 split intotwo memory objects. Second, the (wrapped) object allocated in the original versionoccupied 617 MB while the two objects on the newer version occupy 346 MB (56%of original size) demonstrating that the object is more packed and might expose betterspatial locality. Third, linear accesses that we recognized in the (wrapped) object are stillvisible in the two objects but there are concurrent linear accesses to both. With respect toperformance, we notice a higher MIPS rate (70% increase compared to the original) dueto improved cache usage that largely compensates the additional instructions executed(7%). Regions a1 , a2 and B show less bandwidth usage (3844, 4325 and 5580 MB/srespectively) than the previous version which means that there is room for growth. This line corresponds to lines 108-110 in the original code. This line corresponds to lines 132-134 in the original code. .00 00000000bottomtop C od e li n e ghostAa1 a2 B C Dd1 d2 E 2adfdcc6d3402adfde904f682adfe059cb902adfe22347b82adfe3ecc3e0 A dd r e ss e s r e f e r e n ce d M I PS C oun t e r / i n s t r u c ti on Time (ms)

Branches L1D miss L2 miss L3 miss MIPS (a) Folding results for the main computation region of the HPCG 3.0 benchmark. C od e li n e ghostA B C D E 2b880afe89002b8815b91e002b882073b3002b882b2e48002b8835e8dd00 A dd r e ss e s r e f e r e n ce d M I PS C oun t e r / i n s t r u c ti on Time (ms)

Branches L1D miss L2 miss L3 miss MIPS (b) Folding results for the main computation region of the HPCG 3.0 benchmark after the code modiﬁca-tions (at the same time-scale as in Figure 5a).Figure 5: Analysis of HPCG 3.0. . Related work This section describes earlier approaches related to performance analysis tools thathave focused to some extent on the analysis of data structures and the eﬃciency achievedwhile accessing to them. We divide this research into two groups depending on themechanism used to capture the addresses referenced by the load/store instructions.The ﬁrst group includes tools that instrument the application instructions to obtainthe referenced addresses. MemSpy [16] is a prototype tool to proﬁle applications on a sys-tem simulator that introduces the notion of data-oriented, in addition to code oriented,performance tuning. This tool instruments every memory reference from an applicationrun and leverages the references to a memory simulator that calculates statistics such ascache hits, cache misses, etc. according to a given cache organization. SLO [17] suggestslocality optimizations by analyzing the application reuse paths to ﬁnd the root causesof poor data locality. This tool extends the GCC compiler to capture the application’smemory accesses, function calls, and loops in order to track data reuses, and then itanalyzes the reused paths to suggest code loop transformations. MACPO [18] capturesmemory traces and computes metrics for the memory access behavior of source-level datastructures. The tool uses PerfExpert [19] to identify code regions with memory-relatedineﬃciencies, then employs the LLVM compiler to instrument the memory references,and, ﬁnally, it calculates several reuse factors and the number of data streams in a loopnest. Intel R (cid:13) Advisor is a component from the Intel R (cid:13) Parallel Studio XE [20] that pro-vides users insights on applications’ vectorization. It relies on PIN [21] to instrumentbinaries and precisely correlates memory access on user selected routines with source-code. Tareador [22] is a tool that estimates how much parallelism can be achieved ina task-based data-ﬂow programming model. The tool employs dynamic instrumenta-tion to monitor the memory accesses of delimited regions of code in order to determinewhether they can simultaneously run without data race conditions, and then it simulatesthe application execution based on this outcome. EVOP is an emulator-based data-oriented proﬁling tool to analyze actual program executions in a system equipped onlywith a DRAM-based memory [23]. EVOP uses dynamic instrumentation to monitor thememory references in order to detect which memory structures are the most referencedand then estimate the CPU stall cycles incurred by the diﬀerent memory objects to de-cide their optimal object placement in a heterogeneous memory system by means of the dmem advisor tool [24]. ADAMANT [25] uses the PEBIL instrumentation package [26]and includes tools to characterize application data objects, to provide reports helping onalgorithm design and tuning by devising optimal data placement, and to manage datamovement improving locality.The second group of tools take beneﬁt of hardware mechanisms to sample addressesreferenced when processor counter overﬂows occur and estimate the accesses weight fromthe sample count. The Oracle Developer Studio [27] (formerly known as Sun ONE Studio)incorporates a tool to explore memory system behavior in the context of the application’sdata space [28]. This extension brings the analyst independent and uncorrelated viewsthat rank program counters and data objects according to hardware counter metricsand it shows metrics for each element in data object structures. HPCToolkit has beenrecently extended to support data-centric proﬁling of parallel programs [29], providinga graphical user interface that presents data- and code-centric metrics in a single panel,easing the correlation between the two. Roy and Liu developed StructSlim [30] on top of17PCToolkit to determine memory access patterns to guide structure splitting. Gim´enez et al. use PEBS to monitor load instructions that access addresses within memoryregions delimited by user-speciﬁed data objects and focusing on those that surpass agiven latency [31]. Then, they associate the memory behavior with semantic attributes,including the application context which is shown through the MemAxes visualizationtool.The BSC tools for the memory exploration adopt a hybrid approach combining PEBS-based sampling and minimal instrumentation usage and its main diﬀerence from existingtools relies on the ability to report time-based memory access patterns, in addition tosource code proﬁles and performance bottlenecks. Regarding the monitoring mecha-nism, the tool brings two beneﬁts. First, limiting the instrumentation usage reduces theoverhead suﬀered by the application and thus increases the representability of the per-formance results. Second, the folding mechanism allows the analyst to blindly choose asampling frequency because the mechanism gathers samples from repetitive code regionsinto a synthetic one, and consequently minimizes the number of application executions.Regarding the results provided, the inclusion of the temporal analysis permits time-based studies such as detection of simultaneous memory streams, ordering accesses tothe memory hierarchy, and even, insights for extracting parallelism through task-baseddata-ﬂow programming models. The results also allow manually estimating the memorybus bandwidth usage per variable on a give region of code on linear accesses.

5. Conclusions

Memory hierarchies are getting complex and it is necessary to better understandthe application behavior in terms of memory accesses to improve the application perfor-mance and prepare for future memory technologies. The PEBS hardware infrastructureassists with sampling memory-related instructions and gathers valuable details about theapplication behavior. We have described the latest extensions in the Extrae instrumen-tation package order to enable performance analysts to understand the application andsystem behavior in terms of memory accesses even for in-production optimized binaries.The additional extension to the folding mechanism depicts the temporal evolution ofthe memory accesses in a compute region by using a coarse-grain non-intrusive samplingfrequency and minimal instrumentation. The usage of these tools results in thoroughmemory access patterns exploration on two state-of-the-art benchmarks without havingto use high-frequency sampling and thus not incurring on large overheads. The explo-ration included scan of the memory access patterns from a time perspective and theidentiﬁcation of the most dominant data streams and their temporal evolution alongcomputing regions. As a result of this exploration, we have proposed small changes toboth of them that improved their performance.In addition to the optimization eﬀorts, application developers can use the presentedtools to explore how the address space is being accessed and conﬁrm if the results matchtheir expectations. For instance, the results for the modiﬁed Stream show that a usercan identify the modiﬁcation applied to the benchmark as well as the compiler decisionto replace the source code by a memcpy call that accesses the address space in reverseorder compared to what the developer would expect. Concerning Lulesh, the results showpotential independent load and store accesses to the same parts of the address space bydiﬀerent routines which may be a valuable insight for using data-dependent task-based18rogramming models. Finally, the HPCG results show that the main routine traversesthe address space two times (in a forward direction followed by a backward direction) andthat a part of the address space is not modiﬁed. HPCG also shows diﬀerent performancevalues for forward and backward sweeps not only in cache miss ratios but also in the costof providing data from memory.Hardware architects may also ﬁnd valuable insight in the results obtained. One possi-ble suggestion according to the Stream results would be to not cache in L2 given parts ofthe address space for a period of time with the consequent energy savings. Additionally,the results for HPCG indicate that a portion of the address space is only read during theexecution phase and thus this region may beneﬁt from memory technologies where loadsare faster than stores.

Acknowledgments

This work has been performed in the Intel-BSC Exascale Lab. We would like to thankForschungszentrum J¨ulich for the compute time on the Jureca system. This projecthas received funding from the European Union’s Horizon 2020 research and innovationprogram under Marie Sklodowska-Curie grant agreement No. 749516.

References [1] G. H. Loh, 3D-stacked memory architectures for multi-core processors, in: Proceedings of the35th Annual International Symposium on Computer Architecture (ISCA), IEEE Computer Society,Washington, DC, USA, 2008, pp. 453–464.[2] C. Wang, S. S. Vazhkudai, X. Ma, F. Meng, Y. Kim, C. Engelmann, NVMalloc: Exposing anaggregate SSD store as a memory partition in extreme-scale machines, in: IEEE 26th InternationalParallel Distributed Processing Symposium (IPDPS), 2012, pp. 957–968.[3] Barcelona Supercomputing Center, Extrae user guide,

Last accessed November, 2017.[4] H. Servat, G. Llort, J. Gimenez, K. Huck, J. Labarta, Unveiling internal evolution of parallelapplication computation phases, in: International Conference on Parallel Processing (ICPP), 2011,pp. 155–164.[5] H. Servat, G. Llort, J. Gonz´alez, J. Gim´enez, J. Labarta, Low-overhead detection of memory accesspatterns and their time evolution, in: Euro-Par 2015: Parallel Processing - 21st InternationalConference on Parallel and Distributed Computing, 2015, pp. 57–69.[6] A. C. de Melo, The new Linux “perf” tools, in: Linux Kongress, 2010, http://vger.kernel.org/~acme/perf-devconf-2015.pdf

Last accessed November, 2017.[7] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol. Volume 3B:System Programming Guide, Part 2, 2016, Ch. 18.9.[8] J. Labarta, S. Girona, V. Pillet, T. Cortes, L. Gregoris, DiP: A parallel program development envi-ronment, in: Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II, Euro-Par ’96, Springer-Verlag, London, UK, 1996, pp. 665–674.[9] S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci, A portable programming interface for per-formance evaluation on modern processors, Int. J. High Perform. Comput. Appl. 14 (3) (2000)189–204, http://icl.cs.utk.edu/papi - Last accessed November, 2017.[10] P. Drongowski, L. Yu, F. Swehosky, S. Suthikulpanit, R. Richter, Incorporating instruction-basedsampling into AMD CodeAnalyst, in: IEEE International Symposium on Performance Analysis ofSystems Software (ISPASS), 2010, pp. 119–120.[11] M. Srinivas, B. Sinharoy, R. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flem-ming, A. Blanchard, P. Seshadri, et al., IBM POWER7 performance modeling, veriﬁcation, andevaluation, IBM Journal of Research and Development 55 (3) (2011) 4–1.

12] Jureca system architecture,

Last accessed, November 2017.[13] J. D. McCalpin, Memory bandwidth and machine balance in current high performance comput-ers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter(1995) 19–25.[14] Hydrodynamics challenge problem, Tech. rep., Lawerence Livermore National Laboratory, https://codesign.llnl.gov/pdfs/LULESH2.0_Changes.pdf

Last accessed November, 2017. (2011).[15] J. Dongarra, M. A. Heroux, P. Luszczek, High-performance conjugate-gradient benchmark: A newmetric for ranking high-performance computing systems, IJHPCA 30 (1) (2016) 3–10.[16] M. Martonosi, A. Gupta, T. Anderson, Memspy: Analyzing memory system bottlenecks in pro-grams, in: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurementand Modeling of Computer Systems, ACM, 1992, pp. 1–12.[17] K. Beyls, E. D’Hollander, Refactoring for data locality, Computer 42 (2) (2009) 62–71.[18] A. Rane, J. Browne, Enhancing performance optimization of multicore chips and multichip nodeswith data structure metrics, in: International Conference on Parallel Architectures and CompilationTechniques, 2012, pp. 147–156.[19] M. Burtscher, B.-D. Kim, J. Diamond, J. McCalpin, L. Koesterke, J. Browne, PerfExpert: Aneasy–to–use performance diagnosis tool for HPC applications, in: Proceedings of the ACM/IEEEInternational Conference for High Performance Computing, Networking, Storage and Analysis,SC’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 1–11.[20] Intel R (cid:13) Parallel Studio XE, https://software.intel.com/en-us/intel-parallel-studio-xe

Last accessed November, 2017.[21] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, K. Hazel-wood, PIN: Building customized program analysis tools with dynamic instrumentation, in: Pro-ceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation,PLDI’05, ACM, 2005, pp. 190–200.[22] V. Subotic, R. Ferrer, J. C. Sancho, J. Labarta, M. Valero, Quantifying the potential task-baseddataﬂow parallelism in MPI applications, in: Proceedings of the 17th International Conference onParallel Processing - Volume Part I, Euro-Par’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp.39–51.[23] A. J. Pe˜na, P. Balaji, A framework for tracking memory accesses in scientiﬁc applications, in: 43rdInternational Conference on Parallel Processing Workshops (ICCPW), IEEE, 2014, pp. 235–244.[24] A. J. Pe˜na, P. Balaji, Toward the eﬃcient use of multiple explicitly managed memory subsystems,in: IEEE International Conference on Cluster Computing (CLUSTER), 2014, pp. 123–131.[25] P. Cicotti, L. Carrington, ADAMANT: Tools to capture, analyze, and manage data movement, in:International Conference on Computational Science (ICCS), 2016, pp. 450–460.[26] M. Laurenzano, M. M. Tikir, L. Carrington, A. Snavely, PEBIL: Eﬃcient static binary instru-mentation for Linux, in: IEEE International Symposium on Performance Analysis of Systems andSoftware (ISPASS), 2010, pp. 175–183.[27] Oracle developer studio,