[PDF] An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors

Abstract

We investigate an approach that uses low-level analysis and the execution-cache-memory (ECM) performance model in combination with tuning of hardware parameters to lower energy requirements of memory-bound applications. The ECM model is extended appropriately to deal with software optimizations such as non-temporal stores. Using incremental steps and the ECM model, we analytically quantify the impact of various single-core optimizations and pinpoint microarchitectural improvements that are relevant to energy consumption. Using a 2D Jacobi solver as example that can serve as a blueprint for other memory-bound applications, we evaluate our approach on the four most recent Intel Xeon E5 processors (Sandy Bridge-EP, Ivy Bridge-EP, Haswell-EP, and Broadwell-EP). We find that chip energy consumption can be reduced in the range of 2.0-2.4 × on the examined processors.

Full PDF

AAn ECM-based energy-efﬁciency optimizationapproach for bandwidth-limited streaming kernels onrecent Intel Xeon processors

Johannes Hofmann

Department of Computer ScienceUniversity of Erlangen-NurembergErlangen, Germany [email protected] Dietmar Fey

Department of Computer ScienceUniversity of Erlangen-NurembergErlangen, Germany [email protected]

ABSTRACT

We investigate an approach that uses low-level analysisand the execution-cache-memory (ECM) performancemodel in combination with tuning of hardware param-eters to lower energy requirements of memory-boundapplications. The ECM model is extended appropri-ately to deal with software optimizations such as non-temporal stores. Using incremental steps and the ECMmodel, we analytically quantify the impact of varioussingle-core optimizations and pinpoint microarchitec-tural improvements that are relevant to energy con-sumption. Using a 2D Jacobi solver as example that canserve as a blueprint for other memory-bound applica-tions, we evaluate our approach on the four most recentIntel Xeon E5 processors (Sandy Bridge-EP, Ivy Bridge-EP, Haswell-EP, and Broadwell-EP). We ﬁnd that chipenergy consumption can be reduced in the range of 2.0–2.4 × on the examined processors. Keywords

ECM; 2D Jacobi; Performance Engineering; EnergyOptimization

1. INTRODUCTION AND REL. WORK

For new HPC installations contribution of power us-age to total system cost has been increasing steadilyover the past years [10] and studies project this trendto continue [7]. As a consequence energy-aware met-rics have recently been gaining popularity. Energy-to-solution, i.e. the amount of energy consumed by asystem to solve a given problem, is the most obvious

Permission to make digital or hard copies of all or part of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for proﬁt or commercial advantage and that copies bear this noticeand the full citation on the ﬁrst page. Copyrights for components of this workowned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, or republish, to post on servers or to redistribute tolists, requires prior speciﬁc permission and/or a fee. Request permissions [email protected].

E2SC ’16 November 14 th , 2016, Salt Lake City, UT, USA © 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00DOI: of these metrics and will be used as quality measurethroughout this study.Previous works view the application code as constantand instead focus their energy optimization attemptsto parameter tuning on either the runtime environ-ment, the hardware, or both [11]. The runtime envi-ronment approach works by adjusting the number ofactive threads across diﬀerent parallel regions based onthe regions’ computational requirements [1]. Hardwareparameter tuning involves trying to identify slack, e.g.during MPI communication, and using dynamic voltageand frequency scaling (DVFS) to lower energy consump-tion in such sections of low computational intensity [14];similar strategies can be applied to OpenMP barriers[2]. The parameters employed by these algorithms canbe based on measurements made at runtime [17] or setstatically [16]. One fact often ignored by DVFS controlsoftware, however, is that hardware delays caused bychanging frequency states can often be signiﬁcant [12]which can lead to diminishing returns in the real world.Another form of hardware parameter optimization in-volves the tuning of hardware prefetchers [21].In contrast to previous work, our approach focuseson increasing single-core performance, primarily thoughsoftware optimization. Using the execution-cache-memory (ECM) performance model to guide optimiza-tions such as SIMD vectorization, cache blocking, non-temporal stores, and the use of cluster-on-die (COD)mode the bandwidth consumption of a single core ismaximized. This allows the chip to saturate mainmemory bandwidth with the least number of activecores in the multi-core scenario, thus reducing power aswell as the energy consumed by the chip. In a secondstep, diﬀerent hardware parameters such as the numberof active cores and their frequencies should be evaluatedto further reduce energy consumption. The approachis demonstrated on a 2D Jacobi solver, which acts as aproxy for many memory-bound applications. To pro-duce signiﬁcant results, all experiments were performedon the four most recent server microarchitectures byIntel, which make up about 91% of HPC installationsof the June 2016 Top 500 list.The paper is organized as follows. Section 2 describes a r X i v : . [ c s . PF ] S e p B a nd w i d t h [ G B / s ] naiveoptimizedsustained bandwidth b s (a) E n e r gy t o S o l u ti on [ J ] naiveoptimized(b) Figure 1: Bandwidth and energy-to-solution for diﬀerent 2DJacobi implementations using a dataset size of 16 GB. the blueprint of and the reasoning behind our energy-eﬃciency optimization approach. Section 3 introducesthe ECM performance model that we use to guide op-timizations. Section 4 contains an overview of the sys-tems used for benchmarking. Core-level improvementeﬀorts are documented in Section 5, followed by a val-idation of single-core results in Section 6. Implicationsfor the multi-core scenario are discussed in Section 7,followed by the conclusion in Section 8.

2. OPTIMIZATION APPROACH

Per deﬁnition, the bottleneck for bandwidth-boundapplications is the sustained bandwidth b s imposed bythe memory subsystem. It is a well-established fact thata single core of modern multi- and many-core systemscannot saturate main memory bandwidth [19], neces-sitating the use of multiple cores to reach peak per-formance. The number of cores n s required to sustainmain memory bandwidth depends on single-core per-formance: The faster a single-core implementation, themore bandwidth is consumed by a single core. The morebandwidth is consumed by a single core, the fewer coresare required to saturate main memory bandwidth.This relation is illustrated in Fig. 1, which depictsmeasurement results obtained on a single socket of astandard two-socket Broadwell-EP machine (cf. Sec-tion 4 for hardware details). Fig. 1a shows memorybandwidth as function of active cores for two diﬀer-ent 2D Jacobi implementations. Fig. 1b depicts theenergy-to-solution required for both implementations.From this real-world example two important conclusioncan be drawn:1. Memory-bound codes should be run using n s cores Once the bandwidth bottleneck is hit, adding morecores no longer increases performance. Instead usingmore cores increases chip power consumption resultingin a higher energy-to-solution. This eﬀect is visible inFig. 1b.2.

Early saturation can lead to lower energy-to-solution

The optimized version saturates memory bandwidthwith six cores compared to the ten cores required by In practise performance will decrease slightly when using morethan n s cores, as prefetchers and memory subsystem have to dealwith additional memory streams which makes them less eﬃcient;this eﬀect can be observed in Fig. 1a. the naive counterpart. Six vs. ten cores being activetypically translates into a lower power draw for versionusing fewer cores. Together with the fact that versions(16 GB /b s ), this results in a better energy-to-solutionfor the optimized version.After establishing these facts we propose the followingapproach to optimize energy consumption: (1) Use theECM model to guide performance improvements for thesingle-core implementation. (2) Attempt to lower per-core power draw by tuning hardware parameters, e.g.core frequency or COD mode. (3) Never run code withmore cores than necessary ( n s ).

3. THE ECM PERFORMANCE MODEL

The ECM model [19, 3, 18, 4, 5] is an analytic per-formance model that, with the exception of sustainedmemory bandwidth, works exclusively with architecturespeciﬁcations as inputs. The model estimates the num-bers of CPU cycles required to execute a number ofiterations of a loop on a single core of a multi- or many-core chip. For multi-core estimates, linear scaling ofsingle-core performance is assumed until a shared bot-tleneck, such as e.g. main memory bandwidth, is hit.Note that only parts of the model relevant to this work,i.e. the single-core model, are presented here. Readersinterested in the full ECM model can ﬁnd the most re-cent version for Intel Xeon, Intel Xeon Phi, and IBMPOWER8 processors here [6].The single-core prediction is made up of contribu-tions from the in-core execution time T core , i.e. thetime spent executing instructions in the core under theassumption that all data resides in the L1 cache, andthe transfer time T data , i.e. the time spent transferringdata from its location in the cache/memory hierarchyto the L1 cache. As data transfers in the cache- andmemory hierarchy occur at cache line (CL) granularitywe chose the number of loop iterations n it to correspondto one cache line’s ”worth of work.” On Intel architec-tures, where CLs are 64 B in size, n it = 8 when usingdouble precision (DP) ﬂoating-point numbers, becauseprocessing eight “doubles” (8 B each) corresponds to ex-actly one CL worth of work.Superscalar core designs house multiple executionunits, each dedicated to perform certain work: loading,storing, multiplying, adding, etc. The in-core execu-tion time T core is determined by the unit that takes thelongest to retire the instructions allocated to it. Otherconstraints for the in-core execution time may apply,e.g. the four micro-op per cycle retirement limit of In-tel Xeon cores. The model diﬀerentiates between corecycles depending on whether data transfers in the cachehierarchy can overlap with in-core execution time. Forinstance, on Intel Xeons, core cycles in which data ismoved between the L1 cache and registers, e.g. cycles In theory it is possible that an implementations using fewer coresto saturate memory bandwidth is less energy-eﬃcient than onethat uses more. Consider, for example, an optimization that dou-bles single-core performance but triples the power drawn by thecore. n s is halved but energy to solution is 50% higher nonethe-less. In practise we have however never observed such a scenario.For all optimizations described in Section 5 the increase in single-core power draw accompanying an optimization is negligible. able 1: Test machine speciﬁcations. Microarchitecture (Shorthand) Sandy Bridge-EP (SNB) Ivy Bridge-EP (IVB) Haswell-EP (HSW) Broadwell-EP (BDW)Chip Model Xeon E5-2680 Xeon E5-2690 v2 Xeon E5-2695 v3 Pre-releaseRelease Date Q1/2012 Q3/2013 Q3/2014 Q1/2016non-AVX/AVX Base Freq. 2.7 GHz/2.7 GHz 3.0 GHz/3.0 GHz 2.3 GHz/1.9 GHz 2.1 GHz/2.0 GHzCores/Threads 8/16 10/20 14/28 18/36Latest SIMD Extensions AVX AVX AVX2, FMA3 AVX2, FMACore-Private L1/L2 Caches 8 ×

32 kB/8 ×

256 kB 10 ×

32 kB/10 ×

256 kB 14 ×

32 kB/14 ×

256 kB 18 ×

32 kB/18 ×

256 kBShared Last-Level Cache 20 MB (8 × × × × → Reg Bandwidth 2 ×

16 B/cy 2 ×

32 B/cy 2 ×

32 B/cyReg → L1 Bandwidth 1 ×

16 B/cy 1 ×

32 B/cy 1 ×

32 B/cyL1 ↔ L2 Bandwidth 32 B/cy ˆ=2 cy/CL 32 B/cy ˆ=2 cy/CL 64 B/cy ˆ=1 cy/CL 64 B/cy ˆ=1 cy/CLL2 ↔ L3 Bandwidth 32 B/cy ˆ=2 cy/CL 32 B/cy ˆ=2 cy/CL 32 B/cy ˆ=2 cy/CL 32 B/cy ˆ=2 cy/CLL3 ↔ Mem Bandwidth (copy) 14.5 B/cy ˆ=4.4 cy/CL 20.0 B/cy ˆ=4.4 cy/CL 21.7 B/cy ˆ=2.9 cy/CL 24.6 B/cy ˆ=2.6 cy/CL in which load and/or store instructions are retired, pro-hibit simultaneous transfer of data between the L1 andL2 cache; these “non-overlapping” cycles contribute to T nOL . Cycles in which no load or store instructions butother instructions, such as e.g. arithmetic instructions,retire are considered “overlapping” cycles and contributeto T OL . The in-core runtime is the maximum of both: T core = max( T OL , T nOL ).For modelling data transfers, latency eﬀects are ini-tially neglected, so transfer times are exclusively a func-tion of bandwidth. Cache bandwidths are typically welldocumented and can be found in vendor data sheets.Depending on how many CLs have to be transferred,the contribution of each level in the memory hierarchy( T L1L2 , . . . , T L3Mem ) can be determined. Special carehas to be taken when dealing with main memory band-width, because theoretical memory bandwidth speciﬁedin the data sheet and sustained memory bandwidth b s can diﬀer greatly. Also, in practise b s depends on thenumber of load and store streams. It is therefore recom-mended to empirically determine b s using a kernel thatresembles the memory access pattern of the benchmarkto be modeled. Once determined, the time to transferone CL between the L3 cache and main memory can bederived from the CPU frequency f as 64 B · f /b s cycles.Starting with the Haswell-EP (HSW) microarchitec-ture, an empirically determined latency penalty T p isapplied to oﬀ-core transfer times. This departure fromthe bandwidth-only model has been made necessary bylarge core counts, the dual-ring design, and separateclock frequencies for core(s) and Uncore all of whichincrease latencies when accessing oﬀ-core data. Thepenalty is added each time the Uncore interconnect isinvolved in data transfers. This is the case wheneverdata is transferred between the L2 and L3 caches, asdata is pseudo-randomly distributed between all last-level cache segments; and when data is transferred be-tween the L3 cache and memory, because the memorycontroller is attached to the Uncore interconnect. In-struction times as well as data transfer times, e.g. T L1L2 for the time required to transfer data between L1 andL2 caches, can be summarized in shorthand notation: { T OL (cid:107) T nOL | T L1L2 | T L2L3 + T p | T L3Mem + T p } .To arrive at a prediction, in-core execution and datatransfers times are put together. Depending on whether there exist enough overlapping cycles to hide all datatransfers, runtime is given by either T OL or the sum ofnon-overlapping core cycles T nOL plus contributions ofdata transfers T data , whichever takes longer. T data con-sists of all necessary data transfers in the cache/memoryhierarchy, plus latency penalties if applicable, e.g. fordata coming from the L3 cache: T data = T L1L2 + T L2L3 + T p . The prediction is thus T ECM = max( T OL , T nOL + T data ). A shorthand notation also exists for the model’sprediction: (cid:8) T coreECM (cid:101) T L2ECM (cid:101) T L3ECM (cid:101) T MemECM (cid:9) .Converting the prediction from time (in cycles) toperformance (work per second) is done by dividing thework per CL W CL (e.g. ﬂoating-point operations, up-dates, or any other relevant work metric) by the pre-dicted runtime in cycles and multiplying with the pro-cessor frequency f , i.e. P ECM = W CL /T ECM · f .

4. EXPERIMENTAL TESTBED

All measurements were performed on standard two-socket Intel Xeon servers. A summary of key speci-ﬁcations of the four generations of processors can befound in Table 1. According to Intel’s “tick-tock” model,where a “tick” corresponds to a shrink of the manufac-turing process technology and a “tock” to a new mi-croarchitecture, IVB and BDW are “ticks”—apart froman increase in core count and a faster memory clock, nomajor improvements were introduced in these microar-chitectures.HSW, which is a “tock”, introduced AVX2, extendingthe already existing 256 bit SIMD vectorization fromﬂoating-point to integer data types. Instructions in-troduced by the fused multiply-add (FMA) extensionare handled by two new, AVX-capable execution units.Data paths between the L1 cache and registers as wellas the L1 and L2 caches were doubled. Due to limitedscalability of a single ring connecting the cores, HSWchips with more than eight feature a dual-ring design.HSW also introduces the AVX base and maximum AVXTurbo frequencies. The former is the minimum guaran-teed frequency when running AVX code on all cores;the latter the maximum frequency when running AVXcode on all cores (cf. Table 3 in [9]). Based on workload,the actual frequency varies between this minimum andmaximum value. For a more detailed analysis of the for (y=1; y

5. SINGLE-CORE OPTIMIZATIONS

Figure 2 shows the source code for one 2D ﬁve-pointJacobi sweep, i.e. the complete update of all grid points.One grid point update computes and stores in b the newstate of each point from the values of its four neighborsin a , which holds data from the previous iteration. Forresults to be representative, the dataset size per socketfor all measurements is 16 GB, i.e. each of the two gridsis 8 GB in size and made up of 32768 × Using adequate optimization ﬂags ( -O3 -xHost -fno-alias ) it is trivial to generate AVX vectorizedassembly for the code shown in Figure 2 with recentIntel compilers. This is why we decided to use an AVXvectorized variant instead of scalar code as baseline.With 256-bit AVX vectorization in place, one CLworth of work (eight LUP/s), consists of eight AVXloads, two AVX stores, six AVX adds, and two AVXmultiplication instructions.To determine data transfers inside the cache hierar-chy, we have to examine each load and store in moredetail. Storing the newly computed results to array b involves the transfer of two CLs to/from main memory:because both arrays are too large to ﬁt inside the caches,the store will miss in the L1 cache, triggering a write-allocate of the CL from main memory. After the valuesin the CL have been updated, the CL will have to be The guaranteed frequency when running code on all cores thatdoes not use AVX instructions. evicted from the caches eventually, triggering anothermain memory transfer.The left neighbor a[y][x-1] can always be loadedfrom the L1 cache since it was used two inner iterationsbefore as right neighbor; access to a[y+1][x] must beloaded from main memory since it was not used beforewithin the sweep.Based on work by Rivera and Tseng [15] Stengel etal. introduced the layer condition (LC) [18] to help de-termining where data for a[y-1][x] and a[y][x+1] iscoming from. The LC stipulates three successive rowshave to ﬁt into a certain cache for accesses to these datato come from this particular cache. Assuming cache k can eﬀectively hold data up to 50% of its nominal size C k , the LC can be formulated as 3 · N · < · C k .For N = 32768, three rows take up 768 kB so on allpreviously introduced machines (cf. line 8 in Table 1)the LC holds true for the L3 cache. ECM Model for SNB and IVB

To process one CL worth of data, eight AVX load,two AVX store, six AVX addition and two AVX multi-plication instructions have to executed. Throughput islimited by the two load units. Each load unit has a 16 Bwide data path connecting registers and L1 cache, soretiring a 32 B AVX load takes two cycles. Using bothload units, eight AVX loads take T nOL = 8 cy. BothAVX stores are retired in parallel with the eight loads,so they do not increase T nOL . Computation throughputis limited by the single add port, which takes T OL = 6 cyto retire all six AVX add instructions. Both AVX mul-tiplications can be processed in parallel with two of thesix AVX add instructions.As established previously, three CLs have to be trans-ferred between L3 and memory: write-allocating andlater evicting b[y][x] and loading a[y+1][x] . On bothSNB and IVB, L3-memory bandwidth is 4.4 cy/CL (cf.last line in Table 1). This results in T L3Mem = 13 . a[y-1][x] and a[y][x+1] have to be trans-ferred from the L3 to the L2 cache. Transferring ﬁveCLs at a bandwidth of 2 cy/CL takes T L2L3 = 10 cy onboth SNB and IVB. At a L1-L2 bandwidth of 2 cy/CL,moving these ﬁve CLs between L2 and L1 cache takes T L1L2 = 10 cy. Using the ECM short notation to sum-marize the inputs yields { (cid:107) | | | . } cy for bothSNB and IVB; the corresponding runtime prediction forSNB and IVB is { (cid:101) (cid:101) (cid:101) . } cy.For a 2.7 GHz SNB core the performance predictionis P Mem

ECM = / CL41 . / CL · . / s . For IVBthe model predicts a performance of 582 MLUP / s. ECM Model for HSW and BDW

On HSW and BDW, the address generation units(AGUs) are the bottleneck for T nOL . Each load/storeinstruction accesses an AGU to compute the referencedmemory address. With only two AGUs capable of per-forming the required addressing operations available,retiring all ten load/store instructions takes T nOL =5 cy. HSW and BDW posses a single AVX add unitust like SNB and IVB, so T OL = 6 cy.The oﬀ-core latency penalty was empirically esti-mated at approximately 1.6 cycles for both HSW andBDW and is applied per CL transfer that takes placeover the Uncore interconnect, i.e. all transfers betweenL3 and L2 caches as well as transfers between memoryand the L3 cache. The eﬀective L3-Mem bandwidth isthus 2.9+1.6 cy on HSW and 2.6+1.6 cy on BDW; theeﬀective L2-L3 is 2+1.6 cy on HSW and 2+1.6 cy onBDW.Transferring the three required CLs then resultsin T L3Mem = 8 . . T L3Mem =7 . . T L2L3 = 10 + 8 cy on bothHSW and BDW. At a L1-L2 bandwidth of 1 cy/CL,moving the same ﬁve CLs takes T L1L2 = 5 cy onboth microarchitectures. The ECM inputs are thus { (cid:107) | |

10 + 8 | . . } cy for HSW and for BDW { (cid:107) | |

10 + 8 | . . } cy. The corresponding run-time predictions are { (cid:101) (cid:101) (cid:101) . } cy for HSW and { (cid:101) (cid:101) (cid:101) . } cy for BDW. The performance pre-dictions are 443 MLUP/s for HSW and 413 MLUP/s forBDW. One way to increase the performance of the single-core implementation is to reduce the amount of timespent transferring data inside the cache hierarchy. Aspreviously established, a[y-1][x] and a[y][x+1] areloaded from the L3 cache, because the L1 and L2 cachesare too small to fulﬁll the LC for N = 32768. Usingcache blocking, it is possible to enforce the LC in ar-bitrary cache levels. This is done by partitioning thegrid into stripes along the y -axis; the grid is then pro-cessed stripe by stripe. The diameter of the stripes,also known as blocking factor b x , is chosen is such away that the LC is met for a given cache level. If L2blocking is desired, the L2 cache size of 256 kB requiresthat b x should be chosen smaller than 5461.Eﬃcient blocking for the 32 kB L1 cache not asstraightforward. Although determining b x <

682 issimple using the LC, naive L1 blocking in x -directionhas negative side eﬀects. With most of the data forone CL update coming from L1, the L2 cache is lessbusy. This slack is detected by hardware prefetchers,making them more aggressive, leading to data beingprefetched from main memory. With b x ≈ · · y -direction. We used the latter, because disablingprefetchers might degrade performance elsewhere. Toguarantee data is used before being preempted, thesize of each chunk should be chosen smaller than 50%of a single L3 segment, e.g. b y < · . / ( b x · In the multi-core scenario, all cores might be active and storedata in the shared L3 cache; the capacity dedicated to a singlecore is thus that of a single L3 segment adjusted by 50% to reﬂecteﬀective cache size.

ECM Model for SNB and IVB

Other than causing negligible loop overhead cacheblocking does not change the instructions that have tobe retired to process one CL; thus T OL and T nOL remainunchanged.L2 blocking reduces the number of CLs transferredbetween L3 and L2 from ﬁve to three, lowering T L2L3 from ten to six cycles on both SNB and IVB. The run-time prediction T MemECM for both SNB and IVB is reducedfrom 41.2 to 37.2 cy, leading to a performance predic-tion P MemECM of 580 MLUP/s for SNB and 645 MLUP/sfor IVB.Similarly L1 blocking reduces the CL traﬃc betweenL2 and L1 caches from ﬁve to three CLs, lowering T L1L2 from ten to six cycles on both SNB and IVB. Again,the runtime prediction T MemECM for both microarchitec-tures is reduced by four cycles from 37.2 to 33.2 cy. Thepredicted performance P MemECM increases to 651 MLUP/son SNB and 723 MLUP/s on IVB.

ECM Model for HSW and BDW

The eﬀect of L2 blocking is more pronounced onHSW and BDW, because the cost of transferring a CLis higher on these microarchitectures due to the la-tency penalty. L2 blocking lowers T L2L3 from 10+8 to6+4.8 cy on HSW and BDW. In turn, the runtime pre-diction T MemECM for HSW is reduced from 41.5 to 34.3 cyand from 40.6 to 33.4 cy on BDW. The performanceprediction P MemECM increases to 536 MLUP/s on HSWand 503 MLUP/s on BDW.Because the L1-L2 bandwidth increased from 2 cy/CLon SNB/IVB to 1 cy/CL on HSW/BDW, the perfor-mance improvement oﬀered by L1 blocking is less pro-nounced than on SNB and IVB. With the number ofCL transfers lowered from ﬁve to three with L1 block-ing, T L1L2 is reduced from ﬁve to three cycles. Theruntime prediction T MemECM for HSW is reduced from 34.3to 32.3 cy; on BDW from 33.4 to 31.4 cy. The predictedperformance P MemECM is increased to 570 MLUP/s on HSWand 535 MLUP/s on BDW.

As a workaround to the limited scalability of the phys-ical ring interconnect introduced with Westmere-EX,HSW and BDW switch to a dual-ring design. HSWuses the so-called “eight plus x ” design, in which theﬁrst eight cores of a chip are attached to a primary ringand the remaining cores (six for the model introduced inSection 4) are attached to a secondary ring; BDW usesa symmetric design. Two queues enable data to passbetween rings. In the default (non-COD) mode, thephysical topology is hidden from the operating system,i.e. all cores are exposed within the same non-uniformmemory access (NUMA) domain.To understand the latency problems caused by the in-terconnect, we examine the route data travels inside theUncore. Using a hashing function data is is distributedacross all L3 segments based on its memory address.When accessing data in the L3 cache, there is a highprobability it must be fetched from remote L3 segments.In the worst case, this is a segment on the other ring sothere might be a high latency involved. In the case of a3 miss the situation gets worse. Each physical ring hasattached to it a memory controler (MC) and the choicewhich MC to use is again based on the data’s address.So a L3 miss in a segment on one physical ring doesnot imply that this ring’s MC will be used to fetch thedata from memory. That leads to cases in which a largenumber of hops and multiple cross-physical ring trans-fers are involved when getting data from main memory.One way to reduce these latencies is the new CODmode introduced together with the dual-ring design onHSW and BDW in which cores are separated into twophysical clusters of equal size. The latency reductionis achieved by adapting the involved hashing functions.Data requested by a core of a cluster will only be placedin the cluster’s L3 segments; in addition, all memorytransfers are routed to the MC dedicated to the cluster.Thus, for NUMA-aware codes, COD mode eﬀectivelylowers the latency by reducing the diameter and themean distance of the dual-ring interconnect by restrict-ing each cluster to its own physical ring. For a moredetailed analysis of COD mode see [4]. On the HSWchip used for benchmarks, COD mode lowers the inter-connect latency by 0.5 cy; on the employed BDW chip,where a single ring still has eleven cores attached to it,the latency is only reduced by 0.3 cy. ECM Model for HSW and BDW

With COD enabled, the per-CL Uncore latencypenalty is reduced to 1.1 cy on HSW resp. 1.3 cy onBDW. This leads to T L3Mem = 8 . . T L2L3 . To transfer three CLs, 6+3.3 cy are requiredon HSW resp. 6+3.9 cy on BDW. The resulting ECMinputs are { (cid:107) | | . | . . } for HSW and { (cid:107) | | . | . . } for BDW. The correspond-ing runtime prediction is { (cid:101) (cid:101) . (cid:101) . } on HSWand { (cid:101) (cid:101) . (cid:101) . } on BDW. The performance pre-dicted by the ECM model is 628 MLUP/s for HSW and567 MLUP/s on BDW. Streaming or non-temporal (NT) stores are specialinstructions that avoid write-allocates on modern In-tel microarchitectures. Without NT stores, storingthe newly computed result b[y][x] triggers a write-allocate. The old data is brought in from memory andtravels through the whole cache hierarchy. Thus theﬁrst beneﬁt of using NT stores is that the unnecessarytransfer of b[y][x] from memory to the L1 cache nolonger takes place. In addition, NT stores will also stripsome cycles oﬀ the time involved getting the new resultto main memory. Using regular stores, the newly com-puted result is written to the L1 cache, from where thedata has to be evicted at some point through the wholecache hierarchy into memory. Using NT stores, CLs aresent via the L1 cache to the line ﬁll buﬀers (LFBs); fromthere, they are transfered directly to memory and donot pass through the L2 and L3 caches. Although thebeneﬁts should apply equally to all microarchitectures,there are shortcomings in SNB and IVB that make single-core implementations using NT stores slowerthan their regular stores counterpart. The positiveimpact of NT stores can only be leveraged in multi-corescenarios on these microarchitectures, which is why wechose to omit ECM models and measurements for theNT store implementation for SNB and IVB.

ECM Model for HSW and BDW

Transferring a[y+1][x] between the L1 and L2caches takes 1 cy. Although the transfer is not strictlybetween the L1 and L2 cache, the cycle spent transfer-ring the CL for b[y][x] from the L1 cache to the lineﬁll buﬀer (LFB) is booked in T L1L2 as well, making fora total L1-L2 transfer time of 2 cy. Loading a[y+1][x] from the L3 to the L2 cache takes T L2L3 = 2 + 1 . a[y+1][x] from memory takes 2.9+1.1 cy onHSW and 2.6+1.3 cy on BDW; again, although thetransfer of b[y][x] is strictly not between the L3cache and memory, the transfer time to send the CLfrom the LFB to memory is booked in T L3Mem , mak-ing for a total L3-Mem transfer time of 5.8+2.2 cyon HSW and 5.2+2.6 cy on BDW. In summary, the fullECM inputs are { (cid:107) | | . | . . } cy on HSWand { (cid:107) | | . | . . } cy on BDW; the corre-sponding ECM runtime prediction is { (cid:101) (cid:101) . (cid:101) . } on HSW and { (cid:101) (cid:101) . (cid:101) . } . The in-memory per-formance prediction by the ECM model is P MemECM =1016 MLUP / s on HSW and 928 MLUP/s on BDW.

6. SINGLE-CORE RESULTS

Table 2 contains a summary of the ECM inputs andpredictions discussed in Section 5, as well as measure-ments of performance, power, and energy consumptionfor one 2D Jacobi iteration using a 16 GB dataset. Themodel correctly predicts performance with a mean er-ror of 3% and a maximum error of 7%, which indicatesthat all single-core performance engineering measureswork as intended. On SNB and IVB performance in-creases of around 1.3 × are achieved; on HSW with 2.2 × resp. BDW with 2.1 × improvement the increases areeven more pronounced. Measurements obtained via theRAPL interface indicate increases in power consump-tion due to optimizations are negligible (in the rangeof 2%). With power draw almost constant, this meansthat the performance gains directly translate into en-ergy improvements.An interesting observation regarding single-core powerconsumption can be made when comparing the dif-ferent microarchitectures that can be explained usingIntel’s “tick-tock” model. Power decreases with “ticks,”i.e. a shrink in manufacturing size and the accompany-ing decreases in dynamic power; power increases with“tocks,” i.e. major improvements in microarchitecture:The “tick” from SNB which uses 32 nm to IVB whichuses 22 nm technology corresponds to a 10% decrease inpower consumption. HSW, the only “tock” in Table 2uses the same 22 nm process as IVB but introduced The RAPL counters can not report power consumption of indi-vidual cores but only that of the whole package. Thus reportedvalues also include power drawn by Uncore facilities, e.g. all L3segments and the interconnect. able 2: Summary of ECM inputs, ECM predictions, as well as measured performance, power consumption, and energy-to-solution for one 2D Jacobi iteration with a dataset size of 16 GB.

ECM ECM pre- P MemECM

Measured Chip Chip Energy- µ arch Version input [cy] diction [cy] [MLUP/s] [MLUP/s] Power [W] to-Solution [J]SNB Baseline { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) . }

524 514 35.9 75.0L2 blocked { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) }

580 623 36.4 62.7L1 blocked { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) }

651 672 36.6 58.4IVB Baseline { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) . }

582 539 32.7 65.3L2 blocked { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) }

645 651 34.1 56.1L1 blocked { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) }

722 714 33.0 49.6HSW Baseline { (cid:107) | |

10 + 8 | . . } { (cid:101) (cid:101) (cid:101) . }

443 435 50.9 125.6L2 blocked { (cid:107) | | | . . } { (cid:101) (cid:101) (cid:101) }

536 529 51.0 103.5L1 blocked { (cid:107) | | . | . . } { (cid:101) (cid:101) (cid:101) }

570 579 52.1 96.6L1 b.+CoD { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) }

628 625 51.3 88.0L1 b.+CoD+nt { (cid:107) | | | } { (cid:101) (cid:101) (cid:101) } { (cid:107) | |

10 + 8 | . . } { (cid:101) (cid:101) (cid:101) . }

413 407 44.2 116.5L2 blocked { (cid:107) | | | . . } { (cid:101) (cid:101) (cid:101) }

503 489 44.3 97.4L1 blocked { (cid:107) | | . | . . } { (cid:101) (cid:101) (cid:101) }

535 509 44.5 93.7L1 b.+CoD { (cid:107) | | | . } { (cid:101) (cid:101) (cid:101) }

567 561 44.6 86.0L1 b.+CoD+nt { (cid:107) | | | } { (cid:101) (cid:101) (cid:101) }

928 862 45.4 56.6 major improvements in the microarchitecture (cf. Sec-tion 4) which lead to a more than 50% higher powerdraw. BDW, a “tick” which uses a 14 nm process uses11% less power than HSW. While it is tempting togeneralize from these results, note that the reportednumbers are speciﬁc to the 2D Jacobi application andthe used chip models. We also believe that while a surgein power consumption occurs with the HSW “tock” thecause to be a combination of microarchitectural im-provements and the higher core count.

7. CHIP-LEVEL OBSERVATIONS

Although the ECM model can be used to predictmulti-core scaling behavior [6, 4, 18], due to space con-straints chip-level discussion is restricted to empiricalresults. The graphs in Figure 3 show the improvementsdiscussed in Section 5 and relate measured performanceand energy-to-solution for diﬀerent core counts. IVB(Fig. 3a) and HSW (Fig. 3b) are chosen as represen-tatives to demonstrate that the diﬀerent optimizationshave diﬀerent impacts on performance and energy con-sumption depending on the microarchitecture.On IVB, running on n s = 5 instead of ten cores re-duces energy consumption by 13% 44.4 J to 34.8 J; onHSW, running on seven instead of 14 cores amounts toa reduction of 16% from 57.0 J to 47.6 J. The positiveimpacts of L1 blocking can be observed for both mi-croarchitectures, corresponding to a further decrease ofenergy consumption by 16% to 29.1 J on IVB resp. 10%to 43.0 J on HSW. COD mode brings energy consump-tion on HSW further down by 6% to 40.5 J.When it comes to NT stores the diﬀerences in the mi-croarchitectures become visible. On IVB, the per-coreperformance of the implementation using L1 blockingand NT stores (dark blue line in Fig. 3a) is lower thanthat of the version using regular stores. A version using Despite only one core being active in the measurements, all L3cache segments are active and draw power; a chip with more coreswill thus draw more power in single-core use cases. The leftmost measuring point of each graph corresponds to onecore. Following the line attached to a point to the next cor-responds to one more core being active. For demonstration pur-poses the purple graph in Fig. 3a has some core counts annotated.

NT stores without L1 blocking (bright blue line) doesnot even manage to sustain memory bandwidth. Due tothis bad per-core performance, NT stores on SNB/IVBhave almost no positive impact on energy consumption!On HSW, per-core performance as expected is exactly1 . × faster with NT stores, enabling a lowering in en-ergy consumption by 33% to 27.0 J.Another diﬀerence in microarchitecture surfaces whenexamining frequency-tuning for potential energy sav-ings. Before HSW, i.e. on SNB and IVB, the chip’sUncore frequency was set to match the frequency of thefastest active core. Because the Uncore contains L3 seg-ments, ring interconnect, and memory controllers, L3and memory bandwidth is a function of core frequencyon SNB/IVB. This eﬀect can be observed when set-ting the frequency to 1.2 GHz on IVB (magenta line inFig 3a): Despite providing the best energy-to-solution,performance is hurt badly. On HSW, the Uncore isclocked independently. This leads to a situation inwhich (due to frequency-induced lower per-core per-formance) more cores are required to saturate mainmemory bandwidth (cf. magenta line in Fig.3b); how-ever, the lower per-core power consumption translatesinto overall energy savings of 11%, lowering energy-to-solution to 24.0 J. These results indicate that energyconsumption could be further reduced if the chips of-fered frequencies below their current 1.2 GHz ﬂoor.Table 3 contains a summary of the ﬁnal result per-taining to energy saving for each microarchitecture.The “reference” value corresponds to energy consump-tion of the baseline implementation running on all coresclocked at the chip’s nominal frequency. Energy-to-solution of the most energy-eﬃcient version is listed inthe “optimized” column along with the number of activecores and their frequency that was used to obtain theresults in the “conﬁguration” column.

8. CONCLUSION

We have applied a new energy-optimization approachto a 2D Jacobi solver and analyzed its eﬀects on a rangeof recent Intel multi-core chips. Using the execution-cache-memory (ECM) model single-core software im-provements were described and their accuracy validated E n e r gy - t o - S o l u ti on [ J ] BaselineL1BL1B, NTbest @ 1.2 GHzbaseline, NT(a)IVB321 4 107 b s (r e g ) b s ( N T ) E n e r gy - t o - S o l u ti on [ J ] BaselineL1 BlockingL1B, CoDL1B, CoD, NTbest @ 1.2 GHz(b) HSW b s (r e g ) b s ( N T ) Figure 3: Performance vs. energy-to-solution for diﬀerentcore counts on IVB (a) and HSW (b). by measurements. For the ﬁrst time, the ECM modelhas been (a) extended to incorporate non-temporal(NT) stores and (b) applied to a Broadwell-EP chip.We found energy consumption can reduces by a factor of2.1 × on Sandy Bridge-EP, 2.0 × on Ivy Bridge-EP, 2.4 × on Haswell-EP, and 2.3 × on Broadwell-EP. Further, wefound that while NT stores can increase performanceon Sandy- and Ivy Bridge-based E5 processors, a directpositive impact on energy consumption could not beobserved; only in combination with frequency tuningdo NT stores oﬀer a better energy-to-solutions on thesearchitectures. Measurements indicate that this prob-lem has been solved on Haswell-EP and Broadwell-EP.Moreover, our results indicate that future microarchi-tectures that keep core and Uncore frequencies decou-pled could oﬀer improved energy-eﬃciency if core fre-quencies below the current 1.2 GHz ﬂoor were available.Beyond these immediate results we have demonstratedthe viability of our energy-optimization approach.

9. REFERENCES [1] M. Curtis-Maury, F. Blagojevic, C. Antonopoulos, andD. Nikolopoulos. Prediction-based power-performanceadaptation of multithreaded scientiﬁc codes.

Parallel andDistributed Systems, IEEE Transactions on ,19(10):1396–1410, Oct 2008.[2] Y. Dong, J. Chen, X. Yang, L. Deng, and X. Zhang.Energy-oriented openmp parallel loop scheduling. In

Parallel and Distributed Processing with Applications,2008. ISPA ’08. International Symposium on , pages162–169, Dec 2008.[3] G. Hager, J. Treibig, J. Habich, and G. Wellein. Exploringperformance and power properties of modern multicorechips via simple machine models.

Concurrency Computat.:Pract. Exper. , 2013. DOI: 10.1002/cpe.3180.[4] J. Hofmann, D. Fey, J. Eitzinger, G. Hager, and G. Wellein.

Architecture of Computing Systems – ARCS 2016: 29thInternational Conference, Nuremberg, Germany, April 4-7,2016, Proceedings , chapter Analysis of Intel’s HaswellMicroarchitecture Using the ECM Model andMicrobenchmarks, pages 210–222. Springer InternationalPublishing, Cham, 2016.[5] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager,and G. Wellein.

Parallel Processing and AppliedMathematics: 11th International Conference, PPAM 2015,Krakow, Poland, September 6-9, 2015. Revised SelectedPapers, Part I , chapter Performance Analysis of theKahan-Enhanced Scalar Product on Current MulticoreProcessors, pages 63–73. Springer International Publishing,Cham, 2016.[6] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager,and G. Wellein. Performance analysis of the

Table 3: Summary of chip-level benchmarks. Energy im-provements shown in parenthesis. µ arch Reference Optimized ConﬁgurationSNB 58.7 J 28.0 J (2.1 × ) 7 cores, 1.2 GHzIVB 44.4 J 22.3 J (2.0 × ) 9 cores, 1.5 GHzHSW 57.0 J 24.0 J (2.4 × ) 4 cores, 1.2 GHzBDW 47.9 J 20.4 J (2.3 × ) 4 cores, 1.2 GHz kahan-enhanced scalar product on current multi-corecoreand many-core processors. Concurrency and Computation:Practice and Experience , pages n/a–n/a, 2016. cpe.3921.[7] F. I. in Cooperation with Fraunhofer ISI. Absch¨atzung desEnergiebedarfs der weiteren Entwicklung derInformationsgesellschaft, 2009.[8] Intel Corp. Intel 64 and IA-32 Architectures SoftwareDeveloper’s Manual, 2016. Version: April 2016.[9] Intel Corp. Intel Xeon Processor E5 v3 Product Family:Processor Speciﬁcation Update, 2016. Version: February2016.[10] J. Koomey. Growth in data center electricity use 2005 to2010.

Oakland, CA: Analytics Press. August , 1:2010, 2011.[11] D. Li, B. R. de Supinski, M. Schulz, D. S. Nikolopoulos,and K. W. Cameron. Strategies for energy-eﬃcient resourcemanagement of hybrid programming models.

IEEE Trans.Parallel Distrib. Syst. , 24(1):144–157, 2013.[12] A. Mazouz, A. Laurent, B. Pradelle, and W. Jalby.Evaluation of cpu frequency transition latency.

ComputerScience - Research and Development , 29(3):187–195, 2013.[13] J. D. McCalpin. Memory bandwidth and machine balancein current high performance computers.

IEEE ComputerSociety Technical Committee on Computer Architecture(TCCA) Newsletter , pages 19–25, Dec. 1995.[14] A. Miyoshi, C. Lefurgy, E. Van Hensbergen, R. Rajamony,and R. Rajkumar. Critical power slope: Understanding theruntime eﬀects of frequency scaling. In

Proceedings of the16th International Conference on Supercomputing , ICS ’02,pages 35–44, New York, NY, USA, 2002. ACM.[15] G. Rivera and C.-W. Tseng. Tiling optimizations for 3dscientiﬁc computations. In

Proceedings of the 2000ACM/IEEE Conference on Supercomputing , SC ’00,Washington, DC, USA, 2000. IEEE Computer Society.[16] B. Rountree, D. Lowenthal, S. Funk, V. W. Freeh,B. de Supinski, and M. Schulz. Bounding energyconsumption in large-scale mpi programs. In

Supercomputing, 2007. SC ’07. Proceedings of the 2007ACM/IEEE Conference on , pages 1–9, Nov 2007.[17] B. Rountree, D. K. Lownenthal, B. R. de Supinski,M. Schulz, V. W. Freeh, and T. Bletsch. Adagio: Makingdvs practical for complex hpc applications. In

Proceedingsof the 23rd International Conference on Supercomputing ,ICS ’09, pages 460–469, New York, NY, USA, 2009. ACM.[18] H. Stengel, J. Treibig, G. Hager, and G. Wellein.Quantifying performance bottlenecks of stencilcomputations using the Execution-Cache-Memory model.In

Proceedings of the 29th ACM International Conferenceon Supercomputing , ICS ’15, New York, NY, USA, 2015.ACM.[19] J. Treibig and G. Hager. Introducing a performance modelfor bandwidth-limited loop kernels. In R. Wyrzykowski,J. Dongarra, K. Karczewski, and J. Wasniewski, editors,

Parallel Processing and Applied Mathematics , volume 6067of

Lecture Notes in Computer Science , pages 615–624.Springer Berlin / Heidelberg, 2010.[20] J. Treibig, G. Hager, and G. Wellein. Likwid performancetools. In C. B. et al., editor,

Competence in HighPerformance Computing 2010 , pages 165–175. SpringerBerlin Heidelberg, 2012.[21] C.-J. Wu and M. Martonosi. Characterization and dynamicmitigation of intra-application cache interference. In