[PDF] An analysis of core- and chip-level architectural features in four generations of Intel server processors

Abstract

This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. This insight can then serve as input for analytic performance models. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved considerably by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may profit from a suitable choice of cache snoop mode settings.

Full PDF

aa r X i v : . [ c s . PF ] F e b An analysis of core- and chip-level architectural featuresin four generations of Intel server processors

Johannes Hofmann , Georg Hager , Gerhard Wellein , and Dietmar Fey Computer Architecture, University of Erlangen-Nuremberg, 91058 Erlangen, Germany, [email protected], [email protected] Erlangen Regional Computing Center (RRZE), 91058 Erlangen, Germany, [email protected], [email protected]

Abstract.

This paper presents a survey of architectural features among four gen-erations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad-well) with a focus on performance with ﬂoating point workloads. Starting on thecore level and going down the memory hierarchy we cover instruction through-put for ﬂoating-point instructions, L1 cache, address generation capabilities, coreclock speed and its limitations, L2 and L3 cache bandwidth and latency, the im-pact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clockspeed. Using microbenchmarks we study the inﬂuence of these factors on codeperformance. This insight can then serve as input for analytic performance mod-els. We show that the energy efﬁciency of the LINPACK and HPCG benchmarkscan be improved considerably by tuning the Uncore clock speed without sacriﬁc-ing performance, and that the Graph500 benchmark performance may proﬁt froma suitable choice of cache snoop mode settings.

Keywords:

Intel architecture,performance modeling,LINPACK,HPCG,Graph500

Intel Xeon server CPUs dominate in the commodity HPC market. Although the mi-croarchitecture of those processors is ubiquitous and can also be found in mobile anddesktop devices, the average developer of numerical software hardly cares about ar-chitectural details and relies on the compiler to produce “decent” code with “good”performance. If we actually want to know what “good performance” means we haveto build analytic models that describe the interaction between software and hardware.Despite the necessary simpliﬁcations, such models can give useful hints towards the rel-evant bottlenecks of code execution and thus point to viable optimization approaches.The Roofline model [6,22] and the Execution-Cache-Memory (ECM) model [5,18] aretypical examples. Analytic modeling requires simpliﬁed machine and execution mod-els, with details about properties of execution units, caches, memory, etc. Althoughmuch of this data is provided by manufacturers, many relevant features can only beunderstood via microbenchmarks, either because they are not documented or becausethe hardware cannot leverage its full potential in practice. One simple example is themaximum memory bandwidth of a chip, which can be calculated from the number,frequency, and width of the DRAM channels but which, in practice, may be signiﬁ-cantly lower than this absolute limit. Hence, microbenchmarks such as STREAM [13]or likwid-bench [20] are used to measure the limits achievable in practice.

Although there has been some convergence in processor microarchitecures for highperformance computing, the latest CPU models show interesting differences in theirperformance-relevant features. Building good analytic performance models and, in gen-eral, making sense of performance data, requires intimate knowledge of such details.The main goal of this paper is to provide a coverage and critical discussion of thosedetails on the latest four Intel architecture generations for server CPUs: Sandy Bridge(SNB), Ivy Bridge (IVB), Haswell (HSW), and Broadwell (BDW). The actual CPUmodels used for the analysis are described in Sect. 2.1 below.

Out of the many possible approaches to performance analysis and optimization (coined performance engineering [PE]) we favor concepts based on analytic performance mod-els. For recent server multicore designs the ECM performance model allows for avery accurate description of single-core performance and scalability. In contrast to theRoofline model it drops the assumption of a single bottleneck for the steady-state exe-cution of a loop. Instead, time contributions from in-core execution and data transfersthrough the memory hierarchy are calculated and then put together according to theproperties of a particular processor architecture; for instance, in Intel x86 server CPUsall time contributions from data transfers including LOADs and STOREs in the L1cache must be added to get a prediction of single-core data transfer time [18,8]. On theother hand, the IBM Power8 processor shows almost perfect overlap [9]. A full intro-duction to the ECM model would exceed the scope of this paper, so we refer to thereferences given above. The model has been shown to work well for the analysis ofimplementations of several important computational kernels [19,18,23,2,9].In order to construct analytic models accurately, data about the capabilities of themicroarchitecture and how it interacts with the code at hand is needed. For ﬂoating-point centric code in scientiﬁc computing, maximum throughput and latency numbersfor arithmetic and LOAD/STORE instructions are most useful in all their vectorizedand non-vectorized, single (SP) and double precision (DP) variants. On Intel multi-core CPUs up to Haswell, this encompasses scalar, streaming SIMD extensions (SSE),advanced vector extensions (AVX), and AVX2 instructions. Modeling the memory hier-archy in the ECM model requires the maximum data bandwidth between adjacent cachelevels (assuming that the hierarchy is inclusive) and the maximum (saturated) memorybandwidth. As for the caches it is usually sufﬁcient to assume the maximum docu-mented theoretical bandwidth (presupposing that all prefetchers work perfectly to hidelatencies), although latency penalties might apply [9]. The main memory bandwidthand latency may depend on the cluster-on-die (CoD) mode and cache snoop mode set-tings. Finally, the latest Intel CPUs work with at least two clock speed domains: one forthe core (or even individual cores) and one for the Uncore, which includes the L3 cacheand memory controllers. Both are subject to automatic changes; in case of AVX code onHaswell and later CPUs the guaranteed baseline clock speed is lower than the standardspeed rating of the chip. The performance and energy consumption of code dependscrucially on the interplay between these clock speed domains. Finally, especially whenit comes to power dissipation and capping, considerable variations among the specimenof the same CPU model can be observed.

All these intricate architectural details inﬂuence benchmark and application per-formance, and it is insufﬁcient to look up the raw specs in a data sheet in order tounderstand this inﬂuence.

There is a large number of papers dealing with details in the architecture of CPUsand their impact on performance and energy consumption. In [1] the authors assessedthe capabilities of the then-new Nehalem server processor for workloads in scientiﬁccomputing and compared its capabilities with its predecessors and competing designs.In [17], tools and techniques for measuring and tuning power and energy consumptionof HPC systems were discussed. The QuickPath Interconnect (QPI) snoop modes on theHaswell EP processor were investigated in [15]. Energy efﬁciency features, includingthe AVX and Uncore clock speeds, on the same architecture were studied in [4] and [7].Our work differs from all those by systematically investigating relevant architecturalfeatures, from the core level down to memory, via microbenchmarks in view of analyticperformance modeling as well as important benchmark workloads such as LINPACK,Graph500, and HPCG.

Apart from conﬁrming or highlighting some documented or previously published ﬁnd-ings, this paper makes the following new contributions: – We present benchmark results showing the improvement in the performance of thevector gather instruction from HSW to BDW. On BDW it is now advantageous toactually use the gather instruction instead of “emulating” it. – We fathom the capabilities of the L2 cache on all four microarchitectures and estab-lish practical limits for L2 bandwidth that can be used in analytic ECM modeling.These limits are far below the advertised 64 B/cy on HSW and BDW. – We study the bandwidth scalability of the L3 cache depending on the Cluster onDie (CoD) mode and show that, although the parallel efﬁciency for streaming codeis never below 85%, CoD has a measurable advantage over non-CoD. – We present latency data for all caches and main memory under various cache snoopmodes and CoD/non-CoD. We ﬁnd that although CoD is best for streaming andnon-uniform memory access (NUMA) aware workloads in terms of latency andbandwidth, highly irregular, NUMA-unfriendly code such as the Graph500 bench-mark beneﬁts dramatically from non-CoD mode with Home Snoop and Oppor-tunistic Snoop Broadcast by as much as 50% on BDW. – We show how the Uncore clock speed on HSW and BDW has considerable im-pact on the power consumption of bandwidth- and cache-bound code, opening newoptions for energy efﬁcient and power-capped execution.

Table 1.

Key test machine speciﬁcations. All reported numbers taken from data sheets.

Microarchitecture Sandy Bridge-EP Ivy Bridge-EP Haswell-EP Broadwell-EPShorthand SNB IVB HSW BDWChip Model Xeon E5-2680 Xeon E5-2690 v2 Xeon E5-2695 v3 E5-2697 v4Release Date Q1/2012 Q3/2013 Q3/2014 Q1/2016Base Freq. 2.7 GHz 3.0 GHz 2.3 GHz 2.3 GHzMax All Core Turbo Freq. — — 2.8 GHz 2.8 GHzAVX Base Freq. — — 1.9 GHz 2.0 GHzAVX All Core Turbo Freq. — — 2.6 GHz 2.7 GHzCores/Threads 8/16 10/20 14/28 18/36Latest SIMD Extensions AVX AVX AVX2, FMA3 AVX2, FMA3Memory Conﬁguration 4 ch. DDR3-1600 4 ch. DDR3-1866 4 ch. DDR4-2133 4 ch. DDR4-2400Theor. Mem. Bandwidth 51.2 GB/s 59.7 GB/s 68.2 GB/s 76.8 GB/sL1 Cache Capacity 8 ×

32 kB 10 ×

32 kB 14 ×

32 kB 18 ×

32 kBL2 Cache Capacity 8 ×

256 kB 10 ×

256 kB 14 ×

256 kB 18 ×

256 kBL3 Cache Capacity 20 MB (8 × × × × → Reg Bandwidth 2 ×

16 B/cy 2 ×

32 B/cy 2 ×

32 B/cyReg → L1 Bandwidth 1 ×

16 B/cy 1 ×

32 B/cy 1 ×

32 B/cyL1 ↔ L2 Bandwidth 32 B/cy 32 B/cy 64 B/cy 64 B/cyL2 ↔ L3 Bandwidth 32 B/cy 32 B/cy 32 B/cy 32 B/cy

All measurements were performed on standard two-socket Intel Xeon servers. A sum-mary of key speciﬁcations of the four generations of processors is shown in Table 1. Ac-cording to Intel’s “tick-tock” model, a “tick” represents a shrink of the manufacturingprocess technology; however, it should be noted that “ticks” are often accompanied byminor microarchitectural improvements while a “tock” usually involves larger changes.SNB (a “tock”) ﬁrst introduced AVX, doubling the single instruction, multiple data(SIMD) width from SSE’s 128 bit to 256 bit. One major shortcoming of SNB is directlyrelated to AVX: Although the SIMD register width has doubled and a second LOADunit was added, data path widths between the L1 cache and individual LOAD/STOREunits were left at 16 B/cy. This leads to AVX stores requiring two cycles to retire onSNB, and AVX LOADs block both units. IVB, a “tick”, saw an increase in core countas well as a higher memory clock; in addition, IVB brought speedups for several in-structions, e.g., ﬂoating-point (FP) divide and square root; see Table 2 for details.HSW, a “tock”, introduced AVX2, extending the existing 256 bit SIMD vector-ization from ﬂoating-point to integer data types. Instructions introduced by the fusedmultiply-add (FMA) extension are handled by two new, AVX-capable execution units.Data path widths between the L1 cache and registers as well as the L1 and L2 cacheswere doubled. A vector gather instruction provides a simple means to ﬁll SIMD reg-isters with non-contiguous data, making it easier for the compiler to vectorize codewith indirect accesses. To maintain scalability of the core interconnect, HSW chipswith more than eight cores move from a single-ring core interconnect to a dual-ringdesign. At the same time, HSW introduced the new CoD mode, in which a chip is op- tionally partitioned into two equally sized NUMA domains in order to reduce latenciesand increase scalability. Starting with HSW, the system’s QPI snoop mode can also beconﬁgured. HSW no longer guarantees to run at the base frequency with AVX code.The guaranteed frequency when running AVX code on all cores is referred to as “AVXbase frequency,” which can be signiﬁcantly lower than the nominal frequency [12,14].Also there is a separation of frequency domains between cores and Uncore. The Uncoreclock is now independent and can either be set automatically (when Uncore frequencyscaling (UFS) is enabled) or manually via model speciﬁc registers (MSRs).As a “tick,” BDW, the most recent Xeon-EP processor, offers minor architecturalimprovements. Floating-point and gather instruction latencies and throughput have par-tially improved. The dual-ring design was made symmetric and an additional QPI snoopmode is available.

All high-level language benchmarks (Graph500, HPCG) were compiled using Intel ICC16.0.3. For Graph500 we used the reference implementation in version 2.1.4, and forLINPACK we ran the Intel-provided binary contained in MKL 2017.1.013, the mostrecent version available at the time of writing.The LIKWID tool suite in its current stable version 4.1.2 was employed heavily inmany of our experiments. All low-level benchmarks consisted of hand-written assem-bly. When available (e.g., for streaming kernels auch as STREAM triad and others) weused the assembly implementations in the likwid-bench microbenchmarking tool. La-tency measurements in the memory hierarchy were done with all prefetchers turned off(via likwid-features ) and a pointer chasing code that ensures consecutive cache lineaccesses. Energy consumption measurements were taken with the likwid-perfctr tool via the RAPL (Running Average Power Limit) interface, and the clock speed of theCPUs was controlled with likwid-setFrequencies . Starting with HSW, Intel chips offer different base and turbo frequencies for AVX andSSE or scalar instruction mixes. This is due to the higher power requirement of using allSIMD lanes in case of AVX. To reﬂect this behavior, Intel introduced a new frequencynomenclature for these chips.The “base frequency,” also known as the “non-AVX base frequency” or “nominalfrequency” is the minimum frequency that is guaranteed when running scalar or SSEcode on all cores. This is also the frequency the chip is advertised with, e.g., 2.30 GHzfor the Xeon E5-2695v3 in Table 1. The maximum frequency that can be achievedwhen running scalar or SSE code on all cores is called “max all core turbo frequency.”The “AVX base frequency” is the minimum frequency that is guaranteed when runningAVX code on all cores and is typically signiﬁcantly lower than the (non-AVX) base http://tiny.cc/LIKWID C o r e F r e qu e n c y [ GH z ] SSEAVX(a) BDW 4 8 12Number of Cores 22.533.54SSEAVX(b) HSW 2250 2300 2350 2400Core Clock [MHz]405060708090 P ac k a g e P o w e r [ W ] SSEAVX(c) Meggie

Fig. 1.

Attained chip frequency during LINPACK runs on all cores on (a) BDW and (b) HSW. (c)Variation of clock speed and package power among all 1456 Xeon E5-2630v4 CPUs in RRZE’s“Meggie” cluster running LINPACK. frequency. Analogously, the maximum frequency that can be attained when runningAVX code is called “AVX max all core turbo frequency.”On HSW, at least core running AVX code resulted in a chip-wide frequency restric-tion to the AVX max all core turbo frequency. On BDW, cores running scalar or SSEcode are allowed to ﬂoat between the non-AVX base and max all core turbo frequencieseven when other cores are running AVX code.All relevant values for the HSW and BDW specimen used can be found in Table 1.According to ofﬁcial documentation the actually used frequency depends on the work-load; more speciﬁcally, it depends on the percentage of AVX instructions in a certaininstruction execution window. To get a better idea about what to expect for demandingworkloads, LINPACK and FIRESTARTER [3] were selected to determine those fre-quencies. The maximum frequency difference between both benchmarks was 20 MHz,so Figure 1 shows only results obtained with LINPACK. Figure 1a shows that BDW canmaintain a frequency well above the AVX and the non-AVX base frequency for work-loads running at its TDP limit of 145 W (measured package power during stress testswas 144.8 W). HSW, shown in Figure 1b, drops below the non-AVX base frequency of2.3 GHz, but stays well above the AVX base frequency of 1.9 GHz while consuming119.4 W out of a 120 W TDP. When running SSE LINPACK, BDW consumes 141.8 Wand manages to run at the max all core turbo frequency of 2.8 GHz. On HSW, run-ning LINPACK with SSE instructions still keeps the chip at its TDP limit (119.7 W outof 120 W); the attained frequency of 2.6 GHz is slightly below the max all core turbofrequency of 2.7 GHz.While it might be tempting to generalize from these results, we must emphasizethat statistical variations even between specimen of the same CPU type are very com-mon [21]. When examining all 1456 Xeon E5-2630v4 (10-core, 2.2 GHz base fre-quency) chips of RRZE’s new “Meggie” cluster, we found signiﬁcant variations acrossthe individual CPUs. The chip has a max all core turbo and AVX max all core turbo fre- quency of 2.4 GHz [14]. Figure 1c shows each chip’s frequency and package powerwhen running LINPACK with SSE or AVX on all cores. With SSE code, each chipmanages to attain the max all core turbo frequency of 2.4 GHz. However, a variation inpower consumption can be observed. When running AVX code, not all chips reach thedeﬁned peak frequency but stay well above the AVX base frequency of 1.8 GHz. Somechips do hit the frequency ceiling; for these, a strong variation can be observed in thepower domain. Accurate predictions of instruction execution (i.e., how many clock cycles it takes toexecute a loop body assuming a steady state situation with all data coming from theL1 cache) are notoriously difﬁcult in all but the simplest cases, but they are needed asinput for analytic models. As a “lowest-order” and most optimistic approximation onecan assume full throughput, i.e., all instructions can be executed independently and aredynamically fed to the execution ports (and the pipelines connected to them) by the out-of-order engine. The pipeline that takes the largest number of cycles to execute all itsinstructions determines the runtime. The worst-case assumption would be an executionfully determined by the critical path through the code, heeding all dependencies. Inpractice, the actual runtime will be between these limits unless other bottlenecks applythat are not covered by the in-core execution, such as data transfers from beyond the L1cache, instruction cache misses, etc. Even if a loop body contains strong dependenciesthe throughput assumption may still hold if there are no loop-carried dependencies.Calculating the throughput and critical path predictions requires information aboutthe maximum throughput and latency of all relevant instructions as well as generallimits such as decoder/retirement throughput, L1I bandwidth, and the number and typesof address generation units. The Intel Architecture Code Analyzer (IACA) can helpwith this, but it is proprietary software with an unclear future development path and itdoes not always yield accurate predictions. Moreover, it can only analyze object codeand does not work on the high-level language constructs. Thus one must often revert tomanual analysis to get predictions for the best possible code, even if the compiler cannotproduce it. In Table 2 we give worst-case measured latency and inverse throughputnumbers for arithmetic instructions in AVX, SSE, and scalar mode. In the following wepoint out some notable changes over the four processor generations.The most profound change happened in the performance of the divide units. FromSNB to BDW we observe a massive decrease in latency and an almost three-fold in-crease in throughput for AVX and SSE instructions, in single and double precisionalike. Divides are still slow compared to multiply and add instructions, of course. Thefact that the divide throughput per operation is the same for AVX and SSE is wellknown, but with BDW we see a signiﬁcant rise in scalar divide throughput, even be-yond the documented limit of one instruction every ﬁve cycles. The scalar square rootinstruction shows a similar improvement, but is in line with the documentation.The standard multiply, add, and fused multiply-add instructions have not changeddramatically over four generations, with two exceptions: Together with the introduction http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ Table 2.

Measured worst-case latency and inverse throughput for ﬂoating-point arithmetic in-structions. For all of these numbers, lower is better.Latency [cy] Inverse throughput [cy/inst.]µarch BDW HSW IVB SNB BDW HSW IVB SNB vdivpd (AVX) 24 35 35 45 16 28 28 44 divpd (SSE) 14 20 20 22 8 14 14 22 divsd (scalar) 14 20 20 22

14 14 22 vdivps (AVX) 17 21 21 29 10 14 14 28 divps (SSE) 11 13 13 14 5 7 7 14 divss (scalar) 11 13 13 14 vsqrtpd (AVX) 35 35 35 44 28 28 28 43 sqrtpd (SSE) 20 20 20 23 14 14 14 22 sqrtsd (scalar) 20 20 20 23

14 14 22 vsqrtps (AVX) 21 21 21 23 14 14 14 22 sqrtps (SSE) 13 13 13 15 7 7 7 14 sqrtss (scalar) 13 13 13 15 vrcpps (AVX) 7 7 7 7 2 2 2 2 rcpps (SSE, scalar) 5 5 5 5 1 1 1 1 *add* † *mul* *fma* ‡ § — — 0.5 0.5 — — † SP/DP AVX addition: 3 cycles; SP/DP SSE and scalar addition: 4 cycles ‡ SP/DP AVX FMA: 5 cycles; SP/DP SSE and scalar FMA: 6 cycles § SP scalar FMA: 6 cycles; all other: 5 cycles of FMA instructions with HSW, it became possible to execute two plain multiply (butnot add) instructions per cycle. The latency of the add instruction in scalar and SSEmode on BDW has increased from three to four cycles; this result is not documentedby Intel for BDW but announced for AVX code in the upcoming Skylake architecture.The fma instruction shows the same characteristic (latency increase from 5 to 6 cycleswhen using SSE or scalar mode).One architectural feature that is not directly evident from single-instruction mea-surements is the number of address generation units (AGUs). Up to IVB there are twosuch units, each paired with a LOAD unit with which it shares a port. As a consequence,only two addresses per cycle can be generated. HSW introduced a third AGU on the newport 7, but it can only handle simple addresses for STORE instructions, which may leadto some restrictions. See Sect. 3.3 for details.

The cores of all four microarchitectures feature two load units and one store unit. Thedata paths between each unit and the L1 cache are 16 B on SNB and IVB, and 32 Bon HSW and BDW. The theoretical bandwidth is thus 48 B/cy on SNB and IVB and96 B/cy on HSW and BDW; however, several restrictions apply. B a nd w i d t h [ by t e / c y c l e ] two AGUsthree AGUsthree AGUs,no arithmetic96 B/c limit(a) 0 10 20 30Dataset size [kB]020406080100 B a nd w i d t h [ by t e / c y c l e ] SNBIVBHSWBDW96 B/c limit48 B/c limit(b)

Fig. 2. (a) L1 bandwidth achieved with STREAM triad and various optimizations on BDW. (b)Comparison of achieved L1 bandwidths using STREAM triad on all microarchitectures.

An AVX vectorized STREAM triad benchmark uses two AVX loads, one AVXFMA, and one AVX store instruction to update four DP elements. On HSW and BDW,only two address generation units (AGUs) are capable of performing the necessaryaddress computations, i.e., (base + scaled index + offset), typically used in streamingmemory accesses; HSW’s newly introduced third store AGU can only perform offsetcomputations. This means that only two addresses per cycle can be calculated, limit-ing the L1 bandwidth to 64 B/cy. STREAM triad performance using only two AGUsis shown in Figure 2a. One can make use of the new AGU by using one of the “fastLEA” units (which can perform only indexed and no offset addressing) to pre-computean intermediate address, which is then used by the simple AGU to complete the addresscalculation. This way both AVX load units and the AVX store unit can be used simulta-neously. When the store is paired with address generation on the new store AGU, bothmicro-ops are fused into a single micro-op. This means that the four micro-op per cyclefront end retirement constraint should not be a problem: in each cycle two AVX loadinstructions, the micro-op fused AVX store instruction, and one AVX FMA instruc-tion is retired. With sufﬁcient unrolling, loop instruction overhead becomes negligibleand the bandwidth should approach 96 B/cy. Figure 2 shows, however, that micro-opthroughput still seems to be the bottleneck because bandwidth can be further increasedby removing the FMA instructions from the loop body.Figure 2b compares the bandwidths achievable by different microarchitectures (us-ing no arithmetic instructions on HSW and BDW for the reasons described above). OnSNB and IVB a regular STREAM triad code can almost reach maximum theoretical L1performance because it only requires half the number of address calculations per cycle,i.e., two AGUs are sufﬁcient to generate three addresses every two cycles.

Vector gather is a microcode solution for loading noncontinuous data into vector reg-isters. The instruction was ﬁrst implemented in Intel multicore CPUs with AVX2 on Table 3.

Time in cycles per gather instruction on HSW and BDW depending on data distributionacross CLs. Microarchitecture Haswell-EP Broadwell-EPLocation of data L1 L2 L3 Mem L1 L2 L3 MemDistributed across 1 CLs 12.3 12.3 12.4 15.5 7.3 7.3 7.7 13.3Distributed across 2 CLs 12.5 12.5 13.2 23.0 7.5 7.6 11.0 24.5Distributed across 4 CLs 12.5 12.7 20.6 42.7 7.5 9.9 20.0 47.5Distributed across 8 CLs 12.3 18.4 38.5 89.3 7.3 18.1 38.2 94.4

HSW. The ﬁrst implementation offered a poor latency (i.e., the time until all data wasplaced in the vector register) and using hand-written assembly to manually load dis-tributed data into vector registers proved to be faster than using the gather instruction insome cases [10].Table 3 shows the gather instruction latency for both HSW and BDW. The latencydepends on where the data is coming from and, in case data is not in L1, over how manycache lines (CLs) it is distributed. We ﬁnd that the instruction is 40% faster on BDW inL1. When data is coming from L2 on HSW and distributed across eight CLs, the latencyis dominated by time required to transfer eight CLs from L2 to L1 cache. On BDW, thiseffect is already visible when data is coming from the L2 cache and distributed acrossfour CLs. BDW’s improvement of the instruction offers no returns when the latency isdominated by CL transfers, which is the case when loading more than four CLs fromL2, two from L3, or one from memory.

According to ofﬁcial documentation, the L2 cache bandwidth on HSW was increasedfrom 32 B/cy to 64 B/cy compared to IVB. To validate this expectation, knowledgeabout overlapping transfers in the cache hierarchy is required. The ECM model for x86assumes that no CLs are transferred between L2 and L1 in any cycle in which a LOADinstruction retires. Hence, the maximum of 64 B/cy can never be attained by design butan improvement may still be expected. To derive the time spent transferring data, cyclesin which load instructions are retired are subtracted from the overall runtime with anin-L2 working set. The resulting bandwidth should be compared with the documentedtheoretical maximum.

Table 4.

Measured L1-L2 bandwidth on different microarchitectures for dot product andSTREAM triad access patterns.Pattern Code SNB IVB HSW BDWDot product dot+=A[i]+B[i]

28 27 43 43STREAM triad

A[i]=B[i]+s*C[i]

29 29 32 321

Table 4 shows the measured bandwidths for a dot product (a load-only benchmark)and the STREAM triad. Both SNB and IVB operate near the speciﬁed bandwidth of32 B/cy for both access patterns. Although HSW and BDW offer bandwidth improve-ments, especially in case of the dot product, measured bandwidths are signiﬁcantlybelow the advertised 64 B/cy.The question arises of how this result may be incorporated into the ECM model.Preliminary experiments indicate that the ECM predictions for in-L3 data are quiteaccurate when assuming theoretical L2 throughput. We could thus interpret the low L2performance as a consequence of a latency penalty, which can be overlapped when thedata is further out in the hierarchy. Further experiments are needed to substantiate thisconjecture.

Together with the dual-ring interconnect, HSW introduced the CoDmode, in which a single chip can be partitioned into two equally-sized NUMA clusters.HSW features a so-called “eight plus x ” design, in which the ﬁrst physical ring featureseight cores and the second ring contains the remaining cores (six for our HSW chip).This asymmetry leads to a scenario in which the seven cores in the ﬁrst cluster domainare physically located on the ﬁrst ring; the second cluster domain contains the remainingcore from the ﬁrst and six cores from the second physical ring. The asymmetry wasremoved on BDW: here both physical rings are of equal size so both cluster domainscontain cores from dedicated rings. CoD is intended for NUMA-optimized code andimpacts L3 scalability and latency and, implicitly, main memory bandwidth because ituses a dedicated snoop mode that makes use of a directory to avoid unnecessary snooprequests (see Section 5.2 for more details).Figure 3a shows the inﬂuence of CoD on L3 bandwidth (using STREAM triad)for HSW and BDW. When data is distributed across both rings on HSW, the parallelefﬁciency of the L3 cache is 92%; it can be raised to 98% by using CoD. The highercore count of BDW results in a more pronounced effect; here, parallel efﬁciency is only86% in non-CoD mode. Using CoD the efﬁciency goes above 95%. Figure 3b showsthat HSW and BDW with CoD offer similar L3 scalability as SNB and IVB.Assuming an n -core chip, the topological diameter (and with it the average distancefrom a core to data in an L3 segment) is smaller in each of the n / n cores. Shorter ways betweencores and data result in lower latencies when using CoD mode. On BDW, the L3 latencyis 41 cycles when with CoD and 47 cycles without (see Table 5). Starting with HSW, the QPI snoop mode can be selected at boot time.HSW supports three snoop modes: early snoop (ES), home snoop (HS), and direc-tory (DIR) (often only indirectly selectable by enabling CoD in the BIOS) [15,11,16]. P a r a ll e l e ff i c i e n c y [ % ] HSW, COD HSW, no COD BDW,COD BDW, no COD (a) 4 8 12 16Number of cores949698100

SNB IVB BDW,CODHSW,COD (b) 4 8 12 16Number of cores0100200300400500 B a nd w i d t h [ G B / s ] BDWHSWIVBSNB (c)

Fig. 3. (a) L3 scalability on HSW and BDW depending on whether CoD is used. (b) Comparisonof microarchitectures regarding L3 scalability. (c) Absolute L3 bandwidth for STREAM triad asfunction of cores on different microarchitectures.

BDW introduced a fourth snoop mode called HS+opportunistic snoop broadcast (OSB)[16]. The remainder of this section discusses the differences among the modes and theimmediate impact on memory latency and bandwidth.On a L3 miss inside a NUMA domain, in addition to fetching the CL containing therequested data from main memory, cache coherency mandates other NUMA domainsbe checked for modiﬁed copies of the CL. Attached to each L3 segment is a cacheagent (CA) responsible for sending and receiving snoop information. In addition tomultiple CAs, each NUMA domain features a home agent (HA), which plays a majorrole in snooping.In ES, snoop requests are sent directly from the CA of the L3 segment in whichthe L3 miss occurred to the respective CAs in other NUMA domains. Queried remoteCAs directly respond back to the requesting CA; in addition, they report to the HA inthe requesting CA’s domain, so it can resolve potential conﬂicts. ES involves a lot ofrequests and replies, but offers low latencies.In HS, CAs forward snoop requests to their NUMA domain’s HA. The HA proceedsto fetch the requested CL from memory but stalls snoop requests to remote NUMA do-mains until the CL is available. For each CL, so-called directory information is stored inits memory ECC bits. The bits indicate whether a copy of the CL exists in other NUMAdomains. The directory bits only tell whether a CL is present or not in other NUMAdomains; they do not tell which NUMA domain to query, so snoops have to broadcastto all NUMA domains. By waiting for directory data, unnecessary snoop requests areavoided at the cost of higher latency due to delayed snoops. By reducing snoop requests,overall bandwidth can be increased. As in ES, potentially queried remote CAs respondto the initiating CA and HA, which resolves potential conﬂicts. CLs are mapped to L3 segments based on their addresses according to a hashing function.Thus, each CA knows which CA in other NUMA domains is responsible for a certain CL.3

Table 5.

Measured access latencies of all memory hierarchy levels in base frequency core cyclesµarch L1 L2 L3 MEMSNB 4 12 40 230IVB 4 12 40 208HSW 4 12 37 BDW 4 12 47 , 41 , 280 , 190 , 178 COD disabled, COD enabled, ES, HS, HS+OSB, DIR

In DIR, a two-step approach is used. Starting with HSW, each HA features a 14 kBdirectory cache (also called “HitMe” cache) holding additional directory informationfor CLs present in remote NUMA domains. In addition to the directory informationrecorded in the ECC bits, the directory cache stores the particular NUMA domain inwhich the copy of a CL resides; this means that on a hit in the directory cache onlya single snoop request has to be sent. This mechanism further reduces snoop trafﬁc,potentially increasing bandwidth. When the directory cache is hit, latency is also im-proved in DIR compared to HS, because snoops are not delayed until directory infor-mation stored in ECC bits from main memory becomes available. In case of a directorycache miss, DIR mode proceeds similarly to HS. Note, however, that DIR mode is rec-ommended only for NUMA-aware workloads. The directory cache can only hold datafor a small number of CLs. If the number of CLs shared between both cluster domainsexceeds the directory cache capacity , DIR mode degrades to HS mode, resulting inhigh latencies.BDW’s new HS+OSB mode works similarly to HS. However, HAs will send oppor-tunistic snoop requests while waiting for directory information stored in the ECC bitsunder “light” trafﬁc conditions. Latency is reduced in case the directory informationindicates snoop requests have to be sent, because they were already sent opportunisti-cally. Redundant snoop requests are not supposed to impact performance under “light”trafﬁc conditions.The impact of snoop modes is largest on main memory latency. As expected, DIRproduces the best results with 178 cy (see Table 5). Pointer chasing in main memorydoes not generate a lot of trafﬁc on the ring interconnect, which is why HS+OSB willgenerate opportunistic snoops, achieving a latency of 190 cy. The difference in latencyof 12 cy compared to DIR can be explained through shorter paths inside a single clusterdomain in CoD mode. We measured an L3 latency of 41 cy for CoD and 47 cy for non-CoD mode. Since memory accesses pass through the interconnect twice (one to requestthe CL, once to deliver it) the memory latency of non-CoD mode is expected to be twicethe L3 latency penalty of six cycles. In ES, the requesting CA has to wait for its HAto acknowledge that it received all snoop replies from the remote CAs, which causesa latency penalty. On BDW, the measured memory latency is 248 cy. As expected, HSoffers the worst latency at 280 cy, because necessary snoop broadcasts are delayed untildirectory information becomes available from main memory. Investigations using the

HITME * performance counter events indicate this cache is exclusivelyused in DIR mode.4 P e rf o r m a n ce [ M TE P / s ] H S + O S B D I R H S E S (a) 04080120 B D W ( H S + O S B ) S N B I V B H S W ( H S ) (b) 020406080100 P e rf o r m a n ce / W a tt [ G F L O P / s / W ] Performance/W

Uncore frequency0246810 P e rf o r m a n ce [ G F L O P / s ] Performance(c)

Fig. 4. (a) Graph500 performance in millions of traversed edges per second (MTEP/s) as functionof snoop mode on BDW. (b) Graph500 performance of all chips. (c) HPCG performance andperformance per Watt as function of Uncore frequency. ddot load store store (NT) update copy copy (NT) Stream triad Stream triad (NT) 020406080 B a nd w i d t h [ G B / s ] ES HS HS+OSB DIR (COD)

Fig. 5.

Sustained main memory bandwidth on BDW for various access patterns. NT=nontemporalstores.

Graph500 was chosen to evaluate the inﬂuence of snoop modes on the performanceof latency-sensitive workloads. Figure 4a shows Graph500 performance for a singleBDW chip. A direct correlation between latency and performance can be observed forHS, ES, and HS+OSB. DIR mode performs worst despite offering the best memorylatency. This can be explained by the non-NUMA-awareness of the Graph500 bench-marks. Too much data is shared between both cluster domains; this means the directorycache can not hold information on all shared CLs. As a result, snoops are delayed un-til directory information from main memory becomes available. Figure 4b shows anoverview of Graph500 performance on all chips and the qualitative improvement of-fered by the new HS+OSB snoop mode introduced with BDW.The effect of snoop mode on memory bandwidth for BDW is shown in Fig. 5. Thedata is roughly in line with the reasoning above. For NUMA-aware workloads, DIRshould produce the least snoop trafﬁc due to snoop information stored in the directorycache. This is reﬂected in a slightly better bandwidth compared to other snoop modes(with the exception of the non-temporal (NT) store access pattern, which seems to be atoxic case for DIR mode). DIR offers up to 10 GB/s more for load-only access patternswhen compared to ES, which produces the most amount of snoop trafﬁc. The effect is ddot load store store (NT) update copy copy (NT) Stream triad Stream triad (NT) 020406080 B a nd w i d t h [ G B / s ] SNB IVB HSW BDW

Fig. 6.

Comparison of sustained main memory bandwidth across microarchitectures for variousaccess patterns. less pronounced but still observable when comparing DIR to HS and HS+OSB. Fig-ure 6 shows the evolution of sustained memory bandwidth for all examined microarchi-tectures, using the best snoop mode on HSW and BDW. Increases in bandwidths overthe generations is explained by new DDR standards as well as increased memory clockspeeds (see Table 1).

Before HSW, the Uncore was clocked at the same frequency as the cores. Starting withHSW, the Uncore has its own clock frequency. The motivation for this lies in potentialenergy savings: When cores do not require much data via the Uncore (i.e., from/to L3cache and main memory) the Uncore can be slowed down to save power. This mode ofoperation is called UFS. For our BDW chip, the Uncore frequency can vary automat-ically between 1.2 –2.8 GHz, but one can also deﬁne custom minimum and maximumsettings within this range via MSRs.We examine the default UFS behavior for both extremes of the Roofline spectrumand use HPCG as a bandwidth-bound and LINPACK as a compute-bound benchmark.Our ﬁndings indicate that at both ends of the spectrum, UFS tends to select higher thannecessary frequencies, pointlessly boosting power and in the case of LINPACK evenhurting performance.Figure 4c shows HPCG performance and energy efﬁciency versus Uncore frequencyfor a ﬁxed core clock of 2.3 GHz on HSW. We ﬁnd that the Uncore is the performancebottleneck only for Uncore frequencies below 2.0 GHz. Increasing it beyond this pointdoes not improve performance, because main memory is now the bottleneck. Usingperformance counters the Uncore frequency was determined to be the maximum of2.8 GHz when running HPCG in UFS mode. The energy efﬁciency of 64.7 GFLOP/s/Wat 2.8 GHz is 26% lower than the 87.2 GFLOP/s/W observed at 2.0 GHz Uncore fre-quency, at almost the same performance. Energy efﬁciency can be increased even moreby further lowering the Uncore clock; however, below 2.0 GHz performance is de-graded.For LINPACK, we observe a particularly interesting side effect of varying Uncorefrequency. Figure 7 shows LINPACK performance on BDW as a function of core andUncore clock. Note that in Turbo mode, the performance increases when going from C o r e F r e qu e n c y [ GH z ] Uncore Frequency [GHz]

554 600 620 625 640 643 648 645 645 642 639 635 625 623 625 624 623 628552 558 572 582 590 596 599 602 604 605 606 607 608 608 607 607 606 607522 540 553 565 569 573 576 578 580 581 582 582 583 583 584 582 581 582512 527 537 544 547 549 552 554 555 556 556 557 558 558 559 559 556 556493 510 517 520 524 526 528 529 530 531 531 532 532 533 533 533 534 533481 489 494 497 500 502 503 504 505 505 506 506 507 507 507 508 508 507462 466 471 474 476 477 478 479 479 480 481 480 481 481 481 482 482 481440 445 448 450 451 452 452 453 454 454 455 455 455 455 455 456 456 455418 421 423 425 426 426 427 428 428 428 428 429 429 429 429 429 429 428394 397 398 400 400 401 401 402 402 402 402 403 403 403 403 403 403 402371 372 373 374 374 375 375 376 376 376 376 376 376 377 377 377 377 376346 347 347 348 348 349 349 349 350 349 350 350 350 350 350 350 350 349320 321 322 322 322 323 323 323 323 323 323 323 323 324 323 324 324 323 350 400 450 500 550 600 P e rf o r m a n ce [ G F l op / s ] Fig. 7.

LINPACK performance on BDW as a function of core and uncore frequency. the highest Uncore frequencies towards 1.8 GHz. This effect is caused by Uncore andcores competing for the chip’s TDP. When the Uncore clovk speed is reduced, a largerpart of the chip’s power budget can be consumed by the cores, which in turn boost theirfrequency. The core frequency in Turbo mode is 2479 MHz when the Uncore clock is setto 2.8 GHz (the Uncore actually only achieves a clock rate of 2475 MHz) vs 2595 MHzwhen the Uncore clock is set to 1.8 GHz. Below 1.8 GHz the CPU frequency increasesfurther, e.g., to 2617 MHz at an Uncore clock of 1.7 GHz and up to 2720 MHz at anUncore clock of 1.2 GHz. LINPACK performance starts to degrade at this point despitean increasing core frequency due to the Uncore becoming a data bottleneck. In UFSmode, the Uncore is clocked at 2489 MHz and the cores run at 2491 MHz. Comparedto the optimum, UFS degrades performance by 3%. Energy efﬁciency is reduced by 6%from 4.94 GFLOP/s/W at an Uncore clock of 1.8 GHz to 4.65 GFLOP/s/W in UFS. Themost energy-efﬁcient operating point for LINPACK is 5.74 GFLOP/s/W at a core clockof of 1.6 GHz and an Uncore clock of 1.2 GHz.

We have conducted an analysis of core- and chip-level performance features of four re-cent Intel server CPU architectures. Previous ﬁndings about the behavior of clock speedand its interaction with thermal design limits on Haswell and Broadwell CPUs couldbe conﬁrmed. Overall the documented instruction latency and throughput numbers ﬁtour measurements, with slight deviations in scalar DP divide throughput and SSE/scalaradd and fused multiply-add latency on Broadwell. We could also demonstrate the con-sequences of limited instruction throughput and the special properties of Haswell’s andBroadwell’s address generation units for L1 cache bandwidth. Our microbenchmark results have unveiled that the gather instruction, which wasnewly introduced with the AVX2 instruction set, was ﬁnally implemented on Broadwellin a way that makes it faster than hand-crafted assembly. The L2 cache on Haswell andBroadwell does not keep its promise of doubled bandwidth to L1 but only deliversbetween 32 and 43 B/cy, as opposed to Sandy Bridge and Ivy Bridge, which get closeto their architectural limit of 32 B/cy.The scalable L3 cache was one of the major innovations in the Sandy Bridge ar-chitecture. On Haswell and Broadwell, the bandwidth scalability of the L3 cache issubstantially improved in Cluster on Die (CoD) mode. Even without CoD the full-chipefﬁciency (at up to 18 cores) is never worse than 85%. In the memory domain we ﬁnd,unsurprisingly, that CoD provides the lowest latency and highest memory bandwidth(except with streaming stores), but the irregular Graph500 benchmark shows a 50%speedup on Broadwell when switching to non-CoD and Home Snoop with Opportunis-tic Snoop Broadcast.Finally, our analysis of core and Uncore clock speed domains has exhibited signif-icant potential for saving energy in a sensible setting of the Uncore frequency, withoutsacriﬁcing execution performance.Future work will include a thorough evaluation of the ECM performance model onall recent Intel architectures, putting to use the insights generated in this study. Addi-tionally, existing analytic power and energy consumption models will be extended toaccount for the Uncore power more accurately. Signiﬁcant changes in performance andpower behavior are expected for the upcoming Skylake architecture, such as (amongothers) an L3 victim cache and AVX-512 on selected models, and will pose challengesof their own.

References

1. Barker, K., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C.: A Per-formance Evaluation of the Nehalem Quad-core Processor for Scientiﬁc Computing. Paral-lel Processing Letters 18(4), 453–469 (December 2008), http://dx.doi.org/10.1142/S012962640800351X

2. Gasc, T., Vuyst, F.D., Peybernes, M., Poncet, R., Motte, R.: Building a more efﬁcientLagrange-remap scheme thanks to performance modeling. In: Papadrakakis, M., et al. (eds.)Proc. ECCOMAS Congress 2016, the VII. European Congress on Computational Methods inApplied Sciences and Engineering, Crete Island, Greece, 510 June 2016 (2016),

3. Hackenberg, D., Oldenburg, R., Molka, D., Sch¨one, R.: Introducing FIRESTARTER: A pro-cessor stress test utility. In: 2013 International Green Computing Conference Proceedings.pp. 1–9 (June 2013)4. Hackenberg, D., Sch¨one, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy ef-ﬁciency feature survey of the Intel Haswell processor. In: 2015 IEEE International Paralleland Distributed Processing Symposium Workshop. pp. 896–904 (May 2015)5. Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power proper-ties of modern multicore chips via simple machine models. Concurrency Computat.: Pract.Exper. (2013), DOI: 10.1002/cpe.31806. Hockney, R.W., Curington, I.J.: f / : A parameter to characterize memory and communica-tion bottlenecks. Parallel Computing 10(3), 277–286 (1989)87. Hofmann, J., Fey, D.: An ECM-based energy-efﬁciency optimization approach forbandwidth-limited streaming kernels on recent Intel Xeon processors. In: Proceedings ofthe 4th International Workshop on Energy Efﬁcient Supercomputing. pp. 31–38. E2SC ’16,IEEE Press, Piscataway, NJ, USA (2016), https://doi.org/10.1109/E2SC.2016.16

8. Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of Intel’sHaswell Microarchitecture Using the ECM Model and Microbenchmarks, pp. 210–222. Springer International Publishing, Cham (2016), http://dx.doi.org/10.1007/978-3-319-30695-7_16

9. Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance anal-ysis of the Kahan-enhanced scalar product on current multi-core and many-core processors.Concurrency and Computation: Practice and Experience pp. n/a–n/a (2016), http://dx.doi.org/10.1002/cpe.3921

10. Hofmann, J., Treibig, J., Hager, G., Wellein, G.: Comparing the performance of different x86SIMD instruction sets for a medical imaging application on modern multi- and manycorechips. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/VectorProcessing. pp. 57–64. WPMVP ’14, ACM, New York, NY, USA (2014), http://doi.acm.org/10.1145/2568058.2568068

11. Intel Corporation: Intel Xeon Processor E5-1600, E5-2400, and E5-2600 v3 Product Fam-ilies - Volume 2 of 2, Registers,

12. Intel Corporation: Intel Xeon Processor E5 v3 Product Family,

13. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance com-puters. IEEE Computer Society Technical Committee on Computer Architecture (TCCA)Newsletter pp. 19–25 (Dec 1995)14. Microway Inc.: Detailed speciﬁcations of the intel xeon e5-2600 v4 broadwell-ep processors15. Molka, D., Hackenberg, D., Sch¨one, R., Nagel, W.E.: Cache coherence protocol and memoryperformance of the intel haswell-ep architecture. In: Proceedings of the 44th InternationalConference on Parallel Processing (ICPP’15). IEEE (2015)16. Sailesh Kottapalli, Vedaraman Geetha, Henk G. Neefs, Youngsoo Choi: PatentUS20130007376 A1: Opportunistic snoop broadcast (osb) in directory enabled home snoopysystems,

17. Sch¨one, R., Treibig, J., Dolz, M.F., Guillen, C., Navarrete, C., Knobloch, M., Rountree, B.:Tools and methods for measuring and tuning the energy efﬁciency of HPC systems. ScientiﬁcProgramming 22(4), 273–283 (2014), http://dx.doi.org/10.3233/SPR-140393

18. Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of sten-cil computations using the Execution-Cache-Memory model. In: Proceedings of the 29thACM International Conference on Supercomputing. ICS ’15, ACM, New York, NY, USA(2015), http://doi.acm.org/10.1145/2751205.2751240

19. Treibig, J., Hager, G., Hofmann, H.G., Hornegger, J., Wellein, G.: Pushing the limits for med-ical image reconstruction on recent standard multicore processors. The International Journalof High Performance Computing Applications 27(2), 162–177 (2013), http://dx.doi.org/10.1177/1094342012442424

20. Treibig, J., Hager, G., Wellein, G.: likwid-bench: An extensible microbenchmarking platformfor x86 multicore compute nodes. In: Parallel Tools Workshop. pp. 27–36 (2011)21. Wilde, T., Auweter, A., Shoukourian, H., Bode, A.: Taking Advantage of Node Power Vari-ation in Homogenous HPC Systems to Save Energy, pp. 376–393. Springer InternationalPublishing, Cham (2015), http://dx.doi.org/10.1007/978-3-319-20119-1_27 http://doi.acm.org/10.1145/1498765.1498785

23. Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node anal-ysis of energy-optimized lattice Boltzmann CFD simulations. Concurrency and Computa-tion: Practice and Experience 28(7), 2295–2315 (2016),23. Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node anal-ysis of energy-optimized lattice Boltzmann CFD simulations. Concurrency and Computa-tion: Practice and Experience 28(7), 2295–2315 (2016),