[PDF] Performance Optimizations of Recursive Electronic Structure Solvers targeting Multi-Core Architectures (LA-UR-20-26665)

Abstract

As we rapidly approach the frontiers of ultra large computing resources, software optimization is becoming of paramount interest to scientific application developers interested in efficiently leveraging all available on-Node computing capabilities and thereby improving a requisite science per watt metric. The scientific application of interest here is the Basic Math Library (BML) that provides a singular interface for linear algebra operation frequently used in the Quantum Molecular Dynamics (QMD) community. The provisioning of a singular interface indicates the presence of an abstraction layer which in-turn suggests commonalities in the code-base and therefore any optimization or tuning introduced in the core of code-base has the ability to positively affect the performance of the aforementioned library as a whole. With that in mind, we proceed with this investigation by performing a survey of the entirety of the BML code-base, and extract, in form of micro-kernels, common snippets of code. We introduce several optimization strategies into these micro-kernels including 1.) Strength Reduction 2.) Memory Alignment for large arrays 3.) Non Uniform Memory Access (NUMA) aware allocations to enforce data locality and 4.) appropriate thread affinity and bindings to enhance the overall multi-threaded performance. After introducing these optimizations, we benchmark the micro-kernels and compare the run-time before and after optimization for several target architectures. Finally we use the results as a guide to propagating the optimization strategies into the BML code-base. As a demonstration, herein, we test the efficacy of these optimization strategies by comparing the benchmark and optimized versions of the code.

Full PDF

PPerformance Optimizations of Recursive ElectronicStructure Solvers targeting Multi-CoreArchitecturesLA-UR-20-26665

A Preprint

Adetokunbo A. Adedoyin, Los Alamos National Laboratory, CCS-7 Division, [email protected]

Christian F. A. Negre, Los Alamos National Laboratory, T-1 Division,Jamaludin Mohd-Yusof, Los Alamos National Laboratory, CCS-7 DivisionNicolas Bock, Los Alamos National Laboratory, CCS-3 DivisionDaniel Osei-Kuﬀuor, Lawrence Livermore National Laboratory, CASC DivisionJean-Luc Fattebert, Oak Ridge National Laboratory, CSE DivisionMichael E. Wall, Los Alamos National Laboratory, CCS-3 DivisionAnders M. N. Niklasson, Los Alamos National Laboratory, T-1 DivisionSusan M. Mniszewski, Los Alamos National Laboratory, CCS-3 Division, [email protected]

February 18, 2021

Abstract1 Abstract

As we rapidly approach the frontiers of ultra large computing resources, software optimizationis becoming of paramount interest to scientiﬁc application developers interested in eﬃcientlyleveraging all available on-Node computing capabilities and thereby improving a requisitescience per watt metric. The scientiﬁc application of interest here is the Basic Math Library(BML) that provides a singular interface for linear algebra operation frequently used in theQuantum Molecular Dynamics (QMD) community. The provisioning of a singular interfaceindicates the presence of an abstraction layer which in-turn suggests commonalities in thecode-base and therefore any optimization or tuning introduced in the core of code-base hasthe ability to positively aﬀect the performance of the aforementioned library as a whole.With that in mind, we proceed with this investigation by performing a survey of the entiretyof the BML code-base, and extract, in form of micro-kernels, common snippets of code. Sincethe data structure of the core BML code-base is well established we pursue less invasiveoptimization strategies, that is, we focus our eﬀort on optimizing at the thread level asopposed to modifying the data-structures for performance at the Single Instruction Multiple a r X i v : . [ c s . PF ] F e b preprint - February 18, 2021preprint - February 18, 2021 Data (SIMD) scale. We incorporate several active and passive directives/pragmas that aidto inform the compiler on nature of the data-structures and algorithms. Following that,we introduce several optimization strategies into these micro-kernels including 1.) StrengthReduction 2.) Memory Alignment for large arrays 3.) Non Uniform Memory Access (NUMA)aware allocations to enforce data locality and 4.) appropriate thread aﬃnity and bindingsto enhance the overall multi-threaded performance. After introducing these optimizations,we benchmark the micro-kernels and compare the run-time before and after optimizationfor several target architectures. Finally we use the results as a guide to propagating theoptimization strategies into the BML code-base. As a demonstration, herein, we test theeﬃcacy of these optimization strategies by comparing the benchmark and optimized versionsof the code using a 1.) matrix-matrix multiplication using the ELLPACK format and 2.)a full simulation using ExaSP2, a proxy application for linear scaling electronic structurecalculations. The results of the optimization are promising and in agreement with ﬁndingsin literature. K eywords MULTI-THREADED OPTIMIZATIONS · STRENGTH REDUCTION · MEMORY ALIGN-MENT · NON UNIFORM MEMORY ACCESS · DATA LOCALITY · THREAD AFFINITY ANDBINDINGS · MULTI-THREADED PERFORMANCE

The optimization of electronic structure codes requires diﬀerent eﬀorts including the improvement of solvers,and the adaptation of basic operations to novel computer architectures. With the advent of the exascalecomputing architectures, the aforementioned adaptation requires thorough modiﬁcations in order to maximizethe use of the computational capacity [1,2]. Furthermore, in the speciﬁc case of electronic structure calculations,diﬀerent chemical systems require speciﬁc solver as well as speciﬁc architecture adaptation, and generally,what is beneﬁcial for some systems will not necessarily be for others. QMD technique requires solving forthe electronic structure of the physical system to advance the positions of the atoms at each simulationtime-step. Solving for the electronic structure is generally a task that involves a high computational cost dueto the fact that the number of arithmetic operations scale with the cube of the number of atoms. In orderto alleviate this computational cost, various O ( N ) complexity algorithms have been proposed [3, 4]. Thesealgorithms typically approximate the solution with an error that can be controlled by means of an adjustableparameter. They are typically iterative, and require linear algebra operations in each iteration [5, 6]. Foroptimal performance however, these linear scaling codes have to be optimized at both the vector and threadlevel.Optimizing data structure for improved parallelism is challenging and can be somewhat disruptive, therefore,we focus on optimizing at the thread level. One of the primary purposes of multi-core computer technology islatency hiding. However, at the cost of latency hiding is the inherent programmability challenges needed tobe circumvented in order to achieve high performance gains in software applications, measured in terms ofimprovements in ﬂoating point operations (FLOPS) or run-time. Therefore, herein, we attempt to optimizethe Basic Matrix Library (BML) software for performance gains in terms of run-time. BML is a collection ofmatrix data formats (for dense and sparse) and basic matrix operations designed to help electronic structurescodes run eﬃciently on various platforms [7] [8] [9]. We focus on several strategies namely (i) StrengthReduction (SR), (ii) memory alignment (MA) to prevent cache contention, (iii) memory initialization withan understanding of ﬁrst touch (FT) policy on Linux system, and (iv) thread aﬃnity and binding (A&B)optimizations that works in conjunction with all the aforementioned techniques for optimal performance.The rest of this manuscript is organized in the format delineated below. Section 4 expands on the optimizationstechniques implemented in the BML [9] software. It is preceded by section 4.1, where we brief on the overallapproach taken to optimizing the BML [9] software. This is followed by section 4.2, which discusses thetarget computer architectures used for evaluating the performance of the BML [9] code-base before and afteroptimization. sections 4.3 to 4.6 discusses and demonstrates in detail the optimization techniques introducedinto the BML [9] software namely Strength Reduction (SR) [10–18], NUMA aware allocations to enforcedata locality [12, 19–22], Memory Alignment (MA) [23–30] for large arrays and appropriate thread aﬃnity2 preprint - February 18, 2021preprint - February 18, 2021 and bindings (AB) [31–35] to enhance multi-threaded performance, respectively. sections 4.3 to 4.6 is alsoaccompanied by sample implementation of in form of code snippets demonstrating the implementation ofthe aforementioned optimizations techniques. It also shows individual results of performance improvements.section 5.2 combines all the aforementioned optimization techniques and introduces them into the BMLsoftware [9]. Following that we generate several pseudo system matrices representing metals, semi conductorsand soft matter ranging in size from 1000 to 32000 and evaluate the eﬃcacy of these optimizations bymeasuring the performance diﬀerences of BML’s EllPACK matrix-matrix multiply algorithm before andafter tuning [9]. section 5.3 evaluates the performance of ExaSP2, a proxy application for performingQMD calculations that relies on the BML [9] software for its linear algebra computations. System matricesrepresenting semi conductors and soft matter of size 32000 are evaluated for performance. section 6 containsa discussion summary of the ﬁndings herein. Performance optimization of a large code-base can be a daunting task therefore, herein we introduce carefullydesigned subroutines, henceforth referred to as micro-kernels, that are representative of the methods in theun-optimized BML [9] software. For the ease of evaluating the eﬀects of the optimizing techniques introducedherein, we design the micro-kernels such that complex code behaviour e.g. cache trashing, branching andnon-unit stride access are avoided though present in BML’s [9] ELLPACK subroutine. At a latter time, weplan to address the eﬀects of such complex behaviour as they impede on overall performance. We proceed byoptimizing this micro-kernels in a step by step manner while being mindful of the compile time diﬀerencethat may be introduced while building and linking the actual BML software [9]. The target algorithm ismemory bandwidth bound as it is a sparse matrix-matrix multiply algorithm with the ratio of FLOP to byteof data (n) requested from memory of O(n <

3) .

The target multi-core platform experimented with herein are recent releases of Intel architectures namelyIntel’s Sky-lake Gold, Sky-lake Platinum, Cascade Lake with and without support for Intel’s Optane™DCpersistent memory. Table 1 contains the speciﬁcation details of the aforementioned architectures.

Hardware SpeciﬁcationFeatures: Skylake-Gold Skylake-Platinum Cascade Lake Cascade Lakew/ OptaneModel Number 6152 8176 6254 8260Launch Date Q3’17 Q3’17 Q2’19 Q2’19Cores 22 28 18 24Threads Per Core 2 2 2 2Base Frequency(GHz) 2.10 2.10 3.10 2.40Turbo Frequency(GHz) 3.70 3.80 4.10 3.90Cache L3(MB) 30.25 38.50 24.75 35.75HBM (MB) No No No NoTDP (W) 140 165 200 165

Table 1: Hardware speciﬁcation for Intel architectures (Ark Intel).

Strength Reduction [10–18], is an optimization procedure, driven either via human intervention or softwarecompiler, where by high latency (costly) operations are substituted with their lower latency (cheaper)counterpart while maintaining mathematical correctness. SR or approach approximate strength reduction(ASR) techniques can vary from very simple [12,13] to more complex and involving substitutions [10,11,14–18].The need for SR tuning in the BML [9] software, though not prevalent, is a low hanging fruit therefore weevaluate the performance diﬀerence between atypical occurrences of in-loop division with multiplications.Listing 1 is a code snippet of the micro-kernel atypical of the occurrences in the BML software [9]. Listing 23 preprint - February 18, 2021preprint - February 18, 2021 for(i=0; i < A−>N; i++) { A−>index[i] = i A−>nnn[i] += i A−>value[i] = 16.0/RAND_MAX; } Figure 1: Micro-kernel without strength reduction representative of the subroutines in BML [9] before tuning. double INV_RAND_MAX = 1.0/RAND_MAX; for(i=0; i < A−>N; i++) { A−>index[i] = i A−>nnn[i] += i A−>value[i] = 16.0*INV_RAND_MAX; } Figure 2: Micro-kernel with strength reduction applied.is a SR substitutions where we replace divisions within a loop with a single multiplication. Figure 3 is arun-time comparison of the benchmark (BHMK) vs. optimized (TUNED) micro-kernels for multiple dualsocket Intel architectures using Intel 17 compiler. The performance of the optimized (TUNED) micro-kernelis approximately 2X across multiple thread sizes and varying architectures.

Data locality in the BML [9] software is tuned for by ensuring that 1.) malloc-ed memory is initialized ina parallel region with the “First Touch” policy in mind and 2.) the aﬃnity and binding settings for theinitialization and computation loops for data used in multi-threaded regions are speciﬁed carefully such thatcomputational data remains NUMA local. The “First Touch” policy with respect to memory allocation andmemory page assignment on Linux systems, is associated with the physical location of memory (NUMA [36]domain) at the point in which the actual memory addresses get modiﬁed (initialization). At the pointof modiﬁcation, the memory pages get assigned and further associated with that speciﬁc NUMA domain.Therefore, the First Touch policy determines memory page ownership [23]. Consideration for NUMA inorder to avoid needless data movement (data locality) can be implemented to cater to diﬀerent levels ofparallelism which may include the SIMD, Thread and Node level. Several researchers [12, 19–22] have shownthat the aforementioned approach improves the overall performance of computational algorithms by avoidingneedless data movement. Prior to optimization for data locality, atypical linear algebra operations performedusing the BML API required three steps namely 1.) memory allocation and initialization followed by 2.) thelinear algebra calculation of interest and 3.) Memory de-allocation. The memory allocation and initializationstep above is typically achieved in BML via two calls, one to malloc() and the other to memset() . Thisis also achieved via a single call to calloc() . The second step involving the Linear algebra operation ofinterest, e.g. a Matrix-Matrix multiply calculation, is typically comprised of a multi-threaded loop in aparallel region written using the OpenMP paradigm. With the First Touch policy in mind, step one and twoare in conﬂict given that they are performed in a serial and parallel region, respectively. The ﬁrst step is aserial operation with an associated performance penalty as the memory pages become associated with thespeciﬁc NUMA domain they were ﬁrst touched. Listing 4 is a code snippet of the micro-kernel atypical of theoccurrences in the BML software. Listing 5 is a NUMA aware data initialization that ensures data locality inthe computational loop by ﬁrst touching data in a parallel region (lines 7-10) a opposed to (listing 4 lines7-9) using memset() to initially modify malloc-ed data. Figure 13 is a pictorial illustration of a simpliﬁedmodern multi-core computer showing computational cores and memories (RAM) associated with two NUMAdomains. NUMA domain zero (core:0 and Memory:0) and one (core:1 and Memory:1) have an associatecompute core and memory colored in blue and red respectively. To minimizes obscurity, we simpliﬁed the4 preprint - February 18, 2021preprint - February 18, 2021

Figure 3: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) micro-kernel for SRfor Cascade Lake (Top-Left), Sky-Lake Gold (Top-Right), Sky-Lake Platinum (Bottom-Left) and CascadeLake with Intel Optane Technology (Bottom-Right).physical description of each NUMA domain (0 and 1) and further assume that each core is actually multiplecores each with multiple levels of associated cache hierarchy and bandwidth infrastructure. The blue banksof memory located on NUMA:0 (red) represent data initialized on Memory:0 that potentially get movedto NUMA:1 (blue) during computation. This movement leads to additional performance penalty (higherlatency) when performing computation on non-local data. This needles computational expense may also leadto cross NUMA domain cache conﬂict. Figure 7 shows a snap-shot of the preferred association of data andcomputational core in a compute loop. A completely disjoint subset of data associated with distinct NUMAdomains is ideal, though in practice may require careful software design. Figure 8 is run-time comparison ofthe benchmark (BHMK) vs. optimized (TUNED) micro-kernels for multiple dual socket Intel architecturesusing Intel 17 compiler. The performance of the optimized (TUNED) micro-kernel is at best 3X for highthread counts and 2X on average on all architectures. In conjunction with NUMA aware allocations isthe careful speciﬁcation of thread aﬃnity and binding to ensure thread placement is consistent with datalocation, in addition, the speciﬁed number of threads per core is consistent with the arithmetic intensity ofthe application under consideration, and ﬁnally that threads migration is avoided. Others authors [37–42]have demonstrated the challenges associated with non-locality of data and have proposed several solutions.

The application of memory alignment, as demonstrated by [23–30], is primarily for the purpose of preventingcache contention on multi-core computer hardware. Cache contention on multi-core hardware occurs whenmultiple hardware threads attempt to use the same line of cache. Figure 9 is an illustration of a simpliﬁedmodern multi-core computer architecture containing four cores and showing two cache lines belonging to5 preprint - February 18, 2021preprint - February 18, 2021 // Data Allocation A−>nnz = (int *) malloc(sizeof(int)*N); A−>index = (int *) malloc(sizeof(int)*N*M); A−>value = (double *) malloc(sizeof(double)*N*M); // Non−NUMA Aware: Initialization memset(A−>nnz, 0, A−>N*sizeof(int)); memset(A−>index, 0, A−>N*A−>M*sizeof(int)); memset(A−>value, 0.0, A−>N*A−>M*sizeof(double)); // Computational Loop for(i=0; i < A−>N; i++) compute(A); Figure 4: Micro-kernel without consideration for data locality representative of the subroutines in BML [9]before tuning. // Data Allocation A−>nnz = (int *) malloc(sizeof(int)*N); A−>index = (int *) malloc(sizeof(int)*N*M); A−>value = (double *) malloc(sizeof(double)*N*M); // NUMA Aware: Initialization: for(i=0; i < A−>N*A−>M; i++) initialize(A); // Computational Loop for(i=0; i < A−>N; i++) compute(A); Figure 5: Micro-kernel with consideration for data locality representative of the subroutines in BML [9] aftertuning.core:2 (purple) and core:3 (blue). Listing 10 is a code snippet of the micro-kernel atypical of the occurrencesof memory allocation in the BML software. Listing 11 is a code snippet of the micro-kernel reﬂecting memoryalignment that ensures minimal cache contention in the computational loop. Listing 11 shows aligned memoryallocations using Intel compiler [23] (lines 2-4), hinting the compiler (lines 11-13) and a complementaryaligned memory de-allocation (lines 18-20). A similar API that allows for aligned memory allocation isavailable with the GCC compiler [43]. Figure 12 is comparison of the run-time of the benchmark (BHMK)vs. optimized (TUNED) micro-kernels for multiple dual socket Intel architectures using Intel 17 compiler.The performance of the optimized (TUNED) micro-kernel is at best 11X for high thread counts and 5X onaverage across all architectures.

The purpose of rectifying the previously utilized thread binding and aﬃnity settings used while performinglinear algebra computations in BML [9] is to ensure that sub-optimal hardware utilization is prevented.Two known issues that degrade performance when using multi-core computers are 1.) thread migrationand 2.) the utilization of non-equivalent threads during simulation. Thread migration is a process wherebyoperational threads migrate or move from core to core during simulations. This behaviour aﬀects therepeatably during performance testing or bench-marking. Non-equivalent thread utilization is a scenariowhereby operational threads do not have the same associated resources (L1, L2 or L3/LLC cache). This also6 preprint - February 18, 2021preprint - February 18, 2021

Figure 6: Illustration of modern multi-core architectures with multiple NUMA domains. Showing the eﬀectof programming without considerations for “First Touch” policies on Linux systems. During simulation datapresent on NUMA:0 is fetched from NUMA:1.Aﬃnity and Bindings SettingsIntel’s API:

Arithmetic-Intensity

KMP_HW_SUBSET KMP_AFFINITY

Compute Bound All HW Threads “compact"Memory Bound HW Threads “scatter"Migration Control: set

KMP_AFFINITY & KMP_HW_SUBSET

OpenMP’s API:Compute Bound All HW Threads closeMemory Bound HW Threads spreadMigration Control: set

OMP_PROC_BIND=true , OMP_NUM_THREADS & OMP_PLACES=cores

Table 2: Best practice guidelines for conﬁguring the aﬃnity and binding setting depending on predominantarithmetic intensity using Intel’s [31–34] and OpenMP’s [31, 32] API.aﬀects repeatably during performance testing or bench-marking. It occurs as a result of a lack of speciﬁcity inthe environment variables that control the binding and aﬃnity of thread to hardware core(s) and/or socket(s).Preventing thread migration and appropriately specifying the correct thread binding and aﬃnity usingIntels’s API [31–34], requires setting two environment variables namely

KMP_AFFINITY and

KMP_HW_SUBSET . KMP_AFFINITY controls the thread placement which is highly dependent on the predominant arithmeticintensity (AI) of the application. Figure 13 is a pictorial representation of a thread placement speciﬁedas “scatter". In addition, setting both environment variable prevent thread migration.

KMP_HW_SUBSET determines the number of active threads and is set in the following format < . c,t,s stand forcores per socket, threads per core, and number of active sockets, respectively. As an example, for a dualsocket hardware with twenty-four cores and two hardware threads per core,

KMP_HW_SUBSET set to implies that forty-eight threads are operational on that node at one thread per core. Similarly, implies that forty-eight threads are operational on each socket with a total of ninety-six total threads (twothreads per core). Table 2 shows the best practice guidelines for conﬁguring the aﬃnity and binding of anapplication depending on the procedure with the dominant AI using Intel and OpenMP API.7 preprint - February 18, 2021preprint - February 18, 2021

Figure 7: Illustration of modern multi-core architectures with multiple NUMA domains. Showing the eﬀectof programming with considerations for “First Touch” policies on Linux systems. During simulation datapresent on NUMA:0 is fetched from NUMA:0 and vice-versa.

In order to demonstrate of the viability of the aforementioned optimizations we proceed following the three stepsdelineated below. First, we generate chemically relevant system matrices following the techniques discussedin section 5.2 representing Metals, Semi Conductors and Soft Matter. Second, we evaluate the performanceof BML ELLPACK matrix-matrix multiply algorithm for each system. Third, in order to demonstrate theviability of the aforementioned optimizations in a full simulation, we use ExaSP2, a spectral projection proxyapplication [44] for carrying out QMD computations. Following that, we establish a benchmark run-timefor ExaSP2 to determine signature of the SP2-Basic algorithm which will inform on the condition in whichthe best performance gains in run-time is achieved. After establishing the representative/dominant AI, wecompare the performance of the BHMK to that of the TUNED for those speciﬁc cases.

Chemically relevant system matrices typically fall under the following categories namely 1.) Metals, 2.) SemiConductors or 3.) Soft Matter. Figure 14 is a schematic representation of the model Hamiltonian matrices.These model systems are constructed by coupling arrays of two-level systems with A and B as atomic typeof orbitals. Given that, A and B orbitals have onsite energies (cid:15) A and (cid:15) B , respectively. A coupling betweenthe same type of orbital, that is, comprising of only A type or B type, is given by either δ A i A j or δ B i B j .Similarly, a coupling of elements between diﬀerent orbitals, that is A and B, is given by δ A i B j . All couplingsbetween orbitals are modulated by an exponential dumping factor computed as exp(k | j − i | ), where k is thedecaying constant, and i and j are the positions of both orbitals. All the couplings and onsite energies arein units of electron Volts(eV). We also introduced a randomization parameter that adds noise to couplingsand onsite energies. This noise is introduced as param(1 + r × RAND), where RAND is a random numberchosen between -1 and 1 and param represents any of the coupling or onsite energies involved. A module thatenables the generation of these Hamiltonian system matrix has been recently added to the PROGRESS [45]library. Figure 15 is a plot of the total DOS computed out of diﬀerent model Hamiltonian matrices. TheFermi Level of the system is set to be 0.0 eV. System matrices representing Metals were generated by setting: δ A i A j = − . , δ B i B j = − , k = − . , preprint - February 18, 2021preprint - February 18, 2021 Figure 8: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) micro-kernel fordata locality for Cascade Lake (Top-Left), Sky-Lake Gold (Top-Right), Sky-Lake Platinum (Bottom-Left)and Cascade Lake with Intel Optane Technology (Bottom-Right).and the rest of the parameters to 0.0, resulting in a sparsity of 0.98%. For system matrices representing SemiConductors, the parameters set to: δ B i B j = − . , δ AB = − . , k = − . , and the rest of the parameters to 0.0, resulting in a sparsity of 94% sparse. Soft Matter systems matriceswere generated by setting: δ B i B j = − . , δ AB = − . , (cid:15) A = − . , k = − . , r = 1 . , resulting in a sparsity of 82% sparse. Figure 16 is a matrix plot using the BML Ellpack format for Metals,Semi Conductors and Soft Matter. Figure 16 is a matrix plot of three systems namely metals, semi conductors and soft matter. It servesas a visual representation of the density of each system; with metal being the most dense. Figure 17, 18and 19 show the performance of the benchmark (BHMK) vs. optimized (TUNED) BML matrix-matrixmultiply algorithm using the ELLPACK format for the aforementioned systems on four diﬀerent architectures.Matrix-matrix multiplications with metals (Figure 17), shows a 15% to 30% improvement on average inthe optimized version of the BML library for larger matrix sizes on every architecture. Semi conductors(Figure 18), show excellent improvement in the optimized code for all matrix sizes experimented with. Onaverage, the 32k matrices are improved by over 100 folds. Figure 19, (soft matter systems) shows similarimprovements to that semi conductors. On average, the 32k matrices are improved by over 50 folds across allarchitectures. 9 preprint - February 18, 2021preprint - February 18, 2021

Figure 9: An Illustration of the need for memory alignment on multi-core architectures. Showing contentionbetween core:2 and core:3. // Allocating unaligned memory: A−>nnz = (int *) malloc(sizeof(int)*N); A−>index = (int *) malloc(sizeof(int)*N*M); A−>value = (double *) malloc(sizeof(double)*N*M); // Computational loop: for(i=0; i < A−>N; i++) compute(A); // De−allocation of unaligned memory: free(A−>nnz); free(A−>index); free(A−>value); Figure 10: Memory allocation in BML software. // Allocating aligned memory: A−>nnz = (int *) _mm_malloc(sizeof(int)*A−>N,64); A−>index = (int *) _mm_malloc(sizeof(int)*A−>N * A−>M,64); A−>value = (double *) _mm_malloc(sizeof(double)*A−>N*A−>M,64); // Computational loop: for(i=0; i < A−>N; i++) { __assume_aligned(A−>nnz,64); __assume_aligned(A−>index,64); __assume_aligned(A−>value,64); compute(A); } // De−allocation of aligned memory: _mm_free(A−>nnz); _mm_free(A−>index); _mm_free(A−>value); Figure 11: Memory allocation with memory alignment in BML software.10 preprint - February 18, 2021preprint - February 18, 2021

Figure 12: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) micro-kernel formemory alignment for Cascade Lake (Top-Left), Sky-Lake Gold (Top-Right), Sky-Lake Platinum (Bottom-Left)and Cascade Lake with Intel Optane Technology (Bottom-Right).

The procedures that make up the SP2-Basic algorithm can be decomposed into two types namely theinitialization and SP2 calculation step. The initialization step involves reading in the Hamiltonian (ReadHamiltonian) matrix and other miscellaneous initialization sub-steps (Init. Misc.). The SP2 calculation stepinvolves a matrix-matrix multiplication (SP2 Loop X2) and a matrix norm (SP2 Loop Norm) calculation. Allother steps within the SP2 calculation are categories under (SP2 Loop Misc.). For this exercise, only SemiConductor and Soft Matter where experimented with as they are appropriate for an SP2 calculation. TheELLPACK matrix format and corresponding algorithms were used to represent both systems. In addition,since the overall performance of each system has shown repeat-ability across all four architectures, onlyIntel Cascade Lake and Skylake Platinum were used here. Figure 20, 21, 22 and 23 is a comparison of thebenchmark (BHMK) vs. optimized (TUNED) ExaSP2 for both systems. On Intel Cascade Lake (ﬁgure 20and 21), both systems show an improvement in run-time by over four folds (4x) while other subroutines varyfrom four (4x) to twelve (12x) folds.

We evaluated the viability of several optimization strategies namely 1.) Strength Reduction 2.) MemoryAlignment 3.) NUMA aware allocations and all incorporation with the appropriate thread aﬃnity and binding.Initially, we tested the viability of these performance tuning techniques in micro-kernels by comparing therun-time before and after optimization for several target architectures. For SR optimizations, the optimized11 preprint - February 18, 2021preprint - February 18, 2021

Figure 13: An illustration of a modern dual socket multi-core architecture. Showing a coordination of memoryinitialization with thread binding and aﬃnity with considerations for “First Touch” policies on Linux systems. B i A i B j A jδ AB δ AB δ B i B j δ A i A j Figure 14: Two-level system model Hamiltonian used to generate representative Hamiltonian matrices forbench-marking purposes. The model has the following parameters: Four coupling parameters, four onsiteenergies, a decaying exponential parameter, and a randomization factor.12 preprint - February 18, 2021preprint - February 18, 2021

Energy (eV) DO S -15 -10 -5 0 5 10 Energy (eV) DO S -8 -6 -4 -2 0 2 Energy (eV) DO S Figure 15: Total DOS computed out of diﬀerent model Hamiltonian matrices. The Fermi Level of the systemis set to be 0.0 eV. a) Metals can be generated by setting δ A i A j = − . δ B i B j = − k = − .

01, and the restof the parameters to 0.0. Matrices of this type are about 0.98% sparse. b) Semiconductors can be generatedby setting δ B i B j = − . δ AB = − . k = − .

01 and the rest of the parameters to 0.0. Matrices of thistype are about 94% sparse. c) Soft matter systems can be generated by setting δ B i B j = − . δ AB = − . (cid:15) A = − . k = − . r = 1 .

0. Matrices of this type are about 82% sparse. (a) Metals (b) Semi Conductors(c) Soft Matter

Figure 16: Matrix plot of Metals (Top Left), Semi Conductors (Top Right) and Soft Matter (Bottom)generated using the techniques discussed. 13 preprint - February 18, 2021preprint - February 18, 2021

Figure 17: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) BML ELLPACKmatrix-matrix multiply for Metals computed on Cascade Lake (Top-Left), Sky-Lake Gold (Top-Right),Sky-Lake Platinum (Bottom-Left) and Cascade Lake with Intel Optane Technology (Bottom-Right).micro-kernel showed approximately 2X across multiple thread sizes and varying architectures. For NUMAaware data initialization that ensures data locality, the performance of the optimized micro-kernel is at best3X for high thread counts and 2X on average on all architectures. Memory alignment optimizations, theoptimized micro-kernel is at best 11X for high thread counts and 5X on average across all architectures. Theoptimizations where used in conjunction with an appropriate thread binding and aﬃnity settings such that,two well known issues that degrade performance when using multi-core computers are avoided.We followed this up by generating chemically relevant system matrices representing Metals, Semi Conductorsand Soft Matter evaluate the performance of BML ELLPACK matrix-matrix multiply algorithm for eachsystem. The matrix-matrix multiplications with metals showed a 15% to 30% improvement on average in theoptimized version of the BML library for larger matrix sizes on every architecture. Semi conductors, showedan excellent improvement in the optimized code for all matrix sizes experimented with. On average, the 32kmatrices were improved by over 100 folds. Soft matter systems showed similar improvements to that semiconductors where on average, the 32k matrices are improved by over 50 folds across all architectures. Finally,we evaluated the performance of these optimizations using a QMD proxy application, where the ELLPACKmatrix format and corresponding algorithms were used to represent both Soft Matter and Semiconductors.Both systems on Intel Cascade Lake and Skylake Platinum showed an improvement in run-time by over fourfolds (4x). Other relevant subroutines within ExaSP2 showed improvements varying from four (4x) to twelve(12x) folds. 14 preprint - February 18, 2021preprint - February 18, 2021

Figure 18: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) BML EllPACKmatrix-matrix multiply for Semi Conductor computed on Cascade Lake (Top-Left), Sky-Lake Gold (Top-Right),Sky-Lake Platinum (Bottom-Left) and Cascade Lake with Intel Optane Technology (Bottom-Right).15 preprint - February 18, 2021preprint - February 18, 2021

Figure 19: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) BML EllPACKmatrix-matrix multiply for Soft Matter computed on Cascade Lake (Top-Left), Sky-Lake Gold (Top-Right),Sky-Lake Platinum (Bottom-Left) and Cascade Lake with Intel Optane Technology (Bottom-Right).Figure 20: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) ExaSP2 for SemiConductor simulated on Intel’s Cascade Lake. Showing run-time of individual subroutines.16 preprint - February 18, 2021preprint - February 18, 2021

Figure 21: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) ExaSP2 for SoftMatter simulated on Intel’s Cascade Lake. Showing run-time of individual subroutines.Figure 22: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) ExaSP2 for SemiConductor simulated on Intel’s Skylake Platinum. Showing run-time of individual subroutines.17 preprint - February 18, 2021preprint - February 18, 2021

Figure 23: Performance in run-time for the benchmark (BHMK) vs. optimized (TUNED) ExaSP2 for SoftMatter simulated on Intel’s Skylake Platinum. Showing run-time of individual subroutines.18 preprint - February 18, 2021preprint - February 18, 2021

Acknowledgments

We would like to acknowledge the assistance of volunteers in putting together this example manuscript andsupplement. This work was performed as part of the Co-design Center for Particle Applications, supportedby the Exascale Computing Project (17-SC-20-SC), a collaborative eﬀort of the U.S. DOE Oﬃce of Scienceand the NNSA. This work was performed under the U.S. Government contract 89233218CNA000001 forLos Alamos National Laboratory (LANL), U.S. Government Contract DE-AC52-07NA27344 for LawrenceLivermore National Laboratory (LLNL), U.S. Government Contract DE-AC05-00OR22725 for Oak RidgeNational Laboratory (ORNL)

References [1] Home page - exascale computing project. . Accessed: 2020-8-4.[2] Susan M. Mniszewski, James Belak, Stuart R. Slattery, Christian F. A. Negre, Jean-Luc Fattebert,Robert F. Bird, Samuel Temple Reeve, Christoph Junghans, Damien Lebrun-Grandie, Shane Fogerty,Adetokunbo A. Adedoyin, Jamal Mohd-Yusof, Michael E. Wall, Daniel Osei-Kuﬀuor, Guangye Chen,Lee Ricketson, Salman Habib, Adrian Pope, C. S. Chang, Aaron Scheinberg, Steven Plimpton, and StanMoore. Enabling particle applications for exascale computing platforms.[3] D R Bowler and T Miyazaki. methods in electronic structure calculations.

Rep. Prog. Phys. , 75(3):036503,February 2012.[4] Stefan Goedecker. Linear scaling electronic structure methods.

Rev. Mod. Phys. , 71(4):1085–1123, July1999.[5] Anders M N Niklasson. Expansion algorithm for the density matrix.

Phys. Rev. B Condens. Matter ,66(15):155115, October 2002.[6] Susan M Mniszewski, Romain Perriot, Emanuel H Rubensson, Christian F A Negre, Marc J Cawkwell,and Anders M N Niklasson. Linear scaling pseudo Fermi-Operator expansion for fractional occupation.

J. Chem. Theory Comput. , 15(1):190–200, January 2019.[7] Nicolas Bock, Christian F A Negre, Susan M Mniszewski, Jamaludin Mohd-Yusof, Bálint Aradi, Jean-LucFattebert, Daniel Osei-Kuﬀuor, Timothy C Germann, and Anders M N Niklasson. The basic matrixlibrary (BML) for quantum chemistry.

J. Supercomput. , 74(11):6201–6219, November 2018.[8] The basic matrix library (bml).[9] Nicolas Bock, Susan Mniszewski, Bálint Aradi, Michael Wall, Christian F. A. Negre Jamal Mohd-Yusof,and Anders N. M. Niklasson. qmmd/bml v1.2.3, February 2018.[10] Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, and Thomas Benson. Eﬃcientbackprojection-based synthetic aperture radar computation with many-core processors. In

SC’12:Proceedings of the International Conference on High Performance Computing, Networking, Storage andAnalysis , pages 1–11. IEEE, 2012.[11] Matt Godbolt. Optimizations in C++ compilers.

Communications of the ACM , 63(2):41–49, 2020.[12] Adetokunbo Adedoyin. A case study on software modernization using CoMD – a molecular dynamicsproxy application.

LANL Report: LA-UR-17-22676 , 04, 2017.[13] Andrey Vladimirov. Fine-tuning vectorization and memory traﬃc on Intel Xeon Phi coprocessors: LUdecomposition of small matrices.

Colfax Research , pages 1–10, 2015.[14] William M Waite and Gerhard Goos.

Compiler construction . Springer Science & Business Media, 2012.[15] Jeﬀrey Sheldon, Walter Lee, Ben Greenwald, and Saman Amarasinghe. Strength reduction of integerdivision and modulo operations. In

International Workshop on Languages and Compilers for ParallelComputing , pages 254–273. Springer, 2001.[16] Saman Amarasinghe, Walter Lee, and Ben Greenwald. Strength reduction of integer division and modulooperations.

MIT laboratory for Computer Science Technical Memo, MI T-L CS-Tm-600 , 1999.[17] John Cocke and Ken Kennedy. An algorithm for reduction of operator strength.

Communications of theACM , 20(11):850–856, 1977.[18] Frances E Allen, John Cocke, and Ken Kennedy. Reduction of operator strength.

Program Flow Analysis ,pages 79–101, 1981. 19 preprint - February 18, 2021preprint - February 18, 2021

Program Flow Analysis ,pages 79–101, 1981. 19 preprint - February 18, 2021preprint - February 18, 2021 [19] Adrian Tate, Amir Kamil, Anshu Dubey, Armin Größlinger, Brad Chamberlain, Brice Goglin, CarterEdwards, Chris J Newburn, David Padua, Didem Unat, et al. Programming abstractions for data locality.2014.[20] Nicolas Denoyelle, John Tramm, Kazutomo Yoshii, Swann Perarnau, and Pete Beckman. Numa-awaredata management for neutron cross section data in continuous energy monte carlo neutron transportsimulation.[21] Ryo Asai. Day 5 : Introduction to parallel intel ® architectures. (April):2013–2017, 2017.[22] Ryo Asai. Day 6 : Optimization on Parallel Intel ® Architectures. (April):2013–2017, 2017.[23] Amanda K Sharp. Memory allocation and ﬁrst-touch. , Published: 11/07/2013, LastUpdated: 09/09/2014 (accessed July 1, 2020).[24] Victor Alessandrini. Chapter 13 - molecular dynamics example. In Victor Alessandrini, editor,

SharedMemory Application Programming , pages 375 – 399. Morgan Kaufmann, Boston, 2016.[25] Jim Jeﬀers and James Reinders. Chapter 5 - lots of data (vectors). In Jim Jeﬀers and James Reinders,editors,

Intel Xeon Phi Coprocessor High Performance Programming , pages 107 – 164. Morgan Kaufmann,Boston, 2013.[26] Alaa Eltablawy and Andrey Vladimirov. IN INTEL XEON SCALABLE PROCESSORS ( SKYLAKE ).pages 1–52, 2017.[27] Andrey Vladimirov. OPTIMIZATION TECHNIQUES FOR THE INTEL MIC ARCHITECTURE .pages 1–12, 2015.[28] Dynamics Proxy Application. LA-UR-17-22676 Title : Author ( s ): Issued :. 04, 2017.[29] Ryo Asai. Day 5 : Introduction to Parallel Intel ® Architectures. (April):2013–2017, 2017.[30] Andrey Vladimirov. Arithmetics on intel’s sandy bridge and westmere cpus : not all ﬂops are createdequal. pages 1–12, 2012.[31] Victor Eijkhout.

Parallel Programming for Science Engineering , 2016 (accessed July 1, 2020).[32] Victor Eijkhout.

Parallel Programming in MPI and OpenMP . Lulu. com, 2017.[33] Intel C++ Compiler 19.1 Developer Guide and Reference.

Thread Aﬃnity Interface (Linux* andWindows*) , 2019 (accessed July 1, 2020).[34] Gregg Skinner.

Process and Thread Aﬃnity for Intel® Xeon Phi™ Processors , 04/21/2016 (accessedJuly 1, 2020).[35]

The OpenMP API speciﬁcation for parallel programming , Version 5.0 November 2018 (accessed July 1,2020).[36] Christoph Lameter. Numa (non-uniform memory access): An overview.

Queue , 11(7):40–51, 2013.[37] P. Memarzia, S. Ray, and V. C. Bhavsar. The art of eﬃcient in-memory query processing on NUMAsystems: a systematic approach. In , pages 781–792, 2020.[38] David Hildenbrand and Luiz Capitulino. Dynamic virtual machine memory allocation, January 30 2020.US Patent App. 16/046,580.[39] Tony Mason, Thaleia Dimitra Doudali, Margo Seltzer, and Ada Gavrilovska. Unexpected performanceof Intel® Optane™ DC persistent memory.

IEEE Computer Architecture Letters , 19(1):55–58, 2020.[40] Isaac Sánchez Barrera, David Black-Schaﬀer, Marc Casas, Miquel Moretó, Anastasiia Stupnikova, andMihail Popov. Modeling and optimizing NUMA eﬀects and prefetching with machine learning. In

Proceedings of the 34th ACM International Conference on Supercomputing , pages 1–13, 2020.[41] Mehul Wagle, Daniel Booss, and Ivan Schreter. Numa-aware memory allocation, February 6 2018. USPatent 9,886,313.[42] Wei Wang, Jack W Davidson, and Mary Lou Soﬀa. Predicting the memory bandwidth and optimal coreallocations for multi-threaded applications on large-scale numa machines. In , pages 419–431. IEEE, 2016.[43] Linux Manual Page. posix-memalign(3) , 2020 (accessed July 1, 2020).[44] Susan Mniszewski, Christian F. A. Negre, Robert Pavel, and Christoph Junghans. ExaSP2. https://github.com/ECP-copa/ExaSP2 , 2016. 20 preprint - February 18, 2021preprint - February 18, 2021