Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors
Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim Kim
OOptimization and parallelization of B-spline basedorbital evaluations in QMC on multi/many-coreshared memory processors
Amrita Mathuriya ∗§ , Ye Luo † , Anouar Benali † , Luke Shulenburger, ‡ and Jeongnim Kim ∗∗ Intel Corporation † Argonne National Laboratory ‡ Sandia National Laboratories § Corresponding author: [email protected]
Abstract —B-spline based orbital representations are widelyused in Quantum Monte Carlo (QMC) simulations of solids,historically taking as much as 50% of the total run time. Randomaccesses to a large four-dimensional array make it challenging toefficiently utilize caches and wide vector units of modern CPUs.We present node-level optimizations of B-spline evaluations onmulti/many-core shared memory processors. To increase SIMDefficiency and bandwidth utilization, we first apply data layouttransformation from array-of-structures to structure-of-arrays(SoA). Then by blocking SoA objects, we optimize cache reuseand get sustained throughput for a range of problem sizes.We implement efficient nested threading in B-spline orbitalevaluation kernels, paving the way towards enabling strongscaling of QMC simulations. These optimizations are portableon four distinct cache-coherent architectures and result in up to5.6x performance enhancements on Intel R (cid:13) Xeon Phi TM processor7250P (KNL), 5.7x on Intel R (cid:13) Xeon Phi TM coprocessor 7120P,10x on an Intel R (cid:13) Xeon R (cid:13) processor E5v4 CPU and 9.5x onBlueGene/Q processor. Our nested threading implementationshows nearly ideal parallel efficiency on KNL up to 16 threads.We employ roofline performance analysis to model the impacts ofour optimizations. This work combined with our current effortsof optimizing other QMC kernels, result in greater than 4.5xspeedup of miniQMC on KNL. Index Terms —QMC, B-spline, SoA, AoSoA, vectorization,cache-blocking data-layouts and roofline.
I. I
NTRODUCTION
With the advances in algorithms and growing computingpowers, quantum Monte Carlo (QMC) methods have becomea leading contender for high accuracy calculations for theelectronic structure of realistic systems. It is general andapplicable to a wide range of physical and chemical systems inany dimension, boundary conditions etc. The favorable scalingof O ( N ) for an N -electron system and ample opportunitiesfor parallelization make QMC methods uniquely powerfultools to study the electronic structure of realistic systems onlarge-scale parallel computers. Despite the high computationalcost of the method, this scalability has enabled calculationson a wide variety of systems, including large molecules [1],layered materials such as the graphite shown in Figure 1 [2]–[4], transition metal oxides [5], [6] and general bulk propertiesof solids [7]. A prime trend in HPC field is the increase in parallelismavailable on a single SMP node. The average number ofnodes of big clusters has not increased in recent years but thecompute capacity of a node has been increasing through morecores, multiple hardware threads per core, and wider SIMDunits. The many-core architecture expands the parallelismon a node dramatically. For example, the second generationIntel R (cid:13) Xeon Phi TM processors, formerly codenamed KnightsLanding (KNL) [8], have up to 72 cores, 4 threads per coreand a double-precision SIMD width of 8. Its theoretical peakis more than 10 times of one IBM Blue Gene/Q (BG/Q) nodeof 16 cores, 4 threads per core and a double-precision SIMDwidth of 4 [9]. To take full advantage of modern day systems,it is essential to utilize SIMD units and cache efficiently.One of the main computational bottlenecks in QMC simula-tions is the evaluation of N single particle orbitals (SPOs) foran N -electron system. A B-spline basis is the most efficientrepresentation for SPOs whose cost per orbital is O (1) .Spline interpolations are widely used in many applicationsincluding classical and quantum molecular dynamics [10] tolower the cost of complex analytic function evaluations. InQMC, B-spline based orbitals provide orders of magnitudebetter time-to-solution than other standard basis sets such asGaussian-type orbitals even for a modest problem size forthe same accuracy. This work presents processes to improveon-node performance of B-spline evaluation routines in QM- Fig. 1. (a) DMC charge-density of AB stacked graphite [2] and (b) theball-and-stick rendering and the 4-Carbon unit cell in blue. a r X i v : . [ c s . D C ] N ov PACK [17]. The random access patterns that are inherentin QMC make achieving high performance of B-spline basedSPOs on any architecture challenging. Its arithmetic intensity,FLOPS per bytes, is low and its performance is sensitive tothe memory subsystems – bandwidths and cache levels andtheir sizes. We demonstrate how one can achieve efficientvectorization, optimally utilize the multi-level cache systemand exploit a large number of hardware threads as a proof ofconcept.
SIMD and cache utilization : Extensive analysis of QMC-PACK simulations reveals that SIMD efficiency is low exceptfor B-spline evaluations in intrinsics and BLAS/LAPACKroutines that are available in optimized libraries, such as IntelMKL [11]. The abstractions for 3D physics, defined in array-of-structures (AoS) formats, such as representing positionsfor N particles with R[N][3] data layout, are primarilyresponsible for the low SIMD efficiency. AoS representationsfor physical entities are logical for expressing concepts andhave been widely adopted in HPC applications. However, thecomputations using them are not efficient on modern CPUs.Implementing “hotspots” using architecture-specific intrinsicsis possible but it is not scalable nor portable. Also currently,without any cache-blocking optimizations, the working set fallout of cache for large problems, causing low cache uitlization.Our goal is to transform all QMCPACK core routines toincrease on-node performance and science productivity, whileminimizing the development cost, and this work presents thefirst most important step to achieve the goal.
Beyond walker parallelism : The common parallelizationstrategy in QMC is distributing walkers (Monte Carlo sam-ples) among N p tasks, where N p is the number of parallelprocessing units for a simulation. The OpenMP/MPI hybridprogramming model adopted in QMCPACK has proven tobe very effective on clusters of multi-core shared-memoryprocessor (SMP) nodes. It minimizes the memory footprintper MPI task, and improves the overall parallel efficiencyby considerably reducing collective communication time ata large task count [12]. However, the cost of processingeach walker increases as O ( N ) or higher for an N -electronsystem. Also, the overall memory usage on a node increaseas O ( N w N ) for N w walkers. This makes it essential toparallelize the execution of each walker for reducing thetotal time to solution ( T tot ) and memory footprints; hence,expanding the applicability of QMC methods to larger systemsizes. We implement efficient nested threading for B-splinebased SPO evaluations, paving the way towards reducing T tot and enabling strong scaling of QMC simulations. Contributions and results : The two main contributions ofthis work are (i) node-level optimizations to improve singlenode efficiency in a portable fashion, without the use of pro-cessor specific intrinsics and (ii) parallelization of the orbitalevaluations for each walker to reduce the time-to-solution inthe strong-scaling runs. We develop simplified QMC codes(miniQMC) to facilitate rapid prototyping and benchmarking.They capture the computational and data access patterns andhave similar profile of common QMC workloads on a SMP node. We demonstrate the impact of our optimizations on fourdistinct cache-coherent architectures with extensive analysisand on a wide range of problems, starting from 64-carbon(128 SPOs) to 2048-carbon (4096 SPOs) systems. Finally, weemploy the roofline model analysis to quantify the relative andabsolute performance impact from the optimizations.The key accomplishments of this work are as follows.We apply data layout transformation from array-of-structuresto structure-of-arrays (SoA) to increase SIMD efficiencyand bandwidth utilization. We implement “tunable tiling” orAoSoA transformation to achieve efficient cache utilizationallowing sustained throughput across problem sizes. The ob-tained optimal tile size is independent of the problem size andcan be tuned once for a specific architecture with miniQMC.These optimizations result in up to 5.6x(KNL), 5.7x(KNC),10x(BDW) and 9.5x (BG/Q) speedups of B-spline routines.We provide an efficient nested threading implementation foreach walker (Monte Carlo unit) and demonstrate more than14x reduction in the time-to-solution on 16 KNL nodes.The optimizations discussed in this publication are broadlyapplicable to many HPC applications for increasing the utiliza-tion of wide SIMD units and caches. In addition to B-splinekernels, we also implement them in miniQMC for optimizingthe other time consuming QMC kernels. With the combinedwork, we obtain greater than 4.5x speedup of miniQMC onKNL and BDW processors.To faciliate easy transition of performance gains proven inminiQMC to QMCPACK, we take help of object oriented pro-gramming. We design and implement SoA container classesfor particle abstractions in 3D and develop classes hidingthe low-level computational engines for SIMD and memo-ry/cache optimizations from the application-level classes. Theoptimized kernels of miniQMC will directly be called fromQMCPACK driver routines to transfer the performance gains.II. R
ELATED WORK
To conduct cutting edge scientific research on material sci-ence, QMCPACK has been deployed on the current genera-tion of leadership supercomputers, Mira (IBM Blue Gene/Q)at Argonne National Laboratory and Titan (AMD OpteronCPUs and NVIDIA Tesla K20 GPU) at Oak Ridge NationalLaboratory. As the most performance-critical component inQMCPACK, 3D B-splines orbitals have been extensively op-timized over the years [13] [14]. The highly optimized routinesevaluating B-spline SPOs are implemented in QPX intrinsics[15] on BG/Q, SSE/SSE2 intrinsics on x86 and in CUDA [16]to maximize single-node performance. The QPX and CUDAversions have adopted SoA data layout and FMA instructionsto achieve high memory bandwidth utilization and FLOP rates.The single precision was first implemented in QMCPACKGPU port with significant speedups and memory saving andlater introduced to the CPU version. Multiple algorithms usingloop transformations and unrolling have been developed. TheQMCPACK build system allows selecting precision and theoptimal algorithms and implementations at compile time.he baseline of this work uses the latest optimized im-plementations in C/C++ in the official QMCPACK distri-bution [15], [17]. Extensive studies in the data transforma-tions to maximize SIMD efficiency on Intel R (cid:13) Xeon Phi TM (co)processors and nested parallelisms are available [18]–[21].III. QMC AND B- SPLINE - BASED
SPO S In quantum mechanics, all physically observable quantities fora system containing N el particles can be computed from the N el -dimensional wave function , Ψ( r , . . . , r N el ) . The Slater-Jastrow trial wave functions Ψ T , commonly used in QMCapplications for solving electronic structures, are the productof Slater determinants of single-particle orbitals (SPOs) and aJastrow factor: Ψ T = exp( J ) D ↑ ( { φ } ) D ↓ ( { φ } ) , (1)with N el = N ↑ + N ↓ for ↑↓ spins. For the rest of the paper,we assume N el = 2 N , D ↓ = D ↑ . Here, { φ } denotes a set ofSPOs and D = det[ A ] = det (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ ( r ) · · · φ ( r N ) ... ... ... φ N ( r ) . . . φ N ( r N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (2)which ensures the antisymmetric property of a Fermionic wavefunction upon exchange of a pair of electrons. The Jastrowfactors, which are factorized for computational efficiency,describe the dynamic correlation, whereas static correlationis described by the Slater determinants.In the diffusion Monte Carlo algorithm (DMC), an ensembleof walkers ( population ) is propagated stochastically fromgeneration to generation, where each walker is represented by R . In each propagation step, the walkers are moved throughposition space by (i) a drift-diffusion process . Once a newconfiguration is sampled, the physical quantities (observables)such as the kinetic energy and Coulomb potential energies arecomputed for each walker, (ii) a measurement stage . Finally,the ensemble averages of the observables and trial energy E T are computed and depending on the energy of a walker relativeto E T , each walker may reproduce itself, be killed, or remainunchanged by a (iii) branching process .The particle-by-particle moves we employ change only onecolumn of the A matrices at a time and the ratio can becomputed as det [ A (cid:48) ] det [ A ] = N (cid:88) n φ n ( r e ) ∗ A − ( n, e ) . (3)When the move is accepted, we employ a rank-1 update of A − using the Sherman-Morrison formula. This allows theinverse to be updated in O ( N ) time rather than O ( N ) timefor a full inversion and the ratios in O ( N ) . The computationsof the many-body gradients and laplacian use the same ratioformula as ∇ det [ A (cid:48) ] det [ A ] = N (cid:88) n ∇ φ n ( r e ) ∗ A − ( n, e ) . (4) Fig. 2. The piecewise cubic polynomial basis functions (a) 1D and (b) 2D.
The cost to compute the value of a φ scales linearly with thenumber of basis function evaluations which tends to grow withthe system size. This amounts to O ( N ) cost for each particlemove and in total SPO evaluations scale as O ( N ) per MonteCarlo step. For this reason, it is efficient to use a localizedbasis with compact support. In particular, 3D tricubic B-splinesprovide a basis in which only 64 elements are nonzero at anygiven point in space [13], [22] and have complexity of O (1) .The one-dimensional cubic B-spline is given by, f ( x ) = i +2 (cid:88) i (cid:48) = i − b i (cid:48) , ( x ) p i (cid:48) , (5)where b i, ( x ) are the piecewise cubic polynomial basis func-tions and i = floor (∆ − x ) , the lower bound of x for agrid spacing ∆ . Figure 2(a) shows the four basis functionscontributing to the values at ≤ x < when ∆ = 1 .Constructing a tensor product in each Cartesian direction, wecan represent a 3D orbital as φ n ( x, y, z ) = i +2 (cid:88) i (cid:48) = i − b i (cid:48) , x ( x ) j +2 (cid:88) j (cid:48) = j − b j (cid:48) , y ( y ) k +2 (cid:88) k (cid:48) = k − b k (cid:48) , z ( z ) p i (cid:48) ,j (cid:48) ,k (cid:48) ,n . (6)This allows for rapid evaluation of each orbital in constanttime. Furthermore, the basis is systematically improvable witha single spacing parameter, so that the accuracy is not com-promised. The coefficients { p } are the interpolation tables foreach orbital and remain constant throughout the simulations.The use of 3D tricubic B-spline SPOs greatly improvesthe time-to-solution. For a typical modest example problemwith 32 electrons, the speed up of B-spline is more thansix-fold over an equivalent Plane-Wave (PW) basis set. Theadvantage of the B-spline SPOs grows as the system sizegrows. This computational efficiency comes at the expense ofincreased memory use. To keep the memory footprints small,QMCPACK uses hybrid parallelism with OpenMP/MPI whereall the threads share the read only coefficient table.IV. B ASELINE AND MINI
QMCWe use the CORAL benchmark 4x4x1 problem [23] to estab-lish the baseline performance. This benchmark represents typ-ical QMC workloads on current generation of HPC systems. Itsolves 256 electrons of 64-atom AB-stacked graphite systemconsisting of 4 by 4 periodic images of the 4-atom unit cell,shown in blue in Fig. 1(b). It defines the grid sizes N x = N y =48and N z =60 of N =128 orbitals. The details of the systems ABLE IS
YSTEM CONFIGURATIONS .BDW KNC KNL BG/QProcessor E5-2697v4 7120P 7250P PowerPC A2
Per tile
LLC (shared) 45 MB 32 MBStream BW(GB/s) 64 177 490 28 used are listed in Table I. Systems are described in section VI,containing the experiments. Intel R (cid:13) C++ Compilers version 16update 2 are used on Intel platforms [24] and IBM XL C/C++12.1 is used on BG/Q [25]. Also, we use Intel R (cid:13) VTune TM Amplifier 2016 (VTune) [26] advanced hotspot profiling onIntel platforms and HPCToolkit [27] on BG/Q for run timeestimation. Hyperthreading is beneficial and is used for theruns in Table II and the data presented in this work.Table II shows the profile of the benchmark with publiclyreleased QMCPACK [17] on single nodes. The main computa-tional groups are B-splines, distance tables and one-body andtwo-body Jastrow evaluations. Their total amounts to - across the platforms. Rest of the time is mostly spenton the assembly of SPOs using B-spline outputs, determinantupdates and inverses. The optimization steps mentioned inSec. II and QPX specializations accounted to reduce the B-spline share of the profile from 22% to 11% on BG/Q.In parallel optimization efforts, we optimize Distance-Tables Optimization Notice: Intel’s compilers may or may not optimize tothe same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2,SSE3, and SSSE3 instruction sets and other optimizations. Intel does notguarantee the availability, functionality, or effectiveness of any optimizationon microprocessors not manufactured by Intel. Microprocessor-dependentoptimizations in this product are intended for use with Intel microprocessors.Certain optimizations not specific to Intel microarchitecture are reservedfor Intel microprocessors. Please refer to the applicable product User andReference Guides for more information regarding the specific instruction setscovered by this notice. TABLE IIS
INGLE NODE RUN TIME PROFILE IN % OF THE X X BENCHMARK FOR PUBLICLY AVAILABLE
QMCPACKBDW KNC KNL BG/QB-splines 18 28 21 22Distance Tables 30 23 34 39Jastrow 13 19 19 21
ROFILE FOR MINI
QMC (
EQUIVALENT TO TABLE
II)
WITH THEOPTIMIZED D ISTANCE -T ABLES AND J ASTROW KERNELS .B-splines Distance Tables JastrowKNL 68.5 20.3 11.2Intel R (cid:13) Xeon R (cid:13) E5-2698 v4 55.3 22.6 22.1 //Number of splines = N, iterations = niters //random samples = ns = 512 //dimensions of 3D position grid (nx, ny, nz) using Pos3= float [3]; class WalkerAoS {T v[N], g[3*N], l[N], h[9*N];}; //Create and init read only 4D table P once. BsplineAoS bSpline(nx, ny, nz, N); //Create and run walkers in parallel. omp parallel { //Contains private copy of outputs. WalkerAoS w; // generate random samples for testing. Pos3 vPos[ns], vglPos[ns], vghPos[ns]; generateRandomPos(vPos, vglPos, vghPos, ns ); for ( int i=0; i VGH . Each walker creates anobject w of WalkerAoS class (shown on L16) which containsits own copy of output arrays. With the generated randompositions, each walker fills up these output arrays. The iterationloop at L20 represents working through multiple generationsof Monte Carlo simulation.The pseudocode in Fig. 4 shows the kernel of VGH whichcomputes N values, gradients (real 3D vectors) and Hessians(symmetric 3D tensors). A total of 13 N output components areevaluated. The input position (x,y,z) is random and providedby the miniQMC driver (Fig. 3 at L22, L24 and L26) to mimicQMC random moves by the quantum forces. The outputs v,g and h passed as parameters to the function call, are the startingaddresses of the particle attributes, V[N] (values) and AoStypes of G[N][3] (gradients) and H[N][3][3] (Hessians).Gradients (Hessians) are the first (second) derivatives with re-spect to the particle position and are used to compute quantumforces and Laplacians. The allocation of the P coefficient array a) class BsplineAoS {T P[Nx][Ny][Nz][N]; // read only and shared among threads. void VGH (T x, T y, T z, T* v, T* g, T* h) { //compute the lower-bound index i0,j0,k0//compute prefactors using (x-x0,y-y0,z-z0) for (int i=0; i<4; ++i) for (int j=0; j<4; ++j) for (int k=0; k<4; ++k) {const T* p=P[i+i0][j+j0][k+k0]; for (int n=0; n T *gx=g, *gy=g+N, *gz=g+2*N;T *hxx=h,*hxy=h+N,*hxz=h+2*N,*hyy=h+3*N,*hyz=h+4*N,*hzz=h+5*N; for (int i=0; i<4; ++i) for (int j=0; j<4; ++j) for (int k=0; k<4; ++k) {const T* p=P[i+i0][j+j0][k+k0]; for (int n=0; n Fig. 4. VGH using the Gradients and Hessians (a) in AoS memory layout and (b) in SoA memory layout. is done as 1D array and uses an aligned allocator and includespadding to ensure the alignment of P[i][j][k] to a 512-bitcache-line boundary. All the computations in miniQMC areperformed in single precision.In 4D memory space, accesses to P start at a random inputgrid point P[i0][j0][k0] that satisfies x i ≤ x < x i + ∆ x ,for ∆ x the grid spacing in x -direction. The same conditionsapply to y and z . Total of 64 input streams are issued to access N coefficient values. In total, 64 N stride-one reads and 13 N mixed-strided accumulations are executed for each randominput point. For large problems, the arithmetic intensity islow at 1 FMA for each accumulation of the output value andtherefore, memory bandwidth plays a critical role in decidingthe throughput of the B-spline routines. The cost of computing { b } at (x,y,z) in Eq. 6 is amortized for N , which had bigimpact on the performance of scalar processors.Functions V and VGL differ from VGH by how many com-ponents are computed, while having the same computationaland data access patterns. Based on the simulation cell type,either VGH or VGL is used. V is used with pseudopotentials forthe local energy computation. For the graphite systems, VGH is used during the drift-diffusion phase. Multiple versions ofB-spline implementations are available in QMCPACK and weuse the optimized CPU algorithm [15] as the baseline.V. O PTIMIZATIONS AND P ARALLELIZATION In this section, we describe the process and techniques tooptimize B-spline kernels with SOA and AoSoA data layout Fig. 5. Data access pattern of read-only B-spline coefficients P at a randomposition ( x, y, z ) and j =floor ( y/ ∆ y ) etc. : (a) current einspline library and(b) AoSoA implementation. The outermost x dimension is not shown. transformations. Further, we describe the nested threadingimplementation, which helps reducing time to solution andmemory footprints. A. AoS-to-SoA transformation of outputs Aligned data objects and contiguous accesses for reading andwriting are essential for efficient vectorization. The innermostloop in Fig. 4(a) runs over number of splines N and providessufficient work for efficient SIMD parallelism. Vectorizationof this inner most loop results in streamed unit stride accessesof read only P array. However, accesses to the g ( h ) arrayhave a 3(9)-strided pattern due to the AoS of the particleabstractions in 3-dimensional space. For example, g assumesthe internal ordering of the gradients in [xyz|xyz|..|xyz] sequence. These strided accesses result in gather and scatterinstructions, which are less efficient than contiguous loads andstores and reduce SIMD efficiency. A better way to storethis data is in a structure-of-arrays (SoA) format and usethree separate streams for each component. It lets us alignthe individual output streams for efficient loading and storingand also eliminates the need to use gather/scatter instructions.Figure IV shows the pseudocode with AoS-to-SoA datalayout transformation of the gradient and Hessian arrays. BsplineSoA class replaces BsplineAoS class in this opti-mized version for Fig. 3. This class uses three separate streamsfor the gradient vector and 6 streams for the Hessian tensor.Only a few components are shown for brevity. A minor changeis made to exploit the symmetric nature of the Hessian, whichresults in a total of 10 (1+3+6) output streams instead of 13 forthe baseline. To help the compiler auto-vectorize and ignoreany assumed dependencies, we place (OpenMP 4.0) on top of the inner most loop. Also, we informthe compiler about the alignment of output and input arraysusing directives. These changes lead to efficient vectorizationover N and ensure stride-one accesses to both P and theoutput arrays. In addition to the AoS-to-SoA transformation,we do a few other optimizations to the VGL function whichare already present for VGH ; such as unrolling the loop overthe z dimension, and move the allocation of temporary arraysf the baseline code out of the loops.Data-layout transformations from AoS to SoA are com-monly employed by HPC applications on Intel R (cid:13) Xeon Phi TM [18], [19] and have been shown to be essential on the proces-sors with wide SIMD units. The same transformation boostsperformance of the other critical computational steps involvingdistance tables and Jastrow of QMCPACK. To minimize theimpact on theoretical development, it is important to be able toimplement the data layout transformation with minimal sourcecode changes. To achieve this while keeping the performancepenalty low, we only modify the code in performance criticalregions to explicitly use the SoA containers representingabstractions for particle positions, and overload their squarebracket operators to return the particle positions at an index,in the current AoS format. This lets us keep the internal datalayout in SoA format and allows the use in both AoS andSoA formats. Other HPC codes written with object orientedprogramming can take advantage of this technique to gainperformance with minimal programming efforts. B. AoSoA transformation (“tiling”) Efficient cache utilization is critical in getting high perfor-mance on large cache-based architectures. To realize this, tilingor blocking optimizations have proven to be highly effective inmany grid-based applications [18]. This is also probably oneof the most difficult and a tricky one to implement in a waythat allows maximizing performance. We present our methodof tiling proven to be efficient for B-spline routines in QMC.Simulating periodic images of a primitive unit cell for theSPOs involves keeping the grid N g = ( nx, ny, nz ) constantand increasing the number of splines N , for solving largersystems. Accesses to the read only D coefficient array P are random, limiting the cache reuse. The working set size inbytes for this single precision (SP) array is N g N , making ittoo big to fit in to even last level caches, for typical N g values.Also, output working set size grows with N ; for example, fullSP output working set size in bytes for VGH is N N w , for N w walkers. For big N, these arrays fall out of caches. Weapply AoSoA transformation or tiling optimization to increaseperformance by better cache utilization.After the SoA transformation, the spline dimension N becomes the innermost and contiguous dimension for boththe input and output arrays. We tile both, input 4D coefficientarray P and the output arrays, along the spline dimension N .We create arrays of BsplineSoA and WalkerSoA objects,effectively splitting the coefficients P and outputs along N in M groups. We use N b = N/M as the tile size. Tiling of P array is pictorially shown in Fig. 5(b). Figure 6 is a pseudo usercode using M BsplineSoA and WalkerSoA objects with N b as the spline dimension. Object oriented programmingin QMCPACK allows us to use the SoA data containersdeveloped for the previous section without any modification.The AoSoA transformation enhances spatial and temporallocality by reducing working set size. The VGH working setsize becomes N g N b and N w N b in bytes for inputs andoutputs respectively. It allows the output working set to be kept i n t Nb = N/M; // tile size. c l a s s WalkerSoA {T v[Nb], g[3*Nb], l[Nb], h[6*Nb]}; // SoA object array containing P[nx][ny][nz][Nb] BsplineSoA bs[M](Nb); // Parallel region for creating walkers. omp parallel { WalkerSoA w[M](Nb); omp parallel for { for ( i n t t=0; t Moving forward, exposing parallelism within a walker andincreasing scaling to more nodes, are becoming necessary i) tosolve problems faster and ii) to reduce memory footprints. TheAoSoA data layout transformation described in the previoussubsection exposes natural parallelism of B-spline operations.Each tile containing BsplineSoA and WalkerSoA objectsshown in Fig. 6 can be executed independently, without anysynchronization. The line 9 in Fig. 6 shows a method forexploiting this parallelism using . The implementation in miniQMC adopts an explicit datapartition scheme by assigning n th threads for each walker anddistributing M objects among n th threads. This avoids anypotential overhead from OpenMP nested run time environmentand sets the maximum performance that can be obtainedwith our cache-and-threading optimizations. The pseudocodein Fig. 6 is representative of the one shown in Fig. 3, with thetiling and threading optimizations and omission of redundantdetails including the calls to VGL and V routines.This parallelization strategy over M independent objectskeeps the benefits of smaller working sets, enabling efficientcache utilization. Also, independent execution of the tileseliminates any potential overhead of synchronization. Theworking set size in bytes for VGH becomes N g N b n th (inputs)and N w N b n th (outputs). For the strong scaling runs, wedecrease N w by n th factor, keeping the output working set sizesame. Alternate approach explored for the nested threadingincludes threading over the innermost N dimension, withouttiling. This approach does not reap the benefits of smallerworking sets for efficient cache utilization and performs worsethan the approach chosen here. We plan to extend this AoSoAdesign to parallelize other parts of QMCPACK and are evalu-ating various parallel algorithms and run time.I. E XPERIMENTS AND R ESULTS Our experiments are carried out on four different shared mem-ory multi/many-core processors, 18-core single-socket Intel R (cid:13) Xeon R (cid:13) processor E5-2697v4 (BDW), the first generationIntel R (cid:13) Xeon Phi TM coprocessor 7120P (KNC), the secondgeneration Intel R (cid:13) Xeon Phi TM processor 7250P (KNL), andIBM Blue Gene/Q (BG/Q). Details of these systems are shownin Table I. As will be discussed, the performance of miniQMCand QMCPACK are highly sensitive to the memory bandwidthand cache types and cache sizes and they are deciding factorsfor our optimization strategy. The computations on KNL areperformed in quad cluster mode and flat memory mode.Memory footprints for all problem sizes are less than 16GB, sowe exclusively use MCDRAM with numactl -m 1 commandwith the executables. No code modification is made to manageallocation other than using the cache-aligned allocators.For the analysis, we vary N , the number of splines, from128 to 4096, from current day problems to large problemsplanned as the grand-challenge on pre-exascale systems. Thegrid size is kept constant at N g = 48 × × , simulatingperiodic images of the primitive unit cell for the SPOs, e.g.,the blue box in Fig. 1(b). The tile size N b is a tunableparameter on each system. We plan to provide an auto-tuningcapability using miniQMC to guide the production runs similarto FFTW’s solution using wisdom files [28].We use T for the throughput per node, a QMC specificmetric (operations/sec) to compare performance acrossproblem sizes and processors. T represents the work done ona node per second and is computed as T X = N w N/t X , where t X is the total time for X = V , VGL or VGH . It is directlyrelated to the efficiency of QMC simulations using B-splineSPOs. For the ideal performance, T should be independent of N and the grid sizes N g = ( nx, ny, nz ) . Higher T indicatesboth a more powerful node, e.g., KNC vs KNL, and the higherefficiency of the implementations. The speedup is the ratioof T s before and after optimizations using the same numberof nodes. As pointed out before, QMCPACK implementationdoes very little communication across nodes and the parallelefficiency with MPI is excellent. This way, we can reliably usethe single-node performance for the comparisons. We choose N w = 36 (BDW), 240 (KNC), 256 (KNL) and 64 (BG/Q),corresponding to 1 walker per thread on each system, as usedin current QMC simulations.The rest of the section presents the performance evolutionwith the processes detailed in Sec. V. We focus on thethroughput of VGH for the analysis. Memory bandwidth usage is a key indicator of the efficiency. We measure the bandwidthutilization on KNL using VTune. The performance character-istics of VGL is similar to those of VGH on each system andtherefore, only the final data for VGL are presented. Kernel V ,having a single floating point output array do not need SoAdata layout and only benefits with the AoSoA transformationand the nested threading versions. A. Aos to SoA Data Layout Transformation Figure 7(a) presents the performance before and after theAoS-to-SoA transformation of output arrays, showing 2-4xspeedups for small to medium problem sizes on the Intelprocessors. We estimate the current gain with vectorization onKNL, of an implementation by comparing T s without auto-vectorization and with vectorization by the compiler. We callit as vector efficiency in the following text. We use -no-vec-no-simd -no-openmp-simd to turn off auto-vectorizationby the Intel compilers. For a small problem such as 256,the vector efficiency is low at 1.2x with the AoS datatypes.In contrast, this efficiency with SoA objects is greater than4. The read bandwidth utilization increases to sustained 238GB/s from 60-98 GB/s. KNC gets the biggest boost withthe AoS-to-SoA transformation being an in-order processorwith high memory bandwidth. This SoA technique is currentlyimplemented in the QPX version for BG/Q, leading to 2.2xspeedup. However, this intrinsics solution is not portable andmore importantly, not needed with the transformation.The AoS-to-SoA transformation achieves good speedups forsmall to medium problem sizes but, its effectiveness dimin-ishes as N increases beyond N =512. Almost no speedup isobtained on KNC and KNL at N =2048 and 4096. The sizes ofL1/L2 caches are mainly responsible for the poor performanceat large N . Output arrays fall out of caches for large N values. This is confirmed with VTune analysis on KNL: thewrite bandwidth usage increase to 177 GB/s at N =4096 from38 GB/s at 256. The drop in T with increase in N is lesssevere on BDW with the large shared L3 cache. Nevertheless,the absolute performance goes down for the big problems onall the cache-based systems, we have tested. To solve thisproblem, we apply an AoSoA data layout transformation. B. AoSoA Data Layout Transformation or Tiling The Array-of-SoA data layout transformation is a cache-blocking optimization. We break up the problem into M “active” working sets to make them fit in caches. Please notethat, we divide the working set across the spline dimension N , which is common to both the inputs and outputs. Outputarrays get split and fit in to caches for efficient reductionoperations. Input four dimensional coefficient array P whichis shared among all the threads, also gets split across itsinnermost dimension. This allows it to fit in the big last levelcaches. As shown in Fig. IV, access to P array althoughrandom but, exhibit small amount of spatial locality across x, y and z dimensions. This tiling method also shortens the stridefor outer dimensions, increasing cache reuse and possiblyavoiding page faults for big grid sizes N g and large N . ig. 7. VGH throughput (operations per second), higher the better with (a) AoS-to-SoA and (b) SoA-to-AoSoA (tiling) data layout transformation. (c)Performance of AoSoA at N = 2048 with N b tile size, along N (linear dimension). Figure 7(c) shows VGH performance with the increasing tilesize for N =2048. Starting at N b = 16 , we explore tile sizes inthe multiple of two till N b = N and call N b with the highestthroughput as the optimal tile size.A striking feature for BDW is the peak at N b = 64. Theentire single precision working set of roughly 28MB, including P array of N g N b bytes and the output of N w N b bytes fit inthe 45MB L3 cache. The volume for the input working set at N b = 128 of roughly 56MB, exceeds the L3 size and therefore, VGH becomes memory bandwidth limited for N b > . BG/Qhaving shared L2 of size 32MB, can hold 28MB of workingset; hence also has a peak performance at N b = 64 . Intel R (cid:13) Xeon R (cid:13) processors with shared LLC behave similarly.In contrast, KNC and KNL do not show the same pro-nounced peak. Although L2 caches are kept fully coherentamong cores, but they are private to cores on KNC and totiles with two cores on KNL. Due to random accesses of thelarge P array, private L2 caches may contain duplicate copiesof elements at any point of time. The duplication reduces theeffective size for L2 cache. Even for N b = 16 tile size, theinput working set size is 7MB, which may not allow it to fitin to their L2. For KNC and KNL, a performance peak isobtained at N b = 512 . The improvement comes from fittingoutput arrays in cache, allowing efficient reduction operationsover 64 grid points. The performance goes up as N b increasestill N b = 512 , reflecting the amortized cost of redundantcomputations of the prefactors for the same random position.Figure 7(b) presents VGH performance before and after theAoSoA transformation, showing significant improvement for N =2048 and 4096. We use the obtained optimal N b on eachplatform: N b =64 on BDW and 512 on KNC and KNL. Withtiling, we obtain sustained throughput across the problem sizeson all the cache-based architectures we have tested. On BDWand BG/Q with the shared LLC, N b =64 gives the speedupsof 1.2-2.5 (BDW) and 1.2-1.5 (BG/Q) across all the problemsizes. The speedups on KNC and KNL are obtained with N b =512. With the optimal N b =512 on KNC and KNL, outputarrays stay in L1/L2 caches and the write-bandwidth decreasesfrom 177 GB/s to 43 GB/s even for N =4096. The vectorefficiency also increases from 2.5x to 4.2x.Figure 8 summarizes the performance improvement on KNLfor all the three functions V , VGL and VGH with the AoSoAtransformation. Speedups are computed keeping the current AoS implementation in C/C++ in QMCPACK distributionas reference. AoS-to-SoA transformation does not apply to V and it only gets speedup with tiling. Speedup for VGL includes performance gain from the basic optimizations statedearlier, which are in addition to the AoSoA optimization; andprovide greater overall speedup. Our optimizations boost thethroughput by 1.85x( V ), 6.4x( VGL ) and 2.5x( VGH ) on a nodeat N = 4096 .In QMCPACK, each thread owns a ParticleSet objectand can benefit from keeping it in cache during the entire run.This AoSoA approach will facilitate similar cache-blockingefficiency in the full application. The optimal tile size N b isindependent of problem size N for a particular cache-basedarchitecture and can be estimated based on the cache sizespresent and their sharing properties. For the production runs, N b can be tuned once for each architecture using miniQMC,making the new algorithm cache-oblivious for any N and N g . C. Exploiting Parallelism within Walker Update Extending parallelism beyond the parallelization over walkers,is a necessary step to scale out to more nodes, reducingthe time-to-solution and the memory footprints. We use n th threads for each walker update and reduce the number ofwalkers on a node by the same factor. This allows us to scalethe same problem size from one node to n th nodes. This iswell justified since the MPI efficiency remains perfect up to1000s of nodes on multiple HPC systems [12].Performance with the nested threading again heavily de-pends on cache system properties. The optimal tile sizes for Fig. 8. Normalized speedup on KNL with original AoS version as reference. ntel R (cid:13) Xeon R (cid:13) and BG/Q processors are determined mainly bythe input working set size n th N g N b (bytes). This increaseswith n th , reducing the optimal tile size by the same factorand limiting the scalability. For N = 2048 , scaling on theseplatforms are limited to only 2 threads per walker for closeto 80% parallel efficiency. For KNC and KNL, the optimaltile size is determined by the output working set size; for VGH n th N w N b bytes. This remains constant due to the decreasein N w by the same n th factor; keeping the optimal tile sizesame. This allows for strong scaling with nearly ideal parallelefficiency till n th = N/N b for the optimal N b . Beyond that,the scaling depends on the performance with smaller non-optimal tile sizes. For N = 2048 , scaling on KNC is limitedto 8 threads per walker corresponding to N b = 256 for closeto 80% parallel efficiency.Figure 9 presents the speedup of the three B-spline kernelsin miniQMC on KNL at N =2048 with respect to n th . The totalnumber of walkers per node and memory usage are reduced bythe same n th factor. The optimal configuration of the AoSoAversion is used as the reference for computing speedupswith N b =512, N w =256 and n th =1. The parallel efficiencyfor n th =16 is greater than 90%, even though N b =128 issmaller than the optimal tile size. For bigger problem sizeof N = 4096 , we can expect the same parallel efficiency at32 threads.VII. P ERFORMANCE S UMMARY AND R OOFLINE ANALYSIS Our optimization processes enhance the utilization of cacheand memory bandwidth on all the four distinct cache-basedarchitectures, discussed here. The relative impact of eachoptimization step, varies depending upon the memory band-width and cache subsystem properties on each platform, assummarized in Table IV. The AoS-to-SoA data layout con-version (Opt A) significantly raises the performance for allof them. The same is true on GPUs [16]. The cache levels,sizes and sharing properties play critical roles for the AoSoAtransformation (Opt B). The performance on the systems withshared LLC (BDW and BG/Q) is the best when a subblockof the B-spline table can fit into LLC. On Intel R (cid:13) Xeon Phi TM (co)processors, AoSoA is an effective cache-blocking methodand helps keep the output arrays in cache for the reduction Fig. 9. Scaling with respect to the number of threads per walker on KNL.The tile sizes N b are chosen to have sufficient number of tiles for n th TABLE IVS PEEDUPS OF TIME OF N =2048. O PTIMIZATION STEPS ARE A(A O S- TO -S O A), B (A O S O A) AND C ( NESTED THREADING ). T HESPEEDUPS FOR C INCLUDE THE STRONG - SCALING FACTOR n th .BDW KNC KNL BG/Q V A/B 2.0 1.2 1.3 1.3C 3.4 5.9 18.7 2.0 VGL A 4.2 4.0 5.1 7.4B 10.2 5.7 5.6 9.5C 17.2 42.1 80.6 15.8 VGH A 1.7 2.6 1.7 1.9B 3.7 5.2 2.3 2.7C 6.4 35.2 33.1 5.2 n th ( N b ) for C 2(32) 8(256) 16(128) 2(32) operations, allowing sustained performance for all problemsizes. Finally, nested threading (Opt C) is used to exploit a newparallelism over the AoSoA objects. KNL benefits the mostfrom Opt C and provides strong scaling of up to 14x with 16threads on a node with roughly parallel efficiency. Userscan tune n th and the number of nodes to use, to optimize theirproductivity, either to maximize the number of simulations perday (throughput) or minimize the time-to-solution.Figure 10 presents roofline performance analysis [29] [30]of VGH at various optimization steps for N =2048 splines.Intel R (cid:13) Advisor 2017 [31] is used to compute GFLOPS and cache-aware arithmetic intensity (AI), FLOPS per Byte, withstatic and run-time analysis. The rooflines based on the mea-sured bounds on BDW and KNL are shown for guidance.In all cases, the bytes transferred from the main memoryare the same, 64 N reads and 10 N writes, and the differencein AI reflects the SIMD efficiency and cache reuse. The AoS-to-SoA transformation increases the AI as well as GFLOPS aswe eliminate the inefficient gather and scatter instructions forthe strided access to the outputs. The AoSoA transformationdoes not affect the AIs but increases the performance with theoptimal tile sizes through the increased cache locality. Higherbandwidth available with MCDRAM on KNL is critical forhigher performance, as indicated by the best 150 GFLOPSobtained on DDR with the AoSoA version.In addition to optimizing B-spline orbital evaluations, weare applying the techniques described here such as SoA andAoSoA data layout transformations and nested threading,for the other compute intensive kernels of miniQMC andQMCPACK. Combining these optimizations, we get higherthan 4.5x speedup of full miniQMC for a range of problemsizes on KNL and BDW systems. Our goal is to transfer theseperformance gains to QMCPACK with minimal code changes.Our SoA container classes developed in object oriented C++and optimized compute kernels will be directly called fromQMCPACK driver code. And, with the use of operator over-loading feature in C++, non performance critical parts ofQMCPACK require minimal changes for the transition.VIII. C ONCLUSION AND F UTURE W ORK We presented node-level optimizations of B-spline based SPOevaluations widely used in QMC, on multi/many-core sharedmemory processors and nested parallelism implementation, ig. 10. VGH roofline performance model for N =2048. Circles denote GFLOPS at the cache-aware AI and X (b) the best performance (AoSoA) on DDR. enabling strong scaling to reduce memory usage and timeto solution. The techniques described in this work, such asSoA data layout transformation, cache-blocking and threadingover the AoSoA objects are applicable to a broad range ofalgorithms and applications and must be taken into accountfor optimizations and code modernization. We are applyingthem in full QMCPACK to deliver the performance obtainedin miniQMC and enable breakthrough QMC studies. Wedemonstrate the impact of our optimization techniques onfour distinct cache-coherent architectures and on a range ofproblem sizes, used in current day simulations to the highlyfuturistic ones. Our methods to improve the performanceenable performance portable solutions in QMCPACK andother QMC applications on current and future cache-coherentarchitectures. A CKNOWLEDGMENT We would like to thank Lawrence Meadows, John Pennycook,Jason Sewall, Harald Servat and Victor Lee for their help-ful discussions and reviewing this manuscript. This work issupported by Intel Corporation to establish the Intel ParallelComputing Center at Argonne National Laboratory. LS wassupported by the Advanced Simulation and Computing -Physics and Engineering models program at Sandia NationalLaboratories. AB was supported through the Predictive Theoryand Modeling for Materials and Chemical Science program bythe Office of Basic Energy Science (BES), Department of En-ergy (DOE). Sandia National Laboratories is a multi- programlaboratory managed and operated by Sandia Corporation, awholly owned subsidiary of Lockheed Martin Corporation, forthe U.S. Department of Energys National Nuclear Security Ad-ministration under Contract No. DE-AC04-94AL85000. Thisresearch used resources of the Argonne Leadership ComputingFacility, which is a DOE Office of Science User Facilitysupported under Contract No. DE-AC02-06CH11357.R EFERENCES[1] A. Benali, L. Shulenburger, N. A. Romero, J. Kim, and O. A. von Lilien-feld, “Application of diffusion monte carlo to materials dominated byvan der waals interactions,” Journal of chemical theory and computation ,vol. 10, no. 8, pp. 3417–3422, 2014.[2] H. Shin, S. Kang, J. Koo, H. Lee, J. Kim, and Y. Kwon, “Cohesionenergetics of carbon allotropes: Quantum monte carlo study,” TheJournal of chemical physics , vol. 140, no. 11, p. 114702, 2014. [3] P. Ganesh, J. Kim, C. Park, M. Yoon, F. A. Reboredo, and P. R. Kent,“Binding and diffusion of lithium in graphite: quantum monte carlobenchmarks and validation of van der waals density functional methods,” Journal of Chemical Theory and Computation , vol. 10, no. 12, pp. 5318–5323, 2014.[4] L. Shulenburger, A. D. Baczewski, Z. Zhu, J. Guan, and D. Tomanek,“The nature of the interlayer interaction in bulk and few-layer phospho-rus,” Nano letters , vol. 15, no. 12, pp. 8170–8175, 2015.[5] J. A. Santana, J. T. Krogel, J. Kim, P. R. Kent, and F. A. Reboredo,“Structural stability and defect energetics of zno from diffusion quantummonte carlo,” The Journal of chemical physics , vol. 142, no. 16, p.164705, 2015.[6] K. Foyevtsova, J. T. Krogel, J. Kim, P. Kent, E. Dagotto, and F. A.Reboredo, “Ab initio quantum monte carlo calculations of spin superex-change in cuprates: The benchmarking case of ca 2 cuo 3,” PhysicalReview X , vol. 4, no. 3, p. 031003, 2014.[7] L. Shulenburger and T. R. Mattsson, “Quantum monte carlo applied tosolids,” Physical Review B , vol. 88, no. 24, p. 245117, 2013.[8] “Knights landing (knl): 2nd generation intel R (cid:13) xeon phi TM TheJournal of chemical physics , vol. 109, no. 18, pp. 7678–7693, 1998.[11] “Intel R (cid:13) math kernel library (intel R (cid:13) mkl).” [Online]. Available:https://software.intel.com/en-us/intel-mkl[12] J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark,L. Shulenburger, and D. M. Ceperley, “Hybrid algorithms inquantum monte carlo,” Journal of Physics: Conference Series ,vol. 402, no. 1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008[13] K. P. Esler, “Einspline b-spline library.” [Online]. Available: http://einspline.sf.net[14] Q. Niu, J. Dinan, S. Tirukkovalur, A. Benali, J. Kim, L. Mitas,L. Wagner, and P. Sadayappan, “Global-view coefficients: a datamanagement solution for parallel quantum monte carlo applications,” Concurrency and Computation: Practice and Experience , pp. n/a–n/a,2016, cpe.3748. [Online]. Available: http://dx.doi.org/10.1002/cpe.3748[15] Y. Luo, A. Benali, and V. Morozov, “Accelerating the b-spline evaluation in quantum monte carlo,” 2015. [Online].Available: http://sc15.supercomputing.org/sites/all/themes/SC15images/tech poster/tech poster pages/post337.html[16] K. P. Esler, J. Kim, L. Shulenburger, and D. M. Ceperley, “Fullyaccelerating quantum monte carlo simulations of real materials on gpuclusters,” Computing in Science and Engineering High Performance Parallelism Pearls:Multicore and Many-core Programming Approaches . Boston, MA,USA: Morgan Kaufmann Publishers Inc., 2015, vol. 1.[19] J. Reinders and J. Jeffers, Eds., High Performance Parallelism Pearls:Multicore and Many-core Programming Approaches . Boston, MA,USA: Morgan Kaufmann Publishers Inc., 2015, vol. 2.20] “Intel R (cid:13) simd data layout templates (intel R (cid:13) sdlt).” [Online]. Available:https://software.intel.com/en-us/node/600110[21] A. Wells, “Simd building block.” [Online]. Available: unpublished[22] D. Alf`e and M. J. Gillan, “An efficient localized basis set for quantumMonte Carlo calculations on condensed matter,” Physical Review B ,vol. 70, no. 16, p. 161101, 2004.[23] “Coral collaboration, benchmark codes.” [Online]. Available: http://https://asc.llnl.gov/CORAL-benchmarks/[24] “Intel compiler options: -g -ip -restrict -unroll -o3 -qopenmp -std=c++11”, with -march=core-avx2 (bdw), -mmic (knc) and -xmic-avx512 (knl).”[25] “Ibm compiler options: -g -o3 -qinline=auto:level=10 -qarch=qp -qsmp=omp -qthreaded -qstrict -qhot=level=1 -qtune=qp -qsimd=auto.”[26] “Intel R (cid:13) vtune TM Proceedings of the IEEE , vol. 93, no. 2, pp. 216–231, 2005,special issue on “Program Generation, Optimization, and PlatformAdaptation”.[29] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightfulvisual performance model for floating-point programs and multicorearchitectures,” Communications of the ACM , 2009.[30] A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgradingthe loft,” IEEE Computer Architecture Letters , vol. 13, no. 1, pp. 21–24,2014.[31] “Intel R (cid:13)(cid:13)