[PDF] Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization

Abstract

In [Van Beeumen, et. al, HPC Asia 2020, this https URL] a scalable and matrix-free eigensolver was proposed for studying the many-body localization (MBL) transition of two-level quantum spin chain models with nearest-neighbor XX+YY interactions plus Z terms. This type of problem is computationally challenging because the vector space dimension grows exponentially with the physical system size, and averaging over different configurations of the random disorder is needed to obtain relevant statistical behavior. For each eigenvalue problem, eigenvalues from different regions of the spectrum and their corresponding eigenvectors need to be computed. Traditionally, the interior eigenstates for a single eigenvalue problem are computed via the shift-and-invert Lanczos algorithm. Due to the extremely high memory footprint of the LU factorizations, this technique is not well suited for large number of spins L , e.g., one needs thousands of compute nodes on modern high performance computing infrastructures to go beyond L=24 . The matrix-free approach does not suffer from this memory bottleneck, however, its scalability is limited by a computation and communication imbalance. We present a few strategies to reduce this imbalance and to significantly enhance the scalability of the matrix-free eigensolver. To optimize the communication performance, we leverage the consistent space runtime, CSPACER, and show its efficiency in accelerating the MBL irregular communication patterns at scale compared to optimized MPI non-blocking two-sided and one-sided RMA implementation variants. The efficiency and effectiveness of the proposed algorithm is demonstrated by computing eigenstates on a massively parallel many-core high performance computer.

Full PDF

EEnhancing Scalabilityof a Matrix-Free Eigensolverfor Studying Many-Body Localization

Roel Van Beeumen ∗ , Khaled Z. Ibrahim ,Gregory D. Kahanamoku–Meyer , Norman Y. Yao , and Chao Yang Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA Department of Physics, University of California at Berkeley, Berkeley, CA

Abstract

In [Van Beeumen, et. al, HPC Asia 2020, ] a scalable and matrix-free eigensolver was proposed for studying the many-body localization (MBL) transition of two-level quantum spin chain models with nearest-neighbor XX + Y Y interactions plus Z terms. This type of problem is computationallychallenging because the vector space dimension grows exponentially with the physi-cal system size, and averaging over diﬀerent conﬁgurations of the random disorder isneeded to obtain relevant statistical behavior. For each eigenvalue problem, eigenval-ues from diﬀerent regions of the spectrum and their corresponding eigenvectors needto be computed. Traditionally, the interior eigenstates for a single eigenvalue problemare computed via the shift-and-invert Lanczos algorithm. Due to the extremely highmemory footprint of the LU factorizations, this technique is not well suited for largenumber of spins L , e.g., one needs thousands of compute nodes on modern high per-formance computing infrastructures to go beyond L = 24. The matrix-free approachdoes not suﬀer from this memory bottleneck, however, its scalability is limited by acomputation and communication imbalance. We present a few strategies to reduce thisimbalance and to signiﬁcantly enhance the scalability of the matrix-free eigensolver.To optimize the communication performance, we leverage the consistent space runtime,CSPACER, and show its eﬃciency in accelerating the MBL irregular communicationpatterns at scale compared to optimized MPI non-blocking two-sided and one-sidedRMA implementation variants. The eﬃciency and eﬀectiveness of the proposed al-gorithm is demonstrated by computing eigenstates on a massively parallel many-corehigh performance computer. Keywords:

Quantum many-body problem, many-body localization, eigenvalue problem,matrix-free eigensolver, LOBPCG method, preconditioner, scalability, graph partitioning,METIS, communication optimization, CSPACER. ∗ Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720. Email: [email protected] . a r X i v : . [ c ond - m a t . d i s - nn ] D ec Introduction

A fundamental assumption in the traditional theory of statistical mechanics is that anisolated system will in general reach an equilibrium state, or thermalize . As early as the mid-20 th century, Anderson demonstrated that a single particle moving in a highly disorderedlandscape can violate this assumption [2]. While surprising, that result does not readilyextend to many-particle systems that exhibit strong interactions between the constitutentparticles. The question of whether a similar eﬀect could manifest in a strongly-interactingmany-body system remained open for decades. This elusive phenomenon has been termed“many-body localization” (MBL).Recently, advances in both high performance computing and experimental control ofindividual quantum particles have begun to yield insight into MBL. Both experimental [23,5, 24, 17, 4, 19] and numerical [18, 13, 6, 3, 21] results have shown evidence of localization insmall strongly-interacting multiparticle systems of 10-20 spins. Unfortunately, extrapolatingresults from these small system sizes to the inﬁnitely-large thermodynamic limit has provendiﬃcult. This lack of clarity has inspired a vigorous debate in the community about preciselywhat can be learned from small-size results. For example, it has been proposed that certainfeatures do not actually exist at inﬁnite system size [7], and even that MBL itself is only aﬁnite-size eﬀect [26, 1]!The primary goal of most studies is to identify and characterize a localization transition .In the thermodynamic limit, as the strength of the system’s disorder increases, theorypredicts a sharp, sudden change from a thermal to a localized state. Unfortunately, inthe small systems available for study, that sharp transition turns into a smooth crossover ,leading to the confusion about what constitutes the transition itself. Numerical evidencesuggests that the transition sharpens rapidly as system size increases, so accessing as largesystems as possible is imperative for investigating MBL.In pursuit of that goal, Luitz et al. used large-scale numerical linear algebra to show alocalization transition for system sizes up to L = 22 [18], and in a following paper extracteduseful data up to L = 24 [22]. In order to compute interior eigenstates for the MBL problem,the shift-and-invert Lanczos algorithm was used in combination with sparse direct solversfor solving the linear systems. One of the major disadvantages of this technique is thatconstructing the LU factorizations becomes extremely memory demanding, due to the socalled ﬁll in, for large number of spins L . Table 1 shows that the memory footprint of theLU factorization computed via STRUMPACK [9] grows rapidly as function of L . See also[22]. Hence, thousands of nodes on modern high performance computing infrastructures areneeded to go beyond L = 24. Table 1:

Total memory footprint as a function of the spin chain length L for LU factorizations,computed via STRUMPACK, and the matrix-free LOBPCG algorithm [27], with block size 64. Theproblem size is given by n . L n

STRUMPACK LOBPCG(64)16 12 ,

870 66 MB 8 MB18 48 ,

620 691 MB 31 MB20 184 ,

756 8 GB 118 MB22 705 ,

432 92 GB 451 MB24 2 , ,

156 1 TB 2 GB26 10 , ,

600 15 TB 7 GB

2o overcome the memory bottleneck that the shift-and-invert Lanczos algorithm faces,we recently proposed in [27] a matrix-free locally optimal block preconditioned conjugategradient (LOBPCG) algorithm. As shown in Table 1, this approach reduces the memoryfootprint by several orders of magnitude, e.g., from 15 TB to only 7 GB for L = 26, andenables simulating spin chains on even a single node, up to L = 24. In contrast to the shift-and-invert Lanczos algorithm, where the dominant computational cost is the constructionof the LU factorization, the dominant computational cost of the LOBPCG algorithm is the(block) matrix-vector (MATVEC) product. As illustrated in [27], the scalability of thisMATVEC is limited at high concurrency which is due to a computation and communicationimbalance. In the current paper, we present diﬀerent strategies to overcome this imbalanceand to signiﬁcantly enhance the scalability of the matrix-free eigensolver.The paper is organized as follows. We ﬁrst review the Heisenberg spin model and MBLmetrics in Section 2. The multiple levels of concurrency and the matrix-free LOBPCGeigensolver are discussed in Section 3. Next, we present the balancing of computation andcommunication within the MATVEC in Section 4 and the optimization of the communi-cation performance in Section 5. Then in Section 6, we illustrate the diﬀerent proposedstrategies for improving the computation and communication imbalance of the matrix-freeLOBPCG eigensolver for L = 24 and L = 26 problems. Finally, the main conclusions areformulated in Section 7. In this section we brieﬂy review the properties of the spin chain model that most frequentlyis studied by numerical simulations of MBL.

We consider the nearest-neighbor interacting Heisenberg spin model with random on-siteﬁelds: H = (cid:88) (cid:104) i,j (cid:105) (cid:126)S i · (cid:126)S j + (cid:88) i h i S zi , (1)where the angle brackets denote nearest-neighbor i and j , h i is sampled from a uniformdistribution [ − w, w ] with w ∈ R +0 , and (cid:126)S i · (cid:126)S j = S xi S xj + S yi S yj + S zi S zj , where S αi = σ αi , with σ αi the Pauli matrices operating on lattice site i and α ∈ { x, y, z } .The parameter w is called the disorder strength , and is responsible for inducing the MBLtransition. The values h i are sampled randomly each time the Hamiltonian is instantiated,and the relevant physics lies in the statistical behavior of the set of all such Hamiltoni-ans. The individual Hamiltonians H with independently sampled h i are called disorderrealizations .Note that in (1) each term of each sum has an implied tensor product with the identity onall the sites not explicitly written. Consequently, the Hamiltonian for L spins is a symmetricmatrix of dimension N = 2 L and exhibits the following tensor product structure H = L − (cid:88) i =1 I ⊗ · · · ⊗ I ⊗ H i,i +1 ⊗ I ⊗ · · · ⊗ I + L (cid:88) i =1 I ⊗ · · · ⊗ I ⊗ h i S zi ⊗ I ⊗ · · · ⊗ I, (2)3here H i,i +1 = S xi S xi +1 + S yi S yi +1 + S zi S zi +1 is a 4-by-4 real matrix and I is the 2-by-2 identitymatrix. Remark that by deﬁnition, all matrices H i,i +1 are the same and independent of thesite i . For our experiments, we use open boundary conditions, meaning that the nearest-neighbor terms do not wrap around at the end of the spin chain. Open boundary conditionscan be considered to yield a larger eﬀective system size because of the reduced connectivity.The state of each spin is described by a vector in C , and the conﬁguration of the entire L -spin system can be described by a vector on the tensor product space ( C ) ⊗ L . In thisspeciﬁc case, however, the Hamiltonian’s matrix elements happen to all be real, so we donot include an imaginary part in any of our computations. Furthermore, our Hamiltoniancommutes with the total magnetization in the z direction, S z = (cid:80) Li =1 S zi . Thus it can beblock-diagonalized in sectors characterized by S z ∈ [ − L / , − L / + 1 , . . . , L / − , L / ]. Thevector space corresponding to each sector has dimension n = (cid:0) LS z + L / (cid:1) such that the largestsector’s dimension is n = L !( L / )!( L / )! , and this corresponds to the actual dimension of thematrices on which we operate, see Table 1. While these subspaces are smaller than the fullspace, their size still grows exponentially with the number of spins L . Thus, the problembecomes diﬃcult rapidly as L increases. Furthermore, the density of eigenvalues in themiddle of the spectrum increases exponentially with L . Thus the tolerance used to solve forthese internal eigenvalues must be made tighter rapidly as L increases. With the problem’s matrix clearly deﬁned, we now review ways for quantifying localizationfrom the eigenvalues and eigenvectors. There are multiple quantities that can be used foridentifying localization.One of the commonly used quantities is the adjacent gap ratio [20, 26, 13, 6]. Thisapproach is based on the statistical distribution of the eigenvalues of diﬀerent disorderrealizations, hence, only eigenvalues need to be computed. Random matrix theory informs usthat the statistical distribution of eigenvalues will diﬀer between localizing and thermalizingHamiltonians [20]. In particular, we expect eigenvalues of a thermal Hamiltonian to repel each other, i.e., hybridization of eigenvectors prevents them from generally coming tooclose to one another. The eigenvalues of a localized Hamiltonian should not display thisbehavior: we expect them to be Poisson distributed. Therefore, we can measure localizationby comparing the relative size of gaps between the eigenvalues. Thermal Hamiltonians willgenerally have more consistently sized gaps due to level repulsion. However, this techniquesuﬀers from large statistical noise and thus requires many samples to be usable.Another quantity for measuring localization is the eigenstate entanglement entropy [27]which is based on the eigenvectors of the Hamiltonians. In a thermal system, we expectquantum entanglement to be widespread, while in a localized system, the entanglement isnot expected to be extensive. This idea can be quantiﬁed by choosing a cut which dividesthe spin chain into two pieces, and measuring the entanglement across it. Not only thevalue of the entropy changes during the localization transition: the statistics change as well.When compared across disorder realizations, the thermal entanglement entropy has smallvariance. During the transition, however, the entanglement entropy depends strongly on thespeciﬁc disorder realization and thus the statistic will have a large variance. Empirically,examining the variance of the entanglement entropy is one of the best ways to identify thelocalization transition and requires fewer samples than the adjacent gap ratio approach.4

Massively Parallel Simulation

In order to maximally reduce the ﬁnite size eﬀects on the determination of the MBL tran-sition point, we need to study spin models with as many spins as possible. Consequently,this problem is computationally demanding and requires lots of resources.

The MBL study allows for at least 4 levels of concurrency. The ﬁrst level corresponds to theneed of averaging over (many) diﬀerent and independently sampled disorder realizations inorder to obtain relevant statistical behavior. Since the disorder strength is responsible forinducing the MBL transition, we also have to vary the disorder strength, giving rise to thesecond level of concurrency. The third level corresponds to the eigenvalue chunks , i.e., foreach (large) eigenvalue problem, originating from one disorder realization and a particulardisorder strength, we have to compute eigenvalues from diﬀerent regions of the spectrumand their corresponding eigenvectors.All previous levels of concurrency are completely independent and can be implementedin a massively parallel fashion by making use of iterative eigensolvers. In this paper, wetherefore only focus on the forth level of parallelism taking place within these eigensolvers.Although most iterative eigensolvers follow a rather sequential procedure, each of the dif-ferent steps within one iteration can be implemented in parallel.

The Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) algorithm [16, 8]is a widely used eigensolver for computing the smallest or largest eigenvalues and correspond-ing eigenvectors of large-scale symmetric matrices. Key features of the LOBPCG algorithmare: (i) It is matrix-free, i.e., the solver does not require storing the coeﬃcient matrix ex-plicitly and it access the matrix by only evaluating matrix-vector products; (ii) It is a blockmethod, which allows for eﬃcient matrix-matrix operations on modern computing architec-tures; (iii) It can take advantage of preconditioning, in contrast to, for example, the Lanczosalgorithm.The standard LOBPCG algorithm allows for computing either the lower or upper part ofthe spectrum. In order to compute interior eigenvalues and their corresponding eigenvectors,we make use of the so called spectral fold transformation [28]( H − σI ) , where σ ∈ R is the shift around which we want to compute eigenvalues. This spectraltransformation maps all eigenvalues to the positive real axis and the ones closest to theshift σ to the lower edge close to 0. Hence, after applying this transformation, we can alsouse the LOBPCG eigensolver for computing interior eigenvalues. Because the transformedeigenvalue problem ( H − σI ) x = λx is symmetric positive deﬁnite, we use a diagonal (Jacobi) preconditioned conjugate gradi-ent (PCG) method as preconditioner for the LOBPCG eigensolver. For more details onthe matrix-free LOBPCG eigensolver used in the particular case of studying many-bodylocalization, we refer to [27]. 5n contrast to the shift-and-invert Lanczos algorithm, where the dominant computationalcost is the construction of the LU factorization, the dominant computational cost of theLOBPCG and PCG algorithms is the (block) MATVEC. In the remainder of the paper, wetherefore will mainly focus on enhancing the scalability of the MATVEC. In this section we have a closer look at the MBL (block) matrix-vector product and focus onhow to enhance its scalability by reducing the computation and communication imbalance.

As a starting point, we take the hybrid MPI–OpenMP MATVEC introduced in [27]. Thismatrix-free MATVEC uses one MPI rank per node and OpenMP for on-node parallelism.The block of vectors to be multiplied by the Hamiltonian matrix is partitioned by rows anddistributed among diﬀerent MPI ranks (and nodes). Within each MPI rank, a local sparseMATVEC is performed. A subset of the rows in the local vector block need to be sent toother MPI ranks to be multiplied and accumulated on the target MPI ranks. Each MPIrank also receives vector block contributions from other MPI ranks to be combined with thelocal product. It has been illustrated in [27] that the parallel MATVEC implementationusing non-blocking MPI communication, in combination with overlapping communicationand local computation, results in the best performance.Figure 1 shows the average wall clock time as a function of the rank for 100 samples of . . . . . .

06 mpi rank t i m e [ s ] computation waitany other barrier Figure 1:

Average wall clock time as a function of the MPI rank for the diﬀerent components ofthe L = 26 non-blocking MPI–OpenMP MATVEC with block size 64. L = 26 non-blocking MPI-OpenMP MATVEC. The diﬀerent amounts of time spend inthe MPI barrier shows clearly the high computation and communication imbalance of thisMATVEC. Note that the local computation time can be fully overlapped with the commu-nication and that the computation time shown only corresponds to the remote computationtime which can only start once the incoming data has arrived. Figure 1 also illustrates thatthis MATVEC is communication dominated. The diﬀerence between computation and com-munication time further increases for higher concurrency. Therefore, in the remainder of thispaper, we will focus on diﬀerent strategies for reducing and optimizing the communicationtime. The row partition of the vectors among diﬀerent MPI ranks corresponds to a partition ofthe adjacency graph induced by the many-body Hamiltonian, with vertices mapped rows orcolumns, and edges mapped to nonzero elements of the Hamiltonian. Within each partition,vertices that are not connected with vertices in other partitions by edges are strictly local.The corresponding rows do not need to be sent to other MPI ranks. The vertices that areconnected to vertices in other partitions are called shared rows . They need to be sent toother MPI ranks in a parallel MATVEC implementation.The state ordering for the hybrid MPI–OpenMP MATVEC in [27] leads to a simplecommunication pattern where, up to ∼

50 ranks, only communication with neighboringranks in a linear topology is required. The ordering of the states also allows for eﬃcientlycomputing the oﬀ-diagonal element indices on the ﬂy for applying the matrix-free MATVEC.However, as shown in Figure 1, the communication is largely imbalanced and far fromoptimal. In particular, the graph for the Hamiltonian is quite nonuniform, and vertices inthe middle are much more densely interconnected than vertices on the edge .One of the properties of the Heisenberg spin model with random on-site ﬁelds is that thesparsity pattern of (2) does not change for diﬀerent disorder realizations, neither does thecorresponding matrix graph. Therefore, we can apply graph partition techniques in orderto reduce the communication volume and better balance the communication time amongthe diﬀerent MPI ranks. A comparison of the communication volume, as a function of thetotal number of MPI ranks, between the hybrid MPI–OpenMP state ordering used in [27]and the METIS k -way graph partitioning [15] reordering of the states is shown in Table 2.For the METIS graph partitioning we used the objective function for total communicationvolume minimization [14].Table 2 shows that for the L = 24 MATVEC beyond 4 MPI ranks both the total commu-nication volume as well as the maximum communication volume per rank can signiﬁcantlybe reduced by using the METIS reordering. Note that this reordering will have an eﬀecton both the computation and communication within the MATVEC. First, although thecommunication volume is reduced, the METIS reordering of the states results in a morecomplicated communication pattern compared to the original MPI–OpenMP one in [27].Hence, each rank needs, in general, to communicate with more than 2 other ranks. Second,the METIS reordering of the states is also less structured and therefore the remote MATVECcomputation will require more complicated lookup tables. However, since the MATVEC iscommunication dominant and we maximally overlap computation with communication, theextra overhead from a slightly slower computational portion will be marginal.7 able 2: Communication volume as a function of the total number of MPI ranks for the L = 26 MPI–OpenMP MATVEC with the state ordering in [27] and the METIS reordering. (a)

Total communication volume ranks 4 8 16 32[27] 1 , ,

358 2 , ,

781 4 , ,

256 9 , , , ,

736 2 , ,

508 4 , ,

179 5 , , −

28 % (b)

Maximum communication volume per rank ranks 4 8 16 32[27] 376 ,

007 408 ,

962 416 ,

484 419 , ,

086 389 ,

561 348 ,

559 277 , −

28 %

The load balancing of computation using METIS aﬀects the communication pattern—speciﬁcally, the number of messages per rank from being constant to being a function of therank count.Figure 2 shows the distribution of the neighbor count a rank needs to communicate thevector with, as we increase the job size. With METIS partitioning, depending on the rankposition within a job, a diﬀerent number of neighbors are involved in the vector exchange.With each of these neighbors, a rank needs to communicate a portion of their assignedvector. As shown in Figure 2, for the L = 26 problem, the number of neighbors increasessuper-linearly with respect to log( r ), where r is the rank count.Figure 3 shows the number of rows, of the partitioned block vector, each rank exchangeswith their neighbors as a function of the sharing level , i.e., the number of neighbors a rowblock is sent to. From this ﬁgure, we see that the number of row blocks not involved indata exchange decreases as we strong scale the computation, hence, making it necessary tocommunicate an increasing fraction of rows. Moreover, the sharing level increase as well,making it necessary to exchange the same row with multiple ranks. Such sharing makes thevolume of communicated data to decrease sub-linearly with the rank count. Overall, thetotal volume of communicated data increases as we scale the job, roughly proportional tothe √ r .Fortunately, the sparsity pattern that inﬂuences the communication pattern remainsunchanged across iterations. As such, one could classify this communication pattern as astatic irregular one, and we can construct all needed information about the communicationpattern before the communication starts. Although the algorithm relies on other communi-cation primitives, such as allreduce , they do not signiﬁcantly contribute to the executiontime.The discussed attributes show the challenge in strong-scaling the computation, whichcould be summarized by the need for processing an increasingly large number of smallmessages. 8 Neighbors per rank

R a n k c o u n t

Figure 2:

The distribution of neighbor count for vector data exchange as we scale the job size dueto load balancing the L = 26 MATVEC. Not only the number of communication messages increases,but also the variability increases signiﬁcantly while scaling.

L o c a l 1 2 3 4 5 6 7 8 911 01 0 01 0 0 01 0 0 0 0

Blocks (x1000)

S h a r i n g F r e q u e n c y 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8

Figure 3:

The amount of data to be communicate (or “shared”) among diﬀerent ranks in the L = 26 MATVEC communication phase as we scale the computation. The sharing level increasesas the number of ranks increases, making the scaling communication bound. .1 Structuring Communication using MPI MPI provides multiple techniques to implement a communication pattern. While not claim-ing that we exhausted all possible methods, we explored a few widely used techniques thatare likely to serve our communication pattern.The most natural way to implement our matrix-vector communication pattern is to usenon-blocking point-to-point transfers. We overlapped local computation with some of thecommunication cost by assigning a thread to progress communication in the backgroundwhile performing local computations.In addition to using point-to-point communication primitives, we explored the use ofMPI-3 non-blocking alltoallv collective primitives [10], and MPI-3 RMA [11]. The collective-based implementation matches more or less the non-blocking point-to-point and is omittedfor brevity. We found the MPI one-sided implementation eﬃcient at small scale, but thescaling behavior is ineﬃcient in our experience on the Cray XC40 system.

In this study, we explored the use of CSPACER [25], Consistent SPACE Runtime, whichprovides a low overhead communication abstraction for irregular communication patterns.The runtime extends the support of the consistency space abstraction [12] to Cray systems.The runtime could interoperate with the MPI runtime, allowing for incremental integrationand tackling communication hotspots while retaining the bulk of MPI’s communicationcode.The space consistency abstraction [12] is a generalization of full-empty synchronizationfor distributed computing, where each memory region is associated with a counter that de-termines its consistency. A memory space becomes consistent, i.e., ready for consumption,when the counter matches a speciﬁc consistency tag. To construct a consistent state, aspace typically receives one or multiple transfers from one or more producers. The run-time provides APIs to facilitate checking the consistency of a space for consumers, but itdoes not provide the functionality of tracking the completions for individual data transfers.The runtime relies on symmetric space allocations across a team of ranks, and supportscommunication primitives such as one-sided put, various collectives that are implementedas patterns of multiple underlying primitives. Due to its simple design, the CSPACERruntime enjoys a low injection overhead, in the range of 0.4 µ s on KNL architectures, andprovides good scalability, especially for irregular communication patterns.The space consistency adopts a memory-centric approach while orchestrating communi-cation across ranks, rather than relying on transfer-centric strategies. It supports multiplemechanisms for issuing transfer operations that help achieving a consistent state, includ-ing concurrent threaded injection from an OpenMP parallel region, pipelined injection andprogress, etc. The runtime implements these operations without signiﬁcant injection over-head or serialization between concurrent threads. Moreover, successfully injected transfersprogress in the background without requiring the runtime polling for progress, or assigningthe progress to a thread. While transfer injection is non-blocking by default, a transfer injec-tion could be blocked until resources are available. This back-pressure mechanism providessome throttling mechanism to avoid congesting the interconnect.By not providing a mechanism of tracking completion of individual transfers or orderingconstraints between transfers, the runtime could handle a large number of transfers withminimal overhead. We structured the communication such that a single space per rank10eceives the contribution of all producer ranks. Therefore, involving more ranks in thecommunication does not result in an increase of the overhead of checking the data readinessfor consumption. The advantage of such approach manifests at scale. All numerical experiments were performed on the NERSC super computer called Cori, aCray XC40 system powered by Intel Xeon Phi “Knights Landing” (KNL) compute [email protected] GHz, 68 cores and each with 4 hyper-threads, 96 GB DDR4 RAM, 16 GB MCDRAM.The Cray XC40 nodes are connected using a Dragonﬂy Aries interconnect.Throughout the numerical experiments we use a ﬁxed block size of 64 for the MATVECsand the LOBPCG eigensolver. We also use 1 MPI rank per node and OpenMP for on-nodeparallelism.

In a ﬁrst experiment, we compare the diﬀerent implementations of the MATVEC. Figure 4shows the strong scaling results for the L = 24 and L = 26 MATVECs. In this ﬁgure, thedashed lines correspond to the state ordering of [27] with non-blocking MPI communicationand the dotted and solid lines correspond to the proposed METIS state reordering withnon-blocking MPI and CSPACER communication, respectively.Due to the large computation and communication imbalance, as shown in Figure 1,the dashed lines in Figure 4 show that the scalability of the MATVEC implementationusing the original ordering of the states stops at 16 ranks. On the other hand, this ﬁgureclearly shows that using the proposed METIS reordering of the states yields the MATVECto continue scaling for higher concurrency. Although the METIS reordering reduces boththe total communication volume and the maximum communication volume per rank, just . . t i m e [ s ] L = 24 L = 26 Figure 4:

Strong scaling of 1 MATVEC with block size 64. The dashed lines correspond to thestate ordering used in [27], the dotted lines to the METIS ordering, and the solid lines to the METISordering + CSPACER.

As discussed in Section 3.1, the MBL problem exhibits multiple levels of concurrency and re-quires computing eigenvalues/eigenvectors from diﬀerent spectral regions. Since computingeigenvalues in the middle of the spectrum are the hardest, we focus on computing eigenvaluesaround the shift σ = 0.In all remaining experiments, we use the matrix-free LOBPCG eigensolver with theMETIS reordering for the MATVEC. The allreduce operations in the LOBPCG and PCGsolver use MPI, in contrast to the MATVEC for which we compare diﬀerent communicationstrategies. Because most of the MATVECs take place in the preconditioner, we also performall PCG iterations in single precision and only the LOBPCG iterations in double precision.This mixed precision approach for the MBL problem turns out to have no eﬀect on the overalleigenvalue accuracy or the total number of LOBPCG iterations, however, it signiﬁcantlyreduces the wall clock time.The L = 24 strong scaling behavior of the diﬀerent communication strategies withinthe MATVEC are presented in Figure 5. Note that we only report the timings for 1 L =24 LOBPCG iteration with 5,000 PCG iterations. In order to converge the eigensolverone needs a few tens of iterations. From Figure 5, we notice that the CSPACER-variantoutperforms the non-blocking MPI variant and that the diﬀerence grows for increasingconcurrency. On the other hand, one-sided remote memory access (rma) communicationonly performs well at low concurrency and is even in that case not competitive with theCSPACER-variant. Therefore, due to the poor scalability of the rma MPI-variant, we willnot further consider this type of communication.The upper part of Figure 6 presents the strong scaling behavior for the L = 24 problem,while using diﬀerent thread counts per node. We notice that the need for full thread concur-rency diminishes as we scale, to the extent we start seeing performance degradation at highnode concurrency. This behavior could be attributed to an increased overhead for managingthread pools, e.g., barrier synchronization for the amount of work assigned to the threads.We notice that the CSPACER-variant suﬀers less from performance degradation comparedto the MPI-variant. We attribute that to the former using threading more eﬃciently ininjecting and progressing multiple transfer lanes in the interconnect. The correspondingspeedup factors for the CSPACER-variant compared to MPI-variant are presented in Ta-ble 3. We notice that in almost all situations the CSPACER-variant results in a signiﬁcantspeedup and, as also shown in Figures 5 and 6, the speedup factors increase, in general, forincreasing concurrency. 12

256 threads mpi ranks t i m e [ s ] CSPACERnon-blocking MPIrma MPI

Figure 5:

Strong scaling of 1 L = 24 LOBPCG iteration with 5,000 PCG iterations in singleprecision.

The lower part of Figure 6 shows the decomposition of the execution time for the bestperforming threading conﬁguration for both the MPI- and CSPACER-variants. As shown,the MATVEC, blue box, exhibits a signiﬁcant fraction of the execution time, especially atlow concurrency. As we strong scale the computation, the allreduce operations, red box,start contributing signiﬁcantly to the execution time. Because the number of elements inthe reduction remains constant, we expect the overhead of the reduction to increase withthe number of nodes. Instead, we noticed a strong correlation between the variability of theMATVEC and the allreduce phases. Given that these two phases are executed consecu-tively, we conducted an experiment with an extra barrier between them and found that thebarrier captured most of the variability. We omitted this extra barrier synchronization dueto its unnecessary overheads in the presented results.In general, we note that there are multiple sources of performance variability acrossnodes in our code. The ﬁrst is due to the computational load imbalance originating fromthe imperfect partitioning; The second is due to the system noise through the shared in-terconnect; The third is due to the communication runtime. The MATVEC overlap of

Table 3: L = 24 speedups for the CSPACER-variant compared to non-blocking MPI communica-tion. ranks 256 threads 128 threads 64 threads16 11 . . . . . − . . . . . . . . . . . . . . . .

285 143 99.3 69.3 52.8 46.3 64.0207 96.1 63.6 100163 126 76.1 66.1 83.8 119 non-blocking MPI mpi ranks t i m e [ s ]

64 threads128 threads256 threads 16 32 64 128 256 512 1024304050100200300

256 157 99.3 65.6 34.0 35.0185 43.5 38.3 45.5145 107 78.7 57.4 47.6 50.9 63.2

CSPACER mpi ranks t i m e [ s ]

64 threads128 threads256 threads16 32 64 128 256 512 1024050100150

163 126 96.1 69.3 52.8 46.3 64.0 non-blocking MPI mpi ranks t i m e [ s ] totalmatvecallreduceother 16 32 64 128 256 512 1024050100150

145 107 78.7 57.4 43.5 34.0 35.0

CSPACER mpi ranks t i m e [ s ] totalmatvecallreduceother Figure 6:

Strong scaling of 1 L = 24 LOBPCG iteration with 5,000 PCG iterations in singleprecision. computation with communication makes it diﬃcult to isolate the impact of communicationfrom computation imbalance. Having two runtime implementations allow for classifyingsources of variability better. For instance, the lower variability of the CSPACER-variantcompared to the MPI-variant shed some light on the minimum variability due to the MPIruntime. In our experiments, we found the inter-quartile range for the MATVEC variabilityfor MPI compared to CSPACER to be 1 . × at low concurrency and reaching 2 . × at 1024nodes. We also consider the variability of the CSPACER-variant as an upper limit on thecomputation load imbalance.Figure 7 shows the time decomposition and scaling behavior with various thread con-currency for the L = 26 problem. The corresponding speedup factors are given in Table 4.While for L = 24, we noticed performance advantage for the CSPACER-variant across allconcurrency levels, for L = 26, the performance advantage starts at 128 nodes. This behav-ior is somewhat expected because the L = 26 problem is associated with a larger volume of14 non-blocking MPI mpi ranks t i m e [ m i n ]

64 threads128 threads256 threads 32 64 128 256 512 1024 204851020304050

CSPACER mpi ranks t i m e [ m i n ]

64 threads128 threads256 threads32 64 128 256 512 1024 20480510152025 non-blocking MPI mpi ranks t i m e [ m i n ] totalmatvecallreduceother 32 64 128 256 512 1024 20480510152025 CSPACER mpi ranks t i m e [ m i n ] totalmatvecallreduceother Figure 7:

Strong scaling of 1 L = 26 LOBPCG iteration with 20,000 PCG iterations in singleprecision.

Table 4: L = 26 speedups for the CSPACER-variant compared to non-blocking MPI communica-tion. ranks 256 threads 128 threads 64 threads32 − . − . − . . . − . . . − . . . . . . . . . . . . . allreduce variabilities.We notice that the optimal threading is problem dependent and is likely to change withthe underlying systems. For L = 24, the advantage for reducing the thread concurrencymanifests at 512 nodes, for L = 26, we need to change the thread-level at 1024 nodes.Currently, we rely on empirical measurement to identify the best conﬁguration. Leveragingthe iterative nature of the algorithm, we could dedicate few iterations for ﬁnding the optimalconﬁguration. Ideally, we need to do such change automatically. We also notice that theoptimal thread choice is dependent on the communication runtime. For both presentedproblem conﬁgurations, MPI tends to require switching to lower thread concurrency atlower node count compared to the CSPACER-variant. The implementation of the latterleverages threads to accelerate the communication progress, which is more advantageouswhen the number of neighboring ranks increase.Finally, we compare the overall wall clock time for the L = 26 problems reported in[27, Table 3]. Using the combination of (i) graph partitioning for reducing the communi-cation volume; (ii) runtime optimization via CSPACER; (iii) performing the MATVECs inthe preconditioner only in single precision, we have been able to signiﬁcantly increase thescalability of the matrix-free LOBPCG eigensolver to 1024 nodes. All these techniques to-gether resulted in a speedup factor of 10 × so that the overall wall clock time for computingeigenvalues/eigenvectors in the middle of the spectrum (30 LOBPCG iterations with 20,000PCG iterations as preconditioner) got reduced from more than 1 day to only 2.5 hours. We have presented several strategies to signiﬁcantly reduce the computation and communica-tion imbalance within the matrix-free LOBPCG eigensolver for computing many eigenvaluesand corresponding eigenvectors of large spin Hamiltonians. Using graph partitioning for re-ordering the states, both the total communication volume and the maximum communicationvolume per rank reduces and enhances the scalability of the matrix-free eigensolver. Com-bining it with communication performance optimization by using CSPACER, ConsistentSPACE Runtime, we have been able to scale the LOBPCG eigensolver up to 512 and 1024nodes for L = 24 and 26 spins, respectively. The numerical experiments have illustratedthat the proposed techniques of graph partitioning, runtime optimization, and using mixedprecision arithmetic, reduce the overall wall clock time for the L = 26 problem, reportedin [27], by a factor of 10. Because the MBL study requires solving eigenvalue problems formany instances of Hamiltonians with random disorder terms, and computing eigenvaluesfrom diﬀerent regions of the spectrum, the overall computation can scale to hundreds ofthousands of computational cores. Acknowledgements

This work is partially supported by the U.S. Department of Energy, Oﬃce of Science, Oﬃceof Advanced Scientiﬁc Computing Research, Scientiﬁc Discovery through Advanced Com-puting (SciDAC) program and Center for Novel Pathways to Quantum Coherence in materi-als, an Energy Frontier Research Center funded by the US Department of Energy, Director,Oﬃce of Science, Oﬃce of Basic Energy Sciences under Contract No. DE-AC0205CH11231.16DKM was supported by the Department of Defense (DoD) through the National DefenseScience & Engineering Graduate (NDSEG) Fellowship Program.This research used resources of the National Energy Research Scientiﬁc Computing Cen-ter (NERSC), which is supported by the Oﬃce of Science of the U.S. Department of Energyunder Contract No. DE-AC02-05CH11231.

References [1]

D. A. Abanin, J. H. Bardarson, G. De Tomasi, S. Gopalakrishnan, V. Khe-mani, S. A. Parameswaran, F. Pollmann, A. C. Potter, M. Serbyn, andR. Vasseur , Distinguishing localization from chaos: challenges in ﬁnite-size systems ,2019, https://arxiv.org/abs/1911.04501 .[2]

P. W. Anderson , Absence of diﬀusion in certain random lattices , Phys. Rev., 109(1958), pp. 1492–1505, https://doi.org/10.1103/PhysRev.109.1492 .[3]

B. Bauer and C. Nayak , Area laws in a many-body localized state and its implicationsfor topological order , J. Stat. Mech. Theory Exp., 2013 (2013), p. P09005, https://doi.org/10.1088/1742-5468/2013/09/p09005 .[4]

P. Bordia, H. L¨uschen, S. Scherg, S. Gopalakrishnan, M. Knap, U. Schnei-der, and I. Bloch , Probing slow relaxation and many-body localization in two-dimensional quasiperiodic systems , Phys. Rev. X, 7 (2017), p. 41047, https://doi.org/10.1103/PhysRevX.7.041047 .[5]

J.-y. Choi, S. Hild, J. Zeiher, P. Schauß, A. Rubio-Abadal, T. Yefsah,V. Khemani, D. A. Huse, I. Bloch, and C. Gross , Exploring the many-bodylocalization transition in two dimensions , Science, 352 (2016), pp. 1547–1552, https://doi.org/10.1126/science.aaf8834 .[6]

E. Cuevas, M. Feigel’man, L. Ioffe, and M. Mezard , Level statistics of dis-ordered spin-1/2 systems and materials with localized Cooper pairs , Nat. Commun., 3(2012), p. 1128, https://doi.org/10.1038/ncomms2115 .[7]

W. De Roeck, F. Huveneers, M. M¨uller, and M. Schiulaz , Absence of many-body mobility edges , Phys. Rev. B, 93 (2016), p. 14203, https://doi.org/10.1103/PhysRevB.93.014203 .[8]

J. A. Duersch, M. Shao, C. Yang, and M. Gu , A robust and eﬃcient imple-mentation of LOBPCG , SIAM J. Sci. Comput., 40 (2018), pp. C655–C676, https://doi.org/10.1137/17M1129830 .[9]

P. Ghysels, X. S. Li, C. Gorman, and F.-H. Rouet , A robust parallel pre-conditioner for indeﬁnite systems using hierarchical matrices and randomized sam-pling , in 2017 IEEE International Parallel and Distributed Processing Symposium(IPDPS), Los Alamitos, CA, USA, 2017, IEEE Computer Society, pp. 897–906, https://doi.org/10.1109/IPDPS.2017.21 .[10]

T. Hilbrich, M. Weber, J. Protze, B. R. de Supinski, and W. E. Nagel , Runtime correctness analysis of MPI-3 nonblocking collectives , in Proceedings of the23rd European MPI Users’ Group Meeting, EuroMPI 2016, New York, NY, USA,17016, Association for Computing Machinery, pp. 188–197, https://doi.org/10.1145/2966884.2966906 .[11]

T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, andK. Underwood , Remote memory access programming in MPI-3 , ACM Trans. ParallelComput., 2 (2015), https://doi.org/10.1145/2780584 .[12]

K. Z. Ibrahim , Optimizing breadth-ﬁrst search at scale using hardware-acceleratedspace consistency , in 2019 IEEE 26th International Conference on High PerformanceComputing, Data, and Analytics (HiPC), Los Alamitos, CA, USA, 2019, IEEE Com-puter Society, pp. 23–33.[13]

S. Johri, R. Nandkishore, and R. N. Bhatt , Many-body localization in imperfectlyisolated quantum systems , Phys. Rev. Lett., 114 (2015), p. 117401, https://doi.org/10.1103/PhysRevLett.114.117401 .[14]

G. Karypis , METIS: A Software Package for Partitioning Unstructured Graphs, Par-titioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices. Version5.1.0 , tech. report, University of Minnesota, 2013.[15]

G. Karypis and V. Kumar , A fast and high quality multilevel scheme for partitioningirregular graphs , SIAM J. Sci. Comput., 20 (1998), pp. 359–392, https://doi.org/10.1137/S1064827595287997 .[16]

A. V. Knyazev , Toward the optimal preconditioned eigensolver: Locally optimal blockpreconditioned conjugate gradient method , SIAM J. Sci. Comput., 23 (2001), pp. 517–541, https://doi.org/10.1137/S1064827500366124 .[17]

T. Kohlert, S. Scherg, X. Li, H. P. L¨uschen, S. Das Sarma, I. Bloch, andM. Aidelsburger , Observation of many-body localization in a one-dimensional systemwith a single-particle mobility edge , Phys. Rev. Lett., 122 (2019), p. 170403, https://doi.org/10.1103/PhysRevLett.122.170403 .[18]

D. J. Luitz, N. Laflorencie, and F. Alet , Many-body localization edge in therandom-ﬁeld Heisenberg chain , Phys. Rev. B, 91 (2015), p. 81103, https://doi.org/10.1103/PhysRevB.91.081103 .[19]

A. Lukin, M. Rispoli, R. Schittko, M. E. Tai, A. M. Kaufman, S. Choi,V. Khemani, J. L´eonard, and M. Greiner , Probing entanglement in a many-body–localized system , Science, 364 (2019), pp. 256–260, https://doi.org/10.1126/science.aau0818 .[20]

V. Oganesyan and D. A. Huse , Localization of interacting fermions at high temper-ature , Phys. Rev. B, 75 (2007), p. 155111, https://doi.org/10.1103/PhysRevB.75.155111 .[21]

A. Pal and D. A. Huse , Many-body localization phase transition , Phys. Rev. B, 82(2010), p. 174411, https://doi.org/10.1103/PhysRevB.82.174411 .[22]

F. Pietracaprina, N. Mac´e, D. J. Luitz, and F. Alet , Shift-invert diago-nalization of large many-body localizing spin chains , SciPost Phys., 5 (2018), p. 45, https://doi.org/10.21468/SciPostPhys.5.5.045 .1823]

M. Schreiber, S. S. Hodgman, P. Bordia, H. P. L¨uschen, M. H. Fischer,R. Vosk, E. Altman, U. Schneider, and I. Bloch , Observation of many-bodylocalization of interacting fermions in a quasirandom optical lattice , Science, 349 (2015),pp. 842–845, https://doi.org/10.1126/science.aaa7432 .[24]

J. Smith, A. Lee, P. Richerme, B. Neyenhuis, P. W. Hess, P. Hauke, M. Heyl,D. A. Huse, and C. Monroe , Many-body localization in a quantum simulator withprogrammable random disorder , Nat. Phys., 12 (2016), p. 907, https://doi.org/10.1038/nphys3783 .[25]

B. L. Software , Cspacer: Consistent space runtime , Sept 2020, https://bitbucket.org/kibrahim/cspacer_xc/ .[26]

J. ˇSuntajs, J. Bonˇca, T. Prosen, and L. Vidmar , Quantum chaos challengesmany-body localization , 2019, https://arxiv.org/abs/1905.06345 .[27]

R. Van Beeumen, G. D. Kahanamoku-Meyer, N. Y. Yao, and C. Yang , A scal-able matrix-free iterative eigensolver for studying many-body localization , in Proceedingsof the International Conference on High Performance Computing in Asia-Paciﬁc Re-gion, HPCAsia2020, New York, NY, USA, 2020, Association for Computing Machinery,pp. 179–187, https://doi.org/10.1145/3368474.3368497 .[28]

L.-W. Wang and A. Zunger , Solving Schr¨odinger’s equation around a desired energy:Application to silicon quantum dots , J. Chem. Phys., 100 (1994), pp. 2394–2397, https://doi.org/10.1063/1.466486https://doi.org/10.1063/1.466486