[PDF] Solving the Parquet Equations for the Hubbard Model beyond Weak Coupling

Abstract

We find that imposing the crossing symmetry in the iteration process considerably extends the range of convergence for solutions of the parquet equations for the Hubbard model. When the crossing symmetry is not imposed, the convergence of both simple iteration and more complicated continuous loading (homotopy) methods are limited to high temperatures and weak interactions. We modify the algorithm to impose the crossing symmetry without increasing the computational complexity. We also imposed time reversal and a subset of the point group symmetries, but they did not further improve the convergence. We elaborate the details of the latency hiding scheme which can significantly improve the performance in the computational implementation. With these modifications, stable solutions for the parquet equations can be obtained by iteration more quickly even for values of the interaction that are a significant fraction of the bandwidth and for temperatures that are much smaller than the bandwidth. This may represent a crucial step towards the solution of two-particle field theories for correlated electron models.

Full PDF

aa r X i v : . [ c ond - m a t . s t r- e l ] S e p Solving the Parquet Equations for the Hubbard Model beyond Weak Coupling

Ka-Ming Tam, H. Fotso, S.-X. Yang, Tae-Woo Lee, J. Moreno,

1, 2

J. Ramanujam,

3, 2 and M. Jarrell

1, 2 Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA 70803 Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803 Department of Electrical and Computer Engineering,Louisiana State University, Baton Rouge, LA 70803 (Dated: September 18, 2018)We ﬁnd that imposing the crossing symmetry in the iteration process considerably extends therange of convergence for solutions of the parquet equations for the Hubbard model. When thecrossing symmetry is not imposed, the convergence of both simple iteration and more complicatedcontinuous loading (homotopy) methods are limited to high temperatures and weak interactions.We modify the algorithm to impose the crossing symmetry without increasing the computationalcomplexity. We also imposed time reversal and a subset of the point group symmetries, but theydid not further improve the convergence. We elaborate the details of the latency hiding schemewhich can signiﬁcantly improve the performance in the computational implementation. With thesemodiﬁcations, stable solutions for the parquet equations can be obtained by iteration more quicklyeven for values of the interaction that are a signiﬁcant fraction of the bandwidth and for temperaturesthat are much smaller than the bandwidth. This may represent a crucial step towards the solutionof two-particle ﬁeld theories for correlated electron models.

PACS numbers: 71.10.Fd,71.10.-w,71.27.+a,71.30.+h

I. INTRODUCTION

A natural step to extend most of the existing many-body single-particle self-consistent methods is to includethe full momentum and energy dependence of the ver-tex corrections. Historically, the self-consistent approachfor vertex corrections was ﬁrst considered by Landau,Abrikosov and Khalatnikov in the context of the highenergy behavior of quantum electrodynamics. The origi-nal goal was to develop a non-perturbative method whichencodes the information in terms of a system of closedintegral equations. The parquet equations, in principle,provide a framework for self-consistent determination ofthe self-energy and the vertex corrections. They wereproposed for both boson-boson scattering and fermion-fermion scattering during the 1950’s.

Methods similarto the parquet equations were ﬁrst introduced in the con-text of many-body theory by de Dominics and Martin.

One of the early practical applications was on the x-rayabsorption and emission problem by Roulet, Gavoret,and Nozi`eres. Since then, various problems have beenstudied by the parquet summation approach, most no-tably, the Fermi liquid in a strong magnetic ﬁeld, thedisordered electron gas in a strong transverse magneticﬁeld, the Anderson impurity model, random po-tential problems, the Hubbard model, Helium-4,

Helium-3, local moment formation, the vor-tex liquid model, the matrix models, and nu-clear structure calculations. While these applications ofparquet formulation provide a lot of important insights,most of the calculations are based on various approxi-mated forms of the parquet equations.It is obvious that going from a one particle to a two-particle self-consistent calculation represents a signiﬁcantincrease in the computational eﬀort, as each two-particle vertex contains three independent momentum and fre-quency indices. From the point of view of practical cal-culation, the number of elements for each index is arounda few thousands. Therefore, the number of elements forthe vertices are around tens of millions to a few billions.Moreover, all the information is encoded in integral equa-tions with complicated structure, in which simpliﬁcationdoes not seem to be immediately possible. Indeed, in thepast, the most successful application using the full par-quet equations was largely limited to the single Ander-son impurity model. With recent advances in compu-tational infrastructure where peta-scale performance hasbecome available, calculations for lattice models, such asthe Hubbard model are now feasible. For example, thesolution of the parquet equations for a 4 × U = 2 t and temperature T = 0 . t was recently obtained. However, limitations on computer performance andstorage are apparently not the sole obstacles for obtain-ing the solution of the parquet equations. Another majorbarrier is the stability of the solvers. The simple itera-tion method, which is widely adapted for the dynamicalmean ﬁeld method, often fails to provide a stable solu-tion for the parquet equations. In most cases, a damp-ing scheme has to be employed. Even with the dampingscheme, when the temperature is low or the coupling islarge, ﬁnding a stable solution still seems to be ratherdiﬃcult. Given the large number of variables and the complex-ity of the parquet equations, instabilities in their solutionmay not be unexpected. Methods based on the local gra-dient are not likely to be suitable as the Hessian cannotbe readily calculated. Most of the non-linear solvers onlyhave local convergence properties. This may not posea problem if we have a reasonable guess which is closeenough to the true solution. Unfortunately, it is not easyto obtain a good initial guess for the parquet equations.Methods that in principle allow “global” convergence, forexample, the homotopy method or continuous loadingmethod, have been proposed as possible ways to improvethe calculations. While these tend to improve conver-gence, many steps are required for the solution to movealong the homotopy path. Moreover, practical experi-ence seems to suggest that convergence may still not beachieved when the temperature is low or the couplingis strong. It is clear that a better solver is deﬁnitely re-quired for the practical application of the parquet methodwithin the context of the strongly correlated systems.One of the most prominent diﬀerences between the par-quet formulation and most of the other approximationschemes such as RPA, self-consistent spin ﬂuctua-tions approach, and ﬂuctuation exchange approach is that the so-called crossing symmetry is obeyed byconstruction of the parquet equations. The crossingsymmetry implies that a vertex in one channel can alsoproduce the vertex in all other channels by pulling orcrossing the vertex legs and multiplying by appropriateconstants. It also implies the Pauli exclusion principleis automatically satisﬁed. However, in the course of theiteration process, as long as the exact solution of the par-quet equations is not obtained, the crossing symmetry isviolated. The main point in the present paper is to high-light that the crossing symmetry is crucial for obtaininga stable solution. We devise a modiﬁed iteration schemewhich can obtain a stable solution for the parquet equa-tions at lower temperature and stronger coupling thanthat from the previous schemes. This is achieved pri-marily by restoring the crossing symmetry at each stepof the iteration.It is important to notice that because of the largenumber of vertex functions, for production runs, mas-sively parallel machines are absolutely necessary. Sincethe vertex functions in diﬀerent channels are mixed inthe parquet formulation, an eﬃcient scheme to transformthe vertex functions storage in diﬀerent nodes is criti-cal to improving the overall eﬃciency of the calculations.Some of the computational details have been explainedin the previous publication. We have further improvedthe scheme which allows us to hide the communicationlatency across diﬀerent nodes behind the local calcula-tions within the nodes, which eﬀectively further speedsup the calculations.The model used for testing the computational schemefor solving the parquet equations is the standard Hub-bard model at half-ﬁlling. The Hamiltonian is H = − t X ( c † i,σ c j,σ + H.c. ) + U X i n i, ↑ n i, ↓ , (1)where U is the on-site repulsion and t = 1 is the hoppingmatrix which establishes a unit of energy.The paper is organized as follows. In Section II, wereproduce the parquet equation which also serves to ﬁxthe notation. In Section III, we describe the iteration scheme for solving the parquet equations. In Section IV,we discuss the violation of crossing symmetry. In SectionV, we present a modiﬁed iteration scheme which explic-itly restores the crossing symmetry. We also discuss thelimitation of the modiﬁed scheme and the possible direc-tions for further developments. In Section VI, we presentthe leading eigenvalues of the antiferromagnetic channelas a function of the temperature and coupling strengthand ﬁnd that the parameter region of stable solutionsis greatly increased when the crossing symmetry is en-forced. A brief summary is contained in Section VII. Inthe Appendix, we present a latency hiding scheme whichallows substantial increase of the eﬃciency for solving theparquet equations. II. PARQUET EQUATIONS

In order for the current paper to be reasonably self-contained, we provide in this section a brief descriptionof the parquet equations, largely following Ref. 36. Thepurpose is to highlight the structure of the equations andto deﬁne the notations that will be used in the subsequentsections. We do not intend to provide a derivation of theformulation. For this we refer the readers to the literaturewhere ample discussions can be found.

Standard perturbative expansions attempt to describeall the scattering processes, at the lowest orders, assingle- or two-particle Feynman diagrams. In the single-particle formulation the self-energy describes the many-body processes that renormalize the motion of a parti-cle in the interacting background of all the other parti-cles. In the two-particle context, one is able to probe theinteractions between particles using the so-called vertexfunctions, which are matrices describing the two particlescattering processes. For example, the reducible (full)two-particle vertex F ph (12; 34) describes the amplitudeof a particle-hole pair scattered from its initial state | , i into the ﬁnal state | , i . Here, i = 1 , , , k i , theMatsubara frequency iω n i and, if needed, the spin σ i andband index m i . Since the total momentum and energyof the vertex are conserved, it is convenient to adapt thenotation F ph (2 − , for the numerical implementationof the single band Hubbard model. In general, depending on how particles or holes areinvolved in the scattering processes, one can deﬁne threediﬀerent two-particle scattering channels. These are theparticle-hole (p-h) horizontal channel, the p-h verticalchannel and the particle-particle (p-p) channel.One can further discriminate the vertices according totheir topology. Starting from the reducible vertex F in-troduced above, we may deﬁne the irreducible vertex Γcorresponding to the subclass of diagrams in F that can-not be separated into two parts by cutting two horizontalGreen function lines. Similarly, the fully irreducible ver-tex Λ corresponds to the subclass of diagrams in Γ thatcannot be split into two parts by cutting two Green func-tion lines in any channel.Furthermore, since we are mostly interested in mod-els that preserve the SU (2) spin rotation symmetry, andsince this is an exact symmetry for our two-dimensionalcalculations at non-zero temperature, it is convenient topreserve this symmetry. This is accomplished by de-composing the vertices in the so-called spin-diagonalizedrepresentation. In this representation, the spin de-grees of freedom decompose the particle-hole channel intothe density and the magnetic channels, and the particle-particle channel into the spin singlet and the spin tripletchannels which we denote as d -channel, m -channel, s -channel, and t -channel respectively. They are deﬁned asfollows, Γ d = Γ P H ↑↑ ; ↑↑ + Γ P H ↑↑ ; ↓↓ , (2)Γ m = Γ P H ↑↑ ; ↑↑ − Γ P H ↑↑ ; ↓↓ , (3)Γ s = Γ P P ↑↓ ; ↑↓ − Γ P P ↑↓ ; ↓↑ , (4)Γ t = Γ P P ↑↓ ; ↑↓ + Γ P P ↑↓ ; ↓↑ , (5)and similarly for F and Λ.We reproduce the full set of equations for the parquetformulation in the spin diagonalized representation in thefollowing. The Schwinger-Dyson equation isΣ( P ) = − U T N X P ′ ,Q { G ( P ′ ) G ( P ′ + Q ) G ( P − Q )( F d ( Q ) P − Q,P ′ − F m ( Q ) P − Q,P ′ )+ G ( − P ′ ) G ( P ′ + Q ) G ( − P + Q )( F s ( Q ) P − Q,P ′ + F t ( Q ) P − Q,P ′ ) } , (6)where G is the single-particle Green function, which itselfcan be calculated from the self-energy using the Dysonequation, G − ( P ) = G − ( P ) − Σ( P ) , (7)where G is the bare Green function. Here, the indices P , P ′ and Q combine momentum k and Matsubara fre-quency iω n , i.e. P = ( k , iω n ).The reducible and the irreducible vertices in a givenchannel are related by the Bethe-Salpeter equation, F r ( Q ) P,P ′ = Γ r ( Q ) P,P ′ + Φ r ( Q ) P,P ′ , (8) F r ′ ( Q ) P,P ′ = Γ r ′ ( Q ) P,P ′ + Ψ r ′ ( Q ) P,P ′ , (9)where r = d or m for the density and magnetic chan-nels and r ′ = s or t for the spin singlet and spin triplet channels. The vertex ladders are deﬁned asΦ r ( Q ) P,P ′ ≡ (10) X P ′′ F r ( Q ) P,P ′′ χ ph ( Q ) P ′′ Γ r ( Q ) P ′′ ,P ′ , Ψ r ′ ( Q ) P,P ′ ≡ (11) X P ′′ F r ′ ( Q ) P,P ′′ χ pp ( Q ) P ′′ Γ r ′ ( Q ) P ′′ ,P ′ , where χ is the product of two single-particle Green func-tions.The parquet equations in the spin diagonalized repre-sentation areΓ d ( Q ) P P ′ = Λ d ( Q ) P P ′ −

12 Φ d ( P ′ − P ) P,P + Q −

32 Φ m ( P ′ − P ) P,P + Q (12)+ 12 Ψ s ( P + P ′ + Q ) − P − Q, − P + 32 Ψ t ( P + P ′ + Q ) − P − Q, − P , Γ m ( Q ) P P ′ = Λ m ( Q ) P P ′ −

12 Φ d ( P ′ − P ) P,P + Q + 12 Φ m ( P ′ − P ) P,P + Q (13) −

12 Ψ s ( P + P ′ + Q ) − P − Q, − P + 12 Ψ t ( P + P ′ + Q ) − P − Q, − P , Γ s ( Q ) P P ′ = Λ s ( Q ) P P ′ + 12 Φ d ( P ′ − P ) − P ′ ,P + Q −

32 Φ m ( P ′ − P ) − P ′ ,P + Q (14)+ 12 Φ d ( P + P ′ + Q ) − P ′ , − P −

32 Φ m ( P + P ′ + Q ) − P ′ , − P , Γ t ( Q ) P P ′ = Λ t ( Q ) P P ′ + 12 Φ d ( P ′ − P ) − P ′ ,P + Q + 12 Φ m ( P ′ − P ) − P ′ ,P + Q (15) −

12 Φ d ( P + P ′ + Q ) − P ′ , − P −

12 Φ m ( P + P ′ + Q ) − P ′ , − P . It is important to note at this point that if we substi-tute the irreducible vertices Γ (Eqs. 12,13,14, and 15) intothe Bethe-Salpeter equation (Eqs. 8 and 9) the crossingsymmetry in the full vertex F is automatically satisﬁedregardless of the numerical values of the vertex ladders Φ and Ψ, assuming the fully irreducible vertices, Λ, arecrossing symmetric. We write all the full vertices explic-itly in the following using only the vertex ladders, Φ, Ψ,and the fully irreducible vertices, Λ. F d ( Q ) P,P ′ = Λ d ( Q ) P P ′ −

12 Φ d ( P ′ − P ) P,P + Q −

32 Φ m ( P ′ − P ) P,P + Q (16)+ 12 Ψ s ( P + P ′ + Q ) − P − Q, − P + 32 Ψ t ( P + P ′ + Q ) − P − Q, − P + Φ d ( Q ) P,P ′ ; F m ( Q ) P,P ′ = Λ m ( Q ) P P ′ −

12 Φ d ( P ′ − P ) P,P + Q + 12 Φ m ( P ′ − P ) P,P + Q (17) −

12 Ψ s ( P + P ′ + Q ) − P − Q, − P + 12 Ψ t ( P + P ′ + Q ) − P − Q, − P + Φ m ( Q ) P,P ′ ; F s ( Q ) P,P ′ = Λ s ( Q ) P P ′ + 12 Φ d ( P ′ − P ) − P ′ ,P + Q −

32 Φ m ( P ′ − P ) − P ′ ,P + Q (18)+ 12 Φ d ( P + P ′ + Q ) − P ′ , − P −

32 Φ m ( P + P ′ + Q ) − P ′ , − P + Ψ s ( Q ) P,P ′ ; F t ( Q ) P,P ′ = Λ t ( Q ) P P ′ + 12 Φ d ( P ′ − P ) − P ′ ,P + Q + 12 Φ m ( P ′ − P ) − P ′ ,P + Q (19) −

12 Φ d ( P + P ′ + Q ) − P ′ , − P −

12 Φ m ( P + P ′ + Q ) − P ′ , − P + Ψ t ( Q ) P,P ′ . These relations allow us to restore the crossing symmetryfor the full vertices without heavy computational over-head.The prominent technical problem at hand is whether ornot we can solve this set of equations eﬃciently withoutresorting to any approximated scheme. An obvious diﬃ-culty is to handle the large number of variables. On go-ing from the one-particle level calculation to two-particlelevel calculation, the number of variables which has to bemonitored grows as the third power of the linear dimen-sion of the system. If N t is the number of lattice sitestimes the number of discrete Matsubara frequencies, i.e., N t = N k × N ω , the largest N t that can be handled is inthe range 2000 − , i.e., the number of variables can be over one billion. One can immediately see that practicalcalculations for reasonably large system sizes pose a seri-ous problem, although not insurmountable with moderncomputational facilities where a large number of com-puter nodes are accessible. However, in addition to thelarge number of computations associated with solving theparquet equations, they also require a complex commu-nication pattern between the diﬀerent processes as wediscussed in more detail in the Appendix. Moreover, anumerical instability is not unexpected, especially whenthe system in the proximity of a phase transition. FIG. 1: (Color online) Flow diagram of the algorithm for solv-ing the parquet equations. See the text for the description ofeach step. The major computational bottleneck is in the self-consistent loop of step 2. The cross channel rotations of thevertex ladders from the form required by the Bethe-Salpeterequations to that in the parquet equations require expensivecommunications across diﬀerent nodes in the parallel imple-mentation.

III. NUMERICAL IMPLEMENTATION

The parquet formulation consists of two sets of equa-tions. The ﬁrst set, made of the parquet equations andthe Bethe-Salpeter equation, determines the full vertex F and the irreducible vertex Γ given the one-particleself-energy Σ and the fully irreducible vertex Λ as theinputs. The second set of equations determine the one-particle quantities given the full vertex F ; it includes theSchwinger-Dyson equation and the Dyson equation.Since the method is iteration based, the initial guess iscrucial for obtaining a converged solution. In principle,the initial guess can be approximated, for example, bysecond order perturbation. However, in practice, this isnot the optimal choice, especially when the self-energyfrom the second order perturbation is small. In this casethe Green function will quickly destabilize the calcula-tion. This may relate to the fact that the damping fromthe imaginary part of the self-energy is quickly reduced.Since we are supposing that we know the fully irreduciblevertex Λ , a practical scheme is to choose the irreducible,Γ and full vertices, F , equal to Λ, and a large value (afew times of the bandwidth) for the imaginary part ofthe self-energy.Fig. 1 is an illustration of the ﬂow diagram of the algo-rithm, where the fully irreducible vertices are the initialinput for the calculation. The algorithm can be describedas the following:1. Set the initial conditions for the irreducible and fullvertices, and the self-energy. 2. Update the Green’s functions and calculate thebare susceptibility, χ , which is given by the product oftwo Green’s functions. Solve the parquet and the Bethe-Salpeter equations for the irreducible vertices, Γ. Simpleiteration is used until the convergence criteria are metfor the irreducible vertices.This completes the update of the vertices. The nextstep is to use the irreducible vertices obtained from theparquet equations to construct the full vertices.3. Solve the Bethe-Salpeter equation to obtain the fullvertices, F , using the irreducible vertices from the previ-ous step; this is executed exactly by calling the LAPACKroutines for the inverse of the matrices. With the full vertices obtained, we can update the self-energy.4. Solve the Dyson-Schwinger equation to obtain theself-energy from the full vertices. Simple iteration is useduntil certain convergence criteria are met for the self-energy.5. Solve the Dyson equation for the fully dressed Greenfunction from the self-energy.This completes the iteration loop, and the procedureis repeated from step 2, until convergence is reached forboth the self-energy and the irreducible vertices.In practice, step 2 which attempts to obtain the irre-ducible vertices needs to be iterated for a few times to geta reasonable convergence, even in the case where the cou-pling is weak and the temperature is high. On the otherhand, step 4 which attempts to solve the self-energy fromthe updated full vertices is not iterated more than onetime at each loop, so as to avoid instability (we deﬁneinstability here as the failure to obtain a converged solu-tion from the iterative solver). Attempting to solve theself-energy at the early stage of the iteration procedurewhere the full vertices are not well converged can gener-ally lead to instability. Although in the present paper,we only focus on the Hubbard model, instabilities in theiteration process have also been observed in solving theparquet equations for nuclear structure calculations. A widely used method to avoid the instability in theiterative process is to introduce a damping factor in theupdates of the variables. The updates are modiﬁed asfollows. Σ = ( α )Σ new + (1 − α )Σ old ; (20)Γ = ( α )Γ new + (1 − α )Γ old . (21)With this damping scheme, the solution for the half-ﬁlledHubbard model on a 4 × U = 2 t and temperature, T = 0 . t . How-ever, in the strong coupling regime, obtaining a stablesolution still seems to be diﬃcult, even though a ratherheavy damping is employed.We will demonstrate the instability problem of the sim-ple iterative process by monitoring the leading eigenval-ues λ deﬁned as λ r φ r = Γ r ( P, P ′ ) G ( P ′ ) G ( P ′ + ( π, π )) φ r (22)for r = d and m ; similarly λ r ′ φ r ′ = ( − / r ′ ( P, P ′ ) G ( − P ′ ) G ( P ′ ) φ r ′ (23)for r ′ = s and t . In principle, these leading eigenvaluessignal a phase transition by going through 1, expressingthe divergence of the susceptibilities in the correspondingchannel.In Fig. 3, we plot the leading eigenvalues of the den-sity, magnetic, spin singlet, and spin triplet channels asa function of the number of iteration steps calculatedwith this simple iteration method (SI). The calculationis done on a 2 × T = 0 . t ; the damping parameter is α = 0 .

1. Aconverged solution is obtained for U = 2 t ; however, for U = 4 t and 6 t the iterative solutions diverge. Chang-ing the damping or the initial self-energy does not helpin obtaining a converged solution for the larger values of U . These are illustrative examples which show the prob-lem of using the simple iteration method for solving theparquet equations. For weak coupling and not too lowtemperature, converged solution can be obtained. Be-yond weak coupling the iteration becomes divergent. A. Continuous loading method

A widely used method to alleviate the divergence inthe non-linear solver is the so-called continuous loadingor homotopy method. The basis of the continuous load-ing method is to construct an auxiliary equation with atuning parameter ν , so that its solution is trivial for ν = 0but the solution of the original equation is recovered for ν = 1. Symbolically we can write down the set of equa-tions of the parquet formulation as f parquet (Σ , Γ) = 0,where f parquet is a large vector. We deﬁne the auxiliaryfunction as g = ν f parquet + (1 − ν ) f , where f is a func-tion with a trivial solution. In our study we choose f asa vector containing all the elements of Γ − Γ and Σ − Σ ,where Γ and Σ are the initial guesses for the irreduciblevertices and the self-energy respectively. The iterationmethod is used to solve the function g ( ν ), instead of the f parquet . One can readily see that the solution of g (0) istrivial while the solution of the f parquet is recovered when ν = 1. The idea is to ﬁnd a converged solution for g ( ν )with a small enough ν where a converged solution canbe readily obtained, and then gradually increase ν untilit goes to 1. Therefore, a series of ν values are needed,which we denote as ν i .We plot the leading eigenvalues from the continuousloading method in Fig. 4. The values of ν used are ν =0 . , . , . , . , . , . , . , . , . , . , . , . , .

0. 100 and 200 iterations are performed for ν ≤ .

993 and ν ≥ .

996 respectively. For U = 2 t ,converged solution is obtained for ν = 1. However, for U = 4 t and 6 t , the iterative solution diverges. Thedivergences appear before the homotopy parameter ispushed to ν = 1. These examples illustrate the genericbehavior when solving the parquet equations by the = L P + QP ′ + Q − P ′ − P R P + QP ′ + Q − P − P ′ − L P + QP ′ + QP ′ P R P ′ P ′ + QP P + Q L P + QP ′ + QP ′ P R P ′ P ′ + QP + Q P − ==CS 1 & 2CS 3 & 4CS 5 & 6 FIG. 2: (Color online) Diagrammatic representation of thesix crossing symmetry operations. Note that spin indices arehidden for the purpose of clarity. For the ﬁrst two operations(CS 1 & 2), we exchange the lower two external legs. Forthird and fourth operations (CS 3 & 4), we exchange the lowertwo legs and then the right two legs. The ﬁrst four crossingsymmetry relationships (CS 1 – 4) relate the particle-particlevertices with the particle-hole vertices. The last two (CS 5 &6) are for the particle-hole vertices only, where we exchangethe lower left and upper right legs. simple iteration method beyond weak coupling. Theyalso illustrate that the continuous loading method maynot be suﬃciently robust to solve the problem. Thedamping factor α used in these calculations is 0 . IV. CROSSING SYMMETRY VIOLATION

The exact solution of the parquet equations automat-ically satisﬁes the crossing symmetry. It is one of themost important diﬀerences between the parquet formu-lation and most other perturbative methods. However,within the iteration scheme presented in the last section,the crossing symmetry is not fulﬁlled unless the itera-tion converges to an exact solution. From the above sec-tion, we clearly ﬁnd that the iteration method, even withthe help of the continuous loading scheme, is not robustenough to obtain a converged solution beyond weak cou-pling. It is desirable to quantify the violation of crossingsymmetry. The following six equalities are the conse-quence of the crossing symmetry (see Fig. 2 for a di-agrammatic representation of these crossing symmetry(CS) operations). L ≡ F t ( Q ) P,P ′ = (24)[ − (1 / F m − (1 / F d ]( P + P ′ + Q ) − P ′ , − P ≡ R ,L ≡ F s ( Q ) P,P ′ = (25)[ − (3 / F m + (1 / F d ]( P + P ′ + Q ) − P ′ , − P ≡ R ,L ≡ F m ( Q ) P,P ′ = (26)[(1 / F t − (1 / F t ]( P + P ′ + Q ) − P − Q, − P ≡ R ,L ≡ F d ( Q ) P,P ′ = (27)[(3 / F t + (1 / F t ]( P + P ′ + Q ) − P − Q, − P ≡ R ,L ≡ F m ( Q ) P,P ′ = (28)[(1 / F m − (1 / F d ]( P ′ − P ) P,P + Q ≡ R ,L ≡ F m ( Q ) P,P ′ = (29)[ − (3 / F m + (1 / F d ]( P ′ − P ) P,P + Q ≡ R . In Fig. 5 we plot the violation of crossing symmetryversus the number of iterations. It can be seen clearlythat the crossing symmetry cannot be perfectly restored,even for the case of U = 2 t , the measures of violationof crossing show oscillatory decreasing behavior, and therate of decrease is quite slow even though the leadingeigenvalues seem to be well converged. Obviously, forthe cases of U = 4 t and 6 t the crossing symmetry isseverely violated and at the verge of the divergence thereare sharp increases in the crossing symmetry violation.This may suggest that if the crossing symmetry can berestored, the divergence may be avoided beyond weakcoupling.In Fig. 6, we plot the crossing symmetry violation asa function of iteration steps with the continuous load-ing method and the same parameters as in Fig. 4. Itis important to note that the homotopy function g ( ν )does not respect the crossing symmetry except at ν = 1.Therefore, although the solution is converged, as long as ν = 1, the crossing symmetry is violated. The data for U = 2 t, t , and 6 t are shown in the upper, the middle,and the lower panels respectively. The data for U = 2 t show a peak at the beginning of the iteration procedurewhen ν is increased and gradually converges to a ﬁnitevalue. For ν close to 1, the data show a similar oscillatorybehavior as that from the simple iteration method. Sim-ilar behaviors are also observed for U = 4 t and U = 6 t ,however, the iterations fail to converge for ν close to 1;the crossing symmetry is strongly violated. Similar tothat for the simple iteration method, at the verge of thedivergence, there are sharp increases of the violation ofcrossing symmetry. V. SYMMETRY RESTORATIONA. Crossing Symmetry

Although it does not seem to be easy to analyze all thecauses of the instability in the iteration, based on the dis-cussion in the above section, our conjecture is that oneof the possible reasons for the instability is that certain number of iterations λ λ d λ m λ s λ t SI,

U=2t number of iterations λ λ d λ m λ s λ t SI,

U=4t number of iterations λ λ d λ m λ s λ t SI,

U=6t

FIG. 3: (Color online) The leading eigenvalues of variouschannels (density (d), magnetic (m), spin singlet (s), and spintriplet (t)) as a function of the number of iteration steps calcu-lated with the simple iteration (SI) method. The calculationsare for the half-ﬁlled Hubbard model on a 2 × T = 0 . t . The initial condition for the self-energyis set at 0 + i t , and that for the irreducible vertex is setat the bare Hubbard coupling. The damping factor α is setat 0 .

1. The solution is well converged for U = 2 t . However,divergence occurs for U = 4 t at the 55th iteration step. For U = 6 t , divergence occurs at the 46th iteration step. symmetries are violated in the course of the iteration pro-cess. A possible strategy to improve the iteration schemeis to impose those symmetries explicitly into the iterationprocess, so that at each step of the iteration these sym-metries are not violated. The full vertices F obtainedby the solution of the Bethe-Salpeter equation cannotguarantee the crossing symmetry, unless the exact solu-tion is attained. Therefore, so as to preserve the crossingsymmetry, the simplest method is to use the full verticesobtained by solving the Bethe-Salpeter equation (that isthe full vertices obtained from the step 3 of the algorithm number of iterations λ λ d λ m λ s λ t CL,

U=2t number of iterations λ λ d λ m λ s λ t CL,

U=4t number of iterations λ λ d λ m λ s λ t CL,

U=6t

FIG. 4: (Color online) The leading eigenvalues obtained bythe continuous loading (CL) method. Symbolically we writethe parquet equations as f parquet (Σ , Γ) = 0 and use them todeﬁne the auxiliary function as g = ν f parquet + (1 − ν ) f ,where f is a function with a trivial solution. We choose f asa vector containing all the elements of Γ − Γ and Σ − Σ . Theiteration method is used to solve the function g ( ν ), insteadof the f parquet . The solution of the f parquet is recovered when ν = 1. A series of ν values are needed, which we denote as ν i .The function g ( ν i ) is solved by the simple iteration methodwith the initial conditions given by the converged solution ofthe function g ( ν i − ). For U = 2 t , we can push the value of ν to 1 to obtain the converged solution for g ( ν = 1), andthe solution of the parquet equations is recovered. However,the iteration procedure diverges for the cases of U = 4 t and U = 6 t ; they diverge at ν = 0 . .

999 respectively.These examples show that for the cases where simple iter-ation method diverges, the continuous loading method maynot eliminate the divergence, even though the value of ν ispushed fairy close to 1. number of iterations m ea s u r e o f c r o ss i ng v i o l a ti on E E E E E E SI,

U=2t number of iterations m ea s u r e o f c r o ss i ng v i o l a ti on E E E E E E SI,

U=4t number of iterations m ea s u r e o f c r o ss i ng v i o l a ti on E E E E E E SI,

U=6t

FIG. 5: (Color online) Crossing symmetry violation, E i , ver-sus the number of iterations for the simple iteration (SI)method with the same parameters as Fig. 3. The six mea-sures of crossing symmetry violation are deﬁned as E i = | L i − R i | / | L i + R i | , where i = 1 , , ..., L i and R i are deﬁnedrespectively as the left hand side and the right hand side ofthe Eqs. 24 – 29. The data for U = 2 t shows an oscillatorybut decreasing trend. This is expected for the case where theiteration provides a well converged solution. One should notethat although the leading eigenvalues seem to be converged,the crossing symmetry is not perfectly constructed from theiteration. For U = 4 t and U = 6 t , the iteration fails to provideconverged solutions, and the crossing symmetry is strongly vi-olated. In particular, at the verge of the divergence, there isa sharp increase of the violation of crossing symmetry. presented in the section III), and feed them back into theparquet equation to reconstruct the crossing symmetricfull vertices. Fig. 7 illustrates the ﬂow diagram for solv-ing the parquet equations with explicit restoration of thecrossing symmetry in the full vertices. Here is the algo-rithm which explicitly preserves the crossing symmetry:1. Set the initial conditions for the irreducible and full number of iterations m ea s u r e o f c r o ss i ng v i o l a ti on E E E E E E CL,

U=2t number of iterations m ea s u r e o f c r o ss i ng v i o l a ti on E E E E E E CL,

U=4t number of iterations m ea s u r e o f c r o ss i ng v i o l a ti on E E E E E E CL,

U=6t

FIG. 6: (Color online) Crossing symmetry violation, E i , ver-sus the number of iterations for the continuous loading (CL)method with the same parameters and the same deﬁnition of E i as in Fig. 5. The homotopy function g ( ν ) does not respectthe crossing symmetry except at ν = 1. Therefore, althoughthe solution may be converged for some values of ν , as longas ν = 1, the violation of crossing symmetry is non-zero. For U = 2 t the crossing symmetry violations peak near the begin-ning of the iteration procedure where ν is small and graduallyconverge to a ﬁnite value when ν = 1. Similar behaviors arealso observed for U = 4 t and U = 6 t ; however, since the iter-ation fails to converge for ν close to 1, the crossing symmetryis strongly violated. Similar to that observed in the simpleiteration method, at the verge of the divergence, there is asharp increase of the violation of crossing symmetry. vertices, and the self-energy.2. Update the Green’s functions and calculate thebare susceptibility, χ . Solve the parquet and the Bethe-Salpeter equations for the irreducible vertices, Γ. Simpleiteration is used until the convergence criteria are metfor the irreducible vertices.Since we will restore the crossing symmetry of the full FIG. 7: (Color online) Flow diagram of the algorithm for solv-ing the parquet equations with crossing symmetry restoration.See the text for the description of each step. The main dif-ference compared to the previous algorithm is in the step 3bwhere the crossing symmetry is restored explicitly in the fullvertex, F . Because of this explicit restoration of the crossingsymmetry, in practice, the step 2 is only iterated for one timeas the self-consistency is not required to generate the crossingsymmetry. vertices in the step 3b, we ﬁnd that it is not necessaryto attain the self-consistency for step 2. In practice, weiterate the parquet equations for one time only. The nextstep is to use the irreducible vertices obtained from theparquet equations to construct the full vertices.3a. Solve the Bethe-Salpeter equation to obtain thefull vertices, F , using the irreducible vertices from theprevious step. This is executed exactly by calling theLAPACK routines for the inverse of the matrices. F .With the full vertices obtained, we can update the self-energy.4. Solve the Dyson-Schwinger equation to obtain theself-energy from the full vertices. Simple iteration is useduntil the convergence criteria are met for the self-energy.5. Solve the Dyson equation for the fully dressed Greenfunction from the self-energy.This completes the iteration loop, and the procedure isrepeated from step 2 until the criteria of convergence aremet for both the self-energy and the irreducible vertices.The main diﬀerence between the current algorithm andthe previous algorithm we present in Section III is in step3 where the full vertices are constructed. In the previ-ous algorithm the Bethe-Salpeter equation is solved many0times to attain convergence. When the absolute conver-gence is attained, the crossing symmetry will be satis-ﬁed. In the current algorithm, we just explicitly solvethe Bethe-Salpeter equation in step 3a to refresh the fullvertices. Once we obtain the full vertices, in step 3b, weconstruct the new vertex ladders and the new full ver-tices from the vertex ladders using the Bethe-Salpeterequation. By doing so, the crossing symmetry of the fullvertices will be satisﬁed; see Eqs. 16, 17, 18, and 19. InFig. 8 we show the leading eigenvalues using the same setof parameters used in Fig. 3. While the simple iterationscheme without crossing symmetry fails to converge forthe case of U = 4 and 6, it provides a converged solutionwhen the crossing symmetry is explicitly restored. B. Time-reversal and Point Group Symmetry

Besides imposing the crossing symmetry on the fullvertices, some of the internal symmetries can also beimposed on the irreducible vertices and the self-energywithout much computational overhead. We illustrate thetime-reversal symmetry in the self-energyΣ( k , iω ) = Σ ∗ ( k , − iω ) (30)and the vertices (spatial reﬂection symmetry and parityinvariance are assumed), F ( Q ) P,P ′ = F ( Q ) P ′ ,P . (31)As these symmetry operations do not mix verticesacross diﬀerent values of Q , and providing that the datais distributed with one or more Q at each node, the timereversal symmetry of the vertices can be imposed withoutinvoking communications across diﬀerent nodes. There-fore, enforcing time-reversal symmetry will only cause avery minor computational overhead.Other symmetries, such as the point group symmetryfor the square lattice can be rather cumbersome. Anexpensive scheme involving heavy internode communi-cation would be required to impose the complete set ofpoint group symmetries. However, we may impose an im-portant subset of the operations R α for which R α ( Q ) = Q without expensive communications. In these cases, thevertices may be symmetrized by performing the sum F ( Q ) P,P ′ = 1 N R α ( Q )= Q X R α ( Q )= Q F ( Q ) R α ( P ) ,R α ( P ′ ) , (32)where N R α ( Q )= Q is the number of elements in this subsetof operations. For general Q in the cluster Brillouin zonethere would be no α such that R α ( Q ) = Q apart fromthe identity. However, for the points of high symmetry, R α ( Q ) = Q for all α . Generally, the instabilities ﬁrstoccur here, so imposing the point group symmetries atthese Q values should have the greatest impact.In Fig. 9 we show the leading eigenvalues of variouschannels when both crossing and time-reversal symme-tries are imposed for the same set of parameters being number of iterations λ λ d λ m λ s λ t SI with CS,

U=2t number of iterations λ λ d λ m λ s λ t SI with CS,

U=4t number of iterations λ λ d λ m λ s λ t SI with CS,

U=6t

FIG. 8: (Color online) The leading eigenvalues of variouschannels (density (d), magnetic (m), spin singlet (s), and spintriplet (t)) versus the number of iterations with the simple in-teraction (SI) method with crossing symmetry (CS). The pa-rameters used are the same as the data shown in Fig. 3. Theonly diﬀerence is that the crossing symmetry in the full ver-tex, F , is explicitly restored at each step of the iteration. Thisis easily achieved by constructing the full vertex directly fromEqs. 8 and 9. The simple iteration scheme without crossingsymmetry fails for the case of U = 4 and 6. With the cross-ing symmetry explicitly restored, converged solutions are ob-tained. used from the data in Fig. 3 and 8. Spatial reﬂectionsymmetry, parity invariance and spin rotation invarianceare assumed as appropriate for the two-dimensional Hub-bard model at non-zero temperature. We can see thatthere is only very marginal improvement for the conver-gence compared to the results without explicitly restor-ing the time-reversal symmetry (see Fig. 8). We also usethe scheme described above to partially impose the pointgroup symmetries. However, these symmetries resultedin no additional improvements and therefore no results1 number of iterations λ λ d λ m λ s λ t SI with CS & TRS,

U=2t number of iterations λ λ d λ m λ s λ t SI with CS & TRS,

U=4t number of iterations λ λ d λ m λ s λ t SI with CS and TRS,

U=6t

FIG. 9: (Color online) The leading eigenvalues of variouschannels: density (d), magnetic (m), spin singlet (s), andspin triplet (t), as a function of the iteration. The param-eters used are the same as the data shown in Fig. 3. Twosymmetries are explicitly restored at each step of the itera-tion: the crossing symmetry (CS) in the full vertex, F , andthe time-reversal symmetry (TRS) for the self-energy, Σ, andboth the irreducible vertex, Γ, and the full vertex, F , (thatis F ( Q ) P,P ′ = F ( Q ) P ′ ,P and similarly for Γ). Notice thereis no substantial gain in the convergent rate compared to thecase with only the crossing symmetry being restored. are shown. VI. LEADING EIGENVALUE OF THEANTIFERROMAGNETIC CHANNEL

With the improved scheme proposed in this paper, weare able to explore a wider range of temperature andcoupling strength for the half-ﬁlled Hubbard model. InFig. 10 we show the leading eigenvalue for the most singu-lar channel, the antiferromagnetic channel, λ m , as a func- U/t λ m T=0.15tT=0.2tT=0.3tT=0.4tT=0.8t

FIG. 10: (Color online) The leading eigenvalues of the an-tiferromagnetic magnetic channel, λ m , as a function of thecoupling, U , calculated with the simple iteration method.Diﬀerent curves correspond to diﬀerent temperatures. Thedata points enclosed in a black square correspond to the caseswhere the simple iteration without any symmetry restorationcan provide a converged solution. tion of U for a range of temperatures as low as T = 0 . t .The data points enclosed in a black square correspond tothe cases where the simple iteration without any sym-metry restoration provides a converged solution. Forall temperatures, λ m increase sharply at weak coupling( U ∼ t ), and they tend to saturate at strong coupling( U ∼ t ). They are most sensitive to temperature atthe intermediate coupling (2 t < U < t ). We emphasizethat convergency is not possible without the improvedscheme, unless a large number of iterations are used toattain the crossing symmetry. VII. SUMMARY AND DISCUSSION

We present improvements of numerical implementa-tions for solving the parquet equations for the Hubbardmodel. The main strategy is to enforce the symmetries inthe iteration process. The most prominent advantage ofthe parquet formulation, compared to most of the otherapproaches, is that the crossing symmetry is exactly ful-ﬁlled. However, in general, it is true only if the exactsolution is found. With the simple iteration method,the crossing symmetry is strongly violated prior an in-stability, suggesting that the instability is due to thesesymmetry violations.The continuous loading or homotopy method does notimprove convergence signiﬁcantly beyond the simple iter-ation method. We note that the solutions of the contin-uous loading function do not preserve crossing symme-try. This may partly explain why the continuous loadingmethod does not provide signiﬁcant improvement over2simple iteration.We present a simple method to enforce the crossingsymmetry at each step of the iteration which does notsubstantially increase the computational cost. The addi-tion of these symmetry constraints can greatly improvethe stability of the calculation, so that a wider range ofparameters can be explored by the parquet formulation.Along this line of thought, one can expect that the sta-bility may be further improved if other symmetries arealso imposed, the obvious ones being time-reversal andpoint group. However, these additional symmetries didnot improve the stability signiﬁcantly beyond that ob-tained with crossing symmetry alone.The parquet formulation still remains as one of thebest approaches for calculating the two-particle vertexfunctions in a self-consistent manner. At present, solv-ing the parquet equations for a large lattice size is still avery challenging task; however, with the continuous ad-vances of computational facilities, it should become morefeasible in the foreseeable future. A promising direction,which allows immediate application of the parquet for-mulation, is to incorporate it as part of the multi-scalemany-body approach.

Acknowledgments

We would like to acknowledge very useful discussionswith Karen Tomko, and we thank Peter Reis for his care-ful reading of the manuscript. This work was supportedin part by the DOE SciDAC grant DE-FC02-06ER25792(KMT, HF, SYZ, and MJ) and the U.S. National ScienceFoundation LA-SiGMA grant EPS-1003897 (JR, JM, andMJ). Supercomputer support was provided by the NSFTeraGrid under grant number TG-DMR100007. Thisresearch also used resources of the National Center forComputational Sciences at Oak Ridge National Labo-ratory, which is supported by the Oﬃce of Science ofthe U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Appendix A: Parallel Implementation with LatencyHiding

This appendix describes a highly eﬀective implemen-tation of the symmetry-enforcing variant of the parquetformulation described earlier in the paper. The commu-nication bottleneck in this implementation is the expen-sive tensor rotations required to rotate the vertex ladders(Eqs. 10 and 11) between the forms used in the Bethe-Salpeter equation, Eqs. 8 and 9, to those used in the par-quet equations, Eqs. 12 – 15. If we distribute the equa-tions between the processes executing on the computenodes of a parallel machine using the transfer momenta Q , the tensor rotations are done with an expensive all-to-all communication among those processes, in which everynode needs to communicate with all the other nodes. The MPI implementation of the all-to-all communication isa collective operation that is blocking, i.e., each processhas to wait until the message has been sent out. Thekey aspect of our implementation is the decomposition ofthe required communication so that non-blocking com-munication primitives can be eﬀectively utilized. Thenon-blocking communication enables latency hiding byoverlapping computations and communications.Four diﬀerent forms of tensor rotations are required:rotation 1 : Φ( Q ) P,P ′ ←− Φ( P ′ − P ) P,P + Q (A1)rotation 2 : Φ( Q ) P,P ′ ←− Φ( P ′ − P ) − P ′ ,P + Q (A2)rotation 3 : Φ( Q ) P,P ′ ←− Φ( P + P ′ + Q ) − P ′ , − P (A3)rotation 4 : Φ( Q ) P,P ′ ←− Φ( P + P ′ + Q ) − P − Q, − P . (A4)Note that the indices in subscripts and those in paren-thesis are equivalent, with the latter only distinguishedby also labeling the nodes where data is distributed. Thesize of the tensors is N t × N t × N t , where N t is the num-ber of momentum points times the number of discreteMatsubara frequencies, i.e., N t = N k × N ω . All indicesare in modulo arithmetic at each of the D + 1 dimen-sions, where D = 2 (the cluster dimension), and the “1”is for the Matsubara frequency. Because it takes manyiterations (up to a few hundred for low temperatures andstrong coupling) to obtain converged solutions, the totalnumber of tensor rotations is signiﬁcant and account fora large fraction of the computational time.We use the hybrid MPI/OpenMP model for the com-putations. The rank three tensors are decomposed andevenly distributed into N virtual nodes. Each virtualnode consists of a few cores. The size of a virtual node(i.e., the number of cores) is less than or equal to the sizeof a physical node. Speciﬁcally, we slice the rank threetensors to a set of two-dimensional arrays based on theindex in parenthesis, e.g., Q and P − P ′ for the left andright sides of Eq. A1, respectively. Then, each two di-mensional matrix is assigned to a virtual node. Since wehave N t layers of two dimensional slices, the total numberof virtual nodes also becomes N = N t . In this scenario,every rotation requires data communications among allnodes. The following describes the data access patternsfor our implementation of the tensor rotations. Step 1:

This step involves no MPI communication andis done before any data is sent between nodes. The tensorelements are locally rearranged in order to collect speciﬁcelements to be grouped and sent to designated destina-tion nodes. The index in parenthesis of the tensors onthe right of Eqs. A1–A4 represents the rank of a sendingnode in which a sliced two dimensional matrix resides.For rotations 1 and 2, rank of sending node S is S = P ′ − P. (A5)For rotations 3 and 4, S is S = P + P ′ + Q. (A6)3Using Eqs. A5 and A6, and applying these to the corre-sponding rotations, the two dimensional matrix elementsare grouped based on the rank of destination node Q from a given sending node S .rotation 1 : A P,Q = Φ( S ) P,P + Q (A7)rotation 2 : A P,Q = Φ( S ) − ( P + S ) ,P + Q (A8)rotation 3 : A P,Q = Φ( S ) P + Q − S, − P (A9)rotation 4 : A P,Q = Φ( S ) − ( P + Q ) , − P (A10)Note that, here, S is the node index (the index of thesender) and P, Q ∈ { , . . . , N t − } , so the P and Q arethe row and column indices of the matrix. We assumecolumn-major order data access in MPI data communi-cations which distribute columns of matrix A to nodesof rank Q in the next step. Step 2:

The columns of the two dimensional ma-trix A are distributed among all nodes. At the send-ing nodes, each column of A is sent to a destina-tion node labeled by Q . The standard approach is touse MPI ALLTOALL . However, as we show later, thistask can be done using diﬀerent combinations of point-to-point communications. In particular, non-blockingcommunication protocols can be applied to overlap com-munications and local computations. Overall, this pro-cedure is applied to all the tensor rotations and can bewritten as B P,S at rank Q node: ← A P,Q at rank S node . (A11)As shown in Eq. A11, the rank of destination nodes isdetermined by the column index Q of A in sending nodes.The rank of sending nodes becomes column index S of B in the receiving nodes. The rank of sending nodes S must be provided to receiving nodes in order to assignthe correct column index to the received messages. Step 3:

Once messages have arrived at the destinationnodes, the columns of the two dimensional matrix B arerearranged to complete the tensor rotations. The columnindex of the rotated received matrix is related to the rankof the sending and receiving nodes by Eqs. A5 and A6.Then, the rotations are ﬁnalized by using the followingrelationsrotation 1 , Q ) P,S + P ←− B P,S (A12)rotation 3 , Q ) P,S − ( P + Q ) ←− B P,S , (A13)where Q is the index of a given receiving node and P, S ∈{ , . . . , N t − } . This step is a local process, i.e., nointernode communication is necessary. Improving the Performance of Tensor Rotations

While steps 1 and 3 are strictly local processes, step2 is the only stage involving nonlocal MPI communi-cations. The nature of the collective communicationsamong all nodes in step 2 makes it suited to the use of

MPI ALLTOALL . In such a case, step 2 can start onlyafter the completion of step 1. Because

MPI ALLTOALL is a blocking communication, step 3 must wait to startuntil step 2 is ﬁnished. Therefore, the total elapsed timeto complete a tensor rotation is the sum of elapsed timesof the three steps. When the problem size is large, thecommunication eﬃciency of

MPI ALLTOALL is reducedsigniﬁcantly due to the increased network complexity as-sociated with the bandwidth and latency among all par-ticipating nodes. Our approach to handle these rotationsmore eﬃciently is to implement a latency hiding strategyby overlapping message communications (step 2) and lo-cal computations (steps 1 and 3).To enable this, we have developed our own version of aroutine that performs communications from all nodes toall nodes. At a basic level, the functionality of this rou-tine is identical to that of the generic

MPI ALLTOALL routine. However, our routine allows further data ma-nipulations such that local computations are embeddedbetween communications in the following way: On thesending node, the ﬁrst column of A is computed fromthe equations of step 1. Then, MPI ISEND sends out theﬁrst column of A . While this column is being sent out, thenext column of A is prepared with step 1. This procedureis repeated until all N t columns of A , the group of theselected elements from Φ , are sent out. This process over-laps steps 1 and 2. Latency hiding is also implemented inreceiving nodes. We note that the sending nodes are alsoreceiving nodes. They only diﬀer by whether they areoperating in the sending or the receiving mode . On thereceiving nodes, MPI IRECV is set to receive messagesfrom arbitrary nodes by using

MPI ANY SOURCE as atag identifying the source of the message. For eﬃciencyreasons,

MPI IRECV is posted before

MPI ISEND ofthe sending process. Then,

MPI TEST calls are usedto check the completion of the arrival of the message.Once message arrival is conﬁrmed, the rank of the nodethat sent this message can be identiﬁed by inquiring using

MPI STATUS . This provides S to assign to a correspond-ing column and to be used in step 3. Since the messagearrival is column-by-column, the processing of each col-umn of B continues to step 3 while the next column istraveling through the network. This process is repeateduntil all columns are completed. This procedure com-pletely overlaps steps 2 and 3.Depending on the size of problem, it is desirable todeﬁne a virtual node containing several cores (assumingmulticore hardware architecture) based on the memoryavailability per node. Among the cores, MPI commu-nications are assigned to one core. The other cores areutilized by implementing OpenMP that parallelizes thelocal computational tasks in a node to all cores within thenode. Thus, OpenMP thread depth is set to match withthe total number of cores per virtual node. Speciﬁcally,we applied the DO directive of OpenMP for iterations ofindex P in the column selection processes of steps 1 and3.4 Experimental Results

We test the eﬃciency of this latency hiding schemeusing a non-blocking protocol against the standard

MPI ALLTOALL . All the experimental comparisons areconducted on the Cray XT5 (Jaguar) at the NationalCenter for Computational Sciences (NCCS) at the OakRidge National Laboratory. Jaguar consists of 12 coresper node, with six cores per NUMA (Non-Uniform Mem-ory Access) node, and two NUMAs per node. First, wediscuss hardware-driven constraints in implementing la-tency hiding. The non-blocking

MPI ISEND does notcheck for the arrivals of messages. With larger tensorsize, the node usage and the size of individual columnsbecomes large. The

MPI ISEND from all participatingnodes tries to dump a large column in each iteration. Thenext iteration starts regardless of message arrivals in thereceiving nodes. As a consequence, a large amount ofdata rushes onto the network faster than the data can beabsorbed by the receiving nodes. Eventually, this causesmemory overﬂow to the system buﬀer assigned to themessage processing unit. To avoid this we have allocatedmore memory space to the system buﬀer.

768 1280 1792 2304 281660120180240300360420 N t (= number of nodes) bu ff e r s i z e ( M B ) FIG. 11: (Color online) Required minimum buﬀer size to ex-ecute our all-to-all routine; each node has 12 cores.

For simplicity, we assign one virtual node to a phys-ical node. On the Jaguar Cray XT5, this means onevirtual node containing 12 cores. To utilize all cores ina node, the value of OpenMP thread depth is set to 12.We gradually increase the problem size N t until jobs endwith error indicating buﬀer overﬂow. Then, we set ahigher buﬀer size by controlling environmental variable MPICH UNEX BUFFER SIZE . For every incidence oferror, we add 60 MB buﬀer size. The default value of

MPICH UNEX BUFFER SIZE is 60 MB on JAGUARXT5 (the total number of cores is less than 50,000). Theresults are shown in the Fig. 11. Up to N t = 1024, the60 MB default buﬀer size is enough to handle the datatraﬃc. Increasing N t further forces us to use a largerbuﬀer size. Overall, the amount of added buﬀer size in-creases for larger problem sizes. We note that the resultspresented in Fig. 11 are with the maximum number of cores per a virtual node. Smaller core usage per nodealleviates the buﬀer restriction. For example, hardwaresetup with a NUMA node per virtual node consumes lessbuﬀer memory due to the reduced total number of physi-cal nodes participating in internode communication. Wedid not observe buﬀer memory overﬂow with the genericblocking MPI ALLTOALL routine.

768 1280 1792 2304 281600.20.40.60.811.21.41.6 N t (= number of nodes) e l ap s e t i m e ( s e c ) without latency hidingwith latency hiding FIG. 12: (Color online) Time spent in data communicationas a function of the number of computer nodes (12 processorsper node). For large data sets each process sends messagesto all the others, and the communication time scales linearlywith the number of processes. Latency hiding techniques thatoverlap the interprocessor communication with local compu-tations yields a factor of two speedup when compared withthe standard

MPI ALLTOALL implementations as the num-ber of processors increases beyond 30,000.

The performance of the latency hiding approach isevaluated in terms of wallclock time spent on a singletensor rotation and compared with the case of the stan-dard

MPI ALLTOALL applied for step 2. For this, theelapsed time to complete the tensor rotation is averagedover nine independent runs. Each run contains 40 repe-titions of identical tensor rotations. At the end of eachrun, the elapsed time is also averaged for the 40 rota-tions. For all runs, we choose rotation 1 and the min-imum buﬀer sizes shown in the Fig. 11 are assumed.The comparison results are shown in Fig. 12. Exceptfor N t = 768 and 2048, latency hiding outperforms thecase without latency hiding in signiﬁcant amount. Theperformance diﬀerences are even higher for N t ≥ MPI ALLTOALL case, there is a sudden speed-up at N t = 2048. We are exploring this behavior further.We believe that it is caused by changes in the data traﬃccontrolled by the hardware.Overall, latency hiding provides a higher speed-up forlarger tensor sizes and core count. From the generaltrends, it can be expected that two-fold or more eﬃciencyimprovement for N t greater than 2816 can be obtainedby implementing latency hiding with our non-blockingadaptation of the all-to-all routine for tensor rotations.5 L. D. Landau, A. A. Abrikosov and I. M. Khalatnikov,Dokl. Akad. Nauk. , 497, 773, 1177 (1954). I. Ya. Pomeranchuk, V. V. Sudakov, and K. A. Ter-Martirosyan, Phys. Rev. , 784 (1956). K. Ter-Martirosyan, Phys. Rev. , 948 (1958). C. de Dominicis and P. C. Martin, J. Math. Phys. , 14(1964). C. de Dominicis and P. C. Martin, J. Math. Phys. , 31(1964). B. Roulet, J. Gavoret, and P. Nozi`eres, Phys. Rev. ,1072 (1969). V. M. Yakovenko, Phys. Rev. B , 8851 (1993). S. A. Brazovskii, Sov. Phys. JETP, S. A. Brazovskii, Sov. Phys. JETP,

433 (1972). P. Kleinert and H. Schlegel, Physica A, , 507 (1995). C.-X. Chen and N. E. Bickers, Solid State Commun. ,311 (1992). V Janiˇs, P. Augustinsk´y, Phys. Rev. B , 165108 (2007). V Janiˇs, P. Augustinsk´y, Phys. Rev. B , 085106 (2008). P. Augustinsk´y and V. Janiˇs, Phys. Rev. B , 035114(2011). V. Janiˇs and J. Kolorenˇc, Phys. Rev. B , 033103 (2005). V. Janiˇs, Phys. Rev. B , 115115 (2001). V. Janiˇs, J. Phys.: Condens. Matter , 485501 (2009). N. E. Bickers,

Numerical Methods for Lattice QuantumMany-Body Problems , ed. D. J. Scalapino (Addison Wes-ley, New York, 1998). N. E. Bickers and S. R. White, Phys. Rev. B , 8044(1991). D. W. Hess, J. J. Deisz, J. W. Serene, Philos. Mag. ,457 (1996). J. Luo and N. E. Bickers, Phys. Rev. B , 15983 (1993). V. Janiˇs, Phys. Rev. B , 11345 (1999). H. Kusunose, J. Phys. Soc. Jpn. , 094707 (2010). A. D. Jackson, A. Lande, R. W. Guitink, and R. A. Smith,Phys. Rev. B , 403 (1985). A. D. Jackson and R. A. Smith, Phys, Rev. A , 2517(1987). M. Pﬁtzner and P. W¨olﬂe, Phys. Rev. B , 4699 (1987). R. A. Weiner, Phys. Rev. Lett. , 1071 (1970). R. A. Weiner, Phys. Rev. B , 3165 (1971). J. Yeo and M. A. Moore, Phys. Rev. Lett. , 1142 (1996). J. Yeo and M. A. Moore, Phys. Rev. B , 4218 (1996). J. Yeo and M. A. Moore, Phys. Rev. B , 024514 (2001). J. Yeo, H. Park, and S. Yi, J. Phys.:Condens. Matter ,3607 (2006). A. Shishanin and I. Ziyatdinov JHEP , 32 (2003). I. Ya. Aref’eva and A. P. Zubarev Phys. Lett. B, , 258(1996). E. Bergli, M. Hjorth-Jensen, Annals Phys. , 1125(2011). S. X. Yang, H. Fotso, J. Liu, T. A. Maier, K. Tomko, E. F.D’Azevedo, R. T. Scalettar, T. Pruschke, and M. Jarrell,Phys. Rev. E , 046706 (2009). N. E. Bickers (unpublished). D. Bohm and D. Pines, Phys. Rev. , 609 (1953). D. Pines, Phys. Rev. , 626 (1953). T. Moriya,

Spin Fluctuations in Itinerant Electron Mag-netism (Springer, Berlin, 1985). N. E. Bickers and D. J. Scalapino, Ann. Phys. (N.Y.) ,106 (1989). S. Weinberg,

The Quantum Theory of Fields (Volumn 1) ,(Cambridge University Press, 2005). N. E. Bickers and D. J. Scalapino, Phys. Rev. B , 8050(1992). P. Kleinert, Prog. Theor. Phys. , 327 (2009). E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel,J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,A. McKenney, and D. Sorensen,

LAPACK Users’ Guide,3rd. edition , (Society for Industrial and Applied Mathe-matics, 1999). R. E. Bank and D. J. Rose, Numer. Math. , 279 (1981). M. Jarrell, K. Tomko, Th. Maier, E. D’Azevedo, R. T.Scalettar, Z. Bai, and S. Savrasov, J. Phys.: Conf. Ser. ,012031 (2007). C. Slezak, M. Jarrell, Th. Maier and J. Deisz, J. Phys.:Condens. Matter , 435604 (2009). M. Snir, J. Dongarra, J. Kowalik, S. Hauss-lederman, S.Otto, and D. Walker,