[PDF] Accelerated impurity solver for DMFT and its diagrammatic extensions

Abstract

We present ComCTQMC, a GPU accelerated quantum impurity solver. It uses the continuous-time quantum Monte Carlo (CTQMC) algorithm wherein the partition function is expanded in terms of the hybridisation function (CT-HYB). ComCTQMC supports both partition and worm-space measurements, and it uses improved estimators and the reduced density matrix to improve observable measurements whenever possible. ComCTQMC efficiently measures all one and two-particle Green's functions, all static observables which commute with the local Hamiltonian, and the occupation of each impurity orbital. ComCTQMC can solve complex-valued impurities with crystal fields that are hybridized to both fermionic and bosonic baths. Most importantly, ComCTQMC utilizes graphical processing units (GPUs), if available, to dramatically accelerate the CTQMC algorithm when the Hilbert space is sufficiently large. We demonstrate acceleration by a factor of over 600 (100) in a simulation of δ -Pu at 600 K with (without) crystal fields. In easier problems, the GPU offers less impressive acceleration or even decelerates the CTQMC. Here we describe the theory, algorithms, and structure used by ComCTQMC in order to achieve this set of features and level of acceleration.

Full PDF

aa r X i v : . [ c ond - m a t . s t r- e l ] O c t Accelerated impurity solver for DMFT and its diagrammaticextensions

Corey Melnick b, ∗ , Patrick Sémon a , Kwangmin Yu a , Nicholas D’Imperio a , André-MarieTremblay c , Gabriel Kotliar a,d a Brookhaven National Laboratory, Computational Science Initiative, Upton, NY b Brookhaven National Laboratory, Department of Condensed Matter Physics and Materials Science,Upton, NY c Université de Sherbrooke, Département de physique and Regroupement québéquois sur les matériauxde pointe, Sherbrooke, Québec, Canada d Rutgers University, Department of Physics, Piscataway, NJ

Abstract

We present ComCTQMC, a GPU accelerated quantum impurity solver. It uses thecontinuous-time quantum Monte Carlo (CTQMC) algorithm wherein the partition func-tion is expanded in terms of the hybridisation function (CT-HYB). ComCTQMC supportsboth partition and worm-space measurements, and it uses improved estimators andthe reduced density matrix to improve observable measurements whenever possible.ComCTQMC efﬁciently measures all one and two-particle Green’s functions, all staticobservables which commute with the local Hamiltonian, and the occupation of eachimpurity orbital. ComCTQMC can solve complex-valued impurities with crystal ﬁeldsthat are hybridized to both fermionic and bosonic baths. Most importantly, ComC-TQMC utilizes graphical processing units (GPUs), if available, to dramatically acceler-ate the CTQMC algorithm when the Hilbert space is sufﬁciently large. We demonstrateacceleration by a factor of over 600 (100) in a simulation of δ -Pu at 600 K with (with-out) crystal ﬁelds. In easier problems, the GPU offers less impressive acceleration oreven decelerates the CTQMC. Here we describe the theory, algorithms, and structureused by ComCTQMC in order to achieve this set of features and level of acceleration.Keywords: Quantum Monte Carlo; Dynamical Mean Field Theory; Anderson ImpurityModel; Strongly-Correlated Materials; Quantum Impurity Solver ∗ Corresponding author.E-mail address: [email protected] submitted to Computer Physics Communications October 19, 2020

ROGRAM SUMMARY

Program Title: ComCTQMCLicensing provisions(please choose one): GPLv3Programming language: C++/CUDASupplementary material:Nature of problem: In dynamical mean-ﬁeld theory (DMFT), the computational bottleneckis the repeated solution of a quantum impurity problem [1]. The continuous-time quantumMonte-Carlo (CTQMC) algorithm has emerged as one of the most efﬁcient methods for solv-ing multiorbital impurity problems at moderate-to-high temperatures [2]. However, the low-temperature regime remains inaccessible, particularly for f -shell systems, and the measure-ment of two-particle correlation functions on an impurity adds a substantial computationalburden. The bottleneck of the CTQMC solver is itself the computation of the local trace whichincludes the multiplication of many moderate-to-large sized matrices. The efﬁcient solution ofthe impurity, measurement of the two-particle correlation functions, and acceleration of thetrace computation are therefore critical.Solution method: ComCTQMC uses the hybridisation expansion of the impurity action to ex-plore partition space [3]. It uses the worm algorithm [4] to explore the union of the partitionspace with observables spaces, e.g., the two-particle correlation functions. It uses improvedestimators to more accurately measure the one- and two-particle Green’s functions [5]. Iden-tical impurities are solved across all MPI ranks (for ideal weak scaling) and the trace compu-tations of these impurities are distributed to and accelerated by GPUs (when available). Thelazy-trace algorithm [6] is used to further reduce the burden of the local trace calculation.Additional comments including Restrictions and Unusual features: ComCTQMC solves nearlyarbitrary impurities, including those with complex valued and time-dependent interactions.However, there are two restrictions: (1) The retarded part of the interaction is described bya set of bilinears (a paired creation and annihilation operator), and these bilinears must com-mute with the local Hamiltonian and have real quantum numbers; (2) If a local Green’s functionvanishes, then the corresponding hybridisation function also vanishes. References [1] A. Georges, G. Kotliar, W. Krauth, and M.J. Rozenberg, Dynamical mean-ﬁeld theory ofstrongly correlated fermion systems and the limit of inﬁnite dimensions, Rev. Mod. Phys. , 13 (1996).[2] G. Kotliar, S. Y. Savrasov, K. Haule, V.S. Oudovenko, O. Parcollet, and C.A. Marianetti,Electronic structure calculations with dynamical mean-ﬁeld theory, Rev. Mod. Phys. ,865 (2006)[3] E. Gull, A.J. Millis, A.I. Lichtenstein, A.N. Rubtsov, M. Troyer, and P. Werner, Continuous-time Monte Carlo methods for quantum impurity models, Rev. Mod. Phys. , 349 (2011)[4] P. Gunacker, M. Wallerberger, E. Gull, A. Hausoel, G. Sangiovanni, and K. Held, ontinuous-time quantum Monte Carlo using worm sampling, Phys. Rev. B , 155102(2015)[5] H. Hafermann, K. R. Patton, and P. Werner,Improved estimators for the self-energy andvertex function in hybridization-expansion continuous-time quantum Monte Carlo simula-tions, Phys. Rev. B , 205106 (2012)[6] P. Sémon, C.-H. Yee, K. Haule, and A-MS Tremblay, Lazy skip-lists: An algorithm for fasthybridization-expansion quantum Monte Carlo, Phys. Rev. B , 075149 (2014)

1. Introduction

Here we introduce ComCTQMC, a GPU accelerated quantum impurity solver whichuses the continuous-time quantum Monte Carlo (CTQMC) algorithm wherein the impu-rity action is expanded in orders of the hybridisation function (CT-HYB). ComCTQMCcan compute the one- and two-particle vertex functions, susceptibilities, and Green’sfunctions of an impurity; it can measure the value of static observables which com-mute with the local Hamiltonian or good quantum numbers of the local Hamiltonian;and it can provide energy-dependent valence histograms for such observables. In com-bination with its post-processing tools, ComCTQMC provides the ability to carry outdynamical mean-ﬁeld theory (DMFT)[1, 2] for both models or real materials (in, e.g.,the LDA or GW approximations). Indeed, ComCTQMC has been embedded in the opensource, strongly-correlated physics packages ComSuite[3] and Portobello[4].DMFT plays a crucial role in the study of strongly-correlated materials whereindensity function theory (DFT) fails to accurately predict even the most basic materialproperties. Where DFT presumes that the electronic structure can be described viaLandau Fermi-liquid theory, i.e., as itinerant quasiparticles, DMFT is a minimal modelwhich treats both itinerant and localized electrons on an equal footing.[1, 2] In DMFT,one maps the correlated physics of a system onto a local impurity which hybridizeswith a non-interacting bath of particles. The Weiss ﬁeld acting upon the impurity dueto this bath is determined by the self-consistent solution of the DMFT equations, whichrequires the repeated solution of an Anderson impurity model. Indeed, the fast andaccurate solution of the Anderson impurity models lies at the heart of any efﬁcientDMFT implementation.To address this need, a number of impurity solvers have been developed. These in-clude the exact diagonalization (ED)[1, 5], numerical renormalization group (NRG)[6],and quantum Monte-Carlo (QMC)[7] methods. While the QMC algorithms cannot gen-erally access very low temperatures and can encounter a sign-problem after which theimpurity becomes practically unsolvable, QMC methods have become the most widelyused impurity solver largely because they are ﬂexible and numerically exact. That is, a3MC algorithm can be applied to problems with arbitrary interactions and couplings,phases, and symmetries; and the error vanishes systematically as the computationaltime is increased.[7]In the QMC algorithm, a Markov chain is generated by sampling the diagrams ofthe (imaginary time) impurity action via the Metropolis Hastings algorithm. That is,operators are stochastically placed on the imaginary time axis according to the rela-tive weight of the new diagram as compared to the current diagram. In the CTQMCalgorithm, these operators are drawn anywhere on the imaginary time axis. (Thismethod contrasts with the original Hirsch and Fry algorithm[8] which discretizes theimaginary time axis and suffers from systematic error as a result.) In order to efﬁ-ciently compute the relevant weights of each diagram in partition space, one mustexpand the action in terms of some part of the impurity Hamiltonian. Common choicesinclude the interaction (CT-INT), an auxiliary ﬁeld (CT-AUX), or the hybridisation be-tween bath and impurity (CT-HYB).[7] CT-HYB has become the most common choicefor solving the impurity model in real materials, as it is the most capable of solvinggeneral multiorbital problems at low temperatures.While CT-HYB is one of the best methods for solving multiorbital impurity prob-lems, the computational burden associated with an f -shell system or a difﬁcult cel-lular DMFT (CDMFT) problem is immense, particularly at low temperatures. Com-pounding this problem is the strong desire in the strong-coupling community to com-pute the susceptibilities of strongly correlated materials or to diagrammatically ex-tend DMFT using the two-particle response functions. Indeed, the two-particle cor-relation functions require substantially more computational resources to accuratelyestimate than the one-particle objects required by DMFT. As a result of the mas-sive computational hurdles facing DMFT and its extensions, a veritable host of al-gorithms have been developed to enhance the ability of CT-HYB to quickly sample thepartition function (lazy trace computation[9], binary trees[10], basis truncations[11],Krylov implementations[12]); to efﬁciently store two-particle objects and ﬁlter thehigh-frequency noise[13, 14]; to more efﬁciently sample many one- and two-particleobjects (improved estimators[15, 16]); to directly sample conﬁguration spaces whichare rarely or never explored by the partition function (worm algorithm[17, 16]); andto reconstruct the high-frequency regions of the self-energy and one-particle Green’sfunction[7] or full vertex[18] using their asymptotic forms. Nearly all of these afore-mentioned improvements are included in the ComCTQMC package. Where conﬂictsarise, we try to use the most recent and optimal algorithms.Other packages, e.g., TRIQS[19], iQIST[20], and ALPSCore CT-HYB[21], have im-plemented their own set of highly optimized algorithms. However, ComCTQMC canleverage a powerful and emerging feature of modern supercomputers: the graphicalprocessing unit (GPU). While essentially all CTQMC codes are massively paralleliz-4ble across computer processing units (CPU), none of the major CTQMC codes haveutilized GPUs. As we will show, this is a massive advantage when solving the mostdifﬁcult problems.A typical compute node on a modern supercomputer has O (10) CPU cores dis-tributed between a few physical CPUs. While the CPUs have access to a large amountof memory, only a very small subset of this memory is on the physical CPU. In com-parison, a GPU designed for high-performance computing (HPC) has O (1000) coresand drastically more on-device memory. The NVIDIA Volta used on the Summit super-computer at Oak Ridge National Laboratory, for example, has 2,560 cores capable ofhandling double precision computations, each of which has access to 16 GB of sharedmemory across the device and 9 MB in an L2 cache, which is orders of magnitudelarger than any CPU. While these cores are not as fast or as good at handling branch-ing logic trees as a CPU, their ability to process large volumes of data and parallelizecomputation across this data is unsurpassed. Therefore, a GPU can perform many(non-branching) operations on a large object much faster than a CPU could; and ifsmall and distinct portions of an algorithm require such a computation, these kernelscan be handed from CPU to GPU in order to accelerate the algorithm. Provided thatthese computations represent the bottleneck of the calculation, the performance gaincan be substantial.The quintessential example of a suitable kernel is in the multiplication of two rank n matrices, which requires n operations. If n is sufﬁciently large, a GPU can com-pute the result thousands of times faster than a CPU. As we will discuss, such matrixmultiplications form a bottleneck in CT-HYB, and GPUs can therefore be used to dra-matically accelerate the algorithm.In CTQMC, the bottleneck in computation occurs each time one updates conﬁgu-ration of the Markov chain (and recomputes the trace over the local impurity states).This requires the multiplication of many moderate-to-large sized matrices. As we havediscussed, this is precisely the type of problem for which a GPU was designed. In ourimplementation, however, it is not only the matrix multiplications which are handledon a GPU: our GPU kernels are used to compute the matrix norm, which is crucialin the lazy-trace algorithm[9], apply the time-evolution operators, add matrices, andcompute the ﬁnal trace. Not only does this allow the GPU to handle more computa-tions, but it also nearly eliminates the costly transfer of the matrices between CPUand GPU memory. Furthermore, we have the GPU accelerate multiple simulationssimultaneously in order to more fully utilize its capabilities. With this approach, weshow that a GPU can accelerate the time-to-solution of a single CPU by a factor of over600 on Summit! Furthermore, our implementation is ﬂexible: even for relatively smallproblems, e.g., where the largest matrices have a rank of only n max = 26 , some smallGPU acceleration is achieved (5 times), as shown in Fig. 1. Still, if the problem is5 cce l e r a ti on o f a s i ng l e C P U Rank of largest matrix, n ~ ( - ) n d -shell (Fe at 600 K)Crystal FieldsNo | Yes f -shell (Pu at 600 K) Figure 1: Acceleration of the CTQMC algorithm on a single CPU by a single GPU on Summit at ORNLfor variations in the problem size. The problem size is described here by the rank of the largest matrixhandled by the algorithm, which depends on the size and symmetries of the Hilbert space. The largerthe Hilbert space and the fewer the symmetries, the larger the matrices become, the harder the problembecomes, and the more acceleration a GPU offers. Here we show the acceleration of representative d and f -shell problems (which have Hilbert spaces of size n = 1 , and , , respectively, where n = 10 and 14 are the number of correlated orbitals) with and without crystal ﬁeld effects (which reduce thesymmetries of the problem). sufﬁciently small, GPU-CPU communication will hurt performance: the leftmost data-point in Fig. 1 is decelerated by a factor of 10. In Sec. 3.3, we will discuss the useof GPUs in much greater detail. Appendix Appendix F provides the relevant details ofour deployment on Summit and physical approximations which lead to these results.(The associated input ﬁles are included with the distribution of ComCTQMC.)Before getting into the implementation, it is useful to overview the theory behind-ComCTQMC and discuss its capabilities and limitations. In Sec. 2 we present thetheory behind the CTQMC algorithm, i.e., the quantum impurity model and its action;the algorithm used to sample partition space and the modiﬁcations used to sampleworm spaces; and the observables one can measure in these spaces and the improvedestimators used to more accurately and efﬁciently measure one- and two-particle ver-tices (e.g., the self-energy). Then, we will discuss the implementation in Sec. 3, i.e.,we will describe the structure of the code, its optimization, and its acceleration usingGPUs. Finally, we present a few results in Sec. 4, discussing the computational burdeninvolved in measuring various quantities and some best-practices for measuring them.A guide for installation, tutorials with input ﬁles and converged results, and tablesdescribing all input parameters are included with ComCTQMC. In the appendices,6e provide supporting information. In particular, we deﬁne the basis functions usedin ComCTQMC which are used for the simulation of real materials (Appendix A), adescription of the Metropolis-Hastings algorithm (Appendix B), and a discussion ofimpurities where partition space sampling fails (Appendix C). We also provide a de-scription of the optimized moves used in ComCTQMC (Appendix D), and a descriptionof the symmetries enforced on two-particle correlators in order to more quickly con-verge these observables (Appendix E).Now, let us overview the theory behind the CT-HYB algorithm and its worm-samplingextension.

2. Theory

In this section, we brieﬂy discuss the theoretical foundation of ComCTQMC. That is,we discuss the general quantum impurity model solved by ComCTQMC, the hybridi-sation expansion of the action, and the resulting CTQMC algorithm used to samplepartition space, the worm algorithm used to sample observable spaces, and the im-proved estimators used to more accurately measure variouos observables. Finally, wediscuss the various observables which can be computed by ComCTQMC.2.1. Quantum impurity modelAn impurity model consists of a small interacting system, the impurity, which hy-bridizes with baths of non-interacting particles. We consider here hybridization withboth a fermionic and a bosonic bath, and the different contributions to the impuritymodel Hamiltonian, ˆ H , are split into the purely local part, ˆ H loc ; the fermionic andbosonic parts, ˆ H bath ,f and ˆ H bath ,b ; and the hybridisation between the the baths and theimpurity, ˆ H hyb ,f and ˆ H hyb ,b . That is, ˆ H = ˆ H loc + ˆ H hyb ,f + ˆ H bath ,f + ˆ H hyb ,b + ˆ H bath ,b , (1)The local part of the impurity Hamiltonian includes both one- and two-body interac-tions. That is, ˆ H loc = X ij ˆ c † i ( t ij − µδ ij )ˆ c j + 12 X ijkl ˆ c † i ˆ c † j U ijkl ˆ c k ˆ c l , (2)where c † i creates a fermion in the generalized orbital i , t ij are hopping amplitudes, U ijkl is the interaction tensor, and µ is the chemical potential. The hybridization withthe fermionic bath is described by ˆ H hyb ,f = X iλ ˆ c † i V iλ ˆ f λ + ˆ f † λ V ∗ iλ ˆ c i and ˆ H bath ,f = X λ ǫ λ ˆ f † λ ˆ f λ , (3)7here ˆ f † λ creates a fermion in the bath orbital λ with energy ǫ λ , and the V iλ arefermionic hybridization amplitudes. The hybridization with the bosonic bath is de-scribed by ˆ H hyb ,b = X Iκ W Iκ (ˆ b κ + ˆ b † κ ) ˆ Q I and ˆ H bath ,b = X κ ω κ ˆ b † κ ˆ b κ , (4)where ˆ b † κ creates a boson in the bath orbital κ with energy ω κ , and the W Iκ are hy-bridization amplitudes between the bosonic bath and charge degrees of freedom onthe impurity, ˆ Q I = X ij h I | ij i ˆ c † i ˆ c j . (5)These particle-hole bilinears ˆ Q I are Hermitian, and the hybridization amplitudes W Iκ are real.The path integral formalism allows us to integrate out the bath degrees of freedom,and the action of the impurity model can be written as S = − X ij Z Z β c i ( τ ) G − ij ( τ − τ ′ ) c j ( τ ′ ) dτ dτ ′ + 12 X ijkl Z Z β c i ( τ + ) c j ( τ ′ + ) U ijkl ( τ − τ ′ ) c k ( τ ′ ) c l ( τ ) dτ dτ ′ . (6)The the Weiss ﬁeld G − ij ( iν n ) = ( iν n + µ ) δ ij − t ij − ∆ ij ( iν n ) (7)absorbs the fermionic bath degrees of freedom, which are encapsulated in the hy-bridization function ∆ ij ( iν n ) = X λ V iλ V ∗ jλ iν n − ǫ λ . (8)Similarly, the dynamic interaction U ijkl ( iω n ) = U ijkl + X IJ h kj | I i D IJ ( iω n ) h J | il i (9)absorbs the bosonic bath degrees of freedom, which are encapsulated in the bosonichybridisation function D IJ ( iω n ) = X κ W Iκ W Jκ ǫ κ ( iω n ) − ǫ κ . (10)Note that we denote the fermionic Matsubara frequencies by iν n and the bosonic Mat-subara frequencies by iω n . Recall as well that the hybridisation amplitudes, W Iκ , arereal. 8omCTQMC can solve most impurities which can be deﬁned within this formalism:It supports complex valued hopping amplitudes t ij , interaction tensors U ijkl , and hy-bridisation functions (i.e., impurity models where not all V iλ can be chosen real, orequivalently, if ∆ ij ( iν n ) = ∆ ji ( iν n ) ). Such complex valued impurities may arise whenapplying a complex valued unitary transformation to the one-particle basis used bydefault in CTQMC. (See Appendix A.) However, there are a few restrictions:1. The particle-hole bilinears ˆ Q I in Eq. 5 need to satisfy [ ˆ H loc , ˆ Q I ] = 0 and [ˆ c i , ˆ Q I ] = q iI ˆ c i . (11)Notice that, as a consequence, the ˆ Q ’s commute among each other and that thequantum numbers q iI are real. This is because the last equation is equivalent torequiring that ˆ Q I = P i q iI ˆ c † i ˆ c i and that the particle-hole bilinears are Hermitian.2. The second restriction concerns the block-diagonal shape of the hybridizationfunction, ∆ ij , and the local Green’s function matrix, G loc ,ij ( τ ) = −h c i ( τ ) c j i loc ,where h◦i loc denotes the thermal average with respect to the impurity Hamilto-nian ˆ H loc . The requirement is G loc ,ij ≡ ⇒ ∆ ij ≡ . (12)Equivalently, the non-zero blocks of the hybridization function must lie within thenon-zero blocks of the atomic Green function. Since the block-diagonal shape of G loc is usually a consequence of the abelian symmetries of H loc , we may alsosay that the hybridization is not allowed to break the abelian symmetries of theimpurity.Having deﬁned the impurity model and its restrictions, let us write the problem ina form amenable to Monte-Carlo sampling.2.2. ExpansionThe CT-HYB algorithm begins by expanding the partition function of the impuritymodel in powers of the fermionic and bosonic hybridization functions. Using Eq. 11 tointegrate out the bosonic part of the expansion, this yields Z = Z D [ c, c ] e − S = ∞ X k =0 Z β dτ · · · Z τ k − dτ k Z β dτ ′ · · · Z τ ′ k − dτ ′ k X i ··· i k X i ′ ··· i ′ k w ( C ) (13)where C = i τ i ′ τ ′ · · · i k τ k i ′ k τ ′ k is the current conﬁguration. The weight of this conﬁgu-ration, w ( C ) , is the product of three distinct terms: the local weight, the fermionic bathweight, and the bosonic bath weight. This decomposition, w ( C ) = w loc ( C ) w hyb ( C ) w ret ( C ) ,9ill prove convenient in the future. These weights are given by the following equa-tions: w loc ( C ) = Tr e − β ˜ H loc T τ |C| Y r =1 ˆ c i ′ r ( τ ′ r )ˆ c † i r ( τ r ) , (14) w hyb ( C ) = Det ≤ r,s ≤|C| ∆ i r i ′ s ( τ r − τ ′ s ) , (15)and w ret ( C ) = exp (cid:18) |C| X r,s =1 X IJ ˜ q rI ˜ q sJ K IJ (˜ τ r − ˜ τ s ) (cid:19) , (16)respectively, where | C | is the expansion order, i.e., the number of operator pairs in thetrace. In Eq. 14, the effective local Hamiltonian is deﬁned by ˜ H loc = ˆ H loc + 12 X IJ ˆ Q I D IJ ( iω n = 0) ˆ Q J (17)and ˆ c i ( τ ) = e τ ˜ H loc ˆ c i e − τ ˜ H loc . In Eq. 16, the imaginary times of the conﬁguration arerelabeled as ˜ τ r = τ r and ˜ τ r + |C| = τ ′ r with r = 1 , . . . , |C| , and the quantum numbers arerelabeled as ˜ q rI = q i r I and ˜ q ( r + |C| ) I = − q i ′ r I . Finally, K IJ ( τ ) = − β D IJ ( iω n = 0) | τ | ( β − | τ | ) + X n =0 D IJ ( iω n ) β ( iω n ) e − iω n τ (18)is the β -periodic second primitive of D IJ ( τ ) − D IJ ( iω n = 0) δ ( τ ) .Just as we have written an expansion for the partition function, we can also writean expansion for a local observable, O , Z D [ c, c ] e − S O = ∞ X k =0 Z β dτ · · · Z τ k − dτ k Z β dτ ′ · · · Z τ ′ k − dτ ′ k X i ··· i k X i ′ ··· i ′ k w ( C , O ) , (19)where the weights w ( C , O ) are again the product of three distinct terms. Assumingthat the observable is of the form O = O ( τ ′′ ) · · · O l ( τ ′′ l ) with [ ˆ O a , ˆ Q I ] = q ′ aI ˆ O a , (20)the three terms are w loc ( C , O ) = Tr e − β ˜ H loc T τ ˆ O ( τ ′′ ) · · · ˆ O l ( τ ′′ l ) |C| Y r =1 ˆ c i ′ n ( τ ′ r )ˆ c † i r ( τ r ) , (21) w hyb ( C , O ) = Det ≤ r,s ≤|C| ∆ i r i ′ s ( τ r − τ ′ s ) , (22)and w ret ( C , O ) = exp (cid:18) |C| + |O| X r,s =1 X IJ ˜ q rI ˜ q sJ K IJ (˜ τ r − ˜ τ s ) (cid:19) , (23)10here |O| describes the number of operators in the observable. Here ˜ τ r and ˜ q rI are asdeﬁned in Eq. 16 for r = 1 , . . . , |C| , and for r = 2 |C| +1 , . . . , |C| + |O| we deﬁne ˜ τ |C| + a = τ ′′ a and ˜ q (2 |C| + a ) I = q ′ aI . Notice that the contribution from the fermionic hybridization isthe same as for the partition function expansion; the local operators of the observabledo not hybridize with the fermionic bath.The expansions of the partition function, Eq. 13, and an observable, Eq. 19, estab-lish the prerequisites for estimating observables within CTQMC. As we will discuss inthe following sections, there are two classes of observables: observables for which weonly need to sample the partition function conﬁguration space, and observables forwhich we must sample both the partition space and the observable space.2.3. Partition-CTQMCIn this section, we consider observables which can be estimated by sampling thepartition space. The expectation value of these observables can be written as theexpectation value of a random variable over the partition function expansion, hOi = Z − X C w ( C ) o ( C , O ) = h o ( C , O ) i w ( C ) /Z , (24)where o ( C , O ) is the random variable and w ( C ) /Z is the probability distribution. Thequantity o ( C , O ) is also called the “estimator” of the observable O in the conﬁguration C . We assume that w ( C ) is a positive real number for the time being. We will comeback to the general case where w ( C ) is negative or complex in Sec. 2.3.2, but beforedoing so, let us brieﬂy discuss the sampling of the partition function expansion.2.3.1. SamplingThe probability distribution w ( C ) /Z is sampled by a Markov chain C → C → . . . in the space of conﬁgurations, characterized by the transition probability P ( C i +1 |C i ) of going from conﬁguration C i to conﬁguration C i +1 . The Markov process convergesto w ( C ) /Z if the transition probability satisﬁes the detailed balance P ( C i +1 |C i ) w ( C i ) = P ( C i |C i +1 ) w ( C i +1 ) and ergodicity conditions.The Metropolis-Hasting algorithm (Appendix B) gives a possible choice for thetransition probability. To start, a trial conﬁguration C is chosen according to a trialprobability q ( C|C i ) , and we set C i +1 := C with probability p = min (cid:18) q ( C i |C ) q ( C|C i ) × w ( C ) w ( C i ) , (cid:19) (25)and C i +1 := C i otherwise. This transition probability p · q satisﬁes detailed balance.For an ergodic sampling of the conﬁguration space, one must propose updates(with an associated transition probability) that allow the Markov chain to explore all of11onﬁguration space. A natural choice is to propose the insertion or removal of a pairof operators. Assuming that the fermionic hybridisation does not break the abeliansymmetries of the impurity, this choice is generally ergodic. If this assumption doesnot hold, one must insert four or more operators at once in order to ensure ergod-icity, as discussed in Ref. [22] (where the hybridization breaks charge conservation)and Ref. [19] (where the hybridization breaks orbital parity conservation). Additionalupdates, or moves, are also required when sampling observable spaces, as discussedin Ref. [16].By default, ComCTQMC only considers pair insertion and removal when exploringpartition space, although the user may specify that four operator insertion and removalshould also be considered. Additional moves are considered in observable spaces, aswe will discuss. For additional detail on the moves implemented in ComCTQMC, e.g.,the values of q ( C i |C ) , we direct the reader to Appendix D.2.3.2. EstimatorsThe expectation value of an observable, O , may be cast as the expectation valueof a random variable over the partition space. For example, one can use Eq. 13 andEq. 19 to write hOi = Z − X C w ( C , O ) = Z − X C w ( C ) w ( C , O ) w ( C ) , (26)which yields the estimator w ( C , O ) /w ( C ) for observable O . In Eq. 26 we assume that w ( C , O ) vanishes whenever w ( C ) vanishes. Let us start with observables for which thisgenerally holds.First, let us discuss the reduced density matrix of the impurity, ˆ ρ . This observableallows us to calculate the expectation value of any static observable ˆ O on the impurity.That is, h ˆ O i = Tr ˆ ρ ˆ O . The estimator for the reduced density matrix is (ˆ ρ ) uv = β − (cid:28)Z β w loc ( C , ˆ P vu ( τ )) w loc ( C ) dτ (cid:29) w ( C ) /Z . (27)Here ˆ P vu = | v ih u | , where | u i and | v i are states of the local (impurity) Hilbert space.In this estimator, we take advantage of time translational invariance and integrateover τ in order to reduce statistical noise. (This integration is done analytically.) Notethat in this estimator there is no contribution from the bosonic bath, or equivalently, w ret . This follows from the restriction that the hybridisation is not allowed to breakthe abelian symmetries of the impurity, such that (ˆ ρ ) uv = 0 if [ ˆ P vu , ˆ Q I ] = 0 . Noticethat without this restriction, there generally are non-zero entries in impurity reduceddensity matrix which cannot be calculated by partition function sampling.By default, ComCTQMC will use the reduced density matrix to calculate the aver-age occupation of each orbital, the average energy of the impurity, and the average12 umber of electrons on the impurity. Users may supply additional quantum numbersor observables which may be computed using the reduced density matrix. CTQMCwill check during post-processing that these observables meet the criterion: i.e., thequantum numbers are, indeed, quantum numbers of the local Hilbert space, and theobservables commute with the local Hamiltonian. It will keep those observables andquantum numbers which are acceptable and discard those which are not.Next, consider the susceptibility of a quantum number , χ IJ ( τ ) = h Q I ( τ ) Q J i −h Q I ih Q J i . If the particle-hole bilinears satisfy Eq. 11, the estimator reads χ IJ ( iω n ) = 1 β ( iω n ) (cid:28) |C| X r,s =1 X IJ ˜ q sI ˜ q rJ e iω n (˜ τ s − ˜ τ r ) (cid:29) w ( C ) /Z . (28)In ComCTQMC, we measure only the dynamical parts, iω n = 0 , as the static measure-ment is better estimated from the reduced density matrix, and the Fourier transformis taken analytically. (The ˜ q ’s and ˜ τ ’s are deﬁned as in Eq. 16.)Next, consider the Green function , G ( τ ′ − τ ) = − Z − P C w ( C , c ( τ ′ ) c ( τ )) . Here,the assumption that w ( C , c ( τ ′ ) c ( τ )) vanishes when w ( C ) vanishes is problematic dueto the Pauli exclusion principle. Fortunately, this can be remedied: Instead insertingtwo operators in order to recover the Green’s function weight, we instead remove thehybridisation lines from two operators. The resulting estimator reads G ij ( iν n ) = − ∂ ln Z∂ ∆ ji ( iν n ) = − β − (cid:28) |C| X r,s =1 e iν n ( τ ′ s − τ r ) δ ii ′ s δ ji r ( M − ) sr (cid:29) w ( C ) /Z (29)where M rs = ∆ i r i ′ s ( τ r − τ ′ s ) . This approach solves our problem. However, it introduceslimitations. Namely, only entries G ij of the Green function for which ∆ ji are non-zero can be calculated with this estimator, and this estimator may fail if the bath hasvery few orbitals, as discussed in Appendix C. DMFT calculations are in general notaffected by these limitations.The most important observable for DMFT calculations is the self-energy , whichcan be calculated with the Green function and the Dyson equation, Σ = G − − G − . Thisis numerically unfavorable, as it ampliﬁes the statistical noise in the Green functionat higher frequencies: δ Σ = G − δG ∝ ν n δG . A better method calculates the followingcorrelation function H ij ( τ ) = −h T τ [ˆ c i , ˆ U ]( τ )ˆ c † j i − Z β X IJ q iI D IJ ( τ − τ ′ ) h T τ ˆ c i ( τ ) ˆ Q J ( τ ′ )ˆ c † j i dτ ′ (30)and obtains the self-energy from Σ = HG − .[15, 16] Here ˆ U is the interacting part of ˆ H loc in Eq. 2. This reduces the noise at the higher frequencies, as δ Σ = G − HδG + G − δH , which roughly goes as ν n times the noise in G and H .13he estimator for Eq. 30, that is, the improved estimator for the self-energy ,has a contribution from the static and the retarded part of the interaction. It reads H ij ( iν n ) = − β − (cid:28) |C| X r,s =1 e iν n ( τ ′ s − τ r ) δ ii ′ s δ ji r ( M − ) sr × ( h st s + h ret s ) (cid:29) w ( C ) /Z . (31)The contribution from the static part, h st s , is h st s = w U loc ,s ( C ) w loc ( C ) , (32)where w U loc ,s ( C ) is obtained from w loc ( C ) by replacing ˆ c i ′ s ( τ ′ s ) with the Bulla operator [ˆ c i ′ s , ˆ U ]( τ ′ s ) . See, for example, Ref. [15, 16]. The contribution from the retarded part, h ret s is h ret s = X IJ q i ′ s I (cid:18) D IJ ( iω n = 0) w loc ( C , Q J ) w loc ( C ) − |C| X t =1 ˜ q tJ L IJ ( τ ′ s − ˜ τ t ) (cid:19) , (33)where L IJ ( τ ) is the ﬁrst primitive of D IJ ( τ ) , and the ˜ q ’s and ˜ τ ’s are deﬁned as inEq. 16. While it is not clear a priori that w U loc ,s ( C ) vanishes whenever w loc ( C ) vanishes,we have not encountered any problems where this assumption is violated. Noticein this respect that, at least from the point of view of their symmetries, w loc ( C ) and w U loc ,s ( C ) have the same zero’s. This is because [ˆ c i , ˆ U ] transforms as ˆ c i under the sym-metries of ˆ H loc , since we can assume that ˆ U transforms as the identity.Until now, we assumed that the weights in the partition function expansion arepositive in order to interpret w ( C ) /Z as a probability distribution. If w ( C ) becomesnegative or complex, which is generally the case since we are dealing with fermions,we rewrite Eq. 24 as hOi = | Z | Z X C | w ( C ) || Z | w ( C ) | w ( C ) | o ( C , O ) = | Z | Z (cid:28) w ( C ) | w ( C ) | o ( C , O ) (cid:29) | w ( C ) | / | Z | (34)and sample the “bosonic” partition function | Z | = P C | w ( C ) | . The estimators we dis-cussed above get a phase factor w ( C ) / | w ( C ) | , and we need to estimate the ratio Z/ | Z | ,which is also called the “sign” of the Monte-Carlo simulation. The sign is not a scalarobservable in the proper sense and depends for example on the one particle basis[23, 24].The estimator of the sign is Z | Z | = (cid:28) w ( C ) | w ( C ) | (cid:29) | w ( C ) | / | Z | . (35)14otice that while w ( C ) / | w ( C ) | can be complex, the sign is always real because thepartition function Z is real. In real materials, it can be shown that the sign becomesexponentially small upon lowering the temperature, whereas the variance of the ran-dom variable w ( C ) / | w ( C ) | gets closer and closer to one. The exponential blow up ofthe statistical noise in Eq. 34 that this entails is called the “sign problem”. That is,each measurement provides Z/ | Z | information, and so one must sample an additional ( Z/ | Z | ) − conﬁgurations in order to receive a good estimate of any observable. A signbelow 0.1 is generally considered unacceptable, as it indicates at least an order ofmagnitude more computational resources must be dedicated to the CTQMC.The approach of removing hybridisation lines as in Eq. 29 can also be used to obtainestimators for higher order correlation functions, for example for the two particleGreen function h c i c j c k c l i = − Z − ∂ Z/∂ ∆ ji ∂ ∆ lk . Similar to the one particle Greenfunction, the limitation is that both ∆ ji and ∆ kl are non-zero. As a consequence, notall entries of the two particle Green function can be accessed. Consider, for example,an impurity model with a diagonal Weiss Field and a Kanamori interaction: entrieswith i = j = k = l can be ﬁnite due to spin-ﬂip processes, while only entries with i = j and k = l (or i = l and j = k ) can be accessed since the hybridisation is diagonal.To overcome this limitation, one must sample the conﬁguration space of not only thepartition function, but also the observable. This is called worm sampling. Let usdiscuss.2.4. Worm-CTQMCLet us brieﬂy outline the theory behind worm sampling using the one particleGreen function as our observable. We begin with the observable expansion, Eq. 19,for O = c ( τ ) c ( τ ′ ) . Let us also include the Fourier transform, so that we sample in theMatsubara basis. Sampling this expansion yields an estimate of the “Green function” ˜ G ( iν n ) = − ( β | Z G | ) − X C ,ττ ′ w ( C , c ( τ ) c ( τ ′ )) e iν n ( τ − τ ′ ) = − β − (cid:28) w ( C , c ( τ ) c ( τ ′ )) | w ( C , c ( τ ) c ( τ ′ )) | e iν n ( τ − τ ′ ) (cid:29) | w ( C ,c ( τ ) c ( τ ′ )) | / | Z G | . (36)This function is not normalized by the partition function, Z , like the true Green’sfunction. Instead, it is normalized to | Z G | = P C ,ττ ′ | w ( C , c ( τ ) c ( τ ′ )) | . To get the realGreen’s function, then, we need an estimate of | Z G | /Z . To this end, we consideran extended sampling space which consists of both the Green function space S G = { ( C , τ τ ′ ) } and the partition function space S Z = {C} , with probability distribution | w ( C , c ( τ ) c ( τ ′ )) | / ( | Z G | + | Z | ) and | w ( C ) | / ( | Z G | + | Z | ) , respectively. The number of sam-ples N G and N Z taken in each space when sampling S = S G ∪ S Z is then proportional15o | Z G | and | Z | , so that G ( iν n ) = | Z | Z × | Z G || Z | × ˜ G ( iν n ) ≈ | Z | Z × N G N Z × ˜ G ( iν n ) . (37)Notice that this extended sampling simultaneously yields estimates of the “Green func-tion” in Eq. 36 (when in S G ), the sign Z/ | Z | , as well as any other observables one cansample in partition space (when in S Z ), and the relative volume | Z G | / | Z | of S G and S Z .The operators c ( τ ) c ( τ ′ ) are called “worm” operators, as they are said to worm betweenthe partition and observable spaces. Thus, we arrive at the name Worm-CTQMC.2.4.1. Correlation FunctionsIn this section, we list the correlation functions for which worm sampling is imple-mented. These are G (1) ij ( τ ) = −h T τ ˆ c i ( τ )ˆ c † j ( τ ) i ( ∗ ) (38) G (2) ijkl ( τ , τ , τ ) = −h T τ ˆ c i ( τ )ˆ c † j ( τ )ˆ c k ( τ )ˆ c † l ( τ ) i ( ∗ ) (39) G (2 ,ph ) ijkl ( τ , τ ) = −h T τ ˆ c i ( τ )ˆ c † j ( τ )ˆ n ( ph ) kl ( τ ) i ( ∗ ) (40) G (2 ,pp ) ijkl ( τ , τ ) = −h T τ ˆ c i ( τ )ˆ c j ( τ )ˆ n ( pp ) kl ( τ ) i ( ∗ ) (41) G (2 ,ph ) ijkl ( τ ) = −h T τ ˆ n ( ph ) ij ( τ )ˆ n ( ph ) lk ( τ ) i (42) G (2 ,pp ) ijkl ( τ ) = −h T τ ˆ n ( hh ) ij ( τ )ˆ n ( pp ) kl ( τ ) i (43)where ˆ n ( ph ) ij = ˆ c † i ˆ c j , ˆ n ( pp ) ij = ˆ c † i ˆ c † j , ˆ n ( hh ) ij = ˆ c i ˆ c j and τ = τ − τ . For the Green functionsmarked with an asterisk, improved estimators are implemented. The higher-order cor-relation functions are obtained by replacing the ﬁrst operator ˆ c i ( τ ) in the respectiveGreen function with the Bulla operator [ˆ c i , ˆ U ]( τ ) and adding the contribution comingfrom the retarded interaction, that is, H ( . . . ) = −h T τ [ˆ c i , ˆ U ]( τ ) · · · i − Z β X IJ q iI D IJ ( τ − τ ′ ) h T τ ˆ c i ( τ ) ˆ Q J ( τ ′ ) · · · i dτ ′ , (44)where the · · · denotes the operators other than ˆ c i ( τ ) in the Green function. For anexample, see Eq. 30, which gives the results for the one-particle Green function.2.4.2. SamplingAs outlined above for the one-particle Green function, Worm-CTQMC extends theconﬁguration space by adding additional, worm spaces to the partition space. That is,we must sample an additional worm space for each correlation function that needs tobe sampled. In general, the additional worm spaces may be much larger or smaller16han each other or than partition space. In this case, the Worm-CTQMC will spendthe vast majority of its time in the largest conﬁguration spaces. To compensate, thevolume of each space is normalized by an additional term, η X . Therefore, we samplethe union of normalized spaces S = [ X η X S X . (45)In our case, the union goes over the Green’s functions listed in Eqs. 38 to 43, and thecorresponding higher-order correlators deﬁned in Eq. 44, and the partition function X = Z . The probability distribution on η X S X is proportional to η X × | w ( C , O ) | , where η X is discussed in more detail below, and O is given as follows:( X = G ): For the Green functions, O represents the operators which deﬁne the function,e.g., O = ˆ c i ( τ )ˆ c † j ( τ )ˆ n ( ph ) kl ( τ ) for G (2 ,ph ) ( τ , τ ) in Eq. 40.( X = H ): For the improved estimators, O represents the operators which deﬁne the staticpart of Eq. 44, e.g., O = [ˆ c i , ˆ U ]( τ )ˆ c j ( τ )ˆ n ( pp ) kl ( τ ) for H (2 ,pp ) ( τ , τ ) . The dynamicpart of the improved estimator is measured in the associated Green’s functionspace.( X = Z ): For the partition function η X = 1 (by convention) and the probability distributionis proportional to | w ( C ) | .When sampling the space S with a Markov-Chain using the Metropolis-Hastingalgorithm, a new conﬁguration ( C ′ , O ′ ) is proposed according to the trial probability q ( C ′ , O ′ |C , O ) , and accepted with probability p = min (cid:18) q ( C ′ , O ′ |C , O ) q ( C , O|C ′ , O ′ ) × η X ′ η X × (cid:12)(cid:12)(cid:12)(cid:12) w ( C ′ , O ′ ) w ( C , O ) (cid:12)(cid:12)(cid:12)(cid:12) , (cid:19) . (46)The trial conﬁguration ( C ′ , O ′ ) can lie in the same space as the present conﬁguration( X = X ′ ) or not ( X ′ = X ). The present CTQMC solver implements the following moves:( X 6 = X ′ ): Insert or remove the worm to connect the worm space with the partition space,that is, ( C , O ) ↔ ( C , ∅ ) . The ﬂavors and imaginary times of the inserted wormoperators are drawn uniformly.( X 6 = X ′ ): Insert or remove a ˆ c ( τ )ˆ c † ( τ ′ ) , [ˆ c, ˆ U ]( τ )ˆ c † ( τ ′ ) or ˆ n ( ph ) ( τ ) from the worm to connectdifferent worm spaces (if compatible), e.g., ( C , O ) ↔ ( C , O + ˆ c ( τ )ˆ c † ( τ ′ )) for O ∈ G (1) or H (1) . The ﬂavors and imaginary times of the inserted operators are drawnuniformly.( X = X ′ ): Swap the imaginary time of an ˆ c ( τ ) or ˆ c † ( τ ) operator in the worm and a partitionoperator of the same ﬂavor, e.g., (ˆ c ( τ ) · · · , ˆ c ( τ ′ ) · · · ) ↔ (ˆ c ( τ ′ ) · · · , ˆ c ( τ ) · · · ) . Also17 Z G (2, ph ) susc if r < p Z G (2, pp ) susc Z G (1) Z G (2, ph )Hedin Z G (2, pp )Hedin Z H (2, ph )Hedin Z H (2, pp )Hedin Z G (2) Z H (1) Z H (2) w ( C’ , O ) w ( C ) Figure 2: The various conﬁguration spaces of ComCTQMC and their connections. The Markov chainmoves between or within a conﬁguration space by comparing the relative weight of the current andproposed conﬁguration while accounting for the detailed balance and the renormalized volume of thestarting and ending conﬁguration spaces. (See Eq. 46 and Eq. 47. The proposed move is accepted if auniform random number r ∈ [0 , < p .) implemented is a related move for the ˆ n ( ph ) ( τ ) , ˆ n ( pp ) ( τ ) and ˆ n ( hh ) ( τ ) operators ina worm. SeeAppendix D for details.( X = X ′ ): Partition space conﬁguration moves in all spaces, that is, ( C , O ) ↔ ( C ′ , O ) .For the X 6 = X ′ moves listed above, the contribution w hyb from the hybridisation tothe weight w in the Metropolis-Hasting acceptance ratio cancels and is therefore notcalculated.Figure 2 illustrates the conﬁguration spaces sampled by ComCTQMC. The spaceswhich ComCTQMC can worm between are connected by red lines. Also shown is aMarkov chain in some conﬁguration. C , proposing a move which worms between thepartition and two-particle Green’s function space. (That is, a move which inserts thefour worm operators c i c j c k c l at random points on the imaginary time axis.) Also shownis a depiction of the disparity in volume between various conﬁguration spaces. Inreality, the volume of one conﬁguration will often be orders of magnitude larger thanthe others, which is why we must scale the relative volume of each space. The scalingfactors η X in Eq. 45 are chosen such that the Markov-Chain spends approximately the18ame amount of time in each space, that is, η X ≈ | Z || Z X | , (47)where | Z X | = P C , O∈X | w ( C , O ) | . This is achieved with a Wang-Landau algorithm [25,26, 14].2.4.3. EstimatorsIn this section, the estimators for the correlation functions listed in Sec. 2.4.1 arediscussed. To this end, we start with a detailed discussion of the estimator for theone-particle Green function and the associated improved estimator.The estimator for the one-particle Green function reads G ij ( iν n ) = − β − N G h w/ | w | × δ ii ′ δ jj ′ e iν n ( τ − τ ′ ) i | w ( C ,G ) | / | Z G | . (48)The subscript denotes that this measurement is averaged across all measurementstaken in in the Green’s function conﬁguration space. Therefore, the normalizationfactor is given by N G = | Z G | Z ≈ | Z | Z × N G η G N Z . (49)Here N Z and N G are the number of samples taken in the partition space and the Greenfunction space, respectively. For the higher-order correlation function H = H st + H ret in Eq. 44, the static part is estimated from the S H space, H st ij ( iν n ) = − β − N H h w/ | w | × δ ii ′ δ jj ′ e iν n ( τ − τ ′ ) i | w ( C ,H ) | / | Z H | , (50)while the retarded part is estimated from the Green function space S G , H ret ij ( iν n ) = − β − N G h w/ | w | × δ ii ′ δ jj ′ e iν n ( τ − τ ′ ) × h i | w ( C ,G ) | / | Z G | , (51)where h = X IJ q i ′ I (cid:18) D IJ ( iω n = 0) w loc ( C , ˆ c i ′ ( τ )ˆ c † j ′ ( τ ′ ) ˆ Q J ) w loc ( C , ˆ c i ′ ( τ )ˆ c † j ′ ( τ ′ )) − |C| +2 X t =1 ˜ q tJ L IJ ( τ − ˜ τ t ) (cid:19) . (52)Here ˜ τ t and ˜ q tI are as deﬁned in Eq. 16 for t = 1 , . . . , |C| . For t = 2 |C| + 1 we set ˜ τ t = τ and ˜ q tI = q i ′ I (which comes from ˆ c i ′ ), and for t = 2 |C| + 2 we set ˜ τ t = τ ′ and ˜ q tI = − q j ′ I (which comes from ˆ c † j ′ ). 19he estimators for the other correlation functions are obtained analogously, andwe list just the estimators for the other Green functions G (2) ( iν n , iν n ′ , iω m ) = −N G (2) h w/ | w | × e iν n τ + iν n ′ τ + iω m τ i G (2) (53) G (2 ,l ) ( iν n , iω m ) = −N G (2 ,l ) (2) h w/ | w | × e iν n τ + iω m τ i G (2 ,l ) (2) (54) G (2 ,l ) ( iω m ) = −N G (2 ,l ) (1) h w/ | w | × e iω m τ i G (2 ,l ) (1) (55)where l = ph, pp and the ﬂavor indices are omitted. We store the estimators for allworm-based observables in either in this Fourier basis [17, 16, 18] or mixed Legendre-Fourier basis [13]. For the technical details of the mixed representation, we refer thereader to Ref. [13].2.4.4. SusceptibilitiesComCTQMC converts all worm-based estimators and improved estimators into sus-ceptibilities, G ijkl , H ijkl → χ ijkl , during a post-processing step. To do this, the discon-nected part of the susceptibility is computed as χ ( ph )disc ,ijkl ( iω m ) = n i n k δ ij δ kl δ m (56) χ ( pp )disc ,ijkl ( iω m ) = 0 (57) χ ( ph )disc ,ijkl ( iν n , iω m ) = G ii ( ν n )[ n l δ ij δ kl δ m − G ll ( iν n − iω m ) δ ik δ jl ] (58) χ ( pp )disc ,ijkl ( iν n , iω m ) = G ii ( iν n ) G kk ( iω m − iν n )( δ ik δ jl − δ il δ jk ) (59) χ disc ,ijkl ( iν n , iν n ′ , iω m ) = δ ij δ kl δ m G ii ( iν n ) G kk ( iν n ′ ) − δ il δ jk δ nn ′ G ii ( iν n ) G kk ( iν n − iω m ) . (60)Note that these two- and three-point deﬁnitions differ from the typical expressions,e.g., those written in Ref. [18]. This discrepancy arises because we write our ob-servables, Eqs. 38 to 43, using a different operator ordering. Our ordering allowsus to deﬁne the operators in terms of bilinears, which simpliﬁes the implementation.Note that the occupations n i are measured in the partition space, and the Green’sfunctions can be measured in either partition space or in the Green’s function wormspace. ComCTQMC will use the worm-space measurements if they are available andthe partition-space measurements if they are not.Once the disconnected parts are computed, the susceptibility is computed from theestimator or improved estimator as[16] χ ijkl = G ijkl − χ disc ,ijkl (61) χ ijkl = X m G mi ( iν n )Σ mi ( iν n ) χ disc ,ijkl − G mi ( iν n ) H ijkl δ im + G mi Σ mi ( iν n ) , (62)20 n-accelerated CPUaccelerated CPUInitialization Thermalization Measurement FinalizationRead & broadcast H Reduce & broadcast η χ Reduce & write O Figure 3: A simple illustration of the parallelism and structure of CTQMC. Each blue path represents thework of a single un-accelerated CPU, i.e., it represents a simulation. The red paths together representthe work of a single accelerated CPU (which handles, in this illustration, three simulations). The simula-tions proceed through four phases: initialization, thermalization, measurement, and ﬁnalization. Threepoints of CPU-CPU communication occur in these phases, as indicated by a dotted lines: (1) The controlparameters and impurity problem are broadcast to each CPU; (2) the scaling factors, η χ , are averagedacross all simulations and then broadcast back to all ranks; and (3) the results are gathered from allsimulations. respectively, where ( iν n ) is the ﬁrst fermionic frequency of the three- or four-pointsusceptibility χ ijkl . (We do not compute improved estimators for the two-point suscep-tibility.)With the theory established, let us discuss our implementation.

3. Implementation

In this section, we present the various techniques implemented in ComCTQMCin order to reduce or overcome the computational burden. Here we will discuss theparallelism of ComCTQMC and the GPU acceleration of the CTQMC algorithm. Detailsregarding the optimization of the Monte-Carlo moves are available in Appendix D, andthe application of symmetries which help with the time-to-solution for two-particlecorrelators are given in Appendix E. Before discussing these topics, however, it isuseful to outline the phases of the CTQMC simulation.3.1. StructureIn ComCTQMC, the CTQMC simulation is divided into four major phases: (1) Ini-tialization, (2) thermalization, (3) measurement, and (4) ﬁnalization. These phases arelisted in Fig. 3, which depicts the general parallelism scheme.(1) During initialization, ComCTQMC reads to the input ﬁles, distributes the impu-rity problem and control parameters to all CPUs, initializes the GPU, and pre-computes21hose quantities which remain static throughout a CTQMC simulation (e.g., it decom-poses the Hilbert space into invariant spaces, computes the eigenvalues and eigen-vectors of these spaces, and generates the operator matrices associated with theseeigenstates.)(2) During thermalization, the simulations (Markov chains) are run for a user spec-iﬁed number of minutes or steps, and the Wang Landau algorithm is used in order todetermine the relative size of each conﬁguration space and compute the resulting setof η χ . The goal in this phase, aside from gathering the set of η χ , is to reach a physicalregion of the conﬁguration space.(3) During measurement, the simulations are run for a user speciﬁed number ofminutes or steps, as in the thermalization phase. Instead of computing η χ , however, allof the observables are sampled. The goal in this phase is for the estimators associatedwith these oberservables to converge.(4) During ﬁnalization, the results from each simulation are gathered, and the erroris estimated (using the set of results from all Markov chains simulated or the previousset of results). Then, the results are written to a ﬁle. Additionally, the ﬁnal conﬁgura-tion of each simulation is written to ﬁle. This allows subsequent runs of ComCTQMCto eliminate or at least substantially reduce the length of the thermalization phase.Now, let us discuss the parallelism and how it relates to these phases.3.2. ParallelismOne of the major reasons to use CTQMC is its nearly ideal weak scaling. Thatis, one can simulate many independent Markov chains and then collect the resultsfrom each simulation. Very little or no communication between these simulations isrequired, as shown in Fig. 3. Therefore, a CTQMC code can simulate one Markovchain per CPU without incurring a noticeable performance loss. ComCTQMC usesthe message passing interface (MPI) to initialize and distribute these Markov chains,collect the results, and compute the average η O after thermalization. A small amountof serial code is required before or after these communication events, e.g., to computethe statistics of the results (across the many Markov chains simulated), conduct I/O,and decompose the Hilbert space of the local Hamiltonian into invariant subspaces, asillustrated in Fig. 3.However, one should note that the solution is being worked on only during themeasurement phase; therefore, the remaining phases essentially do not scale with thenumber of workers. Still, a substantial thermalization phase is only required for theﬁrst CTQMC run, and the initialization and ﬁnalization phases tend to be very quick.(The decomposition of the Hilbert space and creation of the operator matrices cantake a lot of computation in hardest, low-symmetry f -shell systems. However, thiscomputation can be parallelized across a few nodes.)22 ea k s ca li ng Nodes

10 min.100 min.1 min.δ-Pu, 600 KMeasurement time

Figure 4: The weak scaling of ComCTQMC with the number of nodes for variations in the length of themeasurement phase. The lack of communication during this phase leads to nearly ideal scaling, providedthe measurement is at least 10 minutes long. Tests were run on the supercomputer Summit at the OakRidge National Laboratory. These results assume that the simulations are already thermalized.

The structure of this parallelism is simple and effective. As no CPU-to-CPU com-munication occurs during the computationally demanding measurement or thermali-sation phases of the CTQMC, ComCTQMC scales extremely well with the number ofCPUs. Indeed, if the measurement time is long, ComCTQMC scales ideally, as shownin Fig. 4(a): Provided the user does not require measurement across thousands ofCPUs within 10 minutes, the scaling remains above 0.90 of the ideal. We reiteratethat this feature of CTQMC is one of the reasons it has risen to prominence, and is notunique to ComCTQMC.Now, let us discuss the acceleration of the CTQMC algorithm by a GPU.3.3. GPU acceleration of the local traceWhen one seeks to use GPUs to accelerate an algorithm, it is critical to identifythe computational bottleneck. Then, one must ensure that this bottleneck involves themanipulation of large matrices. In impurity solvers, one must deal with a Hilbert spacethat scales as d = 2 n , where n is the number of orbitals. In completely un-optimizedCT-HYB, computing the local trace involves multiplying k matrices of rank n , where k is the expansion order of the Markov chain, i.e., k = | C | + | O | . In an f -shell problem, n = 14 , the Hilbert space has a dimension of 16,384, and solving the impurity problembecomes impossible. (In comparison, d -shell impurities, n = 10 , have a Hilbert space ofrank 1024.) While many optimizations have been adopted by current CTQMC solverswhich allow us to solve the impurity problem [27, 28, 29, 9, 19, 30], the local traceremains the bottleneck due to this underlying scaling of the Hilbert space. This isparticularly true in d -shell and f -shell problems at moderate-to-high temperatures.Fortunately, this scaling also implies that the algorithm may be accelerated by usingGPUs to compute the local trace. 23efore discussing in detail how we use GPUs to accelerate the trace computa-tion, it will be useful to outline more concretely how this computation is handled inComCTQMC. ComCTQMC employs the lazy skip-list[9] algorithm to optimize the tracecomputation. Here we only provide a brief overview of the algorithm; for a detaileddescription of the algorithms, we refer the reader to Ref. [9].3.3.1. Computing the local impurity traceIn a naive implementation of CTQMC, the local impurity trace in Eq. 14, is evalu-ated as follows. First, one represents the set of operators in the many-bodied Hilbertspace, H , via the basis ( F i ) nm = h m | ˆ c i | n i , where | n i and | m i are states in the Hilbertspace. Then, the trace can be represented as w loc ( C ) = Tr P β − τ k F † i k P τ k − τ ′ k F i ′ k · · · F i P τ − τ ′ F † i P τ ′ , (63)where P τ = exp( − τ H loc ) is the (diagonal) projector matrix and we have already appliedthe time ordering operator. In this un-optimized implementation, the computationalburden is O [ k (2 n ) ] , where n is the number of ﬂavors in the impurity problem. (Thatis, we must compute order k matrix products for matrices of rank n .) This calculationis essentially impossible for an f -shell problem where n = 14 .Fortunately, the local Hamiltonian contains Abelian symmetries associated withsome quantum numbers, e.g., the particle number, S z , S , J z , J , etc. This allows usto decompose the Hilbert space into a series of sectors, H = M q H ( q ) , (64)where q enumerates the sectors of the Hilbert space, each with its own unique set ofquantum numbers. Moreover, it allows us to deﬁne a new basis for the creation andannihilation operators, F α ( q i ) , which uniquely maps one sector onto another sector, q i → q i +1 . (We are limited to Abelian symmetries, because we require the unique-ness of this mapping.) Therefore, we can deﬁne the operator matrices [ F α ( q i )] nm = h m ( q i +1 ) | ˆ c α | n ( q i ) i , where | m ( q i +1 ) i and | n ( q i ) i are many-bodied states in the Hilbertsubspaces (or sectors) q i +1 and q i , and α speciﬁes both the operator ﬂavor and type(annihilation or creation). In this basis, we can rewrite the local trace as a sum over aset of initial sectors, q . That is, w loc ( C ) = X q Tr P β − τ k ( q k ) F α k ( q k − ) · · · F α ( q ) P τ ( q ) . (65)Note that P τ ( q ) maps q → q , and that this matrix product involves the “string” of sec-tors q → q → · · · → q k = q , where the ﬁnal equality is a result of our construction of24he CTQMC updates. Often, this string will visit a conﬁguration which is not physical,e.g., that violates the Pauli principal. We say that these “illegal” conﬁgurations are apart of the zero sector and that any matrix product which would visit sector zero doesnot survive. Any such string does not contribute to the impurity trace and does notneed to be computed. Therefore, we also deﬁne a mapping function s α ( q ) : q → q ′ . Thisallows us to follow the string of visited sectors for every q without requiring any costlymatrix multiplications, and then compute the trace for the set of q which survive thecurrent conﬁguration of operators.In general, the operator matrices are of very different sizes and are not necessarilysquare (although the end result is always square). Still, we can bound the scaling ofthe new algorithm by assuming every operator matrix is a square matrix of rank r max ,where r max is the largest dimension of the largest subspace (and operator matrix). If n ss is the number of Hilbert subspaces and all of these subspaces are in the list ofsurviving q , the scaling is bounded by O ( kn ss r max ) . A lower bound can be deﬁnedby noting that if all of the subspaces are of the same size, then n = n ss r max , suchthat O ( kn ss r max ) < O < O ( k (2 n ) /n ss ) . Either bound is dramatically smaller than thenaive implementation which does not use Abelian symmetries. Still, the exponentialscaling will eventually preclude the evaluation of the local trace in sufﬁciently largecluster problems unless n ss = 2 n (and r max = 1 ). (This occurs for a diagonal t ij andIsing-approximated interaction.)As shown in Fig. 1 and as we will discuss, the size and number of subspaces notonly determines the speed of the CTQMC, it also has a large impact on the degree ofacceleration a GPU offers. Indeed, GPUs are particularly good at multiplying largematrices when compared to CPUs.In addition to this optimization of the local trace computation, we include anothermajor optimization in ComCTQMC. In general, only a few operators are inserted intothe local trace, which might have hundreds of operators. This means that the majorityof the subproducts in the trace are left untouched. One can store these subproducts ina data structure, e.g., a balanced binary tree[7] or a skip-list[9], to drastically reducethe number of matrix multiplications required. Both of these options offer essentiallythe same performance improvement, requiring O [log( k )] matrix multiplications ratherthan O ( k ) , but the skip-list is much easier to implement.Furthermore, one can evaluate the bounds of the local trace rather than the traceitself in order to decide whether a proposed conﬁguration will be accepted. (As al-ways, the random number of the Metropolis-Hastings algorithm is drawn ﬁrst, andthis number must be between the lower and upper bounds of the acceptance ratio.)As it is typically much faster to compute these bounds, one can drastically acceler-ate the CTQMC algorithm by relying upon these bounds to quickly reject proposedmoves. Note that if the proposed move is accepted, the local trace must be computed.25 PU accerlated CPU(2 streams) GPU

CUDA

Cores Cores M a t r i x a l g e b r a CPUCPU

MPI Parallelization (3 CPU) dynamicWeights:Markov chains(4)moveproposebathtraceMeasure Waiting for CPUtracetracetracetrace M a t r i x a l g e b r a M a t r i x a l g e b r a trace M a t r i x a l g e b r a Figure 5: Parallelism and acceleration in ComCTQMC. A single Markov chain is created on each MPIimage (CPU), with one exception: Each available GPU is paired with a CPU. This GPU accelerated CPUcreates a (user speciﬁed) number of a Markov chains. While the non-accelerated CPUs handle the localtrace computation on their own, the accelerated CPUs hand off the matrix algebra required to the GPU.While waiting for the GPU to handle this algebra, the CPU moves onto the next Markov chain which isnot waiting for the GPU to ﬁnish its work.

However, the acceptance rate is extremely low in real materials at low-temperatures.Therefore, this approach offers a substantial improvement. In the lazy-trace algo-rithm, the bounds are established by taking the submultiplicative matrix norm of thematrices in the trace. The lazy-trace algorithm utilizes the spectral norm to capturethe large variation in the magnitude of the time-evolution operators.Now, let us discuss how a GPU can be best used to accelerate these calculations.3.3.2. Saturating the GPUsAs we have discussed, the Hilbert space of an impurity quickly becomes pro-hibitively large, particularly for f -shell impurities. Furthermore, this Hilbert spacecan be decomposed into a block diagonal form according to the symmetries of the lo-cal Hamiltonian. The dimension of these blocks is much smaller than the full Hilbertspace. In a low-symmetry f -shell ( d -shell) problem, for example, the largest block hasa dimension of 858 (28). In a high-symmetry impurity, this number drops to 327 (12).These matrices are small in comparison to the capabilities of a GPU, even for the high-26ymmetry f -shell impurity. By handling the trace computations for a large number ofMarkov chains on a single GPU, however, one can still saturate a GPU. There are anumber of ways to accomplish this, most of which are not appropriate for CTQMC, aswe will discuss.Consider a compute node with 7 CPUs and 1 GPU (i.e., a sixth of a Summit node).On this node, we can easily envision each CPU sending instructions to the GPU, suchthat the GPU has a workload that is seven times larger. However, CUDA will serializethese instructions, and so the GPU will be no more saturated than in the case whereone CPU sends instructions to the GPU. The CUDA Multi-Process service (MPS) offersan easy way to parallelize the instructions from multiple CPUs, so that the GPU can si-multaneously work on the instructions from all seven CPUs. With MPS, we can indeedhave the GPU handle a much larger computational burden. However, MPS essentiallypartitions the device into seven different units, each of which is paired with a CPU.This means that, each of the seven units must store the operator matrices involved inthe local trace, and there will be remainders of each seven units which are not wellsaturated.CUDA streams provide an alternate framework through which to provide the GPUwith parallelized instructions. With streams, a CPU divides the instructions it providesto the GPU into multiple streams, and these streams are processed simultaneously bythe GPU (if it has the necessary resources). Fortunately, the concept of streams areeasily applied to CTQMC: We can associate each CUDA stream with a single Markovchain and have the accelerated CPUs handle multiple of Markov chains. In this frame-work, a CPU will advance one Markov chain until the local impurity trace must beevaluated, at which point the GPU will begin working on the trace. While the GPUworks, the CPU will move on to the next Markov chain, advancing the chain until thetrace must be evaluated, at which point the GPU will begin working on that impuritytrace (as it simultaneously continues to work on the previous trace). The CPU contin-ues to loop through all of the Markov chains it controls, completing the non-impuritytrace work for all Markov chains which are not waiting for the GPU. See a schematicof this process in Fig. 5.The global memory of a device is shared between the streams of a single CPU, sothat we do consume as much memory as we would with MPS; and we can dynamicallysaturate the GPU unit by increasing the number of Markov chains handled by eachCPU. Indeed, we can drop MPS entirely and instead have a single CPU handle a largenumber of simulations. This drastically reduces the global memory requirements. Asshown in Fig. 6, this approach incurs no penalty in scaling (indicating that the per-formance is limited by the GPU and CPU-GPU communication, and not by the CPU).Moreover, this ﬁgure shows that streams outperform the purely MPS approach, evenwhen the same number of Markov chains are simulated on the GPU.27 cce l e r a ti on o f a nod e Markov chains / GPU

CUDA MPSStreamsCUDA MPS + Streamsδ-Pu, 600 K Memory limitwith MPS C P U / G P U Figure 6: The acceleration offered by various GPU parallelism schemes for variations in the number ofMarkov chains simulated by each GPU. In Fig. 1, this plot corresponds to the smallest f -shell problem( n = 327 ). For the larger f -shell problems ( n = 858 ), MPS cannot be used without running out ofmemory. (In contrast with Fig. 1, acceleration is measured across an entire node.) Tests were run on thesupercomputer Summit at ORNL, which features nodes with 42 CPU and 6 GPU (7 CPU / GPU). Our approach is more ﬂexible, portable, and enables strictly better acceleration inall of our tests than a solution which uses MPS. By using it, we are able to achievea modest acceleration even for d -shell impurities, as shown in Fig. 1. The most im-pressive acceleration, however, is for the hardest problems. For a complex valued,low-symmetry f -shell impurity, a GPU accelerates a single CPU by over 600x on Sum-mit (the six GPUs accelerate the 42 CPU node by over 85x). Without the accelerationoffered by GPUs, these problems are prohibitively expensive. For this problem, around50 node hours are required to gather a reasonable self-energy. For an LDA+DMFT so-lution, around 1000 node hours would be required. Without using the GPUs, we wouldrequire over 600,000 node hours for a single LDA+DMFT run. If we were to try evalu-ating a more difﬁcult problem, i.e., one with multiple impurities at lower temperaturesand with an off-diagonal hybridisation, 10s of millions of node hours would be requiredfor a single LDA+DMFT calculation. If one wanted to gather the vertex functions, thecomputational cost might exceed the entirety of an INCITE award. Simply put, theGPU acceleration grants access to previously inaccessible regimes, problems, and ob-servables.This description provides only the broad structure of the GPU-CPU interface. Letus brieﬂy provide more detail on our implementation.First, let us overview the GPU computation kernels, the most critical of which arethe matrix multiplication and Frobenius norm kernels. The matrix multiplication is28andled by the CUTLASS library, from which we design an appropriate double pre-cision matrix-multiplication kernel. The performance of this kernel is very similar tothe kernels offered by the cuBLAS libraries. Moreover, it is much more ﬂexible; in thefuture we hope to achieve an even more dramatic acceleration in the hardest prob-lems by using a mixed-precision matrix-multiplication kernel. (Currently, we are usingdouble-precision throughout the CTQMC. However, the NVIDIA Volta and other HPCdevices have dedicated single and half-precision processing units which we do notcurrently utilize. CUTLASS allows us to target these cores within device code in thefuture, whereas cuBLAS does not.) Additionally, we implemented CUDA kernel func-tions for our own norm,reduction, time-evolution, trace, and matrix addition kernels.Finally, we note that our CPU-GPU structure requires that we only use asynchronousCUDA functions. As memory allocation synchronizes the device, we pre-allocate theentire GPU and develop our own memory allocator which can work asynchronouslywith that block of pre-allocated memory. One of the additional beneﬁts of our ap-proach is that this allocator catches when the device runs out of memory. When itdoes, it deletes a stream (and Markov chain) and frees the associated memory. Inthis way, ComCTQMC dynamically ﬁnds the optimal number of Markov chains, pro-vided that the number of Markov chains per device (which is speciﬁed by the user) issufﬁciently high. That is, it is large enough that the device runs out of memory. (Sub-sequent runs should use the optimized number, as some time will be wasted workingon simulations that will be discarded later when the device runs out of memory.)

4. Examples and results

In this section, we will present a results for two examples where we will discusssome best-practices and usage. These examples cover a single-band Hubbard model,which a user can simulate on a modern laptop; and high-symmetry δ -plutonium im-purity which requires access to a cluster and can be accelerated by a GPU. Here weprimarily give the highlights of the more comprehensive tutorials included with theComCTQMC distribution. Note as well that the user guide provided with ComCTQMCdescribes how to install and use ComCTQMC, and contains a comprehensive list of theinput parameters.4.1. Single-Band Hubbard modelFirst, we will show how to run a simple example: The impurity of a single-bandHubbard model. The Hamiltonian for this impurity is H = X k σ ǫ k c † k σ c k σ + ǫ f X σ f † σ f σ + U n f, ↑ n f, ↓ + X k σ V k ( c † k σ f σ + f † σ c k σ ) , (66)29here c † k σ and c k σ are the annihilation and creation operators for the non-interactingbath state with wavevector k , spin σ , and non-interacting energy ǫ k , and f † σ and f σ arethe annihilation and creation operators for the interacting impurity state with spin σ and energy ǫ f . Here solve this model with three bath states at ǫ ± ± D/ and ǫ = 0 ,such that D = 1 is the bandwidth and our energy scale, and set U = 4 D , V k = V = D ,and ǫ f = − U/ (half-ﬁlling).We choose this simple model and solve it at a moderate temperature, β = 10 , fortwo reasons: First, its possible to run the example on a laptop and replicate theseresults (aside from the four-point vertex function). Second, exact diagonalization (ED)[1] can quickly provide the exact results for this model. (Note that we get around theissues discussed in Appendix C by including a third bath state.) That is, we can com-pare ComCTQMC with the exact solution in order to highlight its various capabilities.The input ﬁles for this example are provided in ComCTQMC/examples/hubbard/ ,along with a tutorial discussing the choice of parameters and the workﬂow for run-ning the CTQMC and post-processing codes. Instead of repeating ourselves here, wewill instead focus on the results.4.1.1. ResultsTo provide context for the computational burden associated with these measure-ments, let us note that these results were computed on a MacBook Pro using 6 of12 cores on an Intel i9 processor (unless stated otherwise). No GPU was used: Fora single-band model, a GPU decelerates the computation as the Hilbert space ( d =2 = 4 ) is far too small. Each ﬁgure is produced using a different simulation limited toonly those worm spaces shown, as it takes substantially more time to gather, e.g., thetwo-particle vertex than the one-particle vertex (the self-energy).Figure 7 shows the results for the self-energy, the most difﬁcult and importantof the one-particle observables to measure. These results were collected in a single15 minute simulation. As shown, the non-improved partition space measurementsare comparable to the improved worm space counterparts, while the improved parti-tion space measurement is the most accurate. In general, as the temperature falls,the partition space measurements increasingly outperform the worm space measure-ments. As the temperature rises, the worm space measurements begin to outperformthe partition space measurements. Near the atomic limit, the partition space measure-ments become nearly impossible to converge while the worm measurements convergequickly.In light of these observations, it is typically best practice to use the partition spacemeasurement of the self-energy, unless one is near the atomic limit. Not only are thepartition space measurements better in the regimes wherein CTQMC is difﬁcult, i.e.,at low temperature, but each worm space sampled slows the convergence of all other30 υ n ( n ) I m ( Σ ) EDpartitionimp. partitionwormimp. worm

Figure 7: The local self energy in the Hubbard model at β = 10 . The exact solution is compared with themeasurements taken in both worm and partition spaces, with and without the use of improved estimators.The partition space measurements are more accurate, even at moderate expansion orders. (Here, h k i ≈ .) The improved estimators signiﬁcantly improve the results at higher frequencies. observables sampled: Consider that the CTQMC must explore N + 1 conﬁgurationspaces, where N is the number of worm spaces. A similar amount of time will bespent in each conﬁguration space due to the volume renormalization using the WangLandau algorithm and η O . (Although the user can specify the fraction of time whichshould be spent in partition function space.) The time-to-solution for any observablewill therefore increase by a factor of approximately ( N + 1) .Figure 8 shows the results for the local two-point susceptibilities, measured in a30 minute simulation. Similar to most objects, the high-frequency region is difﬁcultto resolve. Fortunately, the high-frequency region error does not diverge, as in theself-energy or four-point vertex functions, but simply becomes noisy. In contrast withmost objects, the static susceptibilities are also difﬁcult to resolve, and can take themost time to converge.As shown, the partition space measurements converge approximately as fast asthe worm space measurements. Note that in the Hubbard model all of the suscep-tibilities in the particle-hole sector can be measured in partition space. As we havediscussed, this is not true in multiorbital models: When there are multiple bands, we31 ω m ( m ) R e ( χ ) EDpartitionimp. part.worm χ ph,s χ ph,c χ pp↓↑↑↓ Figure 8: The local two-point susceptibilities in the Hubbard model at β = 10 . The exact solutionis compared with the measurements taken in both worm and partition spaces and with and withoutimproved estimators, when possible. The static two-point susceptibilities in the particle-hole sector aredifﬁcult resolve, as are the high-frequency domains of all susceptibilities. The improved estimators helpwith both regions. have many χ ijkl = 0 for i = j and k = l . However, if the hybridisation matrix is di-agonal, CTQMC can only measure susceptibilities of the form χ iikk while in partitionspace. Furthermore, ComCTQMC cannot measure any of the particle-particle suscep-tibilities in partition space, and the partition-space measurements will fail near theatomic limit. Therefore, it is best practice to use the worm space measurements tosupplement the partition space measurements (which should be used whenever possi-ble).Figures 9 shows the results for the two-particle, three point susceptibilities at afew bosonic frequencies. These results are collected over a 6 hour simulation. Notethat these two-time objects are approximately an order of magnitude more difﬁcult toresolve than the one-time Green’s functions, at least in this example. Also note thatpartition space measurements of the three-point susceptibilities are not implemented.Again we see that the improved estimators drastically improve the results, partic-ularly at high frequencies. As with the two-point susceptibilities, the low-frequencyregion, < ν < ω can also be difﬁcult to resolve, and one should test convergence atboth high and low frequencies. The high-frequency Fermionic domain can be resolved32 ( i ω m , i υ n ) iυ n ( n ) EDwormimp. worm χ ph,s χ ph,c χ pp↓↑↑↓ Figure 9: The local three-point susceptibilities at iω . Note that partition space measurements are notimiplemented for these objects. Improved estimators greatly increase the accuracy of the measurement,particularly at low ( < ν < ω ) and high frequencies. using the Legendre basis by setting basis = "legendre" in the hedin block of theparameter ﬁle. However, as with all Legendre basis measurements, one must be care-ful: The error is systematic rather than stochastic, and it is therefore not as readilyapparent. Furthermore, if one is computing the full or irreducible vertex from thesethree-point susceptibilities, small errors in the susceptibility can lead to large errorsin the vertex.Figure10 shows the results for the full four-point vertex in the charge channel. (Thefull vertex is the four point susceptibility with its legs amputated, F ijkl = χ ijkl /G i G j G k G l .)The four point vertices require by far the most computation time for a good estimate.Here we present results using improved estimators accumulated across simulationsrequiring nearly 2,000 CPU hours. Still, the error in the vertex at high frequencies isunacceptable, even in the Legendre basis. By using the vertex asymptotics, however,we can very nearly recover the exact (ED) result at arbitrary frequencies.Note that in order to compute the vertex asypmtotics, we must measure the twoand three point susceptibilities in both particle-hole and particle-particle channels.Therefore, one must sample six conﬁguration spaces (instead of just two). In somesense, this magniﬁes the computational burden. However, the beneﬁts greatly out-33 i υ ' n ( n ) iυ n ( n ) iυ n ( n ) -20 0 10-10 20(a)-1010-20 -5020 i υ ' n ( n ) Figure 10: The real part of the local full vertex at iω with the (a) Fourier and (b) mixed Legendre-Fourierbasis, and (c) using the vertex asymptotics. (d) The difference between the (c) and the exact, ED result.Improved estimators are used for every observable but the local two-point susceptibilities and results areaccumulated across 20,000 CPU hours. Even for this simple problem, simulated for a comparatively longtime and using improved estimators, the asymptotics are crucial when resolving the vertex at an highfrequencies. weigh the costs, particularly if one requires the high-frequency components, as shownin Fig. 10.4.2. δ -PlutoniumNow, let us move on to a real material: δ -Plutonium. δ -Plutonium is a high-temperatureform of Plutonium with only one atom in the unit cell. This is one of the easiest f -shellproblems for CTQMC to solve. Here we will discuss the ﬁnal CTQMC simulation of aconverged LDA+DMFT relativistic run using ComDMFT[3]. In this case, ComDMFTuses the the default one-particle spin-orbit coupled basis set described in Appendix A,and discards the off-diagonal elements of the hybridisation and one-body Hamiltonian.(This leads to many more symmetries than might be expected in a relativistic, f -shellimpurity, e.g., J z remains a good quantum number.) We use the following parameter-ization for the local two-body interaction and the nominal double counting scheme:Hubbard interaction U = 4 . , Hund interaction J = 0 . , and nominal occupancy of34 nergy (eV) -60.00.20.40.6 0 3-3 60.81.0 P r ob a b ilit y N f J z = 5/2 J z = 5/2, γ = 1 J z = 0 J z = 1 J z = 2 J z = 7/2 J z = 4 J z = 2 Figure 11: Valence histogram of δ -Pu at 600 K. Valence ﬂuctuations between the N = 5 ground state andthe N = 6 excited states help to explain the anomalous properties of δ -Pu.[32] ComCTQMC can generatehistograms for any quantum number which commutes with the local Hamiltonian, which can be crucialin efforts to describe the correlated physics of a strongly-correlated material. the impurity orbitals N = 5 . The results of the DFT+DMFT simulation are sensitiveto this parameterisation and the double counting model. See, for example, Ref. [31].However, it is not our intention to present a treatise on the most accurate simulation of δ -Plutonium within DFT+DMFT. Instead, we would like to restrict our analysis to theusage, accuracy, and computational cost of the CTQMC simulation. Therefore, we willleave a more in-depth discussion of the double counting for a later study and simplyuse a parameterization that lets us compare our results to a recent publication.4.2.1. ResultsIn Fig. 11, we present the valence histogram resulting from these calculations.As has been discussed in the literature, δ -Pu exhibits signiﬁcant valence ﬂuctuationsbetween the ground-state valence conﬁguration, N = 5 and J z = 5 / , and a num-ber of other conﬁgurations. Primarily, it ﬂuctuates to a number of higher energy, N = 6 states. This differs substantially from, e.g., curium, which contains a singlepeak in the histogram (the ground state) with a probability near unity [32]. This va-lence histograms helps to explain the anomalous behavior of Plutonium, e.g., its lackof magnetism, and the ability to quickly and easily produce such a histogram is enor-mously useful in the effort to understand various strongly correlated materials. Theuser guide for ComCTQMC goes into the calculation of this histogram or histogramslike it using ComCTQMC.In Fig. 12, we present the spectral function of δ -Pu. To produce the spectral func-35 Γ X W K Γ L U W L KU -2-101

X 015 E n e r gy ( e V ) A k (eV -1 ) Figure 12: The momentum resolved spectral function, A k , of δ -Pu at 600 K along the high-symmetry lines.The spectrum features Hubbard and quasiparticle bands, with a Kondo peak just below the Fermi level.These results compare favorably with a recently published study on δ -Pu, while requiring drastically lesscomputer time.[33] tion, we must ﬁrst analytically continue the imaginary domain self-energy produced byComCTQMC. ComCTQMC does not have a built-in analytical continuation program, asa number of well-developed, open-source analytical continuation codes have alreadybeen developed and published. Here we use the maximum entropy code bundled intothe extended-DMFT package (EDMFT)[34]. As shown in Fig. 12, the spectral functionof δ -Pu is dominated by a Kondo peak near the Fermi level (which does not appear inpure LDA or GGA calculations).[35]To get these results, the LDA+DMFT simulation was run for 20 iterations on 50nodes on Summit at ORNL. Each iteration required 10 minutes of measurement byComCTQMC, with the ﬁrst iteration given another 5 minutes of thermalisation. Anadditional 20 minutes of measurement time was allocated to the ﬁnal two iterationsfor the purpose of gathering more reﬁned results for the analytical continuation. Intotal, around 200 node hours were spent on the solution. This is a shockingly smallinvestment into a problem that has, until now, been prohibitively expensive for mostresearchers. For comparison, the ﬁrst CTQMC simulations of Pu required “massivecomputer resources", i.e., a major allocation on a leadership-class facility [36]. Sud-denly, research into actinide systems is not only possible, but it can also be inexpensive36 a x i m u m e rr o r i n s e l f e n e r gy Acceptable errorfor DMFT δ-Pu 600K C r y s t a l f i e l d s w / G P U C r y s t a l f i e l d s w / o G P U Node hours (Summit) N o c r y s t a l f i e l d s w / G P U N o c r y s t a l f i e l d s w / o G P U Figure 13: Relative error in the self-energy at 10 meV, i.e., the maximum error in the DMFT loop, as thenumber of node hours allocated to ComCTQMC increases. Without crystal ﬁelds, the GPU accelerationtransforms a DFT+DMFT simulation ( 25 times as expensive as a single ComCTQMC run) from mod-erately expensive to inexpensive. With crystal ﬁelds, the GPU acceleration transforms the DFT+DMFTsimulation from nearly impossible to easily doable. Indeed, the GPU acceleration will allow researchers toextend the boundaries in temperature and unit-cell size and simulate, e.g., not only the high-temperatureactinide systems like δ -Pu but also the low temperature systems like α -Pu. in many cases. Furthermore, those problems which were previously impossible with-out additional approximation are now possible.Consider, for example an LDA+DMFT simulation of δ -Pu with crystal ﬁeld effects.This simulation would require approximately 250,000 node hours on Summit withoutGPU acceleration, as shown in Fig. 13. While possible, this would represent a massivecomputational effort. With GPU acceleration, we reduce the computational burdenby two orders of magnitude! Suddenly, it is relatively inexpensive to compute theone-particle observables in the presence of crystal ﬁelds. These ﬁelds should havea pronounced impact on observables which exist at very low energy scales (e.g., theFermi surface) or on many observables in materials with less symmetry than δ -Pu (e.g.,the lower-temperature Plutonium allotropes), and so it is important that ComCTQMCallows us to include them without requiring exorbitant expense. As these materialsare already enormously expensive to simulate (due the large number of impurities inthe unit cell and the low temperatures at which they are stable), the GPU accelerationachieved by ComCTQMC will be necessary to accurately simulate them within DMFT. At low temperatures, CTQMC encounters the sign problem, and one cannot overcome the sign prob- χ (2 ,ph ) susc than it does to measure the self-energy (with the same de-gree of relative error). To get a good analytical continuation, it requires at leasttwo-orders of magnitude more computer time than the entire LDA+DMFT simulation.It is crucial, then, that the GPU offers one-to-two orders of magnitude reduction inthe computational burden, particularly as researchers look towards the two-particlevertex functions.

5. Summary

Here we have presented the program ComCTQMC which efﬁciently solves Ander-son impurity problems using the CTQMC algorithm with worm and/or partition spacesampling. This program utilizes GPUs extremely well in the most difﬁcult problems.Indeed, we demonstrate the acceleration of the CTQMC algorithm by a factor of over600 in the simulation of a low-symmetry f -shell impurity. The parallelisation and accel-eration schemes are extremely ﬂexible, and it can in general use 100% of the compu-tational capacity of HPC resources to solve these difﬁcult problems. The algorithm issimilarly ﬂexible: It can handle complex valued impurities with dynamical interactionsand crystal ﬁelds; and it can use worm sampling to measure any non-zero componentof a two-, three-, or four-point susceptibility in the particle-hole or particle-particlechannels. Finally, it is highly optimized, using state-of-the-art algorithm and improvedestimators to drastically reduce the time-to-solution.Additionally, equations for the improved estimators in the presence off a dynamicalinteraction are presented. Details of the implementation are discussed, and exam-ples are provided. Best-practices and convergence criteria are discussed in theseexamples. Appendices are also provided to derive equations or investigate conclu-sions discussed in ComCTQMC. A user guide is also provided with ComCTQMC whichdiscusses, among other topics, its installation and usage. lem by brute force. Still, the sign problem is not what currently prevents researchers from simulatinglower symmetry actinide systems like α − P u with DFT+DMFT, particularly if they include the crystalﬁelds. ComCTQMC estimates the absolute error if the user directs it to do so. This is done by comparingthe estimators accumulated on each MPI rank using the jackknife method. Alternately, it is done bycomparing sequential, serial runs if the user does not have MPI. . Acknowledgments This work was supported by the US Department of Energy, Ofﬁce of Basic EnergySciences as part of the Computation Material Science Program. This research usedresources of the Oak Ridge Leadership Computing Facility, which is a DOE Ofﬁce ofScience User Facility supported under Contract DE-AC05-00OR22725.

References [1] A. Georges, G. Kotliar, W. Krauth, M. Rozenberg, Dynamical mean-ﬁeld theoryof strongly correlated fermion systems and the limit of inﬁnite dimensions, Rev.Mod. Phys. 68 (1996) 13. doi:10.1103/RevModPhys.68.13 .[2] G. Kotliar, S. Y. Savrasov, K. Haule, V. S. Oudovenko, O. Parcollet, C. A. Mar-ianetti, Electronic structure calculations with dynamical mean-ﬁeld theory, Rev.Mod. Phys. 78 (2006) 865–951. doi:10.1103/RevModPhys.78.865 .URL https://link.aps.org/doi/10.1103/RevModPhys.78.865 [3] S. Choi, P. Semon, B. Kang, A. Kutepov, G. Kotliar,Comdmft: A massively parallel computer package for the electronic structure of correlated-electron systems,Computer Physics Communications 244 (2019) 277 – 294. doi:https://doi.org/10.1016/j.cpc.2019.07.003 .URL [4] R. Adler, G. Kotliar, Bringing electronic structure codes into the modern softwareecosystem, unpublished x (2020) xx.[5] M. Schüler, C. Renk, T. O. Wehling, Variational exact diagonalization method for anderson impurity models,Phys. Rev. B 91 (2015) 235142. doi:10.1103/PhysRevB.91.235142 .URL https://link.aps.org/doi/10.1103/PhysRevB.91.235142 [6] R. Bulla, T. A. Costi, T. Pruschke, Numerical renormalization group method for quantum impurity systems,Rev. Mod. Phys. 80 (2008) 395–450. doi:10.1103/RevModPhys.80.395 .URL https://link.aps.org/doi/10.1103/RevModPhys.80.395 [7] E. Gull, A. J. Millis, A. I. Lichtenstein, A. N. Rubtsov, M. Troyer, P. Werner,Continuous-time monte carlo methods for quantum impurity models, Rev. Mod.Phys. 83 (2011) 349–404. doi:10.1103/RevModPhys.83.349 .URL http://link.aps.org/doi/10.1103/RevModPhys.83.349 [8] J. E. Hirsch, R. M. Fye, Monte carlo method for magnetic impurities in metals,Phys. Rev. Lett. 56 (1986) 2521–2524. doi:10.1103/PhysRevLett.56.2521 .URL https://link.aps.org/doi/10.1103/PhysRevLett.56.2521 doi:10.3929/ethz-a-005722583 .[11] J. H. Shim, K. Haule, G. Kotliar, Modeling the localized-to-itinerant electronic transition in the heavy fermion system ceirin5,Science 318 (5856) (2007) 1615–1617. arXiv:https://science.sciencemag.org/content/318/5856/1615.full.pdf , doi:10.1126/science.1149064 .URL https://science.sciencemag.org/content/318/5856/1615 [12] A. M. Läuchli, P. Werner, Krylov implementation of the hybridization expansion impurity solver and application to 5-orbital models,Phys. Rev. B 80 (2009) 235117. doi:10.1103/PhysRevB.80.235117 .URL https://link.aps.org/doi/10.1103/PhysRevB.80.235117 [13] L. Boehnke, H. Hafermann, M. Ferrero, F. Lechermann, O. Parcol-let, Orthogonal polynomial representation of imaginary-time green’s functions,Phys. Rev. B 84 (2011) 075145. doi:10.1103/PhysRevB.84.075145 .URL https://link.aps.org/doi/10.1103/PhysRevB.84.075145 [14] H. Shinaoka, J. Otsuki, M. Ohzeki, K. Yoshimi,Compressing green’s function using intermediate representation between imaginary-time and real-frequency domains,Phys. Rev. B 96 (2017) 035147. doi:10.1103/PhysRevB.96.035147 .URL https://link.aps.org/doi/10.1103/PhysRevB.96.035147 [15] H. Hafermann, K. R. Patton, P. Werner, Improved estimators for the self-energy and vertex function in hybridization-expansion continuous-time quantum monte carlo simulations,Phys. Rev. B 85 (2012) 205106. doi:10.1103/PhysRevB.85.205106 .URL https://link.aps.org/doi/10.1103/PhysRevB.85.205106 [16] P. Gunacker, M. Wallerberger, T. Ribic, A. Hausoel, G. Sangiovanni, K. Held,Worm-improved estimators in continuous-time quantum monte carlo, Phys. Rev.B 94 (2016) 125153. doi:10.1103/PhysRevB.94.125153 .URL https://link.aps.org/doi/10.1103/PhysRevB.94.125153 [17] P. Gunacker, M. Wallerberger, E. Gull, A. Hausoel, G. Sangiovanni, K. Held,Continuous-time quantum monte carlo using worm sampling, Phys. Rev. B 92(2015) 155102. doi:10.1103/PhysRevB.92.155102 .URL https://link.aps.org/doi/10.1103/PhysRevB.92.155102 [18] J. Kaufmann, P. Gunacker, K. Held, Continuous-time quantum monte carlo calculation of multiorbital vertex asymptotics,Phys. Rev. B 96 (2017) 035114. doi:10.1103/PhysRevB.96.035114 .URL https://link.aps.org/doi/10.1103/PhysRevB.96.035114 doi:https://doi.org/10.1016/j.cpc.2015.04.020 .URL [21] H. Shinaoka, E. Gull, P. Werner, Continuous-time hybridization expansion quantum impurity solver for multi-orbital systems with complex hybridizations,Computer Physics Communications 215 (2017) 128 – 136. doi:https://doi.org/10.1016/j.cpc.2017.01.003 .URL [22] P. Sémon, G. Sordi, A.-M. S. Tremblay, Ergodicity of the hybridization-expansion monte carlo algorithm for broken-symmetry states,Phys. Rev. B 89 (2014) 165113. doi:10.1103/PhysRevB.89.165113 .URL https://link.aps.org/doi/10.1103/PhysRevB.89.165113 [23] P. Sémon, A.-M. S. Tremblay, Importance of subleading corrections for the mott critical point,Phys. Rev. B 85 (2012) 201101. doi:10.1103/PhysRevB.85.201101 .URL https://link.aps.org/doi/10.1103/PhysRevB.85.201101 [24] H. Shinaoka, Y. Nomura, S. Biermann, M. Troyer, P. Werner,Negative sign problem in continuous-time quantum monte carlo: Optimal choice of single-particle basis for impurity problems,Phys. Rev. B 92 (2015) 195126. doi:10.1103/PhysRevB.92.195126 .URL https://link.aps.org/doi/10.1103/PhysRevB.92.195126 [25] F. Wang, D. P. Landau, Efﬁcient, multiple-range random walk algorithm to calculate the density of states,Phys. Rev. Lett. 86 (2001) 2050–2053. doi:10.1103/PhysRevLett.86.2050 .URL https://link.aps.org/doi/10.1103/PhysRevLett.86.2050 [26] F. Wang, D. P. Landau, Determining the density of states for classical statistical models: A random walk algorithm to produce a ﬂat histogram,Phys. Rev. E 64 (2001) 056101. doi:10.1103/PhysRevE.64.056101 .URL https://link.aps.org/doi/10.1103/PhysRevE.64.056101 [27] K. Haule, Quantum monte carlo impurity solver for cluster dynamical mean-ﬁeldtheory and electronic structure calculations with adjustable cluster base, Physi-cal Review B 75 (15) (2007) 155113.[28] E. Gull, Continuous-time quantum monte carlo algorithms for fermions, Ph.D.thesis, ETH Zurich (2008). 4129] C.-H. Yee, Towards an ab initio description of correlated materials, Ph.D. thesis,Rutgers University-Graduate School-New Brunswick (2012).[30] H. Shinaoka, M. Dolﬁ, M. Troyer, P. Werner, Hybridization expansion monte carlosimulation of multi-orbital quantum impurity problems: matrix product formalismand improved sampling, Journal of Statistical Mechanics: Theory and Experiment2014 (6) (2014) P06012.[31] J.-X. Zhu, A. K. McMahan, M. D. Jones, T. Durakiewicz, J. J. Joyce, J. M. Wills,R. C. Albers, Spectral properties of δ -plutonium: Sensitivity to f occupancy,Phys. Rev. B 76 (2007) 245118. doi:10.1103/PhysRevB.76.245118 .URL https://link.aps.org/doi/10.1103/PhysRevB.76.245118 [32] J. H. Shim, K. Haule, G. Kotliar, Fluctuating valence in a correlated solid and the anomalous properties of δ -plutonium,Nature 446 (7135) (2007) 513–516. doi:10.1038/nature05647 .URL https://doi.org/10.1038/nature05647 [33] R. M. Tutchton, W.-t. Chiu, R. C. Albers, G. Kotliar, J.-X. Zhu,Electronic correlation induced expansion of fermi pockets in δ -plutonium, Phys.Rev. B 101 (2020) 245156. doi:10.1103/PhysRevB.101.245156 .URL https://link.aps.org/doi/10.1103/PhysRevB.101.245156 [34] K. Haule, C.-H. Yee, K. Kim, Dynamical mean-ﬁeld theory within the full-potential methods: Electronic structure of ceirin , cecoin , and cerhin ,Phys. Rev. B 81 (2010) 195107. doi:10.1103/PhysRevB.81.195107 .URL https://link.aps.org/doi/10.1103/PhysRevB.81.195107 [35] S. Y. Savrasov, G. Kotliar, E. Abrahams, Correlated electrons in δ -plutonium within a dynamical mean-ﬁeld picture,Nature 410 (6830) (2001) 793–795. doi:10.1038/35071035 .URL https://doi.org/10.1038/35071035 [36] C. A. Marianetti, K. Haule, G. Kotliar, M. J. Fluss,Electronic coherence in δ -pu: A dynamical mean-ﬁeld theory study, Phys. Rev.Lett. 101 (2008) 056403. doi:10.1103/PhysRevLett.101.056403 .URL https://link.aps.org/doi/10.1103/PhysRevLett.101.056403 Appendix A. Basis functions

In general, one does not need to know the form of the one-particle basis to run Com-CTQMC. However, ComCTQMC provides Slater-Condon and Kanamori parameterisa-tions of the static two-body interaction tensor. In order to use these parameterisations,it is necessary to know this basis set. For this purpose, ComCTQMC implements three42asis sets. A generic basis set, which is all that is required for the Kanamori interac-tion; and the product and spin-coupled basis sets, which provide the detail requiredfor the Slater-Condon interaction.The product basis set is deﬁned here as | l, m, σ i = Y l,m ⊗ σ, (A.1)where σ is the spin and Y l,m are the real spherical harmonics, which are deﬁned as Y ℓ,m :=  i √ (cid:0) Y mℓ − ( − m Y − mℓ (cid:1) if m < Y ℓ if m = 01 √ (cid:0) Y − mℓ + ( − m Y mℓ (cid:1) if m > (A.2)in terms of the complex spherical harmonics Y ml .The coupled basis set is deﬁned here as | j, m j i = X mσ Y ml ⊗ σ h ℓ, m ; 12 , σ | j, m j i , (A.3)where h ℓ, m ; , σ | j, m j i are the Clebsch-Gordan coefﬁcients.The generic, product, and coupled basis are enumerated as | , ↓i , ..., | n, ↓i , | , ↑i , ..., | n, ↑i (A.4) | ℓ, − ℓ, ↓i , | ℓ, − ℓ + 1 , ↓i , . . . , | ℓ, ℓ, ↓i , | ℓ, − ℓ, ↑i , | ℓ, − ℓ + 1 , ↑i , . . . , | ℓ, ℓ, ↑i (A.5) | ℓ − , − ℓ + 12 i , . . . , | ℓ − , ℓ − i , | ℓ + 12 , − ℓ − i , . . . , | ℓ + 12 , ℓ + 12 i . (A.6) Appendix B. Metropolis-Hasting algorithm

A Markov chain is a random walk x −→ x −→ · · · in the sample space, char-acterized by the transition probability P ( x n +1 | x n ) between two consecutive samples.Given the probability distribution p n ( x n ) at the step n , the probability distribution atstep n + 1 reads p n +1 ( x n +1 ) = X x n P ( x n +1 | x n ) p n ( x n ) , (B.1)and if the Markov chain converges, the stationary distribution p := p n ( n → ∞ ) satis-ﬁes global balance p ( y ) = X x P ( y | x ) p ( x ) . (B.2)43he goal is now to ﬁnd such a transition probability P for a given target distribution p . A sufﬁcient (but not necessary condition) for global balance is detailed balance P ( y | x ) p ( x ) = P ( x | y ) p ( y ) . (B.3)Since the probability of going from state x to any state y is one, this implies X x P ( y | x ) p ( x ) = X x P ( x | y ) p ( y ) = p ( y ) . (B.4)A general solution of detailed balance is provided by the Metropolis-Hasting algorithm.A new conﬁguration y is proposed with probability P prop ( y | x ) , and accepted with prob-ability P acc ( y | x ) . If the proposed conﬁguration y is rejected, the old conﬁguration x isused again. With y = x , the detailed balance condition for this transition probability P ( y | x ) = P acc ( y | x ) P prop ( y | x ) implies P acc ( y | x ) P acc ( x | y ) = P prop ( x | y ) p ( y ) P prop ( y | x ) p ( x ) =: R ( y , x ) , (B.5)where R ( y , x ) is the acceptance ratio, and is satisﬁed by Metropolis-Hasting choice P acc ( y | x ) = min(1 , R ( y , x )) . (B.6)For x = y , detailed balance is trivially satisﬁed. Notice that this algorithm allows toget samples of a distribution which is only known upon a constant factor. Appendix C. Limitation of the Green function estimator in partition functionspace

In this appendix, we show that the estimator for the Green function in partitionspace in Eq. 29 can have limitations if the bath has only very few orbitals. While thisis not relevant for DMFT calculations, it can be useful when benchmarking a CTQMCsolver with ED.To start with, we consider a toy impurity model with two impurity orbitals and twobath orbitals. The Hamiltonian reads H = H loc z }| {X i,j =1 , ˆ c † i t ij ˆ c j + H hyb z }| {X i =1 , ˆ f † i ˆ c i + ˆ c † i ˆ f i + H bath z }| {X i =1 , ˆ f † i ˆ f i , (C.1)where the off-diagonal elements of the hopping matrix t ij are non-zero.The partition estimator for the Green function is based on the assumption that theGreen function weight w ( C , ˆ c ( τ ′ )ˆ c † ( τ )) vanishes whenever the partition function weight44 ( C + τ τ ′ ) vanishes. This assumption is violated for the conﬁguration C = ˆ c † (˜ τ )ˆ c (˜ τ ′ ) and the Green function entry ˆ c ( τ ′ )ˆ c † ( τ ) with ˜ τ > ˜ τ ′ > τ ′ > τ . The Green functionweight w ( C , ˆ c ( τ ′ )ˆ c † ( τ )) = Z loc =0 z }| { h ˆ c † (˜ τ )ˆ c (˜ τ ′ )ˆ c ( τ ′ )ˆ c † ( τ ) i loc × =0 z }| { h ˆ f (˜ τ ) ˆ f † (˜ τ ′ ) i bath , (C.2)is non-zero. Here the orbital indices are omitted, and h◦i loc/bath = Z loc/bath Tr e − βH loc/bath ◦ . (C.3)The partition function weight w ( C + τ τ ′ ) = Z loc =0 z }| { h ˆ c † (˜ τ )ˆ c (˜ τ ′ )ˆ c ( τ ′ )ˆ c † ( τ ) i loc × =0 z }| { h ˆ f (˜ τ ) ˆ f † (˜ τ ′ ) ˆ f † ( τ ′ ) ˆ f ( τ ) i bath (C.4)however is zero. This is because two bath creation (annihilation) operators lie nextto each other and, as opposed to the fermions on the impurity, the fermions in thebath can not hop between the two orbitals. It is easily seen that contribution to theGreen function as in Eq. C.2, which can not be sampled by the partition space Greenestimator, appear at arbitrary expansion orders.In Eq. C.1 we have chosen the simplest Hamiltonian which violates the assump-tion for the Green function estimator in partition space that w ( C , ˆ c ( τ ′ )ˆ c † ( τ )) vanisheswhenever w ( C + τ τ ′ ) vanishes. The same happens if the one body part of the impurityHamiltonian H loc is diagonal, but we add a Kanamori interaction which shufﬂes thefermions on the impurity through spin-ﬂip and pair-hopping processes. Appendix D. CTQMC Move optimizations

While the trace calculation is an unavoidable bottleneck in CT-HYB calculations,a substantial amount of computation can be avoided by only proposing moves whichhave a chance to be accepted. That is, we often know a priori whether a move will berejected, without having to do an expensive local trace computation in addition to themoderately burdensome hybridisation and dynamical interaction computations.Consider the pair insertion move. In the literature, this move is typically made bygenerating two random times for the creation and annihilation operator, computingthe Metropolis-Hastings acceptance rate, and then deciding to accept or reject theproposed move[7], as shown in Fig. D.14(a). Often, however, the proposed move willviolate the Pauli-exclusion principle, and we have wasted effort computing the accep-tance rate. In ComCTQMC, we only propose moves that satisfy the Pauli exclusionprinciple. In order to accomplish this, we must take extra care with the pair insertion,45 b) r ∆ τ+ τ min r ∆ τ+ τ min τ min τ min τ max τ max ∆ τ int( nr ) int(2 kr )0 (a) r β βr β hybridisation line c i ( τ ) c † i ( τ ) Figure D.14: The (a) standard and (b) Pauli-aware pair insertion and removal moves, where r i are therandom numbers required for the move and n is the number of operators in the current conﬁguration. Inthe standard algorithm, many randomly proposed moves will be rejected because they generate unphysi-cal conﬁgurations. In the Pauli-aware algorithm, the acceptance rate is much higher, particularly at highexpansion orders. removal, and worm operator shift moves. Let us brieﬂy outline how we propose thePauli-aware moves. Pair insertion : First, one selects the ﬂavor of the creation and annihilation opera-tor, i and j . Then, we select a random operator from the list of operators of ﬂavor i and j in the current conﬁguration. Next, we generate two random times which lie on theimaginary time axis between this operator and the next operator in the time-orderedlist. (If there are no operators in the current conﬁguration of ﬂavor i and j , thenone simply chooses two random times as in the typical, Pauli-unaware algorithm.) Asin the Pauli-unaware algorithm, one must attach a weight to the Metropolis-Hastingsacceptance rate in order to maintain the detailed balance. We call that weight the con-ﬁguration weight, w conﬁg . For this Pauli-aware pair insertion move, the conﬁgurationweight is w conﬁg = ∆ τ N ( N + 2) , (D.1)where ∆ τ is the distance between the pre-existing operators which bound the ran-domly select times for the newly proposed pair, and N is the number of operators ofﬂavor i and j . Note the imaginary time axis is a circle of circumference β , so that onemust take care with the distance calculation and concept of the “next” operator. Thismove is illustrated in Fig. D.14(b). 46 τ join ∆ τ split τ r ∆ τ+ τ min c i ( τ ) c † i ( τ ) d i ( τ ) c j ( τ ) c † j ( τ ) d † i ( τ ) n ij ( τ ) ph int(2 kr ) S w a p S p lit/ J o i n Figure D.15: The algorithm for moving a local bilinear. In order to generate physical conﬁgurations,the bilinear is moved throughout the conﬁguration by swapping one of its operators with a hybridized(conﬁguration space) operator forming a new bilinear at this location, and splitting apart the old bilinear.The operators used for the split and join portions of this move are aware of the Pauli exclusion principle,greatly improving the acceptance rate at large expansion orders.

Pair removal : This move proceeds much like the pair insertion move. That is, weselect the ﬂavors of the creation and annihilation operators and then select a randomoperator from the list of operators with these ﬂavors. Then, we select the next operatorin the time-ordered list. Finally, we compute the distance between between these pairsalong the imaginary time axis, a circle. The resulting conﬁguration weight is w conﬁg = 1∆ τ NN − . (D.2)This move is illustrated in Fig. D.14(b). Bilinear swap : In general, one must swap the location of worm and partition op-erators in order for the exploration of worm conﬁguration spaces to be ergodic. Thisis a relatively simple process when one does not need to swap a bilinear operator:Pauli exclusion is never violated and one only needs to compute the change in thehybridisation.[17] When one swaps one of the operators in a bilinear, however, theother operator in that bilinear must be dealt with carefully. If one simply shifts it,Pauli exclusion will be violated quite often, which might cause ergodicity problems.Moreover, if both operators in a bilinear are swapped, a newly hybridized bilinear isleft in place. This hybridized bilinear is relatively unphysical and will therefore typi-cally be rejected by the Metropolis-Hastings algorithm. Again, ergodicity issues arise.Therefore, we must carefully design a smart bilinear swap move. In ComCTQMC, we47mplement the bilinear swap using a combination swap, join, and split moves in orderto avoid the problems outlined above. This move is illustrated in Fig. D.15.As in the typical swap move, the join-swap-split move for a bilinear n ( l ) ij begins byselecting a random hybridized operator of ﬂavor i ( j ) from the current conﬁguration.This time, τ join is where the new bilinear will be placed, and these are the operatorswhich will be swapped (hybridized ↔ local). Then, one ﬁnds the operators of ﬂavor j ( i ) which bracket the current and new bilinears. A random time is selected withinthe operators bracketing the current bilinear, τ split , and we select the operator of thesame type (creation or annihilation) from the operators bracketing the new bilinear.The hybridized operator is moved to τ split and the local operator is moved to τ join Theconﬁguration weight associated with this move is w conﬁg = ∆ τ split ∆ τ join , (D.3)where ∆ τ split and ∆ τ join are the distances between the operators bracketing the origi-nal and new bilinears, respectively. Appendix E. Symmetries of the two-particle correlators

In general, accurate measurement of the two-particle correlators is much moretime consuming than the accurate measurement of the one-particle correlators. InComCTQMC, we implement a number of symmetries during post-processing whichhelp to improve the accuracy of the measurements. The symmetries implemented inComCTQMC are the Hermetian and Fermionic symmetries, when possible. We listthese symmetries in Table E.1If a symmetric component speciﬁed above is not measured (i.e., it is outside thefrequency box deﬁned by the user supplied cutoff frequencies), that symmetry is notapplied at that frequency. That is, these symmetries are applied whenever possibleand ignored otherwise. These symmetries are also implemented for the improvedestimators.

Appendix F. Discussion of Quantitative Scaling and Acceleration Results

In general, the scaling and acceleration of ComCTQMC will depend on not onlythe parameterization of the CTQMC, but it will also depend upon the architectureof the computational resource, the compilers and libraries used, and the ﬂags setwhen compiling these libraries and ComCTQMC itself. Of particular importance is therelative power of the collection of GPUs and CPUs on a node: Summit, for example,contains a few powerful GPUs and fairly slow (but numerous) CPU cores. To someextent, this leads to the drastic acceleration of a single core shown in Fig. 1.48 able E.1: Symmetries implemented in ComCTQMC

Observable G ( ph ) ijkl ( iω ) G ( pp ) ijkl ( iω ) G ( ph ) ijkl ( iν, iω ) G ( ph ) jilk ( iω ) G ( ph ) lkji ( − iω ) G ( ph ) klij ( − iω ) G ( pp ) jikl ( iω ) G ( pp ) ijlk ( iω ) G ( pp ) jilk ( iω ) G ( ph ) jilk ( iω − iν, iω ) Observable G ( pp ) ijkl ( iν, iω ) G ( ph ) ijkl ( iν, iν ′ , iω ) G ( pp ) ijlk ( iν, iω ) G ( pp ) jikl ( iω − iν, iω ) G ( pp ) jilk ( iω − iν, iω ) G ( ph ) kjil ( iν ′ − iω, iν ′ , iν ′ − iν ) G ( ph ) ilkj ( iν − iω, iν, iν − iν ′ ) G ( ph ) lkij ( iν ′ − iω, iν − iω, iω ))