[PDF] State- and superstate-sampling in hybridization-expansion continuous-time quantum Monte Carlo

Abstract

Due to the intrinsic complexity of the quantum many-body problem, quantum Monte Carlo algorithms and their corresponding Monte Carlo configurations can be defined in various ways. Configurations corresponding to few Feynman diagrams often lead to severe sign problems. On the other hand, computing the configuration weight becomes numerically expensive in the opposite limit in which many diagrams are grouped together. Here we show that for continuous-time quantum Monte Carlo in the hybridization expansion the efficiency can be substantially improved by dividing the local impurity trace into fragments, which are then sampled individually. For this technique, which also turns out to preserve the fermionic sign, a modified update strategy is introduced in order to ensure ergodicity. Our (super)state sampling is particularly beneficial to calculations with many d -orbitals and general local interactions, such as full Coulomb interaction. For illustration, we reconsider the simple albeit well-known case of a degenerate three-orbital model at low temperatures. This allows us to quantify the coherence properties of the "spin-freezing" crossover, even close to the Mott transition.

Full PDF

aa r X i v : . [ c ond - m a t . s t r- e l ] A p r State- and superstate sampling in hybridization-expansioncontinuous-time quantum Monte Carlo

Alexander Kowalski, Andreas Hausoel, Markus Wallerberger, Patrik Gunacker, and Giorgio Sangiovanni Institut f¨ur Theoretische Physik und Astrophysik,Universit¨at W¨urzburg, Am Hubland, D-97074 W¨urzburg, Germany Department of Physics, University of Michigan, Ann Arbor, MI 48109, USA Institute of Solid State Physics, TU Wien, 1040 Vienna, Austria

Due to the intrinsic complexity of the quantum many-body problem, quantum Monte Carloalgorithms and their corresponding Monte Carlo conﬁgurations can be deﬁned in various ways.Conﬁgurations corresponding to few Feynman diagrams often lead to severe sign problems. On theother hand, computing the conﬁguration weight becomes numerically expensive in the opposite limitin which many diagrams are grouped together. Here we show that for continuous-time quantumMonte Carlo in the hybridization expansion the eﬃciency can be substantially improved by dividingthe local impurity trace into fragments, which are then sampled individually. For this technique,which also turns out to preserve the fermionic sign, a modiﬁed update strategy is introduced inorder to ensure ergodicity. Our (super)state sampling is particularly beneﬁcial to calculations withmany d -orbitals and general local interactions, such as full Coulomb interaction. For illustration, wereconsider the simple albeit well-known case of a degenerate three-orbital model at low temperatures.This allows us to quantify the coherence properties of the “spin-freezing” crossover, even close tothe Mott transition. I. INTRODUCTION

Continuous-time quantum Monte Carlo algorithms arestate-of-the-art, numerically exact methods for the solu-tion of the Anderson impurity model (AIM) . Theseare widely used for the description of the physics ofmagnetic impurities, Kondo systems, transport throughquantum junctions and are also employed as auxiliarymodels in dynamical mean ﬁeld theory (DMFT) calcu-lations for lattice models of correlated electron systems.Several high-level open source implementations of DMFTand of its merger with density functional theory havebeen recently made available .One of the most successful ﬂavors of continuous-timequantum Monte Carlo algorithms is the strong-couplinghybridization expansion (CT-HYB) . CT-HYB is themethod of choice for multiorbital impurity models withgeneral interactions because one observes only a mod-erate sign problem provided that the bath problem hassuﬃcient symmetry. This is because CT-HYB splits eachMonte Carlo conﬁguration into a noninteracting bathpart and a fully interacting impurity part, and solves theimpurity part using an exact diagonalization/FullCI-typemethod. However, the dimension of the impurity Hamil-tonian grows exponentially with the number of orbitals,and so does the computational eﬀort with it. In prac-tice, correlated d - or f -shells as well as small correlatedmolecules can be treated with CT-HYB.Yet, reaching low temperatures is still challenging dueto the quadratic scaling of the impurity problem withinverse temperature. This follows from the fact thatthe mean order of diagrammatic expansion grows lin-early with inverse temperature, and both the computa-tional cost of evaluating a single conﬁguration as wellas the observed autocorrelation time between conﬁgu-rations scale linearly with expansion order. While the exponential scaling with the number of orbitals and thequadratic scaling with the inverse temperature are in-trinsic to the local problem, potentially model-dependentimprovements to the prefactor of this overall scaling canbe achieved.Common approaches to such optimization are blockdiagonalization of the local Hamiltonian using conservedquantities and binning, tree , or equivalent algo-rithms in so-called “matrix-matrix” implementations ofCT-HYB. Additionally, with a similar motivation asfor our method, outer truncation of the local trace tothe few dominant contributions and calculation of thosewith more eﬃcient sparse-matrix methods has been ap-plied particularly to large systems at low temperatures .Other more advanced strategies are local updates inimaginary time , a fast-rejection/acceptance algorithmby calculating upper/lower boundaries of the weight ,or a partial summation of diagrams to extract more in-formation out of one Monte Carlo conﬁguration .Here we consider a matrix-vector version of the CT-HYB algorithm as implemented in the w2dynamics package and investigate the possibility of sampling thesum over the eigenstates of the local impurity in theMonte Carlo simulation. A hard outer truncation of highenergy states (so far typically used in calculations with w2dynamics ) constitutes an approximation and it is un-clear whether it retains ergodicity. The approach pro-posed here, instead, is numerically exact and furthermoreexceeds the performance beneﬁts of hard truncation sub-stantially.We formulate two versions: the “superstate”-samplingalgorithm, where states grouped by the blocks of theHamiltonian are sampled together, and the “state”-sampling algorithm, where each many-body state of theimpurity is sampled individually. Conceptually, thesemethods can be interpreted as an equivalent of thesegment-algorithm for general interactions. Our im-provements touch the core of the exponential scaling ofCT-HYB and manage to signiﬁcantly reduce the compu-tation time of a Monte Carlo weight. Furthermore, theyare in principle compatible with all of the other above-mentioned algorithmic improvements. Using a ﬁve-orbital AIM with the most general form of the electron-electron Coulomb interaction as an example, we achievespeed-up factors verging on three orders of magnitude.First, we review the basic formulas of CT-HYB inSec. II. In Secs. III and IV, the superstate and state sam-pling methods are introduced. In Sec. V, we comment onthe performance and the average sign, while in Sec. VIwe demonstrate the capabilities of the (super)state algo-rithm with a simple physical example. II. HYBRIDIZATION EXPANSION

Let us start with a brief review of the hybridization-expansion continuous-time quantum Monte Carlo algo-rithm, focusing on the formulas needed to explain ourchanges. For a complete introduction to this method,see Ref. 4. We are concerned with the solution of a mul-tiorbital Anderson impurity model, whose Hamiltoniancan be written as H = H loc [ d, d † ]+ X pλ ( V pλ f † p d λ + H . c . )+ X p ˜ E p f † p f p , (1)where d λ annihilates a fermion on the impurity, whichconsists of spin orbitals λ ∈ { , . . . , N orb } , and f p anni-hilates a fermion on the bath, where the quantum num-ber p can be continuous. V pλ and ˜ E p parametrize thehybridization and bath levels, respectively, while H loc isa generic interacting local (impurity) Hamiltonian.The expansion of the partition function Z = Tr e − βH in terms of the bath hybridization can be written as: Z = Z ∞ X k =0 X λ ,λ ′ Z β d τ Z β d τ ′ · · · X λ k ,λ ′ k Z βτ k − d τ k Z βτ ′ k − d τ ′ k × Tr " T τ e − βH loc k Y i =1 d † λ i ( τ i ) d λ ′ i ( τ ′ i ) det h ∆ λ i λ ′ j ( τ i − τ ′ j ) i ij , (2)where Z is the partition function for the bath part, β = 1 /T is the inverse temperature, T τ denotes pathordering in imaginary time, and ∆( τ ) = V † ( ∂ τ − ˜ E ) − V is the hybridization function, which encodes the total re-tardation eﬀect of the bath on the local fermions.Schematically, Eq. (2) can be written as follows: Z = X C w loc ( C ) w bath ( C ) , (3)where C := ( τ , τ ′ , λ , λ ′ , . . . , τ k , τ ′ k , λ k , λ ′ k ) denotes an inﬁnitesimal term in the expansion, i.e., X C := ∞ X k =0 X λ ,λ ′ Z β d τ Z β d τ ′ · · · X λ k ,λ ′ k Z βτ k − d τ k Z βτ ′ k − d τ ′ k . (4)In the conventional continuous-time hybridization expan-sion quantum Monte Carlo (CT-HYB) algorithm, each C is taken as a Monte Carlo conﬁguration, and the sum (4)is performed using Markov chain Monte Carlo: Z = X C |{z} QMC w loc ( C ) w bath ( C ) . (5)With w bath = det[ . . . ] we denote the bath part, corre-sponding to a determinant of noninteracting hybridiza-tion functions, which can be computed in O ( k ) and up-dated in O ( k ) time. The quantity w loc ( C ) = X s h s | ˆ C | s i (6):= X s D s (cid:12)(cid:12)(cid:12) T τ e − βH loc k Y i =1 d † λ i ( τ i ) d λ ′ i ( τ ′ i ) (cid:12)(cid:12)(cid:12) s E (7)is the local weight or local trace of a conﬁguration, where s indexes the 4 N orb many-body eigenstates of the localimpurity Hamiltonian H loc , and we have introduced theshorthand ˆ C for the sequence of local operators of thecurrent conﬁguration. A na¨ıve implementation of Eq. (7)involves k multiplications of 4 N orb × N orb matrices, whichscales as O ( k exp( αN orb )) with a constant α . Reducingthe computational impact of the calculation of w loc ( C ) isthus usually the main objective of optimizing CT-HYBcodes.In general, the local Hamiltonian H loc conserves aset of quantum numbers. Consequently, the many-bodyHilbert space can be partitioned into a set of linear sub-spaces {S} , so-called “superstates”, and the Hamilto-nian can be brought into a block-diagonal form with re-spect to these superstates. We can write Eq. (6) as: w loc ( C ) = X S w loc , S ( C ) := X S X s ∈S h s | ˆ C | s i . (8)In deﬁning the quantum numbers, we impose the addi-tional requirement that the impurity operators do nottake a state from one superstate to more than one othersuperstate, thereby possibly merging multiple blocks ofthe Hamiltonian into one superstate. This implies that w loc , S can be calculated independently for each S . Thescaling is now controlled by the size of the largest super-state (in the worst case), i.e., by something much smallerthan 4 N orb . Since the application of an impurity opera-tor corresponds to a one-to-one mapping between diﬀer-ent superstates (and giving zero if it violates the Pauliprinciple), a further optimization is possible: For each su-perstate S , one can follow the sequence of superstates byusing this mapping starting with S at τ = 0 until reach-ing τ = β . If one reaches zero at any point or ends up ina diﬀerent superstate at the other end, w loc , S is exactlyzero and does not need to be calculated using possiblymuch more costly linear algebra. We will refer to thisprocedure as quantum number checking in the followingsections.This concludes our overview of what we refer to asconventional CT-HYB method. III. SUPERSTATE SAMPLINGA. General description

The main idea of this work is to transform the deter-ministic summation over the eigenstates of the impurityin Eq. (6) into a stochastic summation. The originalMonte Carlo conﬁguration is split into many “smaller”weights. While having been proposed , it has never beenimplemented to the best of our knowledge.We focus on two strategies in particular, one that par-titions the sum into subsets by quantum numbers andone that breaks it up entirely. We call them1. “Superstate sampling”: the summation over all su-perstates S is now done by Monte Carlo sampling: Z = X C X S | {z } QMC X s ∈S h s | ˆ C | s i w bath ( C ) , (9)where each Monte Carlo conﬁguration now containsthe trace over all states s within a superstate.2. “State sampling”: the summation over all states s in Eq. (8) is now done by Monte Carlo sampling: Z = X C X S X s ∈S | {z } QMC h s | ˆ C | s i w bath ( C ) , (10)where each Monte Carlo conﬁguration now containsthe local conﬁguration evaluated for a single outerstate s .In this section, we will focus on superstate sampling,while state sampling will be discussed in Sec. IV.The fragmentation of the sum reduces the amount ofcalculations needed for one local weight and allows us tomove faster through phase space. It is thus particularlybeneﬁcial in systems with low symmetry, which can havemany superstates with small but nonzero contributionsto the local weight. On the other hand, if many quantumnumbers can be used in a calculation with the conven-tional sampling, an increase in β results in an eﬀectivereduction of the number of possible outer superstates.This “help” is a side eﬀect of the large number of opera-tors in the trace present in the low- T limit. It results in FIG. 1. Average relative contributions of outer superstatesto the local weight per conﬁguration for a typical simulationwith ﬁve orbitals and Kanamori interaction . For compara-bility, the superstates are ordered by their contribution (i.e.,absolute value of the part of the local weight sum from allstates contained in the superstate) for each individual con-ﬁguration, i.e., superstate “1” does not denote one speciﬁcconstant superstate, but always refers to the biggest contrib-utor. a very high chance of quantum number violation and ithence substantially restricts the room for maneuver forthe outer superstates. The advantage of superstate sam-pling is therefore twofold: at any temperature an easyand natural selection of the most important outer super-states and much less need for quantum number checking,particularly beneﬁcial at low T .By sampling superstates, we are sampling a sum ofterms with potentially diﬀerent signs. This may inducea sign problem, which would in general be expected toworsen exponentially with decreasing temperature. (Thisis why it is important to combine all possible bath con-ﬁgurations into a bath determinant in CT-HYB. ) Yet,we do not observe any worsening of the average sign insuperstate sampling compared to the original algorithm(cf. Sec. V). A heuristic argument for this can be sum-marized thusly: Since the mean expansion order growslinearly with the inverse temperature β , the averagenumber of superstates that violate the Pauli principleincreases, until at a certain β , we are often left with onlyone outer superstate. For example, we have observed thatfor a typical metallic systems, a temperature of the orderof 10 − of the electronic bandwidth is about the pointwhere many conﬁgurations have only a single superstatecontributing to the trace. At such low temperatures, thelocal weight in conventional sampling is the sum over onlyone outer superstate, so switching to superstate samplingshould not aﬀect the sign.When multiple superstates contribute, such as in sim-ulations at high temperatures, the switch to superstatesampling could in principle cause a diﬀerence depend-ing on the superstates’ relative weight and relative sign. A β AA B C D A A β AA B A E β EE F B A E

FIG. 2. Three conﬁgurations represented as line segmentswith calligraphic letters denoting the superstates. An “inner”(“outer”) pair move is shown in the upper (lower) part of theﬁgure. The impurity operators are represented by symbols,whose ﬁlledness tells whether they are creators or annihila-tors. The imaginary times of the operators is given by theirposition along the segment, which represents the imaginarytime axis. The letter above the τ = 0 mark is the outer super-state of the conﬁguration. To illustrate the typical superstate weight distributionin such cases, we consider relatively high-temperaturesimulations. For a ﬁve-orbital model with Kanamoriinteraction , we show the average distribution of thelocal weight of a conﬁguration onto the outer superstatesin Fig. 1. It is clear how the local weight of each conﬁg-uration is strongly dominated by the contribution of onesuperstate. Similar results can be obtained for a simplertwo-orbital model. We can therefore expect the methodto be useful for high temperatures as well, as it allows usto sample conﬁgurations with their “ideal” outer super-states. B. Sampling and ergodicity

Since we extend the conﬁgurations in the superstatesampling method by an outer superstate, the simulationmust be able to reach every conﬁguration with nonzeroweight independent of its outer superstate to preserveergodicity. a. Inner pair moves.

While the possibility to changethe outer superstate needs to be available for ergodicity,we observe that it only needs to be done comparativelyrarely in the simulation. Therefore, we do not changethe outer superstate when inserting or removing a pairof operators, which are the most common moves pro-posed. We call this variant the “inner” pair moves: Be-cause we ﬁx the outer superstate, only states that lie between the inserted or removed operators in imaginarytime can change and need to be recalculated.Consider the example of the “inner” pair insertionmove shown in the upper half in Fig. 2: The old con-ﬁguration C old is the one shown in the middle panel. Wewant to perform an “inner” pair move that inserts thetwo orange operators with random times and ﬂavors inthe top diagram, which represents the resulting conﬁgu-ration C inner . In this most commonly proposed type ofmove, new conﬁgurations C inner are only proposed withthe same outer superstate as the old conﬁguration C old .The superstate sequence between the two new orangeoperators is new whereas the part outside them remainsas in the old conﬁguration. These moves are in a sensemost closely connected to the pair moves of conventionalsampling as the outer superstates with the biggest con-tributions are not likely to change in local moves. b. Outer pair moves. In the other local pair moveswe consider, the “outer” moves, the prescription for thesuperstate sequence is the opposite. The superstates be-tween the inserted operators are to be left unchanged,and the sequence must be continued from there to τ = 0to determine the superstate that should be used as newouter superstate. An example for an outer insertion isthe move from the conﬁguration C old in the middle ofFig. 2 to C outer on the bottom, where the inserted op-erators are the same as in the inner move example foreasy comparison. Since we ﬁx the inner part of the su-perstate sequence, the entire part “outside” of the orangeoperators changes.The acceptance of outer pair moves is in practice how-ever signiﬁcantly smaller than that of the “inner” pairmoves. This can be understood thinking about the limitof local-in- τ moves: These, in order to be considered lo-cal in the “outer” case, are subject to the additional con-straint of having the two operators at opposite ends ofthe trace. In the next subsection, we discuss a more eﬃ-cient way to ensure ergodicity with respect to the outersuperstate, the so called global τ -shift move. We willalso show in Appendix A that the global τ -shift movesinduce an equivalence between inner and outer moves,which however does not imply equal acceptance rates.A ﬁnal noteworthy detail of this sampling procedureis the choice of the outer superstate for the initial con-ﬁguration at the beginning of the simulation. While itshould not inﬂuence the simulation after thermalization,for many highly excited outer superstates the local weightis close to zero. We thus select the initial outer super-states randomly with probabilities proportional to theirlocal weights.Let us note that one could think of simpler techniquesthan the presented moves to ensure ergodicity, e.g., theaddition of a move that changes only the outer super-state, or the possibility to change the outer superstaterandomly during each move. Both of these turn out tobe ineﬃcient ways compared to those we present here. A ∆ τ β AA B C B A C β CC B A B C ∆ τ β − ∆ τ FIG. 3. Diagrams representing two conﬁgurations that canbe obtained from one another using a τ -shift move. The sym-bols on the imaginary time segments represent the impurityoperators, with shapes standing in for ﬂavors and ﬁllednessfor whether they are creators or annihilators. Calligraphicletters denote superstates with the ones above the τ = 0 and τ = β mark giving the outer superstate of the conﬁguration.When a τ -shift move is performed, all operators are shiftedaccording to τ → τ − ∆ τ (wrapping around to τ → τ + β − ∆ τ if necessary). Adding such a move is, however, necessary for ergodic-ity in a simulation of a system in the atomic limit, i.e.,without hybridization, because outer (and inner) inser-tions would always be rejected, operators for outer re-movals are not present, and the move presented in thefollowing section does not change a conﬁguration withoutoperators at all. Therefore, we do occasionally proposea change of just the outer superstate in conﬁgurationswithout any operators.

C. Global τ -shift moves A global “ τ -shift move” shifts the positions of all oper-ators in imaginary time by a random ∆ τ ∈ [0 , β ], whichcan equivalently be thought of as a shift of the imaginarytime axis. At the same time, the new outer superstate isby construction chosen to be consistent with this shift ofthe origin of the imaginary time axis (see Fig. 3).Using just inner pair moves and the global τ -shiftmove, a superstate sampling simulation is ergodic if andonly if it is ergodic using conventional sampling with pairmoves. This is because if any conﬁguration can be builtup using pair moves in conventional sampling, any con-ﬁguration can be built up using inner pair moves with theouter superstate being one of the contributing ones in su-perstate sampling. A proof of ergodicity can be found inAppendix A.Let us now consider the properties of the τ -shift move.Over the course of an entire simulation, the proposalprobability for a speciﬁc outer superstate in this kindof move is proportional to the average relative amount ofimaginary time it covers. Since the superstate sequenceis cyclic and eﬀectively also just shifted along the τ -axis,there is no need to perform quantum number checking.A global move similar to our τ -shift was introduced by Shinaoka et al. for a diﬀerent technical reason.The proposal probabilities of a τ -shift move and its re-verse are equal. The acceptance probability of this moveis 1. In Appendix B we prove that the bath determinantremains unchanged, as the action of the τ -shift on thehybridization matrix move eﬀectively corresponds to anumber of permutations and multiplications of rows andcolumns. Additionally in Appendix C there is a proofthat the local trace remains unchanged under a combinedcyclic permutation of the operators and correspondingchange of the outer superstate.Let us discuss why τ -shift moves allow us to preserveergodicity. Inner moves alone cannot change the outersuperstate, but only superstates in parts not including τ = 0 = β . τ -shifts indeed move “the section with theouter superstate” away from τ = 0 = β and can henceshift the operator/superstate sequence in such a way thatthe outer superstate changes while the conﬁguration re-mains otherwise equivalent. In combination with the or-dinary inner pair moves it can hence change the super-state of any section without need for outer pair moves.Due to its favorable characteristics compared to theouter moves, we usually add just τ -shift moves to the al-ways necessary inner moves to ensure ergodicity of sim-ulations with respect to the outer superstate. We chooseto propose τ -shift moves as 0 .

5% of all moves by default,which was also the ratio used in all calculations shownlater. While a smaller ratio might improve performance,the potential speed up in usual cases would be small asthe τ -shift moves usually do not take up the majority ofthe time.Additionally, we also allow random changes of theouter superstate in our implementation during otherglobal moves that can be used in CT-HYB but whichwe do not further discuss in this paper. For such globalmoves as the ﬂavor permutations used in w2dynamics ,procedures mapping the old superstate to a new proposalbased on the speciﬁc move could actually be thought of,but since these moves only serve to go between badlyconnected areas of phase space, we prefer not to restrictthem more than necessary. IV. STATE SAMPLINGA. General description

The superstate sampling method of the last section al-ready signiﬁcantly reduces the cost of computing a localweight. Similarly to the predominance of a single su-perstate in the local weight (Fig. 1), we often ﬁnd thatwithin one superstate, the individual eigenstates show asimilar trend (Fig. 4, using the same model ): The con-tribution of one superstate S is dominated by the contri-bution of one or a small group of its eigenstates s ∈ S .This suggests trying to apply the principle of superstatesampling one level deeper in the form of an “(eigen)statesampling”.We split the local weight further into even smallerparts, where the summation over all states s within asuperstate S is also done as a Monte Carlo sum: Z = X C X S X s ∈S | {z } QMC w loc ,s ( C ) w bath ( C ) , (11)with w loc ,s ( C ) = h s | ˆ C | s i . (12)To avoid confusion, let us stress at this stage that ourmethod does not make any assumption on the contribu-tion of the superstates to the local weight. It is an exactsampling with no approximation involved.Unlike for superstate sampling, a heuristic argumentfrom conventional sampling for a sign close to 1 cannotbe given (unless a system with superstates containing asingle state each is considered). This stems from the factthat at least quantitatively, the weight in state samplingis always diﬀerent from the one in conventional sampling,as the latter sums up contributions from at least one en-tire superstate. However, as stated earlier, we empiricallyﬁnd the contribution of a superstate to the local weight isoften dominated by one of its eigenstates. This suggeststhat the sign should often not be much worse than in su-perstate sampling, as those dominating states should besampled considerably more often than other ones. Be-cause the local weight in state sampling when such adominating state s is the outer state has the same sign asthe local weight of the corresponding conﬁguration withouter superstate S ∋ s in superstate sampling, the maxi-mum deterioration of the sign as compared to superstatesampling should be related to the extent to which individ-ual states dominate the superstate weight contributions.As the time evolution further suppresses states of higherenergy at lower temperature, causing the lower energystates to dominate the superstates’ contributions moreand more, the “sign gap” between the methods shouldalso decrease with decreasing temperature.Finally, let us note that this sampling technique onlyimproves performance over superstate sampling if thesummation over the outer states is actually performedas the outermost summation in the numerical implemen-tation as well, as opposed to multiplying the operatormatrices ﬁrst. While performing the summation overouter superstates ﬁrst is common in implementations ofconventional CT-HYB sampling as this allows the useof quantum number checking for performance improve-ment, the summation over the outer states is instead of-ten performed only after the multiplication of the oper-ator matrices (per superstate), as this allows the use ofoptimizations employing tree structures. As opposed tothese matrix-matrix implementations, there are so-called“matrix-vector” ones that perform the summation overthe outer states as the outermost one. For more detailedinformation, see Appendix D. FIG. 4. Average relative contribution of outer states withinone superstate , ordered by the size of contribution per con-ﬁguration, normalized to the total weight of that superstate.This graph was obtained from a typical simulation of a ﬁve-orbital system with Kanamori interaction . B. Choice of the outer state within a superstate

The crucial point of moves in state sampling comparedto those in superstate or conventional sampling is how tochoose the outer state. This is also the only point inwhich our state sampling moves diﬀer from the ones wepropose for superstate sampling. In superstate samplingthe choice of the new outer superstate is clear from theway in which the superstate sequence is changed by themove. This is the case for both outer moves and the τ -shift move (cf. Secs. III B and III C). In state sampling,the situation is diﬀerent because a qualitative equivalentof the superstate sequence for states does not exist. At τ = 0 there is indeed an eigenstate of the Hamiltonian,but it will change into a linear combination of multipleeigenstates after the ﬁrst application of an impurity op-erator.In the inner pair moves (cf. Sec. III B, Fig. 2), we sim-ply keep the outer state ﬁxed, just as we kept the outersuperstate ﬁxed in the corresponding moves in superstatesampling.For the kinds of moves which we want to use to changethe outer state, we randomly choose the outer state tobe proposed from a suitable set (cf. Fig. 5). This in-volves ﬁrst following essentially the same procedure fora move as in superstate sampling, which allows us to ob-tain an outer superstate proposal which we could call thetarget superstate. From this target superstate, we ran-domly choose a state to propose as outer state, in ourspeciﬁc implementation according to the distribution inEq. (13). In this optimized probability distribution forthe outer state proposal, we take only the eigenenergiesof the diﬀerent possible outer states into account, as wehave found this to be a useful way to increase the sam-pling eﬃciency. FIG. 5. Left: a superstate sampling conﬁguration for a two orbital model with Kanamori interaction. Bold horizontal linesdenote a time-evolution of an eigenstate, the dotted vertical lines the operators that cause transitions between superstates.Here all superstates have size 1, except the spin-ﬂip and pair-hopping one. Right: the conﬁguration resulting from applicationof a global τ -shift by ∆ τ to all operators of the one shown on the left. τ f and τ l denote the imaginary times of the ﬁrst andlast operator after the shift. In the superstate sampling algorithm the local weight is the sum of the red and blue state as outerstate of the trace, whereas in the state sampling only one of the two is selected. Let us take a look at some detailed examples for theselatter kinds of moves: When the outer pair insertion (cf.Sec. III B) depicted by Fig. 2 is performed in state sam-pling, the initial conﬁguration depicted in the middle isof course extended by the speciﬁcation of one outer state s ∈ A , and one of the states contained in E is randomlyselected as the outer state of the proposed new conﬁgu-ration depicted below it. Similarly, in the τ -shift move(cf. Sec. III C) depicted by Fig. 3, the initial conﬁgura-tion depicted on top is extended by the speciﬁcation ofan outer state s ∈ A and one of the states contained in C is randomly chosen as the outer state of the proposedconﬁguration. In any further global moves with no par-ticular connection to the superstate sampling technique,we chose to randomly propose one of the contributingsuperstates. Therefore when they are performed in statesampling, we randomly choose one of the states containedby the contributing superstates—the only case in ourimplementation where outer states from more than onediﬀerent superstate could be proposed in one otherwiseidentical move.Proposing one of the possible states with uniform prob-ability, however, causes a lower acceptance rate comparedto the analogous moves in superstate sampling, where,e.g., the τ -shift move even has acceptance rate 1. There-fore, we employ a more eﬃcient strategy: Since close to τ = 0 there is always just the outer state s propagatingwith its eigenenergy, we can make the procedure more ef-ﬁcient by “transferring” the time evolution of this statefrom the acceptance to the proposal probability; i.e., weinclude it in the proposal probability so that in the stan-dard Metropolis acceptance probability formula, it can-cels with the equivalent factor in the weight of the conﬁg-uration. This proposal probability weighting allows us toincorporate our prior (or easier to calculate) knowledgeto avoid wasting time on proposals that would likely berejected: For a well chosen proposal probability, i.e., one close to the actual weight distribution, we raise the ac-ceptance probability for all outer states since those outerstates that would be rejected more often with uniformproposal probability are simply proposed less often. Toinclude the aforementioned time evolution close to τ = 0,we use the proposal probability p prop ( s ) = exp( − ( E s − E ) · ( τ f + β − τ l )) P k ∈T exp( − ( E k − E ) · ( τ f + β − τ l )) , (13)where T contains all states from all outer superstatesthat may be proposed and τ f and τ l are, respectively,the imaginary times of the ﬁrst and last operator afterthe move. In this way, the acceptance rate for outer statechanges is signiﬁcantly increased and, e.g., reaches about50% for the global τ -shift move in typical calculations.The best choice of proposal probability for such an opti-mization depends on the weight we expect: In this case,we essentially assume that the potentially excited stateat τ = 0 will be brought closer to the ground state bythe impurity operators, since if we expected the energyto stay at the level of the outer state, we could choosea better proposal probability assuming propagation overthe entire imaginary time with the energy of the outerstate (which corresponds to replacing τ f + β − τ l by β inthe proposal probability). V. SPEED-UP QUANTIFICATION

To check the correctness of the results, we use a two-orbital model with a Kanamori interaction, i.e., density-density, pair-hopping, and spin-ﬂip terms, and a ﬁnitenumber of bath sites for which we have a reference so-lution obtained using exact diagonalization. Both thereference self-energy as well as one calculated using ourCT-HYB solver are shown in Fig. 13 (Appendix E).In order to analyze the performance, we use a ﬁve-orbital Hamiltonian modeling a realistic transition-metalimpurity on the surface of a metal with both the full(spherically symmetric) Coulomb tensor as well as onederived from the same interaction matrix restricted toKanamori-like terms only. We perform all calculationsusing w2dynamics , with either the implementation ofconventional sampling found in older versions, or su-perstate and state sampling proposed here; other thanthe sampling method, there are no diﬀerences with signif-icant performance impact between the calculations. Toquantify the performance improvement we compare thesampling rates, i.e., the raw amount of generated (cor-related) samples per time. For the autocorrelation timewe found a minor increase of about 10 % for superstatesampling, but only about 3 % for state sampling as com-pared to the old sampling method. This can be con-sidered negligible in comparison to the speed-up factors.The mean sign is about 1 . .

98 using state sampling. For the modelwith full Coulomb interaction the sign is signiﬁcantly lessthan 1 in all cases, and it is slightly smaller with statesampling than with the other methods For the calcula-tions used to measure the speed up factors, the absolutevalues of the mean sign for all three methods can be foundin Fig. 7.Figure 6 shows the achieved speed up factors (toppanel) and the absolute sampling rates measured insimulations (bottom panel) of the impurity model withCoulomb interaction. We obtain a speed up of the MonteCarlo sampling up to a factor of about 700 in the consid-ered temperature range, depending on the used samplingmethod, temperature, and interaction. Remarkably, thespeed up of our ﬁve-orbital example was never smallerthan 100 for the most arduous case, i.e., the full inter-action. The reason why the speed up factors for theCoulomb interaction are larger under otherwise equalconditions is that a larger number of superstates con-tribute on average in this case. A general observationis that the speed up decreases with decreasing tempera-ture because the number of superstates contributing tothe trace decreases with decreasing temperature as moreoperators tend to cause more quantum number viola-tions. Therefore only the quantum number checking canbe avoided at lower temperatures, whereas at higher tem-peratures other trace contributions are present for whichmatrix-vector products need not be calculated any more.Yet, there is still a noticeable speed up even at lowertemperatures where quantum-number checking takes alarge amount of the total time . As the speed-up af-fects only the trace calculation, it will also continue todecrease with β because the computational complexityof the bath determinant (which scales with β ) is asymp-totically larger than that of the trace calculation.Another noticeable feature is the bigger advantage ofstate sampling over superstate sampling for Coulombinteraction, which is due to the larger average size of s a m p li n g s p ee d - up f a c t o r β [eV − ] Superstate/KanamoriSuperstate/CoulombState/KanamoriState/Coulomb s a m p li n g r a t e [ s − ] β [eV − ] Conventional samplingSuperstate-samplingState-sampling

FIG. 6. Comparison of Monte Carlo sampling speed, i.e.number of individual, mostly local, updates per CPU time,not including measurements, for a ﬁve-orbital system withKanamori or Coulomb interaction and cubic interpolationcurves plotted logarithmically. Top panel: speed-up factorsof the sampling methods compared to conventional samplingfor Kanamori and Coulomb interaction. Bottom panel: speedof all the sampling methods for Coulomb interaction. . . . .

81 0 50 100 150 200 a b s o l u t e m e a n s i g n β [eV − ] Superstate/KanamoriSuperstate/CoulombState/KanamoriState/CoulombConventional/KanamoriConventional/Coulomb

FIG. 7. Absolute value of the mean sign using our samplingmethods for a ﬁve-orbital system with Kanamori or Coulombinteraction with cubic interpolation. The sign obtained insimulations using conventional sampling is usually not bet-ter than with superstate sampling, which we conﬁrmed forthe case with Kanamori interaction. Note that the error ofthe sign for the model with Coulomb interaction is consid-erably larger using conventional sampling than using otheralgorithms. The graph also shows that the sign obtained instate sampling is only slightly worse that that obtained usingthe other methods. the outer superstates in that case. The amount of cal-culated matrix-vector products is reduced by approxi-mately that factor in state sampling compared to super-state sampling, as only one of the outer states is chosenin the former case. This optimization is only advanta-geous for a matrix-vector solver like ours, as additionalouter states can be included at negligible further costif the entire product of the operator matrices for a spe-ciﬁc outer superstate has been calculated, cf. Sec. IV andAppendix D. A similar optimization is also possible with-out splitting conﬁgurations in the superstate samplingmethod by using the cyclical invariance of the trace andstarting the trace calculation at a τ where the superstateof the conﬁguration is smaller than the outer superstateof the conﬁguration, but this can interfere with time sav-ings from caching of intermediate state vectors and eventhe size of the smallest superstate may be greater than1. More details on the methods can be found in Ref. 24.In conclusion, we ﬁnd that superstate sampling im-proves performance without signiﬁcant drawbacks tosuch an extent that it should always be preferable tothe conventional sampling method. Especially in simu-lations with few good quantum numbers, state samplingcan provide an additional speed-up, though it can alsoimpact the mean sign. In our examples, the speed-upin the case with full Coulomb interaction is big enoughto clearly outweigh the marginally reduced sign, but thismay depend on characteristics of the system and the im-plementation. VI. APPLICATION: THE SPIN-FREEZINGCROSSOVER

In Ref. 25 Werner, et al. applied DMFT to a modelwith three degenerate orbitals and rotationally-invariantCoulomb interaction. Upon changing the ﬁlling n , theyidentiﬁed a sharp change in the qualitative behavior ofthe local spin susceptibility. For small ﬁllings, the lat-ter becomes rapidly small at large imaginary times τ , asin a standard metal. Approaching the Mott transitionat n = 3, it starts instead to closely resemble that ofan atomic insulator, i.e., it seems to become essentiallyconstant in τ . This is surprising, as it happens for ﬁll-ings still on the metallic side, before reaching the metal-insulator Mott-Hubbard transition. The sudden loss ofcoherence was interpreted as an abrupt crossover – oreven a true quantum phase transition (“spin-freezing”) –to a bad metal characterized by violations of the Fermi-liquid properties, along a line in the zero-temperature n - U phase diagram.An independent analysis showed, however, convincingevidence that the same model remains in a metallic phaseaway from integer ﬁllings . What changes upon get-ting close to the Mott transition is the coherence of thequasiparticle excitations. The “spin-freezing” is there-fore a ﬁnite-temperature—though surprisingly rapid—crossover to a “bad-metal” rather than a T = 0 phase − . − . − . − − . − . − . − .

20 0 1 2 3 4 5 6 I m ( Σ ( i ω n )) / D iω n /D βD = 1600 µ = 5 . µ = 7 . µ = 8 . − − . − . − . − .

20 0 0 . . . . . µ = 5 . , n = 1 . µ = 7 . , n = 2 . µ = 8 . , n = 2 . I m ( Σ ( i ω n )) / D iω n /D βD = 200 βD = 800 βD = 1600 FIG. 8. Imaginary part of the self-energy on the Matsubaraaxis with

U/D = 4,

J/D = 2 /

3. Upper panel: three diﬀer-ent chemical potentials at low temperature βD = 1600. Thecorresponding ﬁllings are reported in the lower panel and rep-resent values across the spin-freezing crossover (see inset toFig. 11). A close up of the low-frequency region is shownin the lower panel together with the evolution with temper-ature. The Fermi liquid behavior for n = 2 .

63 ( µ = 8 . βD = 200. transition. As long as the DMFT self-consistence doesnot lead to a gap in the spectral function of the bath ofthe corresponding Anderson impurity model, the solutionat zero temperature must indeed be a Fermi liquid. Thisconclusion was already demonstrated more rigorously inRefs. 27 and 28.Yin, Haule, and Kotliar and, immediately afterthem, Georges, de’ Medici, and Mravlje performed CT-HYB calculations of hitherto unprecedented eﬃciency,reaching temperatures 1000-1500 times smaller than thehalf bandwidth (see Fig. 6 and Fig. 7 in the two papers,respectively). The focus was on the functional depen-dence of ImΣ( iω n ) which, even after the spin-freezingcrossover, was shown to follow a Fermi-liquid scaling atextremely low frequencies, visible at these temperatures.A recent study nailed this down using an advanced mul-tiorbital numerical renormalization group solver .Similar types of crossovers have been discussed in thepresence of spin-orbit interaction always showing the0change of behavior in the local spin susceptibility at aﬁxed temperature. Yet, one would like to unambigu-ously demonstrate that this is actually the physics of acrossover from a good to a bad metal with a coherencetemperature that becomes fairly small upon approachingthe Mott transition at half ﬁlling.Here, we consider the same model of Ref. 25 as a func-tion of doping focusing in particular on the tempera-ture dependence. Using the CT-HYB implementationof w2dynamics , featuring both superstate sampling andsliding window sampling , we are able to obtain a clearpicture of the temperature evolution of the local spinresponse, identifying a coherence scale, even deep intothe “spin-freezing region”. The quantities of interest arethe electronic self-energy Σ( iω n ) and the static local spinsusceptibility χ ω =0loc ( T ) = Z β d τ χ loc ( τ ) , (14)i.e., the ω = 0-Fourier component of the spin-spin re-sponse function χ loc ( τ ) = g P ij (cid:10) S iz ( τ ) S jz (0) (cid:11) (with i and j running over the three orbitals). The half band-width of the semicircular noninteracting density of thestates of each orbital is D (corresponding to 2 t in Ref. 25).The coupling constants of the three-orbital Kanamori in-teraction are the usual Hund- J and Hubbard- U (with V = U − J ) at ﬁxed J/U = 1 / iω n ) for three diﬀerent ﬁllings at thelowest temperature T = 1 /β = D/ n = 1 . iω n )for iω n → T doesnot change the shape of ImΣ( iω n ) but only makes theMatsubara frequencies denser, remaining on the same“straight” line. This is a manifestation of the so-called“ﬁrst-Matsubara” rule , according to which a T scattering rate characteristic of a Fermi liquid gives riseto a linear-in- T value for ImΣ( iω n =0 ).The situation is drastically diﬀerent for larger valuesof n . Note that at this U , n = 2 .

35 and 2 .

63 had beenalready assigned to the “spin-freezing region” in the orig-inal paper by Werner, et al. (see inset to Fig. 11). Atthese ﬁllings the low-frequency part of ImΣ( iω n ) is highlynonlinear and it is clear that to recover linearity one hasto consider the lowest temperatures (and probably evenlower than T = D/ n = 2 . n .An inspection of the local spin susceptibility conﬁrmsthat the physics at n = 2 .

35 and 2 .

63 is not qualitativelydiﬀerent from the good-metal ﬁllings but it is just theresult of a strong renormalization of the coherence prop-erties. The results are shown in Fig. 9. For an atom, χ ( τ ) is perfectly ﬂat independently of the temperature,so that its integral from 0 to β is proportional to β (Curielaw). For a Fermi liquid, its shape instead has to change . . . . . . χ l o c [ µ B ] τ /β µ = 5 . , n = 1 . . . τ /β µ = 7 . , n = 2 . . . τ /β µ = 8 . , n = 2 . βD = 200 βD = 800 βD = 1600 FIG. 9. Local spin susceptibility for three diﬀerent ﬁllingsacross the spin-freezing crossover. While the instantaneousmoment at τ = 0 gets larger upon increasing the ﬁlling, thevalue at τ = β/ with temperature in such a way that its integral gives aconstant Pauli susceptibility. Even though the speed ofthe decay for n = 2 .

35 and 2 .

63 is greatly reduced com-pared to n = 1 . et al. ), a pro-nounced temperature dependence of χ ( τ ) is present alsofor the larger ﬁllings, revealing the Fermi-liquid proper-ties. . . . . .

01 0 .

02 0 .

03 0 .

04 0 . χ ( τ = β ) [ µ B ] T /D h n i = 1 . h n i = 1 . h n i = 1 . h n i = 1 . h n i = 2 . h n i = 2 . . . . . . . . . . . χ ( τ = β ) [ µ B ] T /D h n i = 1 . h n i = 2 . h n i = 2 . FIG. 10. Behavior of χ ( τ = β/

2) for diﬀerent temperaturesand ﬁllings. The lower panel contains a close-up of the threeﬁllings shown in Fig. 9 across the spin-freezing crossover. τ = β/

2: Inthe Fermi-liquid case this has to go (quadratically) to 0upon reducing T . The coherence temperature can be esti-mated for instance from the inﬂection point of χ ( τ = β/ χ ( τ = β/

2) approaches 0 andhence how it makes the coherence temperature drop. Atthe same time our results reveal how the physics of thismodel is qualitatively the same at all metallic ﬁllings.The diﬀerence between the curves for diﬀerent n is in-deed only the velocity of the renormalization and thetemperature scale at which the Fermi-liquid scaling is re-covered. FIG. 11. Temperature dependence of χ ω =0loc ( T ) for diﬀerentﬁllings (indicated by square markers of the same colors in theinset taken from Ref. 25). The low- T Pauli-like behavior isvisible in the “nonfrozen” regime (lower panel). At highertemperature there is a crossover to Curie-Weiss physics. Thelatter gets more and more dominant in the “frozen” regime(upper panel; upper right region of parameter space in theinset).

The crossover from Curie to Pauli upon reducing thetemperature can also be visualized in Fig. 11. Upon ap-proaching the spin-freezing crossover the Pauli behaviorgets progressively pushed to lower and lower tempera-tures. In the curves at n = 2 .

63 for instance it is clearthat even lower temperatures would be needed to fullyresolve it. For this reason it is hard to unambiguouslyprove that deep into the “spin-freezing region” the co- herence scale is actually exponentially small. Even inthe good metal region a precise estimate of the crossovertemperature T coh is not a trivial task. First of all thereis a dependence on the observable one is focusing on.Second, even by looking at the same physical quantity,diﬀerent criteria give somewhat diﬀerent answers. . . . . . .

12 0 0 . . . T c o h / D h n i Estimated inﬂection pointFrom ﬁt to c + a T +2 T coh FIG. 12. Coherence temperature T coh determined with twodiﬀerent methods: with the inﬂection point of the curvesshown in Fig. 10 and by ﬁtting the high temperature regionwith Wilson’s formula . We have not been able to obtaingood results using the latter method at the lowest densities. In Fig. 12 we quantify T coh via two approaches basedon χ ( τ = β/ T coh give compatible results though aﬀected by siz-able errors. Furthermore, even though the crossover fromthe high-temperature ∼ /T Curie to the Pauli region be-comes relatively sharp (i.e., in principle more easily iden-tiﬁable) after crossing the “spin-freezing” crossover line,we have diﬃculties in precisely determining T coh , as thelatter is very small there and we do not have many sus-ceptibility data points at such low temperatures. Never-theless, the existence of a sudden drop of T coh approach-ing half-ﬁlling, as in fact pioneered by Werner et al. inRef. 25, is undoubted. Similar conclusions are corrobo-rated by high-precision numerical renormalization groupstudies, published in Refs. 37 and 38. The reason whythis crossover is so sharp, as well as its shape in thedoping- U diagram, are not fully clear yet .In real materials, the position of the coherence scalecan be strongly inﬂuenced by several factors, such as thenonlocal hybridization between orbitals, absent in themodel Hamiltonian studied here. One of these factorshas been also identiﬁed in the presence of sharp peaksin the noninteracting density of the states , somethingoften coexisting with the many-body physics in stronglycorrelated materials. VII. CONCLUSIONS

We have shown that the sum over all impurity eigen-states of the local problem in CT-HYB can be divided2into smaller pieces, and sampled individually. This frag-mentation leads to a remarkable gain in the algorithm’seﬃciency, to some extent against the general intuition.This is due to the exponential character of the imaginarytime evolution e − H loc τ , which very sensitively damps theamplitudes of high energy excitations. Acting on thecore of the exponential scaling in CT-HYB we manageto achieve speed-up factors of the order of 10 , with es-sentially no worsening of the average sign. Additionalwork has to be carried out in order to show whether theimpact of the exponential scaling of the local problemcan be reduced further by employing methods based onour ideas.The speed up ﬁgures have been obtained for a ﬁve-orbital model with full-Coulomb interaction, representingphysically relevant situations such as realistic transition-metal impurities deposited on metallic substrates. Wealso discussed the well-known spin-freezing crossoverobtained in three-orbital Hubbard-model calculations.Reaching very low temperature allows us to quantify thecoherence temperature and the recovery of the Fermi-liquid properties of the self-energy, even when thisphysics is pushed to very low scales by the proximityto the Mott transition. ACKNOWLEDGMENTS

Appendix A: Proof of ergodicity of τ shift moves To prove ergodicity, rigorously at least for the case ofdensity-density interaction, by connecting two conﬁgu-rations diﬀering only in their outer superstates, we ﬁrsttake an arbitrary conﬁguration C and connect it to the“empty” conﬁguration, which has no operators: For anarbitrary outer superstate, consider the occupation num-ber basis states contained by the superstates. To get fromone superstate to another, impurity operators need to beapplied such that a state from the ﬁrst superstate wouldbe transformed into a state from the second one. To dothis in a simulation, perform inner pair moves to buildup this sequence of operators right after τ ; the secondoperator from each pair can be placed anywhere result-ing in a conﬁguration of nonzero weight, with one simple possibility being β − τ if the ﬁrst is placed at τ . Now,perform one tau-shift move with ∆ τ being the point rightafter the last operator of the inserted sequence, and theouter superstate is changed from the old, “ﬁrst” one to anarbitrary “second” one. By removing all operator pairsin the opposite order, an “empty” conﬁguration with anarbitrary outer superstate is reached. We can now runthe procedure in reverse to build up a new conﬁguration C ′ . This implies ergodicity, as any two conﬁgurations C and C ′ are connected by a ﬁnite number of moves.It is simple to demonstrate that ergodicity is not lost ifouter pair moves are replaced by global τ -shift moves orvice versa: An outer move can be replaced by a sequenceof a τ -shift move, an inner move, and the τ -shift moveinverse to the ﬁrst one. In the ﬁrst move, ∆ τ just needsto be suﬃciently big to change the order of the opera-tor pair that would be aﬀected by the outer move in τ ,then it can be performed as an inner move instead be-cause the section of the trace that is changed has entirelybeen moved inside. A τ -shift move can be replaced bya sequence of outer and inner moves: Remove all oper-ators in pairs using outer and inner moves such that anoperator-free conﬁguration with the desired outer super-state is obtained, then put all operator pairs shifted by∆ τ back in using outer and inner moves.As a ﬁnal remark, it should be stressed that eventhough a τ -shift move has acceptance 1 in superstatesampling, the acceptance of the outer pair moves remainssmaller than that of the inner pair moves. This is due tothe fact that outer moves are connected to inner moves by speciﬁc τ -shifts that have a proposal probability smallerthan 1. Appendix B: Invariance of the bath weight underglobal τ -shift move For the proof of the invariance of the bath determinant(adapted from Ref. 24), let us consider the form of thehybridization matrix elements with time ordering alongboth dimensions, ∆ ij = ∆ (cid:16) τ i − τ † j (cid:17) , (B1)where ∆ ( τ ) is the hybridization function, an antiperi-odic function with period β , τ i the imaginary time of the i -th annihilator (ordered by imaginary time), and τ † j theimaginary time of the j -th creator. The number of an-nihilators and creators shifted across τ = 0 by the movewill be denoted as N A and N C in the following.Due to the τ -shift move, the imaginary time of opera-tors with τ < ∆ τ will be transformed as τ → τ + β − ∆ τ and that of other operators as τ → τ − ∆ τ . The argu-ments of the hybridization functions are only time dif-ferences in which the shift parameter ∆ τ always cancels,but in cases where exactly one of the two operators hada τ < ∆ τ , the corresponding matrix element changes itssign.3Additionally, since the ordering of the operators iscyclically permuted, the rows and columns are cyclicallypermuted such that the N A ﬁrst rows become the N A lastrows and the N C ﬁrst columns become the N C last rows,where N A is the number of annihilators with τ < ∆ τ and N C the number of creators with τ < ∆ τ . A cyclicpermutation that moves every column or row exactly oneposition toward the front (“wrapping around” from thebeginning to the end) is equivalent to swapping adjacentcolumns or rows k − k is the size of thematrix in that dimension (as the hybridization matrix isa k × k matrix for hybridization expansion order k ).Each swap causes the determinant to change its sign,and the sign change of matrix elements where only oneoperator wrapped around the end is equivalent to multi-plying all wrapped rows and columns by −

1, where everymultiplication of a column or row causes the determi-nant to change its sign as well. In total, expressed usingthe number of wrapped operators N wrap = N A + N C ,the determinant thus accumulates an additional factorof (cid:16) ( − k − (cid:17) N wrap · ( − N wrap = ( − kN wrap , where thematrix size k is the expansion order.This extra factor is compensated by the sign that isincurred due to time ordering. That the change of thisextra sign is equal to the factor acquired by the deter-minant may be proven by considering the amounts ofpermutations necessary to restore the ordering after per-forming a τ -shift move that wraps exactly one operatoraround the origin. From this, the general case follows. Appendix C: Invariance of the local weight underglobal τ -shift move To prove the invariance of the local weight in super-state sampling under a τ -shift move, we use the cyclicinvariance of the trace (adapted from Ref. 24). Due tothe way superstates are chosen by deﬁnition we knowthat if the trace is restricted in such a way that nonzerocomponents are left for only one superstate at any τ , theresult will be the same as if done so everywhere. Ourlocal weight is eﬀectively the conventional local weightwith such a restriction applied at τ = 0 and τ = β , andif projection operators P ( x ) are inserted onto the outersuperstate x at the beginning and end of the product oftime-evolution, creation and annihilation operators cor-responding to current conﬁguration, it can be written asa proper trace: w loc = X s ∈ x h s | ˆ C | s i (C1)= tr (cid:16) P ( x ) ˆ C P ( x ) (cid:17) . (C2)As the time evolution does not mix states from diﬀer-ent superstates, the superstate projection operator com-mutes with all time-evolution operators, h exp (cid:16) − τ ˆ H (cid:17) , P ( x ) i = 0 , (C3) for any superstate x and any τ . Because the creators andannihilators map each source superstate to one uniquetarget superstate and vice versa, a projector onto a su-perstate x on one side of an annihilator or creator can bereplaced by a projector onto the superstate y which theoperator maps x to on the other side of the operator: d ( y | x ) P ( x ) = P ( y ) d ( y | x ) , (C4)where the mapping of superstates relevant for the speciﬁccase is given in parentheses in superscript with the mean-ing that applying d ( y | x ) to a state vector (to the right) inthe subspace of superstate x will produce a state vectorin the subspace of superstate y .Using these two relations, we can commute the projec-tors all the way through the product to any other pointand also to any imaginary time by splitting time evolu-tion at that time into two consecutive time evolutions ifnecessary. The projectors will not necessarily be projec-tors onto the old outer superstate any more, but ontothe superstate that can be found at that point in thesequence for the current conﬁguration. After commut-ing both projectors to the position in the product cor-responding to the imaginary time ∆ τ , the product canbe cyclically permuted such that one of the projectorsends up at each end of the trace. It is then equivalent tothe local weight of the superstate sampling conﬁgurationafter a τ -shift by ∆ τ . This shows that the local weightdoes not change after a τ -shift move. Appendix D: Compatibility of state sampling withimplementations

Whether the performance of state sampling actuallyexceeds that of superstate sampling depends on the typeof CT-HYB implementation it is used with. When re-viewing standard CT-HYB in Sec. II, we simply ex-pressed the weight of a conﬁguration as the trace of theproduct of the corresponding impurity operators. Typ-ically, there are two ways to calculate it: In a matrix-matrix implementation, the matrices representing the im-purity operators are multiplied with one another and thetrace is obtained from the diagonal elements of the to-tal matrix product. In the matrix-vector ﬂavor, we in-stead explicitly perform a sum over the outer states: Foreach outer ket-vector, we repeatedly calculate the matrix-vector product with the operators, starting from the ﬁrstall the way through to the outer bra-vector, with whichwe ﬁnally compute the scalar product.By decomposing the problem into superstates (i.e.,block diagonalizing the Hamiltonian and choosing theblocks such that the operators connect them in a one-to-one way), we simplify the calculation of the trace inthat the outer states (per superstate) follow independent,noncrossing paths through the superstates. As a result,in the matrix-vector implementation one can just reducethe size of the initial operator matrices. In a matrix-matrix implementation the trace of the product can be4decomposed into the sum of traces per outer superstate,which introduces the explicit summation whose terms canbe calculated using matrices of reduced size as well.If we consider such an implementation as the start-ing point, we just have to restrict the summation to oneouter superstate in either case to implement superstatesampling. While we can obviously reduce the number ofneeded calculations by restricting the outer sum to onestate only in the matrix-vector implementation, i.e., im-plementing state sampling, there is no way to beneﬁciallyimplement this in a matrix-matrix algorithm. We coulduse only one of the diagonal elements of the resultingproducts, but this does not make the product calcula-tion simpler and therefore does not improve performancebut would only waste the other contributions that couldbe included at negligible further cost.

Appendix E: Exact diagonalization cross-check

Here we show that the results of the CT-HYB re-sults with superstate sampling agree with an exact-diagonalization benchmark.

FIG. 13. Comparison of the CT-HYB self-energy calculatedusing the superstate sampling method (points) to exact di-agonalization (lines) for a two-orbital system with Kanamoriinteraction. A. N. Rubtsov, V. V. Savkin, and A. I. Lichtenstein,Phys. Rev. B , 035122 (2005),arXiv:cond-mat/0411344 [cond-mat.str-el]. P. Werner, A. Comanac, L. de’ Medici, M. Troyer, and A. J.Millis, Phys. Rev. Lett. , 076405 (2006),arXiv:cond-mat/0512727 [cond-mat.str-el]. K. Haule, Phys. Rev. B , 155113 (2007),arXiv:cond-mat/0612172 [cond-mat.str-el]. E. Gull, A. J. Millis, A. I. Lichtenstein, A. N. Rubtsov,M. Troyer, and P. Werner, Rev. Mod. Phys. , 349 (2011),arXiv:1012.4474 [cond-mat.str-el]. H. Shinaoka, E. Gull, and P. Werner,Comput. Phys. Commun. , 128 (2017),arXiv:1609.09559 [cond-mat.str-el]. P. Seth, I. Krivenko, M. Ferrero, and O. Parcollet,Comput. Phys. Commun. , 274 (2016),arXiv:1507.00175 [cond-mat.str-el]. K. Haule, C.-H. Yee, and K. Kim,Phys. Rev. B , 195107 (2010),arXiv:0907.0195 [cond-mat.str-el]. L. Huang, Y. Wang, Z. Y. Meng, L. Du, P. Werner, andX. Dai, Comput. Phys. Commun. , 140 (2015),arXiv:1409.7573 [cond-mat.str-el]. M. Wallerberger, A. Hausoel, P. Gunacker, A. Kowalski,N. Parragh, F. Goth, K. Held, and G. Sangiovanni,Comput. Phys. Commun. , 388 (2019), arXiv:1801.10209 [cond-mat.str-el]. P. Werner and A. J. Millis, Phys. Rev. B , 155107 (2006),arXiv:cond-mat/0607136 [cond-mat.str-el]. N. Parragh, A. Toschi, K. Held, and G. Sangiovanni,Phys. Rev. B , 155158 (2012),arXiv:1209.0915 [cond-mat.str-el]. P. S´emon, C.-H. Yee, K. Haule, and A.-M. S. Tremblay,Phys. Rev. B , 075149 (2014),arXiv:1403.7214 [cond-mat.str-el]. A. M. L¨auchli and P. Werner,Phys. Rev. B , 235117 (2009),arXiv:0908.0681 [cond-mat.str-el]. H. Shinaoka, M. Dolﬁ, M. Troyer, and P. Werner,J. Stat. Mech. Theory Exp. , P06012 (2014),arXiv:1404.1259 [cond-mat.str-el]. P. Augustinsk´y and J. Kuneˇs,Comput. Phys. Commun. , 2119 (2013),arXiv:1302.4594. M. Wallerberger, w2dynamics: continuous time quantumMonte Carlo calculations of one- and two-particlepropagators , Ph.D. thesis, Technische Universit¨at Wien,Wien (2016). One may somewhat reduce the residual advantage ofsuperstate sampling visible at large β in Fig. 6 uponoptimizing the quantum number checking or using moresophisticated schemes, such as those proposed in Ref. . We used a ﬁve-orbital Anderson impurity model with a localSlater–Kanamori interaction, parameterized by U ≈ . J ≈ .

64, and U ′ = U − J . The temperature was β = 30where not stated otherwise and the chemical potential waschosen to reach a mean ﬁlling of h N i = 8 in the interactingmodel. The hybridization function was obtained from DFTfor a Cobalt impurity on Cu(001) as described in Ref. 19. M. P. Bahlke, M. Karolak, and C. Herrmann,Phys. Rev. B , 035119 (2018),arXiv:1710.07349 [cond-mat.str-el]. Note that the number of states per superstate varies and theaxis range of 10 was suﬃcient to capture the maximumnumber of contributing states over the entire course of thesimulation due to many good quantum numbers. Speciﬁcally, a matrix-vector implementation of CT-HYBwith time evolution in eigenbasis. Quantum numberchecking (cf. Sec. II) is used to avoid the calculation of zerocontributions to the local weight using a partitioning of theHilbert space into superstates using conserved quantities forKanamori-like interaction and additionally automaticpartitioning for full Coulomb interaction. We do not useouter truncation of the local trace, sliding-window-style localupdates, tree algorithms, or other optimizations even ifmentioned in the introduction unless explicitly stated. We used about the same amount of CPU time for allsampling methods for this example, so due to the worseperformance of the conventional implementation its error isconsiderably larger. While the error in this example is thussuﬃciently large to allow other conclusions about therelative sign of the methods in some temperature ranges,data from many other calculations we did with all methodsnot speciﬁcally for the purpose of this article stronglyindicate equal signs for conventional and superstatesampling and a sign closer to zero (with model-dependentextent) for state sampling. In this work we consider the computational cost ofquantum-number checking in w2dynamics with decreasingtemperature. The magnitude of this speed up might ofcourse be lower in other implementations. A. Kowalski,

Improved sampling in continuous-time quantumMonte Carlo algorithms for fermions , Master’s thesis,Julius-Maximilians-Universit¨at W¨urzburg, W¨urzburg (2017). P. Werner, E. Gull, M. Troyer, and A. J. Millis,Phys. Rev. Lett. , 166405 (2008),arXiv:0806.2621 [cond-mat.str-el]. L. de’ Medici, J. Mravlje, and A. Georges,Phys. Rev. Lett. , 256401 (2011),arXiv:1106.0815 [cond-mat.str-el]. L. De Leo,

Non-Fermi liquid behavior in multi-orbitalAnderson impurity models and possible relevance for stronglycorrelated lattice models , Ph.D. thesis, SISSA (2004). L. De Leo and M. Fabrizio, Phys. Rev. B , 245114 (2004),arXiv:cond-mat/0402121 [cond-mat.str-el]. Z. P. Yin, K. Haule, and G. Kotliar,Phys. Rev. B , 195141 (2012),arXiv:1206.0801 [cond-mat.str-el]. A. Georges, L. d. Medici, and J. Mravlje,Annu. Rev. Condens. Matter Phys. , 137 (2013),arXiv:1207.3033 [cond-mat.str-el]. K. Stadler, G. Kotliar, A. Weichselbaum, and J. von Delft,Annals of Physics (2018), 10.1016/j.aop.2018.10.017,arXiv:1808.09936 [cond-mat.str-el]. A. J. Kim, H. O. Jeschke, P. Werner, and R. Valent´ı,Phys. Rev. Lett. , 086401 (2017),arXiv:1607.05196 [cond-mat.str-el]. A. V. Chubukov and D. L. Maslov,Phys. Rev. B , 155136 (2012),arXiv:1208.3483 [cond-mat.str-el]. L. V. Pourovskii, J. Mravlje, A. Georges, S. I. Simak, andI. A. Abrikosov, New Journal of Physics , 073022 (2017),arXiv:1603.02287 [cond-mat.str-el]. A. Hausoel, M. Karolak, E. S¸a¸sıo˘glu, A. Lichtenstein,K. Held, A. Katanin, A. Toschi, and G. Sangiovanni,Nature Communications , 16062 (2017),arXiv:1707.03789 [cond-mat.str-el]. K. G. Wilson, Rev. Mod. Phys. , 773 (1975). K. M. Stadler, Z. P. Yin, J. von Delft, G. Kotliar, andA. Weichselbaum, Phys. Rev. Lett. , 136401 (2015),arXiv:1503.06467 [cond-mat.str-el]. A. Horvat, R. ˇZitko, and J. Mravlje,Phys. Rev. B , 165140 (2016),arXiv:1606.07654 [cond-mat.str-el]. L. de’ Medici, “Hund’s metals explained,” in