[PDF] Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetry

Abstract

Tensor contractions are ubiquitous in computational chemistry and physics, where tensors generally represent states or operators and contractions are transformations. In this context, the states and operators often preserve physical conservation laws, which are manifested as group symmetries in the tensors. These group symmetries imply that each tensor has block sparsity and can be stored in a reduced form. For nontrivial contractions, the memory footprint and cost are lowered, respectively, by a linear and a quadratic factor in the number of symmetry sectors. State-of-the-art tensor contraction software libraries exploit this opportunity by iterating over blocks or using general block-sparse tensor representations. Both approaches entail overhead in performance and code complexity. With intuition aided by tensor diagrams, we present a technique, irreducible representation alignment, which enables efficient handling of Abelian group symmetries via only dense tensors, by using contraction-specific reduced forms. This technique yields a general algorithm for arbitrary group symmetric contractions, which we implement in Python and apply to a variety of representative contractions from quantum chemistry and tensor network methods. As a consequence of relying on only dense tensor contractions, we can easily make use of efficient batched matrix multiplication via Intel's MKL and distributed tensor contraction via the Cyclops library, achieving good efficiency and parallel scalability on up to 4096 Knights Landing cores of a supercomputer.

Full PDF

AAUTOMATIC TRANSFORMATION OF IRREDUCIBLE REPRESENTATIONSFOR EFFICIENT CONTRACTION OF TENSORS WITH CYCLIC GROUPSYMMETRY

YANG GAO ∗ , PHILLIP HELMS † , GARNET KIN-LIC CHAN ‡ , AND

EDGAR SOLOMONIK § Abstract.

Tensor contractions are ubiquitous in computational chemistry and physics, where tensors generallyrepresent states or operators and contractions are transformations. In this context, the states and operators oftenpreserve physical conservation laws, which are manifested as group symmetries in the tensors. These group symme-tries imply that each tensor has block sparsity and can be stored in a reduced form. For nontrivial contractions, thememory footprint and cost are lowered, respectively, by a linear and a quadratic factor in the number of symmetrysectors. State-of-the-art tensor contraction software libraries exploit this opportunity by iterating over blocks or usinggeneral block-sparse tensor representations. Both approaches entail overhead in performance and code complexity.With intuition aided by tensor diagrams, we present a technique, irreducible representation alignment, which en-ables efﬁcient handling of Abelian group symmetries via only dense tensors, by using contraction-speciﬁc reducedforms. This technique yields a general algorithm for arbitrary group symmetric contractions, which we implementin Python and apply to a variety of representative contractions from quantum chemistry and tensor network methods.As a consequence of relying on only dense tensor contractions, we can easily make use of efﬁcient batched matrixmultiplication via Intel’s MKL and distributed tensor contraction via the Cyclops library, achieving good efﬁciencyand parallel scalability on up to 4096 Knights Landing cores of a supercomputer.

1. Introduction.

Tensor contractions are computational primitives found in many areasof science, mathematics, and engineering. In this work, we describe how to accelerate tensorcontractions involving block sparse tensors whose structure is induced by a cyclic group sym-metry or a product of cyclic group symmetries. Tensors of this kind arise frequently in manyapplications, for example, in quantum simulations of many-body systems. By introducing aremapping of the tensor contraction, we show how such block sparse tensor operations can beexpressed almost fully in terms of dense tensor operations. This approach enables effectiveparallelization and makes it easier to achieve peak performance by avoiding the complicationsof managing block sparsity. We illustrate the performance and scalability of our approach bynumerical examples drawn from the contractions used in tensor network algorithms and cou-pled cluster theory, two widely used methods of quantum simulation.A tensor T is deﬁned by a set of real or complex numbers indexed by tuples of integers(indices) i, j, k, l, . . . , where the indices take integer values i ∈ . . . D i , j ∈ . . . D j , . . . etc.,and a single tensor element is denoted t ijkl... . We refer to the number of indices of the tensoras its order and the sizes of their ranges as its dimensions ( D i × D j × · · · ). We will call theset of indices modes of the tensor. Tensor contractions are represented by a sum over indicesof two tensors. In the case of matrices and vectors, the only possible contractions correspondto matrix and vector products. For higher order tensors, there are more possibilities, and anexample of a contraction of two order 4 tensors is w abij = (cid:88) k,l u abkl v klij . (1.1)To illustrate the structure of the contraction, it is convenient to employ a graphical notationwhere a tensor is a vertex with each incident line representing a mode, and contracted modesare represented by lines joining vertices, as shown in Figure 1. Tensor contractions canbe reduced to matrix multiplication (or a simpler matrix/vector operation) after appropriatetransposition of the data to interchange the order of modes. ∗ Division of Engineering and Applied Science, California Institute of Technology ([email protected]) † Division of Chemistry and Chemical Engineering, California Institute of Technology ([email protected]) ‡ Division of Chemistry and Chemical Engineering, California Institute of Technology ([email protected]) § Department of Computer Science, University of Illinois at Urbana-Champaign ([email protected])1 a r X i v : . [ phy s i c s . c o m p - ph ] S e p IGURE Representation of the contraction in Eq. (1.1). Each tensor is represented by a vertex and eachmode by a line; the lines joining the vertices are contracted over. Labelling the modes indexes into the tensor.

In many applications, the tensors of interest contain symmetries, which allow each tensorto be stored in a compressed form, referred to as its reduced form . A special type of sparsitythat often appears is one that is associated with a cyclic group structure deﬁned on the indices,e.g. t ijk... = 0 if (cid:98) i/G (cid:99) + (cid:98) j/G (cid:99) + (cid:98) k/G (cid:99) + · · · (cid:54) = 0 (mod G ) . (1.2)For a matrix, such sparsity would lead to a blocked matrix where each block is the samesize, G × G . The blocks of an order 3 tensor would similarly all have the same dimen-sions, G × G × G . We refer to such tensors as tensors with cyclic group symmetry , orcyclic group tensors for short. In quantum physics and chemistry these symmetries arise dueto physical symmetries and conservation laws in the system. For example, when quantumstates are classiﬁed by irreducible representations (irreps) of such symmetries, their numer-ical representation in terms of tensors can be chosen to carry this block structure. In someapplications, the block sizes are non-uniform, but this can be accommodated in a cyclic grouptensor by padding blocks with zeros to a ﬁxed size. With this assumption, the block symmetrycan then be expressed via an unfolding of T , so that new symmetry modes yield indices thatiterate over blocks, t iI,jJ,kK... = 0 if I + J + K · · · (cid:54) = 0 (mod G ) , where we use the convention that the uppercase indices are the symmetry modes and thelowercase letters index into the symmetry blocks (physical modes).Given a number of symmetry sectors G (as in (1.2)), cyclic group symmetry can re-duce tensor contraction cost by a factor of G for some simple contractions and G for mostcontractions of interest (any contraction with a cost that is superlinear in input/output size).State-of-the-art sequential and parallel libraries for handling cyclic group symmetry, both inspeciﬁc physical applications and in domain-agnostic settings, typically iterate over appro-priate blocks within a block-sparse tensor format [1, 4, 13, 16, 18–21, 28, 32, 34, 39]. The useof explicit looping (over possibly small blocks) makes it difﬁcult to reach theoretical peakcompute performance. Parallelization of block-wise contractions can be done manually orvia specialized software [13, 16, 19–21, 28, 32, 34]. However, such parallelization is challeng-ing in the distributed-memory setting, where block-wise multiplication might (depending oncontraction and initial tensor data distribution) require communication/redistribution of tensordata. We introduce a general transformation of cyclic group symmetric tensors, irreduciblerepresentation alignment , which allows all contractions between such tensors to be trans-formed into a single large dense tensor contraction with optimal cost, in which the two inputreduced forms as well as the output are indexed by a new auxiliary index. This transformationprovides three advantages:1. it avoids the need for data structures to handle block sparsity or scheduling overblocks,2. it makes possible an efﬁcient software abstraction to contract tensors with cyclicgroup symmetry, . it enables effective use of parallel libraries for dense tensor contraction and batchedmatrix multiplication.The most closely related previous work to our approach that we are aware of is the directproduct decomposition (DPD) [24,43], which similarly seeks an aligned representation of thetwo tensor operands. However, the unfolded structure of cyclic group tensors in Eq. (1) allowsfor a much simpler conversion to an aligned representation, both conceptually and in terms ofimplementation complexity. In particular, our approach can be implemented efﬁciently withexisting dense tensor contraction primitives.We study the efﬁcacy of this new method for tensor contractions with cyclic group sym-metry arising in physics and chemistry applications. In particular, we consider some of thecostly contractions arising in tensor network methods for quantum many-body systems andin coupled cluster theory for electronic structure calculations. We develop a software library, Symtensor , that implements the irrep alignment algorithm and contraction. We demonstratethat across a variety of tensor contractions, the library achieves orders of magnitude improve-ments in parallel performance and a matching sequential performance relative to the manualloop-over-blocks approach. The resulting algorithm may also be easily and automatically par-allelized for distributed-memory architectures. Using the Cyclops Tensor Framework (CTF)library [40] as the contraction backend to Symtensor, we demonstrate good strong and weakscalability with up to at least 4096 Knights Landing cores of the Stampede2 supercomputer.

2. Applications of Cyclic Group Symmetric Tensors.

Cyclic group tensors arise in avariety of settings in computational quantum chemistry and physics methods. This is becauseboth the quantum Hamiltonian and its eigenstates transform as irreps of the physical sym-metry operators, for example, those associated with particle number, spin, and point groupand lattice symmetries [6, 22, 44]. Many of these symmetries are associated with ﬁnite cyclicgroups (e.g. ﬁnite Abelian point groups or ﬁnite lattice groups) or inﬁnite cyclic groups(e.g. particle number symmetry and magnetic spin symmetry). Those associated with inﬁnitecyclic groups give rise to conservation laws via Noether’s theorem.In this section, we summarize some of the quantum chemistry and physics applicationsof cyclic group tensors, and related developments to handle cyclic group symmetry withinthem. The performance evaluation of our proposed algorithms later considers some of thecomputationally expensive tensor contractions arising in these applications.

In electronic structure calculations in chemistry and physics,the most complicated part of the discretized Hamiltonian can be written as an order 4 tensor,often called the two-electron integral tensor . Similarly, the many-body eigenstates and wave-functions are represented by tensors. One common choice of approximate wavefunction, aris-ing in so-called correlated wavefunction methods such as conﬁguration interaction (CI) [45]and coupled cluster (CC) methods at the doubles level of approximation [3, 7, 12, 46, 47],represents an important part of the wavefunction in terms of another order 4 tensor, the dou-bles amplitude tensor . Higher order tensors arise in approximations that target higher accu-racy [17, 29, 35, 42].Both the two-electron integral tensor and the amplitude tensors possess the cyclic groupsymmetries associated with the physical symmetries of the system. In a molecule, the largestsymmetry group for which cost reductions can be achieved is usually the point group symme-try , e.g. associated with a rotational group C nv or a product group such as D h . In crystallinematerials, crystal translational symmetry is equivalent to a cyclic symmetry along each latticedirection. Typically the cyclic groups (number of k -points) are chosen to be as large as com-putationally feasible. Thus the savings arising from efﬁcient use of translational symmetryare particularly important in materials simulations [11, 14, 15, 25, 27].Many quantum chemistry implementations leverage cyclic group symmetries within var- ous approximate quantum methods [25, 38, 43]. In addition, a number of state-of-the-arttensor software efforts aimed at quantum chemistry applications handle cyclic group symme-try in a general way via block sparse tensor formats [13, 16, 19, 21, 28, 32, 34]. As describedin Section 3.1, our proposed algorithm achieves the same computational improvement viatransforming the cyclic group tensor representation, while maintaining a global view of theproblem as a dense tensor contraction, as opposed to a series of block-wise operations or acontraction of block-sparse tensors. Tensor network (TN) methods represent a quantumstate, and the associated operators acting on it, as a contraction of tensors on the sites of alattice. The contraction of the site tensors yields a full state or operator tensor with a numberof modes equal to the size of the lattice (i.e. an exponentially large number of entries in thesize of the lattice). The simplest such tensor network is the matrix product state (MPS) [9,51](also referred to as a tensor train [31]), which is a 1D tensor network. It is widely used aspart of the density matrix renormalization group (DMRG) method [51, 52]. DMRG uses analternating optimization along the 1D tensor network to arrive at representations of ground orexcited eigenstates of a Hamiltonian, which is often itself represented as a tensor train ( matrixproduct operator (MPO) [5, 8, 49]).Some common cyclic group symmetries in this context, known as Abelian quantumnumbers, are those associated with the magnetic spin quantum number and particle num-ber symmetries (both U (1) symmetries) as well as fermionic supersymmetry ( Z ). Ten-sor networks that go beyond a 1D contraction structure have substantially more expressivepower [48, 50, 53]. The projected entangled pair states (PEPS) are one such class deﬁned inarbitrary dimensions (reducing to MPS in 1D) and are believed to provide a compact repre-sentation of the Hamiltonian eigenstates of physical interest in higher dimensions [48]. Thesemore complicated tensor networks exhibit the same cyclic group symmetries as found in ma-trix product states. Because the symmetry groups arising in tensor network simulations canbe arbitrary large (e.g. U (1) ) their efﬁcient use can lead to orders of magnitude improvementin memory footprint and time to solution [4, 26, 36, 52]. Similarly to the situation in quantumchemistry, tensor contraction libraries have been developed to handle these symmetries wherethe most common strategy is to track and loop over block-sparse representations [1,18,30,36].These contractions and block-sparse representations can similarly be replaced by the cyclicgroup tensor representation and irrep alignment algorithms of this work. In Section 5.2, westudy two representative contractions: one arising in an MPS simulation using the DMRGalgorithm, and one in the contraction of a PEPS network.

3. Irreducible Representation Alignment Algorithm.

We now describe our proposedapproach. We ﬁrst describe the algorithm on an example contraction and provide intuition forcorrectness based on conservation of ﬂow in a tensor diagram graph. These arguments areanalogous to the conservation arguments used in computations with Feynman diagrams (e.g.momentum and energy conservation) [10] or with quantum numbers in tensor networks [37],although the notation we use is slightly different. We then give a complete algebraic deriva-tion of all steps, accompanied by more detailed tensor diagrams to explain each step.

We consider a contraction of order 4 tensors U and V into a new order 4 tensor W , where all tensors have cyclic group symmetry. We can expressthis cyclic group symmetric contraction as a contraction of tensors of order 8, by separatingindices into intra-block (lower-case) indices and symmetry (upper-case) indices, so w aA,bB,iI,jJ = (cid:88) k,K,l,L u aA,bB,kK,lL v kK,lL,iI,jJ . lgorithm 3.1 Loop nest to perform group symmetric contraction w aA,bB,iI,jJ = (cid:80) k,K,l,L u aA,bB,kK,lL v kK,lL,iI,jJ . for A = 1 , . . . , G dofor B = 1 , . . . , G dofor I = 1 , . . . , G do J = A + B − I mod G for K = 1 , . . . , G do L = − A − B − K mod G ∀ a, b, i, j, w aA,bB,iI,jJ = (cid:80) k,l w aA,bB,iI,jJ + u aA,bB,kK,lL, v kK,lL,iI,jJ end forend forend forend for Here and later we use commas to separate index groups for readability, each character is adistinct index. The input and output tensors have a group symmetry deﬁned by a cyclic group,so that w aA,bB,iI,jJ (cid:54) = 0 if A + B − I − J ≡ G ) ,u aA,bB,kK,lL (cid:54) = 0 if A + B + K + L ≡ G ) ,v kK,lL,iI,jJ (cid:54) = 0 if K + L − I − J ≡ G ) . Ignoring the symmetry, this tensor contraction would have cost O ( n G ) , where n is thedimension of each symmetry sector, but with the use of group symmetry this is reduced to O ( n G ) .Existing approaches implement the contraction with O ( G ) block-wise operations via amanual loop nest, as shown in Algorithm 3.1. The order 3 reduced forms W , U , and V canbe written as dense tensors and their elements may be accessed via the appropriate 3 indicesto perform the multiply and accumulate operation in Algorithm 3.1. Reduced forms providean implicit representation of one of the tensor modes (see Figure 2). F IGURE Tensor diagrams of standard forms of reduced tensors (legs with arrows are indices/modes of thereduced tensor, legs without arrows are represented implicitly via an irrep map). The sum of the 4 arrows at thevertex (dot) satisﬁes A + B = I + J (mod G ) , making one of the arrows redundant. However, the indirection needed to compute L and J within the innermost loops pre-vents expression of the contraction in terms of standard library operations for contractionof dense tensors. Figure 3 illustrates that standard reduced forms, where some choice of 3symmetry indices of the original tensor is represented in each reduced form, cannot simplybe contracted to obtain a reduced form as a result. This problem has motivated the use ofsparse or block sparse representations of the tensors, which permit a work efﬁcient approach or arbitrary group symmetries [16]. However, the need to parallelize general block-wisetensor contraction operations creates a signiﬁcant software-engineering challenge and com-putational overhead for tensor contraction libraries. F IGURE These two tensor diagram equations aim to illustrate why certain reduced forms cannot be con-tracted directly. With the reduced forms chosen in the top case, the two symmetry relations of the implicitly storedindex K mean that the contraction must be multiplied by a factor of δ ( A + B, I + J ) . Without this, the contractionwould perform too much computation. In the second case, the reduced forms cannot be contracted to produce avalid standard reduced form for the output (one needs 3 of the uncontracted indices to be represented / marked witharrows). The main idea in the irreducible representation alignment algorithm is to ﬁrst transform(reindex) the tensors using an auxiliary index which subsequently allows a dense tensor con-traction to be performed without the need for any indirection or symmetry rules. In the abovecontraction, we deﬁne the auxiliary index as Q ≡ − I − J ≡ A + B ≡ K + L (mod G ) andthus obtain a new reduced form for each tensor, ˆ w aA,b,i,jJ,Q = w aA,b,Q − A mod G,i, − J − Q mod G,jJ , ˆ u aA,b,k,lL,Q = u aA,b,Q − A mod G,k,Q − L mod G,lL , ˆ v k,lL,i,jJ,Q = v k,Q − L mod G,lL,i, − J − Q mod G,jJ . This reduced form is displayed in Figure 4. F IGURE The symmetry aligned reduced form is deﬁned by introducing the Q index. Each of the two verticesdeﬁnes a conservation rule: A + B = Q (mod G ) and Q = I + J (mod G ) , allowing two of the arrows to beremoved in the 3rd diagram, i.e. to be represented implicitly as opposed to being part of the reduced form. The Q index is chosen so that it can serve as part of the reduced forms of each of U , V , and W . An intuition for why this alignment is possible is given via tensor diagramsin Figure 5. The new auxiliary indices ( P and Q ) of the two contracted tensors satisfy aconservation law P = Q , and so can be reduced to a single index. F IGURE By deﬁning conservation laws on the the vertices, we see that P = K + L (mod G ) and K + L = Q (mod G ) . Consequently, the only non-zero contributions to the contraction must have P = Q . IGURE The reduced forms may be contracted efﬁciently to produce the output reduced form.Ignoring intra-block indices, the resulting contraction may be performed with the einsum operation

W=einsum("AQL,LQJ->AQJ",U,V) . As shown in Figure 6, given the aligned reduced forms of the two operands, we can con-tract them directly to obtain a reduced form for the output that also has the internal symmetryindex Q . Speciﬁcally, it sufﬁces to perform the dense tensor contraction, ˆ w aA,b,i,jJ,Q = (cid:88) L,k,l ˆ u aA,b,k,lL,Q ˆ v k,lL,i,jJ,Q . This contraction can be expressed as a single einsum operation (available via NumPy, CTF,etc.) and can be done via a batched matrix multiplication (available in Intel’s MKL). Once ˆ W is obtained in this reduced form, it can be remapped to any other desired reduced form. We now provide a formal derivation of the algo-rithm for contractions of tensors of arbitrary order, with cyclic group symmetries describedby arbitrary coefﬁcients. Algorithm 3.2 provides a description of each computational step,and associated costs. We represent an order N complex tensor with cyclic group symmetryas in (1.2) as an order N tensor, T ∈ C n × H ×···× n N × H N satisfying, for some implicitmode index j ∈ { . . . N } , modulus remainder Z ∈ { . . . G } , and coefﬁcients c . . . c N with c i = G/H i or c i = − G/H i , t i I ...i N I N = (cid:40) r ( T ) i I ...i j − I j − i j i j +1 I j +1 ...i N I N : c I + · · · + c N I N ≡ Z (mod G )0 : otherwise,(3.1)where the order N − tensor R ( T ) is the reduced form of the cyclic group tensor T . Anycyclic group symmetry may be more generally expressed using an irrep map, represented asa tensor with binary values, M ( T ) ∈ { , } H ×···× H N as F IGURE Group symmetric tensor expressed as a product of a reduced form and an irrep map. Intra-blockindices are denoted with dashed lines and symmetry indices are denoted with solid lines. t i I ...i N I N = r ( T ) i I ...i j − I j − i j i j +1 I j +1 ...i N I N m ( T ) I ...I N . The tensor diagram form of this equation with j = N is given in Figure 7. peciﬁcally, the irrep map is deﬁned by m ( T ) I ...I N = (cid:40) c I + · · · + c N I N ≡ Z (mod G )0 : otherwise.An irrep map can be decomposed by any tree tensor network, where every bond dimensionhas rank n , which follows from the properties of group multiplication in a cyclic group, andis explicitly shown via the following irrep map decomposition lemma. The equation in thelemma is displayed in terms of tensor diagrams in Figure 8.L EMMA

Any irrep map M ∈ { , } H ×···× H N maybe written as a contraction of two irrep maps M ( A ) ∈ { , } H ×···× H k × G and M ( B ) ∈{ , } H k +1 ×···× H N × G , m I ...I N = G − (cid:88) J =0 m ( A ) I ...I k J m ( B ) I k +1 ...I N J . F IGURE An irrep map tensor may be decomposed with low rank (with a bond index of dimension equal tothe size of the group, G , no matter the order of the irrep map tensor) as a contraction of two irrep map tensors. Eachirrep map tensor holds a disjoint subset of indices of the original irrep map and a new contracted bond index. Proof.

Let c . . . c N be the coefﬁcients deﬁning irrep map M and Z be the remainder,then deﬁne m ( A ) I ...I k J = (cid:40) c I + · · · + c k I k + J ≡ Z (mod G )0 : otherwiseand m ( B ) I k +1 ...I N J = (cid:40) c k +1 I k +1 + · · · + c N I N − J ≡ G )0 : otherwise.The remainder Z can alternatively be absorbed into the deﬁnition of the other factor. It thensufﬁces to observe that contracting these tensors gives the desired result, which we can seeby associating or with the true/false value of conditional statements deﬁning the sparsity, m I ...I N = G − (cid:88) J =0 ( c I + · · · + c k I k + J ≡ Z mod G ) · ( c k +1 I k +1 + · · · + c N I N − J ≡ G )= (cid:40) c I + · · · + c N I N ≡ Z (mod G )0 : otherwise.Repeated application of this Lemma implies that, for any ordering of modes, M is an MPSwith all ranks at most G . Consequently, any tree tensor network of this rank is a valid de-composition [31]. The irrep alignment will make use of a tree with only two nodes, whichcorresponds to a low-rank matrix factorization, but other factorizations may be useful. iven any tree tensor network with factors M (1) . . . M ( F ) and auxiliary (contracted)indices J . . . J F − , we can express T using a reduced form R ( T ) that contains any choiceof N − indices among I . . . I N and J . . . J F − . This representation freedom is due to thepresence of a one-to-one map between the irrep I . . . I N − and any other irrep consistingof N − indices. Such a map exists as the tensor network encodes F equations relating the N + F − variables/indices.Modulo ordering of indices, which is immaterial, any contraction of two tensors withcyclic group symmetry can be written, for some s, t, v ∈ { , , . . . } , as w i I ...i s I s j J ...j t J t = (cid:88) k K ...k v K v u i I ...i s I s k K ...k v K v v k K ...k v K v j J ...j t J t . (3.2)Our tensor diagram representation of this contraction is given in Figure 10. We assume thatthe group symmetries assign the same coefﬁcients to the indices K . . . K v (the contractionis well-deﬁned only if they are the same or differ by a sign, and in the latter case the sign canbe absorbed into the reduced form). F IGURE A contraction of a tensor of order s + v with a tensor of order v + t into a tensor of order s + t ,where all tensors have cyclic group symmetry and are represented with tensors of twice the order. The irrep alignment algorithm uses a tree with two nodes to represent each of the twoinput irrep maps, M ( U ) and M ( V ) . Each pair of nodes separates the contracted and uncon-tracted indices, m ( U ) I ...I s K ...K v = (cid:88) P m ( U,I ) I ...I s P m ( U,K ) K ...K v P ,m ( V ) K ...K v J ...J r = (cid:88) Q m ( V,K ) K ...K v Q m ( V,J ) J ...J t Q . In deﬁning these reduced forms, we absorb the remainders associated with the irrep maps M ( U ) and M ( V ) into M ( U,I ) and M ( V,J ) , respectively. Consequently, we have that M ( U,K ) and M ( V,K ) both have a remainder of zero and are both deﬁned by the same set ofcoefﬁcients, so M ( U,K ) = M ( V,K ) . Associated with these irrep maps, we deﬁne reducedforms that use a representation that includes the auxiliary indices ( P and Q ), • R ( U ) with indices I . . . I s − K . . . K v − P , and • R ( V ) with indices K . . . K v − J . . . J t − Q .Using these irrep maps and reduced forms we can represent U as u i I ...i s I s k K ...k v K v = (cid:88) P r ( U ) i I ......i s − I s − i s k K ......k v − K v − k v P m ( U,I ) I ...I s P m ( U,K ) K ...K v P . (3.3)This representation is depicted in Figure 10. With V represented analogously, we can com- IGURE

Symmetry aligned reduced form of a group symmetric tensor. The P index belongs to the reducedform as well as both group tensor factors. The I s and J t indices are represented implicitly via irrep maps, i.e., arenot indices of the reduced form. pute W as w i I ...i s I s j J ...j t J t = (cid:88) k K ...k v − K v − k v P Q (cid:20) r ( U ) i I ...i s − I s − i s k K ...k v − K v − k v P r ( V ) k K ...k t − K v − k v j J ...j t − J t − j t Q m ( U,I ) I ...I s P m ( V,J ) J ...J t Q (cid:16) (cid:88) K v m ( U,K ) K ...K v P m ( V,K ) K ...K v Q (cid:17)(cid:21) . Consequently, since (cid:80) K v m ( U,K ) K ...K v P m ( V,K ) K ...K v Q = δ ( P, Q ) , we can eliminate the P and K v indices and bring the remaining irrep maps outside the main summation, so w i I ...i s I s j J ...j t J t = (cid:16) (cid:88) Q m ( U,I ) I ...I s Q m ( V,J ) J ...J t Q (cid:17) · (cid:18) (cid:88) k K ...k v − K v − k v r ( U ) i I ...i s − I s − i s k K ...k v − K v − k v Q r ( V ) k K ...k t − K v − k v j J ...j t − J t − j t Q (cid:19) . Figure 11 provides tensor diagrams depicting how the above equations permit the irrep align-ment algorithm to yield an efﬁcient contraction. F IGURE

Main contraction between aligned reduced forms. Two of the factors composing the irrep mapscan be contracted away independently due to alignment of reduced forms. lgorithm 3.2 The irrep alignment algorithm for contraction of cyclic group symmetric ten-sors, for contraction deﬁned as in (3.2). Input two tensors U of order s + v and V of order v + t with group symmetries describedusing coefﬁcient vectors c ( U ) and c ( V ) and remainders Z ( U ) and Z ( V ) as in (3.1). Assume that these vectors share coefﬁcients for contracted modes of the tensors, so thatif c ( U ) = (cid:34) c ( U )1 c ( U )2 (cid:35) , then c ( V ) = (cid:34) c ( U )2 c ( V )2 (cid:35) . Deﬁne new coefﬁcient vectors, c ( A ) = (cid:20) c ( U )1 (cid:21) , c ( B ) = (cid:20) c ( U )2 − (cid:21) , and c ( C ) = (cid:20) c ( V )2 (cid:21) . Deﬁne irrep maps M ( U,I ) , M ( U,K ) , and M ( V,J ) based, respectively, based on the co-efﬁcient vectors c ( A ) , c ( B ) , c ( C ) and remainders Z ( U ) , , Z ( V ) . Obtain the irrep aligned reduced forms R ( U ) and R ( V ) for U and V , respectively, sothat 3.3 is satisﬁed. Assuming we are given standard reduced forms ¯ R ( U ) and ¯ R ( V ) for U and V that do not store the last symmetry mode (other cases are similar), we cancompute R ( U ) and R ( V ) via the following contractions: r ( U ) i I ...i s − I s − i s k K ...k v − K v − k v Q = (cid:88) I s K v ¯ r ( U ) i I ...i s I s k K ...k v − K v − k v m ( U,I ) I ...I s Q m ( U,K ) K ...K v Q ,r ( V ) k K ...k v − K v − k v j J ...j t − J t − j t Q = (cid:88) K v J t ¯ r ( V ) k K ...k v K v j J ...j t − J t − j t m ( U,K ) K ...K v Q m ( V,J ) J ...J t Q . (cid:46) The above contractions can be done with constant workper element of R ( U ) and R ( V ) , namely O ( n s + v G s + v − ) and O ( n v + t G v + t − ) , or witha factor of O ( G ) more if done as dense tensor contractions that ignore the structure of M ( U,I ) , M ( U,K ) , and M ( V,J ) . Compute r ( W ) i I ...i s − I s − i s J J ...j t − J t − j t Q = (cid:88) k K ...k v − K v − k v r ( U ) i I ...i s − I s − i s k K ...k v − K v − k v Q r ( V ) k K ...k t − K v − k v j J ...j t − J t − j t Q (cid:46) The above contraction has cost O ( n s + t + v G s + t + v − ) If a standard output reduced form is desired, for example with the last mode of W storedimplicitly, then compute ¯ r ( W ) i I ...i s I s j J ...j t − J t j t = (cid:88) Q r ( W ) i I ...i s − I s − i s J J ...j t − J t − j t Q m ( U,I ) I ...I s Q If we instead desired a reduced form with another implicit mode, it would not be implicitin R ( W ) , so we would need to also contract with m ( V,J ) J ...J t Q and sum over the desiredimplicit mode. (cid:46) In either case, the above contraction can be done with constant work per element of R ( W ) , namely O ( n s + t G s + t − ) , or with a factor of O ( G ) more if done as dense tensorcontractions, if ignoring the structure of M ( U,I ) and M ( V,J ) . mport numpy as npfrom symtensor import array, einsum F IGURE

Symtensor library example for contraction of two group symmetric tensors.

4. Software Automation.

We implement the irrep alignment algorithm as a Python li-brary, Symtensor , that automatically maps cyclic group-symmetric tensor contractions ontoa few einsum operations on dense tensors. The library is interfaced to different tensor con-traction backends. Aside from the default NumPy einsum , we provide a backend that lever-ages MKL’s batched matrix-multiplication routines [2] to obtain good threaded performance,and employ Cyclops [40] for distributed-memory execution.The Symtensor library maps cyclic group symmetric tensors into reduced representationsand automatically aligns these reduced representations so that contractions can proceed withoptimal efﬁciency. The reduced representations are stored as dense arrays, and these objectsand the resulting dense contractions are handled by the array backend. The user interfacemimics the NumPy API and hides any explicit reference to symmetry except at the timeof array creation. This API enables users to effectively implement numerical methods withgroup symmetric tensors in a symmetry-oblivious manner.In Figure 12, we provide an example of how Symtensor library can be used to performthe contraction of two cyclic group tensors. In the code, the Symtensor library initializes theorder cyclic group symmetric tensor using an underlying order dense reduced represen-tation, with Z (cyclic group with G = 3 ) symmetry for each index. Once the tensors areinitialized, the subsequent einsum operation implements the contraction shown in Fig. 5without referring to any symmetry information in its interface. While the example is basedon a simple cyclic group for an order 4 tensor, the library supports arbitrary orders, as well asproducts of cyclic groups and inﬁnite cyclic groups (e.g. U (1) symmetries).As introduced in Section 3, the main operations in our irrep alignment algorithm consistof transformation of the reduced form and the contraction of reduced forms. Symtensorchooses the least costly version of the irrep alignment algorithm from a space of variantsdeﬁned by different choices of the three implicitly represented modes in the symmetry alignedreduced form in Algorithm 3.2 (therein these are I s , J t , and K v ). This choice is made byenumerating all valid variants. After choosing the best reduced form, the required irrep maps, M ( U,I ) and M ( V,J ) in Algorithm 3.2, are generated as dense tensors. This permits both the https://github.com/yangcal/symtensor 12 IGURE

Goldstone diagrams for the three types of CC contractions in our example. Virtual sectors aredenoted with upward arrows and occupied sectors with downward arrows. Note that this is domain-speciﬁc notationthat is slightly different from that used in the earlier symmetric tensor diagrams. transformations and the reduced form contraction to be done as einsum operations of densetensors with the desired backend.

5. Example Applications.

We survey a few group symmetric tensor contractions thatare the costliest components of common tensor network and quantum chemistry methods.These contractions, summarized in Table 1, are evaluated as part of a benchmark suite inSection 6 (we also consider a suite of contractions with different tensor order, i.e. differentchoices of s, t, v ). We investigate the performance of theirrep alignment contraction algorithm for several expensive tensor contractions arising in thedoubles amplitude equation of the CC theory of periodic systems. These can be written as w aA,bB,iI,jJ = (cid:88) cC,dD u aA,bB,cC,dD v iI,jJ,cC,dD ,x aA,bB,iI,cC = (cid:88) kK,lL u kK,lL,aA,bB v kK,lL,cC,iI ,y aA,bB,iI,jJ = (cid:88) kK,lL u kK,lL,iI,jJ v kK,lL,aA,bB . These contractions are displayed in Fig.13 as diagrams in a form that is common in quantumchemistry literature. As discussed in Section 2, in crystalline (periodic) materials, the crystaltranslational group is the product of cyclic groups associated with the lattice vectors of thecrystal. In 3 dimensions, the order of group, known as the total number of k points, takes theform G = G × G × G , where G , G , G are the number of k points along each dimension.Each index of the tensors is associated with this group symmetry. The indices also fall intotwo classes, i, j, k, l (the occupied indices) and a, b, c, d (the virtual indices) associated withdifferent dimensions N occ = N i = N j = N k = N l and N virt = N a = N b = N c = N d .The cost of the above contractions using the irrep alignment algorithm scales as G N occ N vir , G N occ N vir and G N occ N vir respectively. These contractions are summarized in Table 1where, for simplicity, we use N = N vir = N occ . We benchmark the performance of the irrep align-ment algorithm for two tensor contractions arising in tensor network simulations, one arisingin DMRG calculations with MPS, and the other in a 2D PEPS contraction. The tensor con-tractions are illustrated shown in Fig. 14.The DMRG contraction considered is the absorption of the blocked environment tensor U into the MPS tensor V , w iI,jJ,lL,mM = (cid:88) kK u iI,jJ,kK v kK,lL,mM , which is encountered while solving the local eigenvalue problem. Here, as illustrated inFig. 14a, two environment tensors are contracted with a local MPS and MPO tensor; we IGURE

Tensor network diagrams for the (a) DMRG and (b-c) PEPS contractions considered. In (a), twoenvironment tensors are contracted with an MPS and an MPO tensor in three steps, with the contraction consideredhighlighted in blue. For the example PEPS contraction, (b) illustrates how a single layer PEPS is contracted columnby column using the IBMPS algorithm and (c) shows the diagram for the considered contraction, which is part ofthe randomized implicit SVD step. benchmark the third contraction, highlighted in blue. While in general the number of sym-metry blocks is different for each of the tensor indices in a tensor network, for simplicitywe keep the number of symmetry blocks constant for all legs, meaning G = G I = G J = G K = G L = G M . The indices i , k , and m denote the auxiliary indices of the MPS withsizes N mps = N i = N k = N m , j is the auxiliary index of the MPO used for the eigenvalueproblem with size N j , and l is the physical bond dimension of size N l .The PEPS contraction considered is one that arises in a recently proposed variant ofthe boundary contraction algorithm [48], the Implicit Boundary MPS (IBMPS) [33], whichprovides asymptotic speedup over the standard Boundary MPS (BMPS) method [23] for agiven bond dimension by using an implicit randomized singular value decomposition (SVD)in the boundary bond dimension truncation. The IBMPS procedure approximately contractsa square PEPS lattice one layer at a time, with Fig. 14b showing an example of this; webenchmark one of the contractions with the highest scaling with the PEPS bond dimension,shown in Fig. 14c. This contraction is of tensors U and V , w iI,jJ,mM,nN = (cid:88) kK,lL u iI,jJ,kK,lL v kK,lL,mM,nN . Indices i and j have block sizes N mps = N i = N j , corresponding to the auxiliary bonddimension of the boundary MPS, and the remaining indices correspond to the PEPS auxiliarybond dimension, with block sizes N peps = N k = N l = N m = N n . All indices have the samenumber of symmetry blocks, G . The costs of these DMRG and PEPS contractions using theirrep alignment algorithm scale as G N mps N j N l and G N mps N peps respectively; in Table 1we set N = N mps = N peps for simplicity.

6. Performance Evaluation.

Performance experiments were carried out on the Stam-pede2 supercomputer. Each Stampede2 node is a Intel Knight’s Landing (KNL) processor,on which we used up to 64 of 68 cores by employing up to 64 threads with single-nodeNumPy/MKL and 64 MPI processes per node with 1 thread per process with Cyclops. Weused the Symtensor library together with one of three external contraction backends: Cy-clops, default NumPy, or a batched BLAS backend for NumPy arrays (this backend leverages , , ) ( , , ) ( , , ) ( , , ) ( , , ) ( , , ) ( , , ) / t l oo p s / t s y m F IGURE

Comparison of execution times for various types of contraction using the Symtensor Library withbatched BLAS and using loops over blocks with NumPy as the contraction backend. Results are shown for variouscombinations of ( s , t , v ) where the number of indices shared between the ﬁrst input and output is s , between thesecond input and output is t , and between the two inputs (contracted) is v . HPTT [41] for fast tensor transposition and dispatches to Intel’s MKL BLAS for batchedmatrix multiplication). We also compare against the loop over symmetry blocks algorithmas illustrated in Algorithm 3.1.

We consider the performance of Symtensoron a single core and a single node of KNL relative to manually-implemented looping overblocks as well as relative to naive contractions that ignore symmetry. Our manual loop imple-mentations of contractions store a Python list of NumPy arrays to represent the tensor blocksand invokes the NumPy einsum functions to perform each blockwise contraction.

We ﬁrst examine the performance of the irrepalignment algorithm for generic contractions (Eq. 3.2) with equal dimensions N and G foreach mode, but different ( s , t , v ). Figure 15 shows the speed-up in execution time obtainedby Symtensor relative to the manual loop implementation. The contractions are constructedby ﬁxing G = 4 and modifying N to attain a ﬁxed total of × ﬂoating-point operations.All contractions are performed on both a single thread and 64 threads of a single KNL nodeand timings are compared in those respective conﬁgurations. The irrep alignment algorithmachieves better parallel scalability than block-wise contraction and can also be faster sequen-tially. However, we also observe that in cases when one tensor is larger than others (when s, t, v are unequal) the irrep alignment approach can incur overhead relative to the manualloop implementation. Overhead can largely be attributed to the cost of transformations be-tween reducible representations, which are done as dense tensor contractions. An alternativetransformation mechanism that forgoes this factor of O ( G ) overhead would likely reduce se-quential efﬁciency for these cases, but the use of dense tensor contractions permits use ofexisting optimized kernels and easy parallelizability. The results are displayed in Figure 16 with the left, center, and right plots showing the scalingfor the contractions labeled MM, CC , and PEPS in Table 1. We compare scaling relativeto two conventional approaches: a dense contraction without utilizing symmetry and loopsover symmetry blocks, both using NumPy’s einsum function. The sizes of the tensorsconsidered are, for matrix multiplication, N = 500 and G ∈ [4 , , for the CC contraction, N occ = 8 , N vir = 16 , with G ∈ [4 , , and for the PEPS contraction, N mps = 16 , N peps = 4 ,with G ∈ [2 , . For all but the smallest contractions, using the Symtensor implementationimproves contraction performance. A comparison of the slopes of the lines in each of the We used the default Intel compilers on Stampede2 with the following software versions: HPTT v1.0.0, CTFv1.5.5 (compiled with optimization ﬂags: -O2 -no-ipo), and MKL v2018.0.2.15 G / T i m e ( s ) G G Dense (NumPy)Loop Blocks (NumPy)Symtensor (CTF) F IGURE

Comparison of the execution times, in seconds, for contractions on a single thread using threedifferent algorithms, namely a dense, non-symmetric contraction, loops over symmetry blocks, and our Symtensorlibrary. From left to right, the plots show the scaling for matrix multiplication (MM), a coupled cluster contraction(CC ), and a tensor network contraction (PEPS). The dense and loop over blocks calculations use NumPy as acontraction backend, while the Symtensor library here uses Cyclops as the contraction backend. T ABLE Summary of coupled cluster and tensor network contractions used to benchmark the symmetric tensor contrac-tion scheme and their costs. We include matrix multiplication (MM) as a point of reference.

Label Contraction Symmetric Cost Naive CostMM w iI,kK = (cid:80) jJ u iI,jJ v jJ,kK O ( GN ) O ( G N ) CC w aA,bB,iI,jJ = (cid:80) cC,dD u aA,bB,cC,dD v iI,jJ,cC,dD O ( G N ) O ( G N ) CC x aA,bB,iI,cC = (cid:80) kK,lL u kK,lL,aA,bB v kK,lL,cC,iI O ( G N ) O ( G N ) CC y aA,bB,iI,jJ = (cid:80) kK,lL u kK,lL,iI,jJ v kK,lL,aA,bB O ( G N ) O ( G N ) MPS w iI,jJ,lL,mM = (cid:80) kK u iI,jJ,kK v kK,lL,mM O ( G N ) O ( G N ) PEPS w iI,jJ,mM,nN = (cid:80) kK,lL u iI,jJ,kK,lL v kK,lL,mM,nN O ( G N ) O ( G N ) three plots demonstrates that the dense tensor contraction scheme results in a higher orderasymptotic scaling of cost in G than either of the symmetric approaches.Figure 17 provides absolute performance with 1 thread and 64 threads for all contractionsin Table 1. For each contraction, we consider one with a large number of symmetry sectors( G ) with small block size ( N ) (labeled with a subscript a ) and another with fewer symmetrysectors and larger block size (labeled with a subscript b ). The speciﬁc sizes of all tensorsstudied are provided in Table 2. For each of these cases, we compare the execution time, inseconds, using loops over blocks dispatching to NumPy contractions, the Symtensor librarywith NumPy arrays and batched BLAS as the contraction backend, and the Symtensor libraryusing Cyclops as the array and contraction backend.A clear advantage in parallelizability of Symtensor is evident in Figure 17. With 64threads, Symtensor outperforms manual looping by a factor of at least . X for all contractionbenchmarks, and the largest speed-up, X, is obtained for the CC a contraction. There is asigniﬁcant difference between the contractions labeled to be of type a (large G and small N )and type b (large N and small G ), with the geometric mean speedup for these two being Xand . X respectively on 64 threads; on a single thread, this difference is again observed,although less drastically, with respective geometric mean speedups of . X and . X. Type b cases involve more symmetry blocks, amplifying overhead of manual looping. We now illustrate the parallelizability of the ir-rep alignment algorithm by studying scalability across multiple nodes with distributed mem-ory. All parallelization in Symtensor is handled via the Cyclops library in this case. The solidlines in Figure 18 show the strong scaling (ﬁxed problem size) behavior of the Symtensor im-plementation on up to eight nodes of Stampede2. As a reference, we provide comparison to C a C C a C C a C C b C C b C C b M M a M M b M P S a M P S b P E P S a P E P S b / T i m e ( s ) Loop Blocks (1 Proc, NumPy)Symtensor (1 Proc, BLAS)Symtensor (1 Proc, CTF)Loop Blocks (64 Proc, NumPy)Symtensor (64 Proc, BLAS)Symtensor (64 Proc, CTF) F IGURE

Comparison of contraction times using the Symtensor library (using Cyclops for the array storageand contraction backened, or NumPy as the array storage with batched BLAS contraction backend) and loops overblocks using NumPy as the contraction backend. Results are shown for instances of the prototypical contractionsintroduced in Section 5, with details of tensor dimensions provided in Table 2. The different bars indicate both thealgorithm and backend used and the number of threads used on a single node. T ABLE Dimensions of the tensors used for contractions in Figure 17 and Figure 18.

Label SpeciﬁcationsCC a G = 8 , N i = N j = 16 , N a = N b = N c = N d = 32 CC a G = 8 , N i = N k = N l = 16 , N a = N b = N c = 32 CC a G = 8 , N i = N j = N k = N l = 16 , N a = N b = 32 CC b G = 16 , N i = N j = 8 , N a = N b = N c = N d = 16 CC b G = 16 , N i = N k = N l = 8 , N a = N b = N c = 16 CC b G = 16 , N i = N j = N k = N l = 8 , N a = N b = 16 MM a G = 2 , N = 10000 MM b G = 100 , N = 2000 MPS a G = 2 , N i = N k = N m = 3000 , N j = 10 , N l = 1 MPS b G = 5 , N i = N k = N m = 700 , N j = 10 , N l = 1 PEPS a G = 2 , N i = N j = 400 , N k = N l = N m = N n = 20 PEPS b G = 10 , N i = N j = 64 , N k = N l = N m = N n = 8 strong scaling on a single node for the loop over blocks method using NumPy as the array andcontraction backend. We again observe that the Symtensor irrep alignment implementationprovides a signiﬁcant speedup over the loop over blocks strategy, which is especially evidentwhen there are many symmetry sectors in each tensor. For example, using 64 threads on asingle node, the speedup achieved by Symtensor over the loop over blocks implementation is X for CC a , . X for CC b , . X for PEPS a and X for PEPS b . We additionally see thatthe contraction times continue to scale with good efﬁciency when the contraction is spreadacross multiple nodes.Finally, in Figure 19 we display weak scaling performance, where the dimensions ofeach tensor are scaled with the number of nodes (starting with the problem size reported inTable 2 on 1 node) used so as to ﬁx the tensor size per node. Thus, in this experiment, weutilize all available memory and seek to maximize performance rate. Figure 19 displays theperformance rate per node, which varies somewhat across contractions and node counts, butgenerally does not fall off with increasing node count, demonstrating good weak scalabil-ity. When using cores, the overall performance rate approaches 4 Teraﬂops/s for somecontractions, but is lower in other contractions that have less arithmetic intensity.

7. Conclusion.

The irrep alignment algorithm leverages conservation laws implicit incyclic group symmetry to provide a contraction method that is efﬁcient accross a wide range N core T i m e ( s ) N core Symtensor (CTF)Loop Blocks (NumPy) CC a /PEPS a CC b /PEPS b F IGURE

Strong scaling across up to 8 nodes for the CC contractions (left) labelled CC a (blue circles)and CC b (green triangles) and the PEPS contractions labelled PEPS a (blue circles) and PEPS b (green triangles).The dashed lines correspond to calculations done using a loop over blocks algorithm with a NumPy contractionbackend while the solid lines correspond to Symtensor calculations using the irrep alignment algorithm, with aCyclops contraction backend.

64 256 1024 4096 N core G F l o p s / ( N o d e s )

64 256 1024 4096 N core Few Sym Sectors (a)Many Sym Sectors (b) CC /MMCC /MPSCC /PEPS F IGURE

Weak scaling (see text for details) across up to 64 nodes for CC (left) and TN (right) contractions,showing the performance, in GFlops per node, as a function of the number of used nodes. The dashed lines corre-spond to contractions with few symmetry irreps, previously labelled (a), while solid lines correspond to contractionswith many symmetry irreps, labelled (b). The blue squares correspond to the CC and matrix multiplication perfor-mance, the dark green circles correspond to the CC and MPS performance, and the light green triangles correspondto the CC and PEPS performance. of tensor contractions. This technique is applicable to many numerical methods for quantum-level modelling of physical systems that involve tensor contractions and tensor networks.The automatic handling of group symmetry with dense tensor contractions provided via theSymtensor library provides beneﬁts in productivity, portability, and parallel scalability forsuch applications.

8. Acknowledgement.

We thank Linjian Ma for providing the batched BLAS backendused in our calculations. ES was supported by the US NSF OAC SSI program, via awardsNo. 1931258 and No. 1931328. YG, PH, GKC were supported by the US NSF OAC SSIprogram, award No. 1931258. PH was also supported by a NSF Graduate Research Fel-lowship via grant DGE-1745301 and an ARCS Foundation Award. The work made useof the Extreme Science and Engineering Discovery Environment (XSEDE), which is sup-ported by US National Science Foundation grant number ACI-1548562. We use XSEDEto employ Stampede2 at the Texas Advanced Computing Center (TACC) through allocationTG-CCR180006.

REFERENCES[1] ITensor Library (version 2.0.11) http://itensor.org.182] A. A

BDELFATTAH , A. H

AIDAR , S. T

OMOV , AND

J. D

ONGARRA , Performance, design, and autotuningof batched GEMM for GPUs , in International Conference on High Performance Computing, Springer,2016, pp. 21–38.[3] R. J. B

ARTLETT AND

M. M

USIAŁ , Coupled-cluster theory in quantum chemistry , 79 (2007), pp. 291–352.[4] G. K.-L. C

HAN AND

M. H

EAD -G ORDON , Highly correlated calculations with a polynomial cost algo-rithm: A study of the density matrix renormalization group , The Journal of chemical physics, 116 (2002),pp. 4462–4476.[5] G. K.-L. C

HAN , A. K

ESELMAN , N. N

AKATANI , Z. L I , AND

S. R. W

HITE , Matrix product operators,matrix product states, and ab initio density matrix renormalization group algorithms , The Journal ofchemical physics, 145 (2016), p. 014102.[6] F. A. C

OTTON , Chemical applications of group theory , John Wiley & Sons, 2003.[7] T. D. C

RAWFORD AND

H. F. S

CHAEFER , An introduction to coupled cluster theory for computationalchemists , vol. 14, VCH Publishers, New York, 2000, ch. 2, pp. 33–136.[8] G. M. C

ROSSWHITE , A. C. D

OHERTY , AND

G. V

IDAL , Applying matrix product operators to model systemswith long-range interactions , Physical Review B, 78 (2008), p. 035116.[9] M. F

ANNES , B. N

ACHTERGAELE , AND

R. F. W

ERNER , Finitely correlated states on quantum spin chains ,Communications in mathematical physics, 144 (1992), pp. 443–490.[10] A. L. F

ETTER AND

J. D. W

ALECKA , Quantum theory of many-particle systems , Courier Corporation, 2012.[11] N. F

LOCKE AND

R. B

ARTLETT , Correlation energy estimates in periodic extended systems using the lo-calized natural bond orbital coupled cluster approach , The Journal of chemical physics, 118 (2003),pp. 5326–5334.[12] J. G

AUSS , Coupled-cluster Theory , John Wiley & Sons, Ltd, 2002, https://doi.org/10.1002/0470845015.cca058.[13] S. H

IRATA , Tensor Contraction Engine: Abstraction and automated parallel implementation of conﬁguration-interaction, coupled-cluster, and many-body perturbation theories , The Journal of Physical ChemistryA, 107 (2003), pp. 9887–9897, https://doi.org/10.1021/jp034596z.[14] S. H

IRATA , R. P

ODESZWA , M. T

OBITA , AND

R. J. B

ARTLETT , Coupled-cluster singles and doubles forextended systems , The Journal of chemical physics, 120 (2004), pp. 2581–2592.[15] F. H

UMMEL , T. T

SATSOULIS , AND

A. G R ¨ UNEIS , Low rank factorization of the Coulomb integrals for peri-odic coupled cluster theory , The Journal of chemical physics, 146 (2017), p. 124105.[16] K. Z. I

BRAHIM , S. W. W

ILLIAMS , E. E

PIFANOVSKY , AND

A. I. K

RYLOV , Analysis and tuning of libten-sor framework on multicore architectures , in 2014 21st International Conference on High PerformanceComputing (HiPC), IEEE, 2014, pp. 1–10.[17] M. K ´

ALLAY AND

P. R. S

URJ ´ AN , Higher excitations in coupled-cluster theory , 115 (2001), pp. 2945–2954,https://doi.org/10.1063/1.1383290.[18] Y.-J. K AO , Y.-D. H SIEH , AND

P. C

HEN , Uni10: An open-source library for tensor network algorithms , inJournal of Physics: Conference Series, vol. 640, IOP Publishing, 2015, p. 012040.[19] R. A. K

ENDALL , E. A

PRA , D. E. B

ERNHOLDT , E. J. B

YLASKA , M. D

UPUIS , G. I. F

ANN , R. J. H AR - RISON , J. J U , J. A. N ICHOLS , J. N

IEPLOCHA , T. S

TRAATSMA , T. L. W

INDUS , AND

A. T. W

ONG , High performance computational chemistry: An overview of NWChem a distributed parallel application ,Computer Physics Communications, 128 (2000), pp. 260 – 283.[20] Y. K

URASHIGE AND

T. Y

ANAI , High-performance ab initio density matrix renormalization group method:Applicability to large-scale multireference problems for metal compounds , The Journal of chemicalphysics, 130 (2009), p. 234114.[21] P.-W. L AI , K. S TOCK , S. R

AJBHANDARI , S. K

RISHNAMOORTHY , AND

P. S

ADAYAPPAN , A framework forload balancing of tensor contraction expressions via dynamic task partitioning , in Proceedings of theInternational Conference on High Performance Computing, Networking, Storage and Analysis, 2013,pp. 1–10.[22] M. L AX , Symmetry principles in solid state and molecular physics , Courier Corporation, 2001.[23] M. L

UBASCH , J. I. C

IRAC , AND

M.-C. B A ˜ NULS , Algorithms for ﬁnite projected entangled pair states ,Physical Review B, 90 (2014), https://doi.org/10.1103/physrevb.90.064425, http://dx.doi.org/10.1103/PhysRevB.90.064425.[24] D. A. M

ATTHEWS , On extending and optimising the direct product decomposition , Molecular Physics, 117(2019), pp. 1325–1333.[25] J. M C C LAIN , Q. S UN , G. K.-L. C HAN , AND

T. C. B

ERKELBACH , Gaussian-based coupled-cluster theoryfor the ground-state and band structure of solids , Journal of chemical theory and computation, 13 (2017),pp. 1209–1218.[26] I. P. M C C ULLOCH AND

M. G UL ´ ACSI , The non-abelian density matrix renormalization group algorithm ,EPL (Europhysics Letters), 57 (2002), p. 852.[27] M. M

OTTA , S. Z

HANG , AND

G. K.-L. C

HAN , Hamiltonian symmetries in auxiliary-ﬁeld quantum MonteCarlo calculations for electronic structure , Physical Review B, 100 (2019), p. 045127.[28] J. N

IEPLOCHA , R. J. H

ARRISON , AND

R. J. L

ITTLEFIELD , Global Arrays: A nonuniform memory ac- ess programming model for high-performance computers , The Journal of Supercomputing, 10 (1996),pp. 169–189.[29] J. N OGA AND

R. B

ARTLETT , The full CCSDT model for molecular electronic structure , 86 (1987), pp. 7041–7050, https://doi.org/10.1063/1.452353.[30] R. O

LIVARES -A MAYA , W. H U , N. N AKATANI , S. S

HARMA , J. Y

ANG , AND

G. K.-L. C

HAN , The ab-initio density matrix renormalization group in practice , The Journal of chemical physics, 142 (2015),p. 034102.[31] I. V. O

SELEDETS , Tensor-train decomposition , SIAM Journal on Scientiﬁc Computing, 33 (2011), pp. 2295–2317.[32] D. O

ZOG , J. R. H

AMMOND , J. D

INAN , P. B

ALAJI , S. S

HENDE , AND

A. M

ALONY , Inspector-executorload balancing algorithms for block-sparse tensor contractions , in 2013 42nd International Conferenceon Parallel Processing, IEEE, 2013, pp. 30–39.[33] Y. P

ANG , T. H AO , A. D UGAD , Y. Z

HOU , AND

E. S

OLOMONIK , Efﬁcient 2D tensor network simulation ofquantum systems , 2020, https://arxiv.org/abs/2006.15234.[34] C. P

ENG , J. A. C

ALVIN , F. P

AVOSEVIC , J. Z

HANG , AND

E. F. V

ALEEV , Massively parallel implementationof explicitly correlated coupled-cluster singles and doubles using TiledArray framework , The Journal ofPhysical Chemistry A, 120 (2016), pp. 10231–10244.[35] G. D. P

URVIS

III

AND

R. J. B

ARTLETT , A full coupled-cluster singles and doubles model: The inclusionof disconnected triples , The Journal of Chemical Physics, 76 (1982), pp. 1910–1918, https://doi.org/10.1063/1.443164, http://link.aip.org/link/?JCP/76/1910/1.[36] P. S

CHMOLL AND

R. O

RUS , Benchmarking global SU (2) symmetry in 2d tensor network algorithms , arXivpreprint arXiv:2005.02748, (2020).[37] U. S CHOLLW ¨ OCK , The density-matrix renormalization group in the age of matrix product states , Annals ofphysics, 326 (2011), pp. 96–192.[38] G. E. S

CUSERIA , A. C. S

CHEINER , T. J. L EE , J. E. R ICE , AND

H. F. S

CHAEFER

III,

The closed-shellcoupled cluster single and double excitation (CCSD) model for the description of electron correlation. Acomparison with conﬁguration interaction (CISD) results , The Journal of chemical physics, 86 (1987),pp. 2881–2890.[39] S. S

HARMA AND

G. K.-L. C

HAN , Spin-adapted density matrix renormalization group algorithms for quan-tum chemistry , The Journal of chemical physics, 136 (2012), p. 124121.[40] E. S

OLOMONIK , D. M

ATTHEWS , J. R. H

AMMOND , J. F. S

TANTON , AND

J. D

EMMEL , A massively paral-lel tensor contraction framework for coupled-cluster computations , Journal of Parallel and DistributedComputing, 74 (2014), pp. 3176–3190.[41] P. S

PRINGER , T. S U , AND

P. B

IENTINESI , HPTT: A High-Performance Tensor Transposition C++ Li-brary , in Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages,and Compilers for Array Programming, ARRAY 2017, New York, NY, USA, 2017, ACM, pp. 56–62,https://doi.org/10.1145/3091966.3091968, http://doi.acm.org/10.1145/3091966.3091968.[42] J. F. S

TANTON , Why CCSD(T) works: a different perspective , 281 (1997), pp. 130–134.[43] J. F. S

TANTON , J. G

AUSS , J. D. W

ATTS , AND

R. J. B

ARTLETT , A direct product decomposition approach forsymmetry exploitation in many-body methods. I. Energy calculations , The Journal of Chemical Physics,94 (1991), pp. 4334–4345.[44] S. S

TERNBERG , Group theory and physics , Cambridge University Press, 1995.[45] A. S

ZABO AND

N. S. O

STLUND , Modern quantum chemistry: introduction to advanced electronic structuretheory , Courier Corporation, 2012.[46] J. ˇC´ I ˇ ZEK , On the correlation problem in atomic and molecular systems. Calculation of wavefunction com-ponents in Ursell-type expansion using quantum-ﬁeld theoretical methods , The Journal of ChemicalPhysics, 45 (1966), pp. 4256–4266.[47] J. ˇC´ I ˇ ZEK AND

J. P

ALDUS , Correlation problems in atomic and molecular systems III. rederivation ofthe coupled-pair many-electron theory using the traditional quantum chemical methods , InternationalJournal of Quantum Chemistry, 5 (1971), pp. 359–379, https://doi.org/10.1002/qua.560050402, http://dx.doi.org/10.1002/qua.560050402.[48] F. V

ERSTRAETE AND

J. I. C

IRAC , Renormalization algorithms for quantum-many body systems in two andhigher dimensions , (2004), https://arxiv.org/abs/cond-mat/0407066.[49] F. V

ERSTRAETE , J. J. G

ARCIA -R IPOLL , AND

J. I. C

IRAC , Matrix product density operators: simulation ofﬁnite-temperature and dissipative systems , Physical review letters, 93 (2004), p. 207204.[50] G. V

IDAL , Class of quantum many-body states that can be efﬁciently simulated , Physical review letters, 101(2008), p. 110501.[51] S. R. W

HITE , Density matrix formulation for quantum renormalization groups , Physical review letters, 69(1992), p. 2863.[52] S. R. W

HITE , Density-matrix algorithms for quantum renormalization groups , Physical Review B, 48 (1993),p. 10345.[53] K. Y

E AND