[PDF] AutoHOOT: Automatic High-Order Optimization for Tensors

Abstract

High-order optimization methods, including Newton's method and its variants as well as alternating minimization methods, dominate the optimization algorithms for tensor decompositions and tensor networks. These tensor methods are used for data analysis and simulation of quantum systems. In this work, we introduce AutoHOOT, the first automatic differentiation (AD) framework targeting at high-order optimization for tensor computations. AutoHOOT takes input tensor computation expressions and generates optimized derivative expressions. In particular, AutoHOOT contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize. The expressions are then optimized by both the traditional compiler optimization techniques and specific tensor algebra transformations. Experimental results show that AutoHOOT achieves competitive CPU and GPU performance for both tensor decomposition and tensor network applications compared to existing AD software and other tensor computation libraries with manually written kernels. The tensor methods generated by AutoHOOT are also well-parallelizable, and we demonstrate good scalability on a distributed memory supercomputer.

Full PDF

AAutoHOOT: Automatic High-Order Optimizationfor Tensors

Linjian Ma ∗ Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Urbana, [email protected]

Jiayu Ye ∗ Google

Sunnyvale, [email protected]

Edgar Solomonik

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Urbana, [email protected]

Abstract —High-order optimization methods, including New-ton’s method and its variants as well as alternating minimizationmethods, dominate the optimization algorithms for tensor decom-positions and tensor networks. These tensor methods are usedfor data analysis and simulation of quantum systems. In thiswork, we introduce AutoHOOT, the ﬁrst automatic differenti-ation (AD) framework targeting at high-order optimization fortensor computations. AutoHOOT takes input tensor computationexpressions and generates optimized derivative expressions. Inparticular, AutoHOOT contains a new explicit Jacobian / Hessianexpression generation kernel whose outputs maintain the inputtensors’ granularity and are easy to optimize. The expressionsare then optimized by both the traditional compiler optimiza-tion techniques and speciﬁc tensor algebra transformations.Experimental results show that AutoHOOT achieves competitiveperformance for both tensor decomposition and tensor networkapplications compared to existing AD software and other tensorcomputation libraries with manually written kernels, both onCPU and GPU architectures. The scalability of the generatedkernels is as good as other well-known high-order numericalalgorithms so that it can be executed efﬁciently on distributedparallel systems.

Index Terms —automatic differentiation, computational graphoptimization, tensor computation, tensor decomposition, tensornetwork

I. I

NTRODUCTION

Tensors, represented as multidimensional arrays in the com-puter program, are important in both scientiﬁc computing andmachine learning. Tensor decomposition [29] is a powerfultool in compressing and approximating the high dimensionaldata, and is used widely in numerical PDEs [41], quantumchemistry [20], [21] and statistical modeling [4], [47]. Tensornetworks are also widely used in physics to approximatequantum states [36], [39] and in neural networks to formtensorized neural architectures [38]. Convolution, which isa basic tensor operation, is widely used in computer visionapplications [31]. Tensors are also widely used in methodsfor electronic structure calculations in computational chem-istry [18].Derivatives, mostly in the form of gradients, are ubiquitousin the optimization algorithms for tensor related problems. Forneural networks, they are used to calculate the gradients of ∗ Equal contribution. the loss function w.r.t. the model parameters. For tensor de-composition and tensor networks, ﬁrst-order and higher-orderderivatives are necessary to construct the operators used in thealternating optimization. Gradients of computational chemistrymethods are used for optimization of the electronic geometryto identify stable states and state transitions [25]. Automaticdifferentiation (AD) frameworks, including popular Pythontools such as PyTorch [40], JAX [9], and TensorFlow [1], cangenerate derivatives in all of these contexts. However, in tensordecomposition, tensor networks, and quantum chemistry, gra-dient calculations are most often done via manually writtencodes, as careful numerical and performance considerationsare required in these more complex settings.AD transforms a software or mathematical expression ofa function into code for computation of its derivatives withrespect to the desired parameters. Although mathematicallycorrect, the output programs for the derivatives may be sub-optimal in computational cost, use of efﬁcient kernels suchas the BLAS, memory footprint, and numerical stability.Components of different frameworks address these problemjointly or independently. For example, transformations of thecomputational graph and operator fusion are used to improvecomputational efﬁciency and parallelizability [1], [24], [40].Gradient checkpointing and garbage collection are used toaddress memory bottlenecks [1], [40]. For large scale tensorcomputations, computational and memory demands leave littleleeway for error in these aspects.Common commercial AD frameworks such as PyTorch [40],JAX [9], and TensorFlow [1] are focused on ﬁrst-order nu-merical optimization methods on deep learning models. In thecontext of tensor decompositions, tensor network optimization,and differentiation of tensor methods, three major additionalchallenges arise.1) These domains predominantly employ alternatingsecond-order optimization methods, as they providemonotonic convergence and rapid progress at almostthe same per-iteration cost as ﬁrst-order methods. Thesemethods employ implicit representations of the Jacobianand Hessian to solve linear systems. Existing AD frame-works have limited logical constructs for second-orderderivative information, and consequently generate codethat can be sub-optimal in cost by orders of magnitude. a r X i v : . [ c s . M S ] M a y ) Most tensor operations involved in the deep learningapplications are related to small tensors, while in tensornetwork and tensor decomposition applications, thereare many tensor contractions over high order (multi-dimensional) tensors with a large number of elements.Therefore, tensor network applications require betteroptimization algorithms to select optimized contractionorder and eliminate redundant calculations.3) Deep learning computational graphs usually have largedepth with many nonlinear operations, making the free-dom to optimize tensor operations limited. On the otherhand, in tensor decomposition and tensor network appli-cations, the computational graphs are usually wide andhave small depth, so there is more freedom to optimizethe computation.Although many frameworks, such as Tensorly [30], Tensor-Network [45] and Quimb [13], provide interfaces to optimizethe tensor decomposition / networks algorithms with ADframeworks such as TensorFlow and PyTorch, the optimizationalgorithms are the general ﬁrst-order methods and its variants.These frameworks explicitly implement popular second-ordermethods for these problems, such as Alternating Least Squares(ALS) for tensor decompositions and Density Matrix Renor-malization Group (DMRG) for 1D tensor networks, ratherthan using AD. The ability to generate efﬁcient expressionsof these methods automatically via AD, would accelerate thedevelopment of new variants and their deployment on shared-memory, GPU, and distributed-memory architectures.In this paper, we propose a new AD framework for ten-sor computations, Automatic High-Order Optimization forTensors (AutoHOOT). AutoHOOT encapsulates the followingnovel ideas and capabilities: • a new AD module that generates more efﬁcient rep-resentations for higher-order derivative constructs suchas Jacobians and Hessians, which are needed for tensorcomputation applications, • a new computational graph optimization module thatextends beyond the traditional optimization techniques forcompilers with tensor-algebra speciﬁc transformations,such as distributivity of matrix inversion over the Kro-necker product, • portability via high-level support for different tensor con-traction backends: NumPy for multi-core CPU, Tensor-Flow for GPUs, and Cyclops [52] for distributed memorysystems, • substantial improvements in sequential and parallel per-formance for tensor network and tensor decompositionoptimizations over other AD libraries and competitive orimproved performance w.r.t. manually-optimized imple-mentations. II. B ACKGROUND

A. Notations and Deﬁnitions

For vectors, bold lowercase Roman letters are used, e.g., x . For matrices, bold uppercase Roman letters are used, e.g., X . For tensors, bold calligraphic uppercase Roman lettersare used, e.g., XXX . An order N tensor corresponds to an N -dimensional array with dimensions s × · · · × s N .Elements of vectors, matrices, and tensors are denotedin parentheses, e.g., x ( i ) denotes the i th entry of a vector x , X ( i, j ) denotes the ( i, j ) th element of a matrix X , and XXX ( i, j, k, l ) denotes the ( i, j, k, l ) th element of an order 4 ten-sor XXX . Subscripts are used to label different vectors, matrices,tensors and functions (e.g.

XXX and XXX , f and f ).Matricization is the process of unfolding a tensor into amatrix. Given a tensor XXX the mode- n matricized version isdenoted by X ( n ) ∈ R s n × K where K = (cid:81) Nm =1 ,m (cid:54) = n s m . Wegeneralize this matricization deﬁnition, so that X ( i : j ) meansthat the dimensions from the i th index to the j th index areunfolded to the column dimension of the matrix, and all theother dimensions are unfolded to the row dimension of thematrix.For a scalar output function y = f ( a , . . . , a N ) , We usethe g [ f ][ a i ] and H [ f ][ a i ] to denote the gradient vector and Hessianmatrix of f w.r.t the input vectors a i . When the inputs aretensors, the gradient and the Hessian will also be a tensor anddenote GGG [ f ][ AAA i ] and HHH [ f ][ AAA i ] . For a function non-scalar output y = f ( a , . . . , a N ) , We use J [ f ][ a i ] to denote the Jacobian matrix ofthe function f w.r.t one of the input vectors a i . The shapeof the Jacobian matrix will be R | y |×| a i | . If YYY is an outputtensor with size R s × ... × s M , and AAA i is an input tensor withsize R r × ... × r K , then the Jacobian will be a tensor denoting JJJ [ f ][ AAA i ] , and has size R s × ... × s M × r × ... × r K .We also deﬁne generalized Vector-Jacobian Product (VJP),Jacobian-Vector Product (JVP) and Hessian-Vector Product(HVP). When both Jacobian and Hessian are matrices, theseare matrix-vector multiplication operations. When Jacobianand Hessian are both tensors deﬁned above, these are tensorcontractions, whose results are the same as unfolding the ten-sors into matrices and performing the matrix-vector product. B. Numerical Optimization Algorithms for Tensor Computa-tions

We consider two tensor numerical problems: the nonlinearleast squares ﬁtting and the eigenvalue problem. For bothproblems, we denote

XXX as the input tensor which can bean explicit tensor or implicit tensor network (e.g MatrixProduct Operator [61]), f as a tensor network function and AAA , . . . , AAA N as the optimization variables. Then the objectivefor the nonlinear least squares problem is deﬁned as min AAA ,..., AAA N φ ( AAA , . . . , AAA N ) := 12 (cid:107)XXX − f ( AAA , . . . , AAA N ) (cid:107) , (1)which ﬁnds a generalized low rank approximation of the inputtensor XXX . The objective for the eigenvalue problem is deﬁnedas min

AAA ,..., AAA N ψ ( AAA , . . . , AAA N ) := v T (1: N ) X (1: N ) v (1: N ) (cid:107)VVV(cid:107) F , (2)where VVV = f ( AAA , . . . , AAA N ) and the output of f serves as ageneralized low rank approximation of the eigenvector of a2ermitian matrix that is a matricization of XXX .Three categories of algorithms are generally used to opti-mize the problems: second-order methods, including Newton’smethod and its variants, alternating minimization, which up-dates each input / site at one time, and ﬁrst-order methodssuch as gradient descent and its variants.

Newton’s method and its variants.

Newton’s method andits variants, such as Gauss-Newton (GN) method, are popularmethods to solve non-linear least squares problems for aquadratic objective function deﬁned in Equation 1. Let a de-note the concatenation of all the vectorized sites vec ( AAA i ) and ˆ f ( a ) = vec ( f ( AAA , . . . , AAA N )) , so that r ( a ) := vec ( XXX ) − ˆ f ( a ) denotes the vectorized residual. Further, let r i ( a ) denote the i th element of the output of function r . The gradient and theHessian matrix of φ can be expressed as ∇ φ ( a ) = J [ r ] T [ a ] r ( a ) , H [ φ ][ a ] = J [ r ] T [ a ] J [ r ][ a ] + (cid:88) i r i ( a ) H [ r i ][ a ] . The Newton iteration performs the update based on a ( k +1) = a ( k ) − ( H [ φ ][ a ( k ) ] ) − J [ r ] T [ a ( k ) ] r ( a ( k ) ) , while the Gauss-Newton method leverages the fact that H [ r i ][ a ] is negligible as its norm is small when the residual is small,therefore the update can be performed as a ( k +1) = a ( k ) − ( J [ r ] T [ a ( k ) ] J [ r ][ a ( k ) ] ) − J [ r ] T [ a ( k ) ] r ( a ( k ) ) , where a ( k ) represents the a at k th iteration. The Gauss-Newton updates can be regarded as normal equations for thelinear least squares problem. Both Newton and Gauss-Newtonmethods can be solved via the conjugate gradient method withmatrix-vector products performed with an implicit form of theJacobian / Hessian to avoid costly matrix inversion [56], [48]. Alternating minimization.

For the tensor numerical opti-mizations, in many cases both the input and output dimensionsare large, and it’s computationally expensive to form theexplicit Hessian / Jacobian matrix w.r.t. all the variables andperform the second-order method directly. On the other hand,when optimizing a subset of variables, forming the Hessianor Jacobian with respect to those variables is affordableand effective. Most often, alternating minimization proceduresupdate one tensor operand at a time. For Equation 1, suchsubproblem can be formulated as min

AAA i φ ( AAA , . . . , AAA N ) . (3)Each AAA i for i ∈ { , . . . , N } is updated once via its subproblemduring an optimization sweep. For tensor decompositions andtensor networks, each subproblem is often quadratic, allowingfor the minima to be found directly, often at a similar cost toupdating AAA i with a ﬁrst-order method. Alternating minimiza-tion also generally provides monotonic convergence.In each sweep, many terms necessary to form the subprob-lems have many equivalent intermediates, and choosing theproper contraction paths to form and also amortize them can greatly save the cost. This scheme, called the dimension tree algorithm, is critical to the algorithm performance, and hasbeen used in both tensor decompositions [43], [60], [33] andDMRG to save the cost. First-order methods.

The efﬁcacy of the ﬁrst-order meth-ods on tensor computations is dependent on the applications.The ﬁrst-order methods are shown to be advantageous onachieving high ﬁtting accuracies on some tensor decompo-sition problems [2], while they also perform worse thanalternating minimization in achieving high accuracy for largescale tensor completion problems [64]. The per-iteration costof ﬁrst-order methods is comparable to that of both second-order methods and the alternating minimization method, dueto the structured tensor networks f in both Equation 1 and 2.Traditional AD frameworks can generate efﬁcient kernelsfor the ﬁrst-order methods, while their performance on the ker-nels in higher-order methods is suboptimal. In this paper, wefocus on the performance optimization over both second-ordermethod and alternating minimization methods, to acceleratefuture development of efﬁcient high-order methods for variousapplications. However, we believe our graph optimization tech-niques also have the potential to produce efﬁcient formulationsfor the ﬁrst-order methods based on the contractions of high-order tensors, which arise in quantum chemistry methods [18]. C. Previous Work

Optimization for tensor computations requires three es-sential building blocks, automatic differentiation, optimiza-tion compiler and computation backend. Existing softwarefor tensor computations, including Tensorly [30], TensorNet-work [45] and Quimb [13], adapt to many computation back-ends. However, they also rely on the backend for optimizationand AD. Therefore, AD and Just-In-Time compilation are onlyavailable with backends such as JAX and TensorFlow, but notNumPy.Automatic differentiation is generally provided via one oftwo ways, operator overloading [40], [9], [57], [63], [34] orsource code transformation (SCT) [1], [59], [23]. Operatoroverloading requires the user to write functions in terms ofthe provided library constructs and constructs the derivativesat run-time, while SCT uses precompilation to generate codefor derivative computation. Operator overloading providesa similar mental programming model as normal computerprograms [57], yielding code that is easier to interpret anddebug than SCT. On the ﬂip side, SCT has more potential tooptimize the computational graph with global graph informa-tion. Consequently, SCT is generally the method of choice forAD libraries that aim to achieve high performance (e.g. [1]).Our work on graph optimization builds on substantial effortsfor optimization of computational graphs of tensor operations.Tensor contraction can be optimized via parallelization [44],[27], [26], [52], efﬁcient transposition [54], blocking [12],[32], [22], [46], exploiting symmetry [18], [52], [51], and spar-sity [28], [42], [26], [35], [42], [50]. For complicated tensorgraphs, specialized compilers like XLA [55] and TVM [10]rewrite the computational graph to optimize program execution3

AddB CEinsum("ik,kj->ij")

Figure 1: An example of a computational graph. We use greennodes to denote input variables, purple nodes to denote outputnodes, and blue nodes to denote intermediate or constantnodes.Figure 2: System overview of AutoHOOT. The arrows showthe computation ﬂow.and memory allocation on dedicated hardware. For machineindependent optimization, Grappler in TensorFlow [1] andTASO [24] use rule based symbolic substitution to simplifythe execution ﬂow. Classical compiler optimization also in-cludes relevant techniques such as common subexpressionelimination [3] are widely used as well [1], [15]. Previouswork, such as Opt einsum [49] has yielded approaches forautomatically determining efﬁcient contraction orderings andselecting the best intermediates [18], [6], [16], [17]. Theapproaches generally rely on heuristic or exhaustive searchto select a contraction path, as ﬁnding the optimal contractionorder is NP-hard [11].III. O

VERALL A RCHITECTURE

The computations in AutoHOOT are described by com-putational graphs , which are directed graphs revealing thedata dependency between different operations. Each node canbe source, intermediate or sink. Source / Sink nodes areinputs / outputs of the graph. Sink and intermediate nodescan be any mathematical computation, while input nodes arefed by the user or constants. An edge connecting two nodesrepresents the data dependency between them. An example ofa computational graph is shown in Figure 1, where A , B , C are source nodes, the Einsum node is the sink, and the graphcomputes ( A + B ) C . We typically refer a node with its type,e.g Einsum node, which represents the tensor computationsbased on the Einstein summation convention. Einsum graph is deﬁned as a graph of nodes where all the nodes except thesources are Einsum nodes.

Einsum tree is deﬁned as a treeof nodes where all the nodes except the sources are Einsumnodes.AutoHOOT has two major components: an automatic dif-ferentiation architecture for tensor computations and a tensorcomputational graph optimizer. Figure 2 shows the systemoverview. For an input computation expression, the AD mod-ule will generate its tensorized differentiation expressions.Both the input expressions and the differentiation expressionswill be optimized through the graph optimization module.With the optimized expressions, users have the choice todirectly run the optimized expressions using the frameworkbackends, including NumPy, TensorFlow and Cyclops, or togenerate the Python source code through the source generationmodule.Below we show an example to perform the CP decompo-sition based on alternating least squares using the framework.Rather than constructing each subproblem and building thedimension tree based algorithm manually, we only need to con-struct the updates of Newton’s method for each subproblem,and the optimize function will reorganize the computationalgraph to minimize execution time automatically.

A, B, C, input_tensor, loss = cpd_graph(size, rank) def update_site(site):hes = ad.hessian(loss, [site])grad, = ad.gradients(loss, [site])new_site = ad.tensordot(ad.tensorinv(hes[0][0]), grad) return optimize(new_site)new_A = update_site(A)new_B = update_site(B)new_C = update_site(C) executor = ad.Executor([loss, new_A, new_B, new_C]) for i in range (num_iter):A_val = executor.run(feed_dict= { input_tensor: input_tensor_val,A: A_val, B: B_val, C: C_val } , out=[new_A])B_val = executor.run(feed_dict= { input_tensor: input_tensor_val,A: A_val, B: B_val, C: C_val } , out=[new_B])C_val = executor.run(feed_dict= { input_tensor: input_tensor_val,A: A_val, B: B_val, C: C_val } , out=[new_C])loss_val = executor.run(feed_dict= { input_tensor: input_tensor_val,A: A_val, B: B_val, C: C_val } , out=[loss]) In the AD module, we implement the reverse mode ADfor ﬁrst-order derivatives (Jacobian, VJP and JVP), as well asfor higher-order derivatives, including Hessian and HVP. BothJacobian and Hessian are formulated with a new algorithm,such that their calculations are not dependent on the JVP andHVP routines, which is more amenable to parallel execution4s well as graph optimizations. We describe this approach indetail in Section IV.The graph optimizer provides optimizations for tensor com-putational graphs. We adopt many machine independent opti-mization algorithms for common tensor computational graphs,such as selection of optimal contraction path and common sub-expression elimination. For second-order methods, the graphoptimizer rewrites the structured inverse, such as the inverseof a Kronecker product, so that the inverses are operated onsmaller tensors. For alternating methods, we developed a pathselection algorithm with constraints to construct the dimensiontrees. We describe this algorithm in detail in Section V.IV. C

OMPUTATIONAL G RAPHS FOR H IGH -O RDER D ERIVATIVES

We implement the reverse-mode AD based on the sourcecode transformation (SCT) method, explicitly transforming theprimal computation expression prior to execution to the adjointexpression. It allows us to ﬂexibly perform the computationalgraph optimization after the adjoint expression production.Our AD module supports the operations which calculatethe Jacobian / Hessian expressions implicitly (VJP, JVP andHVP), and also explicit Jacobian and Hessian calculations.The implicit calculations are widely used in many other frame-works, because it is computationally cheaper. For example, fora Hessian matrix with size n × n , explicitly forming the matrixcosts O ( n ) , while the HVP calculation will only cost O ( n ) leveraging the back-propagation gradient functions. For theexplicit Jacobian and Hessian calculations, we introduce a newback-propagation algorithm that can produce a computationalgraph is more amenable to parallelization and downstreamoptimizations. The algorithm is detailed in Section IV-B. A. VJP, JVP, and HVP

Our implementation of VJP is similar to many other frame-works [40], [9], [1], and is based on the reverse-mode AD. Forfunctions involving matrix / vector operations whose inputsand outputs are both vectors: x i +1 = f i ( x i ) , i ∈ [1 , . . . , N ] , consider a computational graph consisting of a chain of thesefunctions, y = f ( x ) = f N · · · f ( x ) , the VJP adjoint of x i , v T J [ f ][ x i ] , is calculated based on the VJPadjoint of x i +1 , VJP ( v , f, x i )= v T J [ f ][ x i ] =( v T J [ f ][ x i +1 ] ) J [ f i ][ x i ] = VJP ( v , f, x i +1 ) J [ f i ][ x i ] . Therefore, the VJP of all the inputs / intermediates x i , i ∈ [1 , . . . , N ] will be calculated with one backward propagation.It is also computationally efﬁcient, because only matrix-vectorproduct is necessary for each calculation.Note that for the cases where sub function inputs andoutputs contain matrices or tensors, VJP with reverse-modeAD is still valid and efﬁcient, since we can think of eachmatrix or tensor as a reshaped vector. For the case where the output is a scalar, the gradient expression is implementedbased on the VJP, if we ﬁx the vector as a unit length vectorwith element being one.Our JVP implementation is based on the VJP function .Although it’s more computationally efﬁcient to implement JVPbased on forward mode AD [8], we choose to implement itbased on our reverse mode AD module, and optimize thecomputational graph afterwards to achieve computationallyefﬁcient expressions. The JVP implementation is based oncalling the VJP function twice. First, we construct a function g , whose expression is as follows, g ( u ) = VJP ( u , f, x ) T = ( u T J [ f ][ x ] ) T . Afterwards, we perform another VJP operation on the function g with related to its input u , and can get the JVP expression, VJP ( v , g, u ) T = ( v T J [ g ][ u ] ) T = ( v T J [ f ] T [ x ] ) T = J [ f ][ x ] v = JVP ( v , f, x ) . We also implement the HVP function based on the gradientfunction. We only consider the case when the function outputis a scalar, because it is the general case where Hessian ma-trices are used. The HVP is formulated based on two gradientcalculations, because HVP is equivalent to the gradient ofthe gradient-vector inner product. The expression is shownas follows,HVP ( v , f, x ) = H [ f ][ x ] v = ∂ g [ f ][ x ] ∂ x v = ∂ g [ f ][ x ] ∂ x v + g [ f ] T [ x ] ∂ v ∂ x = ∂ ( g [ f ] T [ x ] v ) ∂ x = grad ( grad ( f, x ) T v , x ) . B. Explicit Jacobian and Hessian

To the best of our knowledge, all of the popular ADframeworks calculate explicit Jacobian and Hessian based onthe VJP and HVP routines [1], [9], [40]. Taking the Jacobiancalculation of f ( x ) = A A x as an example: when both x and f ( x ) are of size n , currentmethods will compute the i th row of the Jacobian via VJP e Ti J [ f ][ x ] for i ∈ { , . . . , n } , where e i is the i th elementaryvector. There are two major disadvantages to this approach: • It changes the BLAS-3 level matrix-matrix multiplica-tions to multiple BLAS-2 level matrix-vector multiplica-tions, and less ﬂop intensity can be achieved. Althoughmany frameworks provide the routine to compute all thematrix-vector multiplications in parallel, the parallelismis still sub-optimal and less efﬁcient than the matrixmultiplications, because the ﬂop-to-byte ratio is O (1) versus O ( n ) . • The computational graph produced is difﬁcult to opti-mize. Although having high dimensions, many Jacobians/ Hessians in tensor computation operations are highlystructured and the computational cost can be greatly The JVP implementation is based on the technique introduced at https://j-towns.github.io/2017/06/12/A-new-trick.html A = B ⊗ C and A = D ⊗ E and B , C , D , E have sizes n × n , performing matrix-vectorproduct for the Jacobian and each elementary vector costs O ( n ) and the overall Jacobian calculation cost is O ( n ) .However, if we calculate the Jacobian directly, we can usethe mixed-product property of the Kronecker product tooptimize the expression, ( B ⊗ C )( D ⊗ E ) = ( BD ) ⊗ ( CE ) , reducing the overall cost to O ( n ) .To alleviate these disadvantages, we produce both Jacobianand Hessian expressions in a way that’s independent of VJPand HVP routines.For the Jacobian expression, our implementations are alsobased on the chain rule using the back propagation, based onthe fact that,Jacobian ( f, x i )= J [ f ][ x i ] = J [ f ][ x i +1 ] J [ f i ][ x i ] = Jacobian ( f, x i +1 ) J [ f i ][ x i ] . Therefore, the Jacobian of one target node is the matrix-matrix product between the Jacobian of its output node andthe Jacobian of the local function. Note that when both x i andthe Jacobian have the tensor format, the above equation stillholds, while matrix-matrix product is expressed in the formof tensor contractions (Einsums).For linear operations, such as addition, subtraction, scalar-tensor multiplication and Einsum, we formulate the Jacobianexpressions as an Einsum. To achieve that, we introduce the Identity node , which is a node that applies an identity matrix,to express the constraints in Jacobian tensors. For example,for the addition operations of two order N tensors, f ( AAA , BBB ) =

AAA + BBB , its Jacobian is a tensor of order N , where JJJ [ f ][ AAA ] ( x , . . . , x N ) = 1 if and only if x i = x i + N for i ∈ { , . . . , N } , and other elements are 0. This constraintcan be easily speciﬁed with identity nodes. For the order 3addition, the Jacobian of AAA can be expressed as

JJJ [ f ][ AAA ] ( i, j, k, l, m, n ) = I ( i, l ) I ( j, m ) I ( k, n ) . Similarly, we can use the method to express the Jacobians forall the other linear operations. For example, for an Einsumexpression below, its Jacobians are written as f ( AAA , BBB )( i, j, k ) = (cid:88) l AAA ( i, k, l ) BBB ( j, k, l ) , JJJ [ f ][ AAA ] ( i, j, k, m, n, o ) = I ( i, m ) I ( k, n ) BBB ( j, n, o ) , JJJ [ f ][ BBB ] ( i, j, k, m, n, o ) = I ( j, m ) I ( k, n ) AAA ( i, n, o ) . Although we have introduced several identity nodes, they canbe easily pruned so that only necessary identity nodes are

Algorithm 1:

Graph optimization input :

Input Graph: G output:

Optimized Graph: OGG =

FuseAllEinsum ( Distribution (G)) (cid:46)

Provide longer EinsumsG =

SymbolicExecution (G) (cid:46)

Decompose Inverse /Prune identity / SymPyG =

OptContractPath (G) (cid:46)

Find efﬁcientcontraction orderingOG =

CSE (G) (cid:46)

Perform Common SubexpressionEliminationreturn OGleft, which will be introduced in Section V. The Hessianroutines are based on the Jacobian routines: we performJacobian calculations twice to get the Hessian expressions.The advantage of this Jacobian / Hessian generation methodis three-fold: ﬁrst, we can leverage BLAS-3 level operationsto perform most of the tensor contractions and can achievehigher performance. Second, the expressions are much easierto optimize, as will be introduced below. Third, the sourcecode for Jacobian / Hessian expressions can be easily acquired,which is beneﬁcial for both debugging and research purposes.V. G

RAPH O PTIMIZATIONS

We built a compiler to optimize tensor computationalgraphs. Our goal is to reduce the computational cost bytransforming the graph to an equivalent form. With the ac-knowledgement that retrieving the optimal execution graph isNP-hard, we devised several strategies: • Generation of longer Einsum nodes:

To achieve this,we implemented two kernels, Einsum distribution andEinsum fusion. • Symbolic rule execution:

We implemented the structuredinverse node decomposition and redundant node pruningkernels. In addition, we use SymPy [37] to simplifyelementary algebraic operations. • Contraction order selection:

We select contraction pathon fully simpliﬁed expressions. • Constrained contraction path construction:

To acceleratealternating minimization, we provide a kernel to reuseintermediates between optimization subproblems.Traditional compiler techniques, such as common sub-expressions elimination, are applied after the strategies above.The overall algorithm is described in Algorithm 1.

A. Longer Einsum Nodes Generation

We aim to transform the computational graph into Einsumnodes with as many inputs as possible. This will empowerthe contraction path selection with a global view and ease thediscovery of optimizable patterns for downstream algorithms.Along this line, two transformation kernels were built.

Einsum distribution.

Einsum distribution recursively lever-age distribution property to generate larger Einsum graphs.Larger Einsum graphs are the prerequisite for further graph6

Einsum("ik,kj->ij") C Einsum("ik,kj->ij")BAdd

A AddB CEinsum("ik,kj->ij") (a) Einsum distribution.

AEinsum("ab,ac,bd->abcd")B I0

AEinsum("ab,cd,ac,be,ef->abdf")B I0 I1 I2 (b) Identity node pruning. The nodes whose name starts from ”I” are identitynodes.

AEinsum("ik, kj->ij") BEinsum("ik, kj->ij")CEinsum("jk, ki->ji")

A Einsum("ik, kj->ij") BClone CloneCClone CloneEinsum("jk, ki->ji")Einsum("ik, kj->ij") Einsum("ik, kj->ij")

AEinsum("ba,ac,cd,de,ef->bf")B C

A Einsum("ba,ae,ed,dc,cf->bf")BClone Clone CClone Clone (c) Einsum fusion. It transforms an Einsum graph into one single Einsum node.

AEinsum("ab,cd,ef->acebdf")B CTensorInv(ind=3)

ATensorInv(ind=1) BTensorInv(ind=1) CTensorInv(ind=1)Einsum("ab,cd,ef->acebdf") (d) Optimization of tensor inversion I0 AEinsum("ac,de,cd->ae")Einsum("ab,bc->ac")BTensorInv(ind=1) (e) Inverse node pruning

AEinsum("ac,ba,bc->")BAdd

AEinsum("ac,ba,bc->") Einsum("ba,ac,bc->")BAdd (f) Common subexpression elimination

AEinsum("ak,abcd->kbcd")X BEinsum("kbcd,bk->kcd") CEinsum("kcd,ck->dk")

XEinsum("abcd,ak,bk,ck->dk")A B C (g) Optimal contraction path

XEinsum("bm,abc,cm->am") Einsum("abc,am,cm->bm")Einsum("bm,abc,am->cm")B CA

BEinsum("ab,cab->cb") Einsum("ab,acb->cb")CEinsum("ab,cda->cdb") XEinsum("acd,ab->cdb")AEinsum("acb,ab->cb") (h) Dimension tree generation

Figure 3: Visualization of different graph optimization kernels.depth reduction. We move the distributive nodes closer to thegraph sinks based on the rule below. Figure 3a illustrates theidea of an application of the algorithm while the pseudo-codecan be found in Algorithm 4 in the Appendix.

Einsum(dist_op(g1, g2), g3) =dist_op(Einsum(g1, g3), Einsum(g2, g3))

Einsum fusion.

Einsum fusion transforms an Einsum graphinto several single Einsum nodes with the same sources. It isa prerequisite for downstream graph optimization steps, such as contraction path selection and identity node pruning. Anexample can be seen in Figure 3c.Einsum fusion has three steps: linearization of the graph,fusion of the generated Einsum Tree, and removal of theredundant clone nodes. The linearization step changes theinput Einsum graph into an Einsum tree. When a source nodeis used in multiple Einsums, we create a clone of it for eachEinsum. If an Einsum node has more than one output, wecopy the subgraph deﬁning its computation, including itself,7 insum('pb,or,ob,pr,st->srtb', B, A, A, B, I)Einsum('eb,ed,fb,fd,ac->abcd', A, A, B, B, I)

AB I AB

Figure 4: Tensor diagram of two Einsum expressions withthe same tensor computations. The numbers around the inputtensor denote the dimension numbers that are contractedby speciﬁc edges. The numbers with underline denote thedimension number of the output tensor. Two Einsum expres-sions with the same tensor diagram express the same tensorcomputations.and repeat until all nodes have a single output, yielding aforest (set of disconnected trees). The fusion step fuses eachgenerated Einsum tree. It leverages a union-ﬁnd data structure,which puts two dimensions from two Einsum nodes into oneset if they have the same subscript in one Einsum expression.After that, each disjoint set is assigned an unique characterfor the generation of the subscript of the new Einsum node.Finally, the clone node removal step removes the redundantclone nodes and returns an Einsum node. We illustrate boththe pseudo-code sketch of the algorithm and the union-ﬁnddata structure in Algorithm 3 and 5 in Appendix C.

B. Symbolic Execution

We employ several linear algebra constructs that can sim-plify the computational graph and reduce the computationalcost.

Structured Tensor inverse decomposition.

An inverse ofan Einsum Graph may be the bottleneck of the computationalgraph because of the cubic order complexity. Fortunately,structured information may guide the optimization, e.g, inverseof a Kronecker Product can be decomposed into the Kroneckerproduct of inverses through ( A ⊗ B ) − = A − ⊗ B − .We develop an algorithm to detect and break large tensorinverses into products of smaller tensor inverses so that thecomputation is cheaper. To keep it simple, the algorithm limitsits applicability to speciﬁc forms of the tensors, and furtherdetails are described in Appendix B. An illustrative exampleis shown is Figure 3d.We also provide a function to prune the unnecessary inversenode, as is shown in Figure 3e. When there exists a matrixmultiplication between an Einsum node and its correspondinginverse node, we will directly return the identity node ratherthan perform the calculations. Redundant node pruning.

We prune the redundant nodes,including the Identity nodes and the inverse nodes, to simplify the expressions. Identity nodes are essential building blocks forthe explicit Jacobian and the Hessian expressions, as is shownin Section IV-B. During the AD, redundant Identity nodesare introduced to aid the construction of the graph. Hence,we implemented an algorithm to eliminate the unnecessaryidentity nodes afterwards for better efﬁciency. Identity nodesare removed unless they express necessary constraints in theoutput tensor structure, such as the tensor symmetry shownin the right graph of Figure 3b. In addition, we prune theunnecessary inverse nodes, as is shown in Figure 3e. Whenthere exists an matrix multiplication between an Einsum Nodeand its corresponding inverse node, we will directly return anidentity node.

Elementary Algebraic Simpliﬁcation.

For elementary op-erations, such as addition, subtraction and multiplication, weuse the SymPy library [37] to optimize them. SymPy can helpus easily simplify the expressions. For the example shownbelow, it helps reducing the expression to one term. sympy_simplify((A-(((A*0.5)-(T*0.5))+((A*0.5)-(T*0.5))))) = T

C. Optimized Contraction Path Selection

We identify the optimal contraction path for the Einsumexpression after all the above transformations. For one Einsumnode with multiple inputs, we provide an function to decom-pose it into an Einsum graph with the optimized contractionpath, as is shown in Figure 3g. Our strategy is designedfor the common tensor contractions with the following twoassumptions: • For simplicity, we only discuss the case where tensors aredense, and for a long Einsum expression with multipleinputs, it will ﬁrst be split into multiple small Einsumexpressions, each has only two inputs, and then densetensor contractions will be executed. • The chosen contraction path is hardware oblivious. Weassume the contraction time for each operation is pro-portional to the ﬂop counts. Other factors, such as thecommunication cost among different processes under theparallel execution settings, are not considered.These assumptions allow us to implement the algorithm basedon an interface provided by NumPy . Note that whetherwe can ﬁnd the optimal contraction path is based on theoptimization algorithm, but we found that generally a greedysearch algorithm is able to provide an optimal path for mostof the Einsum expressions in tensor computation applications.In addition, the assumptions above are not limitations of ouroverall approach. AutoHOOT is also capable of extracting thecontraction path based on other libraries, such as Cyclops [64],where hardware and tensor sparsity are considered in thealgorithm. We extract the contraction path from NumPy Einsum: https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html lgorithm 2: Opt contraction path w constraint input :

Einsum Node: N, Contraction order list: L output:

Einsum Tree: T n = length(L)T = N (cid:46) Initialize tree with single Einsum node for i ∈ { , . . . , n } do split T = SplitEinsum (T, L[i+1:n]) (cid:46)

Split T intoan Einsum node that contracts all input nodes apartfrom L[i+1:n] and the subgraph induced by theremaining nodes, returning the formeropt contract subtree =

OptContractPath (split T) (cid:46)

Unconstrained optimized contraction pathopt contract subtree =

Get_nearest_ancestor (opt contract subtree, L[i]) (cid:46)

Get the tree whosesink is the nearest ancestor of L[i]T =

Substitute_graph (T, opt contract subtree) (cid:46)

Return the equivalent graph of T whose inputscontain opt contract subtreereturn T

D. Constrained Contraction Path Construction

We provide a constrained contraction path selection routine,such that the contraction path is optimized under the constraintthat partial inputs’ contraction order is ﬁxed. This routine iscritical for the dimension tree construction used in the alter-nating minimization algorithms. Consider Equation 3 with theupdate sequence in each sweep starts from

AAA and ends at AAA N ,for the Einsum node used to update AAA i , where i ∈ { , . . . , N } ,we generate the contraction path such that it is optimized underthe constraint that the contraction order for all the target sitesis AAA N ≺ · · · ≺ AAA i +1 ≺ AAA ≺ · · · ≺ AAA i − . This order ensuresthat the tensor that is updated just previously, AAA i − , affectsonly the last part of the contraction path, enabling the reuseof the calculations prior to it in the path as much as possible.The constrained path selection algorithm is illustrated inAlgorithm 2, and is implemented on top of the unconstrainedone and uses the greedy search heuristic. We ﬁnd that thisheuristic works well for all the dimension tree selection inthe tensor computation applications tested in Section VI. Anexample is shown in Figure 3h, which illustrate the dimensionconstruction for the Matricized Tensor Times Khatri-Rao Prod-uct (MTTKRP) calculations of an order 3 CP decomposition.The pseudo-code is illustrated in Algorithm 6 in the Appendix. E. Common Subexpression Elimination (CSE)

CSE is used to remove the duplicated Einsum expressionsgenerated from the path selection above. We show one ex-ample in Figure 3f, where CSE helps saving one Einsumcalculation. However, the CSE is nontrivial for Einsum nodesbecause different Einsum subscripts may represent the samecomputation. We show an example in Figure 4 where twoEinsum nodes represent the same calculation despite differentinput ordering and subscripts. Hence, we transfer an Einsumexpression into a tensor diagram graph, and compare the graphstructures between two expressions. Moreover, two nodes in an Einsum Graph may be trans-positions of each other. After detecting such conditions, wereplace one of the nodes with its transpose node and update itsoutputs’ expressions therein. This optimization greatly reducesthe computation cost when transposes of large tensors lurks inthe graph. VI. B

ENCHMARKS

We evaluate the performance of AutoHOOT on boththe Gauss-Newton method and the alternating minimizationmethod discussed in Section II-B. The performance of thecritical Gauss-Newton kernel, the Hessian-Vector Product, isevaluated on the CP decomposition application, where Gauss-Newton with conjugate gradient update is commonly usedto achieve high accuracy [48], [53]. The performance ofalternating minimization kernels generated by AutoHOOT isevaluated on both CP and Tucker decompositions, as well asthe DMRG algorithm in tensor network applications used tocalculate the smallest eigenvalue and eigenvector for a matrixproduct state.The experiments are run on both CPUs and GPUs. OnCPUs, we test the performance on both one process withthe NumPy backend, and on the distributed parallel systemwith the Cyclops backend. The results are collected on theStampede2 supercomputer located at the University of Texasat Austin. We leverage the Knight’s Landing (KNL) nodes,each of which consists of 68 cores, 96 GB of DDR RAM,and 16 GB of MCDRAM. These nodes are connected viaa 100 Gb/sec fat-tree Omni-Path interconnect. We use Intelcompilers and the MKL library for threaded BLAS routinesfor both sequential and parallel experiments. We use 16processes per node and 16 threads per process for the Cyclopsbenchmark experiments. We also collected results with theTensorFlow backend on a single NVIDIA TESLA K80 GPU,which contains 4992 NVIDIA CUDA cores and has up to 8.73teraﬂops single-precision performance.The performance of the HVP kernels in the Gauss-Newtonalgorithm for the CP decomposition is shown in Figure 6.As can be seen, AutoHOOT has at least 2X speed-up on theGPU and at least 7X speed-up on the CPU compared to JAXwhen the dimension size s ≥ . It can also be seen thatthe speed-up increases with the increase of the dimensionsize, indicating the advantage of AutoHOOT for large scaletensor computations. In addition, the AutoHOOT performanceis comparable compared to the manually designed algorithmsin the reference [48], indicating that the kernels generated byAutoHOOT reaches the state-of-art performance boundary.The performance of the alternating minimization kernelsfor both tensor decompositions and the DMRG algorithm areshown in Figure 5. For the tensor decompositions, we comparethe performance of AutoHOOT output expressions, both withand without dimension tree optimizations, to the popular tensordecomposition libraries Tensorly [30], both with NumPy andTensorFlow backend, and scikit-tensor with NumPy backend. https://github.com/mnick/scikit-tensor Dimension size T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorlyscikit-tensor (a) NumPy, CPD, R = s

50 100 200 400 800

Dimension size T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorly (b) TensorFlow, CPD, R = s Number of nodes T i m e f o r one A L S s w eep ( s ) (c) Cyclops, CPD, R = 400 , s = (cid:98) n (cid:99)

50 100 200 400 800

Dimension size T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorlyscikit-tensor (d) NumPy, Tucker, R = 0 . s

50 100 200 400 800

Dimension size T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorly (e) TensorFlow, Tucker, R = 0 . s Number of nodes T i m e f o r one A L S s w eep ( s ) (f) Cyclops, Tucker, R = 400 , s = (cid:98) n (cid:99) Physical leg size T i m e f o r one s w eep o f H VP ( s ) AutoHOOTQuimb (g) NumPy, DMRG, s = R Physical leg size T i m e f o r one s w eep o f H VP ( s ) AutoHOOTQuimb (h) TensorFlow, DMRG, s = R Number of nodes T i m e f o r one H VP s w eep ( s ) (i) Cyclops, DMRG, R = s = (cid:98) n (cid:99) Figure 5: AutoHOOT performance for kernels in the alternating minimization. (a)-(c) : Results for the CP decomposition. Thetensor order is set as N = 3 for all the experiments. (d)-(f) : Results for the Tucker decomposition. The tensor order is setas N = 3 for all the experiments. (g)-(i) : Results for the DMRG experiment. The number of sites is set as N = 10 for theexperiments with NumPy and TensorFlow, and set as N = 6 for the experiments with Cyclops. For the Cyclops benchmark,the dotted line denotes the perfect scaling curve. Each bar/dot is the average result of 10 iterations.For the DMRG algorithm, we compare the performance toQuimb [13], which is an efﬁcient library for tensor networks.The benchmark results for the CP decomposition with bothNumPy and TensorFlow can be seen in Figure 5a, 5b. Wecompare the performance with different CP ranks ( R ) anddimension size ( s ). As can be seen, the expressions generatedwith the dimension tree algorithm outperform all the otherimplementations. Note that Tensorly’s performance is not asexpected for the CP decomposition, because it slices the factormatrices over the rank mode and sums over all the MTTKRPresults of the input tensor and the sliced factor matrices, which is not favorable. The weak scaling benchmark is alsoperformed on the distributed parallel system with Cyclops,shown in Figure 5c, where we consider weak scaling with ﬁxedinput size and work per processor. The expressions generatedfrom AutoHOOT scale well, obtaining 73% parallel scalingefﬁciency on 128 nodes (2048 cores).The benchmark results for the Tucker decomposition withboth NumPy and TensorFlow can be seen in Figure 5d, 5e.We compare the performance with different Tucker ranks ( R )and dimension size ( s ). Note that we are only comparing theperformance of the kernel generated through AutoHOOT to the10 Dimension size T i m e f o r one H v P i t e r a t i on ( s ) AutoHOOTCPD_GN_paperJAX (a) CPU

40 80 160 320 640

Dimension size T i m e f o r one H v P i t e r a t i on ( s ) AutoHOOTCPD_GN_paperJAX (b) GPU

Figure 6: Performance comparison among AutoHOOT, JAX and the existing implementation for the HVP kernel in the Gauss-Newton algorithm for the CP decomposition. The implementation of CPD GN paper comes from reference [48]. The tensororder is set as N = 3 , and the CP rank is set equal to the dimension size. Each bar is the average result of 10 iterations.

100 200 300 400 500 600

Time (s) S m a ll e s t e i gen v a l ue AutoHOOT-R=10AutoHOOT-R=20AutoHOOT-R=30AutoHOOT-R=40Quimb-R=10Quimb-R=20Quimb-R=30Quimb-R=40

Figure 7: Comparison between AutoHOOT and Quimb on thefull DMRG running curve. The input MPO is random andsymmetric, has 6 sites, and its physical leg size equals 10 andMPO rank size equals 20. We compare the performance underdifferent largest MPS rank constraints.Tensor Times Matrix-chain (TTMc) implementation in otherlibraries, which doesn’t contain the low rank factorizationstep of splitting the factor matrix from the core tensor. Theexpressions generated with the dimension tree algorithm iscomparable to all the other implementations. The weak scalingbenchmark is shown in Figure 5f. Similar to the CP decompo-sition, the expressions generated from AutoHOOT scale withhigh efﬁciency.The performance results for DMRG can be seen in Fig-ure 5g, 5h, 5i. We benchmark over sweep of the HVP kernelswith different MPO and MPS rank size ( R ) and physicaldimension size ( s ), where the Hessian denotes the localHessian of the DMRG loss function w.r.t. each local site. InDMRG, the HVP calculations are important kernels for the sparse eigensolver. Multiple HVP calculations are necessaryfor each site to get the local smallest eigenvalue, making itthe computation bottleneck. The expressions generated withthe dimension tree algorithm is comparable to the implemen-tations in Quimb. In addition, the expressions generated fromAutoHOOT scales perfectly with Cyclops up to at least 128nodes .We also compare the performance between AutoHOOT andQuimb on the full DMRG experiments. Same as Quimb, wealso use the sparse eigensolver in SciPy [62], and set thesolver parameters the same as Quimb. The results are shownin Figure 7. We test the four cases where the maximum MPSrank ranges from 10 to 40, and the results show that bothlibraries have the similar performance, while AutoHOOT hasa small ﬁxed overhead.Note that we did not report the ALS results of otherAD libraries, because their performance is far worse thanboth AutoHOOT and other tensor computation libraries. Forboth CP and Tucker decompositions, existing AD librariescannot efﬁciently decompose the structured inverse operations,leading to a big overhead from inverting large tensors. Forthe DMRG experiment, existing libraries fail to choose anoptimized contraction path, and produce large intermediateswhich require too much memory.VII. C ONCLUSION

AutoHOOT is the ﬁrst automatic differentiation frameworktargeting at high-order optimization for tensor computations.AutoHOOT contains a new explicit Jacobian / Hessian expres-sion generation kernel whose outputs keep the input tensors’granularity and are easy to optimize. It also contains a newcomputational graph optimization module that combines boththe traditional optimization techniques for compilers and tech-niques based on speciﬁc tensor algebra. The optimization mod-ule generates expressions as good as manually written codes Note that under the constraint where the physical leg size is equal to therank size, e.g. s = R , the computational cost is O ( R ) and the memory costis O ( R ) .

11n other frameworks for the numerical algorithms of tensorcomputations. AutoHOOT is compatible with other numericalcomputation libraries, and users can execute the generatedexpressions on CPU with NumPy, GPU with TensorFlow, anddistributed parallel systems with Cyclops Tensor Framework.Experimental results show that AutoHOOT has competitiveperformance on both tensor decomposition and tensor networkapplications compared to both existing AD software and othertensor computation libraries with manually written kernels,both on CPU and GPU architectures.VIII. A

CKNOWLEDGEMENTS

Linjian Ma and Edgar Solomonik were supported by the USNSF OAC SSI program, award No. 1931258. This work usedthe Extreme Science and Engineering Discovery Environment(XSEDE), which is supported by National Science Foundationgrant number ACI-1548562. We used XSEDE to employStampede2 at the Texas Advanced Computing Center (TACC)through allocation TG-CCR180006.R

EFERENCES[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. Tensorﬂow: A system for large-scale machine learning. In { USENIX } Symposium on OperatingSystems Design and Implementation ( { OSDI } , pages 265–283, 2016.[2] E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable optimizationapproach for ﬁtting Canonical tensor decompositions. Journal ofChemometrics , 25(2):67–86, 2011.[3] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers, principles, techniques.

Addison wesley , 7(8):9.[4] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensordecompositions for learning latent variable models.

Journal of MachineLearning Research , 15:2773–2832, 2014.[5] C. A. Andersson and R. Bro. Improving the speed of multi-wayalgorithms: Part I. Tucker3.

Chemometrics and intelligent laboratorysystems , 42(1-2):93–103, 1998.[6] A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Chop-pella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan,et al. Automatic code generation for many-body electronic structuremethods: the tensor contraction engine.

Molecular Physics , 104(2):211–228, 2006.[7] G. Ballard, N. Knight, and K. Rouse. Communication lower bounds formatricized tensor times Khatri-Rao product. In , pages 557–567. IEEE, 2018.[8] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind.Automatic differentiation in machine learning: a survey.

The Journalof Machine Learning Research , 18(1):5595–5637, 2017.[9] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclau-rin, and S. Wanderman-Milne. JAX: composable transformations ofPython+NumPy programs, 2018.[10] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan,L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. TVM:An automated end-to-end optimizing compiler for deep learning. In , pages 578–594, Carlsbad, CA, Oct. 2018. USENIXAssociation.[11] L. Chi-Chung, P. Sadayappan, and R. Wenger. On optimizing a class ofmulti-dimensional loops with reduction for parallel execution.

ParallelProcessing Letters , 7(02):157–168, 1997.[12] J. Choi, X. Liu, S. Smith, and T. Simon. Blocking optimizationtechniques for sparse tensor computation. In , pages 568–577. IEEE, 2018.[13] J. Gray. quimb: a Python library for quantum information and many-body calculations.

Journal of Open Source Software , 3(29):819, 2018.[14] R. A. Harshman. Foundations of the PARAFAC procedure: models andconditions for an explanatory multimodal factor analysis. 1970. [15] A. Hartono, Q. Lu, X. Gao, S. Krishnamoorthy, M. Nooijen, G. Baum-gartner, D. E. Bernholdt, V. Choppella, R. M. Pitzer, J. Ramanujam, et al.Identifying cost-effective common subexpressions to reduce operationcount in tensor contraction evaluations. In

International Conference onComputational Science , pages 267–275. Springer, 2006.[16] A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baum-gartner, D. E. Bernholdt, M. Nooijen, R. Pitzer, J. Ramanujam, et al.Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry.

The Journal of Physical ChemistryA , 113(45):12715–12723, 2009.[17] A. Hartono, A. Sibiryakov, M. Nooijen, G. Baumgartner, D. E. Bern-holdt, S. Hirata, C.-C. Lam, R. M. Pitzer, J. Ramanujam, and P. Sadayap-pan. Automated operation minimization of tensor contraction expres-sions in electronic structure calculations. In

International Conferenceon Computational Science , pages 155–164. Springer, 2005.[18] S. Hirata. Tensor contraction engine: Abstraction and automated parallelimplementation of conﬁguration-interaction, coupled-cluster, and many-body perturbation theories.

The Journal of Physical Chemistry A ,107(46):9887–9897, 2003.[19] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum ofproducts.

Journal of Mathematics and Physics , 6(1-4):164–189, 1927.[20] E. G. Hohenstein, R. M. Parrish, and T. J. Mart´ınez. Tensor hy-percontraction density ﬁtting. I. quartic scaling second-and third-orderMøller-Plesset perturbation theory.

The Journal of chemical physics ,137(4):044103, 2012.[21] F. Hummel, T. Tsatsoulis, and A. Gr¨uneis. Low rank factorization ofthe Coulomb integrals for periodic coupled cluster theory.

The Journalof chemical physics , 146(12):124105, 2017.[22] K. Z. Ibrahim, S. W. Williams, E. Epifanovsky, and A. I. Krylov.Analysis and tuning of libtensor framework on multicore architectures.In , pages 1–10. IEEE, 2014.[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture forfast feature embedding. In

Proceedings of the 22nd ACM internationalconference on Multimedia , pages 675–678, 2014.[24] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken.TASO: optimizing deep learning computation with automatic generationof graph substitutions. In

Proceedings of the 27th ACM Symposium onOperating Systems Principles , pages 47–62, 2019.[25] P. Jørgensen and J. Simons.

Geometrical derivatives of energy surfacesand molecular properties , volume 166. Springer Science & BusinessMedia, 2012.[26] D. Kats and F. R. Manby. Sparse tensor framework for implementationof general local correlation methods.

The Journal of Chemical Physics ,138(14):–, 2013.[27] R. A. Kendall, E. Apr`a, D. E. Bernholdt, E. J. Bylaska, M. Dupuis,G. I. Fann, R. J. Harrison, J. Ju, J. A. Nichols, J. Nieplocha, et al.High performance computational chemistry: An overview of NWChema distributed parallel application.

Computer Physics Communications ,128(1-2):260–283, 2000.[28] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. Thetensor algebra compiler.

Proceedings of the ACM on ProgrammingLanguages , 1(OOPSLA):1–29, 2017.[29] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.

SIAM review , 51(3):455–500, 2009.[30] J. Kossaiﬁ, Y. Panagakis, A. Anandkumar, and M. Pantic. Tensorly:Tensor learning in Python.

The Journal of Machine Learning Research ,20(1):925–930, 2019.[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcationwith deep convolutional neural networks. In

Advances in neuralinformation processing systems , pages 1097–1105, 2012.[32] J. Li, J. Sun, and R. Vuduc. HiCOO: hierarchical storage of sparsetensors. In

SC18: International Conference for High PerformanceComputing, Networking, Storage and Analysis , pages 238–252. IEEE,2018.[33] L. Ma and E. Solomonik. Accelerating alternating least squaresfor tensor decomposition by pairwise perturbation. arXiv preprintarXiv:1811.10573 , 2018.[34] D. Maclaurin, D. Duvenaud, and R. P. Adams. Autograd: Effortlessgradients in NumPy.[35] S. Manzer, E. Epifanovsky, A. I. Krylov, and M. Head-Gordon. Ageneral sparse tensor framework for electronic structure theory.

Journalof chemical theory and computation , 13(3):1108–1116, 2017.

36] I. L. Markov and Y. Shi. Simulating quantum computation by contractingtensor networks.

SIAM Journal on Computing , 38(3):963–981, 2008.[37] A. Meurer, C. P. Smith, M. Paprocki, O. ˇCert´ık, S. B. Kirpichev,M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, et al. SymPy:symbolic computing in Python.

PeerJ Computer Science , 3:e103, 2017.[38] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizingneural networks. In

Advances in neural information processing systems ,pages 442–450, 2015.[39] R. Or´us. A practical introduction to tensor networks: Matrix productstates and projected entangled pair states.

Annals of Physics , 349:117–158, 2014.[40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: Animperative style, high-performance deep learning library. In

Advancesin Neural Information Processing Systems , pages 8024–8035, 2019.[41] W. Pazner and P.-O. Persson. Approximate tensor-product precondi-tioners for very high order discontinuous Galerkin methods.

Journal ofComputational Physics , 354:344–369, 2018.[42] C. Peng, J. A. Calvin, F. Pavosevic, J. Zhang, and E. F. Valeev. Massivelyparallel implementation of explicitly correlated coupled-cluster singlesand doubles using TiledArray framework.

The Journal of PhysicalChemistry A , 120(51):10231–10244, 2016.[43] A.-H. Phan, P. Tichavsk`y, and A. Cichocki. Fast alternating LS algo-rithms for high order CANDECOMP/PARAFAC tensor factorizations.

IEEE Transactions on Signal Processing , 61(19):4834–4846, 2013.[44] S. Rajbhandari, A. Nikam, P.-W. Lai, K. Stock, S. Krishnamoorthy, andP. Sadayappan. A communication-optimal framework for contractingdistributed tensors. In

SC’14: Proceedings of the International Con-ference for High Performance Computing, Networking, Storage andAnalysis , pages 375–386. IEEE, 2014.[45] C. Roberts, A. Milsted, M. Ganahl, A. Zalcman, B. Fontaine, Y. Zou,J. Hidary, G. Vidal, and S. Leichenauer. Tensornetwork: A library forphysics and machine learning. arXiv preprint arXiv:1905.01330 , 2019.[46] R. Senanayake, F. Kjolstad, C. Hong, S. Kamil, and S. Amarasinghe. Auniﬁed iteration space transformation framework for sparse and densetensor algebra, 2019.[47] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Pa-palexakis, and C. Faloutsos. Tensor decomposition for signal process-ing and machine learning.

IEEE Transactions on Signal Processing ,65(13):3551–3582.[48] N. Singh, L. Ma, H. Yang, and E. Solomonik. Comparison of accuracyand scalability of Gauss-Newton and alternating least squares for CPdecomposition. arXiv preprint arXiv:1910.12331 , 2019.[49] D. Smith and J. Gray. opt einsum - a Python package for optimizingcontraction order for einsum-like expressions.

Journal of Open SourceSoftware , 3(26):753, 2018.[50] S. Smith and G. Karypis. Tensor-matrix products with a compressedsparse tensor. In

Proceedings of the 5th Workshop on IrregularApplications: Architectures and Algorithms , IA3 ’15, New York, NY,USA, 2015. Association for Computing Machinery.[51] E. Solomonik and J. Demmel. Fast bilinear algorithms for symmetrictensor contractions.

Computational Methods in Applied Mathematics ,2020.[52] E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Dem-mel. A massively parallel tensor contraction framework for coupled-cluster computations.

Journal of Parallel and Distributed Computing ,74(12):3176–3190, 2014.[53] L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based al-gorithms for tensor decompositions: Canonical polyadic decomposition,decomposition in rank-(l r,l r,1) terms, and a new generalization.

SIAMJournal on Optimization , 23(2):695–720, 2013.[54] P. Springer, T. Su, and P. Bientinesi. HPTT: A High-Performance TensorTransposition C++ Library. In

Proceedings of the 4th ACM SIGPLANInternational Workshop on Libraries, Languages, and Compilers forArray Programming , ARRAY 2017, pages 56–62, New York, NY, USA,2017. ACM.[55] X. Team et al. XLA-TensorFlow compiled. post in the Google devel-opers blog, 2017.[56] P. Tichavsk`y, A. H. Phan, and A. Cichocki. A further improvement ofa fast damped Gauss-Newton algorithm for CANDECOMP-PARAFACtensor decomposition. In , pages 5964–5968. IEEE, 2013.[57] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito, S. Suzuki,K. Uenishi, B. Vogel, and H. Yamazaki Vincent. Chainer: A deep learning framework for accelerating the research cycle. In

Proceedingsof the 25th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , pages 2002–2011, 2019.[58] L. R. Tucker. Some mathematical notes on three-mode factor analysis.

Psychometrika , 31(3):279–311, 1966.[59] B. van Merrienboer, D. Moldovan, and A. Wiltschko. Tangent: Auto-matic differentiation using source-code transformation for dynamicallytyped array programming. In

Advances in Neural Information Process-ing Systems , pages 6256–6265, 2018.[60] N. Vannieuwenhoven, K. Meerbergen, and R. Vandebril. Computingthe gradient in optimization algorithms for the CP decomposition inconstant memory through tensor blocking.

SIAM Journal on ScientiﬁcComputing , 37(3):C415–C438, 2015.[61] F. Verstraete, J. J. Garcia-Ripoll, and J. I. Cirac. Matrix product den-sity operators: simulation of ﬁnite-temperature and dissipative systems.

Physical review letters , 93(20):207204, 2004.[62] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J.van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov,A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. Carey, ˙I. Polat, Y. Feng,E. W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman,I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H.Ribeiro, F. Pedregosa, P. van Mulbregt, and Contributors. SciPy 1.0:Fundamental Algorithms for Scientiﬁc Computing in Python.

NatureMethods , 17:261–272, 2020.[63] S. F. Walter and L. Lehmann. Algorithmic differentiation in Python withAlgoPy.

Journal of Computational Science , 4(5):334–344, 2013.[64] Z. Zhang, X. Wu, N. Zhang, S. Zhang, and E. Solomonik. Enablingdistributed-memory tensor completion in Python using new sparse tensorkernels. arXiv preprint arXiv:1910.02371 , 2019. PPENDIX

A. Background of Tensor Computation Applications

In this Section we provide background on CP decomposition, Tucker decomposition, and Density Matrix RenormalizationGroup (DMRG).

CP decomposition.

The CP tensor decomposition [19], [14] serves to approximate a tensor by a sum of R tensor productsof vectors. For an order N input tensor XXX with size s × · · · × s N , CP decomposition compresses it into N factor matrices A , . . . , A N , size of each is s i × R for i ∈ { , . . . , N } . The optimization for the CP decomposition is a least squares problem,where element-wise expression for the output of the tensor network f in Equation 1 denotes f ( A , . . . , A N )( x , . . . , x N ) := R (cid:88) k =1 (cid:89) i ∈{ ,...,N } A i ( x i , k ) . Both the Gauss-Newton method and the alternating minimization method, which is also called alternating least squares (ALS)are effective and popular for the CP decomposition.The CP-ALS method alternates among quadratic optimization problems for each of the factor matrices A n , resulting inlinear least squares problems for each row, are often solved via the normal equations [29], A n Γ n ← X ( n ) P n , where the matrix P n ∈ R I n × R , where I n = (cid:81) Ni =1 ,i (cid:54) = n s i , is formed by Khatri-Rao products of the other factor matrices, P n = A (cid:12) · · · (cid:12) A n − (cid:12) A n +1 (cid:12) · · · (cid:12) A N , and Γ n ∈ R R × R can be computed via Γ n = S ∗ · · · ∗ S n − ∗ S n +1 ∗ · · · ∗ S N , with each S i = A Ti A i . The

Matricized Tensor Times Khatri-Rao Product or MTTKRP computation M n = X ( n ) P n is themain computational bottleneck of CP-ALS [7]. Within MTTKRP, the bottleneck is the contraction between the input tensorand the ﬁrst-contracted matrix. For a rank-R CP decomposition, this computation has the leading cost of s N R if s n = s forall n ∈ { , . . . , N } . The dimension-tree algorithm for ALS [43], [60] uses a ﬁxed amortization scheme to update MTTKRPin each ALS sweep. This scheme only needs to perform two ﬁrst contraction calculations for each ALS sweep, decreasing theleading order cost of a sweep from N s N R to s N R. Another alternative is to solve the least squares problem via the Gauss-Newton method. Although directly inverting theHessian matrix for the problem is expensive (costs O ( N s R ) if each dimension has size s ), the matrices involved in thelinear system are sparse and have much implicit structure. The cost of direct Hessian inversion can be reduced to O ( N R ) [56]and of using implicit conjugate gradient method is O ( N sR ) for each iteration [53]. Additionally, it has been shown thathigher decomposition accuracy can be reached with Gauss-Newton rather than ALS [48]. We refer readers to references fordetails of the Gauss-Newton implementations for CP decomposition [48], [53]. Tucker decomposition.

Tucker decomposition [58] approximates a tensor by a core tensor contracted by orthogonal matricesalong each mode. For an order N input tensor XXX with size s × · · · × s N , Tucker decomposition compresses it into N matriceswith orthogonal columns A , · · · , A N , size of each is s i × R i for i ∈ { , . . . , N } , and a core tensor GGG with size R ×· · ·× R N .Similar to CP decomposition, the optimization for the Tucker decomposition is a least squares problem, where element-wiseexpression for the output of the tensor network f in Equation 1 denotes f ( GGG , A , . . . , A N )( x , . . . , x N ) = (cid:88) { z ,...,z N } GGG ( z , . . . , z N ) (cid:89) r ∈{ ,...,N } A r ( x r , z r ) . The ALS method for Tucker decomposition [5], [29], which is also called the higher-order orthogonal iteration (HOOI),proceeds by ﬁxing all except one factor matrix, and computing a low-rank matrix factorization on the

Tensor Times Matrix-chain (TTMc)

YYY n for n ∈ { , . . . , N } , to update that factor matrix and the core tensor. YYY n is expressed as YYY n ( z , . . . , z n − , x n , z n +1 , . . . , z N ) = (cid:88) { x ,...,x n − ,x n +1 ,...x N } XXX ( x , . . . , x N ) (cid:89) r ∈{ ,...,N } ,r (cid:54) = n A r ( x r , z r ) . Then

YYY n is factored into a product of an orthogonal matrix A n and the core tensor GGG , so that Y n, ( n ) ≈ A n G ( n ) . Thisfactorization can be done by taking A n to be the R n leading left singular vectors of Y n, ( n ) . TTMc is the computationalbottleneck of Tucker-ALS. With the use of dimensions trees same as CP-ALS, the computational complexity for a sweep ofTTMc has the leading order O (4 s N R ) . 14 ensity Matrix Renormalization Group (DMRG). DMRG calculates the smallest eigenvalue of a Matrix Product Operator(MPO) with the corresponding eigenvector represented by a Matrix Product State (MPS). MPS, which is also called tensortrain, represents a high dimensional tensor into a linear tensor network. For an order N input tensor VVV with size s × · · · × s N ,the MPS decomposition is expressed as VVV ( x , . . . , x N ) = (cid:88) α ,...,α N N (cid:89) i =1 AAA i ( α i − , x i , α i ) , where AAA i ∈ R R i − × s i × R i and R = R N = 1 . The MPO has the similar linear structure, each site is a 4-D tensor, WWW ( x , . . . , x N , y , . . . , y N ) = (cid:88) α ,...,α N N (cid:89) i =1 AAA i ( α i − , x i , y i , α i ) , and for each i ∈ { , . . . , N } , the i th and i + N th mode of WWW have the same size. The objective of DMRG is expressed as min

VVV ψ ( VVV ) := v T (1: N ) W (1: N ) v (1: N ) (cid:107) v (1: N ) (cid:107) , where we are optimizing VVV under the constraint that it has the MPS structure. DMRG optimizes this objective via alternatingminimization, where in each local step the minimum of the objective w.r.t. one or two neighboring sites is achieved, andperforms sweeps of the local steps until the results converged. We refer readers to the tensornetwork website for algorithmdetails . B. Proofs for Structured Inverse Node Decomposition

In our program, for an implicit tensor constructed through several input tensors and an Einsum expression, our optimizationalgorithm ﬁnds the form of its decomposed tensors that obey the rules in Theorem A.4, thus helping the inverse.At ﬁrst, we deﬁne the terms decomposable tensor , tensor inverse and identity tensor as follows: Deﬁnition A.1.

A tensor

TTT ∈ R s ×···× s N is decomposable if it can be written as the outer product of 2 smaller tensors. Itcan be written as TTT ( x , . . . , x N ) = AAA ( y , . . . , y M ) BBB ( z , . . . , z K ) , where { y , . . . , y M } ∪ { z , . . . , z K } = { x , . . . , x N } and { y , . . . , y M } ∩ { z , . . . , z K } = ∅ . Deﬁnition A.2.

For an even order tensor

TTT ∈ R s ×···× s N , let R = (cid:81) Ni =1 s i , R = (cid:81) Ni = N +1 s i , if R = R , its tensor inverse TTT − is deﬁned as the inverse of the matricized tensor T , where T ∈ R R × R . Deﬁnition A.3.

We use

III T to denote a tensor has the same shape as TTT , and the matricized

III T is an identity matrix. Using the deﬁnitions above, we will show that if a tensor meets the requirement described in Theorem A.4, then we cantransfer the tensor inverse into the inverse of its decomposed parts.

Theorem A.4.

For an even order tensor

TTT ∈ R s ×···× s N , if it can be decomposed into 2 tensors AAA and

BBB as:

TTT ( x , . . . , x N ) = AAA ( y , . . . , y M ) BBB ( z , . . . , z K ) , and the indices satisfy the following requirements: { y , . . . , y M } ∪ { z , . . . , z K } = { x , . . . , x N } , and { y , . . . , y M } ∩ { z , . . . , z K } = ∅ , y i + M = x j + N if y i = x j for ∀ i ∈ { , . . . , M } , z i + K = x j + N if z i = x j for ∀ i ∈ { , . . . , K } , AAA and

BBB are both invertible,then we have

TTT − ( x , . . . , x N ) = AAA − ( y , . . . , y M ) BBB − ( z , . . . , z K ) . Proof.

Let tensor

CCC be the outer product of tensor

AAA − and BBB − based on the following element-wise expression: CCC ( x , . . . , x N ) = AAA − ( y , . . . , y M ) BBB − ( z , . . . , z K ) , we can rewrite the equation above with different index notations: CCC ( x N +1 , . . . , x N , a , . . . , a N ) = AAA − ( y M +1 , . . . , y M , y M +1 , . . . , y M ) BBB − ( z K +1 , . . . , z K , z K +1 , . . . , z K ) , https://tensornetwork.org/mps/algorithms/dmrg/ { y M +1 , . . . , y M } ∪ { z K +1 , . . . , z K } = { a , . . . , a N } , { y M +1 , . . . , y M } ∩ { z K +1 , . . . , z K } = ∅ , and y i +2 M = a j if y i = x j for ∀ i ∈ { , . . . , M } , z i +2 K = a j if z i = x j for ∀ i ∈ { , . . . , K } .We denote the matrix multiplication of the matricized TTT and

CCC as ZZZ , and their relations can be shown as

ZZZ ( x , . . . , x N , a , . . . , a N ) = TTT ( x , . . . , x N ) CCC ( x N +1 , . . . , x N , a , . . . , a N )= AAA ( y , . . . , y M ) BBB ( z , . . . , z K ) AAA − ( y M +1 , . . . , y M , y M +1 , . . . , y M ) BBB − ( z K +1 , . . . , z K , z K +1 , . . . , z K )= III A ( y , . . . , y M , y M +1 , . . . , y M ) III B ( z , . . . , z K , z K +1 , . . . , z K ) = III T ( x , . . . , x N , a , . . . , a N ) . Therefore, the theorem is proved.

C. Detailed Optimization Algorithms

In this Section, we provide detailed explanations on the union-ﬁnd data structure for Einsum. In addition, we provide detailedpseudo-codes for the distribution, Einsum fusion and dimension tree generation kernels discussed in Section V.

Union-ﬁnd for Einsum.

The union-ﬁnd (UF) representation for Einsum is a key ingredient of the optimization kernels.In the UF graph, each node represents one dimension of a speciﬁc tensor in the Einsum graph, and each edge represents aconnection between two dimensions in an Einsum expression, where the connection is denoted by two dimensions sharingthe same character in an Einsum string. Downstream tasks can leverage this canonical representation of the Einsum graphfor analysis. The algorithm of graph building is illustrated in Algorithm 3. We use Einsum subscript to denote the Einsumexpression of an Einsum node, for example the string ’ ij, jk → ik ’. Algorithm 3:

BuildUF input :

Einsum Graph: G output:

Union-ﬁnd data structure: UFInitialize a union-ﬁnd data structure UFInitialize a map from Einsum character to tensor dimension DM for all einsum nodes N in G dofor all characters C1 in N.einsum subscript dofor all characters C2 in N.einsum subscript doif C1 == C2 then

UF.connect(DM[C1], DM[C2])return UF

Algorithm 4:

Distribution input :

Graph: G output:

Distributed Graph: DGDG = G while

True dofor

All DistributeOp nodes { Ops } in DG doif All Einsum nodes are topologically ahead of { Ops } then return DG for Op ∈ { Ops } do DG =

Distribute (Op, DG) (cid:46)

Distribute does Einsum((a+b),c) → Einsum(a,c) + Einsum(b,c), where + isthe Op D. Additional Benchmark Results

We present the additional benchmark results for the kernels in the alternating minimization problems in Figure 8. For bothCP and Tucker decompositions, we ﬁx the tensor size in each mode and the decomposition rank, and compare the performancebetween AutoHOOT and other libraries with different input tensor order. As can be seen, the expressions generated with thedimension tree algorithm outperform all the other implementations. For DMRG, we ﬁx the physical dimension sizes and theMPO/MPS ranks, and compare the performance with different number of sites. As can be seen, AutoHOOT and Quimb havecomparable performance. 16 lgorithm 5:

Einsum fusion input :

Einsum Tree: T output:

Fused Einsum Node: FNLT =

Linearize (T)UF = BuildUF(LT)UF.Assign() (cid:46)

Assign each disjoint subset an unique character.Init FN (sink: T.root, source: T.leaves)FN.genereateSubscript() (cid:46)

Generate FT.subscript based on input nodes’ assigned characters.FN =

Declone (FN)return FN

Algorithm 6:

Dimension tree construction input :

Einsum node List: NL, Input node list: IL output:

Updated Einsum node List: UL n = length(NL)UL = NL for i ∈ { , . . . , n } do contract order list = [ I[N], . . . , I[i+1], I[1], . . . , I[i-1] ] contract order list = part of contract order list where all elements are in UL[i].inputsUL[i] = Opt_contraction_path_w_constraint (UL[i], contract order list)return UL

Tensor order T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorlyscikit-tensor (a) NumPy, CPD, s = R = 30 Tensor order T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorlyscikit-tensor (b) NumPy, Tucker, s = 30 , R = 10 Number of sites T i m e f o r one s w eep o f H VP ( s ) AutoHOOTQuimb (c) NumPy, DMRG, s = R = 40 Tensor order T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorly (d) TensorFlow, CPD, s = R = 30 Tensor order T i m e f o r one A L S s w eep ( s ) AutoHOOT_DTAutoHOOTTensorly (e) TensorFlow, Tucker, s = 30 , R = 10 Number of sites T i m e f o r one s w eep o f H VP ( s ) AutoHOOTQuimb (f) TensorFlow, DMRG, s = R = 40= 40