Propagating large open quantum systems towards their steady states: cluster implementation of the time-evolving block decimation scheme
Valentin Volokitin, Ihor Vakulchyk, Evgeny Kozinov, Alexey Liniov, Iosif Meyerov, Michail Ivanchenko, Tatyana Laptyeva, Sergey Denisov
PPropagating large open quantum systems towards their steady states: clusterimplementation of the time-evolving block decimation scheme.
Valentin Volokitin , Ihor Vakulchyk , , Evgeny Kozinov , Alexey Liniov ,Iosif Meyerov , Michail Ivanchenko , Tatyana Laptyeva , Sergey Denisov Lobachevsky State University of Nizhny Novgorod, Russia Institute for Basic Science, Daejeon, Korea Korea University of Science and Technology, Daejeon, Korea Oslo Metropolitan University, N-0130 Oslo, Norway
Many-body quantum systems are subjected to the Curse of Dimensionality: The dimension of theHilbert space H , where these systems live in, grows exponentially with number of their components(’bodies’). However, with some systems it is possible to escape the curse by using low-rank tensorapproximations known as “matrix-product state/operator (MPS/O) representation” in the quantumcommunity and “tensor-train decomposition” among applied mathematicians. Motivated by recentadvances in computational quantum physics, we consider chains of N spins coupled by nearest-neighbor interactions. The spins are subjected to an action coming from the environment. Spatiallydisordered interaction and environment-induced decoherence drive systems into non-trivial asymp-totic states. The dissipative evolution is modeled with a Markovian master equation in the Lindbladform. By implementing the MPO technique and propagating system states with the time-evolvingblock decimation (TEBD) scheme (which allows keeping the length of the state descriptions fixed),it is in principle possible to reach the corresponding steady states. We propose and realize a clusterimplementation of this idea. The implementation on four nodes allowed us to resolve steady states ofthe model systems with N = 128 spins (total dimension of the Hilbert space dim H = 2 ≈ ). a r X i v : . [ c ond - m a t . s t a t - m ec h ] M a y I. INTRODUCTION
Many-body systems are at the focus of the current research in theoretical and experimental quantum physics.In addition to their fundamental importance for quantum thermodynamics and information [1], these systems areperspective from the technological point of view; e.g., all manufactured (by now) quantum computers are based onarrays of interacting superconducting qubits [2]).All real-life quantum systems are open, meaning that they interact – to a different extent – with their environments[3]. This ’action from outside’, termed “decoherence“ or ”dissipation“, works together with the unitary evolutionstemming from system’s Hamiltonians and, on large time scales, these joint efforts result in the creation of anasymptotic stationary (steady) state. The evolution of an open quantum system towards its steady states is usuallymodeled with a Markovian master equation, which describes the dynamics of the system density operator (cid:37) ( t ),˙ (cid:37) ( t ) = L (cid:37) ( t ) [3]. Formally, similar to the Schroedinger equation used to describe unitary evolution of an isolatedquantum system, this is a linear differential equation which can be solved numerically, e. g., by diagonalizing generatorof evolution L .However, computational studies of many-body quantum systems are limited by the so-called Course of Dimensional-ity: the total length L of description (number of parameters required to specify a state) of an isolated quantum systemconsisting of N components (spins, qubits, ions, etc.), each one with d degrees of freedom, scales as L ( N ) ∼ d N . Tospecify an arbitrary state of a system of 50 qubits one needs 2 ≈ complex-valued parameters. This exceeds thememory capacity of the supercomputer “Titan”[4]. In the case of open quantum systems, the complexity squares: todescribe a density operator one needs L ( N ) ∼ d N real-valued parameters.This is a famous problem in modern data science – manipulations (or even simply storing) with data tensorsbecomes impossible when the data are sorted in high-dimensional spaces. The attempts to break the curse led tothe development of a variety of low-rank tensor approximation algorithms [5]. These algorithms are used now insignal processing, computer vision, data mining, and neuroscience [6]. The most robust algorithms are based onSingular Value Decomposition (SVD), and one particularly efficient for multilinear algebra manipulations is the so-called Tensor-Train (TT) decomposition [7]. In physical literature, it is commonly referred to as Matrix Product State(MPS) [or Matrix Product Operator (MPO)] representation [8, 9]. While these two names are used simultaneously(though in different fields), the underlying mathematical structure is the same [10]. The MPS/MPO/TT approachallows to reduce descriptions of some many-body states to a linear scaling L ( N ) ∼ N [7].The MPS/MPO representation allows for effective propagation of quantum many-body systems in time by usingthe so-called Time-Evolving Block Decimation (TEBD) scheme [11]. In short, this is a procedure to reduce thedescription of the state, obtained after every propagation step, to a given fixed length L cut . The accuracy of thepropagation is controllable through L cut : If the information is thrown out after the restriction is substantial, theused TEBD propagation is bad and leads to a wrong description. Otherwise, it is good. Some many-body systems’behave’ well during the TEBD propagation and so the amount of the neglected information is tolerable (we are notgoing to discuss physical properties underlying such a ’good behavior’ and refer the reader to an extensive literatureon the subject; see. e.g., Ref. [9]). Important is that the MPO/TT-TEBD scheme can be used to propagate opensystems [12] and thus get in touch with the corresponding steady states [13, 14]. It is crucial therefore to estimatecomputational resources needed for the realization of this program. Here we report the results of our studies in thesedirections. II. THE ALGORITHMA. Tensor-Train Decomposition
Here we mainly follow works [7] and [8]; for more details, we refer the interested reader to them.We start with a N -dimensional complex-valued tensor A i ,i ...i N with i k = 1 , , . . . M . By gluing together indices i , i . . . i N we obtain a M × M N − matrix to which we apply then SVD (henceforth we use notation without Hermitianconjugation for the last matrix in the decomposition) A [ i ; i . . . i N ] = (cid:88) α ,α U ( i ; α )Λ ( α ; α ) V ( α ; i . . . i N ) , (1)where U and V are unitary matrices and Λ ≥ λ j . We assumethat singular values are sorted in descending order, λ ≥ λ ≥ . . . λ r . Using diagonal structure of Λ,Λ α ,α = λ α δ α ,α ,where λ j are singular values, and reshaping U as a 1 × d tensor indexed by i , we get A [ i , i . . . i N ] = (cid:88) α Γ [1] i α ,α λ [1] α V ( α ; i . . . i N ) . (2)Repeating the same “reshape-SVD-reshape” procedure for V and continue further iteratively, we arrive at the TTrepresentation, A [ i , i . . . i N ] = (cid:88) α ...α N − Γ [1] i α ,α λ [1] α Γ [2] i α ,α λ [2] α . . . Γ [ N − i N − α N − ,α N − λ [ N − α N − Γ [ N ] i N α N − ,α N . (3)One may interpret this structure as a “train” (see Fig. 1a) of Γ’s that encode local structure in each dimension,and λ ’s that quantify correlations between them. Each Γ [ k ] is an array of M matrices r k − × r k with restrictions r j ≤ M max ( r j − , r j +1 ) with boundary conditions r = r N = 1. Thus, the dimensions of the matrices are 1 × M, M × M , M × M . . . M × M, M ×
1, which corresponds to the full representation with M N complex parameters. WhenSVD is performed, one can keep only certain singular values based on the approximation criterion. One possibility is todiscard all values smaller than a fixed number. An alternative approach is to introduce a so-called bond dimension R , a cut-off value such that on each bound i only singular values λ [ i ] j , j ≤ R , are kept and the rest are truncated. Weuse the latter option. Each local approximation procedure on the set of singular value { λ [ i ] } introduces a truncationerror E i ( R ) = (cid:88) j>R (cid:16) λ [ i ] j (cid:17) . (4)One of the main advantages of the TT representation is the simplicity of local convolutions with other tensorstructures. Consider an operation T j acting in j -th dimension only, A (cid:48) [ i . . . i j . . . i N ] = (cid:88) i (cid:48) j ,i j T j [ i j , i (cid:48) j ] A (cid:2) i . . . i (cid:48) j . . . i N (cid:3) . (5)After substitution in Eq. (3), one could see that this convolution only affects the corresponding Γ tensor,Γ (cid:48) [ j ] i j = (cid:88) i (cid:48) j T i j ,i (cid:48) j Γ [ j ] i (cid:48) j . (6) FIG. 1. (a) Tensor train (matrix product state) de-composition and (b) Suzuki-Trotter propagation (seetext for more details).
Operations involving multiple dimensions (indices), especially distant, destroy TT structure, so additional proce-dures are required to restore it. Generally, it would imply effectively the same procedure as initial decomposition.However, in this paper, we only use convolutions involving two neighboring indices. Thus, only one local reorthogo-nalization is required.Consider an operation T j,j +1 : A (cid:48) [ i . . . i j i j +1 . . . i N ] = (cid:88) i (cid:48) j ,i j T j,j +1 [ i j , i (cid:48) j ; i j +1 , i (cid:48) j +1 ] A (cid:2) i . . . i (cid:48) j i (cid:48) j +1 . . . i N (cid:3) . (7)It affects a pair of the corresponding tensors,Γ [ j ] i j α j − α j , λ [ l ] α j , Γ [ j +1] i j +1 α j α j +1 → (cid:88) i (cid:48) j i (cid:48) j +1 ,α j T j,j +1 [ i j , i (cid:48) j ; i j +1 , i (cid:48) j +1 ]Γ [ j ] i (cid:48) j α j − α j λ [ l ] i j Γ [ j +1] i (cid:48) j +1 α j α j +1 . (8) Algorithm 1 : TEBD method implementation upload : system & method parameters (N, T j [ i j , i (cid:48) j ], T j,j +1 [ i j , i (cid:48) j , i j +1 , i (cid:48) j +1 ], dt , T max , R ), initial state (Γ [ j ] i j α j − α j , λ [ j ] α j ) for t = 0 to T max do propagate all Γ [ j ] i j α j − α j on [ t ; t + dt/ propagate Γ [ j ] i j α j − α j , λ [ j ] α j , Γ [ j +1] i j +1 α j α j +1 with odd j on [ t ; t + dt/ (cid:46) propagate Γ [ j ] i j α j − α j , λ [ j ] α j , Γ [ j +1] i j +1 α j α j +1 with even j on [ t ; t + dt ] propagate Γ [ j ] i j α j − α j , λ [ j ] α j , Γ [ j +1] i j +1 α j α j +1 with odd j on [ t + dt/ t + dt ] propagate all Γ [ j ] i j α j − α j on [ t + dt/ t + dt ] end for save results release memory To perform reorthogonalization, we introduce matrix K [ i j , i j +1 ; α j − , α j +1 ]. Next we reshape indexes, performSVD, and reshape indexes back [15]: K [ α j − , i j ; α j +1 i j +1 ] = Γ (cid:48) [ j ] i j α j − α j λ (cid:48) [ j ] α j Γ (cid:48) [ j +1] i j +1 α j α j +1 . (9)Such operation takes only O ( R ) steps and this scaling does not depend on N . Moreover, it modifies only a pairof relevant tensors so multiple pairwise operations that act in independent subspaces may be parallelized. The TTrepresentation allows also for fast realizations of other algebraic operations: calculation of partial or full traces, norm,scalar products, additions, etc [7]. B. Tensor-Train Propagation
The TT representation provides a basis for an approximate tensor propagation algorithms. Here we use Time-Evolving Block Decimation (TEBD) scheme [12, 15], which was specifically designed for quantum systems but appli-cable also in the general case. Consider a tensor flow governed by an evolution generator consisting only of operationsacting on one or two adjacent dimensions, ddt A [ i . . . i N ] = (cid:88) j (cid:88) i j T [1] ji j A [ i . . . i j . . . i N ] + (cid:88) j ,j (cid:88) i j ,i j T [2] ji j ,i j A [ i . . . i j i j . . . i N ] . (10)We use standard time discretization to iteratively integrate this equation (starting from some initial tensor). Interms of operations the solution reads A ( t + dt ) = L ( dt ) A ( t ) = = exp (cid:88) j ˆ T [1] j + ˆ T [2] j dt A ( t ) . (11)As T operators generally do not commute, we have to approximate the matrix exponents. A way to minimizethe error is to separate the operators into groups as large as possible such that all the operators belonging to onegroup commute with each other. All one-dimension operators commute by default, and two-dimension acting onodd/even pairs commute within their oddity groups; see Fig. 1b. We utilize this fact and use modified second orderSuzuki-Trotter decomposition [9]: L ( dt ) ≈ L ( dt/ L odd2 ( dt/ L even2 ( dt ) L odd2 ( dt/ L ( dt/ ,L ( dt ) = N (cid:89) j =1 L i ( dt ) , L even/odd2 ( dt ) = (cid:89) j ∈{ even/odd } L i ( dt ) , (12) L i ( dt ) = e ˆ T [1] j dt , L i ( dt ) = e ˆ T [2] j dt . (13)Note that the standard approach is to adsorb one-index operators L into L , but we find it numerically beneficialto separate them. Each factor of L and L can be calculated by using Eqs. (6) and (8) respectively. Furthermore,as they commute by construction, corresponding computation can be parallelized. Each two-index operator mayinclude a cut-off if after the reorthogonalizationm Eq. (9), the number of singular value exceeds bound dimension R . Corresponding accumulated truncation error is then calculated as a sum of local errors (4) over all the operationduring evolution up to time t , E ( t, R ) = t/dt (cid:88) j =1 (cid:88) { L i ( dt j ) } E ( j ) i ( R ) . (14)Computations are dominated by SVD, so resulting complexity is O ( N R ), where R is the bond dimension. With O ( N ) cores available, it becomes O ( R ) and thus the computational task is perfectly scalable. C. Lindblad Equation
We apply both TT and TEBD methods to evolve numerically many-body open quantum models. The state of suchsystems is described by a density matrix (cid:37) ( t ) of the size M N × M N , where N is the number of particles/spins and M is number of the local states, which we put to M = 2 for a 1 / (cid:37) ( t ) = L (cid:37) ( t ) = L H (cid:37) ( t ) + L dis (cid:37) ( t ) = − i [ H, (cid:37) ( t )] + M (cid:88) s =1 γ s (cid:20) D s (cid:37) ( t ) D † s − { D † s D s , (cid:37) ( t ) } (cid:21) , (15)where L is the Lindblad superoperator consisting of conservative L H and dissipative L dis parts, H is Hamiltonian, D s are dissipation operators, and γ s are corresponding dissipation rates. There is a stationary state solution for anyLindblad superoperator L (cid:37) ( ∞ ) = 0 which is unique (aside of special cases of symmetries which we do not addresshere).Many-body density operator (cid:37) can be represented as an 2 N -dimensional tensor (cid:37) [ i , i . . . i N ; i (cid:48) , i (cid:48) . . . i (cid:48) N ] whereevery pair of indexes i j , i (cid:48) j (each one runs from 1 to 2) correspond to the j -th qubit/spin. The models we considerinclude only one- and two-particle interactions, which thus involve up to four indexes of (cid:37) . To overcome this, we usevectorization procedure [12] and glue together indexes of each particle forming (cid:37) [ i i (cid:48) ; i i (cid:48) . . . i N i (cid:48) N ] – N -dimensiontensor with each dimension going from 1 to 4. This allows applying TEBD scheme (II B) as long as H and A s in(15) do not couple more than two particles. However, as we restrict the accessible space to the bond dimension R ,evolution can only start from initial MPO conditions belonging to this space. In the models considered further, weuse an extreme case of product initial states, r j = 1, ∀ j ; as we aim at the stationary states, we have complete freedomwhen choosing initial conditions (though other choices can be more beneficial from the relaxation-speed point of view). III. MODEL SYSTEM
As test-bed models, we use spin chains from Refs. [13, 14]. Here we only briefly described them.Both chains consist of N spins. Hamiltonian part of evolution is governed by the Heisenberg XXZ model with localdisordered potential H = (cid:88) i σ xi σ xi +1 + σ yi σ yi +1 + Jσ zi σ zi +1 + h i σ zi , (16)where σ i are Pauli matrices, J is interaction strength parameter and h i are uncorrelated random value uniformlydistributed in [ − h, h ]. Open boundary conditions are used.In Ref. [13], a disordered spin chain with next-neighbor coupling and two thermal reservoirs – each one representedby pair of Lindblad operators causing excitation (relaxation) and acting on the two end spins, i = 1 , N , – wereconsidered, D i =1 ± = (cid:112) ± µσ ± , D i = N ± = (cid:112) ∓ µσ ± N , (17)where µ is bias responsible for the formation of non-equilibrium steady state with directed non-zero current. In thelimit µ →
0, the stationary state is an infinite temperature ’(maximally mixed’) state, that is the normalized identity.By assuming µ (cid:28)
1, one could address the linear-response regime. The transport of the spin charge through a chainin the stationary regime was considered and the current scaling with N was estimated.The authors managed to achieve a model size of N = 400, which is an unprecedented size for many-body openquantum models. The complexity of the computational experiments is increased by the fact that the systems neededa considerable propagation time in order to reach the stationary state (this is because the dissipation was acting atthe chain ends only). Finally, to obtain scaling dependencies, averaging over disorder realizations was performed. Atthe same time, this work provides little detail about the resources used for numerical simulations. We considered theobtained results as a challenge and decided to reproduce them – at least for N (cid:39) D l = ( σ + l + σ + l +1 )( σ − l − σ − l +1 ) , (18)which try to make neighboring spins oscillating out of phase (‘anti-synchronization’). The maximal size of the modelsused in numerical simulations, reported in this work, was N = 32. Among other characteristics, scaling of the so-calledoperational entanglement entropy [16] was considered. We use this quantifier in our numerical experiments, in whichwe tried to reach N = 128. IV. IMPLEMENTATION
The method described in Section II is implemented as shown in Algorithm 1. The algorithm is implemented usingthe C++ programming language. We found, that the matrix operations (mainly SVD) are the most time-consumingparts of the algorithm. In this regard, we employ the Armadillo software library integrated with highly optimizedmathematical routines from the Intel Math Kernel Library to improve performance. Finally, Armadillo/MKL routinestake about 50-80% of computation time during the propagation step depending on the current system state.The algorithm assumes performing a set of integration operations for individual components of the system atevery time step. These operations are not independent but can be ordered according to their dependencies for theorganization of parallel computations. In particular, all one- and two-particle interactions can be performed completelyin parallel.
FIG. 2. Distribution of computationaland communication functions run time.64 MPI-processes were executed on fournodes of the cluster.
The cluster parallelization is done by using the MPI technology. We apply the classic master-worker scheme forparallelization of the algorithm. For that, the single managing MPI-process (master) forms separate tasks for one–and two–particle interactions, monitors their dependencies from each other and readiness, distributes tasks to all otherprocesses (workers) and accumulates the results.All computational experiments have been done on the Lobachevsky cluster with a 2 × N = 128, R = 50, T max = 50, dt = 0 .
1. Parallel codewas run on four computational nodes of the cluster (1 MPI-process per CPU core, 64 MPI-processes overall). Totalcomputation time was 143 s. The resulted diagram for the distribution of computational and communication functionsrun time is presented in Fig. 2. It is shown that the calculations are fairly well balanced, which is an undoubtedadvantage of the parallelization scheme. However, MPI communications take a significant part of the computationtime, while further increasing the number of cluster nodes used will not significantly speed up the calculations, whichis a limitation of the scheme. Computational efficiency (ratio of computation time to total execution time) was 47%.
FIG. 3. Scaling of the spin current j through a dis-ordered spin chain with N spins for different values ofdisorder strength h . Our results (big colored circles)are plotted on top of the results (lines and other sym-bols) reported in Ref. [13]). The maximal size of themodel system used in our simulations is N = 128. Forevery set of parameters, we performed averaging over20 disorder realizations. The propagation time step dt = 0 . R = 50. V. RESULTS
We find that it is possible to reproduce - with high accuracy – the results reported in Ref. [13] by using bond-dimension R = 50. On Figure. 3 we present a comparison of the results of the sampling we perform with our code(big circles; yellow, red and green) with the results by ˇZnidari˘c and his co-authors. We use propagation step dt = 0 . t = 10 , irrespectively of its size. It is not the most optimal way to sample (forexample, it would be more effective to determine arrival of the system at the asymptotic state by monitoring the valueof the spin current); however, at this stage, we tried to make the sampling procedure as simple as possible. For everyvalue of N and disorder strength h , we additionally performed averaging for 20 disorder realizations. Each realizationtook from 2 minutes to 2 hours depending on the size of the system and involved up to two nodes (for large systemsizes, N > R c = 360 constitutes a threshold value after which theasymptotic entropy does not change upon further increase of the bond dimension. The calculation time for this valueof bond dimension was four weeks of continuous propagation on four computing nodes.We analyze also the evolution of accumulated error 14 in this case. It is noteworthy that saturation of the operatorentropy, which signals the arrival to the asymtotic state, is not accompanied by the saturation of the error. Thelatter continues to grow in a power-law manner, see the dashed line in Fig. 4b. This means that MPO states – evenwith R = 480 – are different from the genuine steady state of the model [which is the zero-value eigenelement of thecorresponding Lindbladian (16 - 18)]. VI. CONCLUSIONS
We presented a parallel implementation of the MPO-TEBD algorithm to propagate many-body open quantumsystems. Parallelization is performed using the MPI technology and employs the master-worker scheme for compu-tational tasks distribution. High-performance implementations of linear algebra from the Intel MKL were used tobetter utilize computational resources of modern hardware.A series of numerical experiments was performed to determine the accuracy and limits of applicability of thedeveloped code. In particular, the effect of the number of SVD numbers kept after each propagation step (bond-dimension R ) on the accuracy of the method was investigated. We found that threshold value R c after whichsaturation of the relevant characteristics is observed and fuhrer increase of bond dimension does not change theirvalues.The performance tests on the Lobachevsky cluster demonstrated that 64 MPI processes running on four compu-tational nodes is the optimal configuration for the model systems with N = 128 spins. As a next step, we plan toexplore the possibility of further improvements of the parallelization by reducing the communications and increasingthe efficiency of using computational resources. After that, we hope to reach the limit N (cid:39)
400 with the test-bedmodels.
VII. ACKNOWLEDGMENTS
The authors acknowledge support of the Russian Foundation for Basic Research and the Government of the NizhniNovgorod region of the Russian Federation, grant - - - - t E ( R ,t ) ( b ) R = = = = = = FIG. 4. (a) Evolution of the operator entanglement entropy S for a single disorder realization for the model from Ref. [14],for different values of bond dimension R . The propagation time step dt = 0 . N = 128. Note that for R = 480 we did not reach the asymptotic ’plateau’ because it was not possible to numerically propagate system further (we hitthe two-week limit). (b) Increase of the accumulated truncation error (14) in time. Science, Project Code (IBS-R024-D1), and by the Korea University of Science and Technology Overseas Trainingprogram.
REFERENCES [1] J. Eisert, M. Friesdorf, and C. Gogolin, Nature Phys. , 124 (2015).[2] J. M. Gambetta, J. M. Chow, and M. Steffen, npj Quant. Inf. , 2 (2017).[3] Breuer, H.-P., and F. Petruccione, The Theory of Open Quantum Systems (Oxford University Press, Oxford, 2002).[4] https://en.wikipedia.org/wiki/Titan (supercomputer)[5] T. G. Kolda and B. W. Bader, SIAM Rev. , 455 (2009).[6] A. Cichocki, Era of Big Data processing: a new approach via Tensor Networks and Tensor Decompositions ,https://arxiv.org/abs/1403.2048.[7] I. V. Oseledets, SIAM J. Sci. Comput. , 2295 (2011).[8] D. Perez-Garcia, F. Verstraete, M.M. Wolf, J.I. Cirac, Quantum Inf. & Comput. , 401 (2007); U. Schollw¨ock, Ann. Phys. , 96 (2011).[9] U. Schollwoeck, Ann. of Phys. (1) (2010).[10] J. Haegeman, C. Lubich, I. Oseledets, B. Vandereycken, and F. Verstraete, Phys. Rev. B , 165116 (2016).[11] G. Vidal, Phys. Rev. Lett. , 147902 (2003). [12] M. Zwolak and G. Vidal, Phys. Rev. Lett. , 207205 (2004).[13] M. ˇZnidari˘c, A. Scardicchio, and V. K. Varma, Phys. Rev. Lett. , 040601 (2016).[14] I. Vakulchyk, I. Yusipov, M. Ivanchenko, S. Flach, and S. Denisov, Phys. Rev. B , 020202(R).[15] G. Vidal, Phys. Rev. Lett. (14) (2003)[16] T. Prosen and I. Piˇzorn, Phys. Rev. A.76