[PDF] A time-optimal algorithm for solving (block-)tridiagonal linear systems of dimension N on a distributed computer of N nodes

Abstract

We are concerned with the fastest possible direct numerical solution algorithm for a thin-banded or tridiagonal linear system of dimension N on a distributed computing network of N nodes that is connected in a binary communication tree. Our research is driven by the need for faster ways of numerically solving discretized systems of coupled one-dimensional black-box boundary-value problems. Our paper presents two major results: First, we provide an algorithm that achieves the optimal parallel time complexity for solving a tridiagonal linear system and thin-banded linear systems. Second, we prove that it is impossible to improve the time complexity of this method by any polynomial degree. To solve a system of dimension m⋅N and bandwidth m∈Ω( N 1/6 ) on 2⋅N−1 computing nodes, our method needs time complexity O(log(N ) 2 ⋅ m 3 ) .

Full PDF

AA time-optimal algorithm for solving(block-)tridiagonal linear systems of dimension N on a distributed computer of N nodes Martin NeuenhofenFebruary 2, 2018

Abstract

We are concerned with the fastest possible direct numerical solutionalgorithm for a thin-banded or tridiagonal linear system of dimension N ona distributed computing network of N nodes that is connected in a binarycommunication tree. Our research is driven by the need for faster ways ofnumerically solving discretized systems of coupled one-dimensional black-box boundary-value problems.Our paper presents two major results: First, we provide an algorithmthat achieves the optimal parallel time complexity for solving a tridiagonallinear system and thin-banded linear systems. Second, we prove thatit is impossible to improve the time complexity of this method by anypolynomial degree.To solve a system of dimension m · N and bandwidth m ∈ Ω( N / ) on2 · N − O (log( N ) · m ). Motivation of the problem

Many computational engineering tasksdeal with the solution of (systems) of one-dimensional diﬀerential-algebraicboundary-value problems [9]. Examples are numerical simulations of the fol-lowing physical phenomena: the deformation of a clamped beam, the dynamicpressure in a gas-pipe, the trajectory of a missile, and constrained optimal con-trol problems.Using a numerical discretization method, a large-dimensional equation sys-tems results that is typically solved via Newton’s method. The arising linearsystems are thin-banded and of very large dimension.Often, the Newton system results from a minimization principle, either of anobjective or by a natural model that aims for minimization of potential energy.In these cases the arising linear systems are not only thin-banded and of verylarge dimension, but they are also symmetric positive deﬁnite, which is clearlydesirable for reasons of numerical stability.1 a r X i v : . [ m a t h . NA ] J a n ince thin-banded, the system matrix can be interpreted as block-tridiagonal, where the block-size is identical to the band-width. Thus, for easeof presentation in the following we present an algorithm for (block-) tridiagonalsystems where the block-size is m (cid:28) N . Motivation of the problem statement and computing model

In conse-quence of the emergence of massively parallel computing systems, nowadays theproblem in numerical computing is typically not to solve a given mathematicalproblem, but rather to solve it on a given computing system while exploitingits resources in an optimal way. Especially for parallel computing systems it isdiﬃcult to spread the computational task in a way that enables the full use ofthe computing system’s capacity.As for our case of solving thin-banded linear systems, it is well-known thatfor a bandwidth bounded by m ∈ O (1) the optimal time complexity for solvinga banded linear system on a serial computer is O ( N ) , as can be achieved by use of Gaussian elimination. This is equivalent in meaningto that solving the problem with a serial machine is literally as expensive as thefollowing two: reading the problem, or writing the problem’s solution into thememory.So at ﬁrst glance it seems like nothing can be done to improve: The time tocommunicate the problem to a solver would already predominate the time thatthe solver actually needed. So where is the point in trying to make the solverfaster?The point or answer is that problems do not need to be communicated. Onecould assemble in parallel the rows of a linear equation system in a distributedway on the memory of independent computing nodes that are connected via anetwork. Using the algorithm that we propose, all the computing nodes cansolve the one big linear system but each of them does only read a tiny part ofthe problem and only writes a tiny part of the solution vector. Literature review

We consider the problem of solving a thin-banded linearsystem as a generalization of solving tridiagonal linear systems in parallel, whichis why our literature review refers to parallel tridiagonal solvers. Such solverscan be applied by interpreting the thin-banded linear system as block-tridiagonalsystem of dense blocks.There are several popular methods for the solution of tridiagonal linear sys-tems. These have in common that they use a concept described as parallel fac-torizations [1, 7]: The matrix is multiplied from the left with a block-diagonalmatrix that decouples a portion of the unknowns through a n interface sys-tem. The reduced system is solved. The reduced solution is distributed to allprocessors so that they can compute in parallel the formerly removed unknowns.According to [2], the ﬁrst parallel tridiagonal solver is called cyclic reduction and was presented 1965 in [5]. The recursive doubling algorithm was introduced2973 in [11]. In both of these algorithms each processor holds one row of thesystem. Cyclic reduction works by successively expressing the odd variables ofthe solution vector in terms of the even. This can be done until ﬁnally there isa system of one variable that is solved. An implementation is provided in [4].

Wang’s method [12], introduced 1981, is a parallel algorithm where the orderof the number of processors is smaller than the order of the number of rows of thesystem to be solved. This algorithm has been proven to be numerically stable[13]. The idea of this method is the assembly of an interface problem whosesolution can be used directly to solve the remaining variables in a backward-substitution step.In 1991 Bondeli introduced a divide and conquer algorithm for tridiagonalsystem [3]. The idea of this algorithm is to solve a block-diagonal approximationof the system in parallel, where each processor holds a diagonal-block. It resultsa reduced interface system that is solved by cyclic reduction, cf. [2].

Organization of the paper

In the remainder of this section we describe theparallel computing system that we use for our algorithm. We then describethe mathematical problem that the algorithm solves. It is important that theproblem is given to the algorithm in a special way. In particular, the problemdata must be provided in distributed memory before the algorithm is called.This is important because it would take to much time to move the data from acentral storage to the distributed memory.In Section 2 we present the algorithm. We start with an implementationand afterwards show how this algorithm can be derived from familiar matrixalgorithms. At the end of the section we analyse the time complexity of thealgorithm and remark on optimality of this complexity result.Section 3 we give lower bounds on the time complexity for solving tridiagonallinear systems on a parallel distributed memory machine. We show that ouralgorithm is able to yield optimal time complexity.Eventually we draw conclusions in Section 4.

Computing system

We need to describe the computing system before wedescribe the problem statement because otherwise we cannot describe where wepresume the problem data to be placed. Above we described why this is crucial:We have to make sure that the problem is provided in the right way becausemoving the problem data around would cost too much time.For a system of dimension N · m with a bandwidth m we consider a computingsystem of 2 · N − parent , an up-child and a down-child , cf. in the ﬁgure.Exceptions are: There exists one node called root . This node does not have aparent. Further, there exist N nodes that are called leaves . A leaf does onlyhave a parent, but it does neither have an up-child nor a down-child .The right part of the ﬁgure assigns the nodes with numbers. Each node hasa processor number and a level . The levels are deﬁned recursively: Each leaf hasa level of zero, and the parent of each node has a level that is by one larger thanthe level of the node itself. The processor number is a value that is given bycounting from 1 from the uppermost node to the lowermost node of each level.Each node holds identifying variables like a passport: The variable my level gives the level of this node. The variable my proc num is a list. The followingvalues are well-deﬁned my proc num ( ‘ ) for ‘ ∈ { my level , ...d } , where d is the level of the root. The value my proc num ( ‘ ) gives the processornumber of the node on level ‘ through which a signal from the root would haveto travel in order to reach this node. Figure 1 gives an example for this: Toreach the node on level 0 with processor number 5, a signal from the root wouldhave to traverse through node 2 of level 2 and node 3 of level 1.We presume the nodes as identical serial computing units. As is common,we presume that basic operations plus, minus, times, divide, and copy of scalarvalues require each a ﬁxed amount of time on a respective node.For communications over the network, we use the following message passinginterface of six commands:• send to up child ( M ). If this node has an up-child then it sends a matrixby copy M to up-child and waits until up-child received it.• receive from up child ( M ). If this node has a down-child then it waitsuntil it receives by copy a matrix from up-child. This node stores thematrix in its variable M and the copy for the transmission is destroyed.• send to down child ( M ); analogous to above, but the data is sent to thedown-child of this node.• receive from down child ( M ); analogous to above, but the data is re-ceived from the down-child of this node.• send to parent ( M ); analogous to above, but the data is sent to the par-ent of this node.• receive from parent ( M ); analogous to above, but the data is receivedfrom the parent of this node.For each communication we assume a time complexity of the number of elementsof M plus a constant amount of time c Lat N that is due to latency. The latencyaccounts for the phenomenon that information travels through the cable at speedof light, so it takes a while until the beginning of a message has moved throughthe cable. 4 he root two leaves a node its up-child its down-child its parent my_level=0my_proc_num(0)=5my_proc_num(1)=3my_proc_num(2)=2my_proc_num(3)=1 levels Figure 1: Structure of the computing system: 2 · N − roblem statement We consider the numerical solution of a banded linearsystem A · X = Y (1)where A ∈ R ( N · m ) × ( N · m ) has bandwidth b ≤ m , and Y ∈ R ( N × m ) × k is adense matrix of k right-hand sides. The task is to ﬁnd numerical values for X ∈ R ( N × m ) × k . We assume N ∈ N .As formerly discussed, the time for writing the solution into memory in asequential way would already exceed the time that is actually needed to solvethe system. This is why in the following we describe very precisely in whichform the data A , Y must be provided to our computing system.The system matrix, the right-hand sides, and the solution vectors are storedin a separated way in the leaves of our two-tree. Figure 2 illustrates the situationfor N = 8. Each leaf holds ﬁve matrices in its private storage: A , B , C ∈ R m × m and X , Y ∈ R m × k . Matrices of two distinct leaves can have totally diﬀerentvalues. Comparing the upper and lower part of the Figure 2, we ﬁnd thatthe original matrices A , X , Y can be composed of the matrices A , B , C , X , Y of all leaves. There are two matrices that fall out of the pattern: C in theuppermost leaf and B in the lowermost leaf. We require that these matrices arezero-matrices. This section is organized as follows. We ﬁrst present the algorithm as a code thatcould be directly used for implementation in a programming language such asMPI with Cpp or Fortran. Then we sketch our derivation of the algorithm, thatarised from applying the devide-and-conquer paradigm on the SPIKE algorithmdue to Sameh [8]. We explain how our algorithm operates and give an example toillustrate the algorithmic steps. Finally, we analyse the parallel time complexity.

The algorithm

The following algorithm is launched on all nodes of the com-puting system at the same time with their respective local data. procedure ParallelSolver ( A , B , C , Y , N, d ) // matrices: V { j } ∈ R m × m for j = 1 , ..., d ; V up , V down ∈ R m × m // matrices (cont. 1): Z { j } V,up , Z { j } V , Z { j } V,down ∈ R m × m for j = 1 , ..., d // matrices (cont. 2): Z X,up , Z X , Z X,down ∈ R m × k if my level == 0 then // - - - write wings j B := my proc num(my level) ; k := N/ for j = d : − do if j B ≥ k then j B := j B − k end if X Y leavesroot

A BC A BC A BC A BC A BC A BC A BC A BC X YX YX YX YX YX YX YX Y b km

Figure 2: Distributed storage of the linear system. At the top: The linear systemof bandwidth b ≤ m is chunked into row-blocks of size m . At the bottom: Eachrow-block is stored in one leaf using ﬁve matrices.7 if j B == 0 then V { j } := V { j } + B ; break for-loop end if k := k/ end for j C := my proc num(my level) − k := N/ for j = d : − do if j C ≥ k then j C := j C − k end if if j C == 0 then V { j } := V { j } + C ; break for-loop end if k := k/ end for // - - - block-diagonal inversion [ V { } , ..., V { d } , X ] := A \ [ V { } , ..., V { d } , Y ] end if for ‘ = 1 : 1 : d do if my level == 0 then send to parent ( V { ‘ } , ..., V { d } , X ) else receive from up child ( V { ‘ } up , ..., V { d } up , X up ) receive from down child ( V { ‘ } down , ..., V { d } down , X down ) if my level < ‘ then if my proc num ( ‘ −

1) is odd then send to parent ( V { ‘ } down , ..., V { d } down , X down ) else send to parent ( V { ‘ } up , ..., V { d } up , X up ) end if end if end if // above: nodes of level ‘ receive V { ‘,...,d } up , V { ‘,...,d } down , X up , X down if my level == ‘ then S := " I m × m V { ‘ } up V { ‘ } down I m × m " Z { ‘ +1 } V,up , ..., Z { d } V,up Z X,up Z { ‘ +1 } V,down , ..., Z { d } V,down Z X,down := S \ " V { ‘ +1 } up , ..., V { d } up X up V { ‘ +1 } down , ..., V { d } down X down send to up child ( Z { ‘ +1 } V,down , ..., Z { d } V,down , Z X,down ) send to down child ( Z { ‘ +1 } V,up , ..., Z { d } V,up , Z X,up ) end if if my level < ‘ then receive from parent ( Z { ‘ +1 } V , ..., Z { d } V , Z X )8 if my level > then send to up child ( Z { ‘ +1 } V , ..., Z { d } V , Z X ) send to down child ( Z { ‘ +1 } V , ..., Z { d } V , Z X ) end if if my level == 0 then [ V { ‘ +1 } , ..., V { d } , X ] := [ V { ‘ +1 } , ..., V { d } , X ] − V { ‘ } · [ Z { ‘ +1 } V , ..., Z { d } V , Z X ] end if end if end for end procedure Origins in SPIKE

We derived the above algorithm by applying the SPIKEalgorithm [6] due to Sameh in a divide-and-conquer fashion. We explain thederivation with the help of Figure 3.Initially, we are given a banded linear system A · X = Y , as is shown in part1 of the ﬁgure. As is common for the divide-and-conquer approach, the systemis split in the middle. We express the system as the composition of equallydimensioned square-matrices A , A , and matrices X , X , Y , Y . Since A isbanded, we need two additional matrices V , V in order to be able to express A as blocks. V , V are high and thin: They have the hight of half the dimensionof the original system and their breadth is equal to the bandwidth of the originalsystem. We give names to ˜ V , ˜ V the system in part 2: The system we call blade and the matrices ˜ V , ˜ V we call wings . We will come back to this later.In part 2 of the ﬁgure we see a transformed system that is obtained whenmultiplying the inverse of the block-diagonal matrix DD = (cid:20) A

00 A (cid:21) from the left onto the system. If the dimension of A and A is large thenit is unlikely that D is singular. If A is symmetric positive deﬁnite then D isregular with a condition number bounded by that of A [10]. Thus, there existconditions for A such that the system considered in part 2 of the ﬁgure is well-posed. The obtained system has a interesting structure: It is an identity matrixplus two dense sub-blocks ˜ V , ˜ V , that have the breadth of A ’s bandwidth.In part 3 of the ﬁgure we consider a sub-system that is obtained when ex-tracting the red portions from the matrices in part 2. This sub-system is de-coupled from the total system. It has a size to twice the bandwidth and itcan be solved directly (e.g. by LU -factorization followed by forward and back-ward substitutions) for the extracted (red-marked) portion of X . This approachwould be followed in the usual SPIKE algorithm [8]. However, we use a diﬀer-ent approach. As shown in part 3 of our ﬁgure, we solve the sub-system withsystem matrix S and write the solution into a matrix that we call Z . Z is com-posed vertically of two blocks Z up and Z down , of which each has the hight of the9andwidth of A , and the breadth of X and Y .In part 4 we consider again a modiﬁed system. Using Z from step 3, we seethat we can change the right-hand sides such that the system matrix simpliﬁesto an identity. This transformation can be easily derived: As the ﬁrst block ofthe system in part 2, we have I · X + [ ˜ V , ] · X = ˜ Y . Replacing [ ˜ V , ] · X by ˜ V · Z down and moving the second term to the right-hand side yields the formula for ˆ Y . The formula for ˆ Y can be found in ananalogous way. Parallelism, recursions, and the computing system

In Figure 3, thereare two stages where parallelism can be exploited: The transformation frompart 1 to part 2 involves the solution of the following banded linear systems: A · [ ˜ V , ˜ Y ] = [ V , Y ] (2a) A · [ ˜ V , ˜ Y ] = [ V , Y ] (2b)These two systems can be solved independently from each other. The secondstage, where parallelism can be exploited, is in the computation of the twocomponents ˆ Y and ˆ Y : ˆ Y = ˜ V · Z down (3a)ˆ Y = ˜ V · Z up (3b)The problems in (2) and (3) can be expressed as recursions. The recursion for(2) is obvious because we develop an algorithm that solves banded linear systemand interiorly requires the solution of smaller banded linear systems. A recursionfor (3) is found by distributing the computation over the rows of ˆ Y , ˆ Y . This isshown in Figure 4. The computation of each ˆ Y and ˆ Y can each be expressedin a form as M := M − U · W , in some programming languages also written as M − = U · W . The ﬁgure shows how the computation of the update of M canbe distributed through a two-tree network by dividing it in vertical direction.All-together, we can formulate our whole algorithm for solving a banded linearsystem in a recursive way. The following code demonstrates this: procedure RecursiveSolver ( A , Y ) if ( thenA has small dimension) X = A \ Y return X end if Decompose the system into A , A , V , V , Y , Y , X , X . [ ˜ V , ˜ Y ] := RecursiveSolver ( A , [ V , Y ]) [ ˜ V , ˜ Y ] := RecursiveSolver ( A , [ V , Y ]) Compose S and solve the reduced linear system for Z up , Z down .10 X Y

A V A V X X Y Y I IV V ~~ X Y ~ Y ~ X S Z Y ~ Y ~ Z up Z down I I X Y Y X ^^ Y ^ Y ^ Y ~ Y ~ Y Y ^^ Y Y ~~ V V ~~ Z up Z down Z down Z up Figure 3: Derivation of our algorithm by applying SPIKE in a divide-and-conquer fashion. The system is divided into two equally sized systems of smallerdimension. After solving these, the solutions can be put together through thesolution of a decoupled system (part 3) that has a small dimension.11igure 4: Recursive divide-and-conquer approach to compute the matrix update M − = U · W on a parallel two-tree computing system. Each node divides thecomputational problem vertically into two problems of smaller dimension. Theseare then solved recursively by its children. // ˆ Y is ˜ Y , ˆ Y is ˜ Y ˆ Y := RecursiveMatMul ( ˜ Y , ˜ V , Z down ) ˆ Y := RecursiveMatMul ( ˜ Y , ˜ V , Z up ) // X is (cid:20) ˆ Y ˆ Y (cid:21) return X end procedure The computing system has been chosen as a two-tree in order to exploit therecursive nature of the algorithm: Initially, the root has the solve the linearsystem. According to the above code it would call its children in lines 7–8to solve recursively the subsystems with A and A . This is supposed to bedone in parallel. Afterwards, the root composes S from the small portions˜ V , ˜ V of ˜ V , ˜ V , and computes Z up , Z down . Finally, it calls again itschildren in order to compute ˆ Y , ˆ Y . Access pattern on the distributed memory

So far our explanation helpsto understand how the algorithm works and why the results for X are correct.But yet it is not easy to see how communication-intense the algorithm is and howthe recursion acts on the global system. In this paragraph we give illustrationsfor both.Our algorithm can be interpreted in an elegant way. To solve a system A · X · Y we can interprete that the algorithm applies an iterative scheme A (0) := A Y (0) := YA ( j +1) := (cid:0) D ( j ) (cid:1) − · A ( j ) Y ( j +1) := (cid:0) D ( j ) (cid:1) − · Y ( j ) for j = 0 , ..., d , d = log ( N ) and where A ( d +1) = I and thus X = Y ( d +1) . The matrices D ( j ) are block-diagonal matrices whose blocks are blades. As we have seen, theinverses of blades can be computed in parallel through RecursiveMatMul .We only need one upwards-communication of V , V and one downwardscommunication of Z down , Z up .Figure 5 illustrates the iteration for N = 8, d = 3. The ﬁgure shows thesystem matrix in the right part, starting at the top with the original matrixand ending at the bottom with an identity. The left part of the ﬁgure showsthe matrices D ( j ) for j = 0 , ..., d .We want to explain how this interpretation of the algorithm’s action can bederived from RecursiveSolver and at the same time discuss the algorithmicsteps as represented by the ﬁgure. The recursion in lines 7–8 results in a par-titioning of the system matrix along the diagonal blocks. Black lines indicatethe recursive diagonal partitioning. Grey coloring shows the non-zero pattern.In the bottom of the recursion the inverses of the smallest diagonal blocks areapplied from the left onto the system. So the matrix D (0) consists solely of diag-onal elements. The matrix A (1) has identity matrices on the diagonal blocks byconstruction. The sub-diagonal blocks are no longer triangular but dense. Sincelevel 0 is the recursive bottom, we now ascend. The matrix D (1) consists of fourblades. Multiplication from the left with its inverse yields a matrix A (2) thathas identities on larger diagonal blocks (since these where identical to the bladeson the diagonals of D (1) ). However, the hight of the wings in A (2) increasesbecause there are subdiagonal blocks in A (1) that were not represented in D (1) .Further ascending, the matrix D (2) consists of two diagonal blocks, which areblades of dimension 4. The two subdiagonal blocks in A (3) suﬀer from ﬁll-inin vertical direction while the diagonal blocks become identity matrices. Veryﬁnally, A (3) has a blade-structure and this ( D (3) ) − · A (3) yields the identity.From the sparsity patterns of the system matrix in each iteration we candraw conclusions on the memory that the nodes need to hold in order to beable to compute A (0) , ..., A ( d +1) without the need of any overhead for, e.g.,dynamically changing a sparse-memory representation of A . In our algorithm ParallelSolver we store A as wing-matrices V { j , j = 1 , ..., d . The top ofFigure 6 depicts this: The ﬁll-in pattern of A ﬁts into log ( N ) column vectors.For this data structure the product of A with an inverse of on of the aboveblock-diagonal matrices D ( j ) can be computed eﬃciently, as is shown in thebottom of the ﬁgure: Say we want to compute the product with ( D ( j ) ) − for j = 3. In this case, the leaves send the data for the reduced system (compare tothe red-framed portions in Figure 3 part 2) to the root of the sub-trees of level j = 3. The roots of the subtrees compute the reduced solutions Z up , Z down ,that afterwards they send back to all leaves through their sub-tree. The leavesupdate the wing-matrices by performing the same computations for them as forupdating Y . 13 ystem matrixInverse multipliedfrom leftlevel 0level 1level 2level 3 Figure 5: Successive multiplication of inverse of block-diagonal matrix fromthe left onto the system matrix. The block-diagonal matrices consist of blades.Their inverse is easy to apply in parallel.14 {1} V {2} V {3} V {4} N=16 d=4

Product with inverse of blade on (sub-)two-tree [V ,Y] {. . .} root leaves

S Z [V,Y] reduced

Data structure for system matrix update [V,Y]

Figure 6: Top: The matrix is represented by d = log ( N ) dense wing-matrices V { j } , j = 1 , ..., d , which are each distributed in row-chunks over the leaves.Bottom: The product with the inverse of a blade. The leaves send extractedmatrices to the root of the sub-tree. The root computes the decoupled solution Z and sends it back through the tree to all leaves, so they can update thewing-matrices. 15 arallel time complexity We derive the time complexity by the followinglist of axioms:(i) According to Figure 5 the algorithm performs O (log( N )) iterations.(ii) The communicated pieces of data per iteration are the node-local matrices[ V { } , ..., V { d } , Y ] ∈ R m × ( m · d + k ) , which consist of O (cid:0) m · ( m · log( N )+ k ) (cid:1) elements. The communication costper iteration is bounded by the time required for sending these elementsfrom an arbitrary leaf to the root or vice versa (size Z V , Z X have the samedimensions as [ V { } , ..., V { d } , Y ]). Equivalently, the time complexity forcommunication per iterations is O ( c Lat N + m · ( m · log( N ) + k )) , where c Lat N is the time complexity of physical time that is needed to senda single number from the root to a leaf that is farthest away from the rootin terms of cable-length.(iii) The computational complexity per iteration per node is as follows: Thenodes that are not leaves do either compute nothing or they compute adecomposition of a matrix S ∈ R (2 · m ) × (2 · m ) and apply it to the columns ofthe matrix [ V { } , ..., V { d } , Y ] in order to compute Z V , Z X . The leaves onthe other hand compute matrix-matrix-products of an m × m -matrix with Z V and Z X . Thus, the computational complexity per node per iterationis bounded by O ( m · log( N ) + m · k ).Combining the items, we ﬁnd that the parallel time complexity of our methodis: O log( N ) · (cid:16) c Lat N + m · (cid:0) m · log( N ) + k (cid:1) (cid:17) ! (4)Unfortunately, for a computing system of N nodes with each of a size in Θ(1)the minimum possible value for c Lat N lives in Θ( N / ).We consider two special cases:1. Assuming k, m ∈ O (1) the complexity result simpliﬁes to O (cid:16) log( N ) · N / (cid:17) .

2. Assuming k, m ∈ Ω( N / ), the time complexity is O (cid:16) log( N ) · m · (cid:0) m · log( N ) + k (cid:1) (cid:17) . Whereas the second result is obviously optimal for solving a band-matrix withdense band of bandwidth m (since it is has N diagonal blocks of size m thatneed to be factorized), it turns out that the ﬁrst result is not optimal and canbe made optimal by a ﬁne-tuning. 16 Lower complexity bound for the parallel solu-tion of tridiagonal linear systems

In this section we prove in order the following statements, where s is the physicaldimension in which the computing system is built (e.g., if the computing systemis built on the surface of a planet then s = 2, and if the computing system isa/the planet then s = 3):1. The latency c Lat N is bounded from below by Ω( N /s ), regardless how manycables are used.2. The time complexity for an algorithm for solving the tridiagonal linearequation system of dimension N on a distributed memory machine ofcomputing nodes is bounded from below by Θ( N / ( s +1) ), no matter howmany nodes and cables are used. Latency

We discuss on the latency time for a computing system of N nodesthat are s -dimensional spheres, where the latency of one node to communicateto another is bounded from below by the order of their physical distance inspace.For N nodes, the diameter on a line is ≥ N − N − ∈ Ω( N ) nodes between themselves. For an illustration of this,consider Figure 2, that shows twice the topology for the two-tree computingsystem of N = 8 nodes. Since the leaves are placed on a line with a unitdistance, the physical distance of the root to the farthest leaf is bounded frombelow in Ω( N ).Now let us consider the case where we use s = 2 dimensions: We placethe computing nodes into the smallest possible circle. Since the total area of N nodes is Θ( N ), the radius r of the must be in r ∈ Θ( N / ). Then, thecommunication between two nodes requires at most a time of 2 · r ∈ Θ( N / ).In s = 3 dimensions the situation is even better. Here the radius must onlybe of length r ∈ Θ( N / ). Given the positions of the nodes in the sphere, it istrivial to ﬁnd an almost optimal communication network for them: Connectingthem as a tree yields that the maximum diameter of the communication tree is O ( log( N ) · N /s ) because each cable can at most have length O ( N /s ) and atree network has diameter O (log( N )). Lower bound on time complexity for solving the tridiagonal linearsystems

Presume we shall solve a tridiagonal system A · x = y of dimension N . Presume that we use O ( N p ) nodes of which each holds at most O ( N q )pieces of data. It must be q ≥ − p because O ( N ) pieces of data must be storedin total.It is known that each number of the solution vector x depends on eachnumber of the right-hand side y and each value of the matrix A . Thus thefollowing hold: 17imension s order p of number of nodes order ω of time complexity1 1 / /

22 2 / /

33 3 / / O ( N − p ) pieces of its data.(ii) Each node must communicate at least once (either directly or indirectly)to each other node.Combining the two properties, we ﬁnd a lower bound for the time complexitywith a ﬁxed value of p :Ω( N − p | {z } read problem + N ps |{z} communicate ) = O ( N ω )The minimum of polynomial complexity orders ω depending on s and the opti-mal order p of the number of nodes are given in table 1.The question arises why we have not presented our algorithm with a numberof nodes that is in O ( N / ) since then our algorithm would be logarithmicallyclose to the optimal time complexity. We did not because for our PhD thesiswe will have to solve problems where N ≈ and m ≈ N ) · m . The m arises from the fact that A has N dense m × m -matrices on the diagonal thatmust be decomposed. In theory a fast matrix-multiplication algorithm couldbe employed to reduce the complexity order for this, but this is not practical.Thus, for the problems that we need to solve, the complexity of our algorithmis already logarithmically close to optimal or maybe even optimal. We presented an algorithm for the eﬃcient parallel solution of (block-)tridiagonal linear systems. We provided an accurate implementation of thealgorithm that uses a common message-passing interface. Following our analy-sis, the algorithm has a parallel time complexity that could be made optimal fortridiagonal systems (by simply using fewer nodes) and that is clearly optimalfor block-tridiagonal linear systems, respectively banded linear systems of smallbandwidth.Though a proper implementation has been provided for the algorithm, itwill be rather diﬃcult to utilize its full potential in practice. This is becausethe time that is required to send the data to the solver would already destroythe beneﬁt. Software that uses this solver needs to be highly sophisticated. Inparticular, problems must be instantiated in a way such that the linear system18s already distributed in the memory of the leaves of our computing system atthe time when our solver is called.Further work will be related to an attempt of implementing an optimalcontrol solver by direct transcription that shall solve the linear systems withinthe non-linear programming solver by means of the linear system solver thathas been presented in this work.

References [1] P. Amodio, L. Brugnanot, and J. Plemmons. Parallel factorizations andparallel solvers for tridiagonal linear systems.

Linear Algebra Appl , pages347–364, 1992.[2] Travis Austin, Markus Berndt, and David Moulton. A memory eﬃcientparallel tridiagonal solver. Preprint LA-UR-03-4149, 2004.[3] Stefan Bondeli. Paper: Divide and conquer: A parallel algorithm for thesolution of a tridiagonal linear system of equations.

Parallel Comput. , 17(4-5):419–434, July 1991.[4] Peter N. Brown, Robert D. Falgout, and Jim E. Jones. Semicoarseningmultigrid on distributed memory machines.

SIAM J. Scientiﬁc Computing ,21(5):1823–1834, 2000.[5] R. W. Hockney. A fast direct solution of poisson’s equation using fourieranalysis.

J. ACM , 12(1):95–113, January 1965.[6] Murat Manguoglu, Ahmed H. Sameh, and Olaf Schenk. Pspike: A par-allel hybrid sparse linear system solver. In Henk Sips, Dick Epema, andHai-Xiang Lin, editors,

Euro-Par 2009 Parallel Processing , pages 797–808,Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.[7] Nathan Mattor, Timothy J. Williams, and Dennis W. Hewett. Algorithmfor solving tridiagonal matrix problems in parallel.

Parallel Computing ,21(11):1769 – 1782, 1995.[8] Eric Polizzi and Ahmed Sameh. Spike: A parallel environment for solvingbanded linear systems.

Computers and Fluids , 36(1):113–120, 2007.[9] R. D. Russell and L. F. Shampine. A collocation method for boundaryvalue problems.

Numerische Mathematik , 19(1):1–28, Feb 1972.[10] Y. Saad.

Iterative Methods for Sparse Linear Systems . Society for Industrialand Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003.[11] Harold S. Stone. An eﬃcient parallel algorithm for the solution of a tridi-agonal linear system of equations.

J. ACM , 20(1):27–38, January 1973.1912] H. H. Wang. A parallel method for tridiagonal equations.

ACM Trans.Math. Softw. , 7(2):170–183, June 1981.[13] Plamen Yalamov and Velisar Pavlov. On the stability of a partitioningalgorithm for tridiagonal systems.