[PDF] A Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation

Abstract

The QR algorithm is one of the three phases in the process of computing the eigenvalues and the eigenvectors of a dense nonsymmetric matrix. This paper describes a task-based QR algorithm for reducing an upper Hessenberg matrix to real Schur form. The task-based algorithm also supports generalized eigenvalue problems (QZ algorithm) but this paper focuses more on the standard case. The task-based algorithm inherits previous algorithmic improvements, such as tightly-coupled multi-shifts and Aggressive Early Deflation (AED), and also incorporates several new ideas that significantly improve the performance. This includes the elimination of several synchronization points, the dynamic merging of previously separate computational steps, the shorting and the prioritization of the critical path, and the introduction of an experimental GPU support. The task-based implementation is demonstrated to be significantly faster than multi-threaded LAPACK and ScaLAPACK in both single-node and multi-node configurations on two different machines based on Intel and AMD CPUs. The implementation is built on top of the StarPU runtime system and is part of an open-source StarNEig library.

Full PDF

11 A Task-based Multi-shift QR/QZ Algorithm with Aggressive EarlyDeflation

MIRKO MYLLYKOSKI,

Department of Computing Science and HPC2N, Ume˚a UniversityThe QR algorithm is one of the three phases in the process of computing the eigenvalues and the eigenvectorsof a dense nonsymmetric matrix. This paper describes a task-based QR algorithm for reducing an upperHessenberg matrix to real Schur form. The task-based algorithm also supports generalized eigenvalue problems(QZ algorithm) but this paper focuses more on the standard case. The task-based algorithm inherits previousalgorithmic improvements, such as tightly-coupled multi-shifts and Aggressive Early Deflation (AED), and alsoincorporates several new ideas that significantly improve the performance. This includes the elimination ofseveral synchronization points, the dynamic merging of previously separate computational steps, the shortingand the prioritization of the critical path, and the introduction of an experimental GPU support. The task-basedimplementation is demonstrated to be significantly faster than multi-threaded LAPACK and ScaLAPACK inboth single-node and multi-node configurations on two different machines based on Intel and AMD CPUs. Theimplementation is built on top of the StarPU runtime system and is part of an open-source StarNEig library.CCS Concepts: •

Mathematics of computing → Solvers; Computations on matrices; • Computingmethodologies → Parallel algorithms;

Additional Key Words and Phrases: eigenvalue problem, real Schur form, QR algorithm, QZ algorithm,multi-shift, aggressive early deflation, task-based, StarPU, shared memory, distributed memory, MPI, GPU

ACM Reference format:

Mirko Myllykoski. 2016. A Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation. 1, 1,Article 1 (January 2016), 34 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

Given a matrix A ∈ R n × n , the standard matrix eigenvalue problem consists of finding eigenvalues λ i ∈ C and corresponding eigenvectors x i ∈ C n , x i (cid:44)

0, such that Ax i = λ i x i . (1)If the matrix A is dense and nonsymmetric, then the eigenvalue problem is usually solved in threephases: (i) reduction to Hessenberg form, (ii) reduction to (real) Schur form, and (iii) computationof the eigenvectors. The first phase reduces the dense matrix A to upper Hessenberg form H via anorthogonal similarity transformation and can be considered to be a preprocessing step. The secondstep further reduces the upper Hessenberg matrix H to to (real) Schur form S via the application ofthe QR algorithm. After this step, the eigenvalues λ i can then be determined from the diagonalblocks of S . Finally, the eigenvectors x i are solved using the Schur form S . Note that although This work is part of a project (NLAFET) that has received funding from the European Union’s Horizon 2020 research andinnovation programme under grant agreement No 671633. This work was supported by the Swedish strategic researchprogramme eSSENCE and Swedish Research Council (VR) under Grant E0485301.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2016 ACM. XXXX-XXXX/2016/1-ART1 $15.00DOI: 10.1145/nnnnnnn.nnnnnnn , Vol. 1, No. 1, Article 1. Publication date: January 2016. a r X i v : . [ c s . M S ] J u l :2 Mirko Myllykoski Hessenberg reduction is generally considered to be the most time consuming phase, it can beaccelerated with a GPU, see Tomov and Dongarra [2009] and Myllykoski and Kjelgaard Mikkelsen[2020b], and thus performed faster that the other two phases. For a more detailed explanation ofthe three phases, see a textbook by Golub and Van Loan [1996].This paper focuses on the second phase and describes a task-based (see, e.g., Thibault [2018]) QRalgorithm that is implemented on top of the StarPU runtime system, see Augonnet et al. [2011].The task-based algorithm also supports generalized eigenvalue problems (QZ algorithm) but thispaper mainly focuses on the standard case. The generalized case is relegated to sequence ofremarks highlighting the major differences. The task-based algorithm inherits previous algorithmicimprovements, such as multi-shifts introduced by Bai and Demmel [1989], tightly-coupled multi-shifts introduced Braman et al. [2002a], and Aggressive Early Deflation (AED) introduced byBraman et al. [2002b], from the latest LAPACK (see Byers [2007]) and ScaLAPACK (see Granatet al. [2015a, 2015b]) implementations, and also incorporates several new ideas that significantlyimprove the performance in both single-node and multi-node environments. In particular, the newtask-based algorithm introduces the following improvements: • Elimination of several (global) synchronization points. In particular, the algorithm doesnot synchronize between bulge chasing and AED steps, or when a set of tightly-coupledbulges is moved across MPI process boundaries. • Dynamic merging of previously separate computational steps. This leads to a increasedcomputational resource utilization. • Shortening and prioritization of the critical path. This means that the critical path iscompleted sooner and lower priority task are delayed until computational resources startbecoming idle. • Concurrent processing of several unreduced diagonal blocks. • Adaptive approach for deciding when to perform a parallel AED. • A common algorithm and implementation for single-node and multi-node environments. • GPU acceleration.The new StarPU-based implementation is part of an open-source StarNEig library, see Myllykoskiand Kjelgaard Mikkelsen [2020a, 2020b]. The library aims to provide a complete task-based softwarestack for solving dense nonsymmetric standard and generalized eigenvalue problems. Both single-node and multi-node environments are targeted and some components of the library support GPUs.Currently, StarNEig implements the whole software stack for standard eigenvalue problems insingle-node environment. Support for multi-node environment is currently a work in progressbut the Schur reduction phase is fully operational for both standard and generalized eigenvalueproblems in both single-node and multi-node environments. The missing software componentsare implemented as LAPACK and ScaLAPACK wrapper functions. The development history ofthe StarNEig library, including the development history of the task-based QR/QZ algorithm, isdocumented in several NLAFET deliverable reports: Kjelgaard Mikkelsen et al. [2017], Myllykoskiet al. [2018], Myllykoski et al. [2019], and Karlsson et al. [2019].The rest of this paper is organized as follows: Section 2 summarizes the QR algorithm, includingthe tightly-coupled multi-shifts and AED, and discusses the related work. Section 3 describes thenew task-based QR algorithm and its StarPU-based implementation. Section 4 summarizes howthe software is used in single-node and multi-node environments. Section 5 demonstrates theperformance of the StarPU-based implementation by comparing it against LAPACK and ScaLAPACKimplementation on Intel and AMD CPUs and Nvidia GPUs. Finally, Section 6 ends the paper withfinal conclusions. , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:3

The overall goal of the QR algorithm is to reduce an upper Hessenberg matrix H = Q T AQ =  h , h , . . . h , n h , h , . . . h , n . . . . . . ... h n , n − h n , n  ∈ R n × n (2)to real Schur form S = Q T Q T A Q Q (cid:124)(cid:123)(cid:122)(cid:125) = : Q =  S , S , . . . S , m S , . . . S , m . . . ... S m , m  ∈ R n × n , (3)where Q , Q ∈ R n × n are orthogonal matrices and S , , S , , . . . , S m , m ∈ R × ∪ { X ∈ R × : λ ( X ) ∈ C \ R } . (4)Above, λ ( X ) denotes the set of all eigenvalues of a matrix X . Now, since the matrix Q is othogonaland the matrix S is upper quasi-triangular, we have λ ( A ) = λ ( H ) = λ ( S ) = m (cid:216) i = λ ( S i , i ) (5)and the eigenvectors x i = Qy i can be solved for ( S − λ i I ) y i = A is allowed to have complex eigenvalues and eigenvectors, allinvolved matrices have only real entries. If the matrix A has complex entries, then the involvedcomputational phases are otherwise identical except that the matrix S is complex and triangular,and all orthogonal similarity transformations are replaced with unitary similarity transformations.Only the real case, i.e., the case where the matrix A has only real entries but (possibly) complexeigenvalues and eigenvectors, is discussed in this paper.The QR algorithm is iterative. More details will be covered in the following subsection but insummary, each iteration, H ← ˆ Q Tk . . . ˆ Q T H ˆ Q . . . ˆ Q k , Q ← Q ˆ Q . . . ˆ Q k , (6)consists of a sequence of orthogonal transformations and produces an updated matrix H that isotherwise of the form (3) but condition (4) does not necessary hold for all diagonal blocks. Ifcondition (4) does not hold, we say that the matrix contains unreduced blocks. Note that this is alsothe case with the upper Hessenberg matrix (2) since the entire matrix consists of one unreducedblock.In most cases, the matrix contains a single unreduced block that gradually shrinks as the QRalgorithm deflates vigilant deflation ). If an unreduced block decouples in this manner, then the QR algorithm isapplied recursively to each one of the newly-created unreduced blocks. The QR algorithm repeatsuntil a matrix that does not contain any unreduced blocks is produced or an iteration limit isreached. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :4 Mirko Myllykoski Remark 1.

In the generalized case, we are given a matrix pair ( A , B ) ∈ ( R n × n ) and we want to findgeneralized eigenvalues λ i ∈ R ∪ {∞} and corresponding eigenvectors x i ∈ R n × n , x i (cid:44) , such that Ax i = λ i Bx i . This is done by first reducing the matrix pair ( A , B ) to Hessenberg-triangular form ( H , R ) = Q ( A , B ) Z T , where H is upper Hessenberg, R is upper triangular and Q , Z are orthogonal. The Hessenberg-triangular matrix pair ( H , R ) is then further reduced to generalized Schur form ( S , T ) = Q T ( H , R ) Z = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) S , S , . . . S , m S , . . . S , m . . . ... S m , m  ,  T , T , . . . T , m T , . . . T , m . . . ... T m , m (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ∈ (cid:0) R n × n (cid:1) , where Q , Z are orthogonal and ( S , , T , ) , . . . , ( S m , m , T m , m ) ∈ (cid:0) R × (cid:1) ∪ (cid:110) ( X , Y ) ∈ (cid:0) R × (cid:1) : ˆ λ ( X , Y ) ∈ C \ R (cid:111) . Above ˆ λ ( X , Y ) denotes the set of all generalized eigenvalues of a matrix pair ( X , Y ) . In the end, wehave ˆ λ ( A , B ) = ˆ λ ( H , R ) = ˆ λ ( S , T ) = m (cid:216) i = ˆ λ ( S i , i , T i , i ) . The used QZ algorithm is iterative and very similar the QR algorithm: H ← ˆ Q Tk . . . ˆ Q T H ˆ Z . . . ˆ Z k , R ← ˆ Q Tk . . . ˆ Q T R ˆ Z . . . ˆ Z k , Q ← Q ˆ Q . . . ˆ Q k , Z ← Z ˆ Z . . . ˆ Z k . The cases where T i , i = ∈ R × correspond to infinite eigenvalues. Note that the standard conventionused by LAPACK is to classify a real eigenvalue as infinite if T i , i ≤ u (cid:107) T (cid:107) F , where (cid:107) · (cid:107) F is the Frobeniusnorm and u is the unit roundoff. In that case, T i , i ← . The task-based algorithm uses the sameconvention. Before discussing the modern incarnation of the QR algorithm, we will first remind the reader aboutthe textbook double-implicit-shift QR algorithm, see Golub and Van Loan [1996]. As illustrated inFigure 1, each iteration begins with the computation of a pair of shifts , ( ˆ λ , ˆ λ ) , by solving a smalleigenvalue problem involving a 2-by-2 sub-matrix in the lower right corner of an unreduced blockˆ H . We then compute vector v = ( ˆ H − ˆ λ I )( ˆ H − ˆ λ I ) e , generate a 3-by-3 Householder reflector ˆ Q such that ˆ Q T v is a multiple of e , and apply ˆ Q to ˆ H from both sides: ˆ H ← ˆ Q T ˆ H ˆ Q .The transformation produces a small 2-by-2 bulge to the upper left corner of the unreducedblock (see Figure 1). The unreduced block is then returned back to upper Hessenberg form via asequence of overlapping 3-by-3 Householder reflectors, ˆ Q . . . ˆ Q k , in a process called bulge chasing .The implicit Q theorem (see Francis [1961]) guarantees that each iteration preserves the similarity.Each iterations ends with the investigation of the sub-diagonal entries in the lower right corner ofˆ H . If a sub-diagonal entry is concluded to be small enough, then it is set to zero and either a 1-by-1or a 2-by-2 diagonal block is decoupled from ˆ H , thus deflating the corresponding eigenvalue or thecorresponding complex conjugate pair of eigenvalues, respectively. By a 3-by-3 Householder reflector, we mean an identity matrix I ∈ R n × n with a Householder reflector X ∈ R × embeddedin it as a diagonal block., Vol. 1, No. 1, Article 1. Publication date: January 2016. Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:5 B u l ge i s i n t r odu c ed B u l ge c ha s i ngSh i ft s a r e gene r a t ed D e ﬂ a t i on c he ck Fig. 1. An illustrational of the first iteration of the double-implicit-shift QR algorithm: generation of theshifts, creation of the bulge, bulge chasing, and deflation (from the top: no deflation, 1-by-1 block / realeigenvalue deflated, and 2-by-2 block / complex conjugate pair of eigenvalues deflated).

Remark 2.

In the generalized case, we compute a pair of shifts, ( ˆ λ , ˆ λ ) , by solving a small general-ized eigenvalue problem involving a 2-by-2 sub-matrix pair in the lower right corner of an unreducedblock ( ˆ H , ˆ T ) . We then compute vector v = ( ˆ H ˆ T − − ˆ λ I )( ˆ H ˆ T − − ˆ λ I ) e , generate a 3-by-3 Householderreflector ˆ Q such that ˆ Q T v is a multiple of e , and apply ˆ Q to ( ˆ H , ˆ T ) from the left: ( ˆ H , ˆ T ) ← ˆ Q T ( ˆ H , ˆ T ) .The resulting matrix pair is then returned back to Hessenberg-triangular form via a overlapping se-quence 3-by-3 Householder reflectors from the left and 3-by-3 ”right-hand side“ Householder reflectorsfrom the right, see Adlerborn et al. [2014] and Watkins and Elsner [1994]. In addition to the 2-by-2bulge being chased down the diagonal of ˆ H , a 1-by-1 bulge is simultaneously chased down the diagonalof ˆ T . The modern incarnation of the QR algorithm improves upon the double-implicit-shift QR algorithmby introducing several algorithmic advances, such as tightly-coupled multi-shifts and AED. We willfirst discuss the tightly-coupled multi-shifts (see Braman et al. [2002a]) that are used to improvethe arithmetical intensity, and thus the performance, of the bulge chasing step. B L AS - upda t e s N ea r- d i agona l bu l ge c ha s i ng Fig. 2. An illustration of near-diagonal bulge chasing and off-diagonal updates.

Instead of introducing just a single bulge and chasing it down the diagonal, the multi-shift QRalgorithm introduces a set of tightly-coupled bulges . This is done by introducing the bulges one by , Vol. 1, No. 1, Article 1. Publication date: January 2016. :6 Mirko Myllykoski one and then chasing then down to diagonal just enough so that the next bulge can be introduced.The bulges are then chased down the diagonal as a single unit, in a pipelined manner, as illustratedin Figure 2. More specifically, the bulge chasing is initially performed only near the diagonal,inside a small diagonal window . The related 3-by-3 Householder reflectors are simultaneouslyaccumulated into an accumulator matrix and only later applied to the relevant off-diagonal regionsusing matrix-matrix multiplications (BLAS-3). The bulge chasing step thus consists of a chain ofoverlapping diagonal bulge chasing windows and the corresponding accumulated left-hand andright-hand side updates . The end result is a significantly improved arithmetical intensity comparedto the double-implicit-shift bulge chasing step. Fig. 3. An illustration of multiple concurrent near-diagonal bulge chasing windows. In this example, threediagonal window chains are involved. Note that the off-diagonal updates overlap with each other.

Furthermore, the set of bulges can be divided into several sub-sets and each sub-set can beassociated with a corresponding chain of overlapping diagonal windows as shown in Figure 3. Thiscan be used to (i) increase the concurrency, by the introduction of several concurrent diagonalwindows as done in ScaLAPACK, and (ii) improve the arithmetical intensity, by the reduction ofthe size of each diagonal window so that it fits entirely inside the L1/L2 cache of a CPU core. Notethat the ScaLAPACK algorithm does not perform the near-diagonal bulge chasing and off-diagonalupdates as synchronously as implied in Figure 3. However, the ScaLAPACK algorithm does useprocess row and column broadcasts to communicate the accumulator matrices and synchronizeswhen a set of bulges is moved across a process boundary.

The tightly-coupled multi-shifts are usually combined with the AED technique, see Braman et al.[2002b]. The purpose of the AED step is twofold: (i) attempt to detect deflatable eigenvalues earlyin order to accelerate the convergence rate of the QR algorithm and (ii) generate the shifts that areused to introduce the next set of tightly-coupled bulges. Each AED step consists of three sub-stepsas illustrated in Figure 4 and summarized below:(1) A small diagonal window (a.k.a

AED window ) in the lower right corner of an unreducedblock is reduced to (real) Schur form via recursive application of the QR algorithm. Therelated left-hand side updates induce a spike to the left of the window and the upperHessenberg structure of the matrix is therefore temporarily lost.(2) The computed eigenvalue candidates (i.e., the eigenvalues of the involved AED window)are then systematically evaluated, starting from the bottommost diagonal block. If corre-sponding element(s) in the same row(s) of the spike are concluded to be small enough (i.e.,so-called deflation condition is satisfied), then the element(s) in the spike are set to zero These source of the necessary shifts is discussed in the Subsection 2.3., Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:7 S p i k e B u l g e c h a s i n g S c hu r r edu c t i on H e s s e n b e r g r e d u c t i o n R epea t ed AE D A gg r e ss i v e E a r l y D e f l a t i on S h i ft s AE D w i ndo w R eo r de r La r ge T i n y D e f l a t e Fig. 4. An illustration of the second iteration of the multi-shift QR algorithm with AED. and the eigenvalue or the complex conjugate pair of eigenvalues is thus deflated. If, on theother hand, the corresponding element(s) of the spike are concluded to be too large, thenthe AED window is reordered such that the diagonal block that failed the deflation checkis moved above the remaining unevaluated diagonal blocks. This eigenvalue reordering procedure pushes the remaining unevaluated diagonal blocks down the diagonal. Theprocedure involves a set of Givens rotations and 3-by-3 Householder reflectors (see Baiand Demmel [1993] and K˚agstr¨om [1993]).(3) After all eigenvalue candidates have been evaluated, the remaining spike is eliminated byperforming a small-sized Hessenberg reduction.The failed eigenvalue candidates are used as shifts during the bulge chasing step, and if the AEDstep did not produce enough shifts, it is repeated. The decision to repeat an AED is also sometimesmade for heuristic reasons since a large number of deflated eigenvalues usually implies that thenext AED is also likely to deflate a reasonable number of eigenvalues. It should be noted thatthe AED window is copied to a separate memory buffer and copied back only when at least oneeigenvalue was successfully deflated. All orthogonal transformation that are initially applied onlyinside the AED window and later propagated as matrix-matrix multiplications. The ScaLAPACKalgorithm performs each AED in parallel using a subset of the available MPI processes.

The design of a task-based algorithm begins with carefully dividing the the computational workinto self-contained tasks that all have well-defined inputs and outputs. The way this differs thecommon practice of organizing an implementation into regular subroutines is that a task-basedalgorithm relies on a runtime system . The task-based implementation does not call the associatedcomputation kernels directly, instead it is the role of the runtime system to schedule the task tovarious computational resources, such as CPU cores and GPUs, in a sequentially consistent order as dictated by the task dependences. One of the main benefits of this approach is that as long asthe algorithm is well-designed, the underlying parallelism is exposed automatically as the runtimesystem gradually traverses the resulting task graph .If particular, the fact that the runtime system guarantees that the task are executed in a se-quentially consistent order eliminates the need to explicitly synchronize the execution. Differentcomputational steps, that were previously separated by global synchronization points, are allowedto merge and the runtime system readily exploit the available concurrency to a far higher degree , Vol. 1, No. 1, Article 1. Publication date: January 2016. :8 Mirko Myllykoski than previously. Other benefits of the task-based approach include, for example, better load balanc-ing and resource utilization due to dynamic task scheduling. This all leads to significantly morepowerful algorithms and implementations that are able to dynamically adapt to different input dataand ever-changing hardware configurations.

The implementation of the task-based QR algorithm is built on top of the StarPU runtime system,see Augonnet et al. [2011]. The implementation provides StarPU a set of codelets that specify howeach task type is implemented. Each task type can have multiple implementations defined in thecorresponding codelet, for example, one implementation of a CPU core and another implementationfor a GPU. StarPU uses calibrated performance models to predict the execution and data transfer timesfor each computational resource, and makes the scheduling decisions based on that information.Each task is also given priority that effects the scheduling after all task dependences have beenresolved.The input and output data is encapsulated inside data handles . In particular, the StarPU-basedimplementation divides the matrices into disjoint square tiles (excluding the last tile row and column)and each tile is registered with StarPU and given a data handle. In our chosen configuration, StarPUderives the task dependencies from task’s input and output data handles. That is, the tasks graph isconstructed implicitly. An application controlled main thread inserts the tasks into StarPU andStarPU schedules the tasks to a set of worker threads. If the main thread needs the output of atasks, it must acquire the corresponding data handle. This causes the main thread to wait until thetask and all its dependences have been executed.StarPU supports multi-node environments (distributed memory) though MPI. In our chosenconfiguration, each data handle is supplemented with an unique tag (integer identifier) and tasksthat require communication are inserted in all involved MPI processes. In addition, each datahandle has an owner , i.e., the MPI process where the master copy of the data recedes in. Other MPIprocesses register a placeholder data handle that may encapsulate a copy of the data during thecomputations. The task graph, the tags, and the owner information provide StarPU all necessaryinformation for automatically performing the communications. In a sense, each MPI process has a(partial) copy of the task graph that has been supplemented with special communication tasks. Aseparate worker thread is dedicated for executing the communication tasks using MPI. Note thatthe master thread must insert explicit communication requests into StarPU in some situations, forexample, when it about to acquire a data handle.

The task-based algorithm is build around seven main task types : small Schur task reduces a (small) unreduced block to Schur form. The local transformationsare accumulated into an accumulator matrix and the task returns the outcome of thereduction operation (either success or failure). The task is implemented as a call to the DHSEQR

LAPACK routine (sequential multi-shift QR algorithm with AED). small AED task performs a sequential AED inside a diagonal window. The local transforma-tions are accumulated into an accumulator matrix and the task returns the outcome ofthe AED (the number of deflated eigenvalues) and the computed shifts. The Hessenbergreduction sub-step is part of the small AED task. Future work includes the separation of thissub-step into a separate task. The task implementation uses the

DHSEQR , DTREXC (eigenvaluereordering),

DGEHRD (Hessenberg reduction) and

DORMHR (Hessenberg reduction) LAPACKroutines. , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:9 push bulges task chases a set of tightly-coupled bulges downwards the diagonal insidea diagonal window. Optionally, the task introduces and/or annihilates the bulges. Thelocal transformations are accumulated into an accumulator matrix. Those push bulges tasks, that are associated with the last set of tightly-coupled bulges, finish by scanning thesub-diagonal entries in order to identify any unreduced blocks that were decoupled duringthe bulge chasing step. This decoupling can happen either due to vigilant deflations orentries being flushed to zero. The outcome of this sub-diagonal scan is stored to a special aftermath vector that has an entry for each row of the matrix. The task implementationchases the bulges in small sub-batches and utilizes reflector accumulation and BLAS-3operations. The diagonal windows are always placed such that window’s lower right cornerfollows the edges of the underlying tiles. The size of each bulge chasing window is at most2 b tile × b tile , where b tile is the tile size, and each task chases at most ( b tile − )/ deflate task attempts to deflate eigenvalue candidates that fall within a diagonal window andreorders failed candidates to the upper left corner of the window. The local transformationsare accumulated into an accumulator matrix and the task returns the outcome of thedeflation check (the number of deflated eigenvalues and the number of successfully movedfailed eigenvalue candidates). The diagonal windows are always placed such that window’supper left corner follows the edges of the underlying tiles. The size of each deflationwindow is at most 2 b tile × b tile and each task moves at most b tile − DTREXC

LAPACK routine. small Hessenberg task reduces a diagonal window to a Hessenberg form. The local trans-formations are accumulated into an accumulator matrix. The task is implemented as callsto the

DGEHRD and

DORMHR

LAPACK routines. left update task applies an accumulator matrix from the left to a section of a matrix. right update task applies an accumulator matrix from the right to a sections of a matrix.

Fig. 5. An illustration of how the matrix is divided into rectangular tiles and how the the left-hand andright-hand side updates are cut into individual update tasks.

We will from now on refer to the left update and right update tasks as update tasks . Theupdate tasks are implemented as a call to the dgemm

BLAS routine (for CPUs) or the cublasDgemm cuBLAS routine (for GPUs). Special care has to be taken when cutting the updates into individualupdate tasks or we risk introducing spurious data dependences between independent tasks. Forexample, if two left update tasks, that were cut from the same left-hand side update, were allowedto operate on a common tile, then this overlap would induce a spurious dependency between thetwo tasks. The task-based algorithm avoids this problem by forming a two-dimensional stencil , Vol. 1, No. 1, Article 1. Publication date: January 2016. :10 Mirko Myllykoski (see Figure 5) that separates the tiles into groups of adjacent tiles and the updates are always cutfollowing the edges of the stencil.The task-based algorithm contains several auxiliary task types and the StarNEig library includesa task-based implementation of the Hessenberg reduction phase which is used to eliminate spikesthat are larger than a given threshold. For most task types, the implementation copies the contentsof the intersecting tiles into a temporary buffer before the computations. The return values arestored to a separate data handles and later acquired by the main thread.Remark 3.

In the generalized case, the task-based algorithm includes one additional task type: push infinities task chases a set of infinite eigenvalues upwards the diagonal inside a diag-onal window. The local transformations are accumulated into an accumulator matrix. Thechasing is performed with sequence of Given’s rotation (see, e.g., Adlerborn et al. [2014]). Thediagonal windows are always placed such that window’s upper left corner follows the edges ofthe underlying tiles. The size of each bulge chasing window is at most b tile × b tile and eachtasks chases at most b tile − infinite eigenvalues.The other task types differ as follows: small Schur task implements a sequential version of multi-shift QZ algorithm with AED. Thenear-diagonal bulge chasing is implemented on top of the DHGEQZ

LAPACK routine (double-implicit-shift QZ algorithm). The implementation also uses the the

DTGEXC (generalizedeigenvalue reordering) and

DGGHRD (Hessenberg-triangular reduction) LAPACK routines. small AED task implementation uses the same a sequential QZ algorithm as the small Schur task. push bulges task attempts to detect any infinite eigenvalues. Discovered infinite eigenvaluesare marked in the aftermath vector. deflate task implementation uses the

DTGEXC

LAPACK routine. small Hessenberg task is implemented as calls to the

DGGHRD

LAPACK routine.

The QR algorithm introduces several challenges for the task-based approach, the most significantof them being the fact that the QR algorithm is iterative and contains several data-dependentbranching points. For example, it is practically impossible to predict how many eigenvalues eachAED step manages to deflate and this information is critical when deciding whether to repeat theAED step or proceed to the bulge chasing step. This means that the task-based algorithm cannotsimply insert all tasks to the runtime system and then proceed to waiting their completion. Thisa major difference to, lets say, Cholesky factorization or eigenvalue reordering (see Myllykoski[2018] and Myllykoski et al. [2017]). Instead, the task-based algorithm must observe the outcome ofcertain tasks in order make the necessary decisions and this can lead to an early exhaustion of thetask pool since no new tasks can inserted while the main thread is being blocked inside the relatedruntime system API function. This is particularly important if we wish to leverage the additionalconcurrency from processing several unreduced blocks in parallel. In that situation, the task-basedalgorithm cannot advance on any other unreduced block while the main thread is being blocked.Our solution to these problems is to adopt an event driven approach . All information on theunreduced blocks is stored into a nested list where each element corresponds to one unreducedblock. We will from now on refer to the list elements as segments . Each segment has a state and cancontain a child-segment list . Each segment list also contains a set of preset priority levels ( max prio , def prio and min prio ) that are used to assign priorities for tasks that are related to the segment , Vol. 1, No. 1, Article 1. Publication date: January 2016. Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:11

Bootstrap DecoupledNewFailedSmall AED smallAED reduce AED deflate BulgesConverged . . .

Fig. 6. Possible segment states and associated state transitions. The child-segment list are illustrated withdouble dashed lines. list. The task-based algorithm repeatedly scans the list performing an action on each segment asillustrated in Figure 6 and summarized below:

Bootstrap state initializes the QR algorithm.

Initialization:

A new segment covering the entire matrix H is created, initialized withstate Bootstrap and added to the segment list. The sub-diagonal entries of H arescanned in order to detect any preexisting unreduced blocks and this informationcommunicated to all MPI nodes in a form of an aftermath vector.The segment list priorities are set as follows: • The highest task priority ( max prio ) is set to the maximum priority allowed bythe runtime system. Larger number implies higher priority. • The default task priority ( def prio ) is set to the runtime system default priority. • The lowers task priority ( min prio ) is set to the minimum priority allowed bythe runtime system.

Action:

Each MPI node piecewise acquires and processes the aftermath vector. (i) If adecoupled unreduced block is detected, then a new child-segment corresponding to theunreduced block is created, initialized with state

New and added to a child-segment list.At this point, the state transition function is also triggered for the new child-segment.The parent segment is marked as

Decoupled . (ii) Otherwise, the segment is markedas

New . New state indicates the beginning of a new iteration.

Action: (i) If the iteration limit is reached, then the segment is marked as

Failed . (ii) Ifthe segment is small, then a small Schur and related update tasks are inserted, andthe segment is marked as

Small . (iii) If the AED window is small, then a small AED task is inserted and the segment is marked as

AED small . (iv) Otherwise, the AED , Vol. 1, No. 1, Article 1. Publication date: January 2016. :12 Mirko Myllykoski window is copied to a separate matrix and a new child-segment list is created for it. Anew child-segment covering the entire copied matrix is created, initialized with withstate

New , and added to the child-segment list. The parent segment is marked as

AEDreduce . The child-segment list priorities are set as follows: • The highest task priority ( max prio ) is set to max prio parent . • The lowers task priority ( min prio ) is set to min(max prio, def prio parent +1) . • The default task priority ( def prio ) is set to (cid:98) ( max prio + min prio) / 2 (cid:99) .See Subsection 3.5. Small state indicates that a small Schur task has been inserted.

Action:

The return status of the small Schur task is acquired. (i) If the small Schur taskwas executed successfully, then the segment is marked as

Converged . (ii) Otherwise,the segment is marked as

Failed . Failed state indicates that the algorithm failed to reduce the segment to Schur form.

Action:

The task-based algorithm exits and returns an appropriate error code.

Converged state indicates that the segment was successfully reduced to Schur form.

Action:

The segment is removed from the segment list.

Decoupled state indicates that the segment has decoupled into several child-segments.

Action:

The parent segment is replaced with the child-segments.

AED small state dicates that a small AED task has been inserted.

Action:

The return status of the small AED task is acquired. (i) If the small AED taskgenerated enough shifts, then the associated update and bulge chasing tasks (seeSubsection 3.4) are inserted, and the segment is marked as

Bulges . Communicationrequests for communicating the aftermath vector to all MPI nodes are inserted. (ii)Otherwise, the segment is marked as

New . AED reduce state dicates that the associated AED window is being reduced to Schur form.

Action:

The state transition function is called for each segment in the child-segment list.(i) If any of the child-segments transitions to the

Failed state, then the parent segmentis marked as

Failed . (ii) If the child-segment list is non-empty, then the parent segmentretains the

AED reduce state. (iii) Otherwise, the first deflate task and the relatedupdate tasks are inserted, and the parent segment is marked as

AED deflate . SeeSubsection 3.5.

AED deflate state indicates that the associated AED window is being deflated.

Action:

The return status of the deflate task is acquired. (i) If more deflate tasks arerequired, then the next

AED deflate task and the related update tasks are inserted,and the segment retains the

AED deflate state. The eigenvalue reordering related deflate tasks and the related update tasks are also inserted. (ii) If all eigenvaluecandidates have been evaluated, then the associated update, Hessenberg reduction andbulge chasing (see Subsection 3.4) tasks are inserted, and the segment is marked as

Bulges . Communication requests for communicating the aftermath vector to all MPInodes are inserted. (iii) Otherwise, the segment is marked as

New . See Subsection 3.5.

Bulges state indicates that the bulge chasing tasks have been inserted.

Action:

New and added to a child-segment list.At this point, the state transition function is also triggered for the new child-segment. , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:13

The parent segment is marked as

Decoupled . (ii) Otherwise, the segment is markedas

New . Fig. 7. An illustration of three unreduced diagonal blocks (segments) that are all in different states. Fromtop to bottom: (i) 2nd AED sub-step (

AED small or AED deflate ), (ii) bulge chasing (

Bulges ), and (iii)off-diagonal right-hand side updates from an AED (

Bulges , AED small or AED reduce ). As illustrated in Figure 7, the task-based algorithm allows several segments to be processedconcurrently. In particular, the task-based algorithm can detect new segments even when the bulgechasing step is still being active. The main thread still waits for the completion of certain tasksbecause the StarPU-based implementation uses the blocking variant of the data handle acquiringfunction. The non-blocking variant was experimented with but the additional overhead (the mainthread is constantly pooling the runtime system) and, more importantly, operation ordering relatedcomplications in multi-node configuration lead us to abandon this approach.Remark 4.

In the generalized case, the

Bootstrap and

Bulges actions also scan the aftermathvector for infinite eigenvalues. If infinite eigenvalues are discovered, they are grouped, chased upwardsthe the diagonal and deflated as done in Adlerborn et al. [2014]. Once one group fills ( b tile − infiniteeigenvalues), the related push infinities tasks and the update tasks are inserted. The task insertionis also triggered if the segment decouples into sub-segments. This approach allows the infinite eigenvaluechasing and to bulge chasing to happen simultaneously. The infinite eigenvalues are chased to theupper left corner of the unreduced block so that the involved tasks do not block the next AED. Although the vast majority of the flops associated with the bulge chasing step are performed usingBLAS-3 operations, the overall cost of the bulge chasing step is still very high in comparison to theAED step. In addition, each bulge chasing step modifies a significant fraction of the entire matrix.It is therefore important to optimize the bulge chasing step as well as possible. In particular, asdemonstrated in Myllykoski [2018] and Myllykoski et al. [2017], the order in which the tasks areinserted can play a major role in two-sided matrix transformation algorithms. The order in whichthe tasks are inserted effectively determines the order in which the orthogonal transformation areapplied. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :14 Mirko Myllykoski

In this section, we use G qr to denote the sub-graph that forms the the critical path of the entireQR algorithm. By critical path, we mean the longest path through the tasks graph when measuredin term of execution time. We say that a task feeds back to the critical path when the task is adirect or an indirect dependency to one or more tasks on the critical path. We also use G bulдes todenote the sub-graph that forms the critical path of the bulge chasing step. The sub-graphs G qr and G bulдes are not known. However, we can postulate the following: (i) All push bulges tasksmust be executed before the bulge chasing step can finish and the next AED step begin. All pushbulges are therefore part of both G qr and G bulдes . (ii) All AED-related tasks that contribute to thecomputation of the shifts must be executed before the next bulge chasing step can begin. ManyAED-related tasks are therefore part of G qr . (iii) The sub-graph G bulдes is a sub-graph of G qr . H i ghe s t p r i o r i t y Lo w p r i o r i t y H i gh p r i o r i t y Fig. 8. An illustration of a task insertion step. The task insertion order is read from left to right. All right-handside updates above the dotted line are considered low priority as the associated right update tasks do notfeed back to G bulдes . This example involves four diagonal window chains. When the tasks are inserted, all diagonal bulge chasing windows chains are processed together,in an interleaved manner, as visualized in Figure 8. During each task insertion step , the diagonalbulge chasing windows are inserted in reverse order stating from the topmost window. Both the push bulges task and the corresponding left update and right update tasks are inserted. The push bulges task is always given the highest priority ( max prio ) and the left update tasks aregiven the second highest priority ( max(def prio, max prio - 1) ). The right update tasksbelow the dashed line are given the highest priority ( max prio ) and the right update tasks abovethe dashed line are given given the default priority ( def prio ).The overall goal is to always prioritize the push bulges and the left update tasks since pushbulges tasks are part of G bulдes and left update tasks feed back to G bulдes . In particular, sincethe diagonal window chains are processed in reverse order, the left update tasks are alwaysscheduled before the overlapping right update tasks (see Figure 8). We could, in principle, insertthe tasks in an arbitrary order, in two phases: first inserting the push bulges and left update tasks, and then right update tasks. However, the simplest approach is to process the bulgechasing windows in the described manner.Furthermore, note that all push bulges and left update tasks that are inserted during thenext task insertion step are below the dashed line in Figure 8. This means that right update tasks, that update sections of the matrix above the dashed line, do not feed back to G bulдes andcan thus be given lower priority, see Figure 9. The right update tasks that update sections ofthe matrix below the dashed line in Figure 8 do feed back to G bulдes and are therefore given the The most straightforward task insertion order would be to insert each diagonal window chain separately, staring from thefirst diagonal window chain. Compared to this straightforward task insertion order, the task-based algorithm inserts thediagonal window chains in interleaved manner, in reverse order., Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:15

W W W W W WL L L L LL L L LL L LL LLRR R R R RRRR RRR RR R RR R R R LLLLWLLR w i ndon c ha i n s w i ndon c ha i n s w i ndon c ha i n s Fig. 9. A simplified tasks graph for the first diagonal window chain. The push bulges tasks are marked withW, left update tasks with L and right update tasks with R. The dashed lines show how much of the tasksgraph feeds back to G bulдes for a given number of diagonal window chains. For example, ”4 window chains“corresponds to the situation shown in Figure 8, i.e., all tasks above the third dashed line feed back to G bulдes .Note that task dependences that correspond to the other diagonal window chains are omitted. highest priority. This encourages the runtime system to schedule these tasks as soon as possible sothat they do not block the tasks that are inserted during the next task insertion step.Finally, if we had added more separation between the diagonal windows in the task insertionstep, then more right update tasks would feed back to G bulдes as the distance from each diagonalwindow to the dashed line in Figure 8 would be longer. Therefore, given the choice between zeroor some separation between the windows, the most optimal configuration is the one where thediagonal windows are packed as close to each other as possible. Also, note that real-live taskgraphs are significantly larger than the simplified task graph presented in Figure 9. Therefore,those right update tasks that do not feed back to G bulдes forms nearly half of the task graph.Furthermore, right update tasks that compute the Q matrix form a completely separate sub-graph and constitute about one-half of the total number of update tasks. Therefore, the numberof update tasks that need to be executed before the next AED can begin is reduced by almostthree-quarters when compared to the ScaLAPACK algorithm. Fig. 10. Four snapshots taken during the bulge chasing step. Note how the low-priority right update tasksare delayed until the push bulges tasks and the left bulges tasks can no longer saturate all workers.

It is important to realize that the task insertion order simply defines the dependences betweenthe tasks. The runtime system is allowed to choose the task scheduling order as long as it respects , Vol. 1, No. 1, Article 1. Publication date: January 2016. :16 Mirko Myllykoski the dependences. The priority of a task is taken into account once all its dependences have beenresolved. The outcome of all these tasks insertion order and task priority considerations is shownin Figure 10. Note how the low-priority right update tasks are delayed until the push bulges and left bulges tasks can no longer saturate all workers. In particular, the low-priority rightupdate task form a wave pattern that slowly travels from the diagonal to the direction of the upperright corner of the matrix. Lo w e s t p r i o r i t y Lo w e s t p r i o r i t y Fig. 11. An illustration of the lowest priority update tasks. All updates outside the dotted triangle areconsidered low priority as the associated tasks do not feed back to G qr . The facts that the unreduced blocks shrink as the QR algorithm progresses and the task-basedalgorithm can process several unreduced block concurrently are also taken into account. Inparticular, as shown in Figure 11, all update tasks that update sections of the matrix outside theunreduced blocks are given the lowest priority ( min prio ). These tasks do not feed back to G qr and can be delayed.Since each AED step is part of G qr , it is important to prioritize all tasks that are direct or indirectdependences of the AED-related tasks. The task-based algorithm therefore calculated the size of thenext AED window and effectively lifts the dashed line to the level of the upper edge of the next AEDwindow as shown in Figure 12. This means that right update tasks, that are direct or indirectdependences to the AED-related tasks, are given higher priority and are therefore scheduled earlier.The outcome of all these additional task priority considerations can be seen in Figure 13. Similarly to the ScaLAPACK algorithm, the task-based QR algorithm performs small AEDs sequen-tially and large AEDs in parallel. The decision to perform a parallel AED is done adaptively. Theadaptive approach was first presented and evaluated by the author in Karlsson et al. [2019]. Thebasic idea is visualized in Figure 14. The task-based algorithm measures size of the task bool (i.e.,the set of all inserted tasks that are not yet scheduled to workers) in two different points in time:just after the bulge chasing tasks have been inserted and just before the the task-based algorithmmust decide whether to perform a sequential AED or a parallel AED. The bulge chasing tasksare consumed at a constant rate and the task-based algorithm can therefore use a linear model , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:17 N e x t AE D w i ndo w Fig. 12. An illustration of the right-hand side update tasks that feed back to G qr . All tasks that that overlapwith the next AED window, i.e., are below the dotted line, are given the highest priority.Fig. 13. Four additional snapshots taken during computations. The arrows mark AEDs. From left to right: (i)an AED is being executed while other workers are still executing the low priority right update tasks, (ii) a small Hessenberg tasks from an AED is overlapped with the next bulge chasing step, (iii) two bulge chasingsteps are overlapped with each other, and (iv) an AED is being executed while other workers are executingthe lowest priority left update tasks. to predict the point in time when the task bool will be exhausted. The StarPU runtime systemprovides an estimate for the execution time of a sequential AED in a form of a performance model an b + c , where n is the size of the AED window and a , b , c are the model parameters. With all thisinformation, the task-based algorithm can predict if a sequential AED finishes before the task boolis exhausted. If this is not the case, then the task-based algorithm performs a parallel AED. Thesmallest and largest AEDs are always performed sequentially and in parallel, respectively. Notethat these predictions are not exact and adaptive approach is meant to be a heuristic.The first sub-step (recursive QR) is performed by copying the AED window to a separate matrixthat has smaller tile size ( b aed ) than the main matrix. The second sub-step (deflation check) isperformed by a set of deflate tasks. A deflate tasks has two main function: perform deflationchecks for the eigenvalue candidates that fall within a diagonal deflation window and reorder failedeigenvalue candidates to the upper left corner of the deflation window. Once the first sub-step hasfinished, the first deflate task is inserted such that the corresponding deflation window is placed , Vol. 1, No. 1, Article 1. Publication date: January 2016. :18 Mirko Myllykoski M ea s u r e m en t s P o t en t i a l s equen t i a l AE DD e c i s i on po i n t P r ed i c t ed e x hau s t i ono f t he t a sk poo l Fig. 14. Illustration of the adaptive AED decisions. The vertical axis shows size of the task bool and thehorizontal axis shows the wall time. The left figure shows a situation where the algorithm predicts that thetask bool does not get exhausted before a sequential AED finishes and the right figure shows a situationwhere the algorithm predicts that the tasks bool does get exhausted. The later case could thus benefit from aparallel AED. R eo r de r T i n y D e f l a t e R eo r de r Fig. 15. An illustration of the second sub-step of AED with a deflation window and two reordering windowsbegin processed in parallel. Each deflate tasks takes the deflation/reordering window and the correspondingsection of the spike as arguments. flush against the lower right corner of the AED window. When the state transition function returnsto the corresponding segment, the main thread waits until the previous deflate task has finished.If the previous deflation window contains less than b aed failed eigenvalue candidates, then thenext deflate task is inserted such that the corresponding deflation window encloses the failedeigenvalue candidates. If, on the other hand, the previous deflation window contains more than b aed failed eigenvalue candidates, then then one or two eigenvalue reordering window chains are inserted.The eigenvalue reordering is performed by deflate tasks and each reordering window chain movesat most b aed − , Vol. 1, No. 1, Article 1. Publication date: January 2016. Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:19 next deflate task is then inserted such that the lower right corner of the corresponding deflationwindow is placed flush against the next unevaluated eigenvalue candidate. The process repeatsuntil all eigenvalue candidates have been evaluated.One of the major complications here is the fact that the eigenvalue reordering can fail. That is,kernels that are used to swap two adjacent diagonal block can return an error code indicating thatthe swapping operation is too ill-conditioned. The reason why this becomes a problem is that thesefailures are difficult to predicted and the task-based algorithm should, in principle, investigate theoutcome of each eigenvalue reordering related deflate task. Otherwise, the main thread maylose track of the locations of the failed eigenvalue candidates and cannot place the deflation andreordering windows without splitting 2-by-2 blocks. Acquiring the outcome of each deflate taskwould, however, limit the parallelism since the main thread would be blocked more often.Instead, the task-based algorithm adopts a greedy approach where the task insertion phaseassumes that the all swaps are going to be successful and the eigenvalue reordering related deflate task make sure that the Schur form is retained even if the swapping kernels fail to swap one ormore diagonal blocks. This approach can lead to situations where only a subset of the eigenvaluecandidates are evaluated but such situations are assumed to be quite uncommon. Therefore, anapproach that favors performance is preferred. ? W i ndo w s p li t s t w o b l o ck N o s p li tt i ng K no w n t o t he deflate t a sk R e s i z ed w i ndo w K no w n t o t he ne x t deflate t a sk ? ? ? ?? Fig. 16. Illustrations of deflation and reordering windows. From left to right: (i) a situation where the windowboundary splits two 2-by-2 block, (ii) a situation where the window boundary does not split any 2-by-2 blocks,(iii) matrix entries that are known to a deflate tasks, (iv) resized window, and (v) matrix entries that areknown to the next deflate task.

The actual details of this greedy approach are visualized in Figure 16. Since a deflate task cannotseparate between situations where the deflation/reordering window splits a 2-by-2 block (leftmostfigure) and when it does not (second figure from the left), the deflate task will resize its activecomputational window (fourth figure from the left) when necessary. The resizing happens whenthe topmost diagonal block in the window is a 1-by-1 block and/or when the bottommost diagonalblock in the window is a 1-by-1 block. The fact that the topmost failed eigenvalue candidate mayend up begin moved to the second position from the top is taken into account when the nextreordering window is placed. If a failure to swap occurs, then the deflate tasks will simply haltthe reordering. The number of failed eigenvalue candidates that were successfully moved is storedto separate data handle and passed to the next deflate tasks in the eigenvalue ordering windowchain.

The new StarPU-based implementation is part of the open-source StarNEig library Myllykoski andKjelgaard Mikkelsen [2020a, 2020b]. The details on how to use the library, and the task-based QRalgorithm, are covered in the StarNEig User’s Guide that is available from the StarNEig home page , Vol. 1, No. 1, Article 1. Publication date: January 2016. :20 Mirko Myllykoski (https://nlafet.github.io/StarNEig/) in both PDF and HTML formats. This section briefly summarizedhow to use the use the library in the context of the Schur reduction phase.

StarNEig implements the following interface function for the Schur reduction phase in a single-nodeenvironment: s t a r n e i g e r r o r t s t a r n e i g S E P S M S c h u r ( i n t n , double

H [ ] , i n t ldH , double

Q [ ] , i n t ldQ , double r e a l [ ] , double imag [ ] ) ;The following source code listing demonstrates how a dense matrix A is reduced to real Schur form: < s t a r n e i g / s t a r n e i g . h > void main ( ) { i n t n = 3 0 0 0 , ldA = 3 0 0 0 , ldQ = 3 0 0 0 ; double A[ n ∗ ldA ] , Q[ n ∗ ldQ ] , r e a l [ n ] , imag [ n ] ; / / i n i t i a l i z e A a s a d e n s e m a t r i x and Q a s an i d e n t i t y m a t r i x . . . / / I n i t i a l i z e t h e S t a r N E i g l i b r a r y u s i n g a l l CPU c o r e s and/ / GPUs . The STARNEIG HINT SM f l a g i n d i c a t e s t h a t t h e l i b r a r y/ / s h o u l d i n i t i a l i z e i t s e l f f o r s h a r e d memory c o m p u t a t i o n s . s t a r n e i g n o d e i n i t ( − − / / r e d u c e t h e d e n s e m a t r i x m a t r i x A t o u p p e r H e s s e n b e r g form s t a r n e i g S E P S M H e s s e n b e r g ( n , A , ldA , Q , ldQ ) ; / / r e d u c e t h e u p p e r H e s s e n b e r g m a t r i x A t o S c h u r form s t a r n e i g S E P S M S c h u r ( n , A , ldA , Q , ldQ , r e a l , imag ) ; / / de − i n i t i a l i z e t h e S t a r N E i g l i b r a r y s t a r n e i g n o d e f i n a l i z e ( ) ; } A user simply has to initialize the matrices (in column-major format), initialize the library, callthe starneig SEP SM Hessenberg interface function for reducing the matrix to upper Hessenbergform, call the starneig SEP SM Schur interface function for reducing the upper Hessenberg ma-trix to real Schur form, and de-initialize the library. The real and imaginary components of theeigenvalues are returned as two separate arrays, real and imag , respectively.

StarNEig implement an additional layer of structure on top of the tiling. The tiles are consolidatedinto disjoint rectangular blocks of uniform size (excluding the last block row and column). All , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:21 tiles that belong to the same distributed block must have the same owner. Beyond that, the datadistribution can be arbitrary. StarNEig implements a set of interface functions for manipulating thedistributed matrices. StarNEig also implements a ScaLAPACK compatibility layer which allows auser to use BLACS context and descriptors (see BLACS’s User’s Guide by Dongarra and Whaley[1997]) with StarNEig.StarNEig implements the following interface function for the Schur reduction phase in a multi-node environment: s t a r n e i g e r r o r t s t a r n e i g S E P D M S c h u r ( s t a r n e i g d i s t r m a t r i x t

H , s t a r n e i g d i s t r m a t r i x t

Q , double r e a l [ ] , double imag [ ] ) ;The following source code listing demonstrates how a dense matrix A is reduced to real Schur form: < mpi . h > < s t a r n e i g / s t a r n e i g . h > void main ( i n t a r g c , char ∗ ∗ a r g v ) { i n t n = 3 0 0 0 , ldA = 3 0 0 0 , ldQ = 3 0 0 0 ; double A[ n ∗ ldA ] , Q[ n ∗ ldQ ] , r e a l [ n ] , imag [ n ] ; s t a r n e i g d i s t r m a t r i x t lA , lQ , dA , dQ ; i n t t h r e a d s u p p o r t , w o r l d r a n k ;M P I I n i t t h r e a d (& a r g c , ( char ∗ ∗ ∗ ) & argv , MPI THREAD MULTIPLE ,&t h r e a d s u p p o r t ) ;MPI Comm rank ( MPI COMM WORLD , &w o r l d r a n k ) ; i f ( w o r l d r a n k == r o o t ) { / / t h e r o o t n o d e i n i t i a l i z e s t h e m a t r i c e s A and Q l o c a l l y . . . } / / I n i t i a l i z e t h e S t a r N E i g l i b r a r y u s i n g a l l CPU c o r e s and/ / GPUs . The STARNEIG HINT DM f l a g i n d i c a t e s t h a t t h e l i b r a r y/ / s h o u l d i n i t i a l i z e i t s e l f f o r d i s t r i b u t e d memory/ / c o m p u t a t i o n s . s t a r n e i g n o d e i n i t ( − − / / c r e a t e a two − d i m e n s i o n a l b l o c k c y c l i c d i s t r i b u t i o n w i t h/ / row − m a j o r o r d e r i n g s t a r n e i g d i s t r t d i s t r = s t a r n e i g d i s t r i n i t m e s h ( − − / / C o n v e r t t h e l o c a l m a t r i x A t o a d i s t r i b u t e d m a t r i x lA , Vol. 1, No. 1, Article 1. Publication date: January 2016. :22 Mirko Myllykoski / / t h a t i s owned by t h e r o o t n o d e . T h i s i s d o n e i n − p l a c e ,/ / i . e . , t h e m a t r i c e s A and lA p o i n t t o t h e same d a t a . lA = s t a r n e i g d i s t r m a t r i x c r e a t e l o c a l (n , n , STARNEIG REAL DOUBLE , 0 , A , ldA ) ; / / c r e a t e a d i s t r i b u t e d m a t r i x dA u s i n g d e f a u l t d i s t r i b u t e d/ / b l o c k s i z e dA = s t a r n e i g d i s t r m a t r i x c r e a t e (n , n , − − / / c o p y t h e l o c a l m a t r i x lA t o t h e d i s t r i b u t e d m a t r i x dA s t a r n e i g d i s t r m a t r i x c o p y ( lA , dA ) ; / / s c a t t e r t h e m a t r i x Q lQ = s t a r n e i g d i s t r m a t r i x c r e a t e l o c a l (n , n , STARNEIG REAL DOUBLE , 0 , Q , ldQ ) ;dQ = s t a r n e i g d i s t r m a t r i x c r e a t e (n , n , − − / / r e d u c e t h e f u l l m a t r i x dA t o u p p e r H e s s e n b e r g form s t a r n e i g S E P D M H e s s e n b e r g ( dA , dQ ) ; / / r e d u c e t h e u p p e r H e s s e n b e r g m a t r i x dA t o S c h u r form s t a r n e i g S E P D M S c h u r ( dA , dQ , r e a l , imag ) ;s t a r n e i g d i s t r m a t r i x d e s t r o y ( lA ) ;s t a r n e i g d i s t r m a t r i x d e s t r o y ( dA ) ;s t a r n e i g d i s t r m a t r i x d e s t r o y ( lQ ) ;s t a r n e i g d i s t r m a t r i x d e s t r o y ( dQ ) ;s t a r n e i g d i s t r d e s t r o y ( d i s t r ) ;s t a r n e i g n o d e f i n a l i z e ( ) ;M P I F i n a l i z e ( ) ; } In the above example, the root node (rank 0) initializes the matrices locally and scatters them acrossthe nodes. The scattering approach is used here only for demonstrational purposes.

StarNEig also implements three expert interface functions for the Schur reduction phase: void s t a r n e i g s c h u r i n i t c o n f ( s t r u c t s t a r n e i g s c h u r c o n f ∗ c o n f ) ; s t a r n e i g e r r o r t s t a r n e i g S E P S M S c h u r e x p e r t ( s t r u c t s t a r n e i g s c h u r c o n f ∗ c o n f , i n t n , double

H [ ] , i n t ldH , double

Q [ ] , i n t ldQ , , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:23 double r e a l [ ] , double imag [ ] ) ; s t a r n e i g e r r o r t s t a r n e i g S E P D M S c h u r e x p e r t ( s t r u c t s t a r n e i g s c h u r c o n f ∗ c o n f , s t a r n e i g d i s t r m a t r i x t

H , s t a r n e i g d i s t r m a t r i x t

Q , double r e a l [ ] , double imag [ ] ) ;The starneig schur conf

C structure can be used to fine-tune the behavior of the implementation.A user may, for example, select the tile size, the iteration limit, the AED windows size, the shiftcount and various thresholds. In particular, a user may choose the deflation and vigilant deflationconditions as described below.Given a sub-matrix than encloses a AED window and the corresponding spike,  h i , i . . . . . . . . . . . . h i + , i . . . . . . . . . . . .... . . . . . . . . . . . . h i + k − , i h i + k − , i + k − h i + k − , i + k . . . h i + k , i h i + k , i + k − h i + k , i + k . . . . . .  , (7)the deflation condition determines when the spike entries h i + k − , i and h i + k , i are small enough tobe set to zero without introducing significant perturbations. The task-based algorithm supportsthree different deflation and vigilant deflation conditions:(1) LAPACK-style conditions. If h i + k , i + k − = | h i + k , i | ≤ max ( ν ( n / u ) , u | h i + k , i + k |) , (8)where ν is the smallest floating-point number such that 1 / ν does not overflow and u is theunit roundoff, then the 1-by-1 block [ h i + k , i + k ] is deflated. If h i + k , i + k − (cid:44) (| h i + k , i | , | h i + k − , i |) ≤ max ( ν ( n / u ) , u ˆ h i + k ) , (9)where ˆ h j = (cid:112) | h j , j h j − , j − | + (cid:112) | h j − , j h j , j − | , then the 2-by-2 block (cid:20) h i + k − , i + k − h i + k − , i + k h i + k , i + k − h i + k , i + k (cid:21) is deflated. The task-based algorithm mirrors LAPACK 3.9.0 when it comes to implementa-tion of vigilant deflation checks.(2) Norm-stable conditions. If h i + k , i + k − = | h i + k , i | ≤ u (cid:107) H (cid:107) F , then the 1-by-1 block [ h i + k , i + k ] is deflated. If h i + k , i + k − (cid:44) | h i + k , i | ≤ u (cid:107) H (cid:107) F and | h i + k − , i | ≤ u (cid:107) H (cid:107) F , then the2-by-2 block (cid:20) h i + k − , i + k − h i + k − , i + k h i + k , i + k − h i + k , i + k (cid:21) is deflated (see Braman et al. [2002b]). Given a sub-diagonal entry h i , i − , a vigilant deflationoccurs if | h i , i − | ≤ u (cid:107) H (cid:107) F . , Vol. 1, No. 1, Article 1. Publication date: January 2016. :24 Mirko Myllykoski (3) Fixed conditions. If h i + k , i + k − = | h i + k , i | ≤ ϵ thold , where ϵ thold is a user definedthreshold, then the 1-by-1 block [ h i + k , i + k ] is deflated. If h i + k , i + k − (cid:44) | h i + k , i | ≤ ϵ thold and | h i + k − , i | ≤ ϵ thold , then the 2-by-2 block (cid:20) h i + k − , i + k − h i + k − , i + k h i + k , i + k − h i + k , i + k (cid:21) is deflated. Given a sub-diagonal entry h i , i − , a vigilant deflation occurs if | h i , i − | ≤ ϵ thold .More details are included in the StarNEig User’s guide.Remark 5. For the generalized case, StarNEig implements four interface functions: s t a r n e i g e r r o r t s t a r n e i g G E P S M S c h u r ( i n t n , double

H [ ] , i n t ldH , double

R [ ] , i n t ldR , double

Q [ ] , i n t ldQ , double

Z [ ] , i n t ldZ , double r e a l [ ] , double imag [ ] , double b e t a [ ] ) ; s t a r n e i g e r r o r t s t a r n e i g G E P D M S c h u r ( s t a r n e i g d i s t r m a t r i x t

H , s t a r n e i g d i s t r m a t r i x t

R , s t a r n e i g d i s t r m a t r i x t

Q , s t a r n e i g d i s t r m a t r i x t

Z , double r e a l [ ] , double imag [ ] , double b e t a [ ] ) ; s t a r n e i g e r r o r t s t a r n e i g G E P S M S c h u r e x p e r t ( s t r u c t s t a r n e i g s c h u r c o n f ∗ c o n f , i n t n , double

H [ ] , i n t ldH , double

R [ ] , i n t ldR , double

Q [ ] , i n t ldQ , double

Z [ ] , i n t ldZ , double r e a l [ ] , double imag [ ] , double b e t a [ ] ) ; s t a r n e i g e r r o r t s t a r n e i g G E P D M S c h u r e x p e r t ( s t r u c t s t a r n e i g s c h u r c o n f ∗ c o n f , s t a r n e i g d i s t r m a t r i x t

H , s t a r n e i g d i s t r m a t r i x t

R , s t a r n e i g d i s t r m a t r i x t

Q , s t a r n e i g d i s t r m a t r i x t

Z , double r e a l [ ] , double imag [ ] , double b e t a [ ] ) ;The generalized eigenvalues are returned as a pair of numbers ( α , β ) such that α / β , β (cid:44) , gives thegeneralized eigenvalue. The case β = correspond to an infinite eigenvalue and the case α = β = correspond to indefinite eigenvalue. This section demonstrates the performance of the StarPU-based implementation (StarNEig) with setsof computational experiments. The task-based algorithm is compared against LAPACK (

DHSEQR ) andScaLAPACK (

PDHSEQR ) in both single-node (shared/distributed memory) and multi-node (distributedmemory) environments. The GPU performance and scalability of the StarPU-based implementationare also demonstrated.Remark 6.

Only the standard case is discussed in this section. See Myllykoski et al. [2019] forpreliminary results relating to the computation of generalized Schur form using the QZ algorithm. Notethat the problems relating to the detection of infinite eigenvalues have been fixed in StarPU version0.1-beta.4. , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:25

All computational experiments were performed using two systems, Kebnekaise and Abisko, locatedat the High Performance Computing Center North (HPC2N), Ume˚a University, Sweden. Kebnekaiseis a heterogeneous systems consisting of many different types of Intel and Nvidia based nodes. Thenode types that are used in this paper are the following:

Broadwell node contains 28 Intel Xeon E5-2690v4 cores organized into 2 NUMA islandswith 14 cores each and 128 GB memory. The nodes are connected with FDR Infiniband. Allmulti-node experiments were performed on these nodes.

Skylake node contains 28 Intel Xeon Gold 6132 cores organized into 2 NUMA islands with14 cores each and 192 GB memory. Most single-node experiments were performed on thesenodes.

V100 GPU node contains 28 Intel Xeon Gold 6132 cores organized into 2 NUMA islands with14 cores each and 192 GB memory. Each NUMA island is connected to a single NVIDIATesla V100 GPU with 16 GB memory. All GPU experiments were performed on these nodes.Abisko is an older system based on AMD Interlagos family of CPUs:

Abisko node contains 48 AMD Opteron 6238 cores organized into 8 NUMA islands (4 sockets)with 6 cores each and 128 GB memory. Two neighboring cores share a floating point unit(FPU), i.e., the total number of FPUs per node is 24. The nodes are connected with Mellanox4X QSFP 40 Gb/s InfiniBand.

Kebnekaise configuration.

The software was compiled with GCC 8.3.0 and linked to OpenMPI3.1.4, OpenBLAS 0.3.7, ScaLAPACK 2.0.2, CUDA 10.1.243, StarPU 1.3.3 and StarNEig 0.1.1. ForGPU and Hessenberg reduction experiments, the software was linked to StarPU 1.2.10 (with --disable-cuda-memcpy-peer ) and StarNEig 0.1.3 as the StarPU 1.3 series performs worse thanthe StarPU 1.2 series in these particular contexts and StarNEig 0.1.3 contains modifications to theHessenberg reduction phase. Only 25 cores were used in most single-node experiments becausethis choice leads a square 5-by-5 MPI process mesh. A square MPI process mesh usually leadsto better performance with the

PDHSEQR routine because for a pm -by- pn MPI process mesh, themaximum number of concurrent near-diagonal bulge chasing windows is given by min ( pm , pn ) .If all 28 cores were used, then the maximum number of concurrent near-diagonal bulge chasingwindows would have been min ( , ) = Abisko configuration.

The software was compiled with GCC 8.2.0 and linked to OpenMPI 3.1.3,OpenBLAS 0.3.5, ScaLAPACK 2.0.2, StarPU 1.3.3 and StarNEig 0.1.1. Since two neighboring coresshare a FPU, only every other core, total of 24 cores, were used in the experiments. With

PDHSEQR , themaximum number of concurrent near-diagonal bulge chasing windows was therefore min ( , ) = LAPACK configuration.

The

DHSEQR routine was parallelized by enabling the OpenBLAS multi-threading support. This means that each individual off-diagonal BLAS-3 update is performed inparallel.

ScaLAPACK configuration.

The matrices were distributed in 160-by-160 blocks. Initial experimentsindicated that smaller block size leads to decreased performance and that large block size doesnot provide any significant performance benefit. ScaLAPACK was configured to use one processper core (1ppc) topology. Note that the version of the

PDHSEQR routine that is provided withScaLAPACK 2.0.2 is known to be buggy and StarNEig was therefore compared against an updatedversion of routine, see Granat et al. [2015a, 2015b]. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :26 Mirko Myllykoski

StarNEig configuration.

One core was allocated for inserting the tasks into StarPU, i.e., the numberof StarPU workers (including the MPI worker) was set to max ( , p − ) , where p is the number ofavailable cores on a node. In multi-node experiments, StarNEig was configured to use a librarydefined default distributed block size and one process per node (1ppn) topology. Test driver configuration.

The StarNEig test driver random number seed was set to 2020. For mostexperiments, only one run was performed for each measurement. This was done for the followingfour reasons: (i) The access to the Kebnekaise and Abisko systems is limited and the number ofperformed experiments is quite large (and not all of them are even included in this paper). Alsonote that the validation of the result took in most cases far longer than the actual computation.(ii) Most measurements took a few minutes or more, and are therefore not susceptible to minorinterference from the operating system and other external sources. Furthermore, all experimentswere performed with exclusive access to the node. (iii) None of the conclusion of the paper dependon an individual measurement. (iv) In previous studies, the coefficient of variation has been hasbeen at most a few percent, usually around 1-2%. (v) The GPU-experiments are generally more noisyand in those experiments the median of three runs is reported. In addition, the performance modelswere pre-calibrated with

STARPU CALIBRATE=1 and

STARPU SCHED=random for more consistentperformance.

Table 1. Sparse test matrices selected from the SuiteSparse Matrix Collection.

Name n Nonzeros Described sourceg7jac020 5 850 42 568 0.12 % Economic Problemsinc15 11 532 551 184 0.41 % Materials Problemex11 16 614 1 096 948 0.40 % Computational Fluid Dynamics Problemns3Da 20 414 1 679 599 0.40 % Computational Fluid Dynamics ProblemTSOPF RS b2052 c1 25 626 6 761 100 1.03 % Power Network Probleminvextr1 new 30 412 1 793 881 0.19 % Computational Fluid Dynamics Problemg7jac120sc 35 550 412 306 0.03 % Economic Problemav41092 41 092 1 683 902 0.10 % 2D/3D ProblemThree different classes of test matrices were considered:(1)

Sparse real-world matrices.

A set of sparse nonsymmetric matrices was selected from theSuiteSParse Matrix Collection (formerly known as Tim Davis’s collection), see Table 1.These test matrices were included because the behavior the QR algorithm is sensitive tothe properties of the matrix, such as, the distribution of the eigenvalues. It is thereforevital that performance of StarNEig is demonstrated with real-world matrices. Using a setof dense real-world matrices would have been a more obvious choice but a database that iscomparable to the SuiteSParse Matrix Collection simply does not exist for dense matrices.The sparse matrices are also more easily accessible for anyone due to their small storagefootprint.(2)

Synthetic matrices with known eigenvalues (syn n ). These matrices are generated by firstforming a random upper triangular matrix ¯ T ∈ R n × n (entries uniformly distributed on theinterval [− , ] ). The known eigenvalues, λ , . . . , λ n , 50% of them begin complex conjugatepairs, are then placed on the diagonal of ¯ T to form a Schur matrix ¯ S . The Schur matrix¯ S is then multiplied from both sized with a random Householder reflector ¯ Q ∈ R n × n to , Vol. 1, No. 1, Article 1. Publication date: January 2016. Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:27 generate a dense matrix A = ¯ Q ¯ S ¯ Q T . The known eigenvalues are initially selected to be ± , ± , ± , . . . . The complex conjugate pairs of eigenvalues, ( λ i , λ i + ) , are then randomlyselected and lifted from the real line by setting λ i ← Re ( λ i ) + i | Re ( λ i )| λ i + ← Re ( λ i ) − i | Re ( λ i )| . (10)(3) Hessrand matrices (hess n ). Let

N ( µ , σ ) be the normal distribution with mean µ and variance σ , and let χ ( τ ) be the chi-squared distribution with τ degrees of freedom. As done inBraman et al. [2002a], the upper Hessenberg matrix H ∈ R n × n is generated directly asfollows: h i , j ∼ N ( , ) , i = , . . . , j , j = , . . . , n , h i + , i ∼ χ ( n − i ) , i = , . . . , n − . (11) Given a computed orthogonal similarity transformation A ≈ U XU T , we measure the backwarderror with the following relative residual: R A ( X , U ) = || U XU T − A || F u || A || F , (12)where || · || F is the Frobenius norm and u is the unit roundoff ( u = − ≈ . · − ), see Highamand Higham [1998]. Similarly, we measure the loss of orthogonality with the relative residual R orth ( U ) = || UU T − I || F u || I || F . (13)Note that the residuals are always computed with respect to the original matrix A . Finally, given acomputed Schur decomposition A ≈ Q ¯ SQ T , we measure for each computed eigenvalue ¯ λ ∈ λ ( ¯ S ) the relative error E λ ( ¯ λ ) = min λ ∈ λ ( A ) | ¯ λ − λ | u | λ | . (14)We report both the mean error and the maximum error: E λ , mean = (cid:205) ¯ λ ∈ λ ( ¯ S ) E λ ( ¯ λ ) n and E λ , max = max ¯ λ ∈ λ ( ¯ S ) E λ ( ¯ λ ) . (15) The so-called standard algorithm by Quintana-Ort´ı and van de Geijn [2006] for reducing a non-symmetric matrix A to upper Hessenberg form H is known to be memory bound and thus takeslonger than the Schur reduction phase. However, as shown in Table 2, the Hessenberg reductionphase can be accelerated with a GPU. This means that the Hessenberg reduction phase can actuallybe performed faster than the Schur reduction phase and any improvements made to the Schurreduction phase are therefore meaningful. StarNEig (and MAGMA, see Tomov and Dongarra [2009])currently support only a single GPU but a multi-GPU support is being developed while this paper isbegin prepared. Note that StarNEig would not benefit from the additional 14 cores from the secondsocket because the standard algorithm is memory bound even the memory-bound operations areperformed on a GPU. , Vol. 1, No. 1, Article 1. Publication date: January 2016. :28 Mirko Myllykoski Table 2. A comparison between a Skylake node (25 cores) and a single socket of a V100 GPU node (14 cores +GPU) when computing a Hessenberg form, and the associated residuals from the former set of experiments.

Run time [s]Matrix CPU only CPU + GPU R A ( H , Q ) R orth ( Q ) g7jac020 10 2 11 10sinc15 56 13 10 10ex11 150 32 9 10ns3Da 266 55 20 10TSOPF RS .. 507 105 10 11invextr1 .. 814 167 16 10g7jac120s.. 1268 – 10 10av41092 1923 – 25 11 Table 3. A comparison between the LAPACK-style deflation condition (left in each column-pair) and thenorm-stable deflation condition (right in each column-pair) on a Skylake node (25 cores) when computing aSchur form from a Hessenberg form.

Matrix Run time [s] R A ( S , Q ) R orth ( Q ) E λ , mean E λ , max syn 10000 14 13 238 267 157 160 62 65 505 504syn 20000 68 63 336 356 237 223 92 91 1488 912syn 40000 473 427 619 549 412 349 181 150 3462 1780g7jac020 3 3 112 88 93 67 – – – –sinc15 8 5 103 87 83 65 – – – –ex11 43 36 633 612 145 130 – – – –ns3Da 54 44 231 205 147 130 – – – –TSOPF RS .. 223 135 472 372 236 188 – – – –invextr1 .. 410 304 1393 1011 590 352 – – – –g7jac120s.. 277 186 242 165 240 147 – – – –av41092 478 204 418 277 231 176 – – – – Before discussing the performance, we must first discuss the deflation condition and how it effectsthe accuracy. StarNEig defaults to the norm-stable deflation condition but since both the

DHSEQR and

PDHSEQR routines use the LAPACK-style deflation condition, all computational experimentswere conducted using the LAPACK-style deflation condition. As shown in Table 3, the choice ofdeflation condition impacts on both the performance and the accuracy. In general, the norm-stabledeflation condition leads to shorter run time, which is not surprising since norm-stable deflationcondition is less tight than the LAPACK-style deflation condition. However, what might at firstappear surprising is the fact that the norm-stable deflation condition can actually lead to a moreaccurate solution. The differences are small and likely contributable to the rounding errors resultingfrom the additional QR iterations due to the tighter deflation condition.

The first set of experiments demonstrates the performance of StarNEig in comparison to LAPACKand ScaLAPACK on a Skylake node (Table 4) and an Abisko node (Table 5). The quantities E λ , mean and E λ , max are listed separately in Table 6 when applicable. The results show that when compared , Vol. 1, No. 1, Article 1. Publication date: January 2016. Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:29

Table 4. A comparison between LAPACK (left in each column-triplet), ScaLAPACK (middle in each column-triplet) and StarNEig (right in each column-triplet) on a Skylake node (25 cores) when computing a Schurform from a Hessenberg form.

Matrix Run time [s] R A ( S , Q ) R orth ( Q ) syn 10000 62 37

276 174

168 113 syn 20000 325 198

333 289

202 192 syn 40000 2052 2995

492 619

263 396 g7jac020 11 8

87 84

69 65 sinc15 27 21

62 58

89 66 ex11 161 104

155 176

101 105 ns3Da 222 157

174 171

120 117

TSOPF RS .. 864 465

65 100

171 171 invextr1 .. 1077 644

443 510

167 174 g7jac120s.. 784 653

103 117

123 137 av41092 1486 893

222 229

162 168

Table 5. A comparison between LAPACK (left in each column-triplet), ScaLAPACK (middle in each column-triplet) and StarNEig (right in each column-triplet) on an Abisko node (24 cores/FPUs) when computing aSchur form from a Hessenberg form.

Matrix Run time [s] R A ( S , Q ) R orth ( Q ) syn 10000 186 108

283 228

174 144 syn 20000 1082 650

340 422

206 295 syn 40000 7181 5148

490 982

260 746 g7jac020 38 19

88 83

69 64 sinc15 130 51

57 60

69 71 ex11 563 340

159 122

100 159 ns3Da 752 521

172 205

118 147

TSOPF RS .. 3108 2050

62 68

172 262 invextr1 .. 3924 2542

447 513

167 305 g7jac120s.. 3025 3262

104 165

123 228 av41092 4944 2823

222 271

161 212

Table 6. An eigenvalue accuracy comparison between LAPACK (left in each column-triplet), ScaLAPACK(middle in each column-triplet) and StarNEig (right in each column-triplet) on a Skylake node (25 cores) andan Abisko node (24 cores/FPUs) when computing a Schur form from a Hessenberg form.

Skylake node Abisko nodeMatrix E λ , mean E λ , max E λ , mean E λ , max syn 10000 78 44

483 397

78 64

561 392 syn 20000 89 84

91 146

992 1672 syn 40000 116 179

115 348 against LAPACK, StarNEig is between 2.6 and 4.8 (average 3.7) times faster on a Skylake nodeand between 3.8 and 6.8 (average 5.2) times faster on an Abisko node. When compared againstScaLAPACK, StarNEig is between 1.6 and 6.3 (average 2.8) times faster on a Skylake node andbetween 2.1 and 4.3 (average 3.2) times faster on an Abisko node. Note that although StarNEig is , Vol. 1, No. 1, Article 1. Publication date: January 2016. :30 Mirko Myllykoski slightly less accurate than LAPACK and ScaLAPACK, the differences are very small, in particularconsidering the observations made in Subsection 5.6. The differences are likely contributableto the way the algorithms compute the shifts since, under certain conditions, the LAPACK andScaLAPACK algorithms recompute the shifts after an AED in order improve their accuracy. Moreaccurate shifts can lead to a fewer number of iterations and thus less roundoff errors. The samefunctionality has been planned as an option for StarNEig. P a r a ll e l e ff i c i e n c y n = 5kn = 10kn = 20kn = 40kn = 60k 0 5 10 15 20 25Cores0.00.20.40.60.81.0 P a r a ll e l e ff i c i e n c y n = 5kn = 10kn = 20kn = 40k Fig. 17. Single-node parallel efficiency on a Skylake node (left) and an Abisko node (right) when computing aSchur form from a Hessenberg form hess n . The second set of experiments demonstrates the scalability of StarNEig in single-node environ-ment, see Figure 17. The sharp dip at 4 cores in both graphs is due to the fact that StarNEig isconfigured to use only 3 StarPU workers (see Subsection 5.2) and the relative loss in the availableperformance is at its highest in that instance. If the number of StarPU workers was set to equal thenumber of available cores, then the dip at 4 cores would disappears but the performance woulddip at 28 cores, since in that case, the thread that inserts the tasks into StarPU would share a corewith a StarPU worker thread. A similar effect could in principle occur at 1 core, and thus effect theshapes of the graphs, but this does not seem to occur in practice. Note that both Skylake (base 2.60GHz, turbo 3.70 GHz) and Abisko (base 2.60 GHz, turbo 3.20 GHz) nodes run at higher clock rateunder a single-core load. The exact Skylake boost table is proprietary information and has not beenreleased by the manufacturer. Therefore, neither graph is corrected for the changing clock rate.

The fourth set of experiments demonstrated the performance of StarNEig in comparison to ScaLA-PACK on multiple Skylake nodes (Table 7) and multiple Abisko nodes (Table 7). Both ScaLAPACKand StarNEig were configured to use square MPI process mesh and ScaLAPACK was always giventhe advantage in terms of number of cores. The results show that when compared against ScaLA-PACK, StarNEig is between 2.3 and 3.0 (average 2.6) times faster on the Broadwell nodes andbetween 1.9 and 2.3 (average 2.1) times faster on the Abisko nodes. , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:31

Table 7. A comparison between ScaLAPACK (left in each column-pair) and StarNEig (right in each column-pair)on multiple Broadwell nodes when computing a Schur form from a Hessenberg form. syn n Cores Run time [s] R A ( S , Q ) R orth ( Q ) E λ , mean E λ , max

10 000 121 112 27

20 000 121 112 95

40 000 121 112 451

60 000 256 252 1004

80 000 256 252 1679

100 000 256 252 3366

120 000 484 448 – – – – –

140 000 484 448 – – – – –

160 000 484 448 – – – – – Table 8. A comparison between ScaLAPACK (left in each column-pair) and StarNEig (right in each column-pair)on multiple Abisko nodes when computing a Schur form from a Hessenberg form. syn n Cores/FPUs Run time [s] R A ( S , Q ) R orth ( Q ) E λ , mean E λ , max

10 000 100 96 58

20 000 100 96 206

40 000 100 96 1049

60 000 225 216 2013

80 000 225 216 4091

100 000 225 216 8189

20k 40k 60k 80k 100k 120k 140k 160kMatrix dimension050010001500200025003000 R un t i m e [ s ] Fig. 18. Multi-node scalability on Broadwell nodes when computing a Schur form from a Hessenberg formhess n . The fifth set of experiments demonstrates the scalability of StarNEig on multiple nodes. The tile sizewas fixed to 248-by-248 and the distributed block size to 1984-by-1984. In overall, the scalabilityis very good when moving from a single node to four nodes but after that the benefits from the , Vol. 1, No. 1, Article 1. Publication date: January 2016. :32 Mirko Myllykoski

20k 40k 60k 80k 100k 120k 140k 160kMatrix dimension01000200030004000500060007000 R un t i m e [ s ] Fig. 19. Multi-node scalability on Abisko nodes when computing a Schur form from a Hessenberg formhess n . additional nodes begin to shrink. Note that the behavior the QR algorithm is sensitive to theproperties of the matrix. In particular, some matrices require only a few bulge chasing steps andthus converge faster but offer less opportunities for concurrency. Table 9. Run time (in seconds) on a V100 GPU node (28 cores + 0/1/2 GPUs) when computing a Schur formfrom a Hessenberg form.

Matrix CPU only CPU + GPU CPU + 2 GPUssyn 10000 14 17 16syn 20000 64 61 55syn 40000 437 285 273g7jac020 3 4 4sinc15 9 10 10ex11 39 34 30ns3Da 54 56 50TSOPF RS .. 215 169 159invextr1 .. 373 160 143g7jac120s.. 254 141 137av41092 454 305 316The third set of experiments demonstrates StarNEig’s GPU performance, see Table 9. Theintroduction of a single GPU leads a speedup between 0.8 and 2.3 (average 1.3) and the introductionof two GPUs leads a speedup between 0.8 and 2.6 (average 1.4). The speedups are computedwith respect to the CPU-only results. Note that the GPU support, and the multi-GPU support inparticular, are still considered experimental features. Also note that in case of the smaller matricesand with 28 cores, the execution time is likely to be bounded from the below by the CPU-onlynear-diagonal bulge chasing tasks which are part of the critical path of the QR algorithm. Thepotential benefits GPUs could have are thus very limited. , Vol. 1, No. 1, Article 1. Publication date: January 2016.

Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation 1:33

This paper presented a task-based multi-shift QR algorithm with AED and its StarPU-based im-plementation. The task-based algorithm inherits previous algorithmic improvements, such astightly-coupled multi-shifts and Aggressive Early Deflation (AED) from the latest LAPACK andScaLAPACK algorithms. The task-based algorithm incorporates several new ideas that significantlyimprove the performance. These ideas include the elimination of several synchronization points,the dynamic merging of previously separate computational steps, the shorting and the prioritizationof the critical path, and the introduction of an experimental GPU support. The implementationwas demonstrated to be significantly faster than multi-threaded LAPACK and ScaLAPACK in bothsingle-node and multi-node environments on Intel and AMD based machines. Furthermore, theexperimental GPU acceleration was demonstrated to be functional in single-GPU configuration.

ACKNOWLEDGMENTS

StarNEig authors.

StarNEig has been developed by the author (Hessenberg reduction, Schurreduction, eigenvalue reordering, integration and testing), Carl Christian Kjelgaard Mikkelsen(generalized eigenvectors and theoretical contributions to the robust computation of eigenvectors),Angelika Schwarz (standard eigenvectors), Lars Karlsson (miscellaneous contributions to thelibrary and theoretical contributions to the robust computation of eigenvectors), and Bo K˚agstr¨om(coordinator and scientific director of the entire NLAFET project).

Other acknowledgments.

The author would like to separately extend his gratitude to Dr. CarlChristian Kjelgaard Mikkelsen for his valuable comments and suggestions during the preparationof this paper. The author also acknowledges the work of Mahmoud Eljammaly and Bj¨orn Adlerbornduring the NLAFET project. The author thanks the High Performance Computing Center North(HPC2N) at Ume˚a University, for providing computational resources and valuable support duringtest and performance runs, and StarPU team at INRIA, for rapidly providing good answers toseveral questions related to StarPU.

REFERENCES

B. Adlerborn, B. K˚agstr¨om, and D. Kressner. 2014. A Parallel QZ Algorithm for distributed memory HPC-Systems.

SIAM J.Sci. Comput.

36, 5 (2014), C480–C503. https://doi.org/10.1137/140954817 arXiv:http://dx.doi.org/10.1137/140954817C´edric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andr´e Wacrenier. 2011. StarPU: A Unified Platformfor Task Scheduling on Heterogeneous Multicore Architectures.

CCPE - Concurrency and Computation: Practice andExperience, Special Issue: Euro-Par 2009

23 (Feb. 2011), 187–198. Issue 2. https://doi.org/10.1002/cpe.1631Zhaojun Bai and James Demmel. 1989. On a Block Implementation of Hessenberg Multishift QR Iteration.

Int. J. High SpeedComput.

1, 1 (April 1989), 97fi?!112. https://doi.org/10.1142/S0129053389000068Z. Bai and J. W. Demmel. 1993. On swapping diagonal blocks in real Schur form.

Linear Algebra Appl.

186 (1993), 73–95.K. Braman, R. Byers, and R. Mathias. 2002a. The multishift QR algorithm. I. Maintaining well-focused shifts and level 3performance. SIAM J. Matrix Anal. Appl.

23, 4 (2002), 929–947. https://doi.org/10.1137/S0895479801384573K. Braman, R. Byers, and R. Mathias. 2002b. The multishift QR algorithm. II. Aggressive early deflation. SIAM J. MatrixAnal. Appl.

23, 4 (2002), 948–973. https://doi.org/10.1137/S0895479801384585Ralph Byers. 2007. LAPACK 3.1 xHSEQR: Tuning and implementation notes on the small bulge multi-shift QR algorithmwith aggressive early deflation. (04 2007).Jack Dongarra and R. Clint Whaley. 1997.

A User’s Guide to the BLACS . LAWN 94.J. G. F. Francis. 1961. The QR Transformation A Unitary Analogue to the LR Transformationfi!?Part 1.

Comput.J.

4, 3 (01 1961), 265–271. https://doi.org/10.1093/comjnl/4.3.265 arXiv:https://academic.oup.com/comjnl/article-pdf/4/3/265/1080833/040265.pdfG. H. Golub and C. F. Van Loan. 1996.

Matrix Computations (4th ed.). Johns Hopkins University Press, Baltimore, MD.R. Granat, B. K˚agstr¨om, D. Kressner, and M. Shao. 2015a. ALGORITHM 953: Parallel Library Software for the Multishift QRAlgorithm with Aggressive Early Deflation.

ACM Trans. Math. Software

41, 4 (2015), 1–23. https://doi.org/10.1145/2699471, Vol. 1, No. 1, Article 1. Publication date: January 2016. :34 Mirko Myllykoski

R. Granat, B. K˚agstr¨om, D. Kressner, and M. Shao. 2015b. ALGORITHM 953: Parallel Library Software for the Multishift QRAlgorithm with Aggressive Early Deflation —Electronic Appendix: Derivation of the Performance Model.

ACM Trans.Math. Software

41, 4 (2015). https://doi.org/10.1145/2699471Desmond J. Higham and Nicholas J. Higham. 1998. Structured Backward Error and Condition of Generalized Eigen-value Problems.

SIAM J. Matrix Anal. Appl.

20, 2 (1998), 493–512. https://doi.org/10.1137/S0895479896313188arXiv:https://doi.org/10.1137/S0895479896313188B. K˚agstr¨om. 1993.

A Direct Method for Reordering Eigenvalues in the Generalized Real Schur form of a Regular Matrix Pair (A,B) . Springer Netherlands, Dordrecht, 195–218. https://doi.org/10.1007/978-94-015-8196-7 11Lars Karlsson, Mahmoud Eljammaly, and M. Myllykoski. 2019.

D6.5 Evaluation of auto-tuning techniques . Technical Report.Department of Computing Science, Ume˚a University, SE-901 87 Ume˚a, Sweden.C.C Kjelgaard Mikkelsen, M. Myllykoski, Bj¨orn Adlerborn, Lars Karlsson, and K˚agstr¨om. 2017.

D2.5: Eigenvalue ProblemSolvers . Technical Report. Department of Computing Science, Ume˚a University, SE-901 87 Ume˚a, Sweden.M. Myllykoski. 2018. A Task-Based Algorithm for Reordering the Eigenvalues of a Matrix in Real Schur Form. In

ParallelProcessing and Applied Mathematics, PPAM 2017 (Lecture Notes in Computer Science) , Vol. 10777. Springer InternationalPublishing, 207–216. https://doi.org/10.1007/978-3-319-78024-5 19M. Myllykoski, Lars Karlsson, Bo K˚agstr¨om, Mahmoud Eljammaly, Srikara Pranesh, and Mawussi Zounon. 2018.

D2.6:Prototype Software for Eigenvalue Problem Solvers . Technical Report. Ume˚a University, The University of Manchester.M. Myllykoski, C.C Kjelgaard Mikkelsen, A. Schwarz, and B. K˚agstr¨om. 2019.

D2.7: Eigenvalue solvers for nonsymmetricproblems . Technical Report. Department of Computing Science, Ume˚a University, SE-901 87 Ume˚a, Sweden.M. Myllykoski and C. C. Kjelgaard Mikkelsen. 2020a. Introduction to StarNEig — A Task-based Library for SolvingNonsymmetric Eigenvalue Problems. In

Parallel Processing and Applied Mathematics, PPAM 2019 (Lecture Notes inComputer Science) , Vol. 12043. Springer Nature Switzerland AG.M. Myllykoski and C. C. Kjelgaard Mikkelsen. 2020b. Task-based, GPU-accelerated and Robust Library for SolvingDense Nonsymmetric Eigenvalue Problems.

Concurrency and Computation: Practice and Experience (2020). https://doi.org/10.1002/cpe.5915M. Myllykoski, C. C. Kjelgaard Mikkelsen, L. Karlsson, and B. K˚agstr¨om. 2017.

Task-Based Parallel Algorithms for Reorderingof Matrices in Real Schur Form . nlafet working note wn-11. Also as Report UMINF 17.11, Dept. of Computing Science,Ume˚a University, SE-901 87 Ume˚a, Sweden.Gregorio Quintana-Ort´ı and Robert van de Geijn. 2006. Improving the Performance of Reduction to Hessenberg Form. ACMTrans. Math. Software

32, 2 (2006). https://doi.org/10.1145/1141885.1141887Samuel Thibault. 2018. On Runtime Systems for Task-based Programming on Heterogeneous Platforms. Habilitation `adiriger des recherches, Universit´e de Bordeaux.Stanimire Tomov and Jack Dongarra. 2009.

Accelerating the reduction to upper Hessenberg form through hybrid GPU-basedcomputing . Technical Report. Technical Report 219, LAPACK Working Note.David Watkins and Ludwig Elsner. 1994. Theory of Decomposition and Bulge-Chasing Algorithms for the GeneralizedEigenvalue Problem.