[PDF] Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems

Abstract

In this paper, we present the StarNEig library for solving dense nonsymmetric standard and generalized eigenvalue problems. The library is built on top of the StarPU runtime system and targets both shared and distributed memory machines. Some components of the library have support for GPU acceleration. The library is currently in an early beta state and supports only real matrices. Support for complex matrices is planned for a future release. This paper is aimed at potential users of the library. We describe the design choices and capabilities of the library, and contrast them to existing software such as ScaLAPACK. StarNEig implements a ScaLAPACK compatibility layer which should assist new users in the transition to StarNEig. We demonstrate the performance of the library with a sample of computational experiments.

Full PDF

RReceived: Added at production Revised: Added at production Accepted: Added at productionDOI: xxx/xxxx

ARTICLE TYPE

Task-based, GPU-accelerated and Robust Library for SolvingDense Nonsymmetric Eigenvalue Problems

Mirko Myllykoski* | Carl Christian Kjelgaard Mikkelsen Department of Computing Science andHPC2N, Umeå University, SE-901 87Umeå, Sweden

Correspondence *Mirko Myllykoski, Department ofComputing Science and HPC2N, UmeåUniversity, SE-901 87 Umeå, Sweden.Email: [email protected]

Summary

In this paper, we present the StarNEig library for solving dense nonsymmetric stan-dard and generalized eigenvalue problems. The library is built on top of the StarPUruntime system and targets both shared and distributed memory machines. Somecomponents of the library have support for GPU acceleration. The library is cur-rently in an early beta state and supports only real matrices. Support for complexmatrices is planned for a future release. This paper is aimed at potential users of thelibrary. We describe the design choices and capabilities of the library, and contrastthem to existing software such as ScaLAPACK. StarNEig implements a ScaLA-PACK compatibility layer which should assist new users in the transition to StarNEig.We demonstrate the performance of the library with a sample of computationalexperiments.

KEYWORDS: eigenvalue problem, parallel computing, task-based, numerical library

In this paper, we present the StarNEig library for solving dense nonsymmetric standard and generalized eigenvalue problems.StarNEig diﬀers from existing libraries such as LAPACK and ScaLAPACK by relying on a modern task-based approach(see, e.g., and references therein) in a manner similar to the already well-established PLASMA library . Speciﬁcally, StarNEigis built on top of the StarPU runtime system . This allows StarNEig to target both shared memory and distributed memorymachines. Furthermore, some components of StarNEig have support for GPU acceleration. The library is currently in an earlybeta state and under continuous development. Currently, StarNEig applies to real matrices with real and complex eigenvaluesand all calculations are done using real arithmetic. This paper is an extended version of a conference paper by the authors .This paper is addressed to potential users of the StarNEig library. We hope that readers, who are already familiar with ScaLA-PACK, will be able to decide if StarNEig is suitable for them. In particular, we wish to communicate the changes needed tointegrate existing ScaLAPACK style (or LAPACK style) software with StarNEig. Central to this integration is the ScaLAPACKcompatibility layer implemented in StarNEig. This compatibility layer allows users to keep their existing two-dimensional blockcyclic distribution of the data and call StarNEig routines directly to perform the computations. The authors hope to start adiscussion which will help guide and prioritize the future development of the library.Dense nonsymmetric eigenvalue problems are usually solved using three phases: reduction to condensed Hessenberg form,reduction to (real) Schur form and computation of eigenvectors. Additionally, a fourth phase, called eigenvalue reordering,can be performed to acquire an invariant subspace that is associated with a given subset of eigenvalues. All four phases areimplemented in StarNEig in a task-based manner. Performance-wise, the Hessenberg reduction phase in StarNEig is comparable a r X i v : . [ c s . M S ] F e b Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen to the LAPACK, ScaLAPACK and MAGMA libraries, while the Schur reduction and the reordering phases in StarNEig aresigniﬁcantly faster than the ScaLAPACK implementations. Moreover, StarNEig can compute the eigenvectors directly from anySchur form without suﬀering from ﬂoating-point overﬂow, i.e., the implementation in robust. This functionality simply does notexist in ScaLAPACK and the implementation in StarNEig is signiﬁcantly faster than the LAPACK implementation in a parallelsetting. We refer the reader to for more comprehensive performance and accuracy evaluations.The rest of this paper is organized as follows: Section 2 provides a brief summary of the solution of dense nonsymmetriceigenvalue problems using the four phases mentioned earlier. Section 3 introduces the task-based approach and explains why thetask-based approach can potentially lead to superior performance when compared to older, well-established techniques. Section4 introduces the reader to some of the inner workings of StarNEig. In particular, the current state of the library and variouslimitations are explained in this section. Section 5 presents a sample of computational results which demonstrate the expectedperformance of StarNEig in both shared and distributed memory. Finally, Section 6 concludes the paper. Given a matrix 𝐴 ∈ ℝ 𝑛 × 𝑛 , the standard eigenvalue problem consists of computing eigenvalues 𝜆 𝑖 ∈ ℂ and matching eigenvectors 𝑥 𝑖 ∈ ℂ 𝑛 , 𝑥 𝑖 ≠ , such that 𝐴𝑥 𝑖 = 𝜆 𝑖 𝑥 𝑖 . (1)Similarly, given matrices 𝐴 ∈ ℝ 𝑛 × 𝑛 and 𝐵 ∈ ℝ 𝑛 × 𝑛 the generalized eigenvalue problem for the matrix pair ( 𝐴, 𝐵 ) consists ofcomputing generalized eigenvalues 𝜆 𝑖 ∈ ℂ ∪ {∞} and matching generalized eigenvectors 𝑥 𝑖 ∈ ℂ 𝑛 , 𝑥 𝑖 ≠ , such that 𝐴𝑥 𝑖 = 𝜆 𝑖 𝐵𝑥 𝑖 . (2)If the matrices 𝐴 and 𝐵 are sparse, then the well-known SLEPc library is one the better tools for solving the eigenvalueproblems (1) and (2). Similarly, if the matrices 𝐴 and 𝐵 are is symmetric, then algorithms and software that take advantage ofthe symmetry are preferred (see, e.g., ). Otherwise, if the matrices are both dense and nonsymmetric , then the route ofacquiring the (generalized) eigenvalues and the (generalized) eigenvectors usually includes the following three phases: Hessenberg(-triangular) reduction:

The matrix 𝐴 or the matrix pair ( 𝐴, 𝐵 ) is reduced to Hessenberg form 𝐻 or Hessenberg-triangular form ( 𝐻, 𝑅 ) by a similarity transformation 𝐴 = 𝑄 𝐻𝑄 𝑇 or ( 𝐴, 𝐵 ) = 𝑄 ( 𝐻, 𝑅 ) 𝑍 𝑇 , (3)where 𝐻 is upper Hessenberg, 𝑅 is a upper triangular, and 𝑄 and 𝑍 are orthogonal matrices. Schur reduction:

The Hessenberg matrix 𝐻 or the Hessenberg-triangular matrix pair ( 𝐻, 𝑅 ) is reduced to real Schur form 𝑆 or generalized real Schur form ( 𝑆, 𝑇 ) by a similarity transformation 𝐻 = 𝑄 𝑆𝑄 𝑇 or ( 𝐻, 𝑅 ) = 𝑄 ( 𝑆, 𝑇 ) 𝑍 𝑇 , (4)where 𝑆 is upper quasi-triangular with and blocks on the diagonal, 𝑇 is a upper triangular, and 𝑄 and 𝑍 are orthogonal matrices. The eigenvalues or generalized eigenvalues can be determined from the diagonal blocks of 𝑆 or ( 𝑆, 𝑇 ) . Eigenvectors:

Finally, we solve for vectors 𝑦 𝑖 ∈ ℂ 𝑛 from ( 𝑆 − 𝜆 𝑖 𝐼 ) 𝑦 𝑖 = 0 or ( 𝑆 − 𝜆 𝑖 𝑇 ) 𝑦 𝑖 = 0 (5)and backtransform to the original basis by 𝑥 𝑖 = 𝑄 𝑄 𝑦 𝑖 or 𝑥 𝑖 = 𝑍 𝑍 𝑦 𝑖 . (6)Additionally, a fourth phase can be performed to acquire an invariant subspace of 𝐴 or ( 𝐴, 𝐵 ) that is associated with a givensubset of eigenvalues or a given subset of generalized eigenvalues: Eigenvalue reordering:

The real Schur form 𝑆 or the generalized real Schur form ( 𝑆, 𝑇 ) is reordered, such that a selected setof eigenvalues or generalized eigenvalues appears in the leading diagonal blocks of an updated real Schur form ̂𝑆 or an irko Myllykoski and Carl Christian Kjelgaard Mikkelsen updated generalized real Schur form ( ̂𝑆, ̂𝑇 ) , by a similarity transformation 𝑆 = 𝑄 ̂𝑆𝑄 𝑇 or ( 𝑆, 𝑇 ) = 𝑄 ( ̂𝑆, ̂𝑇 ) 𝑍 𝑇 , (7)where 𝑄 and 𝑍 are orthogonal matrices.See for a detailed explanation of the underlying mathematical theory. A task-based algorithm functions by cutting the computational work into self-contained tasks that all have a well deﬁned set ofinputs and outputs. In particular, StarNEig divides the matrices into disjoint (square) tiles and each task takes a set of tiles as itsinput and produces/modiﬁes a set of tiles as its output. The main diﬀerence between tasks and regular function/subroutine calls isthat a task-based algorithm does not call the associated computation kernels directly. Instead, the tasks are inserted into a runtimesystem which then derives the data dependences between the tasks from the supplied input and output information. The runtimesystem then schedules the tasks to computational resources, such as CPUs and GPUs, in a sequentially consistent order as dictatedby the data dependences. The main beneﬁt from this is that as long as the cutting is carefully done, the underlying parallelism isexposed automatically as the runtime system gradually traverses the resulting task graph. In particular, the runtime system candetect and tap into previously unexplored avenues of parallelism that are hidden within the task graphs. This leads to signiﬁcantlymore powerful algorithms that are able to adapt to diﬀerent inputs and changing hardware conﬁgurations. Other beneﬁts ofthe task-based approach include, for example, better load balancing and resource utilization due to dynamic scheduling, taskpriorities and implicit MPI communications that are automatically derived from the task graph.The following subsections brieﬂy discuss the four phases in the algorithm stack from the point of view of task parallelism.We do not attempt to explain the details of each algorithm, rather we focus on the key steps and explain how they beneﬁt fromtask parallelism. We will use the standard eigenvalue problem as an illustration. (a)

Panels. (b)

Reduce the panel. (c)

Update the trailing matrix. (d)

Update the top part from the right.

FIGURE 1

An illustration of the standard algorithm for reducing a matrix to upper Hessenberg form.We begin by discussing the Hessenberg reduction phase. We emphasize that this phase does not beneﬁt as much from thetask-based approach at the other phases. Now, the so-called standard algorithm for reducing a nonsymmetric matrix 𝐴 to upperHessenberg form 𝐻 ﬁrst divides the matrix 𝐴 into disjoint panels as illustrated in Figure 1a. Each panel is then reduced to upperHessenberg form as summarized below:1. Reduce the panel.

The 𝑖 th column in the panel is reduced by constructing and applying a suitable Householder reﬂector 𝐼 − 𝜏 𝑖 𝑣 𝑖 𝑣 𝑇𝑖 , where 𝜏 𝑖 ∈ ℝ and 𝑣 𝑖 ∈ ℝ 𝕟 . All involved reﬂectors are initially applied only inside the current panel andaccumulated into a compact WY representation 𝐼 − 𝑉 𝑖 𝑇 𝑖 𝑉 𝑇𝑖 , where 𝑇 𝑖 ∈ ℝ 𝑖 × 𝑖 is upper triangular and 𝑉 𝑖 = [ 𝑣 𝑣 … 𝑣 𝑖 ] .In tandem, one of the necessary intermediate results is also accumulated into a matrix 𝑌 𝑖 = 𝐴𝑉 𝑖 𝑇 𝑖 . The construction of Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen the matrix 𝑌 𝑖 allows us to reduce the entire panel without updating the other sections of the matrix. However, the updateformula includes a large matrix-vector multiplication involving the section of the matrix that trails the current column(see the shaded area in Figure 1b) and the non-zero part of the vector 𝑣 𝑖 . Although these matrix-vector multiplicationsconstitute approximately only 20% of the total number of ﬂops, they are signiﬁcantly more expensive than the remaining80% due to the fact that matrix-vector multiplication is a memory bound operation. This is an important factor to considerwhen analysing the performance.2. Update the trailing matrix.

Update the section of the matrix that trails the current panel (see the shaded area in Figure 1c)by the update formula 𝐴 ← ( 𝐼 − 𝑉 𝑇 𝑉 𝑇 ) 𝑇 ( 𝐴 − 𝑌 𝑉 𝑇 ) , (8)where 𝐼 − 𝑉 𝑇 𝑉 𝑇 and 𝑌 are the ﬁnal compact WY representation and the ﬁnal intermediate result matrix from the panelreduction step, respectively.3. Update the top part from the right.

Update the section of the matrix above the current panel (see the shaded area in Figure1d) by the update formula 𝐴 ← 𝐴 − 𝑌 𝑉 𝑇 . (9) C r i t i c a l pa t h ( R edu c e t he pane l and U pda t e t he tr a ili ng m a tr i x ) C P U s G P U C P U s FIGURE 2

An illustration of the two scheduling contexts.In StarNEig, each one of the aforementioned three steps is formulated as a task. In particular, for task scheduling overheadrelated reasons, the entire reduce the panel step is implemented as one monolithic task. Implementing each column reduction asa set of separate tasks would lead to an excessive number of lightweight tasks and thus to an unmanageable amount of schedulingoverhead. The computational resources are divided into two scheduling contexts:

Parallel scheduling context contains a subset of the available CPU cores and a GPU, if one exists in the system. Each insertedtask is executed in parallel by all included CPU cores or by the GPU, depending on which resource is predicted to be thebest option by the runtime system. In order to accomplish this, the runtime system uses calibrated performance modelsto predict the execution and data transfer times for each task. The reduce the panel and update the trailing matrix tasksform the critical path of the algorithm and are scheduled to this scheduling context as illustrated in Figure 2. The CPUimplementation of the reduce the panel task copies the trailing matrix to memory buﬀers that are allocated from the localNUMA islands. That is, each involved CPU core has its own local memory buﬀer and the obtainable memory bandwidthis thus signiﬁcantly higher compared to a situation where the CPU cores would access the memory across the NUMAislands. Note that the trailing matrix is copied only once in the beginning of each reduce the panel task. This is an anotherreason for implementing reduce the panel step as one monolithic task. irko Myllykoski and Carl Christian Kjelgaard Mikkelsen Sequential scheduling context contains the remaining CPU cores and will inherit computational resources from the parallelscheduling context once the critical path has been completed. Each inserted task is executed sequentially by one of theavailable computational resources. The update the top part from the right tasks are scheduled to this context as illustratedin Figure 2. Note that the update the top part from the right tasks never feeds back into the critical path and can thereforebe scheduled independently. The tasks that compute the matrix 𝑄 in (3) are also scheduled to this context and given lowerpriority.The main beneﬁt that comes from the task-based approach here is that the runtime system is allowed to schedule the workto the GPU when it predicts this will improve the performance. In particular, the runtime system usually schedules reduce thepanel tasks to the GPU because most modern GPUs have a much higher memory bandwidth than CPUs. This means that thetime that is spent computing the large matrix-vector multiplications is signiﬁcantly reduced. The runtime system will handle thenecessary data transfer between main memory and GPU memory, including prefetching of the data when necessary. The endresult is that the task-based implementation will naturally behave very similarly to the implementation available in MAGMAlibrary but provides some additional ﬂexibility as the scheduling decisions are done dynamically. R e o r d e r D e ﬂ a t e Sp i k e B u l g e c h a s i n g S c hu r r e d u c t i o n H e ss e n b e r g r e d u c t i o n R e p e a t e d A E D A E D w i n d o w L a r g e T i n y A gg r e ss i ve Ea r l y D e ﬂ a t i o n S h i f t s FIGURE 3

An illustration of the multi-shift QR algorithm with aggressive early deﬂation.We will now use the Schur reduction and eigenvalue reordering phases to illustrate some of the more notable beneﬁts of thetask-based approach. The modern approach for obtaining a Schur form 𝑆 of a nonsymmetric matrix 𝐴 is to apply the multi-shiftQR algorithm with Aggressive Early Deﬂation (AED) to the upper Hessenberg form 𝐻 (see and references therein).The algorithm is a sequence of steps of two types, AED and bulge chasing, as illustrated in Figure 3 and summarized below: AED step ﬁrst reduces a small diagonal window (a.k.a AED window) to real Schur form via a recursive application of the QRalgorithm. When the sections of the matrix outside the AED window are updated from the left, a spike is induced to theleft of the AED window and we will thus temporarily deviate from the upper Hessenberg form. All diagonal blocks in thereduced AED window are then systematically evaluated in order to identify those eigenvalues that can be safely deﬂatedwithout introducing signiﬁcant perturbations. If the corresponding element in the spike is found to be small enough (i.e.,the so-called deﬂation condition is satisﬁed), then the element is set to zero and the eigenvalue that corresponds to the

Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen diagonal block is deﬂated. On the other hand, if the corresponding entry in the spike is found to be too large, then theremaining AED window is reordered such that the diagonal block that failed the deﬂation check is moved to the upper leftcorner of the window thus pushing the remaining unevaluated diagonal blocks downwards along the diagonal. After alldiagonal blocks have been evaluated, the remaining spike is eliminated by performing a small-sized Hessenberg reduction.

Bulge chasing step chases a set of bulges down the diagonal. The eigenvalues of the diagonal blocks that failed the deﬂationcondition, 𝜆 , 𝜆 , … , 𝜆 𝑚 , are used as shifts to generate the bulges. That is, the ﬁrst column of the matrix 𝐻 is transformedto the ﬁrst column of the matrix ( 𝐻 − 𝜆 𝐼 )( 𝐻 − 𝜆 𝐼 ) via a small-sized Householder reﬂector. When applied from the bothsides, the reﬂector creates ﬁll-in in the form a bulge that appears in the upper left corner of the matrix. At this point,the bulge could be eliminated by chasing it down the diagonal with a sequence of overlapping small-sized Householderreﬂectors and this would complete one implicit QR iteration. However, a multi-shift QR algorithm will instead chase thebulge just enough so that a second bulge can be introduced using the shifts 𝜆 and 𝜆 . The same procedure is then repeateduntil a total of 𝑚 ∕2 bulges have been introduced. The bulges are then chased in groups down the diagonal to complete onepipelined QR iteration. The bulge chasing step is then followed by a second AED step and the same procedure is repeateduntil all eigenvalues have been deﬂated or an iteration limit is reached.Similarly, the eigenvalue reordering phase is based on applying sequences of overlapping Givens rotations and small-sizedHouseholder reﬂectors to 𝑆 . The Schur form 𝑆 is essentially reordered in a bubble sort manner by using kernels that swap twoadjacent diagonal blocks . (a) Individual scalar updates. (b)

Localized updates inside a window. (c)

BLAS level-3 oﬀ-diagonal updates. (d)

Concurrent windows in ScaLAPACK. (e)

Concurrent windows in StarNEig.

FIGURE 4

Hypothetical snapshots taken during the computations. The currently active regions are highlighted with darkershade and the propagation directions of the oﬀ-diagonal updates are marked with arrows. In (a), the overlap between two over-lapping transformations is highlighted with dashed lines. In (b) and (c), the overlap between two diagonal windows is highlightedwith dashed lines. In (d) and (e), the dashed lines illustrate how the matrix is divided into distributed blocks.If the Givens rotations and small-sized Householder reﬂectors are applied one by one, then memory is accessed as shown inFigure 4a. This is grossly ineﬃcient for two reasons: i) the transformations are so localized that parallelizing them would notproduce any signiﬁcant speedup and ii) the matrix elements are touched only once thus leading to very low arithmetic intensity. irko Myllykoski and Carl Christian Kjelgaard Mikkelsen The modern approach (see and references therein) groups together a set of local transformation and initially appliesthem to a relatively small diagonal window as shown in Figure 4b. The localized transformations are accumulated into anaccumulator matrix and later propagated as BLAS level-3 operations acting on the oﬀ-diagonal sections of the matrix as shownin Figure 4c. This leads to much higher arithmetic intensity and enables proper parallel implementations as multiple diagonalwindows can be processed concurrently. These are the main reason why the multi-shift QR algorithm introduces several bulges to the diagonal. The bulges are divided into groups and each group is chased separately down the diagonal. A set,or a chain , of overlapping diagonal windows is associated with each group. Similarly, several selected diagonal blocks can begrouped and moved together in the eigenvalue reordering phase .The Schur reduction and eigenvalue reordering phases are implemented in ScaLAPACK as

PDHSEQR and PDTRSEN sub-routines, respectively. Following the ScaLAPACK convention, the matrices are distributed in a two-dimensional block cyclicfashion . The resulting memory access pattern is illustrated in Figure 4d for a MPI process mesh. In this example, threediagonal windows can be processed simultaneously. The related BLAS level-3 oﬀ-diagonal updates require careful coordinationsince the left and right hand side updates must be performed in a sequentially consistent order. In practice, this means (global orrow/column communicator broadcast) synchronization after each set of BLAS level-3 oﬀ-diagonal updates have been applied.In addition, each AED step introduces a global synchronization point.In StarNEig, the Schur reduction and eigenvalue reordering phases are implemented with the following tasks types:

Window task generates and applies a set of local transformations inside a diagonal window. Takes the intersecting tiles asinput, and produces updated tiles and an accumulator matrix as output.

Right update task applies accumulated right-hand side updates using BLAS level-3 operations. Takes the intersecting tilesand an accumulator matrix as input, and produces updated tiles as output.

Left update task applies accumulated left-hand side updates using BLAS level-3 operations. Takes the intersecting tiles andan accumulator matrix as input, and produces updated tiles as output.

W W W W W WL L L L LL L L LL L LL LLR R R R RR R R RR R RR RR

FIGURE 5

A hypothetical task graph arising from a situation where a set of three bulges is chased down the diagonal. We havesimpliﬁed the graph by omitting dependences between the left (L) and right (R) update tasks as these dependences are enforcedthrough the diagonal bulge chasing tasks (W). Note that the actual bulge chasing step involves several sets of bulges and theresulting task graph is therefore signiﬁcantly more complex than the simpliﬁed graph presented here.The tasks are inserted into the runtime system in a sequentially consistent order and each chain of overlapping diagonalwindows leads to a task graph like the one shown in Figure 5. Note that real live task graphs are signiﬁcantly more complex thanshown here, but also enclose more opportunities for parallelism. It is also critical to realize that the runtime system guarantees that the tasks are executed in a sequentially consistent order. In particular, there is no need for synchronization and diﬀerentcomputational steps are allowed to overlap (see Figure 4e) as the runtime system merges the corresponding sub-graphs together.This can lead to a much higher concurrency since idle time can be reduced by delaying low priority tasks until computational

Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen (a)

The ﬁrst bulge chasing step with delayed right-hand side updates. (b)

Overlapping AED and bulge chasing steps.

FIGURE 6

Snapshots taken during the ﬁrst two iterations of the multi-shift QR algorithm with AED. The numbering matchesthe numbering in Figure 7. That is, (1) is the bulge chasing step from the ﬁrst iteration, (2) is the small-sized Hessenberg reductionfrom the second AED step and (3) is the bulge chasing step from the second iteration. Note that the three steps are merged in (b). A gg r e ss i ve E a r l y D e ﬂ a t i o n F i r s t i t e r a t i o n S e c o n d i t e r a t i o n FIGURE 7

An illustration of the ﬁrst two iterations of the multi-shift QR algorithm with AED. The steps that are concurrentlyactive in Figure 6b are highlighted in bold and numbered correspondingly. irko Myllykoski and Carl Christian Kjelgaard Mikkelsen resources start becoming idle. This is exactly what can be observed from Figure 6a where a subset of the right-hand side updatesare given lower priority and are therefore scheduled only after the diagonal bulge chasing tasks and the left-hand side updatetasks cannot saturate all available computational resources. This is possible because, as seen in Figure 5, most right-hand sideupdate tasks do not feed back to the other sections of the task graph and can therefore be scheduled independently.The AED step can also be overlapped with the bulge chasing steps as shown in Figures 6b and 7. This improves the concurrencysigniﬁcantly compared to ScaLAPACK because the ScaLAPACK implementation will eﬀectively synchronize before and aftereach AED step. This means that in ScaLAPACK most MPI ranks will become idle while a subset of the MPI ranks are involvedin the AED step. Actually, StarNEig can even overlap two bulge chasing steps with each other as seen in Figures 6b. See forfurther information. Mathematically, the problem of computing a single eigenvector of, say, a quasi-triangular matrix is trivial. The standardalgorithm is a variant of substitution. However, substitution is very vulnerable to ﬂoating-point overﬂow. In particular, thereexist triangular linear systems which are well-conditioned in the sense of Skeel for which the solution grows exponentially andrapidly exceeds the representational range of any ﬂoating-point number system ? . We say that an algorithm or a subroutine isrobust if all intermediate and ﬁnal results are in the representational range, i.e., ﬂoating point overﬂow is avoided. In LAPACKthere exists a robust family xLATRS of subroutines for solving triangular linear system 𝑇 𝑥 = 𝑏 . They dynamically scale theentire right-hand side and return a scaling factor 𝛼 and a vector 𝑥 such that 𝑇 𝑥 = 𝛼𝑏 . The purpose of the scaling factor 𝛼 isto extend the ﬂoating-point representational range. In LAPACK, the solvers for computing eigenvectors from Schur forms areall descended from xLATRS . They are scalar codes which compute the eigenvectors one by one. The back-transformation to theoriginal basis can of course be done using BLAS level-3 operations. In ScaLAPACK, there is no parallel implementation of xLATRS and the solvers for computing eigenvectors are not robust.In contrast, StarNEig implements novel algorithms for computing eigenvectors which are tiled, parallel and robust. InStarNEig, each matrix of eigenvectors is partitioned into tiles 𝑋 𝑖𝑗 . Every tile 𝑋 = [ 𝑥 𝑥 ⋯ 𝑥 𝑘 ] is augmented with a vector ofscaling factors 𝛼 ∈ ℝ 𝑘 with one scaling factor per column. The augmented tile ⟨ 𝛼, 𝑋 ⟩ represents the matrix 𝑌 = [ 𝑦 𝑦 ⋯ 𝑦 𝑘 ] given by 𝑦 𝑗 = 𝑥 𝑗 ∕ 𝛼 𝑗 . The StarNEig solvers for computing eigenvectors accept and produce augmented tiles. This allowsStarNEig to obtain a representation of the eigenvectors without exceeding the representational range. A ﬁnal post-processing stepensures that all segments of each eigenvector are consistently scaled. In reality, StarNEig is exploiting a principle which is famil-iar to every scientist: Two results can be combined if we know how to convert between the diﬀerent system of measurements,say, the metric system and the imperial system.The use of augmented tiles is compatible with the linear update 𝑍 ← 𝑌 − 𝑇 𝑋 and the vast majority of the arithmeticoperations can be completed using BLAS level-3 operations. Given a matrix 𝑇 and two augmented tiles ⟨ 𝛼, 𝑋 ⟩ and ⟨ 𝛽, 𝑌 ⟩ , suchthat 𝑌 − 𝑇 𝑋 is deﬁned, StarNEig produces an augmented tile ⟨ 𝛾, 𝑍 ⟩ which represents the intended result, i.e., 𝛾 −1 𝑗 𝑧 𝑗 = 𝛽 −1 𝑗 𝑦 𝑗 − 𝛼 −1 𝑗 𝑇 𝑥 𝑗 . (10)This is achieved as follows. A preprocessing step ensures that each pair ( 𝑥 𝑗 , 𝑦 𝑗 ) of columns of 𝑋 and 𝑌 are not only consistentlyscaled, but the linear update 𝑧 𝑗 = 𝑦 𝑗 − 𝑇 𝑥 𝑗 can be computed without exceeding the overﬂow threshold. If 𝑋 , 𝑌 and 𝑇 are 𝑛 × 𝑛 matrices, then this preprocessing step requires 𝑂 ( 𝑛 ) operations. Then the linear update 𝑍 ← 𝑌 − 𝑇 𝑋 is executed usinga single BLAS level-3 operation. We emphasize that the cost of the preprocessing step is insigniﬁcant compared with 𝑂 ( 𝑛 ) arithmetic operations required for the linear update. Rescaling vectors to obtain a consistent scaling requires that the scalingfactors are nonzero. Otherwise, a division-by-zero is attempted. In StarNEig, the possibility of the scaling factors underﬂowingis signiﬁcantly reduced by using scaling factors which are integer powers of . Currently StarNEig uses at least 32 bit signedintegers, for which the smallest scaling factor is 𝛼 = 2 −2 +1 ≈ 10 −6 . , but this can easily be extended to the point whereunderﬂow is a practical impossibility using 64 bit unsigned integers.The main beneﬁts that stem from the task-based approach are the merging of diﬀerent computational steps and the improvedload balancing. Additional information can be found in the existing literature. In particular, the fundamental principles for solvingtriangular linear systems in parallel without suﬀering from overﬂow are discussed in . The StarNeig solver for computinggeneralized eigenvectors is the subject of a separate paper . Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen

StarNEig is a C-library that runs on top of the StarPU task-based runtime system. StarPU handles low-level operations such asheterogeneous scheduling, data transfers and replication between memory spaces and MPI communication between computenodes. In particular, StarPU is responsible for managing the various computational resources such as CPU cores and GPUs. Thesupport for GPUs and distributed memory were the main reasons why StarPU was chosen as the runtime system.StarPU manages a set of worker threads; usually one thread per computational resource. In addition, one thread is responsiblefor inserting the tasks into StarPU and tracking the state of the machine. If necessary, one additional thread is allocated for MPIcommunication. For these reasons, StarNEig must be used in a one process per node (1ppn) conﬁguration, i.e., several CPUcores should be allocated for each MPI process (a node can be a full node, a NUMA island or some other reasonably largecollection of CPU cores).

TABLE 1

Current status of the StarNEig library.

Computational phase Shared memory Distributed memory GPUs (CUDA)Hessenberg reduction

Complete ScaLAPACK wrapper Single GPU supported

Schur reduction

Complete Complete Experimental

Eigenvalue reordering

Complete Complete Experimental

Eigenvectors

Complete In progress —

Hessenberg-triangular reduction

LAPACK wrapper ScaLAPACK wrapper —

Generalized Schur reduction

Complete Complete Experimental

Generalized eigenvalue reordering

Complete Complete Experimental

Generalized eigenvectors

Complete In progress —The current status of StarNEig is summarized in Table 1. The library is currently in an early beta state. The

Experimental status indicates that the software component has not been tested as extensively as those software components that are consid-ered

Complete . In particular, the GPU functionality requires some additional involvement from the user (performance modelcalibration). At the time of writing this paper, only real arithmetic is supported and certain interface functions are implementedas LAPACK and ScaLAPACK wrapper functions. However, we emphasize that StarNEig supports real valued matrices thathave complex eigenvalues and eigenvectors. Additional distributed memory functionality and support for complex data typesare planned for a future release. (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6)(1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)(2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)(3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)(4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (a)

Distributed blocks. (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6)(1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)(2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)(3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)(4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

02 2 1 3 1 323 0 2 2 0 031 2 1 1 2 010 2 3 2 0 3 (b)

Arbitrary distribution. (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6)(1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)(2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)(3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)(4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

10 32 10 32 10 32 10 32 10 32 10 32 0202 (c)

FIGURE 8

Examples of various data distributions supported by StarNEig, including two-dimensional block cyclic distribution(2D-BCD). irko Myllykoski and Carl Christian Kjelgaard Mikkelsen StarNEig distributes the matrices in rectangular blocks of uniform size (excluding the last block row and column) as illustratedin Figure 8a. The data distribution, i.e., the mapping from the distributed blocks to the MPI process rank space, can be arbitraryas illustrated in Figure 8b. A user has three options:1. Use the default data distribution. This is recommended for most users and leads to reasonable performance in mostsituations.2. Use a two-dimensional block cyclic distribution (see Figure 8c). In this case, the user may select the MPI process meshdimensions and the rank ordering.3. Deﬁne a data distribution function 𝑑 ∶ ℤ + × ℤ + → ℤ + that maps the block row and column indices to the MPI rankspace. For example, in Figure 8b, rank owns the blocks (0,1), (1,2), (1,5), (1,6), (2,6), (3,0) and (3,5).The library implements distribution agnostic copy, scatter and gather operations.Users who are familiar with ScaLAPACK are likely accustomed to using relatively small distributed block sizes (between 64–256). In contrast, StarNEig functions optimally only if the distributed blocks are relatively large (at least 1000 but preferably mustlarger). This is due to the fact that StarNEig further divides the distributed blocks into tiles and a tiny tile size leads to excessivetask scheduling overhead because the tile size is closely connected to the task granularity. Furthermore, as mentioned in thepreceding section, StarNEig should be used in 1ppn conﬁguration as opposed to a one process per core (1ppc) conﬁgurationwhich is common with ScaLAPACK. StarNEig is fully compatible with ScaLAPACK and provides a ScaLAPACK compatibility layer that encapsulates BLACScontexts and descriptors inside transparent objects, and implements a set of bidirectional conversion functions. The conversionsare performed in-place and do not modify any of the underlying data structures. Users can mix StarNEig interface functionswith ScaLAPACK routines without intermediate conversions. The use of the ScaLAPACK compatibility layer requires the useof either the default data distribution or the two-dimensional block cyclic data distribution. Listing 1 illustrates how a distributedStarNEig matrix is converted to an equivalent BLACS context and a matching local buﬀer. The matrix is then reduces to upperHessenberg form by calling the PDGEHRD routine from ScaLAPACK. This is actually how the ScaLAPACK wrapper functionsare implemented in StarNEig, see Table 1.

Computational experiments were performed on the Kebnekaise system, located at the High Performance Computing CenterNorth (HPC2N), Umeå University. Kebnekaise is a heterogeneous systems consisting of many diﬀerent types of compute nodes.The compute node types relevant for this paper are:

Broadwell compute node contains 28 Intel Xeon E5-2690v4 cores organized into 2 NUMA islands with 14 cores each and128 GB memory. The nodes are connected with FDR Inﬁniband. All distributed memory experiments were performed onthese nodes.

Skylake compute node contains 28 Intel Xeon Gold 6132 cores organized into 2 NUMA islands with 14 cores each and 192GB memory. All Hessenberg reduction experiments were performed on these nodes.

Large memory node contains 72 Intel Xeon E7-8860v4 cores organized into 4 NUMA islands with 18 cores each and 3072GB memory. All eigenvector experiments were performed on these nodes.

V100 GPU node contains 28 Intel Xeon Gold 6132 cores organized into 2 NUMA islands with 14 cores each and 192 GBmemory. Each NUMA island is connected to a single NVIDIA Tesla V100 GPU. All GPU experiments were performedon these nodes.For the distributed memory and eigenvalue reordering related GPU experiments, the software was compiled with GCC 7.3.0and linked to OpenMPI 3.1.3, OpenBLAS 0.3.2, ScaLAPACK 2.0.2, CUDA 9.2.88, and StarPU 1.2.8. Since the the version of Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen

Listing 1: An example how a distributed matrix is converted to a BLACS descriptor and a local buﬀer. / / c r e a t e a 2D b l o c k c y c l i c d a t a d i s t r i b u t i o n ( pm x pn p r o c e s s mesh ) s t a r n e i g _ d i s t r _ t d i s t r =s t a r n e i g _ d i s t r _ i n i t _ m e s h ( pm , pn , STARNEIG_ORDER_DEFAULT ) ; / / c r e a t e a n x n d i s t r i b u t e d m a t r i x ( bn x bn b l o c k s ) s t a r n e i g _ d i s t r _ m a t r i x _ t dA =s t a r n e i g _ d i s t r _ m a t r i x _ c r e a t e ( n , n , bn , bn , STARNEIG_REAL_DOUBLE , d i s t r ) ;. . . / / c o n v e r t t h e d a t a d i s t r i b u t i o n t o a BLACS c o n t e x t s t a r n e i g _ b l a c s _ c o n t e x t _ t c t x = s t a r n e i g _ d i s t r _ t o _ b l a c s _ c o n t e x t ( d i s t r ) ; / / c o n v e r t t h e d i s t r i b u t e d m a t r i x t o a BLACS d e s c r i p t o r and a l o c a l b u f f e r s t a r n e i g _ b l a c s _ d e s c r _ t d e s c r ; d o u b l e ∗ p t r ;s t a r n e i g _ d i s t r _ m a t r i x _ t o _ b l a c s _ d e s c r ( dA , c t x , & d e s c r , ( v o i d ∗∗ )& p t r ) ; / / ScaLAPACK r o u t i n e f o r r e d u c i n g a g e n e r a l d i s t r i b u t e d m a t r i x t o u p p e r/ / H e s s e n b e r g f o r m e x t e r n v o i d p d g e h r d _ ( i n t c o n s t ∗ , i n t c o n s t ∗ , i n t c o n s t ∗ , d o u b l e ∗ , i n t c o n s t ∗ , i n t c o n s t ∗ , s t a r n e i g _ b l a c s _ d e s c r _ t c o n s t ∗ , d o u b l e ∗ , d o u b l e ∗ , i n t c o n s t ∗ , i n t ∗ ) ;p d g e h r d _ (&n , &i l o , &i h i , ptr , &i a , &j a , & d e s c r , t a u , . . . ) ; PDHSEQR routine that exists in ScaLAPACK 2.0.2 is known to be buggy, StarNEig was actually compared against an updatedversion of

PDHSEQR ; see . The updated version places far less strict conditions on the distributed block size and thus performbetter on modern hardware. For the Hessenberg reduction and eigenvector computation experiments, the software was compiledwith ICC 19.0.1.144 and linked to Intel MPI 2018.4.274, Intel MKL 2019.1.144, CUDA 10.1.105 and StarPU 1.2.8. StarPU wascompiled with the --disable-cuda-memcpy-peer conﬁguration ﬂag enabled. This signiﬁcantly reduces the CUDA relatedoverhead.In most experiments, StarNEig version 0.1-beta.2 was used. However, the Schur reduction and eigenvalue reordering experi-ments in distributed memory were performed using an older and unpublished version of StarNEig. The main diﬀerence betweenthis unpublished version and the version 0.1-beta.2 is the deﬂation condition used in the Schur reduction phase. The deﬂationcondition is used to decide when an eigenvalue can be deﬂated and thus impacts the convergence rate of the algorithm. Theunpublished version uses a deﬂation condition that is identical to the one used in LAPACK and PDHSEQR where as the ver-sion 0.1-beta.2 uses the so-called norm stable deﬂation condition . The latter deﬂation condition is less strict and could thuspotentially lead to faster convergence. The latest version of StarNEig (0.1-beta.4) allows the user to choose between these twodeﬂation conditions.Table 2 shows how the Hessenberg reduction routine in StarNEig compares against LAPACK (with parallel BLAS), ScaLA-PACK (in single node) and MAGMA . Since the implementations in all four libraries are based on the standard algorithm, theperformance is limited by the throughput of the matrix-vector multiplication routine. It is therefore important that the memoryaccess pattern is optimized. In particular, a memory access to a NUMA island that is not local to a particular core is very likelyto reduce the performance since the obtainable memory bandwidth is signiﬁcantly lower compared to a local access. A memoryaccess pattern that is close to optimal is easy to achieve with ScaLAPACK since each MPI process accesses its own local buﬀerand this buﬀer can be allocated from the NUMA island that is closest to the core to which the MPI process is bound. For the otherCPU experiments, the memory was allocated in interleaved mode across the two NUMA islands in order to evenly distribute irko Myllykoski and Carl Christian Kjelgaard Mikkelsen TABLE 2

A run time comparison (in seconds) between parallel MKL-LAPACK, MKL-ScaLAPACK, MAGMA and StarNEigwhen computing a Hessenberg form in shared memory. Columns 2-4 show the execution times for 28 Intel Xeon Gold 6132cores and columns 5-6 show the execution times for 14 Intel Xeon Gold 6132 cores paired with a NVIDIA Tesla V100 GPU.The

STARPU_WORKERS_CPUID environmental variable was set to [...].28 cores 14 cores + V100 GPU 𝑛 LAPACK ScaLAPACK StarNEig MAGMA StarNEig5 000 3.8 4.2 8.4 1.3 2.210 000 32 29 40 7.3 8.915 000 113 95 117 23 2420 000 278 223 249 49 5325 000 561 459 454 90 9730 000 953 728 755 144 16440 000 2397 1722 1711 — —the load. Since the matrix-vector multiplication operation is memory bound, it is not surprising that StarNEig closely matchesthe performance of LAPACK, ScaLAPACK and MAGMA. It is only in the case of the smallest considered matrix where theadditional task scheduling related overhead begins to negatively eﬀect StarNEig’s performance. The additional memory localityconsiderations in StarNEig begin to show their eﬀect when the matrices are reasonably large, leading up to 40% performanceimprovement compared to LAPACK.

TABLE 3

A run time comparison (in seconds) between ScaLAPACK and StarNEig in distributed memory. Each node contains28 Intel Xeon E5-2690v4 cores.CPU cores Schur reduction Eigenvalue reordering 𝑛 ScaLAPACK StarNEig ScaLAPACK StarNEig ScaLAPACK StarNEig10 000 36 28 38 18 12 320 000 36 28 158 85 72 2540 000 36 28 708 431 512 18060 000 121 112 992 563 669 16880 000 121 112 1667 904 1709 391100 000 121 112 3319 1168 3285 737120 000 256 252 3268 1111 2902 581Table 3 shows a comparison between ScaLAPACK and StarNEig. All distributed memory experiments were performed usinga square MPI process grid. This was done because the maximum number of diagonal bulge chasing and reordering windows inthe

PDHSEQR and

PDTRSEN routines is given by min( 𝑝𝑚, 𝑝𝑛 ) , where 𝑝𝑚 and 𝑝𝑛 are the height and width of the process mesh,respectively. A square process mess thus maximizes the degree of parallelism in the ScaLAPACK routines. We always map eachStarNEig process to a full node (28 cores) and each ScaLAPACK process to a single CPU core, the latter being done for thesame reason as the choice to use a square process mesh. Since it is not straightforward to ﬁnd a CPU core allocation that wouldlead to a square MPI process grid in both conﬁgurations, the number of CPU cores in each ScaLAPACK experiment is alwaysequal or larger than the number of CPU cores in the corresponding StarNEig experiment. The upper Hessenberg matrices for theSchur reduction experiments were computed from random matrices (entries uniformly distributed over the interval [−1 , ). Inthe ScaLAPACK experiments, the matrices were distributed in

160 × 160 blocks, and in the StarNEig experiments, the librarydefault block size was used. In the eigenvalue reordering experiments, 35% of the diagonal blocks were randomly selected.From Table 3, we note that StarNEig is between 1.6 and 2.9 times faster than ScaLAPACK (

PDHSEQR ) when computing theSchur form and between 2.8 and 5.0 times faster than ScaLAPACK (

PDTRSEN ) when reordering the Schur form. The distributedmemory experiments were initially reported in the technical report . Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen

20k 40k 60k 80k 100k 120k 140k 160kMatrix dimension050010001500200025003000 R un t i m e [ s ] FIGURE 9

Distributed memory scalability of StarNEig when computing a Schur form. Each node contains 28 Intel XeonE5-2690v4 cores.

20k 40k 60k 80k 100k 120k 140k 160kMatrix dimension020040060080010001200 R un t i m e [ s ] FIGURE 10

Distributed memory scalability of StarNEig when reordering a Schur form. Each node contains 28 Intel XeonE5-2690v4 cores.Figures 9 and 10 give some idea of how well the library is expected to scale in distributed memory. We note that StarNEigscales reasonably when computing the Schur form and almost linearly when reordering the Schur form. The iterative natureof the QR algorithm makes the Schur reduction results less predictable because diﬀerent matrices require diﬀerent number ofbulge chasing steps. That is, with some matrices, the algorithms ends up performing more consecutive AED steps between thebulge chasing steps, thus leading to faster convergence rate but weaker scalability.Figure 11 demonstrates that StarNEig can indeed take advantage of the available GPUs as long as the matrices are reasonablylarge. The introduction of a single V100 GPU gives a speedup of up to 1.65 and the introduction of two V100 GPUs gives aspeedup of up to 2.84. In the best case, the speedup when moving from single V100 GPU to two V100 GPUs is 1.83.Table 4 shows how the eigenvector computation routine in StarNEig compares against LAPACK (with parallel BLAS). In allexperiments, 35% of the eigenvalues were randomly selected. In single core experiments, StarNEig is between 3.4 and 4.9 timesfaster than LAPACK. This demonstrates demonstrate how eﬃcient the tiled approach is compared to the scalar implementationthat exists in LAPACK. Furthermore, the multi-core experiments demonstrate the poor scalability of the LAPACK implemen-tation. In particular, the initial solve phase (5) is executed sequentially and the back transformation phase (6), which is executed irko Myllykoski and Carl Christian Kjelgaard Mikkelsen

20k 40k 60k 80kMatrix dimension020040060080010001200 R un t i m e [ s ]

28 cores, no GPUs28 cores + 1 GPU28 cores + 2 GPUs

FIGURE 11

GPU performance of StarNEig when reordering a Schur form. Each socket (14 Intel Xeon Gold 6132 cores) isconnected to one NVIDIA Tesla V100 GPU.

TABLE 4

A run time comparison (in seconds) between MKL-LAPACK and StarNEig when computing 35% of the eigenvectors.Run times that are longer than four hours are not tabulated.LAPACK StarNEig 𝑛 \ cores 1 16 32 48 64 1 16 32 48 6410 000 95 80 80 80 80 28 2.3 1.4 1.1 1.020 000 784 672 694 673 668 194 15 8.1 5.9 5.340 000 6731 5826 5681 5608 5743 1364 98 53 38 3060 000 — — — — — 4231 315 167 116 9080 000 — — — — — 9221 693 359 252 194100 000 — — — — — — 1325 698 483 369120 000 — — — — — — 2226 1194 833 647in parallel, constitutes only a small fraction of the total execution time. The implementation in StarNEig, on the other hand,demonstrates a very good scalability as for the larger matrices ( 𝑛 ≥

40 000 ) the parallel eﬃciency stays above 70%. In the mostextreme case StarNEig is over 190 times faster than LAPACK.

This paper presented a new library called StarNEig. StarNEig aims to provide a complete task-based software stack for solvingdense nonsymmetric standard and generalized eigenvalue problems. StarNEig support both shared and distributed memorymachines and some routines in the library can take advantage of the available GPUs. The paper is mainly aimed at potential usersof the library. Various design choices were explained and contrasted to existing software. In particular, users who are alreadyfamiliar with ScaLAPACK should know the following: • StarNEig expect that the matrices are distributed in relatively large blocks compared to ScaLAPACK. • StarNEig should be used in a one process per node (1ppn) conﬁguration as opposed to a one process per core (1ppc)conﬁguration which is common with ScaLAPACK. • StarNEig implements a ScaLAPACK compatibility layer. Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen

The performance of the library was demonstrated with a set of computational experiments. The presented results show the fol-lowing: In the Hessenberg reduction phase, StarNEig is competitive with LAPACK, ScaLAPACK (in single node) and MAGMA(single GPU). In the Schur reduction phase, StarNEig is between 1.6 and 2.9 times faster than ScaLAPACK. In the eigenvectorcomputation phase, StarNEig’s parallel and robust implementation signiﬁcantly outperforms LAPACK in both single-core andmulti-core settings with recorded speedups as large as 190. In the eigenvalue reordering phase, StarNEig is between 2.8 and 5.0times faster than ScaLAPACK and scales nearly linearly.The library is still incomplete. Future work with StarNEig includes the implementation and integration of the missing softwarecomponents. Support for complex valued matrices is also planned. The GPU support, and the multi-GPU support in particular,are still under active development. The authors hope to start a discussion which would help guide and prioritize the futuredevelopment of the library.

ACKNOWLEDGMENTS

StarNEig has been developed by Mirko Myllykoski (who has designed and implemented the Hessenberg reduction, Schurreduction and eigenvalue reordering phases), Carl Christian Kjelgaard Mikkelsen (who has designed and implemented the eigen-vector computation phase for generalized eigenvectors), Angelika Schwarz (who has designed and implemented the eigenvectorcomputation phase for standard eigenvectors), Lars Karlsson (who has worked with the underlying theory behind the robustcomputation of eigenvectors), and Bo Kågström (who was the coordinator and scientiﬁc director of the NLAFET project). Thiswork is part of a project (NLAFET) that has received funding from the European Union’s Horizon 2020 research and inno-vation programme under grant agreement No 671633. This work was supported by the Swedish strategic research programmeeSSENCE and Swedish Research Council (VR) under Grant E0485301. We thank the High Performance Computing CenterNorth (HPC2N) at Umeå University for providing computational resources and valuable support during test and performanceruns. The authors acknowledge the work of Mahmoud Eljammaly and Björn Adlerborn during the NLAFET project. The authorswould also like to thank the StarPU team at INRIA for rapidly providing good answers to several questions related to StarPU.

References

Int. J.Parallel Program.

CCPE - Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009

ParallelComputing irko Myllykoski and Carl Christian Kjelgaard Mikkelsen

10. Hernandez V, Roman JE, Vidal V. SLEPc: A scalable and ﬂexible toolkit for the solution of eigenvalue problems.

ACMTrans. Math. Software

Parallel Comput.

Journal of Information Processing

Journal of Physics: Condensed Matter

Matrix Computations . Baltimore, MD: Johns Hopkins University Press. 4th ed. 2012.16. Quintana-Ortí G, Geijn v. dR. Improving the Performance of Reduction to Hessenberg Form.

ACM Trans. Math. Software 𝑄𝑅 algorithm. I. Maintaining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl. 𝑄𝑅 algorithm. II. Aggressive early deﬂation. SIAM J. Matrix Anal. Appl.

ACM Trans. Math. Software

Linear Algebra Appl.

ACM Transactions on MathematicalSoftware

Concurrency and Computation:Practice and Experience

LNCS . Springer International Publishing; 2018: 207–21626. Dongarra J, Whaley RC.

A User’s Guide to the BLACS

Parallel Processing and Applied Mathematics

SpringerInternational Publishing; 2018; Cham: 68–78. Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen

30. Kjelgaard Mikkelsen CC, Schwarz AB, Karlsson L. Parallel robust solution of triangular linear systems.