[PDF] An Adaptive Solver for Systems of Linear Equations

Abstract

Computational implementations for solving systems of linear equations often rely on a one-size-fits-all approach based on LU decomposition of dense matrices stored in column-major format. Such solvers are typically implemented with the aid of the xGESV set of functions available in the low-level LAPACK software, with the aim of reducing development time by taking advantage of well-tested routines. However, this straightforward approach does not take into account various matrix properties which can be exploited to reduce the computational effort and/or to increase numerical stability. Furthermore, direct use of LAPACK functions can be error-prone for non-expert users and results in source code that has little resemblance to originating mathematical expressions. We describe an adaptive solver that we have implemented inside recent versions of the high-level Armadillo C++ library for linear algebra. The solver automatically detects several common properties of a given system (banded, triangular, symmetric positive definite), followed by solving the system via mapping to a set of suitable LAPACK functions best matched to each property. The solver also detects poorly conditioned systems and automatically seeks a solution via singular value decomposition as a fallback. We show that the adaptive solver leads to notable speedups, while also freeing the user from using direct calls to cumbersome LAPACK functions.

Full PDF

AAn Adaptive Solver for Systems of Linear Equations

Conrad Sanderson †‡ and Ryan Curtin (cid:5) † Data61 / CSIRO, Australia ‡ Grifﬁth University, Australia (cid:5)

RelationalAI, USA

Abstract —Computational implementations for solving systemsof linear equations often rely on a one-size-ﬁts-all approachbased on LU decomposition of dense matrices stored in column-major format. Such solvers are typically implemented with theaid of the xGESV set of functions available in the low-levelLAPACK software, with the aim of reducing development time bytaking advantage of well-tested routines. However, this straight-forward approach does not take into account various matrixproperties which can be exploited to reduce the computationaleffort and/or to increase numerical stability. Furthermore, directuse of LAPACK functions can be error-prone for non-expertusers and results in source code that has little resemblance tooriginating mathematical expressions. We describe an adaptivesolver that we have implemented inside recent versions of thehigh-level Armadillo C++ library for linear algebra. The solverautomatically detects several common properties of a givensystem (banded, triangular, symmetric positive deﬁnite), followedby solving the system via mapping to a set of suitable LAPACKfunctions best matched to each property. The solver also detectspoorly conditioned systems and automatically seeks a solutionvia singular value decomposition as a fallback. We show that theadaptive solver leads to notable speedups, while also freeing theuser from using direct calls to cumbersome LAPACK functions.

Index Terms —adaptive systems, numerical linear algebra,mapping problem, system of linear equations, computationalimplementation.

I. I

NTRODUCTION

Solving systems of linear equations is a fundamental com-putational task in numerous ﬁelds [10], [12], including signalprocessing and machine learning. The general form for asystem of linear equations is expressed as: a x + a x + · · · + a n x n = b a x + a x + · · · + a n x n = b ... ... ... ... ... a m x + a m x + · · · + a mn x n = b m (1)which can be compactly represented in matrix form as: Ax = b (2)where matrix A with size m × n contains the coefﬁcients ofthe system, vector x with size n × represents the unknownvariables to be found, and vector b with size m × containsknown constants.For the common case of square-sized A , a naive approachto ﬁnd x is via matrix inverse, i.e. x = A − b . However,in practice computing the inverse is actually not necessaryand can result in numerical instability, leading to inaccurate solutions [13]. An effective and general approach is to solvethe system with the help of lower-upper (LU) decomposi-tion [10], [12]. This can be typically implemented through the xGESV set of functions available in the ubiquitous and de-factoindustry standard LAPACK software [1]. Several optimisedversions of LAPACK are available, such as MKL [14] for the x86-64 architecture used by the widely employed Intel andAMD processors.The task of converting an arbitrary linear algebra expressioninto an efﬁcient sequence of optimally matched LAPACKfunction calls, either manually or automatically, is known asthe linear algebra mapping problem [19]. When the conversionis done manually, the process can be laborious and error-prone; it typically requires thorough understanding of severalareas: high-performance computing, numerical linear algebra,and the intricacies of LAPACK. Furthermore, source codethat directly uses LAPACK functions has several downsides:it (i) has little resemblance to the originating mathematicalexpression, (ii) requires manual memory management, (iii) re-quires keeping track of many extra variables. In turn, thesedownsides reduce the readability of the source code bynon-expert users, while increasing the maintenance burden andrisk of bugs [16], [23].To address the above issues and hence to increase productiv-ity, linear algebra expressions are often implemented with theaid of high-level frameworks [25], such as Matlab, Octave [15]and Armadillo [20]. While these frameworks facilitate sim-pliﬁed user code that focuses on high-level algorithm logic,the automatic mapping of given mathematical expressions intoLAPACK functions can be suboptimal [2], [4].Higher level frameworks default to storing dense matrices instraightforward column-major format (where all the elementsin each column are stored consecutively in memory), withouttaking into account any special structure or properties of thematrix. This choice is made by framework maintainers toreduce the internal code complexity of the framework, and toavoid burdening users with the choice of storage types; usersmay not have the inclination nor expertise to understand theadvantages and disadvantages of various storage formats.Within the above context, the dense matrix LU-based solveris an effective and often-used one-size-ﬁts-all approach for For each LAPACK function, we substitute the ﬁrst letter of the functionwith ‘ x ’ to indicate a set of functions which differ only in the applicableelement type (eg. single- or double-precision element type). For example, theset { SGESV , DGESV , CGESV , ZGESV } is represented as xGESV . a r X i v : . [ c s . M S ] S e p olving systems of linear equations. However, this straight-forward approach does not take into account various matrixproperties which can be exploited to reduce the computationaleffort and/or to increase numerical stability. For example,when matrix A is symmetric positive deﬁnite, it is more com-putationally efﬁcient to use Cholesky decomposition insteadof LU decomposition [12].In this paper we describe an adaptive solver that we havedevised and implemented inside recent versions of the open-source Armadillo C++ linear algebra library. The libraryprovides a high-level domain speciﬁc language [17], [22] em-bedded within the host C++ language, allowing mathematicaloperations with matrices to be expressed in a concise andeasy-to-read manner similar to Matlab/Octave. This facilitatesprototyping directly in C++ and aids the conversion of researchcode into production environments.The adaptive solver is able to automatically detect severalcommon properties of matrices, while being robust to inher-ent limitations in the precision of ﬂoating point representa-tions [11], [18], and then map them to a large set of suitableLAPACK functions. We show that this leads to considerablespeedups, thereby freeing the user from worrying about storageformats and using cumbersome manual calls to LAPACKfunctions in order to obtain good computational efﬁciency.We continue the paper as follows. Section II describesseveral common matrix properties which can be exploited,summarises the algorithms used for detecting such properties,and lists the suitable sets of LAPACK functions tailored foreach property. Section III provides an empirical evaluationshowing the speedups attained by the adaptive solver. Sec-tion IV provides a brief discussion on computational complex-ity and runtime considerations. The salient points and avenuesfor further exploitation are summarised in Section V.II. A UTOMATIC D ETECTION AND M APPING

The adaptive solver detects special matrix structures inthe following order: (i) banded, (ii) upper or lower triangu-lar, (iii) symmetric positive deﬁnite (sympd). Examples ofthe structures are shown in Fig. 1. A specialised solver isemployed as soon as one of the structures is detected. Thedetection algorithms and associated LAPACK functions forsolving the systems are described in Sections II-A throughto II-C. If no special structure is detected, an extended formof an LU-based solver is used, described in Section II-D.A ﬂowchart summarising the adaptive solver is given in Fig. 2.   (a)   (b)   (c)

Figure 1:

Examples of matrix structures: (a) banded, (b) lowertriangular, (c) symmetric positive deﬁnite. Each of the solvers estimates the reciprocal condition num-ber [3] of the given system, which is used to determine thequality of the solution. If any of the solvers fail to ﬁnd asolution (including the solver for general matrices), or if thesystem is determined to be poorly conditioned, an approximatesolution is attempted using a solver based on singular valuedecomposition (SVD), described in Section II-E.The code for the detection and mapping is implementedas part of the solve() function in Armadillo. As of Armadilloversion 9.900, the solve() function is comprised of about 3000lines of code, not counting LAPACK code. Example usage ofthe solve() function is shown in Fig. 3.

A. Banded Matrices

Banded matrices contain elements on the main diagonaland typically on several more diagonals, above and/or belowthe main diagonal. All other elements are zero. As such,computational effort can be considerably reduced by exploitingthe sparseness of the matrix. Fig. 1(a) shows an examplebanded matrix.To efﬁciently detect a banded matrix structure within acolumn-major dense matrix, each column is examined ratherthan each possible diagonal. To determine the number ofsuper-diagonals (diagonals above the main diagonal), the ele-ments in each column are traversed from the top of the columnto the location of the main diagonal within that column,followed by noting the difference in the locations of the ﬁrstnon-zero element and the main diagonal. The maximum of allnoted differences is taken as the number of super-diagonals.To determine the number of sub-diagonals (diagonals belowthe main diagonal), the elements in each column are traversedfrom the location of the main diagonal within that column tothe bottom of the column, followed by noting the differencein the locations of the main diagonal and the last non-zeroelement. The maximum of all noted differences is taken asthe number of sub-diagonals.As soon as the number of detected diagonals indicatesthat more than 25% of the elements within the matrix arenon-zero, all further processing is stopped and the matrixis deemed to be non-banded. The threshold of 25% wasdetermined empirically as a good trade-off between the lowercomputational requirements of solving a banded system, andthe amount of extra processing required for both the detectionand further wrangling of data in the banded structure.If a banded matrix structure is detected, data from thediagonals must be ﬁrst converted into a relatively compactstorage format, where each diagonal is stored as a vector in aseparate dense matrix [1]. Data in the compact storage formatis then used by the LAPACK function xGBTRF to computethe LU decomposition of the banded matrix, followed bythe xGBTRS function to solve the system based on the LUdecomposition, and ﬁnally the xGBCON function to estimatethe reciprocal condition number from the LU decomposition.Diagonal matrices are treated as banded matrices, wherethe number of diagonals above and below the main diagonal iszero. While it is certainly possible to have specialised handling anded?triangular?no solve via xGBTRF/xGBTRSyessympd?no solve via xTRTRSyes solve via xPOTRF/xPOTRSyessolve via xGETRF/xGETRSno rcondtoo low?xPOTRF failed solve with xGELSDyes return resultsnobegin

Figure 2:

Flowchart for the adaptive solver. Depending on properties of the input matrix A , more efﬁcient LAPACK solversare used when possible. In all cases, if the reciprocal condition number (rcond) of A is too low, an approximate SVD-basedsolver is used as a fallback.for diagonal matrices, in practice such matrices are seldomencountered when solving systems of linear equations. B. Triangular Matrices

For triangular matrices, the LU decomposition can beavoided and the solution can be obtained via either back-substitution (for upper triangular matrices) or forward-substitution (for lower triangular matrices) [12]. As such, con-siderable reductions in computational effort can be attained.An example of a lower triangular matrix is shown in Fig. 1(b). using namespace arma; int main(){// generate matrix with random values in [-0.5,+0.5] intervalmat R = randu(100, 100) - 0.5;// generate symmetric matrix A via A = R'Rmat A = R.t() * R;// ensure values on the the main diagonal are dominantA.diag() += 1.0;// generate random column vectorvec B(100, fill::randu);// solve for X in the random sympd system AX = Bvec X = solve(A,B);X.print("solution:"); return

Figure 3:

An example C++ program for solving a randomsympd system. The solve() function as well as the mat and vec classes (for matrices and vectors) are provided bythe Armadillo library [20]. Several optional arguments havebeen omitted for brevity. The solve() function is internallycomprised of about 3000 lines of code, not counting LAPACKcode. Documentation for all available classes and functionscan be viewed at http://arma.sourceforge.net/docs.html . The process of forward-substitution can be summarised asﬁrst ﬁnding the value for x in Eqn. (1), where a through to a n are known to be zero, resulting in x = b /a . The valuefor x is then used to ﬁnd x , where a through to a n arezero. This iterative process can be compactly expressed as: x i = b i − (cid:80) i − j =1 a ij · x j a ii (3)The process of back-substitution follows a similar manner,starting from the bottom of the matrix instead of the top (i.e. x n is solved ﬁrst instead of x ).The detection of triangular matrices is done in a straightfor-ward manner. If all elements above the main diagonal are zero,a lower triangular matrix is present. Conversely, if all elementsbelow the main diagonal are zero, an upper triangular matrixis present. An attempt to solve the system is made by usingthe xTRTRS function, and the reciprocal condition number iscomputed via the xTRCON function. C. Symmetric Positive Deﬁnite Matrices

A matrix M is considered to be symmetric positive deﬁnite(sympd) if v (cid:62) M v > for every non-zero column vector v ,and M is symmetric (i.e.. its lower triangular part is equivalentto a transposed version of its upper triangular part, excludingthe main diagonal). Fig. 1(c) shows an example sympd matrix.For solving systems of linear equations, the sympd propertycan be exploited by using Cholesky decomposition [12] insteadof LU decomposition in order to reduce computational effort.To determine whether a given matrix is sympd, the tradi-tional approach is to check whether all its eigenvalues arereal and positive [6]. However, this raises two complications.First, the eigen-decomposition of a matrix is computationallyexpensive, which would defeat the aim of saving computa-tional effort. Second, since the matrix is stored in simplecolumn-major format, all matrix elements must be checked toensure that symmetry is actually present. This in turn raises afurther complication, as a simple check for equality betweenelements does not take into account minor differences thatmay be present due to the accumulation of rounding errorstemming from limitations in the precision of ﬂoating pointrepresentations [11], [18].To address the above issues, we have devised a fast two-pass algorithm which aims to ensure the presence of severalnecessary conditions for a sympd matrix [6], [12]: (i) alldiagonal elements are greater than zero, (ii) the element withlargest modulus is on the main diagonal, (iii) the matrix isdiagonally dominant, and (iv) the matrix is symmetric whiletolerating minor variations in a robust manner.We note that while the conditions checked by the algorithmare necessary, they are not sufﬁcient to absolutely guaran-tee the presence of a sympd matrix. However, for practicalpurposes, in our experience (mainly in the machine learningarea) the conditions appear adequate for determining sympdpresence.The algorithm is shown in Fig. 4, which returns either true or false to indicate that a matrix is likely to be sympd. As soonas any condition is not satisﬁed, the algorithm aborts furtherprocessing and returns false . Lines 7 to 8 check whetherall diagonal elements are greater than zero. Lines 12 to 14check the presence of symmetry by determining whether thedifference between an element and its corresponding trans-posed element is below a threshold; the difference is comparedagainst the threshold in both absolute and relative terms, forrobustness against precision variations between low-magnitudeand high-magnitude ﬂoating point representations [11], [18].We have empirically chosen the threshold as · (cid:15) , where (cid:15) is the machine epsilon . Line 15 checks whether the elementwith largest modulus is on the main diagonal, while line 16performs a rudimentary check for diagonal dominance.If the algorithm determines that a sympd matrix is present,an attempt to solve the system is made with a combinationof the following LAPACK functions: xPOTRF , xPOTRS , and xPOCON . The xPOTRF function computes the Cholesky decom-position, xPOTRS solves the system based on the Choleskydecomposition, and ﬁnally xPOCON computes the reciprocalcondition number from the Cholesky decomposition.If the xPOTRF function fails (i.e. the Cholesky decomposi-tion was not found), the failure is taken to indicate that thematrix is not actually sympd. In that case, the system is solvedby the generic solver described in Section II-D. D. General Matrices

For general matrices, where no special structure has beendetected, a standard LU-based solver is used. Here matrix A from Eqn. (2) is expressed as A = LU , where L is a unitlower triangular matrix, and U is an upper triangular matrix.Rewriting Eqn. (2) yields: L ( U x ) = b (4)Solving for x is then accomplished in two steps. In the ﬁrststep, Eqn. (4) is rewritten as LY = b , and a solution for Y isfound. Since L is lower triangular and b is known, Y is easily Machine epsilon ( (cid:15) ) is deﬁned as the difference between 1 and the nextrepresentable ﬂoating point value [11], [18]. For double-precision ﬂoatingpoint numbers on x86-64 machines, (cid:15) ≈ . × − . found via forward-substitution (as shown in Section II-B). Inthe second step, Y is expressed as Y = U x . Since U is lowertriangular and Y is known, x is found via back-substitution,thereby providing the desired solution to Eqn. (2).Rather than simply using the xGESV family of functions fromLAPACK that do not provide an estimate of the reciprocalcondition number, we use a combination of xGETRF , xGETRS and xGECON . The xGETRF function computes the LU decom-position, xGETRS solves a system of linear equations basedon the LU decomposition, and ﬁnally xGECON computes thereciprocal condition number from the LU decomposition. E. Fallback for Poorly Conditioned Systems

If the above solvers fail or if the estimated reciprocalcondition number indicates a poorly conditioned system, anapproximate solution is attempted using a solver based onSVD. A poorly conditioned system is detected when thereciprocal condition number is below a threshold; followingLAPACK’s example, we have set this threshold to . (cid:15) , where (cid:15) is the machine epsilon as deﬁned in Section II-C. The useof the SVD-based fallback solver can be disabled through anoptional argument to the solve() function.The SVD-based solver uses the xGELSD set of functions,which ﬁnd a minimum-norm solution to a linear least squaresproblem [7], i.e. x is found via minimisation of || b − Ax || .In brief, the solution to the system in Eqn. (2) is reformulatedas x = ( A (cid:62) A ) − A (cid:62) b , where ( A (cid:62) A ) − A (cid:62) is known as theMoore-Penrose pseudo-inverse, which can be obtained via theSVD of A [12]. This reformulation is equivalent to the leastsquares solution. proc likely sympd input: A (square matrix in column-major format) input: N (number of rows in A ) output: boolean ( true or false ) tol ← · (cid:15) (where (cid:15) is machine epsilon) max diag ← forall j ∈ [0 , N ) : if A j,j ≤ then return false if A j,j > max diag then max diag ← A j,j forall j ∈ [0 , N − : forall i ∈ [ j + 1 , N ) : delta ← | A i,j − A j,i | if delta > tol and delta > tol · max( | A i,j | , | A j,i | ) then return false if | A i,j | ≥ max diag then return false if ( | A i,j | + | A j,i | ) ≥ ( A i,i + A j,j ) then return false return true Figure 4:

An algorithm for detecting whether a given matrix A is likely to be symmetric positive deﬁnite (sympd). The matrixis assumed to have column-major storage, with its elementsaccessed via A i,j , where i indicates the row and j indicatesthe column. Indexing starts at zero, following C++ convention. atrix size standard adaptive reduction × . × − . × − × . × − . × − × . × − . × − × . × − . × − Table I:

Comparison of time taken (in seconds) to solverandom banded systems using a standard dense solver (whichignores the banded property) against the adaptive solver(which takes into account the banded property). Average wall-clock time across 1000 runs is reported. matrix size standard adaptive reduction × . × − . × − × . × − . × − × . × − . × − × . × − . × − Table II:

Comparison of time taken (in seconds) to solverandom triangular systems using a standard dense solver(which ignores the triangular property) against the adaptivesolver (which takes into account the triangular property).Average wall-clock time across 1000 runs is reported. matrix size standard adaptive reduction × . × − . × − × . × − . × − × . × − . × − × . × − . × − Table III:

Comparison of time taken (in seconds) to solverandom symmetric positive deﬁnite (sympd) systems using astandard dense solver (which ignores the sympd property)against the adaptive solver (which takes into account thesympd property). Average wall-clock time across 1000 runsis reported. matrix size standard adaptive overhead × . × − . × − × . × − . × − × . × − . × − × . × − . × − Table IV:

Comparison of time taken (in seconds) to solverandom dense systems (without any special structure) usinga standard dense solver against the adaptive solver (whichattempts to detect special structures). Average wall-clock timeacross 1000 runs is reported.III. E

MPIRICAL E VALUATION

In this section we demonstrate the speedups resulting fromautomatically detecting the matrix properties described inSection II and selecting the most appropriate set of LAPACKfunctions for solving systems of linear equations. The eval-uations were done on a machine with an Intel Core i5-5200U CPU running at 2.2 GHz. Compilation was done withthe GCC 10.1 C++ compiler with the following conﬁgura-tion options: -O3 -march=native . We used the open-sourceOpenBLAS 0.3.9 [26] package which provides optimisedimplementations of LAPACK functions.We performed evaluations on the following set of sizes formatrix A , ranging from small to large: { × × × × } . Three matrix types were used: (i) ban-ded (with 5 diagonals), (ii) lower triangular, (iii) symmetricpositive deﬁnite (sympd). For each matrix size and type,1000 unique random systems were solved, with each randomsystem solved using the standard LU-based solver (as perSection II-D) and the adaptive solver. For each solver, theaverage wall-clock time (in seconds) across the 1000 runs isreported. The results are presented in Tables I, II and III.The results indicate that for all matrix types, the adaptivesolver reduces the wall-clock time. On average, the reductionis most notable for banded systems, closely followed bytriangular systems. While the reductions for sympd systems areless pronounced, they still show useful speedups. In general,the larger the matrix size, the larger the degree in reductionof wall-clock time.To gauge the overhead of all the detection algorithms,we compare the time taken to solve random dense systems (without any special structure) using the standard LU-basedsolver against the adaptive solver. The results shown inTable IV indicate that the overhead is negligible. This is dueto each detection algorithm stopping as soon as it determinesthat a special structure is not present.IV. R UNTIME C ONSIDERATIONS

In the previous section we showed the proposed algorithm tobe empirically efﬁcient on randomly generated linear systems.Indeed, this is partly due to the fact that the checks thatwe have proposed are asymptotically more efﬁcient than thefactorisations we employ. To see this, observe that the bandedstructure check described in Section II-A can be performedin a single pass over the matrix. For a square matrix of size n × n , this is O ( n ) time. Similarly, the triangular structurecheck described in Section II-B can also be performed in asingle pass, yielding O ( n ) time. Finally, the likely sympd algorithm in Fig. 4, as written, can be performed as an O ( n ) diagonal pass on A followed by a single O ( n ) pass. Overall,this yields a total quadratic runtime for all of the proposedchecks.In the situation where the given matrix is triangular, we canperform forward- or back-substitution in O ( n ) time, yieldingan asymptotic improvement over the standard LU-based solver,which takes O ( n ) time [12]. Cholesky decomposition andsingular value decomposition both also take O ( n ) time tocompute [12]. However, in practice the Cholesky decompos-ition tends to be faster (as only one triangular part of thesymmetric matrix needs to be processed), explaining the rest ofthe empirical advantage that was observed in our experiments.. C ONCLUSION

We have described an adaptive solver for systems of lin-ear equations that is able to automatically detect severalcommon properties of a given system (banded, triangular,and symmetric positive deﬁnite), followed by solving thesystem via mapping to a set of suitable LAPACK functionsbest matched to each property. Furthermore, the solver candetect poorly conditioned systems and automatically obtainan approximate solution via singular value decomposition asa fallback. The automatic handling facilitates simpliﬁed usercode that focuses on high-level algorithm logic, freeing theuser from worrying about various storage formats and usingcumbersome manual calls to LAPACK functions in order toobtain good computational efﬁciency.The solver is present inside recent versions of the high-level Armadillo C++ library for linear algebra [20], [21],which allows matrix operations to be expressed in an easy-to-read manner similar to Matlab/Octave. The solver is com-prised of about 3000 lines of code, not counting LAPACKcode; it is able to handle matrices with single- and double-precision ﬂoating point elements, in both real and complexforms. The source code for the adaptive solver is providedunder the permissive Apache 2.0 license [24], allowing unen-cumbered use in commercial products; it can be obtained from http://arma.sourceforge.net .The adaptive solver has been successfully used to increaseefﬁciency in open-source toolkits such as the mlpack libraryfor machine learning [9], and the ensmallen library for numer-ical optimisation [5]. Areas for further improvements includespecialised handling of plain symmetric matrices (symmetricbut not positive deﬁnite), and tri-diagonal matrices which area common subtype of banded matrices [8].A

CKNOWLEDGEMENTS

We would like to thank our colleagues at Data61/CSIRO(Frank De Hoog, Mark Staples) and Grifﬁth University(M.A. Hakim Newton, Majid Namazi) for discussions leadingto the improvements of this paper. R

EFERENCES[1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Don-garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, andD. Sorensen.

LAPACK Users’ Guide . SIAM, 1999.[2] H. Barthels, C. Psarras, and P. Bientinesi. The sad truth about linearalgebra software. XXI Householder Symposium on Numerical LinearAlgebra, 2020.[3] D. A. Belsley, E. Kuh, and R. E. Welsch.

Regression Diagnostics: Identi-fying Inﬂuential Data and Sources of Collinearity . Wiley-Interscience,1980.[4] D. Ber´enyi, A. Leitereg, and G. Lehel. Towards scalable pattern-basedoptimization for dense linear algebra.

Concurrency and Computation:Practice and Experience , 30(22), 2018.[5] S. Bhardwaj, R. R. Curtin, M. Edel, Y. Mentekidis, and C. Sanderson.ensmallen: a ﬂexible C++ library for efﬁcient function optimization.Workshop on Systems for ML and Open Source Software at NeurIPS /NIPS, 2018.[6] R. Bhatia.

Positive Deﬁnite Matrices . Princeton University Press, 2015.[7] S. Boyd and L. Vandenberghe.

Introduction to Applied Linear Algebra:Vectors, Matrices, and Least Squares . Cambridge University Press, 2018.[8] E. W. Cheney and D. R. Kincaid.

Numerical Mathematics and Com-puting . Cengage Learning, 7th edition, 2012.[9] R. R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, andS. Zhang. mlpack 3: a fast, ﬂexible machine learning library.

Journalof Open Source Software , 3(26):726, 2018.[10] J. W. Demmel.

Applied Numerical Linear Algebra . SIAM, 1997.[11] D. Goldberg. What every computer scientist should know about ﬂoating-point arithmetic.

ACM Computing Surveys , 23(1):5–48, 1991.[12] G. H. Golub and C. F. Van Loan.

Matrix Computations . Johns HopkinsUniversity Press, 4th edition, 2013.[13] N. J. Higham.

Accuracy and Stability of Numerical Algorithms . SIAM,2nd edition, 2002.[14] Intel. Math Kernel Library (MKL), 2020. https://software.intel.com/mkl.[15] S. Linge and H. P. Langtangen.

Programming for Computations -MATLAB/Octave . Springer, 2016.[16] R. Malhotra and A. Chug. Software maintainability: Systematic lit-erature review and current trends.

International Journal of SoftwareEngineering and Knowledge Engineering , 26(8):1221–1253, 2016.[17] M. Mernik, J. Heering, and A. M. Sloane. When and how to developdomain-speciﬁc languages.

ACM Computing Surveys , 37(4):316–344,2005.[18] J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Jeannerod, M. Joldes,V. Lef`evre, G. Melquiond, N. Revol, and S. Torres.

Handbook ofFloating-Point Arithmetic . Birkh¨auser, 2nd edition, 2018.[19] C. Psarras, H. Barthels, and P. Bientinesi. The linear algebra mappingproblem. arXiv:1911.09421 , 2019.[20] C. Sanderson and R. Curtin. Armadillo: a template-based C++ libraryfor linear algebra.

Journal of Open Source Software , 1:26, 2016.[21] C. Sanderson and R. Curtin. Practical sparse matrices in C++ with hybridstorage and template-based expression optimisation.

Mathematical andComputational Applications , 24(3), 2019.[22] M. Scherr and S. Chiba. Almost ﬁrst-class language embedding: tamingstaged embedded DSLs. In

ACM SIGPLAN International Conferenceon Generative Programming: Concepts and Experiences , pages 21–30,2015.[23] H. M. Sneed. A cost model for software maintenance & evolution. In

IEEE International Conference on Software Maintenance , 2004.[24] A. St. Laurent.

Understanding Open Source and Free Software Licens-ing . O’Reilly Media, 2008.[25] P. Viviani, M. Drocco, and M. Aldinucci. Scaling dense linear algebra onmulticore and beyond: a survey. In