Good pivots for small sparse matrices
aa r X i v : . [ c s . S C ] J u l Good Pivots for Small Sparse Matrices
Manuel Kauers ⋆ [0000 − − − and JakobMoosbauer ⋆⋆ [0000 − − − Institute for Algebra, Johannes Kepler University, Linz, Austria[manuel.kauers | jakob.moosbauer]@jku.at Abstract.
For sparse matrices up to size 8 ×
8, we determine optimalchoices for pivot selection in Gaussian elimination. It turns out that theyare slightly better than the pivots chosen by a popular pivot selectionstrategy, so there is some room for improvement. We then create a pivotselection strategy using machine learning and find that it indeed leadsto a small improvement compared to the classical strategy.
It can be cumbersome to solve a sparse linear system with Gaussian eliminationbecause a poor choice of a pivot can have a dramatic effect on the sparsity.In the worst case, we start with a fairly sparse matrix, and already after asmall number of elimination steps, we are faced with a dense matrix, for whichcontinuing the elimination procedure may be too costly. The principal goal ofa linear system solver for sparse matrices is therefore to maintain as much ofthe sparsity as possible, for as long as possible. To achieve this goal, we can payspecial attention to the pivot selection. A popular pivot selection strategy whichaims at maintaining the sparsity of a matrix is attributed to Markowitz [10,3].It is based on the notion of “fill-in”, which is defined as the number of matrixentries that get affected when a particular element is chosen as pivot. Moreprecisely, for a matrix A = (( a i,j )) n,mi,j =1 , let r i be the number of nonzero entriesof the i th row ( i = 1 , . . . , n ) and c j be the number of nonzero entries of the j thcolumn ( j = 1 , . . . , m ). Then the fill-in associated to the entry at position ( i, j )is defined as ( r i − c j − i, j ) is chosen as pivot: ∗ ∗ ∗∗∗ | : ∗←−−−− ∗ + ←−−−−−− ∗ + ∗ ∗ • • • • . ⋆ M.K. was supported by the Austrian FWF grants F50-04, W1214-13, and P31571. ⋆⋆ J.M. was supported by the Land Ober¨osterreich through the LIT-AI Lab. he selection strategy of Markowitz is to choose among the eligible candidates apivot for which the fill-in is minimized. The strategy thus neglects that touchinga cell that already is nonzero does not decrease the sparsity (it may in factincrease if we are lucky enough).It is known that finding a pivot that minimizes the number of new entriesintroduced during the whole elimination process has been shown to be NP-complete by Yannakakis [16]. But how much does this matter? In other words:how close does the Markowitz pivot selection strategy get to the theoretical op-timum? This is the first question we address in this paper. By an exhaustivesearch through all square matrices up to size 8 × × We will distinguish two kinds of matrix entries: 0 (zero) and ∗ (nonzero), andwe ignore the possibility of accidental cancellations, so we adopt the simplifyingassumption that the sum of two nonzero elements is always nonzero. As coeffi-cient domain, we therefore take the set S = { , ∗} together with addition andmultiplication defined as follows:+ 0 ∗ ∗∗ ∗ ∗ · ∗ ∗ ∗ . he operations we count are ∗ + ∗ and ∗ ·∗ , i.e., operations not involving zero. Aswe have primarily applications with symbolic matrices in mind (originating, e.g.,from applications in symbolic summation [9], Gr¨obner bases computation [5], orexperimental mathematics [1,2]), we do not worry about stability issues.The exact number of operations depends not only on the choice of the pivotbut also on how the elimination is performed. We consider two variants. In thefirst variant, we add a suitable multiple of the pivot row to all rows which havea nonzero entry in the pivot column: a b c de f g ←− − e/a + a b c d f − eb/a g . In this case, we count the following operations. First, there are c − e/a in the above sketch, where c is thenumber of nonzero elements in the pivot column. Secondly, there are ( r − c − eb/a , where r isthe number of nonzero elements in the pivot row. Finally, for each clash of twononzero elements in the submatrix, like f and eb/a in the sketch above, we countone addition.The second variant is inspired by fraction free elimination [6]. Here we donot compute a multiplicative inverse of the pivot. Instead, the affected rows inthe submatrix are multiplied by the pivot: a b c de f g ←− − e + a b c d af − eb ag . In this case, we count the following operations. First, the number of multipli-cations by a is given by P i ( r i −
1) where r i is the number of nonzero entriesin the i th row and the summation ranges over the rows which have a nonzeroentry in the pivot column, excluding the pivot row. Secondly, there are again( r − c −
1) multiplications compute all the numbers corresponding to eb in theabove sketch, where r is is the number of nonzero elements in the pivot row and c the number of nonzero elements in the pivot column. Finally, for each clash oftwo nonzero elements in the submatrix, like af and eb in the sketch above, wecount one addition.In the following, we refer to the first variant as the “field case” (because itinvolves a division) and to the second variant as the “ring case” (because it isfraction free). It is clear that when we distinguish two kinds of entries, 0 and ∗ , then there are2 n different matrices of size n × n . However, for the problem under consideration,we do not need to consider all of them. Pivot search will not be affected byermuting the rows of a matrix. For two n × n matrices A, B , write A ≈ B if asuitable permutation of rows turns A into B . Then ≈ is an equivalence relation,and we get a normal form with respect to ≈ by simply sorting the rows. It sufficesto consider these normal forms, which reduces the problem from 2 n matrices to (cid:0) n + n − n (cid:1) equivalence classes. The number of equivalence classes is the numberof sorted n -tuples of binary numbers less than 2 n .As we consider a pivot search that also allows column exchanges, we can go astep further. Write A ∼ B if applying a suitable permutation to the rows and asuitable permutation to the columns turns A into B . This is also an equivalencerelation, but it is less obvious how to get a normal form. It was observed byZivkovic [17] that deciding ∼ on binary matrices is equivalent to deciding thegraph isomorphism problem for bipartite graphs. The idea is to interpret thematrix as adjacency matrix where row indices correspond to vertices of the firstkind and column indices correspond to vertices of the second kind of a bipartitegraph. Permuting rows or columns then amounts to applying permutations toeach of the two kinds of vertices. Graph isomorphism is a difficult problem, butit is not a big deal for the matrix sizes we consider here. We used the nautylibrary [11] to compute normal forms with respect to ∼ , and only analyzedmatrices in normal form.There does not appear to be a simple formula for the number of equivalenceclasses with respect to ∼ , but for small n , the numbers can be determined byPolya enumeration theory, as explained for example in [7], and they are availableas A002724 in the OEIS [14]. Here are the counts for n = 1 , . . . , n | S n × n | | S n × n / ≈| | S n × n / ∼| × × × × × × Exhaustive Search
We use exhaustive search to create databases with the following information: thematrix itself, the cost of elimination over a field and over a ring with the bestpivot and what is the best pivot, the cost with the worst pivot and what is theworst pivot, the median of the elimination costs, and the same considering onlythe pivots which produce minimum fill-in. For matrices of size n × n we only doone elimination step and then use the data from the database of matrices of size n −
1. We include a simplified pseudocode to show how the data is produced. By M ( n ) we denote a set of representatives of S n × n with respect to ∼ . The function Eliminate takes a matrix and a row and column index and returns the result ofperforming one step of Gaussian elimination. The function
Cost takes the sameinput and returns the number of operations needed in the elimination step. Themappings costmin , costmax and costmed are the database entries. input : A size n , a set of n × n matrices M ( n ) and three mappings costmin , costmax and costmed from M ( n −
1) to the reals output: three mappings costmin , costmax and costmed from M ( n ) tothe reals for m ∈ M ( n ) do if there exist i, j ∈ { , . . . , n } with Cost ( m, i, j ) = 0 then m ← Eliminate ( m, i, j ) costmin ( m ) ← costmin ( m ) costmax ( m ) ← costmax ( m ) costmed ( m ) ← costmed ( m ) else costmin ( m ) ← ∞ ; costmax ( m ) ← −∞ for all ( i, j ) ∈ { , . . . , n } with m ij = ∗ do m ← Eliminate ( m, i, j ) o ← Cost ( m, i, j ) costmin ( m ) ← Min ( costmin ( m ) + o, costmin ( m )) costmax ( m ) ← Max ( costmax ( m ) + o, costmax ( m )) cost ij ← costmed ( m ) + o costmed ( m ) ← Median ( cost )The median in line 15 is taken only over those pivots which have been con-sidered. For the minimum fill-in strategy we adjust the if statement in line 9such that only pivots which produce minimum fill-in are considered. In this casealso the database entries are taken from the minimum fill-in strategy.Note that not only a pivot with no other elements in its column results inzero elimination cost (since there is nothing to eliminate), but also a pivot withno other elements in its row results in no elimination cost, since it does notchange any matrix entries apart from those which become 0. In line 2 we ensurethat regardless of the strategy we are analyzing a free pivot is always chosen.Also note that if there are several such pivots, then the order in which we chosethem does not affect the total cost.e use the function genbg from the nauty package to produce the list of n × n matrices and we use the nauty package main function to compute a canonicalform after the elimination step.The full source code and the databases up to size 6 × There have been some recent advances in applying machine learning for improv-ing selection heuristics in symbolic computation. As it is the case for Gaussianelimination, many algorithms allow different choices which do not affect cor-rectness of the result, however may have a large impact on the performance.Machine learning models have for example been applied in cylindrical algebraicdecomposition [8] and Buchberger’s algorithm [13]. For more applications see [4].Without going into too much detail on the background of machine learning,we give here a summary of our approach, mainly in order to document thecomputations we performed in order to facilitate a proper interpretation of theexperimental results reported later. For explanations of the technical terms usedin this section, we refer to the literature on machine learning.We train a reinforcement learning agent [15] to select a good pivot in Gaussianelimination. In reinforcement learning an agent interacts with an environmentby choosing an action in a given state. In every time step the agent is presentedwith a state s t , chooses an action a t and the environment returns the new state s t +1 and a reward r t +1 . In our case the state is the matrix and the action is arow and a column index. The new state is the matrix we get after performingthe elimination with the chosen pivot and the reward is minus the number ofoperations needed in the elimination step. An episode is a sequence of stepsthat transform a matrix to row echelon form. The agent tries to maximize thereturn G t = P T − tk =0 γ k r t + k +1 with the discount factor 0 < γ ≤
1. The returnmeasures the expected future rewards and the discount factor makes the agentprefer earlier rewards over later rewards. Since we have a bound on the length ofthe episode we can choose γ = 1, in this case the return equals the total numberof operations needed.A difference to the more common concept of supervised learning is that wedo not need a training set with training data. Instead we can sample from allpossible inputs and the reward signal replaces the training data. This reducesthe problem that an agent can produce bad answers outside of the training pool.We sample matrices equally distributed from all binary matrices. During thelearning process actions are chosen using an ǫ -greedy policy. This means thatwith a probability of ǫ we choose a random action, with probability 1 − ǫ we choosea greedy action, i.e., the action the agent would choose in the given situation.This policy ensures accurate predictions for the actions the agent eventuallychooses and also that we do not miss good choices because we never try them.he main component of the reinforcement learning agent is the deep Q-network which approximates the Q-value function [12,15]. The Q-value functionmaps a state and an action to the expected return value. We can use this infor-mation to pick a pivot that has the lowest expected cost. Since we need to fixthe network size in advance we can only consider matrices of bounded size. Forthe present paper we stick to small matrix sizes since we can evaluate the overallperformance using the database and the network stays of easy manageable size.We encode the row and column of the pivot by a one-hot vector. This meanswe use 2 n inputs nodes which each correspond to a row or a column and set thoseof the pivot to 1 and the others to 0. We also tried to use the fill-in as additionalinput feature to see whether it improves the performance. After experimentingwith different network structures we settled with a fully connected network with n ( n +2), respectively n ( n +2)+1 input nodes and two hidden layers with n ( n +2)nodes and a relu activation function. We chose this architecture to fit the taskat hand, for larger matrices training and evaluating this neural network wouldget prohibitively expensive. Selecting appropriate features or decomposing theproblem in smaller parts are possible ways around this.The network is trained using deep Q-learning with experience replay and atarget network [12]. Experience replay means that we keep a replay buffer withstate-action-reward pairs we created and in each training step we sample a batchof training data for the network to learn from. This helps the network not to“forget” what it already learned.For the learning process there are different hyperparameters which controlthe learning process. They can have a huge impact on the performance and theconvergence of the learning process. We did not invest a lot of time in tuningnetwork architecture and hyperparameters, since our goal was to evaluate thegeneral applicability of machine learning to this problem rather than findingbest-possible results. The ǫ -greedy policy starts at ǫ = 0 . ǫ = 0 .
1. We use a discount factor of 1, learning rate 0 . Results
In this section we analyze the results of the exhaustive search and the perfor-mance of machine learning. In the first graph we show the savings achieved byusing the minimum fill-in strategy compared to the median cost. We observe thatfor matrices of size 8 × × × ×
7, but there is a decrease from 7 × ×
8. In view of this decrease, it is hard to predict how the graph continuesfor matrix sizes that are currently out of the reach of exhaustive search. saving by Markowitzvs. random pivot (%) matrixsize2 3 4 5 6 7 8010203040 “field”“ring” saving by optimal pivotvs. Markowitz (%) matrixsize2 3 4 5 6 7 802468 “field”“ring”
For matrices of size 6 × maximal possiblesaving (%) density (%)0–10 10–20 20–30 30–40 40–50 50–60 60–70 70–80 80–90 90–10002468 ince several different pivots can have minimal fill-in, the question ariseswhether we just need a refined criterion to pick an optimal pivot among thosethat have minimal fill-in, or if we need to do something completely different. Inorder to address this question, we test if the best pivot which produces minimalfill-in is already optimal. We observe that the percentage of matrices whereMarkowitz’s strategy is optimal throughout the whole computation drops to 46%in the field case and 30% in the ring case for 8 × minimum fill-inis optimal (%) matrixsize2 3 4 5 6 7 8020406080100 “field”“ring” The table below shows the improvement achieved by the machine learningmodel compared to the fill-in strategy. The machine learning model is able tosurpass the Markowitz strategy by a very small amount. The network was trainedon 40000 to 100000 matrices to a point where additional training did not resultin further improvement. While for 4 × × n × × ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ . When the rows and columns are sorted like this it is easy to spot that if we choosea pivot in the first column, we get a row echelon form after one elimination stepwith 11 operations. However, the pivots in the first column produce fill-in 5 andthose in the last row produce only fill-in 4. Choosing the pivot with the minimalfill-in results in four elimination steps and in each step we have to eliminate everyrow, resulting in a total of 32 field operations. Over a ring the difference becomeseven larger, 15 operations are needed with the best pivot and 55 if we alwayspick what produces minimal fill-in. There are two takeaways from this example.First we notice that the fill-in for the better pivot is produced by elements in itsrow and the fill-in for the other pivot is produced by the column. Especially inthe ring case this leads to a large amount of extra computations since for everyelement we want to eliminate all the elements in the corresponding row haveto be multiplied by the pivot. This suggests to use some kind of weighted fill-inwhere the number of elements in the column is weighted higher than the numberof elements in the row. Another heuristic criterion motivated by this exampleis to do a two-step lookahead. In the first two columns there is a block of fournonzero elements and all other elements are 0. If we choose one of these fourelements as pivot, then we only have to eliminate one row and the eliminationis completed for both rows, since the second column contains no other elements.In this particular example the neural network finds the optimal pivot with thehigher fill-in.
References
1. Brualdi, R.A., Ryser, H.J., et al.: Combinatorial matrix theory, vol. 39. Springer(1991)2. Corless, R.M., Thornton, S.E.: The bohemian eigenvalue project. ACMCommunications in Computer Algebra (4), 158–160 (February 2017).https://doi.org/10.1145/3055282.30552893. Duff, I.S., Erisman, A.M., Reid, J.K.: Direct methods for sparse matrices. Claren-don Pr (1986)4. England, M.: Machine learning for mathematical software. In: Mathematical Soft-ware – ICMS 2018. pp. 165–174. Springer International Publishing, Cham (2018)5. Faug`ere, J.C.: A new efficient algorithm for computing Gr¨obner bases. Journal ofPure and Applied Algebra (1–3), 61–88 (1999)6. Geddes, K.O., Czapor, S.R., Labahn, G.: Algorithms for Computer Algebra.Kluwer (1992)7. Harary, F., Palmer, E.M.: Graphical enumeration. Acad. Press (1973). Huang, Z., England, M., Wilson, D., Davenport, J.H., Paulson, L.C., Bridge, J.:Applying machine learning to the problem of choosing a heuristic to select thevariable ordering for cylindrical algebraic decomposition. In: Intelligent ComputerMathematics. pp. 92–107. Springer International Publishing, Cham (2014)9. Koutschan, C.: Creative telescoping for holonomic functions. In: Computer Algebrain Quantum Field Theory: Integration, Summation and Special Functions. pp. 171–194. Texts and Monographs in Symbolic Computation, Springer (2013)10. Markowitz, H.M.: The elimination form of the inverse and its applica-tion to linear programming. Management Science (3), 255–269 (1957),
11. McKay, B.D., Piperno, A.: Practical graph isomorphism, ii. Journal of SymbolicComputation (0), 94–112 (2014). https://doi.org/10.1016/j.jsc.2013.09.00312. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-levelcontrol through deep reinforcement learning. Nature (7540), 529–533 (2015)13. Peifer, D., Stillman, M., Halpern-Leistner, D.: Learning selection strategies in buch-berger’s algorithm. arXiv preprint arXiv:2005.01917 (2020)14. Sloane, N.J.A.: The on-line encyclopedia of integer sequences (2020), https://oeis.org/
15. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press(1998)16. Yannakakis, M.: Computing the minimum fill-in is NP-complete. SIAM J. Alge-braic Discrete Methods (1), 77–79 (1981). https://doi.org/10.1137/060201017. ˇZivkovi´c, M.: Classification of small (0,1) matrices. Linear Algebra and its Appli-cations414