Parallel computation of the rank of large sparse matrices from algebraic K-theory
Jean-Guillaume Dumas, Philippe Elbaz-Vincent, Pascal Giorgi, Anna Urbanska
aa r X i v : . [ m a t h . K T ] A p r Parallel computation of the rank of large sparse matricesfrom algebraic K-theory
Jean-Guillaume Dumas
Laboratoire Jean Kuntzmann, UMR CNRS 5224Universit´e Joseph Fourier,B.P. 53 X, 38041 Grenoble, France.
Philippe Elbaz-Vincent
Institut de Math´ematiques et de Mod´elisation de Montpellier,UMR CNRS 5149Universit´e Montpellier II, CC051, Place E. Bataillon.34095 Montpellier cedex 5, FRANCE. [email protected]
Pascal Giorgi
Laboratoire LP2A, Universit´e de Perpignan Via Domitia.52, avenue Paul Alduy 66860 Perpignan France. [email protected]
Anna Urba ´nska
Laboratoire Jean Kuntzmann, UMR CNRS 5224Universit´e Joseph Fourier,B.P. 53 X, 38041 Grenoble, France.
ABSTRACT
This paper deals with the computation of the rank and someinteger Smith forms of a series of sparse matrices arisingin algebraic K-theory. The number of non zero entries inthe considered matrices ranges from 8 to 37 millions. Thelargest rank computation took more than 35 days on 50processors. We report on the actual algorithms we used tobuild the matrices, their link to the motivic cohomology andthe linear algebra and parallelizations required to performsuch huge computations. In particular, these results arepart of the first computation of the cohomology of the lineargroup GL ( Z ).
1. INTRODUCTION1.1 Motivation from K-theory
Numerous problems in modern number theory could be solvedor at least better understood, if we have a good knowledgeon the algebraic K -theory (or motivic cohomology) of inte-gers of number fields or the cohomology of arithmetic groups(i.e. subgroups of finite index of GL N ( Z )). As a short list,we could mention: • modular forms and special values of L functions, • Iwasawa theory and understanding of the “cyclotomy”, • Galois representations (or automorphic representations). Let us explain first what is algebraic K-theory: to a commu-tative ring R we can associate (functorially) an infinite fam-ily of abelian groups K n ( R ) which encodes a huge amountof information on its arithmetic, geometric and algebraicstructures. These groups extend some classical notions andgive higher dimensional analogues of some well known re-sults. For instance if R is a ring of integers or a polynomialring over a field, we have • K ( R ) is the classical Grothendieck group (classifyingfinitely generated R − modules), • K ( R ) is the group of invertibles of R , • K ( R ) classifies the universal extensions of SL ( R ) andis related to the Brauer group in the case of a field.We can give a general abstract definition of K n for n > K n ( R ) = π n ( BGL ( R ) + ) , where BGL ( R ) + is the Quillen +-construction applied tothe classifying space BGL ( R ) and π n denotes the n th ho-motopy group [23]. We also can see the K -groups as a wayto understand GL ( R ). For instance, K ( R ) is isomorphicto GL ( R ) modulo elementary relations. For a more detailedbackground on K-theory and its applications see [23]. Fact : these groups are hard to compute.
The Vandiver conjecture plays an important role in the un-derstanding of the “cyclotomy” in number theory. Its state-ment is as follows;
Conjecture of Vandiver : Let p be an odd prime, ζ p = e πi/p , C the p -Sylow subgroup of the class group of Q ( ζ p )and C + the subgroup fixed by the complex conjugation. Theandiver conjecture is the statement that C + = 0.To illustrate its interplay with K-theory, we have Fact (Kurihara, 1992)[22]: If K n ( Z ) = 0 for all n >
0, thenthe conjecture of Vandiver is true.Some partial results on this conjecture and its connectionwith the cohomology of SL N ( Z ) are given in [25]. General problem : Find explicit methods for computing(co)homologies of arithmetic groups and the K-theory ofnumber fields (or their ring of integers).Our first task will be the computation of the (co)homolo-gies of linear groups (mainly GL N ( Z ) and SL N ( Z )). Wecan show [15, 13, 14] that the computation of those groupsis the key point for computing K -groups with our method.We will begin to recall some facts from topology. The notion of cellular complex is a generalization of a graphto several dimensions. We call n − cell a topological spacehomeomorphic to the open unit ball of R n and such that itsclosure is homeomorphic to the closed unit ball. A cellularcomplex (or cell complex or also cellular decomposition orcellular space) is a family of sets X n (with n ∈ N ), suchthat each X n is a collection (eventually infinite) of n − cells.Usually we work with cell complexes with a finite numberof cells.A classical result [26] shows that any (reasonable) topologicalspace can be approximated by such cell complexes . A cellular decomposition of the cube
To a cell complex, we can associate a family C n ( n ∈ N )of free Z − modules and a family d n : C n → C n − of linearmaps.The module C n is the free module with basis the n − cells(modulo a choice of orientation).If the complex is finite, then all the modules are of finite rankand we will denote by { b nλ } λ ∈ Λ n a basis of C n , Λ n being anindex set for the n − cells. Then the map d n is defined by d n ( b nλ ) = X µ [ b nλ : b n − µ ] b n − µ , and the integer number [ b nλ : b n − µ ] is called the incidencenumber of the cell e n − µ inside the cell e nλ . The relation d n ◦ d n +1 = 0 (i.e., formally d = 0) should hold for any n .If the complex is regular (i.e., always at most one cell ofdimension n + 1 between two cells of dimension n ), thenwe can build the incidences inductively starting from the0 − cells up to the maximal cells using the d = 0 conditionand moreover the incidence numbers will be 0, or ± n th homology group of the complex is defined as thequotient of Ker( d n ) by Im( d n +1 ). This construction is func-torial (in the category of cell complexes). As a consequence, we can determine the homology groups effectively by comput-ing the Smith form of the integral matrices of the d n (rela-tively to the fixed basis). Notice that the Smith form givesboth the rank of the free part and the explicit description ofthe torsion part. In case the computation of the torsion isunnecessary (or to difficult to achieve), we can tensorize by − ⊗ Z Q and the homology groups become Q -vector spaceswith their dimensions given by the ranks of the matrices ofthe differentials.In general, the matrices of the differentials can be very largeeven for a relatively simple cell complex. However, they arealso very sparse and we may look for some other favorableproperties which would enable the computation despite thesize of the problem. If G is a group acting on a cell space X (i.e., G sends n − cellsto n − cells), then, under some technical assumptions on X and on the action, we can show [4] that roughly computingthe homology of G (as group homology) is the same as com-puting the homology of the cell space X/G . Hence, if
X/G can be calculated effectively, we can compute explicitly itshomology, and from this the homology of G (similarly forthe cohomology). Notice that in general the space X/G willnot be regular anymore. The main difficulty is to find a cellspace X such that X/G will be effective. We will discussin section 2.1 how we can construct such cellular space forlinear groups.
This first idea to deal with very large sparse matrices isto use them as blackboxes, i.e. only using the matrix-vectorproduct. This will let the matrix remain sparse all along thealgorithm where Gaussian elimination for instance would fillit up. To compute the rank, the fastest black box algorithmis Wiedemann’s as shown in [10]. This algorithm computesa sequence of scalars of the form u t A i v ( u and v are vec-tors) with i matrix vector products and dot products. Ithas been shown to successfully deal with large sparse ma-trix problems from homology, see e.g. [9]. Nevertheless,when matrices are very large (e.g millions on non-zero en-tries) computations would require months or years. This isdue to the low practical efficiency of the computation of asparse matrix-vector product. For instance, in our case ofhomology computation, one would need 300 days of CPUto compute the sequence involving matrix of GL ( Z ) with n = 19 (GL7d19 matrix). To achieve computations of manyhomologies in a realistic time we then need to parallelize thecomputation of the sequence. Then the algorithm candidates the block Wiedemann method, which computes a sequence X T A i Y where X and Y are blocks of vectors. This step canbe easily parallelized by distributing vectors of block Y toseveral processors. We thus have several objectives with re-gard to the parallelism in this paper: • We want to solve large problems coming from homol-ogy computation. • We want to experimentally validate our parallel imple-mentation of the block Wiedemann rank algorithm. • We want to show the parallel scaling of block Wiedemannapproaches.
In section 2, the algorithms and optimizations we used togenerate the matrices and compute with them are discussed.Then, section 3 shows our experimental results on theselarge sparse matrices coming from homology.
2. ALGORITHMS AND OPTIMIZATIONS
As seen in section 1.1.3, we can effectively compute the ho-mology of a cellular space and of a group which acts “nicely”on a cellular space. The main difficulty remains to find suchexplicit cellular space. In section 2.1 we present the processof matrix generation and optimization. Then in the follow-ing sections we give a description of the algorithms used forthe computation of the rank and the Smith form of thosematrices.
In the case of subgroups of GL N ( Z ), we have an “obvious”action on Z N . We can then capture the topology by regard-ing Z N not as a free Z − module but as a lattice (or equiv-alently as quadratic forms), and see if this leads to someinteresting topological construction. We will describe thisapproach below and the results that we can get. Let N > C N be the set of positivedefinite real quadratic forms in N variables. Given h ∈ C N ,let m ( h ) be the finite set of minimal vectors of h , i.e. vectors v ∈ Z N , v = 0, such that h ( v ) is minimal. A form h is called perfect when m ( h ) determines h up to scalar: if h ′ ∈ C N issuch that m ( h ′ ) = m ( h ), then h ′ is proportional to h . Example
The form q ( x, y ) = x + y has minimum 1and minimal vectors ± (1 , and ± (0 , . Nevertheless thisform is not perfect, because there is an infinite number ofdefinite positive quadratic forms having these minimal vec-tors.On the other hand, the form q ( x, y ) = x + xy + y has alsominimum 1 and has exactly 3 minimal vectors (up to sign),the one above and ± (1 , − . This form is perfect, the associ-ated lattice is the ”honeycomb lattice” (with optimal spherespacking in the plane), it is the only one. Denote by C ∗ N the set of non negative real quadratic formson R N the kernel of which is spanned by a proper linearsubspace of Q N , by X ∗ N the quotient of C ∗ N by positive realhomotheties, and by π : C ∗ N → X ∗ N the projection. Let X N = π ( C N ) and ∂X ∗ N = X ∗ N − X N . Let Γ be either GL N ( Z ) or SL N ( Z ). The group Γ acts on C ∗ N and X ∗ N on the right by the formula h · γ = γ t h γ , γ ∈ Γ , h ∈ C ∗ N , where h is viewed as a symmetric matrix and γ t is the trans-posed of the matrix γ . Vorono¨ı proved that there are onlyfinitely many perfect forms modulo the action of Γ and mul-tiplication by positive real numbers ([32], Th. p. 110).Given v ∈ Z N − { } we let ˆ v ∈ C ∗ N be the form defined byˆ v ( x ) = ( v | x ) , x ∈ R N , where ( v | x ) is the scalar product of v and x . The convexhull of a finite subset B ⊂ Z N − { } is the subset of X ∗ N image by π of the elements P j λ j b v j , v j ∈ B , λ j >
0. Forany perfect form h , we let σ ( h ) ⊂ X ∗ N be the convex hull ofthe set m ( h ) of its minimal vectors. Vorono¨ı proved in [32, § σ ( h ) and their intersections, as h runsover all perfect forms, define a cell decomposition of X ∗ N ,which is invariant by the action of Γ. We endow X ∗ N withthe corresponding CW -topology. If τ is a closed cell in X ∗ N and h a perfect form with τ ⊂ σ ( h ), we let m ( τ ) be the setof vectors v in m ( h ) such that ˆ v lies in τ . Any closed cell τ is the convex hull of m ( τ ) and m ( τ ) ∩ m ( τ ′ ) = m ( τ ∩ τ ′ ). Let d ( N ) = N ( N + 1) / − X ∗ N and n d ( N ) a natural integer. We denote by Σ ∗ n a set ofrepresentatives, modulo the action of Γ, of those cells ofdimension n in X ∗ N which meet X N , and by Σ n ⊂ Σ ∗ n thecells σ such that the stabilizer Γ σ of σ in Γ preserves itsorientation. Let V n be the free abelian group generated byΣ n . We define as follows a map d n : V n → V n − . For each closed cell σ in X ∗ N we fix an orientation of σ , i.e.an orientation of the real vector space R ( σ ) of symmetricmatrices spanned by the forms ˆ v , v ∈ m ( σ ). Let σ ∈ Σ n and let τ ′ be a face of σ . Given a positive basis B ′ of R ( τ ′ )we get a basis B of R ( σ ) by adding after B ′ a vector ˆ v , v ∈ m ( σ ) − m ( τ ′ ). We let ε ( τ ′ , σ ) = ± B in the oriented vector space R ( σ ) (this signdoes not depend on the choice of v ).Next, let τ ∈ Σ n − be the cell equivalent to τ ′ = τ · γ . Wedefine η ( τ, τ ′ ) = 1 (resp. η ( τ, τ ′ ) = −
1) when γ is com-patible (resp. incompatible) with the chosen orientations of R ( τ ) and R ( τ ′ ).Finally we define d n ( σ ) = X τ ∈ Σ n − X τ ′ η ( τ, τ ′ ) ε ( τ ′ , σ ) τ , (1)where τ ′ runs through the set of faces of σ which are equiv-alent to τ .It is shown in [14], that up to p -torsions with p N + 1, thehomology of this complex computes the cohomology of G .For N = 5 , , n . ∗ n ( GL ( Z )) 5 10 16 23 25 23 16 9 4 3 Σ n ( GL ( Z )) 0 0 0 1 7 6 1 0 2 3 Σ ∗ n ( GL ( Z ) 3 10 28 71 162 329 589 874 1066 1039 775 425 181 57 18 7 Σ n ( GL ( Z )) 0 0 0 0 3 46 163 340 544 636 469 200 49 5 0 0 Σ ∗ n ( SL ( Z )) 3 10 28 71 163 347 691 1152 1532 1551 1134 585 222 62 18 7 Σ n ( SL ( Z )) 0 3 10 18 43 169 460 815 1132 1270 970 434 114 27 14 7 Σ n and Σ ∗ n for N = 5 , (empty slots denote zero)n ∗ n ( GL ( Z )) 6 28 115 467 1882 7375 26885 Σ n ( GL ( Z )) 0 0 0 1 60 1019 8899
13 14 15 16 17 18 19 20 21 totalΣ ∗ n ( GL ( Z )) 87400 244029 569568 Σ n ( GL ( Z )) 47271 171375 460261
22 23 24 25 26 27 total TOTALΣ ∗ n ( GL ( Z )) 374826 115411 24623 3518 352 33 Σ n ( GL ( Z )) 349443 105054 21074 2798 305 33 Σ n and Σ ∗ n for N = 7 Theorem (Elbaz-Vincent/Gangl/Soul´e)[15, 13, 14].The cardinalities of Σ n and Σ ∗ n are shown on figure 1 for N = 5 , and on figure 2 for N = 7 . The previous result gives the precise size of the matricesinvolved in the computation of the homology.The main challenge was then the computation of the ranks ofthe matrices of the differential (this gives the free part of thehomology) and the computation of the Smith forms (whichgives the relevant arithmetical information of the homology),in particular for N = 7, knowing that such matrices areparticularly sparse. We can emphasize the fact that whatwe want to detect is the “high torsion” in the homology (i.e.the prime divisors > . One successful approach to deal with linear algebra compu-tations on large sparse matrices is to rely on Lanczos/Krylovblack-box methods. In particular, block versions of Wiede-mann method [33] are well suited for parallel computation.This technique has been first proposed by Coppersmith in [6]for computation over GF (2) and then analyzed and provedby Kaltofen [19] and Villard [30, 31]. The idea is for amatrix A ∈ F n × n to compute the minimal generating ma-trix polynomial of the matrix sequence { XA i Y } ∞ i =0 ∈ F s × s , All the matrices are available on linein the “Sparse Integer Matrix Collection”( ljk.imag.fr/membres/Jean-Guillaume.Dumas/simc.html ) where
X, Y are blocks of s vectors (instead of vectors in theoriginal Wiedemann’s algorithm). Therefore, an intuitiveparallelization of this method is to distribute the vectors ofthe block X, Y to several processors. Thus, the computationof the sequence, which is the major performance bottleneckof this method, can be done in parallel and then allow forbetter performance.Lots of implementations and practical experimentations hasbeen developed on parallel block Wiedemann. For instance,in 1996, Kaltofen and Lobo [20] have proposed a coarse grainimplementation to solve homogeneous linear equations over GF (2). They have thus been capable to solve a system of252 252 linear equations with about 11 .
04 million non-zeroentries, in about 26 . σ In order to reduce the number of dot products, we used asymmetric projection. In other words, we set X = Y T inthe XA i Y sequence. Indeed, in this case the probability ofsuccess is reduced but the obtained block is symmetric asoon as A is symmetric. This is always the case when thepreconditioners of [10] are used (they are of the form A T A ).This reduces the dot product part of the computation of thesequence by a factor of two. For instance column i can bededuced from its top i elements and row i . This inducessome load balancing issues when one process owns the com-putation of one column. Note also that we use BLAS level-2for the computation of this dot products. In other words weperform them by blocks. σ -basis computations In order to efficiently compute σ -basis we rely on algorithm PM-Basis of [16] which reduces this computation to poly-nomial matrix multiplication. One can multiply two poly-nomial matrices
A, B ∈ F n × n [ x ] of degree d in O ( n d + n d log d ) finite field operations if d -th primitive roots ofunity are available in F . Consequently, we decided for ourcomputations to define F as a prime field with primes of theform c × k + 1 such that c × k ≡ d . These primesare commonly called FFT primes since they allow the useof FFT evaluation/interpolation for polynomials of degree d . We refer the reader to [5, 3] and references therein forfurther informations on fast polynomial matrix arithmetic.When finite fields not having d -th primitive roots of unityare used, polynomial matrix multiplication is still be doneefficiently by using Chinese Remainder Theorem with fewFFT primes. Let be F a prime field of cardinality p , thenthe multiplication of A, B ∈ F n × n [ x ] of degree d can be ef-ficiently done by using CRT with FFT primes p i satisfying Q p i > d × n × p . This is equivalent to perform the multi-plication over the integers and then reduce the result in F .The overall performance of the multiplication, and then ofthe σ -basis , is dependent on the numbers of FFT primesneeded. Our main interest in the block Wiedemann approach is tocompute the rank of large sparse matrices given by the Ho-mology group determination problem explained in section1.1. Hence, we rely on Kaltofen-Saunder’s rank algorithm[21] and its block version [29] to achieve efficient parallelcomputation.The Kaltofen-Saunders approach is based on the fact thatif ˜ A is a good preconditioned matrix of A then its rankis equal to the degree of its minimal polynomial minus itsvaluation (or co-degree) [21]. Thus, by using well chosenpreconditioners and Wiedemann algorithm one can easilycompute the rank of a sparse matrix over a finite field. Theblock version of this method is presented e.g. in [29, § Block Wiedemann Rank Algorithm :let A ∈ F n × n , ˜ A from A with good preconditioners (e.g. thoseof [10]). random block Y ∈ F n × s and compute thematrix sequence S = { Y T ˜ A i Y } for i = 0 . . . n/s + O (1). the minimal matrix generator F ˜ AY ∈ F s × s [ x ]of the matrix sequence S . the rank r as r = deg(det( F ˜ AY )) − codeg det(F ˜AY ).Note that if the minimal matrix of step 3 is in Popov form(e.g. computed using the σ -basis of [16]), then the degree ofdet( F ˜ AY ) is simply the sum of the row degrees of the matrix F ˜ AY . Then the co-degree is zero if the determinant of theconstant term of F ˜ AY , seen as a matrix polynomial, is non-zero. In the latter case the computation of the determinantof the whole polynomial matrix can be avoided.When this fails, this determinant is computed by a mas-sively parallel evaluation/interpolation. It could be inter-esting, though, to interpolate only the lower coefficients ofthis polynomial incrementally. This was not required for thematrices we considered and we therefore did not investigatemore on these speed improvements.Note that to probabilistically compute the rank over the in-tegers, it is sufficient to choose several primes at randomand take the largest obtained value, see e.g. [9] for moredetails. Moreover, one can choose the primes at randomamong primes not dividing the determinant (and thus pre-serving the rank). In order to ensure this property it itsufficient to select primes not dividing the valence or lastinvariant factor computed by one of the methods of nextsection. The computation of the Smith form for the matrices of GL ( Z ) turned out to be a very challenging problem. Prior experience with sparse homology matrices led us totry the SmithViaValence algorithm of [9]. The idea is tocompute the minimal valence (the coefficient of the small-est non zero monomial of the minimal polynomial) of theproduct A T A to determine the primes p which divide theinvariant factors of the Smith form of A . When the primeshave been found, one can compute the local Smith forms of A at each p separately and return the resulting Smith form S as the product of the local Smith forms S p over all p .Local Smith form computation can be done by a repeatedGauss elimination modulo p e where the exponent e is ad-justed automatically during the course of the algorithm.This algorithm works very efficiently for sparse matrices pro-vided that the minimal polynomial of the product AA T hasa small degree. Unfortunately, the latter condition does nothold in the case of GL ( Z ) matrices. Moreover, some earlyexperiments with small matrices showed that much moreprimes occur in the computed valence than in the Smithform of the original matrix. Thus, we decided to apply the adaptive algorithm of Saun-ders and Wan [24] which is a modified version of Eberly-Giesbrecht-Villard algorithm [12]. In [12] the authors pro-posed a procedure
OneInvariantFactor ( i, A ) (OIF) whichcomputes the i th invariant factor of a n × n matrix A . Thenthe binary search for distinct invariant factors allows themto find the Smith form of A . OIF reduces the i th factoromputation to the computation of the last ( n th) invariantfactor (LIF) of a preconditioned matrix A + U i V i , where U i , V Ti are random n × ( n − i ) matrices. In [24] the methodwas extended to handle the rectangular case of m × n ma-trix. It is done by computing the last ( i th) invariant factorof a preconditioned matrix L i AR i where L i is a m × i and R i is a i × n matrix.The procedure OIF is of Monte Carlo probabilistic typewhere the probability of correctness is controlled by repeat-ing the choice of preconditioners. Assuming the correctnessof LIF computation, it gives a multiple of the i th invariantfactor. In practice, LIF is also of randomized Monte Carlotype. The idea is to get a divisor of the last invariant factorby solving a linear equation Mx = b with random right-hand side. After several solvings we get the last invariantfactor with large probability. Thus, the overall situation ismore complex and we cannot exclude the possibility thatsome primes are omitted or unnecessary in the output ofOIF. However, the probability that a prime is omitted or isunnecessary in this output can be controlled for each primeseparately and is smaller for bigger primes.Therefore in [24] the authors introduce a notion of smoothand rough parts of the Smith form. The idea is to computethe local Smith form for smaller primes by for example theSmithViaValence or OIF algorithm and to recover only largeprimes with the invariant factor search of [12]. When weconsider large primes, a sufficient probability of correctnesscan be obtained by a smaller number of repetitions. As we did not want to compute the valence, we introducedsome minor changes to the algorithm, which at the end isas follows:1. r = rank( A )2. For primes 1 < p <
100 compute the local Smith form S p of A ;3. Compute s r ( A ) by OneInvariantFactor algorithm;4. P = all primes p >
100 which divide s r ( A );5. If P = ∅ return S = Π p S p ;One advantage of this method is that we get the informationon the smooth form of the matrix very quickly. Moreover,the OIF computation acts as a certification phase whichallows us to prove that no other primes are present with asufficiently large probability. This probability is explicit inthe following theorem: Theorem
The probability that there exists a prime p > P that divides the i th invariant factor but does not dividethe output of OIF which uses M random preconditioners L i , R i and N random vectors b , with b ∈ { , , . . . β − } and β > s i ( A ) , in the LIF procedure is bounded by M ∞ X p>P „ p « N . Proof.
As we take the gcd of the result with differentpreconditioners L i and R i , is suffices that the LIF computa-tion fails in one case to spoil the computation. We are freeto choose a large bound for k b k such that k b k > s i ( A ) with-out increasing the complexity of LIF computation. Thenthe probability that a prime p < k b k is omitted in LIF isless than or equal to β ⌈ βp ⌉ < p , see [1]. Finally, we boundthe probability that any prime p > P, p | s i ( A ) is missing bytaking a sum over all primes.The choice M = N = 2 suffice to obtain a small probability0 ,
015 of omitting an important prime, and at the same timeto exclude all primes that are not in the i th invariant factor.In our experiments, there was no need to perform the com-putation for any additional prime p >
100 as all the primeswere excluded by the OIF computation. This is one of themost important advantages over the valence computation.The algorithm [24] was stated in the case of dense matrices.We slightly modified it in order to exploit the sparse struc-ture of the matrix. In particular, we used the sparse localSmith form computation of [9, Algorithm LRE] but stick tothe dense Dixon solver [7] as long as the memory was suffi-cient. Any other solver, including the new sparse solver of[11] could potentially be used for larger matrices.The limits of this method are imposed by the available mem-ory. For example, it was possible to use the dense Dixonsolver only for the six smallest matrices. Furthermore, sparseelimination reached its limits for matrices of size greaterthan 171375 × × × × The encountered problems have shown a need for a moreelaborated reduction algorithm. We focused our attentionon the algebraic reduction algorithm for chain complexes of[18]. We implemented a simplified version of the algorithmin the language of matrices using the LinBox library. Theheuristic behind this algorithm is that Gaussian eliminationcan propagate from one matrix of a chain complex to thenext thanks to the exactness of the differential map (i.e. d = 0 condition).The motivations come from the geometric properties of ho-mologies. By a free face we refer to a ( k −
1) cell a whichis in the differential of only one k -dimensional cell b . Byremoving the pair ( a, b ) we obtain a retract of the initialcell complex (viewed as a geometrical object), see Figure
3. The process can be repeated. From the homology theorywe know that the groups of homology are the same for theset and its retract. The removal of pairs leads to a reductionof the basis of the cell complex. We refer to [17, Ch.4] for afull description of the procedure.If the differential map is represented by matrices whose rowsrepresent the cells of dimension k − k , the removal of a pair ( a, b ) can be interpreted as the igure 3: Retraction for a square removal of a row with only one non-zero entry (1-row) andthe column it points to. In the general case, the algebraicreduction of ( a, b ) such that d k ( b ) = λa + u is possible iff λ is invertible in the ring of computation i.e. Q , Z , Z p k - de-pending on the problem. A modification ˜ d of the differentialgiven by the formula˜ d k v = d k ( v ) − λ − [ v : a ] d k ( b ) . (2)Thus, in the basic case of free face removal no modificationis needed. In the case of matrices, the formula describes astep of Gauss elimination where the reduced row is removedand not permuted. This proves that the Smith form (or therank) of the initial and reduced matrix will be the same,provided we add a number of trivial invariant factors equalto the number of rows reduced to the reduced Smith form.The important characteristic of this methods is that, thanksto the exactness of the matrix sequence, we can also removerow b and column a from the neighboring matrices. In thisway, elimination in one matrix can propagate on the others.Due to the format of data (large files with matrix entries)we decided to implement only the simplest case of 1-rowsremoval which led to entries removal but no modifications.We removed the empty rows/columns at the same time andperformed the whole reduction phase at the moment of read-ing the files. This led to vast matrix reduction in the case of GL ( Z ) matrices from the beginning of the sequence. Thepropagation of reductions unfortunately burned out near GL d
14 matrix and stopped completely on GL d
19. Ap-plying the process for the transposed sequence did not im-prove the solution. Next step would be to implement thepropagation of Gauss elimination steps as in Eq. (2). Itwould be interesting to examine whether the burn-out canbe connected to the loss of regularity for
X/G and/or a hugerectangularity of the input matrix GL d
3. EXPERIMENTAL RESULTS3.1 Parallel implementation
For the sake of simplicity in our experimental validation,we decided to develop our parallel implementation usingshell tasks distribution on SMP architecture. Thus, a sim-ple script code is used to distribute all different tasks overall the processors and files are used to gather up the com-puted results. Our parallel implementation has been doneas follow:1. The block of vectors Y and the sparse matrix A arebroadcasted on every processors. 2. Each process takes one column of the blackbox A T .A and compute the corresponding column’s sequence us-ing the first ith columns of Y . Each process writes theresult in a file labeled with the corresponding index ofthe column sequence.3. When all previous processes have terminated, the σ -basis computation is sequentially performed after load-ing the sequence from the generated files.Despite the na¨ıve approach used for the parallelization, ourimplementation authorized us to perform very large compu-tations as show in next section. However, our experimentsshow a need for at least a more robust parallel computationscheduler. All our computation have been done on a SGI Altix 3700gathering 64 Itanium2 processors with 192Gb memory andrunning SuSE Linux Enterprise System 10. Further informa-tions on this platform are available at .In
Table GL ( Z ) matrices and their sparsity. The matrices arevery sparse which is illustrated by the fact that less than1% of the entries are non-zero except for matrices GL7d10and GL7d11. This value drops to less than 0,2% in the caseof the largest matrices. Also in Table GL ( Z ) matri-ces. For the computation of the Smith form, full result hasbeen obtained in the case of matrices 10,11,12,13,25,26. Formatrices 14 and 24 only the smooth part of the Smith formhas been computed. For matrices 15 and 23 we have provedthe existence of a non-trivial local Smith form at 2 and 3and a triviality of the local Smith form at 5. As the resultfor these matrices we give the number of invariant factorsdivisible by 2 and 3.In Table GL ( Z ) matrices was performedmodulo 65537. This FFT prime allowed us to use bothBLAS routines and sigma-basis reconstruction using fastpolynomial multiplication. In Table
Table
Table Ω n m rank ker Smith formGL7d10 8 60 1 1 59 1GL7d11 1513 1019 60 59 960 1 (59)GL7d12 37519 8899 1019 960 7939 1 (958), 2 (2)GL7d13 356232 47271 8899 7938 39333 1 (7937), 2 (1)GL7d14 1831183 171375 47271 39332 132043 1 (39300),2 (29),4 (3)GL7d15 6080381 460261 171375 132043 28218 1 (131993), 2 · ? (46), 6 · ? (4) (*)GL7d16 14488881 955128 460261 328218 626910GL7d17 25978098 1548650 955128 626910 921740GL7d18 35590540 1955309 1548650 921740 1033569GL7d19 37322725 1911130 1955309 1033568 877562GL7d20 29893084 1437547 1911130 877562 559985GL7d21 18174775 822922 1437547 559985 262937GL7d22 8251000 349443 822922 262937 86506GL7d23 2695430 105054 349443 86505 18549 1 (86488), 2 · ? (12), 6 · ? (5) (*)GL7d24 593892 21074 105054 18549 2525 1 (18544),2 (4),4 (1)GL7d25 81671 2798 21074 2525 273 1 (2507), 2 (18)GL7d26 7412 305 2798 273 32 1 (258), 2 (7), 6 (7), 36 (1) Table 1: Results of the rank and Smith form computation for GL ( Z ) matrix A of dimension n × m with Ω non-zero entries. For (*) the information is incomplete - only divisors of the invariant factors were determinedbased on the rank mod 2 and 3 computation. A ˜ n ˜ m ˜ r Red RAdaptive SmoothSF AdaptiveSF SFValenceGL7d11 39 8 52 0.01s < − s 0.09s 0.26s 4.84sGL7d12 289 58 909 0.30s 0.16s 9.75s 218.68s 4.04hGL7d13 7938 740 7250 3.12s 159.16s 0.76h * 2526.65hGL7d14 165450 35741 4279 21.62s * 796h * *GL7d25 2797 20990 0 1.74s * 17.67s 4.40h 52.13hGL7d26 302 2748 0 0.14s * 0.29s 26.81s 274.35s Table 2: Times for Smith Form computation for GL ( Z ) matrices. From left to right: the dimensions of thematrix after reductions, rank approximation by reductions, time of reading and reducing the matrix, timefor the adaptive algorithm for a reduced matrix; times for the original matrix: smooth form computation,adaptive algorithm; valence computation in parallel - sequential time equivalent. A iter [1] time app iter [ p ] time time app σ -basisGL7d11 120 0.02s 6 [30] 0.01s < − s 0.58sGL7d12 1922 7.53s 66 [30] 0.32s 0.26s 12.16sGL7d13 15878 880.28s 532 [30] 51.65s 29.49s 249.17sGL7d14 78666 5.90h 2625 [30] 0.56h 0.20h 0.45hGL7d15 264088 66.98h 8805 [30] 2.25h 2.23h 2.45hGL7d16 656438 509.80h 21884 [30] 27.29h 17.00h 6.03hGL7d17 1253822 113.90d 41796 [30] 14d 3.80d 0.57dGL7d18 1843482 256.50d 46089 [40] 28d 6.41d 1.00dGL7d19 2067138 321.89d 41345 [50] 35d 6.44d 1.56dGL7d20 1755126 212.82d 36568 [48] 10d 4.43d 1.41dGL7d21 1119972 75.01d 37335 [30] 5d 2.50d 0.55dGL7d22 525876 293.60h 17532 [30] 16.47h 9.79h 5.85hGL7d23 173012 21.18h 5769 [30] 1.17h 0.71h 1.09hGL7d24 37100 3172.79s 1239 [30] 188.78s 105.96s 666.83sGL7d25 5052 40.21s 171 [30] 1.56s 1.36s 41.47sGL7d26 548 0.40s 21 [30] 0.03s 0.02s 2.03s Table 4: A summary of large-scale parallel rank computation. From left to right: number of iteration in thescalar case, time estimation in this case, number of iterations on p processors computed as · r/p , average(real) time of sequence computation, estimated time on p processors, the time of the σ basis computation fora sequence of length iter [ p ] of p × p matrices. A T Au [s] 1 U T v [s] 30 u T v [s]GL7d11 0.0002 < − < − GL7d12 0.0038 0.0001 0.0002GL7d13 0.0550 0.0005 0.0036GL7d14 0.2677 0.0025 0.0190GL7d15 0.9048 0.0082 0.0708GL7d16 2.7724 0.0234 0.2641GL7d17 7.8003 0.0485 0.5052GL7d18 11.9457 0.0759 0.8710GL7d19 13.3591 0.0948 1.0710GL7d20 10.4056 0.0711 0.8587GL7d21 5.7461 0.0408 0.4604GL7d22 1.9919 0.0180 0.2082GL7d23 0.4354 0.0052 0.0459GL7d24 0.0843 0.0012 0.0085GL7d25 0.0078 0.0002 0.0008GL7d26 0.0007 < − < − Table 3: CPU timings (in sec.) for different opera-tions used in large-scale parallel rank computation.All times in seconds. From left to right: time ofa matrix-vector product, a BLAS multiplication ofa vector and a × min( n, m ) matrix U and 30 dotproducts. unfortunately required re-running some part of the compu-tation. Thus, the real time given in Table
4. CONCLUSION
Using the previous methods and computations, we get thefollowing new result for the rational cohomology of GL ( Z ). Theorem (Elbaz-Vincent/Gangl/Soul´e)[14] We have H m ( GL ( Z ) , Q ) = ( Q if m = 0 , , , , , . Clearly the simple parallelization we used together with thehighly optimized routines were the key to enable these com-putations.To go further and solve even larger problems, it is manda-tory to improve the parallelism. On SMP we can split thematrices into blocks and perform the matrix-vector prod-ucts with different threads. This, and the unbalanced loadwe had when we choose to assign one vector to one process,advocates for the use of more advanced scheduling. We areexperimenting KAAPI but were not ready for the compu-tation of GL .Other improvements are of algorithmic type. They includethe use of the sparse projections of [11] for the matrix se-quence. But then we loose the symmetry of the projectionsand therefore must pay a factor of two for the number ofiterations. We could also use an early termination strategyto stop the iteration earlier, but up to now this require toloose the fast algorithm for the sigma bases. Then if a good Kernel for Adaptive, Asynchronous Parallel and Interactiveprogramming, kaapi.gforge.inria.fr structure for the sparsity of the matrices could be found,e.g. an adapted reordering technique, this would enable anefficient clustering and therefore faster and more scalablematrix-vector products.In order to have the relevant part of the torsion of the in-tegral cohomology of GL ( Z ), we would need the completedescription of the Smith forms of all the matrices describedabove. Our experiments have shown that this can be anenormously difficult task. The computation remains to bedone but would have applications in number theory.
5. ACKNOWLEDGMENTS
We are grateful to Arne Storjohann and to the ComputerScience Computing Facilities of the University of Waterloofor letting us fill up their SMP machine to perform our par-allel computations.
6. REFERENCES [1] J. Abbott, M. Bronstein, and T. Mulders. Fastdeterministic computation of determinants of densematrices. In S. Dooley, editor,
Proceedings of the 1999International Symposium on Symbolic and AlgebraicComputation, Vancouver, Canada , pages 197–204.ACM Press, New York, July 1999.[2] B. Beckermann and G. Labahn. A uniform approachfor the fast computation of matrix-type Pad´eapproximants.
SIAM Journal on Matrix Analysis andApplications , 15(3):804–823, 1994.[3] A. Bostan and E. Schost. Polynomial evaluation andinterpolation on special sets of points.
J. Complex. ,21(4):420–446, 2005.[4] K. S. Brown.
Cohomology of groups.
Graduate Textsin Mathematics, 87. New York-Heidelberg-Berlin:Springer- Verlag. X, 306 p., 4 figs. DM 74.00 $ 29.60 ,1982.[5] D. G. Cantor and E. Kaltofen. On fast multiplicationof polynomials over arbitrary algebras.
Acta Inf. ,28(7):693–701, 1991.[6] D. Coppersmith. Solving homogeneous linearequations over GF[2] via block Wiedemann algorithm.
Mathematics of Computation , 62(205):333–350, Jan.1994.[7] J. D. Dixon. Exact solution of linear equations using p -adic expansions. Numerische Mathematik ,40(1):137–141, Feb. 1982.[8] J.-G. Dumas, editor.
ISSAC’2006. Proceedings of the2006 International Symposium on Symbolic andAlgebraic Computation, Santander, Spain . ACMPress, New York, July 2006.[9] J.-G. Dumas, B. D. Saunders, and G. Villard. Onefficient sparse integer matrix Smith normal formcomputations.
Journal of Symbolic Computations ,32(1/2):71–99, jul–aug 2001.[10] J.-G. Dumas and G. Villard. Computing the rank ofsparse matrices over finite fields. In V. G. Ganzha,E. W. Mayr, and E. V. Vorozhtsov, editors, roceedings of the fifth International Workshop onComputer Algebra in Scientific Computing, Yalta,Ukraine , pages 47–62. Technische Universit¨atM¨unchen, Germany, Sept. 2002.[11] W. Eberly, M. Giesbrecht, P. Giorgi, A. Storjohann,and G. Villard. Solving sparse rational linear systems.In Dumas [8], pages 63–70.[12] W. Eberly, M. Giesbrecht, and G. Villard. Oncomputing the determinant and Smith form of aninteger matrix. In
Proceedings of the 41st AnnualSymposium on Foundations of Computer Science ,pages 675–687. IEEE Computer Society, 2000.[13] P. Elbaz-Vincent. Perfects lattices, homology ofmodular groups and algebraic k-theory.
OberwolfachReports (OWR) , 2, 2005. based on joint work with H.Gangl and C. Soul´e.[14] P. Elbaz-Vincent, H. Gangl, and C. Soul´e. Perfectforms, cohomology of modular groups and k-theory ofintegers. in preparation.[15] P. Elbaz-Vincent, H. Gangl, and C. Soul´e. Quelquescalculs de la cohomologie de GL N ( Z ) et de lak-th´eorie de Z . C. R. Acad. Sci. Paris, Ser. I ,335:321–324, 2002.[16] P. Giorgi, C.-P. Jeannerod, and G. Villard. On thecomplexity of polynomial matrix computations. InR. Sendra, editor,
Proceedings of the 2003International Symposium on Symbolic and AlgebraicComputation, Philadelphia, Pennsylvania, USA , pages135–142. ACM Press, New York, Aug. 2003.[17] T. Kaczy´nski, K. Mischaikow, and M. Mrozek.
Computational Homology . Springer, 2004.[18] T. Kaczy´nski, M. Mrozek, and M. ´Slusarek. Homologycomputation by reduction of chain complexes.
Computers and Mathematics , 35(4):59–70, 1998.[19] E. Kaltofen. Analysis of Coppersmith’s blockWiedemann algorithm for the parallel solution ofsparse linear systems.
Mathematics of Computation ,64(210):777–806, Apr. 1995.[20] E. Kaltofen and A. Lobo. Distributed matrix-freesolution of large sparse linear systems over finitefields. In A. Tentner, editor,
Proceedings of HighPerformance Computing 1996, San Diego, California .Society for Computer Simulation, SimulationCouncils, Inc., Apr. 1996.[21] E. Kaltofen and B. D. Saunders. On Wiedemann’smethod of solving sparse linear systems. In
AppliedAlgebra, Algebraic Algorithms and Error–CorrectingCodes (AAECC ’91) , volume 539 of
Lecture Notes inComputer Science , pages 29–38, Oct. 1991.[22] M. Kurihara. Some remarks on conjectures aboutcyclotomic fields and K -groups of Z . Compos. Math. ,81(2):223–236, 1992.[23] J. Rosenberg.
Algebraic K-Theory and its applications .Springer, 1995. [24] B. D. Saunders and Z. Wan. Smith normal form ofdense integer matrices, fast algorithms into practice.In J. Gutierrez, editor,
Proceedings of the 2004International Symposium on Symbolic and AlgebraicComputation, Santander, Spain , pages 274–281. ACMPress, New York, July 2004.[25] C. Soul´e. Perfects forms and the vandiver conjecture.
J. reine angew. Math. , 517:209–221, 1999.[26] E. H. Spanier.
Algebraic Topology . Springer, 1994.[27] E. Thom´e. Fast computation of linear generators formatrix sequences and application to the blockWiedemann algorithm. In
International Symposiumon Symbolic and Algebraic Computation, London,Ontario , pages 323–331. ACM Press, July 2001.[28] E. Thom´e. Subquadratic computation of vectorgenerating polynomials and improvement of the blockWiedemann algorithm.
Journal of SymbolicComputations , 33(5):757–775, July 2002.[29] W. J. Turner. A block wiedemann rank algorithm. InDumas [8], pages 332–339.[30] G. Villard. Further analysis of Coppersmith’s blockWiedemann algorithm for the solution of sparse linearsystems. In W. W. K¨uchlin, editor,
Proceedings of the1997 International Symposium on Symbolic andAlgebraic Computation, Maui, Hawaii , pages 32–39.ACM Press, New York, July 1997.[31] G. Villard. A study of Coppersmith’s blockWiedemann algorithm using matrix polynomials.Technical Report 975–IM, LMC/IMAG, Apr. 1997.[32] G. Voronoi. Nouvelles applications des param`etrescontinus `a la th´eorie des formes quadratiques i.
J.Crelle , 133:97–178, 1907.[33] D. H. Wiedemann. Solving sparse linear equationsover finite fields.