[PDF] Parallel computation of the rank of large sparse matrices from algebraic K-theory

Abstract

This paper deals with the computation of the rank and of some integer Smith forms of a series of sparse matrices arising in algebraic K-theory. The number of non zero entries in the considered matrices ranges from 8 to 37 millions. The largest rank computation took more than 35 days on 50 processors. We report on the actual algorithms we used to build the matrices, their link to the motivic cohomology and the linear algebra and parallelizations required to perform such huge computations. In particular, these results are part of the first computation of the cohomology of the linear group GL_7(Z).

Full PDF

aa r X i v : . [ m a t h . K T ] A p r Parallel computation of the rank of large sparse matricesfrom algebraic K-theory

Jean-Guillaume Dumas

Laboratoire Jean Kuntzmann, UMR CNRS 5224Universit´e Joseph Fourier,B.P. 53 X, 38041 Grenoble, France.

[email protected]

Philippe Elbaz-Vincent

Institut de Math´ematiques et de Mod´elisation de Montpellier,UMR CNRS 5149Universit´e Montpellier II, CC051, Place E. Bataillon.34095 Montpellier cedex 5, FRANCE. [email protected]

Pascal Giorgi

Laboratoire LP2A, Universit´e de Perpignan Via Domitia.52, avenue Paul Alduy 66860 Perpignan France. [email protected]

Anna Urba ´nska

Laboratoire Jean Kuntzmann, UMR CNRS 5224Universit´e Joseph Fourier,B.P. 53 X, 38041 Grenoble, France.

[email protected]

ABSTRACT

This paper deals with the computation of the rank and someinteger Smith forms of a series of sparse matrices arisingin algebraic K-theory. The number of non zero entries inthe considered matrices ranges from 8 to 37 millions. Thelargest rank computation took more than 35 days on 50processors. We report on the actual algorithms we used tobuild the matrices, their link to the motivic cohomology andthe linear algebra and parallelizations required to performsuch huge computations. In particular, these results arepart of the ﬁrst computation of the cohomology of the lineargroup GL ( Z ).

1. INTRODUCTION1.1 Motivation from K-theory

Numerous problems in modern number theory could be solvedor at least better understood, if we have a good knowledgeon the algebraic K -theory (or motivic cohomology) of inte-gers of number ﬁelds or the cohomology of arithmetic groups(i.e. subgroups of ﬁnite index of GL N ( Z )). As a short list,we could mention: • modular forms and special values of L functions, • Iwasawa theory and understanding of the “cyclotomy”, • Galois representations (or automorphic representations). Let us explain ﬁrst what is algebraic K-theory: to a commu-tative ring R we can associate (functorially) an inﬁnite fam-ily of abelian groups K n ( R ) which encodes a huge amountof information on its arithmetic, geometric and algebraicstructures. These groups extend some classical notions andgive higher dimensional analogues of some well known re-sults. For instance if R is a ring of integers or a polynomialring over a ﬁeld, we have • K ( R ) is the classical Grothendieck group (classifyingﬁnitely generated R − modules), • K ( R ) is the group of invertibles of R , • K ( R ) classiﬁes the universal extensions of SL ( R ) andis related to the Brauer group in the case of a ﬁeld.We can give a general abstract deﬁnition of K n for n > K n ( R ) = π n ( BGL ( R ) + ) , where BGL ( R ) + is the Quillen +-construction applied tothe classifying space BGL ( R ) and π n denotes the n th ho-motopy group [23]. We also can see the K -groups as a wayto understand GL ( R ). For instance, K ( R ) is isomorphicto GL ( R ) modulo elementary relations. For a more detailedbackground on K-theory and its applications see [23]. Fact : these groups are hard to compute.

The Vandiver conjecture plays an important role in the un-derstanding of the “cyclotomy” in number theory. Its state-ment is as follows;

Conjecture of Vandiver : Let p be an odd prime, ζ p = e πi/p , C the p -Sylow subgroup of the class group of Q ( ζ p )and C + the subgroup ﬁxed by the complex conjugation. Theandiver conjecture is the statement that C + = 0.To illustrate its interplay with K-theory, we have Fact (Kurihara, 1992)[22]: If K n ( Z ) = 0 for all n >

0, thenthe conjecture of Vandiver is true.Some partial results on this conjecture and its connectionwith the cohomology of SL N ( Z ) are given in [25]. General problem : Find explicit methods for computing(co)homologies of arithmetic groups and the K-theory ofnumber ﬁelds (or their ring of integers).Our ﬁrst task will be the computation of the (co)homolo-gies of linear groups (mainly GL N ( Z ) and SL N ( Z )). Wecan show [15, 13, 14] that the computation of those groupsis the key point for computing K -groups with our method.We will begin to recall some facts from topology. The notion of cellular complex is a generalization of a graphto several dimensions. We call n − cell a topological spacehomeomorphic to the open unit ball of R n and such that itsclosure is homeomorphic to the closed unit ball. A cellularcomplex (or cell complex or also cellular decomposition orcellular space) is a family of sets X n (with n ∈ N ), suchthat each X n is a collection (eventually inﬁnite) of n − cells.Usually we work with cell complexes with a ﬁnite numberof cells.A classical result [26] shows that any (reasonable) topologicalspace can be approximated by such cell complexes . A cellular decomposition of the cube

To a cell complex, we can associate a family C n ( n ∈ N )of free Z − modules and a family d n : C n → C n − of linearmaps.The module C n is the free module with basis the n − cells(modulo a choice of orientation).If the complex is ﬁnite, then all the modules are of ﬁnite rankand we will denote by { b nλ } λ ∈ Λ n a basis of C n , Λ n being anindex set for the n − cells. Then the map d n is deﬁned by d n ( b nλ ) = X µ [ b nλ : b n − µ ] b n − µ , and the integer number [ b nλ : b n − µ ] is called the incidencenumber of the cell e n − µ inside the cell e nλ . The relation d n ◦ d n +1 = 0 (i.e., formally d = 0) should hold for any n .If the complex is regular (i.e., always at most one cell ofdimension n + 1 between two cells of dimension n ), thenwe can build the incidences inductively starting from the0 − cells up to the maximal cells using the d = 0 conditionand moreover the incidence numbers will be 0, or ± n th homology group of the complex is deﬁned as thequotient of Ker( d n ) by Im( d n +1 ). This construction is func-torial (in the category of cell complexes). As a consequence, we can determine the homology groups eﬀectively by comput-ing the Smith form of the integral matrices of the d n (rela-tively to the ﬁxed basis). Notice that the Smith form givesboth the rank of the free part and the explicit description ofthe torsion part. In case the computation of the torsion isunnecessary (or to diﬃcult to achieve), we can tensorize by − ⊗ Z Q and the homology groups become Q -vector spaceswith their dimensions given by the ranks of the matrices ofthe diﬀerentials.In general, the matrices of the diﬀerentials can be very largeeven for a relatively simple cell complex. However, they arealso very sparse and we may look for some other favorableproperties which would enable the computation despite thesize of the problem. If G is a group acting on a cell space X (i.e., G sends n − cellsto n − cells), then, under some technical assumptions on X and on the action, we can show [4] that roughly computingthe homology of G (as group homology) is the same as com-puting the homology of the cell space X/G . Hence, if

X/G can be calculated eﬀectively, we can compute explicitly itshomology, and from this the homology of G (similarly forthe cohomology). Notice that in general the space X/G willnot be regular anymore. The main diﬃculty is to ﬁnd a cellspace X such that X/G will be eﬀective. We will discussin section 2.1 how we can construct such cellular space forlinear groups.

This ﬁrst idea to deal with very large sparse matrices isto use them as blackboxes, i.e. only using the matrix-vectorproduct. This will let the matrix remain sparse all along thealgorithm where Gaussian elimination for instance would ﬁllit up. To compute the rank, the fastest black box algorithmis Wiedemann’s as shown in [10]. This algorithm computesa sequence of scalars of the form u t A i v ( u and v are vec-tors) with i matrix vector products and dot products. Ithas been shown to successfully deal with large sparse ma-trix problems from homology, see e.g. [9]. Nevertheless,when matrices are very large (e.g millions on non-zero en-tries) computations would require months or years. This isdue to the low practical eﬃciency of the computation of asparse matrix-vector product. For instance, in our case ofhomology computation, one would need 300 days of CPUto compute the sequence involving matrix of GL ( Z ) with n = 19 (GL7d19 matrix). To achieve computations of manyhomologies in a realistic time we then need to parallelize thecomputation of the sequence. Then the algorithm candidates the block Wiedemann method, which computes a sequence X T A i Y where X and Y are blocks of vectors. This step canbe easily parallelized by distributing vectors of block Y toseveral processors. We thus have several objectives with re-gard to the parallelism in this paper: • We want to solve large problems coming from homol-ogy computation. • We want to experimentally validate our parallel imple-mentation of the block Wiedemann rank algorithm. • We want to show the parallel scaling of block Wiedemannapproaches.

In section 2, the algorithms and optimizations we used togenerate the matrices and compute with them are discussed.Then, section 3 shows our experimental results on theselarge sparse matrices coming from homology.

2. ALGORITHMS AND OPTIMIZATIONS

As seen in section 1.1.3, we can eﬀectively compute the ho-mology of a cellular space and of a group which acts “nicely”on a cellular space. The main diﬃculty remains to ﬁnd suchexplicit cellular space. In section 2.1 we present the processof matrix generation and optimization. Then in the follow-ing sections we give a description of the algorithms used forthe computation of the rank and the Smith form of thosematrices.

In the case of subgroups of GL N ( Z ), we have an “obvious”action on Z N . We can then capture the topology by regard-ing Z N not as a free Z − module but as a lattice (or equiv-alently as quadratic forms), and see if this leads to someinteresting topological construction. We will describe thisapproach below and the results that we can get. Let N > C N be the set of positivedeﬁnite real quadratic forms in N variables. Given h ∈ C N ,let m ( h ) be the ﬁnite set of minimal vectors of h , i.e. vectors v ∈ Z N , v = 0, such that h ( v ) is minimal. A form h is called perfect when m ( h ) determines h up to scalar: if h ′ ∈ C N issuch that m ( h ′ ) = m ( h ), then h ′ is proportional to h . Example

The form q ( x, y ) = x + y has minimum 1and minimal vectors ± (1 , and ± (0 , . Nevertheless thisform is not perfect, because there is an inﬁnite number ofdeﬁnite positive quadratic forms having these minimal vec-tors.On the other hand, the form q ( x, y ) = x + xy + y has alsominimum 1 and has exactly 3 minimal vectors (up to sign),the one above and ± (1 , − . This form is perfect, the associ-ated lattice is the ”honeycomb lattice” (with optimal spherespacking in the plane), it is the only one. Denote by C ∗ N the set of non negative real quadratic formson R N the kernel of which is spanned by a proper linearsubspace of Q N , by X ∗ N the quotient of C ∗ N by positive realhomotheties, and by π : C ∗ N → X ∗ N the projection. Let X N = π ( C N ) and ∂X ∗ N = X ∗ N − X N . Let Γ be either GL N ( Z ) or SL N ( Z ). The group Γ acts on C ∗ N and X ∗ N on the right by the formula h · γ = γ t h γ , γ ∈ Γ , h ∈ C ∗ N , where h is viewed as a symmetric matrix and γ t is the trans-posed of the matrix γ . Vorono¨ı proved that there are onlyﬁnitely many perfect forms modulo the action of Γ and mul-tiplication by positive real numbers ([32], Th. p. 110).Given v ∈ Z N − { } we let ˆ v ∈ C ∗ N be the form deﬁned byˆ v ( x ) = ( v | x ) , x ∈ R N , where ( v | x ) is the scalar product of v and x . The convexhull of a ﬁnite subset B ⊂ Z N − { } is the subset of X ∗ N image by π of the elements P j λ j b v j , v j ∈ B , λ j >

0. Forany perfect form h , we let σ ( h ) ⊂ X ∗ N be the convex hull ofthe set m ( h ) of its minimal vectors. Vorono¨ı proved in [32, § σ ( h ) and their intersections, as h runsover all perfect forms, deﬁne a cell decomposition of X ∗ N ,which is invariant by the action of Γ. We endow X ∗ N withthe corresponding CW -topology. If τ is a closed cell in X ∗ N and h a perfect form with τ ⊂ σ ( h ), we let m ( τ ) be the setof vectors v in m ( h ) such that ˆ v lies in τ . Any closed cell τ is the convex hull of m ( τ ) and m ( τ ) ∩ m ( τ ′ ) = m ( τ ∩ τ ′ ). Let d ( N ) = N ( N + 1) / − X ∗ N and n d ( N ) a natural integer. We denote by Σ ∗ n a set ofrepresentatives, modulo the action of Γ, of those cells ofdimension n in X ∗ N which meet X N , and by Σ n ⊂ Σ ∗ n thecells σ such that the stabilizer Γ σ of σ in Γ preserves itsorientation. Let V n be the free abelian group generated byΣ n . We deﬁne as follows a map d n : V n → V n − . For each closed cell σ in X ∗ N we ﬁx an orientation of σ , i.e.an orientation of the real vector space R ( σ ) of symmetricmatrices spanned by the forms ˆ v , v ∈ m ( σ ). Let σ ∈ Σ n and let τ ′ be a face of σ . Given a positive basis B ′ of R ( τ ′ )we get a basis B of R ( σ ) by adding after B ′ a vector ˆ v , v ∈ m ( σ ) − m ( τ ′ ). We let ε ( τ ′ , σ ) = ± B in the oriented vector space R ( σ ) (this signdoes not depend on the choice of v ).Next, let τ ∈ Σ n − be the cell equivalent to τ ′ = τ · γ . Wedeﬁne η ( τ, τ ′ ) = 1 (resp. η ( τ, τ ′ ) = −

1) when γ is com-patible (resp. incompatible) with the chosen orientations of R ( τ ) and R ( τ ′ ).Finally we deﬁne d n ( σ ) = X τ ∈ Σ n − X τ ′ η ( τ, τ ′ ) ε ( τ ′ , σ ) τ , (1)where τ ′ runs through the set of faces of σ which are equiv-alent to τ .It is shown in [14], that up to p -torsions with p N + 1, thehomology of this complex computes the cohomology of G .For N = 5 , , n . ∗ n ( GL ( Z )) 5 10 16 23 25 23 16 9 4 3 Σ n ( GL ( Z )) 0 0 0 1 7 6 1 0 2 3 Σ ∗ n ( GL ( Z ) 3 10 28 71 162 329 589 874 1066 1039 775 425 181 57 18 7 Σ n ( GL ( Z )) 0 0 0 0 3 46 163 340 544 636 469 200 49 5 0 0 Σ ∗ n ( SL ( Z )) 3 10 28 71 163 347 691 1152 1532 1551 1134 585 222 62 18 7 Σ n ( SL ( Z )) 0 3 10 18 43 169 460 815 1132 1270 970 434 114 27 14 7 Σ n and Σ ∗ n for N = 5 , (empty slots denote zero)n ∗ n ( GL ( Z )) 6 28 115 467 1882 7375 26885 Σ n ( GL ( Z )) 0 0 0 1 60 1019 8899

13 14 15 16 17 18 19 20 21 totalΣ ∗ n ( GL ( Z )) 87400 244029 569568 Σ n ( GL ( Z )) 47271 171375 460261

22 23 24 25 26 27 total TOTALΣ ∗ n ( GL ( Z )) 374826 115411 24623 3518 352 33 Σ n ( GL ( Z )) 349443 105054 21074 2798 305 33 Σ n and Σ ∗ n for N = 7 Theorem (Elbaz-Vincent/Gangl/Soul´e)[15, 13, 14].The cardinalities of Σ n and Σ ∗ n are shown on ﬁgure 1 for N = 5 , and on ﬁgure 2 for N = 7 . The previous result gives the precise size of the matricesinvolved in the computation of the homology.The main challenge was then the computation of the ranks ofthe matrices of the diﬀerential (this gives the free part of thehomology) and the computation of the Smith forms (whichgives the relevant arithmetical information of the homology),in particular for N = 7, knowing that such matrices areparticularly sparse. We can emphasize the fact that whatwe want to detect is the “high torsion” in the homology (i.e.the prime divisors > . One successful approach to deal with linear algebra compu-tations on large sparse matrices is to rely on Lanczos/Krylovblack-box methods. In particular, block versions of Wiede-mann method [33] are well suited for parallel computation.This technique has been ﬁrst proposed by Coppersmith in [6]for computation over GF (2) and then analyzed and provedby Kaltofen [19] and Villard [30, 31]. The idea is for amatrix A ∈ F n × n to compute the minimal generating ma-trix polynomial of the matrix sequence { XA i Y } ∞ i =0 ∈ F s × s , All the matrices are available on linein the “Sparse Integer Matrix Collection”( ljk.imag.fr/membres/Jean-Guillaume.Dumas/simc.html ) where

X, Y are blocks of s vectors (instead of vectors in theoriginal Wiedemann’s algorithm). Therefore, an intuitiveparallelization of this method is to distribute the vectors ofthe block X, Y to several processors. Thus, the computationof the sequence, which is the major performance bottleneckof this method, can be done in parallel and then allow forbetter performance.Lots of implementations and practical experimentations hasbeen developed on parallel block Wiedemann. For instance,in 1996, Kaltofen and Lobo [20] have proposed a coarse grainimplementation to solve homogeneous linear equations over GF (2). They have thus been capable to solve a system of252 252 linear equations with about 11 .

04 million non-zeroentries, in about 26 . σ In order to reduce the number of dot products, we used asymmetric projection. In other words, we set X = Y T inthe XA i Y sequence. Indeed, in this case the probability ofsuccess is reduced but the obtained block is symmetric asoon as A is symmetric. This is always the case when thepreconditioners of [10] are used (they are of the form A T A ).This reduces the dot product part of the computation of thesequence by a factor of two. For instance column i can bededuced from its top i elements and row i . This inducessome load balancing issues when one process owns the com-putation of one column. Note also that we use BLAS level-2for the computation of this dot products. In other words weperform them by blocks. σ -basis computations In order to eﬃciently compute σ -basis we rely on algorithm PM-Basis of [16] which reduces this computation to poly-nomial matrix multiplication. One can multiply two poly-nomial matrices

A, B ∈ F n × n [ x ] of degree d in O ( n d + n d log d ) ﬁnite ﬁeld operations if d -th primitive roots ofunity are available in F . Consequently, we decided for ourcomputations to deﬁne F as a prime ﬁeld with primes of theform c × k + 1 such that c × k ≡ d . These primesare commonly called FFT primes since they allow the useof FFT evaluation/interpolation for polynomials of degree d . We refer the reader to [5, 3] and references therein forfurther informations on fast polynomial matrix arithmetic.When ﬁnite ﬁelds not having d -th primitive roots of unityare used, polynomial matrix multiplication is still be doneeﬃciently by using Chinese Remainder Theorem with fewFFT primes. Let be F a prime ﬁeld of cardinality p , thenthe multiplication of A, B ∈ F n × n [ x ] of degree d can be ef-ﬁciently done by using CRT with FFT primes p i satisfying Q p i > d × n × p . This is equivalent to perform the multi-plication over the integers and then reduce the result in F .The overall performance of the multiplication, and then ofthe σ -basis , is dependent on the numbers of FFT primesneeded. Our main interest in the block Wiedemann approach is tocompute the rank of large sparse matrices given by the Ho-mology group determination problem explained in section1.1. Hence, we rely on Kaltofen-Saunder’s rank algorithm[21] and its block version [29] to achieve eﬃcient parallelcomputation.The Kaltofen-Saunders approach is based on the fact thatif ˜ A is a good preconditioned matrix of A then its rankis equal to the degree of its minimal polynomial minus itsvaluation (or co-degree) [21]. Thus, by using well chosenpreconditioners and Wiedemann algorithm one can easilycompute the rank of a sparse matrix over a ﬁnite ﬁeld. Theblock version of this method is presented e.g. in [29, § Block Wiedemann Rank Algorithm :let A ∈ F n × n , ˜ A from A with good preconditioners (e.g. thoseof [10]). random block Y ∈ F n × s and compute thematrix sequence S = { Y T ˜ A i Y } for i = 0 . . . n/s + O (1). the minimal matrix generator F ˜ AY ∈ F s × s [ x ]of the matrix sequence S . the rank r as r = deg(det( F ˜ AY )) − codeg det(F ˜AY ).Note that if the minimal matrix of step 3 is in Popov form(e.g. computed using the σ -basis of [16]), then the degree ofdet( F ˜ AY ) is simply the sum of the row degrees of the matrix F ˜ AY . Then the co-degree is zero if the determinant of theconstant term of F ˜ AY , seen as a matrix polynomial, is non-zero. In the latter case the computation of the determinantof the whole polynomial matrix can be avoided.When this fails, this determinant is computed by a mas-sively parallel evaluation/interpolation. It could be inter-esting, though, to interpolate only the lower coeﬃcients ofthis polynomial incrementally. This was not required for thematrices we considered and we therefore did not investigatemore on these speed improvements.Note that to probabilistically compute the rank over the in-tegers, it is suﬃcient to choose several primes at randomand take the largest obtained value, see e.g. [9] for moredetails. Moreover, one can choose the primes at randomamong primes not dividing the determinant (and thus pre-serving the rank). In order to ensure this property it itsuﬃcient to select primes not dividing the valence or lastinvariant factor computed by one of the methods of nextsection. The computation of the Smith form for the matrices of GL ( Z ) turned out to be a very challenging problem. Prior experience with sparse homology matrices led us totry the SmithViaValence algorithm of [9]. The idea is tocompute the minimal valence (the coeﬃcient of the small-est non zero monomial of the minimal polynomial) of theproduct A T A to determine the primes p which divide theinvariant factors of the Smith form of A . When the primeshave been found, one can compute the local Smith forms of A at each p separately and return the resulting Smith form S as the product of the local Smith forms S p over all p .Local Smith form computation can be done by a repeatedGauss elimination modulo p e where the exponent e is ad-justed automatically during the course of the algorithm.This algorithm works very eﬃciently for sparse matrices pro-vided that the minimal polynomial of the product AA T hasa small degree. Unfortunately, the latter condition does nothold in the case of GL ( Z ) matrices. Moreover, some earlyexperiments with small matrices showed that much moreprimes occur in the computed valence than in the Smithform of the original matrix. Thus, we decided to apply the adaptive algorithm of Saun-ders and Wan [24] which is a modiﬁed version of Eberly-Giesbrecht-Villard algorithm [12]. In [12] the authors pro-posed a procedure

OneInvariantFactor ( i, A ) (OIF) whichcomputes the i th invariant factor of a n × n matrix A . Thenthe binary search for distinct invariant factors allows themto ﬁnd the Smith form of A . OIF reduces the i th factoromputation to the computation of the last ( n th) invariantfactor (LIF) of a preconditioned matrix A + U i V i , where U i , V Ti are random n × ( n − i ) matrices. In [24] the methodwas extended to handle the rectangular case of m × n ma-trix. It is done by computing the last ( i th) invariant factorof a preconditioned matrix L i AR i where L i is a m × i and R i is a i × n matrix.The procedure OIF is of Monte Carlo probabilistic typewhere the probability of correctness is controlled by repeat-ing the choice of preconditioners. Assuming the correctnessof LIF computation, it gives a multiple of the i th invariantfactor. In practice, LIF is also of randomized Monte Carlotype. The idea is to get a divisor of the last invariant factorby solving a linear equation Mx = b with random right-hand side. After several solvings we get the last invariantfactor with large probability. Thus, the overall situation ismore complex and we cannot exclude the possibility thatsome primes are omitted or unnecessary in the output ofOIF. However, the probability that a prime is omitted or isunnecessary in this output can be controlled for each primeseparately and is smaller for bigger primes.Therefore in [24] the authors introduce a notion of smoothand rough parts of the Smith form. The idea is to computethe local Smith form for smaller primes by for example theSmithViaValence or OIF algorithm and to recover only largeprimes with the invariant factor search of [12]. When weconsider large primes, a suﬃcient probability of correctnesscan be obtained by a smaller number of repetitions. As we did not want to compute the valence, we introducedsome minor changes to the algorithm, which at the end isas follows:1. r = rank( A )2. For primes 1 < p <

100 compute the local Smith form S p of A ;3. Compute s r ( A ) by OneInvariantFactor algorithm;4. P = all primes p >

100 which divide s r ( A );5. If P = ∅ return S = Π p S p ;One advantage of this method is that we get the informationon the smooth form of the matrix very quickly. Moreover,the OIF computation acts as a certiﬁcation phase whichallows us to prove that no other primes are present with asuﬃciently large probability. This probability is explicit inthe following theorem: Theorem

The probability that there exists a prime p > P that divides the i th invariant factor but does not dividethe output of OIF which uses M random preconditioners L i , R i and N random vectors b , with b ∈ { , , . . . β − } and β > s i ( A ) , in the LIF procedure is bounded by M ∞ X p>P „ p « N . Proof.

As we take the gcd of the result with diﬀerentpreconditioners L i and R i , is suﬃces that the LIF computa-tion fails in one case to spoil the computation. We are freeto choose a large bound for k b k such that k b k > s i ( A ) with-out increasing the complexity of LIF computation. Thenthe probability that a prime p < k b k is omitted in LIF isless than or equal to β ⌈ βp ⌉ < p , see [1]. Finally, we boundthe probability that any prime p > P, p | s i ( A ) is missing bytaking a sum over all primes.The choice M = N = 2 suﬃce to obtain a small probability0 ,

015 of omitting an important prime, and at the same timeto exclude all primes that are not in the i th invariant factor.In our experiments, there was no need to perform the com-putation for any additional prime p >

100 as all the primeswere excluded by the OIF computation. This is one of themost important advantages over the valence computation.The algorithm [24] was stated in the case of dense matrices.We slightly modiﬁed it in order to exploit the sparse struc-ture of the matrix. In particular, we used the sparse localSmith form computation of [9, Algorithm LRE] but stick tothe dense Dixon solver [7] as long as the memory was suﬃ-cient. Any other solver, including the new sparse solver of[11] could potentially be used for larger matrices.The limits of this method are imposed by the available mem-ory. For example, it was possible to use the dense Dixonsolver only for the six smallest matrices. Furthermore, sparseelimination reached its limits for matrices of size greaterthan 171375 × × × × The encountered problems have shown a need for a moreelaborated reduction algorithm. We focused our attentionon the algebraic reduction algorithm for chain complexes of[18]. We implemented a simpliﬁed version of the algorithmin the language of matrices using the LinBox library. Theheuristic behind this algorithm is that Gaussian eliminationcan propagate from one matrix of a chain complex to thenext thanks to the exactness of the diﬀerential map (i.e. d = 0 condition).The motivations come from the geometric properties of ho-mologies. By a free face we refer to a ( k −

1) cell a whichis in the diﬀerential of only one k -dimensional cell b . Byremoving the pair ( a, b ) we obtain a retract of the initialcell complex (viewed as a geometrical object), see Figure

3. The process can be repeated. From the homology theorywe know that the groups of homology are the same for theset and its retract. The removal of pairs leads to a reductionof the basis of the cell complex. We refer to [17, Ch.4] for afull description of the procedure.If the diﬀerential map is represented by matrices whose rowsrepresent the cells of dimension k − k , the removal of a pair ( a, b ) can be interpreted as the igure 3: Retraction for a square removal of a row with only one non-zero entry (1-row) andthe column it points to. In the general case, the algebraicreduction of ( a, b ) such that d k ( b ) = λa + u is possible iﬀ λ is invertible in the ring of computation i.e. Q , Z , Z p k - de-pending on the problem. A modiﬁcation ˜ d of the diﬀerentialgiven by the formula˜ d k v = d k ( v ) − λ − [ v : a ] d k ( b ) . (2)Thus, in the basic case of free face removal no modiﬁcationis needed. In the case of matrices, the formula describes astep of Gauss elimination where the reduced row is removedand not permuted. This proves that the Smith form (or therank) of the initial and reduced matrix will be the same,provided we add a number of trivial invariant factors equalto the number of rows reduced to the reduced Smith form.The important characteristic of this methods is that, thanksto the exactness of the matrix sequence, we can also removerow b and column a from the neighboring matrices. In thisway, elimination in one matrix can propagate on the others.Due to the format of data (large ﬁles with matrix entries)we decided to implement only the simplest case of 1-rowsremoval which led to entries removal but no modiﬁcations.We removed the empty rows/columns at the same time andperformed the whole reduction phase at the moment of read-ing the ﬁles. This led to vast matrix reduction in the case of GL ( Z ) matrices from the beginning of the sequence. Thepropagation of reductions unfortunately burned out near GL d

14 matrix and stopped completely on GL d

19. Ap-plying the process for the transposed sequence did not im-prove the solution. Next step would be to implement thepropagation of Gauss elimination steps as in Eq. (2). Itwould be interesting to examine whether the burn-out canbe connected to the loss of regularity for

X/G and/or a hugerectangularity of the input matrix GL d

3. EXPERIMENTAL RESULTS3.1 Parallel implementation

For the sake of simplicity in our experimental validation,we decided to develop our parallel implementation usingshell tasks distribution on SMP architecture. Thus, a sim-ple script code is used to distribute all diﬀerent tasks overall the processors and ﬁles are used to gather up the com-puted results. Our parallel implementation has been doneas follow:1. The block of vectors Y and the sparse matrix A arebroadcasted on every processors. 2. Each process takes one column of the blackbox A T .A and compute the corresponding column’s sequence us-ing the ﬁrst ith columns of Y . Each process writes theresult in a ﬁle labeled with the corresponding index ofthe column sequence.3. When all previous processes have terminated, the σ -basis computation is sequentially performed after load-ing the sequence from the generated ﬁles.Despite the na¨ıve approach used for the parallelization, ourimplementation authorized us to perform very large compu-tations as show in next section. However, our experimentsshow a need for at least a more robust parallel computationscheduler. All our computation have been done on a SGI Altix 3700gathering 64 Itanium2 processors with 192Gb memory andrunning SuSE Linux Enterprise System 10. Further informa-tions on this platform are available at .In

Table GL ( Z ) matrices and their sparsity. The matrices arevery sparse which is illustrated by the fact that less than1% of the entries are non-zero except for matrices GL7d10and GL7d11. This value drops to less than 0,2% in the caseof the largest matrices. Also in Table GL ( Z ) matri-ces. For the computation of the Smith form, full result hasbeen obtained in the case of matrices 10,11,12,13,25,26. Formatrices 14 and 24 only the smooth part of the Smith formhas been computed. For matrices 15 and 23 we have provedthe existence of a non-trivial local Smith form at 2 and 3and a triviality of the local Smith form at 5. As the resultfor these matrices we give the number of invariant factorsdivisible by 2 and 3.In Table GL ( Z ) matrices was performedmodulo 65537. This FFT prime allowed us to use bothBLAS routines and sigma-basis reconstruction using fastpolynomial multiplication. In Table

Table

Table Ω n m rank ker Smith formGL7d10 8 60 1 1 59 1GL7d11 1513 1019 60 59 960 1 (59)GL7d12 37519 8899 1019 960 7939 1 (958), 2 (2)GL7d13 356232 47271 8899 7938 39333 1 (7937), 2 (1)GL7d14 1831183 171375 47271 39332 132043 1 (39300),2 (29),4 (3)GL7d15 6080381 460261 171375 132043 28218 1 (131993), 2 · ? (46), 6 · ? (4) (*)GL7d16 14488881 955128 460261 328218 626910GL7d17 25978098 1548650 955128 626910 921740GL7d18 35590540 1955309 1548650 921740 1033569GL7d19 37322725 1911130 1955309 1033568 877562GL7d20 29893084 1437547 1911130 877562 559985GL7d21 18174775 822922 1437547 559985 262937GL7d22 8251000 349443 822922 262937 86506GL7d23 2695430 105054 349443 86505 18549 1 (86488), 2 · ? (12), 6 · ? (5) (*)GL7d24 593892 21074 105054 18549 2525 1 (18544),2 (4),4 (1)GL7d25 81671 2798 21074 2525 273 1 (2507), 2 (18)GL7d26 7412 305 2798 273 32 1 (258), 2 (7), 6 (7), 36 (1) Table 1: Results of the rank and Smith form computation for GL ( Z ) matrix A of dimension n × m with Ω non-zero entries. For (*) the information is incomplete - only divisors of the invariant factors were determinedbased on the rank mod 2 and 3 computation. A ˜ n ˜ m ˜ r Red RAdaptive SmoothSF AdaptiveSF SFValenceGL7d11 39 8 52 0.01s < − s 0.09s 0.26s 4.84sGL7d12 289 58 909 0.30s 0.16s 9.75s 218.68s 4.04hGL7d13 7938 740 7250 3.12s 159.16s 0.76h * 2526.65hGL7d14 165450 35741 4279 21.62s * 796h * *GL7d25 2797 20990 0 1.74s * 17.67s 4.40h 52.13hGL7d26 302 2748 0 0.14s * 0.29s 26.81s 274.35s Table 2: Times for Smith Form computation for GL ( Z ) matrices. From left to right: the dimensions of thematrix after reductions, rank approximation by reductions, time of reading and reducing the matrix, timefor the adaptive algorithm for a reduced matrix; times for the original matrix: smooth form computation,adaptive algorithm; valence computation in parallel - sequential time equivalent. A iter [1] time app iter [ p ] time time app σ -basisGL7d11 120 0.02s 6 [30] 0.01s < − s 0.58sGL7d12 1922 7.53s 66 [30] 0.32s 0.26s 12.16sGL7d13 15878 880.28s 532 [30] 51.65s 29.49s 249.17sGL7d14 78666 5.90h 2625 [30] 0.56h 0.20h 0.45hGL7d15 264088 66.98h 8805 [30] 2.25h 2.23h 2.45hGL7d16 656438 509.80h 21884 [30] 27.29h 17.00h 6.03hGL7d17 1253822 113.90d 41796 [30] 14d 3.80d 0.57dGL7d18 1843482 256.50d 46089 [40] 28d 6.41d 1.00dGL7d19 2067138 321.89d 41345 [50] 35d 6.44d 1.56dGL7d20 1755126 212.82d 36568 [48] 10d 4.43d 1.41dGL7d21 1119972 75.01d 37335 [30] 5d 2.50d 0.55dGL7d22 525876 293.60h 17532 [30] 16.47h 9.79h 5.85hGL7d23 173012 21.18h 5769 [30] 1.17h 0.71h 1.09hGL7d24 37100 3172.79s 1239 [30] 188.78s 105.96s 666.83sGL7d25 5052 40.21s 171 [30] 1.56s 1.36s 41.47sGL7d26 548 0.40s 21 [30] 0.03s 0.02s 2.03s Table 4: A summary of large-scale parallel rank computation. From left to right: number of iteration in thescalar case, time estimation in this case, number of iterations on p processors computed as · r/p , average(real) time of sequence computation, estimated time on p processors, the time of the σ basis computation fora sequence of length iter [ p ] of p × p matrices. A T Au [s] 1 U T v [s] 30 u T v [s]GL7d11 0.0002 < − < − GL7d12 0.0038 0.0001 0.0002GL7d13 0.0550 0.0005 0.0036GL7d14 0.2677 0.0025 0.0190GL7d15 0.9048 0.0082 0.0708GL7d16 2.7724 0.0234 0.2641GL7d17 7.8003 0.0485 0.5052GL7d18 11.9457 0.0759 0.8710GL7d19 13.3591 0.0948 1.0710GL7d20 10.4056 0.0711 0.8587GL7d21 5.7461 0.0408 0.4604GL7d22 1.9919 0.0180 0.2082GL7d23 0.4354 0.0052 0.0459GL7d24 0.0843 0.0012 0.0085GL7d25 0.0078 0.0002 0.0008GL7d26 0.0007 < − < − Table 3: CPU timings (in sec.) for diﬀerent opera-tions used in large-scale parallel rank computation.All times in seconds. From left to right: time ofa matrix-vector product, a BLAS multiplication ofa vector and a × min( n, m ) matrix U and 30 dotproducts. unfortunately required re-running some part of the compu-tation. Thus, the real time given in Table

4. CONCLUSION

Using the previous methods and computations, we get thefollowing new result for the rational cohomology of GL ( Z ). Theorem (Elbaz-Vincent/Gangl/Soul´e)[14] We have H m ( GL ( Z ) , Q ) = ( Q if m = 0 , , , , , . Clearly the simple parallelization we used together with thehighly optimized routines were the key to enable these com-putations.To go further and solve even larger problems, it is manda-tory to improve the parallelism. On SMP we can split thematrices into blocks and perform the matrix-vector prod-ucts with diﬀerent threads. This, and the unbalanced loadwe had when we choose to assign one vector to one process,advocates for the use of more advanced scheduling. We areexperimenting KAAPI but were not ready for the compu-tation of GL .Other improvements are of algorithmic type. They includethe use of the sparse projections of [11] for the matrix se-quence. But then we loose the symmetry of the projectionsand therefore must pay a factor of two for the number ofiterations. We could also use an early termination strategyto stop the iteration earlier, but up to now this require toloose the fast algorithm for the sigma bases. Then if a good Kernel for Adaptive, Asynchronous Parallel and Interactiveprogramming, kaapi.gforge.inria.fr structure for the sparsity of the matrices could be found,e.g. an adapted reordering technique, this would enable aneﬃcient clustering and therefore faster and more scalablematrix-vector products.In order to have the relevant part of the torsion of the in-tegral cohomology of GL ( Z ), we would need the completedescription of the Smith forms of all the matrices describedabove. Our experiments have shown that this can be anenormously diﬃcult task. The computation remains to bedone but would have applications in number theory.

5. ACKNOWLEDGMENTS

We are grateful to Arne Storjohann and to the ComputerScience Computing Facilities of the University of Waterloofor letting us ﬁll up their SMP machine to perform our par-allel computations.

6. REFERENCES [1] J. Abbott, M. Bronstein, and T. Mulders. Fastdeterministic computation of determinants of densematrices. In S. Dooley, editor,

Proceedings of the 1999International Symposium on Symbolic and AlgebraicComputation, Vancouver, Canada , pages 197–204.ACM Press, New York, July 1999.[2] B. Beckermann and G. Labahn. A uniform approachfor the fast computation of matrix-type Pad´eapproximants.

SIAM Journal on Matrix Analysis andApplications , 15(3):804–823, 1994.[3] A. Bostan and E. Schost. Polynomial evaluation andinterpolation on special sets of points.

J. Complex. ,21(4):420–446, 2005.[4] K. S. Brown.

Cohomology of groups.

Graduate Textsin Mathematics, 87. New York-Heidelberg-Berlin:Springer- Verlag. X, 306 p., 4 ﬁgs. DM 74.00 $ 29.60 ,1982.[5] D. G. Cantor and E. Kaltofen. On fast multiplicationof polynomials over arbitrary algebras.

Acta Inf. ,28(7):693–701, 1991.[6] D. Coppersmith. Solving homogeneous linearequations over GF[2] via block Wiedemann algorithm.

Mathematics of Computation , 62(205):333–350, Jan.1994.[7] J. D. Dixon. Exact solution of linear equations using p -adic expansions. Numerische Mathematik ,40(1):137–141, Feb. 1982.[8] J.-G. Dumas, editor.

ISSAC’2006. Proceedings of the2006 International Symposium on Symbolic andAlgebraic Computation, Santander, Spain . ACMPress, New York, July 2006.[9] J.-G. Dumas, B. D. Saunders, and G. Villard. Oneﬃcient sparse integer matrix Smith normal formcomputations.

Journal of Symbolic Computations ,32(1/2):71–99, jul–aug 2001.[10] J.-G. Dumas and G. Villard. Computing the rank ofsparse matrices over ﬁnite ﬁelds. In V. G. Ganzha,E. W. Mayr, and E. V. Vorozhtsov, editors, roceedings of the ﬁfth International Workshop onComputer Algebra in Scientiﬁc Computing, Yalta,Ukraine , pages 47–62. Technische Universit¨atM¨unchen, Germany, Sept. 2002.[11] W. Eberly, M. Giesbrecht, P. Giorgi, A. Storjohann,and G. Villard. Solving sparse rational linear systems.In Dumas [8], pages 63–70.[12] W. Eberly, M. Giesbrecht, and G. Villard. Oncomputing the determinant and Smith form of aninteger matrix. In

Proceedings of the 41st AnnualSymposium on Foundations of Computer Science ,pages 675–687. IEEE Computer Society, 2000.[13] P. Elbaz-Vincent. Perfects lattices, homology ofmodular groups and algebraic k-theory.

OberwolfachReports (OWR) , 2, 2005. based on joint work with H.Gangl and C. Soul´e.[14] P. Elbaz-Vincent, H. Gangl, and C. Soul´e. Perfectforms, cohomology of modular groups and k-theory ofintegers. in preparation.[15] P. Elbaz-Vincent, H. Gangl, and C. Soul´e. Quelquescalculs de la cohomologie de GL N ( Z ) et de lak-th´eorie de Z . C. R. Acad. Sci. Paris, Ser. I ,335:321–324, 2002.[16] P. Giorgi, C.-P. Jeannerod, and G. Villard. On thecomplexity of polynomial matrix computations. InR. Sendra, editor,

Proceedings of the 2003International Symposium on Symbolic and AlgebraicComputation, Philadelphia, Pennsylvania, USA , pages135–142. ACM Press, New York, Aug. 2003.[17] T. Kaczy´nski, K. Mischaikow, and M. Mrozek.

Computational Homology . Springer, 2004.[18] T. Kaczy´nski, M. Mrozek, and M. ´Slusarek. Homologycomputation by reduction of chain complexes.

Computers and Mathematics , 35(4):59–70, 1998.[19] E. Kaltofen. Analysis of Coppersmith’s blockWiedemann algorithm for the parallel solution ofsparse linear systems.

Mathematics of Computation ,64(210):777–806, Apr. 1995.[20] E. Kaltofen and A. Lobo. Distributed matrix-freesolution of large sparse linear systems over ﬁniteﬁelds. In A. Tentner, editor,

Proceedings of HighPerformance Computing 1996, San Diego, California .Society for Computer Simulation, SimulationCouncils, Inc., Apr. 1996.[21] E. Kaltofen and B. D. Saunders. On Wiedemann’smethod of solving sparse linear systems. In

AppliedAlgebra, Algebraic Algorithms and Error–CorrectingCodes (AAECC ’91) , volume 539 of

Lecture Notes inComputer Science , pages 29–38, Oct. 1991.[22] M. Kurihara. Some remarks on conjectures aboutcyclotomic ﬁelds and K -groups of Z . Compos. Math. ,81(2):223–236, 1992.[23] J. Rosenberg.

Algebraic K-Theory and its applications .Springer, 1995. [24] B. D. Saunders and Z. Wan. Smith normal form ofdense integer matrices, fast algorithms into practice.In J. Gutierrez, editor,

Proceedings of the 2004International Symposium on Symbolic and AlgebraicComputation, Santander, Spain , pages 274–281. ACMPress, New York, July 2004.[25] C. Soul´e. Perfects forms and the vandiver conjecture.

J. reine angew. Math. , 517:209–221, 1999.[26] E. H. Spanier.

Algebraic Topology . Springer, 1994.[27] E. Thom´e. Fast computation of linear generators formatrix sequences and application to the blockWiedemann algorithm. In

International Symposiumon Symbolic and Algebraic Computation, London,Ontario , pages 323–331. ACM Press, July 2001.[28] E. Thom´e. Subquadratic computation of vectorgenerating polynomials and improvement of the blockWiedemann algorithm.

Journal of SymbolicComputations , 33(5):757–775, July 2002.[29] W. J. Turner. A block wiedemann rank algorithm. InDumas [8], pages 332–339.[30] G. Villard. Further analysis of Coppersmith’s blockWiedemann algorithm for the solution of sparse linearsystems. In W. W. K¨uchlin, editor,

Proceedings of the1997 International Symposium on Symbolic andAlgebraic Computation, Maui, Hawaii , pages 32–39.ACM Press, New York, July 1997.[31] G. Villard. A study of Coppersmith’s blockWiedemann algorithm using matrix polynomials.Technical Report 975–IM, LMC/IMAG, Apr. 1997.[32] G. Voronoi. Nouvelles applications des param`etrescontinus `a la th´eorie des formes quadratiques i.

J.Crelle , 133:97–178, 1907.[33] D. H. Wiedemann. Solving sparse linear equationsover ﬁnite ﬁelds.