Computing huge Groebner basis like cyclic10 over $\Q$ with Giac
aa r X i v : . [ c s . S C ] M a r Computing huge Groebner basis like cyclic10over Q with Giac. Bernard ParisseInstitut FourierUMR 5582 du CNRSUniversité de Grenoble I2019
Abstract
We present a short description on how to fine-tune the modular algorithm im-plemented in the Giac computer algebra system to reconstruct huge Groebner basisover Q . The classical cyclic10 benchmark will serve as example. In 1999, J.-C. Faugère presented the F4 algorithm ([3]), it was illustrated with the re-construction of the Groebner basis over Q of cyclic9, a symmetric Groebner basis with9 variables, obtained in 18 days. The cyclic9 basis has 1344 elements, 1067837 mono-mials and requires 378 29-bits primes for reconstruction. Twenty years later, with morepowerful computers (more RAM) and multi-CPU architectures, the same computationtakes about 20 minutes with Giac/Xcas ([6])). The new big benchmark in this cate-gory is now cyclic10, with 10 variables instead of 9. The computation modulo a primetakes a few hours with a generic algorithm (i.e. without taking in account the sym-metries like in [8] or [2]), the basis has 5690 elements, about 20 millions monomials(19825931 or a little less for some primes, because of cancellation). The computa-tion over Q is a challenge , because most coefficients require more than 2000 29-bitsprimes for reconstruction, and there are about 20 millions of coefficients to compute,i.e. the Groebner basis itself is more than 50G of RAM. Being able to solve this exam-ple is not intrinsically interesting (especially since our algorithm does not exploit anysymmetries), but induced improvements in many areas in the Giac’s modular gbasisalgorithm implementation. I also had to implement partial computation saving.Section 2 will briefly explain the different phases of the modular reconstructionalgorithm. Section 3 will describe how to control Giac’s gbasis command and showhow we were able to compute the Groebner basis of cyclic10 on Q on a server with as far as I know, this document reports the first successful attempt to compute cyclic10 over Q . AllanSteel reported about cyclic9 over Q in Magma 15 years ago [7], more recent reports are in finite fields The modular gbasis algorithm over Q is done in a loop with three steps : gbasis com-putation modulo a prime, Chinese remaindering with previous primes and rational re-construction attempts. Rational reconstruction is not checked on each basis elementafter each prime, this would take too long. The basis elements are sorted by increasingrevlex order, then an attempt is made to reconstruct the first basis element that is not yetreconstructed and check that it fits with the last prime. If reconstruction is successful,try the next one and so on until reconstruction fails. As a side effect, basis elementsreconstruct in clusters (it is not clear if there is a better ordering strategy of basis el-ements to attempt reconstruction, I tried a few obvious strategies like age, but it wasworse than revlex order). Already reconstructed generators are checked each time anew prime gbasis is computed.For multi-CPU architectures, there are two ways to parallelize the computation : • multiple primes compute the Groebner basis modulo several primes in parallel. This gives thebest speed-up (about the same real-time/CPU time ratio than the number of si-multaneous primes), but it requires more memory (for example with Giac, about12G of RAM per prime for cyclic10, running more than 12 primes simultane-ously is not a good idea with 256G of RAM). • most tasks in a modular Groebner basis computation can be paralleled: sparselinear algebra, Gaussian reduction, inter-reduction... This does not require addi-tional memory (vs mono-threaded execution), but is not as efficient as computingmultiple primes simultaneously.Note that Chinese remaindering (and rational reconstruction) is single-threaded, be-cause it can trigger memory allocations (and locks between threads could slow downthe real-time computation). With a better knowledge of GMP memory handling, thiscould probably be paralleled.There are two classical optimizations : • learning The idea is that the F4 algorithm will follow the same path for all primes: thesame pairs will reduce to 0 and the same monomials will be involved (“symbolicpreprocessing” information), we can record them for further runs. For moredetails, see [5]. • re-injection Some of the ideal generators have smaller coefficients than other, reconstruc-tion can happen after a fraction of the total number of required primes (see also[4]). Using this additional information, we can speed up the computation for2he remaining primes. For example, partial reconstruction happens for cyclic10after 390 29-bits primes (for 20 generators), then after 794 primes (for 2164 gen-erators). Using this additional information will also save memory because thegbasis algorithm requires less intermediate computations and coefficients stor-age is optimized: for example a 400 bits reconstructed rational (with numeratorand denominator of size 200 bits) takes 3 times less memory than it’s integerrepresentant modulo a 1200 bits integer.Examples of observed real-time speeds computing cyclic10 gbasis modulo a 29 bitsprime using at most 36 CPU ( cocalc : Intel Xeon CPU E5-2687W v4 @ 3.00GHz,first part of the table) or at most 24 CPU ( ifnode2 : Intel Xeon CPU E5-2640 v3 @2.60GHz, second part). Run 1 means first prime run modulo a 29 bit prime (includesstoring learning information for further runs). Run 2 means next primes runs modulo a29 bits prime with learned information. Time does not include Chinese remaindering(varying from a few seconds at stage 1 start to almost 1 minute per prime at stage 2 end)or rational reconstruction (negligible except when clusters of generators reconstruct)Run Re-inject. simult. primes threads per prime minutes per prime1 No 1 24 2002 No 4 6 21.252 No 3 8 242 No 2 12 50 • phase 0 : 1st prime run with parallelism, collect information for further runs • phase 1 : next runs (several primes, for each prime parallelism by number ofCPU available divided by number of simultaneous prime runs). No re-injectionat this stage yet. Partial reconstructions attempts. • phase 2 : if at some point, partial reconstruction of a significant number of gen-erators is realized in phase 1, re-inject reconstructed generators, clear learnedinformation, redo phase 0 with initial generators+reconstructed generators, thenredo phase 1 until final reconstruction, or partial reconstruction of a larger partof the basis (phases 3 or more) • phase 3, 4, ...like phase 2 with a more complete partial reconstructed basis3or huge computations, it may be desirable to save partial computations, this can bedone here by archiving reconstructed generators. Further computations can start atphase 2 instead of phase 0. In addition to the threads:=n command to control the number of threads, Giac1.5.0-49 introduces some new instructions to control the gbasis implementation : • gbasis_simult_primes(n)gbasis_simult_primes(n1,p1,n2,p2,n3) This parameter controls the number of simultaneous modular prime computa-tions. With 5 arguments instead of 1, the number of simultaneous modular com-putation is defined as n1 until p1 primes are computed, then n2 until p2 primes,then n3 .The best real-time is obtained by choosing the highest number of primes com-patible with the RAM available, then divide the number of available threads bythis number of primes and give it for each 1-prime parallelism. • gbasis_reinject(ratio,speed_ratio) Change defaults re-injection parameters. If the number of reconstructed genera-tors over the number of elements of the basis is greater than ratio and the speedratio of the second run (with learning information) vs first run (without learninginformation) is greater than speed_ratio , then re-injection will happen. • gbasis_reinject(-n) With a negative integer as argument, this will stop the computation as soon as atleast n elements of the Groebner basis are reconstructed and gbasis will returna partial Groebner basis. This is used in conjunction with archive("filename",gbasis(list_generators,vars)) for a computation in several stages. Further stages will call gbasis(list_generators,vars,gbasis_reinject=unarchive("filename")) • gbasis_max_pairs(n) Maximal number of pairs that will be reduced simultaneously in the F4 algo-rithm. This can reduce time and memory for a 1-prime computation but it seemsto be less efficient on further runs.
The computation of cyclic9 can be done in 1 stage with the following script cyclic9:=[x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, x1*x2 + x2*x3+ x3*x4 + x4*x5 + x5*x6 + x6*x7 + x7*x8 + x1*x9 + x8*x9, x1*x2*x3+ x2*x3*x4 + x3*x4*x5 + x4*x5*x6 + x5*x6*x7 + x6*x7*x8 + x1*x2*x9+ x1*x8*x9 + x7*x8*x9, x1*x2*x3*x4 + x2*x3*x4*x5 + x3*x4*x5*x6 + Uncommenting gbasis_reinject(-500); will stop the computation after 500basis elements are reconstructed. To finish the computation, run gbasis(cyclic9,indets(cyclic9),gbasis_reinject=unarchive("H9")) . Q The computation of cyclic10 was done in 2 stages on cocalc by calling icas (Giactext interpreter) on the following script debug_infolevel:=1;threads:=36;gbasis_simult_primes(12,800,9,1600,2); // 12 simult up to 800 primes,then 9, then 2// gbasis_reinject(.05,.05);gbasis_reinject(-2205); // so that we reinject the 2205 first basiselements after 796 primesvars := [x0,x1,x2,x3,x4,x5,x6,x7,x8,x9];sys := [x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, x0*x1 +x0*x9 + x1*x2 + x2*x3 + x3*x4 + x4*x5 + x5*x6 + x6*x7 + x7*x8 + x8*x9,x0*x1*x2 + x0*x1*x9 + x0*x8*x9 + x1*x2*x3 + x2*x3*x4 + x3*x4*x5 +x4*x5*x6 + x5*x6*x7 + x6*x7*x8 + x7*x8*x9, x0*x1*x2*x3 + x0*x1*x2*x9+ x0*x1*x8*x9 + x0*x7*x8*x9 + x1*x2*x3*x4 + x2*x3*x4*x5 + x3*x4*x5*x6+ x4*x5*x6*x7 + x5*x6*x7*x8 + x6*x7*x8*x9, x0*x1*x2*x3*x4 + x0*x1*x2*x3*x9+ x0*x1*x2*x8*x9 + x0*x1*x7*x8*x9 + x0*x6*x7*x8*x9 + x1*x2*x3*x4*x5+ x2*x3*x4*x5*x6 + x3*x4*x5*x6*x7 + x4*x5*x6*x7*x8 + x5*x6*x7*x8*x9,x0*x1*x2*x3*x4*x5 + x0*x1*x2*x3*x4*x9 + x0*x1*x2*x3*x8*x9 + x0*x1*x2*x7*x8*x9+ x0*x1*x6*x7*x8*x9 + x0*x5*x6*x7*x8*x9 + x1*x2*x3*x4*x5*x6 + x2*x3*x4*x5*x6*x7+ x3*x4*x5*x6*x7*x8 + x4*x5*x6*x7*x8*x9, x0*x1*x2*x3*x4*x5*x6 + x0*x1*x2*x3*x4*x5*x9 x0*x1*x2*x3*x4*x8*x9 + x0*x1*x2*x3*x7*x8*x9 + x0*x1*x2*x6*x7*x8*x9+ x0*x1*x5*x6*x7*x8*x9 + x0*x4*x5*x6*x7*x8*x9 + x1*x2*x3*x4*x5*x6*x7+ x2*x3*x4*x5*x6*x7*x8 + x3*x4*x5*x6*x7*x8*x9, x0*x1*x2*x3*x4*x5*x6*x7+ x0*x1*x2*x3*x4*x5*x6*x9 + x0*x1*x2*x3*x4*x5*x8*x9 + x0*x1*x2*x3*x4*x7*x8*x9+ x0*x1*x2*x3*x6*x7*x8*x9 + x0*x1*x2*x5*x6*x7*x8*x9 + x0*x1*x4*x5*x6*x7*x8*x9+ x0*x3*x4*x5*x6*x7*x8*x9 + x1*x2*x3*x4*x5*x6*x7*x8 + x2*x3*x4*x5*x6*x7*x8*x9,x0*x1*x2*x3*x4*x5*x6*x7*x8 + x0*x1*x2*x3*x4*x5*x6*x7*x9 + x0*x1*x2*x3*x4*x5*x6*x8*x9+ x0*x1*x2*x3*x4*x5*x7*x8*x9 + x0*x1*x2*x3*x4*x6*x7*x8*x9 + x0*x1*x2*x3*x5*x6*x7*x8*x9+ x0*x1*x2*x4*x5*x6*x7*x8*x9 + x0*x1*x3*x4*x5*x6*x7*x8*x9 + x0*x2*x3*x4*x5*x6*x7*x8*x9+ x1*x2*x3*x4*x5*x6*x7*x8*x9,x0*x1*x2*x3*x4*x5*x6*x7*x8*x9-1];time(archive("H10_2205",gbasis(sys ,vars))); This computation ran in about one week on cocalc (real time on cocalc
H10_1 . I could also run with 64G of RAM in2 weeks on ifnode2 with this configuration threads:=24;gbasis_simult_primes(4,250,3,600,2);
The output (compressed in gzip format) is available here
Now we can run icas on a copy of the script above, after commenting the line // gbasis_reinject(-2205) ,uncommenting the line gbasis_reinject(.05,.05); ,and replacing the last line by time(archive("H10",gbasis(sys,vars,gbasis_reinject=unarchive("H10_2205"))));
The second stage runs in about 1.5 days and requires about 220G of memory. Makesure you have enough disk space to save the full basis (about 100G) or make a partialsave, like this : time(H:=gbasis(sys,vars,gbasis_reinject=unarchive("H10_2205")));archive("H10",eval(H,1)[0..100]):;
Stage 2 may be split in several steps to save the temporary memory required, forexample gbasis_reinject(-3263) will save the 3263 first basis elements (thisrequires much less RAM, because we compute about 1000 new basis elements insteadof about 3000). Reconstruction in cyclic10 happens after the following numbers ofprimesNumber of primes Reconstructed basis elements3 21388 41794 22051041 23511426 26361933 32632225 5690An attempt was made to re-inject the partial reconstruction of the 41 first gener-ators, but instead of speeding up the computation, it slowed it down (more precisely,6he 1st run with re-injection was about twice as fast, but further runs were slower, as iflearning was fully ineffective). We observed the same behavior trying to decrease themaximal number of pairs with gbasis_max_pairs , this make the 1st run faster,but after that learning is much less effective and next runs are slower.
In 1999, J.C. Faugère stated about the cyclic 9 computation on Q “This success is alsoa failure in some sense: the size of the output is so big that we cannot do anythingwith this result”. I don’t think this is true anymore, the output of cyclic9 can be usednowadays. Perhaps 20 years from now, the output of cyclic10 on Q will be used ? Acknowledgements
Many thanks to William Stein for letting me run most of these computations on cocalc.sagemath.org and to S. Lelièvre for his interest.
References
Proceedings of the 38th International Symposiumon Symbolic and Algebraic Computation , pages 347–354. ACM, 2013.[3] J.-C. Faugère. A new efficient algorithm for computing Gröbner bases (F4).
Jour-nal of Pure and Applied Algebra , 139(1–3):61–88, June 1999.[4] M. Monagan and R. Pearce. An algorithm for spliting polynomial systems basedon f4. In