LinBox founding scope allocation, parallel building blocks, and separate compilation
Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, B. David Saunders
aa r X i v : . [ c s . S E ] J u l LinBox founding scope allocation, parallelbuilding blocks, and separate compilation
Jean-Guillaume Dumas ∗ Thierry Gautier † Cl´ement Pernet † B. David Saunders ‡ October 25, 2018
Abstract
To maximize efficiency in time and space, allocations and dealloca-tions, in the exact linear algebra library
LinBox , must always occur inthe founding scope. This provides a simple lightweight allocation model.We present this model and its usage for the rebinding of matrices betweendifferent coefficient domains. We also present automatic tools to speed-upthe compilation of template libraries and a software abstraction layer forthe introduction of transparent parallelism at the algorithmic level.
As a building block for a wide range of applications, computational exactlinear algebra has to conciliate efficiency and genericity. The goal of the
LinBox project is to address this problem in the design of an efficientgeneral-purpose
C++ open-source library for exact linear algebra over theintegers, the rationals, and finite fields. Matrices can be either dense,sparse or black box (i.e. viewed as a linear operator, acting on vectorsonly). The library proposes a set of high level linear algebra solutions,such as the rank, the determinant, the solution of a linear system, theSmith normal form, the echelon form, the characteristic polynomial, etc.Each of these solutions involves a hybrid combination of several specializedalgorithms depending on the domain, and the type of matrix. Over a finitefield, the building blocks are an efficient implementation of Wiedemann ∗ Laboratoire J. Kuntzmann, Universit´e de Grenoble. 51, rue des Math´ematiques, umrCNRS 5224, bp 53X, F38041 Grenoble, France,
[email protected] . Part ofthis work was done while the first author was visiting the Claude Shannon Institute of theUniversity College Dublin, Ireland, under a CNRS grant. † Laboratoire LIG, Universit´e de Grenoble. umr CNRS, F38330 Montbonnot, France.
[email protected] , [email protected] . Part of this work was done whilethe second author was visiting the ArTeCS group of the University Complutense, Madrid,Spain. ‡ University of Delaware, Computer and Information Science Department. Newark / DE /19716, USA. [email protected] . nd block Wiedemann algorithms combined with preconditioners [1] forblack box matrices, a sparse Gaussian elimination for sparse matrices andthe BLAS based dense linear algebra techniques of the FFLAS library [4]for dense matrices. The solutions over the integers and rationals are liftedfrom modular computations by a Chinese remainder algorithm or p -adiclifting. The design is based on high genericity to allow us to write effi-cient algorithms independent of the many representations of domains andmatrices. As a middleware, the library relies on the efficiency of kernellibraries such as GMP , Givaro , NTL , ATLAS and can be used by generalpurpose computer algebra systems such as Sage or Maple .We describe in this paper a selection of ideas and improvements thatwere recently introduced into the the design of LinBox for the forthcoming2.0 release.
The main objects that require memory allocation in
LinBox are basefield or ring elements, vectors, matrices, and polynomials. The memorymanagement for all of these object types follows the same rules, organizedto maximize efficiency in time and space, and consequently requiring someefforts by the programmer: the allocations and deallocations must alwaysoccur in the founding scope. In particular no external garbage collectionmechanism is used.
The input and output types of most functions are usually template types,and can be either basic types, or complicated objects. Consequently,passing arguments by value (copy) must be avoided as much as possible.Every argument is passed as a reference, including the return types. Moreprecisely the return value of a function is also the first argument, definedas a non const reference.
Matrix& someFunction(Matrix& result, const XXX& args);
This convention was already presented in [2, § stl -like operators, as discussed in section 3.1. A consequence of the aboveconvention is that the objects returned by a function, have to be declaredand initialized (in particular, memory allocated, e.g. via constructors)before the function call. By enforcing this practice, we require that theprogrammer keep the handle on the objects that he allocates until all usesof the object and its subobjects are completed. Moreover, he is respon-sible for object deallocation in the same scope where it was allocated.This restricts some convenient programming practices, but provides pre-cise control of memory usage. This is particularly important when large,memory filling, matrices are in play. It also allows to avoid the costs of arbage collection or reference counting. Many LinBox objects involve ahandle containing a reference to the free store. Note that even though afunction does not allocate the handle itself, it is in certain cases still freeto resize and thus reallocate the free store memory referenced.
Dense Matrix allocations.
The objects storing dense matrices re-quire a special care concerning their allocations. Dense matrices are rep-resented as a one dimensional array storing elements in the row majorformat:
A[i,j] = *(A+i*n+j) . It is important to be able to define asub-matrix as a view on such an array, without allocating the data. Forthis we propose to distinguish two classes: one for allocated (via construc-tors) matrices and the other for sub-matrix views. The genericity of thetemplate mechanism or inheritance will allow to use these two types inthe same code, without duplication. This allows also for an automaticdecision about deallocation. Other solutions includes reference countingand explicit ”end of use” functions.Thus a first approach is to define a dense matrix class with a boolean alloc member, telling whether the matrix owns its data or whether itis a simple view on some other matrix’s data. The destructor deallocatesthe data only if alloc is true . This can be viewed as a simplified ref-erence counting mechanism, where one assumes that the matrix initiallyallocated is always destructed after all of its sub-matrices. This conven-tion is consistent with the previous consideration: the allocations anddeallocations must always occur in the founding scope.To further improve the efficiency, an alternative is to distinguish twoclasses: one for allocated matrices and the other for sub-matrix views.The genericity of the template mechanism or inheritance will allow touse these two types in the same code, without duplication. Furthermore,thread-safety mechanism on the alloc member are not required anymore.Remark that in this founding scope model, neither the alloc vari-able nor a two classes system is required. The programmer should knowwhether a matrix is created as a sub-matrix or as an allocating instanceby what constructors or other initializers she uses. Thus she knows whichrequire care to deallocate in the same scope. What she does not necessar-ily get is automatic decision about deletion in the destructor, and wouldthus have to call an explicit ”end of use” function. LinBox makes use of the concept of rebinds for the mapping of data struc-tures between different coefficient domains. For instance, in the context ofthe Chinese remainder algorithm, rebinds allow to map a matrix over theintegers of type, say, (
DenseMatrix
DenseMatrix
LinBox , binder adaptors are enclosed within many data struc-tures and make use of a generic converter, named
Hom and found in inbox/field/hom.h . Hom can generically use the
LinBox domain’s canon-ical conversion methods methods init and convert from/to the
LinBox
Integer type:
Domain1 → Integer → Domain2 . Moreover, when natural,efficient conversions exists between domains (e.g. different representationsof the same field or one ring embedded in another), generic
Hom can bedirectly bypassed by a specialization of
Hom . Rebind of dense matrices.
We illustrate the founding scope allo-cation model with the use of rebind functions adapted from the allocatorsin the STL, on dense matrices. template
According to the founding scope allocation model, the function operator() in charge of the initialization of the matrix cannot allocate any memory.This has to be done at the level where the rebind is called. This alsorequires a modification of the rebind operator interface of the STL: thenew object is passed by reference.
In the case of BlackBoxes (functions providing only a matrix-vector prod-uct and not necessarily storing any data) the rebind mechanism becomesmore specific. We detail in this section the solution provided in blackboxeswhich only store references to other blackboxes, such as the
Compose , Transpose , Submatrix , etc.Indeed to rebind a blackbox containing only references one should al-locate a new memory zone and rebind the refered blackbox there. Theproblem is that a caller, given a
BlackBox::rebind
LinBox , we proposeanother solution: the other not only has different elements, but also canbe of a different type. For instance, the rebind other type of a blackboxcontaining a reference will be the same kind of blackbox, but physicallystoring the data (and thus owning it). or the different blackboxes defined in LinBox which use references,we thus define a similar class called e.g.
TransposeOwner , ComposeOwner , SubmatrixOwner , etc. These classes store and own their data. Then itsuffices for the rebind sub-class of their reference equivalents to define its other type to the associated *Owner class. The example of the
Compose is given in figure 1.Not this also fits well in the
LinBox founding scope allocation, sincethe *Owner class will be declared (and thus allocated) by the caller of therebind in codes similar to the following: template
Remark that for a submatrix of a class storing its elements (contraryto a submatrix of a blackbox containing e.g. only references), a moreefficient rebind would only rebind the elements within the boundariesof the submatrix. There we use a trait to decide wether the referedblackbox is a storing component and in the latter case specialize e.g.
Submatrix
DenseMatrix
Efficient parallel applications must take into consideration hardware char-acteristics (number of cores, memory hierarchy, etc.). It is time consum-ing or impossible for a single developer to program a high performancecomputer algebra application, with state of the art algorithms, while ex-ploiting all the available parallelism. In order to separate the domainsof expertise we have designed a software abstraction layer between com-puter algebra algorithms and parallel implementations which may employautomatic dynamic scheduling.
Computer algebra algorithms have three main characteristics: 1) they arecomplex and require a deep knowledge of the problem in order to obtainthe most efficient sequential algorithm; 2) they may be highly irregular.This enforces a runtime use of load balancing algorithms; 3) they aregeneric in the sense that they are usually designed to work over severalalgebraic domains. emplate
LinBox algorithms, we have decided to base our soft-ware abstraction, called
Parallel Building Blocks (PBB) , on the STL algo-rithms (Standard Template Like) principles. Indeed, C++ data structuresin
LinBox let us have random access iterators over containers which arenaturally parallel. We have already defined several STL-like algorithmsand the list will be extended in the near future: for each, transform, accumulate : the PBB versions of these algo-rithms are similar to the STL versions except that the involved operators(or function object classes), given as parameters, are required to have theirreturn value reference passed as the first parameter of the function. Thisis in accordance with the memory model of LinBox . The STL return-by-value semantic is not appropriate.The fundamental idea of PBB is that at the computer algebra level, theparallelization of all the loops and more generally of all the STL-like al-gorithms will already enable good performance and easy switching amongmultiple implementations. Regarding performance, this parallelization ofthe inner loops of the underlying linear algebra is sufficient in many cases.Regarding implementations, this abstraction provides for programming in-dependent of the parallel model with selection of the parallel environmentdepending on the target architecture. The parallel blocks can be imple-mented using many different parallel environments, such as OpenMP ;TBB (Thread Building Blocks) or Kaapi [6]; using both static and dy-namic work-stealing schedulers [9]. The current implementations are builton OpenMP and Kaapi. To bound the complexity of many linear algebra problems, one of the keyideas is to use an accumulation with early termination .For instance, this is used in Chinese Remaindering algorithms. Thecomputation is performed modulo a sequence of (co)prime numbers andthe result is built from a sequence of residues, until a condition is satis-fied [3]. The termination of the algorithm depends on the accumulatedresult.In order to parallelize such algorithms, we proposed an extension ofthe STL algorithms called accumulate until . The algorithm takes anarray v of length N , a unary operator f to be applied to each array en-try and a specific binary update operator/predicate for the accumulation.This accumulator with a signature like bool accum(a, b) behaves like anin place addition ( a+=b ) and returns true to indicate sufficiently manyvalues are accumulated. Let S k = P i =0 ,..,k f ( v [ i ]) with k ∈ { , N } . Thealgorithm computes and returns n ≤ N and S n such that one accumula-tion during the computation of S n returned true or n = N . In intendeduse, we know any additional accumulation would also return true .This algorithm will be used for the early termination Chinese remain-dering algorithms of LinBox . Though not yet using PBB and accumu-late until , a sequential version and parallel versions with OpenMP and openmp.org , threadingbuildingblocks.org aapi can be found in the LinBox distributions as linbox/algorithms/cra-domain-*.h . Many computer algebra programs allocate dynamic memory for the in-termediate computations. Several experiments with
LinBox algorithmson multicore architectures have shown that these allocations are quite of-ten the bottleneck. An analysis of the memory pattern and experimentswith three well known memory allocators (ptmalloc, Hoard and TCMal-loc from Google Perf. Tools [7]) have been conducted. The goal was todecide whether the parallel building blocks model was suitable to high-performance exact linear algebra. We used dynamic libraries to exchangeallocators for the experiments, but one can use them together in the
Lin-Box library if needed [8, § LinBox is developed with several kinds of genericity: 1) genericity withrespect to the domain of the coefficients, 2) genericity with respect to thedata structure of the matrices, 3) genericity with respect to the interme-diate algorithms. While this is efficient in terms of capabilities and codereusability, there is a combinatorial explosion of combinations. Considerthat each of 50 arithmetic domains may be combined with each of 50matrix representations in each of 10 intermediate algorithm forms for asingle problem as simple as matrix rank. This lengthens the compilationtime and generates large executable files.For the management of code bloat
LinBox has used an “archetypemechanism” which enables, at the user’s option, to switch to a compi-lation against abstract classes [2, § Sage or Maple . Our idea is to automate the techniqueof [5] which combines compile-time instantiation and link-time instanti-ation, while using template instantiation instead of void pointers. Themechanism we propose is independent of the desired generic method, thecandidate for separate compilation, and is explained in algorithm 1.This Algorithm is illustrated on figure 2, where the function is the rank and the template parameter is a dense matrix over GF (2), DenseMatrix
Input:
A generic function func . Input:
Template parameters
TParam for separate specialization/compilation of func . Output:
A generic function calling func with separately compiled instantia-tions. Create a header and a body files “func instantiate.hpp” and“func instantiate.cpp”; Add a template function func separate , with the same specification as func , to the header; Its generic default implementation is a single line calling the original function func . { This enables to have a unified interface, even for non specialized class. } for each separately compiled template parameter TParam do Add a non template specification funcTParam , to the header file; Add the associated body with a single line returning the instantiation of func on a parameter of type
TParam , to the body file; Add an inline specialization body of func separate on a parameter oftype
TParam with a single line returning funcTParam , to the header file; end for Compile the body file “func instantiate.cpp”. rankDenseMatGF2 template
Instantiate and calls
Separately compiled body rank_separate(const Mat&) template
Default call User interface rank(const Mat&)rankDenseMatGF2 header
Specialization template<>Specialized call
Figure 2: Separate compilation of the rank
Algorithm 1 has been simplified for the sake of clarity. To enablea more user-friendly interface one can rename the original function andall its original specializations func original ; then rename also the newinterface simply func . With the classical inline compiler optimizations,the overhead of calling func separate is limited to single supplementaryfunction call. Indeed all the one line additional methods will be automat-ically inlined, except, of course, the one calling the separately compiledcode. If this overhead is too expensive, it suffices to enclose all the nongeneric specializations of “func instantiate.hpp” by a macro test. At com-pile time, the decision to separately compile or not can be taken accordingto the definition of this macro. e show in tables 1 and 2 the gains in compilation time obtained ontwo examples from LinBox : the examples/ { rank,solve } .C algorithms.Indeed, without any specification the code has to invoke several special-izations depending on run-time discovered properties of the input. Forinstance solve.C requires 6 specializations for sparse matrices over theIntegers or over a prime field, with a sparse elimination, or an iterativemethod, or a dense method, if the matrix is small. . . file real time user time sys. time real time user time sys. timeRank Solve instantiate.o { rank,solve } .o link Sep. comp. total
Full comp. speed-up { rank,solve } .C compilation time on an AMD Athlon3600+, 1.9GHz, with gcc 4.5 -O2. instantiate.o contains to the separatelycompiled instantiations (e.g. densegf2rank in figure 2); { rank,solve } .o con-tains to the user interface and generic implementation compilation; link corre-sponds to the linking of both .o and the library; Full comp. corresponds tothe compilation without the separate mechanism.file real time user time sys. time real time user time sys. timeRank Solve instantiate.o separate.o separate
Sep. comp.
Full comp. speed-up { rank,solve } .C compilation time on an intel XeonE5345, 2.33GHz, with icc-11.1 -O2. Acknowledgment
We thank the
LinBox group and especially Brice Boyer, Pascal Giorgi,Erich Kaltofen, Dan Roche, Brian Youse for many useful discussions inparticular during the recent
LinBox developer meetings in Delaware andDublin. eferences [1] L. Chen, W. Eberly, E. Kaltofen, B. D. Saunders, W. J. Turner, andG. Villard. Efficient matrix preconditioners for black box linear alge-bra. Linear Algebra and its Applications , 343-344:119–146, 2002.[2] J.-G. Dumas, T. Gautier, M. Giesbrecht, P. Giorgi, B. Hovinen,E. Kaltofen, B. D. Saunders, W. J. Turner, and G. Villard. Lin-Box: A generic library for exact linear algebra. In A. M. Cohen,X.-S. Gao, and N. Takayama, editors,
Proceedings of the 2002 Inter-national Congress of Mathematical Software, Beijing, China , pages40–50. World Scientific Pub., Aug. 2002.[3] J.-G. Dumas, T. Gautier, and J.-L. Roch. Generic design of chineseremaindering schemes. In M. Moreno-Maza and J.-L. Roch, editors,
PASCO 2010 . Universit´e de Grenoble, France, July 2010.[4] J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra overword-size prime fields: the fflas and ffpack packages.
ACM Trans.Math. Softw. , 35(3):1–42, 2008.[5] U. Erlingsson, E. Kaltofen, and D. Musser. Generic Gram-Schmidtorthogonalization by exact division. In
ISSAC’1996 , pages 275–282,July 1996.[6] T. Gautier, X. Besseron, and L. Pigeon. KAAPI: a thread schedul-ing runtime system for data flow computations on cluster of multi-processors. In
PASCO’07 , pages 15–23, 2007.[7] S. Ghemawat and P. Menage. Tcmalloc: Thread-caching malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html .[8] E. Kaltofen, D. Morozov, and G. Yuhasz. Generic matrix multipli-cation and memory management in
LinBox . In
ISSAC’2005 , pages216–223, July 2005.[9] D. Traore, J. L. Roch, N. Maillard, T. Gautier, and J. Bernard. Deque-free work-optimal parallel STL algorithms. In
EUROPAR 2008 , pages887–897, Las Palmas, Spain, Aug. 2008., pages887–897, Las Palmas, Spain, Aug. 2008.