[PDF] LinBox founding scope allocation, parallel building blocks, and separate compilation

Abstract

To maximize efficiency in time and space, allocations and deallocations, in the exact linear algebra library \linbox, must always occur in the founding scope. This provides a simple lightweight allocation model. We present this model and its usage for the rebinding of matrices between different coefficient domains. We also present automatic tools to speed-up the compilation of template libraries and a software abstraction layer for the introduction of transparent parallelism at the algorithmic level.

Full PDF

aa r X i v : . [ c s . S E ] J u l LinBox founding scope allocation, parallelbuilding blocks, and separate compilation

Jean-Guillaume Dumas ∗ Thierry Gautier † Cl´ement Pernet † B. David Saunders ‡ October 25, 2018

Abstract

To maximize eﬃciency in time and space, allocations and dealloca-tions, in the exact linear algebra library

LinBox , must always occur inthe founding scope. This provides a simple lightweight allocation model.We present this model and its usage for the rebinding of matrices betweendiﬀerent coeﬃcient domains. We also present automatic tools to speed-upthe compilation of template libraries and a software abstraction layer forthe introduction of transparent parallelism at the algorithmic level.

As a building block for a wide range of applications, computational exactlinear algebra has to conciliate eﬃciency and genericity. The goal of the

LinBox project is to address this problem in the design of an eﬃcientgeneral-purpose

C++ open-source library for exact linear algebra over theintegers, the rationals, and ﬁnite ﬁelds. Matrices can be either dense,sparse or black box (i.e. viewed as a linear operator, acting on vectorsonly). The library proposes a set of high level linear algebra solutions,such as the rank, the determinant, the solution of a linear system, theSmith normal form, the echelon form, the characteristic polynomial, etc.Each of these solutions involves a hybrid combination of several specializedalgorithms depending on the domain, and the type of matrix. Over a ﬁniteﬁeld, the building blocks are an eﬃcient implementation of Wiedemann ∗ Laboratoire J. Kuntzmann, Universit´e de Grenoble. 51, rue des Math´ematiques, umrCNRS 5224, bp 53X, F38041 Grenoble, France,

[email protected] . Part ofthis work was done while the ﬁrst author was visiting the Claude Shannon Institute of theUniversity College Dublin, Ireland, under a CNRS grant. † Laboratoire LIG, Universit´e de Grenoble. umr CNRS, F38330 Montbonnot, France.

[email protected] , [email protected] . Part of this work was done whilethe second author was visiting the ArTeCS group of the University Complutense, Madrid,Spain. ‡ University of Delaware, Computer and Information Science Department. Newark / DE /19716, USA. [email protected] . nd block Wiedemann algorithms combined with preconditioners [1] forblack box matrices, a sparse Gaussian elimination for sparse matrices andthe BLAS based dense linear algebra techniques of the FFLAS library [4]for dense matrices. The solutions over the integers and rationals are liftedfrom modular computations by a Chinese remainder algorithm or p -adiclifting. The design is based on high genericity to allow us to write eﬃ-cient algorithms independent of the many representations of domains andmatrices. As a middleware, the library relies on the eﬃciency of kernellibraries such as GMP , Givaro , NTL , ATLAS and can be used by generalpurpose computer algebra systems such as Sage or Maple .We describe in this paper a selection of ideas and improvements thatwere recently introduced into the the design of LinBox for the forthcoming2.0 release.

The main objects that require memory allocation in

LinBox are baseﬁeld or ring elements, vectors, matrices, and polynomials. The memorymanagement for all of these object types follows the same rules, organizedto maximize eﬃciency in time and space, and consequently requiring someeﬀorts by the programmer: the allocations and deallocations must alwaysoccur in the founding scope. In particular no external garbage collectionmechanism is used.

The input and output types of most functions are usually template types,and can be either basic types, or complicated objects. Consequently,passing arguments by value (copy) must be avoided as much as possible.Every argument is passed as a reference, including the return types. Moreprecisely the return value of a function is also the ﬁrst argument, deﬁnedas a non const reference.

Matrix& someFunction(Matrix& result, const XXX& args);

This convention was already presented in [2, § stl -like operators, as discussed in section 3.1. A consequence of the aboveconvention is that the objects returned by a function, have to be declaredand initialized (in particular, memory allocated, e.g. via constructors)before the function call. By enforcing this practice, we require that theprogrammer keep the handle on the objects that he allocates until all usesof the object and its subobjects are completed. Moreover, he is respon-sible for object deallocation in the same scope where it was allocated.This restricts some convenient programming practices, but provides pre-cise control of memory usage. This is particularly important when large,memory ﬁlling, matrices are in play. It also allows to avoid the costs of arbage collection or reference counting. Many LinBox objects involve ahandle containing a reference to the free store. Note that even though afunction does not allocate the handle itself, it is in certain cases still freeto resize and thus reallocate the free store memory referenced.

Dense Matrix allocations.

The objects storing dense matrices re-quire a special care concerning their allocations. Dense matrices are rep-resented as a one dimensional array storing elements in the row majorformat:

A[i,j] = *(A+i*n+j) . It is important to be able to deﬁne asub-matrix as a view on such an array, without allocating the data. Forthis we propose to distinguish two classes: one for allocated (via construc-tors) matrices and the other for sub-matrix views. The genericity of thetemplate mechanism or inheritance will allow to use these two types inthe same code, without duplication. This allows also for an automaticdecision about deallocation. Other solutions includes reference countingand explicit ”end of use” functions.Thus a ﬁrst approach is to deﬁne a dense matrix class with a boolean alloc member, telling whether the matrix owns its data or whether itis a simple view on some other matrix’s data. The destructor deallocatesthe data only if alloc is true . This can be viewed as a simpliﬁed ref-erence counting mechanism, where one assumes that the matrix initiallyallocated is always destructed after all of its sub-matrices. This conven-tion is consistent with the previous consideration: the allocations anddeallocations must always occur in the founding scope.To further improve the eﬃciency, an alternative is to distinguish twoclasses: one for allocated matrices and the other for sub-matrix views.The genericity of the template mechanism or inheritance will allow touse these two types in the same code, without duplication. Furthermore,thread-safety mechanism on the alloc member are not required anymore.Remark that in this founding scope model, neither the alloc vari-able nor a two classes system is required. The programmer should knowwhether a matrix is created as a sub-matrix or as an allocating instanceby what constructors or other initializers she uses. Thus she knows whichrequire care to deallocate in the same scope. What she does not necessar-ily get is automatic decision about deletion in the destructor, and wouldthus have to call an explicit ”end of use” function. LinBox makes use of the concept of rebinds for the mapping of data struc-tures between diﬀerent coeﬃcient domains. For instance, in the context ofthe Chinese remainder algorithm, rebinds allow to map a matrix over theintegers of type, say, (

DenseMatrix ) to a modular matrixof type, say,(

DenseMatrix > ).In

LinBox , binder adaptors are enclosed within many data struc-tures and make use of a generic converter, named

Hom and found in inbox/field/hom.h . Hom can generically use the

LinBox domain’s canon-ical conversion methods methods init and convert from/to the

LinBox

Integer type:

Domain1 → Integer → Domain2 . Moreover, when natural,eﬃcient conversions exists between domains (e.g. diﬀerent representationsof the same ﬁeld or one ring embedded in another), generic

Hom can bedirectly bypassed by a specialization of

Hom . Rebind of dense matrices.

We illustrate the founding scope allo-cation model with the use of rebind functions adapted from the allocatorsin the STL, on dense matrices. template class DenseMatrix {typedef DenseMatrix Self_t;...template struct rebind{typedef DenseMatrix other;operator ()(other& Ap, const Self_t& A, const AnyDomain& D){// Performs the modular conversion of A to Ap over Dtypename Self_t::ConstRawIterator A_iter;typename other::RawIterator Ap_iter;Hom hom(A. field(), F);for (A_iter = A. rawBegin(), Ap_iter = Ap.rawBegin();A_iter != A. rawEnd(); ++ A_iter, ++ Ap_iter)hom.image (*Ap_iter, *A_iter);}};}

According to the founding scope allocation model, the function operator() in charge of the initialization of the matrix cannot allocate any memory.This has to be done at the level where the rebind is called. This alsorequires a modiﬁcation of the rebind operator interface of the STL: thenew object is passed by reference.

In the case of BlackBoxes (functions providing only a matrix-vector prod-uct and not necessarily storing any data) the rebind mechanism becomesmore speciﬁc. We detail in this section the solution provided in blackboxeswhich only store references to other blackboxes, such as the

Compose , Transpose , Submatrix , etc.Indeed to rebind a blackbox containing only references one should al-locate a new memory zone and rebind the refered blackbox there. Theproblem is that a caller, given a

BlackBox::rebind::other type,does not necessarily know how to allocate for this object. The STL solu-tion is to embed the allocator in each container. In

LinBox , we proposeanother solution: the other not only has diﬀerent elements, but also canbe of a diﬀerent type. For instance, the rebind other type of a blackboxcontaining a reference will be the same kind of blackbox, but physicallystoring the data (and thus owning it). or the diﬀerent blackboxes deﬁned in LinBox which use references,we thus deﬁne a similar class called e.g.

TransposeOwner , ComposeOwner , SubmatrixOwner , etc. These classes store and own their data. Then itsuﬃces for the rebind sub-class of their reference equivalents to deﬁne its other type to the associated *Owner class. The example of the

Compose is given in ﬁgure 1.Not this also ﬁts well in the

LinBox founding scope allocation, sincethe *Owner class will be declared (and thus allocated) by the caller of therebind in codes similar to the following: template void f(const BlackBox& A) {...typedef typename Blackbox::template rebind::other FBlackbox;// rebinds generically the BlackBox A to a BlackBox Ap// with a new Domain F2// The container type of Ap might be different from the one of A// this decision is made in the rebind type of A,// via the ’other’ typedefFBlackbox Ap(A, F2);...}

Remark that for a submatrix of a class storing its elements (contraryto a submatrix of a blackbox containing e.g. only references), a moreeﬃcient rebind would only rebind the elements within the boundariesof the submatrix. There we use a trait to decide wether the referedblackbox is a storing component and in the latter case specialize e.g.

Submatrix >::rebind::other to a sim-pler

DenseMatrix instead of using the *Owner mechanism.

Eﬃcient parallel applications must take into consideration hardware char-acteristics (number of cores, memory hierarchy, etc.). It is time consum-ing or impossible for a single developer to program a high performancecomputer algebra application, with state of the art algorithms, while ex-ploiting all the available parallelism. In order to separate the domainsof expertise we have designed a software abstraction layer between com-puter algebra algorithms and parallel implementations which may employautomatic dynamic scheduling.

Computer algebra algorithms have three main characteristics: 1) they arecomplex and require a deep knowledge of the problem in order to obtainthe most eﬃcient sequential algorithm; 2) they may be highly irregular.This enforces a runtime use of load balancing algorithms; 3) they aregeneric in the sense that they are usually designed to work over severalalgebraic domains. emplate class Compose {...template struct rebind {typedef ComposeOwner::other,typename Blackbox2::template rebind<_Tp2>::other> other;...};const Blackbox1 * _A_ptr;const Blackbox2 * _B_ptr;};template class ComposeOwner {...template struct rebind {typedef ComposeOwner::other,typename Blackbox2::template rebind<_Tp2>::other> other;...};templateComposeOwner (const Compose<_BBt1, _BBt2> &M, const Field& F): _A_data(*(M.getLeftPtr()), F), _B_data(*(M.getRightPtr()), F) {typename Compose<_BBt1, _BBt2>::template rebind()(*this,M,F);}templateComposeOwner (const ComposeOwner<_BBt1, _BBt2> &M, const Field& F): _A_data(M.getLeftData(), F), _B_data(M.getRightData(), F) {typename ComposeOwner<_BBt1, _BBt2>::template rebind()(*this,M,F);}Blackbox1 _A_data;Blackbox2 _B_data;}; Figure 1: Owner mechanism for the composed blackboxes6 n the case of

LinBox algorithms, we have decided to base our soft-ware abstraction, called

Parallel Building Blocks (PBB) , on the STL algo-rithms (Standard Template Like) principles. Indeed, C++ data structuresin

LinBox let us have random access iterators over containers which arenaturally parallel. We have already deﬁned several STL-like algorithmsand the list will be extended in the near future: for each, transform, accumulate : the PBB versions of these algo-rithms are similar to the STL versions except that the involved operators(or function object classes), given as parameters, are required to have theirreturn value reference passed as the ﬁrst parameter of the function. Thisis in accordance with the memory model of LinBox . The STL return-by-value semantic is not appropriate.The fundamental idea of PBB is that at the computer algebra level, theparallelization of all the loops and more generally of all the STL-like al-gorithms will already enable good performance and easy switching amongmultiple implementations. Regarding performance, this parallelization ofthe inner loops of the underlying linear algebra is suﬃcient in many cases.Regarding implementations, this abstraction provides for programming in-dependent of the parallel model with selection of the parallel environmentdepending on the target architecture. The parallel blocks can be imple-mented using many diﬀerent parallel environments, such as OpenMP ;TBB (Thread Building Blocks) or Kaapi [6]; using both static and dy-namic work-stealing schedulers [9]. The current implementations are builton OpenMP and Kaapi. To bound the complexity of many linear algebra problems, one of the keyideas is to use an accumulation with early termination .For instance, this is used in Chinese Remaindering algorithms. Thecomputation is performed modulo a sequence of (co)prime numbers andthe result is built from a sequence of residues, until a condition is satis-ﬁed [3]. The termination of the algorithm depends on the accumulatedresult.In order to parallelize such algorithms, we proposed an extension ofthe STL algorithms called accumulate until . The algorithm takes anarray v of length N , a unary operator f to be applied to each array en-try and a speciﬁc binary update operator/predicate for the accumulation.This accumulator with a signature like bool accum(a, b) behaves like anin place addition ( a+=b ) and returns true to indicate suﬃciently manyvalues are accumulated. Let S k = P i =0 ,..,k f ( v [ i ]) with k ∈ { , N } . Thealgorithm computes and returns n ≤ N and S n such that one accumula-tion during the computation of S n returned true or n = N . In intendeduse, we know any additional accumulation would also return true .This algorithm will be used for the early termination Chinese remain-dering algorithms of LinBox . Though not yet using PBB and accumu-late until , a sequential version and parallel versions with OpenMP and openmp.org , threadingbuildingblocks.org aapi can be found in the LinBox distributions as linbox/algorithms/cra-domain-*.h . Many computer algebra programs allocate dynamic memory for the in-termediate computations. Several experiments with

LinBox algorithmson multicore architectures have shown that these allocations are quite of-ten the bottleneck. An analysis of the memory pattern and experimentswith three well known memory allocators (ptmalloc, Hoard and TCMal-loc from Google Perf. Tools [7]) have been conducted. The goal was todecide whether the parallel building blocks model was suitable to high-performance exact linear algebra. We used dynamic libraries to exchangeallocators for the experiments, but one can use them together in the

Lin-Box library if needed [8, § LinBox is developed with several kinds of genericity: 1) genericity withrespect to the domain of the coeﬃcients, 2) genericity with respect to thedata structure of the matrices, 3) genericity with respect to the interme-diate algorithms. While this is eﬃcient in terms of capabilities and codereusability, there is a combinatorial explosion of combinations. Considerthat each of 50 arithmetic domains may be combined with each of 50matrix representations in each of 10 intermediate algorithm forms for asingle problem as simple as matrix rank. This lengthens the compilationtime and generates large executable ﬁles.For the management of code bloat

LinBox has used an “archetypemechanism” which enables, at the user’s option, to switch to a compi-lation against abstract classes [2, § Sage or Maple . Our idea is to automate the techniqueof [5] which combines compile-time instantiation and link-time instanti-ation, while using template instantiation instead of void pointers. Themechanism we propose is independent of the desired generic method, thecandidate for separate compilation, and is explained in algorithm 1.This Algorithm is illustrated on ﬁgure 2, where the function is the rank and the template parameter is a dense matrix over GF (2), DenseMatrix . lgorithm 1 C++ Automatic separate compilation wrapping

Input:

A generic function func . Input:

Template parameters

TParam for separate specialization/compilation of func . Output:

A generic function calling func with separately compiled instantia-tions. Create a header and a body ﬁles “func instantiate.hpp” and“func instantiate.cpp”; Add a template function func separate , with the same speciﬁcation as func , to the header; Its generic default implementation is a single line calling the original function func . { This enables to have a uniﬁed interface, even for non specialized class. } for each separately compiled template parameter TParam do Add a non template speciﬁcation funcTParam , to the header ﬁle; Add the associated body with a single line returning the instantiation of func on a parameter of type

TParam , to the body ﬁle; Add an inline specialization body of func separate on a parameter oftype

TParam with a single line returning funcTParam , to the header ﬁle; end for Compile the body ﬁle “func instantiate.cpp”. rankDenseMatGF2 template

Instantiate and calls

Separately compiled body rank_separate(const Mat&) template rank_separate(const DenseMat&)

Default call User interface rank(const Mat&)rankDenseMatGF2 header

Specialization template<>Specialized call

Figure 2: Separate compilation of the rank

Algorithm 1 has been simpliﬁed for the sake of clarity. To enablea more user-friendly interface one can rename the original function andall its original specializations func original ; then rename also the newinterface simply func . With the classical inline compiler optimizations,the overhead of calling func separate is limited to single supplementaryfunction call. Indeed all the one line additional methods will be automat-ically inlined, except, of course, the one calling the separately compiledcode. If this overhead is too expensive, it suﬃces to enclose all the nongeneric specializations of “func instantiate.hpp” by a macro test. At com-pile time, the decision to separately compile or not can be taken accordingto the deﬁnition of this macro. e show in tables 1 and 2 the gains in compilation time obtained ontwo examples from LinBox : the examples/ { rank,solve } .C algorithms.Indeed, without any speciﬁcation the code has to invoke several special-izations depending on run-time discovered properties of the input. Forinstance solve.C requires 6 specializations for sparse matrices over theIntegers or over a prime ﬁeld, with a sparse elimination, or an iterativemethod, or a dense method, if the matrix is small. . . ﬁle real time user time sys. time real time user time sys. timeRank Solve instantiate.o { rank,solve } .o link Sep. comp. total

Full comp. speed-up { rank,solve } .C compilation time on an AMD Athlon3600+, 1.9GHz, with gcc 4.5 -O2. instantiate.o contains to the separatelycompiled instantiations (e.g. densegf2rank in ﬁgure 2); { rank,solve } .o con-tains to the user interface and generic implementation compilation; link corre-sponds to the linking of both .o and the library; Full comp. corresponds tothe compilation without the separate mechanism.ﬁle real time user time sys. time real time user time sys. timeRank Solve instantiate.o separate.o separate

Sep. comp.

Full comp. speed-up { rank,solve } .C compilation time on an intel XeonE5345, 2.33GHz, with icc-11.1 -O2. Acknowledgment

We thank the

LinBox group and especially Brice Boyer, Pascal Giorgi,Erich Kaltofen, Dan Roche, Brian Youse for many useful discussions inparticular during the recent

LinBox developer meetings in Delaware andDublin. eferences [1] L. Chen, W. Eberly, E. Kaltofen, B. D. Saunders, W. J. Turner, andG. Villard. Eﬃcient matrix preconditioners for black box linear alge-bra. Linear Algebra and its Applications , 343-344:119–146, 2002.[2] J.-G. Dumas, T. Gautier, M. Giesbrecht, P. Giorgi, B. Hovinen,E. Kaltofen, B. D. Saunders, W. J. Turner, and G. Villard. Lin-Box: A generic library for exact linear algebra. In A. M. Cohen,X.-S. Gao, and N. Takayama, editors,

Proceedings of the 2002 Inter-national Congress of Mathematical Software, Beijing, China , pages40–50. World Scientiﬁc Pub., Aug. 2002.[3] J.-G. Dumas, T. Gautier, and J.-L. Roch. Generic design of chineseremaindering schemes. In M. Moreno-Maza and J.-L. Roch, editors,

PASCO 2010 . Universit´e de Grenoble, France, July 2010.[4] J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra overword-size prime ﬁelds: the ﬄas and ﬀpack packages.

ACM Trans.Math. Softw. , 35(3):1–42, 2008.[5] U. Erlingsson, E. Kaltofen, and D. Musser. Generic Gram-Schmidtorthogonalization by exact division. In

ISSAC’1996 , pages 275–282,July 1996.[6] T. Gautier, X. Besseron, and L. Pigeon. KAAPI: a thread schedul-ing runtime system for data ﬂow computations on cluster of multi-processors. In

PASCO’07 , pages 15–23, 2007.[7] S. Ghemawat and P. Menage. Tcmalloc: Thread-caching malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html .[8] E. Kaltofen, D. Morozov, and G. Yuhasz. Generic matrix multipli-cation and memory management in

LinBox . In

ISSAC’2005 , pages216–223, July 2005.[9] D. Traore, J. L. Roch, N. Maillard, T. Gautier, and J. Bernard. Deque-free work-optimal parallel STL algorithms. In

EUROPAR 2008 , pages887–897, Las Palmas, Spain, Aug. 2008., pages887–897, Las Palmas, Spain, Aug. 2008.