[PDF] Fast Matlab compatible sparse assembly on multicore computers

Abstract

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 times in speedup on a typical 6-core machine and 10 times on a dual-socket 16 core machine compared to the built-in serial implementation.

Full PDF

FFAST MATLAB COMPATIBLE SPARSE ASSEMBLY ONMULTICORE COMPUTERS

STEFAN ENGBLOM AND DIMITAR LUKARSKI

Abstract.

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix fromraw index data. Since it is often a quite demanding and sometimes critical op-eration, it is of interest to design a highly eﬃcient implementation. We showhow to do this, and moreover, we show how our implementation can be paral-lelized to utilize the power of modern multicore computers. Our freely availablecode, fully Matlab compatible, achieves about a factor of 5 × in speedup on atypical 6-core machine and 10 × on a dual-socket 16-core machine comparedto the built-in serial implementation. Introduction

The popular Matlab programming environment was originally built around theinsight that most computing applications in some way or the other rely on storageand manipulations of one fundamental object — the matrix . In the early 90san important update was made with the support of a sparse storage format aspresented in [7]. In that paper the way sparse matrices are managed in an otherwisedense storage matrix environment is described, including the initial creation of asparse matrix, some basic manipulations and operations, and fundamental matrixfactorizations in sparse format.As a guiding principle the authors formulate the “time is proportional to ﬂops” -rule [7, p. 334]:The time required for a sparse matrix operation should be propor-tional to the number of arithmetic operations on nonzero quantities.The situation is somewhat diﬀerent today since ﬂops often can be considered tobe “free” while memory transfers are, in most cases, the real bottlenecks of theprogram. With the multicore era here to stay programs need to be threaded inorder to utilize all hardware resources eﬃciently. This is a non-trivial task andrequires some careful design [1].In this paper we consider a sole sparse operation, namely the initial assembly operation as implemented by the Matlab function sparse ; >> S = sparse(i,j,s,m,n,nzmax); Date : October 23, 2015.2010

Mathematics Subject Classiﬁcation.

Primary: 68W10; Secondary: 65Y10.

Key words and phrases.

Sparse matrix; Column compressed format; Assemble; Matlab.Corresponding author: S. Engblom ([email protected]), telephone +46-18-471 27 54, fax +46-18-51 19 25. a r X i v : . [ c s . D C ] O c t S. ENGBLOM AND D. LUKARSKI

After the call, S contains a sparse representation of the matrix deﬁned by S ( i k , j k ) = s k for k a range of indices pointing into the vectors { i, j, s } , and where repeatedindices imply that the corresponding elements are to be summed together . Manyapplications naturally lead to substantial repetitions of indices and the impliedreduction must be detected and handled eﬃciently. For example, in the importantcase of assembly in linear ﬁnite element methods for partial diﬀerential equations,the resulting sparse matrix has a sparsity pattern which is identical to that of thematrix representation of the underlying triangular or tetrahedral mesh when viewedas a graph . The number of collisions during the assembly then corresponds exactlyto the connectivity of the nodes in this graph.Since the assembly must be performed before any other matrix operations areexecuted, the performance may become a bottleneck. The typical example is fordynamic nonlinear partial diﬀerential equations (PDEs), where re-assembly occursmany times as a numerical time integration proceeds, including during the iterationsof the nonlinear solver. Thus, with the assembly process a quite time-consumingoperation which is repeatedly performed, it cannot always be amortized over sub-sequent operations. Notably, in the truly large case presented in [9, § BuildMatrix , corresponds to the sparse function.As mentioned, ﬁnite element methods naturally lead to the assembly of largesparse matrices. A stack based representation specially designed for this appli-cation is suggested in [8], and is also implemented there using a hybrid parallelprogramming model on a Cray XE6. Another approach is reported in [4], wherethe assembly of ﬁnite element sparse matrices in both Matlab and Octave is con-sidered using these high-level languages directly.Using the “time is proportional to ﬂops”-rule as a guiding principle - but payingclose attention to memory accesses - we provide a fast re-implementation of the sparse function. The resulting function fsparse is Matlab compatible, memoryeﬃcient, and parallelizes well on modern multicore computers. Moreover, it is welltested and has been freely available in the public domain for quite some time.In § § §

4, where the memory bound characterof the operation is also highlighted. In general, with most sparse algorithms, thereare not enough non-trivial arithmetic operations to hide the format overhead anddata transfer costs [3]. A summarizing discussion around these issues is found in § Availability of software.

The code discussed in the paper is publicly avail-able and the performance experiments reported here can be repeated through theMatlab-scripts we distribute. Refer to § AST MULTICORE SPARSE ASSEMBLY 3 A fast general algorithm for sparse assembly

In this section we lay out a fast algorithm for assembling sparse matrices fromthe standard index triplet data. A description of the problem is ﬁrst oﬀered in § § § § Description of the problem.

The column compressed sparse (CCS) is thesparse matrix storage format supported by Matlab but has also enjoyed a wide-spread use in several other packages. Given a 4-by-4 matrix S deﬁned by S = 

10 0 0 −

23 9 0 00 7 8 73 0 8 5  , (2.1)the three required CCS arrays read as follows prS = [10 3 3 9 7 8 8 − , irS = [0 1 3 1 2 2 3 0 2 3] , jcS = [0 3 5 7 10] , where prS contains the nonzero values column-wise, irS the zero-oﬀset row indices,and where jcS points to the columns in both prS and irS . For an M -by- N sparsematrix we always have that jcS [0] = 0 and that jcS [ N ] = nnz , the total numberof nonzero elements.In Matlab we may form a representation of S by >> s = [10 3 3 9 7 8 8 -2 7 5];>> i = [1 2 4 2 3 3 4 1 3 4];>> j = [1 1 1 2 2 3 3 4 4 4];>> S = sparse(i,j,s); % size(S) = [4,4] is implicit The assembly problem, therefore, is to transform the triplet ( i, j, s ) into the CCStriplet ( irS , jcS , prS ). Generally, the diﬃculties lie in that (i) the data is unordered,and (ii) , the values in s of equivalent pairs ( i, j ) are to be summed together. Forexample, the above matrix may also be constructed from the triplet dataListing 1. Sample input (running example) >> s = [4 4 5 7 3 5 5 4 3 4 9 7 -2];>> i = [3 4 1 3 2 1 4 4 4 3 2 3 1];>> j = [3 3 1 4 1 1 4 3 1 3 2 2 4]; Below we will use the sample input of Listing 1 as a running example to demonstratethe eﬀects of the code snippets shown.The complexity of the assembly operation can be bounded from above as follows.Consider ﬁrst sorting the triplet with respect to column indices, then sorting eachcolumn with respect to rows. In a ﬁnal sweep over all columns, equivalent indices Note that the abbreviation ‘CSC’ is also in widespread use.

S. ENGBLOM AND D. LUKARSKI are summed together. If the initial triplet has length L , then the complexity is thatof sorting and hence bounded by L log L . Additionally, using an in-place sortingalgorithm, it is easy to see that the whole operation can in fact be done in-place.In practice, sorting algorithms using auxiliary memory are generally faster and alsoparallelize better.While these are generally applicable remarks they do not take into account thefact that the indices are bounded integers and hence can be sorted more eﬃciently.In fact, the algorithm presented in § § § fsparse but not detailed here; this is an extensionof the Matlab syntax which allows for row- and/or column-indices to be countedseveral times as dictated by the dimensions of the inputs.One should keep in mind that repeated assembly can often be done eﬃciently bysaving various types of information between successive calls. However, the possibil-ity of “quasi assembly” is clearly very strongly problem dependent. Thus in whatfollows, we remain in the general setting.2.2. Format of input and output.

In the following we stepwise explain thealgorithm by giving snippets of actual C-code in an imagined environment whichcontains an increasing number of variables. The ﬁnal serial version of the code withall pieces taken together is found in Listing 15 in Appendix A.In the example above the dimensions of the ﬁnal matrix are implicitly deﬁnedas the largest row- and column index. Hence as the ﬁrst step, these arrays mustbe parsed for the maximum values and we also conveniently translate them intointegers (recall that an array in Matlab by default is a double array). Code for thispre-processing is found in Listing 13 in Appendix A and the result is the equivalenceof Listing 2. Note that the input index arrays remain in unit-oﬀset.Listing 2. Input format const double *sr; // pointer to valuesconst int *ii,*jj; // pointers to row- and column indices (unit-offset)int len; // length of arrays ii, jj, srint M,N; // sparse matrix dimensions

Given the input in Listing 2, the algorithm detailed in § jcS , belongs to the ﬁnal outputand has been discussed previously. The other one, irank , the inverse rank , containsinformation as to how the remaining CCS-arrays prS and irS are to be formed fromthe raw triplet data. The relation is that for i = 0 ... len − AST MULTICORE SPARSE ASSEMBLY 5 is used here), irS [ j ] = ii [ i ] − , where irank [ i ] = j, (2.2) prS [ j ] = (cid:88) i ; irank [ i ]= j sr [ i ] . (2.3)In plain language, irank [ i ] points to the ﬁnal position in ( irS , prS ) for the cor-responding pair ( ii [ i ] , sr [ i ]). Code for ﬁnalizing the representation according tothese relations is found in Listing 14 in Appendix A. Note that as a consequenceof (2.2), nnz = 1 + max i irank [ i ].Listing 3. Intermediate output format int *irank; // inverse rank array of length lenint *jcS; // final column pointer for sparse matrix SjcS = calloc(N+1,sizeof(jcS[0]));irank = malloc(len*sizeof(irank[0])); Index-based sparse assembly.

The task at hand is now to arrive at theintermediate output of Listing 3, given the parsed input of Listing 2. We willdo this incrementally in four parts detailed in § row , the second part constructs a rank-arraywhich provides with the ability for a row-wise traversal. The most complex partof the algorithm is the third part in which the unique row indices of each columnare found. Finally, in the fourth part the required intermediate outputs irank and jcS can be determined.2.3.1. Part 1.

This part builds a kind of row pointer with the same structure as jcS ,but for rows instead of for columns. “Kind of” because there is no data availableto actually point into, the input still being unordered. In Listing 4, note also thatthe resulting pointer jrS ignores collisions and hence that the estimated number ofnonzeros per row is an upper bound.Listing 4. Part 1: count rows int *jrS; // accumulated row counterjrS = calloc(M+1,sizeof(jrS[0]));// count and accumulate indices to rowsfor (int i = 0; i < len; i++) jrS[ii[i]]++;for (int r = 2; r <= M; r++) jrS[r] += jrS[r-1];

Example.

Given the arrays deﬁned in Listing 1 as inputs, Listing 4 produces thepointer to rows jrS = [0 3 5 9 13] . That collisions are ignored at this stage can be seen from the fact that S in (2.1)has 10 nonzero elements, whereas jrS has reserved space for 13 elements. (cid:3) S. ENGBLOM AND D. LUKARSKI

Part 2.

With an upper bound on the number of nonzeros per row available itis now straightforward to create a rank-array rank such that i = rank [ j ] points tothe i th triplet ( ii [ i ] , jj [ i ] , sr [ i ]) ordered with respect to row indices (that is, with ii [ i ] non-decreasing). Hence the key feature with rank is that it allows for the datato be traversed in an ordered row-by-row fashion.Listing 5. Part 2: build rank-array int *rank; // rank-array for rowsrank = malloc(len*sizeof(rank[0]));// build rank with the active use of jrSjrS--; /* (unit-offset in ii) */for (int i = 0; i < len; i++) rank[jrS[ii[i]]++] = i; Example.

Continuing with the sample input from Listing 1, Listing 5 produces rank = [2 5 12 4 10 0 3 9 11 1 6 7 8] , jrS = [ ∗ , where the notation indicates that jrS is now in unit-oﬀset. The deﬁning relation is ii [ rank [ · ]] = [1 1 1 2 2 3 3 3 3 4 4 4 4] , such that rank indeed provides for a row-wise traversal of the data. (cid:3) Part 3.

In this part of the algorithm, the program loops over the input andmakes each column unique with respect to row indices, building both the indexarray irank and the column pointer jcS at the same time. This is made feasibleby the row-wise traversal of the input data and a small cache memory for columnindices. Listing 6. Part 3: uniqueness int *hcol; // cache memory for columnshcol = calloc(N,sizeof(hcol[0]));hcol--; /* (unit-offset in jj) */// loop over all row indicesfor (int row = 1,i = 0; row <= M; row++)// loop over single rowfor ( ; i < jrS[row]; i++) {const int ixijs = rank[i]; // index into input data triplet (ii,jj,sr)const int col = jj[ixijs]; // column index// new element?if (hcol[col] < row) {hcol[col] = row; // remembered by the row indexjcS[col]++; // count it}// irank keeps track of where it should goirank[ixijs] = jcS[col]-1;}

AST MULTICORE SPARSE ASSEMBLY 7 // done: deallocate auxiliary variablesfree(++hcol);free(rank);free(++jrS);

Example.

Our sample input in Listing 1 yields irank = [0 1 0 1 1 0 2 1 2 0 0 1 0] , jcS = [0 3 2 2 3] , which is not very informative due to the missing ﬁnal accumulation of indices. (cid:3) Part 4.

In the ﬁnal part of the algorithm the column pointer jcS is ﬁnalizedby an accumulating sum. Since there is a dependency between irank and jcS , theformer must be updated analogously.Listing 7. Part 4: ﬁnalize intermediate format // accumulate pointer to columnsfor (int c = 2; c <= N; c++) jcS[c] += jcS[c-1];// irank must account for the previous accumulationjcS--; /* (again, unit-offset in jj) */for (int i = 0; i < len; i++) irank[i] += jcS[jj[i]];jcS++;

Example.

The ﬁnal part of the algorithm transforms our running example into irank = [5 6 0 8 1 0 9 6 2 5 3 4 7] , jcS = [0 3 5 7 10] . While rank is a permutation, irank is a combination and has no inverse. However,if we deﬁne JJ by executing the assignment JJ [ irank [ i ]] = jj [ i ] , from i = 0 and upwards, then, JJ = [1 1 1 2 2 3 3 4 4 4] . That is, irank has sorted the data according to columns and detected all collisionsin the process. Note also that, as required, JJ is indexed by jcS . (cid:3) Complexity.

The assembly process clearly has a memory bound characterand the single most important complexity metric is therefore the number of memoryaccesses made. Since sparse matrices are used to avoid excessive memory use, it isalso of interest to look at the amount of working memory allocated. We estimatethese both characteristics in turn.Thanks to the deterministic character of all loops, it is straightforward to deter-mine the number of memory accesses; this amounts to little more than just countingthe pointer evaluations in Listings 4–7. The result is found in Table 2.1 and it showsthat the number of indirect accesses is about 8 L (with L := len for brevity), thatis, the equivalence of an array of size L is looped over in random order a total of 8 S. ENGBLOM AND D. LUKARSKI L Part 1 2 L + M L L L L

Part 3 5 L + M L L Part 4 3 L + N L L + 2 M + N L L Table 2.1.

Memory access complexity in Listings 4–7 and interms of L = len . Included is the number of data accesses, thenumber of indirect (hence possibly non-contiguous) accesses, andthe number of indirect accesses to data arrays of size L (assuming L (cid:29) M, N ).times. The most common case is that M and N are much smaller than L such thatindirect accesses to an array of size L are more expensive than to arrays of sizes M or N . For this situation we see that a size L array is looped over randomly a totalof 3 times only.When it comes to allocated memory it is also easy to follow the explicit allo-cations made by the program. The result is that the maximal allocation will takeplace in one of two places. The ﬁrst candidate is in Listing 6 just after the array hcol has been allocated. Here the equivalence of an integer array of size S = 2 N + 1 + M + 1 + 2 L (2.4)has been allocated in total. A second candidate is when the ﬁnal output has beenallocated. Here only the intermediate result array irank remains allocated andassuming sizeof(double) = 2 × sizeof(int) the eﬀective data size is S = N + 1 + 3 nnz + L, (2.5)where usually S > S . Hence the maximal memory ever allocated by the algorithmis generally the size of the output plus a size L integer array.As we will see in § Parallel sparse assembly in shared memory

In this section we present and analyze the threaded version of the assembly al-gorithm outlined in §

2. The obvious approach to parallelizing the algorithm is toevenly distribute the input data among the cores and perform a local assembly,then ﬁnalize the result by summing these local matrices together. We advocateagainst this for two reasons: the required amount of working memory is substan-tially larger with this approach and the ﬁnal gather operation is a quite complicatedtask in itself. Our OpenMP implementation was developed in an incremental fash-ion starting from the serial version. After several design leaps along the way theﬁnal version achieves a competitive performance as we shall see, while still requiringonly a small amount of working memory.

AST MULTICORE SPARSE ASSEMBLY 9

Revisited: format of input and output.

The eﬀective input data in List-ing 2 is the same in the threaded implementation although the associated code issomewhat more involved, see Listing 16 in Appendix B.The intermediate output format, however, diﬀers considerably between the twoversions. To begin with, for the threaded version a permuted version irankP of irank is preferred. Formally, the arrays are related through irankP [ rank [ i ]] = irank [ i ] , (3.1)for i = 0 ... len −

1. This implies that (compare (2.2)–(2.3)) irS [ j ] = ii [ k ] − , where ( irankP [ i ] , rank [ i ]) = ( j, k ) , (3.2) prS [ j ] = (cid:88) i ; ( irankP [ i ] , rank [ i ])=( j,k ) sr [ k ] . (3.3)As we shall see the main beneﬁts with this seemingly more complicated setupis that (i) it opens up for more parallelism in the post-processing part, and (ii) itsimpliﬁes the memory access pattern in Part 3 and 4.Code for ﬁnalizing the representation according to (3.2)–(3.3) is found in List-ing 17 in Appendix B. The logic for increasing the degree of parallelism is fairlyclear and builds on the fact that rank allows for a row-wise traversal; hence datacan be distributed according to row indices.Finally, two minor details present in Listing 8 deserve to be mentioned. Firstly,since rank is used in the post-processing part we now need to store the row point-ers jrS throughout the whole algorithm. Secondly, the algorithm is considerablystreamlined when those pointers are kept in thread-private copies. For the samereason this design was also used for the column pointer jcS .Listing 8. Intermediate output format, parallel version // rank-array and inverse permuted rank-arrayint *rank,*irankP;rank = malloc(len*sizeof(rank[0]));irankP = malloc(len*sizeof(irankP[0]));const int nThreads = omp_get_max_threads();int **jrS; // row counters, one per threadjrS = malloc((nThreads+1)*sizeof(jrS[0]));for (int k = 0; k <= nThreads; k++) {jrS[k] = calloc(M+1,sizeof(jrS[k][0]));jrS[k]--; /* (unit-offset in ii) */}/* (final result will appear in jrS[nThreads-1]) */int **jcC; // column counters, one per threadjcS = malloc((nThreads+1)*sizeof(jcS[0]));for (int k = 0; k <= nThreads; k++)jcS[k] = calloc(N+1,sizeof(jcS[k][0]));/* (final result will appear in jcS[0]) */ An index-based multithreaded algorithm.

The threaded version followsclosely the pattern laid out in §

2, the main diﬀerence being found in Parts 3 and 4which are now merged into one parallel region.3.2.1.

Part 1.

Counting rows in parallel is straightforward since each thread has itsown local counter. Accumulation has to be done in two steps, the last of which isstrictly serial, but which runs over a length M array only (where usually M (cid:28) L ).A feature with the code in Listing 9 is the ﬁnal block in which the thread-privatepointers jrS are ﬁnalized. The format used here supports each thread to continueto process data in a fully independent manner.Listing 9. Part 1: count rows, parallel version Part 2.

With thread-private pointers to rows available, constructing therank-array is trivially parallel yet follows the logic of its serial counterpart.Listing 10. Part 2: build rank-array, parallel version

AST MULTICORE SPARSE ASSEMBLY 11

Part 3 and 4.

Since each thread may loop over rows independently, the dou-ble for-loop construction in the serial code in Listing 6 can still be used. Thanks tothe modiﬁed intermediate output format, with a permuted version irankP replac-ing irank , the access pattern is actually slightly simpliﬁed. The ﬁnal accumulationof jcS follows closely the logic for jrS in Listing 10.Listing 11. Part 3+4: uniqueness and ﬁnal format, parallel version L Part 1 2 L + 3 M p L L L L

Part 3 5 L + M L L Part 4 4 L + 3 N p L L

Total 14 L + 3( M + N ) p + M L L Table 3.1.

The equivalence of total number of memory accessesperformed concurrently in p threads in Listings 9–11 following thenotation in Table 2.1. By Part 4 is here meant the part in Listing 11which follows after the omp barrier statement. // determine a private jcS for each thread Parallel complexity.

The memory complexity of the parallel algorithm canbe estimated as before and provides us with some insight. Under the reasonableassumption that the number of threads p is small compared to the other array sizes,for simplicity we ignore all accesses made to size p arrays. Table 3.1 lists the totalnumber of memory accesses performed concurrently.Compared to Table 2.1 the total number of indirect accesses is the same inthe serial and in the parallel version. However, the number of expensive indirectaccesses in size L arrays has increased from 3 L to 4 L which can be attributed tothe ﬁnal block in Part 4 where there is now one more addressing in the rank -arraythan before. This is by design as it saves an even more expensive ﬁnal permutationof irankP as well as opens up for more parallelism in the ﬁnal post-processing.For large enough data sizes the maximal memory allocation occurs when theﬁnal data is being allocated and is equal to the equivalent of an integer array ofsize S = N + 1 + ( M + 1)( p + 1) + 3 nnz + 2 L. (3.4)Compared to the serial memory footprint (2.5) one more size L array is requiredand we also see the extra allocation of the thread-private pointer to rows jrS .4. Performance Experiments

In this section we present results of performance experiments of both the pro-posed serial and parallel algorithms. To get some baseline results for our index-based sorting algorithm we start by proﬁling our serial version and brieﬂy compareit with the built-in Matlab function sparse . The results for our threaded version

AST MULTICORE SPARSE ASSEMBLY 13 are presented in § Hardware and benchmark conﬁguration.

We performed our experimentson two diﬀerent hardware platforms. The ﬁrst one is a standard workstation basedon an Intel Xeon W3680 CPU with 6 cores, 24 GB of memory and 12 MB of cache,and is denoted ‘C1’. The second system is a server with a dual socket Intel XeonE5-2680 CPU, where each CPU has 8 cores and 20 MB of cache, 64GB of totalmemory, and is denoted ‘C2’. On both systems we run 64 bit Linux OS; MatlabR2014b (8.4.0.150421) on C1 and Matlab R2012b (8.0.0.783) on C2. To avoid OSnoise and caching eﬀects, all tests were performed 40 times and the average timewas determined as the arithmetic mean.There are essentially three parameters which can vary in the input data for thesparse assembly procedure — the dimensions of the output matrix, the number ofnonzero elements per row of the output matrix, and the number of collisions perrow. Formally, the latter two parameters may of course vary per row accordingto some empirical distribution. For convenience we considered constant or almostconstant values only. While others have benchmarked sparse assembly algorithmsusing classes of matrices ranging from highly structured to highly unstructured cases[2], on balance we ﬁnd it diﬃcult to characterize this structure in a meaningful way.Some preliminary tests performed by us using matrices from the UF-collection [6],indicated that the actual structure of the ﬁnal matrix does not strongly inﬂuencethe algorithmic performance. We therefore ran all our tests with matrices of randomstructure relying on uniform random numbers for the indices. In Listing 12 we showthe generation of experimental data according to these considerations.Listing 12. Benchmark data generator function [ii,jj,ss,siz] = ransparse(siz,nnz_row,nrep)% input: size, nonzeros per row, and collisions per final element% output: row and column indices, sparse values, and sizennz = nnz_row*siz; % number of nonzerosii = repmat((1:siz)’,[1 nnz_row]);jj = ceil(rand(siz,nnz_row)*siz);ii = repmat(ii(:),[1 nrep]);jj = repmat(jj(:),[1 nrep]); % (some jj’s might be the same)p = randperm(numel(ii));ii = ii(p);jj = jj(p);ss = ones(size(ii));

For the experiments we focus on three diﬀerent data sets. All sets consistsof 2,500,000 elements of raw input. The ﬁrst data set is to be accumulated into asparse matrix with size 10,000 and 50 elements per row (yielding eﬀectively 500,000nonzero elements in total). The next two data sets are larger in terms of matrix size, but data set

Table 4.1.

Benchmark data sets. Number of nonzeros per rowand number of collisions per nonzero element.These data could mimic diﬀerent problems. For example, with ﬁnite elementmethods in 3D and higher order elements, the matrices contain a relatively largenumber of nonzero elements per row and the collision pattern and resulting numberof nonzero entries per row will be fairly large (as in data set 1). The second data setcould mimic high-dimensional problems discretized with a lower order element, andthe last data set represents problems in low spatial dimension but modeled again byhigher order elements. However, the reader could map various other scenarios fromstochastic processes, data mining, and so on to similar data sets. As a concreteexample, a Laplace problem in 3D with linear Lagrange elements and discretizedwith tetrahedron elements results in 12–48 collisions and about 7 nonzero elementsper row.4.2.

Serial assembly.

In order to facilitate the understanding of the parallel be-havior of the proposed index-based sorting algorithm, we start with proﬁling eachpart of the serial code. As expected the diﬀerent data sets put somewhat diﬀerentloads at each section of the code, see Figure 4.1.The behavior observed in the ﬁgure can be explained fairly intuitively as follows.If the problem contains more nonzero elements per row (relative to the matrixsize), then this puts extra pressure on the post-processing procedure (Listing 17 inAppendix B) since it needs to perform more reduction operations. As a by-productto this, naturally, the other parts take less time in a relative sense. An importantfeature with the proposed parallel algorithm is that the reduction of duplicateelements can be done in a fully independent manner. This allows us to performthis computation completely in parallel without any locks or atomic operations.Counting the number of nonzero elements (Part 1) takes under 5% for all data setsand is proportional to the size of the matrix and to the number of collisions per row.The operations for achieving uniqueness (Part 2 and 3) using row-wise traversal via rank are more expensive for matrices with large number of collisions per row. Theperformance of the serial fsparse can of course vary between CPUs, however, therelative time for each part is likely to remain fairly stable with ﬂuctuations mainlydue to diﬀerent cache conﬁgurations.Although we do not have access to Matlab’s built-in sparse function, by monitorthe processes we can conclude that this function is serial. Loren Shure in a blog-postexplains that sparse is based on quicksort [16]. She remarks that quicksort hasa higher complexity compared to other algorithms such as bucket sort, but arguesthat it performs fairly well in practice. For all tests performed here, our proposedserial version outperforms Matlab’s sparse convincingly, see Table 4.2.

AST MULTICORE SPARSE ASSEMBLY 15

Pre-processing Part 1 Part 2 Part 3 Part 4 Post-processing % o f t o t a l t i m e Data 1=1.61secData 2=2.95secData 3=1.78sec

Figure 4.1.

Total time and load distribution for pre- and post-processing, and for parts 1–4 (serial fsparse , hardware C2).4.3.

Parallel assembly.

Due to the fact that essentially all operations in thesparse assembly process are memory bounded, it is unreasonable to expect a linearspeedup even in the ideal case. The reason is that the memory bus of the CPU canbe utilized eﬃciently already with a single core, hence additional memory accessesassociated with an increasing number of cores can generally not utilize the band-width to linear scaling. Following the STREAM benchmark test suite [13, 14], avery simple parallel copy function can demonstrate this phenomenon,

With N = 100 , ,

000 this bandwidth test shows that the OpenMP section canspeedup the copy up to 4 . × on the workstation (using 6 threads/cores) and upto 6 . × (using 16 threads) on the dual socket server. However, this is only a purestreaming test which does not take into account the cache — with more cores theaggregated cache is, of course, larger.Before starting the actual assembly, the function determines the maximum valuesof the index arrays and converts them to integers (Listing 16 in Appendix B). Thisoperation contains only contiguous memory accesses and is purely parallel. Thusthe speedup for this function does not depend on data and is around 7 × (on C2),see Figure 4.3. Note that Figures 4.1 and 4.2 present only the fraction of the totaltime, while the actual measured speedup of course also takes into account the realexecution time.For both implementations, Part 1 takes below 5% of the total execution time.The parallelism in this part and thus the total speedups depend mainly on the sizeof the matrix, due to the fact that all loops are performed over rows, and wherethe reductions are computed element-wise per thread. Pre-processing Part 1 Part 2 Part 3+4 Post-processing % o f t o t a l t i m e Data 1=0.34secData 2=0.464secData 3=0.435sec

Figure 4.2.

Total time and load distribution (parallel fsparse ,hardware C2).The situation is similar for the computation of the forward mapping, the rank -array (Part 2). The whole computation is basically a single loop over all inputentries and since the indirect mapping depends on the number of rows in the re-sulting matrix, the overhead of looping over a larger matrix reduces the speedupdue to additional memory accesses.In the parallel version we combine Part 3 and 4 into one. Although the memoryaccess pattern is complex, the number of contiguous and indirect memory accessesare proportional to the total size of the input data arrays. Thus, the speedup factorsfor all test cases are in fact similar, around 5 × .All speedup factors, for all parts of the code, are presented in Figure 4.3. Theoverall speedups comparing the serial and parallel versions are 4 . × , 6 . × and 4 . × on C2.The ﬁnal accumulation of the results (Listing 17 in Appendix B) heavily dependson the number of nonzero elements which need to be computed. Thanks to the factthat with our approach all of the main computations can be performed in parallel,for all data sets this computation takes between 25–35% of the total time. Thus,we observe that the OpenMP implementation gives the highest speedup for theproblem with the most nonzero elements per row (normalized by the matrix size).Finally, we also brieﬂy compare the full parallel fsparse against the built-inMatlab sparse . The results are presented in Table 4.2 for hardware C1 and C2.Our implementation outperforms the Matlab version by 4–5 × on C1 and 9–10 × onC2. 5. Conclusions

In this paper we devised an index-based implicit sorting algorithm for the assem-bly of sparse matrices in CCS format given raw index-triplet data. The algorithm

AST MULTICORE SPARSE ASSEMBLY 17

Pre-processing Part 1 Part 2 Part 3+4 Post-processing S peedup Data 1=0.000474xData 2=6.36xData 3=4.08x

Figure 4.3.

Parallel speedup for the diﬀerent parts of fsparse (hardware C2).

Data Set MATLAB Serial ParallelHardware Time Time Speedup Time Speedup1 on C1 3 .

52 1 .

51 2 . × .

65 5 . × .

74 1 .

87 2 . × .

83 4 . × .

49 1 .

67 2 . × .

76 4 . × .

49 1 .

61 2 . × .

33 10 . × .

39 2 .

95 1 . × .

46 9 . × .

46 1 .

78 1 . × .

43 9 . × Table 4.2.

Overall speedup factors when comparing Matlab sparse and fsparse (serial and parallel versions).was shown to be eﬃcient in terms of memory accesses and does not require muchauxiliary memory. We also showed how the algorithm could be modiﬁed and par-allelized on multicore CPUs with shared memory. The characteristic in terms ofmemory accesses for our parallel version is remindful of the serial one and resultsin a good overall performance. As shown by our experiments, compared to thestandard serial Matlab implementation, we are able to assemble a matrix up to10 × faster on a dual-socket system and about 5 × faster on a 6 core system.The approach taken in the code is a good example of how to avoid locks by com-puting and storing slightly more temporary results, resulting in a more streamlinedparallel implementation and a higher eﬃciency.5.1. Reproducibility.

Our implementation of fsparse as described in this paperis available for download via the ﬁrst author’s web-page . The code comes with a http://user.it.uu.se/~stefane/freeware convenient Matlab mex-interface and along with the code, automatic Matlab-scriptsthat repeat the numerical experiments presented here are also distributed.The matrix assembly functions in the PARALUTION library (ver.0.7.0) arebased on this implementation. PARALUTION is a library for iterative sparsemethods targeting multicore CPUs and accelerators. Acknowledgment

The authors would like to thank master’s students Aidin Dadashzadeh and Si-mon Ternsj¨o who programmed an early version of the parallel fsparse code [5].Their work was performed within the project course in Computational Science atthe Department of Information Technology, Uppsala University, given by MayaNeytcheva.The research that lead to this paper was supported by the Swedish ResearchCouncil and carried out within the Linnaeus centre of excellence UPMARC, Upp-sala Programming for Multicore Architectures Research Center.

References [1] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer,D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick.The landscape of parallel computing research: A view from berkeley. Techni-cal Report UCB/EECS-2006-183, EECS Department, University of California,Berkeley, Dec 2006. URL .[2] M. Aspn¨as, A. Signell, and J. Westerholm. Eﬃcient assembly of sparse matricesusing hashing. In B. K˚agstr¨om, E. Elmroth, J. Dongarra, and J. Wasniewski,editors,

Proceedings of PARA 2006, Applied Parallel Computing, State of theArt in Scientiﬁc Computing , Lecture Notes in Computer Science, pages 900–907. Springer-Verlag, 2007.[3] A. Buluc and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In

ICPP ’08. 37th International Conference on ParallelProcessing , pages 503–510, 2008. doi:10.1109/ICPP.2008.45.[4] F. Cuvelier, C. Japhet, and G. Scarella. An eﬃcient way to perform theassembly of ﬁnite element matrices in Matlab and Octave. Technical Report8305, Universit´e Paris 13, 2013.[5] A. Dadashzadeh and S. Ternsj¨o. Parallel assembly of sparse matrices usingCCS and OpenMP. Technical report, Dept of Information Technology, UppsalaUniversity, 2013.[6] T. A. Davis and Y. Hu. The university of Florida sparse matrix collection.

ACMTrans. Math. Softw. , 38(1):1:1–1:25, 2011. doi:10.1145/2049662.2049663.[7] J. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in MATLAB: De-sign and implementation.

SIAM J. Matrix Anal. Appl. , 13(1):333–356, 1992.doi:10.1137/0613024.[8] N. Jansson. Optimizing sparse matrix assembly in ﬁnite element solvers withone-sided communication. In M. Dayd´e, O. Marques, and K. Nakajima, editors,

High Performance Computing for Computational Science - VECPAR 2012 ,volume 7851 of

Lecture Notes in Computer Science , pages 128–139. Springer,Berlin, 2013. doi:10.1007/978-3-642-38718-0 15. AST MULTICORE SPARSE ASSEMBLY 19 [9] N. Jansson, J. Hoﬀman, and M. Nazarov. Adaptive simulation of turbulentﬂow past a full car model. In

State of the Practice Reports , SC ’11, pages20:1–20:8, New York, NY, 2011. ACM. doi:10.1145/2063348.2063375.[10] J. Kepner, D. Bader, A. Bulu¸c, J. Gilbert, T. Mattson, and H. Meyerhenke.Graphs, matrices, and the GraphBLAS: Seven good reasons.

Procedia Com-puter Science , 51:2453–2462, 2015. doi:10.1016/j.procs.2015.05.353.[11] D. E. Knuth.

The art of computer programming, volume 3: Sorting and search-ing . Addison-Wesley Series in Computer Science. Addison-Wesley, Reading,MA, 1973.[12] S. P. Marinova and S. D. Stoichev. Parallel algorithm for integer sorting withmulti-core processors.

Inf. Technol. Control , 3:15–27, 2009.[13] J. D. McCalpin. STREAM: Sustainable memory bandwidth in high perfor-mance computers. Technical report, University of Virginia, Charlottesville,Virginia, 1991–2007. URL . A con-tinually updated report.[14] J. D. McCalpin. Memory bandwidth and machine balance in current high per-formance computers.

IEEE Computer Society Technical Committee on Com-puter Architecture (TCCA) Newsletter , pages 19–25, 1995.[15] V. Shah and J. R. Gilbert. Sparse matrices in Matlab*P: Design and im-plementation. In L. Boug´e and V. K. Prasanna, editors,

High PerformanceComputing - HiPC 2004 , volume 3296 of

Lecture Notes in Computer Science ,pages 144–155. Springer, Berlin, 2005. doi:10.1007/978-3-540-30474-6 20.[16] L. Shure. Creating Sparse Finite-Element Matrices in MATLAB,2007. URL http://blogs.mathworks.com/loren/2007/03/01/creating-sparse-finite-element-matrices-in-matlab/ . Loren onthe Art of MATLAB.

Retrieved

Appendix A. Additional serial codes

The identical code producing jj and N has been omitted for brevity.Listing 13. Pre-processing: input of index ii / jj const double *ival = mxGetPr(I); // Matlab input vector Iint *ii = malloc(len*sizeof(int));int M = 0;for (int i = 0; i < len; i++) {// error: bad indexif (ival[i] < 1.0 || ival[i] != ceil(ival[i])) return false;if ((ii[i] = ival[i]) > M) M = ival[i];} Listing 14. Post-processing: ﬁnalize CCS format for (int i = 0; i < len; i++) {irS[irank[i]] = ii[i]-1; // switch to zero-offsetprS[irank[i]] += sr[i];}

Listing 15. Sparse assembly, serial version void sparse(const int *ii,const int *jj,const double *sr,int len,int M,int N){ // outputint *jcS; // column pointer for sparse matrix Sint *irank; // inverse rank array of length lenint *jrS; // accumulated "pessimistic" row counterint *rank; // rank-array for rowsint *hcol; // cache memory for columns// Part 1: count and accumulate indices to rowsjrS = calloc(M+1,sizeof(jrS[0]));for (int i = 0; i < len; i++) jrS[ii[i]]++;for (int r = 2; r <= M; r++) jrS[r] += jrS[r-1];// Part 2: build rank with the active use of jrSrank = malloc(len*sizeof(rank[0]));jrS--; /* (unit-offset in ii) */for (int i = 0; i < len; i++) rank[jrS[ii[i]]++] = i;/* Part 3: loop over input and make each column unique with respectto rowindices, building both an index vector irank and the finalcolumn pointer at the same time */jcS = calloc(N+1,sizeof(jcS[0]));hcol = calloc(N,sizeof(hcol[0]));hcol--; /* (unit-offset in jj) */irank = malloc(len*sizeof(irank[0]));for (int row = 1,i = 0; row <= M; row++)for ( ; i < jrS[row]; i++) {const int ixijs = rank[i]; // index into input data triplet (ii,jj,sr)const int col = jj[ixijs]; // column index// new element?if (hcol[col] < row) {hcol[col] = row; // remembered by the row indexjcS[col]++; // count it}// irank keeps track of where it should goirank[ixijs] = jcS[col]-1;}free(++hcol);free(rank);free(++jrS);// Part 4: accumulate pointer to columnsfor (int c = 2; c <= N; c++) jcS[c] += jcS[c-1];// irank must account for the previous accumulationjcS--; /* (again, unit-offset in jj) */

AST MULTICORE SPARSE ASSEMBLY 21 for (int i = 0; i < len; i++) irank[i] += jcS[jj[i]];jcS++;/* allocate output and insert data: code not shown */// deallocate intermediate formatfree(irank);}

Appendix B. Additional parallel codes

Listing 16. Pre-processing in parallel const double *ival = mxGetPr(I); // Matlab input vectorint *ii = malloc(len*sizeof(int));int M = 0;

Listing 17. Post-processing in parallel }} // end parallel

Division of Scientific Computing, Department of Information Technology, UppsalaUniversity, SE-751 05 Uppsala, Sweden.

URL , (S. Engblom): http://user.it.uu.se/~stefane

E-mail address ::