[PDF] A survey of sparse matrix-vector multiplication performance on large matrices

Abstract

We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, BSR, COO, HYB, and ELL matrix formats (supported by each software package).

Full PDF

aa r X i v : . [ c s . PF ] A ug A Survey of Sparse Matrix-Vector Multiplication Performanceon Large Matrices

Max Grossman

Rice University [email protected] Christopher Thiele

Shell International Exploration& Production Inc.

Mauricio Araya-Polo

Shell International Exploration& Production Inc.

Florian Frank

Rice University

Faruk O. Alpak

Shell International Exploration& Production Inc.

Vivek Sarkar

Rice University

Keywords

Sparse matrix computations, large datasets, iterative solvers,matrix-vector multiplication, SpMV, GPU, MKL.

1. MOTIVATION

One of the main sources of sparse matrices is the discretiza-tion of partial diﬀerential equations that govern continuum-physics phenomena such as ﬂuid ﬂow and transport, phase sep-aration, mechanical deformation, electromagnetic wave propa-gation, and others. Recent advances in high-performance com-puting area have been enabling researchers to tackle increas-ingly larger problems leading to sparse linear systems withhundreds of millions to a few tens of billions of unknowns,e.g., [5, 6]. Iterative linear solvers are popular in large-scalecomputing as they consume less memory than direct solvers.Contrary to direct linear solvers, iterative solvers approach thesolution gradually requiring the computation of sparse matrix-vector (SpMV) products. The evaluation of SpMV productscan emerge as a bottleneck for computational performancewithin the context of the simulation of large problems. In thiswork, we focus on a linear system arising from the discretiza-tion of the Cahn–Hilliard equation, which is a fourth order non-linear parabolic partial diﬀerential equation that governs theseparation of a two-component mixture into phases [3]. Theunderlying spatial discretization is performed using the dis-continuous Galerkin method (e.g. [10]) and Newton’s method.A number of parallel algorithms and strategies have been eval-uated in this work to accelerate the evaluation of SpMV prod-ucts.

2. BACKGROUND

There exist many factors that signiﬁcantly impact the per-formance of a given SpMV computation. We separate thesefactors into four categories: matrix characteristics, storage for-mat, software implementation, and hardware platform.Matrix characteristics of the input matrix include: its def-initeness, eigenvalues, number of non-zero values (NNZ), andimbalances in terms of NNZ. While some matrix characteristicsare immutable at runtime others can be changed to improveperformance. For example, sorting the rows of a sparse matrixby their NNZ can reduce NNZ imbalance between neighboringrows.There are many ways to store a sparse matrix, includingCompressed Sparse Row (CSR), Compressed Sparse Column(CSC), Coordinate (COO), Diagonal (DIA), Ellpack-Itpack(ELL), Sliced ELL (SELL), Hybrid (HYB), Blocked CSR(BSR), and Extended BSR (BSRX). Each of these formatsdiﬀers in its space consumption, lookup eﬃciency, redundantzero values, and implementation complexity. Additionally, thebest storage format for a given matrix is often dependenton characteristics of the matrix itself. The matrix formatcan almost always be freely tuned to optimize space or timeeﬃciency.

Matrix Dim % NZ Avg NZ/row CSR Size

A 611K 0.00172% 10.53 79.74 MBB 1,555K 0.00068% 10.48 201.83 MBC 4,884K 0.00023% 11.01 664.73 MBD 12,434K 0.00009% 10.98 1.689 GBE 39,079K 0.00003% 11.25 5.434 GBF 99,476K 0.00001% 11.24 13.818 GB

Table 1: Intrinsic characteristics for each matrixtested.

There are several optimized SpMV implementations avail-able. In our work, we focus on a preliminary evaluation ofIntel MKL [7], the Trilinos project [8], CUSPARSE [9], andCUSP [1]. While application developers are generally free tochoose whichever SpMV implementation they prefer (ignoringlicensing constraints), that choice generally places hard con-straints on the problems that can be solved. For example,libraries may only support SpMV on a limited number of stor-age formats, may only be available or optimized on a certainhardware platform, or may limit the size of the stored matrixby their choice of type used to represent matrix size.Some SpMV implementations are closely tied to the hard-ware they are executed on (e.g. MKL, CUSPARSE). There-fore, the choice of hardware platform generally constrains thechoice of software implementation, and vice versa. The charac-teristics of a hardware platform (e.g. DRAM bandwidth, cachehierarchy, parallelism) can have a signiﬁcant impact on SpMVperformance. Past work has generally found that SpMV isa memory-bound kernel whose performance is heavily tied tothe cache hierarchy and memory bandwidth of the hardwareplatform it runs on [11].In this work, we contribute a third-party survey of SpMVperformance on industrial-strength, large matrices using:1. The SpMV implementations in Intel MKL, the Trilinosproject (Tpetra subpackage), the CUSPARSE library,and the CUSP library, each running on modernarchitectures.2. NVIDIA GPUs and Intel multi-core CPUs (supported byeach software package).3. The CSR, BSR, COO, HYB, and ELL matrix formats(supported by each software package).

3. METHODS3.1 Matrix characterization

The origin of the matrices used in this work is described inSection 1. We use six diﬀerent not diagonal (but band) matri-ces of varying sizes and characteristics in this work. Selectedmatrix characteristics are listed in Table 1. Each matrix hasa mode of 10 non-zeroes per row, and a maximum of 16 non-zeroes per row. oftware Version Proc. Matrix Formats

MKL 11.2.2.164 CPU CSR, COO, BSRTrilinos 12.2.1 CPU CSRCUSPARSE 6.5 GPU CSR, BSR, HYB, ELLCUSP 0.5.1 GPU CSR, COO, HYB, ELL

Table 2: Conﬁguration of each software platform.Matrix MKL Trilinos CUSPARSE CUSP

A 3.87 (8T) 3.65 (8T) 0.89 0.59B 2.97 (8T) 3.16 (8T) 0.83 0.60C 2.84 (8T) 3.16 (8T) 0.79 0.58D 3.15 (8T) 2.54 (2T) 0.87 N/AE 3.33 (8T) 4.69 (8T) 0.84 N/AF 3.52 (8T) 4.75 (2T) 0.82 N/A

Table 3: Speedups for each software platform on eachmatrix, relative to single-threaded MKL, where NT in-dicates that the best performance was achieved withN CPU threads. Missing data for CUSP is due to thediﬃculty of supporting out-of-core datasets in its API.

All experimental evaluations are performed on a hardwareplatform containing a 12-core 2.80GHz Intel X5660 CPU with48GB of system RAM paired with an NVIDIA M2070 with ∼

4. EXPERIMENTAL EVALUATION

Here we present preliminary results evaluating the relativeperformance of MKL, Trilinos, CUSPARSE, and CUSP.We start our evaluation by identifying the optimal matrixformat for each software package, with varying numbers ofthreads for the CPU-based packages. For each combinationof software package and matrix we perform 30 runs and selectthe fastest format based on the median. For BSR, we try allblocking factors less than or equal to 32 that evenly divide thematrix dimensions. For HYB, we try with the ELL portion’swidth set 10, 11, 12, 13, 14, and 15. Some formats are morespace eﬃcient than others, and so not all formats could beevaluated on all matrices due to out-of-memory errors.We ﬁnd that the most eﬃcient format for all matrices and allthread counts running on MKL is CSR. Trilinos only supportsthe CSR format, so no matrix format comparison is necessary.For CUSPARSE and CUSP the optimal matrix format for allmatrices is also CSR. This result may be surprising as pastwork highlights the beneﬁts of the ELL and HYB formats forGPUs [2]. However, the small number of non-zeroes per rowin our test matrices reduces memory coalescing in vectorizedformats like ELL and HYB.Next, we compare the best performance achievable on eachsoftware platform for each matrix. Table 3 lists the maxi-mum speedup each platform achieves, relative to MKL single-threaded. These speedup numbers including copying the inputsand outputs to and from the GPU for CUSPARSE and CUSP.Note the slowdown of both GPU frameworks, relative to single-threaded MKL, which contrasts with most or all of the relatedliterature [2, 4]. To explain this, we compare kernel executiontimes in Table 4 (ignoring GPU communication). There, wesee results more similar to past evaluations, with CUSPARSEachieving ∼ × speedup relative to MKL single-threaded andan average of 3.123 × speedup relative to the fastest CPU im-plementation for each matrix.

5. CONCLUSIONS

The evaluation performed in this work is signiﬁcant for anumber of reasons:1. Even though MKL is widely use for sparse linear algebraand oﬀers a low-level C API, Trilinos is able to match or

Matrix CUSPARSE CUSP

A 11.20 4.11B 10.41 3.70C 10.47 3.76D 10.35 N/AE 10.76 N/AF 10.81 N/A

Table 4: Speedups for CUSPARSE and CUSP relativeto single-threaded MKL, kernel time only. beat MKL performance on several of our matrices, par-ticularly larger ones. Therefore, using Trilinos’s ﬂexible,object-oriented API becomes the preferred choice with-out having to worry about sacriﬁcing performance.2. When only considering kernel performance, CUSPARSEis able to demonstrate 3.123 × speedup relative to thebest CPU-based packages. However, these beneﬁts im-mediately disappear when data movement is considered.Emphasis must be placed on keeping data on the devicesor maximizing computation-communication overlap. AsCUSPARSE gives more control over communication thanCUSP, it is the preferred GPU framework.These preliminary results make it clear that we shouldfocus our future investigation and work around Trilinos andCUSPARSE. Our next steps will focus on evaluating Trilinos’sGPU support, extending this work to distributed systems,considering multi-GPU execution, experimenting with theMKL inspector-executor framework, and investigatingTrilinos-CUSPARSE integration.

6. REFERENCES [1] CUSP. https://github.com/cusplibrary/cusplibrary.[2] N. Bell and M. Garland. Implementing sparsematrix-vector multiplication on throughput-orientedprocessors. In

Proceedings of the Conference on HighPerformance Computing Networking, Storage andAnalysis , page 18. ACM, 2009.[3] J. W. Cahn and J. E. Hilliard. Free energy of anonuniform system. I. Interfacial free energy.

TheJournal of Chemical Physics , 28:258–267, 1958.[4] J. D. Davis and E. S. Chung. Spmv: A memory-boundapplication on the gpu stuck between a rock and a hardplace.

Microsoft Research Silicon Valley, TechnicalReport14 September , 2012, 2012.[5] A. H. Dogru, L. S. Fung, U. Middya, T. M. Al-Shaalan,J. A. Pita, K. HemanthKumar, H. Su, J. C. Tan, H. Hoy,W. Dreiman, et al. A next-generation parallel reservoirsimulator for giant reservoirs. In

SPE/EAGE ReservoirCharacterization & Simulation Conference , 2009.[6] A. H. Dogru, L. S. K. Fung, U. Middya, T. M.Al-Shaalan, T. Byer, H. Hoy, W. A. Hahn, N. Al-Zamel,J. A. Pita, K. Hemanthkumar, M. Mezghani,A. Al-Mana, J. Tan, W. Dreiman, A. Fugl, andA. Al-Baiz. New frontiers in large scale reservoirsimulation.

Society of Petroleum Engineers , 2011.[7] Intel. Intel Math Kernel Library.https://software.intel.com/en-us/intel-mkl.[8] S. N. Laboratories. The Trilinos Project.https://trilinos.org/.[9] NVIDIA. CUSPARSE.https://developer.nvidia.com/cusparse.[10] B. Rivi`ere.

Discontinuous Galerkin Methods for SolvingElliptic and Parabolic Equations: Theory andImplementation . Frontiers in Applied Mathematics.Society for Industrial and Applied Mathematics, 2008.[11] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, andJ. Demmel. Optimization of sparse matrix–vectormultiplication on emerging multicore platforms.