2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) | 2021

SPAGHETTI: Streaming Accelerators for Highly Sparse GEMM on FPGAs

 
 
 
 
 

Abstract


Generalized Sparse Matrix-Matrix Multiplication (Sparse GEMM) is widely used across multiple domains, but the computation’s regularity is dependent on the input sparsity pattern. The majority of sparse GEMM accelerators are based on the inner product method and propose new storage formats [5], [28], [31] to regularize computation. We find that these storage formats are more suited for denser matrices. Accelerators [26], [34] adopting the outer product algorithm are more suitable for highly sparse inputs $(\\lt 1$% density), since they support CSC/CSR storage formats. The current state-of-the-art, SpArch [34], condenses inputs to improve output reuse, but then spoils input reuse. The condensing effectiveness varies across inputs leading to high variance in DRAM utilization and speedup across inputs. SpArch also requires a complex memory hierarchy (e.g., prefetch caches) to re-capture input reuse.We propose Spaghetti, an open-source Chisel generator for creating FPGA-optimized outer product accelerators. The key novelty in Spaghetti is a new pattern-aware software scheduler that analyzes the sparsity pattern and schedules row-col pairs of the inputs onto the fixed microarchitecture. Spaghetti takes advantage of our observation that the rows in the input matrix lead to mutually independent rows in the final output. Thus the scheduler can partition the input into tiles that maximize reuse and eliminate re-fetching the partial matrices from the DRAM. The microarchitecture template we create has the following key benefits: i) we can statically schedule the inputs in a streaming fashion and maximize DRAM utilization, ii) we can parallelize the merge phase and generate multiple rows of the output in parallel maximally using the output DRAM bandwidth, iii) we can adapt to the varying logic resources and bandwidth across various FPGA devices and attain maximal roofline performance (only limited by memory bandwidth). We auto-generate sparse GEMM accelerators on Amazon AWS FPGAs and demonstrate that we can achieve performance improvement over CPUs and GPUs between 1.1 – 34.5 x. Compared to SpArch [34], our design improves performance by an average of $2.6 \\times$, and reduces DRAM accesses by an average of $4 \\times$.

Volume None
Pages 84-96
DOI 10.1109/HPCA51647.2021.00017
Language English
Journal 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Full Text