[PDF] Automatically Harnessing Sparse Acceleration

Abstract

Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear Algebra Computations (LiLAC). Rather than requiring the application developer to (re)write every program for a given library, the burden is shifted to a one-off description by the library implementer. The LiLAC-enabled compiler uses this to insert appropriate library routines without source code changes. LiLAC provides automatic data marshaling, maintaining state between calls and minimizing data transfers. Appropriate places for library insertion are detected in compiler intermediate representation, independent of source languages. We evaluated on large-scale scientific applications written in FORTRAN; standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across heterogeneous platforms, applications and data sets we show speedups of 1.1 × to over 10 × without user intervention.

Full PDF

AAutomatically Harnessing Sparse Acceleration

Philip Ginsbach

School of InformaticsThe University of EdinburghUnited Kingdom [email protected]

Bruce Collie

School of InformaticsThe University of EdinburghUnited Kingdom [email protected]

Michael F. P. O’Boyle

School of InformaticsThe University of EdinburghUnited Kingdom [email protected]

Abstract

Sparse linear algebra is central to many scientific programs,yet compilers fail to optimize it well. High-performance li-braries are available, but adoption costs are significant. More-over, libraries tie programs into vendor-specific software andhardware ecosystems, creating non-portable code.In this paper, we develop a new approach based on our specification Language for implementers of Linear AlgebraComputations (LiLAC). Rather than requiring the applicationdeveloper to (re)write every program for a given library,the burden is shifted to a one-off description by the libraryimplementer. The LiLAC-enabled compiler uses this to insertappropriate library routines without source code changes.LiLAC provides automatic data marshaling, maintainingstate between calls and minimizing data transfers. Appropri-ate places for library insertion are detected in compiler in-termediate representation, independent of source languages.We evaluated on large-scale scientific applications writtenin FORTRAN; standard C/C++ and FORTRAN benchmarks;and C++ graph analytics kernels. Across heterogeneous plat-forms, applications and data sets we show speedups of 1.1 × to over 10 × without user intervention. CCS Concepts • Software and its engineering → Com-pilers ; Specification languages . Keywords sparse linear algebra, domain specific languages,library integration, declarative langauges, data marshalling

ACM Reference Format:

Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle. 2020.Automatically Harnessing Sparse Acceleration. In

Proceedings ofthe 29th International Conference on Compiler Construction (CC ’20),February 22–23, 2020, San Diego, CA, USA.

ACM, New York, NY,USA, 12 pages. https://doi.org/10.1145/3377555.3377893

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].

CC ’20, February 22–23, 2020, San Diego, CA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-7120-9/20/02...$15.00 https://doi.org/10.1145/3377555.3377893

Linear algebra is an important component of many appli-cations and a prime candidate for hardware acceleration.While there has been significant compiler effort in acceler-ating dense algebra [23, 36, 40], there has been less successwith sparse codes. This is largely due to indirect memory ac-cess, which challenges compiler analysis [32]. Sparse-basedalgorithms are, however, increasingly important as the basisof graph algorithms and data analytics [28].We currently see the wide-scale provision of fast sparselibraries [2, 3, 5, 55]. They deliver excellent performance,but require significant programmer intervention and arerarely portable across platforms. Alternatives, such as theSLinGen/LGen system [45, 46], provide specialized code gen-erators for linear algebra, but again require code modificationby the programmer and focus only on dense computations.Program modification is particularly problematic when thetargets are hardware accelerators that require careful datamarshaling. Such modifications are often program-wide andseverely reduce the portability of the program. Furthermore,they require a commitment to specific hardware vendors,resulting in codebases that quickly become obsolete. In or-der to mitigate this, many projects have to keep multipleexecution paths, resulting in arcane build systems and un-maintainable code. In this time of rapid hardware innovation,such a vendor lock-in is undesirable. In fact, the difficultyof efficient portable integration is a key impediment to thewider use of accelerator libraries and hardware.In this paper, we reexamine how compilers and librariescan be used to achieve performance without programmereffort. Highly tuned and platform specific-libraries invariablyremain the fastest implementations available. However, weshow that we can automatically integrate these librarieswithout polluting the source code. This is performed as acompiler transformation step, leaving the original sourcecode intact and portable.To achieve this, we develop a new specification languagefor implementers of libraries, the specification Language forimplementers of Linear Algebra Computations (LiLAC). UsingLiLAC, library implementers specify with a few lines of code, what a library does and how it is invoked. Our compiler thendetermines where the library specification matches user codeand automatically transforms it to utilize the library. Thelanguage has two complementing parts. a r X i v : . [ c s . PF ] J a n C ’20, February 22–23, 2020, San Diego, CA, USA Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle

Portable Source Code Optimized Compiler IRPlatform Specific Source Code → Application Binary ≈ ↓

LiLAC-compiler(cf. Figure 2) ↓ W ha t H o w → Generated Harness Code ↔ Harness Binary Intel MKL ↓ for (cgit = 1; cgit <= cgitmax; cgit++) { for (j = 0; j < lastrow - firstrow + 1; j++) { sum = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { sum = sum + a[k]*p[colidx[k]]; } q[j] = sum; } d = 0.0; for (j = 0; j < lastcol - firstcol + 1; j++) { d = d + p[j]*q[j];}for (cgit = 1; cgit <= cgitmax; cgit++) { spmv_csr_harness( lastrow - firstrow + 1, rowstr, colidx, p, a, q); d = 0.0; for (j = 0; j < lastcol - firstcol + 1; j++) { d = d + p[j]*q[j];} ↔ Figure 1.

LiLAC applied to NPB Conjugate Gradient: Code (1) that matches the LiLAC-What specification (cf. Figure 2) isreplaced by calls to a harness (5) during compilation (2), resulting in an application binary (6) that corresponds to (hypothetical)platform-specific source code (4). The harness is generated from the LiLAC-How specification (cf. Figure 2) to utilize Intel MKL.

LiLAC-What is a high-level language to describe sparse anddense linear algebra computations. The LiLAC compiler usesit to detect such functionality in user applications at compilerintermediate representation level. It is powerful enough toformulate linear algebra routines, yet remains independentof compiler internals and is easy to understand and program.

LiLAC-How specifies how libraries can be used to performa LiLAC-What-specified computation. Besides generatingsetup code and handling hardware context management, itcrucially enables efficient memory synchronization. It usesmemory protection mechanisms to automatically track datachanges and transfers memory only when necessary.The research contribution of this paper is a combination ofthree techniques for the acceleration of sparse linear algebra: • Accelerate unchanged source code by identifying sparselinear algebra computations with backtracking search. • Avoid vendor lock-in with an extensible specificationlanguage that adapts to new accelerator libraries. • Achieve program-wide memory synchronization withonly local transformations using memory protection.Together, these techniques result in a system that works onexisting and novel software. It offers the full performanceof fast libraries, avoids vendor lock-in, and keeps the sourcecode easy to maintain and free from pollution.

Figure 1 shows the LiLAC-enabled compiler from the userperspective. In the top left corner (1), we see unmodifiedapplication source code. This is conjugate gradient from theNAS-PB suite. To achieve good performance on Intel proces-sors, the compiler (2) has been configured to offload nativesparse code to Intel MKL. Using a specification of

What com-putations MKL supports, it recognizes the highlighted loop as a suitable sparse matrix-vector product. Instead of passingit on to the compiler backend for code generation, it inserts acall to a harness function. This is performed on intermediatecode (3) and results in a program (4). In the bottom left (5) isan equivalent source-level representation.LiLAC also generates the corresponding harness code (6),which gets compiled into a shared library (7) that is linkedwith the application binary. This harness interfaces with theunderlying library implementation, Intel MKL (8).

Figure 2 shows the internals of the LiLAC system. It is fullyintegrated into the build system of the established LLVMcompiler framework, extending the clang compiler.On the left is the LiLAC specification - just 16 lines of code.It is independent of the user application and can be providedby the library implementer. It consists of a

What and a

How part. These two parts are processed by the LiLAC systemand result in a runtime library and a generated detectionfunction, which is incorporated into the clang compiler.

LiLAC-What specifies the functionality that is providedby a library, in this example spmv-csr (cf. Figure 2). From this,a function that detects the computation in normalized LLVMIR code is generated and the harness interface is determined.The detection functions are based on a backtracking searchalgorithm, as elaborated in section 4. The detection functionis linked directly into the LiLAC-compiler, either staticallyor dynamically at (compiler) run time.

LiLAC-How specifies how the library, Intel MKL in thiscase, is invoked to perform the specified calculation. Thisinvolves boilerplate code, but also advanced features. Theseinclude efficient data synchronization and the caching ofinvariants. In the given example, the columns variable issuch an invariant. It is required for the library call, but not utomatically Harnessing Sparse Acceleration CC ’20, February 22–23, 2020, San Diego, CA, USA

LiLAC-What

LiLAC-compiler detects computationin application code,cuts out and insertsharness calls

LiLAC-How llvmcodebase

Harness Interface

Detection Function LiLACllvmpassShared Library

HARNESS mkl IMPLEMENTS spmv_csr sparse_matrix_t A; mkl_sparse_d_create_csr(&A, SPARSE_INDEX_BASE_ZERO, rows, columns, row_ptr, row_ptr+1, col_ind, val); struct matrix_descr D; D.type = SPARSE_MATRIX_TYPE_GENERAL; D.mode = SPARSE_FILL_MODE_LOWER; D.diag = SPARSE_DIAG_NON_UNIT; mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, 1.0, A, D, vector, 0.0, output);

Marshaling int columns = Maximum of col_ind[0 .. row_ptr[rows]] void spmv_csr_harness( double* output, double* val, double* vector, int* row_ptr, int* col_ind, int rows);

COMPUTATION spmv_csr forall(0 <= i < rows) { output[i] = sum(row_ptr[i] <= j < row_ptr[i+1]) val[j] * vector[col_ind[j]]; } u s e s links with LiLAC SystemLibrary Implementers generatesdefinesgenerates usesuses bu il d s Figure 2.

Overview of LiLAC internals: On the left is the complete LiLAC program that the library implementer has to provide.At compile time of LLVM, this program is parsed and incorporated into a modified clang C++ compiler, behaving as in Figure 1.statically available. Therefore, it has to be computed at run-time. Using

Marshaling , LiLAC automatically generates theharness such that this is only recomputed if the values in row_ptr change. Such changes are captured with generatedmemory protection code using mprotect , managed by LiLAC.On the right of the figure, we can see how the componentsgenerated from the LiLAC specification are used to build theLiLAC-compiler. The detection function is compiled andused directly by the LiLAC-Compiler, linked either staticallyor dynamically. Interacting with the internals of LLVM, itimplements a transformation pass that is executed after thenormal optimization pipeline. Using the generated detectionfunction, it finds instances of the computation and replacesthem with calls to the specified harness interface.The harness, on the other hand, is compiled into a sharedlibrary. The LiLAC-compiler dynamically links applicationsto this shared library whenever it inserts harness calls. Whenmultiple LiLAC-How programs are provided, the generatedharnesses are compatible and linking the user program to adifferent harness library at runtime is sufficient.

This section describes in more detail the two components ofthe LiLAC language. LiLAC-What specifies the computationsthat a library performs; LiLAC-How describes how exactlythe library should be invoked to perform these computations.

At the heart of our approach is a simple language to specifysparse and dense linear algebra operations. This serves twopurposes in our LiLAC system: Firstly, it is used to generate adetection program for finding the computation in user code.Secondly, it identifies the variables that are arguments to thelibrary, thus defining the harness interface. proдram :: = COMPUTATION ⟨ name ⟩ ⟨ body ⟩ body :: = ⟨ f orall ⟩ | ⟨ dotop ⟩ ranдe :: = ( ⟨ exp ⟩ <= ⟨ name ⟩ < ⟨ exp ⟩ ) f orall :: = forall ⟨ ranдe ⟩ { ⟨ body ⟩ } dotp :: = ⟨ addr ⟩ = dot ⟨ ranдe ⟩ ⟨ addr ⟩ * ⟨ addr ⟩ ; addr :: = ⟨ name ⟩ { [ ⟨ exp ⟩ ] } add :: = ⟨ exp ⟩ + ⟨ exp ⟩ mul :: = ⟨ exp ⟩ * ⟨ exp ⟩ exp :: = ⟨ name ⟩ | ⟨ cnst ⟩ | ⟨ addr ⟩ | ⟨ add ⟩ | ⟨ mul ⟩ Figure 3.

Grammar of the LiLAC-What language val = (cid:2) (cid:3) col_ind = (cid:2) (cid:3) row_ptr = (cid:2) (cid:3) Figure 4.

Compressed Sparse Row (CSR) representation asused by the LiLAC-What example in Figure 1 and Figure 2 perm = (cid:2) (cid:3) val = (cid:2) -1 1 2 -1 2 3 1 2 1 2 (cid:3) col_ind = (cid:2) (cid:3) jd_ptr = (cid:2) (cid:3) nzcnt = (cid:2) (cid:3) COMPUTATION spmv_jds forall(0 <= i < rows) { output[perm[i]] = sum(0 <= j < nzcnt[i]) val[jd_ptr[j]+i] * vector[col_ind[jd_ptr[j]+i]]; }

Figure 5.

Jagged Diagonal Storage (JDS) in LiLAC-What

C ’20, February 22–23, 2020, San Diego, CA, USA Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle

The key design challenge was to stay simple enough toautomatically generate robust detection functionality, yetto be able to capture operations in all relevant data formats.Most importantly, this includes the CSR/CSC, JDS and COOformats. CSR and JDS are part of our evaluation. Across thedifferent formats, the control flow is rigid and easy to express.This is reflected in the grammar as shown in Figure 3.

Sparse matrices can be stored in different formats. We intro-duce two of them explicitly, but others are supported in thesame way by LiLAC-What.

Compressed Sparse Row (CSR) [44]

All non-zero entriesare stored in a flat array val . The col_ind array stores thecolumn position for each value. Finally, the row_ptr arraystores the beginning of each row of the matrix as an offsetinto the other two arrays. The number of rows in the matrixis given directly by the length of the row_ptr array minusone, however, the number of columns is not explicitly stored.In Figure 4, a 5x5 matrix is shown represented in this format,the LiLAC-What code is in the top left of Figure 2.

Jagged Diagonal Storage (JDS) [43]

The matrix rows arereordered such that the number of non-zeros per row isdecreasing. The permutation is stored in a vector perm ,the number of nonzeros in nzcnt . The nonzero entries arethen stored in an array val in the following order: The firstnonzero entry in each row, then the second nonzero entry ineach row etc. The array col_ind stores the column for eachof the values and jd_ptr stores offsets into val and col_idx .The product of a sparse matrix in JDS format with a densevector is specified in LiLAC-What at the bottom of Figure 5.

Dense

Detecting dense is easier than sparse, and existingliterature covers it well. We fully support dense but evaluateit only briefly for completeness.

Where LiLAC-What specifies the computations implementedby a library, LiLAC-How describes how precisely library callscan be used to perform them. The language was designedto support important existing libraries such as cuSPARSE,clBLAS, and Intel MKL. The idiosyncrasies of these librariesrequire LiLAC-How to capture some boilerplate C++ codethat manages the construction of parameter structures, call-ing conventions etc. Aside from this aspect, we designed it ashigh-level as possible without compromising performance.In particular, LiLAC-How abstracts away memory transfers.These considerations result in two interacting compo-nents. Firstly, a harness describes the boilerplate code forindividual library invocations. Secondly, data marshaling be-tween the core program and the library is specified, which iscrucial for heterogeneous compute environments. Figure 6shows the grammar specification of LiLAC-How. harness :: = HARNESS ⟨ name ⟩ IMPLEMENTS ⟨ name ⟩⟨ C + + ⟩[ ⟨ marshalinд ⟩ ] [ ⟨ persistence ⟩][ CppHeaderFiles { ⟨ name ⟩ } ] persistence :: = PersistentVariables { ⟨ name ⟩ ⟨ name ⟩ }[

BeforeFirstExecution ⟨ C + + ⟩ ][ AfterLastExecution ⟨ C + + ⟩ ] marshallinд :: = Marshaling { ⟨ type ⟩ ⟨ name ⟩ = ⟨ name ⟩ of ⟨ name ⟩ [ 0 .. ⟨ exp ⟩ ] } input :: = INPUT ⟨ name ⟩ ⟨ C + + ⟩[ BeforeFirstExecution ⟨ C + + ⟩ ][ AfterLastExecution ⟨ C + + ⟩ ] output :: = OUTPUT ⟨ name ⟩ ⟨ C + + ⟩[ BeforeFirstExecution ⟨ C + + ⟩ ][ AfterLastExecution ⟨ C + + ⟩ ] Figure 6.

Grammar of LiLAC-How

We need to encapsulate the boilerplate code that any givenlibrary requires, such as setup code, filling of parameterstructures etc. This part of the language is straightforward.

Harness

The harness construct is the central way of tellingthe LiLAC system how a library can be used to performa computation that was specified in LiLAC-What. As wecan see at the top of Figure 6, a harness refers to a LiLAC-What program by name and also has a name itself. It is builtaround some C++ code, which can use all the variables fromthe LiLAC-What program to connect with the surroundingprogram. It also needs to specify the relevant C++ headerfiles that the underlying library requires. Lastly, the harnesscan incorporate persistent state and utilize data marshaling.

Persistence

Many libraries need setup and cleanup code,which is specified with the keywords

BeforeFirstExecution and

AfterLastExecution . These are used in combination with

PersistentVariables , allowing state to persist between harnessinvocations, e.g. to retain handlers to hardware accelerators.

Example

In Figure 7, we see a trivial LiLAC-What programfor implementing spmv_csr with the Intel MKL library.The actual call to the relevant library function is in line 16.To prepare for that call, there is boilerplate code in lines 7–14to fill parameter structures.Critically, there is an additional parameter required by thelibrary that is data-dependent: the number of columns, cols ,in the sparse matrix. It is determined at runtime, in lines 2–5,leading to reduced performance. We will avoid this with thedata marshaling constructs in the next section. utomatically Harnessing Sparse Acceleration CC ’20, February 22–23, 2020, San Diego, CA, USA

HARNESS mkl IMPLEMENTS spmv_csrint cols = 0;for(int i = 1; i < rowstr[rows]; i++) cols = colidx[i]>cols?colidx[i]:cols;cols = cols+1;sparse_matrix_t A;mkl_sparse_d_create_csr(&A, SPARSE_INDEX_BASE_ZERO, rows, cols, rowstr, rowstr+1, colidx, a);struct matrix_descr dscr;dscr.type = SPARSE_MATRIX_TYPE_GENERAL;dscr.mode = SPARSE_FILL_MODE_LOWER;dscr.diag = SPARSE_DIAG_NON_UNIT;mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, 1.0, A, dscr, iv, 0.0, ov); PersistentVariables"mkl.h"12345678910111213141516171819

Figure 7.

This LiLAC-What program implements spmv-csrnaïvely with Intel MKL. Performance is degraded because oflines 2–5. Figure 9 will present a solution to this bottleneck.

INPUT CudaReadcudaMemcpy(out, in, sizeof(type_in)*size, cudaMemcpyHostToDevice); BeforeFirstExecutioncudaMalloc(&out, sizeof(type_in)*size);12345 BeforeFirstExecutioncudaFree(out);67

Figure 8.

LiLAC-How code to provide efficient automaticdata marshaling between the host and the CUDA accelerator.

INPUT Maximumout = in[0];for(int i = 1; i < size; i++) out = in[i]>out?in[i]:out;out = out+1;12345

Figure 9.

INPUT can also be used to specify data-dependentcomputations that are only recalculated when necessary.

HARNESS cuda IMPLEMENTS spmv_csrdouble alpha = 1.0;double beta = 0.0;cusparseMatDescr_t descrA;cusparseCreateMatDescr(&descrA);cusparseDcsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, rows, cols, ranges[rows], &alpha, descrA, d_mat, d_ranges, d_indir, d_vec, &beta, d_out); Marshalingint cols = Maximum of indir[0..ranges[rows]]double* d_mat = CudaRead of matrix[0..ranges[rows]]double* d_vec = CudaRead of vector[0..cols]int* d_ranges = CudaRead of ranges[0..rows+1]Int* d_indir = CudaRead of indir[0..rowstr[rows]]double* d_out = CudaWrite of output[0..rows] PersistentVariablescusparseHandle_t handle BeforeFirstExecutioncusparseCreate(&handle); AfterLastExecutioncusparseDestroy(handle); CppHeaderFiles "cusparse_v2.h"12345678910111213141516172021181922232425

Figure 10.

This LiLAC-What specification implements anefficient SPMV harness using cuSPARSE in 25 lines of code.

Heterogeneous accelerators require data transfers to keepmemory consistent between host device and accelerator. Toachieve the best performance, these have to be minimized.Importantly, unchanged data should never be copied again.This requires program-wide analysis that is not availablestatically. LiLAC-How uses memory protection to implementthis at runtime with minimal overhead by capturing readand write accesses to memory ranges. The same mechanismis used to cache data-dependent invariants across severalinvocations, such as cols in Figure 7.Data marshaling routines are bound to ranges of memoryin the harness. In the specification, the underlying array isavailable using the identifiers in , size , and out . In Figure 8, the cudaMemcpy function from NVIDIA CUDAis integrated with LiLAC-How. It is used to copy data fromthe host to the accelerator. For this to work, it first needs toallocate memory of the device using cudaMalloc , whichis later freed with cudaFree . Minimal memory transfersare obtained by executing cudaMemcpy only when a valuein the array changes.We can use the same construct to efficiently computevalues such as the cols variable in Figure 7, as shown inFigure 9. The optimized implementation is derived fromFigure 7 lines 2-5. However, instead of the concrete variablenames, the reserved identifiers in , size , and out are used.Figure 10 shows an spmv_csr LiLAC-How program forthe cuSPARSE library. A number of data marshaling variablesare introduced in lines 12–17, that automatically optimizeboth memory transfers and the computation of the cols variable. The core of the harness in lines 2–10 is again noth-ing more than library-specific boilerplate C++ code.

The LiLAC system, as shown in Figure 2 is entirely integratedinto the LLVM build system. When LLVM is compiled, theLiLAC specification is parsed using a Python program. Basedon the LiLAC-What and LiLAC-How sections, C++ code isgenerated that is automatically incorporated into LLVM infurther stages of the build process.The result is an LLVM optimization pass that is availablewhen linking LLVM with the clang C/C++ compiler. Thispass performs the discovery of linear algebra code and theinsertion of harness calls. Furthermore, the harness librariesthemselves are built at compile time of LLVM, using C++code emitted from the LiLAC-How sections.The two crucial implementation details are therefore thefollowing: Firstly, how automatic detection functionalityin C++ is generated from the LiLAC-What specifications.Secondly, how the LiLAC-How sections are used to generatefast C++ implementations of the specified library harnesses.

C ’20, February 22–23, 2020, San Diego, CA, USA Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle

The parsed LiLAC-What sections are turned into C++ func-tions that recognize places for harness call insertions in anLLVM pass. This builds on previous work via a formulationin CAnDL [21]. Detection is done on optimized compilerintermediate representation. Standard -O2 optimizations,excluding loop unrolling and vectorization, normalize theintermediate code. Optimizations minimize programminglanguage-specific artifacts and the impact of syntax-levelprogrammer decisions.The effect is demonstrated in Figure 11, which shows threeimplementations of a dot product in different languages:C, C++, and FORTRAN. After translating to LLVM IR andperforming optimizations, the dot product is recognized inthe LiLAC system using the same LiLAC-What specification.The detection comprises two steps, as demonstrated inFigure 13. Firstly, the control flow skeleton is recognized.This is simple, as LiLAC-What can only express control flowin the form of loop nests of a certain depth. After candidateloop nests have been identified, the index and loop rangecalculations from LiLAC-What are mapped onto the LLVM IRnodes. This is done via a backtracking search procedure andallows robust detection across many syntactically differentinput programs, as described in [21, 22].

For detecting instances of LiLAC-What specifications in userprograms, LLVM IR segments that match the control flowskeleton are identified. These control flow candidates arethen processed with a backtracking search algorithm.All ⟨ exp ⟩ expressions in the LiLAC-What program areidentified. These have to be assigned instructions or othervalues from the LLVM IR segment. Those top-level ⟨ exp ⟩ expressions that are used as limits or iterators in ⟨ ranдe ⟩ expressions are easily connected with the correspondingloop boundaries in the control flow candidates.The remaining expressions are successively assigned bybacktracking. Consider the example in Figure 12, whichshows a candidate loop from the LLVM IR generated fromthe C++ dot product code in Figure 11. The iteration spaceis determined by loop analysis and this immediately allowsus to assign the iterator and range in Figure 13 on the left.The LLVM IR values that correspond to a[i] , a , b[i] , b , a[i]*b[i] and result are then searched for. When apartial solution fails, the algorithm backtracks. This happensin the example once, when no suitable multiplication can befound in step 5. If no complete solution can be determined,the control flow candidate is discarded. Each loop nest that matches a LiLAC-What specification isreplaced with a harness call. To minimize the invasivenessof our pass, this is performed as follows: Firstly, a harness

COMPUTATION dotproduct result = sum(0 <= i < length) a[i] * b[i]; int i = 0; while (i < N) {x += (*(A+i))*(*(B+i));i+=1; } for ( int i = 0; i < vec_a.size(); i++)x += vec_a[i]*vec_b[i]; DO I = 1, N, 1X = X + A(i)*B(i)

END DO

Figure 11.

Syntactically different computations in C, C++,or FORTRAN are captured by one LiLAC-What specification. ; :17:%18 = phi i64 [ 0, %10 ], [ %26, %17 ]%19 = phi double [ 0.0, %10 ], [ %25, %17 ]%20 = getelementptr double , double * %9, i64 %18%21 = load double , double * %20%22 = getelementptr double , double * %12%23 = load double , double * %22%24 = fmul double %21, %23%25 = fadd double %19, %24%26 = add nuw i64 %18, 1%27 = icmp ugt i64 %14, %26 br i1 %27, label %17, label %15 Figure 12.

LiLAC intercepts LLVM IR after optimizations.This ensures normalized and language-independent features.a[i] ← %21a ← %9i ← %18 b[i] ← %21 %23length ← %14 b ← %9 %12a[i] * b[i] ← fail! %24result ← %25 Figure 13.

After finding a candidate loop and receiving somevariables from loop analysis (left), the backtracking solverattempts to assign the remaining variables one by one (right).call is inserted directly before the loop. The function callarguments are selected from the backtracking result andpassed to the harness. Secondly, the LLVM instruction thatstores the result of the computation or passes it out of theloop as a phi node is removed. The remainder of the loopnest is removed automatically by dead code elimination.

LiLAC-How syntax elements that take C++ code generategeneric functions, and template parameter deduction insertsconcrete types during the compilation process.In Figure 14, we see the correspondence between gener-ated C++ template functions and the specification in Figure 8.The three function bodies are directly inserted. The functionsare used to specialize the

ReadObject class template, whichguarantees the following properties via memory protection: construct is called before the first invocation and when in or size change for consecutive harness invocations. utomatically Harnessing Sparse Acceleration CC ’20, February 22–23, 2020, San Diego, CA, USA template < typename type_in, typename type_out> void CudaRead_update(type_in* in, int size, type_out& out) { cudaMemcpy(out, in, sizeof (type_in)*size, cudaMemcpyHostToDevice); } template < typename type_in, typename type_out> void CudaRead_construct( int size, type_out& out) { cudaMalloc(&out, sizeof (type_in)*size); } template < typename type_in, typename type_out> void CudaRead_destruct( int size, type_out& out) { cudaFree(out); } template < typename type_in, typename type_out> using CudaRead = ReadObject, CudaRead_construct, CudaRead_destruct>;

Figure 14.

LiLAC uses code from Figure 8 to define threefunctions that specialize the ReadObject template, whichuses mprotect for capturing memory accesses internally. update is called after construct and if any of the data in thearray is changed between consecutive harness invocations. destruct is called in between consecutive construct calls andbefore the program terminates.

The LLVM frontend for FORTRAN under active development,flang, is in an unfinished state and produces unconventionalLLVM IR code. Significant additional work was required tonormalize the IR code. We developed normalization passesin LLVM to overcome the specific shortcomings, enablingFORTRAN programs to be managed as easily as C/C++.The problems that we encountered included: differingindexing conventions requiring offsetting pointer variableson a byte granularity with untyped pointers; incompatibleintermediate representation types where all parameters arepassed in as i64 pointers, frequently necessitating a pointertype conversion followed by a load from memory; obfuscatedloops with additional induction variable that counts downinstead of up such that the standard LLVM indvars pass isunable to merge the loop iterators.

We wrote short LiLAC programs for a collection of linearalgebra libraries and applied our approach to a chemicalsimulation application, two graph analytics applications anda collection of standard benchmark suites.

Libraries

We selected four different libraries for sparselinear algebra functions. These were: Intel MKL [3], NvidiacuSPARSE [5], clSPARSE [2] and SparseX [19]. MKL is ageneral-purpose mathematical library, while clSPARSE andcuSPARSE are OpenCL and CUDA implementations of sparselinear algebra designed to be executed on the GPU, andSparseX uses an auto-tuning model and code generation tooptimize sparse operations on particular matrices.

Name Hardware Libraries

Intel-0 2 × Intel Xeon E5-2620Nvidia Tesla K20 GPU MKLcuSPARSEclSPARSESparseXIntel-1 Intel Core i7-8700KNvidia GTX 1080 GPUAMD AMD A10-7850KAMD Radeon R7 iGPUNvidia Titan X GPU cuSPARSEclSPARSE × Table 1.

Evaluated platforms and library harnesses; AMD-0supports clSPARSE on both its internal and its external GPU.

Applications

To evaluate the impact of LiLAC in a real-world context, we used the pathsample physical chemistrysimulation suite, a large FORTRAN legacy application [56]consisting of over 40,000 lines of code. Recent work showsthat applications in this area are amenable to accelerationusing sparse linear algebra techniques [53], and pathsampleprovides a useful example of this. We also evaluated twomodern C++ graph analytics kernels (BFS and PageRank[11, 15]). pathsample was run in two different modes andthree different levels of pruning, in each case using a systemof 38 atoms [18] commonly used to evaluate applications inthis domain. The graph kernels were run against 10 matricesfrom the University of Florida’s sparse matrix collection [14],with sizes between 300K and 80M non-zero elements.For completeness and validation that our LiLAC-generatedimplementations were correct, we also applied our tech-nique to sparse programs from standard benchmark suites:CG from the NAS parallel benchmarks [9], spmv from Par-boil [48] and the Netlib sparse benchmark suites [17]. Eachbenchmark suite was run using their supplied inputs.

Platforms

We evaluated our approach across 3 differentmachines with varying hardware performance and softwareavailability. Each one was only compatible with a subset ofour LiLAC-generated implementations—a summary of thesemachines is given in Table 1.

We first present raw performance impact, then we analyzetwo intermediate metrics: reliability of linear algebra discov-ery and effectiveness of memory transfer optimizations.

LiLAC achieves significant speedups on real applications aswell as benchmarks, as shown in Figure 15. Baselines werecompiled with -O2 using the same version of clang withoutLiLAC extensions. Higher optimization levels ( -O3 ) had anegligible impact on performance. Different platforms andapplications profit from different libraries (subsection 6.2).Speedup ranges from 1 . × on the scientific applicationcodes to 12 × on well-known sparse benchmark programs. C ’20, February 22–23, 2020, San Diego, CA, USA Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle

AMD Intel-0Intel-1 Sp ee d u p ( × ) PFold

AMD Intel-0Intel-1

NGT

AMD Intel-0Intel-1

Parboil SPMV

AMD Intel-0Intel-1

BFS

CL eGPUMKLNativeSparseX

AMD Intel-0Intel-1 Sp ee d u p ( × ) NPB-CG

AMD Intel-0Intel-1

PageRank

AMD Intel-0Intel-1

Netlib C

AMD Intel-0Intel-1

Netlib Fortran

CL eGPUMKLcuSPARSE

Figure 15.

Evaluation on real-world applications and well-known benchmarks: Bars show the geomean speedup of thebest-performing LiLAC harness across the set of input examples for each program and platform. Hatchings encode the selectedimplementations. The baseline is the identical source code compiled with clang -O2 , yielding sequential CPU-only programs.

Applications

On the pathsample applications (PFold andNGT), we measured consistent speedups of approximately50% and 10% respectively across all 3 platforms. For large ap-plications, Amdahl’s law is a severe limitation for approacheslike ours – other parts of the applications dominate executiontimes when linear algebra is accelerated.

Graph kernels

PageRank requires a large number of SPMVcalls using the same input matrix to iterate until convergence.The GPU implementations running on AMD and Intel-1 takeadvantage of data remaining in memory. The larger numberof CPU cores and slower GPU available on Intel-0 make MKLits best-performing implementation. CPU implementationsperform best on BFS by avoiding memory copies entirely –on AMD, SparseX outperforms GPU implementations.

Benchmarks

LiLAC achieves speedups of up to 12 × onstandard sparse linear algebra benchmarks. The impact isindependent of the source language, as the C and FORTRANversions of the Netlib benchmark demonstrate. LiLAC is ableto achieve consistent, useful speedups across a variety ofhardware configurations. Dense

We evaluated on some dense benchmarks as well.In line with the literature, dense is very amenable to hetero-geneous acceleration. We achieve 20 × speedup on Parboilsgemm by inserting LiLAC-harnessed calls into sequentialbaseline. However, impressive heterogeneous speedups ondense are well explored in the literature, we focus on sparse. Comparison to Expert

NPB and Parboil contain expert-written alternative versions with GPU acceleration. Thisallowed the evaluation of LiLAC against heterogeneous codereaching close to peak performance, shown in Figure 16.While the expert version of NPB-CG is ∼ × faster, thisis not due to an improved sparse linear algebra operation,but a complete parallelization and rewrite of the program NPB-CG Parboil P e r f o r m a n c e ( × ) LiLAC vs. Expert Implementation

LiLACExpert

Benchmark Modified LoC

LiLAC ExpertNPB-CG 0 (44) 1948Parboil SPMV 0 (44) 261

Figure 16.

LiLAC performance as fraction of expert versionperformance. We achieve good speedup with no applicationprogrammer effort (measured as required LoC change). TheLiLAC required code – identical across programs – is inparentheses. Amdahl’s Law limits our impact on NPB-CG.for the GPU. In Parboil SPMV, the expert version focuses onimproved sparse linear algebra. Here the difference betweenan expert and LiLAC is only 1.07 × . Productivity

The bottom of Figure 16 shows the amountof code modified in order to add heterogeneous accelerationmanually vs with LiLAC. This demonstrates the productivityimprovements for application programmers. No lines of usercode need to be modified using LiLAC, while both expertversions require significant application rewrites. Only 44lines of application-independent LiLAC code is required.

The relative performance of different accelerator librariesis highly dependent on the application, problem size, andplatform, as Figure 17 shows. utomatically Harnessing Sparse Acceleration CC ’20, February 22–23, 2020, San Diego, CA, USA

Platform Implementation PFold NGT PageRank BFS

L0 L1 L2 L0 L1 L2 Erdos LJ-2008 Road Erdos LJ-2008 RoadAMD cuSPARSE 1.38 1.18 0.67 0.69 0.69 0.70 - -Intel-0 MKL cuSPARSE 0.75 0.60 0.45 0.66 0.66 0.66 1.39 1.00 3.32 0.87 1.74 1.28clSPARSE 0.90 0.75 0.46 0.81 0.79 0.78 1.24 0.95 2.24 0.13 1.45 0.07SparseX - - - - - - - - - 1.19 - -Intel-1 MKL cuSPARSE 0.48 0.41 0.30 0.68 0.69 0.68

Table 2.

LiLAC speedups on each platform, across different applications and problem sizes. SparseX demonstrated promisingperformance on some applications, but we were unable to evaluate on every relevant instance due to instability. Implementationwith best geomean speedup per benchmark and platform is bold. Sp ee d u p ( × ) Speedup Distribution

AMDIntel-0Intel-1 cuSPARSECL iGPUCL eGPU SparseXMKL

Figure 17.

Distribution of speedups on NPB-CG. The stackswithin each of the three columns are sorted by problem size,each point shows the speedup of a specific implementation.Table 2 has more detailed data. The best-performing imple-mentation varies considerably, depending on characteristicsof the problem in question. No accelerator library performswell reliably, each harness outperforms any other harness onsome combination of data set and platform. For some smallproblem sizes, hardware acceleration is not profitable. Thoseslowdowns are due to inherent overheads, not LiLAC.

Our implementation of LiLAC relies on a non-trivial datamarshaling system that prevents redundant computationsand memory transfers. We present performance results thatshow the importance and effectiveness of this system.We repeated our experiments, using the best-performingimplementations from Figure 15. Instead of using the datamarshaling scheme, we recompute and transfer memorynaively for each invocation. The results are in Figure 18.Across the best AMD versions of PFold, NGT, PageRank andBFS – where accelerators are profitable with marshaling –only PageRank achieves a significant speedup naively.For BFS, the naive approach leads to drastic performancedegradation, the marshaling version is 25 × faster. This isbecause it performs an internal matrix tuning phase that isfar more expensive than a memory copy. For the other threeprograms, there is a factor of 1.4–3.5 × between the naiveand the smart version. PFoldCL eGPU NGTCL eGPU PageRankcuSPARSE BFSSparseX Sp ee d u p ( × ) Improvement from Optimized Memory Transfers

NaiveLiLAC

Figure 18.

LiLAC vs. naïve library calls without marshalingoptimizations, speedup over sequential baseline: Advancedmarshaling features of LiLAC are critical for performance.

For performance impact, LiLAC needs to first detect linearalgebra computations. Previous results already implied thatthis works reliably, and Table 3 reiterates this. All relevantsparse matrix-vector multiplications were recognized.Established approaches, like the polyhedral model, areunable to model sparse linear algebra, as verified with thePolly compiler. Similarly, the Intel C/C++ and FORTRANcompilers fail to auto-parallelize, as they cannot reason aboutsparsity and have to assume additional dependencies.These results show the novelty of the abilities of LiLACrather than implementation weaknesses of Polly and ICC, asneither were designed for accelerating sparse computations.

Table 3.

Sparsity does not fit the polyhedral model; Polly isnot available for FORTRAN; Intel compilers fail to parallelizesparse. Only LiLAC detects sparse linear algebra reliably.

Benchmark LiLAC Polly Intel icc/ifort

PFold CSR - parallel dependenceNGT CSR - parallel dependenceParboil-SPMV JDS no SCoP parallel dependenceBFS CSR no SCoP parallel dependenceNPB-CG CSR - parallel dependencePageRank CSR no SCoP parallel dependenceNetlib C CSR no SCoP parallel dependenceNetlib Fortran CSR - parallel dependence

C ’20, February 22–23, 2020, San Diego, CA, USA Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle

Compiler centric linear algebra optimization

Compilermanagement of indirect memory accesses was first examinedusing an inspector-executor model for distributed-memorymachines [10]. The location of read data was discoveredat runtime and appropriate communication inserted. Laterwork was focused on efficient runtime dependence analysisand the parallelization of more general programs [20, 38, 41,49]. However, the performance achieved is modest due toruntime overhead and falls well short of library performance.More recent work developed equality constraints and subsetrelations that help reduce the runtime overhead [32].The polyhedral model is an established compiler approachfor modeling data dependencies [12, 24, 26, 42, 47]. Such anapproach has been implemented in optimizing compilers,such as the Polly extensions to LLVM [16]. Recent workhas extended the polyhedral model beyond affine programsto some forms of sparsity with the PENCIL extensions [8].These can be used to model important features of sparselinear algebra, such as counted loops [58], i.e. loops withdynamic, memory dependent bounds but statically knownstrides. Such loops are central to sparse linear algebra. ThePPCG compiler [54] can detect relevant code regions, butit relies on well behaved C code with all arrays declared invariable-length C99 array syntax. This excludes most real-world programs; nothing in our evaluation fits this structure.The Appollo system [51] integrates thread level specula-tion with the polyhedral model, allowing its application tosparse linear algebra. However, it requires sub-parts of thecomputation to perform dense accesses at runtime. Similarapproaches [7] also require regular sub-computations.

Compiler detection

Previous work has detected code struc-tures in compilers using constraint programming. Early workwas based on abstract computation graphs [37], but morerecent approaches have used compiler intermediate code andmade connections to the polyhedral model [21].In [22] they implement a method that operates on SSAintermediate representation. It uses a general-purpose low-level constraint programming language aimed at compilerengineers. The paper focuses on code detection, with manualdata marshaling. Recent work [13] uses type-guided programsynthesis to model library routines, which are then detectedby a solver. Again, data marshaling is not taken into account.Other advanced approaches to extracting higher-levelstructures from assembly and well-structured FORTRANcode involve temporal logic [27, 31]. These approaches tendto focus on a more restricted set of computations (densememory access). While this allows formal reasoning aboutcorrectness, is too restrictive to model sparse linear algebra.

Domain-Specific Languages

There have been multipledomain-specific libraries proposed to formulate linear al-gebra computations. Many of these contain some degree of autotuning functionality to achieve good performance acrossdifferent platforms [50]. Halide [39] was designed for imageprocessing. [52]. Its core design decision is the schedulingmodel that allows the separation of the computation sched-ule and the actual computation. There has been work onautomatically tuning the schedules [35] but in general, thecomputational burden is put on the application programmer.The SLinGen [45] compiler takes a program expressed inthe custom LA language, inspired by standard mathematicalnotation. It then implements custom code generation forthe expressed calculations, with a focus on small, fixed-sizeoperands. This is built on top of building blocks providedby previous work on LGen [46]. The approach outperformslibraries focused on large data sizes but is unable to utilizeheterogeneous compute and requires program rewrites.

Libraries

The most established way of encapsulating fastlinear algebra routines is via numeric libraries, generallybased on the BLAS interface [6]. These are generally veryfast on specific hardware platforms, but require applicationprogrammer effort and offer little performance portability.Implementations of dense linear algebra are available formost suitable hardware platforms, such as cuBLAS [4] forNVIDIA GPUs, clBLAS [1] for AMD GPUs and the Intel MKLlibrary [3] for Intel CPUs and accelerators.Fast implementations of sparse linear algebra are fewer,but they exist for the most important platforms, includingcuSPARSE [5] and clSPARSE [2]. There have been severalBLAS implementations that attempt platform independentacceleration and heterogeneous compute [33, 34, 57].

CPU-GPU data transfer optimizations

Data transfersbetween CPU and GPU have been studied extensively asan important bottleneck for parallelization efforts. Previ-ous work [25, 30] established systems for automatic man-agement of CPU-GPU communication. The authors of [29]implemented a system to move OpenMP code to GPUs, opti-mizing data transfers using data flow analysis. However, thisapproach performs a direct translation, not optimizing thecode for the specific performance characteristics of GPUs.

This paper presented LiLAC, a language and compiler that en-ables existing codebases to exploit sparse (and dense) linearalgebra accelerators. No effort is required from the applica-tion programmer. Instead, the library implementer providesa specification, which LiLAC uses to automatically and effi-ciently match user code to high-performance libraries.We demonstrated this approach on C, C++, and FORTRANbenchmarks as well as legacy applications, and shown signif-icant performance improvement across platforms and datasets. In future work, we will investigate how our frameworkcan be adapted to other application domains, enabling effort-free access to an even larger set of accelerator libraries. utomatically Harnessing Sparse Acceleration CC ’20, February 22–23, 2020, San Diego, CA, USA

References [1] [n. d.]. clMathLibraries clBLAS. https://github.com/clMathLibraries/clBLAS . ([n. d.]).[2] [n. d.]. clMathLibraries clSPARSE. https://github.com/clMathLibraries/clSPARSE . ([n. d.]).[3] [n. d.]. IntelÂő Math Kernel Library (MKL). https://software.intel.com/mkl . ([n. d.]).[4] [n. d.]. NVIDIA cuBLAS. https://developer.nvidia.com/cublas . ([n. d.]).[5] [n. d.]. NVIDIA CUDA Sparse Matrix library (cuSPARSE). https://developer.nvidia.com/cusparse . ([n. d.]).[6] 2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS).

ACM Trans. Math. Softw.

28, 2 (June 2002), 135–151. https://doi.org/10.1145/567806.567807 [7] Travis Augustine, Janarthanan Sarma, Louis-Noël Pouchet, and GabrielRodríguez. 2019. Generating Piecewise-Regular Code from IrregularStructures. In

Proceedings of the 40th ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI 2019) . As-sociation for Computing Machinery, New York, NY, USA, 625âĂŞ639. https://doi.org/10.1145/3314221.3314615 [8] R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy,S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v.Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015.PENCIL: A Platform-Neutral Compute Intermediate Language forAccelerator Programming. In . 138–149. https://doi.org/10.1109/PACT.2015.17 [9] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L.Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber,H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. TheNAS Parallel Benchmarks&Mdash;Summary and Preliminary Results.In

Proceedings of the 1991 ACM/IEEE Conference on Supercomputing(Supercomputing ’91) . ACM, New York, NY, USA, 158–165. https://doi.org/10.1145/125826.125925 [10] D. Baxter, R. Mirchandaney, and J. H. Saltz. 1989. Run-time Paralleliza-tion and Scheduling of Loops. In

Proceedings of the First Annual ACMSymposium on Parallel Algorithms and Architectures (SPAA ’89) . ACM,New York, NY, USA, 303–312. https://doi.org/10.1145/72935.72967 [11] Scott Beamer, Krste AsanoviÄĞ, and David Patterson. 2015. The GAPBenchmark Suite. (08 2015).[12] Lam Chi-Chung, P Sadayappan, and Rephael Wenger. 1997. On opti-mizing a class of multi-dimensional loops with reduction for parallelexecution.

Parallel Processing Letters

7, 02 (1997), 157–168.[13] B. Collie, P. Ginsbach, and M. F. P. O’Boyle. 2019. Type-DirectedProgram Synthesis and Constraint Generation for Library Portability.In . 55–67. https://doi.org/10.1109/PACT.2019.00013 [14] Timothy A. Davis and Yifan Hu. 2011. The University of Florida SparseMatrix Collection.

ACM Trans. Math. Softw.

38, 1, Article 1 (Dec. 2011),25 pages. https://doi.org/10.1145/2049662.2049663 [15] Camil Demetrescu, Andrew V Goldberg, and David S Johnson. [n. d.].

The Shortest Path Problem: Ninth DIMACS Implementation Challenge .Vol. 74. American Mathematical Soc.[16] Johannes Doerfert, Kevin Streit, Sebastian Hack, and Zino Benaissa.2015. Polly’s Polyhedral Scheduling in the Presence of Reductions.

CoRR abs/1505.07716 (2015). http://arxiv.org/abs/1505.07716 [17] Jack Dongarra, Victor Eijkhout, and Henk van der Vorst. 2001. AnIterative Solver Benchmark.

Sci. Program.

9, 4 (Dec. 2001), 223–231. https://doi.org/10.1155/2001/527931 [18] Jonathan P. K. Doye, Mark A. Miller, and David J. Wales. 1999. Thedouble-funnel energy landscape of the 38-atom Lennard-Jones cluster.

The Journal of Chemical Physics https://doi.org/10.1063/1.478595 arXiv:https://doi.org/10.1063/1.478595[19] Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornil-ios Kourtis, Georgios Goumas, and Nectarios Koziris. 2018. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplicationon Multicore Platforms.

ACM Trans. Math. Softw.

44, 3 (Jan. 2018),26:1–26:32. https://doi.org/10.1145/3134442 [20] Allan L Fisher and Anwar M Ghuloum. 1994. Parallelizing complexscans and reductions. In

ACM SIGPLAN Notices , Vol. 29. ACM, 135–146.[21] Philip Ginsbach, Lewis Crawford, and Michael F. P. O&

Proceedings of the 27th International Conference on Com-piler Construction (CC 2018) . ACM, New York, NY, USA, 151–162. https://doi.org/10.1145/3178372.3179515 [22] Philip Ginsbach, Toomas Remmelg, Michel Steuwer, Bruno Bodin,Christophe Dubach, and Michael F. P. O’Boyle. 2018. AutomaticMatching of Legacy Code to Heterogeneous APIs: An IdiomaticApproach. In

Proceedings of the Twenty-Third International Confer-ence on Architectural Support for Programming Languages and Op-erating Systems (ASPLOS ’18) . ACM, New York, NY, USA, 139–153. https://doi.org/10.1145/3173162.3173182 [23] Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly- Performing Polyhedral Optimizations on a Low-Level IntermediateRepresentation.

Parallel Processing Letters

22, 4 (2012). https://doi.org/10.1142/S0129626412500107 [24] Gautam Gupta and Sanjay V Rajopadhye. 2006. Simplifying reductions..In

POPL , Vol. 6. 30–41.[25] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson,Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPUCommunication Management and Optimization.

SIGPLAN Not.

46, 6(June 2011), 142–151. https://doi.org/10.1145/1993316.1993516 [26] Pierre Jouvelot and Babak Dehbonei. 1989. A unified semantic ap-proach for the vectorization and parallelization of generalized reduc-tions. In

Proceedings of the 3rd international conference on Supercom-puting . ACM, 186–194.[27] Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016. Verified Lifting of Stencil Computations. In

Proceedingsof the 37th ACM SIGPLAN Conference on Programming Language Designand Implementation (PLDI ’16) . ACM, New York, NY, USA, 711–726. https://doi.org/10.1145/2908080.2908117 [28] Jeremy Kepner, David A. Bader, Aydin BuluÃğ, John R. Gilbert, Tim-othy G. Mattson, and Henning Meyerhenke. 2015. Graphs, Matrices,and the GraphBLAS: Seven Good Reasons. In

ICCS .[29] Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMPto GPGPU: A Compiler Framework for Automatic Translation andOptimization.

SIGPLAN Not.

44, 4 (Feb. 2009), 101–110. https://doi.org/10.1145/1594835.1504194 [30] Christos Margiolas and Michael F. P. OâĂŹBoyle. 2014. Portable andTransparent Host-Device Communication Optimization for GPGPUEnvironments. In

Proceedings of Annual IEEE/ACM International Sym-posium on Code Generation and Optimization (CGO âĂŹ14) . Asso-ciation for Computing Machinery, New York, NY, USA, 55âĂŞ65. https://doi.org/10.1145/2544137.2544156 [31] Charith Mendis, Jeffrey Bosboom, Kevin Wu, Shoaib Kamil, JonathanRagan-Kelley, Sylvain Paris, Qin Zhao, and Saman Amarasinghe.2015. Helium: Lifting High-performance Stencil Kernels from Strippedx86 Binaries to Halide DSL Code. In

Proceedings of the 36th ACMSIGPLAN Conference on Programming Language Design and Imple-mentation (PLDI ’15) . ACM, New York, NY, USA, 391–402. https://doi.org/10.1145/2737924.2737974 [32] Mahdi Soltan Mohammadi, Tomofumi Yuki, Kazem Cheshmi, Eddie C.Davis, Mary Hall, Maryam Mehri Dehnavi, Payal Nandy, CatherineOlschanowsky, Anand Venkat, and Michelle Mills Strout. 2019. SparseComputation Data Dependence Simplification for Efficient Compiler-generated Inspectors. In

Proceedings of the 40th ACM SIGPLAN Con-ference on Programming Language Design and Implementation (PLDI2019) . ACM, New York, NY, USA, 594–609. https://doi.org/10.1145/3314221.3314646C ’20, February 22–23, 2020, San Diego, CA, USA Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle [33] Ana Moreton-Fernandez, Arturo Gonzalez-Escribano, and Diego Fer-raris. 2017. Multi-device Controllers: A Library to Simplify ParallelHeterogeneous Programming. (12 2017).[34] Ana Moreton-Fernandez, Eduardo Rodriguez-Gutiez, Arturo Gonzalez-Escribano, and Diego R. Llanos. 2017. Supporting the Xeon Phi Co-processor in a Heterogeneous Programming Model. In

Euro-Par 2017:Parallel Processing , Francisco F. Rivera, Tomás F. Pena, and José C.Cabaleiro (Eds.). Springer International Publishing, Cham, 457–469.[35] Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling HalideImage Processing Pipelines.

ACM Trans. Graph.

35, 4, Article 83 (July2016), 11 pages. https://doi.org/10.1145/2897824.2925952 [36] Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams,David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize Once, Run Everywhere. In

Proceedings of the 9th AnnualIEEE/ACM International Symposium on Code Generation and Optimiza-tion (CGO ’11) . IEEE Computer Society, Washington, DC, USA, 151–160. http://dl.acm.org/citation.cfm?id=2190025.2190062 [37] Shlomit S Pinter and Ron Y Pinter. 1994. Program optimization andparallelization using idioms.

ACM Transactions on Programming Lan-guages and Systems (TOPLAS)

16, 3 (1994), 305–327.[38] Bill Pottenger and Rudolf Eigenmann. 1995. Idiom recognition in thePolaris parallelizing compiler. In

Proceedings of the 9th internationalconference on Supercomputing . ACM, 444–448.[39] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, SylvainParis, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Lan-guage and Compiler for Optimizing Parallelism, Locality, and Recom-putation in Image Processing Pipelines.

SIGPLAN Not.

48, 6 (June2013), 519–530. https://doi.org/10.1145/2499370.2462176 [40] Gil Rapaport, Ayal Zaks, and Yosi Ben-Asher. 2015. Streamlining WholeFunction Vectorization in C Using Higher Order Vector Semantics.In

Proceedings of the 2015 IEEE International Parallel and DistributedProcessing Symposium Workshop (IPDPSW ’15) . IEEE Computer Society,Washington, DC, USA, 718–727. https://doi.org/10.1109/IPDPSW.2015.37 [41] Lawrence Rauchwerger and David A Padua. 1999. The LRPD test:Speculative run-time parallelization of loops with privatization andreduction parallelization.

IEEE Transactions on Parallel and DistributedSystems

10, 2 (1999), 160–180.[42] Xavier Redon and Paul Feautrier. 1994. Scheduling reductions. In

Proceedings of the 8th international conference on Supercomputing . ACM,117–125.[43] Y. Saad. 1989. Krylov Subspace Methods on Supercomputers.

SIAM J.Sci. Statist. Comput.

10, 6 (1989), 1200–1232. https://doi.org/10.1137/0910073 arXiv:https://doi.org/10.1137/0910073[44] Y. Saad. 2003.

Iterative Methods for Sparse Linear Sys-tems (second ed.). Society for Industrial and AppliedMathematics. https://doi.org/10.1137/1.9780898718003 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898718003[45] Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, andMarkus Püschel. 2018. Program Generation for Small-Scale LinearAlgebra Applications. In

International Symposium on Code Generationand Optimization (CGO) . 327–339. [46] Daniele G. Spampinato and Markus Püschel. 2014. A Basic LinearAlgebra Compiler. In

International Symposium on Code Generation andOptimization (CGO) . 23–32.[47] Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, FabriceRastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan.2014. A framework for enhancing data reuse via associative reordering.In

ACM SIGPLAN Notices , Vol. 49. ACM, 65–76.[48] John Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-WenChang, Nasser Anssari, Daniel Geng, Wen-Mei Liu, and Wen-meiHwu. 2018. Parboil: A Revised Benchmark Suite for Scientific andCommercial Throughput Computing. (Nov. 2018).[49] Toshio Suganuma, Hideaki Komatsu, and Toshio Nakatani. 1996. De-tection and global optimization of reduction operations for distributedparallel machines. In

Proceedings of the 10th international conferenceon Supercomputing . ACM, 18–25.[50] Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf,Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: ACompiler Architecture for Performance-Oriented Embedded Domain-Specific Languages.

ACM Trans. Embed. Comput. Syst.

13, 4s, Article134 (April 2014), 25 pages. https://doi.org/10.1145/2584665 [51] Aravind Sukumaran-Rajam and Philippe Clauss. 2015. The PolyhedralModel of Nonlinear Loops.

ACM Trans. Archit. Code Optim.

12, 4, Arti-cle Article 48 (Dec. 2015), 27 pages. https://doi.org/10.1145/2838734 [52] Patricia Suriana, Andrew Adams, and Shoaib Kamil. 2017. ParallelAssociative Reductions in Halide. In

Proceedings of the 2017 Interna-tional Symposium on Code Generation and Optimization (CGO ’17) .IEEE Press, Piscataway, NJ, USA, 281–291. http://dl.acm.org/citation.cfm?id=3049832.3049863 [53] Kyle H. Sutherland-Cash, Rosemary G. Mantell, and David J. Wales.2017. Exploiting sparsity in free energy basin-hopping.

ChemicalPhysics Letters

685 (2017), 288 – 293. https://doi.org/10.1016/j.cplett.2017.07.081 [54] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Igna-cio Gómez, Christian Tenllado, and Francky Catthoor. 2013. PolyhedralParallel Code Generation for CUDA.

ACM Trans. Archit. Code Optim.

9, 4, Article 54 (Jan. 2013), 23 pages. https://doi.org/10.1145/2400682.2400713 [55] Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI:A library of automatically tuned sparse matrix kernels.

Journal ofPhysics: Conference Series

16 (jan 2005), 521–530. https://doi.org/10.1088/1742-6596/16/1/071 [56] David J. Wales. 2002. Discrete path sampling.

Molecular Physics https://doi.org/10.1080/00268970210162691 arXiv:https://doi.org/10.1080/00268970210162691[57] Linnan Wang, Wei Wu, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016.BLASX: A High Performance Level-3 BLAS Library for HeterogeneousMulti-GPU Computing. In

Proceedings of the 2016 International Confer-ence on Supercomputing (ICS ’16) . ACM, New York, NY, USA, Article20, 11 pages. https://doi.org/10.1145/2925426.2926256 [58] Jie Zhao, Michael Kruse, and Albert Cohen. 2018. A PolyhedralCompilation Framework for Loops with Dynamic Data-dependentBounds. In

Proceedings of the 27th International Conference on Com-piler Construction (CC 2018) . ACM, New York, NY, USA, 14–24.. ACM, New York, NY, USA, 14–24.