[PDF] Automatic Generation of Efficient Sparse Tensor Format Conversion Routines

Abstract

This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping, analysis, and assembly. We then develop a language that precisely describes how different formats group together and order a tensor's nonzeros in memory. This lets a compiler emit code that performs complex remappings of nonzeros when converting between formats. We also develop a query language that can extract statistics about sparse tensors, and we show how to emit efficient analysis code that computes such queries. Finally, we define an abstract interface that captures how data structures for storing a tensor can be efficiently assembled given specific statistics about the tensor. Disparate formats can implement this common interface, thus letting a compiler emit optimized sparse tensor conversion code for arbitrary combinations of many formats without hard-coding for any specific combination. Our evaluation shows that the technique generates sparse tensor conversion routines with performance between 1.00 and 2.01 × that of hand-optimized versions in SPARSKIT and Intel MKL, two popular sparse linear algebra libraries. And by emitting code that avoids materializing temporaries, which both libraries need for many combinations of source and target formats, our technique outperforms those libraries by 1.78 to 4.01 × for CSC/COO to DIA/ELL conversion.

Full PDF

AAutomatic Generation of EfficientSparse Tensor Format Conversion Routines

Stephen Chou

MIT CSAILCambridge, MA, USA [email protected]

Fredrik Kjolstad

Stanford UniversityStanford, CA, USA [email protected]

Saman Amarasinghe

MIT CSAILCambridge, MA, USA [email protected]

Abstract

This paper shows how to generate code that efficiently con-verts sparse tensors between disparate storage formats (datalayouts) such as CSR, DIA, ELL, and many others. We decom-pose sparse tensor conversion into three logical phases: coor-dinate remapping, analysis, and assembly. We then developa language that precisely describes how different formatsgroup together and order a tensor’s nonzeros in memory.This lets a compiler emit code that performs complex remap-pings of nonzeros when converting between formats. We alsodevelop a query language that can extract statistics aboutsparse tensors, and we show how to emit efficient analysiscode that computes such queries. Finally, we define an ab-stract interface that captures how data structures for storinga tensor can be efficiently assembled given specific statisticsabout the tensor. Disparate formats can implement this com-mon interface, thus letting a compiler emit optimized sparsetensor conversion code for arbitrary combinations of manyformats without hard-coding for any specific combination.Our evaluation shows that the technique generates sparsetensor conversion routines with performance between 1.00and 2.01 × that of hand-optimized versions in SPARSKIT andIntel MKL, two popular sparse linear algebra libraries. And byemitting code that avoids materializing temporaries, whichboth libraries need for many combinations of source andtarget formats, our technique outperforms those libraries by1.78 to 4.01 × for CSC/COO to DIA/ELL conversion. CCS Concepts: • Software and its engineering → Ab-straction, modeling and modularity ; Source code gen-eration ; Domain specific languages ; •

Mathematics of com-puting → Mathematical software performance . Keywords: sparse tensor conversion, sparse tensor assem-bly, sparse tensor algebra, sparse tensor formats, coordinateremapping notation, attribute query language

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe owner/author(s).

ACM Reference Format:

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2020.Automatic Generation of Efficient Sparse Tensor Format ConversionRoutines. In

Proceedings of the 41st ACM SIGPLAN InternationalConference on Programming Language Design and Implementation(PLDI ’20), June 15–20, 2020, London, UK.

ACM, New York, NY, USA,16 pages. https://doi.org/10.1145/3385412.3385963

Sparse multidimensional arrays (tensors) are suited for repre-senting data in many domains, including data analytics [2, 6],machine learning [41, 46], and others. Countless formats forstoring sparse tensors have been developed [5, 7, 8, 10, 13, 14,23, 27, 34–37, 47, 49, 50, 53, 55, 60, 61] to accelerate kernelslike sparse matrix-vector multiplication (SpMV), and newformats are constantly being proposed in recent literature.No format is universally superior in every circumstance,since the ideal format for storing a sparse tensor dependson its structure and sparsity, the operation being performed,and the available hardware. Applications typically need toperform different operations on the same tensor, and eachoperation may require the tensor to be stored in a distinctformat for optimal performance. Importing data into a sparsetensor, for instance, can be done efficiently if the tensor isconstructed in the COO format [7] or the DOK format [54],since they support efficient appends or random insertions ofnew nonzeros. Computing SpMV with the tensor, however,can be done more than twice as fast if the tensor is stored inCSR [55], which compresses out redundant row coordinatesand thereby reduces memory traffic [17]. Alternatively, ifall of the tensor’s nonzeros are clustered along a few densediagonals, then storing it in DIA [49] minimizes memorytraffic even more while exposing vectorization opportunities,further improving SpMV performance by up to 22% as aresult [17]. Thus, to optimize the performance of both dataimport and compute, an application must convert the tensorfrom COO (or DOK) to DIA (or CSR). And in applications likepreconditioned solvers and sparse neural network trainingwhere a tensor might only be computed with a few times,the conversion must be efficient so that the overhead doesnot outweigh gains from using an optimized format [20].General-purpose sparse linear and tensor algebra librarieslike SPARSKIT [48] and Intel MKL [24] thus strive to supportefficiently converting tensors between as many formats as a r X i v : . [ c s . M S ] J un LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe Columns (J) R o w s ( I ) Figure 1.

A sparse 4 × N ) ofsupported formats, it becomes impractical to manually im-plement efficient conversion routines for all Θ ( N ) combina-tions of source and target formats. Instead, hand-optimized li-braries typically only support direct conversions to and fromsome arbitrary canonical format (e.g., CSR with SPARSKIT).Thus, to convert a tensor from COO to DIA using SPARSKIT(or Intel MKL), an application must first convert the tensorto CSR and then to DIA. This doubles the number of conver-sions needed, which is inefficient when converting a tensoreven once incurs significant overhead [17, 60]. Worse, thisapproach is not even feasible if an application uses novel for-mats that are not supported by libraries, such as many of thesparse matrix and tensor formats that have been proposed inrecent literature. The developer must then hand-implementefficient custom conversion routines for each new format,which are typically complicated and tedious to write andoptimize. This motivates a technique that can instead auto-matically generate such efficient format conversion routines.Existing sparse tensor algebra compilers such as taco [17,29, 30] are unable to generate such routines for many formats.Converting a tensor between disparate formats typicallyentails changing how its nonzeros are grouped and orderedin memory, potentially rearranging nonzeros into complexorders such as by diagonals or by blocks (Figure 1), or evenby Morton order [13, 34]. Efficient conversion algorithms canoften achieve this reordering without explicitly sorting theinput tensor by first computing statistics about the tensor.These statistics are then used to coordinate the movement ofnonzeros to the output tensor (in the target format) in sucha way that avoids data reshuffles and memory reallocations.The taco compiler cannot generate such algorithms as itcannot express or reason about reordering nonzeros in non-lexicographic coordinate order. It also cannot reason aboutcomputing and utilizing statistics about the input tensor tocoordinate assembly of disparate tensor data structures.We propose a technique to generate efficient sparse tensorconversion routines for a wide range of disparate formats,building on our recent works on sparse tensor algebra com-pilation [17, 29, 30]. We decompose a large class of tensorconversion algorithms into three logical phases (Section 3).Then, to facilitate generating code for each phase, we develop coordinate remapping notation, which describes how dif-ferent tensor formats group together and order nonze-ros in memory (Section 4); attribute query language, which describes what statisticsabout a tensor are needed so that sufficient memorycan be reserved for conversion (Section 5); and a tensor assembly abstract interface, which exposes func-tions that capture how results of attribute queries areused to efficiently assemble many kinds of sparse ten-sor data structures (Section 6).As we will show, the conciseness of these abstractions makesit easy to provide specifications that describe how to effi-ciently construct sparse tensors in many formats. Our tech-nique can then combine these specifications with additionalones proposed by Chou et al. [17], which specify how toefficiently iterate over sparse tensors in many formats, inorder to generate efficient conversion routines for arbitrarycombinations of formats. In this way, users only have toprovide one set of specifications for every supported formatrather than every combination of source and target formats.We have implemented a prototype of our technique intaco. Our evaluation shows that, for many combinations ofsource and target formats, our technique generates conver-sion routines with performance between 1.00 and 2.01 × thatof hand-optimized implementations in SPARSKIT and In-tel MKL. For conversions from CSC/COO to DIA/ELL, ourtechnique emits code that avoid materializing temporariesand that are not implemented in either library, which lets usoptimize those conversions by 1.78 to 4.01 × (Section 7). There exist a wide variety of formats for storing sparse ten-sors in memory. Figure 2 shows four examples of commonlyused sparse tensor formats; for an overview of more formatsthat have been proposed, we refer readers to Section 2.1in [17]. The COO format [7] represents a sparse tensor asa list of its nonzeros, storing the complete coordinates andvalue of each nonzero. COO supports efficiently appendingnew nonzeros, though it also wastes memory by storing re-dundant row coordinates. The CSR format [55] compressesout the redundant row coordinates by grouping all nonze-ros in the same row together and using a pos array to mapnonzeros to each row. However, inserting a nonzero at somearbitrary coordinates into CSR is expensive as all nonzerosin subsequent rows must be shifted in memory. The DIA [49]format stores nonzeros along the same diagonal together inmemory, while the ELL [27] format groups together up toone nonzero from each row. Such orderings of nonzeros ex-pose vectorization opportunities for SpMV [19] and can alsoreduce memory footprint. However, DIA is only suitable fordiagonal and banded matrices, while ELL is only suitable ifall rows in the matrix have a similar number ( K ) of nonzeros.As these examples show, there is no universally ideal format utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK pos crd crdvals (a) COO Nvalspos crd (b)

CSR M N Kperm vals (c)

DIA N crd K vals (d) ELL

Figure 2.

The same tensor as shown in Figure 1, stored in disparate sparse tensor formats.

00 12 335 1 7 3 8 4 910 11 02 43 B rowscolsvals

222 613 00 12 25 1 7 3 8 2 41 1 0 2 3 B

349 61

Figure 3.

Two coordinate hierarchy representations of thesame tensor from Figure 1 in COO (left) and CSR (right).Their differing structures reflect how COO and CSR storenonzeros (i.e., whether duplicate row coordinates are stored).for storing sparse matrices. The same is true of higher-ordersparse tensors, for which even more formats have been—and are continuously being—proposed [8, 34, 35, 50] thatuse disparate data structures and ordering schemes to storenonzeros, each with distinct trade-offs.Chou et al. [17] describe how tensors stored in disparateformats can be represented as coordinate hierarchies thathave varying structures but that expose the same abstractinterface. Figure 3 shows examples of coordinate hierarchiesthat represent a tensor in two different formats. Each levelin a coordinate hierarchy encodes the nonzeros’ coordinatesinto one dimension. Edges associate stored components withtheir containing subtensors. In Figure 3, for instance, eachcolumn coordinate, which is associated with a nonzero, isconnected by an edge to a coordinate that identifies thenonzero’s containing row. Each stored component is repre-sented by a path from the root to a leaf, with coordinatesalong the path representing the component’s coordinates.We refer readers to Section 3 in [17] for more details.We can then decompose sparse tensor formats into levelformats that each stores a coordinate hierarchy level, whichrepresents a tensor dimension. CSR (Figure 2b), for instance,can be decomposed into two level formats, dense and com-pressed , that store the row and column levels respectively, asFigure 4 shows. The dense level format implicitly encodes allrows using just a parameter N to store the dimension’s size.By contrast, the compressed level format uses two arrays, pos and crd , to store column coordinates of nonzeros in each row.All level formats, however, expose the same static interfaceconsisting of level functions , which describe how to access aformat’s data structures, and properties , which describe char-acteristics of the data as stored (e.g., if nonzeros are stored inorder). The level function locate , for instance, describes how pos_access(p , i ): return ;pos_bounds(p ): return ; N locate(p , i ): return p * N + i ; B(i1,i2)dense compressed pos crd

Figure 4.

Decomposition of CSR into level formats and cor-responding level functions that describe how the associateddata structures can be efficiently accessed.to efficiently random access coordinates that are stored in alevel format. Similarly, pos_bounds and pos_access describehow to efficiently iterate over coordinates, with the formerspecifying how to compute the bounds of iteration and thelatter specifying how to access each coordinate.Structured sparse tensor formats like DIA and ELL, whichdo not group nonzeros lexicographically by their coordinates,can also be decomposed into level formats by casting them asformats for tensors with additional dimensions. For example,a DIA matrix can be cast as a 3rd-order tensor where eachslice contains only nonzeros that lie on the same diagonal, asshown in Figure 5. We can then decompose DIA into threelevel formats: one that stores the set of nonzero diagonalsin a perm array of size K , another that encodes the set ofrows in each diagonal, and a third that encodes the columncoordinates of nonzeros. Such a decomposition lets a sparsetensor algebra compiler reason about how tensors stored inDIA and similar structured formats can be efficiently iterated,which is crucial for generating fast tensor algebra code.The coordinate hierarchy abstraction lets a compiler gen-erate efficient code to iterate over sparse tensors in disparateformats by simply emitting code to traverse coordinate hier-archies. This entails recursively generating a set of nestedloops that each iterates over a level in a coordinate hierarchy.The compiler generates each loop by emitting calls to levelfunctions that describe how to efficiently access the level.Then, to obtain code that iterates over a tensor in any desiredformat, the level function calls are simply replaced with thedesired format’s implementations of those level functions.This approach lets a compiler generate efficient code for dis-parate formats without hard-coding for any specific format.We refer readers to Section 4.5 in [17] for more details. LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe R o w s ( I )

43 2

Slices (K = J - I)0-2 1Columns (J)3210 54 3210 54

Figure 5.

The matrix in Figure 1 can be transformed to a3rd-order tensor where each slice contains all nonzeros thatlie on the same diagonal in the original matrix. The lexi-cographic coordinate ordering of nonzeros in the resultingtensor matches the order in which nonzeros are stored inDIA (Figure 2c). Section 4 shows how this transformation isformalized by the coordinate remapping (i,j) -> (j-i,i,j) . Figure 6 shows three examples of sparse tensor conversionroutines that efficiently convert tensors between differentstorage formats. As these examples illustrate, different combi-nations of source and target formats require vastly dissimilarcode. It turns out, however, that efficient algorithms for con-verting sparse tensors between a wide range of disparateformats can all be decomposed into three logical phases: coor-dinate remapping, analysis, and assembly. Figure 6 identifiesthese phases using different background colors.The coordinate remapping phase iterates over the inputtensor and, for each nonzero, computes new coordinatesas functions of its original coordinates. What additional co-ordinates are computed depends on the target format. Forinstance, the code in Figure 6a, which converts a CSR matrixto DIA, computes a new coordinate k for each nonzero as thedifference between its column and row coordinates (lines 2–6and 20–24). Coordinate remapping conceptually transforms(i.e., remaps) the input tensor to a hypersparse higher-ordertensor. As Figure 5 illustrates, the lexicographic coordinateordering of nonzeros in the remapped tensor reflects hownonzeros are grouped together and ordered in the targetformat (i.e., DIA), even if the format does not store nonzerosin lexicographic order by their original coordinates.The analysis phase computes statistics about the input ten-sor that are later used to determine the amount of memory toallocate for storing nonzeros in the target format. The exactstatistics that are computed also depend on the target format.Figure 6a, for instance, computes the set of all nonzero diag-onals in the input matrix (lines 1–8), with distinct diagonalsidentified by offsets ( k ) computed in the coordinate remap-ping phase. By contrast, Figure 6b computes the maximumnumber of nonzeros in any row of the input matrix (lines1–5), while Figure 6c computes the number of nonzeros ineach row of the input matrix (lines 1–6).Finally, the assembly phase iterates over the input tensorand inserts each nonzero into the output data structures.Again, where each nonzero is inserted ( pB2 ) depends on the target format. Figure 6a computes pB2 as a function of eachnonzero’s row coordinate and its offset k (as computed in thecoordinate remapping phase), in such a way that nonzeroswith the same offset are grouped together in the output (lines25–26). By contrast, Figure 6c simply appends each nonzeroto its row’s corresponding segment in the crd array (line 19).In Sections 4 through 6, we describe how our techniquegenerates efficient code to perform coordinate remapping(Section 4), analysis (Section 5), and assembly (Section 6). Foreach logical phase, we define a language that captures whatneeds to be performed for disparate target formats. Figure 7demonstrates how these languages can be used to specifywhat needs to be performed when converting tensors to ELL.For each new target tensor format, a user must first specify • a coordinate remapping (in green) that, when appliedto the input tensor, captures how nonzeros are groupedtogether and ordered in the target format.As alluded to in Section 2, the target tensor format can thenbe decomposed into level formats that each stores a dimen-sion of the remapped input tensor. For each of these levelformats, the user must then also specify • what input tensor statistics to compute (in yellow) and • how to store the coordinates of nonzeros (in blue).To generate code that converts tensors between any two spe-cific formats, our technique combines the aforementionedspecifications for the target format with level functions thatdescribe how to iterate over tensors in the source format.Given the specifications in Figures 4 and 7 as inputs, forinstance, our technique generates code like what is shownin Figure 6b, which performs CSR to ELL conversion. Just aseasily though, given the same specifications in Figure 7 butalso level functions that describe how to iterate over COOtensors, our technique instead generates efficient COO toELL conversion code. In this way, our technique can generateefficient conversion routines for many combinations of for-mats without needing specifications for each combination.Having a separate language to describe each logical phaseprovides several benefits. First, converting to different tensorformats may require similar steps to be taken for only someof the phases, so having each phase be specified separatelyallows for reuse of the specifications. For instance, the COOformat uses the same data structure as ELL to store columncoordinates. Thus, the two formats can share specificationsfor assembly (i.e., level functions implemented for the sin-gleton level format) even if they require different coordinateremappings. Second, having each logical phase be specifiedseparately gives the compiler flexibility to generate codethat fuses logically distinct phases only if it is beneficial. Ourtechnique can thus generate code like Figure 6a, which dupli-cates and fuses coordinate remapping with the analysis andassembly phases to avoid materializing the offsets of nonze-ros. At the same time, for conversions to formats that storenonzeros in more complex orderings (e.g., Morton order), the utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK bool nz[2 * N - 1] = {0}; for (int i = 0; i < N; i++) { for (int pA2 = A_pos[i]; pA2 < A_pos[i+1]; pA2++) { int j = A_crd[pA2]; int k = j - i; nz[k + N - 1] = true; }} int* B_perm = new int[2 * N - 1]; int K = 0; for (int i = -N + 1; i < N; i++) { if (nz[i + N - 1]) B_perm[K++] = i; } double* B_vals = new double[K * N](); int* B_rperm = new int[2 * N - 1]; for (int i = 0; i < K; i++) { B_rperm[B_perm[i] + N - 1] = i; } for (int i = 0; i < N; i++) { for (int pA2 = A_pos[i]; pA2 < A_pos[i+1]; pA2++) { int j = A_crd[pA2]; int k = j - i; int pB1 = B_rperm[k + N - 1]; int pB2 = pB1 * N + i; B_vals[pB2] = A_vals[pA2]; }} (a) CSR to DIA int K = 0; for (int i = 0; i < N; i++) { int ncols = A_pos[i+1] - A_pos[i]; K = max(K, ncols); } int* B_crd = new int[K * N](); double* B_vals = new double[K * N](); for (int i = 0; i < N; i++) { int count = 0; for (int pA2 = A_pos[i]; pA2 < A_pos[i+1]; pA2++) { int j = A_crd[pA2]; int k = count++; int pB2 = k * N + i; B_crd[pB2] = j; B_vals[pB2] = A_vals[pA2]; }} (b) CSR to ELL int count[N] = {0}; for (int pA1 = A_pos[0]; pA1 < A_pos[1]; pA1++) { int i = A1_crd[pA1]; count[i]++; } int* B_pos = new int[N + 1]; B_pos[0] = 0; for (int i = 0; i < N; i++) { B_pos[i + 1] = B_pos[i] + count[i]; } int* B_crd = new int[pos[N]]; double* B_vals = new double[pos[N]]; for (int pA1 = A_pos[0]; pA1 < A_pos[1]; pA1++) { int i = A1_crd[pA1]; int j = A2_crd[pA1]; int pB2 = pos[i]++; B_crd[pB2] = j; B_vals[pB2] = A_vals[pA2]; } for (int i = 0; i < N; i++) { B_pos[N - i] = B_pos[N - i - 1]; } B_pos[0] = 0; (c)

COO to CSR

Figure 6.

Code (in C++) to convert sparse tensors between different source and target formats. The background colors identifydistinct logical phases of format conversion (green for coordinate remapping, yellow for analysis, and blue for assembly). get_size(sz ): return sz ; init_coords(sz , Q ): crd = calloc(sz , int);get_pos(p , ..., i ): return p ; insert_coord(p , p , ..., i ): crd[p ] = i ; N crd K Q := [select [] -> max(i ) as max_crd]get_size(sz ): return sz * K; init_coords(sz , Q ): K = Q [0][].max_crd + 1;get_pos(p , i ): return p * K + i ; get_size(sz ): return sz * N;get_pos(p , i , i ): return p * N + i ; ELL: (i2,i3) -> (i1=

Figure 7.

Specifications that describe what needs to be performed as part of each logical phase when converting sparse tensorsto ELL, which can be decomposed into three level formats: sliced , dense , and singleton . The background colors identify thephases being specified, following the same scheme as Figure 6.compiler can emit code to perform coordinate remappingseparately and materialize the additional coordinates. Thiseliminates the need to recompute complex remappings. As explained in Section 3, many efficient sparse tensor con-version algorithms logically transform (i.e., remap) inputtensors to higher-order tensors, such that the lexicographiccoordinate ordering of nonzeros in the remapped tensorsspecify how nonzeros are stored in the target formats. Wepropose a new language called coordinate remapping notation ,which precisely describes how a tensor can be remapped soas to capture the various ways that different tensor formatsgroup together and order nonzeros in memory. We furthershow how our technique generates code that applies a co-ordinate remapping to remap the input tensor as part of format conversion. This eliminates the need for end usersto hand-implement additional code that separately performssuch a remapping, which the technique of Chou et al. [17]requires for conversions to structured tensor formats.

Figure 8 shows the syntax of coordinate remapping notation.Statements in coordinate remapping notation specify howcomponents in a canonical (non-remapped) input tensor mapto components in an output tensor of equal or higher order.For instance, given a matrix A as input, the statement (i,j) -> (j-i,i,j) maps every component A ij to the corresponding componentin the ( j − i ) -th slice of three-dimensional remapped ten-sor. Applying this remapping to any matrix, which can bestored in any format but must have canonical coordinates LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe remap_stmt := src_indices ‘ -> ’ dst_indicessrc_indices := ‘ ( ’ ivar { ‘ , ’ ivar } ‘ ) ’dst_indices := ‘ ( ’ ivar_let { ‘ , ’ ivar_let } ‘ ) ’ivar_let := { var ‘ = ’ ivar_expr ‘ in ’ } ivar_exprivar_expr := ivar_xor { ‘ | ’ ivar_xor }ivar_xor := ivar_and { ‘ ^ ’ ivar_and }ivar_and := ivar_shift { ‘ & ’ ivar_shift }ivar_shift := ivar_add { ( ‘ << ’ | ‘ >> ’ ) ivar_add }ivar_add := ivar_mul { ( ‘ + ’ | ‘ - ’ ) ivar_mul }ivar_mul := ivar_factor { ( ‘ * ’ | ‘ / ’ | ‘ % ’ ) ivar_factor }ivar_factor := ‘ ( ’ ivar_expr ‘ ) ’ | ivar_counter | ivar | var | constivar_counter := ‘ ’ { ivar } Figure 8.

The syntax of coordinate remapping notation. Ex-pressions in braces may be repeated any number of times. ( i , j ) , transforms it to a 3rd-order tensor where each slicecontains all nonzeros that lie on the same diagonal in theoriginal matrix. As Figure 5 shows, the lexicographic coordi-nate ordering of nonzeros in the resulting tensor preciselyreflects the order in which nonzeros are stored in DIA.Similarly, the BCSR format partitions a matrix into fixed-sized M × N blocks and stores components of each blockcontiguously in memory [23]. Such grouping of nonzeroscan be expressed with the remapping (i,j) -> (i/M,j/N,i,j), which assigns components that lie within the same block tothe same two-dimensional slice (identified by coordinates ( i / M , j / N ) ) in the output tensor.Coordinate remapping notation can express complex ten-sor reorderings. The remapping below, for instance, groupstogether nonzeros that lie within the same constant-sized N × N × N block and also orders the blocks as well as thenonzeros within each block in Morton order [38]: (i,j,k) ->(r=i/N in s=j/N in t=k/N in(r&1) | ((s&1)<<1) | ((t&1)<<2) ...,i/N,j/N,k/N,u=i%N in v=j%N in w=k%N in(u&1) | ((v&1)<<1) | ((w&1)<<2) ...,i,j,k). This remapping exactly describes how the HiCOO tensor for-mat orders nonzeros in memory [34]. Nested let expressionsare used to first define variables r , s , and t as the coordinatesof each block and variables u , v , and w as the coordinates ofeach nonzero within a block. The remapping then computesthe Morton code of each block and each nonzero within ablock by interleaving the bits of those previously definedcoordinates using bitwise operations.Coordinate remapping notation also provides counters,denoted by " " in Figure 8. Counters map nonzeros that sharethe same specified coordinates to distinct slices in the result.For instance, as Figure 9 shows, the remapping (i,j) -> (k= assigns the k -th nonzero that is iterated over in each rowof the input matrix to the k -th slice in the output tensor, R o w s ( I ) Slices (K =

Figure 9.

Result of applying (i,j) -> ( to the ma-trix in Figure 1, assuming nonzeros are iterated over duringremapping in the same order as they are stored in Figure 2b.ensuring nonzeros with the same i coordinate are mapped todistinct slices. This remapping effectively groups together upto one nonzero from each row of the input matrix, accuratelyreflecting how formats like ELL and JAD [47] store nonzeros. To support generating sparse tensor conversion routines forarbitrary combinations of formats, we independently anno-tate each supported format with a coordinate remappingthat describes how the format groups together and ordersnonzeros. As the previous examples demonstrate, it is alsosimple for end users to add support for additional customformats by using coordinate remapping notation to specifyhow the new formats lay out nonzeros in memory. Then, tosupport converting tensors between two specific formats,our technique emits code that iterates over the input tensorand transforms the (canonical) coordinates of each nonzeroby applying the target format’s coordinate remapping.To generate a set of nested loops that efficiently iterateover an input tensor in any source format, our technique usesthe method proposed by Kjolstad et al. [30] and generalizedby Chou et al. [17], which is summarized at the end of Sec-tion 2. Within the generated loops, our technique then emitscode that computes each remapped nonzero’s additional co-ordinates as functions of the nonzero’s original coordinates,following the target format’s coordinate remapping.To compute additional coordinates that are defined purelyas arithmetic or bitwise expressions of the original coordi-nates, our technique simply inlines those expressions directlyinto the emitted code (e.g., lines 6 and 24 in Figure 6a, whichcompute the first coordinate in the output of the remapping (i,j) -> (j-i,i,j) ). Remappings that contain let expres-sions are lowered by first emitting code to initialize thelocal variables and then inlining the expressions that usethose local variables. For example, a remapped coordinate r=i/N in (r&1) | ((r&2)<<2) would be lowered to int r = i/N;int m = (r&1) | ((r&2)<<2;

Coordinate remappings that contain counters are loweredby emitting a counter array for each distinct counter in theremapping. Each element in the counter array correspondsto a distinct set of coordinates ( i , ..., i k ) that can be used to utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK index into the counter, and the counter array element trackshow many input nonzeros with coordinates ( i , ..., i k ) havebeen iterated over so far. Our technique additionally emitscode that, for each nonzero having coordinates that corre-spond to counter array element c , first assigns the nonzeroto the output tensor slice indexed by c and then increments c . So to apply the remapping (i,j) -> ( to a COOmatrix, for instance, our technique emits the following code: int counter[N] = {0}; // counter array for If the coordinates used to index into a counter are iterated inorder though, our technique reduces the size of the counterarray in the generated code by having the counter arraybe reused across iterations. This can significantly reducememory traffic. For instance, if the input matrix is insteadstored in CSR, our technique infers from properties of theformat (exposed through the static interface discussed inSection 2) that we can efficiently iterate over nonzeros rowby row. Thus, to apply the same coordinate remapping asbefore, our technique emits optimized code as shown on lines8–13 in Figure 6b, which uses the same scalar count variableto remap the nonzeros in each row of the input matrix.

As we also saw in Section 3, to avoid having to constantlyreallocate and shuffle around stored nonzeros, many efficienttensor conversion algorithms instead allocate memory in oneshot based on some statistics about the input tensor. Com-puting these statistics, however, requires very different codedepending on how the input tensor is stored. For instance,to convert a matrix to ELL without dynamically resizing the crd and vals arrays, one must first determine the maximumnumber of nonzeros K stored in any row. If the input matrixis stored in COO, then computing K requires constructinga histogram that records the number of nonzeros in eachrow, which in turn requires examining all the nonzeros inthe matrix. If the input is stored in CSR, however, then thenumber of nonzeros in each row can instead be directly com-puted from the pos array. Optimized code for converting CSRmatrices to ELL thus does not need to make multiple passesover the input matrix’s nonzeros, reducing memory traffic.We propose a new language called the attribute querylanguage that describes statistics of sparse tensors as aggre-gations over the coordinates of their nonzeros. The attributequery language is declarative, and attribute queries are speci-fied independently of how the input tensor is actually stored.This lets our technique lower attribute queries to equivalentsparse tensor computations and then simply leverage priorwork on sparse tensor algebra compilation [17, 29, 30] togenerate optimized code for computing tensor statistics. As select [i] -> count(j) as nir select [i] -> min(j) as minir, max(j) as maxir select [j] -> id() as nei nir i minir maxir i ne Figure 10.

Examples of attribute queries computed on thetensor shown in Figure 1.we show in Section 6, our technique can thus generate effi-cient tensor conversion routines while only requiring usersto provide simple-to-specify attribute queries for each po-tential target format, as opposed to complicated loop nestsfor every combination of source and target formats.

The attribute query language lets users compute summariesof a tensor’s sparsity structure by performing aggregationsover the coordinates of the tensor’s nonzeros. All queries inthe attribute query language take the form select [i ,...,i m ] -> as label , ..., as label n where each i k denotes a coordinate into dimension I k ofsome r -dimensional tensor A and invokes the ag-gregation function count , max , min , or id . The result of anattribute query is conceptually a map that, for every dis-tinct set of coordinates ( i , . . . , i m ) , stores computed statis-tics about the I m + × · · · × I r subtensor A ′ in A identifiedby those coordinates. Figure 10 shows examples of differentattribute queries computed on the same tensor. count( i m + ,...,i l ) computes the number of nonzero I l + ×· · · × I r subtensors, each of which can be identified by a dis-tinct set of coordinates ( i , . . . , i l ) , that are contained in A ′ .For instance, if I , J , K represent the slice, row, and columndimensions of a 3rd-order tensor B , then the query select [i] -> count(j) as nnr_in_slice computes the number of nonzero rows contained in each J × K slice of B , while the query select [i] -> count(j,k) as nnz_in_slice computes the number of nonzeros in each J × K slice. Fig-ure 10 (left) shows how count queries can be used to computethe number of nonzeros in each row of a matrix, which isrequired when converting it from COO to CSR for instance. max(i m + ) and min(i m + ) compute, for each subtensor A ′ ,the largest and smallest coordinates i m + such that the i m + -th slice of A ′ along dimension I m + is nonzero. For instance,if Q is the result of the query in Figure 10 (middle), then LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

Q[3].minir == 1 and

Q[3].maxir == 4 since all nonzeros inrow 3 of the tensor in Figure 1 lie between columns 1 and 4.Finally, id simply returns 1 if a subtensor A ′ containsnonzeros and 0 otherwise. So if R is the result of the query inFigure 10 (right), then R[4].ne == 1 since column 4 containsa nonzero while

R[5].ne == 0 since the last column is empty.The attribute query language can be used with coordi-nate remapping notation to compute even more complexattributes of structured tensors. For example, let A be a K × I × J tensor obtained by applying the remapping (i,j) ->(j-i,i,j) to a matrix B . Since each slice of A along dimen-sion K corresponds to a unique diagonal in B , computing select [k] -> id() as ne on A results in a bit set that encodes the set of all nonzerodiagonals in B . This, as mentioned in Section 3, is preciselyinformation that would be required if one were to convert B from CSR to DIA. Furthermore, since the coordinate of eachslice of A is defined to be the offset of the correspondingdiagonal in B from the main diagonal, applying the query select [] -> min(k) as lb, max(k) as ub to A computes the lower and upper bandwidths of matrix B . Kjolstad et al. [29] introduce concrete index notation, whichis a language for precisely specifying tensor computations.For instance, an operation that computes the sum of everyrow in a matrix A can be expressed as ∀ i ∀ j x i += A ij , whereeach ∀ specifies iteration over a dimension of A . Kjolstadet al. show how computations expressed in concrete indexnotation can be lowered to efficient imperative code; we referreaders to Section 5 in [28] for more details. At a high level,this is done recursively dimension by dimension in the orderspecified by the ∀ s. To generate code for the previous exam-ple, for instance, a compiler would first emit a loop to iterateover all rows of A . Within that loop, the compiler would thenemit a second loop to iterate over all columns of A in orderto compute the sum of nonzeros in row i . Again, specializingthe emitted code to operands in arbitrary formats can bedone in the same way described at the end of Section 2.To generate efficient code that computes an attribute query,our technique simply reformulates the query as sparse ten-sor algebra computation. The query is first lowered to acanonical form in concrete index notation, which we extendwith the ability to index into results using coordinates thatare computed as arbitrary functions of index variables. Thecanonical form of the query is subsequently optimized byapplying a set of predefined transformations to simplify thecomputation. Finally, the optimized query in concrete indexnotation is compiled to imperative code by straightforwardlyleveraging the techniques of Kjolstad et al. and Chou et al.as summarized above. This approach works as long as queryresults are stored in a format, like dense arrays, that can itselfbe efficiently assembled without needing attribute queries. More precisely, let A be an I × · · · × I r tensor obtained byapplying some remapping to a J × · · · × J n tensor B . Then,to compute an attribute query of the form select [i ,...,i m ] -> id() as Q on A for instance, our technique lowers the query to itscanonical form in concrete index notation as ∀ j · · · ∀ j n Q i ··· i m | = map ( B j ··· j n , ) , where | = denotes Boolean OR reduction. The computationabove logically iterates over every component of B , computesthe coordinates ( i , . . . , i m ) of each component B j ··· j n in theremapped tensor A , and sets the corresponding componentin the Boolean result tensor Q to true (1). (All componentsof Q are assumed to be initialized to false.) The map oper-ator returns the second argument if the first argument isnonzero (or true) and zero otherwise, which ensures onlythe coordinates of nonzeros in B are aggregated. So if, forinstance, C is a K × I × J tensor obtained by applying theremapping (i,j) -> (j-i,i,j) to a matrix D , then to com-pute select [k] -> id() as Q on C , our technique lowersthe query to the computation ∀ i ∀ j Q j − i | = map ( D ij , ) . Foreach nonzero of D , this computation computes the nonzero’soffset from the main diagonal and sets the correspondingcomponent in Q to true. The query result Q thus strictlyencodes the set of diagonals in D that contain nonzeros.In a similar way, our technique lowers count queries select [i ,...,i m ] -> count(i m + ,...,i l ) as Q on A to their canonical form (cid:0) ∀ i · · · ∀ i l Q i ··· i m += map ( W i ··· i l , ) (cid:1) where (cid:0) ∀ j · · · ∀ j n W i ··· i l | = map ( B j ··· j n , ) (cid:1) . The computation above first iterates over the nonzeros of B tocompute the intermediate result W , which encodes whethereach subtensor of A identified by coordinates ( i , . . . , i l ) isnonzero. The computation then sums over dimensions I m + through I l of W to compute the number of aforementionedsubtensors that are nonzero and contained in each higher-order subtensor with coordinates ( i , . . . , i m ) .Our technique also generates code for max queries select [i ,...,i m ] -> max(i m + ) as Q by lowering them to their canonical form ∀ j · · · ∀ j n Q ′ i ··· i m max = map ( B j ··· j n , i m + − s + ) , where s denotes the smallest possible coordinate along di-mension I m + . Q ′ is assumed to be initialized to the zerotensor, so by mapping each input tensor component to itsremapped coordinate i m + plus the constant ( − s ) , we ensurethat only the coordinates of nonzeros are actually aggregated. Q ′ can thus be interpreted as the actual result of the origi-nal query (i.e., Q ) but just shifted by ( − s ) ; in other words, Q i ··· i m ≡ Q ′ i ··· i m + s −

1. Similarly, min queries select [i ,...,i m ] -> min(i m + ) as Q utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK Table 1.

Example transformations that our technique applies to optimize attribute queries. We augment level formats with aproperty that specifies if a dimension stores explicit zeros, which lets our technique determine if a tensor stores only nonzeros.

Transformation Definition Preconditions and Postconditionsreduction-to-assign (cid:0) ∀ j · · · ∀ j n A i ··· i m ⊕ = expr (cid:1) = ⇒ (cid:0) ∀ j · · · ∀ j n A i ··· i m = expr (cid:1) For each j k , there exists an i l such that j k ≡ i l . ⊕ is any reductionoperator. A is initialized to the zero tensor.inline-temporary (cid:0) ∀ i · · · ∀ i m A i ··· i l ⊕ = f ( W i ··· i m ) (cid:1) where (cid:0) ∀ j · · · ∀ j n W i ··· i m = expr (cid:1) = ⇒ (cid:0) ∀ j · · · ∀ j n A i ··· i l ⊕ = f ( expr ) (cid:1) f is any function that takes only W as tensor operand. ⊕ is anyreduction operator or a simple assignment.simplify-width-count (cid:0) ∀ j · · · ∀ j n A i ··· i m += map ( B j ··· j n , c ) (cid:1) = ⇒ (cid:0) ∀ j · · · ∀ j n − A i ··· i m += B ′ j ··· j n − · c (cid:1) B stores only nonzeros, and j n is a reduction variable that indexesinto the innermost dimension of B (i.e., J n ). c is any constant. B ′ is a tensor that encodes the number of nonzeros in each slice of B indexed by coordinates ( j , ..., j n − ) ; values of B ′ are not materi-alized but dynamically computed with calls to level functions thatdefine iteration over dimension J n of B .counter-to-histogram (cid:0) ∀ j · · · ∀ j n A i ··· i m max = map ( B j ··· j n , j k · · · j l + ) (cid:1) = ⇒ (cid:0) ∀ i · · · ∀ i l A i ··· i m max = W i ··· i l (cid:1) where (cid:0) ∀ j · · · ∀ j n W i ··· i m j k ··· j l += map ( B j ··· j n , ) (cid:1) None. are lowered to their canonical form ∀ j · · · ∀ j n Q ′ i ··· i m max = map ( B j ··· j n , − i m + + t + ) , where t denotes the largest possible coordinate along dimen-sion I m + and Q ′ is the query result but negated and shiftedby ( t + ) ; in other words, Q i ··· i m ≡ − Q ′ i ··· i m + t + select [i] -> count(j) as Q ap-plied to an I × J matrix B . As described before, our techniquefirst lowers this query to its canonical form (cid:0) ∀ i ∀ j Q i += map ( W ij , ) (cid:1) where (cid:0) ∀ i ∀ j W ij | = map ( B ij , ) (cid:1) . Our technique then proceeds to iteratively and eagerly applythe transformations shown in Table 1 on the computationabove. In particular, each iteration variable bound to a ∀ isused to independently index into a dimension of W , so thesubstatement that defines W satisfies the preconditions ofthe reduction-to-assign transformation. Our technique thusapplies the aforementioned transformation on the substate-ment that defines W to obtain (cid:0) ∀ i ∀ j Q i += map ( W ij , ) (cid:1) where (cid:0) ∀ i ∀ j W ij = map ( B ij , ) (cid:1) . Then, since the temporary W is no longer the result of areduction operation, our technique eliminates it by applyingthe inline-temporary transformation to obtain ∀ i ∀ j Q i += map ( map ( B ij , ) , ) , which is then trivially rewritten to ∀ i ∀ j Q i += map ( B ij , ) by applying constant folding. If B is stored in COO, then wecan directly apply the techniques of Kjolstad et al. and Chouet al. to lower this rewritten statement down to imperativecode shown on lines 1–6 in Figure 6c. However, if B is storedin CSR (with only nonzeros stored), then our technique ad-ditionally applies the simplify-width-count transformationfollowed by reduction-to-assign again to get the final query ∀ i Q i = B ′ i , where each component of B ′ is dynamically computed as pos[i+1] - pos[i] . The optimized query thus avoids iterat-ing over B ’s nonzeros, thereby reducing memory traffic. As explained in Section 2, a sparse tensor can be modeled asa hierarchical structure of coordinates, where each storedcomponent is represented by a path from the root to a leaf.We can thus view any tensor format simply as a compositionof level formats that each stores a level of a coordinate hi-erarchy. This abstraction lets us reason about sparse tensorassembly as coordinate hierarchy construction.We extend the coordinate hierarchy abstraction with newprimitives (level functions) that describe how each level canbe efficiently constructed (assembled). These new level func-tions, unlike analogous ones proposed by Chou et al. [17], de-scribe how coordinates and edges can be efficiently insertedinto a coordinate hierarchy assuming certain statistics aboutthe input tensor have been precomputed.Figures 7 and 11 show how level formats that use dis-parate data structures can implement the new assembly levelfunctions. All these implementations expose the same staticinterface, which lets our code generator reason about andemit efficient code to convert tensors between a wide range

LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe get_size(sz k-1 ): return pos[sz k-1 ];seq_init_edges(sz k-1 , Q k ): pos = malloc(sz k-1 + 1, int); pos[0] = 0; seq_insert_edges(p k-1 , i , ..., i k-1 , q k ): pos[p k-1 + 1] = pos[p k-1 ] + q k [0].nir;unseq_init_edges(sz k-1 , Q k ): pos = calloc(sz k-1 + 1, int); unseq_insert_edges(p k-1 , i , ..., i k-1 , q k ): pos[p k-1 + 1] = q k [0].nir; unseq_finalize_edges(sz k-1 ): prefix_sum(pos, sz k-1 + 1);init_coords(sz k-1 , Q k ): crd = malloc(pos[sz k-1 ]], int); yield_pos(p k-1 , i , ..., i k ): return pos[p k-1 ]++;insert_coord(p k-1 , p k , i , ..., i k ): crd[p k ] = i k ; finalize_yield_pos(sz k-1 ): for (i = 0; i < sz k-1 ; i++) pos[sz k-1 - i] = pos[sz k-1 - i - 1]; pos[0] = 0;Q k := [select [i , ..., i k-1 ] -> count(i k ) as nir]get_size(sz k-1 ): return sz k-1 * K;init_coords(sz k-1 , Q k ): perm = malloc(N k - M k , int); K = 0; for (i = M k ; i < N k ; i++) { if (Q k [0][i].nz) perm[K++] = i; } init_get_pos(sz k-1 ): rperm = malloc(N k - M k , int); for (i = 0; i < K; i++) rperm[perm[i] - M k ] = i; get_pos(p k-1 , i , ..., i k ): return p k-1 * K + rperm[i k - M k ];finalize_get_pos(sz k-1 ): free(rperm);Q k := [select [i k ] -> id() as nz]get_size(sz k-1 ): return pos[sz k-1 ];seq_init_edges(sz k-1 , Q k ): pos = malloc(sz k-1 + 1, int); pos[0] = 0; seq_insert_edges(p k-1 , i , ..., i k-1 , q k ): pos[p k-1 + 1] = pos[p k-1 ] + max(i k-1 - q k [0].w + 1, 0);unseq_init_edges(sz k-1 , Q k ): pos = calloc(sz k-1 + 1, int); unseq_insert_edges(p k-1 , i , ..., i k-1 , q k ): pos[p k-1 + 1] = max(i k-1 - q k [0].w + 1, 0); unseq_finalize_edges(sz k-1 ): prefix_sum(pos, sz k-1 + 1);get_pos(p k-1 , i , ..., i k ): return pos[p k-1 + 1] + i k - i k-1 - 1; Q k := [select [i , ..., i k-1 ] -> min(i k ) as w] Figure 11.

Implementations of the assembly abstraction, including level function definitions and the required attribute queries,for three different level formats: squeezed (top), compressed (middle), and banded (bottom).

Squeezed stores the dimension ofoffsets in (remapped) DIA tensors.

Compressed stores the column dimension of CSR tensors as well as the row dimension ofCOO tensors.

Banded stores the column dimension of tensors in the skyline format [24], which stores all components betweenthe first nonzero and the diagonal for every row. M k and N k denote the lower and upper bounds of the k -th dimension.of formats. And by using the same level formats to expressdifferent tensor formats that share common data structures,our technique can reuse the same level function implemen-tations to generate conversion routines for many differenttarget formats. For example, the column dimensions of CSRtensors and the row dimensions of COO tensors can bothbe stored using the same (i.e., compressed) level format. Thecode generator can thus use the same implementations ofthe new assembly level functions (for the compressed levelformat) to generate parts of code that convert tensors toeither CSR or COO. This limits the one-time effort needed toimplement our extended coordinate hierarchy abstraction. Our extended coordinate hierarchy abstraction assumes thatcoordinate hierarchies can be constructed level by level fromtop to bottom. The assembly of each level is decomposed intotwo logical phases: edge insertion and coordinate insertion .The edge insertion phase, which is optional, logically bulkinserts edges into a coordinate hierarchy to connect coordi-nates in adjacent levels. Edge insertion models the assemblyof data structures that map nonzeros to their containingsubtensors. Depending on whether each position (node) inthe preceding parent level can be iterated in sequence, edgeinsertion can be done in a sequenced or unsequenced fashion. Unsequenced edge insertion is defined in terms of threelevel functions that any level format may implement: • unseq_init_edges(sz k − , Q k ) -> void • unseq_insert_edges(p k − , i , ..., i k − , q k ) -> void • unseq_finalize_edges(sz k − ) -> void Q k denotes the (complete) results of attribute queries that alevel format specifies must be precomputed, while q k denotesonly the elements of Q k indexed by coordinates ( i , . . . , i k − ) . sz k − is the size of the parent level and can be computed as afunction of its own parent’s size by calling the level function get_size(sz k − ) -> sz k , which all level formats must also implement. unseq_init_edges initializes data structures that the level format usesto logically store edges. Then, for each position p k − (whichrepresents a subtensor with coordinates ( i , . . . , i k − ) ) in theparent level, unseq_insert_edges allocates some number ofchild coordinates to be connected to p k − . The number ofchild coordinates allocated can be computed as any arbitraryfunction of q k . Finally, unseq_finalize_edges inserts edgesinto the coordinate hierarchy so that each coordinate inthe parent level is connected to as many children as it waspreviously allocated. Figure 12 (left) shows how these levelfunctions can logically be invoked to bulk insert edges. utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK sz k − = get_size k − (get_size k − ( · · · (1) · · · ));unseq_init_edges(sz k − , Q k );for (position p k − in parent level | ∃ coordsi , ..., i k − connecting p k − to root) {q k [:] = Q k [:][i , ..., i k − ];unseq_insert_edges(p k − , i , ..., i k − , q k );}unseq_finalize_edges(sz k − ); init_coords(sz k − , Q k );init_{get|yield}_pos(sz k − );for (nonzero with coords i , ..., i k ) {for (j = 1; j <= k; j++) // can be unrolledp j = {get|yield}_pos j (p j − , i , ..., i j );insert_coord(p k − , p k , i , ..., i k );}finalize_{get|yield}_pos(sz k − ); Figure 12.

Unsequenced edge insertion (left) and coordinate insertion (right), expressed in terms of calls to level functions.Sequenced edge insertion, by contrast, is defined in termsof just two level functions: • seq_init_edges(sz k − , Q k ) -> void • seq_insert_edges(p k − , i , ..., i k − , q k ) -> void These level functions are analogous to unseq_init_edges and unseq_insert_edges and can be invoked in similar ways. Se-quenced edge insertion, however, assumes that all positionsin the parent level are iterated in order. Thus, seq_insert_edges is responsible for both allocating the appropriate num-ber of child coordinates to each parent and actually insertingthe edges, and a separate finalize function is not necessary.The coordinate insertion phase logically iterates over theinput tensor’s nonzeros and inserts their coordinates intoa coordinate hierarchy. This phase models the assembly ofdata structures that store the coordinates and values of thenonzeros, and it is defined in terms of five level functions: • init_coords(sz k − , Q k ) -> void • init_{get|yield}_pos(sz k − ) -> void • {get|yield}_pos(p k − , i , ..., i k ) -> p k • insert_coord(p k − , p k , i , ..., i k ) -> void • finalize_{get|yield}_pos(sz k − ) -> voidinit_coords allocates and initializes data structures for stor-ing coordinates in a coordinate hierarchy level. If a levelformat implicitly encodes coordinates (e.g., as a fixed range)using some fixed set of parameters, then init_coords alsocompute those parameters as functions of the attribute queryresults Q k . On the other hand, if a level format explicitlystores coordinates in memory, then the coordinates of nonze-ros are inserted by invoking insert_coord for each nonzero.The position p k at which each nonzero should be insertedis computed by invoking either get_pos or yield_pos . Theformer guarantees that nonzeros with the same coordinatesare inserted at the same position. The latter allows dupli-cate coordinates to be inserted at different positions. Bothfunctions, however, may rely on auxiliary data structures totrack where to insert coordinates; init_{get|yield}_pos and finalize_{get|yield}_pos initializes and cleans up thosedata structures. Figure 12 (right) shows how all these levelfunctions can be invoked to perform coordinate insertion. To generate code that converts sparse tensors between twoformats, our code generator emits loops that iterate over a tensor in the source format and apply the target format’scoordinate remapping to each nonzero. This is done by apply-ing the technique described in Section 4.2. Then, within eachloop nest that iterates over the (remapped) input tensor, thecode generator emits code that invokes the level functionsdescribed in Section 6.1 to store each nonzero into the result.The emitted code is finally specialized to the target formatby inlining its implementations of the aforementioned levelfunctions (e.g., as shown in Figure 11). This approach enablesthe code generator to support disparate target (and source)formats. At the same time, it limits the complexity of thecode generator, since the code generator does not need tohard-code for specific data structures but can simply reasonabout how to invoke a fixed set of level functions.In order to minimize memory traffic at runtime, our tech-nique emits code that, wherever possible, fuses the assemblyof adjacent levels in the result coordinate hierarchy. Adjacentlevels can be assembled together as long as only the parentlevel requires a separate edge insertion phase (or if none do).As an example, none of the level formats that compose DIArequires edge insertion. Thus, our technique will emit codeto convert any matrix to DIA by iterating over the matrix justonce and assembling all output dimensions (levels) together.For each set of levels that can be assembled together, ourtechnique then simply has to emit code like what is shown inFigure 12 to perform edge insertion (if required) followed bycoordinate insertion. If a level format implements both vari-ants of edge insertion, then our technique selects one basedon whether the parent level can be iterated in order. The codegenerator infers this from properties exposed through thecoordinate hierarchy abstraction that specify if the parentlevel stores coordinates in order. If a level format implements yield_pos but does not permit storing duplicates of the samecoordinate, our technique also emits logic to perform dedu-plication on the fly by keeping track of inserted coordinates.To see how our technique works, suppose we are gener-ating code to convert COO tensors to CSR. To obtain codethat assembles the column dimension of the result, the codegenerator first emits sequenced edge insertion code thathas the same structure as Figure 12 (left), except with alllevel functions replaced by their sequenced counterparts.The emitted code is then specialized to CSR by replacing thelevel function calls with the compressed level format’s imple-mentations of those functions (Figure 11, middle). The result

LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

Table 2.

Statistics about matrices used in our experiments.Non-symmetric matrices are highlighted in gray.

Matrix Dimensions NNZ NonzeroDiagonals Max.NNZ/rowpdb1HYS 36.4K × × × × × × × × × × × × × × × × × × × × × is lines 7–11 in Figure 6c, which iterate over all rows of theoutput in order and reserve enough memory to store eachrow’s nonzeros. In a similar way, the code generator emitscode in Figure 12 (right) to perform coordinate insertion andthen specializes it to CSR, yielding lines 12–25 in Figure 6c. We evaluate our technique and find that it generates efficientsparse tensor conversion routines for many combinationsof disparate source and target formats. The generated con-version routines have performance similar to or better thanequivalent hand-implemented versions. We also find that,for combinations of source and target formats that are notdirectly supported by a library, our technique can furtheroptimize conversions between those formats by emittingcode that avoids materializing temporaries.

We implemented a prototype of our technique as extensionsto the open-source taco tensor algebra compiler [30]. Toevaluate it, we compare code that our technique generatesagainst hand-implemented versions in SPARSKIT [48], awidely used [25, 42] Fortran sparse linear algebra library,and Intel MKL [24], a C and Fortran math processing librarythat is optimized for Intel processors. We also evaluate ourtechnique against taco without our extensions. All experiments are conducted on a 2.5 GHz Intel XeonE5-2680 v3 machine with 30 MB of L3 cache and 128 GB of https://github.com/tensor-compiler/taco/tree/c0e93b65 main memory. The machine runs Ubuntu 18.04.3 LTS. Wecompile code that our technique generates using GCC 7.5.0and build SPARSKIT from source using GFortran 7.5.0. Werun each experiment 50 times under cold cache conditionsand report median serial execution times.We run our experiments with real-world matrices of vary-ing sizes and structures as inputs. These matrices, listed inTable 2, come from applications in disparate domains andare obtained from the SuiteSparse Matrix Collection [18]. We measure the performance of sparse tensor conversionroutines that our technique generates for seven distinct com-binations of source and target formats: • COO to CSR ( coo_csr ) • COO to DIA ( coo_dia ) • CSR to CSC ( csr_csc ) • CSR to DIA ( csr_dia ) • CSR to ELL ( csr_ell ) • CSC to DIA ( csc_dia ) • CSC to ELL ( csc_ell )where inputs and outputs in COO, CSR, or CSC are not as-sumed to be sorted (though nonzeros are still necessarilygrouped by row/column in CSR/CSC). For each combinationof formats, we also measure the performance of conversionbetween those formats using SPARSKIT and Intel MKL. Bothlibraries implement routines that directly convert matricesfrom COO to CSR, CSR to CSC, and CSR to DIA. Additionally,SPARSKIT supports directly converting matrices from CSRto ELL. However, neither SPARSKIT nor Intel MKL imple-ments routines that directly convert matrices between theremaining combinations of formats. Thus, to perform thoseconversions using either library, we first convert the inputfrom its source format to a temporary in CSR and then con-vert the temporary from CSR to the intended target format.(If the input matrix is symmetric though, then conversionsfrom CSC to DIA/ELL are cast as conversions from CSR toDIA/ELL, since CSR and CSC are equivalent in that case.)Table 3 show the results of our experiments. As theseresults demonstrate, our technique outperforms or is compa-rable to SPARSKIT and Intel MKL on average for all combi-nations of source and target formats that we evaluate. Onthe whole, code that our technique emits to convert matricesfrom COO to CSR ( coo_csr ) and from CSR to CSC ( csr_csc )exhibit similar performance as hand-implemented routinesin SPARSKIT and somewhat better performance than IntelMKL. This is unsurprising since our technique generatescode that implement the same algorithms (variations of Gus-tavson’s HALFPERM algorithm [22]) as SPARSKIT. Our tech-nique also emits code to perform CSR to DIA conversion thatis 2.01 × faster than SPARSKIT and 1.80 × faster than IntelMKL on average. SPARSKIT’s implementation of csr_dia supports extracting a bounded number of nonzero diago-nals from the input matrix and storing them in the output.However, it implements this capability using an inefficient al-gorithm to identify the densest diagonals, thus leading to the utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK Table 3.

Normalized execution times of conversion routines that are implemented or generated in SPARSKIT (skit), Intel MKL(mkl) and taco without our extensions (taco w/o ext.), relative to code that our technique generates (taco w/ ext.). The actualexecution times (in milliseconds) of code that our technique generates are shown in parentheses. For CSR to CSC conversion,we only show results for nonsymmetric matrices (rows highlighted in gray) since CSR and CSC are equivalent for symmetricmatrices. For symmetric matrices, we cast CSC to DIA/ELL conversion as CSR to DIA/ELL conversion and report the sameresults. For conversions to DIA/ELL, we also omit results for matrices that would have to be stored with more than 75% ofvalues being zeros, since having to compute with so many zeros would likely eliminate any performance benefit of DIA/ELL.Columns highlighted in gray denote kernels that perform the conversion by first converting to CSR temporaries.

Matrix coo_csr coo_dia csr_csc csr_dia csr_ell csc_dia csc_ell tacow/ ext. tacow/o ext. skit mkl tacow/ ext. tacow/ ext. skit mkl tacow/ ext. skit mkl tacow/ ext. skit tacow/ ext. tacow/ ext.skit mkl skit mkl skitpdb1HYS 1 (57.5) 15.88 1.02 1.11 1 (79.1) 1.68 1 (79.1) 1.68jnlbrng1 1 (0.96) 31.15 0.97 1.56 1 (0.86) 3.07 3.38 1 (0.91) 1.85 1.56 1 (0.92) 0.95 1 (0.91) 1.85 1.56 1 (0.92) 0.95obstclae 1 (0.93) 31.95 1.00 1.60 1 (0.86) 3.02 3.36 1 (0.91) 1.84 1.54 1 (0.82) 1.04 1 (0.91) 1.84 1.54 1 (0.82) 1.04chem_master1 1 (1.06) 29.05 1.04 1.44 1 (0.88) 4.60 4.94 1 (1.10) 0.98 2.14 1 (0.93) 1.83 1.54 1 (0.91) 0.91 1 (0.96) 4.24 3.85 1 (1.24) 3.11rma10 1 (34.0) 12.77 1.01 0.96 1 (29.1) 1.17 1.16 1 (49.2) 1.84 1 (62.8) 2.09dixmaanl 1 (1.61) 30.64 1.02 1.42 1 (1.50) 5.03 3.11 1 (1.54) 1.88 1.57 1 (1.35) 0.97 1 (1.54) 1.88 1.57 1 (1.35) 0.97cant 1 (27.4) 25.89 1.00 1.35 1 (45.3) 4.16 4.39 1 (44.5) 3.61 3.63 1 (59.8) 1.78 1 (44.5) 3.61 3.63 1 (59.8) 1.78shyy161 1 (1.69) 26.92 1.00 1.50 1 (1.76) 4.98 3.05 1 (1.64) 1.07 2.36 1 (1.86) 1.85 1.54 1 (1.77) 0.94 1 (1.89) 4.68 3.44 1 (2.36) 3.00consph 1 (64.8) 18.58 1.01 1.21 1 (88.9) 1.45 1 (88.9) 1.45denormal 1 (5.61) 33.13 1.01 1.51 1 (5.14) 5.10 5.78 1 (4.83) 2.21 2.26 1 (5.17) 1.02 1 (4.83) 2.21 2.26 1 (5.17) 1.02Baumann 1 (4.70) 25.21 0.99 1.49 1 (3.48) 5.23 5.48 1 (4.71) 1.03 1.89 1 (3.56) 1.95 1.70 1 (3.33) 1.01 1 (3.60) 5.07 4.10 1 (4.74) 3.16cop20k_A 1 (63.6) 8.46 0.89 0.96 1 (34.8) 3.60 1 (34.8) 3.60shipsec1 1 (81.6) 18.29 1.02 1.28 1 (112) 1.93 1 (112) 1.93majorbasis 1 (12.3) 24.05 1.00 1.33 1 (10.9) 3.34 3.70 1 (12.1) 0.99 1.78 1 (9.91) 2.43 2.42 1 (8.17) 1.03 1 (10.4) 3.47 4.37 1 (20.1) 1.88scircuit 1 (15.8) 11.54 1.00 1.10 1 (16.4) 0.95 1.09mac_econ_fwd500 1 (11.1) 20.52 0.99 1.29 1 (11.6) 1.00 1.38pwtk 1 (121) 19.77 1.00 1.29 1 (123) 4.09 1 (123) 4.09Lin 1 (9.88) 30.12 0.98 1.36 1 (8.40) 4.87 5.12 1 (8.14) 1.92 1.70 1 (10.1) 0.98 1 (8.14) 1.92 1.70 1 (10.1) 0.98ecology1 1 (42.3) 19.82 0.99 1.41 1 (36.8) 2.74 3.00 1 (35.8) 1.64 1.44 1 (37.5) 1.08 1 (35.8) 1.64 1.44 1 (37.5) 1.08webbase-1M 1 (57.9) 10.27 1.01 0.99 1 (59.3) 1.00 1.14atmosmodd 1 (113) 15.15 0.96 1.21 1 (64.9) 3.26 3.49 1 (113) 0.97 1.04 1 (62.2) 1.72 1.58 1 (74.1) 1.17 1 (62.9) 3.40 3.39 1 (88.7) 2.20

Geomean 1 20.39 1.00 1.29 1 4.01 3.96 1 1.02 1.48 1 2.01 1.80 1 1.36 1 2.75 2.51 1 1.78 slowdown. In addition, csr_ell code that our technique emitsis 1.36 × faster than SPARSKIT on average. This is becauseour technique emits code that invokes calloc to both allocateand zero-initialize the output arrays. SPARSKIT, by contrast,takes user-allocated output arrays as arguments and sepa-rately initializes those arrays. Furthermore, for conversionsto DIA and ELL, code that our technique emits achieve evengreater speedups—between 1.78 and 4.01 × —over SPARSKITand Intel MKL when converting from COO or CSC. This isbecause when the input matrix is nonsymmetric or stored inCOO, both libraries must incur additional memory accessesto construct and then iterate over temporary CSR matrices.Finally, we measure and compare against the performanceof taco without our extensions for COO to CSR conversion.By expressing COO to CSR conversion in index notation astensor assignment (i.e., A ij = B ij , where A and B are CSRand COO matrices respectively), the techniques of Kjolstadet al. and Chou et al. [17, 29, 30] can also generate code thatperforms the conversion. As Table 3 also shows though, ourtechnique emits code to perform COO to CSR conversionthat is 20.4 × faster on average. The techniques of Kjolstad et al. and Chou et al. cannot reason about generating codethat inserts nonzeros into CSR data structures out of order.Thus, it must sort the input before performing the actualconversion, incurring significant additional overhead. Fur-thermore, while sparse matrix formats like DIA or ELL can becast as 3rd-order tensor formats, index notation (as describedin [17, 29, 30]) cannot express assignment of a matrix to a 3rd-order tensor in a way that does not duplicate nonzeros alongthe extra dimension. So without the extensions described inthis paper, taco, which takes index notation as input, cannotemit code to perform end-to-end conversion for formats thatstore nonzeros in non-lexicographic coordinate order. There is a long line of works [5, 8, 10, 13, 23, 34–36, 47, 50, 60]on developing new sparse tensor formats to accelerate SpMV,sparse matrix-dense matrix multiplication (SpDM/SpMM),matricized tensor times Khatri-Rao products (MTTKRP), andother tensor computations. These formats organize nonzerosin disparate ways to reduce memory footprint, improve cacheutilization, expose parallelization opportunities, and better

LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe exploit hardware capabilities such as SIMD vector units forperformance. All the aforementioned works rely on hand-implemented routines for converting tensors from a standardrepresentation (e.g., COO) to their proposed formats. Theseroutines are often more complex than code to perform thecomputations that the proposed formats are designed toaccelerate. In addition, many techniques [15, 21, 22, 57–59]have been proposed for accelerating transpositions of CSRmatrices, which is equivalent to CSR to CSC conversion.Existing sparse tensor and linear algebra compilers cannotgenerate efficient code to convert tensors between arbitrary,disparate formats. The taco compiler [17, 29, 30] can emitcode to convert tensors between formats that store nonzerosin lexicographic coordinate order, but cannot generate com-plete conversion routines for structured formats like DIAand ELL. Without the extensions described in this paper, tacoalso cannot emit code that computes and uses statistics aboutthe input tensor to coordinate efficient assembly of the out-put tensor. LL [3, 4] is a functional language that lets usersdefine sparse matrix formats as nestings of lists and pairsthat encode nonzeros in a dense matrix. From these specifi-cations, the LL compiler can emit code that convert densematrices to different sparse matrix formats, but not efficientcode that can directly convert between sparse matrix formats.In the context of inspector-executor approaches for sparselinear algebra, Nandy et al. [39] build on CHiLL [56] andSPF [52] to show how inspectors that convert input matricesbetween different sparse matrix formats can be generated.Their approach, however, requires specifications to be pro-vided for every combination of potential source and targetformats, since each specification is hard-coded to data struc-tures used by the source and target formats. SIPR [45] andtechniques that Bik and Wijshoff proposed [11, 12] can alsogenerate sparse linear algebra code that, as sub-operations,convert matrices between different formats. However, con-versions in SIPR-generated code are performed by invokinghand-implemented operations that are hard-coded to spe-cific source and target formats. Meanwhile, the techniquesproposed by Bik and Wijshoff only support a fixed set ofstandard sparse matrix formats. Bernoulli [31, 32, 51] usesa black-box protocol that provides an abstract interface fordescribing how sparse matrices stored in different data struc-tures can be efficiently accessed. However, the interface doesnot support assembly, so Bernoulli cannot generate code thatconstruct sparse matrix results.There also exists a separate line of works [16, 26, 33, 40] ongenerating efficient code for query languages like SQL, whichour attribute query language resembles. (Attribute queriesare analogous to GROUP BY queries on a table that storesthe coordinates of nonzeros.) In particular, HorseIR [16] low-ers SQL queries to an array-based intermediate representa-tion that is then optimized and compiled to efficient code.EmptyHeaded [1] is a graph processing framework that gen-erates efficient code to compute graph queries expressed in a Datalog-like language. Furthermore, our approach tooptimizing attribute queries is reminiscent of query rewrit-ing systems in certain relational database systems like Star-burst [43, 44]. All these techniques are designed for queriesthat may perform complex joins and aggregate data of ar-bitrary types. By contrast, attribute queries are limited toaggregating tensor coordinates, which are integers. This letsour technique lower and optimize attribute queries in waysthat would be invalid for aggregations over other data types.

We have described a technique for generating sparse tensorconversion routines that efficiently convert tensors betweena wide range of formats. Our technique is extensible, so userscan easily add support for new source and target formats bysimply specifying how to construct and iterate over tensorsin those new formats. By making it easy to work with thesame data in multiple formats, each suited to a different stageof an application, our technique can greatly reduce the effortneeded to optimize sparse tensor algebra applications.That said, our technique can be further generalized invarious ways to support conversions between an even widerrange of formats. A limitation of our technique is that itonly supports tensor formats that can be expressed in thecoordinate hierarchy abstraction we proposed in [17]. This,for instance, precludes support for conversions to hybridformats like HYB [9] that decompose a tensor into multiplesubtensors and store each using a different format, which theabstraction does not support. Coordinate remapping nota-tion as we proposed also does not support grouping nonzerosbased on statistics of the input tensor. Such a capability couldagain, for instance, be useful for supporting conversions tohybrid tensor formats, which require determining how to di-vide up nonzeros amongst different subtensors. Furthermore,our technique as described does not support generating fullyparallelized code for CPUs or GPUs. We believe addressingsuch limitations would constitute valuable future work.

Acknowledgments

We thank Peter Ahrens, Rawn Henry, Ziheng Wang, MichelleStrout, and other anonymous reviewers for their helpful re-views and suggestions. This work was supported by theApplication Driving Architectures (ADA) Research Center,a JUMP Center co-sponsored by SRC and DARPA; the Toy-ota Research Institute; the U.S. Department of Energy, Of-fice of Science, Office of Advanced Scientific ComputingResearch under Award Numbers DE-SC0008923 and DE-SC0018121; the National Science Foundation under GrantNo. CCF-1533753; and DARPA under Awards HR0011-18-3-0007 and HR0011-20-9-0017. Any opinions, findings, andconclusions or recommendations expressed in this materialare those of the authors and do not necessarily reflect theviews of the aforementioned funding agencies. utomatic Generation of Efficient Sparse Tensor Format Conversion Routines PLDI ’20, June 15–20, 2020, London, UK

References [1] Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, KunleOlukotun, and Christopher Ré. 2017. EmptyHeaded: A RelationalEngine for Graph Processing.

ACM Trans. Database Syst.

42, 4, Article20 (Oct. 2017), 44 pages. https://doi.org/10.1145/3129246 [2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, andMatus Telgarsky. 2014. Tensor Decompositions for Learning LatentVariable Models.

J. Mach. Learn. Res.

15, Article 1 (Jan. 2014), 60 pages.[3] Gilad Arnold. 2011.

Data-Parallel Language for Correct and EfficientSparse Matrix Codes . Ph.D. Dissertation. University of California,Berkeley.[4] Gilad Arnold, Johannes Hölzl, Ali Sinan Köksal, Rastislav Bodík, andMooly Sagiv. 2010. Specifying and Verifying Sparse Matrix Codes.In

Proceedings of the 15th ACM SIGPLAN International Conference onFunctional Programming (Baltimore, Maryland, USA) (ICFP ’10) . ACM,New York, NY, USA, 249–260. https://doi.org/10.1145/1863543.1863581 [5] Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan.2014. An Efficient Two-dimensional Blocking Strategy for SparseMatrix-vector Multiplication on GPUs. In

Proceedings of the 28th ACMInternational Conference on Supercomputing (Munich, Germany) (ICS’14) . ACM, New York, NY, USA, 273–282. https://doi.org/10.1145/2597652.2597678 [6] Brett W. Bader, Michael W. Berry, and Murray Browne. 2008.

DiscussionTracking in Enron Email Using PARAFAC . Springer London, 147–163.[7] Brett W Bader and Tamara G Kolda. 2007. Efficient MATLAB compu-tations with sparse and factored tensors.

SIAM Journal on ScientificComputing

30, 1 (2007), 205–231.[8] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient andscalable computations with sparse tensors. In . 1–6. https://doi.org/10.1109/HPEC.2012.6408676 [9] Nathan Bell and Michael Garland. 2008.

Efficient Sparse Matrix-VectorMultiplication on CUDA . NVIDIA Technical Report NVR-2008-004.NVIDIA Corporation.[10] Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors. In

Proceed-ings of the Conference on High Performance Computing Networking,Storage and Analysis (Portland, Oregon) (SC ’09) . ACM, New York, NY,USA, Article 18, 11 pages. https://doi.org/10.1145/1654059.1654078 [11] Aart JC Bik and Harry AG Wijshoff. 1993. Compilation techniquesfor sparse matrix computations. In

Proceedings of the 7th internationalconference on Supercomputing . ACM, 416–424.[12] Aart JC Bik and Harry AG Wijshoff. 1994. On automatic data structureselection and code generation for sparse computations. In

Languagesand Compilers for Parallel Computing . Springer, 57–75.[13] Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, andCharles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In

Proceedings of the twenty-first annual symposium on Parallelism inalgorithms and architectures . ACM, 233–244.[14] Aydin Buluç and John R. Gilbert. 2008. On the representation and mul-tiplication of hypersparse matrices. In

IEEE International Symposiumon Parallel and Distributed Processing, (IPDPS).

Advances in EngineeringSoftware

17, 1 (Jan. 1993), 49–60. https://doi.org/10.1016/0965-9978(93)90041-Q [16] Hanfeng Chen, Joseph Vinish D’silva, Hongji Chen, Bettina Kemme,and Laurie Hendren. 2018. HorseIR: Bringing Array ProgrammingLanguages Together with Database Query Processing. In

Proceedingsof the 14th ACM SIGPLAN International Symposium on Dynamic Lan-guages (Boston, MA, USA) (DLS 2018) . ACM, New York, NY, USA,37–49. https://doi.org/10.1145/3276945.3276951 [17] Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. For-mat Abstraction for Sparse Tensor Algebra Compilers.

Proc. ACM Program. Lang.

2, OOPSLA, Article 123 (Oct. 2018), 30 pages.[18] Timothy A. Davis and Yifan Hu. 2011. The University of Florida SparseMatrix Collection.

ACM Trans. Math. Softw.

38, 1, Article 1 (Dec. 2011).[19] Eduardo F. D’Azevedo, Mark R. Fahey, and Richard T. Mills. 2005. Vec-torized Sparse Matrix Multiply for Compressed Row Storage Format.In

Proceedings of the 5th International Conference on ComputationalScience - Volume Part I (Atlanta, GA) (ICCS’05) . Springer-Verlag, Berlin,Heidelberg, 99–106. https://doi.org/10.1007/11428831_13 [20] A. Elafrou, G. Goumas, and N. Koziris. 2017. Performance Analysis andOptimization of Sparse Matrix-Vector Multiplication on Modern Multi-and Many-Core Processors. In . 292–301.[21] Miguel A. Gonzalez-Mesa, Eladio D. Gutierrez, and Oscar Plata. 2013.Parallelizing the Sparse Matrix Transposition: Reducing the Program-mer Effort Using Transactional Memory.

Procedia Computer Science

18 (2013), 501 – 510. https://doi.org/10.1016/j.procs.2013.05.214

ACM Trans. Math. Softw.

4, 3 (Sept. 1978), 250–269. https://doi.org/10.1145/355791.355796 [23] Eun-jin Im and Katherine Yelick. 1998. Model-Based Memory Hierar-chy Optimizations for Sparse Matrices. In

In Workshop on Profile andFeedback-Directed Compilation .[24] Intel. 2020. Intel Math Kernel Library Developer Reference. https://software.intel.com/sites/default/files/mkl-2020-developer-reference-c.pdf.pdf [25] Yuanlin Jiang. 2007.

Techniques for Modeling Complex Reservoirs andAdvanced Wells . Ph.D. Dissertation. Stanford University.[26] Jun Rao, H. Pirahesh, C. Mohan, and G. Lohman. 2006. Compiled QueryExecution Engine using JVM. In . 23–23. https://doi.org/10.1109/ICDE.2006.40 [27] David R. Kincaid, Thomas C. Oppe, and David M. Young. 1989.

IT-PACKV 2D User’s Guide .[28] Fredrik Kjolstad. 2020.

Sparse Tensor Algebra Compilation . Ph.D.Dissertation. Massachusetts Institute of Technology.[29] Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe.2019. Tensor Algebra Compilation with Workspaces. (2019), 180–192. http://dl.acm.org/citation.cfm?id=3314872.3314894 [30] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, andSaman Amarasinghe. 2017. The Tensor Algebra Compiler.

Proc. ACMProgram. Lang.

1, OOPSLA, Article 77 (Oct. 2017), 29 pages. https://doi.org/10.1145/3133901 [31] Vladimir Kotlyar. 1999.

Relational Algebraic Techniques for the Synthesisof Sparse Matrix Programs . Ph.D. Dissertation. Cornell University.[32] Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. 1997. A relationalapproach to the compilation of sparse matrix programs. In

Euro-Par’97Parallel Processing . Springer, 318–327.[33] K. Krikellas, S. D. Viglas, and M. Cintra. 2010. Generating code forholistic query evaluation. In . 613–624. https://doi.org/10.1109/ICDE.2010.5447892 [34] Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: HierarchicalStorage of Sparse Tensors. In

Proceedings of the International Conferencefor High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC ’18) . IEEE Press, Piscataway, NJ, USA, Article 19,15 pages. https://doi.org/10.1109/SC.2018.00022 [35] B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A UnifiedOptimization Approach for Sparse Tensor Operations on GPUs. In .47–57. https://doi.org/10.1109/CLUSTER.2017.75 [36] Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Formatfor Cross-Platform Sparse Matrix-Vector Multiplication. In

Proceed-ings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS ’15) . ACM, New York, NY, USA,

LDI ’20, June 15–20, 2020, London, UK Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe https://doi.org/10.1145/2751205.2751209 [37] Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010.Automatically Tuning Sparse Matrix-Vector Multiplication for GPUArchitectures. In

High Performance Embedded Architectures and Com-pilers , Yale N. Patt, Pierfrancesco Foglia, Evelyn Duesterwald, PaoloFaraboschi, and Xavier Martorell (Eds.). Springer Berlin Heidelberg,Berlin, Heidelberg, 111–125.[38] Guy M Morton. 1966.

A computer oriented geodetic data base and anew technique in file sequencing . Technical report.[39] Payal Nandy, Mary Hall, Eddie C. Davis, Catherine MillsOlschanowsky, Mahdi Soltan Mohammadi, Wei He, and Michelle MillsStrout. 2018. Abstractions for Specifying Sparse Matrix DataTransformations. In

Proceedings of Eighth InternationalWorkshop onPolyhedral Compilation Techniques (Manchester, United Kingdom) (IMPACT 2018) .[40] Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plansfor Modern Hardware.

Proc. VLDB Endow.

4, 9 (June 2011), 539–550. https://doi.org/10.14778/2002938.2002940 [41] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, YiranChen, and Pradeep Dubey. 2016. Faster CNNs with Direct SparseConvolutions and Guided Pruning. arXiv:cs.CV/1608.01409[42] Andrés Peratta and Viktor Popov. 2006. A new scheme for numericalmodelling of flow and transport processes in 3D fractured porousmedia.

Advances in Water Resources

29, 1 (2006), 42 – 61. https://doi.org/10.1016/j.advwatres.2005.05.004 [43] Hamid Pirahesh, Joseph M. Hellerstein, and Waqar Hasan. 1992. Exten-sible/Rule Based Query Rewrite Optimization in Starburst. In

Proceed-ings of the 1992 ACM SIGMOD International Conference on Managementof Data (San Diego, California, USA) (SIGMOD ’92) . ACM, New York,NY, USA, 39–48. https://doi.org/10.1145/130283.130294 [44] Hamid Pirahesh, T. Y. Cliff Leung, and Waqar Hasan. 1997. A Rule En-gine for Query Transformation in Starburst and IBM DB2 C/S DBMS.In

Proceedings of the Thirteenth International Conference on Data En-gineering (ICDE ’97) . IEEE Computer Society, Washington, DC, USA,391–400. http://dl.acm.org/citation.cfm?id=645482.653436 [45] William Pugh and Tatiana Shpeisman. 1999. SIPR: A new frameworkfor generating efficient code for sparse matrix computations. In

Lan-guages and Compilers for Parallel Computing . Springer, 213–229.[46] Samyam Rajbhandari, Yuxiong He, Olatunji Ruwase, Michael Carbin,and Trishul Chilimbi. 2017. Optimizing CNNs on Multicores for Scala-bility, Performance and Goodput. In

Proceedings of the Twenty-SecondInternational Conference on Architectural Support for Programming Lan-guages and Operating Systems (Xi’an, China) (ASPLOS ’17) . ACM, NewYork, NY, USA, 267–280. https://doi.org/10.1145/3037697.3037745 [47] Youcef Saad. 1989. Krylov Subspace Methods on Supercomputers.

SIAM J. Sci. Stat. Comput.

10, 6 (Nov. 1989), 1200–1232. https://doi.org/10.1137/0910073 [48] Youcef Saad. 1994. SPARSKIT: a basic tool kit for sparse matrix com-putations - Version 2.[49] Yousef Saad. 2003.

Iterative methods for sparse linear systems . SIAM.[50] Shaden Smith and George Karypis. 2015. Tensor-matrix products witha compressed sparse tensor. In

Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms . ACM, 5.[51] Paul Stodghill. 1997.

A Relational Approach to the Automatic Gener-ation of Sequential Sparse Matrix Codes . Ph.D. Dissertation. CornellUniversity.[52] Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante,Barbara Kreaseck, and Catherine Olschanowsky. 2016. An approachfor code generation in the Sparse Polyhedral Framework.

ParallelComput.

53 (2016), 32 – 57. https://doi.org/10.1016/j.parco.2016.02.004 [53] Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A Cross-PlatformOpenCL SpMV Framework on GPUs. In

Proceedings of the 26thACM International Conference on Supercomputing (San Servolo Is-land, Venice, Italy) (ICS ’12) . ACM, New York, NY, USA, 353–364. https://doi.org/10.1145/2304576.2304624 [54] The SciPy community. 2018. scipy.sparse.dok_matrix – SciPyv1.1.0 Reference Guide. https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html .[55] William F Tinney and John W Walker. 1967. Direct solutions of sparsenetwork equations by optimally ordered triangular factorization.

Proc.IEEE

55, 11 (1967), 1801–1809.[56] Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and DataTransformations for Sparse Matrix Code. In

Proceedings of the 36thACM SIGPLAN Conference on Programming Language Design and Im-plementation (Portland, OR, USA) (PLDI 2015) . 521–532.[57] Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. ParallelTransposition of Sparse Data Structures. In

Proceedings of the 2016International Conference on Supercomputing (Istanbul, Turkey) (ICSâĂŹ16) . Association for Computing Machinery, New York, NY, USA,Article 33, 13 pages. https://doi.org/10.1145/2925426.2926291 [58] Tien-Hsiung Weng, Delgerdalai Batjargal, Hoa Pham, Meng-Yen Hsieh,and Kuan-Ching Li. 2013. Parallel Matrix Transposition and Vec-tor Multiplication Using OpenMP. In

Intelligent Technologies and En-gineering Systems (Lecture Notes in Electrical Engineering) , JengnanJuang and Yi-Cheng Huang (Eds.). Springer, New York, NY, 243–249. https://doi.org/10.1007/978-1-4614-6747-2_30 [59] Tien-Hsiung Weng, Hoa Pham, Hai Jiang, and Kuan-Ching Li. 2013.Designing Parallel Sparse Matrix Transposition Algorithm Using CSRfor GPUs. In

Intelligent Technologies and Engineering Systems (LectureNotes in Electrical Engineering) , Jengnan Juang and Yi-Cheng Huang(Eds.). Springer, New York, NY, 251–257. https://doi.org/10.1007/978-1-4614-6747-2_31 [60] Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He,and Lixin Zhang. 2018. CVR: Efficient Vectorization of SpMV on x86Processors. In

Proceedings of the 2018 International Symposium on CodeGeneration and Optimization (Vienna, Austria) (CGO 2018) . ACM, NewYork, NY, USA, 149–162. https://doi.org/10.1145/3168818 [61] Albert-Jan N. Yzelman and Rob H. Bisseling. 2012. A Cache-ObliviousSparse Matrix–Vector Multiplication Scheme Based on the HilbertCurve. In