[PDF] ALTO: Adaptive Linearized Storage of Sparse Tensors

Abstract

The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8X over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.3X relative to the best mode-specific (compressed sparse fiber) formats.

Full PDF

AALTO: Adaptive Linearized Storage of Sparse Tensors

AHMED E. HELAL,

Intel Labs, USA

JAN LAUKEMANN,

Intel Labs, USA

FABIO CHECCONI,

Intel Labs, USA

JESMIN JAHAN TITHI,

Intel Labs, USA

TERESA RANADIVE,

Laboratory for Physical Sciences, USA

FABRIZIO PETRINI,

Intel Labs, USA

JEEWHAN CHOI,

University of Oregon, USA

The analysis of high-dimensional sparse data is becoming increasingly pop-ular in many important domains. However, real-world sparse tensors arechallenging to process due to their irregular shapes and data distributions.We propose the Adaptive Linearized Tensor Order (ALTO) format, a novelmode-agnostic (general) representation that keeps neighboring nonzeroelements in the multi-dimensional space close to each other in memory.To generate the indexing metadata,

ALTO uses an adaptive bit encodingscheme that trades off index computations for lower memory usage and moreeffective use of memory bandwidth. Moreover, by decoupling its sparse repre-sentation from the irregular spatial distribution of nonzero elements,

ALTO eliminates the workload imbalance and greatly reduces the synchronizationoverhead of tensor computations. As a result, the parallel performance of

ALTO -based tensor operations becomes a function of their inherent datareuse. On a gamut of tensor datasets,

ALTO outperforms an oracle thatselects the best state-of-the-art format for each dataset, when used in keytensor decomposition operations. Specifically,

ALTO achieves a geometricmean speedup of 8 × over the best mode-agnostic format, while deliveringa geometric mean compression ratio of more than 4 × relative to the bestmode-specific format.Additional Key Words and Phrases: Sparse tensors, tensor decomposition,MTTKRP, multi-core CPU Many critical application domains such as healthcare [Ho et al.2014; Wang et al. 2015], cybersecurity [Fanaee-T and Gama 2016;Kobayashi et al. 2018], data mining [Kolda and Sun 2008; Papalex-akis et al. 2016], and machine learning [Anandkumar et al. 2014;Sidiropoulos et al. 2017] produce and manipulate massive amountsof high-dimensional data. Such datasets can naturally be representedas sparse tensors, which store the values of the nonzero tensor ele-ments along with indexing metadata that denote the position of eachnonzero in the tensor. Therefore, a fundamental problem in sparsetensor computations involves determining how to store, group, andorganize the nonzero elements to 1) reduce memory storage, 2) im-prove data locality, 3) increase parallelism, and 4) decrease workloadimbalance and synchronization overhead. Since tensor algorithmsperform computations along different mode (i.e., dimension) orien-tations, practical sparse tensor formats must be mode-agnostic todeliver acceptable performance and scalability across all modes. Be-cause real-world sparse tensors are highly irregular in terms of their

Authors’ addresses: Ahmed E. Helal, Intel Labs, USA, [email protected]; JanLaukemann, Intel Labs, USA, [email protected]; Fabio Checconi, Intel Labs,USA, [email protected]; Jesmin Jahan Tithi, Intel Labs, USA, [email protected]; Teresa Ranadive, Laboratory for Physical Sciences, USA, [email protected]; Fabrizio Petrini, Intel Labs, USA, [email protected]; JeewhanChoi, University of Oregon, USA, [email protected]. shape, dimensions, and distribution of nonzero elements, achievingthese (oftentimes conflicting) goals is challenging.To tackle this problem, researchers have proposed many sparsetensor formats [Bader and Kolda 2007; Baskaran et al. 2012; Liet al. 2018; Li et al. 2019; Liu et al. 2017; Nisa et al. 2019; Nisaet al. 2019; Phipps and Kolda 2019; Smith and Karypis 2015; Smithet al. 2015], which can be classified based on their encoding of themulti-dimensional coordinates into list-, block-, and tree-based for-mats [Chou et al. 2018]. List-based tensor representations, such asthe simple coordinate (COO) format, explicitly store the nonzeroelements along with their coordinates (i.e., the indices of all dimen-sions). Therefore, they are agnostic to the different mode orien-tations of tensor algorithms and, as a result, they remain the defacto sparse tensor storage [Chou et al. 2018] in many libraries (e.g.,Tensor Toolbox [Bader and Kolda 2007], Tensorflow [Abadi et al.2016], and Tensorlab [Vervliet et al. 2016]). However, the list-basedCOO format does not impose any order on the multi-dimensionaldata and it suffers from a significant synchronization overhead toresolve the update conflicts across threads [Liu et al. 2017].Prior block-based sparse tensor representations employ multi-dimensional tiling schemes to further compress the COO format [Liet al. 2018; Li et al. 2019]. However, the efficacy of this hierarchicalCOO (HiCOO) storage completely depends on the characteristics oftarget tensors (such as their shape, density, and data distribution)as well as the format parameters (e.g., block size). In addition, theresulting parallel schedule of HiCOO blocks can suffer from limitedparallelism and scalability due to conflicting updates across blocks.Several proposals use tree-based data structures to extend tradi-tional compressed matrix formats, such as the compressed sparserow (CSR) format, to higher-order tensors. The most popular ex-ample of these storage representations is the compressed sparsefiber (CSF) format [Smith et al. 2015], which uses multiple arraysof index pointers to compress the multi-dimensional indices ofnonzero elements. While CSF-based formats can reduce memorytraffic, they are mode-specific, i.e., they are oriented to favor a spe-cific order of tensor modes. As the tensor order increases, thesemode-specific formats require excessive memory to store multipletensor copies [Smith and Karypis 2017] and/or different code imple-mentations for computing along distinct modes [Smith and Karypis2015]. Moreover, such a compressed, coarse-grained format cansuffer from significant workload imbalance and limited scalability,especially in irregularly shaped tensors with short modes [Li et al.2018]. a r X i v : . [ c s . D C ] F e b .E+001.E+011.E+021.E+031.E+041.E+051.E+06 NIPS UBER CHICAGO DARPA ENRON NELL -2 FB - M FLICKR DELI NELL -1 AMAZON PATENTS REDDIT N o n z e r o s Sparse Tensors

Fig. 1. A box plot of the data (nonzero elements) distribution across the multi-dimensional blocks (subspaces) of the hierarchical coordinate storage. Themulti-dimensional subspace size is 128 𝑁 , where 𝑁 is the number of dimensions (modes), as per prior work [Li et al. 2018]. The sparse tensors are sorted in anincreasing order of their size (number of nonzero elements). In summary, the above block- and tree-based approaches forsparse tensor storage rely on clustering the nonzero elements basedon their location in the multi-dimensional space and then partitioningthis space into non-overlapping regions to generate a compressedindexing metadata. Hence, they are constrained by the spatial dis-tribution of nonzero elements. To this end, Figure 1 shows thedistribution of nonzero elements in the multi-dimensional space fora set of 3D and 4D sparse tensors. Specifically, it demonstrates thatthe number of nonzero elements per block fluctuates widely (notethe use of logarithmic scale). As the sparsity of tensors increases(e.g., deli, nell-1, amazon, and reddit tensors), the location-basedclustering fails to compress tensors and introduces a substantialmemory overhead. As a result, the parallel performance and scala-bility of such location-based formats can be severely impacted bythe irregular data distributions and unstructured sparsity patternsthat typically emerge in higher-order sparse tensors.

We propose the

Adaptive Linearized Tensor Order (

ALTO ) format,a mode-agnostic storage system for sparse tensors that addressesthe irregularity in sparse tensor computations and the performancebottlenecks on modern parallel architectures. ALTO organizes andstores the nonzero elements of a given tensor in a one-dimensionaldata structure along with compact indexing metadata, such thatneighboring nonzero elements in the tensor are close to each otherin memory. Most importantly, it generates the indexing metadatausing an adaptive bit encoding scheme based on the characteristicsof target tensors. Such an adaptive format trades off index compu-tations (i.e., de-linearization) for lower memory usage and moreefficient use of the effective memory bandwidth.Unlike prior compressed [Li et al. 2018; Smith et al. 2015] andlinearized [Harrison and Joseph 2018] sparse tensor formats,

ALTO not only improves data locality across all mode orientations, butalso eliminates workload imbalance and greatly reduces synchro-nization cost, which have traditionally limited the performance andscalability of irregular sparse tensor computations. Moreover, itgenerates perfectly balanced partitions for effective parallel execu-tion. To resolve update conflicts across these potentially overlappingpartitions,

ALTO locates their positions in the multi-dimensionalspace and automatically selects the appropriate synchronization The implementation of HiCOO [Li et al. 2018], a block-based COO format, onlysupports 3D and 4D sparse tensors. mechanism based on the data reuse of target tensors. As a result,

ALTO delivers superior performance over state-of-the-art sparsetensor formats. In what follows, we summarize the contributions ofthis work: • We introduce

ALTO , a novel sparse tensor format for high-performance and scalable execution of tensor operations.Unlike prior location-based clustering approaches for com-pressed sparse tensor storage,

ALTO uses a single (mode-agnostic) tensor representation that improves data locality,eliminates workload imbalance, and greatly reduces memoryusage and synchronization overhead, regardless of the datadistribution in the multi-dimensional space ( § • We propose an adaptive synchronization mechanism to ef-ficiently resolve conflicting updates (writes) across threadsbased on the average data reuse of target sparse tensors ( § • We present effective parallel algorithms that perform Ma-tricized Tensor Times Khatri-Rao Product (MTTKRP) oper-ations, the main tensor decomposition performance bottle-neck [Baskaran et al. 2012; Choi and Vishwanathan 2014;Kang et al. 2012; Smith et al. 2015], on sparse tensors storedvia

ALTO ( § • We demonstrate that

ALTO outperforms the state-of-the-artsparse tensor formats via experimental evaluation over avariety of real-world datasets.

ALTO achieves a geometricmean speedup of 8 × over the best mode-agnostic format anda geometric mean compression ratio of 4 . × over the bestmode-specific format ( § We begin with a brief overview of tensor decomposition methodsand related notations. The work by Kolda and Bader [Bader andKolda 2007; Kolda and Bader 2009] provides a more in-depth discus-sion of tensor algorithms and applications.

Tensors are the higher-order generalization of matrices. An 𝑁 di-mensional tensor is said to have 𝑁 modes and is called a mode- 𝑁 tensor. The following notations are used in this paper:(1) Scalars are denoted by lower case letters (e.g., 𝑎 ).(2) Vectors are mode-1 tensors denoted by bold lower case letters(e.g., a ). The 𝑖 𝑡ℎ element of a vector a is denoted by 𝑎 𝑖 . Matrices are mode-2 tensors denoted by bold capital letters(e.g., A ). If A is a 𝐼 × 𝐽 matrix, it can also be denoted as A ∈ R 𝐼 × 𝐽 , and its element at index ( 𝑖, 𝑗 ) is denoted as 𝑎 𝑖,𝑗 .(4) Higher-order tensors are denoted by Euler script letters (e.g., X ). A mode- 𝑁 tensor whose dimensions are 𝐼 × 𝐼 × · · · × 𝐼 𝑁 can be denoted as X ∈ R 𝐼 × 𝐼 ×···× 𝐼 𝑁 , and its element at index ( 𝑖 , 𝑖 , . . . , 𝑖 𝑁 ) is denoted as 𝑥 𝑖 ,𝑖 ,...,𝑖 𝑁 .(5) Fibers are the higher-order analogue of matrix rows andcolumns. A mode- 𝑛 fiber is defined by fixing every modeexcept the 𝑛 𝑡ℎ mode. For example, a matrix column is definedby fixing the second mode, and is therefore a mode-1 fiber. Afiber is denoted by using a colon for the non-fixed mode (e.g.,the 𝑗 𝑡ℎ column of a matrix A is denoted by a : ,𝑗 ).(6) Slices are the lower-order components of a tensor which resultfrom fixing all but two modes. For example, the slices of amode-3 tensor X are matrices (such as X 𝑖, : , : and X : ,𝑗, : ). The Canonical Polyadic Decomposition (CPD) is a widely used typeof tensor factorization, in which a mode- 𝑁 tensor X is approximatedby the sum of 𝑅 outer products of 𝑁 vectors. Each outer productis called a rank- component , while the sum of the 𝑅 componentsis said to be a rank- 𝑅 decomposition of X . The vectors forming therank-1 outer products each correspond to a particular tensor mode.We may arrange the 𝑅 vectors corresponding to each of the 𝑁 modesinto 𝑁 different factor matrices so that the decomposition of X isthe outer product of these matrices. For example, if X ∈ R 𝐼 × 𝐽 × 𝐾 ,we may write a decomposition of X in terms of factor matrices A ∈ R 𝐼 × 𝑅 , B ∈ R 𝐽 × 𝑅 , and C ∈ R 𝐾 × 𝑅 , where the columns of A (resp. B and C ) are the vectors used in forming the 𝑅 outer products alongmode-1 (resp. 2 and 3).The CPD-Alternating Least Squares (CPD-ALS) method is widelyused tensor decomposition algorithm. During each ALS iteration,one alternates between updating each of the individual factor matri-ces, updating each factor matrix to yield the best approximation of X when all other factor matrices are fixed. The most expensive partof CPD-ALS, along with many other tensor algorithms, is the matri-cized tensor times Khatri-Rao product (MTTKRP) operation [Smithet al. 2015].The MTTKRP operation involves two basic subroutines:(1) Tensor matricization – a process in which a tensor is unfoldedinto a matrix. Moreover, the mode- 𝑛 matricization of a tensor X , denoted X ( 𝑛 ) , is obtained by laying out the mode- 𝑛 fibersof X as the columns of X ( 𝑛 ) .(2) Khatrio-Rao product – the “matching column-wise” Kroneckerproduct between two matrices. That is, given matrices B ∈ R 𝐽 × 𝑅 and C ∈ R 𝐾 × 𝑅 , their Khatri-Rao product K , denoted K = B ⊙ C , where K is a ( 𝐽 · 𝐾 ) × 𝑅 matrix, is defined as: 𝐵 ⊙ 𝐶 = [ b ⊗ c b ⊗ c . . . b 𝑅 ⊗ c 𝑅 ] .For a mode-3 tensor X , the mode-1 MTTKRP operation can beexpressed as X ( ) ( B ⊙ C ) . Typically, MTTKRP operations along allmodes are performed 10–100 times in one tensor decompositioncalculation. Since these MTTKRP operations are similar, we onlydiscuss mode-1 MTTKRP in this paper. Real-world sparse tensors, which emerge in high-dimensional dataanalytics, are challenging to efficiently encode as they suffer fromhighly irregular shapes and data distributions as well as unstruc-tured sparsity patterns. For example, one mode of a tensor may rep-resent a massive user database while another mode represents theirdemographic information, their interactions, or their (potentially in-complete) consumer preferences/ratings for a set of products [Smithet al. 2017].Thus, the proposed

ALTO format uses an adaptive (data-aware)recursive partitioning of the high-dimensional space that representsa given sparse tensor to generate a mode-agnostic linearized index,which maps a point (nonzero element) in this Cartesian space toa point on a compact line. Specifically,

ALTO splits every modeinto multiple regions based on the mode length, such that eachdistinct mode has a variable number of regions to adapt to the un-equal cardinalities of different modes and to minimize the storagerequirements. This adaptive linearization ensures that neighbor-ing points in a multi-dimensional space are close to each other onthe resulting compact line. Moreover, the

ALTO format is not onlylocality-friendly, but also parallelism-friendly as it decomposes themulti-dimensional space into perfectly balanced (in terms of work-load) subspaces. Further, it intelligently arranges the modes in thederived subspaces based on their characteristics to further reducethe overhead of resolving the write conflicts that typically occur inparallel sparse tensor computations.What follows is a detailed description and discussion of the

ALTO format generation ( § § ALTO -based sequential and parallel al-gorithms as well as adaptive conflict resolution mechanisms forefficient execution of illustrative sparse tensor operations on paral-lel shared-memory platforms ( § Formally, an

ALTO tensor X = { 𝑥 , 𝑥 , . . . , 𝑥 𝑀 } is an ordered set ofnonzero elements, where each element 𝑥 𝑖 = ⟨ 𝑣 𝑖 , 𝑝 𝑖 ⟩ is representedby a value 𝑣 𝑖 and a position 𝑝 𝑖 . The position 𝑝 𝑖 corresponds to acompact mode-agnostic encoding of the indexing metadata, whichis used to quickly generate the tuple ( 𝑖 , 𝑖 , . . . , 𝑖 𝑁 ) that locates anonzero element in the multi-dimensional Cartesian space.The generation of an ALTO tensor is carried out in two stages:linearization and ordering. First,

ALTO constructs the indexing meta-data using a compressed encoding scheme, based on the cardinal-ities of tensor modes, to map each nonzero element to a positionon a compact line. Second, it arranges the nonzero elements ina linearized storage according to their line positions, i.e., the val-ues of their

ALTO index. Typically, the ordering stage dominatesthe format conversion/generation time. However, compared to thelocation-based sparse tensor formats [Li et al. 2018; Li et al. 2019;Nisa et al. 2019; Nisa et al. 2019; Smith and Karypis 2015; Smith et al.2015],

ALTO requires a minimal generation time because orderingthe linearized tensors incurs a fraction of the cost required to sortthe multi-dimensional tensor formats (due to the reduction in thecomparison operations, as detailed in § X8X2 Tensor ALTO Bit Mask ijk k = 0k = 1

Value Position 𝒙 𝒙

15 (001111) 𝒙

20 (010100) 𝒙

25 (011001) 𝒙

42 (101010) 𝒙

51 (110011) 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒊,𝟏 𝒃 𝒋,𝟎 𝒃 𝒊,𝟎 𝒃 𝒌,𝟎 ALTO Tensor

00 01 10 11

Fig. 2. An example of the ALTO sparse encoding and representation for athree-dimensional tensor.

Figure 2 provides an example of the

ALTO format for a 4 × × 𝑥 𝑖,𝑗,𝑘 ). The multi-dimensional indices ( 𝑖 , 𝑗 , and 𝑘 ) are color coded and the 𝑟 𝑡ℎ bit oftheir binary representation is denoted 𝑏 𝑖 / 𝑗 / 𝑘,𝑟 . Specifically, ALTO keeps the value of a nonzero element along with a linearized in-dex, where each bit of this index is selected to partition the multi-dimensional space into two hyperplanes. For example, the

ALTO encoding in Figure 2 uses a compact line of length 64 (i.e., a 6-bitlinearized index) to represent the target tensor of size 4 × ×

2. This in-dex consists of three groups of bits with variable sizes (resolutions)to efficiently handle high-order data of arbitrary dimensionality.Within each bit group,

ALTO arranges the modes in an increasingorder of their length (i.e., the shortest mode first) which is equivalentto partitioning the multi-dimensional space along the longest modefirst. Such an encoding aims to generate a balanced linearizationof the irregular Cartesian space, where the amount of informationabout the spatial position of a nonzero element decreases with everyconsecutive bit , starting from the most-significant bit. Therefore, theline segments encode subspaces with mode intervals of equivalentlengths, e.g., the line segments [ − ] , [ − ] , and [ − ] encodesubspaces of 4 × ×

2, 4 × ×

2, and 2 × × ALTO encodes the resulting linearized index in the minimum num-ber of bits, and it improves data locality across all modes of agiven sparse tensor. Hence, a mode- 𝑁 tensor, whose dimensionsare 𝐼 × 𝐼 × · · · × 𝐼 𝑁 , can be efficiently represented using a single ALTO format with an indexing metadata of size: 𝑆 ALTO = 𝑀 × ( 𝑁 ∑︁ 𝑛 = log 𝐼 𝑛 ) bits, (1)where 𝑀 is the number of nonzero elements.As a result, compared to the de facto COO format, ALTO reducesthe storage requirement regardless of the tensor’s characteristics. 𝒃 𝒌,𝟐 𝒃 𝒋,𝟐 𝒃 𝒊,𝟐 𝒃 𝒌,𝟏 𝒃 𝒋,𝟏 𝒃 𝒊,𝟏 𝒃 𝒌,𝟎 𝒃 𝒋,𝟎 𝒃 𝒊,𝟎 Z-curve Bit Mask 511 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒊,𝟏 𝒃 𝒋,𝟎 𝒃 𝒊,𝟎 𝒃 𝒌,𝟎 ALTO Bit Mask 63 Fig. 3. For the example in Figure 2, ALTO generates a non-fractal, yet morecompact encoding compared to traditional space-filling curves, such as theZ-Morton order.

That is, the metadata compression ratio of the

ALTO format relativeto COO is always ≥

1. On a hardware architecture with a word-levelmemory addressing mode, this compression ratio is given by: 𝑆 COO 𝑆 ALTO = (cid:205) 𝑁𝑛 = (cid:108) log 𝐼 𝑛 𝑊 𝑏 (cid:109)(cid:24) (cid:205) 𝑁𝑛 = log 𝐼 𝑛 𝑊 𝑏 (cid:25) , (2)where 𝑊 𝑏 is the word size in bits. For example, on an architecturewith a byte-level addressing mode (i.e., 𝑊 𝑏 = list-based COO format and a single byte tostore the linearized index of the same element in the ALTO format:the metadata compression ratio of

ALTO , compared to the list-basedformats, is three.Most importantly, the

ALTO format not only reduces the overallmemory traffic of sparse tensor computations, but also decreasesthe number of memory transactions required to access the index-ing metadata of a sparse tensor, as reading the linearized indexrequires fewer memory transactions compared to reading severalmulti-dimensional indices. In addition, this natural coalescing of themulti-dimensional indices into a single linearized index increasesthe memory transaction size to make more efficient use of the mainmemory bandwidth.It is important to note that

ALTO uses a non-fractal encodingscheme, unlike the traditional space-filling curves (SFCs) [Peano1890]. In contrast, such SFCs (e.g., Z-Morton order [Morton 1966])are based on continuous self-similar (fractal) functions that canbe extremely inefficient to encode the irregularly shaped multi-dimensional spaces that emerge in higher-order sparse tensor algo-rithms, as they require indexing metadata of size: 𝑆 SFC = 𝑀 × 𝑁 × 𝑁 max 𝑛 = ( log 𝐼 𝑛 ) bits. (3)Therefore, the application of SFCs in sparse tensor computationshas been limited to reordering the nonzero elements to improve theirdata locality rather than compressing the indexing metadata [Li et al.2018]. Figure 3 shows the compact encoding generated by ALTO compared to the fractal encoding scheme based on the Z-Mortoncurve. In this example, the non-fractal encoding scheme used by

ALTO reduces the length of the encoding line by a factor of eight,which not only decreases the overall size of the indexing metadata,but also reduces the linearization/de-linearization time required tomap the multi-dimensional space to/from the compact encodingline. A fractal pattern is a hierarchically self-similar pattern that looks the same at increas-ingly smaller scales. mode mask X X X X 𝒃 𝒊,𝟏 𝒃 𝒊,𝟎 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒊,𝟏 𝒃 𝒋,𝟎 𝒃 𝒊,𝟎 𝒃 𝒌,𝟎 mode index X X X 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒋,𝟎 X X X X X 𝒃 𝒌,𝟎 ALTO index (a) ALTO generates its linearized index using bit-level gather operations. mode mask

X X X X 𝒃 𝒊,𝟏 𝒃 𝒊,𝟎 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒊,𝟏 𝒃 𝒋,𝟎 𝒃 𝒊,𝟎 𝒃 𝒌,𝟎 mode index X X X 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒋,𝟎 X X X X X 𝒃 𝒌,𝟎 ALTO index (b) To generate the multi-dimensional indices, ALTO decodes the linearizedindexing metadata using bit-level scatter operations.Fig. 4. The ALTO-based bit encoding/decoding mechanisms.

To allow fast indexing of the linearized tensors during sparsetensor operations, the

ALTO encoding is implemented using a set ofsimple 𝑁 bit masks, where 𝑁 is the number of modes, on top of com-mon data processing primitives. Figure 4 shows the linearizationand de-linearization mechanisms which are used during the ALTO format generation and the sparse tensor computations, respectively.The linearization is implemented as a bit-level gather, while thede-linearization is performed as a bit-level scatter. Thus, while thecompressed representation of the proposed

ALTO format comes atthe cost of a de-linearization (decompression) overhead, such a com-putational overhead is minimal and can be effectively overlappedwith the memory accesses of the memory-intensive sparse tensoroperations, as shown in § The prior compressed sparse tensor formats, such as block- and CSF-based approaches, seek to reduce the size of the indexing metadataby clustering the nonzero elements into coarse-grained structures(e.g., tensor blocks, slices, and/or fibers) which divide the multi-dimensional space of a given tensor into non-overlapping regions.However, due to the irregular shapes and distributions of higher-order data, such coarse-grained approaches can suffer from severeworkload imbalance, in terms of nonzero elements, which in turnleads to limited parallel performance and scalability.Thus, the proposed

ALTO representation works at the finest gran-ularity level (i.e., a single nonzero element) which exposes the maxi-mum fine-grained parallelism and allows scalable parallel execution.While a non-overlapping space partitioning of a tensor can be ob-tained from the

ALTO encoding scheme, using a subset of the indexbits, the workload balance of such a partitioning still depends onthe sparsity patterns of the tensor. To decouple the performanceof sparse tensor computations from the distribution of nonzero el-ements,

ALTO eliminates the workload imbalance and generatesperfectly balanced partitions in the following way.Figure 5 depicts an example of

ALTO ’s workload decompositionwhen applied to the sparse tensor in Figure 2. Moreover,

ALTO

Value Position 𝒙 𝒙

15 (001111) 𝒙

20 (010100) 𝒙

25 (011001) 𝒙

42 (101010) 𝒙

51 (110011)

ALTO Tensor 𝒃 𝒋,𝟐 𝒃 𝒋,𝟏 𝒃 𝒊,𝟏 𝒃 𝒋,𝟎 𝒃 𝒊,𝟎 𝒃 𝒌,𝟎 ALTO Bit Mask k = 0k = 1

111 00 01 10 1100 01 10 11 ijk (0, 0, 0) (3, 3, 1) (1, 2, 0) (3, 6, 1) Fig. 5. ALTO partitioning of the example in Figure 2. divides the line segment containing the nonzero elements of thetarget tensor into smaller line segments, all of which have the samenumber of nonzeros, thus perfectly splitting the workload. There-fore, in Figure 5,

ALTO partitions the linearized tensor into twoline segments: [ − ] and [ − ] . Although the resulting linesegments have different lengths (i.e., 18 and 26), they have the samenumber of nonzeros elements.Once the linearized sparse tensor is divided into multiple linesegments, ALTO identifies the basis mode intervals (coordinateranges) of the multi-dimensional subspaces that correspond to thesesegments. For example, the line segments [ − ] and [ − ] cor-respond to three-dimensional subspaces bounded by the mode inter-vals {[ − ] , [ − ] , [ − ]} and {[ − ] , [ − ] , [ − ]} , respectively.While the derived multi-dimensional subspaces of the line segmentsmay overlap, as highlighted in yellow in Figure 5, each nonzeroelement is assigned to exactly one line segment. That is, ALTO im-poses a partitioning on a given linearized tensor that generates adisjoint set of non-overlapping and balanced line segments , yet it doesnot guarantee that such a partitioning will decompose the multi-dimensional space of the tensor into non-overlapping subspaces. Incontrast, the prior formats decompose the multi-dimensional spaceinto non-overlapping (yet highly imbalanced) regions, namely, ten-sor slices and fibers in CSF-based formats and multi-dimensionalspatial blocks in block-based formats (e.g., HiCOO).More formally, a set of 𝐿 line segments partitions a linearized ALTO tensor X , which encodes a mode- 𝑁 sparse tensor, such that X = X ∪ X · · · ∪ X 𝐿 and X 𝑖 ∩ X 𝑗 = 𝜙 ∀ 𝑖 and 𝑗 , where 𝑖 ≠ 𝑗 .Each line segment X 𝑖 is an ordered set of nonzero elements thatare bounded in an 𝑁 -dimensional space by a set of 𝑁 closed modeintervals 𝑇 𝑖 = {[ 𝑇 𝑠𝑖, ,𝑇 𝑒𝑖, ] , [ 𝑇 𝑠𝑖, ,𝑇 𝑒𝑖, ] , · · · [ 𝑇 𝑠𝑖,𝑁 ,𝑇 𝑒𝑖,𝑁 ]} , where eachmode interval 𝑇 𝑖,𝑗 is delineated by a start 𝑇 𝑠𝑖,𝑗 and an end 𝑇 𝑒𝑖,𝑗 . Theintersection of two sets of mode intervals represents the subspaceoverlap between their corresponding line segments (partitions). .3 Adaptive Tensor Operations Since high-dimensional data analytics is becoming increasingly pop-ular in rapidly evolving areas [Ho et al. 2014; Kobayashi et al. 2018;Papalexakis et al. 2016; Sidiropoulos et al. 2017], a fundamental goalof the proposed

ALTO format is to deliver superior performancewithout compromising the productivity of end users to allow fastdevelopment of tensor algorithms and operations. Algorithm 1 il-lustrates the popular MTTKRP tensor operation using the

ALTO format. First, unlike CSF-based formats,

ALTO enables end users toperform tensor operations using a unified code implementation thatworks on a single copy of the sparse tensor , regardless of the differentmode orientations of such operations. Second, by decoupling the rep-resentation of a sparse tensor from the distribution of its nonzero el-ements,

ALTO does not require automated or user-generated tuningto select the optimal format parameters for this tensor, in contrastto block-based storage approaches such as HiCOO.

Algorithm 1

Mode-1 sequential MTTKRP-

ALTO algorithm.

Input:

A third-order

ALTO sparse tensor

X ∈ R 𝐼 × 𝐽 × 𝐾 with 𝑀 nonzero elements, dense factor matrices A ∈ R 𝐼 × 𝑅 , B ∈ R 𝐽 × 𝑅 ,and C ∈ R 𝐾 × 𝑅 Output:

Updated dense factor matrix Ã ∈ R 𝐼 × 𝑅 for 𝑥 = , . . . , 𝑀 do 𝑖 = EXTRACT ( 𝑝𝑜𝑠 ( 𝑥 ) , 𝑀𝐴𝑆𝐾 ( )) ⊲ De-linearization. 𝑗 = EXTRACT ( 𝑝𝑜𝑠 ( 𝑥 ) , 𝑀𝐴𝑆𝐾 ( )) 𝑘 = EXTRACT ( 𝑝𝑜𝑠 ( 𝑥 ) , 𝑀𝐴𝑆𝐾 ( )) for 𝑟 = , . . . , 𝑅 do Ã ( 𝑖, 𝑟 ) + = 𝑣𝑎𝑙 ( 𝑥 )× B ( 𝑗, 𝑟 )× C ( 𝑘, 𝑟 ) end for end for return Ã Algorithm 2 describes the proposed workload distribution andscheduling scheme using a representative parallel MTTKRP opera-tion that works on a sparse tensor stored in the

ALTO format. After

ALTO imposes a partitioning on a given sparse tensor, as detailed in § ALTO uses an adaptive conflict resolution approach that dynamically selects the appropriate global synchronizationtechnique (highlighted by the different gray backgrounds) acrossthreads based on the reuse of the target fibers. This metric representsthe average number of nonzero elements per fiber (the generaliza-tion of a matrix row/column) and it is estimated by simply dividingthe total number of nonzero elements by the number of fibers alongthe target mode. When a sparse tensor operation suffers from lim-ited fiber reuse,

ALTO resolves the conflicting updates across its linesegments (partitions) using direct atomic operations. Otherwise,it uses a limited amount of temporary (local) storage to keep thelocal updates of different partitions (line 7) and then merges theconflicting global updates (lines 12–18) using an efficient pull-basedparallel reduction, where the final results are computed by pullingthe partial results from the

ALTO partitions. Specifically,

ALTO con-siders the fiber reuse large enough to use local staging memoryfor conflict resolution, if the average number of nonzero elements

Algorithm 2

Adaptive parallel execution of mode-1 MTTKRP-

ALTO kernel.

ALTO automatically uses local storage or atomics ,based on the reuse of output fibers, to resolve the update conflicts.

Input:

A third-order

ALTO sparse tensor

X ∈ R 𝐼 × 𝐽 × 𝐾 with 𝑀 nonzero elements, dense factor matrices A ∈ R 𝐼 × 𝑅 , B ∈ R 𝐽 × 𝑅 ,and C ∈ R 𝐾 × 𝑅 Output:

Updated dense factor matrix Ã ∈ R 𝐼 × 𝑅 for 𝑙 = , . . . , 𝐿 in parallel do ⊲ ALTO line segments. for ∀ 𝑥 ∈ X 𝑙 do 𝑖 = EXTRACT ( 𝑝𝑜𝑠 ( 𝑥 ) , 𝑀𝐴𝑆𝐾 ( )) ⊲ De-linearization. 𝑗 = EXTRACT ( 𝑝𝑜𝑠 ( 𝑥 ) , 𝑀𝐴𝑆𝐾 ( )) 𝑘 = EXTRACT ( 𝑝𝑜𝑠 ( 𝑥 ) , 𝑀𝐴𝑆𝐾 ( )) for 𝑟 = , . . . , 𝑅 do Temp 𝑙 ( 𝑖 − 𝑇 𝑠𝑙, , 𝑟 ) + = 𝑣𝑎𝑙 ( 𝑥 )× B ( 𝑗, 𝑟 )× C ( 𝑘, 𝑟 ) ATOMIC ( Ã ( 𝑖, 𝑟 ) + = 𝑣𝑎𝑙 ( 𝑥 )× B ( 𝑗, 𝑟 )× C ( 𝑘, 𝑟 ) ) end for end for end for for 𝑏 = , . . . , 𝐼 in parallel do ⊲ Pull-based accumulation. for ∀ 𝑙 where 𝑏 ∈ [ 𝑇 𝑠𝑙, ,𝑇 𝑒𝑙, ] do for 𝑟 = , . . . , 𝑅 do Ã ( 𝑏, 𝑟 ) + = Temp 𝑙 ( 𝑏 − 𝑇 𝑠𝑙, , 𝑟 ) end for end for end for return Ã per fiber is more than the cost of using this two-stage (buffered)accumulation process which consists of: initialization (omitted forbrevity), local accumulation (line 7), and global accumulation (lines12–18). As explained in § X 𝑙 is bounded in an 𝑁 -dimensional space by a set of 𝑁 closed mode intervals 𝑇 𝑙 , whichis computed during the partitioning of an ALTO tensor; thus, thesize of the temporary storage accessed during the accumulation of X 𝑙 ’s updates along a mode 𝑛 is directly determined by the modeinterval [ 𝑇 𝑠𝑙,𝑛 ,𝑇 𝑒𝑙,𝑛 ] (see lines 7 and 13).Finally, ALTO allows automated generation and use of helperflags to further reduce the conflict resolution overhead in the sparsetensors that suffer from limited fiber reuse. That is, it exploits the un-used (do-not-care) bits in the linearized index to encode if a nonzeroelement is a boundary or internal (exclusive) element along a modewith limited fiber reuse. Based on this information,

ALTO deter-mines whether to execute the global update (line 8 in Algorithm 2)as an atomic operation (for boundary elements) or a normal write(for internal elements).

We implemented

ALTO as a C++ library and used OpenMP [Dagumand Menon 1998] for multi-threaded execution. The implementa-tion utilizes automatic vectorization and loop unrolling optimiza-tions [Bik et al. 2002], and it uses templates [Vandevoorde andJosuttis 2002] for generalized support of tensors with arbitrary sizes. pecifically, we use a generic type to represent our mode-agnosticindexing metadata, which allows the same code to support linearizedindices of different widths. This template-based approach avoidscode duplication and reduces the time and effort required to port ALTO to other hardware platforms.To improve the performance of sparse tensor kernels, the state-of-the-art tensor libraries specialize these kernels for different tensororders (e.g., 3D and 4D tensors). In our library, a canonical ten-sor operation, such as MTTKRP, has an entry point that acts asa dispatcher which invokes the generic implementation or morespecialized versions of this implementation when available. Thismalleable approach leverages the compiler to transparently gen-erate optimized code for common tensor orders and/or for typicaldecomposition ranks (called rank specialization). However, for afair comparison with existing tensor libraries, we report the perfor-mance of

ALTO without rank specialization and discuss the potentialperformance improvement of such an optimization.We also incorporated ALTO into the popular SPLATT library [Smithet al. 2015] (i.e., replaced the CSF-based MTTKRP operation withthe ALTO-based MTTKRP) to validate its usability in the CPD-ALS algorithm. Given the same initial values, our implementationcalculates identical factor matrices as the original SPLATT imple-mentation, and shows the same convergence properties (i.e., samenumber of iterations to convergence and fit to the original tensor).Our implementation also shows similar fit compared to the TensorToolbox [Bader and Kolda 2007] CPD-ALS implementation.

All experiments were conducted on an Intel XeonPlatinum 8280 CPU with Cascade Lake-X (CLX) microarchitecture.It consists of two sockets, each with 28 physical cores running at afixed frequency of 1.8 GHz for accurate measurements. The serverhas 384 GiB of memory and it runs CentOS 7.7 Linux distribution.The code is built using Intel C/C++ compiler (version 19.1.3) with theoptimization flags -O3 -xCORE-AVX512 -qopt-zmm-usage=high tofully utilize vector units. Unless otherwise stated, the parallel ex-periments use all hardware threads (112) on the target platform.We report the performance numbers as an average over 100 iter-ations/runs, after a warmup iteration as per prior tensor libraries.For performance counter measurements and thread pinning, we usethe LIKWID tool suite v5.1.0 [Gruber et al. 2020].

For a comprehensive evaluation, the experimentsconsider a gamut of real-world tensor datasets with various charac-teristics. These tensors are often used in related works and they arepublicly available in the FROSTT [Smith et al. 2017] and HaTen2 [Jeonet al. 2015] repositories. Table 1 shows the detailed features of thetarget datasets, ordered by size, in terms of dimensions, number ofnonzero elements (

Table 1. Characteristics of the target sparse tensors.

Tensor Dimensions lbnl 1 . 𝐾 × . 𝐾 × . 𝐾 × . 𝐾 × . 𝐾 . 𝑀 . × − Limitednips 2 . 𝐾 × . 𝐾 × 𝐾 ×

17 3 . 𝑀 . × − Highuber 183 × × . 𝐾 × . 𝐾 . 𝑀 . × − Highchicago 6 . 𝐾 × × ×

32 5 . 𝑀 . × − Highvast 165 . 𝐾 × . 𝐾 × × ×

89 26 𝑀 . × − Highdarpa 22 . 𝐾 × . 𝐾 × . 𝑀 . 𝑀 . × − Limitedenron 6 𝐾 × . 𝐾 × . 𝐾 × . 𝐾 . 𝑀 . × − Highnell-2 12 . 𝐾 × . 𝐾 × . 𝐾 . 𝑀 . × − Highfb-m 23 . 𝑀 × . 𝑀 ×

166 99 . 𝑀 . × − Limitedflickr 319 . 𝐾 × . 𝑀 × . 𝑀 ×

731 112 . 𝑀 . × − Limiteddeli 532 . 𝐾 × . 𝑀 × . 𝑀 × . 𝐾 . 𝑀 . × − Mediumnell-1 2 . 𝑀 × . 𝑀 × . 𝑀 . 𝑀 . × − Mediumamazon 4 . 𝑀 × . 𝑀 × . 𝑀 . 𝐵 . × − Highpatents 46 × . 𝐾 × . 𝐾 . 𝐵 . × − Highreddit 8 . 𝑀 × 𝐾 × . 𝑀 . 𝐵 . × − High modes of limited/medium reuse is considered to have an overalllimited/medium fiber reuse.

We evaluate the proposed

ALTO format com-pared to the mode-agnostic COO and HiCOO formats [Li et al.2018] as well as the mode-specific CSF formats [Smith and Karypis2015; Smith et al. 2015]. Specifically, we use the latest code of thestate-of-the-art sparse tensor libraries for CPUs, namely, ParTI and SPLATT. We report the best achieved performance across thedifferent configurations of the COO format; that is, with or withoutthread privatization (which keeps local copies of the output factormatrix). For the HiCOO format, its performance and storage arehighly sensitive to the block and superblock (SB) sizes, which bene-fit from tuning. Since the current HiCOO implementation does notauto-tune the format parameters or consider the required tuningtime in the format generation cost, we use a block size of 128 (2 )and two superblock sizes of 2 and 2 according to prior work [Sunet al. 2020], and we report the performance of each format variant(“HiCOO-SB10” and “HiCOO-SB14”). We evaluate two variants ofthe mode-specific formats: CSF and CSF with tensor tilling (“CSF-tile”), both of which use 𝑁 representations (“SPLATT-ALL”) for anorder- 𝑁 sparse tensor to achieve the best performance. Similar toprevious studies [Choi et al. 2018; Smith and Karypis 2015; Smithet al. 2015], the experiments use double-precision arithmetic and64-bit (native word) integers. While the target datasets require alinearized index of size between 32 and 80 bits, we configured

ALTO to select the size of its linearized index to be multiples of the nativeword size (i.e., 64 and 128 bits) for simplicity. We use a decompositionrank 𝑅 =

16 for all experiments. Available at: https://github.com/hpcgarage/ParTI Available at: https://github.com/ShadenSmith/splatt In reddit dataset, SPLATT runs out of memory. While keeping fewer tensor copiesis possible, it significantly degrades the performance on non-root modes due to usingdifferent recursion- and lock-based algorithmic variants. .0 C o m p r e ss i o n r a t i o S p ee d u p Best mode-agnostic Best mode-specific (tensor copy/mode) ALTO

Fig. 6. The performance metrics of the ALTO format (higher is better) incomparison with an oracle that selects the best mode-agnostic or mode-specific format variant for tensor datasets.

Figure 6 compares the performance characteristics of the proposed

ALTO format to an oracle that selects the best format for targettensors. The oracle considers two distinct types of sparse tensorformats: 1) mode-agnostic or general formats (COO, HiCOO-SB10,and HiCOO-SB14), which use a single tensor representation, and2) mode-specific formats (CSF and CSF-tile), which keep multipletensor copies (one per mode) for best performance. The resultsshow that

ALTO outperforms the best mode-agnostic as well asmode-specific formats in terms of the speedup of tensor operations(MTTKRP on all modes) and the tensor storage. Specifically,

ALTO achieves a geometric mean speedup of 8 × compared to the bestgeneral formats, while delivering a geometric mean compressionratio of 4 . × relative to the best mode-specific formats. What followsis a detailed analysis and discussion of the performance results. To evaluate the parallel performance and scalability of the

ALTO format, Figure 7 shows the speedup of the different parallel imple-mentations of MTTKRP, which dominates the execution time oftensor decomposition methods. Unlike prior formats, the parallelperformance of

ALTO depends on the inherent data reuse of sparsetensors rather than the spatial distribution of their nonzero ele-ments. As explained in Algorithm 2, the parallel MTTKRP-

ALTO algorithm consists of two stages: 1) computing the output fibers ofthe target mode using the input fibers along all other modes, and2) merging the conflicting updates of output fibers across threads.

ALTO demonstrates linear scaling for the sparse tensors with highreuse (on the left of the figure). Specifically, it achieves a geometricmean speedup of 47 × on 56 cores (i.e., more than 80% parallel effi-ciency) by exploiting the data locality of input fibers and by locallycomputing the partial updates of output fibers in higher levels ofthe memory system hierarchy. This way, the overhead of mergingthese partial updates is amortized over the large number of outputfiber reuse. The analysis of performance counters, as detailed be-low, shows that most of the input/output fiber data is accessed incache, which strongly reduces the pressure on main memory. Forthe amazon and reddit tensors, their hyper-sparsity impacts theparallel scalability compared to other high-reuse cases, due to thereduction of useful work performed over the application’s memoryfootprint. In addition, when sparse tensors have limited/mediumdata reuse, as shown on the right of Figure 7, ALTO has a sub-linear scaling (a geometric mean speedup of 16 × on 56 cores) due to thelarge number of accesses to main memory. On our test platform, theSTREAM Triad [McCalpin 1995] bandwidth is 10 GB/s and 210 GB/sfor a single core and a full node of 56 cores, respectively. Hence, ALTO obtains around 76% of the maximum realizable speedup (21 × )in such memory-bound cases.On the contrary, the performance of prior sparse tensor formatsvaries widely due to their sensitivity to the irregular shapes anddata distributions of higher-order tensors. As a result, the location-based formats (HiCOO-SB10/SB14, CSF, and CSF-tile) suffer fromworkload imbalance and ineffective compression depending on thedistribution of the nonzero elements of each dataset. Additionally,based on the update conflicts across blocks/superblocks, the par-allel schedule of block-based formats (HiCOO-SB10/SB14) can en-counter serialization and/or substantial parallelization overhead.However, in a few cases, namely, nips, amazon, and nell-1 tensors,the mode-specific CSF format shows a slightly better performance(1 . × geometric-mean speedup) than the mode-agnostic ALTO for-mat by keeping multiple sparse tensor representations along dif-ferent mode orientations, which in turn significantly increases thestorage of CSF (by 2 . × –4 . × compared to ALTO , as shown in Fig-ure 11).

Figure 8 demonstrates the run-time variations of the different formats while performing MTTKRPacross tensor modes. The selected sparse tensors exhibit differentcharacteristics in terms of the shape, size, density, and data reuse.Compared to the other formats,

ALTO has a relatively consistentperformance regardless of the mode orientation of tensor operations.In the darpa tensor, the execution time of

ALTO along mode-3 ishigher than mode-1 and mode-2 due to the limited fiber reuse ofmode-3. While fibers along all modes are read during MTTKRPexecution, only the output fibers along the target mode are updated.Since the memory read bandwidth is typically higher than the writebandwidth, the limited fiber reuse of mode-3 has more impact onmode-3 MTTKRP compared to other modes. In contrast, the othersparse formats suffer from significant performance variations acrossmodes (note the use of logarithmic scale). While the execution timeof the mode-specific CSF formats is expected to change based onthe characteristics of the target mode, as they use a distinct tensorrepresentation for each mode orientation, the results show thatthe parallel performance of block-based formats varies widely dueto the different update conflicts and parallelism degrees of tensorblocks/superblocks across modes.

To better understand the performancecharacteristics of MTTKRP using the

ALTO format, we createda Roofline model [Williams et al. 2009] for the test platform andcollected performance counters across a set of representative paral-lel runs. For the Roofline model, an upper performance limit Π isgiven by Π = min ( Π peak , 𝐵 peak × 𝑂𝐼 ) , where Π peak is the theoreticalpeak performance, 𝐵 peak is the peak memory bandwidth, and 𝑂𝐼 is the operational intensity (i.e., the ratio of executed floating-pointoperations per byte). In addition, we enhance our Roofline model toconsider the bandwidth of the different cache levels. While the L2cache, L3 cache, and main memory bandwidth in our model is phe-nomenological, i.e., measured using likwid-bench, L1 bandwidth o s uppo r t f o r D t e n s o r s O u t o f m e m o r y N o s uppo r t f o r D t e n s o r s N o s uppo r t f o r D t e n s o r s N o s uppo r t f o r D t e n s o r s O u t o f m e m o r y O u t o f m e m o r y NIPS UBER CHICAGO VAST ENRON NELL -2 AMAZON PATENTS REDDIT LBNL DARPA FB - M FLICKR DELI NELL -1 HIGH REUSE LIMITED / MEDIUM REUSE S p ee d u p ( v s . A L T O - S e q u e n t i a l ) Sparse Tensors

COO HiCOO-SB10 HiCOO-SB14 CSF CSF-tile ALTO

Fig. 7. The overall parallel performance/speedup of MTTKRP (all modes) using the different sparse tensor formats. The speedup is reported compared to theoptimized sequential MTTKRP-ALTO version to show the parallel scalability. The sparse tensors are categorized and then sorted in an increasing order of theirsize (number of nonzero elements).

MODE -1 MODE -2 MODE -3 MODE -4 MODE -1 MODE -2 MODE -3 MODE -1 MODE -2 MODE -3 MODE -1 MODE -2 MODE -3 NIPS DARPA NELL -2 AMAZON I t e r a t i o n t i m e [ s ] Sparse Tensors

COO HiCOO-SB10 HiCOO-SB14 CSF CSF-tile ALTO

Fig. 8. The execution time of parallel MTTKRP implementations across different modes for select tensors with various characteristics. L c a c h e B W L c a c h e B W L c a c h e B W p e a k m a i n m e m o r y B W ( G B / s ) DP peak performance (3.2 TFLOP/s)

Pull-based accumulationAtomic accumulation64-bit

ALTO index128-bit

ALTO index

Operational Intensity [ FLOP / Byte ] P e r f o r m a n c e [ G F L O P / s ] Fig. 9. The parallel performance of ALTO across MTTKRP runs: nips mode-4 without 1 and with r rank specialization, nell-2 mode-3 2 , amazonmode-1 3 , darpa mode-1 4 and mode-3 5 . measurements are error-prone. Therefore, we use the theoreticalL1 load bandwidth of two cache lines per cycle per core for thisparticular roofline. The peak performance, Π peak , is calculated basedon the ability of the cores to execute two fused multiply-add (FMA)instructions on eight-element double precision vector registers (dueto the availability of AVX-512) per cycle.Figure 9 shows the performance of the parallel MTTKRP- ALTO computations (lines 1-11 in Algorithm 2) for several representa-tive cases. While the memory-intensive MTTKRP operation suffersfrom low operational intensity,

ALTO can still exceed the peak main memory bandwidth by exploiting the inherent data reuse and byefficiently resolving the update conflicts. The nips tensor is rela-tively small and it has high reuse of output fibers, which can be heldlocally in L1/L2 caches. The performance counters show a superiorcache hit rate while computing output fibers, which explains thehigh effective bandwidth and floating-point throughput. Mode-3 ofnell-2 has a smaller amount of fiber reuse and it accesses a largernumber of local output fibers compared to nips . Nonetheless, ALTO efficiently utilizes all threads to do useful work and amortizesthe parallelization overheads due to the larger number of nonzeroelements, and we observe a slightly higher performance than .While the amazon tensor has high fiber reuse, its hyper-sparsityand large size increase the memory pressure. In addition, althoughthe tensor can be encoded using a 72-bit linearized index, our cur-rent ALTO implementation uses a 128-bit linearized index (twicethe native word size) for simplicity. Together these factors increasethe data volume and the de-linearization overhead, which in turnreduces the effective floating-point throughput compared to othertensors with high data reuse and .In contrast, the darpa tensor has limited fiber reuse (along mode-3) which restricts the performance compared to the previous cases.During the computation of MTTKRP on mode-3 , the limited reuseof output fibers does not justify the use of local/temporary memoryto resolve the conflicting updates. As a result, ALTO automaticallychooses to use atomic updates for conflict resolution. While darpamode-1 has higher reuse of output fibers, comparable to , read-ing the input fibers along mode-3 dominates the execution timeand results in more data traffic across the memory system levelsbecause of the limited temporal locality; thus, we can observe thatits performance is bounded by the memory bandwidth. .2 2.7 HIGH REUSE LIMITED / MEDIUM REUSE S p ee d u p ( v s . A L T O ) Sparse Tensors

ALTO (rank specialized)

Fig. 10. The speedup of ALTO using rank specialization relative to thedefault ALTO parallel implementation. The sparse tensors are categorizedand then sorted in an increasing order of their size.

The generic (template-based) implementation of

ALTO (which isdiscussed in § ALTO do not leverage rankspecialization . Here, we show the potential of this feature to furtherimprove the performance by allowing the compiler to gain moreinsight about the length of rank-wise update operations. This way,the compiler can generate more optimized code, which leads to abetter overall performance, as shown for nips mode-4 r . The anal-ysis of the generated assembly files with OSACA [Laukemann et al.2019] shows that rank specialization results in better control struc-tures with less branches compared to the standard implementation.Additionally, the compiler can reduce the pressure on the load unitsand, thus, decreases the execution time by more than 2 × in nips r .Figure 10 shows a comprehensive comparison between the rankspecialized ALTO version and the default parallel version used inprior experiments. For smaller tensors with high reuse, the compilermanages to optimize the computations and to improve the perfor-mance by more than 2 × . In larger tensors as well as tensors withlimited reuse, we can still observe a performance benefit of morethan 10%, leading to an overall geometric mean speedup of 1 . × across the different sparse tensors. Figure 11 details the relative storage of the sparse tensors usingthe proposed

ALTO format as well as the block-based (HiCOO-SB10/SB14) and tree-based (CSF and CSF-tile) formats. Compared tothe de facto COO format,

ALTO always requires less storage spacedue to its efficient linearization of the multi-dimensional space, asexplained in § Figure 12 compares the cost of constructing the different sparsetensor formats from a sparse tensor in the COO format. Since

ALTO works on a linearized rather than a multi-dimensional represen-tation of tensors, the cost of sorting the nonzero elements in itslinearized storage (which is the most expensive step in the formatgeneration) is substantially reduced because of the decrease in thenumber of comparison operations. In addition to sorting the nonzeroelements, the block-based HiCOO formats require costly cluster-ing of these elements based on their multi-dimensional coordinatesas well as scheduling of the resulting blocks/superblocks to avoidconflicts. As a result,

ALTO has more than an order-of-magnitudespeedup relative to HiCOO in terms of the format conversion time.Although the tree-based CSF formats are generated from presorted

COO tensors,

ALTO still requires less time for format constructionon average. In large-scale tensors, such as amazon and patents,

ALTO has comparable conversion cost to the tree-based formatsas the tensor sorting time increases with the number of nonzeroelements. Currently, the format generation of

ALTO leverages theC++ Standard Template Library (STL) and a further reduction ofthis construction cost is possible.

Due to the rise of big data analytics, sparse tensors have becomepopular in the HPC community. Our study was motivated by thelinearized coordinate (LCO) format [Harrison and Joseph 2018],which also flattens sparse tensors but along a given orientation oftensor modes. Hence, LCO requires either multiple tensor copiesor on-the-fly permutation of tensors for efficient computation. Inaddition, the authors limit their focus to sequential algorithms, andit is not clear how LCO can be used to efficiently parallelize sparsetensor computations.Recent extensions [Liu et al. 2017; Phipps and Kolda 2019] tothe list-based COO format reduce the synchronization overheadacross threads using mode-specific scheduling/permutation arrays.However, their output-oriented traversal of sparse tensors adverselyaffects the input data locality and/or result in random access ofthe nonzero elements [Phipps and Kolda 2019]. Moreover, storingthis fine-grained scheduling information for all tensor modes canmore than double the memory consumption, compared to the COOformat [Phipps and Kolda 2019].The block-based formats, such as HiCOO [Li et al. 2018], arehighly sensitive to the characteristics of sparse tensors as well asthe block size. When the blocking ratio (i.e., the number of blocksto the number of nonzero elements) is high, HiCOO can use morestorage space than the simple COO format [Li et al. 2018]. Due to thenonuniform (skewed) data distributions in the multi-dimensionalspace, the number of nonzero elements per block varies widelyacross HiCOO blocks, even after expensive mode-specific tensorpermutations which in practice further increase workload imbal-ance [Li et al. 2019]. Moreover, such block-based formats, which o s uppo r t f o r D t e n s o r s N o s uppo r t f o r D t e n s o r s O u t o f m e m o r y N o s uppo r t f o r D t e n s o r s N o s uppo r t f o r D t e n s o r s O u t o f m e m o r y O u t o f m e m o r y LBNL NIPS UBER CHICAGO VAST DARPA ENRON NELL -2 FB - M FLICKR DELI NELL -1 AMAZON PATENTS REDDIT S t o r a g e ( v s . C OO ) Sparse Tensors

HiCOO-SB10 HiCOO-SB14 CSF CSF-tile ALTO

Fig. 11. The sparse tensor storage using the different formats compared to COO. The tensors are sorted in an increasing order of their size. N o s uppo r t f o r D t e n s o r s N o s uppo r t f o r D t e n s o r s O u t o f m e m o r y N o s uppo r t f o r D t e n s o r s N o s uppo r t f o r D t e n s o r s O u t o f m e m o r y O u t o f m e m o r y LBNL NIPS UBER CHICAGO VAST DARPA ENRON NELL -2 FB - M FLICKR DELI NELL -1 AMAZON PATENTS REDDIT T i m e [ s ] Sparse Tensors HiCOO-SB10 HiCOO-SB14 CSF CSF-tile ALTO

Fig. 12. The format construction cost in seconds. The sparse tensors are sorted in an increasing order of their size. use small types for element indices, can end up under-utilizing thecompute units and memory bandwidth because 1) mainstream ar-chitectures are word oriented, which leads to limited efficiency ofsub-word operations [Abel and Reineke 2019; Mittal 2017], and 2)applications need to generate large memory transactions to effi-ciently utilize the bandwidth of commodity DRAM chips [Ghoseet al. 2019].The mode-specific CSF format [Smith and Karypis 2015; Smithet al. 2015] clusters the nonzero elements into coarse-grained tensorslices and fibers, which limits its scalability on massively parallel ar-chitectures. To improve the performance on GPUs, recent CSF-basedformats [Nisa et al. 2019; Nisa et al. 2019] expose more balancedand fine-grained parallelism but at the expense of substantial syn-chronization overheads and expensive preprocessing and formatgeneration costs. The sparse tensor format selection (SpTFS) frame-work [Sun et al. 2020] leverages deep learning models [Xie et al.2019; Zhao et al. 2018] to predict the best of COO, HiCOO, andCSF formats to perform the MTTKRP operation on a given sparsetensor.

Motivated by the vulnerability of existing compressed formats tothe irregular characteristics of higher-order sparse data, this workproposed the

ALTO format to efficiently encode sparse tensors ofarbitrary dimensionality and spatial distributions using a singlemode-agnostic representation. An implementation of the

ALTO -based tensor decomposition operations (all-modes MTTKRP) provedto outperform an oracle that selects the best state-of-the-art format,in terms of the parallel performance and tensor storage. Specifically,

ALTO delivers 8 × geometric-mean speedup and 4 . × geometric-mean compression ratio over the best general and mode-specificformats, respectively, due to its superior workload balance, adaptivesynchronization, and compact encoding. The experiments showed the potential of our template-basedimplementation to bring extra performance improvement (1 . × geometric-mean speedup) by specializing the ALTO -based tensorkernels for target application features. A natural extension of thisapproach is to support efficient just-in-time (JIT) code generationthat considers multiple application and input features, which weintend to investigate in future work. We also plan to explore othermassively parallel and distributed-memory platforms as well ascommon sparse computations that can benefit from the superiorperformance characteristics of the

ALTO format.

ACKNOWLEDGMENTS

The authors would like to thank Dr. Tamara G. Kolda for setting uson the direction of tensor linearzation and for providing valuablefeedback.

REFERENCES

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, ManjunathKudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, BenoitSteiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning.In

Proceedings of the 12th USENIX Conference on Operating Systems Design andImplementation (Savannah, GA, USA) (OSDI’16) . USENIX Association, USA, 265–283.https://dl.acm.org/doi/10.5555/3026877.3026899Andreas Abel and Jan Reineke. 2019. uops.info: Characterizing Latency, Throughput,and Port Usage of Instructions on Intel Microarchitectures. In

ASPLOS (Providence,RI, USA) (ASPLOS ’19) . ACM, New York, NY, USA, 673–686. https://doi.org/10.1145/3297858.3304062Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky.2014. Tensor Decompositions for Learning Latent Variable Models.

J. Mach. Learn.Res.

15, 1 (Jan. 2014), 2773–2832. https://dl.acm.org/doi/10.5555/2627435.2697055Brett W. Bader and Tamara G. Kolda. 2007. Efficient MATLAB computations withsparse and factored tensors.

SIAM JOURNAL ON SCIENTIFIC COMPUTING

30, 1(2007), 205–231. https://doi.org/10.1137/060676489M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and ScalableComputations with Sparse Tensors. In . 1–6. https://doi.org/10.1109/HPEC.2012.6408676 art J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic Intra-Register Vectorization for the Intel Architecture. Int. J. Parallel Program.

30, 2 (April2002), 65–98. https://dl.acm.org/doi/10.5555/586554.586555J. Choi, X. Liu, S. Smith, and T. Simon. 2018. Blocking Optimization Techniques forSparse Tensor Computation. In . 568–577. https://doi.org/10.1109/IPDPS.2018.00066Joon Hee Choi and S. V. N. Vishwanathan. 2014. DFacTo: Distributed Factorization ofTensors. In

Proceedings of the 27th International Conference on Neural InformationProcessing Systems (Montreal, Canada) (NIPS’14) . MIT Press, Cambridge, MA, USA,1296–1304. https://dl.acm.org/doi/10.5555/2968826.2968971Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format Abstractionfor Sparse Tensor Algebra Compilers.

Proc. ACM Program. Lang.

2, OOPSLA, Article123 (Oct. 2018), 30 pages. https://doi.org/10.1145/3276493L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memoryprogramming.

IEEE Computational Science and Engineering

5, 1 (1998), 46–55.https://doi.org/10.1109/99.660313Hadi Fanaee-T and João Gama. 2016. Tensor-based anomaly detection: An inter-disciplinary survey.

Knowledge-Based Systems

98 (2016), 130 – 147. https://doi.org/10.1016/j.knosys.2016.01.027Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019.Demystifying Complex Workload-DRAM Interactions: An Experimental Study. 3, 3,Article 60 (Dec. 2019), 50 pages. https://doi.org/10.1145/3366708Thomas Gruber, Jan Eitzinger, Georg Hager, and Gerhard Wellein. 2020.

RRZE-HPC/likwid: likwid-5.1.0 . https://doi.org/10.5281/zenodo.4282696A. P. Harrison and D. Joseph. 2018. High Performance Rearrangement and MultiplicationRoutines for Sparse Tensor Arithmetic.

SIAM Journal on Scientific Computing

40, 2(2018), C258–C281. https://doi.org/10.1137/17M1115873Joyce C. Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-Throughput Phe-notyping from Electronic Health Records via Sparse Nonnegative Tensor Fac-torization. In

Proceedings of the 20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (New York, New York, USA) (KDD ’14) .Association for Computing Machinery, New York, NY, USA, 115–124. https://doi.org/10.1145/2623330.2623658I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos. 2015. HaTen2: Billion-scale tensordecompositions. In .1047–1058. https://doi.org/10.1109/ICDE.2015.7113355U. Kang, Evangelos Papalexakis, Abhay Harpale, and Christos Faloutsos. 2012. Gi-gaTensor: Scaling Tensor Analysis up by 100 Times - Algorithms and Discoveries.In

Proceedings of the 18th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . Association for Computing Machinery, New York, NY,USA, 316–324. https://doi.org/10.1145/2339530.2339583Teruyoshi Kobayashi, Anna Sapienza, and Emilio Ferrara. 2018. Extracting the multi-timescale activity patterns of online financial markets.

Scientific Reports

8, 1 (2018),1–11. https://doi.org/10.1038/s41598-018-29537-wTamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications.

SIAM Rev.

51, 3 (September 2009), 455–500. https://doi.org/10.1137/07070111XT. G. Kolda and J. Sun. 2008. Scalable Tensor Decompositions for Multi-aspect DataMining. In . 363–372.https://doi.org/10.1109/ICDM.2008.89J. Laukemann, J. Hammer, G. Hager, and G. Wellein. 2019. Automatic Throughputand Critical Path Analysis of x86 and ARM Assembly Kernels. In . 1–6. https://doi.org/10.1109/PMBS49563.2019.00006J. Li, J. Sun, and R. Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In

SC18: International Conference for High Performance Computing, Networking, Storageand Analysis . 238–252. https://doi.org/10.1109/SC.2018.00022Jiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc.2019. Efficient and Effective Sparse Tensor Reordering. In

Proceedings of the ACMInternational Conference on Supercomputing (Phoenix, Arizona) (ICS ’19) . Associationfor Computing Machinery, New York, NY, USA, 227–237. https://doi.org/10.1145/3330345.3330366Bangtian Liu, Chengyao Wen, Anand D Sarwate, and Maryam Mehri Dehnavi. 2017. Aunified optimization approach for sparse tensor operations on GPUs. In . IEEE, 47–57.John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current HighPerformance Computers.

IEEE Computer Society Technical Committee on ComputerArchitecture (TCCA) Newsletter (Dec. 1995), 19–25. http://tab.computer.org/tcca/NEWS/DEC95/dec95_mccalpin.psSparsh Mittal. 2017. A Survey of Techniques for Designing and Managing CPU RegisterFile.

Concurrency and Computation: Practice and Experience

29, 4 (2017), e3906. https://doi.org/10.1002/cpe.3906Guy M Morton. 1966.

A computer Oriented Geodetic Data Base; and a New Techniquein File Sequencing . Technical Report. IBM Ltd., 150 Laurier Ave., Ottawa, Ontario,Canada. https://dominoweb.draco.res.ibm.com/reports/Morton1966.pdfIsrat Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krish-namoorthy, and P. Sadayappan. 2019. An Efficient Mixed-Mode Representationof Sparse Tensors. In

Proceedings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19) .Association for Computing Machinery, New York, NY, USA, Article 49, 25 pages.https://doi.org/10.1145/3295500.3356216I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan. 2019. Load-BalancedSparse MTTKRP on GPUs. In . 123–133. https://doi.org/10.1109/IPDPS.2019.00023Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D. Sidiropoulos. 2016.Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Al-gorithms.

ACM Trans. Intell. Syst. Technol.

8, 2, Article 16 (Oct. 2016), 44 pages.https://doi.org/10.1145/2915921Giuseppe Peano. 1890. Sur une courbe, qui remplit toute une aire plane.

Math. Ann.

SIAM Journal on Scientific Computing

41, 3(2019), C269–C290. https://doi.org/10.1137/18M1210691N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos.2017. Tensor Decomposition for Signal Processing and Machine Learning.

IEEETransactions on Signal Processing

65, 13 (2017), 3551–3582. https://doi.org/10.1109/TSP.2017.2690524Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and GeorgeKarypis. 2017.

FROSTT: The Formidable Repository of Open Sparse Tensors and Tools .http://frostt.io/Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Com-pressed Sparse Tensor. In

Proceedings of the 5th Workshop on Irregular Applica-tions: Architectures and Algorithms (Austin, Texas) (IA³ ’15) . Associ-ation for Computing Machinery, New York, NY, USA, Article 5, 7 pages. https://doi.org/10.1145/2833179.2833183Shaden Smith and George Karypis. 2017. Accelerating the Tucker Decomposition withCompressed Sparse Tensors. In

Euro-Par 2017: Parallel Processing , Francisco F. Rivera,Tomás F. Pena, and José C. Cabaleiro (Eds.). Springer International Publishing, Cham,653–668. https://doi.org/10.1007/978-3-319-64203-1_47S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficientand Parallel Sparse Tensor-Matrix Multiplication. (2015), 61–70. https://doi.org/10.1109/IPDPS.2015.27Qingxiao Sun, Yi Liu, Ming Dun, Hailong Yang, Zhongzhi Luan, Lin Gan, GuangwenYang, and Depei Qian. 2020. SpTFS: Sparse Tensor Format Selection for MTTKRP viaDeep Learning. In

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20) . IEEE Press,Article 18, 14 pages. https://dl.acm.org/doi/abs/10.5555/3433701.3433724David Vandevoorde and Nicolai M Josuttis. 2002.

C++ Templates: The Complete Guide,Portable Documents . Addison-Wesley Professional. https://cds.cern.ch/record/781664/files/0201734842_TOC.pdfN. Vervliet, O. Debals, and L. De Lathauwer. 2016. Tensorlab 3.0 — Numerical optimiza-tion strategies for large-scale constrained and coupled matrix/tensor factorization.In . 1733–1738.https://doi.org/10.1109/ACSSC.2016.7869679Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C. Denny, Abel Kho, You Chen,Bradley A. Malin, and Jimeng Sun. 2015. Rubik: Knowledge Guided Tensor Fac-torization and Completion for Health Data Analytics (KDD ’15) . Association forComputing Machinery, New York, NY, USA, 1265–1274. https://doi.org/10.1145/2783258.2783395Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An InsightfulVisual Performance Model for Multicore Architectures.

Commun. ACM

52, 4 (2009),65–76. https://doi.org/10.1145/1498765.1498785Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IA-SpGEMM: An Input-Aware Auto-Tuning Framework for Parallel Sparse Matrix-Matrix Multiplication.In

Proceedings of the ACM International Conference on Supercomputing (Phoenix,Arizona) (ICS ’19) . Association for Computing Machinery, New York, NY, USA,94–105. https://doi.org/10.1145/3330345.3330354Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the Gap betweenDeep Learning and Sparse Matrix Format Selection. In

Proceedings of the 23rd ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna,Austria) (PPoPP ’18) . Association for Computing Machinery, New York, NY, USA,94–108. https://doi.org/10.1145/3178487.3178495. Association for Computing Machinery, New York, NY, USA,94–108. https://doi.org/10.1145/3178487.3178495