[PDF] PLANC: Parallel Low Rank Approximation with Non-negativity Constraints

Abstract

We consider the problem of low-rank approximation of massive dense non-negative tensor data, for example to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes and performing efficient and scalable parallel algorithms to compute the low-rank approximation. We present a software package called PLANC (Parallel Low Rank Approximation with Non-negativity Constraints), which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm (e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs to accelerate the computation in this work).We describe our parallel distributions and algorithms, which are careful to avoid unnecessary communication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiency and scalability results for both synthetic and real-world data sets.

Full PDF

PPLANC: Parallel Low Rank Approximation with Non-negativity Constraints

SRINIVAS ESWAR ∗ , Georgia Institute of Technology

KOBY HAYASHI ∗ , Georgia Institute of Technology

GREY BALLARD,

Wake Forest University

RAMAKRISHNAN KANNAN † , Oak Ridge National Laboratory

MICHAEL A. MATHESON † , Oak Ridge National Laboratory

HAESUN PARK,

Georgia Institute of Technology

We consider the problem of low-rank approximation of massive dense non-negative tensor data, for example to discover latent patternsin video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computationtime and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading theinput data across the memories of multiple nodes and performing efficient and scalable parallel algorithms to compute the low-rankapproximation. We present a software package called PLANC (Parallel Low Rank Approximation with Non-negativity Constraints),which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm(e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs toaccelerate the computation in this work). We describe our parallel distributions and algorithms, which are careful to avoid unnecessarycommunication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiencyand scalability results for both synthetic and real-world data sets.

ACM Reference Format:

Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A. Matheson, and Haesun Park. 2019. PLANC: ParallelLow Rank Approximation with Non-negativity Constraints. 1, 1 (September 2019), 32 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

The CP decomposition, which is also known as CANDECOMP, PARAFAC, and canonical polyadic decomposition,approximates a tensor, or multidimensional array, by a sum of rank-one tensors. CP is typically used to identify latentfactors in data, particularly when the goal is to interpret those hidden patterns, and it is popular within the signal ∗ Eswar and Hayashi share first authorship. † This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United StatesGovernment retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive,paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United StatesGovernment purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOEPublic Access Plan (http://energy.gov/downloads/doe-public-access-plan).Authors’ addresses: Srinivas Eswar, Georgia Institute of Technology, Atlanta, GA, [email protected]; Koby Hayashi, Georgia Institute of Technology,Atlanta, GA, [email protected]; Grey Ballard, Wake Forest University, Winston-Salem, NC, [email protected]; Ramakrishnan Kannan, Oak RidgeNational Laboratory, Oak Ridge, TN, [email protected]; Michael A. Matheson, Oak Ridge National Laboratory, Oak Ridge, TN, [email protected];Haesun Park, Georgia Institute of Technology, Atlanta, GA, [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].© 2019 Association for Computing Machinery.Manuscript submitted to ACMManuscript submitted to ACM a r X i v : . [ m a t h . NA ] A ug Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A. Matheson, and Haesun Parkprocessing, machine learning, and scientific computing communities, among others [1, 15, 48]. Enforcing domain-specific constraints on the computed factors can help to identify interpretable components. We focus in this paperon non-negative dense tensors (when all tensor entries are nonnegative and nearly all of them are positive) and onconstraining solutions to have nonnegative entries. Formally, Non-negative CP (NNCP) can be defined asmin H ( i ) ⩾ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X − R (cid:213) r = H ( ) ( : , r ) ◦ · · · ◦ H ( N ) ( : , r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1)where H ( ) ( : , i ) ◦ · · · ◦ H ( N ) ( : , i ) is the outer product of the i th vector from all the N factors that yields a rank-one tensorand (cid:205) Rr = H ( ) ( : , r ) ◦ · · · ◦ H ( N ) ( : , r ) results in a sum of R rank-one tensors that approximate the N th order input tensor X . For example, in imaging and microscopy applications, tensor values often correspond to intensities, and NNCP canbe used to cluster and analyze the data in a lower-dimensional space [20]. In this work, we consider a brain imagingdata set that tracks calcium fluorescence within pixels of a mouse’s brain over time during a series of experimentaltrials [30].The kernel computations within standard algorithms for computing NNCP can be formulated as matrix computations,but the complicated layout of tensors in memory prevents the straightforward use of BLAS and LAPACK libraries. Inparticular, the matrix formulation of subcomputations involve different views of the tensor data, so no single layoutyields a column- or row-major matrix layout for all subcomputations. Likewise, the parallelization approach for tensormethods is not a straightforward application of parallel matrix computation algorithms.In developing an efficient parallel algorithm for computing a NNCP of a dense tensor, the key is to parallelize thebottleneck computation known as Matricized-Tensor Times Khatri-Rao Product (MTTKRP), and a different result isrequired for each mode of the tensor. The parallelization must load balance the computation, minimize communicationacross processors, and distribute the results so that the rest of the computation can be performed independently. In ouralgorithm, not only do we load balance the computation, but we also compute and store temporary values that canbe used across MTTKRPs of different modes using a technique known as dimension trees, significantly reducing thecomputational cost compared to standard approaches. Our parallelization strategy also avoids communicating tensorentries and minimizes the communication of factor matrix entries, helping the algorithm to remain computation boundand scalable to high processor counts.We employ a variety of algorithmic strategies to computing NNCP, from multiplicative updates to alternatingdirection method of multipliers. Because the bottleneck computations such as MTTKRP are shared by all updatealgorithms that compute gradient information, we separate the parallelization strategy for those computations fromthe (usually local) computations that are unique to each algorithm. In this paper, we present an open-source softwarepackage called Parallel Low-Rank Approximation with Non-negativity Constraints (PLANC) that currently includessix algorithmic options, and we describe how other algorithms can be incorporated into the framework. PLANC canalso be used for non-negative matrix factorization (NMF) with dense or sparse matrices. The software is available ishttps://github.com/ramkikannan/planc.Some of the material in this paper has appeared in a previous conference paper [2], including the parallelizationstrategy described in § 4.1 and the dimension tree optimization detailed in § 4.2. We summarize the main contributionsof this paper as follows: • presentation and description of the open-source PLANC software package, • utilization of GPUs to alleviate the MTTKRP bottleneck achieving up to 7 × speedup over CPU-only execution, Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 3 • scaling results for runs from 1 to 16384 nodes (250,000+ cores) on the Titan supercomputer, • side by side run-time and convergence results for various update algorithms, • and new results obtained by applying our code to a mouse brain imaging data set. Algorithm 1 (cid:74) H ( ) , . . . , H ( N ) (cid:75) = NNCP ( X , R ) Require: X is I × · · · × I N tensor, R is approximation rank % Initialize data for n = N do Initialize H ( n ) G ( n ) = H ( n ) T H ( n ) end for % Compute NNCP approximation while stopping criteria not satisfied do % Perform outer iteration for n = N do % Compute new factor matrix in n th mode M ( n ) = MTTKRP ( X , { H ( i ) } , n ) S ( n ) = G ( ) ∗ · · · ∗ G ( n − ) ∗ G ( n + ) ∗ · · · ∗ G ( N ) H ( n ) = NNLS-Update ( S ( n ) , M ( n ) ) G ( n ) = H ( n ) T H ( n ) end for end whileEnsure: X ≈ (cid:74) H ( ) , . . . , H ( N ) (cid:75) The CP decomposition is a low-rank approximation of a multi-dimensional array, or tensor, which generalizes matrixapproximations like the truncated singular value decomposition. As in Figure 1, CP decomposition approximates thegiven input matrix as sum of R rank-1 tensors.Algorithm 1 shows the pseudocode for an alternating-updating algorithm applied to NNCP [28]. Line 11, line 12,and line 14 compute matrices involved in the gradients of the subproblem objective functions, and line 13 uses thosematrices to update the current factor matrix.The NNLS-Update in line 13 can be implemented in different ways. In a faithful Block Coordinate Descent (BCD)algorithm, the subproblems are solved exactly; in this case, the subproblem is a nonnegative linear least squares problem,which is convex. We can use the Block Principal Pivoting (BPP) method [28, 29], which is an active-set-like method, tosolve the subproblem exactly.However, as discussed in [22] for the matrix case, there are other reasonable alternatives to updating the factormatrix without solving the N -block coordinate descent subproblem exactly. For example, we can more efficiently updateindividual columns of the factor matrix as is done in the Hierarchical Alternating Least Squares (HALS) method [9]. Inthis case, the update rule is H ( n ) ( : , r ) ← (cid:104) H ( n ) ( : , r ) + M ( n ) ( : , r ) − ( H ( n ) S ( n ) )( : , r ) (cid:105) + which involves the same matrices M ( n ) and S ( n ) as BPP. Other possible alternating-updating methods include Alternating-Optimization Alternating Direction Method of Multipliers (AO-ADMM) [18, 49] and Nesterov-based algorithms [36]. Manuscript submitted to ACM

Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A. Matheson, and Haesun Park ≈ Rank 1 Tensor ++ Rank 1 Tensor …… Rank 1 Tensor

Sum of R Rank-1 Tensors r r 𝑰 𝑰 𝑰 𝑰 ++ Low Rank Factors

Fig. 1. CP Decomposition

The details of each of these algorithms are presented in Section 4.3. The parallel algorithm presented in this paper isgenerally agnostic to the approach used to solve the nonnegative least squares subproblems, as all these methods arebottlenecked by the subroutine they have in common, the MTTKRP.

The formulation of NNCP with least squares error and algorithms for computing it go back as far as [42, 57], developedin part as a generalization of nonnegative matrix factorization algorithms [33] to tensors. Sidiropoulos et al. [48] providea more detailed and complete survey that includes basic tensor factorization models with and without constraints,broad coverage of algorithms, and recent driving applications. The mathematical tensor operations discussed and thenotation used in this paper follow Kolda and Bader’s survey [31].Many numerical methods have been developed for the non-negative least squares (NNLS) subproblems which arisein NNCP. Broadly these methods can be divided into projection-based and active-set-based methods. Projection-basedmethods are iterative methods which consist of gradient descent and Newton-type algorithms which exploit the fact thatthe objective function is differentiable and the non-negative projection operator is easy to compute [10, 16, 27, 37, 39].Active-set-like methods explicitly partition the variables into zeros and non-zeros. Once the final partition is known theNNLS problem can be solved via a simpler unconstrained least squares problem [8, 29, 32, 54]. We direct the reader tothe survey by Kim et al [28] for a more in-depth discussion on these methods.Recently, there has been growing interest in scaling tensor operations to bigger data and more processors in boththe data mining/machine learning and the high performance computing communities. For sparse tensors, there havebeen parallelization efforts to compute CP decompositions on shared-memory platforms [34, 51], distributed-memoryplatforms [24, 26, 38, 50] and GPUs [40, 41, 52], and these approaches can be generalized to constrained problems [49].Liavas et al. [35] extend a parallel algorithm designed for sparse tensors [50] to the 3D dense case. They use the“medium-grained” dense tensor distribution and row-wise factor matrix distribution, which is exactly the same as ourdistribution strategy (see § 4.1.2), and they use a Nesterov-based algorithm to enforce the nonnegativity constraints. A

Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 5similar data distribution and parallel algorithm for computing a single dense MTTKRP computation is proposed byBallard, Knight, and Rouse [3]. Another approach to parallelizing NNCP decomposition of dense tensors is presentedby Phan and Cichocki [44], but they use a dynamic tensor factorization, which performs different, more independentcomputations across processors. Moon et al. [40] address the data locality optimizations needed during the NNLS phaseof the algorithm for both shared memory and GPU systems.The idea of using dimension trees (discussed in § 4.2) to avoid recomputation within MTTKRPs across modes isintroduced in [45] for computing the CP decomposition of dense tensors. General reuse patterns and mode splitting werepresent in earlier works on variants of the Tucker Decomposition [4, 14]. It has also been used for sparse CP [26, 34]and sparse Tucker [24].An alternate approach to speeding up CP computations is by reducing the tensor size either via sampling orcompression. A large body of work exists for randomized tensor methods [5, 11, 43, 56] which are recently beingextended to the constrained problem [12, 13]. The second approach is to first compress the tensor using a differentdecomposition, like Tucker, and then compute CP on this reduced array. This method has been discussed in furtherdetail in [7, 53], but it becomes more difficult to impose nonnegative constraints on the overall model. A separateapproach is compute the (constrained) CP decomposition of the entire approximation, rather than only the core tensor,exploiting the structure of the Tucker model to perform the optimization algorithm more efficiently [55].

The basic sequential algorithm is given in Algorithm 1, and the parallel version is given inAlgorithm 2. In Algorithm 2, we will refer to both the inner iteration, in which one factor matrix is updated (line 9 toline 19), and the outer iteration, in which all factor matrices are updated (line 8 to line 20). In the parallel algorithm,the processors are organized into a logical multidimensional grid (tensor) with as many modes as the data tensor. Thecommunication patterns used in the algorithm are MPI collectives: All-Reduce, Reduce-Scatter, and All-Gather. Theprocessor communicators (across which the collectives are performed) include the set of all processors and the sets ofprocessors within the same processor slice. Processors within a mode- n slice all have the same n th coordinate.The method of enforcing the nonnegativity constraints of the linear least squares solve or update generally affectsonly local computation because each row of a factor matrix can be updated independently. In our algorithm, eachprocessor solves the linear problem or computes the update for its subset of rows (see line 14). The most expensive(and most complicated) part of the parallel algorithm is the computation of the MTTKRP, which corresponds to line 11,line 12, and line 18.The details that are omitted from this presentation of the algorithm include the normalization of each factor matrixafter it is computed and the computation of the residual error at the end of an outer iteration. These two computationsdo involve both local computation and communication, but their costs are negligible. We discuss normalization anderror computation and give more detailed pseudocode in Algorithm 4. Given a logical processor grid of processors P × · · · × P N , we distribute the size I × · · · × I N tensor X in a block or Cartesian partition. Each processor owns a local tensor of dimensions ( I / P ) × · · · × ( I N / P N ) ,and only one copy of the tensor is stored. Locally, the tensor is stored linearly, with entries ordered in a naturalmode-descending way that generalizes column-major layout of matrices. Given a processor p = ( p , . . . , p N ) , we denoteits local tensor X p . Manuscript submitted to ACM

Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A. Matheson, and Haesun Park

Algorithm 2 (cid:74) H ( ) , . . . , H ( N ) (cid:75) = Par-NNCP ( X , R ) Require: X is an I × · · · × I N tensor distributed across a P × · · · × P N grid of P processors, so that X p is ( I / P ) ×· · · × ( I N / P N ) and is owned by processor p = ( p , . . . , p N ) , R is rank of approximation for n = N do Initialize H ( n ) p of dimensions ( I n / P ) × R G = Local-SYRK ( H ( n ) p ) G ( n ) = All-Reduce ( G , All-Procs ) H ( n ) p n = All-Gather ( H ( n ) p , Proc-Slice ( n , p n )) end for % Compute NNCP approximation while not converged do for n = N do % Compute new factor matrix in n th mode M = Local-MTTKRP ( X p ··· p N , { H ( i ) p i } , n ) M ( n ) p = Reduce-Scatter ( M , Proc-Slice ( n , p n )) S ( n ) = G ( ) ∗ · · · ∗ G ( n − ) ∗ G ( n + ) ∗ · · · ∗ G ( N ) H ( n ) p = NNLS-Update ( S ( n ) , M ( n ) p ) % Organize data for later modes G = H ( n ) p T H ( n ) p G ( n ) = All-Reduce ( G , All-Procs ) H ( n ) p n = All-Gather ( H ( n ) p , Proc-Slice ( n , p n )) end for end whileEnsure: X ≈ (cid:74) H ( ) , . . . , H ( N ) (cid:75) Ensure:

Local matrices: H ( n ) p is ( I n / P ) × R and owned by processor p = ( p , . . . , p N ) , for 1 ⩽ n ⩽ N , λ storedredundantly on every processorEach factor matrix is distributed across processors in a block row partition, so that each processor owns a subset ofthe rows. We use the notation H ( n ) p , which has dimensions I n / P × R to denote the local part of the n th factor matrixstored on processor p . However, we also make use a redundant distribution of the factor matrices across processors,because all processors in a mode- n processor slice need access to the same entries of H ( n ) to perform their computations.The notation H ( n ) p n denotes the I n / P n × R submatrix of H ( n ) that is redundantly stored on all processors whose n thcoordinate is p n (there are P / P n such processors).Other matrices involved in the algorithm include M ( n ) p , which is the result of the MTTKRP computation and has thesame distribution scheme as H ( n ) p , and G ( n ) , which is the R × R Gram matrix of the factor matrix H ( n ) and is storedredundantly on all processors. The inner iteration is displayed graphically in Figure 2 for a 3-way example and an update of the2nd factor matrix. The main idea is that at the start of the n th inner iteration (Figure 2a), all of the data is in place foreach processor to perform a local MTTKRP computation, which can be computed using a dimension tree as describedin § 4.2. This means that all processors in a slice redundantly own the same rows of the corresponding factor matrix (forall modes except n ). After the local MTTKRP is computed (Figure 2b), each processor has computed a contribution to asubset of the rows of the global MTTKRP M ( n ) , but its contribution must be summed up with the contributions of all Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 7 H ( ) M ( ) H ( ) (a) Start n th iteration withredundant subset of rowsof each input matrix. H ( ) M ( ) H ( ) (b) Compute local MTTKRPfor contribution to outputmatrix M ( ) . H ( ) M ( ) H ( ) (c) Reduce-Scatter to com-pute and distribute rows of M ( ) . H ( ) H ( ) H ( ) (d) Compute NNLS updateto obtain H ( ) p from M ( ) p (and S ( ) ). H ( ) H ( ) H ( ) (e) All-Gather to collectrows of H ( ) needed for laterinner iterations.Fig. 2. Illustration of 2nd inner iteration of Par-NNCP algorithm for 3-way tensor on a × × processor grid, showing datadistribution, communication, and computation across steps. Highlighted areas correspond to processor ( , , ) and its processor slicewith which it communicates. The column normalization and computation of G ( ) , which involve communication across all processors,is not shown here. other processors in its mode- n slice. This summation is performed with a Reduce-Scatter collective across the mode- n processor slice that achieves a row-wise partition of the result (in Figure 2c, the light gray shading corresponds to therows of M ( ) to which processor ( , , ) contributes and the dark gray shading corresponds to the rows it receives asoutput). The output distribution of the Reduce-Scatter is designed so that afterwards, the update of the factor matrixin that mode can be performed row-wise in parallel. S ( n ) can be computed locally since the Gram matrices, G ( n ) , arestored redundantly on all processors. Along with S ( n ) each processor updates its own rows of the factor matrix givenits rows of the MTTKRP result (Figure 2d). The remainder of the inner iteration is preparing and distributing the newfactor matrix data for future inner iterations, which includes an All-Gather of the newly computed factor matrix H ( n ) across mode- n processor slices (Figure 2e) and recomputing G ( n ) = H ( n ) T H ( n ) .Table 1 provides the computation, communication, and memory costs of a single outer-iteration, computing S ( n ) and M ( n ) for each n , which is common to all NNLS algorithms. We refer the reader to [2] for the detailed analysis of thealgorithm and the derivation of these costs. An important optimization of the alternating updating algorithm for NNCP (and unconstrainedCP) is to re-use temporary values across inner iterations [23, 25, 34, 45]. To illustrate the idea, consider a 3-way tensor X approximated by (cid:74) U , V , W (cid:75) and the two MTTKRP computations M ( ) = X ( ) ( W ⊙ V ) and M ( ) = X ( ) ( W ⊙ U ) usedto update factor matrices U and V , respectively. The underlined parts of the expressions correspond to the shared Manuscript submitted to ACM

Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A. Matheson, and Haesun Park

Computation Communication Temporary Memory O (cid:16) RP (cid:206) n I n + R P (cid:205) n I n (cid:17) O (cid:32) R (cid:213) n I n P n (cid:33) O (cid:169)(cid:173)(cid:171) R (cid:32)(cid:214) n I n P n (cid:33) / + R (cid:213) n I n P n (cid:170)(cid:174)(cid:172) Table 1. Per-outer-iteration costs in terms of computation (flops), communication (words moved), and memory (words) required tocompute S ( n ) and M ( n ) for each n , assuming the local MTTKRP uses a dimension tree [2]. These costs do not include the computation(and possibly) communication costs of the particular NNLS algorithm. dependence of the outputs on the tensor X and the third factor matrix W . Indeed, a temporary quantity, which werefer to as a partial MTTKRP , can be computed and re-used across the two MTTKRP expressions. We refer to thecomputation that combines the temporary quantity with the other factor matrix to complete the MTTKRP computationas a multi-tensor-times-vector or multi-TTV , as it consists of multiple operations that multiply a tensor times a set ofvectors, each corresponding to a different mode.To understand the steps of the partial MTTKRP and multi-TTV operations in more detail, we consider X to be I × J × K and U , V , and W to have R columns. Then m ( ) ir = (cid:213) i , j x ijk v jr w kr = (cid:213) j v jr (cid:213) k x ijk w kr = (cid:213) j v jr t ijr , where T is an I × J × R tensor that is the result of a partial MTTKRP between tensor X and the single factor matrix W .Likewise, m ( ) jr = (cid:213) i , k x ijk u ir w kr = (cid:213) i u ir (cid:213) k x ijk w kr = (cid:213) i u ir t ijr , and we see that the temporary tensor T can be re-used. From these expressions, we can also see that computing T (a partial MTTKRP) corresponds to a matrix-matrix multiplication, and computing each of M ( ) and M ( ) from T (amulti-TTV) corresponds to R independent matrix-vector multiplications. In this case, we compute M ( ) using a fullMTTKRP.For a larger number of modes, a more general approach can organize the temporary quantities to be used over amaximal number of MTTKRPs. The general approach can yield significant benefit, decreasing the computation by afactor of approximately N / N -way tensors. The idea is introduced in [45], but we adopt the terminology andnotation of dimension trees used for sparse tensors in [23, 25]. In this notation, the root node is labeled { , . . . , N } (wealso use the notation [ N ] for this set) and corresponds to the original tensor, a leaf is labeled { n } and corresponds to the n th MTTKRP result, and an internal node is labeled by a set of modes { i , . . . , j } and corresponds to a temporary tensorwhose values contribute to the MTTKRP results of modes i , . . . , j .Figure 3 illustrates a dimension tree for the case N =

5. Various shapes of binary trees are possible [23, 45]. Fordense tensors, the computational cost is dominated by the root’s branches, which correspond to partial MTTKRPcomputations. We perform the splitting of modes at the root so that modes are chosen contiguously with the respect tothe layout of the tensor entries in memory. In this way, each partial MTTKRP can be performed via BLAS’s GEMMinterface without reordering tensor entries in memory. All other edges in a tree correspond to multi-TTVs and aretypically much cheaper. By organizing the memory layout of temporary quantities, the multi-TTV operations can beperformed via a sequence of calls using BLAS’s GEMV interface. By using the BLAS in our implementation, we are ableto obtain high performance and on-node parallelism.

Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 9 { , , , , }{ , } { , , }{ } { } { } { , }{ } { } PM PMmTTV mTTV mTTV mTTVmTTV mTTV

Fig. 3. Dimension tree example for N = . The data associated with the root node is the original tensor, the data associated withthe leaf nodes are MTTKRP results, and the data associated with internal nodes are temporary tensors. Edges labeled with PMcorrespond to partial MTTKRP computations, and edges labeled with mTTV correspond to multi-TTV computations. ! " ! ! $ ! % ! & ' ( (*:,) . (*:/){/:1}' (*:,). ! " ! ! $ ! % ! & (a) Partial MTTKRP to compute node { , , } from root node { , , , , } , executed via one GEMM call. ! blocks( ) * (,){/:1} [4] 6 / (: , 4) !( ) ! (9:1) (: , 4) ( : ( ; ( : ( ; < = (b) Multi-TTV to compute node { } from node { , , } , ex-ecuted via R GEMV calls. Here T { , , }( ) [ r ] refers to the r thcontiguous block of size I × ( I I ) of T { , , }( ) .Fig. 4. Data layout and dimensions for two example computations in dimension tree shown in Figure 3. In this notation, X ( ) is thematricization of input tensor X with respect to modes 1 through 2, K ( ) = H ( ) ⊙ H ( ) , T { , , } is the temporary I × I × I × R tensor corresponding to node { , , } in the dimension tree, K ( ) = H ( ) ⊙ H ( ) , and M ( ) is the MTTKRP result for mode 3. Thearrows represent row- vs column-major ordering in memory. Figure 4 shows the data layout and dimensions of a partial MTTKRP and a multi-TTV taken from the exampledimension tree in Figure 3. Figure 4a shows a partial MTTKRP between the input tensor X and the Khatri-Rao productof the factor matrices in modes 1 and 2, which produces a temporary tensor T corresponding to the { , , } node in thedimension tree. The key to efficiency in this computation is that the matricization of X that assigns modes 1 through 2to rows and modes 3 through 5 to columns, which we denote X ( ) , is already column-major in memory. Thus, we canuse the GEMM interface and compute the temporary tensor T without reordering any tensor entries. Note that T is a4-way tensor in this case, with its last mode of dimension R , and the GEMM interface outputs the matrix T ( ) (wherethe first three modes are assigned to rows), which is column-major in memory. Figure 4b depicts a multi-TTV thatcomputes the result M ( ) from T and the factor matrices in modes 4 and 5. Here, the tensor T is matricized with respectto only its first mode (of dimension I ), but this matricization is also column-major in memory. We choose the orderingof the modes of T such that each of R contiguous blocks is used to compute one column of the output matrix via amatrix-vector operation with a corresponding column of the Khati-Rao product of the other factor matrices. Manuscript submitted to ACM IR , where I is thenumber of tensor entries and R is the rank of the CP decomposition. This is the same operation count as a full MTTKRP.The computational cost of a multi-TTV is the number of entries in the temporary tensor, which is the product of a subset of the original tensor dimensions multiplied by R . Thus, it is computationally cheaper than the partial MTTKRPs,but it is also memory bandwidth bound. The other subroutine necessary for implementing the dimension tree approachis the Khatri-Rao product of contiguous sets of factor matrices. The computational cost of this operation is also typicallylower order, but the running time in practice suffers also from being memory bandwidth bound. For a given tensor, it is possible to compute the dimension tree that minimizes overallcomputation and memory. However, for most problems, the computation (and actual running time) will be dominatedby the choice of split at the root node, and the other split choices will have negligible effect. The choice of split at theroot node has no effect on the computational cost of the two partial MTTKRPs, but it does affect the temporary memoryrequirement as well as the practical running time, as that split will determine the dimensions of the two GEMM calls.The three matrix dimensions in the calls are given by the products of the dimensions of the two subsets of modes andthe rank of the decomposition. The amount of additional memory needed is the size of the larger partial MTTKRPresult and is O ( I ) if R is less than the smallest tensor dimension.To minimize temporary memory and optimize GEMM performance, we seek to split the modes such that the productsof each subset of modes are nearly equal. To respect the memory layout of the tensor, we consider only contiguoussubsets of modes, and thus the split depends on only a single parameter S , which we refer to as the “split” mode, andsplit the root into nodes { , . . . , S } and { S + , . . . , N } . We compute S to be the smallest mode such that the product ofthe first S modes is greater than the product of the last N − S modes.Because the splits within the tree have much less effect on the running time and memory, we structure our tree inorder to simplify the software implementation. That is, we compute the factor matrices in order, from 1 to N , and forevery internal node of the tree, we split the smallest mode from all other modes. The structure of the tree we use inPLANC is shown in Figure 5, and the pseudocode for its implementation is given by Algorithm 3. Note that the structureof the main left subtree and the main right subtree are identical, and correspondingly the first half of the pseudocode(for modes 1 to S ) is nearly identical to the second half (for modes S + N ), just with different index ranges.To explain the pseudocode in more detail, we focus on the first half, or modes 1 through S . The first mode ( n = n = S ) are special cases because the first mode involves the partial MTTKRP (line 3) and the lastmode does not compute an internal node of the tree. Internal modes (1 < n < S ) involve computing an internal node ofthe tree and the MTTKRP result for that mode, both of which are computed via multi-TTVs. We use the notation K ( i : j ) to represent the reverse Khatri-Rao product of factor matrices M ( i ) through M ( j ) , which are computed in line 2, line 4,and line 8. The partial MTTKRP (line 3) is a matrix multiplication between a matricization of the tensor where thefirst S modes are mapped to rows and a partial Khatri-Rao product; the output is the temporary tensor T , which iscomputed as a matrix with R columns. Each matrix involved is either column- or row-major ordered in memory asdepicted in Figure 4a, for example, where N =

5. We use notation T { S }( S ) for this output, where the subscript definesthe matricization and the superscript labels the temporary tensor corresponding to its node in the dimension tree.The multi-TTV operations in line 5, line 7, line 9, and line 11 are a set of R matrix-vector multiplications. We useMATLAB-style notation with parentheses to index the r th column of the Khatri-Rao product matrix and the MTTKRPresult matrix. We use square-bracket notation to index contiguous column blocks of the temporary tensor. For example,in line 9, we use T { n : S }( ) [ r ] to denote the r th column block (which comprises I n + · · · I S columns) of the 1st-mode Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 11 { , . . . , N }{ , . . . , S } { S + , . . . , N }{ } { , . . . , S } { S + } { S + , . . . , N }{ S + } { N − , N }{ } { S − , S }{ S − } { S } { N − } { N } Fig. 5. Dimension tree used in PLANC for general N . Mode S is the “split” mode, chosen so that the product of dimensions in modes { , . . . , S } is approximately equal to that of modes { S + , . . . , N } . The splits below the root are not chosen for simplicity. matricization of temporary tensor T { n : S } (which has dimensions I n × · · · × I S × R ). This r th column block is the sameas the 1st-mode matricization of the r th slice of the tensor. The column blocks are colored distinctly in Figure 4b, forexample, where R = In this subsection we consider updating algorithms for the non-negative least squares (NNLS) updates of the factor ateach inner iteration of the algorithm (line 13 of Algorithm 1). The general problem to be solved in each inner iterationis a constrained least squares problem of the form X ← arg min X ≥ ∥ AX − B ∥ F . (2)All our updating methods (approximately) solve Equation (2) by first forming A T A and A T B , matrices that appear inthe gradient of the objective function. In the case of updating the factor matrix H ( n ) we need to solve Equation (2) with X = H ( n ) T , A = K ( n ) , where K ( n ) is the KRP of factor matrices leaving out the n th factor matrix and B = X ( n ) , where X ( n ) is the n th mode matricization of X . In this case we have A T A = S ( n ) and A T B = M ( n ) T , which correspond to theinputs to the NNLS-Update function in line 13 of Algorithm 1.A nice property of the Equation (2) is that it can be decoupled along the columns of X and thus parallelized as inAlgorithm 2. We use the notation X p to refer to a subset of the columns of X owned by processor p , or in the caseof line 14 of Algorithm 2, we use H ( n ) p to refer to a subset of the rows of H ( n ) = X T . The gradient for this subset ofcolumns depends on the corresponding columns of A T B = M ( n ) T , denoted by M ( n ) p , and all of A T A = S ( n ) .Our framework is capable of supporting any alternating-updating NNCP algorithm [21]. The updating algorithmsthat fit this framework and are implemented in PLANC are Multiplicative Update [33], Hierarchical Alternating LeastSquares [9, 17], Block Principal Pivoting [29], Alternating Direction Method of Multipliers [19] and Nesterov-type Manuscript submitted to ACM

Algorithm 3

MTTKRP via Dimension Tree

Require: X is original N -way tensor, T { i : j } is temporary tensor of dimension I × · · · × I i − × I j + × · · · × I N × R Require: n ∈ [ N ] is inner iteration mode (evaluated in order), S ∈ [ N ] is fixed split mode if n = then K ( S + N ) = H ( N ) ⊙ · · · ⊙ H ( S + ) % partial Khatri-Rao product T { S }( S ) = X ( S ) · K ( S + N ) % partial MTTKRP K ( S − ) = H ( S − ) ⊙ · · · ⊙ H ( ) % partial Khatri-Rao product M ( ) ( : , r ) = T { S }( ) [ r ] · K ( S − ) ( : , r ) for each r ∈ [ R ] % multi-TTV for MTTKRP result else if n < S then T { n : S }( S − n + ) ( : , r ) = T { n − S }( ) [ r ] T · H ( n − ) ( : , r ) for each r ∈ [ R ] % multi-TTV for internal node tensor K ( n + S ) = H ( S ) ⊙ · · · ⊙ H ( n + ) % partial Khatri-Rao product M ( n ) ( : , r ) = T { n : S }( ) [ r ] · K ( n + S ) ( : , r ) for each r ∈ [ R ] % multi-TTV for MTTKRP result else if n = S then M ( S ) ( : , r ) = T { S − S }( ) [ r ] · H ( S − ) ( : , r ) for each r ∈ [ R ] % multi-TTV for MTTKRP result else if n = S + then K ( S ) = H ( S ) ⊙ · · · ⊙ H ( ) % partial Khatri-Rao product T { S + N }( N − S ) = X T ( S ) · K ( S ) % partial MTTKRP K ( S + N ) = H ( N ) ⊙ · · · ⊙ H ( S + ) % partial Khatri-Rao product M ( S + ) ( : , r ) = T { S + N }( ) [ r ] · K ( S + N ) ( : , r ) for each r ∈ [ R ] % multi-TTV for MTTKRP result else if n < N then T { n : N }( N − n + ) ( : , r ) = T { n − N }( ) [ r ] T · H ( n − ) ( : , r ) for each r ∈ [ R ] % multi-TTV for internal node tensor K ( n + N ) = H ( N ) ⊙ · · · ⊙ H ( n + ) % partial Khatri-Rao product M ( n ) ( : , r ) = T { n : N }( ) [ r ] · K ( n + N ) ( : , r ) for each r ∈ [ R ] % multi-TTV for MTTKRP result else M ( N ) ( : , r ) = T { N − N }( ) [ r ] · H ( N − ) ( : , r ) for each r ∈ [ R ] % multi-TTV for MTTKRP result end if algorithm [36]. We briefly describe the different solvers below. Note that the descriptions are for the general form ofthe NNLS problem as shown in Equation (2). The MU solve is an elementwise operation [33]. The update rule for element ( i , j ) of X is X ( i , j ) ← X ( i , j ) A T B ( i , j ) (cid:0) A T AX (cid:1) ( i , j ) (3)While this rule does not solve Equation (2) to optimality it ensures a reduction in the objective value from the initialvalue of X . Note that Equation (3) breaks down if the denominator becomes zero. In practice a small value is added tothe denominator to prevent this situation. HALS updates are performed on individual rows of X [9, 17]. Theupdate rule for row i can be written in closed form as X ( i , : ) ←  X ( i , : ) + ( A T B )( i , : ) − (cid:16) A T A ( i , : ) X (cid:17) ( A T A )( i , i )  + (4)where [·] + is the projection operator onto R + . The rows of X are updated in order so that the latest values are usedin every update step. HALS has been observed to produce unbalanced results with either very large or very small Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 13values appearing in the factor matrices [17, 28]. Normalizing the rows of X after every update via Equation (4) has beenproposed to alleviate this problem [17, 28]. Within PLANC’s parallelization, this step requires explicit communicationamong processors because the rows of X (the columns of H ( n ) ) are distributed across processors. BPP is an active-set like method for solving NNLS problems. The main subroutineof BPP is the single right hand side version of Equation (2), x ← arg min x ≥ ∥ Ax − b ∥ (5)The Karash-Kuhn-Tucker (KKT) optimality conditions for Equation (5) are specified by x and y = A T Ax − A T b : x ⩾ , y ⩾ , and x ∗ y = , where ∗ is the Hadamard product. The complementary slackness criteria from the KKT conditionsforces the support sets, i.e. the non-zero elements, of x and y to be disjoint. In the optimal solution, the active-set is theset of indices where x i = In the ADMM solver [19] the optimization problem Equa-tion (2) is reformulated by introducing an auxiliary variable ˆX :min X , ˆX (cid:13)(cid:13) A ˆX − B (cid:13)(cid:13) F + r ( X ) subject to X = ˆX (6)where r (·) is the penalty function for nonnegativity. It is 0 if X ≥ ∞ otherwise. The updates for the ADMMalgorithm are given by ˆX ← (cid:16) A T A + ρ I (cid:17) − (cid:16) A T B + ρ ( X + U ) T (cid:17) X ← arg min X r ( X ) + ρ (cid:13)(cid:13) X − ˆX + U (cid:13)(cid:13) F U ← U + X − ˆX , (7)where U is the scaled version of the dual variables corresponding to the equality constraints X = ˆX and ρ is a step sizespecified by the user. U is initialized as a matrix of all zeros. The advantage of using ADMM is the clever splitting ofthe non-negativity constraints into updates of two blocks of variables X and ˆX . This allows for an unconstrained leastsquares solve for ˆX and element-wise projections onto R + for X .We can accelerate this solve by repeating the updates given by Equation (7) more than once. One important fact tonotice is that the same matrix A T B and matrix inverse (cid:16) A T A + ρ I (cid:17) − are used for all the updates. We can thereforecache A T B and the Cholesky decomposition of (cid:16) A T A + ρ I (cid:17) to save some computations during subsequent updates.We stop updating using the stopping criteria described in [19] which is based on ∥ X ∥ F , ∥ ˆX ∥ F , and ∥ U ∥ F . Computingthese norms requires communication because each of these matrices are distributed across processors. We also limitthe maximum number of acceleration steps to 5. By default, a good choice for ρ is ∥ A ∥ F / R , where R is the number Manuscript submitted to ACM A (rank of the CP decomposition) [19]. A comprehensive guide to the ADMM method, convergenceproperties and selection of optimal ρ can be found in [6]. The Nesterov-type algorithm in PLANC was introduced by Liavas et al [36]. Theirmethod solves a modified version of NNLS problem Equation (2) with the introduction of a proximal term with anauxiliary matrix X ∗ . The proximal term is useful to handle ill-conditioned instances and guarantee strong convexity.The objective function tackled is f p ( X ) : = ∥ AX − B ∥ F + λ ∥ X − X ∗ ∥ F , (8)where X is constrained to be nonnegative. The gradient of f p is given by the expression ∇ f p ( X ) = −( A T AX − A T B ) + λ ( X − X ∗ ) Updates to X are performed using the gradient of f p , ∇ f p ( Y k ) = (cid:16) A T B − λ X ∗ (cid:17) + (cid:16) λ I − A T A (cid:17) Y k X k + ← (cid:2) Y k − α ∇ f p ( Y k ) (cid:3) + Y k + ← X k + + β k + ( X k + − X k ) (9)where [·] + is the projection operator onto R + . Notice that we can update X multiple times reusing A T A and A T B . Thisis the acceleration performed for every inner iteration in line 14 of Algorithm 2. They are repeated until a terminationcriteria is triggered; different criteria are discussed in [36]. The termination criteria are bounds checks on the minimumand absolute maximum values of X and require communication because X = H ( n ) T is distributed across processors. Wealso limit the total number of inner iterations to 20.The selection of hyperparameters λ , α , and β depends on the singular values of A and is necessary for developing aNesterov-like method for solving Equation (8). The matrix X ∗ is generally X from the previous outer iteration (line 8 ofAlgorithm 2). Details of the selection procedure and different cases can be found in the original paper [36].In addition to the acceleration performed during each NNLS solve, Equation (9), we can also perform an accelerationstep for every outer iteration in the while loop (line 8 of Algorithm 2). In this step all factor matrices are updated usingthe previous outer iteration values until the objective stops decreasing. The outer acceleration step for iteration i willbe, H ( ) new ← H ( ) i + s i (cid:16) H ( ) i − H ( ) i − (cid:17) H ( ) new ← H ( ) i + s i (cid:16) H ( ) i − H ( ) i − (cid:17) ... H ( N ) new ← H ( N ) i + s i (cid:16) H ( N ) i − H ( N ) i − (cid:17) (10)The results of Equation (10) will be accepted as the next iterate only if the overall objective error with the newfactor matrices, (cid:74) H ( ) new , . . . , H ( N ) new (cid:75) , is lower than that of (cid:74) H ( ) i , . . . , H ( N ) i (cid:75) . In order to compute the relative error weneed an extra MTTKRP computation per outer acceleration. Typically s i = i / N but its value can change as the overallalgorithm progresses [36]. Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 15

TensorNCPFactors

NTFMPICommunicator

DistAUNTF

DistNTFAOADMM

DistNTFNES DistNTFMU DistNTFHALS DistNTFTimeDistNTFABPP DistNTFCPALS

NumPyArray DistNTFIO

Fig. 6. PLANC class diagram. Utility classes are at the top of the diagram, and the algorithm classes at the bottom of the diagramall derive from the abstract class

DistAUNTF in orange. The blue arrows denote an “is-a” relationship and the red diamond arrowsdenote a “has-a” relationship.

The entire PLANC package has the following modules – shared memory NMF, shared memory NTF, distributedmemory NMF and distributed memory NTF. In this section, we give a brief overview of the software package structurefor NTF and ways to extend it.

We briefly describe the overall class hierarchy of the PLANC package as illustrated in Figure 6. PLANC offers bothshared and distributed memory implementations of NTF and the classes used in each type are distinguished by theprefix

Dist in their names (eg.

DistAUNTF versus

AUNTF ). We shall cover the distributed implementation of NTF in thissection. Most of the descriptions can be directly applied to the shared memory case as well.There are broadly 2 types of classes present. Utility classes are primarily for managing data, setting up the processorgrid, and interacting with the user. Algorithm classes perform all the computations needed for NTF and implement thedifferent NNLS solvers.

Data.

The

Tensor and

NCPFactors classes contain the input tensor X and the factor matrices (cid:74) H ( ) , . . . , H ( N ) (cid:75) . The Tensor class stores the input tensor as a standard data array. The tensor X is stored as its mode-1 unfolding X ( ) incolumn major order. Each processor contains its local part of the tensor (see § 4.1.2). The NCPFactors class contains allthe factor matrices. Each factor matrix is an Armadillo matrix [47]. The matrices are usually column normalized andthe column norms are stored in the vector λ which is present as a member of this class (see Algorithm 4). The vector λ is replicated in all processors whereas the rows of the factor matrices are distributed across the processor grid (see§ 4.1.2). There is no global view of the entire input tensor or factor matrices and care must be taken to communicateparts of either among the processor grid. Manuscript submitted to ACM

Communication.

The

NTFMPICommunicator class creates the MPI processor grid for Algorithm 2. In addition to thecommunicator for the entire grid, it contains a slice communicator for each mode of the processor grid. The slicecommunicators are used in the Reduce-Scatter of line 12 and All-Gather of line 18 in Algorithm 2.

I/O.

The

DistNTFIO and

NumPyArray utility classes are used to read in the input tensor from user-specified files.

DistNTFIO also contains methods to generate random tensors and to write out the factor matrices to disk. The

ParseCommandLine class contains all the command line options available in PLANC. As the name suggests it parsesthe different combinations of user inputs to instantiate the driver class and run the NTF algorithm. Some example userinputs are the target rank of the decomposition, number of outer iterations, NNLS solver, and regularization parameters.

DistAUNTF . This is the major workhorse class of the package. It is used to implement Algorithm 2. Some of theimportant member functions are: • computeNTF : This is the outer iteration (line 8 in Algorithm 2). • distmttkrp : Computes the distributed MTTKRP in line 11 and line 12 in Algorithm 2. • gram_hadamard,update_global_gram : These functions are used to compute the Gram matrix used in the NNLSsolvers. • computeError : This function calculates the relative objective error of the factorization as described in Appen-dix B.2. Derived classes.

There exist derived classes, such as

DistNTFANLSBPP , DistNTFMU , etc., for each of the NNLS solversdescribed in § 4.3. There are two main functions which are present in the derived classes which are described below.Auxiliary variables needed to implement certain NNLS solvers like ADMM and Nesterov-type algorithm are alsomaintained in this class. • update : This function is the NNLS solve function. It returns the updated factor matrix using the current localMTTKRP result and global Gram matrix (see § 4.3). • accelerate : This implements the outer iteration acceleration (line 8 in Algorithm 2). Currently only the Nesterov-type algorithm has an outer acceleration step. Extending PLANC to include different solvers is a simple task and we list the steps to do so below.(1) Create a derived class with the newly implemented update function. This is the new NNLS method needed toupdate the factor matrices.(2) The constructor for the new class should contain information on whether the algorithm requires an outeracceleration step. If the method requires an outer acceleration step it needs to be implemented in the derivedclass.(3) Update the command line parsing class

ParseCommandLine to include additional configuration options for thealgorithm.(4) Include the new algorithm as an option in the utilities and the driver files.

Case Study.

We describe the different steps needed to extend PLANC to include the Nesterov-type algorithm.

Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 17(1) We first create the

DistNTFNES class which is derived from

DistAUNTF .(2) We implement the update and accelerate functions in the derived class. • The Nesterov NNLS updates require the previous iterate values (for the auxiliary term as described in § 4.3)which may be thought of as a persistent “state" of the algorithm. We utilize an extra

NCPFactors object tohold these variables. • The Nesterov update function needs synchronization in order to terminate its local (iterative) NNLS solve. Thisinvolves accessing the communicators found in

DistNTFMPICommunicator class for the distributed algorithm. • Finally, Nesterov-type algorithms generally involve an outer acceleration step which is also implemented inthe derived class

DistNTFNES .(3) We then update the

ParseCommandLine class to include Nesterov as an algorithm.(4) We update the driver file distntf.cpp to include the Nesterov algorithm.

The entire experimentation was performed on Titan, a supercomputer at the Oak Ridge Leadership Computing Facility.Titan is a hybrid-architecture Cray XK7 system that contains both advanced 16-core AMD Optero TM central processingunits (CPUs) and NVIDIA ® Kepler graphics processing units (GPUs). It features 299,008 CPU cores on 18,688 computenodes, a total system memory of 710 terabytes with 32GB on each node, and Cray’s high-performance Gemini network.We use Armadillo [47] for matrix representations and operations. In Armadillo, the elements of the dense matrixare stored in column major order. For dense BLAS and LAPACK operations, we linked Armadillo with the defaultLAPACK/BLAS wrappers from Cray. We use the GNU C++ Compiler (g++ (GCC) 6.3.0) and Cray’s MPI library. Thecode can also compile and run on other commodity clusters with entire open source libraries such as OpenBLAS andOpenMPI.

The “Mouse” data is a 3D dataset that images a mouse’s brain over time and over a sequence ofidentical trials [30]. Each entry of the tensor represents a measure of calcium fluorescence in a particular pixel during atime step of a single trial. The calcium imaging is performed using an epi-fluorescence macroscope viewing the brainthrough an artificial crystal skull. Each image has dimension 1040 × , , × ×

25. Every trial is performed with the same mouse and tracks the same task. The mouse ispresented with visual simulation (starting at frame 3), and after a delay is rewarded with water (starting at frame 25).

Our synthetic data sets are constructed from a CP model with an exact low rank with no additionalnoise. In this case we can confirm that the residual error of our algorithm with a random start converges to zero. For thepurposes of benchmarking, we run a fixed number of iterations of the NTF algorithms rather than using a convergencecheck.

The list below gives a brief description of all the categories shown in the breakdown plots and their role in the overallalgorithm.

Manuscript submitted to ACM R is small relative to the tensor dimensions.(2) NNLS: the cost of a non-negative least squares update can vary drastically with the algorithm used. The variouscharacteristics that may affect run time for each NNLS algorithm are discussed in § 6.4.(3) MTTKRP: the (partial) MTTKRP is a purely local computation performed on each node, and can be offloaded tothe GPU. Using the dimension tree optimization (§ 4.2), we perform 2 partial MTTKRPs for each outer iteration,regardless of the number of modes N . Both operations are cast as GEMM calls, where the dimensions are givenby the product of the first S mode dimensions ( S is the split mode), the product of the last N − S mode dimensions,and the rank R .(4) MultiTTV: the MultiTTVs are purely local computations performed on each node. Each Multi-TTV is cast as aset of R GEMV calls which are typically memory bandwidth bound.(5) ReduceScatter: the ReduceScatter collective is used to sum MTTKRP results and distribute portions of the sumappropriately across processors. It is called for each inner iteration.(6) AllGather: the AllGather collective is used to collect the updated factor matrices to each processor in the slicecorresponding to the mode being updated. It is called after each inner iteration.(7) AllReduce: the AllReduce is used to compute the Gram matrices and for computing norms and other quantitiesrequired for stopping criteria of some algorithms.

Table 2 highlights the distinct aspects of each updating algorithm that can affect performance. The rows of Table 2denote the different local update algorithms implemented in PLANC. The algorithms names and acronyms in order fromtop to bottom in Table 2 are as follows: Unconstrained CP (UCP), Multiplicative Update (MU), Hierarchical Least Squares(HALS), Block Principle Pivoting (BPP), Alternating Direction Method of Multipliers (ADMM), and Nesterov-typealgorithm (NES). The aspects of each algorithm that are displayed in Table 2 are as follows: • Communication: a check mark and description in this column indicates that the local update algorithm requiressome amount of communication. For example, the HALS algorithm requires the communication of the updatedcolumn norms. Additional communication requirements can affect performance by incurring additional latencyand bandwidth costs. These penalties become significant when the number of processors is high. • Extra MTTKRPs: the NES algorithm has an acceleration step which requires an additional MTTKRP to beperformed. This can potentially increase the run time if the acceleration step does not decrease the objectivefunction. Experimentally, on both real and synthetic data sets, we observe that NES run time is significantlyincreased by the additional MTTKRP computations. • Iteration: this column indicates the iterative nature of the local update algorithm. Note that all of the algorithmswe present here are iterative in terms of the outer iteration. UCP, MU, and HALS all have closed-form formulasfor the inner iteration, meaning the number of flops can be explicitly computed as a function of the problem size.The rest of the algorithms have flop and communication requirements dependent on the number of iterations ittakes the algorithm to converge for a particular local update.

Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 19Alg Communication Extra MTTKRPs Iterative TuningUCP × × × × MU × × × × HALS ✓ Column Norms × × ×

BPP × × ✓ × ADMM ✓ Stopping Criteria × ✓ ✓ Step SizeNES ✓ Stopping Criteria ✓ ✓ ✓

See § 4.3

Table 2. Characteristics of the various update algorithms that can potentially affect performance. The columns are as follows: 1) ifthe local update requires communication, 2) if the update requires additional MTTKRP computations, 3) if the local update itself isiterative, 4) if the algorithm’s performance are significantly impacted by parameter tuning. A ✓ corresponds to the algorithm havingthe characteristic, and a × means it does not. • Tuning: many optimization algorithms require some tunable input parameters which can impact performance.For example, setting a step size is a frequent requirement for gradient based optimization algorithms when anexact line search is too computationally expensive.

Figure 7 shows the “local update computation" time takenby the different updating algorithms for various low-rank values on a synthetic data set and the Mouse dataset. Thesynthetic tensor (Figure 7a) involves about 20 LUCs per processor per iteration whereas the Mouse data (Figure 7b)has about 20,000 updates per processor per iteration, which accounts for the difference in the scales of the time seenin the figures. MU and CP are the cheapest algorithms with NES being the most expensive. HALS, ADMM and NESalgorithms all communicate in their update steps and this significantly affects their runtimes, see Figure 13b. NES hasthe most expensive inner iteration involving an eigendecomposition of the Gram matrix and up to 20 iterations of theNNLS updater. ADMM has the second most expensive inner iteration with up to 5 iterations of the acceleration step.HALS on the other hand doesn’t have a very expensive inner iteration but needs a synchronization to normalize everyupdated column of the factor matrix before proceeding to the next column, causing a slowdown.

Figure 8 gives a processor grid comparison for a 3-D cubical tensor of size512. The distributed MTTKRP time dominates the overall run time and we observe that an even processor distributionresults in the best achieved performance for all update algorithms. This difference in run times is partially accountedfor by GEMM performance due to the different shapes of the matrices involved. Besides choosing an even processorgrid we see that configurations 1 through 6 have quite stable run times with the exception of the NES algorithm. Thevariation in the NES run times can be attributed to the variable number of MTTKRPs needed for the acceleration steps.

Figure 9 shows comparison in run timesbetween performing partial MTTKRPs on the CPU versus offloading to the GPU as rank increases. As expected theCPU run times increase linearly with R as the operation count for CP-ALS is dominated by the MTTKRP which is linearin R . In the case of the GPU execution, the tested sizes of R are never large enough to saturate the GPU, yielding flat runtimes even as the rank increases. The NES algorithm takes additional time for both the CPU and GPU executions due tothe additional MTTKRPs. In this case, for the chosen tensor size and rank, it is always beneficial to offload the GEMMcalls to the GPU, and the maximum achieved speedup with GPU offloading is about 7 × . However, we have observed in Manuscript submitted to ACM

20 30 40 50 60 70 80 90 100Low Rank (R)0.0000.0050.0100.0150.0200.0250.0300.035 L U C T i m e p e r I t e r a t i o n ( S e c s ) MUHALSBPPADMMNESUCP (a) Synthetic data: × × × tensor on 81 nodes ar-ranged in × × × grid

20 30 40 50 60 70 80 90 100Low Rank (R)0246810121416 L U C T i m e p e r I t e r a t i o n ( S e c s ) MUHALSBPPADMMNESUCP (b) Mouse data: × × tensor on 64 nodes arrangedin × × gridFig. 7. Per Iteration Local Update Computation (LUC) comparison of NNLS algorithms on 4D synthetic and and 3D Mouse tensors × × × × × × × × × × × × × × . . . . . T i m epe r It e r a t i on ( S e cs ) MU × × × × × × × × × × × × × × HALS × × × × × × × × × × × × × × BPP × × × × × × × × × × × × × × ADMM × × × × × × × × × × × × × × NES × × × × × × × × × × × × × × UCP

GramNNLSMTTKRPMultiTTVReduceScatterAllGatherAllReduce

Processor Grids

Fig. 8. Processor grid sweep of a × × synthetic 3D low rank tensor on 512 nodes with low rank 96. other experiments that NVBLAS can make the incorrect decision to offload the computation to the GPU when it isfaster to perform it on the CPU. Figures 10 and 11 show convergence comparisons (error vs time) for each of the updating algorithms on syntheticlow-rank and Mouse data sets, using two different target ranks each. Every algorithm is run for a fixed number of (30)outer iterations for fair comparison. For the Mouse data in Figure 11, we show only the first 10 seconds because nearlyall algorithms are converging within 30 iterations. The initialized random factors are the same for all algorithms in both

Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 21

20 30 40 50 60 70 80 90 100Low Rank (R)2468101214 T i m e P e r I t e r a t i o n ( S e c s ) MU CPUHALS CPUBPP CPUADMM CPUNES CPUUCP CPU MU GPUHALS GPUBPP GPUADMM GPUNES GPUUCP GPU

Fig. 9. Timing comparison of CPU and GPU offloading on 4D Synthetic Low Rank Tensor of size × × × on × × × processor grid with varying ranks R e l a t i v e E rr o r MUHALSBPPADMMNESUCP (a) Low rank R = R e l a t i v e E rr o r MUHALSBPPADMMNESUCP (b) Low rank R = Fig. 10. Relative error over time comparison of updating algorithms on 4D Synthetic Low Rank Tensor of size × × × on × × × processor grid tests, and the synthetic tensor is the same for all algorithms. In both the synthetic and real world cases BPP achievesthe lowest approximation error in the shortest amount of time. Overall results are as expected, such as MU achievingthe worst error and ADMM achieving the second best in all cases. It is also note-worthy that on the real world data setthe best algorithms, ADMM and BPP, achieve relative errors of ≈ − We performed weak scaling analysis on 2 different cubical tensors with 3 and 4modes. Figure 12 shows the time breakdown for scaling up to 64 nodes of Titan for the 3D case and 16384 nodes forthe 4D tensor. In each experiment the size of the local tensor is kept constant at dimension 128 in each mode for allthe runs. As expected, the run time is dominated by the cost to compute the MTTKRP, and the domination is more

Manuscript submitted to ACM R e l a t i v e E rr o r MUHALSBPPADMMNESUCP (a) Low rank R = R e l a t i v e E rr o r MUHALSBPPADMMNESUCP (b) Low rank k = Fig. 11. Relative error comparison of updating algorithms on 3D Realworld Low Rank Tensor of size × × on a 64 TitanNodes as × × Processor Grid for 10 seconds extreme for higher mode tensors. Moreover we see reasonable weak scaling as the figures remain relatively flat over allprocessor sizes.The variations occur mainly due to the NNLS and communication portions of the algorithm. These do matter ingeneral and especially for the 3D case where the MTTKRP cost is often comparable to NNLS times, especially forsmaller number of processors. However NNLS times scale really well since they split along processor slices rather thanfibers and soon become negligible for large processor grids. The amount of communication per processor remainsconstant but latency costs increase slowly as we scale up.

We run strong scaling experiments on two synthetic square tensors, one 3D andone 4D. Figure 13 contains these results for each of the local update algorithms ranging from 1 to 16384 processors. Wesee similar behavior for the 3D and 4D case. For the 3D tensor (Figure 13a), we observe good strong scaling up to about32 nodes and continue to see speed up through 128 nodes. Similarly for the 4D case (Figure 13b), the algorithms scalewell up to about 1024 nodes and continue to reduce time until 8192 nodes; we observe a slowdown when scaling to16384 nodes.One reason for the limit of strong scaling is the communication overheads of AllGather, AllReduce, and ReduceScatter,which become more significant for more processors. Another reason is that for a cubical tensor of odd dimension, inour case three, the dimension tree optimization is often forced to cast partial MTTKRP into a very rectangular matrixmultiplication, depending on the processor grid. Two of the local dimensions must be grouped together while the otheris left alone. This means that the largest dimension would need to be close to the product of the other two in order forthere to be an approximately square matrix multiplication.

Figure 14 show strong scaling results on the Mouse dataset. We use a 1D P × × ∼

32 nodes and still improve runtimes through 512 nodes. At 1024 nodes the NNLS algorithms, whichcommunicate during the solve steps, perform far worse and show up to 2 × slowdown. The non-communicating solversalso degrade in performance but more gracefully. Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 23 T i m e p e r I t e r a t i o n ( S e c s ) MU HALS

BPP

ADMM

NES

UCP

GramNNLSMTTKRPMultiTTVReduceScatterAllGatherAllReduce

Number of Processors (a) Synthetic 3D Low Rank T i m e p e r I t e r a t i o n ( S e c s ) MU HALS

BPP

ADMM

NES

UCP

GramNNLSMTTKRPMultiTTVReduceScatterAllGatherAllReduce

Number of Processors (b) Synthetic 4D Low RankFig. 12. Weak scaling on synthetic 3D and 4D low-rank tensors. For the 3D case, the input tensors are of size × × , × × , × × and × × on 1, 8, 27 and 64 Titan Nodes. The 4D input tensors are × × × , × × × , × × × , × × × , × × × , × × × , × × × , × × × , × × × on 1, 16, 256, 512, 1024, 2048, 4096, 8192, and 16384 Titan nodes. For all experiments, the low rank is 96. The CP decomposition of the Mouse data can be used to interpret brain patterns in response to the light stimulus andwater reward given to the mouse. For example, Figure 15 shows a visualization of the factors of the 22nd component ofthe rank-32 CP decomposition. From the time factor, we see a marked increase in the importance of the component afterthe reward time frame, which suggests the activity is a response to the reward. Because the same mouse undergoes

Manuscript submitted to ACM T i m e p e r I t e r a t i o n ( S e c s ) MU HALS

BPP

ADMM

NES

Ideal ScalingGramNNLSMTTKRPMultiTTVReduceScatterAllGatherAllReduce

Number of Processors (a) Synthetic 3D Low Rank - × × tensor T i m e p e r I t e r a t i o n ( S e c s ) MU HALS

BPP

ADMM

NES

UCP

Ideal ScalingGramNNLSMTTKRPMultiTTVReduceScatterAllGatherAllReduce

Number of Processors (b) Synthetic 4D Low Rank - × × × tensorFig. 13. Strong scaling on synthetic 3D and 4D low rank tensors with low rank 96

25 identical trials, we expect to see no pattern in the time factor of each component. We note that the factors havebeen normalized, and the absolute magnitude of the y-axis reflects this. The pixel factor has been reshaped to animage of the same dimensions of the original data. We observe higher intensity values in the somatosensory cortex(middle, left), which is associated with bodily sensation. This component, possibly representing a sensory response tothe water reward, aligns well with the findings of cell-based analysis [30, Figure 3], which also identified neurons in thesomatosensory cortex with intensities that peaked quickly after the reward time frame.

Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 25

16 32 64 1282565121024 T i m e p e r I t e r a t i o n ( S e c s ) MU

16 32 64 1282565121024

HALS

16 32 64 1282565121024

BPP

16 32 64 1282565121024

ADMM

16 32 64 1282565121024

NES

16 32 64 1282565121024

UCP

Ideal ScalingGramNNLSMTTKRPMultiTTVReduceScatterAllGatherAllReduce

Number of Processors

Fig. 14. Strong scaling on Mouse dataset

Time li gh t r e w a r d Trial (a) Time and trial factors

200 400 600 800 1000 12001002003004005006007008009001000 (b) Brain image factorFig. 15. Component 22 of rank-32 CP decomposition of Mouse data.

The full set of components for the rank-32 CP decomposition are given in Figures 16 and 17 (Appendix A). We notethat the interpretation of these computed components is useful only for exploratory analysis. Their scientific validitywould need to confirmed with tests of robustness with respect to choice of rank, random starting point, and algorithm.

In this work, we present PLANC, a software library for nonnegative low-rank factorizations that works for tensorsof any number of modes and scales to large data sets and high processor counts. The software framework can beadapted to use any NNLS algorithm within the context of alternating-updating algorithms. We use a dimension treeoptimization to avoid unnecessary recomputation within the bottleneck local MTTKRP computation, and we use anefficient parallelization algorithm that minimizes communication cost. Our performance results show the ability to

Manuscript submitted to ACM

We thank Tony Hyun Kim and Mark Schnitzer for providing the Mouse dataset and help with interpretation of the CP components.This material is based upon work supported by the National Science Foundation under Grant No. OAC-1642385 and OAC-1642410.This manuscript has been co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department ofEnergy. This project was partially funded by the Laboratory Director’s Research and Development fund. This research used resourcesof the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science ofthe U.S. Department of Energy.This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facilitysupported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

REFERENCES [1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor Decompositions for Learning Latent VariableModels.

Journal of Machine Learning Research

15 (2014), 2773–2832. http://jmlr.org/papers/v15/anandkumar14b.html[2] Grey Ballard, Koby Hayashi, and Ramakrishnan Kannan. 2018. Parallel Nonnegative CP Decomposition of Dense Tensors. In . 22–31. https://doi.org/10.1109/HiPC.2018.00012[3] Grey Ballard, Nicholas Knight, and Kathryn Rouse. 2017.

Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product . TechnicalReport 1708.07401. arXiv.[4] Muthu Baskaran, Benoît Meister, Nicolas Vasilache, and Richard Lethin. 2012. Efficient and scalable computations with sparse tensors. In . IEEE, 1–6.[5] Casey Battaglino, Grey Ballard, and Tamara G Kolda. 2018. A practical randomized CP tensor decomposition.

SIAM J. Matrix Anal. Appl.

39, 2 (2018),876–901.[6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternatingdirection method of multipliers.

Foundations and Trends® in Machine learning

3, 1 (2011), 1–122.[7] Rasmus Bro and Claus A Andersson. 1998. Improving the speed of multiway algorithms: Part II: Compression.

Chemometrics and intelligentlaboratory systems

42, 1-2 (1998), 105–113.[8] Rasmus Bro and Sijmen De Jong. 1997. A fast non-negativity-constrained least squares algorithm.

Journal of Chemometrics: A Journal of theChemometrics Society

11, 5 (1997), 393–401.[9] Andrzej Cichocki and Anh-HUY Phan. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations.

IEICE Transactionson Fundamentals of Electronics, Communications and Computer Sciences

E92-A (2009), 708–721. Issue 3.Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 27 [10] Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. 2008. Nonnegative matrix and tensor factorization [lecture notes].

IEEE signal processingmagazine

25, 1 (2008), 142–145.[11] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. 2011. Faster least squares approximation.

Numerische mathematik

PatternRecognition Letters

104 (2018), 1–7.[13] Xiao Fu, Cheng Gao, Hoi-To Wai, and Kejun Huang. 2019. Block-randomized stochastic proximal gradient for constrained low-rank tensorfactorization. In

ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 7485–7489.[14] Lars Grasedyck. 2010. Hierarchical singular value decomposition of tensors.

SIAM J. Matrix Anal. Appl.

31, 4 (2010), 2029–2054.[15] Wolfgang Hackbusch. 2014. Numerical Tensor Calculus.

Acta Numerica

23 (2014), 651–742. https://doi.org/10.1017/S0962492914000087[16] Lixing Han, Michael Neumann, and Upendra Prasad. 2009. Alternating projected Barzilai-Borwein methods for nonnegative matrix factorization.

Electron. Trans. Numer. Anal

36, 6 (2009), 54–82.[17] Ngoc-Diep Ho. 2008.

Nonnegative Matrix Factorization Algorithms and Applications . Ph.D. Dissertation. Université Catholique De Louvain.[18] Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. 2015. Efficient algorithms for universally constrained matrix and tensor factorization.In

Signal Processing Conference (EUSIPCO), 2015 23rd European . IEEE, 2521–2525.[19] Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. 2016. A flexible and efficient algorithmic framework for constrained matrix andtensor factorization.

IEEE Transactions on Signal Processing

64, 19 (2016), 5052–5065.[20] Stephen Jesse, Miaofang Chi, Albina Borisevich, Alexei Belianinov, Sergei Kalinin, Eirik Endeve, Richard K. Archibald, Christopher T. Symons, andAndrew R. Lupini. 2016. Using Multivariate Analysis of Scanning-Ronchigram Data to Reveal Material Functionality.

Microscopy and Microanalysis

22 (07 2016), 292–293.[21] Ramakrishnan Kannan, Grey Ballard, and Haesun Park. 2016. A High-performance Parallel Algorithm for Nonnegative Matrix Factorization. In

Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16) . ACM, New York, NY, USA, Article 9,11 pages. https://doi.org/10.1145/2851141.2851152[22] Ramakrishnan Kannan, Grey Ballard, and Haesun Park. 2018. MPI-FAUN: An MPI-based Framework for Alternating-Updating Nonnegative MatrixFactorization.

IEEE Transactions on Knowledge and Data Engineering

30, 3 (2018), 544–558.[23] Oguz Kaya. 2017.

High Performance Parallel Algorithms for Tensor Decompositions . Ph.D. Dissertation. University of Lyon. https://tel.archives-ouvertes.fr/tel-01623523[24] Oguz Kaya and Bora Uçar. 2016. High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors. In . 103–112. https://doi.org/10.1109/ICPP.2016.19[25] Oguz Kaya and Bora Uçar. 2016.

Parallel CP decomposition of sparse tensors using dimension trees . Research Report RR-8976. Inria - Research CentreGrenoble – Rhône-Alpes. https://hal.inria.fr/hal-01397464[26] Oguz Kaya and Bora Uçar. 2018. Parallel CANDECOMP/PARAFAC Decomposition of Sparse Tensors Using Dimension Trees.

SIAM J. ScientificComputing

40, 1 (2018). https://doi.org/10.1137/16M1102744[27] Dongmin Kim, Suvrit Sra, and Inderjit S Dhillon. 2007. Fast Newton-type methods for the least squares nonnegative matrix approximation problem.In

Proceedings of the 2007 SIAM international conference on data mining . SIAM, 343–354.[28] Jingu Kim, Yunlong He, and Haesun Park. 2014. Algorithms for nonnegative matrix and tensor factorizations: a unified view based on blockcoordinate descent framework.

Journal of Global Optimization

58, 2 (01 Feb 2014), 285–319. https://doi.org/10.1007/s10898-013-0035-4[29] Jingu Kim and Haesun Park. 2011. Fast Nonnegative Matrix Factorization: An Active-Set-Like Method and Comparisons.

SIAM Journal on ScientificComputing

33, 6 (2011), 3261–3281.[30] Tony Hyun Kim, Yanping Zhang, Jérôme Lecoq, Juergen C. Jung, Jane Li, Hongkui Zeng, Cristopher M. Niell, and Mark J. Schnitzer. 2016.Long-Term Optical Access to an Estimated One Million Neurons in the Live Mouse Cortex.

Cell Reports

17, 12 (2016), 3385 – 3394. https://doi.org/10.1016/j.celrep.2016.12.004[31] Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications.

SIAM review

51, 3 (2009), 455–500.[32] Charles L Lawson and Richard J Hanson. 1974. Solving least squares problems. (1974).[33] Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization.

Nature

IEEE International Paralleland Distributed Processing Symposium (IPDPS) . 1048–1057. https://doi.org/10.1109/IPDPS.2017.80[35] A. P. Liavas, G. Kostoulas, G. Lourakis, K. Huang, and N. D. Sidiropoulos. 2017. Nesterov-based Alternating Optimization for Nonnegative TensorFactorization: Algorithm and Parallel Implementation.

IEEE Transactions on Signal Processing (Nov 2017). https://doi.org/10.1109/TSP.2017.2777399[36] Athanasios P Liavas, Georgios Kostoulas, Georgios Lourakis, Kejun Huang, and Nicholas D Sidiropoulos. 2017. Nesterov-based parallel algorithmfor large-scale nonnegative tensor factorization. In

Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on . IEEE,5895–5899.[37] Chih-Jen Lin. 2007. Projected gradient methods for nonnegative matrix factorization.

Neural computation

19, 10 (2007), 2756–2779.[38] Linjian Ma and Edgar Solomonik. 2018. Accelerating Alternating Least Squares for Tensor Decomposition by Pairwise Perturbation. arXiv preprintarXiv:1811.10573 (2018). Manuscript submitted to ACM [39] Michael Merritt and Yin Zhang. 2005. Interior-point gradient method for large-scale totally nonnegative least squares problems.

Journal ofoptimization theory and applications arXiv preprint arXiv:1904.07935 (2019).[41] Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. arXiv preprintarXiv:1904.03329 (2019).[42] Pentti Paatero. 1997. A weighted non-negative least squares algorithm for three-way PARAFAC factor analysis.

Chemometrics and IntelligentLaboratory Systems

38, 2 (1997), 223 – 242. https://doi.org/10.1016/S0169-7439(97)00031-2[43] Evangelos E Papalexakis, Christos Faloutsos, and Nicholas D Sidiropoulos. 2012. Parcube: Sparse parallelizable tensor decompositions. In

JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 521–536.[44] Anh Huy Phan and Andrzej Cichocki. 2011. PARAFAC algorithms for large-scale problems.

Neurocomputing

74, 11 (2011), 1970–1984. https://doi.org/10.1016/j.neucom.2010.06.030[45] Anh-Huy Phan, Petr Tichavsky, and Andrzej Cichocki. 2013. Fast Alternating LS Algorithms for High Order CANDECOMP/PARAFAC TensorFactorizations.

IEEE Transactions on Signal Processing

Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments .Technical Report. NICTA. http://arma.sourceforge.net/armadillo_nicta_2010.pdf[48] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing andMachine Learning.

IEEE Transactions on Signal Processing

65, 13 (July 2017), 3551–3582. https://doi.org/10.1109/TSP.2017.2690524[49] S. Smith, A. Beri, and G. Karypis. 2017. Constrained Tensor Factorization with Accelerated AO-ADMM. In . 111–120. https://doi.org/10.1109/ICPP.2017.20[50] Shaden Smith and George Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. In

IEEE 30th InternationalParallel and Distributed Processing Symposium . 902–911. https://doi.org/10.1109/IPDPS.2016.113[51] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In . 61–70. https://doi.org/10.1109/IPDPS.2015.27[52] Bing Tang, Linyao Kang, Yanmin Xia, and Li Zhang. 2018. GPU-accelerated Large-Scale Non-negative Matrix Factorization Using Spark. In

International Conference on Collaborative Computing: Networking, Applications and Worksharing . Springer, 189–201.[53] Giorgio Tomasi and Rasmus Bro. 2006. A comparison of algorithms for fitting the PARAFAC model.

Computational Statistics & Data Analysis

50, 7(2006), 1700–1734.[54] Mark H Van Benthem and Michael R Keenan. 2004. Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems.

Journal of Chemometrics: A Journal of the Chemometrics Society

18, 10 (2004), 441–450.[55] Nico Vervliet, Otto Debals, and Lieven De Lathauwer. 2019. Exploiting Efficient Representations in Large-Scale Tensor Decompositions.

SIAMJournal on Scientific Computing

41, 2 (2019), A789–A815. https://doi.org/10.1137/17M1152371[56] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar. 2015. Fast and guaranteed tensor decomposition via sketching. In

Advances in Neural Information Processing Systems . 991–999.[57] Max Welling and Markus Weber. 2001. Positive tensor factorization.

Pattern Recognition Letters

22, 12 (2001), 1255 – 1261. https://doi.org/10.1016/S0167-8655(01)00070-8 Selected Papers from the 11th Portuguese Conference on Pattern Recognition.Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 29

A FULL RESULTS FOR MOUSE DATA

Figures 16 and 17 show all 32 components of a CP decomposition of the Mouse data. This decomposition includes thecomponent highlighted in Figure 15. The components are ordered by their weight in the λ vector. The vertical bars inFigure 16a correspond to the frames of the light stimulus and water reward, respectively. B DETAILED PARALLEL ALGORITHMB.1 Factor Matrix Normalization

The CP decomposition has a scale indeterminacy. To prevent possible growth in factor matrix entries, each time afactor matrix is updated, each of the R columns is normalized using the 2-norm and the weights are stored in anauxiliary vector λ . In the distributed algorithm these steps can been seen on lines 21 to 23 in Algorithm 4. Note thatcommunication is required as the global factor matrix norms are computed.On an algorithmic level one can observe why this step is necessary from the objective function for updating a singlefactor matrix in the inner iteration min H ( n ) ∥ X ( n ) − H ( n ) Λ ( H ( N ) ⊙ · · · H ( n + ) ⊙ H ( n − ) · · · ⊙ H ( ) )∥ F , where Λ is thediagonal matrix with the λ vector as its diagonal values. To solve we simply collapse H ( n ) Λ together. Thus when thesolve occurs we are actually computing H ( n ) Λ which is then normalized to obtain both H ( n ) and the new λ . B.2 Relative Error Computation

Given a model M = (cid:74) H ( ) , . . . , H ( N ) (cid:75) , we compute the relative error ∥ X − M ∥/∥ X ∥ efficiently by using the identity ∥ X − M ∥ = ∥ X ∥ − ⟨ X , M ⟩ + ∥ M ∥ . The quantity ∥ X ∥ is fixed, and the other two terms can be computed cheaply giventhe temporary matrices computed during the course of the algorithm. The second term can be computed using the identity ⟨ X , M ⟩ = ⟨ M ( N ) , H ( N ) ⟩ , where M ( N ) = X ( N ) ( H ( N − ) ⊙ · · · ⊙ H ( ) ) is the MTTKRP result in the N th mode. The thirdterm can be computed using the identity ∥ M ∥ = T ( S ( N ) ∗ H ( N ) T H ( N ) ) where S ( N ) = H ( ) T H ( ) ∗ · · · ∗ H ( N − ) T H ( N − ) .Both matrices M ( N ) and S ( N ) are computed during the course of the algorithm for updating the factor matrix H ( N ) . Theextra computation involved in computing the relative error is negligible. These identities have been used previously[31, 36, 46, 50]. Manuscript submitted to ACM = 1.2e+06 = 1.1e+06 = 1.1e+06 = 1.1e+06 = 1.0e+06 = 1.0e+06 = 1.0e+06 = 9.9e+05 = 9.9e+05 = 9.9e+05 = 9.6e+05 = 9.6e+05 = 9.4e+05 = 9.4e+05 = 9.4e+05 = 9.3e+05 = 9.3e+05 = 9.2e+05 = 9.1e+05 = 9.0e+05 = 8.9e+05 = 8.9e+05 = 8.8e+05 = 8.6e+05 = 8.0e+05 = 8.0e+05 = 8.0e+05 = 7.8e+05 = 7.2e+05 = 7.1e+05 = 5.2e+05 = 3.9e+05 (a) Time = 1.2e+06 = 1.1e+06 = 1.1e+06 = 1.1e+06 = 1.0e+06 = 1.0e+06 = 1.0e+06 = 9.9e+05 = 9.9e+05 = 9.9e+05 = 9.6e+05 = 9.6e+05 = 9.4e+05 = 9.4e+05 = 9.4e+05 = 9.3e+05 = 9.3e+05 = 9.2e+05 = 9.1e+05 = 9.0e+05 = 8.9e+05 = 8.9e+05 = 8.8e+05 = 8.6e+05 = 8.0e+05 = 8.0e+05 = 8.0e+05 = 7.8e+05 = 7.2e+05 = 7.1e+05 = 5.2e+05 = 3.9e+05 (b) TrialFig. 16. Time and trial factors of rank-32 CP decomposition of Mouse data. Manuscript submitted to ACM

LANC: Parallel Low Rank Approximation with Non-negativity Constraints 31

Fig. 17. Brain factors of rank-32 CP decomposition of Mouse data.

Manuscript submitted to ACM

Algorithm 4 ( (cid:74) λ ; H ( ) , . . . , H ( N ) (cid:75) , ϵ ) = Par-NNCP ( X , R ) Require: X is an I × · · · × I N tensor distributed across a P × · · · × P N grid of P processors, so that X p is ( I / P ) ×· · · × ( I N / P N ) and is owned by processor p = ( p , . . . , p N ) , R is rank of approximation % Initialize data a = Norm-Squared ( X p ) α = All-Reduce ( a , All-Procs ) ϵ = Inf for n = N do Initialize H ( n ) p of dimensions ( I n / P ) × R G = Local-SYRK ( H ( n ) p ) G ( n ) = All-Reduce ( G , All-Procs ) H ( n ) p n = All-Gather ( H ( n ) p , Proc-Slice ( n , p n )) end for % Compute NNCP approximation while ϵ > tol do % Perform outer iteration for n = N do % Compute new factor matrix in n th mode M = Local-MTTKRP ( X p ··· p N , { H ( i ) p i } , n ) M ( n ) p = Reduce-Scatter ( M , Proc-Slice ( n , p n )) S ( n ) = G ( ) ∗ · · · ∗ G ( n − ) ∗ G ( n + ) ∗ · · · ∗ G ( N ) ˆ H ( n ) p = NNLS-Update ( S ( n ) , M ( n ) p ) % Normalize columns λ = Local-Col-Norms ( ˆ H ( n ) p ) λ = All-Reduce ( λ , All-Procs ) H ( n ) p = Local-Col-Scale ( ˆ H ( n ) p , λ ) % Organize data for later modes G = H ( n ) p T H ( n ) p G ( n ) = All-Reduce ( G , All-Procs ) H ( n ) p n = All-Gather ( H ( n ) p , Proc-Slice ( n , p n )) end for % Compute relative error ϵ from mode- N matrices β = Inner-Product ( M ( N ) p , ˆ H ( N ) p ) β = All-Reduce ( β , All-Procs ) γ = λ T ( S ( N ) ∗ G ( N ) ) λ ϵ = (cid:112) ( α − β + γ )/ α end whileEnsure: ∥ X − (cid:74) λ ; H ( ) , . . . , H ( N ) (cid:75) ∥/∥ X ∥ = ϵ Ensure:

Local matrices: H ( n ) p is ( I n / P ) × R and owned by processor p = ( p , . . . , p N ) , for 1 ⩽ n ⩽ N , λ storedredundantly on every processorstoredredundantly on every processor