[PDF] On Tensors, Sparsity, and Nonnegative Factorizations

Abstract

Tensors have found application in a variety of fields, ranging from chemometrics to signal processing and beyond. In this paper, we consider the problem of multilinear modeling of sparse count data. Our goal is to develop a descriptive tensor factorization model of such data, along with appropriate algorithms and theory. To do so, we propose that the random variation is best described via a Poisson distribution, which better describes the zeros observed in the data as compared to the typical assumption of a Gaussian distribution. Under a Poisson assumption, we fit a model to observed data using the negative log-likelihood score. We present a new algorithm for Poisson tensor factorization called CANDECOMP-PARAFAC Alternating Poisson Regression (CP-APR) that is based on a majorization-minimization approach. It can be shown that CP-APR is a generalization of the Lee-Seung multiplicative updates. We show how to prevent the algorithm from converging to non-KKT points and prove convergence of CP-APR under mild conditions. We also explain how to implement CP-APR for large-scale sparse tensors and present results on several data sets, both real and simulated.

Full PDF

OON TENSORS, SPARSITY, AND NONNEGATIVEFACTORIZATIONS ∗ ERIC C. CHI † AND

TAMARA G. KOLDA ‡ Abstract.

Tensors have found application in a variety of ﬁelds, ranging from chemometricsto signal processing and beyond. In this paper, we consider the problem of multilinear modelingof sparse count data. Our goal is to develop a descriptive tensor factorization model of such data,along with appropriate algorithms and theory. To do so, we propose that the random variation isbest described via a Poisson distribution, which better describes the zeros observed in the data ascompared to the typical assumption of a Gaussian distribution. Under a Poisson assumption, weﬁt a model to observed data using the negative log-likelihood score. We present a new algorithmfor Poisson tensor factorization called CANDECOMP–PARAFAC Alternating Poisson Regression(CP-APR) that is based on a majorization-minimization approach. It can be shown that CP-APRis a generalization of the Lee-Seung multiplicative updates. We show how to prevent the algorithmfrom converging to non-KKT points and prove convergence of CP-APR under mild conditions. Wealso explain how to implement CP-APR for large-scale sparse tensors and present results on severaldata sets, both real and simulated.

Key words.

Nonnegative tensor factorization, nonnegative CANDECOMP-PARAFAC, Poissontensor factorization, Lee-Seung multiplicative updates, majorization-minimization algorithms

1. Introduction.

Tensors have found application in a variety of ﬁelds, rangingfrom chemometrics to signal processing and beyond. In this paper, we consider theproblem of multilinear modeling of sparse count data. For instance, we may considerdata that encodes the number of papers published by each author at each conferenceper year for a given time frame [13], the number of packets sent from one IP addressto another using a speciﬁc port [47], or to/from and term counts on emails [2]. Ourgoal is to develop a descriptive model of such data, along with appropriate algorithmsand theory.Let X represent an N -way data tensor of size I × I × · · · × I N . We are interestedin an R -component nonnegative CANDECOMP/PARAFAC [8, 21] factor model M = R (cid:88) r =1 λ r a (1) r ◦ · · · ◦ a ( N ) r , (1.1)where ◦ represents outer product and a ( n ) r represents the r th column of the nonneg-ative factor matrix A ( n ) of size I n × R . We refer to each summand as a component .Assuming each factor matrix has been column-normalized to sum to one, we refer tothe nonnegative λ r ’s as weights .In many applications such as chemometrics [46], we ﬁt the model to the datausing a least squares criteria, implicitly assuming that the random variation in thetensor data follows a Gaussian distribution. In the case of sparse count data, however,the random variation is better described via a Poisson distribution [37, 44], i.e., x i ∼ Poisson( m i ) ∗ The work of the ﬁrst author was fully supported by the U.S. Department of Energy Compu-tational Science Graduate Fellowship under grant number DE-FG02-97ER25308. The work of thesecond author was funded by the applied mathematics program at the U.S. Department of Energyand Sandia National Laboratories, a multiprogram laboratory operated by Sandia Corporation, awholly owned subsidiary of Lockheed Martin Corporation, for the United States Department ofEnergy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. † Dept. Human Genetics, University of California, Los Angeles, CA. Email: [email protected] ‡ Sandia National Laboratories, Livermore, CA. Email: [email protected] a r X i v : . [ m a t h . NA ] A ug E. C. Chi and T. G. Kolda rather than x i ∼ N ( m i , σ i ), where the subscript i is shorthand for the multi-index( i , i , . . . , i N ). In fact, a Poisson model is a much better explanation for the zeroobservations that we encounter in sparse data — these zeros just correspond to eventsthat were very unlikely to be observed. Thus, we propose that rather than using theleast squares (LS) error function given by (cid:80) i | x i − m i | , for count data we shouldinstead minimize the (generalized) Kullback-Leibler (KL) divergence f ( M ) = (cid:88) i m i − x i log m i , (1.2)which equals the negative log-likelihood of the observations up to an additive constant.Unfortunately, minimizing KL divergence is more diﬃcult than LS error. Although other authors have considered ﬁtting tensor datausing KL divergence [50, 9, 51], we oﬀer the following contributions: • We develop alternating Poisson regression for nonnegative CP model (CP-APR). The subproblems are solved using a majorization-minimization (MM) ap-proach. If the algorithm is restricted to a single inner iteration per subproblem, itreduces to the standard Lee-Seung multiplicative for KL updates [29, 30] as extendedto tensors by Welling and Weber [50]. However, using multiple inner iterations isshown to accelerate the method, similar to what has been observed for LS [19] • It is known that the Lee-Seung multiplicative updates may converge to a non-stationary point [20], and Lin [32] has previously introduced a ﬁx for the LS version ofthe Lee-Seung method. We introduce a diﬀerent technique for avoiding inadmissiblezeros (i.e., zeros that violate stationarity conditions) that is only a trivial change to thebasic algorithm and prevents convergence to non-stationary points. This technique isstraightforward to adapt to the matrix and/or LS cases as well. • Assuming the subproblems can be solved exactly, we prove convergence of theCP-APR algorithm. In particular, we can show convergence even for sparse inputdata and solutions on the boundary of the nonnegative orthant. • We explain how to eﬃciently implement CP-APR for large-scale sparse data.Although it is well-known how to do large-scale sparse tensor calculations for theLS ﬁtting function [3], the Poisson likelihood ﬁtting algorithm requires new sparsetensor kernels. To the best of our knowledge, ours is the ﬁrst implementation of anyKL-divergence-based method for large-scale sparse tensors. • We present experimental results showing the eﬀectiveness of the method onboth real and simulated data. In fact, the Poisson assumption leads quite naturallyto a generative model for sparse data.

Much of the past work in nonnegative matrix and tensoranalysis has focused on the LS error [42, 41, 6, 20, 25, 23, 17], which corresponds to anassumption of normal independently identically distributed (i.i.d.) noise. The focusof this paper is KL divergence, which corresponds to maximum likelihood estimationunder an independent Poisson assumption; see § multiplicative update formulas for both LS and KL divergence, resulting in a very low cost-per-iteration. Welling and Weber [50] were the ﬁrst to generalize the Lee and Seungalgorithms to nonnegative tensor factorization (NTF). Applications of NTF basedon KL-divergence include EEG analysis [38] and sound source separation [16]. Wenote that generalizations of KL divergence have also been proposed in the literature,including Bregman divergence [11, 10, 31] and beta divergence [9, 14]. ensors, Sparsity, and Nonnegative Factorizations §

2. Notation and Preliminaries.2.1. Notation.

Throughout, scalars are denoted by lowercase letters ( a ), vectorsby boldface lowercase letters ( v ), matrices by boldface capital letters ( A ), and higher-order tensors by boldface Euler script letters ( X ). We let e denotes the vector of allones and E denotes the matrix of all ones. The j th column of a matrix A is denoted by a j . We use multi-index notation so that a boldface i represents the index ( i , . . . , i N ).We use subscripts to denote iteration index for inﬁnite sequences, and the diﬀerencebetween its use for an entry and its use as an iteration index should be clear bycontext.The notation (cid:107) · (cid:107) refers to the two-norm for vectors or Frobenious norm formatrices, i.e., the sum of the squares of the entries. The notation (cid:107) · (cid:107) refers to theone-norm, i.e., the sum of the absolute values of the entries.The outer product is denoted by ◦ . The symbols ∗ and (cid:19) represents elementwisemultiplication and division, respectively. The symbol (cid:12) denotes Khatri-Rao matrixmultiplication. The mode- n matricization or unfolding of a tensor X is denoted by X ( n ) . See Appendix A for further details on these operations. In statistics, countdata is often best described as following a Poisson distribution. For a general discus-sion of the Poisson distribution, see, e.g., [44]. We summarize key facts here.A random variable X is said to have a Poisson distribution with parameter µ > x = 0 , , , . . . with probability P ( X = x ) = e − µ µ x x ! . (2.1)The mean and variance of X are both µ ; therefore, the variance increases along withthe mean, which seems like a reasonable assumption for count data. It is also usefulto note that the sum of independent Poisson random variables is also Poisson. Thisis important in our case since each Poisson parameter is a multilinear combination ofthe model parameters. We contrast Poisson and Gaussian distributions in Figure 2.1.Observe that there is good agreement between the distributions for larger values ofthe mean, µ . For small values of µ , however, the match is not as strong and theGaussian random variable can take on negative values.We can determine the optimal Poisson parameters by maximizing the likelihoodof the observed data. Let x be a vector of observations and let µ be the vector ofPoisson parameters. (We assume that µ i ’s are not independent, else the function E. C. Chi and T. G. Kolda P D F GaussianPoisson −2 0 2 4 600.20.40.60.811.2 mean=0.1x P D F GaussianPoisson

Fig. 2.1: Illustration of Gaussian and Poisson distributions for two parameters. Forboth examples, we assume that the variance of the Gaussian is equal to the mean m .would entirely decouple in the parameters to be estimated.) Then the negative of thelog of the likelihood function for (2.1) is the KL divergence (cid:88) i µ i − x i log µ i , (2.2)excepting the addition of the constant term (cid:80) i log( x i !), which is omitted.Because we are working with sparse data, there are many instances for whichwe expect x i = 0, which leads to some ambiguity in (2.2) if µ i = 0. We assumethroughout that 0 · log( µ ) = 0 for all µ ≥

0. This is for notational convenience; else,we would write (2.2) as (cid:80) i µ i − (cid:80) i : x i (cid:54) =0 x i log µ i .

3. CP-APR: Alternating Poisson Regression.

In this section we introducethe CP-APR algorithm for ﬁtting a nonnegative

Poisson tensor decomposition (PTF) to count data. The algorithm employs an alternating optimization scheme that se-quentially optimizes one factor matrix while holding the others ﬁxed; this is non-linear Gauss-Seidel applied to the PTF problem. The subproblems are solved via amajorization-minimization (MM) algorithm, as described in § Our optimization problem is deﬁned asmin f ( M ) ≡ (cid:88) i m i − x i log m i s.t. M = (cid:114) λ ; A (1) , . . . , A ( N ) (cid:122) ∈ Ω , (3.1)where Ω = Ω λ × Ω × · · · × Ω n withΩ λ = [0 , + ∞ ) R and Ω n = (cid:8) A ∈ [0 , I n × R (cid:12)(cid:12) (cid:107) a r (cid:107) = 1 for r = 1 , . . . , R (cid:9) . (3.2)Here M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) is shorthand notation for (1.1) [3]. Depending oncontext, M represents the tensor itself or its constituent parts. For example, whenwe say M ∈ Ω, it means that that the factor matrices have stochasticity constraintson the columns.The function f is not ﬁnite on all of Ω. For example, if there exists i such that m i = 0 and x i >

0, then f ( M ) = + ∞ . If m i > i such that x i >

0, however,then we are guaranteed that f ( M ) is ﬁnite. Consequently, we will generally wish torestrict ourselves to a domain for which f ( M ) is ﬁnite. We deﬁneΩ( ζ ) ≡ conv( { M ∈ Ω | f ( M ) ≤ ζ } ) , (3.3) ensors, Sparsity, and Nonnegative Factorizations · ) denotes the convex hull. We observe that Ω( ζ ) ⊂ Ω (strict subset)since, for example, the all-zero model is not in Ω( ζ ). The following lemma states thatΩ( ζ ) is compact for any ζ >

0; the proof is given in Appendix B.

Lemma 3.1.

Let f be as deﬁned in (3.1) and Ω( ζ ) be as deﬁned in (3.3) . For any ζ > , Ω( ζ ) is compact. We solve problem(3.1) via an alternating approach, holding all factor matrices constant except one.Consider the problem for the n th factor matrix. We note that there is scaling ambi-guity that allows us to express the same M in diﬀerent ways, i.e., M = (cid:114) A (1) , . . . , A ( n − , B ( n ) , A ( n +1) , . . . , A ( N ) (cid:122) (3.4)where B ( n ) = A ( n ) Λ and Λ = diag( λ ) . (3.5)The weights in (3.4) are omitted because they are absorbed into the n th mode. From[3], we can express M as M ( n ) = B ( n ) Π ( n ) where B ( n ) is deﬁned in (3.5) and Π ( n ) ≡ (cid:16) A ( N ) (cid:12) · · · (cid:12) A ( n +1) (cid:12) A ( n − (cid:12) · · · (cid:12) A (1) (cid:17) T . (3.6)Thus, we can rewrite the objective function in (3.1) as f ( M ) = e T (cid:104) B ( n ) Π ( n ) − X ( n ) ∗ log (cid:16) B ( n ) Π ( n ) (cid:17)(cid:105) e , where e is the vector of all ones, ∗ denotes the elementwise product, and the logfunction is applied elementwise. We note that it is convenient to update A ( n ) and λ simultaneously since the resulting constraint on B ( n ) is simply B ( n ) ≥ f ( M )restricted to the n th block, i.e., B ( n ) = arg min B ≥ f n ( B ) ≡ e T (cid:104) BΠ ( n ) − X ( n ) ∗ log (cid:16) BΠ ( n ) (cid:17)(cid:105) e . (3.7)The updates for λ and A ( n ) come directly from B ( n ) . Note that some care mustbe taken if an entire column of B ( n ) is zero; if the r th column is zero, then we canset λ r = 0 and b ( n ) r to an arbitrary nonnegative vector that sums to one. The fullprocedure is given in Algorithm 1; this is a variant (because of the handling of λ ) ofnonlinear Gauss-Seidel. We note that the scaling and unscaling of the factor matricesis common in alternating algorithms, though not always explicit in the algorithmstatement. There are many variations of this basic device; for instance, in the contextof the LS version of NTF, [17, Algorithm 2] collects the scaling information into anexplicit scaling vector that is “amended” after each inner iterationWe defer the proof of convergence until § S ( n ) i = (cid:8) j (cid:12)(cid:12) ( X ( n ) ) ij > (cid:9) (3.8)denote the set of indices of columns for which the i th row of X ( n ) is non-zero. If N = 3, then X (1) ( i, :) corresponds to a vectorization of the i th horizontal slice of X , X (2) ( i, :) to a vectorization of the i th lateral slice, and X (3) ( i, :) to a vectorization of E. C. Chi and T. G. Kolda

Algorithm 1

CP-APR Algorithm (Ideal Version)Let X be a tensor of size I × · · · × I N . Let M = (cid:74) λ ; A (1) , · · · , A ( N ) (cid:75) be an initialguess for an R -component model such that M ∈ Ω( ζ ) for some ζ > repeat for n = 1 , . . . , N do Π ← (cid:16) A ( N ) (cid:12) · · · (cid:12) A ( n +1) (cid:12) A ( n − (cid:12) · · · (cid:12) A (1) (cid:17) T B ← arg min B ≥ e T (cid:2) BΠ − X ( n ) ∗ log ( BΠ ) (cid:3) e (cid:46) subproblem λ ← e T B A ( n ) ← BΛ − end for until convergencethe i th frontal slice. More generally, we can think of vectorizing “hyperslices” withrespect to each mode. Assumption 3.2.

The rows of the submatrix Π ( n ) (: , S ( n ) i ) (i.e., only the columnscorresponding to nonzero rows in X ( n ) are considered) are linearly independent for all i = 1 , . . . , I n and n = 1 , . . . , N . Assumption 3.2 implies that |S ( n ) i | ≥ R for all i . Thus, we need to observe atleast R · max n I n counts in the data tensor X , and the counts need to be suﬃcientlydistributed across X . Consequently, the conditions appeal to our intuition that thereare concrete limits on how sparse the data tensor can be with respect to how manyparameters we wish to ﬁt. If, for example, we had X (1) ( i, :) = 0, it is clear that wecan remove element i from the ﬁrst dimension entirely since it contributes nothing.We are making a stronger requirement: each element in each dimension must have atleast R nonzeros in its corresponding hyperslice.A potential problem is that Assumption 3.2 depends on the current iterate, whichwe cannot predict in advance. However, we observe that if λ > R ≤ min n (cid:81) m (cid:54) = n I m , thenthis condition is satisﬁed with probability one . This condition can be checked as theiterates progress.The matrix Φ ( n ) ≡ (cid:104) X ( n ) (cid:19) (cid:16) B ( n ) Π ( n ) (cid:17)(cid:105) Π ( n ) T , (3.9)with (cid:19) denoting elementwise division, will come up repeatedly in the remainder ofthe paper. For instance, we observe that the partial derivative of f with respect to A ( n ) is ∂f /∂ A ( n ) = ( E − Φ ( n ) ) Λ , where E is the matrix of all ones. Consequently,the matrix Φ ( n ) plays a role in checking convergence as follows. Theorem 3.3. If λ > and M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) ∈ Ω( ζ ) for some ζ > ,then M is a Karush-Kuhn-Tucker (KKT) point of (3.1) if and only if min (cid:16) A ( n ) , E − Φ ( n ) (cid:17) = 0 for n = 1 , . . . , N. (3.10) We can actually appeal to a weaker assumption; if the entries are drawn from any distributionthat is absolutely continuous with respect to the Lebesgue measure on [0,1] then the condition issatisﬁed with probability one. ensors, Sparsity, and Nonnegative Factorizations Proof . Since λ >

0, we can assume that λ has been absorbed into A ( m ) for some m . Thus, we can replace the constraints λ ∈ Ω λ and A ( m ) ∈ Ω n with B ( m ) ≥

0. Inthis case, the partial derivatives are ∂f∂ B ( m ) = E − Φ ( m ) and ∂f∂ A ( n ) = (cid:16) E − Φ ( n ) (cid:17) Λ for n (cid:54) = m. (3.11)Since M ∈ Ω( ζ ) for some ζ >

0, we know that not all elements of M are zero; thus,the set of active constraints are linearly independent. The following conditions deﬁnea KKT point [40]: E − Φ ( m ) − Υ ( m ) = 0 , ( E − Φ ( n ) ) Λ − Υ ( n ) − e ( η ( n ) ) T = 0 , e T A ( n ) = 1 for n (cid:54) = m A ( n ) ≥ , Υ ( n ) ≥ , Υ ( n ) ∗ A ( n ) = 0 for n (cid:54) = m B ( m ) ≥ , Υ ( m ) ≥ , Υ ( m ) ∗ B ( m ) = 0 . (3.12)Here Υ ( n ) are the Lagrange multipliers for the nonnegativity constraints and η ( n ) arethe Lagrange multipliers for the stochasticity constraints.If M = (cid:104) λ ; A (1) , . . . , A ( N ) (cid:105) is a KKT point, then from (3.12), we have that Υ ( m ) = E − Φ ( m ) ≥ B ( m ) ≥

0, and Υ ( m ) ∗ B ( m ) = 0. Thus, min( A ( m ) Λ , E − Φ ( m ) ) = 0.Since λ > m is arbitrary, (3.10) follows immediately.If, on the other hand, (3.10) is satisﬁed, choosing Υ ( m ) = E − Φ ( m ) , and Υ ( n ) =( E − Φ ( n ) ) Λ and η ( n ) = 0 for n (cid:54) = m satisﬁes the KKT conditions in (3.12). Hence, M must be a KKT point.Observe that the condition λ > λ moot in the KKT conditions — thisreﬂects the scaling ambiguity that is inherent in the model.From Theorem 3.3 and because feasibility is always maintained, we can check forconvergence by verifying | min( A ( n ) , E − Φ ( n ) ) | ≤ τ for n = 1 , . . . , N , where τ > We require the strict convexity of f in each of the block coordinates. This is ensured under Assumption 3.2. Lemma 3.4 (

Strict convexity of subproblem ). Let f n ( · ) be the function f re-stricted to the n th block as deﬁned in (3.7) . If Assumption is satisﬁed, then f n ( B ) is strictly convex over B n = { B ∈ [0 , + ∞ ) I n × R : BΠ ( n ) (cid:54) = } .Proof . In the proof, we drop the n ’s for convenience. First note that B is convex.Let C = B T . We can rewrite (3.7) as min f ( C T ) = (cid:80) ij c T i π j − x ij log( c T i π j ) subject to C ≥

0. Hence, it is suﬃcient to show that the function ˆ f ( C ) = − (cid:80) ij x ij log( c T i π j )is strictly convex over the convex set C = { C ∈ [0 , + ∞ ) R × I n : C T Π (cid:54) = } . Fix ¯C , ˆC ∈ C such that ¯C (cid:54) = ˆC . Since the inner product is aﬃne and log is a strictlyconcave function, we need only show that there exists some i and j such that x ij (cid:54) = 0and ˆc T i π j (cid:54) = ¯c T i π j . We know at least one column must diﬀer since ¯C (cid:54) = ˆC ; let i correspond to that column and deﬁne d = ˆc i − ¯c i (cid:54) = 0. By Assumption 3.2, we knowthat Π (: , S i ) has full row rank. Thus, there exists a column j of Π such that x ij (cid:54) = 0and d T π j (cid:54) = 0. Hence, the claim.Here we state our main convergence result. Although this result assumes that thesubproblems can be solved exactly (which is not the case in practice), it gives someidea as to the convergence behavior of the method. We follow the reasoning of the E. C. Chi and T. G. Kolda proof of convergence of nonlinear Gauss-Seidel [5, Proposition 3.9], adapted here forthe way that λ is handled. Theorem 3.5 (

Convergence of CP-APR ). Suppose that f ( M ) is strictly convexwith respect to each block component and that it is minimized exactly for each blockcomponent subproblem of CP-APR. Let M ∗ be a limit point of the sequence { M k } such that λ ∗ > . Then M ∗ is a KKT point of (3.1) .Proof . Let M k = (cid:104) λ k , A (1) k , . . . , A ( N ) k (cid:105) be the k th iterate produced by the outer iterations of Algorithm 1. Deﬁne Z ( n ) k to be the n th iterate in the inner loop of outeriteration k with the λ -vector absorbed into the n th factor, i.e., Z ( n ) k = (cid:104) A (1) k +1 , . . . , A ( n − k +1 , B ( n ) k +1 , A ( n +1) k , . . . , A ( N ) k (cid:105) , where B ( n ) k +1 is the solution to the n th subproblem at iteration k . This deﬁnes A ( n ) k +1 to be the column-normalized version of B ( n ) k +1 , i.e., A ( n ) k +1 = B ( n ) k +1 (diag( B ( n ) k +1 e )) − .Taking advantage of the scaling ambiguity to shift the weights between factors yields f ( Z ( n ) k ) = f ( (cid:104) A (1) k +1 , . . . , A ( n − k +1 , A ( n ) k +1 diag( B ( n ) k +1 e ) , A ( n +1) k , . . . , A ( N ) k (cid:105) ) , = f ( (cid:104) A (1) k +1 , . . . , A ( n − k +1 , A ( n ) k +1 , A ( n +1) k diag( B ( n ) k +1 e ) , . . . , A ( N ) k (cid:105) ) , ≥ f ( (cid:104) A (1) k +1 , . . . , A ( n − k +1 , A ( n ) k +1 , B ( n +1) k +1 , . . . , A ( N ) k (cid:105) ) = f ( Z ( n +1) k ) . Observe that Z ( N ) k = (cid:104) A (1) k +1 , . . . , A ( N − k +1 , A ( N ) k +1 diag( λ k +1 ) (cid:105) , so there is a correspon-dence between Z ( N ) k and M k +1 such that f ( Z ( N ) k ) = f ( M k +1 ). For convenience,we deﬁne Z (0) k = (cid:104) A (1) k diag( λ k ) , A (2) k , . . . , A ( N ) k (cid:105) . Since we assume the subproblem issolved exactly at each iteration, we have f ( M k ) ≥ f ( Z (1) k ) ≥ f ( Z (2) k ) ≥ · · · f ( Z ( N − k ) ≥ f ( M k +1 ) for all k. (3.13)Recall that Ω( ζ ) is compact by Lemma 3.1. Since the sequence { M k } is containedin the set Ω( ζ ), it must have a convergent subsequence. We let { k (cid:96) } denote the indicesof that convergent subsequence and M ∗ = (cid:104) λ ∗ , A (1) ∗ , . . . , A ( N ) ∗ (cid:105) denote its limit point.By continuity of f , f ( M k (cid:96) ) → f ( M ∗ ) . We ﬁrst show that (cid:107) A (1) k (cid:96) +1 − A (1) k (cid:96) (cid:107) →

0. Assume the contrary, i.e., that it does notconverge to zero. Let γ k (cid:96) = (cid:107) Z (1) k (cid:96) − Z (0) k (cid:96) (cid:107) . By possibly restricting to a subsequenceof { k (cid:96) } , we may assume there exists some γ > γ ( k (cid:96) ) ≥ γ for all (cid:96) . Let S (1) k (cid:96) = ( Z (1) k (cid:96) − Z (0) k (cid:96) ) /γ k (cid:96) ; then Z (1) k (cid:96) = Z (0) k (cid:96) + γ k (cid:96) S (1) k (cid:96) , (cid:107) S (1) k (cid:96) (cid:107) = 1, and S (1) k (cid:96) diﬀers fromzero only along the ﬁrst block component. Notice that { S (1) k (cid:96) } belong to a compactset and therefore has a limit point S (1) ∗ . By restricting to a further subsequence of { k (cid:96) } , we assume that S (1) k (cid:96) → S (1) ∗ Let us ﬁx some (cid:15) ∈ [0 , ≤ (cid:15)γ ≤ γ k (cid:96) . Therefore, Z (0) k (cid:96) + (cid:15)γ S (1) k (cid:96) lies on the line segment joining Z (0) k (cid:96) and Z (0) k (cid:96) + γ k (cid:96) S (1) k (cid:96) = Z (1) k (cid:96) and belongs to Ω( ζ )because Ω( ζ ) is convex. Using the convexity of f w.r.t. the ﬁrst block componentand the fact that Z (1) k (cid:96) minimizes f over all Z that diﬀer from Z (1) k (cid:96) in the ﬁrst blockcomponent, we obtain f ( Z (1) k (cid:96) ) = f ( Z (0) k (cid:96) + γ k (cid:96) S (1) k (cid:96) ) ≤ f ( Z (0) k (cid:96) + (cid:15)γ S (1) k (cid:96) ) ≤ f ( Z (0) k (cid:96) ) . ensors, Sparsity, and Nonnegative Factorizations f ( Z (0) k (cid:96) ) = f ( M k (cid:96) ) → f ( M ∗ ), equation (3.13) shows that f ( Z (1) k (cid:96) ) also convergesto f ( M ∗ ). Taking limits as (cid:96) tends to inﬁnity, we obtain f ( M ∗ ) ≤ f ( Z (0) ∗ + (cid:15)γ S (1) ∗ ) ≤ f ( M ∗ ) , where Z (0) ∗ is just M ∗ with λ ∗ absorbed into the ﬁrst component. We conclude that f ( M ∗ ) = f ( Z (0) ∗ + (cid:15)γ S (1) ∗ ) for every (cid:15) ∈ [0 , γ S (1) ∗ (cid:54) = 0, this contradicts thestrict convexity of f as a function of the ﬁrst block component. This contradictionestablishes that (cid:107) A (1) k (cid:96) +1 − A (1) k (cid:96) (cid:107) →

0. In particular, Z (1) k (cid:96) converges to Z (0) ∗ .By deﬁnition of Z (1) k (cid:96) and the assumption that each subproblem is solved exactly,we have f ( Z (1) k (cid:96) ) ≤ f ( (cid:104) B , A (2) k (cid:96) , . . . , A ( N ) k (cid:96) (cid:105) ) for all B ≥ . Taking limits as (cid:96) → ∞ , we obtain f ( M ∗ ) ≤ f ( (cid:104) B , A (2) ∗ , . . . , A ( N ) ∗ (cid:105) ) for all B ≥ . In other words, B (1) ∗ = A (1) ∗ diag( λ ∗ ) is the minimizer of f with respect to the ﬁrstblock components with the remaining components are ﬁxed at A (2) ∗ through A ( N ) ∗ .From the KKT conditions [40], we have that B (1) ∗ ≥ , ∂f∂ B (1) ( B (1) ∗ ) ≥ , B (1) ∗ ∗ ∂f∂ B (1) ( B (1) ∗ ) = 0 . In turn, since λ ∗ >

0, we have min( A (1) ∗ , E − Φ (1) ∗ ) = 0.Repeating the previous argument shows that (cid:107) A (2) k (cid:96) +1 − A (2) k (cid:96) (cid:107) → A (2) ∗ , E − Φ (2) ∗ ) = 0. Continuing inductively, min( A ( n ) ∗ , E − Φ ( n ) ∗ ) = 0 for n =1 , . . . , N . Thus, by Theorem 3.3, M ∗ is a KKT point of f ( M ).Before proceeding to the discussion solving the subproblem, we point out thatremarkably very little is assumed about the objective function f in Theorem 3.5. Theproof required that f is diﬀerentiable, strictly convex in each of its block components,and there is a ξ > ξ ) is compact. The upshot is thatTheorem 3.5 applies equally well to other choices of f corresponding to other mem-bers in the family of beta distributions that are convex, namely the divergences thatcorrespond to β ∈ [1 ,

2] [14]. In fact, it was also observed in [17] that “rescaling doesnot interfere with the convergence of the Gauss–Seidel iterations” (in the context ofthe LS formulation of NTF).

4. Solving the CP-APR Subproblem via Majorization-Minimization.

The basic idea of a majorization-minimization (MM) algorithm is to convert a hardoptimization problem (e.g., non-convex and/or non-diﬀerentiable) into a series of sim-pler ones (e.g., smooth convex) that are easy to minimize and that majorize theoriginal function, as follows.

Definition 4.1.

Let f and g be real-valued functions on R n and R n × R n ,respectively. We say that g majorizes f at x ∈ R n if g ( y , x ) ≥ f ( y ) for all y ∈ R n and g ( x , x ) = f ( x ) . If f ( x ) is the function to be optimized and g ( · , x ) majorizes f at x , the basic MMiteration is x k +1 = arg min y g ( y , x k ) . It is easy to see that such iterates always takenon-increasing steps with respect to f since f ( x k +1 ) ≤ g ( x k +1 , x k ) ≤ g ( x k , x k ) =0 E. C. Chi and T. G. Kolda

Algorithm 2

Subproblem Solver for Algorithm 1 B ← A ( n ) Λ repeat (cid:46) subproblem loop Φ ← (cid:0) X ( n ) (cid:19) ( BΠ ) (cid:1) Π T B ← B ∗ Φ until convergence f ( x k ) , where x k is the current iterate and x k +1 is the optimum computed at thatiterate.Consider the n th subproblem in (3.7). Here we drop the n ’s for convenience sothat (3.7) reduces to min B ≥ f ( B ) ≡ e T [ BΠ − X ∗ log( BΠ )] e . (4.1)Recall that X is the nonnegative data tensor reshaped to a matrix of size I × J , Π is anonnegative matrix of size R × J with rows that sum to 1, and B is a nonnegative ma-trix of size I × R . For clarity in the ensuing discussion, we also restate Assumption 3.2in terms of the local variables for this section as follows: Assumption 4.2.

The rows of the submatrix Π (: , { j | X ij > } ) (i.e., only thecolumns corresponding to nonzero rows in X are considered) are linearly independentfor all i = 1 , . . . , I . According to Assumption 4.2, for every i there is at least one j such that x ij >

0. Thus, we can assume that we have ¯B ≥ f ( ¯B ) is ﬁnite. We nowintroduce the majorization used in our subproblem solver. This majorization is alsoa special case of the one derived in [14] when β = 1 and has a long history in imagereconstruction that predates its use in NMF [36, 45, 28]. The objective f is majorizedat ¯B by the function g ( B , ¯B ) = (cid:88) rij (cid:20) b ir π rj − α rij x ij log (cid:18) b ir π rj α rij (cid:19)(cid:21) where α rij = ¯ b ir π rj (cid:80) r ¯ b ir π rj . (4.2)The proof of this fact is straightforward and thus relegated to Appendix C. Theadvantage of this majorization is that the problem is now completely separable interms of b ir , i.e., the individual entries of B . Moreover, g ( · , ¯B ) has a unique globalminimum with an analytic expression, given by B ∗ Φ , where Φ is as deﬁned in (3.9)and depends on B . A proof is provided in Appendix C. The MM algorithm iterationsare then deﬁned by B k +1 = ψ ( B k ) ≡ B k ∗ Φ ( B k ) , where Φ ( B k ) = [ X (cid:19) ( B k Π )] Π (4.3)and X and Π come from (4.1). If B ≥

0, clearly B k ≥ k . Observe that ∇ f ( B ) = E − Φ ( B ). We discuss in § B (discussed further in § ensors, Sparsity, and Nonnegative Factorizations Theorem 4.3 (

Convergence of MM algorithm ). Let f be as deﬁned in (4.1) andassume Assumption holds, let B be a nonnegative matrix such that f ( B ) is ﬁniteand ( B ) ir > for all ( i, r ) such that ( Φ ( B ∗ )) ir > , and let the sequence { B k } bedeﬁned as in (4.3) . Then { B k } converges to the global minimizer of f . Note that we make a modest but very useful generalization of existing results byallowing iterates to be on (or very close to) the boundary. Prior convergence results,including [26, 15, 51] assume that all iterates are strictly positive. Though true inexact arithmetic, in numerical computations it is not uncommon for some iterates tobecome zero numerically. In § B holdsin practice.

5. CP-APR Implementation Details.

The previous algorithms omit manydetails and numerical checks that are needed in any practical implementation. Thus,Algorithm 3 provides a detailed version that can be directly implemented. A highlightof this implementation is the “inadmissible zero” avoidance, which ﬁxes a the problemof getting stuck at a zero value with multiplicative updates.

If we only take one iterationof the subproblem loop (i.e., setting (cid:96) max = 1), then CP-APR is the Lee-Seung multi-plicative update algorithm for the KL divergence. Thus, we can view the Lee-Seungalgorithm as a special case of our algorithm where we do not solve the subproblemsexactly; quite the contrary, we only take one step towards the subproblem solution.

A well-known problem with multiplica-tive updates is that some elements may get “stuck” at zero; see, e.g., [20]. Forexample, if a ( n ) ir = 0, then the multiplicative updates will never change it. In manycases, a zero entry may be the correct answer, so we want to allow it. In other cases,though, the zero entry may be incorrect in the sense that it does not satisfy the KKTconditions, i.e., a ( n ) ir = 0 but 1 − Φ ( n ) ir <

0. We refer to these values as inadmissiblezeros . We correct this problem before we enter into the multiplicative update phaseof the algorithm. In lines 4–5 of Algorithm 3, any inadmissible zeros (or near-zeros)are “scooched” away from zero and into the interior. The amount of the scooch iscontrolled by the user-deﬁned parameter κ . The condition in Theorem 4.3 is exactlythat the starting point should not have any zeros that are ultimately inadmissible. Ifwe discover that a sequence of iterates leads to an inadmissible zero (or almost-zero),we restart the method by restarting the method with a new starting point. Thisadjustment prevents convergence to non-KKT points. Note that all the quantitiesneeded to perform the check are precomputed and that there is no change to thealgorithm besides adjusting a few zero entries in the current factor matrix. The ﬁxfor the inadmissible zeros is compatible with the Lee-Seung algorithm for LS error aswell.Lin [32] has made a similar observation in the LS case and applied changes tohis gradient descent version of the Lee-Seung method. Our correction is diﬀerent andis directly incorporated into the multiplicative update scheme rather than requiringa diﬀerent update formula. Gillis and Glineur [18] proposed a more drastic ﬁx byrestricting the factor matrices to have entries in [ (cid:15), ∞ ) for some small positive (cid:15) .Avoiding all zeros clearly rules out the possibility of getting stuck at an inadmissible2 E. C. Chi and T. G. Kolda

Algorithm 3

Detailed CP-APR AlgorithmLet X be a tensor of size I × · · · × I N . Let M = (cid:104) λ ; A (1) , · · · , A ( N ) (cid:105) be an initialguess for an R -component model such that M ∈ Ω( ζ ) for some ζ > • k max = Maximum number of outer iterations • (cid:96) max = Maximum number of inner iterations (per outer iteration) • τ = Convergence tolerance on KKT conditions (e.g., 10 − ) • κ = Inadmissible zero avoidance adjustment (e.g., 0 . • κ tol = Tolerance for identifying a potential inadmissible zero (e.g., 10 − ) • (cid:15) = Minimum divisor to prevent divide-by-zero (e.g., 10 − ) for k = 1 , , . . . , k max do isConverged ← true for n = 1 , . . . , N do S ( i, r ) ← (cid:40) κ, if k > , A ( n ) ( i, r ) < κ tol , and Φ ( n ) ( i, r ) > , , otherwise B ← ( A ( n ) + S ) Λ Π ← (cid:16) A ( N ) (cid:12) · · · (cid:12) A ( n +1) (cid:12) A ( n − (cid:12) · · · (cid:12) A (1) (cid:17) T for (cid:96) = 1 , , . . . , (cid:96) max do (cid:46) subproblem loop Φ ( n ) ← (cid:0) X ( n ) (cid:19) (max( BΠ , (cid:15) )) (cid:1) Π T if | min( B , E − Φ ( n ) ) | < τ then break end if isConverged ← false B ← B ∗ Φ ( n ) end for λ ← e T B A ( n ) ← BΛ − end for if isConverged = true then break end if end for zero, but does so at the expense of eliminating any hope of obtaining sparse factormatrices, a desirable property in many applications. The convergence condi-tions on the subproblem require that min( B ( n ) , E − Φ ( n ) ) = 0. We do not requirethe value to be exactly zero but instead check that it is smaller in magnitude thanthe user-deﬁned parameter τ . We break out of the subproblem loop as soon as thiscondition is satisﬁed.From Theorem 3.3, we can check for overall convergence by verifying (3.10). Wedo not want to calculate this at the end of every n -loop because it is expensive.Instead, we know that the iterates will stop changing once we have converged andso we can validate the convergence of all factor matrices by checking that no factormatrix has been modiﬁed and every subproblem has converged. ensors, Sparsity, and Nonnegative Factorizations Consider a large-scale sparse tensorthat is too large to be stored as a dense tensor requiring (cid:81) n I n memory. In thiscase, we can store the tensor as a sparse tensor as described in [3], requiring only( N + 1) · nnz( X ) memory.The elementwise division in the update of Φ requires that we divide the tensor(in matricized form) X by the current model estimate (in matricized form) M = BΠ .Unfortunately, we cannot aﬀord to store M explicitly as a dense tensor because itis the same size as X . In fact, we generally cannot even form Π explicitly becauseit requires almost as much storage as M . We observe, however, that we need onlycalculate the values of M that correspond to nonzeros in X .Let P = nnz( X ). Then we can store the sparse tensor X as a set of values andmulti-indices, ( v ( p ) , i ( p ) ) for p = 1 , . . . , P . In order to avoid forming the current modelestimate, M , as a dense object, we will store only selected rows of Π , one per nonzeroin X ; we denote these rows by w ( p ) for p = 1 , . . . , P . The p th vector is given by theelementwise product of rows of the factor matrices, i.e., w ( p ) = A (1) ( i ( p )1 , :) ∗ · · · ∗ A ( n − ( i ( p ) n − , :) ∗ A ( n +1) ( i ( p ) n +1 , :) ∗ · · · ∗ A ( N ) ( i ( p ) N , :) . In order to determine ˆ X = X (cid:19) M in the calculation of Φ , we proceed as follows. Thetensor ˆ X will have the same nonzero pattern as X , and we let ˆ v ( p ) denote its values.It can be determined thatˆ v ( p ) = x ( p ) / (cid:68) w ( p ) , A ( n ) ( i ( p ) n , :) (cid:69) . To calculate Φ = ˆXΠ , we simply have Φ ( i (cid:48) , r ) = (cid:88) p : i ( p ) n = i (cid:48) ˆ v ( p ) w ( p ) ( r ) . The storage of the w ( p ) for p = 1 , . . . , P vectors and the entries ˆ v ( p ) requires ( R + 1) P additional storage.

6. Numerical Results for CP-APR.6.1. Comparison of Objective Functions for Sparse Count Data.

Wecontend that, for sparse count data, KL divergence (1.2) is a better objective function.To support our claim, we consider simulated data where we know the correct answer.Speciﬁcally, we consider a 3-way tensor ( N = 3) of size 1000 × ×

600 and R = 10factors. It will be generated from a model M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) . The entriesof the vector λ are selected uniformly at random from [0 , A ( n ) is generated as follows: (1) For each column in A ( n ) , randomly select 10%(i.e., 1 /R ) of the entries uniformly at random from the interval [0 , , ν (cid:28) (cid:81) I n balls into (cid:81) I n empty urns whereeach entry of the tensor corresponds to an urn. For each ball, we ﬁrst draw a factor r with probability λ r / (cid:80) λ r . The indices ( i, j, k ) are selected randomly proportionalto a ( n ) r for n = 1 , ,

3. In other words, the ball is then tossed into the ( i, j, k )thurn with probability a (1) ir a (2) jr a (3) kr . In this manner, the balls are allocated across theurns independently of each other. This procedure generates entries x i that are eachdistributed as Poisson( m i ). We adjust the ﬁnal λ so that the scale matches that of4 E. C. Chi and T. G. Kolda

Least Squares KL DivergenceLee-Seung LS CP-ALS Lee-Seung KL CP-APRObservations FMS X , i.e., λ ← ν λ / (cid:107) λ (cid:107) . We generate problems where the number of observations rangesfrom 480,000 (0.1%) down to 24,000 (0.005%). Recall that Assumption 3.2 impliesthat the absolute minimum number of observations is R · max n I n = 10 , ,

1] of how close the computed solution is to the true solution. A value of 1 is ideal.Since the FMS measure is somewhat abstract, we also report the number of columnsin the ﬁrst factor matrix such that the cosine of the angle between the true solutionand the computed solution is greater than 0.95. A value of 10 is ideal since we haveused R = 10. The reported values are averages over 10 problems. See Appendix E forprecise formulas for both measures. Although these problems are extremely sparse, allmethods are able to correctly identify components in the data. Overall, the methodsoptimizing KL divergence are superior to those optimizing least squares. We alsoobserve that CP-APR is an improvement compared to Lee-Seung KL; we providelater evidence that this improvement is more likely due to the inadmissible zero ﬁxthan the extra inner iterations (which provide a beneﬁt of enhanced speed rather thanaccuracy). We demonstrate the eﬀective-ness of our simple ﬁx for avoiding inadmissible zeros, as described in § ×

15 dense positive matrix with entries drawn independentlyand uniformly from [0 , (cid:96) max = 1 , τ = 10 − , (cid:15) = 0 , κ tol =100 · (cid:15) mach . We do two runs: one with κ = 0, corresponding to the standard Lee-Seung(KL version) algorithm, and the other with κ = 10 − to move away from inadmissiblezeros. In both runs we use the same strictly positive initial guess. Figure 6.1 showsthe magnitude of the KKT residual over more than 10 iterations. When κ >

0, thesequence clearly convergences. On the other hand when κ = 0 the iterates appear toget stuck at a non-KKT point. Closer inspection of the factor matrix iterates revealsa single oﬀending inadmissible zero, i.e., its partial derivative is − . ensors, Sparsity, and Nonnegative Factorizations −15 −10 −5 I n f i n i t y no r m o f KK T r e s i dua l Iterations κ = 0 κ = 1e−10 Fig. 6.1: Lee-Seung permitting inadmissible zeros (blue solid line) and avoiding inad-missible zeros (red dashed line). (cid:96) max (a) Factor match scores (cid:96) max (b) Number of multiplicative updates (cid:96) max (c) Time (seconds)

Table 6.2: CP-APR with diﬀerent values of (cid:96) max for sparse count data over 100 trials.be nonnegative. Hence, we use positive values of κ in our experiments. We show that increasing themaximum number of inner iterations (cid:96) max can accelerate the convergence in Table 6.2.Recall that (cid:96) max = 1 corresponds to the Lee-Seung algorithm [29, 50]. We consider a 3-way tensor ( N = 3) of size 500 × ×

300 and R = 5 factors. We generate 100 probleminstances from 100 randomly generated models M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) as describedin § (cid:96) max = 1 , , and 10. andthe other parameters set as k max = 10 , τ = 10 − , κ = 10 − , κ tol = 100 · (cid:15) mach , (cid:15) = 0.We track both the number of multiplicative updates (line 8 of Algorithm 3) and theCPU time using the MATLAB command cputime . The experiments were performedon an iMac computer with a 3.4 GHz Intel Core i7 processor and 8 GB of RAM.Table 6.2a reports the FMS scores as we vary (cid:96) max , and we observe that the value of (cid:96) max does not signiﬁcantly impact accuracy. However, we observe that increasing (cid:96) max can decrease the overall work and runtime. Tables 6.2b and 6.2c present the averagenumber of multiplicative updates and total run times respectively. The distribution ofupdates and times was highly skewed as some problems required a substantial numberof iterations. Nonetheless, we see a monotonic decrease in the number of updates andtime as (cid:96) max increases. The diﬀerences are more substantial when comparing wallclock time. The reason for the disproportionate decrease in wall-clock time comparedto the tally of updates is that the cost of the calculation of Π (in line 6 of Algorithm 3)is amortized over all the subproblem iterations.6 E. C. Chi and T. G. Kolda

We consider the application of CP-APR to email data fromthe infamous Federal Energy Regulatory Commission (FERC) investigation of EnronCorporation. We use the version of the dataset prepared by Zhou et al. [54] and furtherprocessed by Perry and Wolfe [43], which includes detailed proﬁles on the employees.The data is arranged as a three-way tensor X arranged as sender × receiver × month,where entry ( i, j, k ) indicates the number of messages from employee i to employee j in month k . The original data set had 38,388 messages (technically, there were only21,635 messages but some messages were sent to multiple recipients and so are countedmultiple times) exchanged between 156 employees over 44 months (November 1998– June 2002). We preprocessed the data, removing months that had fewer than 300messages and removing any employees that did not send and receive an average of atleast one message per month. Ultimately, our data set spanned 28 months (December1999 – March 2002), involved 105 employees, and a total of 33,079 messages. The datais arranged so that the senders are sorted by frequency (greatest to least). The tensorrepresentation has a total of 8,540 nonzeros (many of the messages occur between thesame sender/receiver pair in the same time period). The tensor is 2.7% dense.We apply CP-APR to ﬁnd a model for the data. There is no ideal method forchoosing the number of components. Typically, this value is selected through trial anderror, trading oﬀ accuracy (as the number of components grows) and model simplicity.Here we show results for R = 10 components. We use the same settings for CP-APRas speciﬁed in Appendix E.Figure 6.2 illustrates six components in the resulting factorization; the other fourare shown in Appendix F. For each component, the top two plots shows the activityof senders and receivers, with the employees ordered from left to right by frequencyof sending emails. Each employee has a symbol indicating their seniority (junioror senior), gender (male or female), and department (legal, trading, other). Thesender and receiver factors have been normalized to sum to one, so the height of themarker indicates each employee’s relative activity within the component. The thirdcomponent (in the time dimension) is scaled so that it indicates total message volumeexplained by that component. The light gray line shows the total message volume.It is interesting to observe how the components break down into speciﬁc subgroups.For instance, component 1 in Figure 6.2a consists of nearly all “legal” and is majorityfemale. This can be contrasted to component 5 in Figure 6.2d, which is nearly all“other” and also majority female. Component 3 in Figure 6.2b is a conversationamong “senior” staﬀ and mostly male; on the other hand, “junior” staﬀ are moreprominent in Component 4 in Figure 6.2c. Component 8 in Figure 6.2e seems to be aconversation among “senior” staﬀ after the SEC investigation has begun. Component10 in Figure 6.2f indicates that a couple of “legal” staﬀ are communicating with many“other” staﬀ immediately after the SEC investigation is announced, perhaps advisingthe “other” staﬀ on appropriate responses to investigators. As another example, we consider ﬁve years (1999-2004) ofSIAM publication metadata that has previously been used by Dunlavy et al. [12].Here, we build a three-way sparse tensor based on title terms (ignoring common stopwords), authors, and journals. The author names have been normalized to last nameplus initial(s). The resulting tensor is of size 4,952 (terms) × × SIAM Review . In fact, thenext 4 highest counts correspond to the terms ‘problems’, ‘review’, ‘survey’, and ensors, Sparsity, and Nonnegative Factorizations (a) Component 1 (b) Component 3(c) Component 4 (d) Component 5(e) Component 8 (f) Component 10 Fig. 6.2: Components from factorizing the Enron data.8

E. C. Chi and T. G. Kolda

Table 6.3: Highest-scoring items in a 10-term factorization of the term × author × journal tensor from ﬁve years of SIAM publication data.‘techniques’, and to authors ‘Flaherty J’ and ‘Trefethen N’.Computing a ten-component factorization yields the results shown in Table 6.3.We use the same settings for CP-APR as speciﬁed in Appendix E. In the table,for the term and author modes, we list any entry whose factor score is greater than10 − · I n , where I n is the size of the n th mode; in the journal mode, we list anyentry greater than 0.01. The 10th component corresponds to introductions writtenby section editors for SIAM Review . The 1st component shows that there is overlapin both authors and title keyword between the

SIAM J. Computing and the

SIAMJ. Discrete Math . The 2nd and 3rd components have some overlap in topic and twooverlapping authors, but diﬀerent journals. Both components 8 and 9 correspond to ensors, Sparsity, and Nonnegative Factorizations

7. Conclusions & Future Work.

We have developed an alternating Poissonregression ﬁtting algorithm, CP-APR, for PTF for sparse count data. When such datais generated via a Poisson process, we show that methods based on KL divergencesuch as CP-APR recovers the true CP model more reliably than methods based on LS.Indeed, in classical statistics, it is well known that the randomness observed in sparsecount data is better explained and analyzed by the Poisson model (KL divergence)than a Gaussian one (LS error).Our algorithm can be considered an extension of the Lee-Seung method for KLdivergence with multiple inner iterations (similar to [19] for LS). Allowing for multipleinner iterations has the beneﬁt of accelerating convergence. Moreover, being verysimilar to an existing method, CP-APR is simple to implement with the exceptionof some details of the sparse implementation as described in § § (cid:96) -penalizationfor matrices [35] and tensors [39, 49, 17, 34]. Sparse factors often provide more easilyinterpreted models, and penalization may also accelerate the convergence. While thefactor matrices generated by CP-APR may be naturally sparse without imposing an (cid:96) -penalty, the degree of sparsity is not currently tunable. One may also considerextensions of this work in the context of missing data [22, 7, 48, 1] and for alternativetensor factorizations such as Tucker [17].Perhaps most challenging, however, are open questions related to rank and in-0 E. C. Chi and T. G. Kolda ference. Questions about how to choose rank are not new; but given the context ofsparse count data, might that structure be exploited to derive a sensible heuristic oreven rigorous criterion for choosing the rank? We already see that Assumption 3.2imposes an upper bound on the rank to ensure algorithmic convergence. Regard-ing inference, our focus in this work was in thoroughly developing the algorithmicgroundwork for ﬁtting a PTF model for sparse count data. CP-APR can be used toestimate latent structure. Once an estimate is in hand, however, it is natural to askhow much uncertainty there is in that estimate. For example, is it possible to puta conﬁdence interval around the entries in the ﬁtted factor matrices, especially zeroor near zero entries? Given that inference for the related but simpler case of Poissonregression has been worked out, we suspect that a sensible solution is waiting to befound. The beneﬁts of answering these questions warrant further investigation. Wehighlight them as important topics for future research.

Acknowledgments.

We thank our colleagues at Sandia for numerous helpfulconversations in the course of this work, especially Grey Ballard and Todd Plantenga.We also thank Kenneth Lange for pointing us to relevant references on emission tomog-raphy. Finally, we thank the anonymous referees and associate editor for suggestionswhich greatly improved the quality of the manuscript.

Appendix A. Notation Details.

Outer product.

The outer product of N vectors is an N -way tensor. For example,( a ◦ b ◦ c ) ijk = a i b j c k . Elementwise multiplication and division.

Let A and B be two same-sized tensors(or matrices). Then C = A ∗ B yields a tensor that is the same size as A (and B )such that c i = a i b i for all i . Likewise, C = A (cid:19) B yields a tensor that is the same sizeas A (and B ) such that c i = a i /b i for all i . Khatri-Rao product.

Give two matrices A and B of sizes I × R and I × R , then C = A (cid:12) B is a matrix of size I I × R such that C = (cid:2) a ⊗ b a ⊗ b · · · a R ⊗ b R (cid:3) , where the Kronecker product of two vectors of size I and I is a vector of length I I given by a ⊗ b =  a b a b ... a I b  . Matricization of a tensor.

The mode- n matricization or unfolding of a tensor X is denoted by X ( n ) and is of size I n × J n where J n ≡ (cid:81) m (cid:54) = n I n . In this case, tensorelement i maps to matrix element ( i, j ) where i = i n and j = 1 + N (cid:88) k =1 k (cid:54) = n ( i k −  k − (cid:89) m =1 m (cid:54) = n I m  . Appendix B. Proof of Lemma 3.1.

In this section, we provide a proof forLemma 3.1. We ﬁrst establish two useful lemmas. ensors, Sparsity, and Nonnegative Factorizations Lemma B.1.

Let X be ﬁxed, let M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) , and let f ( M ) be theobjective function as in (3.1) . If f ( M ) ≤ ζ for some constant ζ > , then there existsconstants ξ (cid:48) , ξ > (depending on X and ζ ) such that e T λ ∈ [ ξ (cid:48) , ξ ] .Proof . Because the factor matrices are column stochastic, we can observe that f ( M ) = e T λ − (cid:88) i x i log (cid:32)(cid:88) r λ r a ( n ) i r · · · a ( n ) i N r (cid:33) , ≥ e T λ − ϑ log (cid:0) e T λ (cid:1) where ϑ = (cid:32) N (cid:89) n =1 I n (cid:33) max i x i . (B.1)We have ζ ≥ e T λ − ϑ log (cid:0) e T λ (cid:1) . Let g ( α ) = α − ϑ log( α ) where α >

0. We showthat g ( α ) ≤ ζ implies there exists ξ (cid:48) , ξ > α ∈ [ ξ (cid:48) , ξ ]. First assume thereis no such lower bound ξ (cid:48) . Then there is a sequence α n tending to zero such that g ( α n ) ≤ ζ . But for suﬃciently large n , we have that − ϑ log( α n ) > ζ . Since α n > n , we have that for suﬃciently large n the function g ( α n ) > ζ . Therefore, thereis such a lower bound ξ (cid:48) .Now suppose there is no such upper bound ξ , and therefore there is an unboundedand increasing sequence α n tending to inﬁnity such that g ( α n ) ≤ ζ for all n . Notethat g (cid:48) ( α ) = 1 − ϑ/α . Since g ( α ) is convex, we have that g ( α ) ≥ g (2 ϑ ) + g (cid:48) (2 ϑ )( α − ϑ ) = g (2 ϑ ) + 12 α − ϑ. This inequality, however, indicates that for suﬃciently large n , the right hand side isgreater than ζ . Therefore, there must be an upper bound ξ . Substituting α = e T λ completes the proof. Lemma B.2.

Let X be ﬁxed, and let f ( M ) be the objective function as in (3.1) .Let Ω( ζ ) be the convex hull of the level set of f as deﬁned in (3.3) . The function f ( M ) is bounded for all M ∈ Ω( ζ ) .Proof . Let ¯ M , ˆ M ∈ { M | f ( M ) ≤ ζ } . Deﬁne ˜ M to be the convex combination ˜ M = α ¯ M + (1 − α ) ˆ M where α ∈ [0 . , . Note that the restriction on α is arbitrary but makes the proof simpler later on.Observe that ˜ m i = (cid:88) r (cid:40)(cid:16) α ¯ λ r + (1 − α )ˆ λ r (cid:17) (cid:89) n (cid:16) α ¯ a ( n ) i n r + (1 − α )ˆ a ( n ) i n r (cid:17)(cid:41) On the one hand, by Lemma B.1, there exists ξ > m i ≤ (cid:88) r (cid:16) α ¯ λ r + (1 − α )ˆ λ r (cid:17) = α (cid:88) r ¯ λ r + (1 − α ) (cid:88) r ˆ λ r ≤ αξ + (1 − α ) ξ = ξ. On the other hand, ˜ m i ≥ (cid:88) r (cid:40) α ¯ λ r (cid:89) n α ¯ a ( n ) i n r (cid:41) = α N +1 ¯ m i Thus, α N +1 ¯ m i ≤ ˜ m i ≤ ¯ m i + ξ E. C. Chi and T. G. Kolda

Now consider ˜ m i − x i log ˜ m i ≤ ¯ m i + ξ − x i log α N +1 ¯ m i = ( ¯ m i − x i log ¯ m i ) + ξ − ( N + 1) x i log α ≤ ( ¯ m i − x i log ¯ m i ) + ξ + ( N + 1) x i log 2 . Thus, f ( ˜ M ) ≤ f ( ¯ M )+ ξ (cid:89) n I n +( N +1) log 2 (cid:88) i x i ≤ ξ (cid:32) (cid:89) n I n (cid:33) +( N +1) log 2 (cid:88) i x i . Given these two lemmas, we are ﬁnally ready to provide the proof of Lemma 3.1.

Proof . [of Lemma 3.1] Fix ζ . If { M ∈ Ω | f ( M ) ≤ ζ } is empty, then Ω( ζ )is empty and there is nothing left to do. Thus, assume { M ∈ Ω | f ( M ) ≤ ζ } isnonempty. Since f is continuous at all M ∈ Ω for which f ( M ) is ﬁnite, f is obviouslycontinuous on Ω( ζ ) by Lemma B.2. Since f is continuous, { M ∈ Ω | f ( M ) ≤ ζ } isclosed because it is the preimage of the closed set ( −∞ , ζ ] under f ; thus, Ω( ζ ) isclosed because it is a convex combination of closed sets. Consequently, we only needto show that Ω( ζ ) is bounded. Assume the contrary. Then there exists a sequence ofmodels M k = (cid:114) λ k ; A (1) k , . . . , A ( N ) k (cid:122) ∈ Ω( ζ ) such that e T λ k → ∞ . By Lemma B.2, f ( M ) is ﬁnite on Ω( ζ ), but this contradicts Lemma B.1. Hence, the claim. Appendix C. Deriving the MM updates.

In this section we derive the MM update rules used to solve the subproblem.We ﬁrst verify that (4.2) majorizes (4.1). For convenience let C = B T so that (4.1)reduces to min C ≥ f ( C T ) = (cid:88) ij c T i π j − x ij log (cid:0) c T i π j (cid:1) (C.1)Proofs of the next two lemmas are given by Lee and Seung in [30] but theirarguments do not carefully handle boundary points. The following two lemmas andtheir proofs treat with more rigor the existence and value of updates when anchorpoints lie on admissible regions of the boundary. Lemma C.1.

Let x ≥ be a scalar and π ≥ , π (cid:54) = 0 , be a vector of length R .For a vector c ≥ , c (cid:54) = 0 , of length R , let the function f be deﬁned by f ( c ) = c T π − x log (cid:0) c T π (cid:1) . Then f is majorized at ¯c ≥ by g ( c , ¯c ) = c T π − x R (cid:88) r =1 α r log (cid:18) c r π r α r (cid:19) where α r = ¯ c r π r ¯c T π . Proof . If x = 0, then g ( c , ¯c ) = f ( c ) for all c , and g trivially majorizes f at ¯c .Consider the case when x >

0. It is immediate that g ( ¯c , ¯c ) = f ( ¯c ). The majorizationfollows from the fact that log is strictly concave and that we can write c T π as a convexcombination of the elements c r π r /α r . Note that if any elements ¯ c r π r are zero, theydo not contribute to the sum since we assume 0 · log( µ ) = 0 for all µ ≥ α r = 0. ensors, Sparsity, and Nonnegative Factorizations C as g ( C , ¯C ) = (cid:88) rij (cid:20) c ri π rj − α rij x ij log (cid:18) c ri π rj α rij (cid:19)(cid:21) where α rij = ¯ c ri π rj (cid:80) r ¯ c ri π rj . (C.2) Lemma C.2.

Let f and g be as deﬁned in (C.1) and (4.2) , respectively. Then,for all ¯C ≥ such that f ( ¯C T ) is ﬁnite, the function g ( · , ¯C ) has a unique globalminimum C ∗ which is given by ( C ∗ ) ri = (cid:80) j α rij x ij where α rij = ¯ c ri π rj / ¯c T i π j , forall r = 1 , . . . , R , i = 1 , . . . , I .Proof . Because g ( C , ¯C ) separates in the elements of C we focus on solving eachelementwise minimization problem. Dropping subscripts, the minimization problemwith respect to c ri can be rewritten asmin c ≥ c − (cid:88) j α j x j log (cid:18) cπ j α j (cid:19) , (C.3)where we have used the fact that (cid:80) j π j = 1. It is suﬃcient to prove that thisunivariate problem has a unique global minimizer, c ∗ = (cid:80) j α j x j . First, consider thecase where the second term is nonzero. Inspecting the stationarity condition revealsthe solution. Moreover, the function is strictly convex and so has a unique globalminimum. Second, consider the case where the second term is zero. Then, it isimmediate that the unique global minimum is c ∗ = 0. Moreover, the second term canonly vanish when (cid:80) j α j x j = 0, and so the formula applies. Appendix D. Proof of Theorem 4.3.

In this section, we prove the MMAlgorithm in Algorithm 2 solves (4.1). We ﬁrst establish the following general resultfor algorithm maps. Part (a) is a simple version of Zangwill’s convergence theorem[52, p. 91] in the case where the objective function and algorithm map are bothcontinuous. The proof of part (b) follows arguments of part of a proof for a diﬀerentbut related property on MM iterates in [27, p. 198].

Theorem D.1.

Let f be a continuous function on a domain D , and let ψ be acontinuous iterative map from D into D such that f ( ψ ( x )) < f ( x ) for all x ∈ D with ψ ( x ) (cid:54) = x . Suppose there is an x such that the set L f ( x ) ≡ { x ∈ D | f ( x ) ≤ f ( x ) } is compact. Deﬁne x k +1 = ψ ( x k ) for k = 0 , , . . . . Then (a) the sequence of iterates { x k } has at least one limit point and all its limit points are ﬁxed points of ψ , and(b) the distance between successive iterates converges to 0, i.e., (cid:107) x k +1 − x k (cid:107) → .Proof . The proof of (a) follows that of Proposition 10.3.2 of [27]. First note thatthe sequence of iterates must be in L f ( x ) because f ( x k ) ≤ f ( x ) for all k . Since L f ( x ) is compact, { x k } has a convergent subsequence whose limit is in L f ( x ); denotethis as x k (cid:96) → x ∗ as (cid:96) → ∞ . Since f is assumed to be continuous, lim f ( x k (cid:96) ) = f ( x ∗ ).Moreover, clearly f ( x ∗ ) ≤ f ( x k (cid:96) ) for all k (cid:96) .Note that f ( ψ ( x k (cid:96) )) ≤ f ( x k (cid:96) ). Taking the limit of both sides and applying thecontinuity of ψ and f , we must have that f ( ψ ( x ∗ )) ≤ f ( x ∗ ). But we also have that f ( x ∗ ) ≤ f ( x k (cid:96) +1 ) ≤ f ( x k (cid:96) +1 ) = f ( ψ ( x k (cid:96) )) . Again taking limits we obtain f ( x ∗ ) ≤ f ( ψ ( x ∗ )). Therefore f ( x ∗ ) = f ( ψ ( x ∗ )). Butby assumption, this equality implies that x ∗ is a ﬁxed point of ψ , and thus (a) isproven.4 E. C. Chi and T. G. Kolda

We now turn to the proof of (b), which follows the proof of Proposition 10.3.3in [27]. Recall { x k } denotes the iterate sequence. Since f ( x k ) is decreasing and f isbounded below on L f ( x ), we can assert that f ( x k ) is a convergent sequence with alimit f ∗ . Assume the contrary of (b), i.e., that there exists an (cid:15) > { k (cid:96) } of the indices such that (cid:107) x k (cid:96) +1 − x k (cid:96) (cid:107) > (cid:15) for all k (cid:96) . (D.1)Note that this subsequence is diﬀerent from the one discussed in proving part (a).Since x k (cid:96) ∈ L f ( x ), by possibly restricting { k (cid:96) } to a further subsequence, we mayassume that x k (cid:96) converges to a limit u . By possibly restricting { k (cid:96) } to yet a furthersubsequence, we may additionally assume that x k (cid:96) +1 converges to a limit v . By (D.1),we can conclude (cid:107) v − u (cid:107) ≥ (cid:15) . Note that x k (cid:96) +1 = ψ ( x k (cid:96) ). Taking the limit of bothsides and using the continuity of ψ we obtain ψ ( u ) = v . Additionally, using thecontinuity of f , f ( u ) = lim (cid:96) →∞ f ( x k (cid:96) ) = f ∗ = lim (cid:96) →∞ f ( x k (cid:96) +1 ) = f ( v ) . Since v = ψ ( u ), we have that f ( u ) = f ( ψ ( u )) which by assumption occurs if andonly if u = ψ ( u ). This implies that u = v , and we have arrived at a contradiction.We now prove a series of lemmas leading up to a proof of the desired convergenceresult. Lemma D.2.

Let B ≥ such that f ( B ) is ﬁnite and suppose B (cid:54) = B ∗ Φ . Then f ( B ) > f ( B ∗ Φ ) .Proof . By Lemma C.2 ( B ∗ Φ ) T is the unique global minimizer of g ( · , B T ) whichmajorizes f at B T . Therefore, if B (cid:54) = B ∗ Φ , we must have f ( B ) = g ( B T , B T ) >g (( B ∗ Φ ) T , B T ) ≥ f ( B ∗ Φ ). Lemma D.3.

Let f be as deﬁned in (4.1) . For any nonnegative matrix B suchthat f ( B ) is ﬁnite, the level set L f ( B ) = { B ≥ | f ( B ) ≤ f ( B ) } is compact.Proof . The proof follows the same logic as the proof for Lemma B.1. Lemma D.4.

Let f be as deﬁned in (4.1) and ψ be as deﬁned in (4.3) , and supposeAssumption is satisﬁed. For any nonnegative matrix B k such that f ( B ) is ﬁnite,the sequence B k +1 = ψ ( B k ) converges.Proof . Note that all limit points of ψ are ﬁxed points of f by Theorem D.1.First, we show that the set of ﬁxed point is ﬁnite. Suppose that B is a ﬁxed pointof ψ . Then we must have B ∗ ( E − Φ ( B )) = 0 . By Lemma 3.4, it can be veriﬁed that B is the unique global minimizer ofmin f ( U ) s.t. U ∈ { U ≥ | u ir = 0 if b ir = 0 } , where f is as deﬁned in (4.1). Therefore, any ﬁxed point that has the same zeropattern of B must be equal to B . Since there are only a ﬁnite number of possible zeropatterns, the number of ﬁxed points is ﬁnite.Since every limit point is a ﬁxed point by Theorem D.1(a), there are only ﬁnitelymany limit points. Let {N p } denote a collection of arbitrarily small neighborhoodsaround each ﬁxed point indexed by p . Only ﬁnitely many iterates B k are in L f ( B ) −∪ p N p . So, all but ﬁnitely many iterates B k will be in ∪ p N p . But (cid:107) B k +1 − B k (cid:107) eventually becomes smaller than smallest distance between any two neighborhoods byTheorem D.1(b). Therefore the sequence B k must belong to one of the neighborhoodsfor all but ﬁnitely many k . So, any sequence of iterates must eventually converge toexactly one of the ﬁxed points of ψ . ensors, Sparsity, and Nonnegative Factorizations Lemma D.5.

Let f be as deﬁned in (4.1) and suppose Assumption is satisﬁed.Suppose B k → B ∗ is a convergent sequence of iterates deﬁned by (4.3) with B ≥ and f ( B ) ﬁnite. If ( B ) ir > for all ( i, r ) such that ( Φ ( B ∗ )) ir > , then ∇ f ( B ∗ ) ≥ .Proof . We give a proof by contradiction. Suppose there exists ( i, r ) such that( B ) ir > ∇ f ( B ∗ )) ir <

0. Since B ∗ is a ﬁxed point of ψ , we must have [1 − ( Φ ( B ∗ )) ir ]( B ∗ ) ir = 0. By our assumption, however ( ∇ f ( B ∗ )) ir = [1 − ( Φ ( B ∗ )) ir ] < B ∗ ) ir = 0. On the other hand, ( B k ) ir > k (proof left toreader). Since Φ ( · ) is a continuous function of B on L f ( B ), we know that there existssome K such that k > K implies B k is close enough to B ∗ such that ( ∇ f ( B k )) ir = [1 − ( Φ ( B k )) ir ] <

0. Since ( B k ) ir >

0, we have [1 − ( Φ ( B k )) ir ]( B k ) ir <

0, which implies( B k ) ir < ( B k +1 ) ir for all k > K . But this contradicts lim k →∞ ( B k ) ir = ( B ∗ ) ir = 0.Hence, the claim.We now prove Theorem 4.3. Proof . [of Theorem 4.3] By Lemma D.4, the sequence { B k } converges; we callthe limit point B ∗ . At this limit point, we have: (a) B ∗ ≥

0, (b) ∇ f ( B ∗ ) ≥ B ∗ ∗ ∇ f ( B ∗ ) = 0 by virtue of B ∗ being a ﬁxed point of ψ . Thus,the point B ∗ satisﬁes the KKT conditions with respect to (4.1). Furthermore, since f is strictly convex by Lemma 3.4, we can conclude that B ∗ is the global minimumof f . Appendix E. Numerical experiment details for § All implementationsare from Version 2.5 of Tensor Toolbox for MATLAB [4]. All methods use a commoninitial guess for the solution. • Lee-Seung LS : Implemented in cp nmu as descibed in [3]. We use the defaultparameters except that the maximum number of iterations ( maxiters ) is setto 200 and the tolerance on the change in the ﬁt ( tol ) is set to 10 − . • CP-ALS : Implemented in cp als as described in [3]. We use the defaultparameter settings except that the maximum number iterations ( maxiters )is 200 and the tolerance on the changes in ﬁt ( tol ) is 10 − . • Lee Seung KL : Implemented in cp apr as described in this paper. The pa-rameters are set as follows: k max = 200 ( maxiters ), (cid:96) max = 1 ( maxinneriters ), τ = 10 − ( tol ), κ = 0 ( kappa ), (cid:15) = 0 ( epsilon ). • CP-APR : Implemented in cp apr as described in this paper. The parame-ters are set as follows: k max = 200 ( maxiters ), (cid:96) max = 10 ( maxinneriters ), τ = 10 − ( tol ), κ = 10 − ( kappa ), κ tol = 10 − ( kappatol ), (cid:15) = 0 ( epsilon ).We compare the methods in terms of their “factor match score,” deﬁned as follows.Let M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) be the true model and let ¯ M = (cid:74) ¯ λ ; ¯A (1) , . . . ¯A ( N ) (cid:75) bethe computed solution. The score of ¯ M is computed asscore( ¯ M ) = 1 R (cid:88) r (cid:18) − | ξ r − ¯ ξ r | max { ξ r , ¯ ξ r } (cid:19) (cid:89) n a ( n ) T r ¯a ( n ) r (cid:107) a ( n ) r (cid:107)(cid:107) ¯a ( n ) r (cid:107) , where ξ r = λ r (cid:89) n (cid:107) a ( n ) r (cid:107) and ¯ ξ r = ¯ λ r (cid:89) n (cid:107) ¯a ( n ) r (cid:107) . The FMS is a rather abstract measure, so we also give results for the number ofcolumns in A (1) that are correctly identiﬁed. In other words, we count the numberof times that the cosine of the angle between the true solution and the computed6 E. C. Chi and T. G. Kolda solution is greater than 0.95, mathematically, a (1) T r ¯a (1) r / (cid:107) a (1) r (cid:107)(cid:107) ¯a (1) r (cid:107) ≥ .

95. We usethe ﬁrst mode, but the results are representative of the other modes.

Appendix F. Additional Enron results.

Figure F.1 illustrates the four components omitted in Figure 6.2. (a) Component 1 (b) Component 3(c) Component 4 (d) Component 5

Fig. F.1: Remaining components from factorizing the Enron data.

REFERENCES[1]

E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup , Scalable tensor factorizations forincomplete data , Chemometrics and Intelligent Laboratory Systems, 106 (2011), pp. 41–56.[2]

B. W. Bader, M. W. Berry, and M. Browne , Discussion tracking in Enron email usingPARAFAC , in Survey of Text Mining II: Clustering, Classiﬁcation, and Retrieval, M. W.Berry and M. Castellanos, eds., Springer, London, 2008.[3]

B. W. Bader and T. G. Kolda , Eﬃcient MATLAB computations with sparse and factoredtensors , SIAM Journal on Scientiﬁc Computing, 30 (2007), pp. 205–231.[4]

B. W. Bader, T. G. Kolda, et al. , MATLAB Tensor Toolbox version 2.5 . Available online,January 2012. ensors, Sparsity, and Nonnegative Factorizations [5] D. P. Bertsekas and J. N. Tsitsiklis , Parallel and Distributed Computation: NumericalMethods , Prentice Hall, 1989.[6]

R. Bro and S. De Jong , A fast non-negativity-constrained least squares algorithm , Journal ofChemometrics, 11 (1997), pp. 393–401.[7]

A. M. Buchanan and A. W. Fitzgibbon , Damped Newton algorithms for matrix factorizationwith missing data , in CVPR’05: 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, vol. 2, IEEE Computer Society, 2005, pp. 316–322.[8]

J. D. Carroll and J. J. Chang , Analysis of individual diﬀerences in multidimensional scalingvia an N-way generalization of “Eckart-Young” decomposition , Psychometrika, 35 (1970),pp. 283–319.[9]

A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari , Non-negative tensorfactorization using alpha and beta divergences , in ICASSP 07: Proceedings of the Interna-tional Conference on Acoustics, Speech, and Signal Processing, 2007.[10]

V. de Silva and L.-H. Lim , Tensor rank and the ill-posedness of the best low-rank approxima-tion problem , SIAM J. Matrix Anal. Appl., 30 (2008), pp. 1084–1127.[11]

I. Dhillon and S. Sra , Generalized nonnegative matrix approximations with Bregman diver-gences , Advances in Neural Information Processing Systems (NIPS), 18 (2006), p. 283.[12]

D. M. Dunlavy, T. G. Kolda, , and W. P. Kegelmeyer , Multilinear algebra for analyz-ing data with multiple linkages , in Graph Algorithms in the Language of Linear Algebra,J. Kepner and J. Gilbert, eds., Fundamentals of Algorithms, SIAM, Philadelphia, 2011,pp. 85–114.[13]

D. M. Dunlavy, T. G. Kolda, and E. Acar , Temporal link prediction using matrix andtensor factorizations , ACM Transactions on Knowledge Discovery from Data, 5 (2011).Article 10, 27 pages.[14]

C. F´evotte and J. Idier , Algorithms for nonnegative matrix factorization with the β -divergence , Neural Computation, 23 (2011), pp. 2421–2456.[15] L. Finesso and P. Spreij , Nonnegative matrix factorization and I-divergence alternating min-imization , Linear Algebra and its Applications, 416 (2006), pp. 270–287.[16]

D. FitzGerald, M. Cranitch, and E. Coyle , Non-negative tensor factorisation for soundsource separation , IEE Conference Publications, (2005), pp. 8–12.[17]

M. P. Friedlander and K. Hatz , Computing nonnegative tensor factorizations , Computa-tional Optimization and Applications, 23 (2008), pp. 631–647.[18]

N. Gillis and F. Glineur , Nonnegative factorization and the maximum edge biclique problem .arXiv:0810.4225, Oct. 2008.[19]

N. Gillis and F. Glineur , Accelerated multiplicative updates and hierarchical ALS algorithmsfor nonnegative matrix factorization , Neural Computation, 24 (2011), pp. 1085–1105.[20]

E. F. Gonzalez and Y. Zhang , Accelerating the Lee-Seung algorithm for nonnegative matrixfactorization , tech. report, Rice University, March 2005.[21]

R. A. Harshman , Foundations of the PARAFAC procedure: Models and conditions for an“explanatory” multi-modal factor analysis , UCLA working papers in phonetics, 16 (1970),pp. 1–84. Available at .[22]

H. A. L. Kiers , Weighted least squares ﬁtting using ordinary least squares algorithms , Psy-chometrika, 62 (1997), pp. 215–266.[23]

D. Kim, S. Sra, and I. S. Dhillon , Fast projection-based methods for the least squares non-negative matrix approximation problem , Statistical Analysis and Data Mining, 1 (2008),pp. 38–51.[24]

D. Kim, S. Sra, and I. S. Dhillon , Tackling box-constrained optimization via a new projectedquasi-Newton approach , SIAM Journal on Scientiﬁc Computing, 32 (2010), pp. 3548–3563.[25]

H. Kim and H. Park , Nonnegative matrix factorization based on alternating nonnegativityconstrained least squares and active set method , SIAM Journal on Matrix Analysis andApplications, 30 (2008), pp. 713–730.[26]

K. Lange , Convergence of EM image reconstruction algorithms with Gibbs smoothing , IEEETransactions on Medical Imaging, 9 (1990), pp. 439–446.[27]

K. Lange , Optimization , Springer, 2004.[28]

K. Lange and R. Carson , EM reconstruction algorithms for emission and transmission to-mography , Journal of Computer Assisted Tomography, 8 (1984).[29]

D. D. Lee and H. S. Seung , Learning the parts of objects by non-negative matrix factorization ,Nature, 401 (1999), pp. 788–791.[30] ,

Algorithms for non-negative matrix factorization , in Advances in Neural InformationProcessing Systems, vol. 13, MIT Press, 2001, pp. 556–562.[31]

L.-H. Lim and P. Comon , Nonnegative approximations of nonnegative tensors , Journal ofChemometrics, 23 (2009), pp. 432–441. E. C. Chi and T. G. Kolda [32]

C.-J. Lin , On the convergence of multiplicative update algorithms for nonnegative matrix fac-torization , IEEE Transactions on Neural Networks, 18 (2007), pp. 1589–1596.[33] ,

Projected gradient methods for nonnegative matrix factorization , Neural Computation,19 (2007), pp. 2756–2779.[34]

J. Liu, J. Liu, P. Wonka, and J. Ye , Sparse non-negative tensor factorization using colum-nwise coordinate descent , Pattern Recognition, 45 (2012), pp. 649–656.[35]

W. Liu, S. Zheng, S. Jia, L. Shen, and X. Fu , Sparse nonnegative matrix factorizationwith the elastic net , in BIBM2010: Proceedings of the IEEE International Conference onBioinformatics and Biomedicine, Dec. 2010, pp. 265–268.[36]

L. Lucy , An iterative technique for the rectiﬁcation of observed distributions , AstronomicalJournal, 79 (1974), pp. 745–754.[37]

P. McCullagh and J. A. Nelder , Generalized linear models (Second edition) , Chapman &Hall, London, 1989.[38]

M. Mørup, L. Hansen, J. Parnas, and S. M. Arnfred , Decomposing the time-frequency representation of EEG using nonnegative matrix and multi-way factoriza-tion . Available at , 2006.[39]

M. Mørup, L. K. Hansen, and S. M. Arnfred , Algorithms for sparse nonnegative tuckerdecompositions , Neural Computation, 20 (2008), pp. 2112–2131.[40]

J. Nocedal and S. J. Wright , Numerical Optimization , Springer, 1999.[41]

P. Paatero , A weighted non-negative least squares algorithm for three-way “PARAFAC” factoranalysis , Chemometrics and Intelligent Laboratory Systems, 38 (1997), pp. 223–242.[42]

P. Paatero and U. Tapper , Positive matrix factorization: A non-negative factor model withoptimal utilization of error estimates of data values , Environmetrics, 5 (1994), pp. 111–126.[43]

P. O. Perry and P. J. Wolfe , Point process modeling for directed interaction networks .arXiv:1011.1703v1, Nov. 2010.[44]

G. Rodr´ıguez , Poisson models for count data , in Lecture Notes on Generalized Linear Models,2007, ch. 4. Available at http://data.princeton.edu/wws509/notes/ .[45]

L. A. Shepp and Y. Vardi , Maximum likelihood reconstruction for emission tomography , IEEETransactions on Medical Imaging, 1 (1982), pp. 113–122.[46]

A. Smilde, R. Bro, and P. Geladi , Multi-Way Analysis: Applications in the Chemical Sci-ences , Wiley, West Sussex, England, 2004.[47]

J. Sun, D. Tao, and C. Faloutsos , Beyond streams and graphs: Dynamic tensor analysis , inKDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ACM Press, 2006, pp. 374–383.[48]

G. Tomasi and R. Bro , PARAFAC and missing values , Chemometrics and Intelligent Labo-ratory Systems, 75 (2005), pp. 163–180.[49]

Z. Wang, A. Maier, N. K. Logothetis, and H. Liang , Single-trial decoding of bistableperception based on sparse nonnegative tensor decomposition , Computational Intelligenceand Neuroscience, 2008 (2008).[50]

M. Welling and M. Weber , Positive tensor factorization , Pattern Recognition Letters, 22(2001), pp. 1255–1261.[51]

S. Zafeiriou and M. Petrou , Nonnegative tensor factorization as an alternative Csiszar–Tusnady procedure: algorithms, convergence, probabilistic interpretations and novel prob-abilistic tensor latent variable analysis algorithms , Data Mining and Knowledge Discovery,22 (2011), pp. 419–466.[52]

W. I. Zangwill , Nonlinear Programming: A Uniﬁed Approach , International Series in Man-agement, Prentice-Hall, Englewood Cliﬀs, New Jersey, 1969.[53]

H. Zhou, D. Alexander, and K. Lange , A quasi-Newton acceleration for high-dimensionaloptimization algorithms , Statistics and Computing, 21 (2011), pp. 261–273.[54]