Columnwise Element Selection for Computationally Efficient Nonnegative Coupled Matrix Tensor Factorization
IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1
Column-wise Element Selection forComputationally Efficient Nonnegative CoupledMatrix Tensor Factorization
Thirunavukarasu Balasubramaniam, Richi Nayak, Chau Yuen, and Yu-Chu Tian
Abstract —Coupled Matrix Tensor Factorization (CMTF) facilitates the integration and analysis of multiple data sources and helpsdiscover meaningful information. Nonnegative CMTF (N-CMTF) has been employed in many applications for identifying latent patterns,prediction, and recommendation. However, due to the added complexity with coupling between tensor and matrix data, existingN-CMTF algorithms exhibit poor computation efficiency. In this paper, a computationally efficient N-CMTF factorization algorithm ispresented based on the column-wise element selection, preventing frequent gradient updates. Theoretical and empirical analysesshow that the proposed N-CMTF factorization algorithm is not only more accurate but also more computationally efficient than existingalgorithms in approximating the tensor as well as in identifying the underlying nature of factors.
Index Terms —CP decomposition, Nonnegative Coupled Matrix Tensor Factorization, Cut-off, Coordinate Descent, Element Selection,Recommender systems, Spatio-temporal patterns. (cid:70)
NTRODUCTION W ITH the digital advancements, the data about thesame concept can be collected from multiple sources.To get meaningful information on a topic, it is important tocombine and analyze relevant data accurately and efficientlyfrom multiple sources. This is known as data fusion, whichis a fundamental technique widely used in a wide range ofdomains such as recommender systems [1], pattern mining[2], metabolomics [3], chemometrics [4], sensor and signalprocessing [5], brain imaging [6], and bioinformatics [7].In recommender systems, the user-to-item rating ( user × item ) information can be fused with the user-to-user trust( user × user ) information. Similarly, the multifaceted na-ture of datasets facilitates the fusion of matrix and tensordata sources for discovering useful knowledge. To recom-mend an item to a user based on its tagging activities,the primary data source can be represented as a third-order tensor ( user × item × tag ) with an auxiliary trustmatrix ( user × user ). In smart city applications, people’scheck-in activities can be represented as a third order tensor( user × location × time ) with a ( location × category ) matrix.The coupled matrix-tensor fusion models are shown to beuseful in real applications [1], [2], [4].Traditional Matrix Factorization (MF) only or TensorFactorization (TF) only algorithms such as Alternating LeastSquare (ALS) [6], Stochastic Gradient Descent (SGD) [8],Multiplicative Update Rule (MU) [9], and Coordinate De-scent (CD) [10] can fail to capture the latent factors with in-terrelated multi-dimensions data like coupled matrix-tensor[11]. A coupled matrix-tensor factorization (CMTF) [1] and • T. Balasubramaniam, R. Nayak, and Y.-C. Tian are with the School ofElectrical Engineering and Computer Science, Queensland University ofTechnology, Australia, QLD, 4000.E-mail: thirunavukarasu.balas, r.nayak, [email protected] • C.Yuen is with Singapore University of Technology and Design, Singa-pore. Email: [email protected] a Nonnegative CMTF (N-CMTF) [12] have been specificallyproposed to jointly analyze matrix and tensors. However,their computation efficiency and convergence speed are notacceptable particularly for sparse data [13]. While thesealgorithms can be implemented in parallel and distributedenvironments to improve the scalability, same as TF [12],[14], [15], [16], [17], the underlying computational complex-ity remains the same. With the rise in this type of data,there exists a need to develop a computationally efficientN-CMTF algorithm.In this paper, we propose a column-wise elementselection-based CD algorithm, Cut-off Coordinate Descent(Cut-CD), to solve N-CMTF efficiently and accurately. Cut-CD updates factor matrices one by one. While updatinga factor matrix, it selects only a few important elementsbased on a column-wise cut-off technique, instead of up-dating all the elements. It calculates each elements im-portance based on the difference in the objective functionthat measures the error minimized using optimization. Byusing the column-wise element selection technique, Cut-CDavoids the frequent gradient updates [18], a bottleneck intraditional element selection-based methods, and speeds upthe factor matrix update process. For the purpose of patternmining, we propose to introduce sparsity constraint to theN-CMTF objective function in Cut-CD (called as Cut-CD-SC) to capture accurate factor matrices. With an efficientN-CMTF algorithm as Cut-CD, the shared and unsharedlatent factors can be learned accurately, which can facilitateaccurate prediction, recommendation and pattern miningfor large datasets.Extensive empirical analysis is conducted to show theefficiency of Cut-CD for the tasks of tensor approximationand recommendation, and the accuracy of Cut-CD-SC forspatio-temporal pattern mining in sparse datasets. Usingsynthetic and real-world datasets, we demonstrate 1) thefast convergence speed of Cut-CD; 2) the efficiency of Cut- a r X i v : . [ c s . L G ] M a r EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2
CD as compared to other state-of-the-art algorithms foraccurately generating recommendations and finding theunderlying structure of spatio-temporal patterns; and (3)the accuracy of Cut-CD-SC in finding distinctive spatio-temporal patterns.In summary, this paper presents the followings:1)
Factorization Algorithm.
Column-wise elementselection-based Cut-CD algorithm for N-CMTF. Itupdates the elements and gradients selectively, lead-ing to faster convergence without a trade-off ofaccuracy.2)
Theoretical analysis.
Theoretical results are estab-lished for the efficiency of Cut-CD in terms of con-vergence, time complexity, memory requirement,and element/gradient updates.3)
Experiments.
Extensive empirical analysis is con-ducted to show the efficiency of Cut-CD in termsof runtime and convergence speed as well as accu-racy in the tasks of approximation, prediction andpattern mining.The rest of the paper is organized as follows: Section2 reviews related work. Section 3 introduces preliminariesfor TF and CMTF. Section 4 presents Cut-CD. Theoreticalanalysis and empirical analysis are discussed in Sections 5and 6 respectively.
ELATED W ORK
CMTF Applications:
With the wide availability of multipledata sources (e.g., mobile computing and Internet-of-Thingsdevices), CMTF has become more effective in missing dataestimation (i.e. prediction) or pattern mining in comparisonto using MF or TF independently [19], [20]. Tensor modeling( location × noise category × time slot ) was applied tounderstand the noise pollution across New York City toinform people about the better environment [20]. The resultsshow that the traditional TF fails to predict most of themissing entries due to the data being sparse. On the otherhand, a CMTF method was successfully used to recom-mend an activity, time and location to a user by modelling( user × activity × location × time ) with additional infor-mation coupled as a matrix at each mode [19]. CMTF hasalso been used in location recommendation by combining a( user × location × time ) tensor model with a ( user × user )similarity matrix and a ( location × f eature ) matrix [21].The runtime of the method was improved by utilizing athreshold algorithm that separates the data into smallerregions and calculates the top-recommendation for eachregion. Recently, N-CMTF was applied to predict spatio-temporal patterns using Greedy Coordinate Descent (GCD)[2]. The method highlighted the scalability issues with N-CMTF particularly in the presence of high sparse data. Ingeneral, all of these methods have been applied to smalldatasets only. In reality, the majority of tensor applicationsresults in a large size and very high sparsity dataset [19],[22]. N-CMTF algorithms:
Traditionally, ALS has been usedto solve Nonnegative Tensor or Matrix Factorization(NTF/NMF) problems. Because of the non-convex opti-mization problem underlying the objective function, only one factor matrix is updated by fixing all the other factormatrices in ALS to solve the optimization problem. Thisapproach of ALS has been extended to solve the N-CMTFbecause of its simplicity in formulation [1]. However, dueto the expensive matrix multiplications involving in theupdate rule, ALS becomes non-scalable and prone to poorconvergence [23], [24]. To overcome poor convergence, all-at-once optimization (OPT) [1] was introduced for CMTFthat updates all factor matrices together. This gradient-descent based approach has been reported unable to identifythe true underlying factors of the factor matrices [1], [4].
Fast NMF/NTF algorithms:
The time complexity in-volved in the traditional factorization process is mainlybecause of the Matricized Tensor Times Khatri-Rao Product(MTTKRP) and the factor matrix update [25].To effectively parallelize or distribute the calculation ofMTTKRP, the majority of researchers rely on the paralleland distributed setups [12], [14], [15], [16], [17], [26]. Forinstance, the parallel and distributed stochastic gradient-based FlexiFact algorithm was proposed to reduce the com-putational cost [14]. While these approaches may improvethe performance in a distributed environment, they hardlyhave an improvement on traditional non-distributed ma-chines. The random sampling based method called Turbo-SMT was used to reduce the size of the input tensor [12]. Thesampled tensor minimizes the size of the processing tensorinvolving MTTKRP, thus the computation cost is reduced.However, the random sampling of an original tensor to asmaller subset is prone to information loss and may resultin factors learned to be inaccurate.On the other hand, the
Coordinate Descent (CD) basedNMF/NTF algorithms have been developed to reducingthe time complexity by involving the efficient factor matrixupdates [27], [28]. They have shown better accuracy andconvergence property than ALS and OPT [29], [30]. Theoptimization is solved as a single element sub-problem andeach factor matrix is alternatively element-wise updated[27]. This minimizes the computational cost of factor ma-trix updates. This element-wise update leads the algorithmto choose and update important elements repeatedly thatproves to converge faster. Hsieh et.al. introduced GCD al-gorithm to solve the NMF problem [18] where the elementsto be updated are greedily selected based on the importancemeasure calculated using the gradients. GCD updates asingle element multiple times and, for each element update,it recalculates the gradient value of the entire row. However,this gradient recalculation significantly increases the com-putational cost and memory requirements. In comparison,the proposed Cut-CD selects a set of elements to be updatedin each column and avoids frequent gradient updates thatare a bottleneck in GCD. We have implemented GCD whichis the only element selection-based algorithm in N-CMTFand used it as a benchmark in experiments.CCD++ [31] is a CD-based MF algorithm that updatesthe k th column of each factor matrix without any elementselection and repeats the k th column updating for otherfactor matrices. This process is then done for all columns.This is considered as a single iteration, where the methodrequires multiple such iterations. Subset Alternating LeastSquare (SALS) [31] is an intermediate version of ALSand CCD++ with an additional constraint that controls the EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3 number of columns to update in a single iteration of CCD++.However, SALS cannot be directly applied for N-CMTF dueto the additional constraint it imposes. Coordinate Descentfor Tensor Factorization (CDTF) [17], [32] and Parallel Col-lective Matrix Factorization (PCMF) [33] are two closely re-lated work in which the underlying factorization algorithmis CCD++ [31]. Without considering the parallelization anddistribution, both CDTF and PCMF are the same except thefact that CDTF is the extension of CCD++ for higher orderswhereas PCMF is the extension of CCD++ for collectivefactorization. We have implemented CCD++ in N-CMTFenvironment and used it as a benchmark in experiments.The difference between the proposed Cut-CD and CCD++based matrix/tensor factorization algorithms are three-fold.1) Cut-CD first selects elements in a factor matrixaccording to element importance based on the pro-posed column-wise cut-off technique. All the se-lected elements of the factor matrix are then up-dated. This process is then repeated for the rest ofthe matrices sequentially. This forms an iteration.Whereas CCD++ and its dependent methods updatethe k th column of each factor matrix without anyelement selection and repeats the k th column up-dating for other factor matrices. This process is thendone for all columns. This forms a single iteration.This frequent change in factor matrix is repeated for R (number of columns) times which involves highercommunication cost making its convergence slower.In summary, Cut-CD and CCD++ considerably varyin terms of how column elements are selected andhow the sequence of columns are updated.2) CCD++ involves multiple inner iterations duringthe column-wise update whereas Cut-CD does nothave any inner iteration. These inner iterationscause CCD++ to update the same column multipletimes which introduces frequent gradient updates,a time consuming task similar to GCD.3) CCD++ relies on an accelerating technique (stop-ping criteria) to control the inner iterations and toavoid unnecessary repetitive update of elements.On the other hand, the column-wise element se-lection makes Cut-CD converge faster without anyinner iterations or repetitive update of elements.In summary, though many factorization algorithms havebeen proposed for matrix/tensor factorization, an efficientalgorithm for N-CMTF is yet to be seen. The existingNMF/NTF algorithms fail to address the computational costassociated with the factor matrix update. The NMF/NTFalgorithms compromise either with accuracy or need expen-sive distributed setup while providing an efficient solution.Recently a family of methods has focused on combiningmanifold learning with NMF/NTF to improve the accuracyof the factorization process by incorporating the graphregularization term to capture the closeness information inthe data [34], [35], [36]. These methods alter the objectivefunction according to domain-specific requirements, but theunderlying factorization algorithm remains similar to ALS.The process of building a nearest graph is very expensive,thus these algorithms are limited to small data size. The pur-pose of Cut-CD is to improve the underlying factorization TABLE 1: Notations used in thie paper. Notation Description X tensor (Euler script letter) Ω set of indices of observable entries of X x jkl ( j, k, l ) th entry of X U matrix (upper case, bold letter) u vector (lower case, bold letter) u scalar (lower case, italic letter) / element X n mode-n matricization of tensor ⊗ Kronecher product (cid:12)
Khatri-Rao product ∗ Hadamard product ◦ outer product (cid:107) . (cid:107) Frobenius norm process as used in all matrix and tensor factorization. Cut-CD is designed for Euclidean distance objective functionthat is commonly used. Therefore, the overarching aim is toutilise Cut-CD in the factorization process of these methodsto achieve a computationally efficient performance. Cut-CDcan be easily adapted for NMF or NTF problems in general.
OUPLED M ATRIX T ENSOR F ACTORIZATION
The notations used throughout the paper are summarizedin Table 1. Let X ∈ R ( J × K × L ) denote a third-order tensorwhere J , K and L represent the length of the dimensions(or modes) of the tensor. Matricization of the tensor.
The process of converting atensor into a matrix is called as matricization or unfoldingof tensor [37]. The mode-1 matricization can be denoted as X ∈ R ( J × ( KL )) for a third order tensor. Kronecker product.
For two matrices denoted as U ∈ R ( J × R ) and V ∈ R ( K × R ) , the Kronecker product is pre-sented as U ⊗ V . The resultant matrix of size ( JK × R ) isdefined as follows: U ⊗ V = u V u V u V . . . u r V u V u V u V . . . u r V ... ... ... . . . ... u j V u j V u j V . . . u jr V (1) = (cid:2) u V u V u V . . . u r V (cid:3) (2)where u r and v r are the columns of the matrices U and V respectively. Khatri-Rao Product.
Column-wise Kronecker product iscalled as Khatri-Rao product and it is denoted as U (cid:12) V .The resultant matrix is of size ( JK × R ) and is defined as: U (cid:12) V = (cid:2) u ⊗ v . . . u r ⊗ v r . (cid:3) (3) Hadamard Product.
When the size of two matrices arethe same, the Hadamard product can be calculated by theelement-wise matrix product. It is denoted by U ∗ V anddefined as: U ∗ V = u v u v u v . . . u r v r u v u v u v . . . u r v r ... ... ... . . . ... u j v j u j v j u j v j . . . u jr v jr . (4) Tensor Factorization is a dimensionality reduction tech-nique that factorizes the given tensor into factor matricesthat contain latent features. CANDECOMP/PARAFAC (CP)
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4 X ∼ = u v w + u v w + · · · + u R v R w R Fig. 1: CP Factorization.factorization and Tucker factorization are two well-knowntensor factorization techniques [24]. CP factorization hasshown to be less expensive in both memory and time con-sumption as compared to Tucker factorization [24]. There-fore, CP factorization has been more commonly used.
Definition 1 (CP Factorization):
For a tensor X ∈ R ( J × K × L ) and rank R , the CP factorization factorizes thetensor into a sum of component rank-one tensors citecar-roll1970 as: X ∼ = (cid:74) U , V , W (cid:75) = R (cid:88) r =1 u r ◦ v r ◦ w r . (5)As shown in Fig.1, it decomposes the tensor X into threefactor matrices U , V and W with R hidden features. It issolved as the following minimization problem: min U , V , W f ( U , V , W ) = (cid:107) X − (cid:74) U , V , W (cid:75) (cid:107) . (6) Coupled Matrix Tensor Factorization:
The objectivefunction of CMTF optimization with a third order tensor X ∈ R ( J × K × L ) and matrix Y ∈ R ( J × M ) coupled (orshared) in the first mode can be formulated [1] as: min U (1) , V , W , U (2) f ( U (1) , V , W , U (2) ) = (cid:13)(cid:13)(cid:13) X − (cid:74) U (1) , V , W (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Y − U (1) U (2) (cid:13)(cid:13)(cid:13) (7)where X is factorized as a three dimensional CP model [24],[38] and Y is factorized as a NMF model [9]. The factormatrix U (1) ∈ R ( J × R ) is shared between X and Y , whereasfactor matrices V ∈ R ( K × R ) , W ∈ R ( L × R ) and U (2) ∈ R ( M × R ) are unshared. HE P ROPOSED C UT - OFF C OORDINATE D E - SCENT A LGORITHM FOR
N-CMTF: C UT -CD The overall process of Cut-CD is presented in Fig. 2.
The goal of Cut-CD is to identify the factor matrices repre-senting each mode of the input tensor and the auxiliary ma-trix that can approximate the input tensor and the auxiliarymatrix together. The objective of the optimization problemis to minimize the Euclidean distance between the inputtensor and the approximated tensor while also minimizingthe Euclidean distance between the auxiliary matrix and theapproximated matrix using the derived factor matrices. Theobjective function of CMTF as formulated in (7) can be rep-resented as the sum of two objective functions representingthe tensor and matrix as: f = f + f , (8) Fig. 2: Architecture of the overall process of Cut-CD.where f is solved as, min U (1) , V , W ≥ f ( U (1) , V , W ) = (cid:13)(cid:13)(cid:13) X − (cid:74) U (1) , V , W (cid:75) (cid:13)(cid:13)(cid:13) (9)and f is solved as, min U (1) , U (2) ≥ f ( U (1) , U (2) ) = (cid:13)(cid:13)(cid:13) Y − U (1) U (2) (cid:13)(cid:13)(cid:13) (10) As the CMTF optimization is a non-convex optimizationproblem, the values of all other factor matrices need to befixed to update one factor matrix at a time [1]. To updatethe factor matrix using the CD method [10], [27], we need tocalculate the gradients of the factor matrices.We first explain the process of learning the factor matrix U (1) which is a shared factor matrix during the factorizationprocess. The gradients G for the objective function f withrespect to U (1) is solved using partial derivatives as: G = ∂f∂ U (1) = ∂f ∂ U (1) + ∂f ∂ U (1) (11) ∂f ∂ U (1) = − X ( W (cid:12) V ) + U (1) ( V T V ∗ W T W ) (12) ∂f ∂ U (1) = U (1) U (2) T U (2) − YU (2) (13)where X is the mode-1 matricization of the tensor. (11) canbe rewritten using (12) and (13) as follows: G = − X ( W (cid:12) V ) + U (1) ( V T V ∗ W T W )+ U (1) U (2) T U (2) − YU (2) . (14)For simplicity, each element of the gradient G is repre-sented as: g jr = − ( X ( W (cid:12) V )) jr + ( U (1) ( V T V ∗ W T W )) jr + ( U (1) U (2) T U (2) ) jr − ( YU (2) ) jr (15) EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5 where g jr is the gradient of ( j, r ) th element of U (1) .The second-order partial derivative H of function f withrespect to U (1) is defined as, H = ∂ f∂ U (1) = ( V T V ∗ W T W ) + U (2) T U (2) . (16)For simplicity, each element of the second-order partialderivative H for each element can be represented as: h rr = ( V T V ∗ W T W ) rr + ( U (2) T U (2) ) rr . (17) The CD update rule can be given as: ˆ U (1) = max (cid:18) , (cid:18) U (1) − GH (cid:19)(cid:19) − U (1) . (18)The single element update rule of CD is given as [27]: ˆ u (1) jr = max (cid:18) , (cid:18) u (1) jr − g jr h rr (cid:19)(cid:19) − u (1) jr (19)where ˆ u (1) jr indicates the computed element, and u (1) jr indi-cates the ( j, r ) th element of the factor matrix U (1) , u (1) jr ←− u (1) jr + ˆ u (1) jr . (20) We impose the nonnegative constraint to the element as: u (1) jr = (cid:40) u (1) jr for u (1) jr > for u (1) jr ≤ . (21) The calculations of X ( W (cid:12) V ) , V T V ∗ W T W and U (1) U (2) T U (2) for every element are expensive. The com-putational complexity of a factorization algorithm can bereduced if these values do not need to be calculated forevery element and they are only calculated for selectedelements. We now present how the elements can be selectedand updated using the proposed Cut-CD algorithm.We conjecture that the N-CMTF algorithm will convergefaster by updating the important elements repeatedly, in-stead of updating all elements sequentially. Algorithm 1 de-tails the process. We propose to calculate the importance ofan element during factorization using the gradient principle[18] as, e jr = − ( u (1) jr ∗ g jr ) − . h rr ∗ u (1) jr ∗ u (1) jr ) (22)where e jr is the ( j, r ) th element importance which is thedifference in objective function. The higher the value, thehigher the importance.We propose to calculate the column-wise element impor-tance as, e ∗ r = − ( u (1) ∗ r ∗ g ∗ r ) − . h ∗ r ∗ u (1) ∗ r ∗ u (1) ∗ r ) (23)where e ∗ r is the vector of element importance for all theelements in r th column. When an element u (1) jr is updated using a traditional el-ement selection-based algorithm such as GCD [18], the gra-dient g j ∗ and the element importance e j ∗ of the entire roware need to be calculated for each update. The calculationof gradient and element importance for each update is anexpensive task and consumes additional runtime and mem-ory. Additionally, the gradients are calculated repeatedly forall elements that are present in the same row where manyof them are not required. Due to this reason, we propose toupdate them column-wise so that the repeated updates canbe avoided, and the efficiency of the factorization can beimproved. Once we have the vector e ∗ r , the element valuesare normalized as: n ∗ r = e ∗ r − min( e ∗ r ) max( e ∗ r ) − min( e ∗ r ) (24)where n ∗ r indicates the normalized column r of e ∗ r .Instead of updating a single element repeatedly, a set ofelements in each column based on a cut-off value can beupdated only once. Most importantly, for each iteration, wepropose to utilise a pre-calculated gradient that can help inidentifying the importance of each element in each column.Therefore, we do not need to perform any inner gradient up-dates which are the bottleneck in row-wise element updatesof a CD-based CMTF algorithm [18]. (a) GCD (b) The Proposed Cut-CD Fig. 3: (a) Row-wise (GCD) and (b) Colum-wise (Cut-CD)element selection in a factor matrix; red color indicates thecalculation for an element, element importance and gradientupdates; blue color indicates the calculation for elementimportance and gradient updates; green color indicates theelement update calculation alone.As shown in Fig. 3a, if element u is selected to updatein GCD, it is essential to update the gradients of the entirerow u ∗ and the element importance e ∗ . This is requiredto choose the next important element in the row for anupdate. The same process is repeated for every row andconsequently increases the time complexity.We propose to choose mean ( n ∗ r ) as the cut-off valuefor each column as the mean value represents the best cut-off value (as shown in the experiment section) and avoidsthe need of parameter tuning. As shown in Fig. 3b, ifelement u is selected to update in Cut-CD, it is enoughto update the element alone and to avoid gradient updates,a distinct cut-off value can be set for each column. Sinceit does not need to be dependent on other columns of G thereby gradient updates are minimized. Additionally, sinceeach element is updated only once in an iteration, it is notrequired to calculate element importance repeatedly. This issupported by the following lemma 4.1. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6
Algorithm 1:
Cut-off Coordinate Descent Cut-CD Input:
Tensor X ; Matrix Y ; Randomly Initiated factormatrices U (1) ∈ R J × R , V ∈ R K × R , W ∈ R L × R , U (2) ∈ R M × R ; Rank R ; E = ∅ ; N = ∅ ; maxiters. Result:
Learned Factor matrices U (1) , V , W , U (2) for iters = 1 : maxiters do Compute: G , H ; using (14) and (16). for r = 1 : R do calculate e ∗ r using (23) ; calculate n ∗ r using (24) ; cut-off = mean( n ∗ r ) ; for j = 1 : J do if n jr ≥ cut-off then update the ( j, r ) th element of U (1) ; end end Repeat analogues lines 4 to 13 with G and H calculated using (29) and (30) respectively toupdate elements of V in line 11; Repeat analogues lines 4 to 13 with G and H calculated using (31) and (32) respectively toupdate elements of W in line 11; Repeat analogues lines 4 to 13 with G and H calculated using (33) and (34) respectively toupdate elements of U (2) in line 11; endLemma 4.1. In the column-wise Cut-CD element selection, theupdate of elements and element importance in r th column of U (1) is dependent only on the gradients of the r th column.Proof. Let u (1) jr ∈ U (1) be ( j, r ) th element of U (1) and g jr ∈ G represents ( j, r ) th gradient of U (1) . Ω denotes the set ofindices of observable entries of X . ∀ r: r ∈ R , the gradient can be calculated column-wiseas, g ∗ r = (X (W (cid:12) V)) ∗ r + (U (1) (V T V ∗ W T W)) ∗ r . (25)The MTTKRP operation ( X (W (cid:12) V) ) for entire matrixis an expensive task which can be simplified by calculatingit column-wise as, (X (W (cid:12) V)) ∗ r = m ∗ r = (cid:88) ( j,k,l ) ∈ Ω U (1) j ( x jkl ∗ ( w ∗ r v ∗ r )) (26)where Ω U (1) j indicates the subset of Ω whose mode U (1) ’sindex is j . w ∗ r and v ∗ r indicate the r th column of the factormatrices W and V respectively.Substituting (26) in (25), we have, g ∗ r = m ∗ r + (U (1) (V T V ∗ W T W)) ∗ r . (27)As the calculation of column-wise gradient is possible asper (27), the next ( ( r + 1) th ) column’s gradient is calculatedas, g ∗ r+1 = m ∗ r+1 + (U (1) (V T V ∗ W T W)) ∗ r+1 . (28) From (27) and (28), it can be seen that calculating thegradient of one column is independent of the gradientof the other column. Moreover, the column-wise elementimportance as per (23) is dependent only on the r th columnof the gradient ( g ∗ r ).The set of important elements s r ∈ u (1) ∗ r ≤ cut-off are selected. This subset s r is dependent on the elementimportance calculated using (23) which in-turn dependenton the r th column of G as per (25). Hence, the elementsand element importance of r th column of U (1) can becalculated and updated column-wise without depending onthe gradients of the other columns.If J is the total number of rows, Cut-CD selects S (where S < J ) elements based on the distinct cut-off value, toupdate in each column. This considerably minimizes thetotal number of element updates.Therefore, for each update of u (1) jr , it is enough to update g jr alone without updating g j ∗ of all the columns of j th rowand element importance e jr .As the matrix component Y is shared only with mode-1( U (1) ), the partial derivatives of f in (11) can be set to zerofor updating the tensor factor matrices V , W . The updaterule for V and W can be derived by the following gradient G and second-order partial derivative H calculations. ∂f∂ V = G = − X ( W (cid:12) U (1) )+ V ( U (1) T U (1) ∗ W T W ) . (29) ∂ f∂ V = H = U (1) T U (1) ∗ W T W . (30) ∂f∂ W = G = − X ( V (cid:12) U (1) ) + W ( U (1) T U (1) ∗ V T V ) . (31) ∂ f∂ W = H = U (1) T U (1) ∗ V T V . (32)When updating U (2) in (8), the partial derivative of f with respect to tensor in (11) is set to zero, the gradient G and second-order partial derivative H for updating U (2) becomes, ∂f∂ U (2) = G = U (2) U (1) T U (1) − Y T U (2) . (33) ∂ f∂ U (2) = H = U (1) T U (1) . (34)The Cut-CD column-wise element selection is applied toselect and update the elements of factor matrices V , W , U (2) using (20) in the same way as explained for U (1) .Algorithm 1 details the process. Cut-CD is generic andcan be easily derived for NMF or NTF problems by sub-stituting zero for f and f respectively in (8) or addinganother shared matrix in other tensor dimensions.The element selection method of GCD needs to knowthe values of gradients of the whole matrix. This is notconvenient when the algorithm is applied to a distributedor parallel environment. Whereas, the column-wise elementselection as in Cut-CD allows the algorithm to efficientlyadapt to a distributed or parallel environment. This is due EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 to the fact that the element selection is independent of othercolumns of a factor matrix as proven in Lemma 4.1. Thecolumn-wise update sequence as proposed in PCMF andCDTF has huge advantages in a distributed and parallel en-vironment over Cut-CD. However, this column-wise updatesequence with distributed and parallelization techniques ofPCMF and CDTF can be adapted to Cut-CD to make it moreefficient in a distributed and parallel environment. This isbeyond the scope of this paper.
HEORETICAL A NALYSIS
In this section, we will analyse Cut-CD in terms of con-vergence, time complexity, memory requirement and ele-ment/gradient update. As GCD is the state-of-the-art ele-ment selection-based method, most of our analytical com-parison is with GCD. We use the following symbols inthe analysis, R : rank; Z : number of element updates; and J, K, L and M indices for factor matrices U (1) , V , W and U (2) respectively. The class of optimization methods solving non-convex op-timization by alternating between multiple sets of variablesis known as Alternating Direction Method of Multipliers(ADMM) [39]. All CD methods including Cut-CD employADMM in solving the optimization problem.
Theorem 5.1.
The ADMM method with an update sequenceusing element section based Cut-CD converges to a stationarypoint for a cut-off value imposed.Proof.
The optimization function f reaches a stationarypoint (i.e. local, global, or saddle) only when the followinginequality satisfies. f ( u i +1 ) ≤ f ( u i ) (35)where i indicates the i th iteration. This states that thefunction decreases monotonically and a stationary point isreached.ADMM based update uses the update sequence, f ( u i +1 ) = max(0 , f ( u i ) − f (cid:48) ( u i +1 ) f (cid:48)(cid:48) ( u i +1 ) ) , i = 0 , , . . . (36)The gradients for the sequence updating all the elementscan be defined as: f (cid:48) ( u i ) = − J (cid:88) j =1 R (cid:88) r =1 g ijr . (37)While S is a set of elements selected based on a cut-offvalue ( mean ( n ∗ r ) ) using Cut-CD to be updated in U (1) , thegradients for the selected elements can be defined as, f (cid:48) s ( u i ) = − (cid:88) j,r ∈ S g ijr . (38)Since the elements are selected such that the objectivefunction will be minimized, the following inequality exists, f (cid:48) s ( u i ) ≤ f (cid:48) ( u i ) . (39)second-order partial derivative of f can be defined as, TABLE 2: Time Complexity Analysis. Mode length (J) Rank (R) O(GCD) O(Cut-CD) Times Faster1000 10 100,000 10,000 101000 20 400,000 20,000 201000 30 900,000 30,000 301000 50 2,500,000 50,000 501000 100 10,000,000 100,000 100 f (cid:48)(cid:48) s ( u i ) = R (cid:88) r =1 h irr > (cid:88) r ∈ S h irr > . (40)With the calculated gradients and second-order partialderivatives, as per (36), we have f ( u i +1 ) ≤ f ( u i ) − f (cid:48) s ( u i +1 ) f (cid:48)(cid:48) s ( u i +1 ) . (41)Iterating the inequality above gives, ∞ (cid:88) i =0 f (cid:48) s ( u i +1 ) f (cid:48)(cid:48) s ( u i +1 ) ≤ f s ( u i ) ≤ f ( u i ) (42)which proves that for the given cut-off value ( mean ( n ∗ r ) ), f ( u i +1 ) ≤ f ( u i ) , hence Cut-CD will converge to a station-ary point.If Cut-CD is stuck in a saddle point, a second-orderderivative test can be used to identify it and an advancedgradient descent techniques like perturbed gradient de-scent [40] can be used to escape. The time complexity of Cut-CD is O ( JKLR + 2 JR + JM R + Z ) in comparison to the time complexity of GCD O ( JKLR +3 JR + JM R +3 ZR ) that is O ( JR + Z (1 − R )) less than GCD.For the first iteration, both Cut-CD and GCD needgradients to be initialized G , as shown in (14). This stepincludes calculation of four terms X ( W (cid:12) V ) , U (1) ( V T V ∗ W T W ) , U (1) U (2) T U (2) and YU (2) which incur the com-plexity of O ( JKLR ) , O ( JR ) , O ( JR ) and O ( JM R ) re-spectively. Additionally, GCD needs initialization of theelement importance matrix E that will exhibit O ( JR ) . If Z number of elements are updated for every inner loop, thetime computation of element update costs O ( ZR ) . Similarly,for each gradient and element importance matrix E update,it requires O ( ZR ) . On the other hand, as shown in Algo-rithm 1, Cut-CD calculates the element importance column-wise to avoid initialization. Also as per lemma 4.1, Cut-CDdoesnt need to update the gradient and importance matrix E of entire row and hence the time complexity becomes O ( Z ) . For each factor matrix update, GCD involves the R inner loops, hence the time complexity of GCD and Cut-CD is O ( JR ) and O ( JR ) respectively. In Table 2, we fixthe value for J as 1000 and substitute different values ofR for the inner loops time complexity of GCD and Cut-CD. The results shows that the computational complexityof GCD grows exponentially; however, Cut-CD is efficientand grows only linearly with increase in the tensor rank. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8
The memory requirement of Cut-CD for a single elementupdate is O (1) in comparison to GCD that is O ( R ) .For each single element update in a factor matrix, Cut-CD requires the following three types of data in the mem-ory: (1) the element, (2) its gradient and (3) the second-orderderivative value of the element. According to lemma 4.1,each element update in Cut-CD requires only one gradientupdate and hence the memory requirement is O (1) . Theindependent column-wise update of Cut-CD makes it easierto perform efficiently if the process of an update can bedistributed column-wise across multiple machines. On theother hand, GCD needs the gradient of the entire row to beupdated for each single element update. If R is the numberof columns in a factor matrix, the memory requirementis O (1) + O ( R ) . The memory requirement of GCD growslinearly when the rank is increased, in comparison to Cut-CD that has the same memory requirement O (1) irrespec-tive of the increase in rank. Importantly, it is to be notedthat the selection of elements depends on the entire factormatrix, consequently, it becomes hard to process GCD in adistributed environment. In CD-based factorization, the factor matrices are updatedelement-wise [6]. Using a synthetic dataset, we show thetotal number of elements and gradients updated in eachiteration by Cut-CD and GCD in Fig. 4. As proved in lemma4.1, each element update in Cut-CD requires only one gradi-ent update and the number of gradient updates is the sameas the number of element updates. Whereas in GCD, thenumber of gradient updates is more than the number ofelement updates. Fig. 4 shows that Cut-CD does not needfrequent gradient updates, as in GCD, when updating theelements in a factor matrix. It is enough to update all thegradients of a factor matrix once for each iteration in Cut-CD. It also shows that the number of element updates ishigher in early iterations in GCD, whereas, the element up-dates count in Cut-CD remains constant. The greedy searchstrategy of GCD can select the same element to be updatedmultiple times within a single iteration while the thresholdbased search strategy of Cut-CD selects and updates a set ofelements only once.Fig. 4: A factor matrix update analysis on the Syn1 dataset(as shown in Table 3).
MPIRICAL A NALYSIS
We conducted experiments with Cut-CD to answer thefollowing questions: TABLE 3: Dataset statistics.
Dataset Tensor size Matrix size Density of tensorSyn1 1500 x 1500 x 1500 1500 x 1500 0.00009LastFM 200 x 12500 x 1719 200 x 200 0.00040Delicious 100 x 5010 x 5120 100 x 100 0.00070D1 37 x 1295 x 7 37 x 37 0.00700D2 37 x 1295 x 24 37 x 37 0.00200D3 25000 x 2294 x 7 25000 x 385 0.00030D4 25000 x 2294 x 24 25000 x 385 0.00010
Q1.
What is Cut-CD’s runtime performance compared tothe state-of-the-art N-CMTF algorithms?
Q2.
How accurately Cut-CD can approximate the tensor?
Q3.
How accurately Cut-CD can predict the missingdata of the tensor? In other words, what is its accuracyperformance in a recommendation or prediction task?
Q4.
How well Cut-CD with Sparsity Constraint (Cut-CD-SC) can identify factors used in pattern mining?The datasets used for the experiments, the experimentalsetups, and the evaluation criteria are detailed in Sections6.1, 6.2 and 6.3 respectively. Q1, Q2, Q3, and Q4 are ad-dressed in Sections 6.4, 6.5, 6.6, and 6.7 respectively.
Several synthetic and openly available real-world datasetswere used in experiments (Table 3). The randomly generatedsynthetic dataset (Syn1) was generated showing high spar-sity. LastFM, Delicious datasets and Syn1 were used to testthe tensor approximation performance. The LastFM datasetincludes a total of 200 users, 12500 artists and 1719 tagsinformation represented as tensor with a coupled matrixof 200 users relationship. The Delicious dataset includes atensor of 100 users × × × × ×
37 to identify temporal patternsover 7 days of the week. We changed the time slot in D1 to24 hours to generate data D2 in the tensor model to identifythe temporal pattern over 24 hours. The second smart citydata, Tokyo city foursquare , consists of 2294 users andtheir check-ins at 25000 restaurants. The dataset was addedwith an auxiliary information of 385 categories for differentlocations. We represent ( user × location × time ) as a tensorand ( location × location category ) as an additional matrix.We generated two datasets by defining 7 days and 24 hoursas the temporal dimension of the tensor as shown in Table 3. The source code of Cut-CD is available . All the experimentswere executed in Intel (R) Xeon (R) CPU E5-2665 0 @ EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9
Accuracy is calculated using Normalized Residual Value(NRV) [30] as:Approximation Error (NRV) = (cid:13)(cid:13)(cid:13) X − U (1) ( W (cid:12) V ) T (cid:13)(cid:13)(cid:13) (cid:107) X (cid:107) . (43)Prediction Accuracy is reported by the Root Means SquareError (RMSE) and the recommendation quality is evaluatedusing precision, recall and F1 score,RMSE = (cid:115) (cid:80) ( X t − ˆ X t ) n (44)where X t is the original tensor populated with the test data, ˆ X t is the approximated tensor after the factorization processand n is the number of elements in X t .Precision = T rue P ositiveT rue P ositive + F alse P ositive . (45)Recall = T rue P ositiveT rue P ositive + F alse N egative . (46)F1 Score = 2 (cid:18)
P recision × RecallP recision + Recall (cid:19) . (47)Pattern Distinctiveness (PD) is used to evaluate the qual-ity of patterns learned using the N-CMTF as follows.PD = < t q , t r >, ∀ q, r ∈ [1 , R ] , q < r (48)where t q and t r indicates the q th and r th column of tem-poral factor matrix W . PD measures the similarity of each pattern with other patterns. Higher the PD value, higheris the similarity between patterns. Since the objective is toidentify unique patterns, hence, lower the PD value, betterthe quality of learned patterns is considered. The runtime of Cut-CD and all benchmarking algorithms infactorizing a tensor has been reported in Fig. 5 by varyingthe mode length, density and rank of the tensor. We firstgenerate randomly initialized synthetic datasets by increas-ing the length of mode ( J, K, L, M ) of the tensor and thematrix from to , by fixing the rank and the tensordensity to and . respectively. As shown in Fig. 5a,the runtime of Cut-CD is lower than other algorithms. Theruntime of ALS increases exponentially with an increase indata size, whereas Cut-CD, GCD, and CCD++ show thelinear growth. Comparing the maximum size of the datasetthat ALS and Turbo-SMT OPT can handle, Cut-CD canprocess a times bigger dataset. Most importantly, due tohigh sparsity, Turbo-SMT OPT even fails to run for smallerdatasets ( and ) as there are not enough observations forpartitioning the dataset into multiple smaller subsets.As discussed before, there are two components involvedin the time complexity of the factorization process, MTTKRPcalculation and factor matrix update. As seen in Fig. 5b,Cut-CD is nearly 9.2 times faster than GCD in updating thefactor matrix while the time taken for MTTKRP calculationremains the same on the mode length dataset. Fig. 5cshows the runtime on datasets with varying density withthe rank and mode length set to 10 and respectively.Cut-CD serves the best performance with 4.2 times fasterover GCD and 76 times faster over ALS. Performance ofCut-CD increases with increase in the sparsity. For instance,it is 1.5 times faster than ALS for the denser dataset whereasit is 76 times faster for the sparse dataset. This demonstratesthe efficiency of Cut-CD for sparse datasets which will bethe case for most of the applications. In Fig. 5d, we varythe rank by fixing the density to 0.00001 and mode lengthto . We choose this to show the performance of allalgorithms, as for larger mode length some of the algorithms(ALS and Turbo-SMT OPT) do not work. Results showthat Cut-CD outperforms other algorithms. Even thoughCCD++ can avoid frequent gradient updates unlike GCD,the runtime performance is inferior to GCD. This is due tothe communication cost associated with the frequent changein the factor matrix. It is interesting to note that CCD++1 andCCD++3 show similar performance despite the increase inthe inner iterations which shows that the communicationcost is the bottleneck and influences the runtime. Fig. 6 reports the RMSE performance of all the algorithmsfor the synthetic datasets used in Section 6.4. It is evidentthat accuracy is not compromised in Cut-CD for efficientruntime performance.Figures 7, 8, and 9 show the comparative performanceof all algorithms in terms of accuracy and convergence.In all datasets, Cut-CD and GCD converged into a bettersolution faster without compromising accuracy. For thesparse data (Syn1) in Fig. 9, Cut-CD converged to a better
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10 (a) Running time vs Mode length (b) Running time (c) Running time vs Density (d) Running time vs Rank
Fig. 5: Runtime performance of all algorithms with varying mode length, rank, and density of the tensor. ALS and Turbo-SMT OPT did not scale with mode length > . Turbo − SMT OPT fails to process the tensor for mode length < anddensity < e − . Legends showing the bar type are given below. (a) RMSE vs Mode length (b) RMSE vs Density (c) RMSE vs Rank Fig. 6: Accuracy performance with synthetic datasets (used for scalability evaluation).solution before GCD and ALS algorithms, completing onlya single iteration. This indicates the effectiveness of Cut-CDin sparse conditions which is a common situation in real-world applications where other algorithms may suffer.
Sensitivity to cut-off value:
Cut-CD requires a cut-off valueto select the elements for updating during factorization. Bynormalizing each column values, a cut-off value can be setbetween 0 and 1. Experiments with all datasets includingthe very sparse Syn1 reveal that the mean value performsconsistently best (Figs. 7d, 8d, and 9d), as well as, it takesleast the time to converge. Therefore, the cut-off value is setto mean that makes Cut-CD a parameter-free method. Asshown in Fig. 8a, the accuracy of Cut-CD with mean as cut-off value is less than that of GCD for the Delicious dataset.But with a proper cut-off value selection, its accuracy canfurther be improved. This is supported by Fig. 8d.
The N-CMTF can be considered as a solution to a rec-ommendation or prediction problem, where the estimatedmissing data are treated as the prediction. The factor ma-trices learned during factorization using Cut-CD are usedto reconstruct the approximated tensor as per (5) that willidentify missing values. The goal of the recommendationtask conducted with LastFM and Delicious datasets is topredict the missing entries of the tensor as accurately aspossible. These entries are then inferred as most likely itemsthat can be recommended to users.Figs. 10 and 11 report the best performance of Cut-CD in comparison to all baseline algorithms. While otheralgorithms including Cut-CD can handle higher ranks, ALSis not scalable to LastFM and ran out of memory for rank greater than 75 for the Delicious dataset. With lower ranks,it has failed in recommendation generation yielding thelowest F1 score. Though Turbo-SMT OPT can handle higherranks, due to random sampling it has lost some informationleading to poor performance in terms of prediction andrecommendation generation. GCD and CCD++ show pooreraccuracy performance in comparison to Cut-CD, as well as,GCD is 5.7 to 8.3 times slower and CCD++ is 8 times slowerthan Cut-CD.
The objection function (8) is designed to generate denser fac-tor matrices so that the reconstructed tensor can be approxi-mated to the original tensor as close as possible. The new orunobserved entries in the approximated tensor can be usedfor making prediction or recommendation. However, for thepurpose of pattern mining, it is better to generate sparseand distinct factor matrices [42]. We introduce an auxiliaryterm to the N-CMTF objective function (8) that controls thesparsity level of the factor matrices. We add the L , normto the objective function (8) as, min U (1) , V , W , U (2) ≥ f ( U (1) , V , W , U (2) ) = (cid:13)(cid:13)(cid:13) X − (cid:74) U (1) , V , W (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Y − U (1) U (2) (cid:13)(cid:13)(cid:13) + λ (cid:13)(cid:13)(cid:13) U (1) (cid:13)(cid:13)(cid:13) , + λ (cid:107) V (cid:107) , + λ (cid:107) W (cid:107) , + λ (cid:13)(cid:13)(cid:13) U (2) (cid:13)(cid:13)(cid:13) , (49)where (cid:13)(cid:13)(cid:13) U (1) (cid:13)(cid:13)(cid:13) , , (cid:107) V (cid:107) , , (cid:107) W (cid:107) , and (cid:13)(cid:13)(cid:13) U (2) (cid:13)(cid:13)(cid:13) , are L , norms applying the sparsity constraints on each factor ma-trices and λ indicates the regularization parameter. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11 (a) NRV vs Running time (b) NRV vs Iterations (c) Running time vs Iteration (d) Threshold sensitivity
Fig. 7: Approximation and Runtime performance of all the algorithms for LastFM dataset. (a) NRV vs Running time (b) NRV vs Iterations (c) Running time vs Iteration (d) Threshold sensitivity
Fig. 8: Approximation and Runtime performance of all the algorithms for Delicious dataset. (a) NRV vs Running time (b) NRV vs Iterations (c) Running time vs Iteration (d) Threshold sensitivity
Fig. 9: Approximation and Runtime performance of all the algorithms for Syn1 dataset. (a) RMSE vs Rank (b) Running time vs Rank (c) Precision and Recall (d) F1 score
Fig. 10: RMSE, Runtime, Precision, Recall and F1 score on the LastFM dataset.The L , norm of factor matrices say for (cid:13)(cid:13)(cid:13) U (1) (cid:13)(cid:13)(cid:13) , canbe calculated as, (cid:13)(cid:13)(cid:13) U (1) (cid:13)(cid:13)(cid:13) , = J (cid:88) j =1 (cid:118)(cid:117)(cid:117)(cid:116) R (cid:88) r =1 (u (1) jr ) = J (cid:88) j =1 (cid:13)(cid:13)(cid:13) u (1) ∗ j (cid:13)(cid:13)(cid:13) . (50)With the sparsity constraint added to the objective func- tion, the non-zero elements in the factor matrix will bereduced leading to a sparse factor matrix. As the sparsityconstraint in (49) is introduced using norm, it will notchange the optimization process and we can easily applyCut-CD for N-CMTF with Sparsity Constraint (Cut-CD-SC)by repeating the Sections 4.2 to 4.5 and calculating L , ofeach factor matrix using (50). EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12 (a) RMSE vs Rank (b) Running time vs Rank (c) Precision and Recall (d) F1 score
Fig. 11: RMSE, Runtime, Precision, Recall and F1 score on the Delicious dataset. o.m. out of memory.
In this section, we apply Cut-CD-SC to extract spatio-temporal patterns from the smart city datasets, D1 and D2.We represent the users’ mobility information in the N-CMTFof ( user × location × time slots ) with ( user × user ) toidentify spatio-temporal patterns over 7 days of the weekand over 24 hours in D1 and D2 respectively. Though NMFcan be useful to extract spatial-temporal patterns separatelyby representing ( user × location ) and ( user × time slots )matrices, they cannot be used to capture spatial and tempo-ral patterns existing together. With rank ( R ) set as 4 duringthe factorization process, 4 distinct patterns relating to eachof the temporal ( W ∈ R L × R ) and location ( V ∈ R K × R )factor matrices are obtained. Each column in the factormatrix becomes a pattern/feature.Fig. 12a and Fig. 12b show the temporal patterns gen-erated by Cut-CD-SC on D1 and D2 where different colorindicates different patterns. The red and pink patterns inFig. 12a show that people are more active on weekends(Sundays and Saturdays respectively) and stay inactive overweekdays. The blue pattern indicates that some users are ac-tive on Thursdays and remain inactive during the weekend.The red and pink patterns in Fig. 12b provide fine-granularinformation that people are engaged in activities between 6- 10 pm, and in the evening at 6 pm, respectively. The bluepattern in Fig. 12b shows a peak at 6 to 9 am to distinguishitself from green pattern.Using location factors V , the spatial patterns of D1 andD2 can be identified as shown in Fig. 13a and Fig. 13brespectively. An interesting pattern can be identified whenthe pink patterns from Fig. 12a and Fig. 13a are associated.The pattern can be interpreted as people who show similarkinds of activities over all 7 days, tend to be only activewithin their region. It should be noted that color sizesindicate the distribution within each pattern and does notnecessarily show the overall dominance. For example, thegreen factor in Fig. 13a seems to be very active, but it meansthat the pattern is scattered. Connecting the color patternsof Fig. 12b and Fig. 13b, the red pattern shows 3 hot spotswhich are spatio-temporal patterns indicating that peopletend to visit 3 places regularly around 8 am to 6 pm. Fig. 12c and Fig. 12d show the temporal patterns of Cut-CD on D3 and D4 involving Tokyo city foursquare users. TABLE 4: Pattern distinctiveness (smaller is desirable).
Method D1 D2 D3 D4 Avg.ALS 0.87 0.45 o.m o.m -Turbo-SMT OPT 0.82 0.49 o.m -CCD++1 0.85 0.50 0.99 0.62 0.74CCD++3 0.85 0.50 0.99 0.62 0.74GCD 0.87 0.49 0.99 0.66 0.75Cut-CD 0.64 0.49 0.99 0.59 0.67Cut-CD-SC
For D3 and D4, we set the rank to 2 and 4 respectively.We set the rank to 2, as there are no more unique patternsfound in the dataset. Fig. 12c shows that Foursquare usersare highly active on weekdays and moderately active onweekends. Fig. 13c shows the respective spatial patternsassociated with red and blue temporal patterns in Fig. 12c.While looking at Fig. 12d, the red pattern shows a peakbetween 8 −
10 am and 6 − − −
12 am. While thesepatterns occur only at few hot spots as shown in Fig. 13d,blue and green patterns showing a peak at 10 pm and 9 amare widespread.
As shown in Table 4, Cut-CD-SC outperforms all the bench-marks by generating more unique and meaningful patterns.The column-wise element selection of Cut-CD-SC enablesit to find unique patterns. ALS ran out of memory forD3 whereas both ALS and Turbo-SMT-OPT ran out ofmemory for D4. Cut-CD-SC does not show a significantimprovement for D3. This is due to the presence of veryfew unique patterns in the dataset (Fig. 12c). For datasetsD2 and D4 where there exist more patterns, Cut-CD-SC hassignificantly generated unique and meaningful patterns.In Fig. 14, we compare the patterns derived from Cut-CDand Cut-CD-SC for D4. The red, pink and blue patterns inFig. 14a can be paired up with green, blue and red patternsrespectively in Fig. 14b. Cut-CD in Fig. 14b identifies 1 morepattern (i.e. pink) that is highly similar to blue pattern. Thisshows the inability of Cut-CD to avoid the simultaneouselimination problem (i.e., a state where similar patterns arederived multiple times) [43]. On the other hand, Cut-CD-SCavoids generating the same patterns repeatedly due to theapplication of sparsity constraint and therefore, the greenpattern in Fig. 14a is a straight line stating there are no moreunique patterns.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13 (a) Time factors on D1 (b) Time factors on D2 (c) Time factors on D3 (d) Time factors on D4
Fig. 12: Temporal patterns derived from the 3rd mode (time) of the tensor using Cut-CD-SC. y − axis shows the normalizedvalues of elements in a column of the factor matrix. (a) Spatial Pattern on D1 (b) Spatial Pattern on D2 (c) Spatial Pattern on D3 (d) Spatial Pattern on D4 Fig. 13: Spatial patterns derived from the 2nd mode (location) of the tensor using Cut-CD-SC. (a) Cut-CD SC (b) Cut-CD
Fig. 14: Temporal patterns derived from the 3rd mode (time)of the tensor on D4 using Cut-CD-SC and Cut-CD showingthe evidence of simultaneous elimination problem.
ONCLUSION
An element selection-based CD algorithm, Cut-CD, hasbeen introduced to improve the efficiency of the N-CMTFcomputation. Cut-CD facilitates efficient factorization toreveal useful information in multifaceted datasets. Cut-CDfirst selects elements in a factor matrix according to theelement importance based on the proposed column-wisecut-off technique. All the selected elements of the factormatrix are then updated. We conducted theoretical andempirical studies to demonstrate the effectiveness of Cut-CD. Theoretical analysis shows that the computational effi-ciency is achieved by Cut-CD by avoiding frequent gradientupdates and proves the convergence property of Cut-CD.Empirical analysis shows that Cut-CD outperforms existingstate-of-the-art algorithms in terms of scalability and con-vergence speed without compromising accuracy. It performs well, especially in sparse data conditions. Cut-CD has beenshown to be efficient not only in missing value identificationbut also in the identification of factor problems with theincorporation of a sparsity constraint on factor matrices. Inthe future, we will explore the Cut-CD for the reduction ofthe computational complexity of MTTKRP. A CKNOWLEDGMENT
We will like to thank the Lee Kuan Yew Centre for Inno-vative Cities (Aging Urbanism and NSFC 61750110529) toprovide us the Singapore Elderly dataset. R EFERENCES [1] E. Acar, T. G. Kolda, and D. M. Dunlavy, “All-at-once optimiza-tion for coupled matrix and tensor factorizations,” arXiv preprintarXiv:1105.3422 , 2011.[2] T. Balasubramaniam, R. Nayak, and C. Yuen, “Nonnegative cou-pled matrix tensor factorization for smart city spatiotemporal pat-tern mining,” in
Machine Learning, Optimization, and Data Science .Springer, 2019, pp. 520–532.[3] E. Acar, R. Bro, and A. K. Smilde, “Data fusion in metabolomicsusing coupled matrix and tensor factorizations,”
Proceedings of theIEEE , vol. 103, no. 9, pp. 1602–1620, 2015.[4] E. Acar, M. A. Rasmussen, F. Savorani, T. Næs, and R. Bro, “Un-derstanding data fusion within the framework of coupled matrixand tensor factorizations,”
Chemometrics and Intelligent LaboratorySystems , vol. 129, pp. 53–63, 2013.[5] B. Hunyadi, D. Camps, L. Sorber, W. Van Paesschen, M. De Vos,S. Van Huffel, and L. De Lathauwer, “Block term decompositionfor modelling epileptic seizures,”
EURASIP Journal on Advances inSignal Processing , vol. 2014, no. 1, p. 139, 2014.[6] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari,
Nonnegativematrix and tensor factorizations: applications to exploratory multi-waydata analysis and blind source separation . John Wiley & Sons, 2009.[7] N. Iakovidou, P. Symeonidis, and Y. Manolopoulos, “Multiwayspectral clustering link prediction in protein-protein interactionnetworks,” in
ITAB . IEEE, 2010, pp. 1–4.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14 [8] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelizedstochastic gradient descent,” in
NIPS , 2010, pp. 2595–2603.[9] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in
NIPS , 2001, pp. 556–562.[10] S. J. Wright, “Coordinate descent algorithms,”
Mathematical Pro-gramming , vol. 151, no. 1, pp. 3–34, 2015.[11] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularizednonnegative matrix factorization for data representation,”
IEEETPAMI , vol. 33, no. 8, pp. 1548–1560, 2011.[12] E. E. Papalexakis, C. Faloutsos, T. M. Mitchell, P. P. Talukdar, N. D.Sidiropoulos, and B. Murphy, “Turbo-smt: Accelerating coupledsparse matrix-tensor factorizations by 200x,” in
SDM . SIAM,2014, pp. 118–126.[13] B. W. Bader and T. G. Kolda, “Efficient matlab computations withsparse and factored tensors,”
SIAM Journal on Scientific Computing ,vol. 30, no. 1, pp. 205–231, 2007.[14] A. Beutel, P. P. Talukdar, A. Kumar, C. Faloutsos, E. E. Papalexakis,and E. P. Xing, “Flexifact: Scalable flexible factorization of coupledtensors on hadoop,” in
SDM . SIAM, 2014, pp. 109–117.[15] E. E. Papalexakis, T. M. Mitchell, N. D. Sidiropoulos, C. Faloutsos,P. P. Talukdar, and B. Murphy, “Scoup-smt: Scalable coupledsparse matrix-tensor factorization,” arXiv preprint arXiv:1302.7043 ,2013.[16] J. Oh, K. Shin, E. E. Papalexakis, C. Faloutsos, and H. Yu, “S-hot:Scalable high-order tucker decomposition,” in
WSDM . ACM,2017, pp. 761–770.[17] K. Shin, L. Sael, and U. Kang, “Fully scalable methods for dis-tributed tensor factorization,”
IEEE TKDE , vol. 29, no. 1, pp. 100–113, 2017.[18] C.-J. Hsieh and I. S. Dhillon, “Fast coordinate descent methodswith variable selection for non-negative matrix factorization,” in
SIGKDD . ACM, 2011, pp. 1064–1072.[19] P. Bhargava, T. Phan, J. Zhou, and J. Lee, “Who, what, when, andwhere: Multi-dimensional collaborative recommendations usingtensor factorization on sparse user-generated data,” in
Interna-tional conference on world wide web , 2015, pp. 130–140.[20] Y. Zheng, T. Liu, Y. Wang, Y. Zhu, Y. Liu, and E. Chang, “Diag-nosing new york city’s noises with ubiquitous data,” in
UbiComp .ACM, 2014, pp. 715–725.[21] X. Zhu and R. Hao, “Context-aware location recommendationswith tensor factorization,” in
ICCC . IEEE, 2016, pp. 1–6.[22] E. Acar, G. Gozde, A. R. Morten, R. Daniela, O. D. Lars, andB. Rasmus, “Coupled matrix factorization with sparse factors toidentify potential biomarkers in metabolomics,”
IJKDB , vol. 3,no. 3, pp. 22–43, 2012.[23] P. Comon, X. Luciani, and A. L. De Almeida, “Tensor decom-positions, alternating least squares and other tales,”
Journal ofChemometrics: A Journal of the Chemometrics Society , vol. 23, no. 7-8,pp. 393–405, 2009.[24] T. G. Kolda and B. W. Bader, “Tensor decompositions and applica-tions,”
SIAM review , vol. 51, no. 3, pp. 455–500, 2009.[25] B. W. Bader and T. G. Kolda, “Efficient MATLAB computationswith sparse and factored tensors,”
SIAM Journal on Scientific Com-puting , vol. 30, no. 1, pp. 205–231, December 2007.[26] H. Imtiaz and A. D. Sarwate, “Distributed differentially privatealgorithms for matrix and tensor factorization,”
JSTSP , vol. 12,no. 6, pp. 1449–1464, 2018.[27] A. Cichocki and A.-H. Phan, “Fast local algorithms for large scalenonnegative matrix and tensor factorizations,”
IEICE transactionson fundamentals of electronics, communications and computer sciences ,vol. 92, 2009.[28] P. Richtrik and M. Tak, “Parallel coordinate descent methods forbig data optimization,”
Mathematical Programming , pp. 433–484,2016.[29] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrixand tensor factorizations: a unified view based on block coordinatedescent framework,”
Journal of Global Optimization , vol. 58, no. 2,pp. 285–319, 2014.[30] K. Kimura and M. Kudo, “Variable selection for efficient nonneg-ative tensor factorization,” in
ICDM . IEEE, 2015, pp. 805–819.[31] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate de-scent approaches to parallel matrix factorization for recommendersystems,” in
ICDM . IEEE, 2012, pp. 765–774.[32] K. Shin and U. Kang, “Distributed methods for high-dimensionaland large-scale tensor factorization,” in
ICDM . IEEE, 2014, pp.989–994. [33] R. A. Rossi and R. Zhou, “Parallel collective factorization formodeling large heterogeneous networks,”
Social Network Analysisand Mining , vol. 6, no. 1, p. 67, 2016.[34] M. Chen, Q. Wang, and X. Li, “Adaptive projected matrix factor-ization method for data clustering,”
Neurocomputing , vol. 306, pp.182–188, 2018.[35] K. Luong, T. Balasubramaniam, and R. Nayak, “A novel techniqueof using coupled matrix and greedy coordinate descent for multi-view data representation,” in
International Conference on Web Infor-mation Systems Engineering (WISE) . Springer, 2018, pp. 285–300.[36] K. Luong and R. Nayak, “Clustering multi-view data using non-negative matrix factorization and manifold learning for effectiveunderstanding: A survey paper,” in
Linking and Mining Heteroge-neous and Multi-view Data . Springer, 2019, pp. 201–227.[37] Tucker and L. R, “Some mathematical notes on three-mode factoranalysis,”
Psychometrika , vol. 31, no. 3, 1966.[38] J. D. Carroll and J.-J. Chang, “Analysis of individual differences inmultidimensional scaling via an n-way generalization of eckart-young decomposition,”
Psychometrika , vol. 35, no. 3, pp. 283–319,1970.[39] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al. , “Dis-tributed optimization and statistical learning via the alternatingdirection method of multipliers,”
Foundations and Trends R (cid:13) inMachine learning , vol. 3, no. 1, pp. 1–122, 2011.[40] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “Howto escape saddle points efficiently,” in ICML . JMLR. org, 2017,pp. 1724–1732.[41] B. P. L. Lau, S. H. Marakkalage, S. K. Viswanath, T. Balasubra-maniam, Y. Chau, Y. Belinda, and R. Nayak, “Extracting pointof interest and classifying environment for low sampling crowdsensing smartphone sensor data,” in
PerCom Workshops . IEEE,2017, pp. 201–206.[42] T. Balasubramaniam, R. Nayak, and C. Yuen, “Understandingurban spatio-temporal usage patterns using matrix tensor factor-ization,” in
ICDMW . IEEE, 2018, pp. 1497–1498.[43] H. Zou and M. Yuan, “The f ∞ -norm support vector machine,” Statistica Sinica , pp. 379–398, 2008.
Thirunavukarasu Balasubramaniam receivedthe B.E. degree from Anna University, Chennai,India, in 2015. He is currently pursuing the Ph.D.degree with the Queensland University of Tech-nology (QUT), Brisbane, Australia. From 2015to 2016, he was a Research Assistant with theSUTD, Singapore. His current research interestsinclude machine learning, factorization, recom-mender systems, and pattern mining.
Richi Nayak received M.E. degree from the In-dian Institute of Technology, Roorkee India in1995 and PhD in Computer Science from QUT,Brisbane, Australia in 2001. She is currently As-sociate Professor of computer science at QUT.Her current research interests include machinelearning, text mining, personalization, automa-tion, and social network analysis.
Chau Yuen received the B.Eng. and Ph.D. de-grees from NTU, Singapore, in 2000 and 2004,respectively. He was a Post-Doc Fellow withthe Lucent Technologies Bell Labs, NJ, USA, in2005. He was a Visiting Assistant Professor withHong Kong PolyU in 2008. From 2006 to 2010,he was a Senior Research Engineer with I2R,Singapore. He is currently an Associate Profes-sor with the SUTD, Singapore. He is an Editor ofthe IEEE Trans. Commun. and the IEEE Trans.Veh. Technol.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 15