Efficient Nonnegative Tensor Factorization via Saturating Coordinate Descent
11Efficient Nonnegative Tensor Factorization via SaturatingCoordinate Descent
THIRUNAVUKARASU BALASUBRAMANIAM,
Queensland University of Technology, Australia
RICHI NAYAK,
Queensland University of Technology, Australia
CHAU YUEN,
Singapore University of Technology and Design, SingaporeWith the advancements in computing technology and web-based applications, data is increasingly generatedin multi-dimensional form. This data is usually sparse due to the presence of a large number of users andfewer user interactions. To deal with this, the Nonnegative Tensor Factorization (NTF) based methods havebeen widely used. However existing factorization algorithms are not suitable to process in all three conditionsof size, density, and rank of the tensor. Consequently, their applicability becomes limited. In this paper, wepropose a novel fast and efficient NTF algorithm using the element selection approach. We calculate theelement importance using Lipschitz continuity and propose a saturation point based element selection methodthat chooses a set of elements column-wise for updating to solve the optimization problem. Empirical analysisreveals that the proposed algorithm is scalable in terms of tensor size, density, and rank in comparison to therelevant state-of-the-art algorithms.CCS Concepts: •
Computing methodologies → Factorization methods ; •
Information systems → Recommender systems ; Spatial-temporal systems; •
Theory of computation → Nonconvex optimization;Additional Key Words and Phrases: Nonnegative Tensor Factorization, Coordinate Descent, Element selection,Saturating Coordinate Descent, Pattern Mining, Recommender systems
ACM Reference Format:
Thirunavukarasu Balasubramaniam, Richi Nayak, and Chau Yuen. 2020. Efficient Nonnegative Tensor Fac-torization via Saturating Coordinate Descent.
ACM Trans. Knowl. Discov. Data.
1, 1, Article 1 (January 2020),28 pages. https://doi.org/10.1145/3385654
With the advent of tagging, sensor and Internet of Things (IoT) technologies, online interaction datacan be easily generated with tagging or spatio-temporal information. Tensor models become thenatural choice to represent a multi-dimensional dataset, for example as ( user × imaдe × taд ) or ( user × location × time ) [7, 18, 25, 44]. A Tensor Factorization (TF) based method decomposes the tensormodel into multiple factor matrices where each matrix learns the latent features inherent in theusage dataset [25, 56]. These factor matrices can be used to reconstruct (or approximate) the tensorto predict missing entries. These entries can be inferred as new items that have been generatedbased on correlations and dependencies between the data, and help in making recommendationgeneration and link prediction [2, 13, 14, 36]. A learned factor matrix reveals the latent features Authors’ addresses: Thirunavukarasu Balasubramaniam, Queensland University of Technology, 2 George Street, Brisbane,QLD, 4000, Australia, [email protected]; Richi Nayak, Queensland University of Technology, 2 GeorgeStreet, Brisbane, QLD, 4000, Australia, [email protected]; Chau Yuen, Singapore University of Technology and Design, 8Somapah Road, Singapore, [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2020 Association for Computing Machinery.1556-4681/2020/1-ART1 $15.00https://doi.org/10.1145/3385654 ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020. a r X i v : . [ c s . L G ] M a r able 1. Scalability in terms of size (mode length), density (percentage of observed entries in the tensor) andrank of the tensor. Algorithms are ranked as low, medium, high, and very high based on their capability inexecuting the process without running out of memory or out of time for synthetic datasets used in Sections6.2 and 6.6. SaCD and FSaCD are the proposed algorithms. Algorithm Mode length Density RankAPG [54] Low Low LowFMU [34] Medium Low MediumFHALS [35] Medium Low MediumBCDP [51] Medium Medium HighCDTF [39] High Medium LowGCD [5] High Medium LowSaCD High High HighFSaCD High Very High Highthat are able to represent hidden patterns in the dataset in that particular dimension, and can helpto understand people’s mobility patterns and usage [45].A TF process is challenging for three main reasons. Firstly, a tensor model tends to grow bigas the data size increases. It especially becomes a problem for web applications due to a largeInternet population and fewer user interactions. Secondly, due to fewer user interactions with allthe items, the data generated is very sparse. Learning the associations or analyzing this sparsedataset is hard due to lack of correlation patterns. Thirdly, the complexity of factorization processincreases with an increase in the rank of the tensor. Here, the rank represents the number of hiddenfeatures in the dataset, in other words, it decides the size of factor matrices. Computing a large sizesparse tensor model for exploring hidden latent relationships in the dataset is still a challengingproblem. Table 1 shows the limited scalability of existing factorization algorithms. NonnegativeTensor Factorization (NTF) which imposes the nonnegative constraint to the factor matrices learnedduring factorization is the backbone of recommendation and pattern mining methods to understandinteractions between features. The development of a fast and efficient NTF algorithm is needed fordeveloping effective web recommendation and pattern mining methods.We propose the Saturating Coordinate Descent (SaCD) factorization algorithm that reduces thecomplexity inherent in the factor matrix update by carefully selecting the important elements to beupdated in each iteration. We propose a Lipschitz continuity based element importance calculationto effectively select the elements for an update that will lead fast convergence. We also propose afast-parallelized version of SaCD (named as FSaCD) to further speed up the factorization processthat suffers from Intermediate Data Explosion (IDE). IDE is caused due to the materialization andstorage of the intermediate data generated. The column-wise element selection and element updateminimizes the complexity of IDE and parallelization becomes easier.Extensive experiments with recommendation and pattern mining applications show improvedscalability (i.e. size, density and rank) performance of SaCD and FSaCD in comparison to all relevantbaseline algorithms. The results also show that the efficiency is achieved at no cost of accuracy. Infact, SaCD achieves high accuracy when compared to other benchmarks.The specific contribution of this paper is four-fold.(1) We propose a novel Lipschitz continuity based element importance calculation method toselect certain elements for updating in the factorization process.2) We propose a novel Saturating Coordinate Descent (SaCD) factorization algorithm thatupdates factor matrices by selecting elements using the proposed Lipschitz continuity basedelement importance and leads fast convergence to a local optimum solution.(3) We apply SaCD in recommender systems and pattern mining effectively with high accuracyand scalability.(4) We design FSaCD by parallelizing SaCD on the single machine using multi-cores that furtherimproves the performance on big and dense datasets.To our best of knowledge, SaCD is the first algorithm that can efficiently process the large andsparse tensor model with large rank factor matrices, with and without parallelization.
Tensor Factorization based recommender systems and pattern mining methods : Tensormodels have been widely applied in several web-based applications and have shown superiorityin generating latent patterns [14, 19, 43]. The majority of model-based Collaborative Filteringmethods, including Matrix Factorization (MF), fail to utilize a context along with users and itemsinformation because of their limited capability to deal with two-dimensional data only [44]. Thecapability of a tensor model to represent multidimensional data makes them a natural choice torepresent the user interaction data [26]. The last decade has witnessed the rise of tensor-basedrecommendation methods [18, 36, 44, 46, 53, 55].A three dimensional tensor model represents the data as ( user × item × context ) and a tensorfactorization (TF/NTF) algorithm can be used to capture dimensional dependencies and generalrecommendations [22]. For example, a ( user × item × taд ) tensor model has been used in socialtagging systems for representing how users have used tags with items [9, 19]. A TF method withhigher order singular value decomposition (HOSVD) has been used to recommend items basedon tagging behavior or to recommend tags for items [47]. Link prediction in social networks hasbeen solved using Alternating Least Square (ALS) based NTF by predicting the missing relationsin the dataset, for example, identifying relations among users or recommending items to users byconsidering time as the additional context [14]. Researchers have used the p − core approach tominimize the size and sparsity of the tensor by only including the data with occurrence of (user,item) at least p times, as the traditional TF/NTF methods are not scalable and their accuracy on thesparse dataset is low [7, 19, 36]. However, the p − core approach removes any users/items with lessthan p interactions, thus, leaving the recommendation process incomplete for some users.Recent advancements in IoT allow recording additional spatio-temporal information easily inaddition to the web interactions. This has led to focus research on Location-Based Social Networks(LBSNs) [53]. Researchers have successfully applied TF/NTF using ALS for automated mining oftemporal patterns [5, 29]. For example, the Foursquare dataset with the user, venue and time isrepresented as a tensor and the r number of patterns are identified by the factor matrix in the timemode where r is a tensor rank [56]. Similarly, the temporal trajectories of online gamers are derivedusing NTF [37]. These patterns allow learning the behavior of web users that helps to personalizeand improve the interactions. Tensor rank is an important parameter in the factorization processthat learns the hidden features of the dataset. These hidden features represent the data effectivelyin a lower dimension, where higher the number of hidden feature identified, higher is the chancesof learning the true representation of dataset [44].With the dependence of these methods on TF to learn the associations in the data, it is importantthat a factorization algorithm can efficiently deal the tensor models with the larger size, density,and rank. ast Tensor Factorization Algorithms : Factorization algorithms namely Alternating LeastSquare (ALS) [23], Multiplicative Update rule (MU) [12], Gradient Descent (GD) [54] and CoordinateDescent (CD) [17, 30, 50] have been commonly used to factorize the tensor models. These traditionalalgorithms have been extended to tensors based on MF. Researchers have explored the conceptsof gradient calculation and tensor block to fasten the factorization process in these factorizationalgorithms. Fast Hierarchical ALS (FHALS) [35] and Fast MU (FMU) [34] are the optimized variantsof ALS and MU by exploiting an efficient gradient calculation method for the faster execution. Thetraditional gradient calculation requires the tensor unfolding at each iteration which is relativelyslow. FHALS and FMU use a simplified update rule with the gradient calculation that does not requirethe tensor unfolding. Accelerated Proximal Gradient (APG) [54] utilizes a low-rank approximationto speed up the factorization process. The Block Coordinate Descent (BCDP) [51] technique hasbeen applied in NTF to minimize the complexity by processing the tensor in convex blocks. Theelement selection-based CD algorithm called Greedy Coordinate Decent (GCD) [17] has shownfast convergence in Nonnegative Matrix Factorization (NMF) process. But the extension of GCD toNTF involves high complexity because of the frequent gradient updates [5].Few researchers have explored parallel and distributed computational methods to minimizethe time complexity involved in the Matricized Tensor Times khatri-Rao Product ( mttkrp ) andmatrics updates during TF [8, 11, 21, 32, 33, 39, 41]. CCD++ [52] is a CD algorithm for MF thatupdates the elements in a factor matrix column-wise. Coordinate Descent for Tensor Factorization(CDTF) [39] and Subset Alternating Least Square (SALS) [38] are two recent extensions of CCD++for TF. While CDTF is a direct extension of CCD++ for TF, SALS is a special case with an additionalconstraint that controls the number of columns to update in a single iteration. The column-wisefactor matrix update minimizes the complexity of mttkrp in these algorithms and introduces ahuge advantage in the distributed and parallel environment. However, these implementations donot alter the computational complexity inherent in the underlying factorization algorithm. We haveimplemented CDTF for NTF and used it as a benchmark in experiments. SALS cannot be directlyextended for NTF due to the additional constraint it imposes. Limitations : While NTF has been applied widely with tensor based methods to utilize multi-dimensional data, it suffers from the scalability issue due to the complex matrix and tensor productsinvolved in the factorization process [39]. Existing factorization algorithms are limited to smalldatasets only (as listed in Table 1). While parallel and distributed computational methods improvethe runtime and scalability by utilizing huge resources, performance on a normal machine withfewer memory requirements remains poor [32]. While optimizing the basic factorization process canimprove the performance without depending on huge resources, not all the optimized factorizationalgorithms are easily extendable to a parallel and distributed environment to further improve theperformance [24]. Most importantly, not all the factorization algorithms can effectively handletensors of large size, sparsity, and rank [32]. We propose a factorization algorithm, SaCD, toefficiently process the large and sparse tensor model, with and without parallelization, by avoidingthe frequent gradient calculations that are essential in existing algorithms [5, 17].
The notations used in this paper are summarized in Table 2. The process of converting a tensorinto a matrix is called as matricization or unfolding of tensor [49]. The mode − X ∈ R ( Q ×( PS )) for a third order tensor X ∈ R ( Q × P × S ) where Q , P , and S denotesthe mode length. Several matrix products are required for tensor-based processes [27]. We brieflyintroduce them. able 2. Table of Symbols Symbol Definition X tensor (Euler script letter) U , V , W X Q , P , S Length of mode U , V , W respectively Ω set of indices of observable entries of X x qps ( q , p , s ) th entry of X Ω Uq subset of Ω whose mode U ’s index is q Ω Vp subset of Ω whose mode V ’s index is p Ω Ws subset of Ω whose mode W ’s index is s U matrix (upper case. bold letter) u vector (lower case, bold letter) u scalar (lower case, italic letter) / element X n mode-n matricization of tensor R rank of tensor ⊗ Kronecher product ⊙ Khatri-Rao product ∗ Hadamard product ◦ outer product ∥ . ∥ Frobenius norm
For two matrices denoted as U ∈ R ( Q × R ) and V ∈ R ( P × R ) , the Kronecker product is presented as U ⊗ V . The resultant matrix of size ( QP × R ) is defined as follows: U ⊗ V = u V u V u V . . . u r V u V u V u V . . . u r V ... ... ... . . . ... u q V u q V u q V . . . u qr V (1) = (cid:2) u V u V u V . . . u r V (cid:3) , (2)where u r is r th column of the factor matrix U . The column-wise Kronecker product, called as Khatri-Rao product, is denoted as U ⊙ V . The resultantmatrix of size ( QP × R ) is defined as: U ⊙ V = (cid:2) u ⊗ v . . . u r ⊗ v r (cid:3) , (3)where u r and v r are columns of the matrices U and V respectively. When the size of two matrices are same, the hadamard product can be calculated by the element-wisematrix product. It is denoted by U ∗ V and defined as: (cid:27) u v w + u v w + · · · + u r v r w r Fig. 1. CP Factorization. U ∗ V = u v u v u v . . . u r v r u v u v u v . . . u r v r ... ... ... . . . ... u q v q u q v q u q v q . . . u qr v qr . (4) Tensor factorization, a dimensionality reduction technique, is an extension of matrix factorizationfor higher order. It factorizes a tensor into factor matrices that contain latent features. CANDE-COMP/PARAFAC (CP) and Tucker are two well-known factorization techniques [25]. CP has shownto be less expensive in both memory and time as compared to Tucker [25, 32].
Definition 1 (CP Factorization):
For a tensor X ∈ R ( Q × P × S ) and rank R , CP factorizationfactorizes the tensor into a sum of component rank-one tensors [10] (as shown in Figure 1) as: X (cid:27) (cid:74) U , V , W (cid:75) = R (cid:213) r = u r ◦ v r ◦ w r , (5)where U ∈ R ( Q × R ) , V ∈ R ( P × R ) and W ∈ R ( S × R ) are factor matrices with R hidden features, R ∈ Z + .The goal of CP tensor factorization is to approximate the input tensor and its objective functioncan be formulated as follows, min U , V , W f ( U , V , W ) = ∥ X − (cid:74) U , V , W (cid:75) ∥ . (6)The learning process is modeled as the minimization of Euclidean loss function as shown inEquation (6). The optimization process (Equation (7)) aims to learn the factors to reduce thedifference between the original tensor and approximated tensor, where the predicted value ˆ x qps is calculated by the inner product of learned factors across all the features, as formulated inEquation (8) [12]: f ( U , V , W ) = Q (cid:213) q = P (cid:213) p = S (cid:213) s = (cid:13)(cid:13) x qps − ˆ x qps (cid:13)(cid:13) . (7)ˆ x qps = R (cid:213) r = u qr . v pr . w sr . (8) The basic idea of NTF is factorizing the n -dimensional tensor into n factor matrices that satisfy thenonnegative constraint [25]. NTF can be achieved in traditional Tucker or CP factorization modelby imposing constraints to maintain the nonnegative values. ig. 2. Architecture of the overall process The objective function for NTF based on CP model can be formulated as follows:min U , V , W ≥ f ( U , V , W ) = ∥ X − (cid:74) U , V , W (cid:75) ∥ . (9)The approximated tensor, after factorization, using the learned factor matrices will be a densermodel in which a large portion of absent values are populated [31, 42]. For applications such aspredictive modelling and recommender systems, these populated values can be inferred as potentialprediction or recommendation [6, 18].The goal of the optimization problem in Equation (9) is to find the accurate factor matrices U , V and W . ALS [48] is the most common algorithm used to find the factor matrices [35, 40]. It givesequal importance to all the elements and alternatively updates the entire factor matrix by fixing allthe other factor matrices. This unnecessary element updates can cause slow convergence and poorscalability [1, 17]. In this paper, we propose the Saturating Coordinate Descent (SaCD) algorithm to solve Equation (9)in order to reduce the complexity inherent in the factor matrix update and to overcome the poorscalability faced by traditional algorithms like ALS [35], APG [54], GCD [17], CDTF [39] andBCDP [51]. Figure 2 provides the detail of the overall process.
We First explain the learning process of factor matrix U which is applicable to other factor matrices.The factor matrix U is updated by solving Equation (10) and to solve, we need to find the first orderand second partial derivatives (Equation (11) and Equation (12) respectively) for f in Equation (9)as follows,in U ≥ (cid:13)(cid:13) X − U ( W ⊙ V ) T (cid:13)(cid:13) , (10) ∂ f ∂ U = G = − X ( W ⊙ V ) + U ( V T V ∗ W T W ) , (11) ∂ f ∂ U = H = V T V ∗ W T W , (12)where ∂ f ∂ U and ∂ f ∂ U are the first order and second order partial derivatives of the objective function f (Equation (9)) with respect to U and let us denote them as G (gradient) and H (second orderderivative) respectively.The one variable gradient and second order derivative are computed as, д qr = −( X ( W ⊙ V )) qr + ( U ( V T V ∗ W T W )) qr , (13) h rr = ( V T V ∗ W T W ) rr , (14)where д qr and h rr represents q , r th and r , r th gradient G and second order derivative H respectively.Equation (9) becomes a quadratic equation in terms of the updated parameter when all otherparameters are fixed. This leads to the following closed form CD update rule (i.e., one variablesub-problem) [35] for each parameter: ˆ u qr = д qr h rr . (15)The one variable sub-problem can be a simplified ALS update rule where each factor matrix isupdated element-wise, thus it reduces the computational complexity. The nonnegative constraintcan be added to Equation (15) as, ˆ u qr ←− max ( , u qr − ˆ u qr ) − u qr , (16)where u qr indicates the q , r th element of the factor matrix U and ˆ u qr indicates the computedelement.With the computed element value ( ˆ u qr ), the element u qr can be updated as, u qr ←− u qr + ˆ u qr . (17)Since the calculation of X ( W ⊙ V ) and V T V ∗ W T W for every element is expensive, it is betterto calculate this value only once at each iteration, instead of calculating it for every element update,during the factorization process to find the best learning factor matrices.In NMF, element selection has been proven to converge faster for updating the importantelements repeatedly, instead of considering all elements [17]. The traditional measurement ofelement importance and update is computationally inefficient for NTF due to frequent gradientupdates. Next, we show how an element importance can be efficiently calculated and only importantelements can be updated that can avoid frequent gradient updates. The existing partial differential equations based optimization algorithms like APG, FHALS, FMU,and GCD are prone to slow convergence due to inconsistent gradients, i.e., gradients shift on bothdirections from the global or local minima in the non-convex optimization curve [23]. Additionalregularity conditions has been incorporated to speed-up the convergence in NMF [15]. In this paper,we propose to smooth (Lipschitz smoothness) the continuous function f with a strong condition ofLipschitz continuity for fast convergence. efinition 2 (Lipschitz continuity): Lipschitz continuity is a strong form of uniform continuityfor a function if the function f is differentiable with Lipschitz constant L .The continuously differentiable function f is called Lipschitz continuous with Lipschitz constant L , such that: (cid:13)(cid:13)(cid:13) f ( u k ) − f ( u k + ) (cid:13)(cid:13)(cid:13) ≤ L (cid:13)(cid:13)(cid:13) u k − u k + (cid:13)(cid:13)(cid:13) , ∀ u k ∈ U k , u k + ∈ U k + , (18)where U k and U k + are two factor matrices at two consecutive iterations. Definition 3 (Lipschitz smoothness):
The Lipschitz continuous function, as defined in Equa-tion (18), can be smoothed with the upper bound value for L . It ensures the strong convexity of thefunction and achieves faster convergence. This property is called Lipschitz smoothness and thefunction f with this property is called L -smoothed. It can be defined as, f ( u k ) ≥ f ( u k + ) + f ′ ( u k + )( u k − u k + ) + L (cid:13)(cid:13)(cid:13) f ( u k ) − f ( u k + ) (cid:13)(cid:13)(cid:13) . (19)The upper bound value for L should be calculated to satisfy, f ′′ ( U , V , W ) − L I ≥ , (20)where I is an Identity matrix.Equations (11) and (12) can reveal that the objective function f is differentiable with respect tofactor matrices as well as it is non-convex. We propose to analyse these properties using Lipschitzcontinuity and Lipschitz smoothness and measure the element importance in order to achievefaster convergence.Let Z be the element importance matrix that holds the importance value for each element in afactor matrix. The importance of element u qr can be calculated as the difference in the objectivefunction as, z qr = д qr − ˆ д qr , (21)where д qr represents the gradient of u qr and ˆ д qr represents the new gradient value of the computedelement ˆ u qr .With the precomputed д qr as per Equation (13), we can compute ˆ д qr using one variable subprob-lem [17] as follows, ˆ д qr = д qr + д qr ˆ u qr + ( h rr ˆ u qr ) . (22)Substituting Equations (13) and (22) in Equation (21), we get the element importance as, z qr = −( д qr ˆ u qr ) − . ( h rr ˆ u qr ) . (23)In Equation (23), h rr is the partial derivative of gradient д qr as defined in Equation (14). Anobjective function holding Lipschitz continuity with differentiable gradients can be Lipschitz-smoothed ( L -smoothed) (as proven in Lemma 4.1). This property will allow the function to convergefaster by replacing the partial derivative of gradient (i.e. h rr in (23)) with Lipschitz constant L [15].We propose to calculate the difference in the L -smoothed objective function for each element’supdate during the factorization using Lipschitz continuity [16] that can achieve faster convergence.The convergence of an optimization problem can be analyzed using curvature measure C f thatmeasures the deviation of the objective function f with the linear approximation. The Lipschitzsmoothness of f has shown the best C f with the improvement of convergence speed to K where K is the number of iterations [20] . We conjecture that use of Lipschitz continuity in the elementimportance calculation will speed up the convergence. For a formal analysis of curvature measure and Lipschitz continuity, we direct the readers to [20]. y applying the upper bound value of Lipschitz continuity in the one variable subproblemEquation (22), we have: ˆ д qr = д qr + д qr ˆ u qr + L ( ˆ u qr ) , (24)where L is a positive scalar called Lipschitz constant.With the redefined one variable subproblem, we can define Lipschitz element importance bysubstituting Equations (13) and (24) in Equation (21) as, z qr = −( д qr ˆ u qr ) − L ( ˆ u qr ) . (25)Lemma 4.1. The gradient of the objective function Equation (9) satisfies Lipschitz continuity.
Proof. Supposedly we have two factor matrices U k and U k + at two consecutive iterations. (cid:13)(cid:13)(cid:13)(cid:13) ∂ f ∂ U k − − ∂ f ∂ U k (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) − X ( W ⊙ V ) + U k − ( V T V ∗ W T W ) − (− X ( W ⊙ V ) + U k ( V T V ∗ W T W )) (cid:13)(cid:13)(cid:13) . (26)Applying Definition 2 and assuming the objective function is differentiable, we obtain, (cid:13)(cid:13)(cid:13)(cid:13) ∂ f ∂ U k − − ∂ f ∂ U k (cid:13)(cid:13)(cid:13)(cid:13) = L (cid:13)(cid:13)(cid:13) U k − − U k (cid:13)(cid:13)(cid:13) . (27)Equating Equations (26) and (27), we can identify the value for L as (cid:13)(cid:13) V T V ∗ W T W (cid:13)(cid:13) . L is thesingular upper bound value that defines the maximum curves an objective function allowed tohave and making the function f L -smoothed for faster convergence.Solving Equation (26) and Equation (27) for one variable subproblem with u k ∈ U k and u k + ∈ U k + proves lemma 4.1. □ Once we have the element importance calculated for all the elements in the factor matrix U usingEquation (25) and stored them in Z , we now select a set of elements for every iteration usingthe proposed SaCD algorithm. The state-of-the-art GCD NMF algorithm uses greedy strategy infinding and updating single element multiple times in each row [17]. This requires a very expensivefrequent gradient update. In the minimization optimization problem as formulated in Equation (9),the error is minimized for each iteration. It slowly reaches convergence, or reaches a saturationpoint beyond which updating will not minimize the objective function f . Hence, the contributionof each element in minimizing the objective function decreases with each iteration. Moreover, notall the elements will effectively minimize the objective function. It is sufficient and efficient toupdate every single element until it reaches the saturation point, instead of updating it for all theiterations. This avoids the expensive frequent gradient update as proven in Lemma 4.2.As the element importance z qr is the difference in the objective function as per Equation (21),we can identify the saturation point sp qr of each element for every iteration k , by keeping track ofthe previous value of z k − qr as, sp qr = ( z kqr − z k − qr ) . (28)Additionally, for each iteration, we measure the total importance of a factor matrix as, ti k = Q (cid:213) q = R (cid:213) r = z kqr , (29) i k − = Q (cid:213) q = R (cid:213) r = z k − qr . (30)If the total importance of current iteration ti k is more than the previous iteration ti ( k − ) , we identifythe saturation point sp qr as, sp qr = ( z k − qr − z kqr ) . (31)While the difference in the objective function gradually decreases, sometimes it increases. Thishappens when the optimization is reaching local or global minima. To avoid stuck in local minima,this redefinition of the saturation point is needed that further allows important elements to beupdated until global minima is reached. We use this saturation point to decide if the element u qr isto be updated as per Equation (17). If sp qr <
0, we avoid updating that element.We have explained the proposed algorithm by describing the learning process of factor matrix U .Next we briefly show that the process is applicable to other factor matrices too.Lemma 4.2. In the SaCD element selection, for each update of u qr , it is not necessary to updategradient of all the columns of q th row and element importance z qr . Proof. Let u qr ∈ U be q , r th element of U and д qr ∈ G represents q , r th gradient of U .For every r : R , the set of important elements e r are selected based on the saturation pointcalculated using Equation (28) or Equation (31).The subset e r is dependent on r th column of G .For each update of u qr ∈ e r , д qr ∈ G alone needs to be updated as e r ⊥⊥ g y where y (cid:44) r and д qr ⊥⊥ u nr where n (cid:44) q and ⊥⊥ indicates that e r is independent of g y .Therefore, for each update of u qr , it is enough to update д qr alone and it is not necessary toupdate g q ∗ of all the columns of q th row and element importance z qr . □ Updating solution for factor matrix V : Taking the first and second order partial derivatives of thefunction f with respect to V , we have a new solution for the gradient G and second order derivative H as, ∂ f ∂ V = G = − X ( W ⊙ U ) + V ( U T U ∗ W T W ) , (32) ∂ f ∂ V = H = U T U ∗ W T W . (33)With the updated G and H as per Equations (32) and (33), the element importance is calculatedaccording to Section 4.1 and the factor matrix update is performed in similar fashion. Updating solution for factor matrix W : Taking the first and second order partial derivatives ofthe function f with respect to W , we have a new solution for the gradient G and second orderderivative H as, ∂ f ∂ W = G = − X ( V ⊙ U ) + W ( U T U ∗ V T V ) , (34) ∂ f ∂ W = H = U T U ∗ V T V . (35)With the updated G and H as per Equations (34) and (35), the element importance is calculatedaccording to Section 4.1 and the factor matrix update is performed in similar fashion.Algorithm 1 details the process. lgorithm 1: Saturating Coordinate Descent (SaCD) Algorithm Input:
Tensor X ; Randomly Initialized factor matrices U ∈ R Q × R , V ∈ R P × R , W ∈ R S × R ; Rank R ; Z = ∅ ;Number of rows in any given factor matrix rows ; Maximum number of iterations maxiters . Output:
Learned Factor matrices U , V , W for k = maxiters do compute G and H using Equations (11) and (12); L = ∥ H ∥ ; if k == 1 then for r = R do for q = rows do compute element importance z qr using Equation (25); store the initial element importance, z k − qr = z qr ; if z qr > then update the element u qr using Equation (17); end end end else for r = R do for q = rows do compute element importance z qr using Equation (25); identify the saturation point sp qr using Equation (28) or Equation (31); if sp qr > then update the element u qr using Equation (17); end update the previous element importance, z k − qr = z qr ; end end end repeat analogues lines 4 to 26 with G and H calculated using Equations (32) and (33) respectively toupdate elements of V in lines 11 and 21; repeat analogues lines 4 to 26 with G and H calculated using Equations (34) and (35) respectively toupdate elements of W in lines 11 and 21; end In this section we propose the Fast SaCD algorithm that leverages the column-wise elementupdate to speed up the factorization process of SaCD. Instead of pre-computing the gradient usingEquation (11), we propose to calculate the gradient column-wise during the column-wise elementupdate. To further improve the performance of SaCD, we utilize the multiple cores of the singlemachine.The pre-calculation of G as per Equation (11) consists of an expensive mttkrp operation (i.e., X ( V ⊙ W ) ), a complex step in factorization that causes the Intermediate Data Explosion (IDE). IDEis caused due to the materialization and storage of the intermediate data ( V ⊙ W ). The calculation of mttkrp is a rather infamous computational kernel with a lot of related work attempting to optimizethe computations [3, 11, 41]. We utilize the concept of sparse tensor times vector product ( sttvp )which redesigns the NTF algorithm to calculate the column-wise mttkrp [4]. sttvp simplifies mttkrp by multiplying sparse tensor to a vector instead of multiplying it to a matrix and minimizes theDE. We call this Fast SaCD algorithm as FSaCD. We propose to calculate the element importanceand mttkrp column-wise and parallelize the calculation together.Based on Algorithm 1, we calculate the element importance column-wise so that only therespective column of gradients is needed. It enables us to calculate the mttkrp of only one columnat a time as follows,For simplicity, let us represent mttkrp as, M = X ( W ⊙ V ) . (36)The mttkrp operation is simplified using the sparseness of tensor and the column-wise mttkrp ( M ) can be calculated as, m r = (cid:213) ( q , p , s )∈ Ω Uq ( x qps ( w r v r )) , (37)where Ω Uq indicates a subset of Ω whose mode U ’s index is q . v r and w r indicates the r th columnof the factor matrices V and W respectively.Now, the column-wise gradient to solve U is calculated as, g r = m r + ( U ( V T V ∗ W T W )) r . (38)The column-wise element importance is calculated as, z r = −( g r ∗ ˆu r ) − L ( ˆu r ∗ ˆu r ) . (39)With the calculated gradient (Equation (38)) and element importance (Equation (39)), the column-wise update is performed as, u r ← u r + max ( , g r h rr ) , (40)where u r indicates the r th column of the factor matrix U and h rr = ( V T V ∗ W T W ) rr .The above process shows the learning of factor matrix U . Next we briefly show that the processis applicable to other factor matrices V and W too. Updating solution for factor matrix V : The mttkrp ( M = X ( W ⊙ U ) ) operation in Equation (32) canbe calculated column-wise as, m r = (cid:213) ( q , p , s )∈ Ω Vp ( x qps ( u r w r )) . (41)Now, the column-wise gradient to solve V is computed as, g r = m r + ( V ( U T U ∗ W T W )) r . (42)The column-wise element importance is calculated as, z r = −( g r ∗ ˆv r ) − L ( ˆv r ∗ ˆv r ) . (43)With the calculated gradient (Equation (42)) and element importance (Equation (43)), the column-wise update is performed as, v r ← v r + max ( , g r h rr ) , (44)where v r indicates the r th column of the factor matrix V and h rr = ( U T U ∗ W T W ) rr . pdating solution for factor matrix W : The mttkrp ( M = X ( V ⊙ U ) ) operation in Equation (34) canbe calculated column-wise as, m r = (cid:213) ( q , p , s )∈ Ω Ws ( x qps ( v r u r )) . (45)Now, the column-wise gradient to solve W is computed as, g r = m r + ( W ( U T U ∗ V T V )) r . (46)The column-wise element importance is calculated as, z r = −( g r ∗ ˆw r ) − L ( ˆw r ∗ ˆw r ) . (47)With the calculated gradient (Equation (46)) and element importance (Equation (47)), the column-wise update is performed as, w r ← w r + max ( , g r h rr ) , (48)where w r indicates the r th column of the factor matrix W and h rr = ( U T U ∗ V T V ) rr .With the column-wise gradients calculated as per Equations (38), (42), and (46), and the column-wise element importance calculated as per Equations (39), (43), and (47), the factor matrix update isperformed in parallel as detailed in the Algorithm 2. We analyze SaCD in terms of convergence, time complexity, and memory requirement. We usethe following symbols in the analysis: R (rank), K (maximum number of iterations), M (number offactor matrices, and a three-mode tensor X ∈ R ( Q × Q × Q ) . In this section, we analyze the convergence of SaCD under the following assumptions.
Assumption 1 . The objective function f w.r.t each factor matrix ∇ f ( u k ) is continuous, differen-tiable and holds the Lipschitz continuity. Assumption 2 . Each element importance is calculated by Equation (25) for all iterations K andthe parameter L ( k − ) obeys l ≤ L ( k − ) ≤ L .Lemma 5.1. Based on the
Assumptions and , for given K iterations, (cid:205) ∞ k = (cid:13)(cid:13) u k − u k + (cid:13)(cid:13) < ∞ . Proof. For the element selection as per Equation (28), we have inequality and therefore, f ( u k ) − f ( u k − ) > L k (cid:13)(cid:13)(cid:13) u k − − u k (cid:13)(cid:13)(cid:13) − L k − (cid:13)(cid:13)(cid:13) u k − − u k − (cid:13)(cid:13)(cid:13) . (49)If we sum the above inequality over k from 1 to K , we have f ( u ) − f ( u K ) ≥ K (cid:213) k = L k (cid:13)(cid:13)(cid:13) u k − − u k (cid:13)(cid:13)(cid:13) − L k − (cid:13)(cid:13)(cid:13) u k − − u k − (cid:13)(cid:13)(cid:13) (50) ≥ K (cid:213) k = L k (cid:13)(cid:13)(cid:13) u k − − u k (cid:13)(cid:13)(cid:13) ≥ K (cid:213) k = l (cid:13)(cid:13)(cid:13) u k − − u k (cid:13)(cid:13)(cid:13) . As the function f is lower bounded, for k = ∞ , the proof satisfies. □ The update rule in Equation (22) utilizes the Newton method [28] to apply the updates to the setof important elements that are identified by SaCD. lgorithm 2:
Fast Saturating Coordinate Descent (FSaCD) Algorithm Input:
Tensor X ; Randomly Initialized factor matrices U ∈ R Q × R , V ∈ R P × R , W ∈ R S × R ; Rank R ; Z = ∅ ;Number of rows in any given factor matrix rows ; Maximum number of iterations maxiters . Output:
Learned Factor matrices U , V , W for k = maxiters do compute H using Equation (12); L = ∥ H ∥ ; if k == 1 then Parallel for r = R do compute gradient g r using Equation (38); compute element importance z r using Equation (39); store the initial element importance, z k − r ← z r ; if z r > then update u r using Equation (40); end end else Parallel for r = R do compute gradient g r using Equation (38); compute element importance z r using Equation (39); for q = rows do identify the saturation point sp qr using Equation (28) or Equation (31); if sp qr > then update the element u qr using Equation (17); end update the previous element importance, z k − qr = z qr ; end end end repeat analogues lines 4 to 26 with H , g r and z r calculated using (33), (42) and (43) respectively to updatethe elements of V in lines 11 and 21; repeat analogues lines 4 to 26 with H , g r and z r calculated using (35), (46) and (47) respectively to updatethe elements of W in lines 11 and 21; end Theorem 5.2.
The newton method update rule using SaCD converges faster to the optimal solutionby reaching the saturation point.
Proof. The Newton method based update uses the update sequence, u k + = max ( , u k − f ′ ( u k ) f ′′ ( u k ) ) , k = , , . . . (51)where k indicates the k th iteration. Using SaCD, we select e qr ∈ E where E is a set of elements tobe updated in U . The gradients and second order derivatives for the sequence can be defined as: f ′ ( u ) = − (cid:213) q , r ∈ E д kqr , (52) f ′′ ( u ) = (cid:213) q , r h krr > (cid:213) q , r ∈ E h krr > . (53)s per linear differential equations properties, for any positive f ′′ ( u ) , f ′ ( b ) ≤ f ′ ( u ) + ( b − u ) f ′′ ( u ) ∀ u , b ≥
0. As we know that f ′′ ( u ) >
0, and setting b = u − ( f ′ ( u ))/( f ′′ ( u )) , we have f ′ ( u − f ′ ( u ) f ′′ ( u ) ) ≤ . (54)With the initialized u , suppose the update sequence that holds these properties will converge toa saturation point u sat = lim k →∞ u k where f ′ ( u sat ) ≤
0. For the larger value of k by continuity ofgiven gradients and second-order derivatives, f ′ ( u k ) f ′′ ( u k ) < u sat f ′′ ( u sat ) , (55) u sat − u k < u sat f ′′ ( u sat ) . (56)From (55) and (56), we have u sat − u k + < u k + ≤ u sat . Hence, it canbe said that the proposed algorithm will converge faster to the optimal solution. □ Lemma 5.3.
The time complexity of SaCD is O (| X | + ( M + K )( Q R + QR + QR + )) Proof. Algorithm 1 includes five operations: initialization of factor matrices and respectiveimportance matrix V , unfolding of the tensor, gradient calculation, updating of factor matrices, andupdating of importance matrix.The random initialization of M number of factor matrices and respective importance matrixtakes O ( MQR ) . Unfolding the tensor generally takes O (| X |) . SaCD requires gradients G to becalculated before updating the elements that involves the calculation of two terms − X ( W ⊙ V ) and U ( VV T ∗ WW T ) as shown in Equation (11) and Equation (13) and requires O ( Q R ) , O ( QR ) respectively. Let E be the total number of elements selected for update. Updating each factor matrixtakes O ( E ) and calculating Z takes additional O ( E ) , where E ≤ QR . For each iteration, the valueof E reduces and at some point, it will reach 0. But if Z can be kept in memory, it can be updatedwhile updating each entry for O ( ) . Thus, the time complexity of SaCD can be formulated as O (| X | + ( M + K )( Q R + QR + QR + )) . □ The time complexity of a third-order tensor factorization algorithms, including SaCD, remainsto be cubical ( Q R ). However, the element selection in SaCD avoids frequent element updates andcontrols the complexity as the number of element updates E ≤ QR . Lemma 5.4.
The memory requirement of SaCD is O (| X | + MQ ( R + Q )) . Proof. For the factorization of an input tensor X ∈ R Q × Q × Q , SaCD stores the following types ofdata in the memory at each iteration: Unfolded tensor; Factor matrices and respective importancematrices; and precomputed gradient and second order derivative. The unfolded tensor with re-spective to any mode requires O (| X |) . The M number of factor matrices require O ( MQR ) while therespective importance matrices require the same amount of memory O ( MQR ) . The precomputedgradients which is of the same size as the respective factor matrices will require another O ( MQR ) .The Hadamard matrix which is a square matrix for each factor matrix will require an additional O ( MQ ) memory. Thus, the memory requirement of SaCD is O (| X | + MQ ( R + Q )) . □ able 3. Real-world Dataset Details. (M: Million) Dataset Tensor Size Density
Delicious 1797 × × . × × . × × . M × M ×
24 0 . We would like to validate that SaCD can perform the factorization process efficietly as well asaccurately. Experiments were conducted to answer the following questions: Q1 . How scalable is SaCD? What is its runtime performance? Q2 . How accurately can SaCD predict missing values and can be used in recommendation? Q3 . How accurately SaCD identify the unique patterns and can be used in pattern mining? Q4 . What is the impact of parallelization on SaCD? Several real-world and synthetic datasets were used to evaluate the performance of SaCD incomparison to the state-of-the-art algorithms. Table 3 details the four real-world datasets used.Delicious consists of 1797 user’s tagging behavior on 24073 URLs with 15752 tags. LastFM consistsof 1583 users, 8383 artists and 3886 tags associated with the artists. Movielens consists of 1129,3884 and 3693 user, movies and tags respectively. Gowalla , the LBSN Foursquare dataset, recordsthe 1 Million users’ check-in activity at 2
Million locations.
The source codes of SaCD and its parallelized version have been made available . All single-coreexperiments were executed on Intel ( R ) Core
T M i − U CPU @ 2 . GHz model with 16
GBRAM . The multi-cores experiments in section 6.4 were executed on
Intel ( R ) Xeon ( R ) CPU E − v . GHz model with 12
GB RAM and 12 cores . For real-world datasets, we use 5 fold crossvalidation with 80% of data used for training and 20% for testing.We compare SaCD with the following benchmark algorithms.(1)
APG [54] uses gradients to accelerate the convergence. The objective function is smoothedusing the proximal gradients and the gradient calculation is simplified using the low-rankapproximations. Instead of calculating the gradient using the original tensor X , the low-rankapproximations of the original tensor X is used to calculate the gradients.(2) FMU [34] and
FHALS [35] are optimized variations of MU and ALS respectively. The tensorunfolding and the Kronecker product during the gradient calculations are simplified tominimize the computational cost. https://del.icio.us/ https://movielens.org/ https://github.com/thirubs/SaCD BCDP [51] decomposes the non-convex optimization function into multiple blocks of convexproblem. And the convex blocks are solved cyclically to update the factor matrices. Theconvex blocks are smoothed for fast convergence.(4)
CDTF [39] is a latest CD algorithm for TF. For better scalability, the factor matrices areupdated column-wise alternatively. For a fair comparision, we use the serial version of CDTFfor NTF.(5)
GCD [5] is a element selection based CD algorithm which selects important elements andupdate repeatedly for fast convergence. The elements are selected row-wise using the frequentgradient updates.
Tensor completion has its special properties, that discriminate it with factorization, such as theeffect of the missing values on the rank/regularization selection and the optimization method. TheRoot Means Square Error (
RMSE ) is a commonly used metric to evaluate the tensor approximationperformance. The recommendation quality is evaluated using precision, recall, and F1 score,
RMSE = (cid:115) (cid:205) ( X test − ˆ X test ) n , (57)where X test is the test data, ˆ X test is approximated data and n is the number of elements in testdata. Precision = True PositiveTrue Positive + False Positive . (58)Recall = True PositiveTrue Positive + False N eдative . (59)F1 score = (cid:18) Precision × RecallPrecision + Recall (cid:19) . (60)We propose to use Pattern Distinctiveness ( PD ) to evaluate the quality of patterns learned usingNTF as follows. PD = Cosine ( w i , w r ) , ∀ i , r ∈ [ , R ] , i < r , (61)where Cosine ( w i , w r ) indicates the cosine similarity of i th and r th column of a factor matrix W . PD measures the similarity of each pattern with other patterns. So higher the PD value, higheris the similarity between patterns. Since the objective is to identify unique patterns, lower the PD value, better the quality of learned patterns is demonstrated. We evaluate the scalability of SaCD and other algorithms, with regards to size (mode length), density,and rank of the tensor, using synthetic data of diverse characteristics. We randomly generatedtensors of size ranging from 64 × ×
64 to 16384 × × . . Mode length . We increase the mode length Q = P = S of each mode from 2 to 2 with settingthe tensor density and rank to 0 . Q = P = S = . Whereas APG [54] ran out of memory for the tensor size Q = P = S > , andFMU [34], FHALS [35], and BCDP [51] ran out of memory for the tensor size Q = P = S > . − × × × R u n t i m e i n s e c s APGFMUFHALSBCDPCDTFGCDSaCD . e − . e − . e − . e − . e − × × × R u n t i m e i n s e c s APG (o.m)FMUFHALSBCDPCDTFGCDSaCD10 30 50 75 100 12510 × × × . . . . . . R u n t i m e i n s e c s APG (o.m)FMUFHALSBCDPCDTFGCDSaCD 1 . e − . e − . e − . e − . e − (d) Density R u n t i m e i n s e c s GCD factor matrix updateSaCD factor matrix updateTime taken for mttkrp
Fig. 3. Scalability Analysis using synthetic datasets. FMU and FHALS shows similar runtime performanceand the lines are overlapped. o.m. out of memory.
Overall SaCD can factorize 2 to 7 times larger tensors when compared to existing algorithms.The runtime performance of SaCD is almost constant for the smaller size tensors, however, itincreases linearly for large size tensors. This is due to the matricized tensor times Khatri-Raoproduct ( mttkrp ), X ( W ⊙ V ) needed in Equation (11) for the gradient calculation. In general, SaCDyields a significant time saving in comparison to other algorithms, especially GCD and CDTF, dueto the avoidance of frequent gradient updates. Density . In this set of experiments, we fix the tensor mode length to Q = P = S = . e − to 1 . e − . As shown in Figure 3(b), SaCD is 13 to70 times faster than existing algorithms for the very sparse dataset. The runtime performance ofSaCD improves when the sparsity increases, due to a gradual reduction in the mttkrp operation, asshown in Figure 3(d). On the other hand, the runtime performance of GCD and CDTF degardeswith an increase in sparsity. In comparison to GCD (Figure 3(d)), there is a significant time savingdue to avoidance of the frequent gradient update. Rank . Here, we fix the tensor mode length Q = P = S = . .5 Tensor Approximation Performance In addition to scalable factorization process, it is essential that the approximated (i.e., reconstructed)tensor has good accuracy. The factor matrices learned using factorization is used to reconstructthe approximated tensor that will identify missing values. Figure 4 reports the RMSE performanceof all the algorithms for the synthetic datasets used in previous section to evaluate the scalability.It is evident that SaCD doesn’t compromise with accuracy for better runtime performance andproduces the best result with less error in comparison to benchmarks. . . . R M S E APG FMU FHALS BCDP CDTF GCD SaCD e − e − e − e − e − . . . .
58 (b) Density R M S E
10 30 50 75 100 1250 . . . .
58 (c) Rank R M S E Fig. 4. RMSE performance of all the algorithms on synthetic datasets used to evaluate the scalability. APGdoes not scale for mode length > whereas FMU, FHALS, and BCDP do not scale for mode length > . The NTF problem can be considered as a solution to a recommendation or prediction problem wherethe estimated missing data are treated as prediction. The approximated tensor reconstructed usingthe factor matrices can be used to infer new values based on associations. In LastFM, Delicious, andMovielens datasets, the goal is to predict the missing entries of the tensor as accurately as possible.These entries are then inferred as “most likely items” that can be recommended to users. able 4. Pattern Distinctiveness ( PD ) and Runtime on the Gowalla dataset (lower values are better). Algorithm PD Runtime in secs
FMU 0 .
69 633 . .
41 623 . .
63 1076 . .
71 2537 . o . o . t o . o . t SaCD
It is evident from Figure 5 that SaCD doesn’t compromise with accuracy for better runtimeperformance and produces the best result with less error in comparison to benchmarks. Especiallyin Movielens dataset, SaCD shows at least 2 .
5% accuracy improvement over other algorithms.Similar to synthetic datasets, APG and BCDP ran out of memory ( o . m ) to process the Deliciousdataset, and both FMU and FHALS ran out of time ( o . o . t ) with an increase in the rank. While allother algorithms fail to process the Gowalla dataset for higher rank, only SaCD can successfullycomplete the process. The materialization of matrices in these algorithms requires large memorymaking them inefficient to deal with higher ranks. It can be noticed that the higher the rank, higheris the accuracy of SaCD. However, due to the increased memory requirements, APG, BCDP, FMU,and FHALS are not suitable for higher ranks. GCD and CDTF are able to process at higher ranks,however, they are significantly slower and run out of time.Similar performance is obtained for the measures of precision, recall, and F1 score. It can beseen in Figure 6 that SaCD significantly outperforms other algorithms in the recommendationperformance on all four real-world datasets. In addition to missing value prediction, NTF can also be used to identify the patterns automatically.For LBSN datasets such as Gowalla, we have time as the 3 rd mode. By setting the rank of a tensoras 4 in the factorization process, we identified 4 different patterns in the temporal factor matrix.Table 4 shows the PD values of all algorithms on the Gowalla dataset. SaCD outperforms all thebaselines. GCD ran out of time due to the large mode length of the tensor. While FMU and FHALScan execute due to the low rank setting, they are not able to distinguish the patterns distinctly.In Figure 7, we plot the values of the factor matrix in “temporal mode”, which have 24 hours as x − axis and y − axis represents the normalized value of the factor matrix in each column.Figure 7(e) shows the patterns obtained by SaCD. The red pattern shows a peak at 1 pm and 9 pm that probably indicates the lunch and dinner time. The pink pattern shows a very common 7 am to10 pm activity. On the other hand, the blue pattern shows a unique pattern with a peak at 1 am ,indicating night time activity. The green pattern shows activities between 1 pm and 6 am . Witha proper domain knowledge, the kind of activities that happens in different time periods can beeasily interpreted by using distinct patterns. It is interesting to note that unlike FMU and CDTF,SaCD avoids simultaneous elimination problem (a state where similar patterns are derived multipletimes) [57]. In Figure 7(a), pink and blue patterns are highly similar, and in Figure 7(d), red patternis same as the green pattern and pink pattern is same as the blue pattern. In comparison, patternsderived from SaCD are highly distinctive. . . . . . .
998 1 1 1 1 1o.o.t o.o.t(a) Rank R M S E DeliciousFMUFHALSBCDP (o.m)CDTFGCDSaCD
10 20 50 75 10010 × .
25 4 .
38 4 .
55 4 .
55 4 . R u n t i m e i n s e c s DeliciousFMUFHALSBCDP (o.m)CDTFGCDSaCD
10 20 50 75 1000 . . .
981 1 1 1 1 1o.m(c) Rank R M S E LastFMFMUFHALSBCDPCDTFGCDSaCD
10 20 50 75 10010 × × .
31 7 .
31 7 .
31 7 .
31 7 . R u n t i m e i n s e c s LastFMFMUFHALSBCDPCDTFGCDSaCD
10 20 50 75 1000 . . . .
99 (e) Rank R M S E MovielensFMUFHALSBCDPCDTFGCDSaCD
10 20 50 75 10010 × × (f) Rank R u n t i m e i n s e c s MovielensFMUFHALSBCDPCDTFGCDSaCD
10 20 50 75 1000 . . . . R M S E GowallaFMUFHALSBCDPCDTFGCD (o.o.t)SaCD
10 20 50 75 10010 . . . . . R u n t i m e i n s e c s Gowalla FMUFHALSBCDPCDTFGCD (o.o.t)SaCD
Fig. 5. Prediction and Runtime performance on real-world datasets. APG does not scale for these datasetswhile GCD runs out of time for Gowalla dataset. [o.m. out of memory. o.o.t. out of time]
10 25 50 1000 . . . . . .
55 0 .
55 0 .
55 0 .
55 0 . Precision - FMU Recall - FMUPrecision - FHALS Recall - FHALSBCDP (o.m) Recall - CDTFPrecision - CDTF Recall - GCDPrecision - GCD Recall - SaCDPrecision - SaCD . . . . . . .
35 0 .
35 0 .
35 0 .
35 0 . F s c o r e DeliciousFMUFHALSBCDP (o.m)CDTFGCDSaCD . . .
81 1 . . . . . Precision - FMU Recall - FMUPrecision - FHALS Recall - FHALSPrecision - BCDP Recall - BCDPPrecision - CDTF Recall - CDTFPrecision - GCD Recall - GCDPrecision - SaCD Recall - SaCD . . .
81 1 1 1 1 1(d) Top N F s c o r e LastFMFMUFHALSBCDPCDTFGCDSaCD . . . .
33 0 .
33 0 .
33 0 .
33 0 . Precision - FMU Recall - FMUPrecision - FHALS Recall - FHALSPrecision - BCDP Recall - BCDPPrecision - CDTF Recall - CDTFPrecision - GCD Recall - GCDPrecision - SaCD Recall - SaCD . . . . F s c o r e MovielensFMUFHALSBCDPCDTFGCDSaCD . . . . . . . . . . Precision - FMU Recall - FMUPrecision - FHALS Recall - FHALSPrecision - BCDP Recall - BCDPPrecision - CDTF Recall - CDTFGCD (o.m) Recall - SaCDPrecision - SaCD . . . . . . . . F s c o r e GowallaFMUFHALSBCDPCDTFGCD (o.m)SaCD
Fig. 6. Precision, Recall, and F1 score on real-world datasets.a) FMU (b) FHALS(c) BCDP (d) CDTF(e) SaCDFig. 7. Temporal patterns derived from the 3rd mode (time) of the tensor on the Gowalla dataset. . . · (a) No. of. workers R u n t i m e i n s e c s SaCDFSaCD · × × , , , , , , , R u n t i m e i n s e c s APGFMUFHALSBCDPCDTFGCDSaCDFSaCD . e − . e − . e − . e − . e − × .
31 11 .
85 9 .
21 9 .
21 9 . R u n t i m e i n s e c s APG (o.m) CDTFFMU (o.m) GCDFHALS (o.m) SaCDBCDP FSaCD 10 30 50 75 100 125 15010 . × .
62 9 .
62 9 .
62 9 .
62 9 .
62 9 .
62 11 . R u n t i m e i n s e c s APG (o.m) CDTFFMU GCDFHALS SaCDBCDP FSaCD
Fig. 8. Scalability analysis of FSaCD and other algorithms using synthetic datasets.
Figure 3(d) shows the time taken for mttkrp and the factor matrix update without multi-coreparallelization. SaCD allows to minimize the time taken for factor matrix update by avoidingthe frequent gradient calculation, however, the complexity of mttkrp remains the same as thetraditional element selection-based CD algorithm.If the calculation of mttkrp is parallelized as in the proposed algorithm FSaCD and executed usingthe cores of a single machine, the time taken for mttkrp can be reduced as shown in Figure 8(a). Weevaluate FSaCD in terms of mode length, density, and rank of the tensor. We set the rank to 100 anddensity to 0 . , we decreased the density to evaluate the performance interms of density. As shown in Figure 8(c), FSaCD shows up to 234 times fast computing performancein comparision to GCD. It is interesting to note that SaCD and FSaCD show similar performanceon sparse datasets showing that parallelization has no effect. To evaluate the performance in termsof rank, we set the density to 0 . and increased therank from 10 to 150. FSaCD shows 37 . times improved performance in comparision to CDTF ashown in Figure 8(d). It ascertains that FSaCD can handle higher rank easily. It is also interestingto note that FSaCD shows a converged performance for ranks where the increase in the rank doesnot increase the runtime. In this paper, we propose an element selection-based coordinate descent algorithm SaCD thatmeasures the important elements for optimization using Lipschitz continuity and provides thesaturated point for early stopping. The proposed Lipschitz continuity based element importancecalculation introduces an additional regularity condition to the optimization process and helps tospeed-up the convergence. SaCD can efficiently process NTF with higher mode length, rank, anddensity by reducing the frequent gradient updates. We also extend SaCD (called FSaCD) for theparallel environment to further improve the performance. We conducted theoretical and empiricalstudies to demonstrate the efficiency of SaCD. Theoretical analysis shows the time complexity andmemory requirement as well as proves the fast convergence property of SaCD. Empirical analysisshows the superiority of SaCD and FSaCD in comparison to the state-of-the-art algorithms, in termsof tensor size (mode length), rank, and density without compromising the accuracy. Results showthe applicability of SaCD in recommendation and pattern mining where efficiency is achieved atno cost of accuracy. SaCD and FSaCD require the element importance matrix stored in the memorywhich makes it challenging to extend in a distributed environment. In future work, we will exploreSaCD and FSaCD to effectively handle the datasets in the distributed environment.
ACKNOWLEDGEMENT
We like to express our gratitude to Dr. U Kang, Associate Professor, Seoul National Universityfor his comments that had greatly improved the manuscript. His willingness to give his time sogenerously is very much appreciated.
REFERENCES [1] Evrim Acar, Tamara G Kolda, and Daniel M Dunlavy. 2011. All-at-once optimization for coupled matrix and tensorfactorizations. arXiv preprint arXiv:1105.3422 (2011).[2] Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender systems: A surveyof the state-of-the-art and possible extensions.
IEEE Transactions on Knowledge & Data Engineering
SIAMJournal on Scientific Computing
Machine Learning, Optimization and Data Science (LOD), TheFourth International Conference on . Springer.[6] Thirunavukarasu Balasubramaniam, Richi Nayak, and Chau Yuen. 2018. People to people recommendation usingcoupled nonnegative Boolean matrix factorization. In
Soft-Computing and Network Security (ICSNS), 2018 InternationalConference on . IEEE.[7] Thirunavukarasu Balasubramaniam, Richi Nayak, and Chau Yuen. 2019. Sparsity constraint Nonnegative TensorFactorization for mobility pattern mining. In
Pacific Rim International Conference on Artificial Intelligence . Springer,582–594.[8] Alex Beutel, Partha Pratim Talukdar, Abhimanu Kumar, Christos Faloutsos, Evangelos E Papalexakis, and Eric PXing. 2014. Flexifact: Scalable flexible factorization of coupled tensors on hadoop. In
Proceedings of the 2014 SIAMInternational Conference on Data Mining . SIAM, 109–117.[9] Mohamed Reda Bouadjenek, Hakim Hacid, and Mokrane Bouzeghoub. 2016. Social networks and information retrieval,how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches andplatforms.
Information Systems
56 (2016), 1–18.[10] J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences in multidimensional scaling via an N-waygeneralization of “Eckart-Young” decomposition.
Psychometrika
35, 3 (1970), 283–319.11] Joon Hee Choi and S Vishwanathan. 2014. DFacTo: Distributed factorization of tensors. In
Advances in NeuralInformation Processing Systems . 1296–1304.[12] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. 2009.
Nonnegative matrix and tensor factoriza-tions: applications to exploratory multi-way data analysis and blind source separation . John Wiley & Sons.[13] Daniel M Dunlavy, Tamara G Kolda, and Evrim Acar. 2011. Temporal link prediction using matrix and tensorfactorizations.
ACM Transactions on Knowledge Discovery from Data (TKDD)
5, 2 (2011), 10.[14] Beyza Ermiş, Evrim Acar, and A Taylan Cemgil. 2015. Link prediction in heterogeneous data via generalized coupledtensor factorization.
Data Mining and Knowledge Discovery
29, 1 (2015), 203–236.[15] Naiyang Guan, Dacheng Tao, Zhigang Luo, and Bo Yuan. 2012. NeNMF: An optimal gradient method for nonnegativematrix factorization.
IEEE Transactions on Signal Processing
60, 6 (2012), 2882–2898.[16] William W Hager. 1979. Lipschitz continuity for constrained processes.
SIAM Journal on Control and Optimization
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and datamining . ACM, 1064–1072.[18] Noor Ifada and Richi Nayak. 2014. Tensor-based item recommendation using probabilistic ranking in social taggingsystems. In
Proceedings of the 23rd International Conference on World Wide Web . ACM, 805–810.[19] Noor Ifada and Richi Nayak. 2016. How relevant is the irrelevant data: leveraging the tagging data for a learning-to-rankmodel. In
Proceedings of the ninth ACM international conference on web search and data mining . ACM, 23–32.[20] Martin Jaggi. 2011.
Sparse convex optimization methods for machine learning . Ph.D. Dissertation. ETH Zurich.[21] U Kang, Evangelos Papalexakis, Abhay Harpale, and Christos Faloutsos. 2012. Gigatensor: scaling tensor analysis upby 100 times-algorithms and discoveries. In
Proceedings of the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining . ACM, 316–324.[22] Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. 2010. Multiverse recommendation:n-dimensional tensor factorization for context-aware collaborative filtering. In
Proceedings of the fourth ACM conferenceon Recommender systems . ACM, 79–86.[23] Koulik Khamaru and Martin Wainwright. 2018. Convergence guarantees for a class of non-convex and non-smoothoptimization problems. In
International Conference on Machine Learning . 2606–2615.[24] Keigo Kimura and Mineichi Kudo. 2015. Variable Selection for Efficient Nonnegative Tensor Factorization. In
DataMining (ICDM), 2015 IEEE International Conference on . IEEE, 805–810.[25] Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications.
SIAM review
51, 3 (2009), 455–500.[26] Sangeetha Kutty, Lin Chen, and Richi Nayak. 2012. A people-to-people recommendation system using tensor spacemodels. In
Proceedings of the 27th Annual ACM Symposium on Applied Computing . ACM, 187–192.[27] L De Lathauwer. 2008. Decompositions of a higher-order tensor in block terms—Part I: Lemmas for partitioned matrices.
SIAM J. Matrix Anal. Appl
30, 3 (2008), 1022–1032.[28] Daniel D Lee and H Sebastian Seung. 2001. Algorithms for non-negative matrix factorization. In
NIPS . 556–562.[29] Bin Liu and Hui Xiong. 2013. Point-of-interest recommendation in location based social networks with topic andlocation awareness. In
Proceedings of the 2013 SIAM International Conference on Data Mining . SIAM, 396–404.[30] Yu Nesterov. 2012. Efficiency of coordinate descent methods on huge-scale optimization problems.
SIAM Journal onOptimization
22, 2 (2012), 341–362.[31] Duy Khuong Nguyen and Tu Bao Ho. 2016. Fast parallel randomized algorithm for nonnegative matrix factorizationwith KL divergence for large sparse datasets. arXiv preprint arXiv:1604.04026 (2016).[32] Jinoh Oh, Kijung Shin, Evangelos E Papalexakis, Christos Faloutsos, and Hwanjo Yu. 2017. S-hot: Scalable high-ordertucker decomposition. In
Proceedings of the Tenth ACM International Conference on Web Search and Data Mining . ACM,761–770.[33] Namyong Park, Byungsoo Jeon, Jungwoo Lee, and U Kang. 2016. Bigtensor: Mining billion-scale tensor made easy. In
Proceedings of the 25th ACM International on Conference on Information and Knowledge Management . ACM, 2457–2460.[34] Anh Huy Phan, Petr Tichavsk`y, and Andrzej Cichocki. 2012. On fast computation of gradients for CANDE-COMP/PARAFAC algorithms. arXiv preprint arXiv:1204.1586 (2012).[35] Anh-Huy Phan, Petr Tichavsk`y, and Andrzej Cichocki. 2013. Fast alternating LS algorithms for high order CANDE-COMP/PARAFAC tensor factorizations.
IEEE Transactions on Signal Processing
61, 19 (2013), 4834–4846.[36] Steffen Rendle and Lars Schmidt-Thieme. 2010. Pairwise interaction tensor factorization for personalized tag recom-mendation. In
Proceedings of the third ACM international conference on Web search and data mining . ACM, 81–90.[37] Anna Sapienza, Alessandro Bessi, and Emilio Ferrara. 2018. Non-negative tensor factorization for human behavioralpattern mining in online games.
Information
9, 3 (2018), 66.[38] Kijung Shin and U Kang. 2014. Distributed methods for high-dimensional and large-scale tensor factorization. In
ICDM .IEEE, 989–994.39] Kijung Shin, Lee Sael, and U Kang. 2017. Fully scalable methods for distributed tensor factorization.
IEEE Transactionson Knowledge and Data Engineering
29, 1 (2017), 100–113.[40] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos.2017. Tensor decomposition for signal processing and machine learning.
IEEE Transactions on Signal Processing
65, 13(2017), 3551–3582.[41] Shaden Smith, Niranjay Ravindran, Nicholas D Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and parallelsparse tensor-matrix multiplication. In
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International .IEEE, 61–70.[42] Suvrit Sra, Dongmin Kim, and Bernhard Schölkopf. 2008. Non-monotonic poisson likelihood maximization.
MaxPlanck Institute for Biological Cybernetics, Tech. Rep
170 (2008).[43] Lijun Sun and Kay W Axhausen. 2016. Understanding urban mobility patterns with a probabilistic tensor factorizationframework.
Transportation Research Part B: Methodological
91 (2016), 511–524.[44] Panagiotis Symeonidis. 2016. Matrix and tensor factorization with recommender system applications.
Graph-BasedSocial Media Analysis
39 (2016), 187.[45] Panagiotis Symeonidis, Antonis Krinis, and Yannis Manolopoulos. 2013. Geosocialrec: Explaining recommendationsin location-based social networks. In
East European Conference on Advances in Databases and Information Systems .Springer, 84–97.[46] Panagiotis Symeonidis, Alexandros Nanopoulos, and Yannis Manolopoulos. 2008. Tag recommendations based ontensor dimensionality reduction. In
Proceedings of the 2008 ACM conference on Recommender systems . ACM, 43–50.[47] Panagiotis Symeonidis, Alexandros Nanopoulos, and Yannis Manolopoulos. 2009. A unified framework for providingrecommendations in social tagging systems based on ternary semantic analysis.
IEEE Transactions on Knowledge andData Engineering
22, 2 (2009), 179–192.[48] Yoshio Takane, Forrest W Young, and Jan De Leeuw. 1977. Nonmetric individual differences multidimensional scaling:An alternating least squares method with optimal scaling features.
Psychometrika
42, 1 (1977), 7–67.[49] Ledyard R Tucker. 1966. Some mathematical notes on three-mode factor analysis.
Psychometrika
31, 3 (1966), 279–311.[50] Stephen J Wright. 2015. Coordinate descent algorithms.
Mathematical Programming
SIAM Journal on imaging sciences
6, 3 (2013),1758–1789.[52] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. 2012. Scalable coordinate descent approaches to parallelmatrix factorization for recommender systems. In
ICDM . IEEE, 765–774.[53] Yonghong Yu and Xingguo Chen. 2015. A survey of point-of-interest recommendation in location-based social networks.In
Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence , Vol. 130.[54] Yu Zhang, Guoxu Zhou, Qibin Zhao, Andrzej Cichocki, and Xingyu Wang. 2016. Fast nonnegative tensor factorizationbased on accelerated proximal gradient and low-rank approximation.
Neurocomputing
198 (2016), 148–154.[55] Nan Zheng, Qiudan Li, Shengcai Liao, and Leiming Zhang. 2010. Which photo groups should I choose? A comparativestudy of recommendation algorithms in Flickr.
Journal of Information Science
36, 6 (2010), 733–750.[56] Yu Zheng, Tong Liu, Yilun Wang, Yanmin Zhu, Yanchi Liu, and Eric Chang. 2014. Diagnosing New York city’snoises with ubiquitous data. In