[PDF] Block based Singular Value Decomposition approach to matrix factorization for recommender systems

Abstract

With the abundance of data in recent years, interesting challenges are posed in the area of recommender systems. Producing high quality recommendations with scalability and performance is the need of the hour. Singular Value Decomposition(SVD) based recommendation algorithms have been leveraged to produce better results. In this paper, we extend the SVD technique further for scalability and performance in the context of 1) multi-threading 2) multiple computational units (with the use of Graphical Processing Units) and 3) distributed computation. We propose block based matrix factorization (BMF) paired with SVD. This enabled us to take advantage of SVD over basic matrix factorization(MF) while taking advantage of parallelism and scalability through BMF. We used Compute Unified Device Architecture (CUDA) platform and related hardware for leveraging Graphical Processing Unit (GPU) along with block based SVD to demonstrate the advantages in terms of performance and memory.

Full PDF

11 Pattern Recognition Letters

Block based Singular Value Decomposition approach to matrix factorization forrecommender systems

Prasad Bhavana a, ∗∗ , Vikas Kumar b , Vineet Padmanabhan a a Artiﬁcial Intelligence Lab, School of Computer and Information Sciences, University of Hyderabad, Hyderbad-500046, AndhraPradesh, India b Central University of Rajasthan, Rajasthan, India

ABSTRACTWith the abundance of data in recent years, interesting challenges are posed in the area of recommendersystems. Producing high quality recommendations with scalability and performance is the needof the hour. Singular Value Decomposition(SVD) based recommendation algorithms have beenleveraged to produce better results. In this paper, we extend the SVD technique further for scalabilityand performance in the context of 1) multi-threading 2) multiple computational units (with the useof Graphical Processing Units) and 3) distributed computation. We propose block based matrixfactorization (BMF) paired with SVD. This enabled us to take advantage of SVD over basic matrixfactorization(MF) while taking advantage of parallelism and scalability through BMF. We used ComputeUniﬁed Device Architecture (CUDA) platform and related hardware for leveraging Graphical ProcessingUnit (GPU) along with block based SVD to demonstrate the advantages in terms of performance andmemory. c (cid:13)

1. Introduction

Recommending items to a user based on his / her past prefer-ences is a well-studied problem in the area of Machine Learning(ML). In the recent past, several techniques have been proposedto address the problem of recommendation. These techniques areprimarily grouped into three major categories namely content-based recommendation, collaborative recommendation (ﬁltering)and hybrid recommendation. In the content-based approach, therecommendation is made by using the proﬁle information ofa user and an item. For example, in movie recommendationthe movie proﬁle can contain the genre information and thenbased on the user interest for the genre , a particular movie is rec-ommended. In collaborative ﬁltering, an item is recommendedto a user based on his / her past preferences and the preferenceinformation of other similar users. For example, in movie rec-ommendation the rating information can be used to ﬁnd othersimilar users. The hybrid approach can be seen as a combinationof both the content-based and collaborative-based model. ∗∗ Corresponding author: Tel.: + + e-mail: (PrasadBhavana), [email protected] (Vikas Kumar), [email protected] (Vineet Padmanabhan) The content-based collaborative ﬁltering has certain limita-tions and cannot be applied to situations where the item featuresare not meaningful or to situations where there is a need to cap-ture the change in user interests over time. Collaborative ﬁlteringalleviates the above mentioned challenges as it only requiresthe preference (implicit or explicit) information for recommen-dation. There are several approaches of collaborative ﬁlteringwhich can be further grouped into memory-based and model-based collaborative ﬁltering. In memory-based collaborativeﬁltering, the recommendation is made by ﬁnding the similarityscore between a user and items. Based on the similarity score, alist of top- K items are recommended to a user. Recommendingitems based on nearest neighborhood is a typical example ofmemory-based collaborative ﬁltering. At ﬁrst, given a targetuser, a set of k -similar users are ﬁrst identiﬁed based on the ob-served preferences. Then the model recommends a set of itemsbased on the likes of similar users. In most of the real-worlddata sets, the observed preferences are very sparse and thereare very few items rated by a set of common users. This leadsto the calculation of unreliable similarity values and in suchscenarios memory-based models perform very poorly (22). Onthe other hand, in model-based collaborative ﬁltering, the goalis to learn the latent (hidden) representation of the users and the a r X i v : . [ c s . L G ] J u l items. Based on the a ﬃ nity in the latent space representation, anitem is either recommended or not recommended to a user.Model-based collaborative ﬁltering can be visualized as amatrix completion task. Given a data set of m × n user-itemratings with n number of users and m number of items, the aimof collaborative ﬁltering is to predict unobserved preference ofusers for items (9; 10). Matrix factorization (MF) is one of theprominent techniques for matrix completion. The objective ofmatrix factorization is to learn latent factors U (for users) and V (for items) simultaneously. The latent factors are used for ap-proximation of the observed entries, so as to evaluate the model,using some loss measure. The latent factors thus derived areused further to predict the unobserved entries.With each passingyear, more and more preference data gets generated and thetask of recommendation becomes more challenging. With theexponential increase in the preference data, the major challengeis to provide a more accurate recommendations with less com-putational e ﬀ ort. Several approaches have been proposed in theliterature that take advantage of availability of high volume ofdata for better and accurate recommendation. Though there are afew important proposals, research on scalability, parallelism anddistributed computation to handle large volumes of data has notattracted much attention of researchers. For instance, Mackeyet al. (14) proposed a divide and conquer based approach forparallelism in matrix factorization by treating factorization ofeach sub-matrix as a sub-problem and thereafter combining theresults. This approach resulted in noisy factorization. Similarly,in (28) a localized factorization is proposed for recommenda-tion on a block diagonal matrix. In (5), a divide and conquerstrategy based Non-Negative Matrix Factorization (NNMF) isproposed for fast clustering and topic modeling. To make themodel scalable from rank-2 to rank- k , the authors proposed touse a binary tree structure of the data items. In (2), a blockkernel based approach of matrix factorization is proposed forthe face recognition task. Nikulin et al. (15) proposed a fastgradient based matrix factorization algorithm for use in statisti-cal analysis. From what has been said till now it can be notedthat matrix factorization based approach is a popular strategy forrecommendation and is still an active area of research. Thereare a few other notable proposals that handle large data setseither addressing parallelism or distributed computation but notboth (21; 6; 27; 13; 29; 17; 16; 26; 3). In this paper, we pro-pose a variant of Singular value Decomposition (SVD) called Block based Singular Value Decomposition for large scale rec-ommendation task. We also demonstrate how parallelism canbe achieved by employing Graphical Processing Unit (GPU).The rest of the paper is organized as follows. Section 2 sum-marizes the well-known existing Singular Value Decompositionapproach. In Section 3, we brieﬂy discuss the Block based Ma-trix Factorization approach and how parallelism can be achievedthrough the GPU. We introduce the proposed Block based vari-ant of SVD (BSVD) in Section 4. The advantage of the proposedapproach over the existing method is reported in Section 5. Fi-nally, Section 6 concludes and indicates several issues for futurework.

2. Singular Value Decomposition

Singular value decomposition (SVD) is closely related to anumber of mathematical and statistical techniques that are usedin a wide variety of ﬁelds, including eigen vector decompo-sition, spectral analysis, factor analysis, etc. SVD is appliedto a large variety of applications including dimensionality re-duction (4; 1), computer vision (24), signal processing (24),(12) etc. One of the important applications of SVD is a matrixcompletion problem wherein given a data matrix X ∈ R m × n with m rows and n columns, the goal is to derive a set of uncor-related low-dimensional factors in the “ eigen rows ” × “ eigencolumns ” space. The numerical rank is much smaller than m and n , and hence, factorization allows the matrix to be storedinexpensively. The original data matrix then can be recoveredwith these low-dimensional factor matricesGiven a m × n size matrix X , the S V D ( X ) is deﬁned as. S V D ( X ) = US V T (1)where U , S and V are of dimensions m × m , m × n , and n × n ,respectively. The matrices U and V are orthogonal matricesand S is a diagonal matrix, called the singular matrix. The di-agonal entries ( s , s , ..., s n ) of S are in incremental order; i.e., s ≥ s ≥ ... ≥ s m >

0. These matrices U , S , and V representa breakdown of the original relationships into linearly indepen-dent components or factors. In the diagonal matrix S , many ofthe entries are very small, and may be ignored, leading to anapproximate model that contains many fewer dimensions. With k number of non-zero entries (the size of reduced dimensionalspace or most signiﬁcant values), the e ﬀ ective dimensions ofthese three matrices U , S , and V are m × k , k × k , and n × k ,respectively. We can choose a small rank (k) and extract a matrixof exactly that rank from the SVD. The resulting matrix willstill approximate the original matrix. Therefore decreasing therank will just smooth out the entries in the recovered matrix byforcing them to be linear combinations of only a few basis vec-tors and at the same time match our sparsely observed ratings asclosely as possible. The result can be represented geometricallyby a spatial conﬁguration in which the dot product or cosinebetween user and item vectors represent estimated similarity ofthe two objects. Recommender system is one of the prominent applicationsof SVD where the aim is to ﬁnd and ﬁt a useful model of therelationship between users and items. The idea is to learn theunderlying parameters of the model including the latent factorsof users and items using the observed rating preferences. Usingthe learnt latent factors, we predict the association between usersand items for which the preferences were unobserved. However,computing SVD of a user-item matrix is expensive and requiresa large amount of memory and computational e ﬀ ort. For reason-able number of users and / or items, it may not even be possibleto ﬁt the matrix in memory to begin with. In order to computethe SVD e ﬃ ciently, in (20) a practical approach to leveragingincremental computation of SVD for recommender systems isproposed. The paper proposes folding-in based SVD techniquefor factorization. Koren et al. (8) extended the incremental com-putation of SVD at an element level to capture the temporalchanges in user and item biases.As given in Eq. (1), the goal of SVD computation is to learnthe factor matrices U , S , and V . For the sake of simplicityand meaningful explanation, we could consider matrix S as anidentity matrix. It is a diagonal matrix, so it simply acts as ascalar on U or V T . Hence, we can assume that we have mergedthe scalar factors into both the matrices U and V during theapproximation. So the matrix factorization simply becomes X = U × V T . Considering the rating value x ui as the result of adot product between two vectors: a vector p u which is a row of U and is speciﬁc to the user u , and a vector q i which is a columnof V T and is speciﬁc to the item i : x ui = p u × q i . So, the SVDof X , is merely modeling the rating of user u for item i as x ui = (cid:88) f ∈ latent f actors (a ﬃ nity of u for f × a ﬃ nity of i for f) (2)In other words, if u has a taste for factors that are endorsedby i , then the rating r ui will be high. However, due to the elimi-nation of S , the typical / general user, item biases represented bythe singular values are eliminated from prediction of unknownratings. This causes deviation in the Root Mean Square Er-ror(RMSE) computation for unknown ratings. An alternativeway to factor-in bias is proposed in (8). The authors proposedadding them back into the equation as a linear combination,which is represented as x ui = p u × q Ti + bu u + bi i (3)where bu u represents a singular value of user bias for u andsimilarly bi i represents a singular value of item bias for i . In most of the real-world applications, the rating matrix X ispartially observed and for such matrices the computation of XX T and X T X do not exist, so their eigenvectors do not exist either.Hence, the SVD computation is not deﬁned. In such situations,the latent factor matrices U and V can actually be learnt if wecan ﬁnd all the vectors p u , q i , bu u and bi i such that the p u makeup the rows of U and the q i make up the columns of V T . Therelated optimization problem can be represented as.min p u , q i , bu u , bi i J = (cid:88) ui ∈ Ω ( x ui − p u . q Ti − bu u − bi i ) (4)where Ω is set of observed entries. However, this optimizationproblem is not convex and hence requires an approximation tech-nique to arrive at a solution. SGD (Stochastic Gradient Descent)is one of the techniques that can ﬁnd the approximate solution.The optimal values of the latent variables can be obtained byminimizing Eq. 4.The gradients with respect to p u and q i (invector notation) are given by ∂ J ∂ p u = − q i ( x ui − p u . q Ti − bu u − bi i ) (5) ∂ J ∂ q i = − p u ( x ui − p u . q Ti − bu u − bi i ) . (6) Similarly, the gradients with respect to bu u and bi i (in vectornotation) are given by ∂ J ∂ bu u = − x ui − p u . q Ti − bu u − bi i ) (7) ∂ J ∂ bi i = − x ui − p u . q Ti − bu u − bi i ) (8)When matrix completion problem is viewed as supervisedlearning with Ω as the training set, it becomes necessary to en-sure that overﬁtting is avoided. This can be done by minimizingthe regularized loss function and thereby having the followingformulation.min p u , q i , bu u , bi i J = (cid:88) ui ∈ Ω ( x ui − p u . q Ti − bu u − bi i ) + β (cid:88) u ∈ Ω p u + β (cid:88) u ∈ Ω q i + β (cid:88) u ∈ Ω bu u + β (cid:88) i ∈ Ω bi i (9)With the inclusion of regularization parameters, the update equa-tions for 5, 6, 7 and 8 can be rewritten with learning co-e ﬃ cient α , α , α , α and regularization coe ﬃ cients β , β , β , β asshown below: p u ← p u + α . q i ( x ui − p u . q Ti − bu u − bi i − β p u ) (10) q u ← q u + α . p u ( x ui − p u . q Ti − bu u − bi i − β q u ) (11) bu u ← bu u + α ( x ui − p u . q Ti − bu u − bi i − β bu u ) (12) bi i ← bi i + α ( x ui − p u . q Ti − bu u − bi i − β bi i ) (13)

3. Block based approach to Matrix Factorization

Consider X ∈ R m × n be a rating matrix with ratings for m users and n items. The matrix factorization (MF) approach isvisualized as an estimation of the data matrix X ≈ UV T wherelatent factor matrices U ∈ R m × k and V ∈ R n × k (for somechosen dimension k ) are derived from the given data. The givendata matrix can be represented in block notation as given in (14).The representation is based on matrix formation with blocks ofequal dimension and if required, zeros can be padded to the datamatrix to ensure all the blocks are of equal size. X =  X X . . . X j . . . X J ... ... . . . ... . . . ... X i X i . . . X i j . . . X iJ ... ... . . . ... . . . ... X I X I . . . X I j . . . X IJ  (14)The Block based approach to Matrix Factorization (BMF)considers each block as an individual matrix. It then factorizesthe block for one iteration and takes the latent features of eachof these blocks as a starting point for approximation of the latentfeatures for the relevant blocks there after. Figure 1 demonstrates Fig. 1. Example of Block Matrix Factorization a simple example wherein X is divided into 4 blocks and eachof these blocks are factorized individually so as to combinetogether to form U and V that exactly explain X .BMF considers each element exactly once per iteration, withthe di ﬀ erence in change of sequence of processing of elements.As MF does not constrain the sequence in which the data ele-ments are processed, the convergence of BMF is expected to beequivalent to MF. As a limiting condition, matrix factorisationcan be viewed as BMF where each element is a di ﬀ erent blockor where the entire matrix is considered as a single block. Parallelism:

Using the matrix blocks, U i and V j matrices canbe derived simultaneously for multiple blocks. This is achievedby ﬁrst identifying the data blocks whose latent factors do notdepend on each other and dedicating a computation unit foreach such block. Figure 2 shows one such example with a 6 × Parallelism through GPU computation:

In order to fullyleverage parallelism, the multi-threading capabilities of the GPUcan be utilized. In (23), a GPU accelerated matrix factorization

Fig. 2. An example scenario of parallel BMF is proposed for the approximate Alternative Least Square (ALS)algorithm. The authors propose to use SGD for the optimization.GPU accelerated Non-Negative Matrix Factorization (NNMF)for Compute Uniﬁed Device Architecture (CUDA) capable hard-ware has been proposed in (7). Similarly, in (11), NNMF withGPU acceleration is used for text mining purposes. From theliterature it can be found that various matrix factorization basedapproaches have been proposed which includes parallel as welldistributed frameworks for scaling up the factorization processas mentioned in the Introduction section.Our approach is tomake use of block based approach for parallelism and combineit with GPU computation to demonstrate the advantages of thecombined approach.

4. Block based approach to SVDAlgorithm 1

Block based approach to factorization of SVD

Require:

Input: Data matrix X ∈ R { m × n } , number of features k Ensure:

Initialize: latent feature matrices with random values U ∈ R { m × k } , V ∈ R { n × k }

1: Let I , J be two constants such that X is represented by I × J number ofsub-matrices2: Represent data matrix X as block matrix with sub-matrices X ij where i ∈ .. I and j ∈ .. J . Similarly, represent feature matrices U , V as block matriceswith sub-matrices U i , V j where i ∈ .. I and j ∈ .. J .3: Let bu ∈ R { × m } , bi ∈ R { × n } be two vectors to represents biases of uses,items respectively4: Let STEPS be a constant representing maximum iterations for factorizationof SVD and α , α , α , α be the learning rates, β , β , β , β be the regular-ization factors and δ the minimum deviation of error between iterations5: for step 1 to STEPS do for each block sub-matrix X { i × j } do U i , V j , bu , bi ← BLOCK S VD ( X ij , U i , V j , bu , bi , k , α , α , α , α , β , β , β , β )8: end for

9: Terminate if RMSE improvement is < δ end for

11: Return latent feature matrices U , V Algorithm 2

SVD based matrix factorization for a block function block svd ( X ij , U i , V j , bu , bi , k , α , α , α , α , β , β , β , β )2: for each row r in block X ij dofor each column c in block X ij do if X ij [ r , c ] > then err ← X ij [ r , c ] − bu [ r ] − bi [ c ] − U i [ r , ∗ ] . V j [ c , ∗ ] T bu [ r ] ← bu [ r ] + α (err - β . bi [ c ]) bi [ c ] ← bi [ c ] + α (err - β . bu [ r ])8: for each latent factor k do U i [ r , k ] , ← U i [ r , k ] + α ( err . V j [ c , k ] − β . U i [ r , k ])10: V j [ c , k ] , ← V j [ c , k ] + α ( err . U i [ r , k ] − β V j [ c , k ]) end for end ifend for end for Return latent feature block matrices, biases U i , V j , bu , bi end function The fundamental idea behind block based SVD approach isto employ block based factorization with an SVD kernel. Abroad outline of the approach is given in Algorithm 1. The

BLOCK S V D () function in the Algorithm 2 encapsulates imple-mentation of SVD based matrix factorization at a block level forone iteration only. Considering the number of steps and k as con-stants and as the two for-loops (iterating STEPS number of times,and iterating for each sub-matrix) already contribute towardsnumber of ratings, hence, the time complexity of the algorithmremains as O ( n ) for n ratings. Considering the block size asconstant, the space complexity is ≈ ( m × n + m × k + m × k ) × c ≈ O ( c ) (constant)The implementation of the Block based SVD algorithm ismade available for public (open source) on Github .

5. Experiments

In this section, we discuss the experimental setup and re-port the related results. The experiments mentioned below areconducted on a shared hardware with Intel(R) Xeon(R) CPUE5-2640 v3 @ 2.60GHz and with Nvidia Tesla M40 GPU with24GB of CPU memory and 8GB of GPU memory. The pro-gramming environment is Python 2.7.5 that leverages CUDAwith driver version 8.0 and with PyCUDA application program-ming interface version 1.8. To demonstrate the scalability ofthe proposed approach, we carried out experiments on ﬁve pub-licly available real-world data sets namely

MovieLens - 100K , MovieLens - 1M MovieLens - 10M MovieLens - 20M

Jester . The data sets are randomly divided into training- andtest-set. We consider 80% of the observed entries to train themodel and remaining 20% to test the performance. The experi-ments are repeated 3 times and the mean and standard deviationof the results have been presented. We compare our proposedmethod with two well-known algorithms, Probabilistic MatrixFactorization (PMF) (19) and SVD. For fair comparison, wefurther adopted Stochastic Gradient Descent search to optimizethe objective function of PMF and SVD. https: // github.com / / blockgmf https://grouplens.org/datasets/movielens/ Hyper-parameters:

For all the experiments reported in thesubsections that follow, the learning rate ( α ) and regularizationparameter ( β ) in PMF are ﬁxed to 0 . .

01, respectively.In SVD, the parameters α and β for the user latent feature com-putation are set to 0 . . α and β are kept ﬁxed to 0 . . α = . β = .

019 are used. For item biases computation α = . β = .

007 are used. The minimum di ﬀ erence of error betweensteps ( δ ) considered for early termination for all the algorithmsis taken as 0 . Fig. 3. PMF vs SVD for MovieLens 1M data set

Figure 3 shows the comparative performance advantage ofSVD over basic PMF. It can be seen that the RMSE convergenceof SVD is better than PMF. We would like to mention that theconvergence time may not be an indicator of performance, asthe learning rates and the number of parameters are di ﬀ erentfor each algorithm and are deﬁned as per the requirements ofeach algorithm. The time per iteration and RMSE values of thealgorithms are detailed in Table 1. SVD vs Block based SVD:

As discussed in Section 3, applyingblock based approach on any MF enables us to process the datain parallel without compromising on the outcome. We haveimplemented Block based CPU variant of SVD (BCSVD) inline with the update equations discussed in Section ?? and withthe use of block based approach. In order to achieve parallelism,we have implemented multiple CPU threads to work in tandem.Each thread is dedicated to processing one block at a time, whilesplitting the data matrix into square block matrices. Combi-nations of di ﬀ erent number of blocks have been experimented.While increasing the number of blocks increases parallelism,however, we observed a performance drop beyond a criticalpoint. We hypothesize that this observation is due to increasedpagination as the threads compete for resources. As our objec-tive is to implement and analyze BMF, we limited the scope toparallelism through BMF. Hence, we conﬁgured just 8 parallelthreads in accordance with the number of CPU cores availableon our hardware. Accordingly we split the data matrix into 8 × × U , V matricesand 32 bit integer data type for ratings. For MovieLens 1M dataset, the maximum memory required for each block includingthe rating data and latent feature vectors, is estimated to be of˜1.513 Mega Bytes (MB). It is observed that the memory neededper block is much lower (˜0.3 MB per block) due to the sparsenature of the ratings. The memory needed for the algorithm isestimated to be constant and is a maximum of ˜12.1 MB. Thisestimate excludes the memory needed for the code, programvariables etc. Table 1 lists the comparative analysis of timetaken per iteration and test RMSE for di ﬀ erent variants of MFlisted above. The results conﬁrm our hypothesis that the blockbased approach has no impact on the quality of outcome butonly on the performance. Block based GPU accelerated SVD:

In order to fully leverageparallelism and to take advantage of multi-threading capabilitiesof GPU, we have implemented a GPU variant of Block basedSVD (BGSVD). The implementation is speciﬁc to CUDA ca-pable hardware supported by Nvidia. We have leveraged 3072threads which is the maximum supported by the hardware. Ac-cordingly, we have split the MovieLens 1M data set into blockmatrix of 3072x3072 dimension. For this experimental setup, theentire data matrix is loaded into GPU memory and then logicallysplit into blocks for determining the boundaries of each thread.It is important to note that no memory optimization techniqueswere used for GPU acceleration. We have not leveraged theshared memory, nor used di ﬀ erent caching techniques to ﬁnetune the performance. Hence the performance of the model isdue to block based approach alone. Figure 5 demonstrates theconvergence of BGSVD. Considering that the hardware archi-tecture of the CPU and the GPU are signiﬁcantly di ﬀ erent, thetime taken by CPU variants need not necessarily be taken as thebasis for comparison against the GPU variants. Table 1 lists thecomparative analysis of the time taken per iteration and the testRMSE for BGSVD.

6. Conclusions and Future work

In this work we have proposed a block based approach to SVDthat can be scaled and can produce better results when appliedin the domain of movie recommender systems. The SVD basedkernel proved to be providing advantage in RMSE convergence,while the block based approach enables scalability. The pro-posed technique does not put limitation on the size of the dataset. The size of blocks are only limited by the available mem-ory on the computation unit, while there is no limitation on thetotal number of blocks. The approach provides computationaladvantage, with respect to time and memory. By increasing thenumber of blocks, and by running (in parallel) the exact num-ber of blocks as the number of available cores, it is possible to

Fig. 4. SVD vs Block based SVD for MovieLens 1M data setFig. 5. BGSVD run on MovieLens 1M data set factorize data sets of any scale. The block based approach canbe adapted to run block level factorization on multiple GPUs aswell as on distributed systems.SVD is also useful for incremental data where all the datais not available initially and the new data may arrive after themodel building phase. In such a scenario, the models can beincrementally computed with the idea that a preliminary modelis computed and then the projection method is used to buildincrementally upon that. This method was used in (20) to handledynamic data sets. It was shown that a projection of additionaldata can potentially provide a fairly good approximation ofthe model. A theoretical basis for incremental computationof SVD is provided in (25). In (18), the incremental SVD isused for visual tracking. However, these models are based onmatrix algebra which is not applicable for sparse matrices. Withsparse matrices, SVD results in complex numbers that can notbe applied for predictions. The block based approach to SVDcould be enhanced further to accommodate incremental data. Ahierarchical factorization model can be developed by consideringthe initial data as a block and thereafter the incremental dataData set – > MovieLens100K MovieLens1M Jester4M MovieLens10M MovieLens20MTraining Time(sec / iteration) PMF 4.80 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Test RMSE PMF 0.955 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1. Comparison of basic SGD based PMF with SVD and then with block based SVD that is implemented both on CPU as well as on GPU as new blocks that are independently factorized. The resultantlatent factors can be converged together to arrive at a new model.The SVD kernel could be further enhanced to factor in tempo-ral changes in user behaviour and time based change in popular-ity of items. Temporal factors were implemented into the MFalgorithm in (8) which eventually won the Netﬂix prize. Suchan implementation can be extended in the context of BMF totake advantage of computational gain and memory optimiza-tion. The GPU implementation of the block based SVD canbe enhanced further for memory optimization and data transferbetween computational units as proposed in (23). The approachcan be applied to various domains like text mining (11).

References [1] M. W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra forintelligent information retrieval.

SIAM REVIEW , 37:573–595, 1995.[2] W. Chen, Y. Li, B. Pan, and C. Xu. Block kernel non-negative matrixfactorization and its application to face recognition. In

IJCNN , pages3446–3452, 2016.[3] Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. Alearning-rate schedule for stochastic gradient methods to matrix factoriza-tion. In

KDD , pages 442–455, 2015.[4] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Lan-dauer, and Richard Harshman. Indexing by latent semantic analysis.

Jour-nal of the American Society for Information Science , 41(6):391–407, 1990.[5] Rundong Du, Da Kuang, Barry Drake, and Haesun Park. Dc-nmf: Non-negative matrix factorization based on divide-and-conquer for fast cluster-ing and topic modeling.

Journal of Global Optimization , 68(4):777–798,Aug 2017.[6] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In

SIGKDD , pages 69–77, 2011.[7] Sven Koitka and Christoph M Friedrich. nmfgpu4r: Gpu-acceleratedcomputation of the non-negative matrix factorization using cuda capablehardware.

The R Journal , 8(2):382–392, 2016.[8] Yehuda Koren. 1 the bellkor solution to the netﬂix grand prize, 2009.[9] Vikas Kumar, Arun K. Pujari, Sandeep Kumar Sahu, Venkateswara RaoKagita, and Vineet Padmanabhan. Collaborative ﬁltering using multiplebinary maximum margin matrix factorizations.

Information Sciences ,380:1–11, 2017.[10] Vikas Kumar, Arun K. Pujari, Sandeep Kumar Sahu, Venkateswara RaoKagita, and Vineet Padmanabhan. Proximal maximum margin matrixfactorization for collaborative ﬁltering.

PRL , 86:62–67, 2017.[11] Volodymyr Kysenko, Karl Rupp, Oleksandr Marchenko, Siegfried Sel-berherr, and Anatoly Anisimov. Gpu-accelerated non-negative matrixfactorization for text mining. In

NIPS , pages 158–163, 2012.[12] Lieven De Lathauwer and Joos Vandewalle. Dimensionality reduction inhigher-order signal processing and rank-(r1,r2,,rn) reduction in multilinearalgebra.

Linear Algebra and its Applications , 391:31 – 55, 2004. SpecialIssue on Linear Algebra in Signal and Image Processing.[13] Boduo Li, Sandeep Tata, and Yannis Sismanis. Sparkler: supportinglarge-scale matrix factorization. In

Joint EDBT / ICDT Conferences , pages625–636, 2013. [14] Lester W. Mackey, Michael I. Jordan, and Ameet Talwalkar. Divide-and-conquer matrix factorization. In

NIPS , pages 1134–1142. 2011.[15] Vladimir Nikulin, Tian-Hsiang Huang, Shu-Kay Ng, Suren I. Rathnayake,and Geo ﬀ rey J. McLachlan. A very fast algorithm for matrix factoriza-tion. Statistics & Probability Letters , 81(7):773–782, 2011. Statistics inBiological and Medical Sciences.[16] Jinoh Oh, Wook-Shin Han, Hwanjo Yu, and Xiaoqian Jiang. Fast androbust parallel SGD matrix factorization. In

KDD , pages 865–874, 2015.[17] Benjamin Recht, Christopher R´e, Stephen J. Wright, and Feng Niu. Hog-wild: A lock-free approach to parallelizing stochastic gradient descent. In

NIPS , pages 693–701, 2011.[18] David A. Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang.Incremental learning for robust visual tracking.

International Journal ofComputer Vision , 77(1):125–141, May 2008.[19] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization.In

NIPS , pages 1257–1264, 2007.[20] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. In-cremental singular value decomposition algorithms for highly scalablerecommender systems. In

Fifth International Conference on Computerand Information Science , pages 27–28, 2002.[21] Sebastian Schelter, Venu Satuluri, and Reza Zadeh. Factorbird - aparameter server approach to distributed matrix factorization.

CoRR ,abs / Advances in artiﬁcial intelligence , 2009, 2009.[23] Wei Tan, Shiyu Chang, Liana L. Fong, Cheng Li, Zijun Wang, and LiangCao. Matrix factorization on gpus with memory optimization and approxi-mate computing. In

ICPP , 2018.[24] Tat-Jun Chin, K. Schindler, and D. Suter. Incremental kernel svd for facerecognition with image sets. In , pages 461–466, April 2006.[25] Jengnan Tzeng. Split-and-combine singular value decomposition for large-scale matrix.

Journal of Applied Mathematics , 2013, 03 2013.[26] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. Scalable coor-dinate descent approaches to parallel matrix factorization for recommendersystems. In

ICDM , pages 765–774, 2012.[27] Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, S. V. N. Vishwanathan, andInderjit Dhillon. Nomad: Non-locking, stochastic multi-machine algo-rithm for asynchronous and decentralized matrix completion.

Proc. VLDBEndow. , 7(11):975–986.[28] Yongfeng Zhang, Min Zhang, Yiqun Liu, Shaoping Ma, and Shi Feng.Localized matrix factorization for recommendation based on matrix blockdiagonal forms. In

WWW , pages 1511–1520. ACM, 2013.[29] Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. Afast parallel stochastic gradient descent for matrix factorization in sharedmemory systems. In