[PDF] MatRec: Matrix Factorization for Highly Skewed Dataset

Abstract

Recommender systems is one of the most successful AI technologies applied in the internet cooperations. Popular internet products such as TikTok, Amazon, and YouTube have all integrated recommender systems as their core product feature. Although recommender systems have received great success, it is well known for highly skewed datasets, engineers and researchers need to adjust their methods to tackle the specific problem to yield good results. Inability to deal with highly skewed dataset usually generates hard computational problems for big data clusters and unsatisfactory results for customers. In this paper, we propose a new algorithm solving the problem in the framework of matrix factorization. We model the data skewness factors in the theoretic modeling of the approach with easy to interpret and easy to implement formulas. We prove in experiments our method generates comparably favorite results with popular recommender system algorithms such as Learning to Rank , Alternating Least Squares and Deep Matrix Factorization.

Full PDF

MMatRec ： Matrix Factorization for Highly SkewedDataset

Hao Wang

Huahui ChangtianBeijing, China [email protected] Bing Ruan

Huahui ChangtianBeijing, China [email protected]

ABSTRACT

Recommender systems is one of the most successfulAI technologies applied in the internet cooperations.Popular internet products such as TikTok, Amazon,and YouTube have all integrated recommendersystems as their core product feature. Althoughrecommender systems have received great success,it is well known for highly skewed datasets,engineers and researchers need to adjust theirmethods to tackle the specific problem to yield goodresults. Inability to deal with highly skewed datasetusually generates hard computational problems forbig data clusters and unsatisfactory results forcustomers. In this paper, we propose a newalgorithm solving the problem in the framework ofmatrix factorization. We model the data skewnessfactors in the theoretic modeling of the approachwith easy to interpret and easy to implementformulas. We prove in experiments our methodgenerates comparably favorite results with popularrecommender system algorithms such as Learningto Rank , Alternating Least Squares and Deep MatrixFactorization.

CCS Concepts • Information systems~Information systemsapplications~Data mining

Keywords

Matrix factorization, Recommender system,Skewness

1. INTRODUCTION

Recommender system is overwhelmingly successfulin the internet business after entrepreneursspending many years’ efforts to search for a validbusiness model for the technology. In the earlyyears of the technology, it is mainly used as anintegrating functionality of websites like E-commerce platforms. The technology helpscompanies generate higher retention rate andprofits. A large group of scientists and engineershave helped to develop hundreds of recommendersystem models since its invention. The earliestsuccessful models include collaborative filtering [1] and matrix factorization [2] . The stream of smallmodifications and micro-invention eventually leadsto technological milestones like learning-to-rankmodels [3] , factorization machines [4] and deeplearning models such as DeepFM [5].We categorize recommender systems as shallowmodels and deep models. The first classincorporates shallow machine learning technologiessuch as matrix factorization and learning to rank ,while the second class are deep learning models likeWide and Deep [6]. Although a bit of out-of-dated,shallow models are still widely used in smallcompanies and projects where agility, usability andefficiency far outweighs boost of performance whichis only economically visible for huge datasets. It iswell known since the invention of the first shallowmodel, that data skewness and sparsity posesserious challenges for recommender systemperformance. The setbacks are two folds: dataskewness causes problems that need specialtreatment in Hadoop/Spark computation; theproblems also deteriorate the performance ofalgorithm as they usually lead to poorer results than uniformly distributed datasets.For instance, Peter has a special interest in historybooks. He reads hundreds of history books rangingfrom

Herodotus to Neil Ferguson . However, the sci-fiseries

The Three Body Problem has becomeextremely popular in recent years and Peter boughtone of the books at the e-commerce site whoserecommender system is one of his favorite bookselection tools. Peter loves

The Three Body Problem and gives it a five star rating on the website.Afterwards, he discovers the book recommendersystem that used to give him history bookrecommendations now started to give himrecommendation from other genres. This is because

The Three Body Problem is a best seller and thereare so many users that read the book, their readingtaste is all reflected in the user similaritycomputation with Peter’s reading taste and spoilsthe recommendation result.he data skewness problem has drawn attentionfrom researchers years ago, and recently theystarted to quantify the effect of data skewness in ascientific way. Wang et. al. [7] calculates analyticalformulas for data skewness and sparsity effects inboth user-based and item-based collaborativefiltering. Cañamares et. al. [8] models the dataskewness problem in a probabilistic way. Although,quantification of the problem has just been started,the solution to this problem has a very long history.Different scientists have resorted to differentheuristics and solutions to solve this issue. Forexample, Wang et. al. [9] points out by takingadvantage of embedding vector of side information,it is possible to ameliorate the data skewnessproblem.Since in the end, recommender models need to bedeployable and run-time efficient in real productrelease environment, we seek to solve the dataskewness problem in the framework of matrixfactorization. We aim to solve difficult problemsusing simple and understandable methods that hasbeen neglected by the community in the decade.After providing technical details in the followingsections, we give experimental results of ouralgorithm in the end. We prove that our algorithm isnot only run-time efficient, but also super ior tomany well-known algorithms such as BayesianPersonalized Ranking [3] in performance.

2. RELATED WORK

Recommender system has a long history ofevolution with many interesting and importantinnovations. The recommender systems havedeveloped from shallow models such as matrixfactorization [2] and learning to rank [3] to moresophisticated models that rely on deep learningtheories. The main evolution theme of recommendersystems is clear as optimization of evaluationmetrics such as MAE, AUC and NDCG are the majordriving force behind new inventions. However, majorchallenges such as data skewness and data sparsityremain unsolved. Techniques aiming to solve suchproblems usually focus on development of newmodels yielding better performance with no explicitor direct efforts or explanation on the particularresolution of the problems. In other words, theresolution of data skewness and sparsity problems isusually tackled indirectly without goodunderstanding of the improvement in performance.The first efforts to resolve the problems scientificallyare done by Wang et. al. [7] and Cañamares et. al.[8] They independently model the data skewnessand sparsity problems in combinatorial and probabilistic frameworks. The quantification of theproblems and their effects is a major milestone inunderstanding the intrinsic structural problems ofthe recommendation algorithms. However,application of their theory remains open andchallenging. With the focus shifting from evaluationmetric improvement for shallow models to proposingnew deep learning models in the researchcommunity, direct resolution to the problemsremains largely ignored by scientists and engineers.In this paper, we propose a simple model thatdirectly solves the data skewness and sparsityproblems by integrating corresponding factors intoproblem modeling.The framework selected to build in our theory ismatrix factorization [2] . Although our approach canbe easily extended to more sophisticated models ,we choose to illustrate our ideas in a wellunderstood and easy to implement model, which ismatrix factorization. Matrix factorization is lessresearched in the community in recent year due tothe rising interest in deep learning models, but it isstill a valuable model in the industry where notevery one has mastered the knowledge of deeplearning and shallow models are often used asbaseline models for experimental comparison.Famous matrix factorization models includeAlternating Least Squares [10] , SVD basedrecommendation [11] , pLSA [12] and LDA [13] .The Factorization Machines (FM) model [4] couldalso be considered as a generalization of matrixfactorization models. The matrix factorizationmodels could be combined with other models toyield better performance. Successful combinedmodels like Collaborative Topic Regression [14] aredeployed in commercial systems such as New YorkTimes news recommendation. Variants of thefactorization machines can be seen in many datascience competition solutions and are among themost practical approaches in commercialenvironments.

3. ALGORITHM

As in social network analysis, power law distributionexists everywhere in recommender system theoriesand practice. The discovery of power law distributioncould be traced back to Zipf’s distribution. Zipf’sdistribution says the distribution of English words inthe document corpus follows the followingdistribution:A simple plot of Zipf’s distribution would show thatthe most popular English words exhibit exponentiallyhigher frequency than less popular English words.The distribution plot shows a highly skeweddistribution with a very long tail.he Zipf’s law could be considered as the simplestpower law distribution. General forms of power Lawdistribution exist in nearly all the input datasets torecommender systems. According to Wang et. al.[7] , power law distribution in the input datastructure causes power law phenomenon in theoutput data structures. The resultant skewness inoutput could be expressed analytically as a functionof the skewness in the input data structures.

The variables used in their research to depict thepower law distributions are the rank of popularity ofthe item ( item rank ) and the rank of popularity ofthe user ( user rank ).Matrix factorization is a commonly adoptedapproach for many recommendation scenario. Thebasic idea is to factorize the rating matrix of usersinto the dot product of the user feature vector anditem feature vector, namely: , = ⋅ Different matrix factorization schemes differ in howthe user feature vector and item feature vector arecomputed. Probabilistic models such as pLSA andLatent Dirichlet Allocation can all be applied to solvethe problem. The difference in problem modelingand optimization method selection could lead toobservably different performance. The paradigm isin essence a dimensionality reduction approach thatreduces computation of ( ) unknown variablesinto computation of ( ) unknown variables.To borrow the concepts of Wang et. al. [7] and buildthe power law effect into the matrix factorizationframework, the formulation of the user featurevector and item feature vector of matrixfactorization is modified in the following way: user feature i = user rank i ⋅ a i + item rank j ⋅ b i + u i item feature j = user rank i ⋅ c j + item rank j ⋅ d j + v j As in all the matrix factorization methods, knownvalues in rating matrix are served as guidance forseeking unknown values of user feature vector anditem feature vector, the dot product of which in turndetermines the unknown values of the rating matrix.The loss function to be optimized is the sum ofsquared loss of rating matrix values :

Loss = ∑ i=1n ∑ j=1m (R i,j − user feature i ⋅ item feature j ) Optimization algorithms such as stochastic gradientcan be applied to solve the loss function. Abiding bythe common practice in the industry , we usestochastic gradient to compute the user featurevector and item feature vector as follows: a = a + η ⋅ ((2 ⋅ x ⋅ (R − (t ⋅ t )/t ))/t ⋅ t − (2 ⋅ t ⋅ x ⋅ (R − t /t ))/(t ⋅ t ) ⋅ t )b = b + η ⋅ ((2 ⋅ y ⋅ (R − (t ⋅ t )/t ))/t ⋅ t − (2 ⋅ t ⋅ y ⋅ (R − t /t ))/(t ⋅ t ) ⋅ t )c = c + η ⋅ ((2 ⋅ t ⋅ x)/t ⋅ t − (2 ⋅ t ⋅ t ⋅ x)/(t ⋅ t )⋅ t )d = d + η ⋅ ((2 ⋅ t ⋅ y)/t ⋅ t − (2 ⋅ t ⋅ t ⋅ y)/(t ⋅ t ) ⋅ t )u = u + η ⋅ ((2 ⋅ (R − (t ⋅ t )/t ))/t ⋅ t − (2 ⋅ t ⋅ (R − t /t ))/(t ⋅ t ) ⋅ t )v = v + η ⋅ 2 ⋅ (t /t ⋅ t − (t ⋅ t )/(t ⋅ t ) ⋅ t ) , where the symbols in the formulas above aredefined as follows: = + ⋅ + ⋅ = + ⋅ + ⋅ = || || = || || = ⋅ = ⋅ = − / = = The normalization of user feature vector and itemfeature vector is crucial in the optimization steps ofthe algorithm. Failing to normalize leads easily togradient explosion that can not be handled bycomputer systems.The algorithm follows the standard matrixfactorization protocol : In each iteration , a batch ofr a single sample of user-item rating tuples isselected as the input dataset, then the variables ofuser feature vectors and item feature vectors areupdated according to the formulas listed above.When prediction needs to be done, the dot productof the user feature vector and item feature vector iscomputed for the unknown item rating by the user.

4. EXPERIMENTAL RESULTS

We set up our experiments using the lastFM [15]and MovieLens datasets . The dataset contains 1892users and 17632 artists to be recommended to theusers. The number of iterations of stochasticgradient descent is fixed to be 300, while thegradient learning step is enumerated to find theoptimal result. The result is plotted in Fig. 1. The lowest MAE 0.1771 is achieved when thegradient learning step is 3 × − . BayesianPersonalized Ranking (BPR) is used in ourexperiment for comparison with the method.The optimal MAE of BPR is 0.23609 and theMAE stays at the level of 0.2+ with varyingparameters. Our algorithm is superior to thewidely adopted learning to rank technique.Fig. 2 shows the MAE of BPR algorithm withvarying gradient learning steps. The number ofiteration is 20 and the dimension of latentfactors is 20.Alternating Least Squares is also tested in ourexperiments for comparison. An MAE of 0.0518is achieved when matrix rank is 10 and numberof iterations is 10. The performance is superiorto our approach , however this is only the casewith the lastFM data set. When tested on the Movie Lens dataset comprisedof 162000 users and 62000 movies, a lowest MAE of0.8618 is achieved by our method when the gradientlearning step is × − (Fig. 3). The result issuperior to that of Alternating Least Squares, whichhas an increasing MAE starting from above 0.94when the matrix rank is 5.Deep Matrix Factorization [16] generates slightlybetter results than our method with an MAE between0.82 and 0.83 with selected parameters. However,the speed of Deep Matrix Factorization is muchslower than our method. Without the help of GPU, itwould take hours to compute the result for the samedataset on a commercial MacBook where ourmethod takes only tens of seconds to finish. Thesacrifice of speed for the slight improvement ofaccuracy is not always economic in real worldscenarios. Fig.1 MAE of the proposed algorithmfor different parameters Fig.2 MAE of BayesianPersonalized Ranking withvarying gradient learningstepsFig.3 MAE of proposedmethod in this paper . CONCLUSION

In this paper, we proposed a new matrixfactorization model that models the data skewnesseffect in input data structures explicitly. Thealgorithm is invented in hope to ameliorate thefamous data skewness problem for recommendersystem. In our experiments, we resort toconventional evaluation metrics for algorithmevaluation. Our method has comparatively favoriteresult in contrast with Learning to Rank approachessuch as BPR and matrix factorization methods likeAlternating Least Squares and Deep MatrixFactorization. However, the metric we used inexperiments is MAE, which might not be the mostappropriate metric to evaluate recommender systemperformance in consideration of data skewness. Infuture work, we would like to create theoreticalfoundation to solve the data skewness problem forrecommender system including finding betterevaluation metrics for experiments.

6. REFERENCES [1] David Goldberg, David Nichols, Brian M. Oki,Douglas Terry, Using Collaborative Filtering toWeave an Information Tapestry,

Commun. ACM,Volume 35, Issue 12,

Computer, 42(8), (2009) , UAI '09: Proceedings of the Twenty-FifthConference on Uncertainty in ArtificialIntelligence , (June 2009), 452–461.[4] Steffen Rendle, Factorization Machine,

IEEEInternational Conference on Data Mining, (Dec.2010). DOI:https://doi.org/10.1109/ICDM.2010.127.[5] Huifeng Guo, Ruiming Tang, Yunming Ye,Zhenguo Li, Xiuqiang He, DeepFM : AFactorization-Machine based Neural Network forCTR Prediction,

Proceedings of the Twenty-SixthInternational Joint Conference on ArtificialIntelligence, (2017). DOI:https://doi.org/10.1109/ICDM.2010.127.[6] Heng-Tze Cheng, Levent Koc, Jeremiah HaMAEn,Tal Shaked, Tushar Chandra, Hrishi Aradhye,Glen Anderson, Greg Corrado, Wei Chai, MustafaIspir, Rohan Anil, Zakaria Haque, Lichan Hong,Vihan Jain, Xiaoping Liu, Hemal Shah, Wide &Deep Learning for Recommender Systems,

Proceedings of the 1st Workshop on DeepLearning for Recommender Systems, (Sept. 2016), 7-10. DOI:https://doi.org/10.1145/2988450.2988454.[7] Hao Wang, Zonghu Wang, Weishi Zhang,Quantitative Analysis of Matthew Effect andSparsity Problem of Recommender Systems,

IEEE International Conference on CloudComputing and Big Data Analysis, (Apr. 2018).DOI:https://doi.org/10.1109/ICCCBDA.2018.8386490.[8] Rocío Cañamares, Pablo Castells, Should I Followthe Crowd? A Probabilistic Analysis of theEffectiveness of Popularity in RecommenderSystems,

SIGIR '18: The 41st International ACMSIGIR Conference on Research & Development inInformation Retrieval, June (2018), 415–424.https://doi.org/10.1145/3209978.3210014[9] Trapit Bansal , David Belanger , AndrewMcCallum, Ask the GRU: Multi-Task Learning forDeep Text Recommendation,

RecSys’16 (Sept.2016),http://dx.doi.org/10.1145/2959100.2959180[10] Gabor Takacs, Dominos Tikk, AlternatingLeast Squares for Personalized Ranking, RecSys'12: Proceedings of the sixth ACM conference onRecommender systems, September (2012) , 83 –

90. https://doi.org/10.1145/2365952.2365972[11] Arkadiusz Paterek, Improving RegularizeSingular Value Decomposition for CollaborativeFiltering , KDD 2007 : Netflix CompetitionWorkshop[12] Thomas Hofmann, Probabilistic LatentSemantic Indexing, SIGIR 99: Proceedings of theTwenty-Second Annual International SIGIRConference on Research and Development inInformation Retrieval[13] David M. Blei, Andrew Y. Ng, Michael Jordan,Latent Dirichlet Allocation, Journal of MachineLearning Research, 993 - 1022,https://doi.org/10.1162/jmlr.2003.3.4-5.993[14] Chong Wang, David M. Blei, CollaborativeTopic Modeling for Recommending ScientificArticles, KDD’11: Proceedings of the 17 th ACMSIGKDD international conference on KnowledgeDiscovery and Data Mining (Aug. 2011), 448 – ’’