Collaborative Similarity Embedding for Recommender Systems
Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai, Yi-Hsuan Yang
CCollaborative Similarity Embedding for Recommender Systems
Chih-Ming Chen ∗ National Chengchi UniversityTaipei, [email protected]
Chuan-Ju Wang
Academia SinicaTaipei, [email protected]
Ming-Feng Tsai † National Chengchi UniversityTaipei, [email protected]
Yi-Hsuan Yang
Academia SinicaTaipei, [email protected]
ABSTRACT
We present collaborative similarity embedding (CSE), a unifiedframework that exploits comprehensive collaborative relationsavailable in a user-item bipartite graph for representation learningand recommendation. In the proposed framework, we differentiatetwo types of proximity relations: direct proximity and k -th orderneighborhood proximity. While learning from the former exploitsdirect user-item associations observable from the graph, learningfrom the latter makes use of implicit associations such as user-usersimilarities and item-item similarities, which can provide valuableinformation especially when the graph is sparse. Moreover, for im-proving scalability and flexibility, we propose a sampling techniquethat is specifically designed to capture the two types of proximityrelations. Extensive experiments on eight benchmark datasets showthat CSE yields significantly better performance than state-of-the-art recommendation methods. ACM Reference Format:
Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai, and Yi-Hsuan Yang.2019. Collaborative Similarity Embedding for Recommender Systems. In
Pro-ceedings of ACM conference (WWW’19).
ACM, New York, NY, USA, Article 4,9 pages. https://doi.org/10.475/123_4
The task of recommender systems is to produce a list of recommen-dation results that match user preferences given their past behavior.Collaborative filtering (CF), a common yet powerful approach, gen-erates user recommendations by taking advantage of the collectivewisdom from all users [14]. Many CF-based recommendation algo-rithms have been shown to work well across various domains andbeen used in many real-world applications [27].The core idea of model-based CF algorithms is to learn low-dimensional representations of users and items from either explicituser-item associations such as user-item ratings or implicit feedback ∗ Social Networks and Human-Centered Computing, Taiwan International GraduateProgram, Institute of Information Science, Academia Sinica, Taiwan † MOST Joint Research Center for AI Technology and All Vista Healthcare, TaiwanPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
WWW’19, May 2019, San Francisco © 2016 Copyright held by the owner/author(s).ACM ISBN 123-4567-24-567/08/06.https://doi.org/10.475/123_4 such as playcounts and dwell time. This can be done by training arating-based model with matrix completion to learn from observeduser-item associations (either explicit or implicit feedback) to pre-dict associations that are unobserved [1, 5, 12, 13, 15, 16, 18, 20, 21,26, 36]. In addition to this rating-based approach, ranking-basedmethods have been proposed based on optimizing ranking loss; theranking-based methods [21, 25, 33–35] have been found more suit-able for implicit feedback. However, many existing model-based CFalgorithms leverage only the user-item associations available in agiven user-item bipartite graph. Thus, when the available user-itemassociations are sparse, these algorithms may not work well.It has been noted that it is possible to mine from a user-item bipar-tite graph other types of collaborative relations, such as user-usersimilarities and item-item similarities, since users and items canbe indirectly connected in the graph. Moreover, by taking randomwalks on the graph, it is possible to exploit higher-order proximityamong users and items. Using item-item similarities in the learningprocess has been firstly studied by Liang et al. [18], who propose tojointly decompose the user-item interaction matrix and the item-item co-occurrence matrix with shared item latent factors. Hsieh etal. [12] propose to learn a joint metric space to encode both userpreferences and user-user and item-item similarities. A recent workpresented by Yu et al. [35] shows that jointly modeling user-item,user-user, and item-item relations outperforms competing methodsthat consider only user-item relations. In [7, 22], the higher-orderproximity has been shown useful in graph embedding methods. Ingeneral, exploiting additional collaborative relations shows promisein learning better representations of vertexes in an informationgraph.We note that these prior arts [7, 12, 18, 35] share the same coreidea: using some specific methods to sample auxiliary informationfrom a graph to augment the data for representation learning. How-ever, there is a lack of a unified and efficient model that generalizesthe underlying computation and aims at recommendation problems.For example, Liang et al. [18] consider only the item-item similari-ties but no other collaborative relations; Yu et al. [35] consider onlyranking-based loss functions but not rating-based ones. Higher-order proximity is exploited in [6, 7, 22], which however deal withthe general graph embedding problem not the recommendation one.Moreover, the model presented by [12] fails to manage large-scaleuser-item associations [29].To address this discrepancy, in this paper we present collabora-tive similarity embedding (CSE), a unified representation learning a r X i v : . [ c s . I R ] F e b WW’19, May 2019, San Francisco Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai, and Yi-Hsuan Yang
Retrieved User(s) Retrieved Item(s) Retrieved User(s) Retrieved Item(s)
Target
User Target
Item
Direct Similarity Embedding (DSEmbed)Neighborhood Similarity Embedding (NSEmbed)
Neighborhood Similarity
Embedding (NSEmbed)Direct Similarity
Embedding (DSEmbed)
Retrieved User ContextsRetrieved Item ContextsTarget
User Target
Item L NS
A recommender system provides a list ofranked items to users based on their historical interactions with https://github.com/cnclabs/proNet-core ollaborative Similarity Embedding for Recommender Systems WWW’19, May 2019, San Francisco items. Let U and I denote the sets of users and items, respectively.User-item associations can be presented as a bipartite graph G = ( V , E ) , where V = { v , . . . , v | V | } = U ∪ I , and E represents theset of observed user-item associations. Note that for explicit ratingdata, the weights of the user-item preference edges can be positivereal numbers, whereas for implicit interactions, the bipartite graphbecomes a binary graph. The goal of the CSE framework is to obtainan embedding matrix Φ ∈ R | V |× d that maps each user and iteminto a d -dimensional embedding vector for item recommendation;that is, with the learned embedding matrix Φ , for a user v i ∈ U ,the proposed framework generates the top- N recommended itemsvia computing the similarity between the embedding vector of theuser, i.e., Φ v i , and those of all items, i.e., Φ v j for all v j ∈ I , where Φ v x denotes the row vector for vertex v x ∈ V from matrix Φ . Framework Overview.
Figure 1 provides an overview of CSE. Inthe figure, CSE consists of two similarity embedding modules: adirect similarity embedding (DSEmbed) module to model user-itemassociations, and a neighborhood similarity embedding (NSEmbed)module to model user-user and item-item similarities. The DSEm-bed model provides the flexibility to implement two mainstreamtypes of modeling techniques: rating-based and ranking-based mod-els to preserve direct proximity of user-item associations; NSEmbed,in turn, models user-user and item-item relations using the contextswithin a k -step random walk, as shown in Fig. 1(b), to preserve k -order neighborhood proximity between users and items. To min-imize the sum of the losses from DSEmbed and NSEmbed modules,which are denoted as L DS and L N S respectively, the objectivefunction of the proposed framework is designed as L = L DS + λ L N S , where λ controls the balance between the two losses. The rationalebehind this design is that L DS controls the optimization of the em-bedding vectors towards preserving direct user-item associations,and L N S encourages users/items sharing similar neighbors to beclose to one another in the learned embedding space.
Definition 2.1. (Direct Proximity)
Given a bipartite graph G = ( V , E ) , the direct proximity between a user v i ∈ U and an item v j ∈ I is represented by the presence of an edge ( v i , v j ) ∈ E between thesetwo vertices. If there is no edge between user v i and item v j , thentheir direct proximity is defined as 0.The DSEmbed module is designed to model the direct proximityof the user-item associations defined in Definition 2.1. For a rating-based approach, the objective is to find the embedding matrix Φ that maximizes the log-likelihood function of observed user-itempairs: arg max Φ (cid:213) ( v i , v j )∈ E log p ( v i , v j | Φ ) = arg min Φ (cid:213) ( v i , v j )∈ E − log p ( v i , v j | Φ ) . (1)In contrast, a ranking-based approach cares more about whetherwe can predict stronger association between a ‘positive’ user-itempair ( v i , v j ) ∈ E than a ‘negative’ user-item pair ( v i , v k ) ∈ ¯ E [25], where ¯ E denotes the set of edges for all the unobserved user-item associations. This can be approached by maximizing the log-likelihood function of observed user-item pairs over unobserveduser-item pairs for each user:arg max Φ (cid:213) ( v i , v j , v k ) log p ( v j > i v k | Φ ) = arg min Φ (cid:213) ( v i , v j , v k ) − log p ( v j > i v k | Φ ) , (2)where v i ∈ U and v j , v k ∈ I , and > i indicates that user v i prefersitem v j over item v k . In the above two equations, p ( v i , v j | Φ ) and p ( v j > i v k | Φ ) is calculated by p ( v i , v j | Φ ) = σ (cid:16) Φ v i · Φ v j (cid:17) , and p ( v j > i v k | Φ ) = σ (cid:16) Φ v i · Φ v j − Φ v i · Φ v k (cid:17) , respectively, and σ (·) denotes the sigmoid function. Definition 2.2. ( k -Order Neighborhood Proximity) Given abipartite graph G = ( V , E ) representing the observed user-itemassociations of the set of users and items in V = U ∪ I , the k -order neighborhood proximity of a pair of users (or items) is definedas the similarity between their neighborhood network structuresretrieved by k -step random walks. Mathematically speaking, giventhe k -order neighborhood structures of a pair of users (or items), v i , v j ∈ U (or v i , v j ∈ I , respectively), which are denoted as twosets of neighbor nodes N v i and N v j , with | N v i | = | N v j | = k , the k -order neighborhood proximity between v i and v j is decided bythe similarity between these two sets N v i and N v j . If there are noshared neighbors between v i and v j , the neighborhood proximitybetween them is 0.The NSEmbed module is designed to model k -order neighbor-hood proximity for capturing user-user and item-item similarities.Given a set of neighborhood relations for users (or items) S U = {( v i , v j )| ∀ v i ∈ U , v j ∈ N v i } (or S I = {( v i , v j )| ∀ v i ∈ I , v j ∈ N v i } ,respectively), the NSEmbed module seeks a set of embedding ma-trices Φ , Φ U C , Φ IC ∈ R | V |× d that maximizes the likelihood of allpairs in S U (or S I , respectively), where Φ is a vertex mapping ma-trix akin to that used in the DSEmbed module, and Φ U C and Φ IC are two context mapping matrices . Note that each vertex (repre-senting a user or an item) plays two roles for modeling the neigh-borhood proximity: 1) the vertex itself and 2) the context of othervertices [3, 7, 22, 28, 37]. With this design, the embedding vectorsof vertices that share similar contexts are thus closely located in thelearned vector space. Therefore, the maximization of the likelihoodfunction can be defined asarg max Φ , Φ UC , Φ IC (cid:206) ( v i , v j )∈ S U p ( v j | v i ; Φ ; Φ U C ) + (cid:206) ( v i , v j )∈ S I p ( v j | v i ; Φ ; Φ IC ) . WW’19, May 2019, San Francisco Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai, and Yi-Hsuan Yang
Similar to Eqs. (1) and (2), the above objective function becomesarg min Φ , Φ UC , Φ IC (cid:205) ( v i , v j )∈ S U − log p ( v j | v i ; Φ ; Φ U C ) + (cid:205) ( v i , v j )∈ S I − log p ( v j | v i ; Φ ; Φ IC ) , (3)where p ( v j | v i ; Θ ) = σ (cid:16) Φ v i · Φ U Cv j (cid:17) if v i ∈ U , σ (cid:16) Φ v i · Φ ICv j (cid:17) if v i ∈ I . (4)It is worth mentioning that most prior arts use only one or twoembedding mappings; while the former approach fails to considerhigh-order neighbors (e.g., [4, 9, 16, 28, 31]), the later one cannotmodel user-user, item-item, and user-item relations simultaneously(e.g., [1, 3, 7, 22, 23, 37]). Our newly designed triplet embeddingsolution (i.e., Θ = { Φ , Φ U C , Φ U C } ) can ideally model user-user,item-item clustering and user-item relations in a single and joint-learning model. In order to minimize the above objective functions, we need to gothrough all the pairs in E for Eq. (1), E and ¯ E for Eq. (2), and S U and S I for Eq. (3), to compute all the pairwise losses. This is notfeasible in real-world recommendation scenarios as the complexityis O(| V | × | V |) . To address this, we propose a sampling techniqueto work in tandem with the above two modules to enhance CSE’sscalability and flexibility in learning user and item representationsfrom large-scale datasets.In CSE, the DSEmbed and NSEmbed modules are fused with theshareable data sampling technique described below. For each param-eter update, we first sample an observed user-item pair ( v i , v j ) ∈ E ,as shown as U1 and I1 in Fig. 1(b), where v i ∈ U and v j ∈ I .Then, we search for the k -order neighborhood structures of user v i and item v j via the k -step random walks. To improve computa-tional efficiency, we use negative sampling [28]. Consequently, fora rating-based approach (see Eq. (1)), the expected sampled loss ofthe DSEmbed module can be re-written as L DS = E ( v i , v j )∼ E (cid:2) − log p ( v i , v j | Φ ) (cid:3) + (cid:213) M E ( v k , v h )∼ ¯ E [ log p ( v k , v h | Φ ) ] , (5)where M denotes the number of negative pairs adopted. For aranking-based approach (see Eq. (2)), the DSEmbed module can bere-written as L DS = E ( v i , v k )∼ ¯ E (cid:104) E ( v i , v j )∼ E (cid:2) − log p ( v j > i v k | Φ ) (cid:3) | v i (cid:105) . (6)Note that for the ranking-based approach, there is no need to ex-plicitly include M negative sample pairs as this kind of methodnaturally involves negative pairs from ¯ E . Similarly, given a user oran item vertex v i , its k -order neighborhood structure N v i is com-posed of nodes in the k -step random walks surfing on G , W v i = (W v i , W v i , W v i , . . . , W kv i ) , where the vertex for W jv i is randomlychosen from the neighbors of the vertex v given W j − v i = v and W v i = v i . The expected sampled loss of the NSEmbed module can be re-written as L N S = E ( v i , v j )∼ S U (cid:104) − log p ( v j | v i ; Φ ; Φ U C ) (cid:105) + (cid:213) M E ( v i , v j )∼ ¯ E (cid:104) log p ( v j | v i ; Φ ; Φ U C ) (cid:105) + E ( v i , v j )∼ S I (cid:104) − log p ( v j | v i ; Φ ; Φ IC ) (cid:105) + (cid:213) M E ( v i , v j )∼ ¯ E (cid:104) log p ( v j | v i ; Φ ; Φ IC ) (cid:105) . (7)Since L DS and L N S are described in a sampling-based expecta-tion form, CSE provides the flexibility for accommodating arbitrarydistributions of positive and negative data. In the following experi-ments, we produce the positive data according to primitive edgesdistribution of given user-item graph. As to negative sampling,we propose to directly sample the negative data from whole datacollection instead of unobserved data collection.
In the optimization stage, we use asynchronous stochastic gradientdescent (ASGD) [24] to efficiently update the parameters in paral-lel. The model parameters are composed of the three embeddingmatrices Θ = { Φ , Φ U C , Φ IC } , each having the size O(| V | d ) . Theyare updated with learning rate α according to Θ ← Θ − α (cid:18) ∂ L DS ∂ Θ + λ (cid:18) ∂ L N S ∂ Θ (cid:19) − λ V ∥ Φ ∥ (cid:19) , (8)where λ V is a hyper-parameter for reducing the risk of overfitting. The CSE framework not only modularizes the modeling of pairwiseuser-item, user-user, and item-item relations, but also integratesthem into a single objective function through shared embeddingvectors. Together with the DSEmbed and NSEmbed sub-modules formodeling these relations, CSE involves a novel sampling techniqueto improve scalability and flexibility. To give a clear view of CSE,we further provide the comparison to general graph embeddingand deep learning models.
Typical factorization methods usually work on a sparse user-itemmatrix and do not explicitly model high-order connections from thecorresponding user-item bipartite graph G = ( V , E ) . Several meth-ods, including our CSE, propose to explicitly incorporate high-orderconnections for modeling user-user and item-item relations intothe recommendation models to improve performance. As discussedin [17–19], such modeling can then be seen as conducting matrixfactorization on a | V | × | V | point-wise mutual information (PMI)matrix: PMI ( v i , v j ) = log (cid:18) p ( v i , v j ) p ( v i ) p ( v j ) (cid:19) − log M . Recall that V = U ∪ I , and M denotes the number of negativepairs adopted in Eqs. (5) and (7). However, given the high-orderconnections for modeling user-user and item-item relations, mostPMI ( v i , v j ) for v i , v j ∈ U (or v i , v j ∈ I ) are nonzero and thus thePMI matrix is considered non-sparse. Conducting matrix factoriza-tion on such a matrix is computationally expensive in both time ollaborative Similarity Embedding for Recommender Systems WWW’19, May 2019, San Francisco Dataset
Density Edge type Network typeFrappe a
957 4,028 96,202 100.52 2.50% click count app-clicksCiteULike b c d e f g h Table 1: Statistics of the datasets considered in our experiments. a http://baltrunas.info/research-menu/frappe b c http://academictorrents.com/ d https://grouplens.org/ e f http://jmcauley.ucsd.edu/data/amazon/ g h https://labrosa.ee.columbia.edu/millionsong/tasteprofile and space, and is therefore infeasible in many large-scale recom-mendation scenarios.To explicitly consider all the collaborative relations into ourmodel while keeping it practical for large-scale datasets, the CSEuses the sampling technique together with k -step random walkssurfing on G to preserve direct proximity of user-item associationsas well as harvest the high-order neighborhood structures of usersand items. By doing so, we approximate factorization of the corre-sponding PMI matrix and thus reduce the complexity in space andthe training time. In the CSE, the DSEmbed and NSEmbed modules are united withthe sampling technique. In addition to improved scalability, sucha sampling perspective facilitates the shaping of different relationdistributions for optimization via different weighting schemes orsampling strategies. Here we resort to use the perspective of KLdivergence to explain this model characteristic. Specifically, mini-mizing the losses in our framework (see Eqs. (5), (6), and (7)) canbe related to minimizing the KL divergence of two probability dis-tributions [28]. Suppose there are two distributions over the space V × V : ˆ p (· , ·) and p (· , ·) , denoting the empirical distribution and thetarget distribution, respectively, we havearg min p KL ( ˆ p (· , ·) , p (· , ·)) = arg min p − (cid:213) ( v i , v j )∈ E ˆ p ( v i , v j ) log (cid:18) p ( v i , v j ) ˆ p ( v i , v j ) (cid:19) ∝ arg min p − (cid:213) ( v i , v j )∈ E ˆ p ( v i , v j ) log p ( v i , v j ) . (9)In Eq. (9), the empirical distribution ˆ p (· , ·) can be treated as theprobability density (mass) function of the distribution in our lossfunctions, from which each pair of vertices ( v i , v j ) is sampled. Thisindicates that applying different weighting schemes or differentsampling strategies (i.e., different ˆ p (· , ·) ) in CSE shapes differentrelation distributions for learning representations. The time and space complexity of the proposed method dependson the implementation. The training procedure of CSE frameworkinvolves a sampling step and an optimization step. Since all therequired training pairs, including observed associations and unob-served associations, can be derived from the user-item bipartitegraph G , we adopt the compressed sparse alias rows (CSAR) datastructure to perform weighted edge sampling for direct similarityand weighted random walk for neighborhood similarity [2]. WithCSAR data structure, sampling an edge requires only O( ) and theoverall demand space complexity is linearly increased with thenumber of positive edges O(| E |) . As to the optimization, SGD-based update has a closed form so that updating the embedding ofa vertex in a batch depends only on the dimension size O ( d ) . As forthe time of convergence, many studies on graph embedding as wellas our method empirically show that the required total trainingtime for the convergence of embedding learning is also linear in | E | [8]. General graph embedding algorithms, such as DeepWalk [22] andnode2vec [7], can be used for the task of recommendation. Yet, wedo not focus on comparing the proposed CSE with general graphembedding models and only provide the results of DeepWalk be-cause many prior works on recommendation [21, 36] have shownthat many of the baseline methods considered in our paper outper-form these general graph embedding algorithms. The main reasonfor this phenomenon is that most graph embedding methods clustervertices that have similar neighbors together, and thus make theusers apart from the items because user-item interactions typicallyform a bipartite graph. Note that the space for the learned embedding matrices Θ = { Φ , Φ UC , Φ IC } is O(| V |) and | V | ≪ | E | . WW’19, May 2019, San Francisco Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai, and Yi-Hsuan Yang
Our method, and the existing methods we discussed and comparedin the experiments, focus on improving the modeling quality of userand item embeddings that can be directly and later used for user-item recommendation with similarity computation. Many approxi-mation techniques, such as approximate nearest neighbor (ANN),can be applied to speed up the similarity computation betweenuser and item embeddings, which facilitates real-time online predic-tions and makes the recommendation scalable to large-scale real-world datasets. In contrast, many deep learning methods, includingNCF [11], DeepFM [10], etc, do not learn the directly comparableembeddings of users or items. There are a few deep learning meth-ods (e.g., Collaborative Deep Embedding [32] and DropoutNet [30])that can produce user and item embeddings, but to our knowl-edge, efficiency is still a major concern of these methods. Therefore,improving the embedding quality is still a critical research issuefor building up a recommender system especially when the com-putation power is limited in real-world application scenarios. Forreadability and to maintain the focus of this work, we opt for notcomparing the deep learning methods in our paper. It is also worthmentioning that our solution can obtain user and item embeddingswithin only an hour for every large dataset listed in the paper;the efficiency and scalability is thus one of the highlights of theproposed method. To examine the capability andscalability of the proposed CSE framework, we conducted exper-iments on eight publicly available real-world datasets that varyin terms of domain, size, and density, as shown in Table 1. Foreach of the datasets, we discarded the users who have less thanten associated interactions with items. In addition, we convertedeach data into implicit feedback:
1) for 5-star rating datasets, wetransformed ratings higher than or equal to 3.5 to 1 and the rest to0; 2) for count-based datasets, we transformed counts higher thanor equal to 3 to 1 and the rest to 0; 3) for the CiteULike dataset, notransformation was conducted as it is already a binary preferencedataset.
We compare the performance of ourmodel with the following eight baseline methods: 1) POP, a naivepopularity model that ranks the items by their degrees, 2) Deep-Walk [22], a classic algorithm of network embedding, 3) WALS [13],a weighted rating-based factorization model, 4) ranking-based fac-torization models: BPR [25], WARP [33], and K -OS [34], 5) BiNE [6],a network embedding model specialized for bipartite networks, and6) recent advanced models considering user-user/item-item rela-tions: coFactor [18], CML [12] and WalkRanker [35], Note thatexcept for POP, the embedding vectors for users and items learned https://github.com/erikbern/ann-benchmarks Note that in real-world scenarios, most feedback is not explicit but implicit [25]; wehere converted the datasets into implicit feedback as most of the recent developedmethods focus on dealing with such type of data. However, our method is not limitedto binary preference since the presented sampling technique has the flexibility tomanage arbitrary weighted edge distributions and rating estimation is also allowed inthe proposed RATE-CSE. by these competitors as well as by our method can be directlyused for item recommendations. Additionally, while CML adoptsEuclidean distance as the scoring function, all other methods in-cluding ours utilize the dot product to calculate the score of a pairof user-item embedding vectors. The experiments for WALS andBPR were conducted using the matrix factorization library QMF, and those for WARP and K -OS were conducted using LightFM; for coFactor, CML, and WalkRanker, we used the code provided bythe respective authors. For all the experiments, the dimensionof embedding vectors was fixed to 100; the values of the hyper-parameters for the compared method were decided via implement-ing a grid search over different settings, and the combination thatleads to the best performance was picked. The ranges of hyper-parameters we searched for the compared methods are listed asfollows. • learning rate: [0.0025, 0.01, 0.025, 0.1] • regularization: [0.00025, 0.001, 0.0025, 0.01, 0.025, 0.1] • training epoch: [10, 20, 40, 80, 160] • sampling time: [20 × | E | , 40 × | E | , 60 × | E | , 80 × | E | , 100 × | E | ] • walk time: [10, 40, 80] • walk length: [40, 60, 80] • window size: [2, 3, 4, 5, 6, 8, 10] • stopping probability for random walk: [0.15, 0.25, 0.5, 0.75] • k -order: [1, 2, 3] • rank margin: 1 (commonly used default value) • number of negative samples: 5 (commonly used default value)For our model, the learning rate α was set to 0.1, λ V was set to0.025; the hyper-parameter λ was set to 0.05 and 0.1 for rating-basedCSE and ranking-based CSE, respectively, and k was set to 2 as thedefault value. The sensitivity of CSE parameters are additionallyreported. For each dataset, the sample time for convergence dependson the number of non-zero user-item interaction edges and is set to80 × | E | . Sensitivity analysis for k and λ and convergence analysisare later provided in the section for convergence analyses. The performance is evaluated between the rec-ommended list R u containing top- N recommended items and thecorresponding ground truth list T u for each user u . We considerfollowing two commonly-used metrics over these N recommendedresults: • Recall: Denoted as Recall@ N , which describes the fraction ofthe ground truth (i.e., the user preferred items) that are suc-cessfully recommended by the recommendation algorithm:Recall@ N = (cid:213) u ∈ U | U | (cid:205) v ∈ R u ( v ∈ T u ) min ( N , | T u |) . • Mean Average Precision: Denoted as mAP@ N , computingthe mean of the average precision at k (AP@ k ) for each user https://github.com/quora/qmf https://github.com/lyst/lightfm ollaborative Similarity Embedding for Recommender Systems WWW’19, May 2019, San Francisco Frappe CiteULike Netflix MovieLens-LatestRecall@10 mAP@10 Recall@10 mAP@10 Recall@10 mAP@10 Recall@10 mAP@10Pop 0.1750 0.0708 0.0270 0.0114 0.0861 0.0359 0.0882 0.0289DeepWalk [22] 0.0430 0.0256 0.0875 0.0458 0.0235 0.0112 0.0207 0.0061WALS [13] 0.1632 0.1117 0.1851 0.0915 0.1214 0.0471 0.2350 † † † † K -OS [34] 0.3018 0.1914 0.1356 0.0756 0.1783 0.0868 0.2522 0.1641BiNE [6] 0.2159 0.1201 0.0422 0.0201 - - - -coFactor [18] 0.2110 0.1309 0.1323 0.0721 - - - -CML [12] † † † † * *0.2014 *0.1039 * * Improv. (%) + − + + + + + + * *0.3094 *0.1902Improv. (%) − − − + + + + + † † † † † K -OS [34] † † † + + + + + + + − * * * * * * * Improv. (%) + + + + + + + + Table 2: Recommendation performance. The † symbol indicates the best performing method among all the baseline methods; ‘*’ and ‘%Improv.’denote statistical significance at p < . with a paired t -test and the percentage improvement of the proposed method, respectively, withrespect to the best performing baseline. u asmAP@ N = | U | (cid:213) u ∈ U AP u @ k (10) = | U | (cid:213) u ∈ U (cid:205) Nk = P u ( k ) × ( r k ∈ T u ) min ( N , | T u |) , where r k is the k -th recommended item and P u ( k ) denotes theprecision at k for user u . This is a rank-aware evaluation metricbecause it considers the positions of each recommended item. Foreach dataset, the reported performance was averaged over 10 times;in each time, we randomly split the data into 80% training set and20% testing set. The results forthe ten baseline methods along with the proposed method are listed in Table 2, where RATE-CSE and RANK-CSE denote two versionsof our method that employ respectively rating-based and ranking-based loss functions for user-item associations. Note that the bestresults are always indicated by the bold font, and for coFactorand BiNE we report only the experimental results on Frappe andCiteULike because of resource limitations. As discussed in Sec-tion 3.4, DeepWalk is not suitable for user-item recommendationas it make the users apart from items in the embedding space. Inaddition, observe that BiNE does not perform well in our experi-ments; such a result is due to the fact that BiNE is a general networkembedding model and thus does not incorporate the regularizerin their objective function, which is however an important factorfor the robustness of recommendation performance. Comparingthe performance of the other baseline methods, we observe that While the memory usage of coFactor implementation is
O(| V | ) , BiNE’s requiresextensive computational time, e.g., more than 24 hours to learn the embedding for thelarge dataset, Movielens-Latest. WW’19, May 2019, San Francisco Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai, and Yi-Hsuan Yang k . . . . . . m A P @ Movielens-Latest
RATE-CSERANK-CSE k . . . . m A P @ Netflix
RATE-CSERANK-CSE k . . . . . . . . m A P @ Last.fm-360K
RATE-CSERANK-CSE k . . . . . . m A P @ Echonest
RATE-CSERANK-CSE . . . . . . λ . . . . . . m A P @ Movielens-Latest
RATE-CSERANK-CSE . . . . . . λ . . . . . m A P @ Netflix
RATE-CSERANK-CSE . . . . . . λ . . . . . . . . . m A P @ Last.fm-360K
RATE-CSERANK-CSE . . . . . . λ . . . . . . . . . m A P @ Echonest
RATE-CSERANK-CSE sample times (million) . . . . . . m A P @ Movielens-Latest
RATE-CSERANK-CSE sample times (million) . . . . . . . . m A P @ Netflix
RATE-CSERANK-CSE sample times (million) . . . . . . . . . m A P @ Last.fm-360K
RATE-CSERANK-CSE sample times (million) . . . . . . . m A P @ Echonest
RATE-CSERANK-CSE
Figure 2: Sensitivity and convergence analyses the performance of WALS, WARP and K -OS is very competitive.That is, these methods achieve the top performance among all thebaselines on several datasets. The performance of WalkRanker andCML, on the other hand, seems satisfactory only on two rathersmall datasets – Frappe and CiteULike – and performs more poorlyon most of the other datasets.We observe that our method achieves the best results in terms ofboth Recall@10 and mAP@10 for most datasets. Moreover, RANK-CSE generally outperforms RATE-CSE in the experiments, re-confirmingthat using a ranking-based loss is indeed better for datasets withbinary implicit feedbacks [25, 33, 34]. Specifically, except for Frappe,RATE-CSE or RANK-CSE achieves significantly much better perfor-mance than the best performing baseline methods with a maximumimprovement of +20.7%. Figure 2shows the results of the sensitivity analysis on two hyper-parameters k and λ in the first and second rows, respectively, and those of theconvergence analysis based on sample times in the third row. Wefirst observe that increasing the order k of modeling neighborhoodproximity between users or items improves the performance ingeneral. We first observe from Figure 2 is that the optimal value of k is data dependent and has to be empirically tuned considering thetrade-off between accuracy and time/space complexity. In general,a larger k leads to better result, and from our experience, the resultwould reach a plateau when k is sufficiently large (e.g., when k > Note that due to space limits, we report the results for the four largest datasets only.
The second row of Figure 2 shows how the balancing parameter λ affects performance: RANK-CSE obtains better performance witha value around 0.05, while RATE-CSE performs well with a valuearound 0.1. Finally, we empirically show that the required total sam-ple times for convergence is linear with respect to | E | as illustratedin the third row, where the vertical dash line indicates the boundaryof | E | ×
80 as we applied to the previous recommendation experi-ment. As the training time depends linearly on the sample times, itcan be said that both RATE-CSE and RANK-CSE converge with lessthan a constant multiple of | E | sample times. This demonstrates thenice scalability of CSE. We present CSE, a unified representation learning framework thatexploits comprehensive collaborative relations available in a user-item bipartite graph for recommender systems. Two types of prox-imity relations are modeled by the proposed DSEmbed and NSEm-bed modules. Moreover, we propose a sampling technique to en-hance the scalability and flexibility of the model. Experimental re-sults show that CSE yields superior recommendation performanceover a wide range of datasets with different sizes, densities, andtypes than many state-of-the-art recommendation methods.
REFERENCES [1] Oren Barkan and Noam Koenigstein. 2016. Item2Vec: Neural item embedding forcollaborative filtering. In
Workshop IEEE MLSP . ollaborative Similarity Embedding for Recommender Systems WWW’19, May 2019, San Francisco [2] Chih-Ming Chen, Yi-Hsuan Yang, Yian Chen, and Ming-Feng Tsai. 2017. Vertex-Context Sampling for Weighted Network Embedding. CoRR (2017).[3] Chih-Ming Chen, Ming-Feng Tsai, Yu-Ching Lin, and Yi-Hsuan Yang. 2016. Query-based Music Recommendations via Preference Embedding. In
Proc. ACM RecSys .[4] Wei-Sheng Chin, Bo-Wen Yuan, Meng-Yuan Yang, Yong Zhuang, Yu-Chin Juan,and Chih-Jen Lin. 2016. LIBMF: A Library for Parallel Matrix Factorization inShared-memory Systems.
Journal of Machine Learning Research (2016).[5] Evangelia Christakopoulou and George Karypis. 2016. Local Item-Item ModelsFor Top-N Recommendation. In
Proc. ACM RecSys .[6] Ming Gao, Leihui Chen, Xiangnan He, and Aoying Zhou. 2018. BiNE: BipartiteNetwork Embedding. In
Proc. ACM SIGIR .[7] Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learning forNetworks. In
Proc. ACM SIGKDD .[8] Yupeng Gu, Yizhou Sun, Yanen Li, and Yang Yang. 2018. RaRE: Social RankRegulated Large-scale Network Embedding. In
Proc. WWW .[9] Guibing Guo, Jie Zhang, and Neil Yorke-Smith. 2015. TrustSVD: CollaborativeFiltering with Both the Explicit and Implicit Influence of User Trust and of ItemRatings. In
Proc. AAAI .[10] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: A Factorization-Machine based Neural Network for CTR Prediction.
CoRR (2017).[11] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural Collaborative Filtering. In
Proc. WWW .[12] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, andDeborah Estrin. 2017. Collaborative Metric Learning. In
Proc. WWW .[13] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering forImplicit Feedback Datasets. In
Proc. IEEE ICDM .[14] Dietmar Jannach, Paul Resnick, Alexander Tuzhilin, and Markus Zanker. 2016.Recommender systemsâĂŤbeyond matrix completion.
Commun. ACM (2016).[15] Yehuda Koren. 2008. Factorization Meets the Neighborhood: A MultifacetedCollaborative Filtering Model. In
Proc. ACM KDD .[16] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech-niques for Recommender Systems. (2009).[17] Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding As Implicit MatrixFactorization. In
Proc. NIPS .[18] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M. Blei. 2016. Factor-ization Meets the Item Embedding: Regularizing Matrix Factorization with ItemCo-occurrence. In
Proc. ACM RecSys .[19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.Distributed Representations of Words and Phrases and Their Compositionality.In
Proc. NIPS . [20] Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-NRecommender Systems. In
Proc. IEEE ICDM .[21] Enrico Palumbo, Giuseppe Rizzo, and Raphaël Troncy. 2017. entity2rec: LearningUser-Item Relatedness from Knowledge Graphs for Top-N Item Recommendation.In
Proc. ACM RecSys .[22] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn-ing of Social Representations. In
Proc. ACM SIGKDD .[23] Bryan Perozzi, Vivek Kulkarni, and Steven Skiena. 2016. Walklets: MultiscaleGraph Embeddings for Interpretable Network Classification.
CoRR (2016).[24] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. HOG-WILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In
Proc. NIPS .[25] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian Personalized Ranking from Implicit Feedback]. In
Proc. UAI .[26] Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization.In
Proc. NIPS .[27] Xiaoyuan Su and Taghi M Khoshgoftaar. [n. d.]. A survey of collaborative filteringtechniques.
Advances in artificial intelligence
Proc. WWW .[29] Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent Relational MetricLearning via Memory-based Attention for Collaborative Ranking. In
Proc. WWW .[30] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. DropoutNet: Ad-dressing Cold Start in Recommender Systems. In
Proc. NIPS .[31] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learningfor Recommender Systems. In
Proc. ACM KDD .[32] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learningfor Recommender Systems. In
Proc. ACM SIGKDD .[33] Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling Up toLarge Vocabulary Image Annotation. In
Proc. IJCAI .[34] Jason Weston, Hector Yee, and Ron J. Weiss. 2013. Learning to Rank Recommen-dations with the K-order Statistic Loss. In
Proc. ACM RecSys .[35] Lu Yu, Chuxu Zhang, Shichao Pei, Guolei Sun, and Xiangliang Zhang. 2018.WalkRanker: A Unified Pairwise Ranking Model with Multiple Relations for Item.In
Proc. AAAI .[36] Wayne Xin Zhao, Jin Huang, and Ji-Rong Wen. 2016. Learning Distributed Rep-resentations for Recommender Systems with a Network Embedding Approach..In
AIRS .[37] Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. 2017. ScalableGraph Embedding for Asymmetric Proximity. In