[PDF] GraphSW: a training protocol based on stage-wise training for GNN-based Recommender Model

Abstract

Recently, researchers utilize Knowledge Graph (KG) as side information in recommendation system to address cold start and sparsity issue and improve the recommendation performance. Existing KG-aware recommendation model use the feature of neighboring entities and structural information to update the embedding of currently located entity. Although the fruitful information is beneficial to the following task, the cost of exploring the entire graph is massive and impractical. In order to reduce the computational cost and maintain the pattern of extracting features, KG-aware recommendation model usually utilize fixed-size and random set of neighbors rather than complete information in KG. Nonetheless, there are two critical issues in these approaches: First of all, fixed-size and randomly selected neighbors restrict the view of graph. In addition, as the order of graph feature increases, the growth of parameter dimensionality of the model may lead the training process hard to converge. To solve the aforementioned limitations, we propose GraphSW, a strategy based on stage-wise training framework which would only access to a subset of the entities in KG in every stage. During the following stages, the learned embedding from previous stages is provided to the network in the next stage and the model can learn the information gradually from the KG. We apply stage-wise training on two SOTA recommendation models, RippleNet and Knowledge Graph Convolutional Networks (KGCN). Moreover, we evaluate the performance on six real world datasets, this http URL 2011, Book-Crossing,movie, LFM-1b 2015, Amazon-book and Yelp 2018. The result of our experiments shows that proposed strategy can help both models to collect more information from the KG and improve the performance. Furthermore, it is observed that GraphSW can assist KGCN to converge effectively in high-order graph feature.

Full PDF

GGraphSW: a training protocol based on stage-wise trainingfor GNN-based Recommender Model

Chang-You Tai

Academia SinicaTaipei, [email protected]

Meng-Ru Wu

Academia SinicaTaipei, [email protected]

Yun-Wei Chu

Academia SinicaTaipei, [email protected]

Shao-Yu Chu

Academia SinicaTaipei, [email protected]

Lun-Wei Ku

Academia SinicaTaipei, [email protected]

ABSTRACT

Recently, researchers utilize Knowledge Graph (KG) as side informa-tion in recommendation system to address cold start and sparsityissue and improve the recommendation performance. ExistingKG-aware recommendation model use the feature of neighboringentities and structural information to update the embedding ofcurrently located entity. Although the fruitful information is bene-ficial to the following task, the cost of exploring the entire graphis massive and impractical. In order to reduce the computationalcost and maintain the pattern of extracting features, KG-awarerecommendation model usually utilize fixed-size and random setof neighbors rather than complete information in KG. Nonetheless,there are two critical issues in these approaches: First of all, fixed-size and randomly selected neighbors restrict the view of graph. Inaddition, as the order of graph feature increases, the growth of pa-rameter dimensionality of the model may lead the training processhard to converge. To solve the aforementioned limitations, we pro-pose GraphSW, a strategy based on stage-wise training frameworkwhich would only access to a subset of the entities in KG in everystage. During the following stages, the learned embedding fromprevious stages is provided to the network in the next stage andthe model can learn the information gradually from the KG. Weapply stage-wise training on two SOTA recommendation models,RippleNet and Knowledge Graph Convolutional Networks (KGCN).Moreover, we evaluate the performance on six real world datasets,Last.FM 2011, Book-Crossing,movie, LFM-1b 2015, Amazon-bookand Yelp 2018. The result of our experiments shows that proposedstrategy can help both models to collect more information from theKG and improve the performance. Furthermore, it is observed thatGraphSW can assist KGCN to converge effectively in high-ordergraph feature.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ACM Reference format:

Chang-You Tai, Meng-Ru Wu, Yun-Wei Chu, Shao-Yu Chu, and Lun-WeiKu. 2018. GraphSW: a training protocol based on stage-wise trainingfor GNN-based Recommender Model. In

Proceedings of Woodstock ’18: ACMSymposium on Neural Gaze Detection, Woodstock, NY, June 03–05, 2018(Woodstock ’18),

To address the matter of the cold-start problem and the sparsity ofuser-item interactions in CF-based recommendation model, manyresearchers take the Knowledge Graph (KG) as side information.Because KG, which introduces semantic relatedness among items,contains fruit information and connections between items, it canenhance the performance of recommendation system[1–5]. Re-cent KG-aware recommendation systems can be roughly classi-fied into three categories: embedding-based methods[6–9], path-based methods[10–14] and Graph Neural Network (GNN) basedmethods[15–20]. Because GNN-based recommendation systemsutilize GNN architecture and can realize end-to-end training to ex-ploit high-order information of KG, GNN-based methods eliminatethe limitation of embedding-based methods and path-based method.Many researchers study GNN-based recommendation model: Rex etal. [21] applies GNNs on bipartite graph and build recommendationmodel which is successively deployed at Pinterest. Wu et al. [22]utilizes multihop neural network structure transform the signalsinto user/item representations. Wang et al. propose RippleNet[15]and KGCN[16]. RippleNet is a memory-network-like model whichpropagates items within paths rooted at each users’ potential prefer-ences to produce user representations. KGCN utilize neighborhoodaggregation to calculate the item representation. In addition, neigh-borhood aggregation can be extended to multiple hops away andallow model to capture high-order and long-distance entity depen-dencies. Wang et al. propose Knowledge Graph Attention Network(KGAT) [18], which utilize attention network on KG and exploit theuser-item graph structure by recursively propagating embeddings.However, Graph Convolutional Network has neighbor explosionissue when GCN aggregates the neighborhoods nodes. In GNN-based recommendation model, each node’s representation in thecurrent layer is aggregated from its neighbors’ representation pre-vious layer. As the hop number increase, the multi-hop neighborswould cost huge computation resource. To solve that issue, current a r X i v : . [ c s . I R ] A ug oodstock ’18, June 03–05, 2018, Woodstock, NY Chang-You Tai, Meng-Ru Wu, Yun-Wei Chu, Shao-Yu Chu, and Lun-Wei Ku GNN-based recommendation, such as PinSage[21], RippleNet[15]and KGCN[16], would adopt ”fixed-size” strategy that in each layermodel would only sample a fixed-size set of neighbors, instead ofusing full neighborhood sets, in previous layer to reduce compu-tation resource. To use more neighborhoods information, in eachminibatch iteration, PinSage would resample another fixed-sizeset of neighbor for each layer. However, in original paper, it don’tdiscuss the performance gain from that resampling strategy andthe but only discusses the trade-off of performance and runtimewhen the different size of neighborhood is set. Could that strategybe applied to different model and dataset? In addition, the statisticof dataset is not included and we don’t know the exact numberof entity-relation-entity triplets information are used when modelachieves best performance.In addition to neighbor explosion issue, GNN-based recommen-dation model, such as KGCN which has architecture similar toPinSage, would face issue that model is hard to converge as a or-der of graph feature increases. Because KGCN’s architecture isdesigned to automatically capture both high-order structure andsemantic information in KG, massive noise entity and the growth ofparameter dimensionality of the KGCN would lead the training pro-cess hard to converge when the order of graph feature increases[16].To assist the speed of convergence and preclude the noisy featurein the beginning of deep neural network’s training process, thestrategy of stage-wise training is proposed[23]. The learning pro-cess of stage-wise training is broken down into a number of relatedsub-tasks and the training data is presented to the network grad-ually. In addition, learned feature in previous stage is extractedand transferred to the next stage in order to gradually absorb thesharing knowledge between task and finally achieve better predic-tions during training process. Stage-wise training has been usedin different applications such as multi-task learning[24], featureextraction layer[23] and multi-model recognition[25].On other hands, we found that in the original training proto-col of GNN-based recommendation models, KGCN and RippleNetdon’t adpot the resampling strategy. As a result, we think thesetwo models are the good choice to study the exact performancechange from resampling strategy. In this paper, we aim to conductthe comprehensive study on the exact performance change fromresampling strategy on different variant and handle the difficultconvergence of KGCN as the order of graph feature increases. As aresult, we propose GraphSW, a strategy based on stage-wise train-ing framework. In every stage, KGCN and RippleNet are fed withonly a small subset of entity in KG instead of massive entity whichmay allow the model to easily find crucial information. During thefollowing stages, the learned embedding from previous stages isprovided to the network and the model can learn the informationfrom KG gradually. Empirically, we evaluate and train two recom-mendation models, RippleNet and KGCN, with stage-wise learningon six real-world datasets, i.e., movie, LFM-1b 2015, Amazon-bookand Yelp 2018. The result shows that stage-wise learning allowsKGCN and RippleNet to collect more information from KG andimproves recommendation performance on all datasets. We con-duct comprehensive study on the performance gain on differentvariant and to our surprise, we found that KGCN’s performanceachieves best when the neighbor sampling size is small and evenKGCN don’t all information in KG. In addition, for KGCN, we found that stage-wise training can mitigate the difficult convergence issueof model as the order of graph increases. As hop number growsto 4, the average improvement of stage-wise training on AUC is34.8%, 17.5%, 2.3%, 2.2%, 8.2% and 0.9% for Last.FM 2011, Book-Crossing, MovieLens-1M, LFM-1b 2015, Amazon-book and Yelp2018 respectively.In summary, this work includes four contributions. • With GraphSW, we conduct comprehensive study on per-formance gain when more KG information is used on sixreal-world datasets consisting of different size of KG. • In general, we find that GraphSW improve performanceof KGCN and RippleNet. However, to our surprise, it isfound that KGCN’s performance achieves best when smallneighbor sampling size is set. • Because the GraphSW assist KGCN to collect useful infor-mation and preclude noisy one, the difficult convergenceissue can be addressed in high order graph feature. • We release the code of KGCN and RippleNet with stage-wise learning to researchers for validating the reportedresults and conducting further research. The code and thedata are available at https://github.com/mengruwu/graphsw.

In this section, we introduce the proposed GraphSW, a trainingprotocol based on stage-wise training. First, we briefly describethe problem formulation of KG aware recommendation. Then, wepresent the design of the proposed stage-wise training on KGCN.

In this section, we introduce the task formulation of KG-awarerecommendation system. The sets of users and items are denotedas U = { u , u ... } and V = { v , v ... } and the user-item inter-action matrix Y = { y uv | u ∈ U , v ∈ V} is defined accordingto user’s implicit feedback. If user u has engaged with item v , y uv would be recorded as y uv =

1; otherwise y uv = G and is comprised of entity-relation-entity triples {( h , r , t )| h , t ∈ E , r ∈ R} , where h , r and t denote head, relationand tail of a knowledge triple and E and R denote the set of entitiesand relations in KG. N( v ) denotes as number neighborhood nodeof item v and the set of neighborhood node of item v is denotedas S( v ) (cid:44) { e | e ∼ N( v )} and |S( v )| = K , where K is neighborssampling size. Recommendation model is used to predict whetheruser u has interest in item which user has no interaction historybefore, and the ultimate goal of KG-aware recommendation modelis to learn prediction function ˆ y uv = F ( u , v ; Θ , G) , where Θ ismodel parameter and ˆ y uv is the probability that user u will engagewith item v . In stage-wise training, the training set in stage s , is denoted as T s = ( u , v , G s , Y ) , where s is current stage of training and s ∈{ , , ..., S } , u ∈ U and v ∈ V are user-item pairs, Y is entire user-item interaction matrix, and G s is fixed-size set of neighbors ofall items in knowledge graph at stage s and G s = (cid:205) V n = v S( n ) . Wedefine KG-aware recommendation’s learning algorithm as A ( ., . ) raphSW: a training protocol based on stage-wise trainingfor GNN-based Recommender Model Woodstock ’18, June 03–05, 2018, Woodstock, NY Figure 1: Schematic diagram of stage-wise training on KGCN and the learned parameter was defined as W s , W s = A ( T s , W inits ) (1)where, W inits is the initial value of the parameter, and the first-stageparameter W init is randomly initialized. The connecting successivestages would be defined as follow: W inits + : = W s , ∀ s ∈ { , ..., S − } (2)where, W s is learned parameter of whole model and W inits + is trans-ferred parameter. In each training stage, we would save W s andtransfer part of parameter W inits + from previous stages to nextstages. The whole training protocol is illustrated in Figure 1. The parameterof GNN-based recommendation system can be roughly classifiedinto two parts: knowledge graph representation and aggregatorparameter. Considering the limited training performance resultedfrom high parameter dimensionality, we train the model parametersin different stages. Because original model is designed to increasecomputation efficiency, only fix-size set of neighborhood in KG, G s , is sampling by aggregator, and the model can only collect partof information in KG during each training stage. As a result, wefirst fine tune the knowledge graph representation and collect moreentity information of KG representation to explore the graph morecomprehensively. During following training stages, we extract allmodel parameters W s and transfer only learned knowledge graphrepresentation W inits + from previous stages to next stage. In addition,the model would randomly sample another set of neighborhood inKG, G s + , to collect more entity information to KG representation inthe next stage. After knowledge graph representation is well trained,it would be utilized to fine tune the remaining aggregator part andthe prediction would perform better because of the comprehensiveview of graph. In this section, we evaluate the recommendation performance ofRippleNet and KGCN with stage-wise training. First, we introducethe datasets, two SOTA GNN-based Recommender models, model settings and experiment setup. Then, we present and discuss therecommendation performance.

To evaluate proposed strategy, we utilize six real-world datasets:Last.FM 2011, Book-Crossing, MovieLens-1M, LFM-1b 2015, Amazon-book and Yelp 2018 which are publicly available[15, 16, 18] and varyin term of size. The Knowledge Graphs (KG) for each datasets is bulitby different way. For Last.FM 2011, Book-Crossing, MovieLens-1M,KG are built by Microsoft Satori and the confidence level greaterthan 0.9 is set. The KG of LFM-1b 2015, Amazon-book are built bytitle matching which method is described in [26]. For Yelp2018, KGis built from local business information network [18]. The statisticsof dataset are recorded on Table 1. We transform dataset into im-plicit feedback, where each entry is marked with 1 if item has beeninteracted or rated by user; otherwise, the entry is marked with0. The rating threshold of MovieLens-1M is 4. For Book-Crossingand Last.FM 2011, LFM-1b 2015, Amazon-book and Yelp2018, wetreat it as positive example if we observed user-item interaction. Inorder to ensure the quality of datasets, we apply 20-core setting forAmazon-book and Yelp 2018 and 50-core setting for LFM-1b2015.In other words, the datasets only remain users and items with atleast the number of cores interactions. • MovieLens-1M dataset is a widely used benchmark datasetin movie recommendations, the dataset contains approxi-mately 1 million explicit ratings (ranging from 1 to 5) ontotal 2445 items from 6036 users. • Book-Crossing dataset is collected from the Book-Crossingcommunity. It contains 139746 explicit ratings (rangingfrom 0 to 10) on total 14967 items from 17860 users. • Last.FM 2011 is the dataset about music listening collectedfrom Last.fm online music systems. This dataset contains42346 explicit ratings record on total 1872 items from 3846users. • LFM-1b 2015 is the dataset about music which recordartists, albums, tracks, and users, as well as individuallistening events information. This dataset contains about3 million explicit ratings record on total 15471 items from12134 users. oodstock ’18, June 03–05, 2018, Woodstock, NY Chang-You Tai, Meng-Ru Wu, Yun-Wei Chu, Shao-Yu Chu, and Lun-Wei Ku

Last.FM2011 Book-Crossing MovieLens-1M LFM-1b2015 Amazon-book Yelp2018

Table 1: dataset basic statistic • Amazon-book is the dataset about user’s preferences onbook products. It record information about user, item,rating and timestamp. This dataset contains about halfmillion explicit ratings record on total 9854 items from 7thousand users. • Yelp2018 is the dataset from 2018 edition of Yelp challengeand is about local businesses. This dataset contains about1.2 million explicit ratings record on total 14 thousanditems from 16 thousand users.

To evaluate the performance of GNN-based recommendation modelwith stage-wise training, we conduct experiments on two GNN-based recommendation models: RippleNet and Knowledge GraphConvolutional Networks (KGCN). • RippleNet [15] is a hybrid based KG-aware recommenda-tion, which combines knowledge graph embedding regu-larization and path-based concept. RippleNet is a memory-network-like approach that users preference embedding isaggregated from entities embedding in KG. For each user,RippleNet would sample a fixed-size set of neighbor topredict user preference. • KGCN [16] is also a hybrid based KG-aware recommen-dation model, and it has user preference embedding foreach user that allow model to capture users’ personal-ized interests from relations. In addition, we adapt labelsmoothness regularization on KGCN which leads to bettergeneralization[17]. For each user, we let the sampler ofKGCN uniformly sample a fixed size set of neighbors foreach entity to predict user preference.

We evaluate stage-wise training in two experiments settings: (1)Click-Through Rate (CTR) and (2) top-K recommendation. In CTRprediction, the trained model is applied to predict each interactionin the test set. We use AUC and ACC to evaluate CTR prediction. Intop-K recommendation, trained model is applied to select K itemswith highest predicted click probability for each user in the testset, and Recall@K is chosen to evaluate the result. Each datasetis split into training, evaluation and test set in ratio of 6:2:2. Eachexperiment is repeated at least 5 time and average score is presented.For both recommendation models, all trainable parameters areoptimized by Adam algorithm.

For KGCN and RippleNet, the hyper-parameters are determined by optimizing AUC on a validation set.For both models, the learning rate η is selected from [0.1, 0.02, 0.005,0.0002, 0.0005, 0.0008] and regularization weight is selected from[1 × − ,1 × − ,2 × − × − ]. To reduce heavier computation,we set the embedding dimension of all entity and relation to 8 and16 for RippleNet and KGCN. Moreover, in GraphSW, early stoppingstrategy is adpoted according to performance of current stage. The results of KGCN and RippleNet on CTR prediction experimentand top-K recommendation are shown on Table 2 and Table 3.We would discuss the recommendation performance gain fromGraphSW in following:

In general, GraphSWimproves the recommendation performance of KGCN and Rip-pleNet on all dataset. For KGCN, the sparse datasets, such asBook-Crossing and Last.FM would cause KGCN fail to converge.However, with stage-wise training, KGCN achieve gain of AUC by2.2% and 3.7% for Last.FM 2011 and Book-Crossing respectively. It isconcluded that GraphSW assists KGCN with collecting more infor-mation on KG and performs well on sparse scenarios. In addition,for RippleNet, huge improvement of performing GraphSW is foundon dataset with large size KG, such as MovieLens-1M, LFM-1b 2015,Amazon-book and Yelp2018, however, improvement is relativelysmall for KGCN. With GraphSW, RippleNet can collect more in-formation on KG. As a result, the improvement from GraphSW islarger on RippleNet than on KGCN of dataset with lager KG size.

After comparingthe improvement on different dataset. We further investigate themodel’s performance with GraphSW on different neighbor sam-pling size. The experiment result of different neighbor samplingsize is shown on Table 4 and Table 5. For both KGCN and RippleNet,the hop number is set to 1 on all dataset. Specifically for Yelp 2018dataset, we optimize the hop number to 2 for KGCN because ofbetter performance. In general, GraphSW improves the model per-formance on different memory size. For KGCN, we surprisinglyobserved that KGCN achieves best result on almost every datasetwith GraphSW when the neighbor sampling size is small. WhenKGCN is trained with small size of neighbor nodes, the model canbetter capture neighbor node’s information in KG. Moreover, KGCNcan gradually learn information from KG with GraphSW. As a result,KGCN achieves best performance with GraphSW when the neigh-bor sampling size is small. For RippleNet, compared with KGCN, raphSW: a training protocol based on stage-wise trainingfor GNN-based Recommender Model Woodstock ’18, June 03–05, 2018, Woodstock, NY

Model MovieLens-1M Book-Crossing Last.FM 2011 LFM-1b 2015 Amazon-book Yelp2018

AUC ACC AUC ACC AUC ACC AUC ACC AUC ACC AUC ACC

KGNN .9171 .8452 .6750 .6208 .7865 .7099 .9127 .8617 .8151 .7393 .9049 .8399KGNN-SW .9223 .8490 .7001 .6390 .8041 .7311 .9194 .8670 .8241 .7470 .9068 .8420RippleNet .9276 .8557 .7630 .6909 .8081 .7418 .9361 .8847 .8216 .7461 .9203 .8588RippleNet-SW .9423 .8721 .7666 .6929 .8120 .7457 .9530 .9015 .9010 .8259 .9481 .8920

Table 2: The results of

AUC and

ACC score in CTR prediction on all datasets

Model MovieLens-1M Book-Crossing Last.FM 2011 LFM-1b 2015 Amazon-book Yelp 2018

R@25 R@50 R@25 R@50 R@25 R@50 R@25 R@50 R@25 R@50 R@25 R@50

KGNN .1229 .2102 .0483 .0763 .1343 .1890 .0067 .0121 .0377 .0603 .0395 .0634KGNN-SW .1308 .2236 .0478 .0785 .1463 .2119 .0079 .0134 .0405 .0672 .0412 .0679RippleNet .1302 .2300 .0482 .0792 .1177 .1917 .0101 .0173 .0362 .0618 .0400 .0692RippleNet-SW .1339 .2371 .0491 .0797 .1158 .1917 .0123 .0182 .0578 .0910 .0509 .0853

Table 3: The results of

Recall @ K score in top-K recommendation on all datasets S .9171 .9160 .9167 .9146MovieLens-1M* .9201 .9198 . .9198 .9191 .9195Book-Crossing .6689 .6694 .6635 .6750 .6472 .6418Book-Crossing* .7001 .6895 .6893 .6771 .6745 .6607Last.FM 2011 .7714 .7817 .7859 .7865 .7617 .7705Last.FM 2011* .8041 .8024 .8013 .7964 .8002 .7989LFM-1b 2015 .9117 .9127 .9090 .9087 .9086 .9085LFM-1b 2015* .9194 .9191 .9189 .9167 .9165 .9172Amazon-book .8151 .8122 .8111 .8060 .8002 .8047Amazon-book* .8241 .8224 .8169 .8160 .8171 .8181Yelp 2018 .8999 .8992 .9022 .9029 .9024 .9049 Yelp 2018* .9033 .9041 .9068 .9059 .9058 .9045

Table 4:

AUC result of KGCN with different neighbors sam-pling size, where * denotes to GraphSW S .9276 MovieLens-1M* .8996 .9092 .9186 .9292 .9357 .9423

Book-Crossing .7520 .7615 .7630 .7602 .7531 .6717Book-Crossing* .7596 .7647 .7666 .7661 .7513 .7412Last.FM 2011 .7926 .7997 .8053 .8081 .8044 .8005Last.FM 2011* .8055 .8083 .8112 .8120 .8078 .8035LFM-1b 2015 .8817 .8954 .9039 .9171 .9279 .9361

LFM-1b 2015* .8902 .9065 .9194 .9345 .9433 .9530

Amazon-book .6929 .6990 .7334 .7706 .7939 .8216

Amazon-book* .7256 .7870 .8404 .8693 .9010 .9000Yelp 2018 .7987 .8650 .8955 .9064 .9149 .9203

Yelp 2018* .8649 .8987 .9199 .9357 .9384 .9481

Table 5:

AUC result of RippleNet with different memory size,where * denotes to GraphSW it achieves better result with GraphSW when neighbor samplingsize is large. with the help of GraphSW, we have found that theperformance on dataset with large KG size obtains huge growth.

As we mentionedbefore, KGCN fails to converge as the hop number increases [16] H Table 6:

AUC result of KGCN with different hop number H ,where * denotes to model trained with GraphSW Model MovieLens-1M Book-Crossing Last.FM 2011KGNN-SW* .8868 .6476 .7372KGNN-SW .9121 .6614 .7821Model LFM-1b 2015 Amazon-book Yelp 2018KGNN-SW* .9070 .7780 .8911KGNN-SW .9194 .8245 .9062

Table 7:

AUC result of KGCN as the hop number H is 4, where* denotes to GraphSW with transferring whole training pa-rameters due to increasing noise and the growth of parameter dimensionality.As a result, we conduct experiment to investigate improvementfrom GraphSW on different hop number. KGCN’s result is shownin Table 6 and RippleNet’s result is shown in Appendix. In gen-eral, GraphSW improves both models’ performance on differenthop number. Moreover, we found that KGCN would easily col-lapse when hop number is 3 or 4, but RippleNet seems not. How-ever, with GraphSW, KGCN’s performance would be improved bya large margin. The average improvement on hop number 4 is34.8%, 17.5%, 2.3%, 2.2%, 8.2% and 0.9% for Last.Fm 2011, Book-Crossing, MovieLens-1M, LFM-1b 2015, Amazon-book and Yelp oodstock ’18, June 03–05, 2018, Woodstock, NY Chang-You Tai, Meng-Ru Wu, Yun-Wei Chu, Shao-Yu Chu, and Lun-Wei Ku In this paper, we proposed GraphSW, a training protocol based onstage-wise training for GNN-based recommendation model. WithGraphSW, we conduct comprehensive study on performance gainwhen different number of KG information is used. In general, wefind GraphSW improves KGCN and RippleNet on every dataset andit shows that the resampling strategy should not be neglected byGNN-based recommendation model which would sample fixed-sizeset of neighbors. In addition, it is found that KGCN and RippleNetachieve best result on different situation because of the differentaggregating method. For KGCN, which has structure similar toPinSage, recommendation’s performance achieves best result onalmost every dataset with stage-wise training when the neighborsampling size is small. In addition, stage-wise training allows usto train model in different stage. In the absence of stage-wisetraining, KGCN suffers from noise and fails to converge as hopnumber increases because of the growth of parameter dimensional-ity. However, with stage-wise training, the performance of KGCNis improved by a large margin. The improvement brought fromGraphSW instructs us to design GNN-based recommendation modelwith more complicate architecture and parameters.

REFERENCES [1] B. Hu, C. Shi, W. X. Zhao, and P. S. Yu, “Leveraging meta-path based context fortop- n recommendation with a neural co-attention model,” in

Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery & , ser. KDD ’18. New York, NY, USA: ACM, 2018, pp. 1531–1540.[Online]. Available: http://doi.acm.org/10.1145/3219819.3219965[2] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, and E. Y. Chang, “Improvingsequential recommendation with knowledge-enhanced memory networks,” in

The 41st International ACM SIGIR Conference on Research & , ser. SIGIR ’18. New York, NY, USA: ACM, 2018, pp.505–514. [Online]. Available: http://doi.acm.org/10.1145/3209978.3210017[3] Z. Sun, J. Yang, J. Zhang, A. Bozzon, L.-K. Huang, and C. Xu, “Recurrentknowledge graph embedding for effective recommendation,” in

Proceedingsof the 12th ACM Conference on Recommender Systems , ser. RecSys ’18.New York, NY, USA: ACM, 2018, pp. 297–305. [Online]. Available: http://doi.acm.org/10.1145/3240323.3240361[4] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma, “Collaborative knowledgebase embedding for recommender systems,” in

Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , ser.KDD ’16. New York, NY, USA: ACM, 2016, pp. 353–362. [Online]. Available:http://doi.acm.org/10.1145/2939672.2939673[5] H. Zhao, Q. Yao, J. Li, Y. Song, and D. L. Lee, “Meta-graph based recommendationfusion over heterogeneous information networks,” in

Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,ser. KDD ’17. New York, NY, USA: ACM, 2017, pp. 635–644. [Online]. Available:http://doi.acm.org/10.1145/3097983.3098063[6] Y. Cao, X. Wang, X. He, Z. Hu, and T. Chua, “Unifying knowledge graph learningand recommendation: Towards a better understanding of user preferences,”

CoRR ,vol. abs/1902.06236, 2019. [Online]. Available: http://arxiv.org/abs/1902.06236[7] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, and E. Y. Chang, “Improvingsequential recommendation with knowledge-enhanced memory networks,” in

The 41st International ACM SIGIR Conference on Research & , ser. SIGIR ’18. New York, NY, USA: ACM, 2018, pp.505–514. [Online]. Available: http://doi.acm.org/10.1145/3209978.3210017 [8] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma, “Collaborative knowledgebase embedding for recommender systems,” in

Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , ser.KDD ’16. New York, NY, USA: ACM, 2016, pp. 353–362. [Online]. Available:http://doi.acm.org/10.1145/2939672.2939673[9] H. Wang, F. Zhang, X. Xie, and M. Guo, “Dkn: Deep knowledge-awarenetwork for news recommendation,” in

Proceedings of the 2018 World Wide WebConference , ser. WWW ’18. Republic and Canton of Geneva, Switzerland:International World Wide Web Conferences Steering Committee, 2018, pp.1835–1844. [Online]. Available: https://doi.org/10.1145/3178876.3186175[10] B. Hu, C. Shi, W. X. Zhao, and P. S. Yu, “Leveraging meta-path based context fortop- n recommendation with a neural co-attention model,” in

Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery & , ser. KDD ’18. New York, NY, USA: ACM, 2018, pp. 1531–1540.[Online]. Available: http://doi.acm.org/10.1145/3219819.3219965[11] Z. Sun, J. Yang, J. Zhang, A. Bozzon, L.-K. Huang, and C. Xu, “Recurrentknowledge graph embedding for effective recommendation,” in

Proceedingsof the 12th ACM Conference on Recommender Systems , ser. RecSys ’18.New York, NY, USA: ACM, 2018, pp. 297–305. [Online]. Available: http://doi.acm.org/10.1145/3240323.3240361[12] X. Wang, D. Wang, C. Xu, X. He, Y. Cao, and T. Chua, “Explainable reasoningover knowledge graphs for recommendation,”

CoRR , vol. abs/1811.04540, 2018.[Online]. Available: http://arxiv.org/abs/1811.04540[13] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han,“Personalized entity recommendation: A heterogeneous information networkapproach,” in

Proceedings of the 7th ACM International Conference on Web Searchand Data Mining , ser. WSDM ’14. New York, NY, USA: ACM, 2014, pp. 283–292.[Online]. Available: http://doi.acm.org/10.1145/2556195.2556259[14] H. Zhao, Q. Yao, J. Li, Y. Song, and D. L. Lee, “Meta-graph based recommendationfusion over heterogeneous information networks,” in

Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,ser. KDD ’17. New York, NY, USA: ACM, 2017, pp. 635–644. [Online]. Available:http://doi.acm.org/10.1145/3097983.3098063[15] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, and M. Guo,“Ripple network: Propagating user preferences on the knowledge graph forrecommender systems,”

CoRR , vol. abs/1803.03467, 2018. [Online]. Available:http://arxiv.org/abs/1803.03467[16] H. Wang, M. Zhao, X. Xie, W. Li, and M. Guo, “Knowledge graph convolutionalnetworks for recommender systems,”

CoRR , vol. abs/1904.12575, 2019. [Online].Available: http://arxiv.org/abs/1904.12575[17] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao, W. Li, and Z. Wang,“Knowledge graph convolutional networks for recommender systems with labelsmoothness regularization,”

CoRR , vol. abs/1905.04413, 2019. [Online]. Available:http://arxiv.org/abs/1905.04413[18] X. Wang, X. He, Y. Cao, M. Liu, and T. Chua, “KGAT: knowledge graph attentionnetwork for recommendation,”

CoRR , vol. abs/1905.07854, 2019. [Online].Available: http://arxiv.org/abs/1905.07854[19] L. Wu, P. Sun, Y. Fu, R. Hong, X. Wang, and M. Wang, “A neural influencediffusion model for social recommendation,”

CoRR , vol. abs/1904.10322, 2019.[Online]. Available: http://arxiv.org/abs/1904.10322[20] X. Wang, X. He, M. Wang, F. Feng, and T. Chua, “Neural graphcollaborative filtering,”

CoRR , vol. abs/1905.08108, 2019. [Online]. Available:http://arxiv.org/abs/1905.08108[21] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graphconvolutional neural networks for web-scale recommender systems,”

CoRR , vol.abs/1806.01973, 2018. [Online]. Available: http://arxiv.org/abs/1806.01973[22] Y. Wu., H. Liu., and Y. Yang., “Graph convolutional matrix completion for bi-partite edge prediction,” in

Proceedings of the 10th International Joint Conferenceon Knowledge Discovery, Knowledge Engineering and Knowledge Management -Volume 1: KDIR, , INSTICC. SciTePress, 2018, pp. 51–60.[23] E. Barshan and P. Fieguth, “Stage-wise training: An improved feature learningstrategy for deep models,” in

Proceedings of the 1st International Workshopon Feature Extraction: Modern Questions and Challenges at NIPS 2015 , ser.Proceedings of Machine Learning Research, D. Storcheus, A. Rostamizadeh, andS. Kumar, Eds., vol. 44. Montreal, Canada: PMLR, 11 Dec 2015, pp. 49–59.[Online]. Available: http://proceedings.mlr.press/v44/Barshan2015.html[24] H. Zhang and V. M. Patel, “Densely connected pyramid dehazing network,”

CoRR ,vol. abs/1803.08396, 2018. [Online]. Available: http://arxiv.org/abs/1803.08396[25] A. Eitel, J. T. Springenberg, L. Spinello, M. A. Riedmiller, and W. Burgard,“Multimodal deep learning for robust RGB-D object recognition,”

CoRR , vol.abs/1507.06821, 2015. [Online]. Available: http://arxiv.org/abs/1507.06821[26] W. X. Zhao, G. He, H. Dou, J. Huang, S. Ouyang, and J. Wen, “Kb4rec: Adataset for linking knowledge bases with recommender systems,”