[PDF] SceneRec: Scene-Based Graph Neural Networks for Recommender Systems

Abstract

Collaborative filtering has been largely used to advance modern recommender systems to predict user preference. A key component in collaborative filtering is representation learning, which aims to project users and items into a low dimensional space to capture collaborative signals. However, the scene information, which has effectively guided many recommendation tasks, is rarely considered in existing collaborative filtering methods. To bridge this gap, we focus on scene-based collaborative recommendation and propose a novel representation model SceneRec. SceneRec formally defines a scene as a set of pre-defined item categories that occur simultaneously in real-life situations and creatively designs an item-category-scene hierarchical structure to build a scene-based graph. In the scene-based graph, we adopt graph neural networks to learn scene-specific representation on each item node, which is further aggregated with latent representation learned from collaborative interactions to make recommendations. We perform extensive experiments on real-world E-commerce datasets and the results demonstrate the effectiveness of the proposed method.

Full PDF

SSceneRec: Scene-Based Graph Neural Networksfor Recommender Systems

Gang Wang

SKLSDE Lab, Beihang [email protected]

Ziyi Guo

[email protected]

Xiang Li

East China Normal [email protected]

Dawei Yin

[email protected]

Shuai Ma

SKLSDE Lab, Beihang [email protected]

ABSTRACT

Collaborative filtering has been largely used to advance modernrecommender systems to predict user preference. A key compo-nent in collaborative filtering is representation learning, whichaims to project users and items into a low dimensional space tocapture collaborative signals. However, the scene information,which has effectively guided many recommendation tasks, israrely considered in existing collaborative filtering methods. Tobridge this gap, we focus on scene-based collaborative recom-mendation and propose a novel representation model SceneRec.SceneRec formally defines a scene as a set of pre-defined itemcategories that occur simultaneously in real-life situations andcreatively designs an item-category-scene hierarchical structureto build a scene-based graph. In the scene-based graph, we adoptgraph neural networks to learn scene-specific representation oneach item node, which is further aggregated with latent repre-sentation learned from collaborative interactions to make recom-mendations. We perform extensive experiments on real-worldE-commerce datasets and the results demonstrate the effective-ness of the proposed method.

Recommender systems have become increasingly important toaddress the information overload problem and have been widelyapplied in many different fields, such as social networks [22] andnews websites [24]. To predict a user’s preference, an extensiveamount of collaborative filtering (CF) methods have been pro-posed to advance recommender systems. The basic idea of CF isthat user behavior would always be similar and a user’s interestcan be predicted from the historical interactive data like clicksor purchases. A key component of CF is to learn the latent repre-sentation, which usually projects users and items into a lowerdimensional space. A variety of CF models, including matrix fac-torization [8], deep neural networks [7] and graph convolutionalnetworks [16], are adopted to capture collaborative signals froma user-item matrix or a user-item bipartite graph.In the meantime, recommender systems that integrate sceneinformation are attracting more and more attention. For exam-ple, predictive models are able to recommend substitutable orcomplementary items [9, 10, 13] that visually match the scenewhich is represented in an input image. The image data containsrich contextual information like background color, location, land-scape, etc., which may be ignored by conventional CF methods.However, the input image could reveal no scene information or © 2021 Copyright held by the owner/author(s). Published in Proceedings of the24th International Conference on Extending Database Technology (EDBT), March23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0. even becomes unavailable in many recommendation scenarios.For example, in E-commerce systems, most thumbnail imagesonly contain product pictures which are embedded in the whitebackground. In such circumstances, scene-based recommenda-tion becomes infeasible because the scene definition is not clear.To address this issue, this work investigates the utility of incor-porating scene information into CF recommendation. However,this study brings two challenges. First, a formal definition onscene is essential to this problem. Without image data, how toformally define a scene becomes a problem. Second, how to incor-porate scene information into existing CF models should also betaken into account. Keeping these two key points in mind, we pro-pose SceneRec, a novel method for scene-based collaborative fil-tering. Specifically, we propose a principled item-category-scenehierarchical structure to construct the scene-based graph (Figure1). In particular, a scene is formally defined by a set of fine-graineditem categories that could simultaneously occur in real-life situa-tions. For example, the set of item categories {Keyboard, Mouse,Mouse Pad, Battery Charger, Headset} represents the scene “Pe-ripheral Devices”. This can be naturally applied to a situationwhere a user has already bought a PC and many different typesof supplementary devices are recommended. Moreover, SceneRecapplies graph neural networks on the scene-based graph to learnthe item representation based on the scene information, which isfurther aggregated with the latent representation learned fromuser-item interactions to make predictions.To the best of our knowledge, SceneRec is among the firstto study scene-based recommendation with a principled scenedefinition and our main contributions are summarized as follows:(1) We study the problem of scene-based collaborative filteringfor recommender system where a scene is formally defined as aset of item categories that could reflect a real-world situation.(2)We propose a novel recommendation model SceneRec. It lever-ages graph neural networks to propagate scene information andlearn the scene-specific representation for each item. This rep-resentation is further incorporated with a latent representationfrom user-item collaborative interactions to make predictions.(3) We conduct extensive experiments to evaluate the perfor-mance of SceneRec against 9 other baseline methods. We findthat our method SceneRec is effective. Specifically, SceneRec onaverage improves the two metrics (NDCG@10, HR@10) over thebaselines by (14 . , . Collaborative filtering has been widely applied in modern rec-ommender systems. One class of CF methods try to build explicitmodels on the user-item interactions. For example, matrix factor-ization [2, 8, 12, 14] maps the representation of each user and eachitem into a lower dimensional space and calculates inner product a r X i v : . [ c s . I R ] F e b etween vector representations to make predictions. To enhancerecommendation, various contextual information has been incor-porated into CF, such as user review [21], social connections [22]and item side information [17]. Different from existing worksthat rely on linear predictive function, many recent efforts applydeep learning techniques [7] to learn non-linearities betweenuser embedding and item embedding.Another line of CF methods take user-item interactions as abipartite graph. For example, some early efforts [5] conduct labelpropagation, which essentially searches neighborhood on thegraph, to capture collaborative signals. Inspired by the successof graph neural networks (GNN) [6, 11] that directly conductconvolutional operations on the non-grid network data, a seriesof GNN-based recommendation methods have been proposedon an item-item graph [23] or a user-item graph [16] to learn avector embedding for each item or user. The general idea is therepresentation of one graph node can be aggregated and com-bined by the representation of its neighbor nodes. NGCF [20]extends GNN to multiple depths to capture high-order connectiv-ities that are included in user-item interactions. KGAT [19] andKGCN [18] investigate the utility of incorporating knowledgegraph (KG) into CF by projecting KG entities to item nodes.Our work is also related to the application of scene informationin recommender systems. For example, given the scene in theform of an input image, recommendation methods are capableof providing substitutable [10, 13] or supplementary[9] productsthat visually match the input scene. However, in these tasks, thescene is represented by image data, which is not readily availablein many recommendation scenarios. In such cases, scene-basedrecommendations become difficult or even impossible becausethe scene has not been well defined. In this paper, we aim tointegrate scene information into CF where each scene is defineby a set of fine-grained item categories. By exploiting the scene-specific representation into conventional CF signals, the modelcan potentially improve predictions on user preference. Definition 3.1.

Scene . A scene is defined as a set of item cat-egories that occur simultaneously and frequently in a real-lifesituation, denoted as 𝑠 = { 𝑐 , 𝑐 , · · · , 𝑐 | 𝑠 | | 𝑐 𝑖 ∈ C , ≤ 𝑖 ≤ | 𝑠 |} ,where C is the set of item categories and | 𝑠 | ≥

1. The item cate-gory is one of a item’s attributes and 𝑠 ⊂ C . Definition 3.2.

User-Item Bipartite Graph.

The user-item in-teractions can be represented as a bipartite graph G = {( 𝑢, 𝑥 𝑢𝑖 , 𝑖 )| 𝑢 ∈ U , 𝑖 ∈ I} , where U and I are the set of users and itemsrespectively, and the edge 𝑥 𝑢𝑖 indicates the occurrence or fre-quency with that the user 𝑢 has interacted with the item 𝑖 , suchas clicking and purchasing. Definition 3.3.

Scene-based Graph.

The scene-based graph H is a hierarchical network with three layers: the item layer,the category layer, and the scene layer as shown in Figure 1.The item layer consists of items and is denoted as L 𝑖𝑡𝑒𝑚 = {( 𝑖 𝑝 , 𝑦 𝑝𝑞 , 𝑖 𝑞 )| 𝑖 𝑝 , 𝑖 𝑞 ∈ I} , where the edge 𝑦 𝑝𝑞 represents the simi-larity between two items 𝑖 𝑝 and 𝑖 𝑞 . The category layer is denotedas L 𝑐𝑎𝑡𝑒 = {( 𝑐 𝑝 , 𝑧 𝑝𝑞 , 𝑐 𝑞 )| 𝑐 𝑝 , 𝑐 𝑞 ∈ C} , where the edge 𝑧 𝑝𝑞 rep-resents that the category 𝑐 𝑝 has relevance to the category 𝑐 𝑞 .The interaction between the item layer and the category layeris described by L 𝑖𝑐 = {( 𝑖 𝑝 , 𝑎 𝑝𝑞 , 𝑐 𝑞 )| 𝑖 𝑝 ∈ I , 𝑐 𝑞 ∈ C} , wherethe edge 𝑎 𝑝𝑞 connects an item 𝑖 𝑝 to a pre-defined item cate-gory 𝑐 𝑞 . The scene layer is composed of scenes, where a scene 𝑠 is formally defined as a set of item categories { 𝑐 , 𝑐 , · · · , 𝑐 | 𝑠 | } . s i c c c c s i i i i Item LayerCategory LayerScene Layer

Figure 1: An illustrative example of the scene-based graphthat consists of the item layer, the category layer and thescene layer. Each item is associated with a category. Inthe item layer and the category layer, the set of edges rep-resent the item-item relations and the category-categoryrelations. There are connections between categories andscenes, which indicates that a category belongs to a scene.

The relation between categories and scenes is illustrated by L 𝑐𝑠 = {( 𝑐 𝑝 , 𝑏 𝑝𝑞 , 𝑠 𝑞 )| 𝑐 𝑝 ∈ C , 𝑠 𝑞 ∈ S} , where the edge 𝑏 𝑝𝑞 indi-cates that a category 𝑐 𝑝 belongs to a scene 𝑠 𝑞 and S = { 𝑠 , 𝑠 , · · · } is the set of scenes. For simplicity, we set the weights of edges inthe scene-based graph H to be 1; otherwise, 0. Definition 3.4.

Scene-based Recommendation.

Given a user-item bipartite graph G recording interaction history, the goal ofthe scene-based recommendation is to predict the probability r 𝑢𝑖 that the user 𝑢 has potential interest in the item 𝑖 with the helpof scene information from a scene-based graph H . In this section, we will first give an overview about the proposedframework, then introduce each model component in detail.

The architecture of the proposed model is shown in Figure 2.There are three components in the model: user modeling, itemmodeling, and rating prediction. User modeling aims to learna latent representation for each user. To achieve this, we takeuser-item interaction as input and aggregate the latent represen-tation of items that the user has interacted with to generate theuser latent factor. Item modeling aims to generate the item latentfactor representation. Since each item exists in both user-itembipartite graph and the scene-based graph, SceneRec learns itemrepresentations in each graph space, i.e., item modeling in theuser-based space and item modeling in the scene-based space.In the user-based space, we take a similar strategy which aggre-gates the representation of all users that each item has interactedwith to generate vector embedding. In the scene-based space,we exploit the hierarchical structure of the scene-based graphwhere the information is propagated from the scene layer tothe category layer and from the category layer to the item layer.Then we concatenate two item latent factors for the general rep-resentation. In the last component, we integrate item and userrepresentations to make rating prediction.

In the user-item graph, a user 𝑢 𝑝 is connected with a set of itemsand these items directly capture the user’s interests. We thuslearn user 𝑢 𝑝 ’s embedding m 𝑢 𝑝 by aggregating the embeddingsof item neighbors, which is formulated as, m 𝑢 𝑝 = 𝜎 ( W u ·  ∑︁ 𝑖 𝑞 ∈ 𝑈 𝐼 ( 𝑢 𝑝 ) e 𝑖 𝑞  + b u ) , (1) cene-Based Space User-Based Space m Si h S h C m Ui m u m i r ′ ui α Item Modeling I s i c c c c s i i i i Scene-Based Graph

I I I f ⋅ α / β I I I IC C C C f C C C C

ItemEmbeddingCategoryEmbedding

C S S S f C S S S

SceneEmbedding Concatenation softmaxsimilarity CS/ISCS/IS

Scene-Based Attention

I U U U f I U U U

UserEmbedding

U I I IU I I I

ItemEmbedding ||||||

User ModelingRatingPrediction

Output β ……… …… Figure 2: The illustration of SceneRec architecture (the arrowed lines present the bottom-up information flow). The em-beddings of users and items are learned by user modeling and item modeling, respectively. where

𝑈 𝐼 ( 𝑢 𝑝 ) denotes the set of items that are connected to user 𝑢 𝑝 , e 𝑖 𝑞 is the embedding vector of item 𝑖 𝑞 , and 𝜎 is the nonlinearactivation function. W u and b u are the weight matrix and thebias vector to be learned. The general representation m 𝑖 𝑝 for item 𝑖 𝑝 can be further splitinto two parts: the embedding m 𝑈𝑖 𝑝 in the user-based space andthe embedding m 𝑆𝑖 𝑝 in the scene-based space. In the user-item graph, an item 𝑖 𝑝 has connections with a set of users. We learn its embedding m 𝑈𝑖 𝑝 by aggregating the embedding of these engaged users: m 𝑈𝑖 𝑝 = 𝜎 ( W iu ·  ∑︁ 𝑢 𝑞 ∈ 𝐼𝑈 ( 𝑖 𝑝 ) e 𝑢 𝑞  + b iu ) , (2)where 𝐼𝑈 ( 𝑖 𝑝 ) denotes the set of users that are connected to item 𝑖 𝑝 , e 𝑢 𝑞 is the embedding vector of user 𝑢 𝑞 , W iu and b iu are param-eters to be learned. Since m 𝑈𝑖 𝑝 is aggregated from user neighbors, m 𝑈𝑖 𝑝 represents the user-based embedding of item 𝑖 𝑝 . In the scene-based graph, eachitem is connected to both other items and its category. So, thescene-based embedding m 𝑆𝑖 𝑝 for item 𝑖 𝑝 is composed of represen-tation that is specific to item neighbors and category neighbors.For the category-specific representation, we should first gener-ate the latent factor of each category. Since one category node canconnect to both scene nodes and other related category nodes,the category representation can be further split into two types:the scene-specific and category-specific representation.Given a category 𝑐 𝑝 , it may belong to a set of scenes and itsscene-specific embedding vector h 𝑆𝑐 𝑝 can be updated as follows: h 𝑆𝑐 𝑝 = ∑︁ 𝑠 𝑞 ∈ 𝐶𝑆 ( 𝑐 𝑝 ) e 𝑠 𝑞 , (3)where 𝐶𝑆 ( 𝑐 𝑝 ) is the set of scenes that category 𝑐 𝑝 belongs to and e 𝑠 𝑞 is the embedding vector of scene 𝑠 𝑞 .Besides the connection between scene nodes and categorynodes, our model also captures the interactions between differ-ent category nodes. Each category contributes to the category-specific representation but categories do not always affect each other equally. Therefore, we apply the attention mechanism tolearn the influence between different item categories. In this way,the category-specific representation h 𝐶𝑐 𝑝 of the category 𝑐 𝑝 canbe aggregated as follows: h 𝐶𝑐 𝑝 = ∑︁ 𝑐 𝑞 ∈ 𝐶𝐶 ( 𝑐 𝑝 ) 𝛼 𝑝𝑞 e 𝑐 𝑞 , (4)where 𝐶𝐶 ( 𝑐 𝑝 ) is the set of neighbor categories, e 𝑐 𝑞 is the em-bedding vector of 𝑐 𝑞 , and 𝛼 𝑝𝑞 is the attention weight. For a pairof categories, the more scenes they share, the higher relevancebetween them. Therefore, we propose a scene-based attentionfunction to compute 𝛼 𝑝𝑞 . Specifically, we calculate the attentionscore by comparing the sets of scenes that 𝑐 𝑝 and 𝑐 𝑞 belong to: 𝛼 ∗ 𝑝𝑞 = 𝑓 (cid:169)(cid:173)(cid:171) ∑︁ 𝑠 𝑎 ∈ 𝐶𝑆 ( 𝑐 𝑝 ) e 𝑠 𝑎 , ∑︁ 𝑠 𝑏 ∈ 𝐶𝑆 ( 𝑐 𝑞 ) e 𝑠 𝑏 (cid:170)(cid:174)(cid:172) , (5)where 𝑓 (·) is an attention function to measure the input similarity.For simplicity, we use cosine similarity as 𝑓 (·) in this work. 𝛼 𝑝𝑞 is obtained by further normalizing 𝛼 ∗ 𝑝𝑞 via the softmax function: 𝛼 𝑝𝑞 = exp (cid:16) 𝛼 ∗ 𝑝𝑞 (cid:17)(cid:205) { 𝑞 |∀ 𝑐 𝑞 ∈ 𝐶𝐶 ( 𝑐 𝑝 ) } exp (cid:16) 𝛼 ∗ 𝑝𝑞 (cid:17) . (6)Finally, we generate the overall representation m 𝑐 𝑝 of cate-gory 𝑐 𝑝 by integrating the scene-specific representation and thecategory-specific representation: m 𝑐 𝑝 = 𝜎 (cid:16) W ic · [ h 𝑠𝑐 𝑝 ∥ h 𝑐𝑐 𝑝 ] + b ic (cid:17) , (7)where ∥ denotes the concatenation operation, W ic and b ic areparameters to be learned.For item 𝑖 𝑝 , it is only connected to one pre-defined categoryand thus its category-specific representation h 𝐶𝑖 𝑝 is denoted as: h 𝐶𝑖 𝑝 = m 𝐶 ( 𝑖 𝑝 ) , (8)where 𝐶 ( 𝑖 𝑝 ) indicates the category of 𝑖 𝑝 .We continue to learn the item-specific representation h 𝐼𝑖 𝑝 sincethere exist connections between different item nodes. Similar tocategory-category relations, items do not always affect each otherequally and we apply the attention mechanism to learn h 𝐼𝑖 𝑝 : h 𝐼𝑖 𝑝 = ∑︁ 𝑖 𝑞 ∈ 𝐼𝐼 ( 𝑖 𝑝 ) 𝛽 𝑝𝑞 e 𝑖 𝑞 , (9) able 1: Statistics of JD datasets. Each relation A-B has three parts: number of A, number of B, and number of A-B. Relations (A-B) Baby & Toy Electronics Fashion Food & DrinkUser-Item 4,521-51,759 (481,831) 3,842-52,025 (539,066) 3,959-53,005 (541,238) 3,236-47,402 (463,391)Item-Item 51,759-51,759 (3,002,806) 52,025-52,025 (2,992,333) 53,005-53,005 (2,750,495) 47,402-47,402 (2,606,003)Item-Category 51,759-103 (51,759) 52,025-78 (52,025) 53,005-91 (53,005) 47,402-105 (47,402)Category-Category 103-103 (1,791) 78-78 (825) 91-91 (1,058) 105-105 (1,628)Scene-Category 323-103 (1,370) 54-78 (281) 438-91 (1,646) 136-105 (630) where 𝛽 𝑝𝑞 denotes the attention weight. Since items that belongto the same category share similarity, we leverage scene infor-mation to calculate 𝛽 𝑝𝑞 by comparing their categories via thescene-based attention mechanism: 𝛽 ∗ 𝑝𝑞 = 𝑓 (cid:169)(cid:173)(cid:171) ∑︁ 𝑠 𝑎 ∈ 𝐼𝑆 ( 𝑖 𝑝 ) e 𝑠 𝑎 , ∑︁ 𝑠 𝑏 ∈ 𝐼𝑆 ( 𝑖 𝑞 ) e 𝑠 𝑏 (cid:170)(cid:174)(cid:172) , (10) 𝛽 𝑝𝑞 = exp (cid:16) 𝛽 ∗ 𝑝𝑞 (cid:17)(cid:205) { 𝑞 |∀ 𝑖 𝑞 ∈ 𝐼𝐼 ( 𝑖 𝑝 ) } exp (cid:16) 𝛽 ∗ 𝑝𝑞 (cid:17) , (11)where 𝐼𝑆 ( 𝑖 𝑝 ) is the set of scenes that contain item 𝑖 𝑝 ’s category.In the end, we concatenate the category-specific representa-tion h 𝐶𝑖 𝑝 and the item-specific representation h 𝐼𝑖 𝑝 to derive theoverall representation m 𝑆𝑖 𝑝 of the item 𝑖 𝑝 in the scene-based space: m 𝑆𝑖 𝑝 = 𝜎 (cid:16) W ii · [ h 𝐶𝑖 𝑝 ∥ h 𝐼𝑖 𝑝 ] + b ii (cid:17) , (12)where W ii and b ii are parameters to be learned. The item embedding m 𝑈𝑖 𝑝 in the user-based space learns the collaborative signals from user-item interactions, while the item embedding m 𝑆𝑖 𝑝 in the scene-based space provides additional information from the scene-basedgraph. These two types of representations could be complemen-tary to each other, and they are combined by a multilayer percep-tron (MLP) to generate the general item embedding as follows: m 𝑖 𝑝 = F (cid:16) W i · [ m 𝑈𝑖 𝑝 ∥ m 𝑆𝑖 𝑝 ] + b i (cid:17) , (13)where F (·) is a MLP network, W i and b i are parameters. Given the representation of user 𝑢 𝑝 and the general representa-tion of item 𝑖 𝑞 , the user preference is obtained via a MLP network: r ′ 𝑝𝑞 = F (cid:16) W r · [ m 𝑢 𝑝 ∥ m 𝑖 𝑞 ] + b r (cid:17) , (14)where W r and b r are parameters to be learned.To optimize the model parameters, we apply the pairwise BPRloss [14], which takes into account the relative order between ob-served and unobserved user-item interactions and assigns higherprediction scores to observed ones. The loss function is as follow: Ω ( Θ ) = ∑︁ ( 𝑝,𝑥,𝑦 ) ∈O − ln 𝜎 (cid:16) r ′ 𝑝𝑥 − r ′ 𝑝𝑦 (cid:17) + 𝜆 ∥ Θ ∥ , (15)where O = (cid:8) ( 𝑝, 𝑥, 𝑦 )|( 𝑝, 𝑥 ) ∈ R + , ( 𝑝, 𝑦 ) ∈ R − (cid:9) denotes the pair-wise training data, R + and R − are the observed and unobserveduser-item interactions, respectively. Θ denotes all trainable modelparameters and 𝜆 controls ℓ regularization to prevent overfitting.To sum up, we have different entity types, i.e., user, item,category and scene, in the user-item bipartite graph and the scene-based graph. In the learning process, the user representation islearnt from interactions between users and items. The item latentfactor is generated from two components: the representation inthe user-based space and the representation in the scene-basedspace. Then the user embedding and the item embedding areintegrated to make prediction via pairwise learning. In this section, we evaluate SceneRec on 4 real-world E-commercedatasets and focus on the following research questions:

RQ1 : How does SceneRec perform compared with state-of-the-art recommendation methods?

RQ2 : How do different key components of SceneRec affect themodel performance?

RQ3 : How does the scene information benefit recommendation?

To the best of our knowledge, there are no public datasets that de-scribe scene-based graph for recommender systems. To evaluatethe effectiveness of SceneRec, we construct 4 datasets, namely,Baby & Toy, Electronics, Fashion, and Food & Drink, from JD.com,one of the largest B2C E-commerce platform in China. In eachdataset, we build the user-item bipartite graph and the scene-based graph from online logs and commodity information. Statis-tics of the above datasets are shown in Table 1 and more detailsare discussed next.We first build the user-item bipartite graph that by randomlysampling a set of users and items from online logs. A user is thenconnected to an item if she or he clicked the item.Next we build the scene-based graph where three differentnodes, i.e., item, category and scene, are taken as input. Wefirst consider connections between different item nodes. In E-commerce systems, users perform various behaviors such as“view” and “purchase”, which can be further used to constructitem-item relations. In this work, we choose “view” to build theitem-item connections. A view session is a sequence of items thatare viewed by a user within a period of time and it is intuitivethat two items should be highly relevant if they are frequentlyco-viewed. In the item layer, two items are linked if they are co-viewed by a user within the same session where the weight is thesum of co-occurrence frequency within 2 months. For each item,we rank all the connected items by the edge weight and at mosttop 300 connections are preserved. All time period and numbersof connection are empirically set based on the trade-off betweenthe size of datasets and co-view relevance between items.We then connect each item to its pre-defined category to buildthe item-category relations. We also consider connections be-tween different category nodes as shown in the second layerof the scene-based graph. For example, in E-commerce systems,the category “Mobile Phone” is strongly related to the category“Phone Case” but has little relevance to the category “WashingMachine”, and thus the first two categories are linked. To achievethis, we compute the co-view frequency within six months be-tween each pair of category node, and only top 100 connectionsof each category is preserved. In the end, each pair is furtherlabeled as 0 or 1 from consensus decision-making by three datalabeling engineers to indicate if there exists relevance or not.The last step of building the scene-based graph is to link cate-gory nodes to scene nodes. Each scene consists of a set of selectedcategories which can be manually coded by human experts (scenemining is our future work). Specifically, this procedure consistsof two steps. First, an expert team (about 10 operations staff) able 2: Comparisons with baselines and model variants.

Baby & Toy Electronics Fashion Food & DrinkNDCG@10 HR@10 NDCG@10 HR@10 NDCG@10 HR@10 NDCG@10 HR@10BPR-MF 0.3117 0.5213 0.4005 0.6082 0.3142 0.5294 0.3663 0.5445NCF 0.2232 0.3800 0.3324 0.5364 0.1518 0.3090 0.3068 0.4628CMN 0.2136 0.3840 0.4447 0.6725 0.2616 0.4516 0.4028 0.5854PinSAGE 0.2124 0.4145 0.2954 0.5200 0.1770 0.3724 0.2791 0.4798NGCF 0.3679 0.6000 0.4308 0.6559 0.3361 0.5749 0.3487 0.5228KGAT 0.3055 0.5421 0.3616 0.6172 0.3115 0.5580 0.3221 0.5093SceneRec-noitem 0.3977 0.6475 0.4748 0.7007 0.3936 0.6454 0.4080 0.6029SceneRec-nosce 0.4193 0.6617 0.4715 0.7156 0.3933 0.6499 0.4156 0.6074SceneRec-noatt 0.3950 0.6357 0.4665 0.7053 0.3953 0.6410 0.4138 0.6154SceneRec edits a set of scene candidates based on the corresponding do-main knowledge. Then, a data labeling team which consists of 3engineers refines the generated scenes based on the criteria thatwhether each scene is reasonable to reflect a real-life situation.To sum up, there is a user-item bipartite graph and a scene-based graph in the constructed E-commerce datasets where wehave different types of nodes, i.e., user, item, category and scene.The scene-based graph presents a 3-layer hierarchical structure.There exist multiple relations among items, categories and scenesthat are derived from user behavior data, commodity informationand manual labeling. Thus, the datasets have all the characteris-tics of networks we want to study as described in Section 3.

SceneRec leverages scene information to learn the representationvector of users and items in recommendation. Therefore, wecompare SceneRec against various recommendation methods ornetwork representation learning methods.(1)

BPR-MF [14] is a benchmark matrix factorization (MF) modelwhich takes the user-item graph as input and BPR loss is adopted.(2)

NCF [7] leverages multi-layer perceptron to learn non-linearitiesbetween user and item interactions in the traditional MF model.(3)

CMN [3] is a state-of-the-art memory-based model to captureboth global and local neighborhood structure of latent factors.(4)

PinSAGE [23] learns node representations on the large-scaleitem-item network where the representation of one item can beaggregated by the representation of its neighbor nodes. Here, wedirectly apply PinSAGE on the input user-item bipartite graph.(5)

NGCF [20]: This is a state-of-the-art GNN-based recommen-dation method, which learns the high-order connectivities basedon the network structure.(6)

KGAT [19] investigates the utility of KG into GNN-basedcollaborative filtering where each item is mapped to an entityin KG. In our experiments, we regard each scene as a specialtype of KG entity and link it to item nodes via the category nodeconnection. In such cases, the scene-based graph is degraded tothe one that contains only item-scene connections. The graphcontains two types of relations: an item belongs to a scene and ascene includes an item.(7)

SceneRec-noitem is a variant of SceneRec by removing item-item interactions in the scene-based graph.(8)

SceneRec-nosce is a variant of SceneRec by removing bothcategory and scene nodes, and thus the scene-based graph onlyincludes relations between items.(9)

SceneRec-noatt is another variant of SceneRec by remov-ing the attention mechanism between item-item relations andcategory-category relations.

We evaluate the model performance using the leave-one-outstrategy as in [1, 7]. For each user, we randomly hold out onepositive item that the user has clicked and sample 100 unobserveditems to build the validation set. Similarly, we randomly chooseanother positive item along with 100 negative samples to buildthe test set. The remaining positive items form the training set.In our experiments, we choose

Hit Ratio (HR) and

NormalizedDiscounted Cumulative Gain [15] (NDCG) as evaluation metrics.HR measures whether positive items are ranked in the top 𝐾 scores while NDCG focuses more on hit positions by assigninghigher scores to top results. For both metrics, a larger value indi-cates a better performance. We report the average performanceover all users with 𝐾 = { − , − , − , − } and the ℓ normalization coefficient 𝜆 is determined by a gridsearch among { , − , − , − } . For fair comparisons, the em-bedding dimension 𝑑 is set to 64 for all methods except NCF. ForNCF, 𝑑 is set to 8 due to the poor performance in higher dimen-sional space. For NGCF and KGAT, the depth 𝐿 is set to 4 since itshows competitive performance via the high-order connectivity. Table 2 reports com-parative results of SceneRec against all 6 baseline methods, andwe have the following observations:(1) In general, NGCF achieves better results than baseline methodsthat take the user-item bipartite graph as input. There are twomain reasons. First, GNN can effectively capture the non-linearityrelations from user-item collaborative behaviors via informationpropagation on the graph. Second, NGCF learns the high-orderconnectivities between different types of nodes as shown in [20].(2) KGAT further adds KG information into recommender sys-tems, but it does not obtain the best result. Note that the KGquality is essential to the model performance. In our work, thereare no available KG attributes that match our datasets, so thereis no additional information to describe network items. Further-more, the simple item-scene connection loses rich relations, e.g.category-category interactions and item-item interactions, in thescene-based graph, and may not advance model prediction.(3) The proposed framework SceneRec obtains best overall perfor-mance using different evaluation metrics. Specifically, SceneRecboosts (16 . . . . . . . . KeyboardMouse CableHeadsetMouse Pad Battery Charger Router

SD Card Phone Case i i i ObservedInteraction PredictionScores s u c c c c c c c c c i i i i i Peripheral Devices

Figure 3: A real example on the Electronics dataset. embedding representations of users and items from the user-itembipartite graph while it learns complementary representationsof items from the scene-based graph, which is not accessible inbaseline methods. Second, SceneRec creatively designs a prin-cipled hierarchical structure in the scene-based graph whereadditional scene-guided information is propagated into collabo-rative filtering. Third, SceneRec leverages GNN which captureslocal network structure to learn non-linear transformation ofdifferent types of graph nodes. Fourth, SceneRec adopts attentionmechanism to attentively learn weighting importance amongitem-item connections and category-category connections.

Table 2 also reportscomparative results against 3 variants and it is observed that:(1) SceneRec-noitem obtains better experimental results thanother baseline methods, and this indicates that the hierarchi-cal structure of the scene-based graph can effectively propagateinformation and generate complementary scene-based represen-tations. Moreover, SceneRec outperforms SceneRec-noitem andthis verifies the effectiveness of incorporating item-item sub-network into the scene-based graph.(2) SceneRec-nosce outperforms all baselines because the item-item connections provide additional knowledge into conventionalcollaborative filtering. Comparing to SceneRec-nosce, SceneRecachieves better performance on both datasets and this indicatesthat, by leveraging scene information, SceneRec is capable oflearning complementary representations beyond CF interactions.(3) The prediction result of SceneRec is consistently better thanthat of SceneRec-noatt, and this verifies that the attention mech-anism does benefit the recommendation by learning weights of1-hop neighbors for each item node or each category node.

Finally, we use a case study to showthe effects of integrating scene-specific representations into col-laborative filtering in Figure 3. From the Electronics dataset, werandomly select a user 𝑢 , a set of items that the user hasinteracted with and a set of candidate items (whose predictionscores are given above item nodes). It is noted that we espe-cially compute the average attention score (below the categorynode) between the candidate item and each item that the userhas interacted with by the scene-based attentive mechanism.The higher average attention score means more shared scenesbetween the candidate item and the user’s interacted items. There-fore, the candidate item is more likely to occur in the scene de-rived from user interests, which could boost recommendationprediction. From this case study, we see that the average atten-tion score does relate to the prediction result. For example, thepositive sample of item 𝑖 that the user has interacted withhas the highest prediction score and the largest average attentionweight. Similar results can be also observed from other users.The item 𝑖 is recommended because its category “Keyboard”complements the user-interacted items’ categories in the samescene “Peripheral Devices”. In this paper, we investigate the utility of integrating the sceneinformation into recommender systems using graph neural net-works, where a scene is formally defined as a set of pre-defineditem categories. To integrate the scene information into graphneural networks, we design a principled 3-layer hierarchicalstructure to construct the scene-based graph and propose a novelmethod SceneRec. SceneRec learns item representation from thescene-based graph, which is further combined with the conven-tional latent representation learned from user-item interactionsto make predictions. We conduct extensive experiments on fourdatasets that are collected from a real-world E-commerce plat-form. The comparative results and a case study demonstrate therationality and effectiveness of SceneRec.

ACKNOWLEDGMENTS

This work is supported in part by National Key R&D Program ofChina 2018AAA0102301 and NSFC 61925203.

REFERENCES [1] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, andTat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recom-mendation with Item- and Component-Level Attention. In

SIGIR .[2] Ernesto Diaz-Aviles, Mihai Georgescu, and Wolfgang Nejdl. 2012. Swarmingto rank for recommender systems. In

RecSys .[3] Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Networkfor Recommendation Systems. In

SIGIR .[4] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. 2016.

Deep Learning .MIT Press.[5] Marco Gori and Augusto Pucci. 2007. ItemRank: A Random-Walk BasedScoring Algorithm for Recommender Engines. In

IJCAI .[6] William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Repre-sentation Learning on Large Graphs. In

NIPS .[7] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural Collaborative Filtering. In

WWW .[8] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering forImplicit Feedback Datasets. In

ICDM .[9] Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, and Julian J.McAuley. 2019. Complete the Look: Scene-Based Complementary ProductRecommendation. In

CVPR .[10] M. Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, and Tamara L. Berg.2014. Hipster Wars: Discovering Elements of Fashion Styles. In

ECCV .[11] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In

ICLR .[12] Ayangleima Laishram, Satya Prakash Sahu, Vineet Padmanabhan, and Siba Ku-mar Udgata. 2016. Collaborative Filtering, Matrix Factorization and PopulationBased Search: The Nexus Unveiled. In

ICONIP .[13] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deep-Fashion: Powering Robust Clothes Recognition and Retrieval with Rich Anno-tations. In

CVPR .[14] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback.In

UAI .[15] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). 2011.

Recommender Systems Handbook . Springer.[16] Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convo-lutional Matrix Completion.

CoRR (2017).[17] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative DeepLearning for Recommender Systems. In

SIGKDD .[18] Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019. Knowl-edge Graph Convolutional Networks for Recommender Systems. In

WWW .[19] Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019.KGAT: Knowledge Graph Attention Network for Recommendation. In

KDD .[20] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019.Neural Graph Collaborative Filtering. In

SIGIR .[21] Yinqing Xu, Wai Lam, and Tianyi Lin. 2014. Collaborative Filtering Incorpo-rating Review Text and Co-clusters of Hidden User Communities and ItemGroups. In

CIKM .[22] Xiwang Yang, Yang Guo, Yong Liu, and Harald Steck. 2014. A survey of collabo-rative filtering based social recommender systems.

Computer Communications (2014).[23] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-ScaleRecommender Systems. In

KDD .[24] Hui Zhang, Xu Chen, and Shuai Ma. 2019. Dynamic News Recommendationwith Hierarchical Attention Network. In