[PDF] Causal Collaborative Filtering

Abstract

Recommender systems are important and valuable tools for many personalized services. Collaborative Filtering (CF) algorithms -- among others -- are fundamental algorithms driving the underlying mechanism of personalized recommendation. Many of the traditional CF algorithms are designed based on the fundamental idea of mining or learning correlative patterns from data for matching, including memory-based methods such as user/item-based CF as well as learning-based methods such as matrix factorization and deep learning models. However, advancing from correlative learning to causal learning is an important problem, because causal/counterfactual modeling can help us to think outside of the observational data for user modeling and personalization. In this paper, we propose Causal Collaborative Filtering (CCF) -- a general framework for modeling causality in collaborative filtering and recommendation. We first provide a unified causal view of CF and mathematically show that many of the traditional CF algorithms are actually special cases of CCF under simplified causal graphs. We then propose a conditional intervention approach for do -calculus so that we can estimate the causal relations based on observational data. Finally, we further propose a general counterfactual constrained learning framework for estimating the user-item preferences. Experiments are conducted on two types of real-world datasets -- traditional and randomized trial data -- and results show that our framework can improve the recommendation performance of many CF algorithms.

Full PDF

CCausal Collaborative Filtering

Shuyuan Xu

Rutgers UniversityNew Brunswick, NJ, [email protected]

Yingqiang Ge

Rutgers UniversityNew Brunswick, NJ, [email protected]

Yunqi Li

Rutgers UniversityNew Brunswick, NJ, [email protected]

Zuohui Fu

Rutgers UniversityNew Brunswick, NJ, [email protected]

Xu Chen

Renmin University of ChinaBeijing, [email protected]

Yongfeng Zhang

Rutgers UniversityNew Brunswick, NJ, [email protected]

ABSTRACT

Recommender systems are important and valuable tools for manypersonalized services. Collaborative Filtering (CF) algorithms—amongothers—are fundamental algorithms driving the underlying mecha-nism of personalized recommendation. Many of the traditional CFalgorithms are designed based on the fundamental idea of miningor learning correlative patterns from data for matching, includ-ing memory-based methods such as user/item-based CF as well aslearning-based methods such as matrix factorization and deep learn-ing models. However, advancing from correlative learning to causallearning is an important problem, because causal/counterfactualmodeling can help us to think outside of the observational data foruser modeling and personalization. In this paper, we propose CausalCollaborative Filtering (CCF)—a general framework for modelingcausality in collaborative filtering and recommendation. We firstprovide a unified causal view of CF and mathematically show thatmany of the traditional CF algorithms are actually special cases ofCCF under simplified causal graphs. We then propose a conditionalintervention approach for 𝑑𝑜 -calculus so that we can estimate thecausal relations based on observational data. Finally, we further pro-pose a general counterfactual constrained learning framework forestimating the user-item preferences. Experiments are conductedon two types of real-world datasets—traditional and randomizedtrial data—and results show that our framework can improve therecommendation performance of many CF algorithms. KEYWORDS

Collaborative Filtering; Causal Analysis; Causal Learning; Interven-tion; Counterfactual Reasoning; Recommender Systems

Recommender systems are important and valuable tools for manyWeb-based services such as e-commerce, social networks and on-line media systems. Collaborative Filtering (CF) [8, 37] algorithms,among others, are fundamental algorithms that support the under-lying mechanism of recommender systems.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

Figure 1: Many traditional CF models are special cases ofCCF under simplified causal graphs. In the graphs, 𝑈 is user, 𝑉 is item, 𝑋 is user interaction history, 𝑌 is preference score.(a) Causal graph for non-personalized models. (b) Causalgraph for similarity matching-based CF models. (c) Causalgraph that considers the causality from user to item [4]. (d)Causal graph used in our framework to demonstrate theidea of CCF, using user interaction history 𝑋 as a mediator. In this paper, we propose Causal Collaborative Filtering (CCF).Different from traditional collaborative filtering algorithms whoseultimate goal is to estimate the correlation 𝑃 ( 𝑦 | 𝑢, 𝑣 ) , casual collab-orative filtering aims to estimate the causal relation 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) ,where 𝑢, 𝑣 is a user-item pair and 𝑦 is the preference score to be es-timated for the pair, e.g., 𝑦 = 𝑦 = 𝑑𝑜 -calculus is used to represent the causal effect if we intervene torecommend item 𝑣 instead of passively observing item 𝑣 in trainingdata. More interestingly, we show that traditional CF models areactually special cases of CCF under simplified causal graphs (Figure1, more details later), and CCF is a general framework for casuallearning in recommender systems which can be applied over vari-ous causal graphs. We finally propose a counterfactual constrainedlearning approach to estimate the causal relation 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) .More specifically, traditional CF-based models are typically framedas predicting users’ preference scores over items. Mathematically,this can be expressed as predicting the 𝑃 ( 𝑦 | 𝑢, 𝑣 ) . For example, non-personalized popularity-based algorithms [14] assume 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ 𝑃 ( 𝑦 | 𝑣 ) and rank the items according to item popularity; user-basedCF [19, 34] assumes 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ | 𝑁 ( 𝑢 ) | (cid:205) 𝑢 ′ ∈ 𝑁 ( 𝑢 ) 𝑦 𝑢 ′ 𝑣 where 𝑁 ( 𝑢 ) are user 𝑢 ’s neighbours and 𝑦 𝑢 ′ 𝑣 are neighbours’ ratings on item 𝑣 ; item-based CF [24, 36] assumes 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ | 𝑁 ( 𝑣 ) | (cid:205) 𝑣 ′ ∈ 𝑁 ( 𝑣 ) 𝑦 𝑢𝑣 ′ where 𝑁 ( 𝑣 ) are item 𝑣 ’s neighbours and 𝑦 𝑢𝑣 ′ are neighbours’ rat-ings from user 𝑢 ; Matrix Factorization (MF) models such as [20]assumes 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ u ⊺ v or u ⊺ v + 𝑏 𝑢 + 𝑏 𝑣 + 𝑏 , where u and v areuser/item latent factors and 𝑏 ∗ are bias terms; Probabilistic MF suchas [28] assumes 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ N ( 𝑦 | u ⊺ v , 𝜎 ) , where N is a normaldistribution, and recent deep models assume 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ NN ( u , v ) , a r X i v : . [ c s . I R ] F e b onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang where NN is a neural network for similarity matching. From a gen-eral perspective, these models are all trying to estimate the 𝑃 ( 𝑦 | 𝑢, 𝑣 ) under different modeling assumptions.However, the essential goal of recommendation is not only to es-timate the pre-intervention user-item associative relationships, butto estimate the post-intervention effects if we recommend/displaysomething to users. This can be interpreted as trying to answer the“what if” question: what would happen if we intervene to recom-mend a certain item to a target user.Using standard mathematical language of causal inference [30],the above “what if” question can be represented as 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) ,where the 𝑑𝑜 -operation is used to model the interventions. As a re-sult, we propose Causal Collaborative Filtering (CCF) in this paper,which aims to estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) instead of 𝑃 ( 𝑦 | 𝑢, 𝑣 ) for per-sonalized recommendation. Interestingly, we find that traditionalCF models are actually special cases of CCF, i.e., they are actuallyalso trying to estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) —but under simplified (some-times non-realistic) causal graphs. As a result, CCF mathematicallysubsumes many of the traditional CF models. For example, non-personalized models such as popularity ranking assume the causalgraph in Figure 1(a). Since user is excluded from the graph and itemis a root node in the graph, we have 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑣 )) = 𝑃 ( 𝑦 | 𝑣 ) , which naturally reduces to the estimate of popularity rank-ing. Many similarity matching-based CF models such as user/item-based CF, matrix factorization and neural ranking models assumethe collider casual graph in Figure 1(b), where the user and itemare assumed to appear independently in the observational train-ing data (more details explained in Section 4). Since both 𝑈 and 𝑉 are root nodes in the graph, we have 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ,which also reduces to the associative estimates of these models.Except for the above associative matching-based models, some ex-isting causal recommendation models can also be included in the 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) framework. For example, Causal Embeddings for Rec-ommendation (CausE) [4] assumes the causal graph in Figure 1(c),which removes the independence assumption between user anditem, and adopts direct intervention through randomized treatmentto estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) , as will be explained in Section 4.Except for the above conceptual contribution, our work also pro-vides technical contributions. More specifically, a great challengeis how to estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) . Estimating 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) usingdirect intervention requires access to the recommendation platformso that we can deploy the intervention strategies such as displayingrandomized recommendations to users. However, such platformis hardly accessible to researchers. Even in industry environmentswhere the platform is fully accessible, researchers and developersare usually prevented from large-scale randomized trails becausethat will greatly hurt the user experience. In this work, we proposea conditional intervention approach to estimating 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) based on observational data. Specifically, we adopt the causal graphin Figure 1(d) for conditional intervention, which considers theuser interaction history 𝑋 for mediator analysis. Moreover, solvingthe conditional intervention requires counterfactual reasoning, andwe further propose a counterfactual constrained machine learn-ing approach for counterfactual reasoning in both discrete andcontinuous space. More details will be explained in Section 5. Fi-nally, we conduct extensive experiments on two different types of real-world datasets: type-one datasets are traditional training-validation-testing split datasets, while for the type-two datasets,the testing set is collected through randomized trials. Experimentalresults show that our CCF framework significantly improves therecommendation performance on both types of datasets.In the following parts of the paper, we will introduce the relatedwork in Section 2 as well as the notations and preliminaries inSection 3. In Section 4, we will provide a unified causal view ofCollaborative Filtering (CF) and motivate the basic idea of CausalCollaborative Filtering (CCF). Then, we will introduce the proposedCCF framework in Section 5 and provide the experimental resultsin Section 6. Finally, Section 7 concludes the work and discussesabout the potential future research directions. Collaborative Filtering (CF) [8] is one of the most fundamental ap-proaches to personalized recommendation, which has been widelyused in many real-world systems. The key idea behind many ofthe CF models is that similar users may share similar interests andsimilar items may be liked by similar users. Due to the wide scopeof literature of CF, it is hardly possible to cover all of the CF algo-rithms, so we review some representative methods in this section,and a more comprehensive review can be seen in [8, 49].Early memory-based CF models—such as user-based CF [19, 34]and item-based CF [24, 36]—take the row or column vectors ofthe user-item rating matrix as the user and item vector represen-tations, and calculate the similarity between users or items forrecommendation based on pre-defined similarity functions suchas cosine similarity and Pearson correlation coefficient. To extractlatent semantic meanings from the matrix, researchers later ex-plored learned user and item vector representations. This startedwith Latent Factor Models (LFM) such as matrix factorization [20],probabilistic matrix factorization [28], tensor factorization [18] andfactorization machines [31], which are widely adopted models inpractice. In these models, each user and item is learned as a latentrepresentation to calculate the matching score of each user-itempair, usually based on inner-product.The development of deep learning and neural networks has fur-ther extended CF. The relevant methods can be broadly classifiedinto two categories: similarity learning approach and representa-tion learning approach. The similarity learning approach adoptssimple user/item representations (such as one-hot) and learns a com-plex matching function (such as a prediction network) to calculateuser-item matching scores [7, 10, 13, 43], while the representationlearning approach learns rich user/item representations and adoptsa simple matching function (e.g., inner product) for efficient match-ing score calculation [2, 27, 47, 50, 52]. User representations canalso be directly calculated from the user’s interaction histories,such as in sequential recommendation [6, 11, 17, 33, 40]. Anotherimportant direction is learning to rank for recommendation, whichlearns the relative ordering of items instead of the absolute scores.A representative method on this direction is Bayesian PersonalizedRanking (BPR) [32], which is a pair-wise learning to rank method.Most of the existing methods learn correlative patterns fromdata for matching and recommendation based on either simple orcomplex matching functions. However, advancing from correlative ausal Collaborative Filtering Conference’17, July 2017, Washington, DC, USA learning to causal learning is an important problem [30]. The com-munity has explored causal modeling on several different perspec-tives. For example, researchers adopted causal models to generateexplanations for recommendation [9], explored fairness issues un-der counterfactual settings [21], and corrected data bias for rankingin search [16], recommendation [4, 25, 38], advertising [46] andevaluating the ranking models [44]. Many of the models are basedon Inverse Propensity Score (IPS) methods, which aim to turn theoutcomes of an observational study into pseudo-randomized trialsby re-weighting the samples. Though convenient in implementa-tion, the disadvantage is that the estimator may not properly handlelarge shifts in observational probability [4]. Besides, each model isusually specifically designed for a particular scenario or a particu-lar problem setting, while there still lacks a general framework formodeling counterfactual reasoning in collaborative filtering, whichis the fundamental problem we aim to solve in this work. In this section, we introduce the basic notations used throughoutthe paper. We also introduce some fundamental concepts of causalinference to be used in the following parts of the paper.

In this paper, we use uppercase letters such as 𝑌 to represent randomvariables. In particular, we use 𝑈 , 𝑉 , 𝑋, 𝑌 to represent user, item,history, and preference variables. We use lowercase letters such as 𝑦 to represent the specific value that a random variable can take. Inparticular, we use 𝑢, 𝑣, 𝑥, 𝑦 to represent a specific user, item, history,and preference value. Moreover, we use bold font lowercase torepresent the latent vector embedding of users and items, such as u , v ∈ R 𝐷 , where 𝐷 is the dimension of the embedding vectors.We use probability notations such as 𝑃 ( 𝑌 = 𝑦 ) and 𝑃 ( 𝑌 = 𝑦 | 𝑋 = 𝑥 ) to represent the probability or the conditional probability thata variable 𝑌 takes on a specific value 𝑦 . When the context is clear,the probability is simplified as 𝑃 ( 𝑦 ) and 𝑃 ( 𝑦 | 𝑥 ) . In this work, forsimplicity, we consider binary values for the user-item preferencevariable 𝑌 , i.e., 𝑌 can take on two values 1 and 0, denoted as 𝑃 ( 𝑌 = ) and 𝑃 ( 𝑌 = ) , corresponding to the probability of like or dislike anitem, respectively. However, the framework can be generalized tomultiple value cases.We adopt standard intervention and counterfactual notations asin Pearl et al. [30]. In particular, 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 )) —simplifiedas 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑥 )) when the context is clear—denotes the probability of 𝑌 = 𝑦 under intervention 𝑋 = 𝑥 . Moreover, 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 ) , 𝑍 = 𝑧 ) —simplified as 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑥 ) , 𝑧 ) —denotes the 𝑧 -specific intervention.We use 𝑃 ( 𝑌 𝑋 = 𝑥 ′ = 𝑦 | 𝑈 = 𝑢, 𝑋 = 𝑥 ) —simplified as 𝑃 ( 𝑌 𝑥 ′ = 𝑦 | 𝑢, 𝑥 ) or 𝑃 ( 𝑌 𝑥 ′ ( 𝑢 ) = 𝑦 | 𝑥 ) —to denote the probability of 𝑌 = 𝑦 for individual 𝑈 = 𝑢 in the counterfactual world where 𝑋 = 𝑥 ′ , given that whathappened in the real world is 𝑋 = 𝑥 . When calculating expectationis needed, we use 𝐸 [ 𝑌 𝑥 ′ ( 𝑢 )| 𝑥 ] to represent the expected value of 𝑌 in the counterfactual world. Definition 1. (Structural Causal Models) [30, p.26] A structuralcausal model (SCM) 𝑀 consists of two set of variables 𝑈 and 𝑉 , anda set of functions 𝑓 that assign a value to each variable in 𝑉 based on other variables in the model. Here 𝑈 are exogenous variables that noexplanatory mechanism is encoded. Definition 2. (Causal Graph) [30, p.35] A causal graph is a di-rected acyclic graph (DAG) G = ({ 𝑈 , 𝑉 } , 𝐸 ) , which captures therelationships among the variables in the corresponding SCM. Definition 3. (Intervention) [30, p.55] We distinguish betweencases where a variable 𝑋 takes a value 𝑥 naturally and cases wherewe fix 𝑋 = 𝑥 by denoting the later 𝑑𝑜 ( 𝑋 = 𝑥 ) . So 𝑃 ( 𝑌 = 𝑦 | 𝑋 = 𝑥 ) is the probability that 𝑌 = 𝑦 conditioned on finding 𝑋 = 𝑥 , while 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 )) is the probability that 𝑌 = 𝑦 when we interveneto make 𝑋 = 𝑥 . Similarly, we write 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 ) , 𝑍 = 𝑧 ) to denote the conditional probability of 𝑌 = 𝑦 , given 𝑍 = 𝑧 , in thedistribution created by the intervention 𝑑𝑜 ( 𝑋 = 𝑥 ) . To calculate the causal effect 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑥 )) , the most fundamental-ist approach is through causal graph manipulation. Technically, weremove all of 𝑋 ’s incoming edges from the original causal graph G to create the manipulated graph G 𝑚 . And then we have have 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑥 )) = 𝑃 𝑚 ( 𝑦 | 𝑥 ) , where 𝑃 𝑚 is the manipulated probability.However, really implementing the manipulated causal graph G 𝑚 through direct intervention to calculate 𝑃 𝑚 can be challenging oreven impossible in practice. As a result, it would be nice if we canestimate 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑥 )) from purely observational data. The followingcausal effect rule answers this question.Definition 4. (The Causal Effect Rule) [30, p.59] Given a causalgraph G in which a set of variables 𝑃𝐴 are designated as the parentsof 𝑋 , the causal effect of 𝑋 on 𝑌 is given by: 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 )) = ∑︁ 𝑧 𝑃 ( 𝑌 = 𝑦 | 𝑋 = 𝑥, 𝑃𝐴 = 𝑧 ) 𝑃 ( 𝑃𝐴 = 𝑧 ) = ∑︁ 𝑧 𝑃 ( 𝑋 = 𝑥, 𝑌 = 𝑦, 𝑃𝐴 = 𝑧 ) 𝑃 ( 𝑋 = 𝑥 | 𝑃𝐴 = 𝑧 ) (1) where 𝑧 ranges over all the combinations of values that the variablesin 𝑃𝐴 can take. The factor 𝑃 ( 𝑋 = 𝑥 | 𝑃𝐴 = 𝑧 ) is the “propensity score”. The most important benefit brought by the above rule is thatit enables us to calculate the causal effect between two variablesbased on passive observational data—we see that the right sideof the equation does not include 𝑑𝑜 -calculations any more. How-ever, enumerating all parents’ value combinations is still rathercomplicated. The following backdoor criterion solves the problem.Definition 5. (Backdoor Criterion) [30, p.61] A set of variables 𝑍 satisfies the backdoor criterion related to an ordered pair of variables ( 𝑋, 𝑌 ) in a causal graph G if 𝑍 satisfies both (1) No node in 𝑍 is adescendant of 𝑋 and (2) 𝑍 blocks every path between 𝑋 and 𝑌 thatcontains an arrow into 𝑋 . If a set of variables 𝑍 satisfies the backdoor criterion for 𝑋 and 𝑌 , then the causal effect of 𝑋 on 𝑌 is given by the formula: 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 )) = ∑︁ 𝑧 𝑃 ( 𝑌 = 𝑦 | 𝑋 = 𝑥, 𝑍 = 𝑧 ) 𝑃 ( 𝑍 = 𝑧 ) (2)through which we can also estimate 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑥 )) from observationaldata but it may only need much fewer variables.The backdoor criterion can be generalized to 𝑧 -specific causaleffect 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 ) , 𝑍 = 𝑧 ) , in which we care about thecausal effect of 𝑋 on 𝑌 under a specific value 𝑍 = 𝑧 [30, p.70]. If we onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang can find a set of variables 𝑆 such that 𝑆 ∪ 𝑍 satisfies the backdoorcriterion ( 𝑆 may include 𝑍 ), then the 𝑧 -specific causal effect can beestimated from observational data: 𝑃 ( 𝑌 = 𝑦 | 𝑑𝑜 ( 𝑋 = 𝑥 ) , 𝑍 = 𝑧 ) = ∑︁ 𝑠 𝑃 ( 𝑌 = 𝑦 | 𝑋 = 𝑥, 𝑆 = 𝑠, 𝑍 = 𝑧 ) 𝑃 ( 𝑆 = 𝑠 | 𝑍 = 𝑧 ) (3)where the summation goes over all value combinations of 𝑆 .Finally, counterfactual analysis aims to answer queries that go be-yond the observational data. In notation, we use 𝑌 𝑋 = 𝑥 ( 𝑈 = 𝑢 ) = 𝑦 ,or simplified as 𝑌 𝑥 ( 𝑢 ) = 𝑦 , to represent the counterfactual sen-tence “ 𝑌 would be 𝑦 had 𝑋 been 𝑥 , in situation 𝑈 = 𝑢 ,” thoughthe observed value of 𝑋 in real world is not 𝑥 . The mathematicaldefinition of counterfactual is as follows.Definition 6. (Counterfactual) [30, p.94] Let 𝑀 be the originalstructural causal model, and 𝑀 𝑥 be the modified version of 𝑀 withthe equation of 𝑋 replaced by 𝑋 = 𝑥 . Then the formal definition ofthe counterfactual 𝑌 𝑥 ( 𝑢 ) is 𝑌 𝑥 ( 𝑢 ) = 𝑌 𝑀 𝑥 ( 𝑢 ) , i.e., the counterfactual 𝑌 𝑥 ( 𝑢 ) in model 𝑀 is defined as the solution for 𝑌 in the “surgicallymodified” submodel 𝑀 𝑥 . Further more, we use 𝑃 ( 𝑌 𝑥 = 𝑦 ) to represent the probability of 𝑌 = 𝑦 had 𝑋 been 𝑥 , and use 𝐸 [ 𝑌 𝑥 ] to represent the expected valueof 𝑌 had 𝑋 been 𝑥 . We start by providing a unified causal view of collaborative filtering(CF). Specifically, we show that the fundamental goal of many CFalgorithms for personalized recommendation is to estimate thecausal effect 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑑𝑜 ( 𝑉 = 𝑣 )) , simplified as 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) ,where 𝑈 , 𝑉 , 𝑌 represent the user, item and the user’s preference onthe item, respectively. According to the definition of causal effect[30, p.55], 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) represents the causal effect of item 𝑣 on thepreference 𝑦 conditioned on user 𝑢 , which is the probability thatuser 𝑢 ’s preference score is 𝑦 (e.g., 1 or 0) when we intervene torecommend item 𝑣 . The key difference between various CF models isthat they assume different causal graphs to calculate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) .When the causal graph is too simple or even unrealistic, the causaleffect will naturally degenerate to association relations that areconsidered in traditional CF models.Before we proceed to formally compare different CF modelsin the causal view, we first explain the intuition of 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) ,which can be appreciated from both recommendation perspectivesand causal perspectives. On the recommendation perspective, weusually consider personalized recommendations in modern recom-mender systems, and thus 𝑢 appears in the condition to enableconditional causal effect. From the causal perspective, simply esti-mating 𝑃 ( 𝑦 | 𝑢, 𝑣 ) from observational data (as many previous modelsdid) can only extract associative signals from the training dataset,which may be subject to the Simpson’s paradox [30] since a user’spersonalized preference can be overwhelmed by the populationeffect. In contrast, 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) aims to estimate the causal effectif we intervene to recommend item 𝑣 for a specific user, and thecausal effect is expected to reveal the user’s real preference on theitem without being influenced by other confounding factors.We now show how different CF models fit into the unified causalview under 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) . Non-personalized recommendation models such as most popularrecommendation [14] assumes a simple causal graph without theuser node, as shown in Figure 1(a). Since user is excluded fromconsideration and since item is a root node in the graph, we have 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = 𝑃 ( 𝑦 | 𝑑𝑜 ( 𝑣 )) = 𝑃 ( 𝑦 | 𝑣 ) , and 𝑃 ( 𝑦 | 𝑣 ) naturally repre-sents the popularity of item 𝑣 in the data. Most CF algorithms fall into the user-item associative matching cat-egory, such as user-based [19, 34] or item-based [24, 36] CF, matrixfactorization models [20, 28], as well as many neural network-basedmatching models, including both function learning [10] and repre-sentation learning [50] approaches. These models actually assumea causal graph shown in Figure 1(b), where user node 𝑈 and itemnode 𝑉 constitute a collider to influence preference node 𝑌 . Basi-cally, these models assume that the appearance of users and itemsare independent from each other in observational data (though thismay be an unrealistic assumption), and since both 𝑈 and 𝑉 areroot nodes, we have 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = 𝑃 ( 𝑦 | 𝑢, 𝑣 ) , which can thus beestimated from observational data using various models. This isactually what we have seen in many CF models for years.The main difference of various models is how to design thematching function to estimate 𝑃 ( 𝑦 | 𝑢, 𝑣 ) . For example, user-based CFassumes 𝑃 ( 𝑌 = | 𝑢, 𝑣 ) ∝ | 𝑁 ( 𝑢 ) | (cid:205) 𝑢 ′ ∈ 𝑁 ( 𝑢 ) 𝑦 𝑢 ′ ,𝑣 , while item-basedCF assumes 𝑃 ( 𝑌 = | 𝑢, 𝑣 ) ∝ | 𝑁 ( 𝑣 ) | (cid:205) 𝑣 ′ ∈ 𝑁 ( 𝑣 ) 𝑦 𝑢,𝑣 ′ , where 𝑁 ( 𝑢 ) and 𝑁 ( 𝑣 ) are the neighbours of user 𝑢 and item 𝑣 , respectively. Matrixfactorization (MF) models such as [20] assumes 𝑃 ( 𝑌 = | 𝑢, 𝑣 ) ∝ u ⊺ v or ∝ u ⊺ v + 𝑏 𝑢 + 𝑏 𝑣 + 𝑏 , while probabilistic matrix factorization such as[28] assumes 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ∝ N ( 𝑦 | u ⊺ v , 𝜎 ) , where N is a normal distri-bution with u ⊺ v as mean. Some neural network-based models suchas [10, 43] assume 𝑃 ( 𝑌 = | 𝑢, 𝑣 ) ∝ NN ( u , v ) , where NN is a neuralnetwork for similarity matching, while representation learning-based models such as [50] assumes 𝑃 ( 𝑌 = | 𝑢, 𝑣 ) ∝ NN ( 𝑢 ) ⊺ NN ( 𝑣 ) ,where NN is a network that learns the user and item vector represen-tations, and a simple inner product is used for similarity matching.More complex deep representation learning models such as sequen-tial models [6, 12, 22, 40] and graph-based models [2, 41, 45, 48]can be represented as 𝑃 ( 𝑌 = | 𝑢, 𝑣 ) ∝ NN ( NN ( 𝑢 ) , NN ( 𝑣 )) , wherea neural similarity network NN is applied on top of the neuralrepresentation learning network NN .Though different models design different methods to estimate 𝑃 ( 𝑦 | 𝑢, 𝑣 ) , a fundamental assumption shared by these models is thatthe co-occurrence of user and item is independent in observationaldata, which is implied by the causal graph in Figure 1(b), and thus 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) can be estimated from observational data as 𝑃 ( 𝑦 | 𝑢, 𝑣 ) .However, this assumption is unrealistic because user behavioris influenced by the recommender system for various reasons. Forexample, users will more likely interact with those recommendeditems (exposure bias [4]), items ranked at top positions (positionbias [42]), or items that are more popular (popularity bias [1]). Eventhe modeling assumptions made in the recommendation algorithmitself may influence what users can see and interact with (inductivebias [3]). As a result, 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) cannot be simplified as 𝑃 ( 𝑦 | 𝑢, 𝑣 ) ,and we need to intervene on items so as to estimate the real causaleffect 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) between an item and the user preference. ausal Collaborative Filtering Conference’17, July 2017, Washington, DC, USA The 𝑢 -specific causal effect 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) by definition requiresinterventional reasoning. If we have complete control of the rec-ommendation platform, i.e., if we can freely decide what items torecommend for a user such as in industry settings, then we canconduct direct intervention based on randomized experiments forestimating 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) . However, researchers usually only have abenchmark dataset and do not have access to the real-world rec-ommendation platform. Even in industry settings, researchers areusually not allowed to make random recommendations or can onlydo randomized experiment on a small subset of users such as inA/B testing. Depending on if or not we have complete control of therecommendation platform, we have the following two approachesto estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) . Direct Intervention Models . If we have complete controlof the recommendation platform or have access to a randomizedtreatment dataset where user is randomly exposed to items, then thestraightforward way of estimating 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) is through directintervention, as researchers have done in [4]. Basically, the assumedcausal graph is Figure 1(c), which extends Figure 1(b) by removingthe independence assumption between user and item.We refine Figure 1(c) as Figure 2(a) to show the structural equa-tions 𝑉 = 𝑔 ( 𝑈 ) and 𝑌 = 𝑓 ( 𝑈 , 𝑉 ) , which represent the two stepsof the recommendation pipeline. 𝑉 = 𝑔 ( 𝑈 ) represents the de factorecommendation model in the system that decides what items areexposed to the user, and 𝑌 = 𝑓 ( 𝑈 , 𝑉 ) represents the user’s prefer-ence on the exposed item. Now it is clear that the observationaldata is subject to exposure bias [4], i.e., users can only reveal theirpreferences on those items that were exposed to them by the rec-ommendation model. If a user did not interact with an item, it doesnot necessarily mean that the user dislikes the item—the user justdid not have a chance to tell his/her preference on the item. As aresult, 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) can not be simplified as 𝑃 ( 𝑦 | 𝑢, 𝑣 ) to be learnedfrom offline observational data.To estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) , we resort to the most original defini-tion of intervention through manipulated causal graphs [30, p.54].We remove all edges directed into 𝑉 in Figure 2(a)—in this case, theedge from 𝑈 to 𝑉 —and we have the manipulated causal graph inFigure 2(b). We thus have 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = 𝑃 𝑚 ( 𝑦 | 𝑢, 𝑣 ) , where 𝑃 𝑚 isthe probability distribution according to the manipulated causalgraph. To estimate 𝑃 𝑚 ( 𝑦 | 𝑢, 𝑣 ) , we can apply a randomized exposurepolicy by showing random items to users thus to implement theindependence between 𝑈 and 𝑉 . This randomized treatment willhelp us to collect an unbiased dataset to estimate 𝑃 𝑚 ( 𝑦 | 𝑢, 𝑣 ) . Moredetails of the procedure can be seen in [4]. Inverse Propensity Scoring (IPS) Models . In many cases,we do not have complete access to the recommendation platform.Even if we do have the access, we may be prohibited from ran-domized exposure policies since this may greatly hurt the userexperience. The basic idea of propensity scoring methods is to turnthe outcomes of an observational study into pseudo-randomizedtrials by re-weighting the samples [4], so that 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) can beestimated from the observational data [15, 23, 38].More formally, according to the recommendation pipeline shownin Figure 2(a), the observed user preference 𝑟 𝑢𝑣 is considered as 𝑈 𝑉𝑌 𝑓 𝑈 𝑉𝑌 𝑓 ✘ (a) (b) 𝑈 𝑉𝑌 𝑓 (c) 𝑋 ℎ 𝑋 𝑌𝑉

𝑈 𝑈𝑈 (d)

Figure 2: (a-b) Causal graphs before and after manipulation.(c-d) Reorganize causal graph using 𝑈 as exogenous variable. 𝑟 𝑢𝑣 ∝ 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) 𝑃 ( 𝑣 | 𝑢 ) , which is the multiplication betweenthe user’s real preference 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) and the probability thatuser 𝑢 had a chance to see the item 𝑃 ( 𝑣 | 𝑢 ) . As a result, we have 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) ∝ 𝑟 𝑢𝑣 𝑃 ( 𝑣 | 𝑢 ) . This means that each example in the ob-servational data should boost its probability by a factor equal to1 / 𝑃 ( 𝑣 | 𝑢 ) , and hence the name inverse propensity scoring (aka inverseprobability weighting [30, p.73]), which corrects the observationaldata by removing the exposure bias. In this way, we can estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) from the corrected observational data.The advantage of IPS-based estimator is its convenience in im-plementation, and the disadvantage is that the estimator may notproperly handle large shifts in exposure probability. For example,items with low probability of exposure 𝑃 ( 𝑣 | 𝑢 ) under the recom-mendation model will tend to have higher predicted scores [4]. As noted before, to estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) , direct invention ap-proaches may not be available to us, while IPS-based approachesmay be vulnerable to small probabilities. As a result, we need betterapproaches to the estimation. It is good to notice that 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) naturally involves counterfactual reasoning as part of its logic.The reason is that most recommendation models make recom-mendations based on users’ interaction history, including bothsequential and non-sequential recommendation models. For sequen-tial models, the algorithm naturally takes the user’s interaction his-tory as input when making recommendations. For non-sequentialmodels such as matrix factorization, the model usually employs amatching function over the user and item embeddings to make rec-ommendations. Even though the final recommendation step doesnot explicitly take user interaction history as input, the learning ofuser and item embeddings are based on user interaction histories.On the other hand, the user interaction history is influenced bythe recommendation model itself, because users will more likelyinteract with the recommended items. However, using user’s inter-action history for recommendation is inevitable since this is thekey idea of CF and it makes personalized recommendation possible,but because of this, it also encodes bias into the recommendationmodel since the user interaction history is already influenced byvarious factors and thus may not reveal users’ real preference.To solve the problem, we need to answer counterfactual ques-tions such as what if an item had or had not been recommended, what if we intervene to recommend an item, and what if the userhad a different interaction history. Such imaginary cases constitutethe counterfactual world , in contrast to what happened in the realworld . In the next section, we will show a flexible learning frame-work based on counterfactual constraints to answer the questionsfor causal collaborative filtering. onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang To enable counterfactual reasoning, we extend the causal graphfrom Figure 2(a) to Figure 2(c) to consider user’s interaction his-tories 𝑋 . More specifically, the structural casual model includesthree structural equations: (1) 𝑋 = ℎ ( 𝑈 ) , which returns a user’sinteraction history 𝑋 . In the most simple case, it can be a databaseretrieval operation that returns a user’s interaction history from theobservational data; (2) 𝑉 = 𝑔 ( 𝑈 , 𝑋 ) , which is the already deployed(but potentially biased) recommendation algorithm of the systemthat returns the recommended item 𝑉 based on the user and theuser’s interaction history; (3) 𝑌 = 𝑓 ( 𝑈 , 𝑉 ) , which is the unbiaseduser preference function that reveals user’s real preference 𝑌 onthe item, and this is the function that we do not know but we wantto estimate. As we can see, the fundamental goal is how to trans-form an arbitrary recommendation algorithm 𝑉 = 𝑔 ( 𝑈 , 𝑋 ) into anunbiased user preference estimation 𝑌 = 𝑓 ( 𝑈 , 𝑉 ) .We should acknowledge that the proposed causal graph in Figure2(c) is not a once-and-for-all solution for recommender systems,because practical recommender systems are very complicated thatinvolve many other factors such as user and/or item content fea-tures as well as sponsored recommendations. However, we considerthe proposed causal graph in this work for two reasons: (1) the struc-tural equation 𝑉 = 𝑔 ( 𝑈 , 𝑋 ) is general enough to include a widescope of recommendation algorithms, including both sequentialand non-sequential methods, and (2) our proposed framework—tobe described in later subsections—is flexible and can be easily gen-eralized to more complex causal graphs such as incorporating userand item content features, which we will consider in the future. To estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) , we first identify that { 𝑈 , 𝑋 } is a set ofvariables that satisfy the backdoor criterion [30, p.61] for the casualeffect 𝑉 → 𝑌 . Since we already conditioned on 𝑈 for personaliza-tion, the only variable that leads to variations in 𝑉 is user interac-tion history 𝑋 , as a result, we adopt conditional intervention [30,p.70][29, p.113] to estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) .More specifically, the recommendation policy 𝑉 = 𝑔 ( 𝑈 , 𝑋 ) pro-vides recommendation 𝑉 based on the user 𝑈 and history 𝑋 , writtenas 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑈 , 𝑋 )) . To find out the distribution of the outcome 𝑌 that results from this policy, we seek to estimate 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑈 , 𝑋 ))) . We will show that identifying the effect ofsuch policies is equivalent to identifying the expression for the ( 𝑢, 𝑥 ) -specific effect 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑋 = 𝑥, 𝑑𝑜 ( 𝑉 = 𝑣 )) [30, p.71]. 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) (cid:17) 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑈 , 𝑋 ))) = ∑︁ 𝑥 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑈 , 𝑋 )) , 𝑋 = 𝑥 ) × 𝑃 ( 𝑋 = 𝑥 | 𝑈 = 𝑢, 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑈 , 𝑋 ))) = ∑︁ 𝑥 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑋 = 𝑥, 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑢, 𝑥 ))) 𝑃 ( 𝑋 = 𝑥 | 𝑈 = 𝑢 ) = ∑︁ 𝑥 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑋 = 𝑥, 𝑑𝑜 ( 𝑉 = 𝑣 ))| 𝑣 = 𝑔 ( 𝑢,𝑥 ) 𝑃 ( 𝑋 = 𝑥 | 𝑈 = 𝑢 ) = ∑︁ 𝑥 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢,𝑥 ) 𝑃 ( 𝑥 | 𝑢 ) = 𝐸 𝑥 | 𝑢 [ 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢,𝑥 ) ] (4) In the above derivation, step 1 follows from the law of totalprobability and the fact that variables { 𝑈 , 𝑋 } satisfy the backdoorcriterion [30, p.61]; step 2 follows from the fact that 𝑋 occurs before 𝑉 and hence any control exerted on 𝑉 can have no effect on the dis-tribution of 𝑋 [30, p.71]; step 3 rewrites the equation which tells usthat the causal effect of a conditional policy 𝑑𝑜 ( 𝑉 = 𝑔 ( 𝑈 , 𝑋 )) can beevaluated from the expression of 𝑃 ( 𝑌 = 𝑦 | 𝑈 = 𝑢, 𝑋 = 𝑥, 𝑑𝑜 ( 𝑉 = 𝑣 )) by substituting 𝑔 ( 𝑢, 𝑥 ) for 𝑣 and taking the conditional expectationover 𝑋 using the conditional distribution 𝑃 ( 𝑋 = 𝑥 | 𝑈 = 𝑢 ) for per-sonalization [30, p.71]; finally, step 4 simplifies the notation intothe conditional expectation form [29, p.113].From the last step in Eq.(4) we can see that the key differencebetween the causal model 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) and traditional matching-based models 𝑃 ( 𝑦 | 𝑢, 𝑣 ) is the existence of the conditional probabilityterm 𝑃 ( 𝑥 | 𝑢 ) in the final step. In the final step, 𝑃 ( 𝑦 | 𝑢, 𝑥, ˆ 𝑣 )| 𝑣 = 𝑔 ( 𝑢,𝑥 ) stands for the preference estimation of the already deployed recom-mendation model 𝑉 = 𝑔 ( 𝑈 , 𝑋 ) . Traditional models only considerthe real world but not the counterfactual world, as a result, theconditional probability 𝑃 ( 𝑥 | 𝑢 ) = 𝑥 , while for unobserved history 𝑥 ′ , 𝑃 ( 𝑥 ′ | 𝑢 ) =

0. In this case, wecan see that the summation in the final step will only include theobserved history 𝑥 and thus 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) naturally degenerates tothe original recommendation model 𝑉 = 𝑔 ( 𝑈 , 𝑋 ) .However, just because the observed history of user 𝑢 is 𝑥 doesnot mean that the user is destined to interact with the items in 𝑥 —the user just happened to interact with 𝑥 , i.e., if the user hada chance to be recommended with different items 𝑥 ′ in the coun-terfactual world, the user may also interact with those items, andthus the probability 𝑃 ( 𝑥 ′ | 𝑢 ) is not 0. As a result, the calculationof Eq.(4) requires counterfactual reasoning in the counterfactualworld where the user history had been 𝑋 = 𝑥 ′ , which is beyondthe observational data 𝑋 = 𝑥 . Counterfactual reasoning enables more refined intervention at in-dividual level [30, p.78,93]. In this work, the individual level refersto each specific user 𝑈 = 𝑢 for personalization purpose. To bet-ter understand this, the causal graph in Figure 2(c) is equivalentlytransformed into Figure 2(d), where 𝑋, 𝑉 , 𝑌 form a chain structureand 𝑈 serves as the exogenous variable over other variables. As aresult, counterfactual reasoning is individualized on each user.To enable counterfactual reasoning for estimating Eq.(4), let usconsider a record ( 𝑢, 𝑥, 𝑣, 𝑦 ) in the observational data, which meansthat the user 𝑢 ’s real interaction history is 𝑥 , and then the systemlogged the user’s preference on item 𝑣 which is 𝑦 . For example, wecan consider binary preference values by using 𝑦 = 𝑦 = 𝑦 = 𝑓 ( 𝑢, 𝑣 ) is expressed as 𝑦 = 𝑓 ( 𝑢, 𝑣 ) ∝ 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = ∑︁ ˜ 𝑥 𝑃 ( 𝑦 | 𝑢, ˜ 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢, ˜ 𝑥 ) 𝑃 ( ˜ 𝑥 | 𝑢 ) = 𝐸 ˜ 𝑥 | 𝑢 [ 𝑃 ( 𝑦 | 𝑢, ˜ 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢, ˜ 𝑥 ) ] (5)To distinguish from the single real-world history 𝑥 , we use ˜ 𝑥 torepresent any possible user history, including both the real history 𝑥 and possible counterfactual histories 𝑥 ′ . Eq.(5) means that the ausal Collaborative Filtering Conference’17, July 2017, Washington, DC, USA Table 1: Different heuristic rules to create counterfactual examples, the corresponding counterfactual question, and someintuitive toy examples. In the toy examples, the user’s real interaction history 𝑥 includes items 𝑎 𝑏 𝑐 , and items at the rightside of the arrow is the counterfactual history 𝑥 ′ . Multiple counterfactual histories can be constructed from the real history 𝑥 . Heuristic Rule Counterfactual Question Toy ExampleKeep One (K1) What if the user only interacted with one history item? 𝑎 𝑏 𝑐 → 𝑎 ; 𝑎 𝑏 𝑐 → 𝑏 ; 𝑎 𝑏 𝑐 → 𝑐 Delete One (D1) What if the user did not interact with one of the history items? 𝑎 𝑏 𝑐 → 𝑏 𝑐 ; 𝑎 𝑏 𝑐 → 𝑎 𝑐 ; 𝑎 𝑏 𝑐 → 𝑎 𝑏 Replace One (R1) What if one of the history items were different? 𝑎 𝑏 𝑐 → 𝑎 ′ 𝑏 𝑐 ; 𝑎 𝑏 𝑐 → 𝑎 𝑏 ′ 𝑐 ; 𝑎 𝑏 𝑐 → 𝑎 𝑏 𝑐 ′ estimation of 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) can be achieved by correcting the orig-inal recommendation algorithm’s estimation 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢,𝑥 ) using counterfactual histories 𝑥 ′ . More specifically, the estimationfor 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) is the expected estimation of 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢,𝑥 ) ,where the expectation is taken over all possible histories (includingreal and counterfactual histories) when item 𝑣 is recommended. Generate Counterfactual Examples . Counterfactual rea-soning requires generating counterfactual examples by makingminimal modifications in the current model [30, p.92]. In our prob-lem, we need to make minimal changes to the real history 𝑥 so asto generate counterfactual histories 𝑥 ′ . We start with a heuristic-based approach for counterfactual example generation and we willgeneralize to a learning-based approach in the next section.We adopt three heuristic rules to generate counterfactual histo-ries 𝑥 ′ by applying modifications to the real history 𝑥 (Table 1). TheKeep One (K1) rule only keeps one item of the user’s real history,the Delete One (D1) rule removes one item from the user’s real his-tory, and the Replace One (R1) rule replaces one item of the user’sreal history with another item. For the R1 rule, depending on howthe item is replaced, we have two variants: R1-random (R1r)—theitem is replaced with a random item, and R1-nearest (R1n)—theitem is replaced with its nearest neighbour based on embeddingsimilarity. We will introduce more details in the experiments. Select Counterfactual Examples . Consider the trainingexample ( 𝑢, 𝑥, 𝑣, 𝑦 ) where the user’s real history is 𝑥 , and we are ableto generate 𝑚 counterfactual histories { 𝑥 ′ , 𝑥 ′ · · · 𝑥 ′ 𝑚 } using one ofthe heuristic rules. Conditional intervention (Section 5.1) requires 𝑣 = 𝑔 ( 𝑢, ˜ 𝑥 ) , i.e., the same item 𝑣 should be recommended by therecommendation algorithm 𝑔 (· , ·) under counterfactual histories(since we are considering 𝑑𝑜 ( 𝑣 ) instead of just 𝑣 in the condition).However, not all of the counterfactual histories { 𝑥 ′ , 𝑥 ′ · · · 𝑥 ′ 𝑚 } guar-antee that item 𝑣 is recommended under the algorithm. As a result,we execute the recommendation algorithm 𝑔 (· , ·) over each coun-terfactual history 𝑥 ′ 𝑖 and obtain the top- 𝑘 recommendation list V ′ 𝑖 = 𝑔 ( 𝑢, 𝑥 ′ 𝑖 ) , where 𝑘 is a hyper-parameter to be tuned (will beintroduced in the experiments). If the target item 𝑣 ∈ V ′ 𝑖 , then wekeep the counterfactual example ( 𝑢, 𝑥 ′ 𝑖 , 𝑣, 𝑦 ) . Suppose 𝑛 of the 𝑚 counterfactual histories are eventually selected, we will have a setof counterfactual examples {( 𝑢, 𝑥 ′ 𝑖 , 𝑣, 𝑦 )} 𝑛𝑖 = .It’s interesting to see that following standard definitions of con-ditional intervention and counterfactual reasoning [30], the finalresult we derived has intuitive meanings. The intuition is that sincethe user already told us his/her real preference on item 𝑣 is 𝑦 (notedby observation ( 𝑢, 𝑥, 𝑣, 𝑦 ) ), then the preference should be unchangedunder counterfactual histories (noted by counterfactual examples ( 𝑢, 𝑥 ′ 𝑖 , 𝑣, 𝑦 ) ), though the counterfactual histories are unobserveddue to various reasons such as exposure or popularity bias. Calculate the Expectation . We then calculate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) based on the real observation ( 𝑢, 𝑥, 𝑣, 𝑦 ) and the counterfactualexamples {( 𝑢, 𝑥 ′ 𝑖 , 𝑣, 𝑦 )} 𝑛𝑖 = according to Eq.(5). For simplicity, weconsider 𝑃 ( ˜ 𝑥 | 𝑢 ) as a uniform distribution over the real and counter-factual histories, i.e., 𝑃 ( ˜ 𝑥 | 𝑢 ) = 𝑛 + , ∀ ˜ 𝑥 ∈ { 𝑥, 𝑥 ′ , 𝑥 ′ · · · 𝑥 ′ 𝑛 } . Gener-alizing to more complex distributions such as Gaussian or Gammadistribution will be considered in the future. As a result, we have: 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) = ∑︁ ˜ 𝑥 𝑃 ( 𝑦 | 𝑢, ˜ 𝑥, 𝑣 )| 𝑣 = 𝑔 ( 𝑢, ˜ 𝑥 ) 𝑃 ( ˜ 𝑥 | 𝑢 ) = 𝑛 + (cid:2) 𝑃 𝑔 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) + 𝑛 ∑︁ 𝑖 = 𝑃 𝑔 ( 𝑦 | 𝑢, 𝑥 ′ 𝑖 , 𝑣 ) (cid:3) (6)where we use 𝑃 𝑔 to represent the probability estimation of the baserecommendation algorithm 𝑣 = 𝑔 ( 𝑢, 𝑥 ) . As we noted before, given any base recommendation algorithm 𝑣 = 𝑔 ( 𝑢, 𝑥 ) , our goal is to derive the preference estimator 𝑦 = 𝑓 ( 𝑢, 𝑣 ) ∝ 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) . To achieve this goal, we propose a coun-terfactual constrained learning approach. The idea is simple—werequire the base recommendation algorithm’s preference estimation 𝑃 𝑔 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) to be equal to our wanted estimation 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) . Inthis way, we can correct the estimation of any base recommenderand use the corrected estimation to generate recommendations. Torealize this goal, we add the following counterfactual constraint tothe training objective of the base recommendation algorithm. 𝑃 𝑔 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) (cid:17) 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) (7)Combining Eq.(7) and Eq.(6), and using 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) for 𝑃 𝑔 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) for notation simplicity, we have 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) = 𝑛 + (cid:2) 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) + 𝑛 ∑︁ 𝑖 = 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ 𝑖 , 𝑣 ) (cid:3) ⇔ 𝑛𝑛 + 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) = 𝑛 + 𝑛 ∑︁ 𝑖 = 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ 𝑖 , 𝑣 )⇔ 𝑛 ∑︁ 𝑖 = 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ 𝑖 , 𝑣 ) = 𝑛 · 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) (8)A sufficient condition for Eq.(8) is to require every counterfactualexample’s estimation to be equal to the real history’s estimation, i.e., 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ 𝑖 , 𝑣 ) = 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) , ∀ ≤ 𝑖 ≤ 𝑛 (9)As a result, we apply the above counterfactual constraint over each ( 𝑥, 𝑥 ′ ) pair. In the following, we first propose a discrete version ofthe counterfactual constrained learning algorithm for any base rec-ommender, which conducts counterfactual reasoning in a discrete onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang item space. We then generalize the discrete version to a contin-uous version for counterfactual reasoning in a continuous latentembedding space of the items. Counterfactual Learning in Discrete Space . Let 𝐿 ( 𝑔 ) bethe loss function of a base recommendation algorithm 𝑔 ( 𝑢, 𝑥 ) . CCFaims to learn 𝐿 ( 𝑔 ) under a discrete counterfactual constraint.minimize 𝐿 ( 𝑔 ) s.t. ∑︁ 𝑢 ∈U ∑︁ 𝑣 ∈I( 𝑢 ) ∑︁ 𝑥 ′ ∈C( 𝑢,𝑣 ) (cid:12)(cid:12) 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ , 𝑣 ) − 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) (cid:12)(cid:12) ≤ 𝜖 (10)where U is the set of users, 𝑥 is the real history of user 𝑢 , I( 𝑢 ) isthe set of interacted items in the training set (excluding items in 𝑥 ), C( 𝑢, 𝑣 ) is the set of counterfactual histories of user 𝑢 under thetarget item 𝑣 (section 5.2.1 and 5.2.2), and 𝜖 is a threshold hyper-parameter controlling how rigorous is the constraint. As stated insection 5.2.2, the intuition here is that since the user already told ushis/her real preference on item 𝑣 is 𝑦 , then the preference shouldkeep unchanged under counterfactual histories, even though thecounterfactual histories are unobserved due to various reasons. Counterfactual Learning in Continuous Space . Manyrecommendation models represent users, items and histories asembedding vectors in a latent space. If a user’s history 𝑥 is rep-resented as an embedding vector x , then we can directly createlatent counterfactual histories x ′ by slightly perturbing vector x in the latent space. This will be more efficient than creating coun-terfactual examples in the original discrete item space based onheuristic rules. Similarly, let 𝐿 ( 𝑔 ) be the loss function of a baserecommendation algorithm 𝑔 ( 𝑢, 𝑥 ) . CCF in continuous space aimsto learn 𝐿 ( 𝑔 ) under a continuous counterfactual constraint.minimize 𝐿 ( 𝑔 ) s.t. ∑︁ 𝑢 ∈U ∑︁ 𝑣 ∈I( 𝑢 ) ∫ x ′ (cid:12)(cid:12) 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ , 𝑣 ) − 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) (cid:12)(cid:12) ≤ 𝜖 , ∥ x ′ − x ∥ ≤ 𝜖 (11)where x is the latent vector embedding of user 𝑢 ’s real history 𝑥 in the latent space, x ′ is a latent vector selected from the small 𝜖 -neighbourhood of the embedding x , and the integration can becalculated based on Monte Carlo sampling. All other parametershave the same meaning as Eq.(10). Directly solving the above constrained optimization problem ischallenging, which requires maintaining the constraints during theoptimization. We formulate the problem as a tractable optimizationproblem by relaxing the constraints. Technically, we allow someexamples to violate the constraints but we penalize these examplesin the loss function during optimization.For counterfactual reasoning in discrete space, we convert theobjective in Eq.(10) to the following Lagrange optimization form:minimize 𝐿 ( 𝑔 ) + 𝜔𝐿 𝑐 𝐿 𝑐 = max (cid:110) , ∑︁ 𝑢 ∈U ∑︁ 𝑣 ∈I( 𝑢 ) ∑︁ 𝑥 ′ ∈C( 𝑢,𝑣 ) (cid:12)(cid:12) 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ , 𝑣 ) − 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) (cid:12)(cid:12) − 𝜖 (cid:111) (12) where 𝜔 is a parameter controlling the weight of the constraint.While for reasoning in continuous space, we relax the constraintfor optimization similarly. The difference is that the parameter 𝜖 in Eq.(11) is used to restrict the distance between counterfactualhistories and the real history. This can be controlled during thesampling and selection process of the counterfactual histories in thelatent embedding space, which we will introduce in the experiments.We write the relaxed objective as:minimize 𝐿 ( 𝑔 ) + 𝜔𝐿 𝑐 s.t. ∥ x ′ − x ∥ ≤ 𝜖 𝐿 𝑐 = max (cid:110) , ∑︁ 𝑢 ∈U ∑︁ 𝑣 ∈I( 𝑢 ) ∫ x ′ (cid:12)(cid:12) 𝑃 ( 𝑦 | 𝑢, 𝑥 ′ , 𝑣 ) − 𝑃 ( 𝑦 | 𝑢, 𝑥, 𝑣 ) (cid:12)(cid:12) − 𝜖 (cid:111) (13)Still, we apply Monte Carlo integration to get the numerical inte-gration value through random samples. We conduct experiment to explore the CCF framework from dif-ferent perspectives. In particular, we aim to answer the followingresearch questions: • RQ1 : What is the overall performance of the CCF framework,can CCF improve the recommendation performance? • RQ2 : How the different heuristic rules influence the recom-mendation performance? • RQ3 : Is it necessary to select counterfactual examples afterthey are generated? • RQ4 : What is the impact of the counterfactual constraint inthe learning objective and how strict should the counterfac-tual constraints be applied?We will fist describe the datasets, baselines and implementationdetails and then provide our answers to the above questions.

Our experiments are conducted on two types of datasets. The firsttype is frequently used benchmark datasets with standard train-ing/validation/testing split to show that our framework is capableof improving the performance in a normal experimental setting.In particular, we use the

MovieLen-100k and Amazon Baby datasets—one is a movie recommendation dataset and the other isan e-commerce dataset. For the second type of datasets, to bettershow that our framework can help models to capture the real pref-erence of users, we apply our framework on the Yahoo! R3 and Coat Shopping datasets. A special property of these two datasetsis that their testing data are collected from randomized trials, i.e.,users are recommended with random items and their feedback onthese random items are collected to construct the testing dataset.For all four datasets, following common settings, we considerratings ≥ ≤ MovieLens-100k and

Amazon Baby datasets, we apply leave-one-out to split the dataset. Specifically,we chronologically sort the interactions for each user and put the https://grouplens.org/datasets/movielens/ https://nijianmo.github.io/amazon/ https://webscope.sandbox.yahoo.com/catalog.php?datatype=r ausal Collaborative Filtering Conference’17, July 2017, Washington, DC, USA latest two positive interactions of each user into the validation setand testing set, respectively, and keep the remaining data in thetraining set. For Yahoo! R3 and

Coat Shopping datasets, sincethey already have the testing set from randomized trials, we onlyneed to create the training and validation sets. For each user, werandomly select a positive interaction from the history and put intothe validation set, all other interactions are put into the trainingset. The statistics of the datasets are summarized in Table 2.

To show the effectiveness of our framework on different models,we apply our framework on six recommendation models, whichbelong to three different categories, including two matching models(BPR-MF and NCF), two sequential models (GRU4Rec and STAMP),as well as two reasoning models (NLR and NCR). • BPR-MF [32]: The Bayesian Personalized Ranking modelfor recommendation. We use Matrix Factorization (MF) [20]as the prediction function under the BPR framework, whichconsiders user, item and global bias terms for MF. • NCF [10]: A neural network-based CF method, which adoptsa non-linear prediction network for user and item matching. • GRU4Rec [11]: A session-based recommendation model,which uses recurrent neural networks—in particular, GatedRecurrent Units (GRU)—to capture sequential patterns. • STAMP [26]: The Short-Term Attention/Memory Prioritymodel, which uses the attention mechanism to model bothshort-term and long-term user preferences. • NLR [39]: A reasoning-based model, which adopts Logic-Integrated Neural Network (LINN) to integrate the power ofdeep learning and logic reasoning for recommendation. • NCR [5]: The Neural Collaborative Reasoning model, whichorganizes the logic expressions as neural networks for rea-soning and prediction in a continuous space.We also employ two different types of causal frameworks forcomparison. In particular, a direct intervention model (CausE) andan Inverse Propensity Scoring-based model (IPS). Both CausE andIPS can be applied on each of the above recommendation models. • CausE [4]: The Causal Embedding framework for recom-mendation. It splits the collected data into control data andtreatment data, and expose as uniformly as possible eachuser to each item in the treatment data to mimic direct inter-vention. It jointly learns the representations in the controland treatment data, and finally adopts the control represen-tations for recommendation. • IPS [35]: This is an IPS-based framework that uses propen-sity scores to re-weight the training samples. It uses a user-independent propensity estimator to estimate the propensityscores for each item and to re-weight each sample.Finally, we test five versions of our CCF framework (Section 5.2.1and Section 5.4). Also, each version can be applied on each of thesix recommendation models. • CCF K1 : CCF in discrete space under the Keep One heuristicrule for counterfactual example generation. • CCF D1 : CCF in discrete space under the Delete One rule. Dataset Table 2: The Statistics of the datasets • CCF

R1r : CCF in discrete space under the Replace One rule.The replacement is conducted randomly. • CCF

R1n : CCF in discrete space under the Replace One rule.Item is replaced with its nearest neighbour. • CCF C : CCF in continuous space. We use the same training, validation and testing sets for all rec-ommendation models and frameworks. The models are evaluatedon the Top-K Recommendation task with metrics nDCG@10 andHit@1 (Hit Ratio). We use the pair-wise learning strategy to trainall models. In detail, for each positive interaction in training set,we randomly sample an item as the negative sample. For each userin the validation and testing set, we randomly sample 100 negativeitems for ranking evaluation. Here the negative samples are eithernegative feedback items (items that user dislikes) or non-interacteditems. To fairly evaluate the improvement brought by our CCFframework and the CausE/IPS frameworks, we keep the basic pa-rameters (e.g. learning rate, neural network construction, weight of ℓ -regularizer) the same before and after applying the frameworkson the recommendation models. Specifically, we set the embeddingdimension as 64, the structure of neural network as two-layer MLPwith dimension 64 for all recommendation models (exepct for BPR-MF which is a shallow model). For all baseline recommendationmodels, we consider the learning rate from {0.0005,0.001,0.003,0.005},the ℓ -regularization weight is chosen from {1e-3, 1e-4, 1e-5}. Forsequential models and reasoning models, the history includes pre-viously interacted items even if the item has negative feedback, andthe maximum length of history is 10. All parameters are tuned tothe best on the validation set based on the measure [email protected] framework is applied in both discrete and continuous itemspace. For discrete version, let ( 𝑢, 𝑥, 𝑣, 𝑦 ) be a training example,when selecting counterfactual histories 𝑥 ′ 𝑖 for this training example(Section 5.2.2), we randomly sample 100 negative items plus thetarget item 𝑣 as the candidate item set (101 in total). If item 𝑣 is stillin the top- 𝑘 list under the counterfactual history 𝑥 ′ 𝑖 after recom-mendation algorithm 𝑔 (· , ·) , then we will keep the counterfactualhistory and create a counterfactual example ( 𝑢, 𝑥 ′ 𝑖 , 𝑣, 𝑦 ) . It is worthnoting that this process is not testing the model based on testing orvalidation data. The target item is not from the testing or validationset but the target item of a training example. The selection processaims to make sure that if an item is recommended in the real history,then it should be in the top- 𝑘 list under counterfactual history.For sequential models and reasoning models, counterfactual out-puts can be obtained by simply changing the input of a fixed model.For matching models, however, we need to retrain the recommenda-tion model on counterfactual training data to get the counterfactualoutputs. During the retrain process, all users’ counterfactual histo-ries are applied on the training set while the validation and testing onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang Models

ML100k Amazon Baby Yahoo! R3 Coat Shopping Summary nDCG@10 Hit@1 nDCG@10 Hit@1 nDCG@10 Hit@1 nDCG@10 Hit@1 nDCG@10 Hit@1value imp value imp value imp value imp value imp value imp value imp value imp num avg_imp num avg_impBPR-MF Original 0.3647 - 0.1490 - 0.2061 - 0.0762 - 0.2252 - 0.1318 - 0.2042 - 0.1561 - - - - -CausE 0.3784

IPS 0.3696

CCF K1 CCF D1 CCF

R1r

R1n

CCF C NCF Original 0.3504 - 0.1458 - 0.1860 - 0.0660 - 0.2161 - 0.1170 - 0.1985 - 0.1519 - - - - -CausE 0.3680

IPS 0.3558

CCF K1 CCF D1 CCF

R1r

CCF

R1n

CCF C GRU4Rec Original 0.4087 - 0.1865 - 0.2473 - 0.1045 - 0.1682 - 0.0907 - 0.1147 - 0.0759 - - - - -CausE 0.4111

CCF K1 CCF D1 CCF

R1r

CCF

R1n

CCF C STAMP Original 0.4016 - 0.1758 - 0.2418 - 0.1042 - 0.1766 - 0.0899 - 0.1182 - 0.0675 - - - - -CausE 0.4202

IPS 0.4094 K1 CCF D1 CCF

R1r

CCF

R1n

CCF C NLR Original 0.4029 - 0.1886 - 0.2481 - 0.1045 - 0.1660 - 0.0784 - 0.1340 - 0.0802 - - - - -CausE 0.4147

IPS 0.4069

CCF K1 CCF D1 CCF

R1r

CCF

R1n

CCF C NCR Original 0.4227 - 0.1972 - 0.3192 - 0.2687 - 0.2003 - 0.0998 - 0.1703 - 0.1055 - - - - -CausE 0.4234

IPS 0.4237

CCF K1 CCF D1 CCF

R1r

CCF

R1n

CCF C Table 3: Performance on six recommendation models and four datasets. We evaluate on metrics nDCG@10 and Hit@1. Therelative improvement on each metric is calculated against the corresponding original performance. In the summary column,we show the number of improved datasets and the average improvement across four datasets (weighted average by the numberof testing users of each dataset). Positive improvements are highlighted in bold and the highest improvement is underlined. sets are unchanged. However, if the target items are still in the train-ing data, then all target items will for sure get higher predictionscores after the model is retrained because the loss function of 𝑔 (· , ·) will purposely optimize the scores of these items. As a result, wecannot identify those cases that the target item’s recommendationis caused by the counterfactual history. To solve the problem, weselect the target items that only appear in the original training databut not in the counterfactual training data. More specifically, thetarget items would be the deleted (for D1 and K1 rules) or replaced(for R1 rule) items. Those items will get higher scores in the originalmodel because they appear in the original training set but their scores in the retrained model are not guaranteed because they donot appear in the counterfactual training set. The counterfactualhistory selection process is controlled by parameter 𝑘 (top- 𝑘 list).The value of 𝑘 is tuned in {50,60,70,80,90,100} based on validationset, and we will study the influence of 𝑘 in the experiments.For continuous version, the counterfactual examples are gener-ated in the latent embedding space. We sample 10 counterfactualembeddings in the 𝜖 -neighbourhood to estimate the integrate inEq.(13). More specifically, to sample each counterfactual embed-ding, we first sample a noise vector 𝜃 from a Gaussian distribution,and randomly pick a number 𝛾 that satisfies 0 ≤ 𝛾 ≤ 𝜖 . To make ausal Collaborative Filtering Conference’17, July 2017, Washington, DC, USA (a) ML100k Discrete (b) Coat Shopping Discrete (c) ML100k Continuous (d) Coat Shopping Continuous Figure 3: nDCG@10 for ML100k and Coat Shopping datasets with different counterfactual selection parameters. (a) and (b) arediscrete versions with R1r heuristic rule under parameter 𝑘 . (c) and (d) are continuous versions under parameter 𝜖 . sure the constraint ∥ x ′ − x ∥ ≤ 𝜖 is satisfied, we define x ′ as x + √ 𝛾𝜃 /∥ 𝜃 ∥ . The counterfactual constraint weight 𝜔 is tuned in{0.001,0.1,0.5,1.0}, the counterfactual constraint threshold 𝜖 ( 𝜖 forcontinuous version) is in {0.1,0.5,1.0}. For continuous version, theparameter 𝜖 in Eq.(13) is selected in {0.1,1.0,2.0,5.0}. We will showthe influence of the parameters in the following experiments. The overall performance of applying causal frameworks (CausE, IPS,CCF ∗ ) on six recommendation models over four datasets are shownin Table 3, including the ranking metrics (nDCG and Hit Ratio) andrelative improvements against the original performance of the rec-ommendation model. All positive improvements are shown in boldnumbers, the best improvement for each recommendation model isunderlined. The last column of the table shows the summary overfour datasets, including on how many of the four datasets a causalframework improved the performance, and the average improve-ment over the four datasets. Average calculation is weighted by thenumber of testing users of each dataset, and negative improvementsare also counted in when calculating the averages.From the results we can see that in most cases causal frameworkscan bring positive improvement to the recommendation models. Inparticular, for most recommendation models, the CCF frameworkbrings improvements on 3 or 4 out of the four datasets. ComparingCausE, IPS and CCF, we see that for all of the six recommenda-tion models, the largest average improvement over four datasets isalways brought by our CCF framework. For matching models (BPR-MF and NCF), the average improvement brought by CCF is about2% on nDCG and 6% on Hit@1, while for sequential (GRU4Rec andSTAMP) and reasoning (NLR and NCR) models, the improvementcan be as large as about 10% on nDCG@10 and 15% on Hit@1. Aswe mentioned before, CausE improves performance by splitting theobservational training data into approximately randomized dataand IPS improves performance by re-weighting the observationaltraining data. Both of the two frameworks only consider the real-world examples though with different techniques, however, ourCCF framework not only considers real-world examples but alsoinvolves counterfactual examples, which helps to better capture theuser preference and improve the recommendation performance.Considering the two types of datasets, ML100k and

AmazonBaby are traditional training-validation-testing split datasets, while

Yahoo!R3 and

Coat Shopping are datasets whose testing sets arecollected from randomized trials. One minor but noteworthy ob-servation is that matching models (BPR-MF and NCF) get betterperformance than sequential (GRU4Rec and STAMP) and reasoning (NLR and NCR) models on

Yahoo!R3 and

Coat Shopping . The reasonis that the training set of these two datasets do not have timestampinformation, and thus the user history interactions are randomlyordered. For sequential models, this means that the model cannotleverage the item ordering information and thus results in sacri-ficed performance, which is consistent with the observations in[51]. For reasoning models—which relies on Premise ⇒ Consequentreasoning for prediction—though the model does not need timeordering information in the user interactions (i.e. the premise), butwhen creating the training examples, the target item (i.e., the conse-quent) could be an interaction that happened before the items in thepremise, which forces the model to use future events to predict pre-vious events. This violates the model assumption and thus results inscarified performance, which is consistent with the observations in[5]. On

ML100k and

Amazon Baby , where the sequential and reason-ing models can be properly executed, we see that both sequentialand reasoning models are better than matching models. However,we care more about the performance improvement in this work.On both the two types of datasets and all types of recommendationmodels, we can see that CCF can improve the recommendationperformance in most cases.Considering different versions of the CCF framework, we see thatall versions can improve the performance in most cases. Therefore,performance improvement would not be a major concern whendeciding which version to apply. Actually, even though our experi-ments contain four discrete versions and one continuous version,there still may exist other ways to apply counterfactual reasoningin discrete or continuous space, as long as the method of generatingcounterfactual examples are well defined. This shows the flexibilityof our counterfactual reasoning framework. In the next subsection,we will compare the different CCF versions in detail.

Generating and selecting counterfactual examples are essentialcomponents of the CCF framework (Section 5.2). In this section, weaim to answer the research questions

RQ2 and

RQ3 . Specifically,we will discuss the influence of the counterfactual examples in twoperspectives. We will first dig into the difference between differentheuristic rules for generating counterfactual examples in the dis-crete space. We then focus on selecting counterfactual examples toshow the necessity of the selection process after generation.

Difference between Heuristic Rules . According to theperformance shown in Table 3, the CCF framework has the abilityto improve the performance of a base recommendation model. How onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang (a) ML100k Discrete (b) Coat Shopping Discrete (c) ML100k Continuous (d) Coat Shopping Continuous Figure 4: nDCG@10 for ML100k and Coat Shopping datasets with different counterfactual constraint weight 𝜔 . (a) and (b) arediscrete versions with R1r heuristic rule. (c) and (d) are continuous versions. to generate counterfactual examples plays an important role here.In this section, we focus on the discrete versions of our frameworkand discuss the effect of different heuristic rules.We have three types of heuristic rules in Table 1, including theKeep One (K1), Delete One (D1) and Replace One (R1). Besides, theR1 rule has two variations—R1-random and R1-nearest. Among theheuristic rules, the K1 and D1 rules generate much fewer numberof counterfactual histories than R1, because K1 and D1 are limitedby the number of interactions in the user’s real history, while theR1 rule can replace each interacted item with a great number ofpossible items in the space, thus creating many counterfactualhistories. As a result, it is more difficult for K1 and D1 to get qualifiedcounterfactual examples in the selection process when 𝑘 is small,where 𝑘 is the top- 𝑘 selection threshold introduced in Section 5.2.2.Considering the difficulty of generating qualified counterfactualexamples, the R1 rules are intuitively better than D1 and K1.This is consistent with the experimental results. Crossing allmodels and datasets in Table 3, we see that the R1 rule is more likelyto get better performance compared with other rules. Summarizingthe numbers in Table 3, among the 2 (measures) × × Counterfactual Example Selection . For the CCF discreteversions, the selection is based on the parameter 𝑘 , i.e., the coun-terfactual example will be selected if the target item is ranked totop- 𝑘 under the counterfactual history (Section 5.2.2). For the CCFcontinuous version, the selection is based on the parameter 𝜖 inEq.(13), i.e., a counterfactual history embedding is selected only ifit is close enough to the real history embedding.We first examine the discrete versions. We plot the nDCG@10under different 𝑘 (from 50 to 100) and under the heuristic rule R1rin Figure 3(a-b) for two types of datasets. Other rules and datasets have similar observations. Since we randomly sample 100 negativeitems for ranking during selection, so when 𝑘 is set as 100, we areactually doing no selection because the target item is almost surelyin top-100 among the 101 items. We can see that when 𝑘 is properlychosen, the selected counterfactual examples will improve the per-formance through the counterfactual constraint. However, when 𝑘 is too large—such as 𝑘 =

100 that all the generated counterfac-tual examples are selected—the counterfactual constraint will hurtthe performance. This observation is consistent with the theory ofconditional intervention (Section 5.1). When doing conditional in-tervention (Eq.(4)(5)), we need to use those counterfactual histories 𝑥 ′ that lead to the target item 𝑣 still being recommended for user 𝑢 under the recommendation algorithm 𝑣 = 𝑔 ( 𝑢, 𝑥 ′ ) , and only thesecounterfactual histories are required to reach similar predictionswith the real history in the counterfactual constraint (Eq.(10)(12)).If too many “irrelevant” counterfactual histories are forced to makethe same predictions, it will violate the assumptions of conditionalintervention and thus lead to decreased performance.For continuous version, we plot the nDCG@10 under different 𝜖 in Figure 3(c-d). When 𝜖 is very small, the generated counter-factual embedding x ′ is very close to the real history embedding x (Eq.(11)(13)), therefore, the performance after applying the CCFframework has no much difference from the original recommenda-tion performance, since the counterfactual constraint in Eq.(11)(13)is easily satisfied. In contrast, if 𝜖 is too large, the generated coun-terfactual embeddings are too far away from the real embedding,and when we force their predictions to be close, the performancewill decrease. The reason is similar to the discrete versions. As aresult, for both discrete and continuous versions, appropriate selec-tion of the counterfactual examples is important, but the selectioncan be conducted in either discrete or continuous space.As mentioned in the previous subsection, the K1 rule applies arelatively more significant change to the user history when gener-ating counterfactual histories, but CCF K1 is still able to improve theperformance in many cases (34 out of 48 cases in Table 3). This alsobenefits from the the selection process. This process will make surethat the selected counterfactual histories do not make significantchanges in terms of the recommendation output. In other words,although K1 applies a significant change in terms of the number ofhistory items, the selected counterfactual examples are still similarto the real example in terms of generating similar recommendations,and thus to satisfy the requirement of conditional intervention.Overall, counterfactual examples are fundamental elements ofthe CCF framework. The counterfactual examples go through two ausal Collaborative Filtering Conference’17, July 2017, Washington, DC, USA (a) ML100k Discrete (b) Coat Shopping Discrete (c) ML100k Continuous (d) Coat Shopping Continuous Figure 5: nDCG@10 for ML100k and Coat Shopping datasets with different counterfactual constraint threshold 𝜖 ( 𝜖 for con-tinuous version). (a) and (b) are discrete versions with R1r heuristic rule. (c) and (d) are continuous versions. steps: step one generates candidate counterfactual examples andstep two selects the qualified examples to satisfy the requirementsof conditional intervention. The generation and selection processcan be conducted in both discrete and continuous space. After the counterfactual examples are selected, we apply them intothe counterfactual constraint for counterfactual learning. There aretwo important parameters for the constraint, one is parameter 𝜔 inEq.(12) and (13) which controls the importance of the constraint inthe learning objective, and the other is the parameter 𝜖 in Eq.(12)(or 𝜖 in Eq.(13)) which controls how strict the counterfactual con-straints are. In this section, we provide the answers to RQ4 . Wewill discuss the two parameters separately in the following.

Counterfactual Constraint Weight . Our framework aimsto learn the loss function of a base recommendation algorithm un-der a counterfactual constraint (Eq.(10)(11)), but for optimization,we relax the constraint and convert the loss function into Eq.(12)and (13). The larger the counterfactual constraint weight 𝜔 , themore likely the results will follow the constraint.We tune the constraint weight 𝜔 while keeping other parametersfixed. The results of nDCG@10 are shown in Figure 4 for bothdiscrete and continuous versions on two types of datasets. We seethat in most cases the performance would first lift and then dropwith the increase of the 𝜔 . This result shows that the counterfactualconstraint is useful for improving performance, but it also requires agood balance with the recommendation loss. When the weight is toosmall, the counterfactual constraint has little effect on the total loss,leading to only slight improvement or even slight loss consideringthe larger model complexity. In contrast, when the weight is toolarge, the constraint loss will dominate the total loss, and thus therecommendation performance is significantly decreased since therecommendation loss does not take too much effect. Comparedwith an overly large weight, a smaller weight than the propervalue would have no or less hurt on the performance. In summary,counterfactual constraint will help improve the recommendationperformance, but the weight needs to be carefully specified. Counterfactual Constraint Threshold . The counterfac-tual constraint threshold (i.e. 𝜖 in Eq.(10) and 𝜖 in Eq.(11)) controlshow rigorous is the constraint. After converting the counterfactualconstraint into the optimization loss (Eq.(12)(13)), the constraintloss will not be optimized when the difference is already less thanthe threshold. Smaller threshold is supposed to result in closer prediction scores between counterfactual examples and the real ex-ample, while larger threshold will lead to a more relaxed constraint.We plot the nDCG@10 with different threshold in Figure 5. Wesee that the performance would first increase and then decrease andfinally tend to be flat when the threshold is large enough. Whenthe threshold is too small, the constraint would be too tight and itonly allows for small variations between counterfactual and realexamples’ predictions. This makes the model less capable of han-dling the potential errors in counterfactual examples and thus leadsto only slightly improved or even slightly decreased performance.When the threshold is too large, we are actually applying no coun-terfactual constraint, since the difference would always be smallerthan the threshold and thus the 𝐿 𝑐 in Eq.(12) and (13) would be 0in most cases. As a result, the performance becomes relatively flatwhen the threshold is large enough. In this paper, we proposed a Causal Collaborative Filtering (CCF)framework for personalized recommendation. We first show the 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) formulation of CCF and further show that many tradi-tional collaborative filtering algorithms are actually special cases ofCCF under simplified causal graphs. We then provided a conditionalintervention approach to estimating 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) based on obser-vational data, and further proposed a counterfactual constrainedlearning framework to make counterfactual reasoning possible un-der a standard machine learning pipeline. Experimental results onboth split and randomized trail datasets show that CCF can improvethe performance of the matching-, sequential- and reasoning-basedrecommendation models in most cases and by large margins.The CCF framework is a very flexible framework and it can beextended in various dimensions. First, we can further extend thecausal graph in Figure 1(d) into more complicated causal graphs, sothat we can estimate 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) for more complex recommenda-tion scenarios. Second, our proposed conditional intervention andcounterfactual learning approach may also be applied to other intel-ligent tasks such as vision and language learning. Third, except forthe conditional intervention and counterfactual learning approach,there may exist other approaches to estimating 𝑃 ( 𝑦 | 𝑢, 𝑑𝑜 ( 𝑣 )) basedon observational data, which we will explore in the future. Finally,we only used the user interaction information in this work, while inthe future, we can consider the rich multimodal information in rec-ommender systems for causal modeling, such as the user reviews,images and item descriptions, and this may also help us to designexplainable recommendation models based on causal learning. onference’17, July 2017, Washington, DC, USA Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang ACKNOWLEDGEMENT

This work was supported in part by NSF IIS-1910154 and IIS-2007907.Any opinions, findings, conclusions or recommendations expressedin this material are those of the authors and do not necessarilyreflect those of the sponsors.

REFERENCES [1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Control-ling popularity bias in learning-to-rank recommendation. In

Proceedings of theEleventh ACM Conference on Recommender Systems . 42–46.[2] Qingyao Ai, Vahid Azizi, Xu Chen, and Yongfeng Zhang. 2018. Learning heteroge-neous knowledge base embeddings for explainable recommendation.

Algorithms

11, 9 (2018), 137.[3] Jonathan Baxter. 2000. A model of inductive bias learning.

Journal of artificialintelligence research

12 (2000), 149–198.[4] Stephen Bonner and Flavian Vasile. 2018. Causal embeddings for recommendation.In

Proceedings of the 12th ACM Conference on Recommender Systems . 104–112.[5] Hanxiong Chen, Shaoyun Shi, Yunqi Li, and Yongfeng Zhang. 2021. NeuralCollaborative Reasoning. In

Proceedings of the 30th Web Conference (WWW) .[6] Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, andHongyuan Zha. 2018. Sequential recommendation with user memory networks.In

Proceedings of the eleventh ACM international conference on web search anddata mining . 108–116.[7] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In

Proceedings of the 1stworkshop on deep learning for recommender systems . 7–10.[8] Michael D Ekstrand, John T Riedl, and Joseph A Konstan. 2011.

Collaborativefiltering recommender systems . Now Publishers Inc.[9] Azin Ghazimatin, Oana Balalau, Rishiraj Saha Roy, and Gerhard Weikum. 2020.PRINCE: Provider-side Interpretability with Counterfactual Explanations in Rec-ommender Systems. In

Proceedings of the 13th International Conference on WebSearch and Data Mining . 196–204.[10] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In

Proceedings of the 26th internationalconference on world wide web . 173–182.[11] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2016. Session-based recommendations with recurrent neural networks. In

Inter-national Conference on Learning Representations .[12] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2016. Session-based recommendations with recurrent neural networks.

ICLR (2016).[13] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, andDeborah Estrin. 2017. Collaborative metric learning. In

In WWW . 193–201.[14] Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2020. A Re-visit of the PopularityBaseline in Recommender Systems.

SIGIR (2020).[15] Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deeplearning with logged bandit feedback. In

International Conference on LearningRepresentations .[16] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiasedlearning-to-rank with biased feedback. In

Proceedings of the Tenth ACM Interna-tional Conference on Web Search and Data Mining . 781–789.[17] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom-mendation. In . IEEE.[18] Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver.2010. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In

Proceedings of the fourth ACM conference onRecommender systems . 79–86.[19] Joseph A Konstan, Bradley N Miller, David Maltz, Jonathan L Herlocker, Lee RGordon, and John Riedl. 1997. GroupLens: applying collaborative filtering toUsenet news.

Commun. ACM

40, 3 (1997), 77–87.[20] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

42, 8 (2009), 30–37.[21] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfac-tual fairness. In

Advances in Neural Information Processing Systems . 4066–4076.[22] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017.Neural attentive session-based recommendation. In

Proceedings of the 2017 ACMon Conference on Information and Knowledge Management . ACM, 1419–1428.[23] Dawen Liang, Laurent Charlin, and David M Blei. 2016. Causal inference forrecommendation. In

Causation: Foundation to Application, Workshop at UAI . AUAI.[24] Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommenda-tions: Item-to-item collaborative filtering.

IEEE Internet computing

7, 1 (2003).[25] Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, andZhong Ming. 2020. A general knowledge distillation framework for counterfactualrecommendation via uniform data. In

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval . 831–840.[26] Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: short-term attention/memory priority model for session-based recommendation. In

KDD . 1831–1839.[27] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel.2015. Image-based recommendations on styles and substitutes. In

SIGIR . ACM.[28] Andriy Mnih and Russ R Salakhutdinov. 2008. Probabilistic matrix factorization.In

Advances in neural information processing systems . 1257–1264.[29] Judea Pearl. 2000. Causality: Models, reasoning and inference.

CambridgeUniversity Press (2000).[30] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. 2016.

Causal inference instatistics: A primer . John Wiley & Sons.[31] Steffen Rendle. 2010. Factorization machines. In . IEEE, 995–1000.[32] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2012. BPR: Bayesian personalized ranking from implicit feedback.

UAI (2012).[33] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-izing personalized markov chains for next-basket recommendation. In

Proceedingsof the 19th international conference on World wide web . 811–820.[34] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and JohnRiedl. 1994. GroupLens: an open architecture for collaborative filtering of netnews.In

CSCW . 175–186.[35] Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata.2020. Unbiased Recommender Learning from Missing-Not-At-Random ImplicitFeedback. In

Proceedings of the 13th International Conference on Web Search andData Mining . 501–509.[36] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative filtering recommendation algorithms. In

WWW . 285–295.[37] J Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborativefiltering recommender systems. In

The adaptive web . Springer, 291–324.[38] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, andThorsten Joachims. 2016. Recommendations as treatments: Debiasing learningand evaluation. In

ICML .[39] Shaoyun Shi, Hanxiong Chen, Weizhi Ma, Jiaxin Mao, Min Zhang, and YongfengZhang. 2020. Neural Logic Reasoning. In

Proceedings of the 29th ACM InternationalConference on Information & Knowledge Management . 1365–1374.[40] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-tion via convolutional sequence embedding. In

Proceedings of the Eleventh ACMInternational Conference on Web Search and Data Mining . 565–573.[41] Pengfei Wang, Hanxiong Chen, Yadong Zhu, Huawei Shen, and Yongfeng Zhang.2019. Unified Collaborative Filtering over Graph Embeddings. In

Proceedings ofthe 42nd International ACM SIGIR Conference on Research and Development inInformation Retrieval . 155–164.[42] Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and MarcNajork. 2018. Position bias estimation for unbiased learning to rank in personalsearch. In

WSDM . 610–618.[43] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen.2017. Deep Matrix Factorization Models for Recommender Systems.. In

IJCAI ,Vol. 17. Melbourne, Australia, 3203–3209.[44] Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Debo-rah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In

Proceedings of the 12th ACM Conference on Recom-mender Systems . 279–287.[45] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton,and Jure Leskovec. 2018. Graph convolutional neural networks for web-scalerecommender systems. In

Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining . 974–983.[46] Bowen Yuan, Jui-Yang Hsia, Meng-Yuan Yang, Hong Zhu, Chih-Yao Chang, Zhen-hua Dong, and Chih-Jen Lin. 2019. Improving ad click prediction by consideringnon-displayed events. In

Proceedings of the 28th ACM International Conference onInformation and Knowledge Management . 329–338.[47] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016.Collaborative Knowledge Base Embedding for Recommender Systems. In

KDD .[48] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma.2016. Collaborative knowledge base embedding for recommender systems. In

Proceedings of the 22nd ACM SIGKDD international conference on knowledgediscovery and data mining . 353–362.[49] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based rec-ommender system: A survey and new perspectives.

ACM Computing Surveys(CSUR)

52, 1 (2019), 1–38.[50] Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Bruce Croft. 2017. Joint repre-sentation learning for top-n recommendation with heterogeneous informationsources. In

CIKM . 1449–1458.[51] Wayne Xin Zhao, Junhua Chen, Pengfei Wang, Qi Gu, and Ji-Rong Wen. 2020.Revisiting Alternative Experimental Settings for Evaluating Top-N Item Recom-mendation Algorithms. In

CIKM . 2329–2332.[52] Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of usersand items using reviews for recommendation. In