Clicks can be Cheating: Counterfactual Recommendation for Mitigating Clickbait Issue
Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, Tat-Seng Chua
““Click” Is Not Equal to “Like”: Counterfactual Recommendationfor Mitigating Clickbait Issue
Wenjie Wang [email protected] University of Singapore
Fuli Feng [email protected] University of Singapore
Xiangnan He [email protected] of Science and Technologyof China
Hanwang Zhang [email protected] Technological University
Tat-Seng Chua [email protected] University of Singapore
ABSTRACT
Recommendation is a prevalent and critical service in informationsystems. To provide personalized suggestions to users, industryplayers embrace machine learning, more specifically, buildingpredictive models based on the click behavior data. This is knownas the Click-Through Rate (CTR) prediction, which has become thegold standard for building personalized recommendation service.However, we argue that there is a significant gap between clicksand user satisfaction — it is common that a user is “cheated” to clickan item by the attractive title/cover of the item. This will severelyhurt user’s trust on the system if the user finds the actual contentof the clicked item disappointing. What’s even worse, optimizingCTR models on such flawed data will result in the Matthew Effect,making the seemingly attractive but actually low-quality items bemore frequently recommended. In this paper, we formulate therecommendation process as a causal graph that reflects the cause-effect factors in recommendation, and address the clickbait issueby performing counterfactual inference on the causal graph. Weimagine a counterfactual world where each item has only exposurefeatures (i.e., the features that the user can see before making aclick decision). By estimating the click likelihood of a user in thecounterfactual world, we are able to remove the effect of exposurefeatures and eliminate the clickbait issue. Experiments on real-world datasets demonstrate that our method significantly improvesthe post-click satisfaction of CTR models.
KEYWORDS
Counterfactual Recommendation, Clickbait Issue, CounterfactualInference
Recommender systems have been increasingly used to alleviateinformation overloading for users in a wide spectrum of informationsystems such as e-commerce [30], digital streaming [25], andsocial networks [8]. To date, the most recognized way fortraining recommender model is to optimize the Click-ThroughRate (CTR), which aims to maximize the likelihood that a userclicks the recommended items. Despite the wide deployment ofCTR optimization in recommender systems, we argue that the userexperience may be hurt unintentionally due to the clickbait issue.That is, the items with attractive exposure features ( e.g., title andcover image) are easy to attract user clicks, thus are more likely to be recommended, but their actual content does not match the exposurefeatures and disappoints users. Such clickbait issue is very common,especially in the present era of self-media, posing great obstaclesfor the platform to provide high-quality recommendations.To illustrate, Fig. 1 shows an example that a user clicks tworecommended videos with observing their exposure features only.After watching the video, i.e., examining the video content afterclicking, the user gives the rating of whether like or dislike therecommendations. The item item direct effect of exposure a r X i v : . [ c s . I R ] S e p c) Ranking score of items. I t e m I t e m I t e m I t e m I t e m Recommendation with clickbait issue. I t e m I t e m I t e m I t e m I t e m (b) Click/like number of items.(a) Browsing behaviors of users on Tiktok. Item 1 Item 2 C li c k / li k e nu m click like NASA, SpaceX launch astronauts from US soil.CNN: UFO found in Denver, we are NOT alone.
Like (Sparse)
Exposure feature Click Content feature Post-clickfeedback item 1item 2
ClickClick Dislike
Inconsistent
Recommendation without clickbait issue.
Figure 1: (a) Illustration of inconsistency between clicks andpost-click feedback. (b) Number of clicks and likes on twoitems where most clicks on item 2 do not end with likes. (c)An example of the recommendation list with clickbait issue. features on the prediction score in a counterfactual world (Fig. 2(c)),which imagines what the prediction score would be if the item had only the exposure features . During inference, we remove this directeffect from the prediction in the factual world, which presents thetotal effect of all item features. In the example of Fig. 1, although item item item
Recommender training.
The target of recommender training isto learn a scoring function s θ that predicts the preference of a userover an item. Formally, Y u , i = s θ ( u , i ) , where i = ( e , t ) , (1) where u and i denote user features and item features, respectively.Specifically, item features i include both exposure features e andcontent features t which are observed by users before and afterclicks, respectively. θ denotes the model parameters typicallylearned from historical click data ¯ D = {( u , i , ¯ y u , i )| u ∈ U , i ∈ I} ,where ¯ y u , i ∈ { , } denotes whether user u clicks item i ( ¯ y u , i = y u , i = U and I refer to the user set and item set,respectively. Formally, the recommender training is: ¯ θ = arg min θ L( ¯ D | θ ) = arg min θ (cid:213) ( u , i , ¯ y u , i )∈ ¯ D l ( s θ ( u , i ) , ¯ y u , i ) , (2) where l ( s θ ( u , i ) , ¯ y ui ) denotes the recommendation loss suchas cross-entropy (CE) loss [5]. During inference, the trainedrecommender serves each user by ranking all items according to Y u , i = s ¯ θ ( u , i ) and recommending the top-ranked ones. Clickbait Issue.
Due to the gap between clicks and usersatisfaction, the recommender will suffer from clickbait issue—itemswith attractive exposure features but disappointing content featureswill be frequently recommended. Assume that items i and j ( e.g., item item i.e., e i and e j ) for the user u , but e i is more attractive and the contentfeatures of item i ( i.e., t i ) cannot satisfy the user, the clickbait issuemeans that item i is ranked in front of item j , i.e., s ¯ θ ( u , i ) > s ¯ θ ( u , j ) , where i = ( e i , t i ) and j = ( e j , t j ) . Consequently, the recommenders are biased to the items like i ,which will hurt user experience and lead to more clicks withdislikes. And worse still, it forms a vicious spiral: in turn, suchclicks aggravates the issue in future recommender training. Due tothe unaffordable overhead to resolve such issue in recommendertraining ( e.g., cleaning such clicks with clickbait issue [12]), weexplore the possibility of mitigating the clickbait issue when servingusers, i.e., debiasing reccommender inference with Y u , i < Y u , j .Since exposure features attract clicks while the combined itemfeatures affect user satisfaction, the core idea is to reduce the directeffect of exposure features on the prediction during inference. Evaluation.
Our target is to build recommenders for more usersatisfaction rather than higher CTR, which is more in accord withpractical requirement. Distinct from the conventional recommenderevaluation that treats all clicks in the testing period as positivesamples [7, 25], we evaluate recommendation performance onlyover clicks end with positive post-click feedback which indicateusers satisfaction of the exposure and content features. Inevitably,the testing samples can also be biased by various factors, suchas position, popularity, or attractive exposure features becausethe clicks are collected from the serving period of recommenders.Although it may have bias, more satisfaction indicates that therecommender reduces the recommendation of items with clickbaitissue because items with clickbait issue will highly dissatisfy users.Besides, we restrict the compared methods to be applied on thesame recommender model, so that a better testing performance canreflect the effectiveness of mitigate clickbait issue. Aiming to reduce the clickbait issue, it is indispensable to accessthe causes of the issue and eliminate its effects. For the items withclickbait issues, the recommenders trained with CTR will heavilyrely on the exposure features for prediction because attractiveexposure features are the causal reason of users’ clicks. As such,the key to mitigate clickbait issue is to reduce the causal effectof exposure features. According to the causality theory [16, 17],causal effect could be estimated through counterfactual inference(CI), which is a logic to answer retrospective questions (“Whatif?”). For instance, CI can answer the question “What wouldthe prediction be if a cause had not happened? ( e.g., removinga feature)” and assess the causal effect through the change ofthe outcome. More preliminary details and formal formulationare in SI Appendix Section A. Inspired by the success of CI in Although the post-click feedback are sparse and insufficient for recommender training,we still can collect enough post-click feedback to cover a large group of users in theevaluation. itigating bias for various machine learning applications such ascomputer vision [13, 23], and information retrieval [9], we proposecounterfactual recommendation (CR) to mitigate the clickbait issuein the recommender inference, which removes the direct effectof exposure features on the prediction Y u , i . Towards this end,CR conduct counterfatual thinking for capturing direct effect ofexposure features on the prediction and excluding the direct effect to mitigate the clickbait issue. Capturing Direct Effect.
From the causation perspective (Fig. 2(a)),conventional recommenders ( s θ (·) ) models the causal relationshipsfrom the user features U and item features I to the prediction ( Y ),where I are derived from two parts: exposure features E and contentfeatures T . Formally, the inference of Y can be obtained by: Y u , i = Y ( U = u , I = i ) , where i = I e , t = I ( E = e , T = t ) , (3) where Y (·) and I (·) represent the scoring function ( e.g., innerproduct function) and feature aggregation function ( e.g., multi-layer perceptron) [5], respectively. However, the inference suffersfrom the clickbait issue, i.e., recommending an item purely basedon its attractive exposure features. Inherently, this is due to thetraining of recommender over “rickrolled” clicks where the userlikes the exposure features but dislikes the content features. Theexistence of such clicks misleads the recommender to rely more onexposure features so as to achieve a small loss on these clicks.Aiming to assess the direct effect of exposure features on theprediction, we add a direct edge from exposure features E to theprediction Y into the causal graph (Fig. 2(a)). While adding anedge is one small step for model improvement, it is one giantleap for recommendation which completes the modeling of causalrelationships from exposure features to prediction. In the new causalgraph, the value of exposure features e affects the prediction Y through two paths: a direct one ( E → Y ) and an indirect one ( E → I → Y ). Accordingly, the inference of Y is formulated as: Y u , i , e = Y ( U = u , I = i , E = e ) , where i = I ( E = e , T = t ) , (4) where Y u , i , e is the prediction in the factual world; Y ( U = u , I = i , E = e ) is a new scoring function with additional inputs of theexposure features e . Y ( U = u , I = i , E = e ) can be easily developedfrom the conventional scoring function Y ( U = u , I = i ) , which isdetailed in the section for implementation. It should be noted that Y u , i , e still has the clickbait issue since the new causal graph cannotavoid the training over “rickrolled” clicks. To mitigate the clickbaitissue, the key is to quantify the effect over the direct path ( E → Y ),and assess the prediction without the direct effect. Reducing Direct Effect.
From the causality theory [18], causaleffect of X on Y is the magnitude by which Y is changed by a unitchange in X . As detailed in SI Appendix Section A, the total effect(TE) of E , T on Y under U = u is formulated as: TE = Y i , e ( u ) − Y i ∗ , e ∗ ( u ) where i ∗ = I e ∗ , t ∗ = I ( E = e ∗ , T = t ∗ ) , where U = u remains unchanged, Y i , e ( u ) = Y u , i , e , and Y i ∗ , e ∗ ( u ) = Y ( U = u , I = i ∗ , E = e ∗ ) which is the reference situation with E , and T set as e ∗ and t ∗ , respectively. In this task, the referencevalues are treated as the status that the features are not given, i.e., recommending items with uniform probabilities. We then conductcounterfactual inference to reduce the direct effect of exposurefeatures and mitigate the clickbait issue from Y u , i , e . To this end, we UIET Y UIET Y
Exposure featureContent featureItem featureUser featurePrediction score (a) Conventional causal graph (b) The proposed causal graph uiet 𝒀 𝒆,𝒊 ∗ (𝒖)𝒆 ∗ 𝒊 ∗ 𝒕 ∗ u 𝒀 𝒆 ∗ ,𝒊 ∗ (𝒖)𝒆 ∗ 𝒊 ∗ 𝒕 ∗ (d) The reference situation ETIUY (c) Counterfactual world
Figure 2: The causal graphs for conventional recommen-dation and counterfactual recommendation. ∗ denotes thereference values. estimate the natural direct effect (NDE) of E , and subtract it fromTE .To estimate NDE, in the counterfactual world, CR can imagine what the prediction score would be if items had only exposure features ,which measures if the users are purely attracted by exposurefeatures. As shown in Fig. 2(c), the effect of E and T on I isblocked by setting I as the reference situation i ∗ . Formally, NDE = Y i ∗ , e ( u ) − Y i ∗ , e ∗ ( u ) where Y i ∗ , e ( u ) = Y ( U = u , I = i ∗ , E = e ) .According to the causality theory that TE can be divided into NDEand total indirect effect (TIE) (see SI Appendix Section A), theformulation of TIE is: TIE = TE − NDE = Y i , e ( u ) − Y i ∗ , e ( u ) . (5) Since the exposure features attract the users’ clicks and thecombined item features ( i.e., I ) decide the user satisfaction, we canachieve the debiasing inference by TIE. Ranking items accordingto the TIE will resolve the direct effect of exposure featuresthat can lead to the clickbait issue. Intuitively, the item withattractive exposure features, will have a high prediction scorein the counterfactual world. As such, its ranking will be largelydecreased when inference via TIE. Consequently, compared to theitem with only attractive exposure features, the item with lessattractive exposure features but satisfying content features has ahigher chance to be recommended. Implementation.
To enable the inference, we have to slightlyadjust a recommender model to satisfy the proposed causal graphin Fig. 2(b), i.e., extending the scoring function from Y ( U = u , I = i ) to Y ( U = u , I = i , E = e ) . A straightforward idea is to embedthe additional input e into the formulation of Y ( U = u , I = i ) .However, this solution loses generality due to the requirement ofcareful adjustment for different recommender models. Accordingto the universal approximation theorem [3], we can also implement Y (·) by a multi-layer perceptron (MLP) with u , e , and i as inputs.Nevertheless, it is hard to tune an MLP to achieve a comparableperformance as the models wisely designed for the recommendationtask [8]. Aiming to keep generality and leverage the advantages of Note that we can also directly estimate the natural indirect effect (NIE) for inference,which is detailed in SI Appendix Section B. xisting models, the scoring function is implemented in a late-fusionmanner [13] Y ( U = u , I = i , E = e ) = f ( Y u , i , Y u , e ) where Y u , i = Y ( U = u , I = i ) and Y u , e = Y ( U = u , E = e ) are the predictionsfrom two conventional models with different inputs; f (·) is a fusionfunction. Y u , i and Y u , e can be instantiated by any recommenderswith user and item features as inputs such as MMGCN [25] andVBPR [7]. In this way, if there is an existing recommender in thereal-world scenarios, the only overhead for CR is implementing afusion strategy. • Fusion strategy.
Inspired by [1, 13], we adopt one representativefusion strategy: Multiplication (MUL), which is formulated as: Y u , i , e = Y ( U = u , I = i , E = e ) = f ( Y u , i , Y u , e ) = Y u , i ∗ σ ( Y u , e ) , (6) where σ represents the non-linear sigmoid function. For sufficientrepresentation capacity of the fusion strategy, non-linear function( e.g., σ ) is necessary (see results in Table 3). Note that our CR isgeneral to any differentiable arithmetic binary operations and wecompare more strategies in Table 3. • Recommender training . Similar to the conventional recom-mender training (Equation 2), the recommender is learned byminimizing the following recommendation loss: L = (cid:213) ( u , i , ¯ y u , i )∈ ¯ D l ( Y u , i , e , ¯ y u , i ) + α ∗ l ( Y u , e , ¯ y u , i ) , (7) where α is a hyper-parameter to tune the relative weight of lossfunctions. Note that we also optimize Y u , e which can be seen asthe prediction based on the user features and exposure features tofacilitate the prediction in the counterfactual world. • Inference via TIE . According to Equation 5, inference viaTIE needs to calculate the predictions Y i , e ( u ) = f ( Y u , i , Y u , e ) and Y i ∗ , e ( u ) = f ( c u , i , Y u , e ) where c u , i refers to the expectationconstants for Y u , i : c u , i = E ( Y u , I ) = |I | (cid:213) i ∈I Y u , i , (8) which indicates that for each user, all the items share the samescore E ( Y u , I ) . Since the indirect effect in Y i ∗ , e ( u ) is blocked andthe features of I are not given, the model recommends items withthe same probability for each user. In this way, TIE with MULstrategy will be calculated by : TIE = f ( Y u , i , Y u , e ) − f ( c u , i , Y u , e ) = ( Y u , i − c u , i ) ∗ σ ( Y u , e ) . Potential Solutions.
To mitigate the clickbait issue, we capturethe direct effect of exposure features, and then reduce it. Inspiredby this, an intuitive idea is to directly discard exposure featuresduring training. Besides, incorporating sparse post-click feedbackfor training is also a promising direction. We apply them as ourbaselines which are detailed in Section 4.As to CR, we can also directly estimate the indirect effect ofexposure and content features for inference, i.e.,
NIE ( I → Y ), whichis further introduced in SI Appendix Section B. Datasets.
We evaluate top- K recommendation performance ontwo public datasets: Tiktok [25] and Adressa [6], which cover micro-video and news recommendations, respectively. As described in Detailed calculation is introduced in SI Appendix Section B.
05 01 0 01 5 0
Click/like number c l i c k l i k e
Cumulative distribution of items
L i k e / c l i c k r a t i o
Figure 3: Click and like distributions of items in Tiktok.The grey line visualizes the cumulative proportion of itemgroups as the like/click ratio increases.
Section 2, clicks are utilized for recommender training, while onlyclicks with positive post-click feedback such as thumbs-up, favorite,and finishing are used for testing. We follow prior work [25] tosplit the dataset and extract user and item features (see details in SIappendix Section C). Taking the Tiktok dataset as an example, weexplore the inconsistency between user clicks and user satisfaction.Among the clicks on an item with post-click feedback, we countthe number of positive feedback ( i.e., like) and calculate the ratio oflikes to clicks. Fig. 3 outlines the distribution of the like/click ratiowhere items are ranked and divided into 101 groups according tothe ratio value. As can be seen, over 60% of items have like/clickratio smaller than 0.5 showing the widely existence of “rickrolled”clicks. Moreover, recommending such items may lead to more clickswhich fail to satisfy users or hurt user experience.
Evaluation Metrics.
We follow the all-ranking evaluationprotocol that ranks over all the items except the clicked ones intraining for each user, and report recommendation performancethrough: Precision@K (P@K), Recall@K (R@K) and NDCG@K(N@K) with K = { , } where higher values indicate betterperformance [25]. Compared Methods.
We incorporate the recommendationmethods with potential to mitigate the clickbait issue as thebaselines. For fair comparison, all methods are applied to theMMGCN model [25], which is the state-of-the-art multi-modalrecommender and able to thoroughly explore the exposure andcontent features. Specifically,
1) NT trains MMGCN [25] vianormal training (NT) ( i.e.,
Equation 2).
2) CFT only uses contentfeatures for training (CFT) to avoid the clickbait issue.
3) IPW leverages debiasing recommender training method IPW [11] wherethe propensity score is estimated through item popularity. Theintuition is that items attract more clicks due to attractive exposurefeatures. Resolving the popularity bias will mitigate the clickbaitissue to some extend. Moreover, we compare three baselines thatincorporate extra post-click feedback into recommender training:
4) CT is trained under a clean training (CT) setting where only post-click feedback are utilized to avoid the clickbait issue.
5) NR.
Wen etal. [26] divide the clicks into two categories by post-click feedback:clicks with and without likes. Negative feedback Re-weighting(NR) is applied to reweigh the non-clicked items and the clickedones end with dislike.
6) RR post-processes the ranking of NTby re-ranking (RR) the top 20 items according to the like/clickratio. Lastly, we evaluate one implementation of CR:
CR-MUL-TIE which adopts the MUL fusion strategy and uses TIE for inference. able 1: Top- K recommendation performance of compared methods on Tiktok and Adressa. %Improve. denotes the relativeperformance improvement w.r.t. NT. The best results are highlighted in bold.
Dataset Tiktok AdressaMetric P@10 R@10 N@10 P@20 R@20 N@20 P@10 R@10 N@10 P@20 R@20 N@20NT
CFT
IPW CT . ∗ NR RR . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ CR-MUL-TIE 0.0269 0.0393 0.0370 0.0242 0.0683 0.0476 0.0532 0.1045 0.0878 0.0439 0.1712 0.1133 %Improve. 5.08% 10.08% 11.11% 4.76% 7.56% 10.70% 6.19% 7.18% 7.47% 5.78% 6.20% 6.99%
We detail the settings of baselines in SI appendix Section D tofacilitate reproduction.
The overall performance comparison is summarized in Table 1.From the table, we have the following observations: • Debiasing Training . In most cases, CFT performs worse thanNT, which is attributed to discarding exposure features. Theresult overrules the option of simply discarding exposure featuresto mitigate the clickbait issue, which is indispensable for userpreference prediction. Moreover, the performance of IPW is inferioron Tiktok and Adressa, showing that the clickbait issue may notbe resolved by simply discouraging the recommendation of itemswith more clicks. In addition, the result indicates the importanceof the proper assumption to mitigate a bias, which is the barrierof the usage of IPW for handling the bias caused by features withcomplex and changeable patterns. • Post-click Feedback . RR outperforms NT, which re-ranks therecommendation of NT according to the like/click ratio. It validatesthe effectiveness of leveraging post-click feedback to mitigate theclickbait issue and satisfy user requirement. However, CT and NR,which incorporate post-click feedback into recommender training,perform worse than NT on Tiktok, e.g., the NDCG@10 of CTdecreases by 11.71% on Tiktok. We ascribe the inferior performanceto the sparsity of post-click feedback, hurting model generalizationwhen the training is focused on these minority feedback. Moreover,we postulate the reason to be the inaccurate causal graph (Fig. 2(a))that lacks the direct edge from exposure feature to prediction, whichis further discussed in Table 2. • Debiasing Inference . In all cases, CR-MUL-TIE achievessignificant performance gains over all baselines. In particular,CR-MUL-TIE outperforms NT w.r.t.
N@10 by 11.11% and 7.47%on Tiktok and Adressa, respectively. The result validates theeffectiveness of the proposed CR, which is attributed to the newcausal graph and the counterfactual inference. Surprisingly, CR-MUL-TIE also outperforms RR which additionally considers post-click feedback, further signifying the rationality of CR. That is,eliminating the direct effect of exposure features on the predictionto mitigate the clickbait issue. As such, CR significantly helps torecommend more satisfying items, which can improve the userengagement and produce greater economic benefits.
Table 2: Performance comparison between inference via TEand TIE.
Dataset Tiktok AdressaMetric P@20 R@20 N@20 P@20 R@20 N@20NT
CR-MUL-TE
CR-MUL-TIE 0.0242 0.0683 0.0476 0.0439 0.1712 0.1133 [ 0 , 0 . 2 ) [ 0 . 2 , 0 . 4 ) [ 0 . 4 , 0 . 6 ) [ 0 . 6 , 0 . 8 ) [ 0 . 8 , 1 ]051 52 02 5
Avg. number
I t e m g r o u p s w i t h d i f f e r e n t l i k e / c l i c k r a t i o
N T C R - M U L - T I E
Figure 4: Visualization of the averaged recommendationfrequency of items.
Effect of the Proposed Causal Graph.
To shed light on theperformance gain, we further study one variant, i.e.,
CR-MUL-TE, which performs inference via TE, i.e., its difference from NTis training over the proposed causal graph. Table 2 shows theirperformance cut at K =
20 (see SI Table S3 for results with K = i.e., debiasing inference, indeedleads to better recommendation with more satisfaction. We then take CR-MUL-TIE on Adressa as an example to furtherinvestigate the effectiveness of CR. Besides, we also conductexperiments on synthetic data to prove that CR can effectivelyreduce the effect of exposure features, which is detailed in SIAppendix Section F.
Visualization of Recommendation w.r.t. like/click ratio.
Recall thatrecommenders with the clickbait issue tend to recommend itemseven though their like/click ratios are low. We thus compare therecommendations of CR-MUL-TIE and NT to explore whether CRcan reduce recommending the items with high risk to hurt user igure 5: Performance comparison across subsets of Adressawith different item discarding proportions. Larger propor-tion indicates higher percentage of “rickrolled” clicks in thedataset. experience. Specifically, we collect top three items recommended toeach user and count the frequency of each item. Fig. 4 outlines therecommendation frequency of CR-MUL-TIE and NT where itemsare intuitively split into five groups according to their like/clickratio for better visualization. From the figure, we can see that:as compared to NT, 1) CR-MUL-TIE recommends less items withlike/click ratios ≤ .
6; and 2) more items with high like/click ratios,especially in [ . , ] . The result indicates the higher potential ofCR to satisfy user preference, which is attributed to the propermodeling of effects from exposure features. Effect of Dataset Cleanness.
We then study how the effectivenessof CR is influenced by the “cleanness” of the click data. Specifically,we compare CR-MUL-TIE and NT over filtered dataset with differentpercentages of “rickrolled” clicks. We rank the items in descendingorder by the like/click ratio, and discard the top ranked itemsat a certain proportion where larger proportion leads to datasetwith higher percentage of “rickrolled” clicks. Fig. 5 shows theperformance with discarding proportion from 0 (the originaldataset) to 0.8. From Fig. 5, we have the following findings: 1)CR-MUL-TIE outperforms NT in all cases with performance gainlarger than 4.14%, which further validates the effectiveness of CR.2) The performance gains are close when the discarding proportionis ≤ .
4, which increase dramatically under larger proportions. Theresult indicates that mitigating the clickbait issue is more importantfor the recommendation scenarios with more “rickrolled” clicks.
Effect of Fusion Strategy.
Recall that any differentiable arithmeticbinary operations can be equipped as the fusion strategy in CR [13].To shed light on the development of proper fusion strategy, weinvestigate its essential properties, such as linearity and boundary.As such, in addition to the MUL strategy, we further evaluate avanilla SUM strategy with linear fusion, SUM with sigmoid function,and SUM/MUL with tanh (·) as the activation function. Formally,
SUM-linear: Y u , i , e = f ( Y u , i , Y u , e ) = Y u , i + Y u , e , SUM-sigmoid: Y u , i , e = f ( Y u , i , Y u , e ) = Y u , i + σ ( Y u , e ) , SUM-tanh: Y u , i , e = f ( Y u , i , Y u , e ) = Y u , i + tanh ( Y u , e ) , MUL-tanh: Y u , i , e = f ( Y u , i , Y u , e ) = Y u , i ∗ tanh ( Y u , e ) . (9) Similar to CR-MUL-TIE, we also infer the TIE for SUM-linear, SUM-sigmoid, SUM-tanh, and MUL-tanh. Detailed calculation can befound in SI Appendix Section E. The performance of different fusion
Table 3: Performance of CR with different fusion strategies.
Metric P@10 R@10 N@10 P@20 R@20 N@20SUM-Linear
SUM-tanh 0.0537 0.1060 0.0889 0.0447 0.1744 0.1150MUL-tanh
SUM-sigmoid
MUL-sigmoid strategies is reported in Table 3. From that, we can find that: 1) non-linear fusion strategies are significantly better than linear onesdue to the better representation capacity of non-linear functions.And 2) SUM-tanh achieves the best performance over other fusionstrategies, including the proposed MUL-sigmoid strategy, whichshows that a fusion function with the proper boundary can furtherimprove the performance of CR and multiple fusion strategies areworth studying when CR is applied to other datasets.
The clickbait issue widely exists in industrial recommendersystems. To eliminate its effect, we propose a new recommendationframework CR that accounts for the causal relationships amongthe exposure features, content features, and prediction. Throughperforming counterfactual inference, we estimate the directeffect of exposure features on the prediction and remove it inrecommendation scoring. While we instantiate CR on a specificrecommender model MMGCN, it is model-agnostic and onlyrequires minor adjustment (several lines of codes) to adoptit to other models, enabling the wide usage of CR acrossrecommendation scenarios and models. By mitigating the clickbaitissue, they can improve the user satisfaction and engagement,further producing greater economic profits. Moreover, the ideaof CR can be extended to debias the click data in other informationretrieval tasks such as search [9], advertising [29], and questionanswering [27].More broadly, this work signifies the importance of causal graphthat accurately describes the causal relationships from features andprediction, opening the door of empowering recommender systemswith causal reasoning ability. In addition, this work justifies theeffectiveness of counterfactual inference in debiasing the click data,and motivates further exploration on other categories of bias suchas selection bias [14] and position bias [9] in information retrievalsystems. EFERENCES [1] Remi Cadene, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, Devi Parikh,et al. 2019. Rubi: Reducing unimodal biases for visual question answering. In
Advances in neural information processing systems . 841–852.[2] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendationwith item-and component-level attention. In
Proceedings of the 40th InternationalSIGIR Conference on Research and Development in Information Retrieval . ACM,335–344.[3] Balázs Csanád Csáji. 2001. Approximation with artificial neural networks.
Facultyof Sciences, Etvs Lornd University, Hungary
24, 48 (2001), 7.[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In arXiv:1810.04805 .[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.
Deep Learning . MITPress.[6] Jon Atle Gulla, Lemei Zhang, Peng Liu, Özlem Özgöbek, and Xiaomeng Su.2017. The Adressa Dataset for News Recommendation. In
Proceedings of theInternational Conference on Web Intelligence . ACM, 1042–1048.[7] Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalizedranking from implicit feedback. In
Proceedings of the 30th AAAI Conference onArtificial Intelligence . AAAI press.[8] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural Collaborative Filtering. In
Proceedings of the 26th InternationalConference on World Wide Web . ACM, 173–182.[9] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. UnbiasedLearning-to-Rank with Biased Feedback. In
Proceedings of the 10th InternationalConference on Web Search and Data Mining . ACM, 781–789.[10] Youngho Kim, Ahmed Hassan, Ryen W White, and Imed Zitouni. 2014. Modelingdwell time to predict click-level satisfaction. In
Proceedings of the 7th ACMInternational Conference on Web search and data mining . ACM, 193–202.[11] Dawen Liang, Laurent Charlin, and David M Blei. 2016. Causal inferencefor recommendation. In
Causation: Foundation to Application, Workshop atUncertainty in Artificial Intelligence . AUAI.[12] Hongyu Lu, Min Zhang, and Shaoping Ma. 2018. Between Clicks and Satisfaction:Study on Multi-Phase User Preferences and Satisfaction for Online NewsReading. In
Proceedings of the 41st International SIGIR Conference on Research andDevelopment in Information Retrieval . ACM, 435–444.[13] Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2020. Counterfactual VQA: A Cause-Effect Look at Language Bias. In arXiv:2006.04315 .[14] Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, and Elena Zheleva.2020. Correcting for Selection Bias in Learning-to-Rank Systems. In
Proceedingsof The Web Conference . ACM, 1863âĂŞ1873.[15] Judea Pearl. 2001. Direct and indirect effects. In
Proceedings of the 17th Conferenceon uncertainty in artiïňĄcial intelligencea . Morgan Kaufmann Publishers Inc,411–420.[16] Judea Pearl. 2009.
Causality . Cambridge university press.[17] Judea Pearl. 2019. The seven tools of causal inference, with reflections on machinelearning.
Commun. ACM
62, 3 (2019), 54–60.[18] Judea Pearl and Dana Mackenzie. 2018.
The Book of Why: The New Science ofCause and Effect (1st ed.). Basic Books, Inc.[19] Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2019. Two CausalPrinciples for Improving Visual Dialog. In arXiv:1911.10496 .[20] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian personalized ranking from implicit feedback. In
Proceedingsof the 25th Conference on uncertainty in artificial intelligence . AUAI Press, 452–461.[21] James M Robins. 2003. Semantics of causal dag models and the identiïňĄcationof direct and indirect effects. In
Oxford Statistical Science Series . 70–82.[22] PAUL R. ROSENBAUM and DONALD B. RUBIN. 1983. The central role of thepropensity score in observational studies for causal effects.
Biometrika
70, 1 (041983), 41–55.[23] Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020.Unbiased scene graph generation from biased training. In arXiv:2002.11949 .[24] Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visualcommonsense r-cnn. In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition . 10760–10770.[25] Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, andTat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Networkfor Personalized Recommendation of Micro-video. In
Proceedings of the 27thInternational Conference on Multimedia . ACM, 1437–1445.[26] Hongyi Wen, Longqi Yang, and Deborah Estrin. 2019. Leveraging Post-clickFeedback for Content Recommendations. In
Proceedings of the 13th Conferenceon Recommender Systems . ACM, 278–286.[27] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with DeepNeural Networks for Retrieval-Based Human-Computer Conversation System.In
Proceedings of the 39th International Conference on Research and Developmentin Information Retrieval . ACM, 55âĂŞ64. [28] Xu Yang, Hanwang Zhang, and Jianfei Cai. 2020. Deconfounded ImageCaptioning: A Causal Retrospect. In arXiv:2003.03923 .[29] Soe-Tsyr Yuan and You Wen Tsao. 2003. A recommendation mechanism forcontextualized mobile advertising.
Expert systems with applications
24, 4 (2003),399–414.[30] Wenhao Zhang, Wentian Bao, Xiao-Yang Liu, Keping Yang, Quan Lin, HongWen, and Ramin Ramezani. 2020. Large-Scale Causal Approaches to DebiasingPost-Click Conversion Rate Estimation with Multi-Task Learning. In
Proceedingsof The Web Conference . ACM, 2775âĂŞ2781.
We introduce some concepts of counterfactual inference [15, 16, 21]used in this paper. And we highly encourage readers to learn fromthe related work [13, 19, 23, 24, 28] in various applications for acomprehensive understanding.
Causal Graph.
Causal graph describes the causal relationsbetween variables by a directed acyclic graph G = {N , E} , where N is the set of variables ( i.e., the nodes in the graph) and E recordsthe causal relations ( i.e., the edges in the graph). In the causal graph,capital letters and lowercase letters denote random variables ( e.g., X ) and the observed values of random variables ( e.g., x ), respectively.The structural equations [16] have be used to quantify the causalrelations and calculate the values of variables. One example ofcausal graph is shown is Fig. 6(a): the individual income ( I ) isaffected by the education ( E ), age ( A ), and skill ( S ); and skill is alsoinfluenced by education, where S is also called mediator between E and I . Fig. 6 visualizes an instantiation of the causal graph, whereeach variable has an observed value. The structural equation usedto calculate the income I is formalized as: I e , s , a = I ( E = e , S = s , A = a ) = i , where s = S ( E = e ) , (10)where I e , s , a denotes the income of one person when he/she satisfies E = e , S = s , and A = a . Intervention.
Causal effect of E on I quantifies the sensitivity ofone response variable I to change with the change of the controlledvariable E . To estimate the causal effect, we can conduct the externalintervention, which is formulated as do-operator . Do-operator ( e.g., do ( E = e ∗ ) ) cuts off all the in-coming edges to the controlledvariable and set the variable as a certain value. Since the causaleffect is defined as the change of a response variable, the calculationrelies on a reference status of the response variable ( e.g., I e ∗ , s ∗ ( a ) in Fig. 6(c)). By the causal intervention, we can capture the totaleffect (TE) of E on I under situation A = a by: TE = I e , s ( a ) − I e ∗ , s ∗ ( a ) , I e ∗ , s ∗ ( a ) = I ( do ( E = e ∗ )| A = a ) = I ( E = e ∗ , S = s ∗ , A = a ) , s ∗ = S ( do ( E = e ∗ )) = S ( E = e ∗ ) , I e , s ( a ) = I e , s , a = I ( E = e , S = s , A = a ) , (11)where TE measure the change of I if E is changed from e ∗ to e while keeping other irrelevant variables fixed ( i.e., A=a). Specifically, I e ∗ , s ∗ ( a ) denotes the reference situation where E is set as thereference value e ∗ by intervention. e ∗ can be assumed as noqualifications, and s ∗ denotes the value of S when E = e ∗ . Theillustrations of I e , s ( a ) and I e ∗ , s ∗ ( a ) are shown in Fig. 6(b) and Fig.6(c), respectively. Counterfactuals.
Counterfactual thinking is to imagine whatthe outcomes would be if events were contrary to the facts [18], (a) ISA e 𝑰 𝒆,𝒔 ∗ (𝒂) 𝒔 ∗ a 𝒆 ∗ (d) 𝒆 ∗ 𝒔 ∗ a (c) 𝑰 𝒆 ∗ ,𝒔 ∗ (𝒂) e sa (b) 𝑰 𝒆,𝒔,𝒂 e 𝑰 𝒆 ∗ ,𝒔 (𝒂) sa 𝒆 ∗ (e) Figure 6: (a) An example of causal graph: The individual income (I) is affected by the education (E), age (A), and skill (S); andskill is also influenced by education. (b) An instantiation of the causal graph with specific values. (c) A causal interventioncase, in which E is set as e ∗ . ∗ denotes the variables with the reference value, for example, e ∗ can be set as no qualifications. (d)One example in the counterfactual world, which blocks the indirect effect of E on I . (e) Another example in the counterfactualworld, which blocks the direct effect of E on I . which can be used to estimate the direct/indirect effect of thevariables. For instance, we can use I e , s ∗ ( a ) (Fig. 6(d)) to imaginewhat the income of Joe at the age of a =
24 with a bachelordegree E = e would be if he only had the skill of a person withno qualifications E = e ∗ . Counterfactuals will help if we want todistinguish between the direct effect of education E on income I ( i.e., E → I ) and the indirect effect of E on I by the mediator S ( i.e., E → S → I ). To capture two effects individually, we can estimate itin the counterfactual world by blocking the other one: • Blocking indirect effect.
As shown in Fig. 6(d), if we block theindirect effect by assigning the mediator S as a reference value s ∗ , i.e., S ( E = e ∗ ) , natural direct effect (NDE) of E on I can be calculatedby NDE = I e , s ∗ ( a ) − I e ∗ , s ∗ ( a ) , where s ∗ = S ( E = e ∗ ) , (12)where I e ∗ , s ∗ ( a ) , denotes the reference situation, and I e , s ∗ ( a ) is inthe counterfactual world. NDE estimates the income I change whenonly education E is changed from e ∗ to e and keep S = s ∗ = S ( E = e ∗ ) . Here “natural” means that the direct effect is calculated in anatural situation where the mediator S is set as a reference value.The NDE excludes the effect of E on I by other mediators ( e.g., S )from TE. As such, total indirect effect (TIE) of E on I can be obtainedby subtracting NDE from TE:TIE = TE − NDE = I e , s ( a ) − I e , s ∗ ( a ) , (13)where TIE represents the effect of E on I via the indirect paths ( i.e., E → S → I in this case). Here, the indirect effect is not named as“natural” effect because it’s calculated when the direct effect existsin both the factual world ( i.e., I e , s ( a ) ) and the counterfactual world( i.e., I e , s ∗ ( a ) ) [13]. • Blocking direct effect.
As shown in Fig. 6(e), we can also firstblock the direct effect to capture natural indirect effect (NIE), andthen subtract NIE from TE to obtain total direct effect (TDE).Formally, (cid:40)
NIE = I e ∗ , s ( a ) − I e ∗ , s ∗ ( a ) , TDE = TE − NIE = I e , s ( a ) − I e ∗ , s ( a ) . (14) Noted that both NIE and TIE capture indirect effect of E on I , and thedifference is that the direct effect is not blocked in the calculationof TIE. Similar to [13], we can also estimate NIE of E and T on I ratherthan TIE to reduce the direct effect of exposure features E duringinference. As described in Section 6.1, we can estimate NIE bysimply blocking the direct edge E → Y as shown in Fig. 7(e). CRwill predict Y only based on the user features and the aggregateditem features I : NIE = Y i , e ∗ ( u ) − Y i ∗ , e ∗ ( u ) , (15)which directly estimates what the prediction score would be if theuser had directly seen both exposure and content features . Inferencevia NIE will also largely reduce the prediction score of an item withclickbait content since the direct effect of its attractive exposurefeatures is ignored. Comparison between TIE and NIE..
The calculation of NIE issimilar to TIE. According to Equation 15, inference via NIE needsto calculate the predictions Y i , e ∗ ( u ) = f ( Y u , i , c e ) and Y i ∗ , e ∗ ( u ) = f ( c i , c e ) where c i and c e refer to the expectation constants for Y u , i and Y u , e . As such, TIE and NIE with the MUL strategy areformulated as: TIE = f ( Y u , i , Y u , e ) − f ( c i , Y u , e ) = Y u , i ∗ σ ( Y u , e ) − ( c i ∗ σ ( Y u , e )) = ( Y u , i − c i ) ∗ σ ( Y u , e ) , NIE = f ( Y u , i , c e ) − f ( c i , c e ) = Y u , i ∗ σ ( c e ) − ( c i ∗ σ ( c e )) = Y u , i ∗ σ ( c e ) ∝ Y u , i , where NIE and TIE are not equivalent. From the perspective ofranking items, NIE equals to the inference with Y u , i . TIE measuresindirect effect with the existence of the direct effect of exposurefeatures. The MUL fusion strategy controls the reduction of thedirect effect of exposure features in TIE. Results.
The results of inference via TIE and NIE are comparedin Table 4. From Table 4, we can find that 1) the performance ofNIE with MUL strategy is slightly worse than CR-MUL-TIE, whichshows that TIE can better control the reduction of the direct effectof exposure features by the MUL fusion strategy. And 2) CR-MUL-NIE is significantly better than the baselines. The performanceof CR-MUL-NIE outperforms NT by 7.56% and 5.87% in terms ofRecall@10 on Tiktok and Adressa, respectively. The superiority IET Y UIET Y (a) Conventional causal graph (b) The proposed causal graph uiet 𝒀 𝒆,𝒊 ∗ (𝒖)𝒆 ∗ 𝒊 ∗ 𝒕 ∗ u 𝒀 𝒆 ∗ ,𝒊 ∗ (𝒖)𝒆 ∗ 𝒊 ∗ 𝒕 ∗ (d) The reference situation Exposure featureContent featureItem featureUser featurePrediction score
ETIUY (c) Counterfactual world uit 𝒀 𝒆 ∗ ,𝒊 (𝒖) (e) Counterfactual world e 𝒆 ∗ Figure 7: The causal graphs for conventional recommendation and counterfactual recommendation.Table 4: Performance comparison betweeen MMGCN trained with our CR and baselines on Tiktok and Adressa. In particular,%Improvement denotes the performance improvement w.r.t.
NT. The best results are highlighted in bold.
Dataset Tiktok AdressaMetric P@10 R@10 N@10 P@20 R@20 N@20 P@10 R@10 N@10 P@20 R@20 N@20NT
CFT
IPW CT . ∗ NR RR . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ . ∗ CR-MUL-NIE
CR-MUL-TIE 0.0269 0.0393 0.0370 0.0242 0.0683 0.0476 0.0532 0.1045 0.0878 0.0439 0.1712 0.1133 %Improvement 5.08% 10.08% 11.11% 4.76% 7.56% 10.70% 6.19% 7.18% 7.47% 5.78% 6.20% 6.99%
NDCG@20 (cid:1)
A d r e s s a
T i k t o k
Figure 8: The effect of α on two datasets. indicates the effectiveness of inference via the indirect effect evenwhen it’s simply estimated by NIE. We evaluate our proposed method on two publicly availabledatasets: Tiktok [25] and Adressa [6]. For each dataset, we utilizepost-click feedback of users to evaluate recommenders. Thestatistics of datasets are shown in Table 5. • Tiktok . It is a multi-modal micro-video dataset released in ICMEChallenge 2019 where a micro-video has the features of caption,audio, and video. Multi-modal item features have already beenextracted by the challenge organizer for fair comparison. Amongthe features, we treat captions as exposure features and theremaining as content ones. Actions of thumbs-up, favorite, orfinish are used as the positive post-click feedback ( i.e., like). • Adressa . This is a public news recommendation dataset,published with the collaboration of Norwegian University ofScience and Technology (NTNU) and Adressavisen . The titleand description of news are exposure features, while the newscontents are treated as content features. We use the state-of-the-art pre-trained Multilingual Bert [4] to extract textual featuresinto 768-dimension vectors. Following prior studies [10], dwelltime > 30 seconds reflects the like of users.For each user, we randomly choose 10% clicks end with likes toconstitute a test set , and treat the remaining as the training set. http://ai-lab-challenge.bytedance.com/tce/vc/. http://reclab.idi.ntnu.no/dataset/. If less than 10% clicks of a user end with likes, all such clicks are put into the test set.Besides, we ignore the potential noise in the test set, e.g., fake favorite, which will beexplored in future work. able 5: Statistics of two datasets. Dataset
Adressa
Besides, 10% clicks are randomly selected from the training set asthe validation set. We utilize the validation set to choose the bestmodel for the testing phase. For each click, we randomly choosean item the user has not interacted with as the negative sample fortraining.
We compare the proposed CR with various recommenders trainingmethods, especially the ones can handle bias. For fair comparison,all methods are applied to MMGCN [25], which is the state-of-the-art multi-modal recommender and captures modality-awarehigh-order user-item relationships. Specifically, CR is comparedwith the following baselines: • NT.
Following [25], MMGCN is trained by the normal training(NT) strategy, where all item features are used and MMGCN isoptimized with click data. We keep the same settings with [25],including that: the model is optimized by the BPR loss [20]; thelearning rate is set as 0.001, and the size of latent features is64; and the model stops training if the recall@10 score does notincrease on the validation set for 10 successive epochs. • CFT.
Based on the analysis that exposure item features are easyto induce the clickbait issue, we only use content features fortraining (CFT). The model is also trained with all click data. • IPW.
Liang et al. [11] tried to reduce exposure bias from clicks bycausal inference with inverse propensity weighting (IPW) [22].For fair comparison, we follow the idea of Liang et al. andimplement the exposure and click models in [11] by MMGCNsince MMGCN incorporates multi-modal item features and thusachieves better performance.Besides, considering post-click feedback helps to indicate thetrue user satisfaction, we also compare CR with three baselines thatincorporate extra post-click feedback as inputs: • CT.
This method is conducted in the clean training (CT) setting,in which only the clicks end with likes are viewed as positivesamples to train MMGCN. • NR.
Wen et al. [26] adopted post-click feedback and also treated“click-skip” items as negative samples. We apply their Negativefeedback Re-weighting (NR) into MMGCN. In detail, NR adjuststhe weights of two negative samples during training, i.e., “click-skip” items and “no-click” items. Following [26], the extra hyper-parameter λ p , n , i.e., the ratio of two kinds of negative samples,is tuned in { , . , . , . , . , . } . • RR.
For each user, we propose to re-rank (RR) the top 20 itemsrecommended by NT during inference. For each item, the finalranking is calculated by the sum of rank in NT and the rank basedon the like/click ratio of items. The like/click ratio is calculatedfrom the whole dataset.We omit potential testing recommenders such as VBPR [7] andACF [2] since the previous work [25] has validate the superiorperformance of MMGCN over these multi-modal recommenders.
Parameter Settings.
We strictly follow the original implemen-tation of MMGCN [25], including code, parameter initialization,and hyper-parameter tuning. The weight α in the loss functionis tuned in { , . , . , . , , , , , } , and the sensitivity of CRto α is visualized in Fig. 8. Moreover, early stopping strategy isperformed for the model selection, i.e., stop training if recall@10on the validation set does not increase for 10 successive epochs. Inspired by [13], we explore multiple fusion strategies in additionto the MUL strategy:
SUM-linear: Y ( U = u , I = i , E = e ) = f ( Y u , i , Y u , e ) = Y u , i + Y u , e , SUM-sigmoid: Y ( U = u , I = i , E = e ) = f ( Y u , i , Y u , e ) = Y u , i + σ ( Y u , e ) , SUM-tanh: Y ( U = u , I = i , E = e ) = f ( Y u , i , Y u , e ) = Y u , i + tanh ( Y u , e ) , MUL-tanh: Y ( U = u , I = i , E = e ) = f ( Y u , i , Y u , e ) = Y u , i ∗ tanh ( Y u , e ) . (16)Similar to the prior MUL strategy, TIE for these fusion functionscan be calculated by:SUM-linear: TIE = f ( Y u , i , Y u , e ) − f ( c i , Y u , e ) = Y u , i + Y u , e − ( c i + Y u , e ) = Y u , i − c i ∝ Y u , i , SUM-sigmoid: TIE = f ( Y u , i , Y u , e ) − f ( c i , Y u , e ) = Y u , i + σ ( Y u , e ) − ( c i + σ ( Y u , e )) = Y u , i − c i ∝ Y u , i , SUM-tanh: TIE = f ( Y u , i , Y u , e ) − f ( c i , Y u , e ) = Y u , i + tanh ( Y u , e ) − ( c i + tanh ( Y u , e )) = Y u , i − c i ∝ Y u , i , MUL-tanh: TIE = f ( Y u , i , Y u , e ) − f ( c i , Y u , e ) = Y u , i ∗ tanh ( Y u , e ) − ( c i ∗ tanh ( Y u , e )) = ( Y u , i − c i ) ∗ tanh ( Y u , e ) . Note that during inference, TIE for the SUM strategies with differentactivation functions are equivalent. However, they capture thedirect effect of exposure features differently during the trainingprocess.
To further evaluate the effectiveness of CR on mitigating the directeffect of exposure features, we conduct experiments on syntheticdata. Specifically, during inference, we construct a fake item foreach positive user-item pair in the testing data by “poisoning” theexposure feature of the item. The content features of the fake itemare the same with the real item while its exposure features arerandomly selected from the items with the like/click ratio < 0.5.Such items with low like/click ratios are more likely to be the oneswith the clickbait issue. Their exposure features are easy to beattractive but deceptive, for example, “Find UFO!”. Therefore, thefake item should have a lower rank than the paired real item if therecommender reduces the effect of misleading exposure features.Besides, there is a large discrepancy between the exposure andcontent features of the fake items, which simulates the items withthe clickbait issue where content features do not align with exposure able 6: Performance comparison between the inference via TE and TIE.Dataset Tiktok AdressaMetric P@10 R@10 N@10 P@20 R@20 N@20 P@10 R@10 N@10 P@20 R@20 N@20NT CR-MUL-TE
CR-MUL-TIE 0.0269 0.0393 0.0370 0.0242 0.0683 0.0476 0.0532 0.1045 0.0878 0.0439 0.1712 0.1133 - 1 k 0 1 k 2 k 3 k 4 k 5 k05 k1 0 k1 5 k2 0 k2 5 k3 0 k3 5 k
Count
R a n k D i f f . G r o u p
N T C R - M U L - T I E
Figure 9: Distribution w.r.t.
Rank Diff. group.
Rank Diff. of CR-MUL-TIE
R a n k D i f f . o f N T
Figure 10: Rank Diff. comparison betweenNT and CR. features. Due to the discrepancy, the lower rank of the fake item ( i.e., the larger gap to rank of the real item) indicates a better eliminationof the direct effect from the exposure features. Accordingly, we rankall testing real items and the fakes ones for each user, and we define rank_diff. = rank f ake − rank real to measure the performance ofrecommenders, where rank f ake and rank real are the ranks of thepaired fake and real items, respectively. Lager values indicate betterperformance. Lastly, we calculate the rank_diff. of each triplet