The First International Conference on AI-ML-Systems | 2021

A Hybrid Approach for Offline A/B Evaluation for Item Ranking Algorithms in Recommendation Systems

 
 
 

Abstract


A recommendation system generally outputs a ranked list of items which is presented to the user. Based on the consumption signals from the user (like click, play) in an production environment, various performance metrics like Click Through Rate (CTR), Play Through Rate (PTR), Average Consumption Time etc. are calculated. These metrics are used to objectively evaluate performance of various underlying algorithms (policies) using Online A/B tests. However, if there are many such policies in the innovation pipeline, evaluating them using Online A/B puts significant overhead on production systems and can cause user dissatisfaction if a poor policy is exposed to them. Therefore, pre-production testing in an offline environment has become a highly researched area of significant practical value. During ”offline A/B” testing, we are interested in comparing multiple policies based on their potential improvement towards the performance metrics which are most correlated with business KPIs. On the other hand, user’s satisfaction is also closely related to the ranked occurrence of items, i.e., the relevant items should occur towards the top of the list. Current well known methods based on counterfactual estimation such as importance sampling and its variants capped importance sampling, normalized capped importance sampling consider only a single performance metric for evaluation and do not consider the user-satisfaction metrics, which could lead to sub-optimal user experience. Further the position bias due to limited real-estate on mobile devices is not considered. To solve these issues, we extend the importance sampling based methods by combining both performance and user satisfaction metrics along with accounting for position bias. We also demonstrate that using such a hybrid metric during offline testing leads to improved correlation with desired business metrics, thus enabling better offline comparison of ranking algorithms.

Volume None
Pages None
DOI 10.1145/3486001.3486241
Language English
Journal The First International Conference on AI-ML-Systems

Full Text