[PDF] Maximizing Marginal Fairness for Dynamic Learning to Rank

Abstract

Rankings, especially those in search and recommendation systems, often determine how people access information and how information is exposed to people. Therefore, how to balance the relevance and fairness of information exposure is considered as one of the key problems for modern IR systems. As conventional ranking frameworks that myopically sorts documents with their relevance will inevitably introduce unfair result exposure, recent studies on ranking fairness mostly focus on dynamic ranking paradigms where result rankings can be adapted in real-time to support fairness in groups (i.e., races, genders, etc.). Existing studies on fairness in dynamic learning to rank, however, often achieve the overall fairness of document exposure in ranked lists by significantly sacrificing the performance of result relevance and fairness on the top results. To address this problem, we propose a fair and unbiased ranking method named Maximal Marginal Fairness (MMF). The algorithm integrates unbiased estimators for both relevance and merit-based fairness while providing an explicit controller that balances the selection of documents to maximize the marginal relevance and fairness in top-k results. Theoretical and empirical analysis shows that, with small compromises on long list fairness, our method achieves superior efficiency and effectiveness comparing to the state-of-the-art algorithms in both relevance and fairness for top-k rankings.

Full PDF

MMaximizing Marginal Fairness for Dynamic Learning to Rank

Tao Yang

University of UtahSalt Lake City, [email protected]

Qingyao Ai

University of UtahSalt Lake City, [email protected]

ABSTRACT

Rankings, especially those in search and recommendation systems,often determine how people access information and how informa-tion is exposed to people. Therefore, how to balance the relevanceand fairness of information exposure is considered as one of the keyproblems for modern IR systems. As conventional ranking frame-works that myopically sorts documents with their relevance willinevitably introduce unfair result exposure, recent studies on rank-ing fairness mostly focus on dynamic ranking paradigms whereresult rankings can be adapted in real-time to support fairness ingroups (i.e., races, genders, etc.). Existing studies on fairness in dy-namic learning to rank, however, often achieve the overall fairnessof document exposure in ranked lists by significantly sacrificingthe performance of result relevance and fairness on the top results.To address this problem, we propose a fair and unbiased rankingmethod named Maximal Marginal Fairness (MMF). The algorithmintegrates unbiased estimators for both relevance and merit-basedfairness while providing an explicit controller that balances theselection of documents to maximize the marginal relevance andfairness in top-k results. Theoretical and empirical analysis showsthat, with small compromises on long list fairness, our methodachieves superior efficiency and effectiveness comparing to thestate-of-the-art algorithms in both relevance and fairness for top-krankings.

CCS CONCEPTS • Information systems → Learning to rank . KEYWORDS

Learning to Rank, Ranking Fairness, Unbiased Learning

ACM Reference Format:

Tao Yang and Qingyao Ai. 2021. Maximizing Marginal Fairness for DynamicLearning to Rank. In

Proceedings of the Web Conference 2021 (WWW ’21),April 19–23, 2021, Ljubljana, Slovenia.

ACM, New York, NY, USA, 10 pages.https://doi.org/10.1145/3442381.3449901

Fairness in ranking has drawn much attention as ranking systems,especially those in search and recommendation systems, couldsignificantly affect how people access information and how infor-mation is exposed to users [5]. For example, job posts ranked highly

This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449901 on LinkedIn are more likely to receive more applications; new prod-ucts on the bottom of Amazon search result pages are less likelyto be clicked by customers. Without proper treatments, rankingsystems could introduce unintended unfairness to various aspectsof people’s lives, such as job opportunities, economical gain, etc.As traditional ranking algorithms that produce static ranked listscan hardly handle the unfairness of result exposure in practice, theranking paradigms that dynamically change result ranking on thefly, namely the dynamic Learning-to-Rank (LTR) [20], have receivedmore and more attention in the research communities. The ideaof dynamic LTR is to learn and adapt ranking models based onuser feedback in real time so that past user interactions or resultdistributions could affect the exposure of future results. Throughheavily exposed to implicit user examination bias [14, 16, 33] andselection bias [21, 28], dynamic LTR allows ranking systems toproduce multiple ranked lists for a single request, which makes itpossible to explicitly control or balance the exposure of results indifferent groups (i.e., race, gender, etc.) for ranking fairness.Existing studies on ranking fairness in dynamic LTR mostly fo-cuses on developing effective algorithms to achieve merit-basedfairness [5, 25]. Well-known examples include the linear program-ming algorithm (LinProg) [25] that determines result ranking bytaking group fairness as an optimization constraint, and the FairCoalgorithm [20] that manipulates ranking scores dynamically accord-ing to the current exposure of results in different groups. Despitetheir solid theoretical foundations, these algorithms often sacrificethe performance of result relevance significantly on the top resultsof each ranked list. These trade-offs are uncontrollable as they aredifficult to be quantified explicitly, which makes the risk of applyingfairness algorithms in practice still high today.In this paper, we present a novel fairness algorithm for dynamicLTR that considers both relevance and fairness for ranking crite-rion. We believe that it is important to control and balance resultrelevance and fairness in online ranking systems, especially for thetop-k results in ranked lists. Therefore, inspired by the studies ofsearch diversification that try to improve novelty while preservingranking performance (especially the maximal marginal relevanceframeworks [6]), we propose a Maximal Marginal Fairness (MMF)algorithm that optimizes ranking performance while explicitly mit-igating amortized group unfairness for selected items. MMF dy-namically selects documents that are relevant and underexposedin top-k results to maximize the marginal relevance and fairness.With a small compromise on the bottom of long ranked lists, MMFachieves superior performance and significantly outperforms thestate-of-the-art fairness algorithms not only in top-k relevance, butalso in top-k fairness. As most people would examine or only ex-amine the top results on a result page [8, 13, 19, 27], MMF is highlycompetitive and preferable in real LTR systems and applications. a r X i v : . [ c s . I R ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tao Yang and Qingyao Ai

From technical viewpoints, the main contribution of this paperare two-fold. First, extending the existing definition of merit-basedgroup fairness in ranking [5, 25], we develop a metric to measurethe group fairness of exposure in the top-k rankings. We show thatmost existing state-of-the-art methods for ranking fairness focusmore on the overall fairness of document exposure while compro-mising a lot in the top ranks of each ranked list. Second, we proposea Maximal Marginal Fairness (MMF) algorithm that can explicitlycontrol and balance result relevance and fairness in top-k rankings.In particular, our method uses a hyper-parameter 𝜆 to determinewhether the system needs to select an item being relevant or anitem that improves the marginal fairness of the current ranked list.We evaluate and compare our algorithm with existing state-of-the-art methods on both synthetic and real-world preference datasetsusing simulated user interactions. Theoretical and empirical analy-sis shows that our method can achieve significant efficiency andeffectiveness improvements in top-k relevance and fairness. Leveraging biased click data for optimizing learning to rank systemshas been a popular approach in information retrieval [13, 14]. Asclick data are often noisy and biased, numerous unbiased learningto rank (ULTR) methods have been proposed based on differenttheoretical foundations [4], including click models [9, 12, 29], ran-domization [22], causal inference [1, 3, 17], etc. Those methodsmake it possible to achieve unbiased relevance estimation or rank-ings for learning to rank in noisy environments.Ranking according to intrinsic result relevance is important yetnot enough today. One of the key principle [23] of ranking inInformation Retrieval states that documents should be ranked inorder of the probability of relevance or usefulness. Such arguments,however, only consider the responsibilities of ranking systems tousers while ignoring the items being ranked. Recently, there hasbeen a growing concern about fairness behind ranking algorithmsin both academia and industry [5, 34, 35]. Since rankings are themain interface from which we find content, products, music, newsand etc., their ordering contributes not only to the utility of users,but also to information providers.To address this, some studies focus on restricting the fraction ofitems of each attribute in a ranking [32, 34]. While another naturalway of understanding unfairness is by considering difference inexposure, which directly relates unfairness to economic or socialopportunities [35]. For example, researchers try to achieve amor-tized fairness of exposure by reducing group disparate exposurefor only the top position [35] or making exposure proportional torelevance [5, 20, 25, 26]. The reason for reducing exposure disparityis that minimal difference in relevance can result in a large differ-ence in exposure across groups [25] if the ranking only is based onrelevance, since there might exist a large skew in the distribution ofexposure like position bias [15], i.e., users observe way much moretop ranks than the bottom part. To achieve this, various methodshave been proposed. For example, fairness loss based on Softmaxfunction is proposed by Zehlike and Castillo [35], which considersequal group exposure in top rank. Linear programming methodsare proposed in [5, 7, 25], which try to give a general framework forcomputing optimal probabilistic rankings for merit-based fairness.

Table 1: A summary of notations. 𝝈 𝒕 , 𝒙 𝒕 , 𝒄 𝒕 , 𝒐 𝒕 , 𝒑 𝒕 , 𝒓 𝒕 The presented document ranking ( 𝝈 𝒕 ), the correspondingfeature vectors ( 𝒙 𝒕 ), the clicks on each document ( 𝒄 𝒕 ), thebinary variables indicating user’s examination on eachdocument ( 𝒐 𝒕 ), user’s propensity to examine results oncertain positions ( 𝒑 𝒕 , namely 𝑃 ( 𝒐 𝒕 = ) ), and the true(personalized) binary relevance rating of the documents( 𝒓 𝒕 ) at time step 𝑡 in dynamic LTR setting. 𝑅 ( 𝑑 ) The average relevance across all users for document 𝑑 . 𝑅 𝜃 ( 𝑑 | 𝒙 𝒕 ) The predicted relevance of 𝑑 given by the model parame-terized by 𝜃 . G , 𝐺 𝑖 , G 𝑘𝑖 The set of groups to consider ( G ) and the 𝑖𝑡ℎ group ( 𝐺 𝑖 ). G 𝑘𝑖 is the priority queue of size k for 𝐺 𝑖 according toestimated relevance. The policy learning method is adopted to maximize ranking metricswhile introducing unfairness as additional loss in [26].In fact, the ideas behind existing fairness methods are always toavoid showing items from the same class or group, which is not anew topic and have already been studied as diversity and noveltyin the information Retrieval Community for decades [11, 30, 31].Though the utility of diversified ranking is still centered on usersnot for providers, studies on search diversification also focus onoptimizing result ranking beyond information relevance. One ofthe most famous approaches is the maximal marginal relevanceapproach (MMR) [6] that treats ranking as a Markov DecisionProcess by selecting documents that maximize the combination ofsub-topic relevance given previously selected results.In a recent work [20], the proportional controller in controltheory is applied to mitigate unfairness in dynamic LTR settings.However, existing fairness algorithms in dynamic LTR only con-sider overall fairness while ignoring the top-k unfairness. In thispaper, we first define the top-k group fairness, which is often ig-nored by existing works. Then, based on the top-k fairness metrics,inspired by MMR, we propose a concept named marginal fairnessand directly optimize top-k relevance and fairness in a dynamicLTR environment where both the relevance and fairness modulesof the algorithms are learned and adapted according to real-timeuser feedback.Another algorithm is related with our work is FA*IR, proposed byZehlike et al. [34]. FA*IR is a post-processing method to explicitlyguarantees the exposure of documents in top-k ranking. The keydifference between our algorithm MMF and FA*IR is that FA*IRfocuses on a simplified offline scenario and ignore the fact thatresult exposure could affect relevance estimations in LTR.

In this section, we introduce the problem of relevance and fairnessestimation in dynamic LTR with a focus on the top-k results. Asummary of the notations used in this paper is shown in Table 1.

In dynamic LTR frameworks, the most popular paradigm for rele-vance estimation is to infer relevance from user feedback directly.Specifically, from partial and biased feedback such as user clicks,we need to construct a model to estimate relevance. In general,traditional LTR models can be categorized as cardinal and ordi-nal ones. Ordinal LTR models give ordinal numbers according tooutput scores for items, while those scores themselves have noexact meaning. Cardinal LTR models, on the other hand, predictdocument ranking scores that are proportional or directly reflect aximizing Marginal Fairness for Dynamic Learning to Rank WWW ’21, April 19–23, 2021, Ljubljana, Slovenia the relevance of the documents. Since exposure disparity explicitlyinvolves relevance, similar to previous studies [20], we only focuson cardinal LTR models in this paper. Here, we adopt a LTR model 𝑅 𝜃 ( 𝑑 | 𝒙 𝒕 ) parameterized by 𝜃 with least-square loss as L 𝜏 ( 𝜃 ) = 𝜏 ∑︁ 𝑡 = ∑︁ 𝑑 (cid:18) 𝒓 𝒕 ( 𝑑 ) − 𝑅 𝜃 ( 𝑑 | 𝒙 𝒕 ) (cid:19) Δ = 𝜏 ∑︁ 𝑡 = ∑︁ 𝑑 (cid:18) 𝑅 𝜃 ( 𝑑 | 𝒙 𝒕 ) − 𝒓 𝒕 ( 𝑑 ) 𝑅 𝜃 ( 𝑑 | 𝒙 𝒕 ) (cid:19) (1)where 𝜏 is the total number of existing time steps in dynamic LTR,and Δ = means equal while ignoring constants.As the true relevance judgements 𝒓 𝒕 are not available, based onthe studies of unbiased learning to rank [17], we define an unbi-ased estimation of L 𝜏 ( 𝜃 ) using click data 𝒄 𝒕 by applying InversePropensity Score (IPS) weighting as (cid:101) L 𝜏 ( 𝜃 ) = 𝜏 ∑︁ 𝑡 = ∑︁ 𝑑 ( 𝑅 𝜃 ( 𝑑 | x t ) − c t ( d ) 𝒑 𝒕 ( 𝑑 ) 𝑅 𝜃 ( 𝑑 | x t )) (2)where 𝒑 𝒕 is the user’s examination propensities on each resultposition (namely 𝑃 ( 𝒐 𝒕 = ) ), and we assume that 𝑃 ( 𝑐 = ) = 𝑃 ( 𝑜 = ) · 𝑃 ( 𝑟 = ) (3)which means that users click a search result ( 𝑐 =

1) only when it isboth observed ( 𝑜 =

1) and perceived as relevant ( 𝑟 = 𝑜 and 𝑟 are independent. The unbiasness of (cid:101) L 𝜏 ( 𝜃 ) can be proven as below E o t [ (cid:101) L 𝜏 ( 𝜃 )] = 𝜏 ∑︁ 𝑡 = ∑︁ 𝑑 ( 𝑅 𝜃 ( 𝑑 | x t ) − E o t [ c t ( d )] 𝒑 𝒕 ( 𝑑 ) 𝑅 𝜃 ( 𝑑 | x t )) = 𝜏 ∑︁ 𝑡 = ∑︁ 𝑑 ( 𝑅 𝜃 ( 𝑑 | x t ) − 𝒑 𝒕 ( 𝑑 ) r t ( d ) 𝒑 𝒕 ( 𝑑 ) 𝑅 𝜃 ( 𝑑 | x t )) = 𝜏 ∑︁ 𝑡 = ∑︁ 𝑑 ( 𝑅 𝜃 ( 𝑑 | x t ) − r t ( d ) 𝑅 𝜃 ( 𝑑 | x t )) = L 𝜏 ( 𝜃 ) (4)Similarly, we could get unbiased estimation of the average relevanceof 𝑑 across all users ( 𝑅 ( 𝑑 ) ) as 𝑅 𝐼𝑃𝑆 ( 𝑑 ) = 𝜏 𝜏 ∑︁ 𝑡 = 𝑐 𝑡 ( 𝑑 ) 𝑝 𝑡 ( 𝑑 ) (5) 𝑅 𝐼𝑃𝑆 ( 𝑑 ) is used for both fairness controlling and as a global rank-ing baseline without personalization (i.e., considering documentrelevance with respect to each individual user). The estimation ofposition bias 𝑝 𝑡 ( 𝑑 ) can be achieved through various methods inadvance [2, 3, 29], which is not in the scope of this paper. We now introduce the definition of the Top-k merit-based groupunfairness in dynamic LTR. First, similar to previous studies [5,20, 25], we define the exposure of a document 𝑑 as its marginalprobability of being examined 𝒑 𝒕 ( 𝑑 ) = 𝑃 ( 𝒐 𝒕 ( 𝑑 ) = | 𝝈 𝒕 , 𝒙 𝒕 , 𝒓 𝒕 ) where 𝝈 𝒕 , 𝒙 𝒕 , 𝒓 𝒕 are the presented ranking, feature vectors, andtrue relevance of documents at time step 𝑡 in dynamic LTR. Let G = { 𝐺 , ..., 𝐺 𝑚 } be the possible groups that each document couldbelong to. Suppose that we only care about the exposure fairness in top-k positions, then we can define group-based fairness in top-kpositions following the merit-based fairness definition [20] as 𝐸𝑥𝑝 𝑘𝑡 ( 𝐺 𝑖 ) = | 𝐺 𝑖 | ∑︁ 𝑑 ∈ 𝐺 𝑖 ∩ 𝑑 ∈ 𝝈 𝑘 𝒕 𝑝 𝑡 ( 𝑑 ) (6)where 𝝈 𝑘 𝒕 is the top k documents in presented ranking 𝝈 𝒕 , 𝒑 𝒕 ( 𝑑 ) isthe examination propensity on 𝑑 at time step 𝑡 . Following previousstudies, we define the merit of a group or document as the expectedaverage relevance across all documents in the same group: 𝑀𝑒𝑟𝑖𝑡 ( 𝐺 𝑖 ) = | 𝐺 𝑖 | ∑︁ 𝑑 ∈ 𝐺 𝑖 𝑅 ( 𝑑 ) (7)Thus, for any two groups 𝐺 𝑖 and 𝐺 𝑗 , we can define their accumula-tive disparity in top-k positions as 𝐸𝑥𝑝 _ 𝑐𝑢𝑚 𝑘𝜏 ( 𝐺 𝑖 ) = 𝜏 ∑︁ 𝑡 = 𝐸𝑥𝑝 𝑘𝑡 ( 𝐺 𝑖 ) 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑖 ) = 𝜏 𝐸𝑥𝑝 _ 𝑐𝑢𝑚 𝑘𝜏 ( 𝐺 𝑖 ) 𝑀𝑒𝑟𝑖𝑡 ( 𝐺 𝑖 ) 𝐷 𝑘𝜏 ( 𝐺 𝑖 , 𝐺 𝑗 ) = (cid:13)(cid:13)(cid:13) 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑖 ) − 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑗 ) (cid:13)(cid:13)(cid:13) (8)Thus, larger disparity indicates greater violation of fairness in top-k rankings. Additional to pairwise disparity defined in Eq.8, forranking with more than 2 groups, we define the unfairness of a listas the average disparity of all pairs:Unfairness@k = 𝐷 𝑘𝜏 = 𝑚 ( 𝑚 − ) 𝑚 ∑︁ 𝑖 = 𝑚 ∑︁ 𝑗 = 𝑖 + (cid:13)(cid:13)(cid:13) 𝐷 𝑘𝜏 ( 𝐺 𝑖 , 𝐺 𝑗 ) (cid:13)(cid:13)(cid:13) (9)Note that the original merit-based unfairness defined in [20] can beseen as a special case in our formulation where 𝑘 is the number ofall possible documents. In this paper, our goal is to create rankingswith high relevance 𝒓 𝝉 while maintaining a low unfairness 𝐷 𝑘𝜏 intop-k results after the time step 𝜏 in dynamic LTR. In this section, we describe our approach to balance relevance andfairness in dynamic LTR. Specifically, we first introduce the conceptof maximal marginal fairness, then we discuss our MMF algorithm.

In this paper, we define a new concept for ranking fairness as

Mar-ginal Fairness . Ranking can be modeled as a greedy selection processwhere we create a ranked list by sequentially selecting documentsfrom a candidate pool. Based on this assumption, considerable rank-ing algorithms have been proposed to optimize ranking utility fromvarious perspectives such as information relevance [18], novelty[24], etc. Particularly, in the studies of search diversification, oneof the most well-known algorithms is the Maximal Marginal Rele-vance (MMR) algorithm [6] that greedily selects documents basedon their

Marginal Relevance , the maximal utility of a documentgiven the selected results in the current ranked list, to balance rank-ing relevance and novelty. Inspired by MMR and the concept ofmarginal relevance, we model ranking as a greedy selection prob-lem and define marginal fairness as the marginal gain of fairness, or

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tao Yang and Qingyao Ai the marginal reduction of unfairness, when selecting and adding adocument given the selected results in the current ranked list . 𝑑 𝑘𝜏 is the document we select for 𝑘 𝑡ℎ position in the ranked list 𝝈 𝝉 . Formally, then the marginal fairness of selecting document 𝑑 𝑘𝜏 from group 𝐺 is 𝑀𝐹 ( 𝐺 | 𝜎 𝑘 − 𝜏 ) = 𝐷 𝑘 − 𝜏 − 𝐷 𝑘𝜏 ( 𝑑 𝑘𝜏 ) , where 𝑑 𝑘𝜏 ∈ 𝐺 (10)where 𝐷 𝑘𝜏 ( 𝑑 𝑘𝜏 ) is the unfairness after we select 𝑑 𝑘𝜏 .Then, to maximize merit-based fairness of top 𝑘 results, a straight-forward method is to maximize marginal fairness by selecting thedocument with maximum 𝑀𝐹 ( 𝐺 | 𝜎 𝑘 − 𝜏 ) . This observation serves asour foundation for the construction of the MMF algorithm. The MMF algorithm includes three sub-modules – the selection ofdocuments for maximizing marginal fairness, the selection of docu-ments for maximizing relevance, and the controller that balancestop-k relevance and fairness.

As discussed previously, MMF optimizestop-k fairness through greedily selecting documents to maximizemarginal fairness. In Eq. (10), the computation of marginal fair-ness requires us to compute the current and updated disparities( 𝐷 𝑘 − 𝜏 ( 𝐺 𝑖 , 𝐺 𝑗 ) and 𝐷 𝑘𝜏 ( 𝐺 𝑖 , 𝐺 𝑗 ) ) for every pair of groups. In prac-tice, however, we can prove that maximal marginal fairness canbe achieved directly by selecting a document from the group withlowest expected merit 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑖 ) in Eq. (8) given that the scaleof 𝒑 𝝉 ( 𝑑 ) is much smaller than 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑖 ) . Also, as the truerelevance of documents 𝑅 ( 𝑑 ) is not available, we use 𝑅 𝐼𝑃𝑆 ( 𝑑 ) toestimate 𝑅 ( 𝑑 ) and get an unbiased estimation of 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑖 ) asˆ 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 𝑖 ) .Thus, we compute the best group to select documents from MMFas 𝐺 𝑘𝜏 = 𝑚𝑎𝑥 𝐺 𝑀𝐹 ( 𝐺 | 𝜎 𝑘 − 𝜏 ) = 𝑚𝑖𝑛 𝐺 ˆ 𝐸𝑥𝑝 _ 𝑀𝑒𝑟 𝑘𝜏 ( 𝐺 ) (11)Note that every document in 𝐺 𝑘𝜏 would have the maximal marginalfairness given the ranked list 𝜎 𝑘 − 𝜏 . To maximize the performance of top-kresults in terms of ranking relevance, the optimal solution is toselect documents based on their estimated relevance from userinteractions (i.e., 𝑅 𝐼𝑃𝑆 ) or a LTR model parameterized by 𝜃 ( 𝑅 𝜃 ) as¯ 𝑑 𝑘𝜏 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑑 ∉ 𝜎 𝑘 − 𝜏 𝑅 𝜃 ( 𝑑 ) (12)where we only consider relevance but not fairness in selectingthe next document for the ranked list. In case we need to selectdocuments from the group with maximal marginal fairness, wecould apply similar process and get˜ 𝑑 𝑘𝜏 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑑 ∈ 𝐺 𝑘𝜏 ∧ 𝑑 ∉ 𝜎 𝑘 − 𝜏 𝑅 𝜃 ( 𝑑 ) (13)In practice, Eq. (12) and Eq. (13) can be computed together bymaintaining multiple priority queues G , for each group separately. We ignore the proof for simplicity

MMF implements asimple yet effective method to control the relevance and fairness oftop-k rankings in dynamic LTR by adding a stochastic controllerparameterized with 𝜆 . Intuitively, the idea of the controller is tochoose the final selected item 𝑑 𝑘𝜏 between the document that maxi-mizes ranking relevance ¯ 𝑑 𝑘𝜏 and the document maximizing marginalfairness ˜ 𝑑 𝑘𝜏 with probability 𝜆 . Formally, we have 𝑑 𝑘𝜏 ∼ ( 𝜆 ˜ 𝑑 𝑘𝜏 + ( − 𝜆 ) ¯ 𝑑 𝑘𝜏 ) (14)The greater 𝜆 is, the fairer the ranking is and the less rankingrelevance is. Different from the linear trade-off strategy used inother fairness algorithms [20, 25] that directly combines the fairnessand relevance scores before the document selection process, weadopt a probabilistic strategy to strike the balance, which not onlymakes the algorithm more robust to the magnitude of estimatedrelevance and fairness scores, but also provides explicit functionsto balance relevance and fairness in practical applications.The overall MMF is shown in Algorithm 1. At step 1, we doinitialization. At step 2, a user enters the dynamic LTR system. Atstep 3 −

7, we dynamically learn a LTR model 𝑅 𝜃 or directly usethe averaged inverse propensity weighted clicks 𝑅 𝐼𝑃𝑆 to estimatethe relevance of items. The LTR model has advantages over 𝑅 𝐼𝑃𝑆 as it can estimate personalized relevance if features contain userinformation. At step 8 −

10, we construct priority queues for eachgroup with estimated relevance from step 3 −

7. In practice, wecould construct priority queues while estimating relevance at thetime. At step 12 −

21, we select the item according to Equations (12)to (14). At step 15, 𝐺 𝑖𝑛𝑑 is a function to get group index of an item.At step 22, we collect clicks in a real-world application or sampleclicks according to Eq.3 for our experiments. At step 23 −

27, weupdate the relevance estimation accordingly. Note that, for differenttime steps, users interact with the same item candidates at step2 − To illustrate the efficiency of MMF, we conduct complexity analysisand use the state-of-the-art fairness algorithm, i.e., FairCo [20].FairCo is one of the most efficient fairness algorithms for dynamicLTR. It achieves fairness by dynamically adding perturbations tothe ranking scores of documents based on the exposure of differentgroups. As relevance is estimated separately, we only discuss thecomplexity of fairness controlling in FairCo and MMF.

Let 𝑘 be the number of ranks we careabout, |G| be the number of groups, and 𝑛 be the total numberof documents to rank. As FairCo needs to add score perturba-tions to all documents, it need to track the cumulative group ex-posure of all documents (i.e., 𝐸𝑥𝑝 _ 𝑐𝑢𝑚 𝑛𝜏 ( 𝐺 𝑖 ) in Eq. (8)) and dolinear interpolation with 𝑂 ( 𝑛 ) time. It needs 𝑂 ( 𝑘 · 𝑙𝑜𝑔 ( 𝑛 )) to se-lect top-k results from 𝑛 documents and thus has the overall timecomplexity as 𝑂 ( 𝑘 · 𝑙𝑜𝑔 ( 𝑛 ) + 𝑛 ) . In contrast, MMF only tracks 𝐸𝑥𝑝 _ 𝑐𝑢𝑚 𝑘𝜏 ( 𝐺 𝑖 ) for top-k results with 𝑂 (|G| · 𝑘 ) time. Besides, ittakes 𝑂 (|G| · 𝑘 · 𝑙𝑜𝑔 ( 𝑛 )) time to construct priority queue for eachgroup, and takes 𝑂 (|G| × 𝑘 ) time to implement Eq.11. The overallcomplexity of MMF is 𝑂 ( 𝑘 · |G| · (cid:0) + 𝑙𝑜𝑔 ( 𝑛 ) (cid:1) ) . Therefore, MMFtakes 𝑂 (( 𝑛 − |G| · 𝑘 ) + 𝑘 · 𝑙𝑜𝑔 ( 𝑛 ) · ( − |G|)) less time than FairCo.Because the size of all documents is usually much more than the aximizing Marginal Fairness for Dynamic Learning to Rank WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Algorithm 1:

MMF initialize 𝜆 within [0,1], 𝑘 , c t ←−

0, initialize 𝑅 𝜃 , 𝑅 𝐼𝑃𝑆 ( 𝑑 ) ←− for each user (time step 𝜏 ) do if Use_LTR_model then Estimate relevance for all items with model 𝑅 𝜃 ; else Estimate relevance for all items with 𝑅 𝐼𝑃𝑆 ( 𝑑 ) according to Eq.5; end for each group 𝐺 𝑖 do construct priority queues G 𝑖 of size k for group 𝐺 𝑖 with estimated relevance. end ranking=[]; for each rank i do if random> 𝜆 then 𝑑 𝑖 selected with Eq.12, and ranking.append( 𝑑 𝑖 ); 𝑗 = 𝐺 𝑖𝑛𝑑 ( 𝑑 𝑖 ) , index group id and assigin to j; G 𝑗 .𝑝𝑜𝑝 () else select 𝐺 𝑘𝜏 according to Eq.11 and assign 𝐺 𝑘𝜏 to j; ranking.append( G 𝑗 .𝑝𝑜𝑝 () ); end end present ranking and collect user clicks c 𝜏 , or samplingclicks according to Eq.3; if Use_LTR_model then train (cid:101) L 𝜏 ( 𝜃 ) loss in Eq.4. ; else update 𝑅 𝐼𝑃𝑆 ( 𝑑 ) according to Eq.5; end end number of ranks and the number of groups we care (i.e., 𝑛 >> 𝑘 and 𝑛 >> |G| ) in most ranking applications, MMF outperformsFairCo in time. MMF needs to track of

𝐸𝑥𝑝 _ 𝑐𝑢𝑚 𝑘𝜏 ( 𝐺 𝑖 ) in Eq. (8) for all positions in top-k rankings for the computationof marginal fairness, which has space complexity as 𝑂 (|G| · 𝑘 ) .FairCo only needs to track the overall cumulative exposure of alldocuments, which has space complexity as 𝑂 ( 𝑛 ) . As the number ofgroups (e.g., race, gender) and the ranks we care are usually small,MMF outperforms FairCo in space. To evaluate our method, we conduct experiments on one datasetwith simulated preference data (i.e., the News dataset [10]) andone dataset with real-world preference data (i.e., the Movie dataset[10]). All the experimental scripts and model implementations usedon this paper is available online . https://github.com/Taosheng-ty/Dynamic-Fairness.git In this paper, we create a simulated preference dataset with thenews articles in the AdFrontes Media Bias dataset , which we referto as the News dataset. In this News dataset, each article contains apolarity value 𝜌 𝑑 that has been rescaled to the range between -1 and1 (i.e., left-leaning and right-leaning). Following the methodologyused by Morik et al. [20], we simulate a dynamic LTR problemon the News Dataset with simulated users and assign each userwith a polarity preference drawn from a mixture of two guassiandistirubtion and are also cliped to [− , ] . 𝜌 𝑢 𝑡 ∼ 𝑐𝑙𝑖𝑝 [− , ] ( 𝑝 𝑛𝑒𝑔 N (− . , . ) + ( − 𝑝 𝑛𝑒𝑔 )N ( . , . )) (15)where 𝑝 𝑛𝑒𝑔 is the probability of the user to be left-leaning. In ad-dition, we simulate each user with an opensness parameter 𝑜 𝑢 𝑡 ∼U( . , . ) , indicating on the breadth of interest outside theirpolarity. With sampled 𝑢 𝑡 , 𝜌 𝑢 𝑡 , and articles’ polarity value anno-tation 𝜌 𝑑 , we could synthesize a true binary relevance judgementfollowing a Bernoulli distribution as 𝒓 𝒕 ( 𝑑 ) ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (cid:20) 𝑝 = 𝑒𝑥𝑝 (cid:18) −( 𝜌 𝑢 𝑡 − 𝜌 𝑑 ) ( 𝑜 𝑢 𝑡 ) (cid:19)(cid:21) (16)In each experiment trial, we sample a set of 30 news articles 𝐷 to recommend to users. To investigate ranking fairness, we grouparticles according to their polarity by assigning articles with 𝜌 𝑑 ∈[− , ) to group 𝐺 and articles with 𝜌 𝑑 ∈ [ , ] to group 𝐺 .We implement four baselines for comparison. The first one isthe Naive method that ranks documents by the sum of their ob-served user clicks (i.e., 𝒄 𝒕 ). The second one is a simple unbiased LTRalgorithm DULTR(Glob) that ranks documents by the unbiasedrelevance estimation 𝑅 𝐼𝑃𝑆 ( 𝑑 ) in Eq. (5). To show the effectivenessof MMF as a fairness algorithm, we also include two state-of-the-artfairness algorithms for dynamic LTR, which are LinProg [25] and

FairCo [20]. Proposed by Singh and Joachims [25], LinProg consid-ers group fairness as optimization constraints and implement a lin-ear programming algorithm to maximize ranking fairness based onthe relevance estimated by 𝑅 𝐼𝑃𝑆 ( 𝑑 ) . In theory, LinProg would pro-duce the best ranked lists in terms of merit-based fairness. FairCoachieves group fairness by dynamically adding perturbations tothe ranking scores of documents based on the exposure of differentgroups. It has been proven to be effective in optimizing the fairnessof dynamic LTR and much more efficient than LinProg.For all models in the experiments on the News dataset, we use 𝑅 𝐼𝑃𝑆 ( 𝑑 ) to estimate document relevance directly as all users aresimulated. This means 𝑈 𝑠𝑒 _ 𝐿𝑇 𝑅 _ 𝑚𝑜𝑑𝑒𝑙 in Algo.1 is set as False.LinProg and FairCo use a hyper-parameter to reduce unfairnessof exposure over all documents. For simplicity, we set it as 0.1for LinProg and 0.01 for FairCo as these were the tuned hyper-parameters in Morik et al. [20]. For MMF, we tune 𝜆 from 0 to 1 andshow the corresponding results in Table 2a and Figure 4. Besides,We follow the experiment setting in [20] and adopt the discountfunction of NDCG as the user’s user examination probability foreach position for this paper in simulation, 𝑝 𝑖 = (cid:0) 𝑙𝑜𝑔 ( + 𝑖 ) (cid:1) (17)where 𝑝 𝑖 indicates examination probability at rank 𝑖 . WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tao Yang and Qingyao Ai users A v e r a g e | R I P S ( d ) R ( d ) | Figure 1: The absolute difference between estimated globalrelevance and true global relevance on News (20 trials). N D C G @ NaiveD_ULTR(Glob)FairCo LinProgMMF( = 0.6)0 1000 2000 3000 4000 5000 6000 users U n f a i r n e ss @ Figure 2: Convergence of NDCG @10 and Unfairness @10 asthe number of users increases on News. (20 trials)

Fig. 1 shows theabsolute difference between estimated global relevance and trueglobal relevance defined in Eq. (5) after applying different rank-ing algorithm. The error of all algorithm with IPS weighting (Lin-prog,FairCo,D_ULTR(Glob) and MMF( 𝜆 = . 𝑀𝑀𝐹 effectively reduce unfairness while maintaininggood ranking performance?

Fig.2 shows the convergence NDCG@10and unfairness@10 for Naive, D-ULTR(Glob), FairCo, Linprog andMMF ( 𝜆 = . N D C G @ k NaiveD_ULTR(Glob)FairCo LinProgMMF( = 0.6)3 5 10 20 allTop_k0.00.10.20.3 U n f a i r n e ss @ k Figure 3: Performance of NDCG(top) and Unfairness (bot-tom) for different prefixes on News (20 trials). performance with D_ULTR(Glob). For the other two algorithm, Lin-Prog and FairCo both show inferior performance than MMF interms of NDCG and unfairness.

𝑀𝑀𝐹 performs at different prefixes of a ranking?

Fig.3shows the performance NDCG and unfairness at different prefixesfor Naive, D-ULTR(Glob), FairCo,Linprog and MMF( 𝜆 = . 𝜆 control trade-off between ranking performanceand unfairness for 𝑀𝑀𝐹 ? Fig. 4 shows the performance NDCG@10and unfairness@10 for MMF with different 𝜆 after 6000 users in-teractions. Note that we only show LinProg and FairCo with thetuned hyper-parameters as shown in [20]. Thus their performancesare constants. As we can see, the use of hyper-parameter 𝜆 in MMFenables us to explicitly control the trade-off between relevanceand fairness. When 𝜆 = .

0, MMF degenerates to D_ULTR(Glob),thus it has similar unfairness with D_ULTR(Glob), greater thanLinprog and FairCo. With the increasing of 𝜆 , unfairness graduallydecreases and become less than Linprog and FairCo on top results.From the figure, we can see that choosing 𝜆 from 0.4 to 0.7 for MMFcan achieve better performance than the baseline algorithms onboth relevance and fairness. aximizing Marginal Fairness for Dynamic Learning to Rank WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Table 2: Comparison of MMF with different baselines on News and Movie. Significant improvements or degradations withrespect to the performance of FairCo are indicated with +/− in the paired t-test with p-value 𝑝 ≤ . . The best performanceof fair algorithms in each column is highlighted in boldface. (a) Performance of learning-to-rank algorithms on News data. NDCG@3 NDCG@5 NDCG@10 NDCG@all Unfairness@3 Unf.@5 Unf.@10 Unf.@allUnfairAlgorithms Naive 0.418 − − − − − − − − D_ULTR(Glob) 0.438 + + + − − − − Fair Algorithms FairCo 0.434 0.443 0.483 0.705 0.036 0.037 0.049 0.015LinProg 0.433 0.439 0.462 0.694 0.043 0.051 0.065

MMF( 𝜆 = . + + + + + (b) Performance of learning-to-rank algorithms on Movie data. NDCG@3 NDCG@5 NDCG@10 NDCG@all Unfairness@3 Unf.@5 Unf.@10 Unf.@allUnfairalgorithm Naive 0.633 − − − − − + − D_ULTR(Glob) 0.671 − − − − + + + − D_ULTR 0.827 + + + + + + + − Fair algorithm FairCo

MMF( 𝜆 = .

1) 0.810 + + + + + − N D C G @ NaiveD_ULTR(Glob)FairCo LinProgMMF0.0 0.2 0.4 0.6 0.8 1.00.00.10.20.3 U n f a i r n e ss @ Figure 4: Performance of NDCG @10 (top) and Unfairness @10 (bottom) of MMF with different 𝜆 on News (20 trials). To evaluate our method on a real-world preference data, we usethe ML20M dataset, which we refer to as the Movie dataset. Fol-lowing the prepossessing method in [20], we select five productioncompanies with the most movies in the dataset(MGM, Warner Bros,Paramount, 20th Century Fox, Columbia). We aim to ensure fairnessof exposure for films from the five production companies, whichmeans movies from the same company belong to the same group.A set of 300 most rated movies by those production companies areselected. Then the 100 movies with the highest standard deviationin the rating across users are selected. For users, we select 10 userswho have rated the most number of the chosen 100 movies. Finally,we get a partially filled rating matrix with 10 users and 100 movies.We use an off-the-shelf matrix factorization algorithm to fill themissing entries. We then normalize the rating to [ , ] by applyinga Sigmoid function centered at rating 𝑏 = 𝑎 = Surprise library (http://surpriselib.com/) for SVD with biased=False and D=50 correspond to higher likelihoods of positive feedback, which isused to generate clicks. We use the user embedding from the matrixfactorization model as user features 𝑥 𝑡 . We keep the dimension ofuser features to 50. At each time step 𝑡 , we sample a user 𝑥 𝑡 and theranking algorithm presents a ranking of the 100 movies. For theuser personal relevance estimation model 𝑅 𝜃 used by FairCo andDULTR, and our method, we use a one hidden-layer neural networkthat consists of 𝐷 =

50 input nodes, which corresponds to userfeature dimension, then fully connected to 64 nodes in the hiddenlayer with RELU activation, which is then connected 100 outputnodes with Sigmoid to output the predicted relevance probabilityfor the 100 selected movies.Besides the baselines discussed in Section 5.1, we also includea new baseline in the experiments on the Movie dataset, whichis

D-ULTR . D-ULTR conducts unbiased learning to train the LTRmodel with click data. Different from D-ULTR(Glob) that directlyranks documents with 𝑅 𝐼𝑃𝑆 ( 𝑑 ) , D-ULTR can model personalizedrelevance by taking user features into account when building theLTR model. Thus, we expect D-ULTR to perform much better thanD-ULTR(Glob) on the Movie dataset with real user preferences.Similarly, we train LTR models with user features for both FairCoand MMF. Note that we exclude LinProg on the Movie dataset as it isdesigned to work with global relevance and is too computationallyexpensive for rankings with a large number of documents. 𝑀𝑀𝐹 effectively reduce unfairness while maintaininggood ranking performance?

We show the performance of rank-ing relevance and fairness for all algorithms on the Movie datasetin Fig. 5 For references, we plot a

Skyline model that trains theLTR model with ground truth relevance judgements and ranks doc-uments via the output estimated relevance from the LTR model.Firstly, as is shown in Fig. 5, personalization do help to reach bet-ter ranking performance just as [20] already reported. Rankingalgorithms (DULTR, FairCo, MMF, skyline) relying on personalizedrelevance show superior ranking performance than algorithms like

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tao Yang and Qingyao Ai N D C G @ NaiveD-ULTR(Glob)D-ULTR FairCoMMF( = 0.1)Skyline0 1000 2000 3000 4000 5000 6000 users U n f a i r n e ss @ Figure 5: Convergence of NDCG @10 and Unfairness @10 asthe number of users increases on Movie (5 trials). N D C G @ k NaiveD-ULTR(Global)D-ULTR FairCoMMF( = 0.1)3 5 10 20 30 50 allTop_k0.00.10.20.30.40.5 U n f a i r n e ss @ k Figure 6: Performance of NDCG(top) and Unfairness (bot-tom) of MMF with different prefixes on Movie (5 trials).

Naive and DULTR(Glob), where no personalization is used. Rank-ing algorithms involving IPS and personalization can approach theskyline as more user interactions are available, which again verifiesthe effectiveness of IPS and personalization.After we show the effectiveness of personalization and IPS, wenow take a look at the overall performance shwon in Table 2b.Naive and DULTR(Glob) show the worst performance in termsof ranking and fairness since no personalization or fairness con-trolling involved. Compared to FairCo, our method MMF( 𝜆 = . N D C G @ U n f a i r n e ss @ NaiveD-ULTR(Global)D-ULTR FairCoMMF

Figure 7: Performance of NDCG @10 (top) and Unfairness @10 (bottom) of MMF with different 𝜆 on Movie (5 trials). 𝑀𝑀𝐹 perform at different prefix of a ranking?

Fig.6 shows the performance NDCG and unfairness at differentprefixes for Naive, D-ULTR(Glob),ULTR, FairCo,MMF ( 𝜆 = . 𝑘 from 3 to 50. While FairCo has excellentperformance on unfairness@all, it sacrifices both relevance and fair-ness performance on top results significantly. It even shows moreunfairness on top results than unfair algorithms such as D-ULTR. 𝜆 control trade-off of ranking performance and miti-gating unfairness for 𝑀𝑀𝐹 . Fig. 7 shows the performance NDCG@10and unfairness@10 for MMF with different 𝜆 after 6000 users inter-actions. Again, we report FairCo with the best parameter settingshere. As 𝜆 gradually increases, a trade-off can be seen from thegrowing of fairness (negative unfairness) and decreasing of rankingperformance. And we may choose 𝜆 to be between 0.0 and 0.2, be-tween which MMF have better ranking performance as well as lessunfairness. The optimal 𝜆 for Movie data is smaller than for Newsdata. We think it’s because of their different relevance distribution.Documents in the News dataset often have similar relevance. Insuch situation, we should pay more attention to fairness (greater 𝜆 )for News dataset since relevance of different documents are closeto each other. In this work, we propose a concept of marginal fairness and a Maxi-mal Marginal Fairness (MMF) algorithm for balancing the relevanceand fairness of top-k results in dynamic learning to rank. We de-velop a metric to measure the group fairness of exposure in thetop-k results of each ranked list and show that most existing state-of-the-art methods for ranking fairness focus more on the overallfairness of document exposure while compromising a lot in topranks of each ranked list. In contrast, our proposed MMF algorithmexplicitly maximizes the marginal fairness of top-k rankings andcan produce better rankings than the state-of-the-art fairness al-gorithms in both top-k relevance and top-k fairness. In the future, aximizing Marginal Fairness for Dynamic Learning to Rank WWW ’21, April 19–23, 2021, Ljubljana, Slovenia we will further explore the possibility of extending MMF for moregeneral ranking scenarios or construct new LTR models that inte-grate the optimization of relevance and fairness from the bottomof model design.

ACKNOWLEDGMENTS

This work was supported in part by the School of Computing,University of Utah. Any opinions, findings and conclusions or rec-ommendations expressed in this material are those of the authorsand do not necessarily reflect those of the sponsor.

REFERENCES [1] Aman Agarwal, Kenta Takatsu, Ivan Zaitsev, and Thorsten Joachims. 2019. Ageneral framework for counterfactual learning-to-rank. In

Proceedings of the 42ndInternational ACM SIGIR Conference on Research and Development in InformationRetrieval . 5–14.[2] Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, andThorsten Joachims. 2019. Estimating position bias without intrusive interven-tions. In

Proceedings of the Twelfth ACM International Conference on Web Searchand Data Mining . 474–482.[3] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbi-ased learning to rank with unbiased propensity estimation. In

The 41st Interna-tional ACM SIGIR Conference on Research & Development in Information Retrieval .385–394.[4] Qingyao Ai, Tao Yang, Huazheng Wang, and Jiaxin Mao. 2020. Unbiased Learningto Rank: Online or Offline? arXiv preprint arXiv:2004.13574 (2020).[5] Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention:Amortizing individual fairness in rankings. In

The 41st international acm sigirconference on research & development in information retrieval . 405–414.[6] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-basedreranking for reordering documents and producing summaries. In

Proceedings ofthe 21st annual international ACM SIGIR conference on Research and developmentin information retrieval . 335–336.[7] L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. 2018. Ranking with Fair-ness Constraints. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.[8] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experi-mental comparison of click position-bias models. In

Proceedings of the 1st WSDM .ACM, 87–94.[9] Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model topredict search engine click data from past observations.. In

Proceedings of the31st annual international ACM SIGIR conference on Research and development ininformation retrieval . 331–338.[10] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: Historyand context.

Acm transactions on interactive intelligent systems (tiis)

5, 4 (2015),1–19.[11] Zhengbao Jiang, Ji-Rong Wen, Zhicheng Dou, Wayne Xin Zhao, Jian-Yun Nie,and Ming Yue. 2017. Learning to diversify search results via subtopic attention.In

Proceedings of the 40th international ACM SIGIR Conference on Research andDevelopment in Information Retrieval . 545–554.[12] Jiarui Jin, Yuchen Fang, Weinan Zhang, Kan Ren, Guorui Zhou, Jian Xu, Yong Yu,Jun Wang, Xiaoqiang Zhu, and Kun Gai. 2020. A Deep Recurrent Survival Modelfor Unbiased Ranking. arXiv preprint arXiv:2004.14714 (2020).[13] Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In

Proceedings of the 8th ACM SIGKDD . ACM, 133–142.[14] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay.2005. Accurately interpreting clickthrough data as implicit feedback. In

Proceed-ings of the 28th annual ACM SIGIR . Acm, 154–161.[15] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay.2017. Accurately interpreting clickthrough data as implicit feedback. In

ACMSIGIR Forum , Vol. 51. Acm New York, NY, USA, 4–11.[16] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski,and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks andquery reformulations in web search.

ACM Transactions on Information Systems

25, 2 (2007), 7.[17] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiasedlearning-to-rank with biased feedback. In

Proceedings of the Tenth ACM Interna-tional Conference on Web Search and Data Mining . 781–789.[18] Tie-Yan Liu. 2011.

Learning to rank for information retrieval . Springer Science &Business Media.[19] Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. 2017.Evaluating and Analyzing Click Simulation in Web Search. In

Proceedings of theACM ICTIR (Amsterdam, The Netherlands) (ICTIR ’17) . ACM, 281–284.[20] Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. Con-trolling Fairness and Bias in Dynamic Learning-to-Rank. In

Proceedings of the 43rdInternational ACM SIGIR Conference on Research and Development in InformationRetrieval (Virtual Event, China) (SIGIR ’20) . Association for Computing Machin-ery, New York, NY, USA, 429–438. https://doi.org/10.1145/3397271.3401100[21] Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, and Elena Zheleva.2020. Correcting for Selection Bias in Learning-to-rank Systems. In

Proceedingsof The Web Conference 2020 . 1863–1873.[22] Filip Radlinski and Thorsten Joachims. 2006. Minimally invasive randomizationfor collecting unbiased preferences from clickthrough logs. In

Proceedings of thenational conference on artificial intelligence , Vol. 21. Menlo Park, CA; Cambridge,MA; London; AAAI Press; MIT Press; 1999, 1406.[23] Stephen E Robertson. 1977. The probability ranking principle in IR.

Journal ofdocumentation (1977).

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Tao Yang and Qingyao Ai [24] LT Rodrygo, Craig Macdonald, and Iadh Ounis. 2015. Search result diversification.

Foundations and Trends in Information Retrieval

9, 1 (2015), 1–90.[25] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings.In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining . 2219–2228.[26] Ashudeep Singh and Thorsten Joachims. 2019. Policy learning for fairness inranking. In

Advances in Neural Information Processing Systems . 5426–5436.[27] Chao Wang, Yiqun Liu, Min Zhang, Shaoping Ma, Meihong Zheng, Jing Qian,and Kuo Zhang. 2013. Incorporating vertical results into search click models. In

Proceedings of the 36th ACM SIGIR . ACM, 503–512.[28] Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016.Learning to rank with selection bias in personal search. In

Proceedings of the 39thInternational ACM SIGIR conference on Research and Development in InformationRetrieval . 115–124.[29] Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and MarcNajork. 2018. Position bias estimation for unbiased learning to rank in personalsearch. In

Proceedings of the Eleventh ACM International Conference on Web Searchand Data Mining . 610–618.[30] Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2015. Learningmaximal marginal relevance model via directly optimizing diversity evaluation measures. In

Proceedings of the 38th international ACM SIGIR conference on researchand development in information retrieval . 113–122.[31] Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2017.Adapting Markov decision process for search result diversification. In

Proceedingsof the 40th International ACM SIGIR Conference on Research and Development inInformation Retrieval . 535–544.[32] Ke Yang and Julia Stoyanovich. 2017. Measuring fairness in ranked outputs.In

Proceedings of the 29th International Conference on Scientific and StatisticalDatabase Management . 1–6.[33] Yisong Yue, Rajan Patel, and Hein Roehrig. 2010. Beyond position bias: Examiningresult attractiveness as a source of presentation bias in clickthrough data. In

Proceedings of the 19th WWW . ACM, 1011–1018.[34] Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Mega-hed, and Ricardo Baeza-Yates. 2017. Fa* ir: A fair top-k ranking algorithm. In

Proceedings of the 2017 ACM on Conference on Information and Knowledge Man-agement . 1569–1578.[35] Meike Zehlike and Carlos Castillo. 2020. Reducing disparate exposure in ranking:A learning to rank approach. In