Algorithmic Monoculture and Social Welfare
AAlgorithmic Monoculture and Social Welfare
Jon Kleinberg Manish Raghavan
Abstract
As algorithms are increasingly applied to screen applicants for high-stakes decisions in employment,lending, and other domains, concerns have been raised about the effects of algorithmic monoculture , inwhich many decision-makers all rely on the same algorithm. This concern invokes analogies to agriculture,where a monocultural system runs the risk of severe harm from unexpected shocks. Here we show thatthe dangers of algorithmic monoculture run much deeper, in that monocultural convergence on a singlealgorithm by a group of decision-making agents, even when the algorithm is more accurate for any oneagent in isolation, can reduce the overall quality of the decisions being made by the full collection ofagents. Unexpected shocks are therefore not needed to expose the risks of monoculture; it can hurtaccuracy even under “normal” operations, and even for algorithms that are more accurate when used byonly a single decision-maker. Our results rely on minimal assumptions, and involve the development of aprobabilistic framework for analyzing systems that use multiple noisy estimates of a set of alternatives.
The rise of algorithms used to shape societal choices has been accompanied by concerns over monoculture —the notion that choices and preferences will become homogeneous in the face of algorithmic curation. Oneof many canonical articulations of this concern was expressed in the New York Times by Farhad Manjoo,who wrote, “Despite the barrage of choice, more of us are enjoying more of the same songs, movies andTV shows” [15]. Because of algorithmic curation, trained on collective social feedback [20], our choices areconverging.When we move from the influence of algorithms on media consumption and entertainment to theirinfluence on high-stakes screening decisions about whom to offer a job or whom to offer a loan, the concernsabout algorithmic monoculture become even starker. Even if algorithms are more accurate on a case-by-case basis, a world in which everyone uses the same algorithm is susceptible to correlated failures whenthe algorithm finds itself in adverse conditions. This type of concern invokes an analogy to agriculture,where monoculture makes crops susceptible to the attack of a single pathogen [18]; the analogy has becomea mainstay of the computer security literature [3], and it has recently become a source of concern aboutscreening decisions for jobs or loans as well. Discussing the post-recession financial system, Citron andPasquale write, “Like monocultural-farming technology vulnerable to one unanticipated bug, the convergingmethods of credit assessment failed spectacularly when macroeconomic conditions changed” [6].The narrative around algorithmic monoculture thus suggests a trade-off: in “normal” conditions, a moreaccurate algorithm will improve the average quality of screening decisions, but when conditions changethrough an unexpected shock, the results can be dramatically worse. But is this trade-off genuine? In theabsence of shocks, does monocultural convergence on a single, more accurate screening algorithm necessarilylead to better average outcomes?In this work, we show that algorithmic monoculture poses risks even in the absence of shocks. Weinvestigate a model involving minimal assumptions, in which two competing firms can either use their ownindependent heuristics to perform screening decisions or they can use a more accurate algorithm that isaccessible to both of them. (Again, we think of screening job applicants or loan applicants as a motivatingscenario.) We find that even though it would be rational for each firm in isolation to adopt the algorithm, itis possible for the use of the algorithm by both firms to result in decisions that are worse on average. Thisin turn leads, in the language of game theory, to a type of “Braess’ paradox” [5] for screening algorithms:the introduction of a more accurate algorithm can drive the firms into a unique equilibrium that is worse forsociety than the one that was present before the algorithm existed.1 a r X i v : . [ c s . G T ] J a n ote that the harm here is to overall performance. Another common concern about algorithmic mono-culture in screening decisions is the harm it can cause to specific individuals: if all employers or lenders usethe same algorithm for their screening decisions, then particular applicants might find themselves lockedout of the market when this shared algorithm doesn’t like their application for some reason. While this isclearly also a significant concern, our results show that it would be a mistake to view the harm to particularapplicants as necessarily balanced against the gains in overall accuracy — rather, it is possible for algorithmicmonoculture to cause harm not just to particular applicants but also to the average quality of decisions aswell.Our results thus have a counterintuitive flavor to them: if an algorithm is clearly more accurate thanthe alternatives when one entity uses it, why does the accuracy become worse than the alternatives whenmultiple entities use it? The analysis relies on deriving some novel probabilistic properties of rankings,establishing that when we are constructing a ranking from a probability distribution representing a “noisy”version of a true ordering, we can sometimes achieve less error through an incremental construction of theranking — building it one element at a time — than we can by constructing it in a single draw from thedistribution. We now set up the basic model, and then frame the probabilistic questions that underpin itsanalysis.Figure 1: Ranking candidates by algorithmically generated scores (Source: https://business.linkedin.com/talent-solutions/blog/recruiting-strategy/2018/the-new-way-companies-are-evaluating-candidates-soft-skills-and-discovering-high-potential-talent To instantiate the ideas introduced thus far, we’ll focus on the case of algorithmic hiring, where recruitersmake decisions based in part on scores or recommendations provided by data-driven algorithms. In thissetting, we’ll propose and analyze a stylized model of algorithmic hiring with which we can begin to investigatethe effects of algorithmic monoculture.Informally, we can think of a simplified hiring process as follows: rank all of the candidates (see Fig-ure 1) and select the first available one. We suppose that each firm has two options to form this ranking:either develop their own, private ranking (which we will refer to as using a “human evaluator”), or use analgorithmically produced ranking. We assume that there is a single vendor of algorithmic rankings, so allfirms choosing to use the algorithm receive the same ranking. The firms proceed sequentially, each hiring2heir favorite remaining candidate according to the ranking they’re using—human-generated or algorithmic.Thus, the we can frame the effects of monoculture as follows: are firms better off using the more accurate,common algorithm, or should they instead employ their own less accurate, but private, evaluations?In what follows, we’ll introduce a formal model of evaluation and selection, using it to analyze a settingin which firms seek to hire candidates.
More formally, we model the n candidates as having intrinsic values x , . . . , x n , where any employer wouldderive utility x i from hiring candidate i . Throughout the paper, we assume without loss of generality that x > x > · · · > x n . These values, however, are unknown to the employer; instead, they must use somenoisy procedure to rank the candidates. We model such a procedure as a randomized mechanism R thattakes in the true candidate values and draws a permutation π over those candidates from some distribution.Our main results hold for families of distributions over permutations as defined below: Definition 1 (Noisy permutation family) . A noisy permutation family F θ is a family of distributions overpermutations that satisfies the following conditions for any θ > and set of candidates x :1. (Differentiability) For any permutation π , Pr F θ [ π ] is continuous and differentiable in θ .2. (Asymptotic optimality) For the true ranking π ∗ , lim θ →∞ Pr F θ [ π ∗ ] = 1 .3. (Monotonicity) For any (possibly empty) S ⊂ x , let π ( − S ) be the partial ranking produced by removingthe items in S from π . Let π ( − S )1 denote the value of the top-ranked candidate according to π ( − S ) . Forany θ (cid:48) > θ , E F θ (cid:48) (cid:104) π ( − S )1 (cid:105) ≥ E F θ (cid:104) π ( − S )1 (cid:105) . (1) Moreover, for S = ∅ , (1) holds with strict inequality. θ serves as an “accuracy parameter”: for large θ , the noisy ranking converges to the true ranking overcandidates. The monotonicity condition states that a higher value of θ leads to a better first choice, evenif some of the candidates are removed after ranking. Removal after ranking (as opposed to before) isimportant because some of the ranking models we will consider later do not satisfy Independence of IrrelevantAlternatives. Examples of noisy permutation families include Random Utility Models [23] and the MallowsModel [14], both of which we will discuss in detail later.As an objective function to evaluate the effects of different approaches to ranking and selection, we’llconsider each individual employer’s utility as well as the sum of employers’ utilities. We think of this lattersum as the social welfare , since it represents the total quality of the applicants who are hired by any firm.(For example, if all firms deterministically used the correct ranking, then the top applicants would be theones hired, leading to the highest possible social welfare.) Each firm in our model has access to the same underlying pool of n candidates, which they rank usinga randomized mechanism R to get a permutation π as described above. Then, in a random order, eachfirm hires the highest-ranked remaining candidate according to their ranking. Thus, if two firms both rankcandidate i first, only one of them can hire i ; the other must hire the next available candidate according totheir ranking. In our model, candidates automatically accept the offer they get from a firm. For the sakeof simplicity, throughout this paper, we restrict ourselves to the case where there are two firms hiring onecandidate each, although our model readily generalizes to more complex cases.As described earlier, each firm can choose to use either a private “human evaluator” or an algorithmicallygenerated ranking as its randomized mechanism R . We assume that both candidate mechanisms come froma noisy permutation family F θ , with differing values of the accuracy parameter θ : human evaluators all havethe same accuracy θ H , and the algorithm has accuracy θ A . However, while the human evaluator produces aranking independent of any other firm, the algorithmically generated ranking is identical for all firms who3hoose to use it. In other words, if two firms choose to use the algorithmically generated ranking, they willboth receive the same permutation π .The choice of which ranking mechanism to use leads to a game-theoretic setting: both firms know theaccuracy parameters of the human evaluators ( θ H ) and the algorithm ( θ A ), and they must decide whether touse a human evaluator or the algorithm. This choice introduces a subtlety: for many ranking models, a firm’srational behavior depends not only on the accuracy of the ranking mechanism, but also on the underlyingcandidate values x , . . . , x n . Thus, to fully specify a firm’s behavior, we assume that x , . . . , x n are drawnfrom a known joint distribution D . Our main results will hold for any D , meaning they apply even when thecandidate values (but not their identities) are deterministically known. Our main result is a pair of intuitive conditions under which a Braess’ Paradox-style result occurs—in otherwords, conditions under which there are accuracy parameters for which both firms rationally choose to usethe algorithmic ranking, but social welfare (and each individual firm’s utility) would be higher if both firmsused independent human evaluators. Recall that the two firms hire in a random order. For a permutation π , let π i denote the value of the i th-ranked candidate according to π .We first state the two conditions, and then the theorem based on them. Definition 2 (Preference for the first position.) . A candidate distribution D and noisy permutation family F θ exhibits a preference for the first position if for all θ > , if π, σ ∼ F θ , E [ π − π | π (cid:54) = σ ] > . In other words, for any θ >
0, suppose we draw two permutations π and σ independently from F θ , andsuppose that the first-ranked candidates differ in π and σ . Then the expected value of the first-rankedcandidate in π is strictly greater than the expected value of the second-ranked candidate in π . Definition 3 (Preference for weaker competition.) . A candidate distribution D and noisy permutation family F θ , exhibits a preference for weaker competition if the following holds: for all θ > θ , σ ∼ F θ and π, τ ∼ F θ , E (cid:104) π ( −{ σ } )1 (cid:105) < E (cid:104) π ( −{ τ } )1 (cid:105) . Intuitively, suppose we have a higher accuracy parameter θ and a lower accuracy parameter θ < θ ; wedraw a permutation π from F θ ; and we then derive two permutations from π : π ( −{ σ } ) obtained by deletingthe first-ranked element of a permutation σ drawn from the more accurate distribution F θ , and π ( −{ τ } ) obtained by deleting the first-ranked element of a permutation τ drawn from the less accurate distribution F θ .Then the expected value of the first-ranked candidate in π ( −{ τ } ) is strictly greater than the expectedvalue of the first-ranked candidate in π ( −{ σ } ) — that is, when a random candidate is removed from π , thebest remaining candidate is better in expectation when the randomly removed candidate is chosen based ona noisier ranking.Using these two conditions, we can state our theorem. Theorem 1.
Suppose that a given candidate distribution D and noisy permutation family F θ satisfy Defi-nition 2 (preference for the first position) and Definition 3 (preference for weaker competition).Then, for any θ H , there exists θ A > θ H such that using the algorithmic ranking is a strictly dominantstrategy for both firms, but social welfare would be higher if both firms used human evaluators. Before we prove Theorem 1, we provide some intuition for the two conditions in Definitions 2 and 3. Thesecond condition essentially says that it is better to have a worse competitor: the firm randomly selected4o hire second is better off if the firm that hires first uses a less accurate ranking (in this case, a humanevaluator instead of the algorithmic ranking).The first condition states that when two identically distributed permutations disagree on their firstelement, the first-ranked candidate according to either permutation is still better, in expectation, than thesecond-ranked candidate according to either permutation. In what follows, we’ll demonstrate that thiscondition implies that firms in our model rationally prefer to make decisions using independent (but equallyaccurate) rankings.To do so, we need to introduce some notation. Recall that the two firms hire in a random order. Givena candidate distribution D , let U s ( θ A , θ H ) denote the expected utility of the first firm to hire a candidatewhen using ranking s , where s ∈ { A, H } is either the algorithmic ranking or the ranking generated by ahuman evaluator respectively. Similarly, let U s s ( θ A , θ H ) be the expected utility of the second firm to hiregiven that the first firm used strategy s and the second firm uses strategy s , where again s , s ∈ { A, H } .Finally, let π, σ ∼ F θ .In what follows, we will show that for any θ , E [ π − π | π (cid:54) = σ ] > ⇐⇒ U AH ( θ, θ ) > U AA ( θ, θ ) . (2)In other words, whenever a ranking model meets Definition 2, the firm chosen to select second will prefer touse an independent ranking mechanism from it’s competitor, given that the ranking mechanisms are equallyaccurate.First, we can write U AH ( θ A , θ H ) = E [ π · π (cid:54) = σ + π · π = σ ] U AA ( θ A , θ H ) = E [ σ ]= E [ σ · π (cid:54) = σ + σ · π = σ ]Thus, U AH ( θ A , θ H ) − U AA ( θ A , θ H )= E [( π − σ ) · π (cid:54) = σ + ( π − σ ) · π = σ ] . Conditioned on either π = σ or π (cid:54) = σ , π and σ are identically distributed and therefore have equalexpectations. As a result, U AH ( θ A , θ H ) − U AA ( θ A , θ H ) = E [( π − π ) · π (cid:54) = σ ] , (3)which implies (2). Thus, whenever a ranking model meets Definition 2, firms rationally prefer independentassessments, all else equal.To provide some intuition for what this preference for independence entails, consider a setting wherea hiring committee seeks to hire two candidates. They meet, produce a ranking σ , and hire σ (the bestcandidate according to σ ). Suppose they have the option to either hire σ or reconvene the next day to forman independent ranking π and hire the best remaining candidate according to π ; which option should theychoose? It’s not immediately clear why one option should be better than the other. However, wheneverDefinition 2 is met, the committee should prefer to reconvene and make their second hire according to anew ranking π . After proving Theorem 1, we will provide natural ranking models that meet Definition 2,implying that under these ranking models, independent re-ranking can be beneficial. With this intuition, we are ready to prove Theorem 1.
Proof of Theorem 1.
For given values of θ A and θ H , using the algorithmic ranking is a strictly dominantstrategy as long as U A ( θ A , θ H ) + U AA ( θ A , θ H ) > U H ( θ A , θ H ) + U AH ( θ A , θ H ) (4) U A ( θ A , θ H ) + U HA ( θ A , θ H ) > U H ( θ A , θ H ) + U HH ( θ A , θ H ) (5)5ote that (5) is always true for θ A > θ H by the monotonicity assumption on F θ : U A ( θ A , θ H ) ≥ U H ( θ A , θ H )because a more accurate ranking produces a top-ranked candidate with higher expected value, and U HA ( θ A , θ H ) ≥ U HH ( θ A , θ H ) because this holds even conditioned on removing any candidate from the pool (in this case, thecandidate randomly selected by the firm that hires first). Crucially, in (5), the first firm’s random selectionis independent from the second firm’s selection; the same logic could not be used to argue that (4) alwaysholds for θ A ≥ θ H . Moreover, when θ A > θ H , U A ( θ A , θ H ) > U H ( θ A , θ H ) by the monotonicity assumption,meaning (5) holds.Let W s s ( θ A , θ H ) denote social welfare when the two firms employ strategies s , s ∈ { A, H } . Then,when both firms use the algorithmic ranking, social welfare is W AA ( θ A , θ H ) = U A ( θ A , θ H ) + U AA ( θ A , θ H ) . By (2), Definition 2 implies that for any θ , U AA ( θ, θ ) < U AH ( θ, θ ), implying U A ( θ H , θ H ) + U AA ( θ H , θ H ) < U H ( θ H , θ H ) + U AH ( θ H , θ H ) . However, by the optimality assumption on F θ in Definition 1, for sufficiently large ˆ θ A , U A (ˆ θ A , θ H ) + U AA (ˆ θ A , θ H ) > U H (ˆ θ A , θ H ) + U AH (ˆ θ A , θ H ) . Note that U s ( θ A , θ H ) and U s s ( θ A , θ H ) are continuous with respect to θ A for any s , s ∈ { A, H } sincethey are expectations over discrete distributions with probabilities that are by assumption differentiable withrespect to θ A . Therefore, by the Differentiability assumption on F θ from Definition 1, there is some θ ∗ A > θ H such that U A ( θ ∗ A , θ H ) + U AA ( θ ∗ A , θ H ) = U H ( θ ∗ A , θ H ) + U AH ( θ ∗ A , θ H ) , (6)i.e., given that its competitor uses the algorithmic ranking, a firm is indifferent between the two strategies.For such θ ∗ A , using the algorithmic ranking is still a weakly dominant strategy. By definition of W AA , W AA ( θ ∗ A , θ H ) = U H ( θ ∗ A , θ H ) + U AH ( θ ∗ A , θ H ) . If both firms had instead used human evaluators, social welfare would be W HH ( θ ∗ A , θ H ) = U H ( θ ∗ A , θ H ) + U HH ( θ ∗ A , θ H ) . By Definition 3, for σ ∼ F θ A ∗ and π, τ ∼ F θ H , E (cid:104) π ( −{ σ } )1 (cid:105) < E (cid:104) π ( −{ τ } )1 (cid:105) . Note that U AH ( θ ∗ A , θ H ) = E (cid:104) π ( −{ σ } )1 (cid:105) U HH ( θ ∗ A , θ H ) = E (cid:104) π ( −{ τ } )1 (cid:105) Thus, Definition 3 implies that for θ A ∗ > θ H , U HH ( θ ∗ A , θ H ) > U AH ( θ ∗ A , θ H ). As a result for θ A ∗ > θ H , usingthe algorithmic ranking is a weakly dominant strategy, but W HH ( θ ∗ A , θ H ) = U H ( θ ∗ A , θ H ) + U HH ( θ ∗ A , θ H ) > U H ( θ ∗ A , θ H ) + U AH ( θ ∗ A , θ H )= U A ( θ ∗ A , θ H ) + U AA ( θ ∗ A , θ H )= W AA ( θ ∗ A , θ H ) , meaning social welfare would have been higher had both firms used human evaluators.6e can show that this effect persists for a value θ (cid:48) A such that using the algorithmic ranking is a strictly dominant strategy. Intuitively, this is simply by slightly increasing θ ∗ A so the algorithmic ranking is strictlydominant. For fixed θ H , define f ( θ A ) = U A ( θ A , θ H ) + U AA ( θ A , θ H ) g ( θ A ) = U H ( θ A , θ H ) + U AH ( θ A , θ H ) h ( θ A ) = U H ( θ A , θ H ) + U HH ( θ A , θ H )Because (5) always holds for θ A > θ H , it suffices to show that there exists θ (cid:48) A such that g ( θ (cid:48) A ) < f ( θ (cid:48) A )
Let F θ be the family of RUMs with either Gaussian or Laplacian noise with standard deviation /θ . Then, for any candidate distribution D over 3 candidates, the conditions of Theorem 1 are satisfied. It might be tempting to generalize Theorem 2 to other distributions and more candidates; however,certain noise and candidate distributions violate the conditions of Theorem 1. Even for 3-candidate RUMs,there exist distributions for which each of the conditions is violated; we provide such examples in Appendix B.Moreover, while Gaussian and Laplacian distributions provably meet Definitions 2 and 3 with only 3candidates, this doesn’t necessarily extend to larger candidate sets. Figure 2 shows that Definition 2 can beviolated under a particular candidate distribution D for Laplacian noise with 15 candidates. This challengesthe intuition that independence is preferable—under some conditions, it can actually better in expectationfor a firm to use the same algorithmic ranking as its competitor, even if an independent human evaluatoris equally accurate overall. Unlike Theorem 2, which applies for any candidate distribution D , certain noisemodels may violate Definition 2 only for particular D . It is an open question as to whether Theorem 2 canbe extended to larger numbers of candidates under Gaussian noise.Finally, there exist noise distributions that violate Definition 2 for any candidate distribution D . In8articular, the RUM family defined by the Gumbel distribution is well-known to be equivalent to the Plackett-Luce model of ranking, which is generated by sequentially selecting candidate i with probabilityexp( θx i ) (cid:80) j ∈ S exp( θx j ) , (7)where S is the set of remaining candidates [12, 4]. Under the Plackett-Luce model, for any θ , U AH ( θ, θ ) = U AA ( θ, θ ). To see this, suppose the firm that hires first selects candidate i ∗ . Then, the firm that hires secondgets each candidate i with probability given by (7) with S = { , . . . , n }\ i ∗ . As a result, by (3), if π, σ ∼ F θ , E [ π − π | π (cid:54) = σ ] = 0for any candidate distribution D , meaning the Plackett-Luce model never meets Definition 2. Thus, underthe Plackett-Luce model, monoculture has no effect—the optimal strategy is always to use the best availableranking, regardless of competitors’ strategies.Given the analytic intractability of most RUMs, it might appear that testing the conditions of Theorem 1,especially for a particular noise and candidate distributions, may not be possible; however, they can beefficiently tested via simulation: as long as the noise distribution E and the candidate distribution D canbe sampled from, it is possible to test whether the conditions of Theorem 1 are satisfied. Thus, even if theconditions of Theorem 1 are not met for every candidate distribution D , it is possible to efficiently determinewhether they are met for any particular D .It is also interesting to ask about the magnitude of the negative impact produced by monoculture. Ourmodel allows for the qualities of candidates to be either positive or negative (capturing the fact that aworker’s productivity can be either more or less than their cost to the firm in wages); using this, we canconstruct instances of the model in which the optimal social welfare is positive but the welfare under the(unique) monocultural equilibrium implied by Theorem 1 is negative. This is a strong type of negative result,in which sub-optimality reverses the sign of the objective function, and it means that in general we cannotcompare the optimum and equilibrium by taking a ratio of two non-negative quantities, as is standard in Price of Anarchy results. However, as a future direction, it would be interesting to explore such Price ofAnarchy bounds in special cases of the problem where structural assumptions on the input are sufficient toguarantee that the welfare at both the social optimum and the equilibrium are non-negative. As one simpleexample, if the qualities for three candidates are drawn independently from a uniform distribution centeredat 0, and the noise distribution is Gaussian, then there exist parameters θ A > θ H such that expected socialwelfare at the equilibrium where both firm use the algorithmic ranking is non-negative, and approximately4% less than it would be had both firms used human evaluators instead. The Mallows Model also appears frequently in the ranking literature [8, 11], and is much more analyticallytractable than RUMs. Under the Mallows Model, the likelihood of a permutation is related to its distancefrom the true ranking π ∗ : Pr[ π ] = 1 Z φ − d ( π,π ∗ ) , (8)where Z is a normalizing constant. In this model, φ > φ is, themore likely the ranking procedure is to output a ranking π that is close to the true ranking r . To instantiatethis model, we need a notion of distance d ( · , · ) over permutations. For this, we’ll use Kendall tau distance,another standard notion in the literature, which is simply the number of pairs of elements in π that areincorrectly ordered [10]. In Appendix D, we verify that the family of distributions F θ given by the MallowsModel satisfies Definition 1, defining θ = φ − θ is well-defined on (0 , ∞ )).In contrast to RUMs, the Mallows Model always satisfies the conditions of Theorem 1 for any candidatedistribution D , which we prove in Appendix E. Theorem 3.
Let F θ be the family of Mallows Model distributions with parameter θ = φ − . Then, for anycandidate distribution D , the conditions of Theorem 1 are satisfied. θ H , θ A ) plane under the MallowsModel. The decrease in social welfare found in Theorem 3 is depicted by the shaded portion of the greenregion labeled AA , where social welfare would be higher if both firms used human evaluators.While the result of Theorem 3 is certainly stronger than that of Theorem 2, in that it applies to allinstances of the Mallows Model without restrictions, it should be interpreted with some caution. TheMallows Model does not depend on the underlying candidate values, so according to this model, monoculturecan produce arbitrarily large negative effects. While insensitivity to candidate values may not necessarily bereasonable in practice, our results hold for any candidate distribution D . Thus, to the extent that the MallowsModel can reasonably approximate ranking in particular contexts, our results imply that monoculture canhave negative welfare effects. Our main focus in this work has been on models with two competing firms. However, it is also interesting toconsider the case of more than two firms; we will see that the complex and sometimes counterintuitive effectsthat we found in the two-firm case are further enriched by additional phenomena. Primarily, we will presentthe result of computational experiments with the model, exposing some fundamental structural propertiesin the multi-firm problem for which a formal analysis remains an intriguing open problem. For concreteness,we will focus on a model in which rankings are drawn from the Mallows model. As before, each firm mustchoose to order candidates according to either an independent, human-produced ranking or an algorithmicranking common to all firms who choose it. These rankings come from instances of the Mallows model withaccuracy parameters φ H and φ A respectively as defined in (8). Braess’ Paradox for k > firms. First, we ask whether the Braess’ Paradox effect persists with k > k > n = 4, k = 3, φ A = 2, φ H = 1 .
75, and candidatequalities are drawn from a uniform distribution on [0 , ≈ . ≈ . k > D and any value of φ H for k > Sequential decision-making.
Since the equilibrium behaviors we are studying take place in a modelwhere firms make decisions in a random order, a crucial first step is to characterize firms’ optimal behaviorwhen making decisions sequentially —that is, when firms hire in a fixed, known order as opposed to a randomorder. In this context, consider the rational behavior of each firm: given a distribution over candidate values,which ranking should each firm use? Clearly, the first firm to make a selection should use the more accurateranking mechanism; however, as shown previously, subsequent firms’ decisions are less clear-cut. For a fixednumber of firms, number of candidates, and distribution over candidate values, we can explore the firms’optimal strategies over the possible space of ( φ H , φ A ) values.An optimal choice of strategies for the k firms moving sequentially can be written as a sequence of length k made up of the symbols A and H ; the i th term in the sequence is equal to A if the i th firm to movesequentially uses the algorithm as its optimal strategy (given the choices of the previous i − H if the i th firm uses an independent human evaluation. We can therefore represent the choiceof optimal strategies, as the parameters ( φ H , φ A ) vary, by a labeling of the ( φ H , φ A )-plane: we label eachpoint ( φ H , φ A ) with the length- k sequence that specifies the optimal sequence of strategies.We can make the following initial formal observation about these optimal sequences: Theorem 4.
When φ H ≥ φ A , one optimal sequence is for all firms to choose H . When φ H > φ A , theunique optimal sequence is for all firms to choose H . We prove this formally in Appendix F.1, but we provide a sketch here. When φ H ≥ φ A , the first firmto move in sequence will simply use the more accurate strategy, and hence will choose H . Now, proceeding10igure 4: Regions for different optimal strategy profiles, where each strategy profile is a sequence of ‘ A ’ and‘ H ’ representing the optimal strategies of each firm sequentially. For this plot, there are 5 firms ( k = 5) and6 candidates ( n = 6) whose values are drawn from a uniform distribution. Note that when φ A is much largerthan φ H , all firms use the algorithmic ranking, but when φ A is only slightly larger than φ H , only the firstfirm uses the algorithmic ranking.by induction, suppose that the first i firms have all chosen H , and consider the ( i + 1) st firm to move insequence. Regardless of whether this firm chooses A or H , it will be making a selection that is independentof the previous i selections, and hence it is optimal for it to choose H as well. Hence, by induction, it is anoptimal solution for all firms to choose H when φ H ≥ φ A . (This argument, slightly adapted, also directlyestablishes that it is uniquely optimal for all firms to choose H when φ H > φ A .)Beyond this observation, if we wish to extend to the case when φ A > φ H , the mathematical analysis of thismulti-firm model remains an open question; but it is possible to determine optimal strategies computationallyfor each choice of ( φ H , φ A ), and then to look at how these strategies vary over the ( φ H , φ A )-plane. Figure 4shows the result of doing this — producing a labeling of the ( φ H , φ A )-plane as described above — for k = 5firms and n = 6 candidates, with the values of the candidates drawn from a uniform distribution.We observe a number of interesting phenomena from this labeling of the plane. First, the region where φ H ≥ φ A is labeled with the all- H sequence, reflecting the argument above; for the half-plane φ A > φ H , onthe other hand, all optimal sequences begin with A , since it is always optimal for the first firm to use themore accurate method. The labeling of the half-plane φ A > φ H becomes quite complex; in principle, anysequence over the binary alphabet { A, H } that begins with A could be possible, and in fact we see that all2 = 16 of these sequences appear as labels in some portion of the plane. This means that the sequentialchoice of optimal strategies for the firms can display arbitrary non-monotonicities in the choice of algorithmicor human decisions, with firms alternating between them; for example, even after the first firm chooses A and the second chooses H , the third may choose A or H depending on the values ( φ H , φ A ).The boundaries of the regions labeled by different optimal sequences are similarly complex; some of theregions (such as AAAHH ) appear to be bounded, while others (such as
AHAHA and
AHHAH ) appear toonly emerge for sufficiently large values of φ H .Perhaps the most intriguing observation about the arrangement of regions is the following. Suppose wethink of the sequences of symbols over { A, H } as binary representations of numbers, with A corresponding tothe binary digit 1 and H correspnding to the binary digit 0. (Thus, for example, AAAHH would correspondto the number 16 + 8 + 4 = 28, while
AHAHA would correspond to the number 16 + 4 + 1 = 21.) The11bservation is then the following: if we choose any vertical line φ H = x (for a fixed x ), and we follow itupward in the plane, we encounter regions in increasing order of the numbers corresponding to their labels,in this binary representation. (First HHHHH , then
AHHHH , then
AHHHA , then
AHHAH , and soforth.)We do not know a proof for this fact, or how generally it holds, but we can verify it computationally forthe regions of the ( φ H , φ A )-plane mapped out in Figure 4, as well as similar computational experiments notshown here for other choices of k and n . This binary-counter property suggests a rich body of additionalstructure to the optimal strategies in the k -firm case, and we leave it as an open question to analyze thisstructure mathematically. Concerns about monoculture in the use of algorithms have focused on the danger of unexpected, correlatedshocks, and on the harm to particular individuals who may fare poorly under the algorithm’s decision. Ourwork here shows that concerns about algorithmic monoculture are in a sense more fundamental, in thatit is possible for monoculture to cause decisions of globally lower average quality, even in the absence ofshocks. In addition to telling us something about the pervasiveness of the phenomenon, it also suggests thatit might be difficult to notice its negative effects even while they’re occurring — these effects can persist atlow levels even without a shock-like disruption to call our attention to them. Our results also make clearthat algorithmic monoculture in decision-making doesn’t always lead to adverse outcomes; rather, we givennatural conditions under which such outcomes become possible, and show that these conditions hold in awide range of standard models.Our results suggest a number of natural directions for further work. To begin with, we have noted earlierin the paper that it would be interesting to give more comprehensive quantitative bounds on the magnitudeof monoculture’s possible negative effects in decisions such as hiring — how much worse can the quality ofcandidates be when selected with an equilibrium strategy involving shared algorithms than with a sociallyoptimal one? In formulating such questions, it will be important to take into account how the noise modelfor rankings relates to the numerical qualities of the candidates.We have also focused here on the case of two firms and a single shared algorithm that is available toboth. It would be natural to consider generalizations involving more firms and potentially more algorithmsas well. With more algorithms, we might see solutions in which firms cluster around different algorithms ofvarying accuracies, as they balance the level of accuracy and the amount of correlation in their decisions. Itwould also be interesting to explore the ways in which correlations in firms’ decisions can be decomposedinto constituent parts, such as the use of standardized tests that form input features for algorithms, and howquantifying these forms of correlation might help firms assess their decisions.Finally, it will be interesting to consider how these types of results apply to further domains. Whilethe analysis presented here illustrates the consequences of monoculture as applied to algorithmic hiring, ourfindings have potential implications in a broader range of settings. Algorithmic monoculture not only leadsto a lack of heterogeneity in decision-making; by allowing valuable options to slip through the cracks — bethey job candidates, potential hit songs, or budding entrepreneurs — it reduces total social welfare, evenwhen the individual decisions are more accurate on a case-by-case basis. These concerns extend beyond theuse of algorithms; whenever decision-makers rely on identical or highly correlated evaluations, they miss outon hidden gems, and in this way diminish the overall quality of their decisions.
Acknowledgements.
This work has been supported in part by a Simons Investigator Award, a VannevarBush Faculty Fellowship, a MURI grant, an NSF Graduate Research Fellowship, and grants from the ARO,AFOSR, and the MacArthur Foundation.
References [1] Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, and David C Parkes. Generalized random utilitymodels with multiple types. In
Advances in Neural Information Processing Systems , pages 73–81, 2013.122] Hossein Azari Soufiani, David C Parkes, and Lirong Xia. Random utility theory for social choice. In
Advances in Neural Information Processing Systems , pages 126–134, 2012.[3] Kenneth P Birman and Fred B Schneider. The monoculture risk put into context.
IEEE Security &Privacy , 7(1):14–17, 2009.[4] HD Block and J Marschak. Random orderings and stochastic theories of responses. In I. Olkin, S. G.Ghurye, W. Hoeffding, W. G. Madow, and H. B. Mann, editors,
Contributions to probability and statis-tics , pages 97–132. Stanford University Press, 1960.[5] Dietrich Braess. ¨Uber ein paradoxon aus der verkehrsplanung.
Unternehmensforschung , 12(1):258–268,1968.[6] Danielle Keats Citron and Frank Pasquale. The scored society: Due process for automated predictions.
Wash. L. Rev. , 89:1, 2014.[7] HE Daniels. Rank correlation and population models.
Journal of the Royal Statistical Society. SeriesB (Methodological) , 12(2):171–191, 1950.[8] Sanmay Das and Zhuoshu Li. The role of common and private signals in two-sided matching withinterviews. In
International Conference on Web and Internet Economics , pages 492–497. Springer,2014.[9] Harry Joe. Inequalities for random utility models, with applications to ranking and subset choice data.
Methodology and computing in Applied Probability , 2(4):359–372, 2000.[10] Maurice G Kendall. A new measure of rank correlation.
Biometrika , 30(1/2):81–93, 1938.[11] Tyler Lu and Craig Boutilier. Learning mallows models with pairwise preferences.
International Con-ference on Machine Learning , 2011.[12] R Duncan Luce.
Individual choice behavior: A theoretical analysis . Wiley, 1959.[13] Rahul Makhijani and Johan Ugander. Parametric models for intransitivity in pairwise rankings. In
TheWorld Wide Web Conference , pages 3056–3062, 2019.[14] Colin L Mallows. Non-null ranking models. I.
Biometrika , 44(1/2):114–130, 1957.[15] Farhad Manjoo. This summer stinks. But at least we’ve got ‘Old Town Road.’.
New York TimesOpinion , 2019.[16] Charles F Manski. The structure of random utility models.
Theory and decision , 8(3):229, 1977.[17] John P Mills. Table of the ratio: area to bounding ordinate, for any portion of normal curve.
Biometrika ,pages 395–400, 1926.[18] JF Power and RF Follett. Monoculture.
Scientific American , 256(3):78–87, 1987.[19] Stephen Ragain and Johan Ugander. Pairwise choice markov chains. In
Advances in Neural InformationProcessing Systems , pages 3198–3206, 2016.[20] Matthew J Salganik, Peter Sheridan Dodds, and Duncan J Watts. Experimental study of inequalityand unpredictability in an artificial cultural market.
Science , 311(5762):854–856, 2006.[21] Michael R Sampford. Some inequalities on Mill’s ratio and related functions.
The Annals of Mathemat-ical Statistics , 24(1):130–132, 1953.[22] David Strauss. Some results on random utility models.
Journal of Mathematical Psychology , 20(1):35–52,1979.[23] Louis L Thurstone. A law of comparative judgment.
Psychological review , 34(4):273, 1927.1324] John I Yellott Jr. The relationship between luce’s choice axiom, thurstone’s theory of comparativejudgment, and the double exponential distribution.
Journal of Mathematical Psychology , 15(2):109–144, 1977.[25] Zhibing Zhao, Tristan Villamil, and Lirong Xia. Learning mixtures of random utility models. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.
A Random Utility Models satisfying Definition 1
Theorem 5.
Let f be the pdf of E . The family of RUMs F θ given by ranking x i + ε i θ with ε i ∼ E satisfiesthe conditions of Definition 1 if: • f is differentiable • f has positive support on ( −∞ , ∞ ) Proof.
We need to show that F θ satisfies the differentiability, asymptotic optimality, and monotonicityconditions in Definition 1. Differentiability:
The probability density of any realization of the n noise samples ε i /θ is (cid:81) ni =1 f ( ε i /θ ).Let ε = [ ε /θ, . . . , ε n /θ ] be the vector of noise values and let M ( π ) ⊆ R n be the region such that any ε ∈ M ( π )will produces the ranking π . The probability of any permutation π isPr θ [ π ] = (cid:90) M ( π ) n (cid:89) i =1 f (cid:16) ε i θ (cid:17) d n z . Because f is differentiable, ddθ f (cid:16) xθ (cid:17) = f (cid:48) (cid:16) xθ (cid:17) · (cid:16) − xθ (cid:17) Because Pr θ ( π ) is an integral of the product of differentiable functions over a fixed region, it is differentiable. Asymptotic optimality:
We will show that for any pair of elements and any δ >
0, there existssufficiently large θ such that the probability that they incorrectly ranked is at most δ . We will conclude witha union bound over the n − θ such that theprobability of outputting the correct ranking must be at least 1 − ( n − δ .Consider two candidates x i > x i +1 . Let ν be the difference x i − x i +1 . Then, they will be correctly rankedif ε i θ > − ν ε i +1 θ < ν q and q be the 1 − δ and δ quantiles of E respectively, and let q = max( | q | , | q | ). For θ > qν ,Pr (cid:104) ε i θ < − ν (cid:105) = Pr (cid:20) ε i < − νθ (cid:21) < Pr [ ε i < − q ] ≤ Pr (cid:2) ε i < q (cid:3) = δ (cid:104) ε i +1 θ > ν (cid:105) = Pr (cid:20) ε i +1 > νθ (cid:21) < Pr [ ε i +1 > q ] ≤ Pr [ ε i +1 > q ]= δ θ , the probability that x i and x i +1 are incorrectly ordered is at most δ .Repeating this analysis for all n − θ ’s, andtaking a union bound yields that the probability of incorrectly ordering any pair of elements is at most( n − δ , meaning the probability of outputting the correct ranking is at least 1 − ( n − δ . Since δ isarbitrary, this probability can be made arbitrarily close to 1, satisfying the asymptotic optimality condition. Monotonicity:
The removal of any elements does not alter the distribution of the remaining elements,meaning that the distribution of π ( − S ) is equivalent to a RUM with n − | S | elements. Thus, it suffices toshow that for a RUM with positive support on ( −∞ , ∞ ), the probability of ranking the best candidate firststrictly increases with θ .Recall that by definition, the candidates are ranked according to x i + ε i θ . The probability that x isranked first isPr (cid:20) x + ε θ > max ≤ i ≤ n x i + ε i θ (cid:21) = Pr (cid:20) ε θ > max ≤ i ≤ n x i − x + ε i θ (cid:21) = Pr (cid:20) ε > max ≤ i ≤ n θ ( x i − x ) + ε i (cid:21) = E ε ,...,ε n Pr (cid:20) ε > max ≤ i ≤ n θ ( x i − x ) + ε i | ε , . . . , ε n (cid:21) (A.1)We want to show that (A.1) is increasing in θ . Intuitively, this is because as θ increases, the right handside of the inequality inside the probability decreases. To prove this formally, it suffices to show that thesubderivative of (A.1) with respect to θ only includes strictly positive numbers. First, we have ∂∂θ E ε ,...,ε n Pr (cid:20) ε > max ≤ i ≤ n θ ( x i − x ) + ε i | ε , . . . , ε n (cid:21) ⊂ R > ⇐⇒ ∂∂θ Pr (cid:20) ε > max ≤ i ≤ n θ ( x i − x ) + ε i | ε , . . . , ε n (cid:21) ⊂ R > Let F and f be the cumulative density function and probability density function of E respectively. Then,Pr (cid:20) ε > max ≤ i ≤ n θ ( x i − x ) + ε i | ε , . . . , ε n (cid:21) = 1 − F (cid:18) max ≤ i ≤ n θ ( x i − x ) + ε i (cid:19) Note that F ( · ) is strictly increasing (since f is assumed to have positive support on ( −∞ , ∞ )), so it sufficesto show that ∂∂θ max ≤ i ≤ n θ ( x i − x ) + ε i ⊂ R < For any i , ddθ θ ( x i − x ) + ε i = x i − x < . Thus, the subderivative of the max of such functions includes only strictly negative numbers, which completesthe proof.
B 3-candidate RUM Counterexamples
B.1 Violating Definition 2
Here, we provide a noise mode E , accuracy parameter θ , and candidate distribution D such that U AH < U AA .Choose the noise distribution E and accuracy parameter θ such that εθ = w.p. δ w.p. − δ − w.p. δ −∞ , ∞ ); however, we can provide a “smooth” approximation to this distribution by expressing it as thesum of arbitrarily tightly concentrated Gaussians with the same results.We choose the candidate distribution D such that x − > x > x > x −
2. For example, x = 74 x = 12 x = 0Under this condition, assuming x = 0 without loss of generality, U AH ( θ, θ ) − U AA ( θ, θ ) = δ (cid:0) δ x − δ x + 4 δx + 2 δ x − δ x + 20 δx − x (cid:1) Notice that the lowest-power δ term is − δ x . Therefore, for sufficiently small δ , this is negative. Forexample, plugging in the values given above with δ = . U AH ( θ, θ ) − U AA ( θ, θ ) ≈ − . B.2 Violating Definition 3
Next, we’ll give a 3-candidate RUM for which U AH < U HH does not hold in general. Consider the following3-candidate example. x = 3 x = 2 x = 0Choose E and θ such that εθ = − δ − − δ
10 w.p. δ −
10 w.p. δ Again, while this noise model doesn’t satisfy Definition 1, we can approximate it arbitrarily closely with thesum of tightly concentrated Gaussians. Let the θ A = 1 . θ and θ H = 0 . θ .We will show that for these parameters, U AH ( θ A , θ H ) > U HH ( θ A , θ H ), i.e., it is somehow better to chooseafter a better opponent than after a worse opponent. At a high level, the reasoning for this is as follows:1. When choosing first, the only difference between the algorithm and the human evaluator is that thealgorithm is more likely to choose x than x . Both strategies have identical probabilities of selecting x .2. When choosing second, the human evaluator’s utility is higher when x has already been chosen thanwhen x has already been chosen. This is because when x is unavailable, the human evaluatoris almost guaranteed to get x ; when x is unavailable, the human evaluator will choose x withprobability ≈ / τ and π be rankings generated by the algorithm and human evaluator respectively. First, we willshow that Pr[ τ = x ] = Pr[ π = x ] (B.1)Pr[ τ = x ] > Pr[ π = x ] (B.2)To do so, consider the realizations of ε , ε , ε that result in different rankings under θ A and θ H . In fact,the only set of realizations that result in different rankings are when ε /θ = − ε /θ = 1. Thus,16he algorithm and human evaluator always rank x in the same position, conditioned on a realization,which proves (B.1); the only difference is that the algorithm sometimes ranks x above x when the humanevaluator does not. Moreover, whenever ε /θ = − x is more strictly more likely to be ranked first underthe algorithm than the human evaluator, which proves (B.2).Next, we must show that when choosing second, the human evaluator is better off when x is unavailablethan when x is unavailable. This is clearly true because for the human evaluator,Pr (cid:20) x + ε θ H θ > x + ε θ H θ (cid:21) ≈ − O ( δ )Pr (cid:20) x + ε θ H θ > x + ε θ H θ (cid:21) ≈ x being unavailable, the human evaluator gets utility ≈
3, whereas when x isunavailable, the human evaluator gets utility ≈ .
75. Let u − i be the expected utility for the human evaluatorwhen x i is unavailable. Putting this together, we get U AH ( θ A , θ H ) − U HH ( θ A , θ H )= (cid:88) i =1 (Pr[ τ = x i ] − Pr[ π = x i ]) u − i = (Pr[ τ = x ] − Pr[ π = x ]) u − + (Pr[ τ = x ] − Pr[ π = x ]) u − + (Pr[ τ = x ] − Pr[ π = x ]) u − = (Pr[ τ = x ] − Pr[ π = x ]) u − + (Pr[ τ = x ] − Pr[ π = x ]) u − (Pr[ τ = x = Pr[ π = x ])= (Pr[ τ = x ] − Pr[ π = x ])( u − − u − ) ( (cid:80) i =1 Pr[ τ = x i ] = (cid:80) i =1 Pr[ π = x i ]) > u − > u − . C Proof of Theorem 2
C.1 Verifying Definition 2
By (2), we can equivalently show that for any θ , U AH ( θ, θ ) > U AA ( θ, θ ). Let τ and π be the algorithmic andhuman-generated rankings respectively. Note that they’re identically distributed because θ A = θ H . Define Y (cid:44) (cid:40) π π (cid:54) = τ π otherwiseNote that U AH ( θ, θ ) = E [ Y ] and U AA ( θ, θ ) = E [ τ ]. We want to show that U AH ( θ, θ ) − U AA ( θ, θ ) = E [ x Y − x τ ] >
0. It is sufficient to show that for any k , E [ Y − τ | τ = x k ] >
0. Let X i = x i + ε i /θ . Notethat for distinct i, j, k and x i > x j , E [ Y − τ | τ = x k ] > ⇐ = Pr[ Y = x i | τ = x k ]Pr[ Y = x j | τ = x k ] > Pr[ τ = x i | τ = x k ]Pr[ τ = x j | τ = x k ] ⇐⇒ Pr[ Y = x i | τ = x k ] > Pr[ τ = x i | τ = x k ](numerator and denominator sum to 1) ⇐⇒ Pr[ X i > X j ] > Pr[ X i > X j | X k > X i ∩ X k > X j ] ⇐⇒ Pr[ X i > X j ] > E X k [Pr[ X i > X j | X k = a, X i < a, X j < a ]] . Thus, it suffices to show that for any a ,Pr[ X i > X j ] > Pr[ X i > X j | X i < a, X j < a ] . (C.1)Since Pr[ X i > X j ] = lim a →∞ Pr[ X i > X j | X i < a, X j < a ], it suffices to show that for all a , dda Pr[ X i > X j | X i < a, X j < a ] ≥ , (C.2)17nd that it is strictly positive for some a . In other words, the higher a is, the more likely i and j are to becorrectly ordered. In Theorems 7 and 8, we show that (C.2) holds for both Laplacian and Gaussian noiserespectively, which proves that RUMs based on both distributions satisfy Definition 2. C.2 Verifying Definition 3
Next, we show that for both Laplacian and Gaussian distributions, U AH ( θ A , θ H ) < U HH ( θ A , θ H ) for all θ A > θ H . In fact, for 3-candidate RUM families, we will show that this is always true for any well-ordered distribution, defined as follows. Definition 4.
A noise model with density f ( · ) is well-ordered if for any a > b and c > d , f ( a − c ) f ( b − d ) > f ( a − d ) f ( b − c ) . In other words, for a well-ordered noise model, given two numbers, two candidates are more likely to becorrectly ordered than inverted conditioned on realizing those two numbers in some order. Lemma 1 showsthat both Gaussian and Laplacian distributions are well-ordered.Thus, it suffices to show that for any 3-candidate RUM with a well-ordered noise model, U AH ( θ A , θ H ) θ H . Theorem 6.
For 3 candidates with unique values x > x > x and well-ordered i.i.d. noise with support ( −∞ , ∞ ) , if θ A > θ H , then U AH ( θ A , θ H ) < U HH ( θ A , θ H ) .Proof. Define u − i to be the expected utility of the maximum element of the human-generated ranking when i is not available. Because we’re in the 3-candidate setting, we have u − = λ x + (1 − λ ) x u − = λ x + (1 − λ ) x u − = λ x + (1 − λ ) x where 1 / < λ i <
1. This is because the noise has support everywhere, so it is impossible to correctly rankany two candidates with probability 1, and any two candidates are more likely than not to be correctlyordered: Pr (cid:104) ε i θ − ε j θ > − δ (cid:105) = Pr[ ε i − ε j ≥
0] + Pr (cid:20) > ε i − ε j θ > − δ (cid:21) > λ > λ and λ > λ , since λ = Pr[ ε − ε > − θ ( x − x )] > max { Pr[ ε − ε > − θ ( x − x )] , Pr[ ε − ε > − θ ( x − x )] } = max { λ , λ } . Let τ ∼ F θ A and π ∼ F θ H . With this, we can write U AH ( θ A , θ H ) = (cid:88) i =1 Pr[ τ = i ] u − i U HH ( θ A , θ H ) = (cid:88) i =1 Pr[ π = i ] u − i Define ∆ p i = Pr[ τ = i ] − Pr[ π = i ]Using Lemmas 2, and 3, we have∆ p > p ≥ ∆ p ∆ p ≤ p + ∆ p + ∆ p = 0. We must show that U AH ( θ A , θ H ) − U HH ( θ A , θ H ) = (cid:88) i =1 ∆ p i u − i < . We consider 2 cases.
Case 1: ∆ p ≤ p = − (∆ p + ∆ p ). This yields (cid:88) i =1 ∆ p i u − i = ∆ p u − + ∆ p u − + ∆ p u − ≤ ∆ p u − − ∆ p min( u − , u − )= ∆ p ( λ x + (1 − λ ) x − min { λ x + (1 − λ ) x , λ x + (1 − λ ) x } ) ≤ ∆ p ( λ x + (1 − λ ) x − min { λ x + (1 − λ ) x , x } )We can show that this is at most 0 regardless of which term attains the minimum. Because λ > λ , λ x + (1 − λ ) x − λ x − (1 − λ ) x = λ x + x − λ x − λ x − x + λ x = λ x − λ x − λ x + λ x = λ ( x − x ) + λ ( x − x ) < λ ( x − x ) + λ ( x − x )= λ ( x − x ) < λ x + (1 − λ ) x − x = (1 − λ )( x − x ) < . Thus, (cid:88) i =1 ∆ p i u − i < . Case 2: ∆ p >
0. Note that u − < x < u − . Then, using ∆ p = − (∆ p + ∆ p ), (cid:88) i =1 ∆ p i u − i = ∆ p u − + ∆ p u − + ∆ p u − = ∆ p ( u − − u − ) + ∆ p ( u − − u − ) ≤ ∆ p ( u − − u − ) + ∆ p ( u − − u − ) (∆ p ≥ ∆ p and u − < u − )= ∆ p ( u − + u − − u − ) ≤ ∆ p ( x + x − λ x + (1 − λ ) x )) < ∆ p (cid:18) x + x − (cid:18) x + 12 x (cid:19)(cid:19) ( λ > )= 0Thus, U AH ( θ A , θ H ) < U HH ( θ A , θ H ). C.3 Supplementary Lemmas for Random Utility Models
Lemma 1.
Both Gaussian and Laplacian distributions are well-ordered. roof. The Gaussian noise model is well-ordered: f ( a − c ) f ( b − d ) = 12 σ π exp( − ( a − c ) − ( b − d ) )= 12 σ π exp( − ( a − d ) − ( b − c ) − ac + bd − ad − bc ))= f ( a − d ) f ( b − c ) exp( − a − b )( c − d ))) < f ( a − d ) f ( b − c )Laplacian noise is as well: f ( a − c ) f ( b − d ) = 14 exp( −| a − c | − | b − d | ) f ( a − d ) f ( b − c ) = 14 exp( −| a − d | − | b − c | )It suffices to show that for a > b and c > d , | a − c | + | b − d | < | a − d | + | b − c | . To show this, plot( a, b ) and ( c, d ) in the ( x, y ) plane. Note that they’re both below the y = x line, and that the (cid:96) distancebetween them is | a − c | + | b − d | . Moreover, the (cid:96) distance between any two points must be realized bysome Manhattan path, which is a combination of horizontal and vertical line segments. Consider the point( b, a ), which is above the y = x line. Any Manhattan path from ( b, a ) to ( c, d ) must cross the y = x line atsome point ( w, w ). Since ( b, a ) and ( a, b ) are equidistant from ( w, w ), for any Manhattan path from ( b, a )to ( c, d ), there exists a Manhattan path from ( a, b ) to ( c, d ) passing through ( w, w ) of the same length,meaning the (cid:96) distance from ( a, b ) to ( c, d ) is smaller than the (cid:96) distance from ( b, a ) to ( c, d ). As a result, | a − c | + | b − d | < | a − d | + | b − c | .Next, we show a few basic facts. Let f A ( r ) be the density function of the joint realization R =[ X , . . . , X n ] = [ x + ε /θ A , . . . , x n + ε n /θ A ] under the algorithmic ranking and f H ( r ) be the similarly defineddensity function under the human-generated ranking. Consider the “contraction” operation r (cid:48) = cont( r ) suchthat r (cid:48) i = x i + ( r i − x i ) · θ H θ A . Essentially, the contraction defines a coupling between f A ( · ) and f H ( · ), sincefor r (cid:48) = cont( r ), f A ( r (cid:48) ) dr (cid:48) = f H ( r ) dr . Let π ( r ) be the ranking induced by r . Note that contraction cannotintroduce any new inversions in π ( r )—that is, if i is ranked above j in π ( r ) for i < j , then i is ranked above j in π (cont( r )). Intuitively, this is because contraction pulls values closer to their means, and can thereforeonly correct existing inversions, not introduce new ones. This fact will allow us to prove some useful lemmas. Lemma 2. If F θ is a RUM family satisfying Definition 1, then for τ ∼ F θ A and π ∼ F θ H , Pr[ τ = x n ] ≤ Pr[ π = x n ] Proof.
Consider any realization r . Because inversions can only be corrected, not generated, by contraction,if π ( r (cid:48) ) = n , then π ( r ) = n where r (cid:48) = cont( r ). Since r (cid:48) and r have equal measure under f A and f H respectively, we have Pr[ π = x n ] = (cid:90) R n f H ( r ) π ( r )= x n dr = (cid:90) R n f A (cont( r )) π ( r )= x n d cont( r ) ≥ (cid:90) R n f A (cont( r )) π (cont( r ))= x n d cont( r )= (cid:90) R n f A ( r ) π ( r )= x n dr = Pr[ τ = x n ]20ext, we prove the following result for well-ordered noise models. Lemma 3.
For any i > , if the noise model E is well-ordered, for θ A ≥ θ H , τ ∼ F θ A , and π ∼ F θ H , Pr[ τ = x ] − Pr[ π = x ] ≥ Pr[ τ = x i ] − Pr[ π = x i ] Proof.
For j (cid:54) = i , let S j → i ⊆ R n be the set of realizations r such that π ( r ) = x j and π (cont( r )) = x i . Notethat S j → i = ∅ for j < i because contraction cannot create inversions. Then, we have thatPr[ τ = x i ] − Pr[ π = x i ] = (cid:88) j>i (cid:90) R n f H ( r ) r ∈ S j → i dr − (cid:88) ji (cid:90) R n f H ( r ) r ∈ S j → i dr Define swap i ( r ) = r (cid:48) , where r (cid:48) j = r j j / ∈ { , i } r j = ir i j = 1Intuitively, the swap i operation simply swaps the realizations in positions 1 and i . Note that this is abijection. Also, if r ∈ S j → i , then swap i ( r ) ∈ S j → , sincecont(swap i ( r )) ≥ cont( r ) i ≥ max j cont( r ) j ≥ max j / ∈{ ,i } cont(swap i ( r )) j cont(swap i ( r )) ≥ cont( r ) i ≥ cont( r ) ≥ cont(swap i ( r )) i Furthermore for r ∈ S j → i , f H ( r ) ≤ f H (swap i ( r )) since f H (swap i ( r )) f H ( r ) = f ( r i − x ) f ( r − x i ) f ( r − x ) f ( r i − x i ) ≥ r ∈ S j → i implies r i > r . Thus, (cid:88) j>i (cid:90) R n f H ( r ) r ∈ S j → i dr ≤ (cid:88) j>i (cid:90) R n f H (swap i ( r )) r ∈ S j → i dr ≤ (cid:88) j>i (cid:90) R n f H (swap i ( r )) swap i ( r ) ∈ S j → dr ≤ (cid:88) j>i (cid:90) R n f H ( r ) r ∈ S j → dr ≤ (cid:88) j> (cid:90) R n f H ( r ) r ∈ S j → dr = Pr[ τ = x ] − Pr[ π = x ]Finally, we show that (C.2) holds for both Laplacian and Gaussian noise. Theorem 7.
For any a ∈ R and X i = x i + σε i where ε i is Laplacian with unit variance, dda Pr[ X i > X j | X i < a, X j < a ] ≥ . Moreover, it is strictly positive for some a . roof. First, we must derive an expression for Pr[ X i > X j | X i < a, X j < a ]. Recall that the Laplacedistribution parameterized by µ and λ has pdf f ( x ; µ, λ ) = λ − λ | x − µ | )and cdf F ( x ; µ, λ ) = (cid:40) exp ( − λ ( µ − x )) x < µ − exp ( − λ ( x − µ )) x ≥ µ Note that x i and x j be the respective means of X i and X j , with x i > x j . Because the Laplace distributionis piecewise defined, we must consider 3 cases and show that in all 3 cases, (C.2) holds. Note thatPr[ X i > X j | X i < a, X j < a ] = (cid:82) a −∞ f ( x ; x i , λ ) F ( x ; x j , λ ) dxF ( a ; x i , λ ) F ( a ; x j , λ ) (C.3) Case 1: a ≤ x j .Then, the numerator of (C.3) is (cid:90) a −∞ λ − λ ( x i − x )) ·
12 exp( − λ ( x j − x )) dx = λ (cid:90) a −∞ exp( − λ ( x i + x j − x )) dx = λ exp( − λ ( x i + x j ))4 (cid:90) a −∞ exp(2 λx ) dx = λ exp( − λ ( x i + x j ))4 12 λ exp(2 λa )= exp( − λ ( x i + x j − a ))8The denominator is 12 exp( − λ ( x i − a )) ·
12 exp( − λ ( x j − a )) = 14 exp( − λ ( x i + x j − a )) . Thus, Pr[ X i > X j | X i < a, X j < a ] = 12 , so its derivative is trivially nonnegative. Case 2: x j < a ≤ x i .Then, the numerator of (C.3) is (cid:90) x j −∞ λ − λ ( x i − x )) ·
12 exp( − λ ( x j − x )) dx + (cid:90) ax j λ − λ ( x i − x )) (cid:18) −
12 exp( − λ ( x − x j )) (cid:19) dx = exp( − λ ( x i − x j ))8 + λ (cid:90) ax j exp( − λ ( x i − x )) dx − λ (cid:90) ax j exp( − λ ( x i − x j )) dx = exp( − λ ( x i − x j ))8 + λ λ (exp( − λ ( x i − a )) − exp( − λ ( x i − x j ))) − λ a − x j ) exp( − λ ( x i − x j ))= 12 exp( − λ ( x i − a )) − (cid:18)
38 + λ a − x j ) (cid:19) exp( − λ ( x i − x j ))The denominator is (cid:18) −
12 exp( − λ ( a − x j )) (cid:19) ·
12 exp( λ ( x j − a )) = 12 exp( − λ ( x i − a )) −
14 exp( − λ ( x i − x j ))22e can factor out exp( − λ ( x i − x j )) from both, soPr[ X i > X j | X i < a, X j < a ] = 2 exp( λ ( a − x j )) − (cid:0) + λ ( a − x j ) (cid:1) λ ( a − x j )) −
1= 2 exp( λ ( a − x j )) − − (cid:0) + λ ( a − x j ) (cid:1) λ ( a − x j )) −
1= 1 − + λ ( a − x j )2 exp( λ ( a − x j )) − dda Pr[ X i > X j | X i < a, X j < a ] > ⇐⇒ dda + λ ( a − x j )2 exp( λ ( a − x j )) − < ⇐⇒ (2 exp( λ ( a − x j )) − λ < (cid:18)
12 + λ ( a − x j ) (cid:19) λ exp( λ ( a − x j )) ⇐⇒ − exp( − λ ( a − x j )) < (cid:18)
12 + λ ( a − x j ) (cid:19) ⇐⇒ − exp( − λ ( a − x j )) < λ ( a − x j ) ⇐⇒ exp( − λ ( a − x j )) > − λ ( a − x j )This is true because λ ( a − x j ) >
0, and for z > − z ) > − z > − z. Case 3: a > x i .Then, the numerator of (C.3) is (cid:90) x j −∞ λ − λ ( x i − x )) ·
12 exp( − λ ( x j − x )) dx + (cid:90) x i x j λ − λ ( x i − x )) (cid:18) −
12 exp( − λ ( x − x j )) (cid:19) dx + (cid:90) ax i λ − λ ( x − x i )) (cid:18) −
12 exp( − λ ( x − x j )) (cid:19) dx = 12 − (cid:18)
38 + λ x i − x j ) (cid:19) exp( − λ ( x i − x j )) + 12 (1 − exp( − λ ( a − x i ))) − λ (cid:90) ax i exp( − λ (2 x − x i − x j )) dx = 1 − (cid:18)
38 + λ x i − x j ) (cid:19) exp( − λ ( x i − x j )) −
12 exp( − λ ( a − x i ))+ 18 exp( λ ( x i + x j ))(exp( − λa ) − exp( − λx i ))= 1 − (cid:18)
12 + λ x i − x j ) (cid:19) exp( − λ ( x i − x j )) −
12 exp( − λ ( a − x i )) + 18 exp( − λ (2 a − x i − x j ))The denominator is (cid:18) −
12 exp( − λ ( t − x i )) (cid:19) (cid:18) −
12 exp( − λ ( t − x j )) (cid:19) = 1 −
12 exp( − λ ( a − x i )) −
12 exp( − λ ( a − x j )) + 14 exp( − λ (2 a − x i − x j ))23hus, Pr[ X i > X j | X i < a, X j < a ]= 1 − (cid:0) + λ ( x i − x j ) (cid:1) exp( − λ ( x i − x j )) − exp( − λ ( a − x i )) + exp( − λ (2 a − x i − x j ))1 − exp( − λ ( a − x i )) − exp( − λ ( a − x j )) + exp( − λ (2 a − x i − x j )) ∝ − (4 + 2 λ ( x i − x j )) exp( − λ ( x i − x j )) − − λ ( a − x i )) + exp( − λ (2 a − x i − x j ))4 − − λ ( a − x i )) − − λ ( a − x j )) + exp( − λ (2 a − x i − x j ))We’re interested in dda Pr[ X i > X j | X i < a, X j < a ] > ⇐⇒ (4 − − λ ( a − x i )) − − λ ( a − x j )) + exp( − λ (2 a − x i − x j ) · (4 λ exp( − λ ( a − x i )) − λ exp( − λ (2 a − x i − x j ))) > (8 − − λ ( a − x i )) + exp( − λ (2 a − x i − x j )) − (4 + 2 λ ( x i − x j )) exp( − λ ( x i − x j ))) · (2 λ exp( − λ ( a − x i )) + 2 λ exp( − λ ( a − x j )) − λ exp( − λ (2 a − x i − x j ))) ⇐⇒
16 exp( − λ ( a − x i )) − − λ (2 a − x i − x j )) − − λ ( a − x i )) + 4 exp( − λ (3 a − x i − x j )) − − λ (2 a − x i − x j )) + 4 exp( − λ (3 a − x i − x j )) + 4 exp( − λ (3 a − x i − x j )) − − λ (2 a − x i − x j )) >
16 exp( − λ ( a − x i )) + 16 exp( − λ ( a − x j )) −
16 exp( − λ (2 a − x i − x j )) − − λ ( a − x i )) − − λ (2 a − x i − x j )) + 8 exp( − λ (3 a − x i − x j ))+ 2 exp( − λ (3 a − x i − x j )) + 2 exp( − λ (3 a − x i − x j )) − − λ (2 a − x i − x j )) − λ ( x i − x j )) exp( − λ ( a − x j )) − λ ( x i − x j )) exp( − λ ( a + x i − x j ))+ 2(4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) ⇐⇒ exp( − λ (3 a − x i − x j )) > − λ ( a − x j )) − − λ (2 a − x i − x j )) + exp( − λ (3 a − x i − x j )) − (4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) − (4 + 2 λ ( x i − x j )) exp( − λ ( a + x i − x j ))+ (4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) ⇐⇒ exp( − λ (2 a − x i − x j )) > − − λ ( a − x i )) + exp( − λ (2 a − x i )) − (4 + 2 λ ( x i − x j )) − (4 + 2 λ ( x i − x j )) exp( − λ ( x i − x j )) + (4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) ⇐⇒ exp( − λ (2 a − x i − x j )) − − λ ( a − x i )) − exp( − λ ( a − x i ))+ (4 + 2 λ ( x i − x j ))(1 + exp( − λ ( x i − x j ))) − (4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) > z ≥
0, we have(4 + 2 z )(1 + e − z ) − ≥ ⇐⇒ (2 + z )(1 + e − z ) ≥ ⇐⇒ z + 2 e − z + ze − z ≥ z = 0, this holds with equality, and the left hand side is increasing since ddx z + 2 e − z + ze − z ≥ ⇐⇒ − e − z + e − z − ze − z ≥ ⇐⇒ ≥ e − z + ze − z ⇐⇒
11 + z ≥ e − z ⇐⇒ z ≤ e z z = λ ( x i − x j ) and plugging back to (C.4), we haveexp( − λ (2 a − x i − x j )) − − λ ( a − x i )) − exp( − λ ( a − x i ))+ (4 + 2 λ ( x i − x j ))(1 + exp( − λ ( x i − x j ))) − (4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) > ⇐ = exp( − λ (2 a − x i − x j )) + 4 exp( − λ ( a − x i )) − exp( − λ ( a − x i )) − (4 + 2 λ ( x i − x j )) exp( − λ ( a − x j )) > ⇐⇒ exp( − λ ( a − x j )) + 4 − exp( − λ ( a − x i )) − (4 + 2 λ ( x i − x j )) exp( − λ ( x i − x j )) > ⇐⇒ − exp( − λ ( x i − x j ))) + exp( − λ ( a − x i ))(exp( − λ ( x i − x j )) − − λ ( x i − x j ) exp( − λ ( x i − x j )) > ⇐⇒ (4 − exp( − λ ( a − x i )))(1 − exp( − λ ( x i − x j ))) − λ ( x i − x j ) exp( − λ ( x i − x j )) > ⇐ =3(1 − exp( − λ ( x i − x j ))) − λ ( x i − x j ) exp( − λ ( x i − x j )) > − λ ( a − x i )) < z = λ ( x i − x j ), this is true if and only if3(1 − e − z ) > ze − z ⇐⇒ e z − > z ⇐⇒ e z > z which is true because e z > z for z >
0. This completes the proof for Case 3.As a result, we have that dda
Pr[ X i > X j | X i < a, X j < a ] ≥ a , with strict inequality for some a , which proves the theorem. Theorem 8.
For any a ∈ R and X i = x i + σε i where ε i ∼ N (0 , , dda Pr[ X i > X j | X i < a, X j < a ] > . Proof.
Assume σ = 1 / √
2. This is without loss of generality because for any instance with arbitrary σ (cid:48) , thereis an instance with σ = 1 / √ σ/σ (cid:48) ). First, we havePr[ X i > X j | X i < a, X j < a ] = (cid:82) a −∞ Pr[ X i = x ] Pr[ X j < x ] dxP r [ X i < a ] Pr[ X j < a ]= (cid:82) a −∞ exp( − ( x − x i ) ) / √ π · (1 + erf( x − x j )) / dx (1 + erf( a − x i )) / · (1 + erf( a − x j )) /
2= 2 √ π (cid:82) a −∞ exp( − ( x − x i ) )(1 + erf( x − x j )) dx (1 + erf( a − x i )) · (1 + erf( a − x j ))The derivative with respect to a is positive if and only if(1 + erf( a − x i ))(1 + erf( a − x j )) exp( − ( a − x i ) )(1 + erf( a − x j )) > (cid:90) a −∞ exp( − ( x − x i ) )(1 + erf( x − x j )) dx · √ π (cid:0) (1 + erf( a − x i )) exp( − ( a − x j ) ) + (1 + erf( a − x j )) exp( − ( a − x i ) ) (cid:1) (C.5)Let t = a − x i and δ = x i − x j . Then, using the fact that (cid:90) a −∞ exp( − ( x − x i ) )(1 + erf( x − x j )) dx = (cid:90) a − x i −∞ exp( − x ) dx + (cid:90) a − x i −∞ exp( − x ) erf( x + δ ) dx = √ π a − x i )) + (cid:90) a − x i −∞ exp( − x ) erf( x + δ ) dx, √ π · (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) > √ π t )) + (cid:90) t −∞ exp( − x ) erf( x + δ ) dx ⇐⇒ (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) − (1 + erf( t )) − √ π (cid:90) t −∞ exp( − x ) erf( x + δ ) dx > f ( t ) > f ( t ) is continuous and differentiable everywhere2. lim t →−∞ f ( t ) = 03. ddt f ( t ) > t →−∞ (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) − (1 + erf( t )) − √ π (cid:90) t −∞ exp( − x ) erf( x + δ ) dx = lim t →−∞ (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) (C.7)Observe that both the numerator and denominator of (C.7) are positive, so this limit must be at least 0.We can upper bound it bylim t →−∞ (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) ≤ lim t →−∞ (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t + δ )) exp( − t )= lim t →−∞ (1 + erf( t ))(1 + erf( t + δ ))= 0Thus, the limit is 0. Now, we must show that the derivative is positive. The derivative is ddt (cid:20) (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) − (1 + erf( t )) − √ π (cid:90) t −∞ exp( − x ) erf( x + δ ) dx (cid:21) = ddt (cid:20) (1 + erf( t ))(1 + erf( t + δ )) exp( − t )(1 + erf( t )) exp( − ( t + δ ) ) + (1 + erf( t + δ )) exp( − t ) (cid:21) − √ π exp( − t ) − √ π exp( − t ) erf( t + δ )(C.8)Taking this derivative and factoring out2(1 + erf( t ))(1 + erf( t + δ )) exp(4 t ) √ π (cid:0) (erf ( t ) + 1) e t + (erf ( δ + t ) + 1) e ( δ + t ) (cid:1) , we get that (C.8) is positive if and only if δ √ π exp(( t + δ ) )(1 + erf( t ))(1 + erf( t + δ )) − exp(2 δt + t )(1 + erf( t + δ )) + (1 + erf( t )) > ⇐⇒ δ √ π exp(( t + δ ) )(1 + erf( t )) + 1 + erf( t )1 + erf( t + δ ) − exp(2 δt + t ) > ⇐⇒ δ √ π exp( t )(1 + erf( t )) + exp( − δt − t ) 1 + erf( t )1 + erf( t + δ ) − > ⇐⇒ (1 + erf( t )) (cid:20) δ √ π exp( t ) + exp( − δt − δ )1 + erf( t + δ ) (cid:21) − > ⇐⇒ t )exp( − t ) (cid:20) δ √ π + exp( − ( t + δ ) )1 + erf( t + δ ) (cid:21) − > g ( t ) (cid:44) t )exp( − t ) . Then, (C.9) is g ( t ) (cid:20) δ √ π + 1 g ( t + δ ) (cid:21) − > ⇐⇒ g ( t ) − g ( t + δ ) < δ √ π By the Mean Value Theorem, 1 g ( t ) − g ( t + δ ) = − δ ddt g ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t = t ∗ for some t ≤ t ∗ ≤ t + δ . Thus, it suffices to show that ddt g ( t ) > −√ π (C.10)for all t . To do this, consider Mills Ratio [17] R ( t ) (cid:44) exp( t / (cid:90) ∞ t exp( − x / dx. Note that this is quite similar in functional form to g ( t ), and with some manipulation, we can relate the two: R ( t ) = exp( t / (cid:90) ∞ t exp( − x / dxR ( √ t ) = exp( t ) (cid:90) ∞√ t exp( − x / dx = √ t ) (cid:90) ∞ t exp( − x ) dx = √ t ) (cid:90) − t −∞ exp( − x ) dx (exp( − x ) is symmetric) R ( −√ t ) = √ t ) (cid:90) t −∞ exp( − x ) dx = √ t ) · √ π t ))= (cid:114) π (cid:18) t )exp( − t ) (cid:19) R ( −√ t ) = (cid:114) π g ( t )Sampford [21, Eq. (3)] proved that ddt R ( t ) < t . Thus, ddt g ( t ) = ddt (cid:113) π R ( −√ t ) = (cid:114) π ddt R ( −√ t ) > (cid:114) π · · −√ −√ π, which proves (C.10) and completes the proof. D Verifying that the Mallows Model Satisfies Definition 1
Theorem 9.
The family of distributions F θ produced by the Mallows Model with Kendall tau distance with θ = φ − satisfies the conditions of Definition 1. roof. We must show that F θ satisfies the differentiability, asymptotic optimality, and monotonicity condi-tions of Definition 1. Differentiability:
Let Π be the set of all permutations on n candidates. The probability of a realizinga particular permutation π under the Mallows model isPr θ [ π ] = φ − d ( π,π ∗ ) (cid:80) π (cid:48) ∈ Π φ − d ( π (cid:48) ,π ∗ ) Both the numerator and denominator are differentiable with respect to θ = φ −
1, so Pr θ [ π ] is differentiablewith respect to θ . Asymptotic optimality:
For the correct ranking π ∗ ,Pr θ [ π ∗ ] = 1 Z , where the normalizing constant Z is Z = (cid:88) π ∈ Π φ − d ( π,π ∗ ) In the limit, lim θ →∞ Z = lim φ →∞ Z = lim φ →∞ (cid:88) π ∈ Π φ − d ( π,π ∗ ) = lim φ →∞ (cid:88) π (cid:54) = π ∗ ∈ Π φ − d ( π,π ∗ ) = 1 + (cid:88) π (cid:54) = π ∗ ∈ Π lim φ →∞ φ − d ( π,π ∗ ) = 1because for any π (cid:54) = π ∗ , d ( π, π ∗ ) ≥
1. Therefore,lim θ →∞ Pr θ [ π ∗ ] = lim θ →∞ Z = 1 Monotonicity:
We must show that for any S ⊂ x , if π ( − S )1 denotes the value of the top-ranked candidateaccording to π excluding candidates in S , E F θ (cid:48) (cid:104) π ( − S )1 (cid:105) ≥ E F θ (cid:104) π ( − S )1 (cid:105) . For any i / ∈ S , let j be the smallest index such that j > i and j / ∈ S . Consider any π such that π ( − S )1 = x j .Then, swapping i and j yields a permutation ˆ π such that ˆ π ( − S )1 = x i . Moreover,Pr[ˆ π ] = Pr[ π ] · φ inv( π ) − inv(ˆ π ) . Since i < j , inv( π ) − inv(ˆ π ) ≥
1. Finally, note that swapping i and j is a bijection between { π : π ( − S )1 = x j } and { π : π ( − S )1 = x i } . Thus,Pr[ π ( − S )1 = x i ]Pr[ π ( − S )1 = x j ] = (cid:88) π : π ( − S )1 = x j Pr[ π ]Pr[ π ( − S )1 = x j ] · φ inv( π ) − inv(ˆ π ) Note that the terms
Pr[ π ]Pr[ π ( − S )1 = x j ] sum to 1, so this is sum is some polynomial in φ with nonnegative weightsand integer powers of φ . As a result, it must have a positive derivative with respect to φ , i.e., for i < j , ddφ Pr[ π ( − S )1 = x i ]Pr[ π ( − S )1 = x j ] > φ (cid:48) > φ . Then, Pr φ [ π ( − S )1 = x i ]Pr φ [ π ( − S )1 = x j ] < Pr φ (cid:48) [ π ( − S )1 = x i ]Pr φ (cid:48) [ π ( − S )1 = x j ]Rearranging, Pr φ [ π ( − S )1 = x i ]Pr φ (cid:48) [ π ( − S )1 = x i ] < Pr φ [ π ( − S )1 = x j ]Pr φ (cid:48) [ π ( − S )1 = x j ] (D.1)For θ (cid:48) − φ (cid:48) − θ = φ − E F θ (cid:104) π ( − S )1 (cid:105) = (cid:88) i/ ∈ S Pr φ (cid:104) π − ( S )1 = x i (cid:105) x i E F θ (cid:48) (cid:104) π ( − S )1 (cid:105) = (cid:88) i/ ∈ S Pr φ (cid:48) (cid:104) π − ( S )1 = x i (cid:105) x i By Lemma 4, E F θ (cid:48) (cid:104) π ( − S )1 (cid:105) > E F θ (cid:104) π ( − S )1 (cid:105) , which completes the proof. Note that we apply Lemma 4 indexing backwards from n to 1, ignoring elementsin S , with p i = Pr φ (cid:104) π − ( S )1 = x i (cid:105) and q i = Pr φ (cid:48) (cid:104) π − ( S )1 = x i (cid:105) . (D.1) provides the condition that p i /q i isdecreasing (as i decreases, since we are indexing backwards). E Proof of Theorem 3
E.1 Verifying Definition 2
We must show that when π, τ ∼ F θ , E [ π − π | π (cid:54) = τ ] > . (E.1)We begin by expanding: E [ π − π | π (cid:54) = τ ] = n (cid:88) i =1 n (cid:88) j =1 ( x i − x j ) Pr[ π = x i ∩ π = x j | π (cid:54) = τ ]= n − (cid:88) i =1 (cid:88) j>i ( x i − x j ) (Pr[ π = x i ∩ π = x j | π (cid:54) = τ ] − Pr[ π = x j ∩ π = x i | π (cid:54) = τ ])Since x i > x j for i < j , it suffices to show that for all i < j ,Pr[ π = x i ∩ π = x j | π (cid:54) = τ ] ≥ Pr[ π = x j ∩ π = x i | π (cid:54) = τ ] , (E.2)and that this holds strictly for some i < j . We simplify (E.2) as follows:Pr[ π = x i ∩ π = x j | π (cid:54) = τ ] > Pr[ π = x j ∩ π = x i | π (cid:54) = τ ] ⇐⇒ Pr[ π = x i ∩ π = x j ∩ π (cid:54) = τ ]Pr[ π (cid:54) = τ ] > Pr[ π = x j ∩ π = x i ∩ π (cid:54) = τ ]Pr[ π (cid:54) = τ ] ⇐⇒ Pr[ π = x i ∩ π = x j ∩ π (cid:54) = τ ] > Pr[ π = x j ∩ π = x i ∩ π (cid:54) = τ ] ⇐⇒ Pr[ π = x i ∩ π = x j ∩ τ (cid:54) = x i ] > Pr[ π = x j ∩ π = x i ∩ τ (cid:54) = x j ] ⇐⇒ Pr[ π = x i ∩ π = x j ] Pr[ τ (cid:54) = x i ] > Pr[ π = x j ∩ π = x i ] Pr[ τ (cid:54) = x j ] (E.3)29e can simplify (E.3) using Lemmas 5 and 6. Let | i − j | denote the difference in rank between x i and x j .Pr[ π = x i ∩ π = x j ] Pr[ τ (cid:54) = x i ] − Pr[ π = x j ∩ π = x i ] Pr[ τ (cid:54) = x j ]= Pr[ π = x i ∩ π = x j ](1 − Pr[ τ = x i ]) − φ − Pr[ π = x i ∩ π = x j ](1 − Pr[ τ = x j ])= Pr[ π = x i ∩ π = x j ](1 − Pr[ τ = x i ]) − φ − Pr[ π = x i ∩ π = x j ](1 − φ −| i − j | Pr[ τ = x i ])= Pr[ π = x i ∩ π = x j ](1 − Pr[ τ = x i ] − φ − − φ −| i − j |− Pr[ τ = x i ]))This is positive if and only if 1 − Pr[ τ = x i ] − φ − − φ −| i − j |− Pr[ τ = x i ] > ⇐⇒ Pr[ τ = x i ] (cid:16) − φ −| i − j |− (cid:17) < − φ − ⇐⇒ Pr[ τ = x i ] < − φ − − φ −| i − j |− ⇐⇒ − φ − φ i − (1 − φ − n ) < − φ − − φ −| i − j |− ⇐⇒ φ i − (1 − φ − n ) > − φ −| i − j |− This is weakly true for any i < j because φ i − ≥ | i − j | + 1 ≤ n , and it is strictly true for any i, j other than 1 and n . Thus, E [ π − π | π (cid:54) = τ ] > E.2 Verifying Definition 3
Recall that Definition 3 is equivalent to U AH ( θ A , θ H ) < U HH ( θ A , θ H ) for θ A > θ H . Let τ be the algorithmicranking, and let π be a ranking from a human evaluator. Recall that U H ( θ A , θ H ) = E [ π ]. Throughout thisproof, we will drop the ( θ A , θ H ) notation and simply write U H , U AH , and U HH . U AH = n (cid:88) i =1 (Pr[ π = x i ∩ τ (cid:54) = x i ] + Pr[ π = x i ∩ π = τ ]) x i = n (cid:88) i =1 Pr[ π = x i ∩ τ (cid:54) = x i ] x i + n (cid:88) i =1 Pr[ π = x i ∩ π = τ ] x i = n (cid:88) i =1 ( P r [ π = x i ] − Pr[ π = x i ∩ τ = x i ]) x i + n (cid:88) i =1 (cid:88) j (cid:54) = i Pr[ π = x j ∩ τ = x j ∩ π = x i ] x i = U H − n (cid:88) i =1 Pr[ π = x i ∩ τ = x i ] x i + n (cid:88) i =1 Pr[ π = x i ∩ τ = x i ] E [ π | π = x i ∩ τ = x i ]= U H + n (cid:88) i =1 Pr[ π = x i ] Pr[ τ = x i ] ( E [ π | π = x i ] − x i )Similarly, because two human evaluators are independent, U HH = U H + n (cid:88) i =1 Pr[ π = x i ] ( E [ π | π = x i ] − x i ) . Let V − i = E [ π | π = x i ]. Note that conditioned on π = x i , the remaining elements of π follow a Mallowsmodel distribution over n − V − i increases as i increases (since x i , the value of the unavailable candidate, decreases). Moreover, x i
30s strictly decreasing in i , so V − i − x i is strictly increasing in i . With this, we have U AH − U H = n (cid:88) i =1 Pr[ π = x i ] Pr[ τ = x i ] ( V − i − x i ) U HH − U H = n (cid:88) i =1 Pr[ π = x i ] ( V − i − x i )Let C A = Pr[ π = τ ] = (cid:80) ni =1 Pr[ π = x i ] Pr[ τ = x i ], and similarly let C H = (cid:80) ni =1 Pr[ π = x i ] . C A > C H by Lemma 4 with y (cid:48) i = Pr[ π = x n − i +1 ], p (cid:48) i = Pr[ π = x n − i +1 ] and q (cid:48) i = Pr[ τ = x n − i +1 ].Let p i = Pr[ π (cid:48) = i ] Pr[ π = i ] /C A , q i = Pr[ π (cid:48) = i ] /C H , and y i = V − i − x i . Then, we have U AH − U H C A = n (cid:88) i =1 p i y i U HH − U H C H = n (cid:88) i =1 q i y i With φ A = θ A + 1 and φ H = θ H + 1, p i q i = C H C A · − φ − A φ i − A (1 − φ − nA )1 − φ − H φ i − H (1 − φ − nH ) ∝ φ i − H φ i − A , which is decreasing in i since φ H < φ A . By Lemma 4, (cid:80) ni =1 p i y i < (cid:80) ni =1 q i y i . Finally, note that U HH − U H < n (cid:88) i =1 p i y i < n (cid:88) i =1 q i y i U AH − U H C A < U HH − U H C H C H ( U AH − U H ) C A < U HH − U H U AH − U H < U HH − U H ( C A > C H , and U HH − U H < U AH < U HH F Supplementary Lemmas for the Mallows Model
Lemma 4.
Let { y i } ni =1 , { p i } ni =1 , and { q i } ni =1 be sequences such that • y i is strictly increasing. • (cid:80) ni =1 p i = (cid:80) ni =1 q i = 1 . • p i q i is decreasing.Then, (cid:80) ni =1 p i y i < (cid:80) ni =1 q i y i .Proof. First, note that there exists j such that p i > q i for i < j and p i ≤ q i for i ≥ j . To see this, let j bethe smallest index such that p j ≤ q j . Such a j must exist because p i and q i both sum to 1, so it cannot bethe case that p i > q i for all i . This implies p i /q i ≤
1, and since p i /q i is decreasing, p i ≤ q i for i ≥ j .Next, note that 0 = n (cid:88) i =1 ( p i − q i )= j − (cid:88) i =1 ( p i − q i ) + n (cid:88) i = j ( p i − q i ) , j − (cid:88) i =1 ( p i − q i ) = n (cid:88) i = j ( q i − p i ) . Using this choice of j , we can write n (cid:88) i =1 p i y i − n (cid:88) i =1 q i y i = n (cid:88) i =1 ( p i − q i ) y i = j − (cid:88) i =1 ( p i − q i ) y i − n (cid:88) i = j ( q i − p i ) y i ≤ j − (cid:88) i =1 ( p i − q i ) y j − − n (cid:88) i = j ( q i − p i ) y j = j − (cid:88) i =1 ( p i − q i ) y j − − n (cid:88) i = j ( q i − p i ) y j = j − (cid:88) i =1 ( p i − q i ) y j − − j − (cid:88) i =1 ( p i − q i ) y j = j − (cid:88) i =1 ( p i − q i )( y j − − y j ) < Lemma 5.
For x i > x j , Pr[ π = x i ∩ π = x j ] = φ Pr[ π = x j ∩ π = x i ] . (F.1) Proof.
Let π − ij be a permutation of all of the candidates except x i and x j . Then, we havePr[ π = x i ∩ π = x j ] = (cid:88) π − ij Pr[ π = x i ∩ π = x j | π − ij ] Pr[ π − ij ]= (cid:88) π − ij φ Pr[ π = x j ∩ π = x i | π − ij ] Pr[ π − ij ]= φ Pr[ π = x j ∩ π = x i ]Intuitively, given that x i and x j are in the top 2 positions, x i followed by x j is φ times more likely than x j followed by x i regardless of the remainder of the permutation, and therefore, x i followed by x j is φ timesmore likely overall. Lemma 6.
For ≤ i ≤ n , Pr[ π = x i ] = 1 − φ − φ i − (1 − φ − n ) . (F.2) Proof.
Let π − i be a permutation over all items except i . Then,Pr[ π = x i ] = (cid:88) π − i Pr[ π = x i | π − i ] Pr[ π − i ]= (cid:88) π − i φ − ( i − Pr[ π − i ]= φ − ( i − (cid:88) π − i Pr[ π − i ]32ote that Pr[ π − i ] doesn’t depend on which n − i .Moreover, (cid:80) ni =1 Pr[ π = x i ] = 1. Therefore, we havePr[ π = x i ] ∝ φ − ( i − . Normalizing, we get Pr[ π = x i ] = φ − ( i − (cid:80) nj =1 φ − ( j − = φ − ( i − − φ − n − φ − = 1 − φ − φ i − (1 − φ − n )Intuitively, any permutation over n − Lemma 7.
For the Mallows Model, U H ( θ A , θ H ) > U HH ( θ A , θ H ) .Proof. Intuitively, this is because selecting first is better than selecting second. To prove this, let π and τ be ranking generated by independent human evaluators under the Mallows Model, i.e., π, τ ∼ F θ H . U H ( θ A , θ H ) − U HH ( θ A , θ H ) = E [ π ] − E [ τ · π (cid:54) = τ + τ · π = τ ]= E [( π − τ ) · π = τ ]= E [( π − π ) · π = τ ]For any i < j , conditioned on π = τ , they are more likely to be correctly ordered than not: E [( π − π ) · π = τ ] = (cid:88) i
To prove this theorem, we make use of the following lemma.
Lemma 8.
Under the Mallows model, the probability that any two items i < j are correctly ranked increasesmonotonically with the accuracy parameter φ . Alternatively, we could prove this by showing that for any permutation with i in front, the permutation in which i and i − φ times more likely, and thus, i − φ times more likely to be in front than i . roof. Let inv( π ) be the number of inversions in a permutation π . Under the Mallows model, the probabilityof observing π is proportional to φ − inv( π ) . Let S i (cid:31) j (resp. S j (cid:31) i ) be the set of permutations where i is rankedbefore j (resp. j is ranked before i ). Then, the probability i is ranked before j isPr[ i (cid:31) j ] = (cid:80) π ∈ S i (cid:31) j φ − inv( π ) (cid:80) π ∈ S i (cid:31) j φ − inv( π ) + (cid:80) π ∈ S j (cid:31) i φ − inv( π ) . We will show that ddφ
Pr[ i (cid:31) j ] >
0. Note that this is equivalent to showing ddφ
Pr[ i (cid:31) j ]Pr[ j (cid:31) i ] >
0. Note thatPr[ i (cid:31) j ]Pr[ j (cid:31) i ] = (cid:80) π ∈ S i (cid:31) j φ − inv( π ) (cid:80) π ∈ S j (cid:31) i φ − inv( π ) . Let π i : j be the subsequence of π containing elements i through j . Then, we havePr[ i (cid:31) j ]Pr[ j (cid:31) i ] = (cid:80) π ∈ S i (cid:31) j φ − inv( π ) (cid:80) π ∈ S j (cid:31) i φ − inv( π ) = (cid:80) π i : j : π ∈ S i (cid:31) j φ − inv( π i : j ) (cid:80) π (cid:48) : π (cid:48) i : j = π i : j φ inv( π i : j ) − inv( π (cid:48) ) (cid:80) π i : j : π ∈ S j (cid:31) i φ − inv( π i : j ) (cid:80) π (cid:48) : π (cid:48) i : j = π i : j φ inv( π i : j ) − inv( π (cid:48) ) = (cid:80) π i : j : π ∈ S i (cid:31) j φ − inv( π i : j ) (cid:80) π i : j : π ∈ S j (cid:31) i φ − inv( π i : j ) Intuitively, the term (cid:80) π (cid:48) : π (cid:48) i : j = π i : j φ inv( π i : j ) − inv( π (cid:48) ) does not depend on π i : j because for any π i : j , if we fixthe order and positions of the remaining elements, the number of inversions involving at least one elementoutside of i : j (i.e., inv( π (cid:48) ) − inv( π i : j )) is a constant. For fixed π i : j , there is a bijection between permutations π (cid:48) : π (cid:48) i : j = π i : j and a fixed order and position of the remaining elements (excluding i : j ), meaning this sumdoes not depend on π i : j . Thus, for the remainder of this proof, we can assume without loss of generalitythat i = 1 and j = n . The quantity of interest becomesPr[1 (cid:31) n ]Pr[ n (cid:31)
1] = (cid:80) π n : π ∈ S (cid:31) n φ − inv( π n ) (cid:80) π i : j : π ∈ S n (cid:31) φ − inv( π n ) = (cid:80) π ∈ S (cid:31) n φ − inv( π ) (cid:80) π ∈ S n (cid:31) φ − inv( π ) Next, we observe that we can similarly ignore inversions between two elements that are neither 1 nor n .To see this, let inv ,n ( π ) be the number of inversions involving at least one of 1 and n . Then, if we fix theorder and positions of 1 and n , all possible permutations of the remaining elements 2 through n − ,n ( π ). More formally, let π (1) and π ( n ) be the respective positions ofelements 1 and n . Then, this we have (cid:88) π ∈ S (cid:31) n φ − inv( π ) = (cid:88) k<(cid:96) (cid:88) π : π (1) = k,π ( n ) = (cid:96) φ − inv( π ) = (cid:88) k<(cid:96) (cid:88) π : π (1) = k,π ( n ) = (cid:96) φ − inv ,n ( φ ) · φ inv ,n ( φ ) − inv( π ) = (cid:88) k<(cid:96) φ − ( k − − ( n − (cid:96) ) (cid:88) π : π (1) = k,π ( n ) = (cid:96) φ inv ,n ( φ ) − inv( π ) As noted above, (cid:80) π : π (1) = k,π ( n ) = (cid:96) φ inv ,n ( φ ) − inv( π ) does not depend on k or (cid:96) , since every permutation ofthe remaining elements yields the same number of inversions among them regardless of k and (cid:96) . A similarargument yields (cid:88) π ∈ S n (cid:31) φ − inv( π ) = (cid:88) k>(cid:96) φ − ( k − − ( n − (cid:96) )+1 (cid:88) π : π (1) = k,π ( n ) = (cid:96) φ inv ,n ( φ ) − inv( π ) (cid:31) n ]Pr[ n (cid:31)
1] = (cid:80) k<(cid:96) φ − ( k − − ( n − (cid:96) ) (cid:80) π : π (1) = k,π ( n ) = (cid:96) φ inv ,n ( φ ) − inv( π ) (cid:80) k>(cid:96) φ − ( k − − ( n − (cid:96) )+1 (cid:80) π : π (1) = k,π ( n ) = (cid:96) φ inv ,n ( φ ) − inv( π ) = (cid:80) k<(cid:96) φ − ( k − − ( n − (cid:96) ) (cid:80) k>(cid:96) φ − ( k − − ( n − (cid:96) )+1 = (cid:80) k<(cid:96) φ − ( k − − ( n − (cid:96) ) (cid:80) k>(cid:96) φ − ( k − − ( n − (cid:96) )+1 · φ n − φ n − = (cid:80) k<(cid:96) φ (cid:96) − k (cid:80) k>(cid:96) φ (cid:96) − k +1 Note that each term in the numerator is strictly increasing in φ , while each term in the denominator isweakly decreasing in φ . As a result, ddφ Pr[1 (cid:31) n ]Pr[ n (cid:31) >