[PDF] An Elo-like System for Massive Multiplayer Competitions

Abstract

Rating systems play an important role in competitive sports and games. They provide a measure of player skill, which incentivizes competitive performances and enables balanced match-ups. In this paper, we present a novel Bayesian rating system for contests with many participants. It is widely applicable to competition formats with discrete ranked matches, such as online programming competitions, obstacle courses races, and some video games. The simplicity of our system allows us to prove theoretical bounds on robustness and runtime. In addition, we show that the system aligns incentives: that is, a player who seeks to maximize their rating will never want to underperform. Experimentally, the rating system rivals or surpasses existing systems in prediction accuracy, and computes faster than existing systems by up to an order of magnitude.

Full PDF

AAn Elo-like System for Massive Multiplayer Competitions

Aram Ebtekar

Vancouver, BC, [email protected]

Paul Liu

Stanford UniversityStanford, CA, [email protected]

ABSTRACT

Rating systems play an important role in competitive sports andgames. They provide a measure of player skill, which incentivizescompetitive performances and enables balanced match-ups. In thispaper, we present a novel Bayesian rating system for contests withmany participants. It is widely applicable to competition formatswith discrete ranked matches, such as online programming competi-tions, obstacle courses races, and some video games. The simplicityof our system allows us to prove theoretical bounds on robustnessand runtime. In addition, we show that the system aligns incentives :that is, a player who seeks to maximize their rating will never wantto underperform. Experimentally, the rating system rivals or sur-passes existing systems in prediction accuracy, and computes fasterthan existing systems by up to an order of magnitude.

Competitions, in the form of sports, games, and examinations, havebeen with us since antiquity. Many competitions grade perfor-mances along a numerical scale, such as a score on a test or acompletion time in a race. In the case of a college admissions examor a track race, scores are standardized so that a given score ontwo different occasions carries the same meaning. However, inevents that feature novelty, subjectivity, or close interaction, stan-dardization is difficult. The Spartan Races, completed by millionsof runners, feature a variety of obstacles placed on hiking trailsaround the world [8]. Rock climbing, a sport to be added to the2020 Olympics, likewise has routes set specifically for each com-petition. DanceSport, gymnastics, and figure skating competitionshave a panel of judges who rank contestants against one another;these subjective scores are known to be noisy [26]. Most boardgames feature considerable inter-player interaction. In all thesecases, scores can only be used to compare and rank participants atthe same event. Players, spectators, and contest organizers who areinterested in comparing players’ skill levels across different com-petitions will need to aggregate the entire history of such rankings.A strong player, then, is one who consistently wins against weakerplayers. To quantify skill, we need a rating system .Good rating systems are difficult to create, as they must bal-ance several mutually constraining objectives. First and foremost,the rating system must be accurate, in that ratings provide use-ful predictors of contest outcomes. Second, the ratings must beefficient to compute: in video game applications, rating systemsare predominantly used for matchmaking in massively multiplayeronline games (such as Halo, CounterStrike, League of Legends,etc.) [20, 24, 28]. These games have hundreds of millions of playersplaying tens of millions of games per day, necessitating certainlatency and memory requirements for the rating system [9]. Third,the rating system must align incentives. That is, players should not modify their performance to “game” the rating system. Ratingsystems that can be gamed often create disastrous consequences toplayer-base, more often than not leading to the loss of players fromthe game [3]. Finally, the ratings provided by the system must behuman-interpretable: ratings are typically represented to players asa single number encapsulating their overall skill, and many playerswant to understand and predict how their performances affect theirrating [16].Classically, rating systems were designed for two-player games.The famous Elo system [13], as well as its Bayesian successorsGlicko and Glicko-2, have been widely applied to games such asChess and Go [16–18]. Both Glicko versions model each player’sskill as a real random variable that evolves with time accordingto Brownian motion. Inference is done by entering these variablesinto the Bradley-Terry model, which predicts probabilities of gameoutcomes. Glicko-2 refines the Glicko system by adding a ratingvolatility parameter. Unfortunately, Glicko-2 is known to be flawedin practice, potentially incentivising players to lose. This was mostnotably exploited in the popular game of Pokemon Go [3]; seeSection 5.1 for a discussion of this issue.The family of Elo-like methods just described only utilize thebinary outcome of a match. In settings where a scoring systemprovides a more fine-grained measure of match performance, Ko-valchik [22] has shown variants of Elo that are able to take advan-tage of score information. For competitions consisting of several settasks, such as academic olympiads, Forišek [14] developed a modelin which each task gives a different “response” to the player: the to-tal response then predicts match outcomes. However, such systemsare often highly application-dependent and hard to calibrate.Though Elo-like systems are widely used in two-player contests,one needn’t look far to find competitions that involve much morethan two players. Aside from the aforementioned sporting examples,there are video games such as CounterStrike and Halo, as well asacademic olympiads: notably, programming contest platforms suchas Codeforces, TopCoder, and Kaggle [2, 5, 7]. In these applications,the number of contestants can easily reach into the thousands.Some more recent works present interesting methods to deal withcompetitions between two teams [11, 19, 21, 23], but they do notpresent efficient extensions for settings in which players are sortedinto more than two, let alone thousands, of distinct places.In a many-player ranked competition, it is important to note thatthe pairwise match outcomes are not independent, as they would bein a series of 1v1 matches. Thus, TrueSkill [20] and its variants [12,24, 25] model a player’s performance during each contest as a singlerandom variable. The overall rankings are assumed to reveal thetotal order among these hidden performance variables, with variousstrategies used to model ties and teams. These TrueSkill algorithmsare efficient in practice, successfully rating userbases that number a r X i v : . [ c s . I R ] J a n ell into the millions (the Halo series, for example, has over 60million sales since 2001 [4]).The main disadvantage of TrueSkill is its complexity: originallydeveloped by Microsoft for the popular Halo video game, TrueSkillperforms approximate belief propagation on a factor graph, whichis iterated until convergence [20]. Aside from being less human-interpretable, this complexity means that, to our knowledge, thereare no proofs of key properties such as runtime and incentive align-ment. Even when these properties are discussed [24], no rigorousjustification is provided. In addition, we are not aware of any workthat extends TrueSkill to non-Gaussian performance models, whichmight be desirable to limit the influence of outlier performances(see Section 5.2).It might be for these reasons that platforms such as Codeforcesand TopCoder opted for their own custom rating systems. These sys-tems are not published in academia and do not come with Bayesianjustifications. However, they retain the formulaic simplicity of Eloand Glicko, extending them to settings with much more than twoplayers. The Codeforces system includes ad hoc heuristics to dis-tinguish top players, while curbing rampant inflation. TopCoder’sformulas are more principled from a statistical perspective; how-ever, it has a volatility parameter similar to Glicko-2, and hencesuffers from similar exploits [14]. Despite their flaws, these sys-tems have been in place for over a decade, and have more recentlygained adoption by additional platforms such as CodeChef andLeetCode [1, 6]. Our contributions.

In this paper, we describe the Elo-MMR rat-ing system, obtained by a principled approximation of a Bayesianmodel very similar to TrueSkill. It is fast, embarrassingly parallel,and makes accurate predictions. Most interesting of all, its simplic-ity allows us to rigorously analyze its properties: the “MMR” inthe name stands for “Massive”, “Monotonic”, and “Robust”. “Mas-sive” means that it supports any number of players with a runtimethat scales linearly; “monotonic” means that it aligns incentives so that a rating-maximizing player always wants to perform well;“robust” means that rating changes are bounded, with the boundbeing smaller for more consistent players than for volatile play-ers. Robustness turns out to be a natural byproduct of accuratelymodeling performances with heavy-tailed distributions, such asthe logistic. TrueSkill is believed to satisfy the first two properties,albeit without proof, but fails robustness. Codeforces only satisfiesaligned incentives, and TopCoder only satisfies robustness.Experimentally, we show that Elo-MMR achieves state-of-the-artperformance in terms of both prediction accuracy and runtime. Inparticular, we process the entire Codeforces database of over 300Krated users and 1000 contests in well under a minute, beating the ex-isting Codeforces system by an order of magnitude while improvingupon its accuracy. A difficulty we faced was the scarcity of efficientopen-source rating system implementations. In an effort to aidresearchers and practitioners alike, we provide open-source imple-mentations of all rating systems, datasets, and additional processingused in our experiments at https://github.com/EbTech/EloR/.

Organization.

In Section 2, we formalize the details of our Bayesianmodel. We then show how to estimate player skill under this modelin Section 3, and develop some intuitions of the resulting formulas. As a further refinement, Section 4 models skill evolutions from play-ers training or atrophying between competitions. This modeling isquite tricky as we choose to retain players’ momentum while en-suring it cannot be exploited for incentive-misaligned rating gains.Section 5 proves that the system as a whole satisfies several salientproperties, the most critical of which is aligned incentives. Finally,we present experimental evaluations in Section 6.

We now describe the setting formally, denoting random variablesby capital letters. A series of competitive rounds , indexed by 𝑡 = , , , . . . , take place sequentially in time. Each round has a set ofparticipating players P 𝑡 , which may in general overlap betweenrounds. A player’s skill is likely to change with time, so we repre-sent the skill of player 𝑖 at time 𝑡 by a real random variable 𝑆 𝑖,𝑡 .In round 𝑡 , each player 𝑖 ∈ P 𝑡 competes at some performance level 𝑃 𝑖,𝑡 , typically close to their current skill 𝑆 𝑖,𝑡 . The deviations { 𝑃 𝑖,𝑡 − 𝑆 𝑖,𝑡 } 𝑖 ∈P 𝑡 are assumed to be i.i.d. and independent of { 𝑆 𝑖,𝑡 } 𝑖 ∈P 𝑡 .Performances are not observed directly; instead, a ranking givesthe relative order among all performances { 𝑃 𝑖,𝑡 } 𝑖 ∈P 𝑡 . In particular,ties are modelled to occur when performances are exactly equal,a zero-probability event when their distributions are continuous. This ranking constitutes the observational evidence 𝐸 𝑡 for ourBayesian updates. The rating system seeks to estimate the skill 𝑆 𝑖,𝑡 of every player at the present time 𝑡 , given the historical roundrankings 𝐸 ≤ 𝑡 : = { 𝐸 , . . . , 𝐸 𝑡 } .We overload the notation Pr for both probabilities and probabil-ity densities: the latter interpretation applies to zero-probabilityevents, such as in Pr ( 𝑆 𝑖,𝑡 = 𝑠 ) . We also use colons as shorthand forcollections of variables differing only in a subscript: for instance, 𝑃 : ,𝑡 : = { 𝑃 𝑖,𝑡 } 𝑖 ∈P 𝑡 . The joint distribution described by our Bayesianmodel factorizes as follows:Pr ( 𝑆 : , : , 𝑃 : , : , 𝐸 : ) (1) = (cid:214) 𝑖 Pr ( 𝑆 𝑖, ) (cid:214) 𝑖,𝑡 Pr ( 𝑆 𝑖,𝑡 | 𝑆 𝑖,𝑡 − ) (cid:214) 𝑖,𝑡 Pr ( 𝑃 𝑖,𝑡 | 𝑆 𝑖,𝑡 ) (cid:214) 𝑡 Pr ( 𝐸 𝑡 | 𝑃 : ,𝑡 ) , where Pr ( 𝑆 𝑖, ) is the initial skill prior,Pr ( 𝑆 𝑖,𝑡 | 𝑆 𝑖,𝑡 − ) is the skill evolution model (Section 4),Pr ( 𝑃 𝑖,𝑡 | 𝑆 𝑖,𝑡 ) is the performance model, andPr ( 𝐸 𝑡 | 𝑃 : ,𝑡 ) is the evidence model.For the first three factors, we will specify log-concave distributions(see Definition 3.1). The evidence model, on the other hand, is adeterministic indicator. It equals one when 𝐸 𝑡 is consistent withthe relative ordering among { 𝑃 𝑖,𝑡 } 𝑖 ∈P 𝑡 , and zero otherwise.Finally, our model assumes that the number of participants |P 𝑡 | islarge. The main idea behind our algorithm is that, in sufficiently mas-sive competitions, from the evidence 𝐸 𝑡 we can infer very preciseestimates for { 𝑃 𝑖,𝑡 } 𝑖 ∈P 𝑡 . Hence, we can treat these performancesas if they were observed directly.That is, suppose we have the skill prior at round 𝑡 : 𝜋 𝑖,𝑡 ( 𝑠 ) : = Pr ( 𝑆 𝑖,𝑡 = 𝑠 | 𝑃 𝑖, < 𝑡 ) . (2) The relevant limiting procedure is to treat performances within 𝜖 -width buckets asties, and letting 𝜖 → . This technicality appears in the proof of Theorem 3.2. ow, we observe 𝐸 𝑡 . By Equation (1), it is conditionally indepen-dent of 𝑆 𝑖,𝑡 , given 𝑃 𝑖, ≤ 𝑡 . By the law of total probability,Pr ( 𝑆 𝑖,𝑡 = 𝑠 | 𝑃 𝑖, < 𝑡 , 𝐸 𝑡 ) = ∫ Pr ( 𝑆 𝑖,𝑡 = 𝑠 | 𝑃 𝑖, < 𝑡 , 𝑃 𝑖,𝑡 = 𝑝 ) Pr ( 𝑃 𝑖,𝑡 = 𝑝 | 𝑃 𝑖, < 𝑡 , 𝐸 𝑡 ) d 𝑝 → Pr ( 𝑆 𝑖,𝑡 = 𝑠 | 𝑃 𝑖, ≤ 𝑡 ) almost surely as |P 𝑡 | → ∞ . The integral is intractable in general, since the performance poste-rior Pr ( 𝑃 𝑖,𝑡 = 𝑝 | 𝑃 𝑖, < 𝑡 , 𝐸 𝑡 ) depends not only on player 𝑖 , but alsoon our belief regarding the skills of all 𝑗 ∈ P 𝑡 . However, in the limitof infinite participants, Doob’s consistency theorem [15] impliesthat it concentrates at the true value 𝑃 𝑖,𝑡 . Since our posteriors arecontinuous, the convergence holds for all 𝑠 simultaneously.Indeed, we don’t even need the full evidence 𝐸 𝑡 . Let 𝐸 𝐿𝑖,𝑡 = { 𝑗 ∈P : 𝑃 𝑗,𝑡 > 𝑃 𝑖,𝑡 } be the set of players against whom 𝑖 lost, and 𝐸 𝑊𝑖,𝑡 = { 𝑗 ∈ P : 𝑃 𝑗,𝑡 < 𝑃 𝑖,𝑡 } be the set of players against whom 𝑖 won. That is, we only see who wins, draws, and loses against 𝑖 . 𝑃 𝑖,𝑡 remains identifiable using only ( 𝐸 𝐿𝑖,𝑡 , 𝐸

𝑊𝑖,𝑡 ) , which will be moreconvenient for our purposes.Passing to the limit |P 𝑡 | → ∞ serves to justify some commonsimplifications made by algorithms such as TrueSkill: since condi-tioning on 𝑃 𝑖, ≤ 𝑡 makes the skills of different players independent ofone another, it becomes accurate to model them as such. In additionto simplifying derivations, this fact ensures that a player’s poste-rior is unaffected by rounds in which they are not a participant,arguably a desirable property in its own right. Furthermore, 𝑃 𝑖, ≤ 𝑡 being a sufficient statistic for skill prediction renders any additionalinformation, such as domain-specific raw scores, redundant.Finally, a word on the rate of convergence. Suppose we wantour estimate to be within 𝜖 of 𝑃 𝑖,𝑡 , with probability at least 1 − 𝛿 .By asymptotic normality of the posterior [15], it suffices to have 𝑂 ( 𝜖 √︃ log 𝛿 ) participants.When the initial prior, performance model, and evolution modelare all Gaussian, treating 𝑃 𝑖,𝑡 as certain is the only simplifyingapproximation we will make; that is, in the limit |P 𝑡 | → ∞ , ourmethod performs exact inference on Equation (1). In the followingsections, we focus some attention on generalizing the performancemodel to non-Gaussian log-concave families, parametrized by lo-cation and scale. We will use the logistic distribution as a runningexample and see that it induces robustness; however, our frameworkis agnostic to the specific distributions used.The rating 𝜇 𝑖,𝑡 of player 𝑖 after round 𝑡 should be a statistic thatsummarizes their posterior distribution: we’ll use the maximum aposteriori (MAP) estimate, obtained by setting 𝑠 to maximize theposterior Pr ( 𝑆 𝑖,𝑡 = 𝑠 | 𝑃 𝑖, ≤ 𝑡 ) . By Bayes’ rule, 𝜇 𝑖,𝑡 : = arg max 𝑠 𝜋 𝑖,𝑡 ( 𝑠 ) Pr ( 𝑃 𝑖,𝑡 | 𝑆 𝑖,𝑡 = 𝑠 ) . (3)This objective suggests a two-phase algorithm to update each player 𝑖 ∈ P 𝑡 at round 𝑡 . In phase one, we estimate 𝑃 𝑖,𝑡 from ( 𝐸 𝐿𝑖,𝑡 , 𝐸

𝑊𝑖,𝑡 ) . ByDoob’s consistency theorem, our estimate is extremely precise when |P 𝑡 | is large, so we assume it to be exact. In phase two, we updateour posterior for 𝑆 𝑖,𝑡 and the rating 𝜇 𝑖,𝑡 according to Equation (3).We will occasionally make use of the prior rating , defined as 𝜇 𝜋𝑖,𝑡 : = arg max 𝑠 𝜋 𝑖,𝑡 ( 𝑠 ) . In this section, we describe the first phase of Elo-MMR. For nota-tional convenience, we assume all probability expressions to beconditioned on the prior context 𝑃 𝑖, < 𝑡 , and omit the subscript 𝑡 .Our prior belief on each player’s skill 𝑆 𝑖 implies a prior distribu-tion on 𝑃 𝑖 . Let’s denote its probability density function (pdf) by 𝑓 𝑖 ( 𝑝 ) : = Pr ( 𝑃 𝑖 = 𝑝 ) = ∫ 𝜋 𝑖 ( 𝑠 ) Pr ( 𝑃 𝑖 = 𝑝 | 𝑆 𝑖 = 𝑠 ) d 𝑠, (4)where 𝜋 𝑖 ( 𝑠 ) was defined in Equation (2). Let 𝐹 𝑖 ( 𝑝 ) : = Pr ( 𝑃 𝑖 ≤ 𝑝 ) = ∫ 𝑝 −∞ 𝑓 𝑖 ( 𝑥 ) d 𝑥, be the corresponding cumulative distribution function (cdf). For thepurpose of analysis, we’ll also define the following “loss”, “draw”,and “victory” functions: 𝑙 𝑖 ( 𝑝 ) : = dd 𝑝 ln ( − 𝐹 𝑖 ( 𝑝 )) = − 𝑓 𝑖 ( 𝑝 ) − 𝐹 𝑖 ( 𝑝 ) ,𝑑 𝑖 ( 𝑝 ) : = dd 𝑝 ln 𝑓 𝑖 ( 𝑝 ) = 𝑓 ′ 𝑖 ( 𝑝 ) 𝑓 𝑖 ( 𝑝 ) ,𝑣 𝑖 ( 𝑝 ) : = dd 𝑝 ln 𝐹 𝑖 ( 𝑝 ) = 𝑓 𝑖 ( 𝑝 ) 𝐹 𝑖 ( 𝑝 ) . Evidently, 𝑙 𝑖 ( 𝑝 ) < < 𝑣 𝑖 ( 𝑝 ) . Now we define what it means forthe deviation 𝑃 𝑖 − 𝑆 𝑖 to be log-concave.Definition 3.1. An absolutely continuous random variable on aconvex domain is log-concave if its probability density function 𝑓 ispositive on its domain and satisfies 𝑓 ( 𝜃𝑥 + ( − 𝜃 ) 𝑦 ) > 𝑓 ( 𝑥 ) 𝜃 𝑓 ( 𝑦 ) − 𝜃 , ∀ 𝜃 ∈ ( , ) , 𝑥 ≠ 𝑦. We note that log-concave distributions appear widely, and in-clude the Gaussian and logistic distributions used in Glicko, TrueSkill,and many others. We’ll see inductively that our prior 𝜋 𝑖 is log-concave at every round. Since log-concave densities are closedunder convolution [10], the independent sum 𝑃 𝑖 = 𝑆 𝑖 + ( 𝑃 𝑖 − 𝑆 𝑖 ) isalso log-concave. The following lemma (proved in the appendix)makes log-concavity very convenient:Lemma 3.1. If 𝑓 𝑖 is continuously differentiable and log-concave,then the functions 𝑙 𝑖 , 𝑑 𝑖 , 𝑣 𝑖 are continuous, strictly decreasing, and 𝑙 𝑖 ( 𝑝 ) < 𝑑 𝑖 ( 𝑝 ) < 𝑣 𝑖 ( 𝑝 ) for all 𝑝. For the remainder of this section, we fix the analysis with respectto some player 𝑖 . As argued in Section 2, 𝑃 𝑖 concentrates verynarrowly in the posterior. Hence, we can estimate 𝑃 𝑖 by its MAP,choosing 𝑝 so as to maximize:Pr ( 𝑃 𝑖 = 𝑝 | 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 ) ∝ 𝑓 𝑖 ( 𝑝 ) Pr ( 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 | 𝑃 𝑖 = 𝑝 ) . Define 𝑗 ≻ 𝑖 , 𝑗 ≺ 𝑖 , 𝑗 ∼ 𝑖 as shorthand for 𝑗 ∈ 𝐸 𝐿𝑖 , 𝑗 ∈ 𝐸 𝑊𝑖 , 𝑗 ∈ P \ ( 𝐸 𝐿𝑖 ∪ 𝐸 𝑊𝑖 ) (that is, 𝑃 𝑗 > 𝑃 𝑖 , 𝑃 𝑗 < 𝑃 𝑖 , 𝑃 𝑗 = 𝑃 𝑖 ), respectively.The following theorem yields our MAP estimate: L R - - - - Figure 1: 𝐿 versus 𝐿 𝑅 for typical values (left). Gaussian ver-sus logistic probability density functions (right). Theorem 3.2.

Suppose that for all 𝑗 , 𝑓 𝑗 is continuously differen-tiable and log-concave. Then the unique maximizer of Pr ( 𝑃 𝑖 = 𝑝 | 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 ) is given by the unique zero of 𝑄 𝑖 ( 𝑝 ) : = ∑︁ 𝑗 ≻ 𝑖 𝑙 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ≺ 𝑖 𝑣 𝑗 ( 𝑝 ) . The proof is relegated to the appendix. Intuitively, we’re sayingthat the performance is the balance point between appropriatelyweighted wins, draws, and losses. Let’s look at two specializationsof our general model, to serve as running examples in this paper.

Gaussian performance model.

If both 𝑆 𝑗 and 𝑃 𝑗 − 𝑆 𝑗 are assumedto be Gaussian with known means and variances, then their inde-pendent sum 𝑃 𝑗 will also be a known Gaussian. It is analytic andlog-concave, so Theorem 3.2 applies.We substitute the well-known Gaussian pdf and cdf for 𝑓 𝑗 and 𝐹 𝑗 ,respectively. A simple binary search, or faster numerical techniquessuch as the Illinois algorithm or Newton’s method, can be employedto solve for the maximizing 𝑝 . Logistic performance model.

Now we assume the performancedeviation 𝑃 𝑗 − 𝑆 𝑗 has a logistic distribution with mean 0 and vari-ance 𝛽 . In general, the rating system administrator is free to set 𝛽 differently for each contest. Since shorter contests tend to be morevariable, one reasonable choice might be to make 1 / 𝛽 proportionalto the contest duration.Given the mean and variance of the skill prior, the independentsum 𝑃 𝑗 = 𝑆 𝑗 + ( 𝑃 𝑗 − 𝑆 𝑗 ) would have the same mean, and a variancethat’s increased by 𝛽 . Unfortunately, we’ll see that the logisticperformance model implies a form of skill prior from which it’stough to extract a mean and variance. Even if we could, the sumdoes not yield a simple distribution.For experienced players, we expect 𝑆 𝑗 to contribute much lessvariance than 𝑃 𝑗 − 𝑆 𝑗 ; thus, in our heuristic approximation, we take 𝑃 𝑗 to have the same form of distribution as the latter. That is, wetake 𝑃 𝑗 to be logistic, centered at the prior rating 𝜇 𝜋𝑗 = arg max 𝜋 𝑗 ,with variance 𝛿 𝑗 = 𝜎 𝑗 + 𝛽 , where 𝜎 𝑗 will be given by Equation (8).This distribution is analytic and log-concave, so the same methodsbased on Theorem 3.2 apply. Define the scale parameter ¯ 𝛿 𝑗 : = √ 𝜋 𝛿 𝑗 . A logistic distribution with variance 𝛿 𝑗 has cdf and pdf: 𝐹 𝑗 ( 𝑥 ) = + 𝑒 −( 𝑥 − 𝜇 𝜋𝑗 )/ ¯ 𝛿 𝑗 = (cid:32) + tanh 𝑥 − 𝜇 𝜋𝑗 𝛿 𝑗 (cid:33) ,𝑓 𝑗 ( 𝑥 ) = 𝑒 ( 𝑥 − 𝜇 𝜋𝑗 )/ ¯ 𝛿 𝑗 ¯ 𝛿 𝑗 (cid:16) + 𝑒 ( 𝑥 − 𝜇 𝜋𝑗 )/ ¯ 𝛿 𝑗 (cid:17) =

14 ¯ 𝛿 𝑗 sech 𝑥 − 𝜇 𝜋𝑗 𝛿 𝑗 . The logistic distribution satisfies two very convenient relations: 𝐹 ′ 𝑗 ( 𝑥 ) = 𝑓 𝑗 ( 𝑥 ) = 𝐹 𝑗 ( 𝑥 )( − 𝐹 𝑗 ( 𝑥 ))/ ¯ 𝛿 𝑗 ,𝑓 ′ 𝑗 ( 𝑥 ) = 𝑓 𝑗 ( 𝑥 )( − 𝐹 𝑗 ( 𝑥 ))/ ¯ 𝛿 𝑗 , from which it follows that 𝑑 𝑗 ( 𝑝 ) = − 𝐹 𝑗 ( 𝑝 ) ¯ 𝛿 = − 𝐹 𝑗 ( 𝑝 ) ¯ 𝛿 + − 𝐹 𝑗 ( 𝑝 ) ¯ 𝛿 = 𝑙 𝑗 ( 𝑝 ) + 𝑣 𝑗 ( 𝑝 ) . In other words, a tie counts as the sum of a win and a loss. Thiscan be compared to the approach (used in Elo, Glicko, TopCoder,and Codeforces) of treating each tie as half a win plus half a loss. Finally, putting everything together: 𝑄 𝑖 ( 𝑝 ) = ∑︁ 𝑗 ⪰ 𝑖 𝑙 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ⪯ 𝑖 𝑣 𝑗 ( 𝑝 ) = ∑︁ 𝑗 ⪰ 𝑖 − 𝐹 𝑗 ( 𝑝 ) ¯ 𝛿 𝑗 + ∑︁ 𝑗 ⪯ 𝑖 − 𝐹 𝑗 ( 𝑝 ) ¯ 𝛿 𝑗 . Our estimate for 𝑃 𝑖 is the zero of this expression. The terms onthe right correspond to probabilities of winning and losing againsteach player 𝑗 , weighted by 1 / ¯ 𝛿 𝑗 . Accordingly, we can interpret (cid:205) 𝑗 ∈P ( − 𝐹 𝑗 ( 𝑝 ))/ ¯ 𝛿 𝑗 as a weighted expected rank of a player whoseperformance is 𝑝 . Similar to the performance computations in Code-forces and TopCoder, 𝑃 𝑖 can thus be viewed as the performancelevel at which one’s expected rank would equal 𝑖 ’s actual rank. Having estimated 𝑃 𝑖,𝑡 in the first phase, the second phase is rathersimple. Ignoring normalizing constants, Equation (3) tells us that thepdf of the skill posterior can be obtained as the pointwise productof the pdfs of the skill prior and the performance model. When bothfactors are differentiable and log-concave, then so is their product.Its maximum is the new rating 𝜇 𝑖,𝑡 ; let’s see how to compute it forthe same two specializations of our model. Gaussian skill prior and performance model.

When the skill priorand performance model are Gaussian with known means and vari-ances, multiplying their pdfs yields another known Gaussian. Hence,the posterior is compactly represented by its mean 𝜇 𝑖,𝑡 , which coin-cides with the MAP and rating; and its variance 𝜎 𝑖,𝑡 , which is our uncertainty regarding the player’s skill. Logistic performance model.

When the performance model isnon-Gaussian, the multiplication does not simplify so easily. ByEquation (3), each round contributes an additional factor to thebelief distribution. In general, we allow it to consist of a collectionof simple log-concave factors, one for each round in which player 𝑖 has participated. Denote the participation history by H 𝑖,𝑡 : = { 𝑘 ∈ { , . . . , 𝑡 } : 𝑖 ∈ P 𝑘 } . Elo-MMR, too, can be modified to split ties into half win plus half loss. It’s easy tocheck that Lemma 3.1 still holds if 𝑑 𝑗 ( 𝑝 ) is replaced by 𝑤 𝑙 𝑙 𝑗 ( 𝑝 ) + 𝑤 𝑣 𝑣 𝑗 ( 𝑝 ) for some 𝑤 𝑙 , 𝑤 𝑣 ∈ [ , ] with | 𝑤 𝑙 − 𝑤 𝑣 | < . In particular, we can set 𝑤 𝑙 = 𝑤 𝑣 = . . Theresults in Section 5 won’t be altered by this change. ince each player can be considered in isolation, we’ll omit thesubscript 𝑖 . Specializing to the logistic setting, each 𝑘 ∈ H 𝑡 con-tributes a logistic factor to the posterior, with mean 𝑝 𝑘 and variance 𝛽 𝑘 . We still use a Gaussian initial prior, with mean and variancedenoted by 𝑝 and 𝛽 , respectively. Postponing the discussion ofskill evolution to Section 4, for the moment we assume that 𝑆 𝑘 = 𝑆 for all 𝑘 . The posterior pdf, up to normalization, is then 𝜋 ( 𝑠 ) (cid:214) 𝑘 ∈H 𝑡 Pr ( 𝑃 𝑘 = 𝑝 𝑘 | 𝑆 𝑘 = 𝑠 )∝ exp (cid:32) − ( 𝑠 − 𝑝 ) 𝛽 (cid:33) (cid:214) 𝑘 ∈H 𝑡 sech (cid:18) 𝜋 √ 𝑠 − 𝑝 𝑘 𝛽 𝑘 (cid:19) . (5)Maximizing the posterior density amounts to minimizing itsnegative logarithm. Up to a constant offset, this is given by 𝐿 ( 𝑠 ) : = 𝐿 (cid:18) 𝑠 − 𝑝 𝛽 (cid:19) + ∑︁ 𝑘 ∈H 𝑡 𝐿 𝑅 (cid:18) 𝑠 − 𝑝 𝑘 𝛽 𝑘 (cid:19) , where 𝐿 ( 𝑥 ) : = 𝑥 and 𝐿 𝑅 ( 𝑥 ) : = (cid:18) cosh 𝜋𝑥 √ (cid:19) . Thus, 𝐿 ′ ( 𝑠 ) = 𝑠 − 𝑝 𝛽 + ∑︁ 𝑘 ∈H 𝑡 𝜋𝛽 𝑘 √ ( 𝑠 − 𝑝 𝑘 ) 𝜋𝛽 𝑘 √ . (6) 𝐿 ′ is continuous and strictly increasing in 𝑠 , so its zero is unique:it is the MAP 𝜇 𝑡 . Similar to what we did in the first phase, we cansolve for 𝜇 𝑡 with either binary search or Newton’s method.We pause to make an important observation. From Equation (6),the rating carries a rather intuitive interpretation: Gaussian factorsin 𝐿 become 𝐿 penalty terms, whereas logistic factors take on amore interesting form as 𝐿 𝑅 terms. From Figure 1, we see that the 𝐿 𝑅 term behaves quadratically near the origin, but linearly at theextremities, effectively interpolating between 𝐿 and 𝐿 over a scaleof magnitude 𝛽 𝑘 It is well-known that minimizing a sum of 𝐿 terms pushes theargument towards a weighted mean, while minimizing a sum of 𝐿 terms pushes the argument towards a weighted median. With 𝐿 𝑅 terms, the net effect is that 𝜇 𝑡 acts like a robust average of thehistorical performances 𝑝 𝑘 . Specifically, one can check that 𝜇 𝑡 = (cid:205) 𝑘 𝑤 𝑘 𝑝 𝑘 (cid:205) 𝑘 𝑤 𝑘 , where 𝑤 : = 𝛽 and 𝑤 𝑘 : = 𝜋 ( 𝜇 𝑡 − 𝑝 𝑘 ) 𝛽 𝑘 √ ( 𝜇 𝑡 − 𝑝 𝑘 ) 𝜋𝛽 𝑘 √

12 for 𝑘 ∈ H 𝑡 . (7) 𝑤 𝑘 is close to 1 / 𝛽 𝑘 for typical performances, but can be up to 𝜋 / | 𝜇 𝑡 − 𝑝 𝑘 | →

0, or vanish as | 𝜇 𝑡 − 𝑝 𝑘 | → ∞ .This feature is due to the thicker tails of the logistic distribution,as compared to the Gaussian, resulting in an algorithm that resistsdrastic rating changes in the presence of a few unusually good orbad performances. We’ll formally state this robustness property inTheorem 5.7. Estimating skill uncertainty.

While there is no easy way to com-pute the variance of a posterior in the form of Equation (5), it willbe useful to have some estimate 𝜎 𝑡 of uncertainty. There is a simpleformula in the case where all factors are Gaussian. Since moment-matched logistic and normal distributions are relatively close (c.f. Figure 1), we apply the same formula:1 𝜎 𝑡 : = ∑︁ 𝑘 ∈{ }∪H 𝑡 𝛽 𝑘 . (8) Factors such as training and resting will change a player’s skillover time. If we model skill as a static variable, our system willeventually grow so confident in its estimate that it will refuse toadmit substantial changes. To remedy this, we introduce a skillevolution model, so that in general 𝑆 𝑡 ≠ 𝑆 𝑡 ′ for 𝑡 ≠ 𝑡 ′ . Now ratherthan simply being equal to the previous round’s posterior, the skillprior at round 𝑡 is given by 𝜋 𝑡 ( 𝑠 ) = ∫ Pr ( 𝑆 𝑡 = 𝑠 | 𝑆 𝑡 − = 𝑥 ) Pr ( 𝑆 𝑡 − = 𝑥 | 𝑃 < 𝑡 ) d 𝑥. (9)The factors in the integrand are the skill evolution model and theprevious round’s posterior, respectively. Following other Bayesianrating systems (e.g., Glicko, Glicko-2, and TrueSkill [17, 18, 20]),we model the skill diffusions 𝑆 𝑡 − 𝑆 𝑡 − as independent zero-meanGaussians. That is, Pr ( 𝑆 𝑡 | 𝑆 𝑡 − = 𝑥 ) is a Gaussian with mean 𝑥 andsome variance 𝛾 𝑡 . The Glicko system sets 𝛾 𝑡 proportionally to thetime elapsed since the last update, corresponding to a continuousBrownian motion. Codeforces and TopCoder simply set 𝛾 𝑡 to a con-stant when a player participates, and zero otherwise, correspondingto changes that are in proportion to how often the player competes.Now we are ready to complete the two specializations of our ratingsystem. Gaussian skill prior and performance model.

If both the priorand performance distributions at round 𝑡 − 𝑆 𝑡 − yields a Gaussian prior on 𝑆 𝑡 . Byinduction, the skill belief distribution forever remains Gaussian.This Gaussian specialization of the Elo-MMR framework lacks theR for robustness (see Theorem 5.7), so we call it Elo-MM 𝜒 . Logistic performance model.

After a player’s first contest round,the posterior in Equation (5) becomes non-Gaussian, rendering theintegral in Equation (9) intractable.A very simple approach would be to replace the full posterior inEquation (5) by a Gaussian approximation with mean 𝜇 𝑡 (equal tothe posterior MAP) and variance 𝜎 𝑡 (given by Equation (8)). As inthe previous case, applying diffusions in the Gaussian setting is asimple matter of adding means and variances.With this approximation, no memory is kept of the individualperformances 𝑃 𝑡 . Priors are simply Gaussian, while posterior den-sities are the product of two factors: the Gaussian prior, and alogistic factor corresponding to the latest performance. To ensurerobustness (see Section 5.2), 𝜇 𝑡 is computed as the argmax of thisposterior before replacement by its Gaussian approximation. Wecall the rating system that takes this approach Elo-MMR( ∞ ).As the name implies, it turns out to be a special case of Elo-MMR( 𝜌 ). In the general setting with 𝜌 ∈ [ , ∞) , we keep the fullposterior from Equation (5). Since we cannot tractably compute theeffect of a Gaussian diffusion, we seek a heuristic derivation of thenext round’s prior, retaining a form similar to Equation (5) whilesatisfying many of the same properties as the intended diffusion. .1 Desirable properties of a “pseudodiffusion” We begin by listing some properties that our skill evolution algo-rithm, henceforth called a “pseudodiffusion”, should satisfy. Thefirst two properties are natural: • Aligned incentives.

First and foremost, the pseudodiffusion mustnot break the aligned incentives property of our rating system.That is, a rating-maximizing player should never be motivatedto lose on purpose (Theorem 5.5). • Rating preservation.

The pseudodiffusion must not alter the arg maxof the belief density. That is, the rating of a player should notchange: 𝜇 𝜋𝑡 = 𝜇 𝑡 − .In addition, we borrow four properties of Gaussian diffusions: • Correct magnitude.

Pseudodiffusion with parameter 𝛾 must in-crease the skill uncertainty, as measured by Equation (8), by 𝛾 . • Composability.

Two pseudodiffusions applied in sequence, firstwith parameter 𝛾 and then with 𝛾 , must have the same effectas a single pseudodiffusion with parameter 𝛾 + 𝛾 . • Zero diffusion.

In the limit as 𝛾 →

0, the effect of pseudodiffusionmust vanish, i.e., not alter the belief distribution. • Zero uncertainty.

In the limit as 𝜎 𝑡 − → 𝜇 𝑡 − is a perfect estimate of 𝑆 𝑡 − ), our belief on 𝑆 𝑡 mustbecome Gaussian with mean 𝜇 𝑡 − and variance 𝛾 . Finer-grainedinformation regarding the prior history 𝑃 ≤ 𝑡 must be erased.In particular, Elo-MMR( ∞ ) fails the zero diffusion property becauseit simplifies the belief distribution, even when 𝛾 =

0. In the proof ofTheorem 4.1, we’ll see that Elo-MMR(0) fails the zero uncertainty property. Thus, it is in fact necessary to have 𝜌 strictly positive andfinite. In Section 5.2, we’ll come to interpret 𝜌 as a kind of inversemomentum. Each factor in the posterior (see Equation (5)) has a parameter 𝛽 𝑘 .Define a factor’s weight to be 𝑤 𝑘 : = / 𝛽 𝑘 , which by Equation (8)contributes to the total weight (cid:205) 𝑘 𝑤 𝑘 = / 𝜎 𝑡 . Here, unlike inEquation (7), 𝑤 𝑘 does not depend on | 𝜇 𝑡 − 𝑝 𝑘 | .The approximation step of Elo-MMR( ∞ ) replaces all the logisticfactors by a single Gaussian whose variance is chosen to ensurethat the total weight is preserved. In addition, its mean is chosento preserve the player’s rating, given by the unique zero of Equa-tion (6). Finally, the diffusion step of Elo-MMR( ∞ ) increases theGaussian’s variance, and hence the player’s skill uncertainty, by 𝛾 𝑡 ;this corresponds to a decay in the weight.To generalize the idea, we interleave the two steps in a contin-uous manner. The approximation step becomes a transfer step :rather than replace the logistic factors outright, we take away thesame fraction from each of their weights, and place the sum ofremoved weights onto a new Gaussian factor. The diffusion stepbecomes a decay step , reducing each factor’s weight by the samefraction, chosen such that the overall uncertainty is increased by 𝛾 𝑡 .To make the idea precise, we generalize the posterior from Equa-tion (5) with fractional multiplicities 𝜔 𝑘 , initially set to 1 for each 𝑘 ∈ { } ∪ H 𝑡 . The 𝑘 ’th factor is raised to the power 𝜔 𝑘 ; in Equa-tions (6) and (8), the corresponding term is multiplied by 𝜔 𝑘 . In Algorithm 1

Elo-MMR( 𝜌, 𝛽,𝛾 ) for all rounds 𝑡 dofor all players 𝑖 ∈ P 𝑡 in parallel doif 𝑖 has never competed before then 𝜇 𝑖 , 𝜎 𝑖 ← 𝜇 𝑛𝑒𝑤𝑐𝑜𝑚𝑒𝑟 , 𝜎 𝑛𝑒𝑤𝑐𝑜𝑚𝑒𝑟 𝑝 𝑖 , 𝑤 𝑖 ← [ 𝜇 𝑖 ] , [ / 𝜎 𝑖 ] diffuse( 𝑖, 𝛾, 𝜌 ) 𝜇 𝜋𝑖 , 𝛿 𝑖 ← 𝜇 𝑖 , √︃ 𝜎 𝑖 + 𝛽 for all 𝑖 ∈ P 𝑡 in parallel do update( 𝑖, 𝐸 𝑡 , 𝛽 ) Algorithm 2 diffuse( 𝑖,𝛾, 𝜌 ) 𝜅 ← ( + 𝛾 / 𝜎 𝑖 ) − 𝑤 𝐺 , 𝑤 𝐿 ← 𝜅 𝜌 𝑤 𝑖, , ( − 𝜅 𝜌 ) (cid:205) 𝑘 𝑤 𝑖,𝑘 𝑝 𝑖, ← ( 𝑤 𝐺 𝑝 𝑖, + 𝑤 𝐿 𝜇 𝑖 )/( 𝑤 𝐺 + 𝑤 𝐿 ) 𝑤 𝑖, ← 𝜅 ( 𝑤 𝐺 + 𝑤 𝐿 ) for all 𝑘 ≠ do 𝑤 𝑖,𝑘 ← 𝜅 + 𝜌 𝑤 𝑖,𝑘 𝜎 𝑖 ← 𝜎 𝑖 /√ 𝜅 Algorithm 3 update( 𝑖, 𝐸, 𝛽 ) 𝑝 ← zero (cid:18)(cid:205) 𝑗 ⪯ 𝑖 𝛿 𝑗 (cid:18) tanh 𝑥 − 𝜇 𝜋𝑗 𝛿 𝑗 − (cid:19) + (cid:205) 𝑗 ⪰ 𝑖 𝛿 𝑗 (cid:18) tanh 𝑥 − 𝜇 𝜋𝑗 𝛿 𝑗 + (cid:19)(cid:19) 𝑝 𝑖 .push( 𝑝 ) 𝑤 𝑖 .push(1 / 𝛽 ) 𝜇 𝑖 ← zero (cid:16) 𝑤 𝑖, ( 𝑥 − 𝑝 𝑖, ) + (cid:205) 𝑘 ≠ 𝑤 𝑖,𝑘 𝛽 ¯ 𝛽 tanh 𝑥 − 𝑝 𝑖,𝑘 𝛽 (cid:17) other words, the latter equation is replaced by1 𝜎 𝑡 : = ∑︁ 𝑘 ∈{ }∪H 𝑡 𝑤 𝑘 , where 𝑤 𝑘 : = 𝜔 𝑘 𝛽 𝑘 . For 𝜌 ∈ [ , ∞] , the Elo-MMR( 𝜌 ) algorithm continuously andsimultaneously performs transfer and decay, with transfer proceed-ing at 𝜌 times the rate of decay. Holding 𝛽 𝑘 fixed, changes to 𝜔 𝑘 can be described in terms of changes to 𝑤 𝑘 : (cid:164) 𝑤 = − 𝑟 ( 𝑡 ) 𝑤 + 𝜌𝑟 ( 𝑡 ) ∑︁ 𝑘 ∈H 𝑡 𝑤 𝑘 , (cid:164) 𝑤 𝑘 = −( + 𝜌 ) 𝑟 ( 𝑡 ) 𝑤 𝑘 for 𝑘 ∈ H 𝑡 , where the arbitrary decay rate 𝑟 ( 𝑡 ) can be eliminated by a changeof variable d 𝜏 = 𝑟 ( 𝑡 ) d 𝑡 . After some time Δ 𝜏 , the total weight willhave decayed by a factor 𝜅 : = 𝑒 − Δ 𝜏 , resulting in the new weights: 𝑤 𝑛𝑒𝑤 = 𝜅𝑤 + (cid:16) 𝜅 − 𝜅 + 𝜌 (cid:17) ∑︁ 𝑘 ∈H 𝑡 𝑤 𝑘 ,𝑤 𝑛𝑒𝑤𝑘 = 𝜅 + 𝜌 𝑤 𝑘 for 𝑘 ∈ H 𝑡 . In order for the uncertainty to increase from 𝜎 𝑡 − to 𝜎 𝑡 − + 𝛾 𝑡 , wemust solve 𝜅 / 𝜎 𝑡 − = /( 𝜎 𝑡 − + 𝛾 𝑡 ) for the decay factor: 𝜅 = (cid:32) + 𝛾 𝑡 𝜎 𝑡 − (cid:33) − . In order for this operation to preserve ratings, the transferredweight must be centered at 𝜇 𝑡 − ; see Algorithm 2 for details. lgorithm 1 details the full Elo-MMR( 𝜌 ) rating system. The mainloop runs whenever a round of competition takes place. First, newplayers are initialized with a Gaussian prior. Then, changes in playerskill are modeled by Algorithm 2. Given the round rankings 𝐸 𝑡 , thefirst phase of Algorithm 3 solves an equation to estimate 𝑃 𝑡 . Finally,the second phase solves another equation for the rating 𝜇 𝑡 .The hyperparameters 𝜌, 𝛽,𝛾 are domain-dependent, and can beset by standard hyperparameter search techniques. For convenience,we assume 𝛽 and 𝛾 are fixed and use the shorthand ¯ 𝛽 𝑘 : = √ 𝜋 𝛽 𝑘 .Theorem 4.1. Algorithm 2 with 𝜌 ∈ ( , ∞) meets all of the prop-erties listed in Section 4.1. Proof. We go through each of the six properties in order. • Aligned incentives.

This property will be stated in Theorem 5.5.To ensure that its proof carries through, the relevant facts tonote here are that the pseudodiffusion algorithm ignores theperformances 𝑝 𝑘 , and centers the transferred Gaussian weight atthe rating 𝜇 𝑡 − , which is trivially monotonic in 𝜇 𝑡 − . • Rating preservation.

Recall that the rating is the unique zero of 𝐿 ′ ,defined in Equation (6). To see that this zero is preserved, notethat the decay and transfer operations multiply 𝐿 ′ by constants( 𝜅 and 𝜅 𝜌 , respectively), before adding the new Gaussian term,whose contribution to 𝐿 ′ is zero at its center. • Correct magnitude.

Follows from our derivation for 𝜅 . • Composability.

Follows from correct magnitude and the fact thatevery pseudodiffusion follows the same differential equations. • Zero diffusion. As 𝛾 → 𝜅 →

1. Provided that 𝜌 < ∞ , we alsohave 𝜅 𝜌 →

1. Hence, for all 𝑘 ∈ { } ∪ H 𝑡 , 𝑤 𝑛𝑒𝑤𝑘 → 𝑤 𝑘 . • Zero uncertainty. As 𝜎 𝑡 − → 𝜅 →

0. The total weight decaysfrom 1 / 𝜎 𝑡 − to 𝛾 . Provided that 𝜌 >

0, we also have 𝜅 𝜌 →

0, sothese weights transfer in their entirety, leaving behind a Gaussianwith mean 𝜇 𝑡 − , variance 𝛾 , and no additional history. □ In this section, we see how the simplicity of the Elo-MMR formu-las enables us to rigorously prove that the rating system alignsincentives, is robust, and is computationally efficient.

To demonstrate the need for aligned incentives , let’s look at the con-sequences of violating this property in the TopCoder and Glicko-2rating systems. These systems track a “volatility” for each player,which estimates the variance of their performances. A player whoserecent performance history is more consistent would be assigned alower volatility score, than one with wild swings in performance.The volatility acts as a multiplier on rating changes; thus, play-ers with an extremely low or high performance will have theirsubsequent rating changes amplified.While it may seem like a good idea to boost changes for playerswhose ratings are poor predictors of their performance, this fea-ture has an exploit. By intentionally performing at a weaker level, aplayer can amplify future increases to an extent that more than com-pensates for the immediate hit to their rating. A player may even“farm” volatility by alternating between very strong and very weakperformances. After acquiring a sufficiently high volatility score,the strategic player exerts their honest maximum performance over a series of contests. The amplification eventually results in a ratingthat exceeds what would have been obtained via honest play. Thistype of exploit was discovered in both TopCoder competitions andthe Pokemon Go video game [3, 14]. For a detailed example, seeTable 5.3 of [14].Remarkably, Elo-MMR combines the best of both worlds: we’llsee in Section 5.2 that, for 𝜌 ∈ ( , ∞) , Elo-MMR( 𝜌 ) also boostschanges to inconsistent players. And yet, as we’ll prove in this sec-tion, no such strategic incentive exists in any version of Elo-MMR.Recall that, for the purposes of the algorithm, the performance 𝑝 𝑖 is defined to be the unique zero of the function 𝑄 𝑖 ( 𝑝 ) : = (cid:205) 𝑗 ≻ 𝑖 𝑙 𝑗 ( 𝑝 )+ (cid:205) 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 ) + (cid:205) 𝑗 ≺ 𝑖 𝑣 𝑗 ( 𝑝 ) , whose terms 𝑙 𝑖 , 𝑑 𝑖 , 𝑣 𝑖 are contributed byopponents against whom 𝑖 lost, drew, or won, respectively. Wins(losses) are always positive (negative) contributions to a player’sperformance score:Lemma 5.1. Adding a win term to 𝑄 𝑖 (·) , or replacing a tie term bya win term, always increases its zero. Conversely, adding a loss term,or replacing a tie term by a loss term, always decreases it. Proof. By Lemma 3.1, 𝑄 𝑖 ( 𝑝 ) is decreasing in 𝑝 . Thus, adding apositive term will increase its zero whereas adding a negative termwill decrease it. The desired conclusion follows by noting that, forall 𝑗 and 𝑝 , 𝑣 𝑗 ( 𝑝 ) and 𝑣 𝑗 ( 𝑝 ) − 𝑑 𝑗 ( 𝑝 ) are positive, whereas 𝑙 𝑗 ( 𝑝 ) and 𝑙 𝑗 ( 𝑝 ) − 𝑑 𝑗 ( 𝑝 ) are negative. □ While not needed for our main result, a similar argument showsthat performance scores are monotonic across the round standings:Theorem 5.2. If 𝑖 ≻ 𝑗 (that is, player 𝑖 beats 𝑗 ) in a given round,then player 𝑖 and 𝑗 ’s performance estimates satisfy 𝑝 𝑖 > 𝑝 𝑗 . Proof. If 𝑖 ≻ 𝑗 with 𝑖, 𝑗 adjacent in the rankings, then 𝑄 𝑖 ( 𝑝 ) − 𝑄 𝑗 ( 𝑝 ) = ∑︁ 𝑘 ∼ 𝑖 ( 𝑑 𝑘 ( 𝑝 ) − 𝑙 𝑘 ( 𝑝 )) + ∑︁ 𝑘 ∼ 𝑗 ( 𝑣 𝑘 ( 𝑝 ) − 𝑑 𝑘 ( 𝑝 )) > . for all 𝑝 . Since 𝑄 𝑖 and 𝑄 𝑗 are decreasing functions, it follows that 𝑝 𝑖 > 𝑝 𝑗 . By induction, this result extends to the case where 𝑖, 𝑗 arenot adjacent in the rankings. □ What matters for incentives is that performance scores be coun-terfactually monotonic; meaning, if we were to alter the roundstandings, a strategic player will always prefer to place higher:Lemma 5.3.

In any given round, holding fixed the relative rankingof all players other than 𝑖 (and holding fixed all preceding rounds),the performance 𝑝 𝑖 is a monotonic function of player i’s prior ratingand of player 𝑖 ’s rank in this round. Proof. Monotonicity in the rating follows directly from mono-tonicity of the self-tie term 𝑑 𝑖 in 𝑄 𝑖 . Since an upward shift in therankings can only convert losses to ties to wins, monotonicity incontest rank follows from Lemma 5.1. □ Having established the relationship between round rankingsand performance scores, the next step is to prove that, even withhindsight, players will always prefer their performance scores tobe as high as possible:Lemma 5.4.

Holding fixed the set of contest rounds in which aplayer has participated, their current rating is monotonic in each oftheir past performance scores. roof. The player’s rating is given by the zero of 𝐿 ′ in Equa-tion (6). The pseudodiffusions of Section 4 modify each of the 𝛽 𝑘 ina manner that does not depend on any of the 𝑝 𝑘 , so they are fixedfor our purposes. Hence, 𝐿 ′ is monotonically increasing in 𝑠 anddecreasing in each of the 𝑝 𝑘 . Therefore, its zero is monotonicallyincreasing in each of the 𝑝 𝑘 .This is almost what we wanted to prove, except that 𝑝 is nota performance. Nonetheless, it is a function of the performances:specifically, a weighted average of historical ratings which, usingthis same lemma as an inductive hypothesis, are themselves mono-tonic in past performances. By induction, the proof is complete. □ Finally, we conclude that the player’s incentives are aligned withoptimizing round rankings, or raw scores:Theorem 5.5 (Aligned Incentives).

Holding fixed the set ofcontest rounds in which each player has participated, and the historicalratings and relative rankings of all players other than 𝑖 , player 𝑖 ’scurrent rating is monotonic in each of their past rankings. Proof. Choose any contest round in player 𝑖 ’s history, and con-sider improving player 𝑖 ’s rank in that round while holding every-thing else fixed. It suffices to show that player 𝑖 ’s current ratingwould necessarily increase as a result.In the altered round, by Lemma 5.3, 𝑝 𝑖 is increased; and byLemma 5.4, player 𝑖 ’s post-round rating is increased. By Lemma 5.3again, this increases player 𝑖 ’s performance score in the followinground. Proceeding inductively, we find that performance scores andratings from this point onward are all increased. □ In the special cases of Elo-MM 𝜒 or Elo-MMR( ∞ ), the rating sys-tem is “memoryless”: the only data retained for each player are thecurrent rating 𝜇 𝑖,𝑡 and uncertainty 𝜎 𝑖,𝑡 ; detailed performance his-tory is not saved. In this setting, we present a natural monotonicitytheorem. A similar theorem was stated for the Codeforces systemin [2], but no proofs were given.Theorem 5.6 (Memoryless Monotonicity Theorem). In ei-ther the Elo-MM 𝜒 or Elo-MMR( ∞ ) system, suppose 𝑖 and 𝑗 are twoparticipants of round 𝑡 . Suppose that the ratings and corresponding un-certainties satisfy 𝜇 𝑖,𝑡 − ≥ 𝜇 𝑗,𝑡 − , 𝜎 𝑖,𝑡 − = 𝜎 𝑗,𝑡 − . Then, 𝜎 𝑖,𝑡 = 𝜎 𝑗,𝑡 .Furthermore, if 𝑖 ≻ 𝑗 in round 𝑡 , then 𝜇 𝑖,𝑡 > 𝜇 𝑗,𝑡 . On the other hand,if 𝑗 ≻ 𝑖 in round 𝑡 , then 𝜇 𝑗,𝑡 − 𝜇 𝑗,𝑡 − > 𝜇 𝑖,𝑡 − 𝜇 𝑖,𝑡 − . Proof. The new contest round will add a rating perturbationwith variance 𝛾 𝑡 , followed by a new performance with variance 𝛽 𝑡 .As a result, 𝜎 𝑖,𝑡 = (cid:32) 𝜎 𝑖,𝑡 − + 𝛾 𝑡 + 𝛽 𝑡 (cid:33) − = (cid:32) 𝜎 𝑗,𝑡 − + 𝛾 𝑡 + 𝛽 𝑡 (cid:33) − = 𝜎 𝑗,𝑡 . The remaining conclusions are consequences of three properties:memorylessness, aligned incentives (Theorem 5.5), and translation-invariance (ratings, skills, and performances are quantified on acommon interval scale relative to one another).Since the Elo-MM 𝜒 or Elo-MMR( ∞ ) systems are memoryless, wemay replace the initial prior and performance histories of playerswith any alternate histories of our choosing, as long as our choice iscompatible with their current rating and uncertainty. For example,both 𝑖 and 𝑗 can be considered to have participated in the same set of rounds, with 𝑖 always performing at 𝜇 𝑖,𝑡 − . and 𝑗 alwaysperforming at 𝜇 𝑗,𝑡 − . Round 𝑡 is unchanged.Suppose 𝑖 ≻ 𝑗 . Since 𝑖 ’s historical performances are all equal orstronger than 𝑗 ’s, Theorem 5.5 implies 𝜇 𝑖,𝑡 > 𝜇 𝑗,𝑡 .Suppose 𝑗 ≻ 𝑖 . By translation-invariance, if we shift each of 𝑗 ’s performances, up to round 𝑡 and including the initial prior,upward by 𝜇 𝑖,𝑡 − − 𝜇 𝑗,𝑡 − , the rating changes between rounds willbe unaffected. Players 𝑖 and 𝑗 now have identical histories, exceptthat we still have 𝑗 ≻ 𝑖 at round 𝑡 . Therefore, 𝜇 𝑗,𝑡 − = 𝜇 𝑖,𝑡 − and,by Theorem 5.5, 𝜇 𝑗,𝑡 > 𝜇 𝑖,𝑡 . Subtracting the equation from theinequality proves the second conclusion. □ Another desirable property in many settings is robustness: a player’srating should not change too much in response to any one con-test, no matter how extreme their performance. The Codeforcesand TrueSkill systems lack this property, allowing for unboundedrating changes. TopCoder achieves robustness by clamping anychanges that exceed a cap, which is initially high for new playersbut decreases with experience.When 𝜌 >

0, Elo-MMR( 𝜌 ) achieves robustness in a natural,smoother manner. It comes out of the interplay between Gaussianand logistic factors in the posterior; 𝜌 > 𝜔 𝑘 from Section 4.2.Theorem 5.7. In the Elo-MMR( 𝜌 ) rating system, let Δ + : = lim 𝑝 𝑡 →+∞ 𝜇 𝑡 − 𝜇 𝑡 − , Δ − : = lim 𝑝 𝑡 →−∞ 𝜇 𝑡 − − 𝜇 𝑡 . Then, 𝜋𝛽 𝑡 √ (cid:169)(cid:173)(cid:171) 𝛽 + 𝜋 ∑︁ 𝑘 ∈H 𝑡 − 𝜔 𝑘 𝛽 𝑘 (cid:170)(cid:174)(cid:172) − ≤ Δ ± ≤ 𝜋𝛽 𝛽 𝑡 √ . Proof. Using the fact that 0 < 𝑑𝑑𝑥 tanh ( 𝑥 ) ≤

1, differentiatingEquation (6) yields1 𝛽 ≤ 𝐿 ′′ ( 𝑠 ) ≤ 𝛽 + 𝜋 ∑︁ 𝑘 ∈H 𝑡 − 𝜔 𝑘 𝛽 𝑘 . For every 𝑠 ∈ R , in the limit as 𝑝 𝑡 → ±∞ , the new term cor-responding to the performance at round 𝑡 will increase 𝐿 ′ ( 𝑠 ) by ∓ 𝜋𝛽 𝑡 √ . Since 𝜇 𝑡 − was a zero of 𝐿 ′ without this new term, we nowhave 𝐿 ′ ( 𝜇 𝑡 − ) → ∓ 𝜋𝛽 𝑡 √ . Dividing by the former inequalities yieldsthe desired result. □ The proof reveals that the magnitude of Δ ± depends inverselyon that of 𝐿 ′′ in the vicinity of the current rating, which in turnis related to the derivative of the tanh terms. If a player’s perfor-mances vary wildly, then most of the tanh terms will be in their tails,which contribute small derivatives, enabling larger rating changes.Conversely, the tanh terms of a player with a very consistent rat-ing history will contribute large derivatives, so the bound on theirrating change will be small.Thus, Elo-MMR( 𝜌 ) naturally caps the rating change of all players,and puts a smaller cap on the rating change of consistent players.The cap will increase after an extreme performance, providing a imilar “momentum” to the TopCoder and Glicko-2 systems, butwithout sacrificing aligned incentives (Theorem 5.5).By comparing against Equation (8), we see that the lower boundin Theorem 5.7 is on the order of 𝜎 𝑡 / 𝛽 𝑡 , while the upper bound ison the order of 𝛽 / 𝛽 𝑡 . As a result, the momentum effect is morepronounced when 𝛽 is much larger than 𝜎 𝑡 . Since the decay stepincreases 𝛽 while the transfer step decreases it, this occurs whenthe transfer rate 𝜌 is comparatively small. Thus, 𝜌 can be chosen ininverse proportion to the desired strength of momentum. Let’s look at the computation time needed to process a round withparticipant set P , where we again omit the round subscript. Eachplayer 𝑖 has a participation history H 𝑖 .Estimating 𝑃 𝑖 entails finding the zero of a monotonic functionwith 𝑂 (|P|) terms, and then obtaining the rating 𝜇 𝑖 entails findingthe zero of another monotonic function with 𝑂 (|H 𝑖 |) terms. Usingthe Illinois or Newton methods, solving these equations to precision 𝜖 takes 𝑂 ( log log 𝜖 ) iterations. As a result, the total runtime neededto process one round of competition is 𝑂 (cid:32)∑︁ 𝑖 ∈P (|P| + |H 𝑖 |) log log 1 𝜖 (cid:33) . This complexity is more than adequate for Codeforces-style com-petitions with thousands of contestants and history lengths up to afew hundred. Indeed, we were able to process the entire history ofCodeforces on a small laptop in less than half an hour. Nonetheless,it may be cost-prohibitive in truly massive settings, where |P| or |H 𝑖 | number in the millions. Fortunately, it turns out that bothfunctions may be compressed down to a bounded number of terms,with negligible loss of precision. Adaptive subsampling.

In Section 2, we used Doob’s consistencytheorem to argue that our estimate for 𝑃 𝑖 is consistent. Specifically,we saw that 𝑂 ( / 𝜖 ) opponents are needed to get the typical errorbelow 𝜖 . Thus, we can subsample the set of opponents to include inthe estimation, omitting the rest. Random sampling is one approach.A more efficient approach chooses a fixed number of opponentswhose ratings are closest to that of player 𝑖 , as these are more likelyto provide informative match-ups. On the other hand, if the settingrequires aligned incentives to hold exactly, then one must avoidchoosing different opponents for each player. History compression.

Similarly, it’s possible to bound the numberof stored factors in the posterior. Our skill-evolution algorithmdecays the weights of old performances at an exponential rate.Thus, the contributions of all but the most recent 𝑂 ( log 𝜖 ) termsare negligible. Rather than erase the older logistic terms outright, werecommend replacing them with moment-matched Gaussian terms,similar to the transfers in Section 4 with 𝜅 =

0. Since Gaussianscompose easily, a single term can then summarize an arbitrarilylong prefix of the history.Substituting 1 / 𝜖 and log 𝜖 for |P| and |H 𝑖 | , respectively, theruntime of Elo-MMR with both optimizations becomes 𝑂 (cid:18) |P| 𝜖 log log 1 𝜖 (cid:19) . Dataset

Codeforces 1087 2999TopCoder 2023 403Reddit 1000 20Synthetic 50 2500

Table 1: Summary of test datasets.

Finally, we note that the algorithm is embarrassingly parallel,with each player able to solve its equations independently. Thethreads can read the same global data structures, so each additionalthread only contributes 𝑂 ( ) memory overhead. In this section, we compare various rating systems on real-worlddatasets, mined from several sources that will be described in Sec-tion 6.1. The metrics are runtime and predictive accuracy, as de-scribed in Section 6.2.We compare Elo-MM 𝜒 and Elo-MMR( 𝜌 ) against the industry-tested rating systems of Codeforces and TopCoder. For a fairercomparison, we hand-coded efficient versions of all four algorithmsin the safe subset of Rust, parellelized using the Rayon crate; assuch, the Rust compiler verifies that they contain no data races [27].Our implementation of Elo-MMR( 𝜌 ) makes use of the optimizationsin Section 5.3, bounding both the number of sampled opponentsand the history length by 500. In addition, we test the improvedTrueSkill algorithm of [25], basing our code on an open-sourceimplementation of the same algorithm. The inherent seqentiality ofits message-passing procedure prevented us from parallelizing it. Hyperparameter search.

To ensure fair comparisons, we ran aseparate grid search for each triple of algorithm, dataset, and metric,over all of the algorithm’s hyperparameters. The hyperparameterset that performed best on the first 10% of the dataset, was thenused to test the algorithm on the remaining 90% of the dataset.The experiments were run on a 2.0 GHz 24-core Skylake ma-chine with 24 GB of memory. Implementations of all rating systems,hyperparameters, datasets, and additional processing used in ourexperiments can be found at https://github.com/EbTech/EloR/.

Due to the scarcity of public domain datasets for rating systems,we mined three datasets to analyze the effectiveness of our system.The datasets were mined using data from each source website’sinception up to October 12th, 2020. We also created a syntheticdataset to test our system’s performance when the data generatingprocess matches our theoretical model. Summary statistics of thedatasets are presented in Table 1.

Codeforces contest history.

This dataset contains the current en-tire history of rated contests ever run on CodeForces.com, the dom-inant platform for online programming competitions. The Code-Forces platform has over 850K users, over 300K of whom are rated,and has hosted over 1000 contests to date. Each contest has a couplethousand competitors on average. A typical contest takes 2 to 3hours and contains 5 to 8 problems. Players are ranked by totalpoints, with more points typically awarded for tougher problemsand for early solves. They may also attempt to “hack” one another’s ubmissions for bonus points, identifying test cases that break theirsolutions. TopCoder contest history.

This dataset contains the current en-tire history of algorithm contests ever run on the TopCoder.com.TopCoder is a predecessor to Codeforces, with over 1.4 milliontotal users and a long history as a pioneering platform for program-ming contests. It hosts a variety of contest types, including over2000 algorithm contests to date. The scoring system is similar toCodeforces, but its rounds are shorter: typically 75 minutes with 3problems.

SubRedditSimulator threads.

This dataset contains data scrapedfrom the current top-1000 most upvoted threads on the websitereddit.com/r/SubredditSimulator/. Reddit is a social news aggrega-tion website with over 300 million users. The site itself is brokendown into sub-sites called subreddits. Users then post and commentto the subreddits, where the posts and comments receive votes fromother users. In the subreddit SubredditSimulator, users are languagegeneration bots trained on text from other subreddits. Automatedposts are made by these bots to SubredditSimulator every 3 minutes,and real users of Reddit vote on the best bot. Each post (and its asso-ciated comments) can thus be interpreted as a round of competitionbetween the bots who commented.

Synthetic data.

This dataset contains 10K players, with skillsand performances generated according to the Gaussian generativemodel in Section 2. Players’ initial skills are drawn i.i.d. with mean1500 and variance 300. Players compete in all rounds, and are rankedaccording to independent performances with variance 200. Betweenrounds, we add i.i.d. Gaussian increments with variance 35 to eachof their skills.

To compare the different algorithms, we define two measures ofpredictive accuracy. Each metric will be defined on individual con-testants in each round, and then averaged: aggregate ( metric ) : = (cid:205) 𝑡 (cid:205) 𝑖 ∈P 𝑡 metric ( 𝑖, 𝑡 ) (cid:205) 𝑡 |P 𝑡 | . Pair inversion metric [20].

Our first metric computes the fractionof opponents against whom our ratings predict the correct pairwiseresult, defined as the higher-rated player either winning or tying: pair _ inversion ( 𝑖, 𝑡 ) : = |P 𝑡 | − × . This metric was used in the evaluation of TrueSkill [20].

Rank deviation.

Our second metric compares the rankings withthe total ordering that would be obtained by sorting players accord-ing to their prior rating. The penalty is proportional to how muchthese ranks differ for player 𝑖 : rank _ deviation ( 𝑖, 𝑡 ) : = | actual_rank − predicted_rank ||P 𝑡 | − × . In the event of ties, among the ranks within the tied range, we usethe one that comes closest to the rating-based prediction.

Recall that Elo-MM 𝜒 has a Gaussian performance model, matchingthe modeling assumptions of TopCoder and TrueSkill. Elo-MMR( 𝜌 ),on the other hand, has a logistic performance model, matchingthe modeling assumptions of Codeforces and Glicko. While 𝜌 wasincluded in the hyperparameter search, in practice we found thatall values between 0 and 1 produce very similar results.To ensure that errors due to the unknown skills of new playersdon’t dominate our metrics, we excluded players who had competedin less than 5 total contests. In most of the datasets, this reduced theperformance of our method relative to the others, as our methodseems to converge more accurately. Despite this, we see in Table 2that both versions of Elo-MMR outperform the other rating systemsin both the pairwise inversion metric and the ranking deviationmetric.We highlight a few key observations. First, significant perfor-mance gains are observed on the Codeforces and TopCoder datasets,despite these platforms’ rating systems having been designed specif-ically for their needs. Our gains are smallest on the synthetic dataset,for which all algorithms perform similarly. This might be in partdue to the close correspondence between the generative processand the assumptions of these rating systems. Furthermore, thesynthetic players compete in all rounds, enabling the system toconverge to near-optimal ratings for every player. Finally, the im-proved TrueSkill performed well below our expectations, despiteour best efforts to improve it; we suspect that the message-passingalgorithm breaks down in contests with a large number of distinctranks. To our knowledge, we are the first to present experimentswith TrueSkill on contests where the number of distinct ranks is inthe hundreds or thousands. In preliminary experiments, TrueSkilland Elo-MMR score about equally when the number of ranks is lessthan about 60.Now, we turn our attention to Table 3, which showcases the com-putational efficiency of Elo-MMR. On smaller datasets, it performscomparably to the Codeforces and TopCoder algorithms. However,the latter suffer from a quadratic time dependency on the numberof contestants; as a result, Elo-MMR outperforms them by almostan order of magnitude on the larger Codeforces dataset.Finally, in comparisons between the two Elo-MMR variants, wenote that while Elo-MMR( 𝜌 ) is more accurate, Elo-MM 𝜒 is alwaysfaster. This has to do with the skill drift modeling described inSection 4, as every update in Elo-MMR( 𝜌 ) must process 𝑂 ( log 𝜖 ) terms of a player’s competition history. This paper introduces the Elo-MMR rating system, which is inpart a generalization of the two-player Glicko system, allowing anunbounded number of players. By developing a Bayesian modeland taking the limit as the number of participants goes to infinity,we obtained simple, human-interpretable rating update formulas.Furthermore, we saw that the algorithm is asymptotically fast, em-barrassingly parallel, robust to extreme performances, and satisfiesthe important aligned incentives property. To our knowledge, oursystem is the first to rigorously prove all these properties in a set-ting with more than two individually ranked players. In terms of ataset Codeforces TopCoder TrueSkill Elo-MM 𝝌 Elo-MMR( 𝝆 ) pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. pair inv. rank dev.Codeforces 78.3% 14.9% 78.5% 15.1% 61.7% 25.4% 78.5% 14.8% % %TopCoder 72.6% 18.5% 72.3% 18.7% 68.7% 20.9% 73.0% 18.3% % %Reddit 61.5% 27.3% 61.4% 27.4% 61.5% % 61.6% 27.3% % % 81.3% 13.1% % % 𝝌 Elo-MMR( 𝝆 ) Codeforces 212.9 72.5 67.2

Table 3: Total compute time over entire dataset, in seconds. practical performance, we saw that it outperforms existing industrysystems in both prediction accuracy and computation speed.This work can be extended in several directions. First, the choiceswe made in modeling ties, pseudodiffusions, and opponent subsam-pling are by no means the only possibilities consistent with ourBayesian model of skills and performances. Second, one may obtainbetter results by fitting the performance and skill evolution modelsto application-specific data.Another useful extension would be to team competitions. Whileit’s no longer straightforward to infer precise estimates of an indi-vidual’s performance, Elo-MM 𝜒 can simply be applied at the teamlevel. To make this useful in settings where players may form newteams in each round, we must model teams in terms of their individ-ual members. In the case where a team’s performance is modeledas the sum of its members’ independent Gaussian contributions,elementary facts about multivariate Gaussian distributions enableposterior skill inferences at the individual level. Generalizing thisapproach remains an open challenge.Over the past decade, online competition communities such asCodeforces have grown exponentially. As such, considerable workhas gone into engineering scalable and reliable rating systems.Unfortunately, many of these systems have not been rigorouslyanalyzed in the academic community. We hope that our paper andopen-source release will open new explorations in this area. ACKNOWLEDGEMENTS

The authors are indebted to Daniel Sleator and Danica J. Sutherlandfor initial discussions that helped inspire this work, and to NikitaGaevoy for the open-source improved TrueSkill upon which ourimplementation is based. Experiments in this paper were funded bya Google Cloud Research Grant. The second author is supported bya VMWare Fellowship and the Natural Sciences and EngineeringResearch Council of Canada.

APPENDIX

Lemma 3.1. If 𝑓 𝑖 is continuously differentiable and log-concave,then the functions 𝑙 𝑖 , 𝑑 𝑖 , 𝑣 𝑖 are continuous, strictly decreasing, and 𝑙 𝑖 ( 𝑝 ) < 𝑑 𝑖 ( 𝑝 ) < 𝑣 𝑖 ( 𝑝 ) for all 𝑝. Proof. Continuity of 𝐹 𝑖 , 𝑓 𝑖 , 𝑓 ′ 𝑖 implies that of 𝑙 𝑖 , 𝑑 𝑖 , 𝑣 𝑖 . It’s known [10]that log-concavity of 𝑓 𝑖 implies log-concavity of both 𝐹 𝑖 and 1 − 𝐹 𝑖 .As a result, 𝑙 𝑖 , 𝑑 𝑖 , and 𝑣 𝑖 are derivatives of strictly concave functions;therefore, they are strictly decreasing. In particular, each of 𝑣 ′ 𝑖 ( 𝑝 ) = 𝑓 ′ 𝑖 ( 𝑝 ) 𝐹 𝑖 ( 𝑝 ) − 𝑓 𝑖 ( 𝑝 ) 𝐹 𝑖 ( 𝑝 ) , 𝑙 ′ 𝑖 ( 𝑝 ) = − 𝑓 ′ 𝑖 ( 𝑝 ) − 𝐹 𝑖 ( 𝑝 ) − 𝑓 𝑖 ( 𝑝 ) ( − 𝐹 𝑖 ( 𝑝 )) , are negative for all 𝑝 , so we conclude that 𝑑 𝑖 ( 𝑝 ) − 𝑣 𝑖 ( 𝑝 ) = 𝑓 ′ 𝑖 ( 𝑝 ) 𝑓 𝑖 ( 𝑝 ) − 𝑓 𝑖 ( 𝑝 ) 𝐹 𝑖 ( 𝑝 ) = 𝐹 𝑖 ( 𝑝 ) 𝑓 𝑖 ( 𝑝 ) 𝑣 ′ 𝑖 ( 𝑝 ) < ,𝑙 𝑖 ( 𝑝 ) − 𝑑 𝑖 ( 𝑝 ) = − 𝑓 ′ 𝑖 ( 𝑝 ) 𝑓 𝑖 ( 𝑝 ) − 𝑓 𝑖 ( 𝑝 ) − 𝐹 𝑖 ( 𝑝 ) = − 𝐹 𝑖 ( 𝑝 ) 𝑓 𝑖 ( 𝑝 ) 𝑙 ′ 𝑖 ( 𝑝 ) < . □ Theorem 3.2.

Suppose that for all 𝑗 , 𝑓 𝑗 is continuously differen-tiable and log-concave. Then the unique maximizer of Pr ( 𝑃 𝑖 = 𝑝 | 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 ) is given by the unique zero of 𝑄 𝑖 ( 𝑝 ) = ∑︁ 𝑗 ≻ 𝑖 𝑙 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ≺ 𝑖 𝑣 𝑗 ( 𝑝 ) . Proof. First, we rank the players by their buckets according to ⌊ 𝑃 𝑗 / 𝜖 ⌋ , and take the limiting probabilities as 𝜖 → (⌊ 𝑃 𝑗 𝜖 ⌋ > ⌊ 𝑝𝜖 ⌋) = Pr ( 𝑝 𝑗 ≥ 𝜖 ⌊ 𝑝𝜖 ⌋ + 𝜖 ) = − 𝐹 𝑗 ( 𝜖 ⌊ 𝑝𝜖 ⌋ + 𝜖 ) → − 𝐹 𝑗 ( 𝑝 ) , Pr (⌊ 𝑃 𝑗 𝜖 ⌋ < ⌊ 𝑝𝜖 ⌋) = Pr ( 𝑝 𝑗 < 𝜖 ⌊ 𝑝𝜖 ⌋) = 𝐹 𝑗 ( 𝜖 ⌊ 𝑝𝜖 ⌋) → 𝐹 𝑗 ( 𝑝 ) , 𝜖 Pr (⌊ 𝑃 𝑗 𝜖 ⌋ = ⌊ 𝑝𝜖 ⌋) = 𝜖 Pr ( 𝜖 ⌊ 𝑝𝜖 ⌋ ≤ 𝑃 𝑗 < 𝜖 ⌊ 𝑝𝜖 ⌋ + 𝜖 ) = 𝜖 (cid:16) 𝐹 𝑗 ( 𝜖 ⌊ 𝑝𝜖 ⌋ + 𝜖 ) − 𝐹 𝑗 ( 𝜖 ⌊ 𝑝𝜖 ⌋) (cid:17) → 𝑓 𝑗 ( 𝑝 ) . Let 𝐿 𝜖𝑗𝑝 , 𝑊 𝜖𝑗𝑝 , and 𝐷 𝜖𝑗𝑝 be shorthand for the events ⌊ 𝑃 𝑗 𝜖 ⌋ > ⌊ 𝑝𝜖 ⌋ , ⌊ 𝑃 𝑗 𝜖 ⌋ < ⌊ 𝑝𝜖 ⌋ , and ⌊ 𝑃 𝑗 𝜖 ⌋ = ⌊ 𝑝𝜖 ⌋ . respectively. These correspond to aplayer who performs at 𝑝 losing, winning, and drawing against 𝑗 , espectively, when outcomes are determined by 𝜖 -buckets. Then,Pr ( 𝐸 𝑊𝑖 , 𝐸 𝐿𝑖 | 𝑃 𝑖 = 𝑝 ) = lim 𝜖 → (cid:214) 𝑗 ≻ 𝑖 Pr ( 𝐿 𝜖𝑗𝑝 ) (cid:214) 𝑗 ≺ 𝑖 Pr ( 𝑊 𝜖𝑗𝑝 ) (cid:214) 𝑗 ∼ 𝑖,𝑗 ≠ 𝑖 Pr ( 𝐷 𝜖𝑗𝑝 ) 𝜖 = (cid:214) 𝑗 ≻ 𝑖 ( − 𝐹 𝑗 ( 𝑝 )) (cid:214) 𝑗 ≺ 𝑖 𝐹 𝑗 ( 𝑝 ) (cid:214) 𝑗 ∼ 𝑖,𝑗 ≠ 𝑖 𝑓 𝑗 ( 𝑝 ) , Pr ( 𝑃 𝑖 = 𝑝 | 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 ) ∝ 𝑓 𝑖 ( 𝑝 ) Pr ( 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 | 𝑃 𝑖 = 𝑝 ) = (cid:214) 𝑗 ≻ 𝑖 ( − 𝐹 𝑗 ( 𝑝 )) (cid:214) 𝑗 ≺ 𝑖 𝐹 𝑗 ( 𝑝 ) (cid:214) 𝑗 ∼ 𝑖 𝑓 𝑗 ( 𝑝 ) , dd 𝑝 ln Pr ( 𝑃 𝑖 = 𝑝 | 𝐸 𝐿𝑖 ,𝐸 𝑊𝑖 ) = ∑︁ 𝑗 ≻ 𝑖 𝑙 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ≺ 𝑖 𝑣 𝑗 ( 𝑝 ) + ∑︁ 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 ) = 𝑄 𝑖 ( 𝑝 ) . Since Lemma 3.1 tells us that 𝑄 𝑖 is strictly decreasing, it onlyremains to show that it has a zero. If the zero exists, it must beunique and it will be the unique maximum of Pr ( 𝑃 𝑖 = 𝑝 | 𝐸 𝐿𝑖 , 𝐸 𝑊𝑖 ) .To start, we want to prove the existence of 𝑝 ∗ such that 𝑄 𝑖 ( 𝑝 ∗ ) <

0. Note that it’s not possible to have 𝑓 ′ 𝑗 ( 𝑝 ) ≥ 𝑝 , as in thatcase the density would integrate to either zero or infinity. Thus, foreach 𝑗 such that 𝑗 ∼ 𝑖 , we can choose 𝑝 𝑗 such that 𝑓 ′ 𝑗 ( 𝑝 𝑗 ) <

0, andso 𝑑 𝑗 ( 𝑝 𝑗 ) <

0. Let 𝛼 = − (cid:205) 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 𝑗 ) > 𝑛 = |{ 𝑗 : 𝑗 ≺ 𝑖 }| . For each 𝑗 such that 𝑗 ≺ 𝑖 , sincelim 𝑝 →∞ 𝑣 𝑗 ( 𝑝 ) = / =

0, we can choose 𝑝 𝑗 such that 𝑣 𝑗 ( 𝑝 𝑗 ) < 𝛼 / 𝑛 .Let 𝑝 ∗ = max 𝑗 ⪯ 𝑖 𝑝 𝑗 . Then, ∑︁ 𝑗 ≻ 𝑖 𝑙 𝑗 ( 𝑝 ∗ ) ≤ , ∑︁ 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 ∗ ) ≤ − 𝛼, ∑︁ 𝑗 ≺ 𝑖 𝑣 𝑗 ( 𝑝 ∗ ) < 𝛼. Therefore, 𝑄 𝑖 ( 𝑝 ∗ ) = ∑︁ 𝑗 ≻ 𝑖 𝑙 𝑗 ( 𝑝 ∗ ) + ∑︁ 𝑗 ∼ 𝑖 𝑑 𝑗 ( 𝑝 ∗ ) + ∑︁ 𝑗 ≺ 𝑖 𝑣 𝑗 ( 𝑝 ∗ ) < − 𝛼 + 𝛼 = . By a symmetric argument, there also exists some 𝑞 ∗ for which 𝑄 𝑖 ( 𝑞 ∗ ) >

0. By the intermediate value theorem with 𝑄 𝑖 continuous,there exists 𝑝 ∈ ( 𝑞 ∗ , 𝑝 ∗ ) such that 𝑄 𝑖 ( 𝑝 ) =

0, as desired. □ REFERENCES [1] CodeChef Rating System. codechef.com/ratings[2] Codeforces Rating System. codeforces.com/blog/entry/20762[3] Farming Volatility: How a major flaw in a well-known rating system takes overthe GBL leaderboard. reddit.com/r/TheSilphRoad/comments/hwff2d/farming_volatility_how_a_major_flaw_in_a/[4] Halo Xbox video game franchise: in numbers. telegraph.co.uk/technology/video-games/11223730/Halo-in-numbers.html[5] Kaggle Progression System. kaggle.com/progression[6] LeetCode Rating System. leetcode.com/discuss/general-discussion/468851/New-Contest-Rating-Algorithm-(Coming-Soon)[7] TopCoder Algorithm Rating System. topcoder.com/community/competitive-programming/how-to-compete/ratings[8] Why Are Obstacle-Course Races So Popular? theatlantic.com/health/archive/2018/07/why-are-obstacle-course-races-so-popular/565130/[9] Sharad Agarwal and Jacob R. Lorch. 2009. Matchmaking for online games andother latency-sensitive P2P systems. In

SIGCOMM 2009 . 315–326.[10] Mark Yuying An. 1997. Log-concave probability distributions: Theory and statis-tical testing.

Duke University Dept of Economics Working Paper

WSDM 2016 . 227–236.[12] Pierre Dangauthier, Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. TrueSkillThrough Time: Revisiting the History of Chess. In

NeurIPS 2007 . 337–344.[13] Arpad E. Elo. 1961. New USCF rating system.

Chess Life

16 (1961), 160–161.[14] RNDr Michal Forišek. 2009. Theoretical and Practical Aspects of ProgrammingContest Ratings. (2009).[15] David A Freedman. 1963. On the asymptotic behavior of Bayes’ estimates in thediscrete case.

The Annals of Mathematical Statistics (1963), 1386–1403. [16] Mark E Glickman. 1995. A comprehensive guide to chess ratings.

American ChessJournal

3, 1 (1995), 59–102.[17] Mark E Glickman. 1999. Parameter estimation in large dynamic paired compari-son experiments.

Applied Statistics (1999), 377–394.[18] Mark E Glickman. 2012. Example of the Glicko-2 system.

Boston University (2012), 1–6.[19] Linxia Gong, Xiaochuan Feng, Dezhi Ye, Hao Li, Runze Wu, Jianrong Tao,Changjie Fan, and Peng Cui. 2020. OptMatch: Optimized Matchmaking viaModeling the High-Order Interactions on the Arena. In

KDD 2020 . 2300–2310.[20] Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkillTM: A BayesianSkill Rating System. In

NeurIPS 2006 . 569–576.[21] Tzu-Kuo Huang, Chih-Jen Lin, and Ruby C. Weng. 2006. Ranking individuals bygroup comparisons. In

ICML 2006 . ACM, 425–432.[22] Stephanie Kovalchik. 2020. Extension of the Elo rating system to margin ofvictory.

International Journal of Forecasting (2020).[23] Yao Li, Minhao Cheng, Kevin Fujii, Fushing Hsieh, and Cho-Jui Hsieh. 2018.Learning from Group Comparisons: Exploiting Higher Order Interactions. In

NeurIPS 2018 . 4986–4995.[24] Tom Minka, Ryan Cleven, and Yordan Zaykov. 2018.

TrueSkill 2: An improvedBayesian skill rating system . Technical Report MSR-TR-2018-8. Microsoft.[25] Sergey I. Nikolenko, Alexander, and V. Sirotkin. 2010. Extensions of the TrueSkillTM rating system. In

In Proceedings of the 9th International Conference on Appli-cations of Fuzzy Systems and Soft Computing . 151–160.[26] Jerneja Premelč, Goran Vučković, Nic James, and Bojan Leskošek. 2019. Reliabilityof judging in DanceSport.

Frontiers in psychology

10 (2019), 1001.[27] Josh Stone and Nicholas D Matsakis. The Rayon library (Rust Crate). crates.io/crates/rayon[28] Lin Yang, Stanko Dimitrov, and Benny Mantin. 2014. Forecasting sales of new vir-tual goods with the Elo rating system.

Journal of Revenue and Pricing Management

13, 6 (Dec. 2014), 457–469.13, 6 (Dec. 2014), 457–469.