[PDF] Modelling Time-Varying Rankings with Autoregressive and Score-Driven Dynamics

Abstract

We develop a new statistical model to analyse time-varying ranking data. The model can be used with a large number of ranked items, accommodates exogenous time-varying covariates and partial rankings, and is estimated via maximum likelihood in a straightforward manner. Rankings are modelled using the Plackett-Luce distribution with time-varying worth parameters that follow a mean-reverting time series process. To capture the dependence of the worth parameters on past rankings, we utilize the conditional score in the fashion of the generalized autoregressive score (GAS) models. Simulation experiments show that small-sample properties of the maximum-likelihood estimator improve rapidly with the length of the time series and suggest that statistical inference relying on conventional Hessian-based standard errors is usable even for medium-sized samples. As an illustration, we apply the model to the results of the Ice Hockey World Championships. We also discuss applications to rankings based on underlying indices, repeated surveys, and non-parametric efficiency analysis.

Full PDF

MModelling time-varying rankings with autoregressive and score-drivendynamics

Vladimír Holý

Prague University of Economics and BusinessWinston Churchill Square 1938/4, 130 67 Prague 3, [email protected]

Jan Zouhar

Prague University of Economics and BusinessWinston Churchill Square 1938/4, 130 67 Prague 3, [email protected]

January 12, 2021

Abstract

We develop a new statistical model to analyse time-varying ranking data. The model can be used witha large number of ranked items, accommodates exogenous time-varying covariates and partial rankings,and is estimated via maximum likelihood in a straightforward manner. Rankings are modelled using thePlackett-Luce distribution with time-varying worth parameters that follow a mean-reverting time seriesprocess. To capture the dependence of the worth parameters on past rankings, we utilize the conditionalscore in the fashion of the generalized autoregressive score (GAS) models. Simulation experiments showthat small-sample properties of the maximum-likelihood estimator improve rapidly with the length of thetime series and suggest that statistical inference relying on conventional Hessian-based standard errors isusable even for medium-sized samples. As an illustration, we apply the model to the results of the IceHockey World Championships. We also discuss applications to rankings based on underlying indices,repeated surveys, and non-parametric eﬃciency analysis.

Keywords:

Ranking data, Random permutation, Plackett-Luce distribution, Generalized autoregressive scoremodel, Ice hockey rankings.

JEL Codes:

C32, C46, L83.

Rankings of universities, scientiﬁc journals, sports teams, election candidates, top visited websites or productspreferred by customers are all examples of ranking data. Statistical models of ranking data have a long history,dating back at least to Thurstone (1927). Since then, the breadth of the statistical toolkit for ranking data hasincreased rapidly; see e.g. Marden (1995) and Alvo and Yu (2014) for an in-depth textbook overview. However,a recent survey of the ranking literature by Yu et al. (2019) draws attention to the lack of time perspective inrankings and calls for research in this particular direction. This paper aims to heed the call and contributeto the thin strand of literature on time-varying ranking data. Unlike the existing models for time variationin rankings, our approach aims to provide a ﬂexible tool for the modelling of time-varying ranking data ina similar fashion as the autoregressive moving average (ARMA) model provides for the case of continuousvariables.Our model builds upon the (static)

Plackett-Luce distribution of Luce (1959) and Plackett (1975), aconvenient and simple probability distribution on rankings utilizing a worth parameter for each item to beranked. It originates from the Luce’s choice axiom and is also related to the Thurstone’s theory of comparativejudgment (see Luce, 1977 and Yellott, 1977 for details). Albeit not without limitations, the Plackett-Lucedistribution is widely used as a base for statistical models analysing ranking data.This also holds true for the scarce literature devoted to models with time-varying ranks. Baker and Mchale(2015) utilize the Plackett-Luce model and consider individual worth parameters behind the rankings to betime-varying – but deterministically so – in an application to golf tournaments results. Glickman and Hennessy(2015) also base their model on the Plackett-Luce distribution but consider worth parameters to follow the1 a r X i v : . [ s t a t . M E ] J a n aussian random walk in an application to women’s Alpine downhill skiing results. Asfaw et al. (2017)take a diﬀerent path and include the lagged ranking as the current modal ranking in the Mallows model inan application to the academic performance of high school students. Finally, Henderson and Kirrane (2018)employ the Plackett-Luce model with observations weighted in time in an application to Formula One results.The latter three papers adopt a Bayesian approach. Generalized autoregressive score (GAS) models of Creal et al. (2013), also known as dynamic conditionalscore (DCS) models by Harvey (2013), have established themselves as a useful modern framework for timeseries modelling. The GAS models are observation-driven models allowing for any underlying probabilitydistribution with any time-varying parameters. They capture the dynamics of time-varying parameters bythe autoregressive term and the lagged score, i.e. the gradient of the log-likelihood function. The GASclass includes many well-known econometric models such as the generalized autoregressive conditionalheteroskedasticity (GARCH) model of Bollerslev (1986) based on the normal distribution with time-varyingvariance, the autoregressive conditional duration (ACD) model of Engle and Russell (1998) based on theexponential distribution with time-varying scale and the count model of Davis et al. (2003) based on thePoisson distribution with time-varying mean. The GAS models can be straightforwardly estimated by themaximum likelihood method (see e.g. Blasques et al., 2018 for details on the asymptotic theory). Generally,the GAS models perform very well when compared to alternatives (see e.g. Koopman et al., 2016 for anextensive empirical and simulation study). To this day, the website reports over 200scientiﬁc papers devoted to the GAS models.In the paper, we propose a dynamic model for rankings based on the Plackett-Luce distribution with time-varying worth parameters following the GAS score-driven dynamics. Our formulation allows for exogenouscovariates and corresponds to the setting of a panel linear regression with ﬁxed eﬀects. We also consider thecase of partial rankings. The proposed model is described in Section 2.Using simulations, we investigate the ﬁnite-sample performance of the maximum likelihood estimator ofour model. First, we demonstrate the convergence of the estimated coeﬃcients for exogenous variables and theGAS dynamics to their true values along the time dimension. Second, we show that conﬁdence intervals basedon the standard maximum likelihood asymptotics appear usable even if the dimensions of data are moderate(such as 20 items ranked in 20 time periods). The simulation study is conducted in Section 3.As an illustration of the proposed methodology, we analyse the results of the Ice Hockey World Cham-pionships from 1998 to 2019. We ﬁnd that the proposed mean-reverting model ﬁts the data better than thestatic and random walk models. The beneﬁts of our approach include compilation of the ultimate (long-term)ranking of teams, straightforward estimation of the probabilities of speciﬁc rankings (e.g. podium positions)and prediction of future rankings. The empirical study is presented in Section 4.Besides sports statistics, we discuss several other possible applications of the proposed model. Notably,we argue that our approach can be used to model rankings based on underlying indices (such as variouscountry rankings) and captures the interaction between items, which the univariate models directly for indicesdo not account for. Furthermore, we note that our model is suitable for repeated surveys that are designed asrankings. Finally, we show how our approach can be utilized for rankings of decision-making units obtainedby non-parametric eﬃciency analysis. These applications are discussed in Section 5.

Let us consider a set of 𝑁 items Y = { , . . . , 𝑁 } . Our main object of interest is a complete permutation ofthis set 𝑦 = ( 𝑦 ( ) , . . . , 𝑦 ( 𝑁 )) , known as a ranking , and its inverse 𝑦 − = (cid:0) 𝑦 − ( ) , . . . , 𝑦 − ( 𝑁 ) (cid:1) , known as an ordering . Element 𝑦 ( 𝑖 ) represents the rank given to item 𝑖 while 𝑦 − ( 𝑗 ) represents the item with rank 𝑗 .We assume that a random permutation 𝑌 follows the Plackett-Luce distribution of Luce (1959) and Plackett(1975). According to this distribution, a ranking is constructed by successively selecting the best item, thesecond best item, the third best item and so on. The probability of selecting a speciﬁc item in any stage isequal to the ratio of its worth parameter and the sum of the worth parameters of all items that have not yet2een selected. Therefore, the probability of a complete ranking 𝑦 isP [ 𝑌 = 𝑦 | 𝑓 ] = 𝑁 (cid:214) 𝑗 = exp 𝑓 𝑦 − ( 𝑗 ) (cid:205) 𝑁𝑘 = 𝑗 exp 𝑓 𝑦 − ( 𝑘 ) , (1)where 𝑓 = ( 𝑓 , . . . , 𝑓 𝑁 ) (cid:48) are the items’ worth parameters. We use a parametrization allowing for arbitraryvalues of 𝑓 𝑖 which facilitates subsequent modelling. Note that the probability mass function (1) is invariant toaddition of a constant to all parameters 𝑓 𝑖 . Therefore, we employ the standardization 𝑁 ∑︁ 𝑖 = 𝑓 𝑖 = . (2)For further details regarding the Plackett-Luce distribution, see Luce (1977); Yellott (1977); Stern (1990);Critchlow et al. (1991). The key ingredient in our dynamic model is the score, i.e. the gradient of the log-likelihood function, deﬁnedas ∇ ( 𝑦 ; 𝑓 ) = 𝜕 ln P [ 𝑌 = 𝑦 | 𝑓 ] 𝜕 𝑓 . (3)The score represents the direction for improving the ﬁt of the distribution with given 𝑓 to speciﬁc observation 𝑦 and indicates the sensitivity of the log-likelihood to parameter 𝑓 . For a complete ranking 𝑦 following thePlackett-Luce distribution, the score is given by ∇ 𝑖 ( 𝑦 ; 𝑓 ) = − 𝑦 ( 𝑖 ) ∑︁ 𝑗 = exp 𝑓 𝑖 (cid:205) 𝑁𝑘 = 𝑗 exp 𝑓 𝑦 − ( 𝑘 ) , 𝑖 = , . . . , 𝑁 . (4)In general, the score function has zero expected value and its variance is equal to the Fisher information I ( 𝑓 ) = E (cid:2) ∇ ( 𝑦 ; 𝑓 ) ∇ ( 𝑦 ; 𝑓 ) (cid:48) (cid:12)(cid:12) 𝑓 (cid:3) . (5)Although the Fisher information is available in a closed form for the Plackett-Luce distribution, it is computa-tionally very intensive for larger 𝑁 as it includes a sum over all possible permutations of Y .The score has an appealing interpretation. In essence, it reﬂects the discrepancy between the items’ worthparameters and the eventual ranking. This information can be exploited in a time-series context where theworth parameters are updated with each new observation. For example, consider the outcome of a matchbetween teams 𝐴 and 𝐵 . This is a special case of ranking with only two items and possible outcomes ( 𝐴, 𝐵 ) and ( 𝐵, 𝐴 ) . Team 𝐴 is the incumbent champion with a very high worth parameter 𝑓 𝐴 while team 𝐵 is theunderdog with a low worth parameter 𝑓 𝐵 . If the match goes as expected and 𝐴 beats 𝐵 , the score will be closeto zero for both teams. If 𝐵 manages to beat 𝐴 , however, the score for 𝐵 , ∇ 𝐵 ( 𝑦 ; 𝑓 ) , will be a large positivenumber while the score for 𝐴 , ∇ 𝐴 ( 𝑦 ; 𝑓 ) , will be negative. The score can therefore serve as a correction ofworth parameters after an observation is realized. Let us observe rankings 𝑦 𝑡 in times 𝑡 = , . . . , 𝑇 . Furthermore, let us assume that individual worth parametersevolve over time and denote them 𝑓 𝑡 = ( 𝑓 ,𝑡 , . . . , 𝑓 𝑁 ,𝑡 ) (cid:48) for 𝑡 = , . . . , 𝑇 . Speciﬁcally, let the time-varyingparameter 𝑓 𝑖,𝑡 follow the generalized autoregressive score (GAS) dynamics of Creal et al. (2013) and Harvey(2013) with score order 𝑃 and autoregressive order 𝑄 . Let it also linearly depend on exogenous covariates 𝑥 , . . . , 𝑥 𝑀 . Parameter 𝑓 𝑖,𝑡 is then given by the recursion 𝑓 𝑖,𝑡 = 𝜔 𝑖 + 𝑀 ∑︁ 𝑗 = 𝛽 𝑗 𝑥 𝑖,𝑡, 𝑗 + 𝑃 ∑︁ 𝑘 = 𝛼 𝑘 ∇ 𝑖 ( 𝑦 𝑡 − 𝑘 ; 𝑓 𝑡 − 𝑘 ) + 𝑄 ∑︁ 𝑙 = 𝜑 𝑙 𝑓 𝑖,𝑡 − 𝑙 , 𝑖 = , . . . , 𝑁, 𝑡 = , . . . , 𝑇 , (6)3here 𝜔 𝑖 is item 𝑖 ’s individual ﬁxed eﬀect, 𝛽 𝑗 is the regression parameter on 𝑥 𝑗 , 𝛼 𝑘 is the score parameterfor lag 𝑘 , 𝜑 𝑙 is the autoregressive parameter for lag 𝑙 and 𝑥 𝑖,𝑡, 𝑗 is the value of 𝑥 𝑗 for item 𝑖 at time 𝑡 . Notethat this formulation corresponds to the setting of a panel linear regression with ﬁxed eﬀects. In most of theGAS literature (and the GARCH and ACD literature as a matter of fact), only the ﬁrst lags are utilized, i.e. 𝑃 = 𝑄 = 𝑁 . Furthermore, the models based on various scaling functions are typicallyquite similar in terms of empirical performance. For these two reasons, we only consider unit scaling (i.e. noscaling).Standardization (2) cannot be enforced at each time 𝑡 without deforming the dynamics given by therecursion (6). Instead, we use the standardization 𝑁 ∑︁ 𝑖 = 𝜔 𝑖 = . (7)In the case of mean-reverting dynamics with no exogenous covariates, this corresponds to zero sum of theunconditional values, ¯ 𝑓 𝑖 = 𝜔 𝑖 − (cid:205) 𝑄𝑙 = 𝜑 𝑙 , 𝑖 = , . . . , 𝑁 . (8) For the estimation of the proposed dynamic model, we utilize the maximum likelihood estimator. Let 𝜃 = ( 𝜔 , . . . , 𝜔 𝑁 − , 𝛽 , . . . , 𝛽 𝑀 , 𝛼 , . . . , 𝛼 𝑃 , 𝜑 , . . . , 𝜑 𝑄 ) (cid:48) be the vector of the 𝑁 + 𝑀 + 𝑃 + 𝑄 − 𝜔 𝑁 being obtained from (7) as 𝜔 𝑁 = − (cid:205) 𝑁 − 𝑖 = 𝜔 𝑖 . The estimate ˆ 𝜃 is obtained from theconditional log-likelihood as ˆ 𝜃 ∈ arg max 𝜃 𝑇 ∑︁ 𝑡 = ln P [ 𝑌 𝑡 = 𝑦 𝑡 | 𝑓 𝑡 ] . (9)The recursive nature of 𝑓 𝑡 requires initialization. A reasonable approach is to set the initial conditional scores ∇( 𝑦 ; 𝑓 ) , . . . , ∇( 𝑦 − 𝑃 + ; 𝑓 − 𝑃 + ) to zero, i.e. their expected value, and the initial parameters 𝑓 , . . . , 𝑓 − 𝑄 + to the unconditional value ¯ 𝑓 given by (8). Alternatively, if additional information about the initial worthparameters is available, it can be used instead. For instance, in a related GAS-type model for binary outcomesof tennis matches, Gorgi et al. (2019) use current ranking points to initialize the worth parameters. On theother hand, they also note that other initialization methods yielded very similar parameters estimates. Theinitial worth parameters can also be considered as additional parameters to be estimated. This would, however,signiﬁcantly increase the number of variables in the maximization problem.As Hunter (2004) states, a necessary condition for the Plackett-Luce log-likelihood to have a maximum isthat in every possible partition of the items into two nonempty subsets, some item in the second set must beranked higher than some item in the ﬁrst set at least once. This assumption for example rules out that there isan item always ranked ﬁrst which would result in an inﬁnite worth parameter.From a computational perspective, it is possible to utilize any general-purpose algorithm for solvingnonlinear optimization problems. In our simulation study and empirical application, we employ the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm implemented in R package nloptr (Ypma and Johnson, 2020).Optimization performance can be improved, however, by exploiting the special structure of the problem. Con-cerning the GAS models, Creal et al. (2013) recall the work of Fiorentini et al. (1996) and suggest an algorithmthat computes the gradient of the likelihood recursively and simultaneously with the time-varying parameters.Concerning the Plackett-Luce distribution, Hunter (2004) presents an iterative minorization–maximization (MM) algorithm, which is further developed by Caron and Doucet (2012). These ideas might prove a use-ful starting point for a specialized likelihood-maximization algorithm tailored to our model; however, thedevelopment of such an algorithm is beyond the scope of this paper.4ur implementation of statistical inference tasks is based on standard maximum likelihood asympotics.Recall that under suitable regularity conditions, the maximum likelihood estimator ˆ 𝜃 is consistent and asymp-totically normal, i.e. it satisﬁes √ 𝑇 (cid:0) ˆ 𝜃 − 𝜃 (cid:1) d → N (cid:16) , − 𝐻 − (cid:17) , (10)where 𝐻 denotes the asymptotic Hessian of the log-likelihood, deﬁned as 𝐻 = plim 𝑇 →∞ 𝑇 𝑇 ∑︁ 𝑡 = 𝜕 ln P [ 𝑌 𝑡 = 𝑦 𝑡 | 𝑓 𝑡 ] 𝜕𝜃 𝜕𝜃 (cid:48) . (11)In ﬁnite samples, standard errors are often computed using the empirical Hessian of the log-likelihood evaluatedat ˆ 𝜃 and the normal c.d.f. is used for statistical inference.The truth is that establishing the asymptotic theory for GAS-type models is diﬃcult in general. Atminimum, it is necessary that the ﬁlter 𝑓 𝑡 is invertible (see e.g. Blasques et al., 2018 for more details). Theinvertibility property ensures, among other things, that the initialization does not matter in the long run. Thetheoretical derivation of the conditions restricting the parameter space in order to obtain consistency andasymptotic normality is, however, beyond the scope of this paper. Indeed, it is very challenging in general toobtain any asymptotic results for the case of multivariate variables with multiple time-varying parameters. Inthe sequel, we base our inference on the asymptotics outlined above and rely on simulations to verify theirvalidity. The distribution can be extended to the case in which the ranking of only top ˜

𝑁 < 𝑁 items is observed. Theset of ranked items is then ˜ Y = { 𝑦 − ( ) , . . . , 𝑦 − ( ˜ 𝑁 )} . We denote the partial ranking of items 𝑖 ∈ ˜ Y as ˜ 𝑦 andthe partial ordering as ˜ 𝑦 − . The probability mass function of the Plackett-Luce distribution for partial ranking˜ 𝑦 is then P (cid:2) ˜ 𝑌 = ˜ 𝑦 (cid:12)(cid:12) 𝑓 (cid:3) = ˜ 𝑁 (cid:214) 𝑗 = exp 𝑓 ˜ 𝑦 − ( 𝑗 ) (cid:205) 𝑠𝑘 = 𝑗 exp 𝑓 ˜ 𝑦 − ( 𝑘 ) + (cid:205) 𝑙 ∉ ˜ Y exp 𝑓 𝑙 . (12)The score function for partial ranking ˜ 𝑦 is ∇ 𝑖 ( ˜ 𝑦 ; 𝑓 ) =  − (cid:205) ˜ 𝑦 ( 𝑖 ) 𝑗 = 𝑓 𝑖 (cid:205) ˜ 𝑁𝑘 = 𝑗 exp 𝑓 ˜ 𝑦 − ( 𝑘 ) + (cid:205) 𝑙 ∉ ˜ Y exp 𝑓 𝑙 for 𝑖 ∈ ˜ Y , − (cid:205) ˜ 𝑁𝑗 = 𝑓 𝑖 (cid:205) ˜ 𝑁𝑘 = 𝑗 exp 𝑓 ˜ 𝑦 − ( 𝑘 ) + (cid:205) 𝑙 ∉ ˜ Y exp 𝑓 𝑙 for 𝑖 ∉ ˜ Y . (13)The probability mass function and the score function for partial rankings can be straightforwardly plugged intothe dynamics from previous sections. We conduct a simulation study in order to investigate behavior of the maximum likelihood estimator over twodimensions – the number of items 𝑁 and the time horizon 𝑇 . In particular, 𝑁 varies between 10, 20 and 30,and 𝑇 ranges from 10 to 100. For each combination of 𝑁 and 𝑇 , we conduct 100,000 replications.The simulations employ the following toy model with parameters selected to resemble those estimated inthe empirical study in Section 4. As we consider diﬀerent values of 𝑁 , the number of 𝜔 𝑖 parameters diﬀers.To somewhat standardize the item-speciﬁc ﬁxed eﬀects, we set 𝜔 𝑖 = ( 𝑖 − )/( 𝑁 − ) − 𝑖 = , . . . , 𝑁 , i.e.parameters 𝜔 𝑖 range from − 𝑁 . We include a single exogenous covariate independently generatedfrom the standard normal distribution. The regression parameter is then set to 𝛽 =

1. Finally, the order ofthe GAS model is chosen as 𝑃 = 𝑄 = 𝛼 = . 𝜑 = .

5. Suchparameter values result in the unconditional values ¯ 𝑓 𝑖 given by (8) ranging from − 𝛽 as 𝛽 , 𝛼 as 𝛼 , and 𝜑 as 𝜑 .5 .2 Simulation results The results of the simulation study are reported in Figure 1. First, we investigate the accuracy of the estimatorsˆ 𝜔 𝑖 , ˆ 𝛽 , ˆ 𝛼 and ˆ 𝜑 ; to enhance readability, the results for ˆ 𝜔 𝑖 are averaged across all items ( 𝑖 ). In the ﬁrst columnof Figure 1, we report the mean absolute errors between the estimated coeﬃcients and their true values. Allestimates converge to their true values along the time dimension. The score parameter 𝛼 proves to be hard toestimate for small 𝑇 as it has much higher mean absolute errors than the autoregressive parameter 𝜑 with acomparable nominal value. Nevertheless, already in a medium sample with 𝑁 =

20 and 𝑇 =

20, the errorsare not that substantial with values 0.22 for ˆ 𝜔 𝑖 , 0.08 for ˆ 𝛽 , 0.12 for ˆ 𝛼 and 0.05 for ˆ 𝜑 . In a large sample with 𝑁 =

30 and 𝑇 = 𝜔 𝑖 , 0.02 for ˆ 𝛽 , 0.02 for ˆ 𝛼 and 0.01 for ˆ 𝜑 .Second, we assess the usability of the maximum likelihood asymptotics for ﬁnite-sample inference.Speciﬁcally, in the second column of Figure 1, we report the fraction of the samples where the 95 percentconﬁdence intervals contained the true parameter values, i.e. we present the estimated coverage probabilities.For all parameters, the coverage probability converges to the target value of 0.95 from bellow along the timedimension. As in the case of MAE, the convergence of the coverage probability is the slowest for the scoreparameter 𝛼 . In a medium sample with 𝑁 =

20 and 𝑇 =

20, the coverage probabilities are 0.91 for ˆ 𝜔 𝑖 , 0.91for ˆ 𝛽 , 0.78 for ˆ 𝛼 and 0.92 for ˆ 𝜑 , while in a large sample with 𝑁 =

30 and 𝑇 = 𝜔 𝑖 ,0.94 for ˆ 𝛽 , 0.92 for ˆ 𝛼 and 0.94 for ˆ 𝜑 . We demonstrate the use of our model using data on the results of the Ice Hockey World Championshipsbetween the years 1998 and 2019. In 1998, the sanctioning body of the championships, the International IceHockey Federation (IIHF), increased the number of teams in the tournament from 12 to 16, and has kept it thatlevel up to date; hence the choice of the starting year. For each year, the IIHF provides a complete ranking ofall 16 participants. Over the years, 24 diﬀerent teams made it through the qualiﬁcation process, comprisingthe set of ranked items in our model.For each year, we obtained information about the host country of the championships. In order to accountfor the home-ice advantage, we included a home ice covariate, a time-varying indicator variable (= 1 for hometeams in the respective years, = 0 otherwise).

The general structure of team strength dynamics given by (6) includes an array of diﬀerent model speciﬁcationsthat can be obtained by (i) choosing the order of the GAS model (

𝑃, 𝑄 ) and (ii) imposing speciﬁc restrictionson the parameter space. As for the former, with the limited size our data set, it seems impractical to consideranything beyond the canonical 𝑃 = 𝑄 = 𝑃 = 𝑄 =

0, on the other hand, yields a static strength model, which is equivalent to the standard ranked-order logit (ROL) – a common go-to model for sports rankings. Recent applications to sport rankingsinclude e.g. Caron and Doucet (2012); Henderson and Kirrane (2018). The latter use time-weighted obser-vations to improve forecasts, but their model is intrinsically static. Both referenced studies use a Bayesianapproach to estimation of the ROL model. We estimate the static model to provide a benchmark for themodels with score-driven dynamics.In the model with 𝑃 = 𝑄 =

1, we generally expect the autoregressive parameter to lie in the ( , ) interval,implying a certain degree of persistence in team strengths with a mean-reverting tendency. In our data set, thisis indeed the result we obtain if we leave the parameters unrestricted in the likelihood-maximization procedure.We refer to this variant as the mean-reverting model . The need for this type of a sports ranking model hasrecently been recognized by Baker and Mchale (2015). In their analysis of golf tournament rankings, they notethat while their model’s deterministic dynamics does suﬃciently capture the time variation in an individualplayers’ performance, a mean-reverting random process would be more appropriate for teams where a largerstochastic component is to be expected due to ﬂuctuation of players and coaches.6 ean absolute error Coverage probability ωβαφ

10 25 40 55 70 85 100 10 25 40 55 70 85 1000.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0 T N

10 20 30

Figure 1:

Mean absolute errors of the estimated coeﬃcients and coverage probabilities of the 95% conﬁdenceintervals. For item-speciﬁc ﬁxed eﬀects, ˆ 𝜔 𝑖 , results are averaged across all items. Dashed horizontal linesare drawn at the 0 and 0.95 vertical coordinates to show the limit values of mean absolute errors andcoverage probabilities under standard maximum-likelihood asymptotics. 𝛽 around 0.998 forback-to-back matches in most of the estimated models. The authors note that this corresponds to a yearlypersistence of about 0.90.If strong persistence is expected, it might be reasonable to restrict the autoregressive parameter to unity,making the team strengths follow a random walk pattern. A ranking model of this type was presented byGlickman and Hennessy (2015) in the context of women’s Alpine downhill skiing competitions. Their model,however, does not make use of the score-driven component and is estimated in a Bayesian framework. Ina GAS setting, a model with a random walk behavior of team strength (henceforth, random walk model )should be approached with caution. The 𝑓 𝑡 ﬁlter in this case is not invertible, making the consistency of themaximum likelihood estimator dubious. That said, in our simulation experiments the mean absolute error ofthe coeﬃcient estimates was roughly comparable to the mean-reverting model of equal sample size, providedthat the model speciﬁcations agree with the underlying data-generating processes. We also note that a GASmodel with random walk strength dynamics has recently been presented by Gorgi et al. (2019) to predictoutcomes of individual tennis matches. Gorgi et al. view mean-reverting processes as the dynamics of choicefor team sports and non-stationary dynamic processes as more appropriate for individual sports.Using the general framework developed in Section 2 we can easily estimate all three model variants (thestatic model, mean-reverting model and random walk model) and compare their ﬁt by information-theoreticcriteria. As only a subset of all teams participated in each championship, we employ the form of the likelihoodfunction for partial rankings developed in Section 2.5. The R code that replicates our results is available onlineas a Supplementary Material. In the observed period, 1998–2019, only 9 teams participated in all 22 Ice Hockey World Championships.These included all teams from the so-called Big Six (Canada, Czechia, Finland, Russia, Sweden, and UnitedStates) along with Latvia, Slovakia and Switzerland. Three teams – Great Britain, Poland, and South Korea –only appeared once. The dominance of the Big Six is evident when looking at the podium positions: out ofthe 66 medals, only six were handed out to teams outside the Big Six (four were awarded to Slovakia, two toSwitzerland). Hosting was also unevenly distributed among the countries: only 14 of the teams experienced thehome-ice advantage, with Czechia, Slovakia, and Switzerland hosting the championships twice and Germany,Finland, Russia, and Sweden three times each.Table 1 presents the results for all three estimated models. In terms of AIC, the mean-reverting modeloutperformed the remaining two by a wide margin, with Δ AIC exceeding 25 in both cases. This (i) impliesthat the introduction of strength dynamics can improve the model ﬁt dramatically and (ii) provides empiricalsupport for the conjecture of Gorgi et al. (2019) about the suitability of mean-reverting dynamics for teamsports. The 95% CI for the autoregressive parameter in the mean-reverting model, [ . , . ] , indicates thepresence of moderate persistence and leads us to reject the null hypotheses of both a random-walk behaviorand no serial dependence.Despite the diﬀerences in AIC values, for parameters that are shared across the models, the estimates arequalitatively similar. In both the mean-reverting and random walk model, the values of the score coeﬃcient,ˆ 𝛼 , are positive and signiﬁcant. This implies that the score component of our model does help in explaining theranking dynamics.In accordance with expectations, point estimates in all three models suggest the existence of home-iceadvantage ( ˆ 𝛽 > home ice statistically signiﬁcant. To assess the eﬀect size impliedby ˆ 𝛽 , we need to know the team strengths. Table 2 shows the estimates of the unconditional team strength fromthe mean-reverting model, obtained based on (8), and the ˆ 𝜔 𝑖 for the static model. The diﬀerences in successivestrength values suggests that an increase by 0 .

23 (the home-ice advantage estimate in the mean-revertingmodel) moves a team 0–2 places ahead in the ranking.For the mean-reverting and static model, the estimates of 𝜔 𝑖 can be used to provide the “ultimate” (orlong-run) ranking. Both models conﬁrm the dominance of the Big Six. Indeed, the ranking in both models8 able 1: Selected estimates for the Ice Hockey World Championships data (1998–2019).

Mean-reverting model Static model Random walk modelHome ice ( ˆ 𝛽 ) 0 .

227 0 .

171 0 . ( . ) ( . ) ( . ) Score parameter ( ˆ 𝛼 ) 0 . ∗∗∗ . ∗∗∗ ( . ) ( . ) Autoregressive parameter ( ˆ 𝜑 ) 0 . ∗∗∗ ( . ) log-likelihood − . − . − . .

391 1299 .

600 1300 . Notes : (i) Estimates of 𝜔 𝑖 omitted from the table to enhance readability. (ii) Standard errors in parentheses. (iii) ∗∗∗ 𝑝 < . ∗∗ 𝑝 < . ∗ 𝑝 < . Table 2:

Unconditional team strength estimates and ultimate ranking in the mean-reverting and static models. Teamsare sorted by the ultimate ranking obtained from the mean-reverting model.

Mean-reverting model Static modelCountry Strength Rank Strength RankCanada 3 .

72 1 3 .

72 2Finland 3 .

70 2 3 .

66 3Sweden 3 .

65 3 3 .

84 1Czechia 3 .

47 4 3 .

41 4Russia 3 .

25 5 3 .

17 5United States 1 .

83 6 2 .

18 6Switzerland 1 .

67 7 1 .

76 7Slovakia 1 .

65 8 1 .

55 8Latvia 0 .

86 9 0 .

82 9Germany 0 .

28 10 0 .

31 10Belarus 0 .

25 11 0 .

11 11Norway 0 .

03 12 − .

07 12Denmark − .

07 13 − .

17 13France − .

41 14 − .

51 14Austria − .

83 15 − .

89 15Italy − .

02 16 − .

10 16Ukraine − .

34 17 − .

52 17Slovenia − .

75 18 − .

64 18Kazakhstan − .

83 19 − .

78 19Japan − .

00 20 − .

94 20Hungary − .

28 21 − .

20 21Great Britain − .

92 22 − .

89 22Poland − .

95 23 − .

90 23South Korea − .

96 24 − .

91 249 able 3:

One-step ahead rank prediction and medal probabilities for the Big Six in the mean-reverting model. Nohome-ice advantage assumed.

Country Strength Predicted rank P[gold medal] P[podium position]Finland 3 .

974 1 0 .

235 0 . .

970 2 0 .

234 0 . .

431 3 0 .

137 0 . .

415 4 0 .

134 0 . .

400 5 0 .

133 0 . .

086 6 0 .

036 0 . 𝑓 𝑖,𝑡 (here referred to as strength) in themean-reverting model. Even though the serial dependence, given by the autoregressive parameter 𝛼 , is mild,it is clearly discernible in the plots. For teams that only appeared in a handful of championships (i.e. theweaker teams, located at the bottom of the ﬁgure), we can see prolonged periods with unchanging values ofthe strength value, which correspond to the absent observations. Similar ﬁgures for the static and random walkmodels are given in Figures 3 and 4.If the values of explanatory variables at time 𝑇 + 𝑇 +

1. These can in turn serve to make one-step-ahead predictions or estimatethe probabilities of speciﬁc rankings or ranking-based events. Applications in betting are straightforward. forinstance, one can easily obtain the probability that a particular team wins a medal or that the podium will beoccupied by a given list of teams.An example is presented in Table 3. Assuming that none of the Big Six countries host the upcomingchampionships, we calculated the future value of the team strength, 𝑓 𝑖,𝑇 + , and the associated rank predictionfor the Bix Six based on our estimates of the mean-reverting model. Even though the team strengths havethe mean-reverting tendency, short-run predictions can diﬀer from the unconditional mean substantially: eventhough the Big Six occupy the ﬁrst six places according to both the predicted rankings for 𝑇 + 𝑓 𝑖,𝑇 + into (12) yields the estimated probability of a partial ordering of interest at 𝑇 +

1. For instance, the estimated probability of the partial ordering (Finland, Canada, Russia) – the predictedpodium outcome – is 1.85 per cent. This probability is low mainly because the predicted strengths happento be quite similar across the ﬁrst ﬁve teams. Analogously, we can obtain the probability of winning a goldmedal, presented in the fourth column of Table 3. Note that the winning probabilities are markedly diﬀerentdespite the similar team strengths. In practical applications, one might be interested in general ranking-basedevents, such as the probability that a team ﬁnishes on the podium; these probabilities can easily be obtainedby combining suitable elementary events. An example is given in the last column of Table 3.

There is often an underlying index or score behind a ranking. For example, the

Times Higher Education WorldUniversity Rankings are based on the score weighted over 13 individual indicators grouped in 5 categories– industry income, international diversity, teaching, research and citations. International rankings based onvarious indices such as the

Global Competitiveness Index , Bloomberg Innovation Index , Human DevelopmentIndex , Climate Change Performance Index or Good Country Index are compiled in a similar fashion. Naturally,an analysis of these rankings and indices is a popular subject of scientiﬁc research; e.g. Saisana et al. (2005)assess robustness of country rankings, Paruolo et al. (2013) measure the importance of individual variablesin composite indicators and Varin et al. (2016) investigate the role of citation data in the ratings of scholarlyjournals. 10 ungary Great Britain Poland South KoreaUkraine Slovenia Kazakhstan JapanDenmark France Austria ItalyLatvia Germany Belarus NorwayRussia United States Switzerland SlovakiaCanada Finland Sweden Czechia2000 2005 2010 2015 2000 2005 2010 2015 2000 2005 2010 2015 2000 2005 2010 2015-2.50.02.5-2.50.02.5-2.50.02.5-2.50.02.5-2.50.02.5-2.50.02.5

Year S t r eng t h Figure 2:

Mean-reverting model – estimated team strength for all teams over the entire observed period. Teams areordered by the estimated unconditional ranking. ungary Great Britain Poland South KoreaUkraine Slovenia Kazakhstan JapanDenmark France Austria ItalyLatvia Germany Belarus NorwayRussia United States Switzerland SlovakiaSweden Canada Finland Czechia2000 2005 2010 2015 2000 2005 2010 2015 2000 2005 2010 2015 2000 2005 2010 2015-4-2024-4-2024-4-2024-4-2024-4-2024-4-2024 Year S t r eng t h Figure 3:

Static model – estimated team strength for all teams over the entire observed period. Teams are ordered bythe estimated unconditional ranking. In the static model, team strengths only vary with the home-iceadvantage, producing little bumps in the plots. ungary Poland South Korea Great BritainUkraine Japan Slovenia KazakhstanDenmark Austria France ItalyLatvia Belarus Germany NorwayRussia Slovakia United States SwitzerlandSweden Finland Czechia Canada2000 2005 2010 2015 2000 2005 2010 2015 2000 2005 2010 2015 2000 2005 2010 2015-2.50.02.5-2.50.02.5-2.50.02.5-2.50.02.5-2.50.02.5-2.50.02.5 Year S t r eng t h Figure 4:

Random walk model – estimated team strength for all teams over the entire observed period. Teams areordered by the mean strength estimate.

A common way of obtaining ranking data is through a survey in which respondents are asked to rank items.Many surveys are repeated on several occasions, forming time-series data. For the statistical methodologydealing with repeated surveys, see Scott and Smith (1974); Steel and McLaren (2009). By asking rankingquestions in repeated surveys, we arrive at a time series of rankings. For example, customers of a retail shopmay be periodically asked to rank products according to their preference. In this case, the proposed dynamicranking model could be a useful tool.

Another interesting application is modelling rankings of decision making units (DMUs) obtained by non-parametric eﬃciency analysis such as data envelopment analysis (DEA) pioneered by Charnes et al. (1978)and Banker et al. (1984). The goal of such analyses is to separate eﬃcient and ineﬃcient DMUs, assigneﬃciency scores to them and determine their ranking. Many empirical papers also study the determinantsof eﬃciency. The analysis is usually carried out by ﬁrst obtaining the eﬃciency scores and then analysingthem using regression in the second phase. Simar and Wilson (2007) (i) point out that a vast majority of theseanalyses ignore the inherent dependence between eﬃciency scores in their second phase, and (ii) developbootstrap procedures to ﬁx invalid inference.Simar and Wilson (2007) focus on the cross-sectional case where dependence only occurs between theDMUs, not over time, which also greatly facilitates bootstrapping. The extension to panel data is notstraightforward to say the least. Nevertheless, in many empirical studies, DMUs are observed annually, withan intention to both assess the way eﬃciency evolved over time and provide a list of units that proved capable ofsustaining eﬃciency over a long period. For this type of analysis, it may be beneﬁcial to model the dynamics ofDEA rankings using our model. If long-term eﬃciency is of interest, it can be measured via the unconditionalranking. A major limitation of this approach is, however, the use of the Plackett-Luce distribution, as DEArankings do not obey Luce’s choice axiom. Other, more complex distributions on rankings could prove moreappropriate here.Modelling rankings instead of eﬃciency scores may also enhance robustness with regard to methodselection. For example, a novel DEA approach utilizing the Chebyshev distance proposed by Hladík (2019)oﬀers alternative eﬃciency scores to the classical DEA models of Charnes et al. (1978) and Banker et al.(1984), but has been shown to produce the exact same ranking. Modelling rankings instead of eﬃciency scoresthus eliminates diﬀerences between the two methods.14

Conclusion

Our new modelling approach brings two main features that have not been utilized in the analysis of time-varyingrankings so far: (i) it allows a general autoregressive scheme for the process that governs the items’ worthparameters, and (ii) new observations can update the worth parameters through a score-driven mechanism.Both these features proved useful in our case study dealing with ice hockey team rankings. We believe thatempiricists in diverse application areas can beneﬁt from these features as well. These empiricists will hopefullyalso appreciate other practical merits of the model, such as the ability to include time-varying covariates orthe straightforward maximum likelihood estimation.This paper has presented the ﬁrst results of an ongoing research. Future eﬀorts should mainly cover thefollowing areas. Firstly, we hope to see more complex results regarding both ﬁnite-sample performance andlimit behavior. We doubt that comprehensive analytical treatment of the maximum likelihood asymptoticsis tractable, but aim to extend the current simulation results substantially. Secondly, for applications witha very large number of items, empiricists would surely beneﬁt from a specialized algorithm for likelihoodmaximization that exploits the speciﬁc structure of the likelihood function. As we mentioned above, Creal et al.(2013) and Caron and Doucet (2012) might be a useful inspiration in this respect. Thirdly, for applications torankings where ties are possible, the model can be extended using the approach of Firth et al. (2019) and Turneret al. (2020). Finally, potential users would surely appreciate a well-documented and ﬂexible implementationof the procedures associated with the model, e.g. in the form of a full-ﬂedged R package.

Acknowledgements

We would like to thank Michal Černý for his comments. Computational resources were supplied by the project"e-Infrastruktura CZ" (e-INFRA LM2018140) provided within the program Projects of Large Research,Development and Innovations Infrastructures.

Funding

The work on this paper was supported by the Internal Grant Agency of the University of Economics, Pragueunder project F4/27/2020 and the Czech Science Foundation under project 19-08985S.

References

Alvo, M. and Yu, P. L. H. (2014)

Statistical Methods for Ranking Data . New York: Springer.Asfaw, D., Vitelli, V., Sørensen, Ø., Arjas, E. and Frigessi, A. (2017) Time-Varying Rankings with the BayesianMallows Model.

Stat , , 14–30.Baker, R. D. and Mchale, I. G. (2015) Deterministic Evolution of Strength in Multiple Comparisons Models:Who Is the Greatest Golfer? Scandinavian Journal of Statistics , , 180–196.Banker, R. D., Charnes, A. and Cooper, W. W. (1984) Some Models for Estimating Technical and ScaleIneﬃciencies in Data Envelopment Analysis. Management Science , , 1078–1092.Blasques, F., Gorgi, P., Koopman, S. J. and Wintenberger, O. (2018) Feasible Invertibility Conditions andMaximum Likelihood Estimation for Observation-Driven Models. Electronic Journal of Statistics , ,1019–1052.Bollerslev, T. (1986) Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics , , 307–327.Caron, F. and Doucet, A. (2012) Eﬃcient Bayesian Inference for Generalized Bradley-Terry Models. Journalof Computational and Graphical Statistics , , 174–196.15harnes, A., Cooper, W. W. and Rhodes, E. (1978) Measuring the Eﬃciency of Decision Making Units. European Journal of Operational Research , , 429–444.Creal, D., Koopman, S. J. and Lucas, A. (2013) Generalized Autoregressive Score Models with Applications. Journal of Applied Econometrics , , 777–795.Critchlow, D. E., Fligner, M. A. and Verducci, J. S. (1991) Probability Models on Rankings. Journal ofMathematical Psychology , , 294–318.Davis, R. A., Dunsmuir, W. T. M. and Street, S. B. (2003) Observation-Driven Models for Poisson Counts. Biometrika , , 777–790.Engle, R. F. and Russell, J. R. (1998) Autoregressive Conditional Duration: A New Model for IrregularlySpaced Transaction Data. Econometrica , , 1127–1162.Fiorentini, G., Calzolari, G. and Panattoni, L. (1996) Analytic Derivatives and the Computation of GARCHEstimates. Journal of Applied Econometrics , , 399–417.Firth, D., Kosmidis, I. and Turner, H. L. (2019) Davidson-Luce Model for Multi-Item Choice with Ties.Glickman, M. E. and Hennessy, J. (2015) A Stochastic Rank Ordered Logit Model for Rating Multi-CompetitorGames and Sports. Journal of Quantitative Analysis in Sports , , 131–144.Gorgi, P., Koopman, S. J. and Lit, R. (2019) The Analysis and Forecasting of Tennis Matches by Using a HighDimensional Dynamic Model. Journal of the Royal Statistical Society: Series A (Statistics in Society) , ,1393–1409.Harvey, A. C. (2013) Dynamic Models for Volatility and Heavy Tails: With Applications to Financial andEconomic Time Series . New York: Cambridge University Press, ﬁrst edn.Henderson, D. A. and Kirrane, L. J. (2018) A Comparison of Truncated and Time-Weighted Plackett-LuceModels for Probabilistic Forecasting of Formula One Results.

Bayesian Analysis , , 335–358.Hladík, M. (2019) Universal Eﬃciency Scores in Data Envelopment Analysis Based on a Robust Approach. Expert Systems with Applications , , 242–252.Hunter, D. R. (2004) MM Algorithms for Generalized Bradley-Terry Models. The Annals of Statistics , ,384–406.Koopman, S. J. and Lit, R. (2019) Forecasting Football Match Results in National League Competitions UsingScore-Driven Time Series Models. International Journal of Forecasting , , 797–809.Koopman, S. J., Lucas, A. and Scharth, M. (2016) Predicting Time-Varying Parameters with Parameter-Drivenand Observation-Driven Models. Review of Economics and Statistics , , 97–110.Leckie, G. and Goldstein, H. (2009) The Limitations of Using School League Tables to Inform School Choice. Journal of the Royal Statistical Society: Series A (Statistics in Society) , , 835–851.Luce, R. D. (1959) Individual Choice Behavior: A Theoretical Analysis . New York: Wiley, ﬁrst edn.URL: https://books.google.com/books/about/Individual{_}choice{_}behavior.html?id=a80DAQAAIAAJ .— (1977) The Choice Axiom after Twenty Years.

Journal of Mathematical Psychology , , 215–233.Marden, J. I. (1995) Analyzing and Modeling Rank Data . New York: Chapman and Hall / CRC Press.Paruolo, P., Saisana, M. and Saltelli, A. (2013) Ratings and Rankings: Voodoo or Science?

Journal of theRoyal Statistical Society: Series A (Statistics in Society) , , 609–634.Plackett, R. L. (1975) The Analysis of Permutations. Journal of the Royal Statistical Society: Series C (AppliedStatistics) , , 193–202. 16aisana, M., Saltelli, A. and Tarantola, S. (2005) Uncertainty and Sensitivity Analysis Techniques as Toolsfor the Quality Assessment of Composite Indicators. Journal of the Royal Statistical Society: Series A(Statistics in Society) , , 307–323.Scott, A. J. and Smith, T. M. (1974) Analysis of Repeated Surveys Using Time Series Methods. Journal ofthe American Statistical Association , , 674–678.Simar, L. and Wilson, P. W. (2007) Estimation and Inference in Two-Stage, Semi-Parametric Models ofProduction Processes. Journal of Econometrics , , 31–64.Steel, D. and McLaren, C. (2009) Design and Analysis of Surveys Repeated over Time. In Handbook ofStatistics , vol. 29, chap. 33, 289–313. Elsevier.Stern, H. (1990) Models for Distributions on Permutations.

Journal of the American Statistical Association , , 558–564.Thurstone, L. L. (1927) A Law of Comparative Judgment. Psychological Review , , 273–286.Turner, H. L., van Etten, J., Firth, D. and Kosmidis, I. (2020) Modelling Rankings in R: The PlackettLucePackage. Computational Statistics , , 1027–1057.Varin, C., Cattelan, M. and Firth, D. (2016) Statistical Modelling of Citation Exchange Between StatisticsJournals. Journal of the Royal Statistical Society: Series A (General) , , 1–63.Yellott, J. I. (1977) The Relationship Between Luce’s Choice Axiom, Thurstone’s Theory of ComparativeJudgment, and the Double Exponential Distribution. Journal of Mathematical Psychology , , 109–144.Ypma, J. and Johnson, S. G. (2020) Package ’nloptr’. URL: https://cran.r-project.org/package=nloptr .Yu, P. L. H., Gu, J. and Xu, H. (2019) Analysis of Ranking Data. Wiley Interdisciplinary Reviews: Computa-tional Statistics ,11