[PDF] Bayesian Paired-Comparison with the bpcs Package

Abstract

This article introduces the bpcs R package (Bayesian Paired Comparison in Stan) and the statistical models implemented in the package. This package aims to facilitate the use of Bayesian models for paired comparison data in behavioral research. Bayesian analysis of paired comparison data allows parameter estimation even in conditions where the maximum likelihood does not exist, allows easy extension of paired comparison models, provide straightforward interpretation of the results with credible intervals, have better control of type I error, have more robust evidence towards the null hypothesis, allows propagation of uncertainties, includes prior information, and perform well when handling models with many parameters and latent variables. The bpcs package provides a consistent interface for R users and several functions to evaluate the posterior distribution of all parameters, to estimate the posterior distribution of any contest between items, and to obtain the posterior distribution of the ranks. Three reanalyses of recent studies that used the frequentist Bradley-Terry model are presented. These reanalyses are conducted with the Bayesian models of the bpcs package, and all the code used to fit the models, generate the figures, and the tables are available in the online appendix.

Full PDF

BBayesian Paired-Comparison with the bpcs Package

David Issa Mattos

Department of Computer Science and Engineering, Chalmers University of Technology,Gothenburg, Sweden

Érika Martins Silva Ramos

Department of Psychology, University of Gothenburg, Gothenburg, SwedenAbstractThis article introduces the bpcs

R package (Bayesian Paired Comparisonin Stan) and the statistical models implemented in the package. The goal ofthis package is to facilitate the use of Bayesian models for paired comparisondata in behavioral research. Historically, studies on preferences have reliedon Likert scale assessments and the frequentist approach to analyze thedata. As an alternative, this article proposes the use of Bayesian modelsfor forced choices assessments. The advantages of forced-choice assessmentsare that they minimize common bias that Likert scales are susceptible to,they can increase criterion validity, control for faking responses, and arequite useful to assess preferences in studies with children and non-humansprimates. Bayesian data analyses are less sensitive to optional stopping,have better control of type I error, have stronger evidence towards the nullhypothesis, allows propagation of uncertainties, includes prior information,and perform well when handling models with many parameters and latentvariables. The bpcs package provides a consistent interface for R users andseveral functions to evaluate the posterior distribution of all parameters,to estimate the posterior distribution of any contest between items, and toobtain the posterior distribution of the ranks. Three reanalyses of recentstudies that used the frequentist Bradley-Terry model are presented. Thesereanalyses are conducted with the Bayesian models of the bpcs package andall the code used to ﬁt the models, generate the ﬁgures and the tables areavailable in the online appendix.

Funding: This work was partially supported by the Wallenberg Artiﬁcial Intelligence, AutonomousSystems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation.The authors would like to thank L. Hopper for revising the results of the second reanalysis and G. Martonfor revising the results of the third reanalysis. a r X i v : . [ s t a t . M E ] J a n AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 2

Introduction

Paired comparison data analysis arises in several contexts, such as selecting prefer-ences or ranking items (Cattelan, 2012). The most common modeling technique for paired-comparison is the Bradley-Terry model (Bradley & Terry, 1952). However, in such cases,psychological research has given preference for the use of Likert scales and the Bradley-Terrymodel has been scarcely used.Likert scales are a metric assessment of discrete choices, with numerically orderedattitudinal labels. Most Likert scales have a range of 5 to 11 discreet ordered points.There are many problems in statistical analysis surrounding the use of Likert scales, mainlydue to the common practice of analyzing ordinal data as if they were metric. Among theseproblems, inﬂated Type I and Type II errors, misses, and inversions for main and interactioneﬀects are in the list of cons when opting to use Likert scales (Liddell & Kruschke, 2018).When it comes to the respondents’ perspective, there is vast documentation of dif-ﬁculties that participants may experience when rating perceptions with scales. They maystruggle to use the scale correctly, they may try to self-monitor their answers, and for spe-ciﬁc samples, the scales may not even be eﬀective (such as young children, people with lowliteracy, or when respondents are using their second language to answer the scales) (Coetzee& Taylor, 1996; Luckett, Burns, & Jenkinson, 2020; Petrou, 2003).An alternative to Likert scales is the use of forced-choice assessments. The advantageof using forced choices assessments over Likert scales is that they minimize common biasthat Likert scales are susceptible to, such as social desirability and acquiescent responding(Kreitchmann, Abad, Ponsoda, Nieto, & Morillo, 2019). Forced-choice formats can alsoincrease criterion validity and it can control, to some extent, for faking responses (Hontangaset al., 2015). In works conducted with non-human primates, the forced paired choices arequite useful to assess preferences (Hopper, Egelkamp, Fidino, & Ross, 2019; Huskisson,Jacobson, Egelkamp, Ross, & Hopper, 2020).Forced-choice assessments may be stimulus-centric or person-centric. Psychome-tric research has developed many model applications for theories of behavioral choicethat take into consideration the individual diﬀerences in psychological attributes (person-centric). These models employ item response theory (IRT) methods, such as the mul-tidimensional pairwise comparison (MPC) (Wang, Qiu, Chen, Ro, & Jin, 2017) and themulti-unidimensional pairwise preference two-parameter logistic (MUPP-2PL) model (Mo-rillo et al., 2016). This line of research focuses on person-centric assessments, which hasbeen vastly used in personality and attitudinal research (Brown, 2016).In the context of this article and the context of applied psychological and behavioralresearch, the stimulus-centric approach is the one that ﬁts better, and it is in this contextthat the Bradley-Terry model could be most useful. Examples are the preferences of naturallandscapes (Hägerhäll et al., 2018), preferences for stimulus (Chien et al., 2012), for food(Coetzee & Taylor, 1996; Luckett et al., 2020), analysis of animal dominance (Abalos, i deLanuza, Carazo, & Font, 2016; Bush, Quinn, Balreira, & Johnson, 2016; Miller et al., 2017)and in the use of pharmacological medication (Meid et al., 2016).Many models have been developed to extend the Bradley-Terry model, for example,to include ties in the comparisons (Davidson, 1970), to address the problem of orderingitems’ presentation (Davidson & Beaver, 1977), to compensate for dependency on the dataAYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 3(Böckenholt, 2001) or to add explanatory variables to the items (Springall, 1973).Although eﬀorts have been done in statistical computing to provide more accuratestandard errors and p-values estimates for the analyses with the Bradley-Terry model (Cat-telan, 2012; Turner & Firth, 2012), most of these works have been focusing on provid-ing software for the frequentist Bradley-Terry model based on the maximum likelihood(Hatzinger & Dittrich, 2012; Turner & Firth, 2012, 2020). There are no comprehensivesoftware packages that implement a Bayesian version of the Bradley-Terry model and itsextensions.Therefore, this article proposes a Bayesian perspective to work with the Bradley-Terry model. Bayesian data analysis can provide advantages to non-Bayesian methods,e.g., being less sensitive to optional stopping (Rouder, 2014), better control of type I error(Kelter, 2020), stronger evidence towards the null hypothesis (Kruschke, 2013), propaga-tion of uncertainties (McShane, Gal, Gelman, Robert, & Tackett, 2019), inclusion of priorinformation (Gelman & Hennig, 2017) and handling models with many parameters andlatent variables (Carpenter et al., 2017; Kucukelbir, Ranganath, Gelman, & Blei, 2015) forinstance.This paper introduces the bpcs

R package (Bayesian Paired Comparison in Stan) andthe statistical models implemented in the package. The goal of this package is to facilitatethe use of Bayesian models for paired comparison data in research.The bpcs package provides a consistent interface for R users to work with thesemodels and several functions for researchers to evaluate the posterior distribution of allparameters, to estimate the posterior distribution of any contest between items, and toobtain the posterior distribution of the ranks. We conclude this paper by providing aBayesian re-analysis of three recent studies that used the frequentist Bradley-Terry model.These re-analyses are conducted with the Bayesian models of the bpcs package and all thecode used to ﬁt the models, generate the ﬁgures and the tables are available in the onlineappendix . Statistical models

The bpcs package implements a general Bayesian model that can address diﬀerent ex-tensions of the Bradley-Terry model and, that can also be reduced to a simpler model. Thissection presents the simplest Bayesian Bradley-Terry model implemented in the packagefollowed by a discussion of its diﬀerent extensions.Diﬀerent research areas work with diﬀerent terminology and therefore it’s worth hav-ing further clariﬁcations:• Players or contestants are synonymous with the items being compared, i.e. the choicesof some type of stimuli such as images, sounds, or objects. The bpcs package and thisarticle utilize the term player.• Subjects, participant, or judge are synonymous for the respondents of a questionnaire,or the subject that is selecting between the paired-comparison. Apart from the termmultiple judgment sampling which refers to when a subject judges multiple items, the bpcs package and this article utilize the term subjects. https://davidissamattos.github.io/bpcs-online-appendix/ AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 4• A contest refers to a single comparison between two players made by a subject.• Ties or draws refer to the case where a subject does not express a preference fora player in a contest. For example, in a questionnaire that asks subjects to selectbetween two stimuli (the players) a tie could be an option such as “I do not have apreference”. To avoid confusion withdrawing samples of a posterior distribution, the bpcs package and this article utilize the term ties.The mathematical models utilize the following basic notation. Additional symbolsand notations are presented with the extension that introduces them.• α i >

0: a latent variable that represents the ability of player i .• λ i = log( α i ): the log of the ability of player i .• y i,j,n : the binary result of the contest n between any two players i and j . If player i wins, y i,h,n = 1. If player j wins y i,j,n = 0.• tie i,j,n : a binary variable representing if the result of a single contest (at position n )between two players ( i and j ) was a tie. It assumes tie n = 1 if it was a tie and tie k = 0if it was not a tie.• N : the number of players.• σ λ : the variance of the normal prior distribution for the variable λ . Analogously goesfor the prior distributions for the other parameters γ , ν , β and U .To simplify the notation of the presented models below the index n is omitted. The Bradley-Terry model

The Bradley-Terry model (Bradley & Terry, 1952) presents a way to calculate theprobability of one player beating the other player in a contest. This probability is repre-sented by: P [ i beats j ] = α i α i + α j (1)This model is commonly parameterized by the log of the ability variable: P [ i beats j ] = exp λ i exp λ i + exp λ j (2)This transformation have the beneﬁts of allowing the estimation of a parameter λ ∈ ( −∞ , ∞ ) and simplifying the estimation of the parameters for the frequentist setting witha generalized linear models with the logit function (Cattelan, 2012):logit( P [ i beats j ]) = λ i − λ j (3)However, the parameters λ are not uniquely identiﬁed and require another constraint,commonly:AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 5 N X i =1 λ i = 0 (4)The ﬁrst work to propose a Bayesian formulation of the Bradley-Terry model is at-tributed to Davidson and Solomon (Davidson & Solomon, 1973). They proposed a versionof the Bradley-Terry model utilizing a conjugated family of priors and estimators to cal-culate the posterior distribution of the log abilities of the parameters and the rank of theplayers. Leonard (1977) discusses the issue of the ﬂexibility an interpretation of the conju-gated priors proposed by Davidson and Solomon in the presence of additional explanatoryvariables and other extensions.Leonard (1977) suggests moving away from the convention of using conjugated priorsand utilizing normal prior distributions for the parameters. The usage of non-conjugatedprior distributions has many advantages, including adaptability to extensions, being ableto reason and fully specify the prior parameters, ability to extend to hierarchical models,and to specify other prior distribution families.The two main disadvantages of the approach proposed by Leonard (1977) is the use ofapproximation methods for the posterior distribution and the computational time. However,the advances in Bayesian computational packages and Markov Chain Monte Carlo (MCMC)samplers signiﬁcantly minimize such disadvantages. The bpcs package utilizes normal priorsfor the λ parameters and models the outcome variable y i,j of a single contest between players i and j with a Bernoulli distribution, based on the probability of winning for P [ i beats j ].Therefore, the simple Bayesian Bradley-Terry model can be represented as: P [ i beats j ] = exp λ i exp λ i + exp λ j (5) y i,j ∼ Bernoulli( P [ i beats j ]) (6) λ i ∼ N (0 , σ λ ), [Prior] (7)The mean value of the log-ability of the prior acts as a soft-constraint the constraintof where the parameters will be located but do not impact the relative diﬀerence of theprobability of winning between any two players. The standard deviation represents the spacewhere the model should look for the relative diﬀerences, smaller standard deviations indicatethat the relative preferences are close and have many overlap, higher standard deviationindicates that the sampler can look for solutions that are very far apart (probabilities closerto zero or 1). Choosing high standard deviations for the prior can increase the time to ﬁnda solution as the search space increases but very low standard deviations imply that thestrength parameters are very close to each other.The parameters λ i can be used to rank the diﬀerent players. However, in the Bayesianframework, a single measure is not obtained but rather a posterior distribution of theparameters λ i . By sampling from the posterior distribution of the log-abilities of the players,it is possible to create a posterior distribution of the ranks of the players, which helps toevaluate the uncertainty in the ranking system.AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 6 Davidson model

The ﬁrst extension to be added to the Bradley-Terry model is the ability of handlingties in the contest between two players. This approach was proposed by Davidson (1970),which adds an additional parameter ν and computes two probabilities: the probability of i beating j given that it was not a tie P [ i beats j | not tie] and the probability of the resultbeing a tie P [ i ties j ]. A Bayesian formulation of the model is represented below: P [ i beats j | not tie] = exp λ i exp λ i + exp λ j + exp ( ν + λ i + λ j ) (8) P [ i ties j ] = exp ( ν + λ i + λ j )exp λ i + exp λ j + exp ( ν + λ i + λ j ) (9) y i ∼ Bernoulli( P [ i beats j | not tie]) (10)tie i,j ∼ Bernoulli( P [ i ties j ]) (11) λ i ∼ N (0 , σ λ ), [Prior] (12) ν ∼ N (0 , σ ν ), [Prior] (13)The ν parameter in (the log scale) balances the probability of ties against the proba-bility of not having ties. If ν → −∞ than P [ i ties j ] → ν → + ∞ than P [ i ties j ] → ν = 0 than P [ i ties j ] = exp λi + λj exp λ i +exp λ j +exp λi + λj , which means thatthe probability of ties depends only on the players abilities. In this last case, if the players i and j have equal abilities, they have equal chance of having a tie, i winning or j winning.For the Bayesian version of the Davidson model, the choice of prior in the ν parameterrefers to the prior belief on how ties are aﬀected by the relative diﬀerence in players’ abilities.If the intended goal is also to investigate if the probability of wins and ties depend only onthe abilities, the mean parameter of the normal prior distribution for ν is set to zero (as inthe presented models).Although the subsequent models are presented only in the Bradley-Terry variation,they can be easily extended and implemented as a Davidson model to handle ties. Models with order eﬀect

In paired comparison problems, a problem that can arise is that the order in whichtwo players are presented can lead to a bias in the choice of the comparison. For example,when two items are presented to be chosen, subjects might have a preference for items placedon the left side. Another example that arises from sports competitions is the presence ofhome-ﬁeld advantage, athletes (the players in paired-comparison) competing in their homeﬁeld can have an advantage compared to the visitor player.The order eﬀect can be modeled as either additive or multiplicative. Davidson andBeaver (1977) discuss some of the advantages of the multiplicative model compared to theadditive. One important advantage of the Bayesian version is that the value of the ordereﬀect parameter does not depend on the abilities parameters. Therefore, setting the priorAYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 7distribution of the order-eﬀect parameter is independent of the soft-constraints applied tothe log-abilities parameter. Additionally, the multiplicative model has easily introduced toboth the Bradley-Terry and the Davidson model when estimating the parameters in the logscale. To compensate for the order eﬀect using a multiplicative model, Davidson and Beaver(1977) introduced an additional parameter γ . This multiplicative parameter becomes in thelog-scale an additive term as shown below. The Bayesian Bradley-Terry model with ordereﬀect can be represented as: P [ i beats j ] = exp λ i exp λ i + exp ( λ j + γ ) y i ∼ Bernoulli( P [ i beats j ]) λ i ∼ N (0 , σ λ ), [Prior] γ ∼ N (0 , σ γ ), [Prior]The γ parameter in (the log scale) reﬂects the impact of the order eﬀect. If γ → −∞ than P [ i beats j ] → i willhave an order eﬀect advantage and will always win the contest. Analogously if γ → + ∞ than P [ i beats j ] → i will have an order eﬀect disadvantage and will alwayslose the contest. If γ = 0 then player i will neither have an advantage or disadvantage withthe order eﬀect, and the probability of wins or ties depends only on the players’ abilitiesand the tie parameter ν . The choice of prior in the γ parameter refers to our prior belief onthe location and spread of the value of order eﬀect. If the intended goal is also to investigateif there is an order eﬀect or not, the mean parameter of the normal prior distribution for γ is set to zero (as in the presented models). Generalized models

Many research problems require the investigation of the eﬀect of players propertiesin the probability of winning a contest. The extension proposed by (Springall, 1973) isanalogue to the multiple regression case. This extension proposes the use of K predictorsthat are characteristic of the players (and not of the subjects which is discussed later). X i,k is the k predictor value of player i and β k is the coeﬃcient of the predictor that weare estimating. Note that for these generalized models the intercept is not identiﬁable andtherefore not included (Springall, 1973; Stern, 2011). The Bayesian generalized Bradley-Terry can be represented as: P [ i beats j ] = exp λ i exp λ i + exp λ j λ i = N X k X i,k β k y i ∼ Bernoulli( P [ i beats j ]) β k ∼ N (0 , σ β ), [Prior]AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 8The generalized version of the Bradley-Terry model estimates the parameters β k . Theparameter λ i is then estimated by the linear model λ i = P Nk X i,k β k . The choice of prior inthe β parameter refers to the prior belief on the value of the coeﬃcient of each predictor.The presented generalized models utilize the same prior for all β coeﬃcients. Therefore itrequires that the values for every k in the X i,k be on the same range, otherwise, the modelwould have a strong informative prior belief in some coeﬃcients and a weakly informativeprior belief in other coeﬃcients. If the predictors’ input values X i,k are normalized for every k the larger the coeﬃcient β k , the higher the inﬂuence of that predictor in the probabilityof winning. Introducing random-eﬀects to model-dependent data

It is common in the research context to have the same subject to make multiplecomparisons (multiple judgment sampling) (Cattelan, 2012). A more realistic analysis ofthe Bradley-Terry model would assume that the comparisons made by the same person aredependent. One approach to address the multiple judgment sampling problems is throughthe usage of mixed-eﬀects or hierarchical paired comparison models (Böckenholt, 2001).Böckenholt (2001) decomposed the paired comparison model into ﬁxed and a randomeﬀect components. The random eﬀects component estimates the subject variation (given S subjects) in each item, while the ﬁxed eﬀect component estimates the average log abilityof the player. The random eﬀects term is represented by U i,s , where i refers to the playerbeing judged and s to the subject. The Bayesian Bradley-Terry model with random eﬀectscan be represented as: P [ i beats j ] = exp λ i,s exp λ i,s + exp λ j,s λ i,s = λ i + U i,s y i,j ∼ Bernoulli( P [ i beats j ]) λ i ∼ N (0 , σ λ ), [Prior] U i,s ∼ N (0 , U ), [Prior] U std ∼ N (0 , σ U ), [Prior]This model aims at estimating the parameter U std that represents the standard devia-tion in the random eﬀects and the diﬀerence between subjects. In the Bayesian context, withthe Stan probabilistic programming language, it is also possible to estimate the parameters U i,s (a total of SN parameters). The choice of prior for the U std represents the prior beliefin the diﬀerence in judgment between the subjects. It can be set to be a weakly-informativeprior (with a large value for σ U ). Subject speciﬁc predictors

The last extension presented in this article, is the inclusion of subject-speciﬁc covari-ates. In many behavior research problems is desired to evaluate how characteristics of thesubject inﬂuences the choice in a contest. This extension was originally proposed by Böck-enholt (2001) models subject-speciﬁc covariates for each player utilizing a linear regression.AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 9This model utilizes the following notation: K is the number of subject-speciﬁc predictors, x i,k,s representing the observed covariate k of subject s for player i and the coeﬃcient forthe covariate k of player i represented by S i,k . The model can be represented as: P [ i beats j ] = exp λ i,s exp λ i,s + exp λ j,s λ i,s = λ i + K X k =1 x i,k,s S i,k y i,j ∼ Bernoulli( P [ i beats j ]) λ i ∼ N (0 , σ λ ), [Prior] S i,k ∼ N (0 , σ S ), [Prior]These models estimate the baseline ability of the players, λ i , and the subject-speciﬁccoeﬃcients of the covariates S i,k . These coeﬃcients represent how a change in the subjectcovariate for each player will impact the probability of selecting player i over player j . Themodel estimates one coeﬃcient for every covariate of every player, resulting in a total of K · N estimated coeﬃcients. It is worth noting that covariates coeﬃcients are speciﬁc toeach player and therefore can have a diﬀerent impact depending on how it inﬂuences theplayer log ability. These coeﬃcients can be used to investigate systematic diﬀerences in howeach player is evaluated by the subjects.It is worth noting, that again the absolute values of these coeﬃcients do not have adirect interpretation of the eﬀect it adds to the probability of one player beating another.This eﬀect depends on the relative impact of the covariate in the two players and it isbetter assessed through the actual probability of selecting one player over the other. In theBayesian context, this can be assessed through the posterior distribution of the probabilitiesand the absolute eﬀect can be measured.These models assume that the covariates x i,k,s have a similar range of values andthat are centered since they utilize the same normal priors with zero mean and constantstandard deviation. In practice, this means that the values of the coeﬃcients are more easilyestimated by the MCMC sampler if all values of x i,k,s is normalized by each covariate. Thismodel accepts both categorical predictors as well as continuous predictors. Categoricalpredictors can be added utilizing dummy-coding. Remarks

Although not presented here, all the discussed extensions can be incorporated in asingle model mathematical model, since they are all linearly added in the exponential terms.The bpcs package can handle from both the simple Bradley-Terry and the Davidson modelto any combination of these extensions to these models.Even in more complex, models the interpretation of the extension parameters remainsthe same as presented here. However, it is worth reinforcing that these parameters shouldalways be analyzed in the context of the eﬀect sizes, i.e., the actual probabilities of oneplayer beating the other given the changes in the other parameters.AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 10

The bpcs package

This section presents a short overview of the underlying implementation of the bpcs package and its main functionalities. The bpcs

R package implements the Bayesian versionof the Bradley-Terry model and its extensions, as discussed in the statistical models’ section.The package can be installed directly from the package repository or the ComprehensiveR Archive Network (CRAN).The models are coded in the Stan language which is compiled to C++ code at installa-tion time. This compilation process allows fast sampling times of the posterior distribution.Although there are many Bayesian samplers and probabilistic programming languages avail-able (Stan, JAGS, BUGS, Tensorﬂow, Pyro, Turing.jl, PyMC3) the choice for Stan wasbased on: (1) it supports the R programming language (which is widely adopted by re-searchers for statistical analysis), (2) it is an actively maintained project, (3) it providesfast sampling with the No U-Turn (NUTS) Hamiltonian Monte Carlo sampler (Hoﬀman &Gelman, 2014) and (4) it has a large number of compatible and supporting tools (such as rstantools , shinystan and bayesplot ).Stan is a probabilistic programming language for statistical modeling and that focuson high-performance computation (Stan Development Team, 2020). It has bindings formany languages (such as R, Python, Matlab, Julia, and Stata) and runs on all majorplatforms (Linux, Mac, and Windows). In recent years, Stan has gained popularity amongprobabilistic programming languages and it is shown to be faster than traditional samplerssuch as JAGS and BUGS.Stan introduces a programming language to facilitate writing Bayesian models thatresemble C++. However, it is often used through R packages that facilitate specifying gen-eralized and mixed-models such as brms (Bürkner, 2018) and rstanarm (Goodrich, Gabry,Ali, & Brilleman, 2020). However, despite the large number of packages and models thatStan-based packages cover, none implements paired comparison models such as the Bradley-Terry and the Davidson (and their extensions). Basic usage

To exemplify the basic usage of the bpcs package, Luckett et al. (2020) work is used asan example.The authors investigate the relative acceptability of food and beverage choicesusing paired preferences. One of the examples discussed is the acceptability of ﬁve brandsof four cheese frozen pizzas. The full code for this presentation of the package and there-analyses are available in the online appendix.The main function of the bpcs package is the bpc function. The bpc takes as inputarguments: a data frame, two string columns with the names of contestants, a string withthe result of the contestant (0 for player0, 1 for player1, or 2 for ties), and the model type.The model type is speciﬁed with a string. Two basic models are available, the ’ bt ’ modelfor the Bradley-Terry model (Bradley & Terry, 1952), and the ’ davidson ’ model for theDavidson model to handle ties (Davidson, 1970). Extensions for each of these base modelscan be added using a dash separator and the extension, for example ’ bt-ordereﬀect ’ speciﬁesthe Bradley-Terry model with order eﬀect; ’ davidson-generalized-U ’ speciﬁes the generalizedDavidson model including random eﬀects. All presented extensions in the statistical models’ https://github.com/davidissamattos/bpcs AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 11section can be added to both base models, including more than one extension at the sametime. Other options, such as method for handling ties, calculating the results from thescores of each player, column for clusters, speciﬁcation of the priors, number of iterations tosample, among others are described in the documentation . The call for the bpc functionis shown in the listing 1:Listing 1: The bpc function m <- bpc ( data = dpizza ,player0 = ’ Prod0 ’,player1 = ’ Prod1 ’,result _ column = ’y ’, solve _ ties = ’ none ’, model _ type = ’bt ’,iter =3000) The package also implements the S3 functions print , summary . plot and predict .The print function displays the parameters table with the High Posterior Density Intervals(Kruschke & Liddell, 2018; McElreath, 2020). summary function prints the parameters table,a table with a posterior probability of winning for all combination of players and a posteriorrank of the players including the median rank, mean rank, and the standard deviation ofthe rank. The plot function provides a caterpillar plot of the model parameters withthe correspondent HPD or credible intervals. The predict function provides a posteriordistribution of predictive results of any match between the players of the ﬁtted model.Below is the result of the summary function for the model.Listing 2: Output of the summary function Estimated baseline parameters with HPD intervals :Parameter Mean HPD _ lower

HPD _ higher n _ eff Rhat------------------- ------ ---------- ----------- ------- -----lambda [ Tombstone ] -0.27 -3.06 2.36 623.75 1lambda [ DiGiorno ] 0.19 -2.62 2.77 620.89 1lambda [ Freschetta ] 0.08 -2.77 2.65 624.07 1lambda [ Red Barron ] 0.10 -2.64 2.76 617.71 1lambda [ aKroger ] -0.49 -3.20 2.23 616.63 1NOTES : * A higher lambda indicates a higher team abilityPosterior probabilities :These probabilities are calculated from the predictive posteriordistribution for all player combinationsi j i _ beats _ j----------- ----------- ----------aKroger DiGiorno 0.33aKroger Freschetta 0.37 https://davidissamattos.github.io/bpcs/ AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 12 aKroger Red Barron 0.36aKroger Tombstone 0.44DiGiorno Freschetta 0.51DiGiorno Red Barron 0.52DiGiorno Tombstone 0.61Freschetta Red Barron 0.50Freschetta Tombstone 0.60Red Barron Tombstone 0.59The rank of the players ’ abilities :The rank is based on the posterior rank distribution of the lambdaparameterParameter MedianRank MeanRank StdRank------------------- ----------- --------- --------lambda [ DiGiorno ] 1 1.62 0.79lambda [ Freschetta ] 2 2.29 0.84lambda [ Red Barron ] 2 2.18 0.82lambda [ Tombstone ] 4 4.08 0.52lambda [ aKroger ] 5 4.82 0.40

The package also provides helper functions to create plots and, to generate formattedtables (such as the ones from the summary function) for Latex, HTML, and the Pandoc format (which in turn can be used to generate Microsoft Word tables). These functions are:• get_parameters_table This function generates a table of the parameters with sum-mary statistics of the posterior distribution. Two measures of uncertainty are avail-able, the equal-tailed intervals and Highest Posterior Density (HPD) intervals (Kr-uschke & Liddell, 2018; McElreath, 2020). The equal-tailed intervals divide the pos-terior distribution into two parts with the same probability mass, i.e. both tails havethe same probability of being selected. The HPD interval corresponds to the nar-rowest interval that contains the mode for a unimodal distribution. In the case of asymmetrical unimodal distribution (such as the normal distribution), both intervalsare equivalent. However, in the case of a non-symmetrical distribution, these intervalswill be diﬀerent and the HPD interval will be shorter.• get_probabilities_table . This function generates a table of the probabilities ofone player being chosen against another player. These probabilities are calculated bysampling the predictive posterior distribution of the results.• get_rank_of_players_table ). This function calculates the rank of each player basedon the posterior distribution of the log-abilities of the players (the λ ). By assessing theposterior distribution of the rank and looking at the standard deviation, it is possibleto assess the uncertainty on the rank estimates. Estimating the uncertainty in therank values is not available in any of the frequentist packages.• plot . This function creates a caterpillar type of plot of the log-ability parameters of https://pandoc.org/ AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 13the players with the uncertainty intervals. This function returns a ggplot2 (Wickham,2016) object which can be easily customized by the user.If the user has a higher need to customize the tables, the user can either pro-vide further customization with additional packages such as the kableExtra package(Zhu, 2020), or the utilize the function get_parameters_df , get_probabilities_df or get_rank_of_players_df ) to obtain a data frame that contains only the data of the ta-ble. The online appendix utilizes these approaches to create more complex tables for there-analyses. Model validity and model comparison

After the call of the bpc function, the bpcs package calls the Stan No U-Turn Hamilto-nian Monte Carlo sampler (Hoﬀman & Gelman, 2014) to estimate the posterior distributionof the parameters of the model. The Stan software provides many diagnostics messages thatcan help the user to identify problems when ﬁtting the model (e.g. if the number of iter-ations should be increased, if the chains have not converged, or if the number of warmupiterations should be increased). After the model ﬁt, the user needs to check if the chainsare properly mixed (McElreath, 2020) if the Gelman-Rubin convergence coeﬃcient is closeto 1 (Gelman & Rubin, 1992), if the number of eﬀective samples is suﬃcient (McElreath,2020; Team, 2016) and the posterior predictive checks (Gelman et al., 2013).Although all of these checks can be performed manually by retrieving the stanﬁtobject from the model (with the get_stanfit function), it is recommended to investi-gate these checks utilizing the shinystan package (Gabry, 2018). This package provides aweb-based graphical user interface that implements all of these checks. The interface canbe launched directly from the bpcs package with the function launch_shinystan . If themodel converges satisfactorily (by analyzing the traceplots and the Rubin-Gelman conver-gence) and provides a good predictive posterior (that can be obtained with the function( posterior_predictive ), the parameter estimates from the model can be trusted.Bayesian statistics reinforces generating several valid models that can explain the ob-tained data and comparing them (McElreath, 2020). One approach to compare these modelsis with the use of information criteria, such as the Watanabe-Akaike Information Criteria(WAIC) (Gelman, Hwang, & Vehtari, 2014) or the Leave-One-Out Cross-Validation method(LOO-CV) (Yao, Vehtari, Simpson, & Gelman, 2017). The bpcs package provides both ofthese estimates with the functions get_waic and get_loo . Note that information criteriasuch as the Akaike Information Criteria (AIC) and the Bayesian Information Criteria (BIC)should not be used since they assume models with ﬂat priors and maximum a posterioriestimates (McElreath, 2020). These assumptions are not valid in the models implementedin the bpcs package and therefore the package does not provide these estimates.

Limitations of the bpcs package

The main limitation of these Bayesian paired comparison methods is the computa-tional costs. While for most research problems a standard laptop should be able to createa posterior distribution of the parameters of the model in a few minutes, there are spe-ciﬁc problems that will require more computational power, or even a transition from theAYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 14Bayesian models towards frequentist models, if the posterior distribution is not used. Forexample, Zhang, Houpt, and Harel (2019) utilizes four Bradley-Terry models to rank a totalof 7,035 images with an order of 120,000 data points for each model. While it is still possibleto perform Bayesian inference in this problem, ﬁtting the model might take several hours.However, for many practical research problems, Bayesian Bradley-Terry models might takeonly minutes.In terms of more computational power, the bpcs package can be launched in cloudenvironments (Amazon AWS, Google Cloud, Microsoft Azure, Blue Ocean, etc), where it ispossible to select a wide range of computing power (extending in both memory and numberof cores) and pay for the service on demand. The bpcs package documentation providesinstructions on how to utilize Docker container environments to launch a visual RStudioServer interface with both the Stan software conﬁgured and the bpcs package.

Reanalyses

This section provides Bayesian reanalyses of three studies conducted with a frequentistimplementation of the Bradley-Terry model. The commented code to generate the ﬁguresand tables of these reanalyses are available in the appendix. These reanalyses provideinformation regarding what was presented in the original papers, followed by discussions ofalternative models. However, the reanalyses do not cover all possible models available inthe bpcs package, such as the models with ties and generalized models. Examples of thesemodels in other areas (e.g. sports tournaments and benchmarks in algorithm development)are provided in the package documentation . Study I: Visual Perception of Moisture Is a Pathogen Detection Mechanism ofthe Behavioral Immune System

This reanalysis is based on the study “Visual Perception of Moisture Is a PathogenDetection Mechanism of the Behavioral Immune System” (Iwasa, Komatsu, Kitamura, &Sakamoto, 2020). In this study, the authors utilized paired comparisons to rate the perceivedmoisture content based on the visual perception for high luminance areas in images. Theparticipants were asked to select the image that had the highest moisture content. Thepaired images were presented twice for each participant ﬁrst left to right and then rightto left, to control for the inﬂuence of the presentation position. The image stimuli arepresented in the supplementary material of the original article.This reanalysis replicates the results of the study with a Bayesian Bradley-Terry modeland then investigates the presence and magnitude of the order-eﬀect. The two models arecompared utilizing the Watanabe-Akaike information criteria (WAIC).For the simple Bradley-Terry model, Table 1 shows the worth value of each parameterthat indicates the moisture content, and Figure 1 shows the parameter plot with the HPDintervals. The WAIC of the model is equal to 8132.2.The second model adds an order-eﬀect term to the model. The posterior estimationof the γ parameter is close to zero (with mean 0.001, lower HPD -0.054, and upper HPD0.057), which indicates that there is no presence of order eﬀect. The WAIC of the modelis equal to 8134.2. The WAIC of the order eﬀect is higher than the WAIC of the simple https://davidissamattos.github.io/bpcs/ AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 15 −5.0−2.50.02.55.0 image1 image2 image3 image4 image5 image6 image7 image8

Images A b ili t y Estimates of the moisture content

Figure 1 . Worth values of the images in the Bradley-Terry model in terms of moisturecontent.Bradley-Terry model, which indicates that the additional parameter did not increase thepredictive values of the model. Therefore, the selected model is the Bradley-Terry modelwithout order eﬀect.Table 1

Parameters estimates for the simple Bradley-Terry model

Parameter Mean HPD lower HPD higherimage1 -4.461 -6.507 -2.307image2 -2.350 -4.459 -0.257image3 -0.038 -2.138 2.061image4 0.153 -1.933 2.273image5 0.313 -1.805 2.377image6 1.855 -0.243 3.974image7 2.032 -0.069 4.128image8 3.044 0.849 5.060Considering the estimates of the ﬁrst model in Figure 1, it is possible to identify alarge overlap in the interval between the latent worth value of some image groups (image3,image4, image5 and also image6, image7). However, to properly rank and diﬀerentiatethem, it is necessary to generate a posterior distribution of the rank of these images. Table2 ranks the images based on the posterior distribution of the ranks in terms of moisturecontent. From this table, it is possible to see that despite the large overlap in the HPDI ofthe worth values, the images diﬀerentiate themselves in distinct ranks with low variation inthe ranks.AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 16Table 2

Rank of the images based on moisture content

Image Median Rank Mean Rank Std. Rankimage8 1 1.00 0.00image7 2 2.00 0.03image6 3 3.00 0.03image5 4 4.01 0.09image4 5 4.99 0.10image3 6 6.00 0.04image2 7 7.00 0.00image1 8 8.00 0.00

Study II: Using a Touchscreen Paradigm to Evaluate Food Preferences andResponse to Novel Photographic Stimuli of Food in Three Primate Species

This reanalysis is from the article “Using a Touchscreen Paradigm to Evaluate FoodPreferences and Response to Novel Photographic Stimuli of Food in Three Primate Species(Gorilla gorilla gorilla, Pan troglodytes, and Macaca fuscata)” (Huskisson et al., 2020), anextension of the initial study with a single gorilla (Hopper et al., 2019). In this study, theauthors tested a protocol of pairwise forced choice with six stimuli of food (four familiarand two novel stimuli) for 18 subjects (six gorillas, ﬁve chimpanzees, and seven Japanesemacaques). The study evaluates the eﬃcacy of using touchscreens to test zoo-housed pri-mates food preferences and evaluate the understanding of the photographic stimuli.A frequentist Bradley-Terry model was used to analyze food preference. The modelwas ﬁtted with the prefmod (Hatzinger & Dittrich, 2012) and the gnm package (Turner &Firth, 2020) in R. The output of the analyses is the worth value of parameters. The analysesinvestigated species and subjects separately.This reanalysis investigates a simple Bayesian Bradley-Terry model for each specieswithout considering the multiple judgment sampling. Then, a simple model with randomeﬀects to model subject-speciﬁc preferences is performed. A total of six models were ﬁtted(three simple Bradley-Terry and three Bradley-Terry with random eﬀects). Table 3 showsthe WAIC values for each model. This table indicates that all models with random eﬀectsperform better than the models without random eﬀects since they have lower WAICs.Table 3

Comparison of the WAIC of the Bradley-Terry model and the Bradley-Terry model withrandom eﬀects on the subjects for each species

WAICSpecies Bradley-Terry Bradley-Terry with random eﬀectsMacaques 7101.6 6713.2Chimpanzees 7199.5 6719.7Gorillas 9767.4 8792.2Table 4 shows the obtained parameters for the random eﬀects models together withAYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 17HPD intervals. To complement, Figure 2 shows a comparison of the estimates from themodel with and without the random eﬀects. Although both models show a relatively closeestimated value of the abilities of each food, the eﬀect sizes represented by the actualprobability of choosing one food over the other for the species and the subjects can still bediﬀerent. For example, the model with random eﬀects can estimate the probability of eachsubject selecting one food over the other, while the Bradley-Terry model without randomeﬀects only estimates the species average. Additionally, the random-eﬀects model can alsocompensate for non-balanced data, if one subject or species has more trials than another.Table 4

Parameters of the model with random eﬀects

Parameter Mean HPD lower HPD upper

Macaque

Carrot 0.12 -0.81 1.02Celery -2.28 -3.16 -1.34Jungle Pellet 1.24 0.35 2.14Oats -0.90 -1.85 -0.02Green Beans -0.17 -1.04 0.77 U std Chimpanzees

Apple 0.02 -0.97 1.06Tomato 0.32 -0.71 1.30Carrot -0.22 -1.26 0.76Grape 0.60 -0.39 1.59Cucumber -0.34 -1.33 0.63Turnip -0.44 -1.44 0.56 U std Gorilla

Apple 0.03 -0.98 0.97Carrot -0.11 -1.12 0.83Grape 0.86 -0.11 1.87Tomato 0.85 -0.10 1.84Cucumber -0.71 -1.66 0.33Turnip -0.93 -1.86 0.08 U std Chimpanzees Gorilla Macaque A pp l e C a rr o t C e l e r y C u c u m be r G r ape G r een B ean s J ung l e P e ll e t O a t s T o m a t o T u r n i p A pp l e C a rr o t C e l e r y C u c u m be r G r ape G r een B ean s J ung l e P e ll e t O a t s T o m a t o T u r n i p A pp l e C a rr o t C e l e r y C u c u m be r G r ape G r een B ean s J ung l e P e ll e t O a t s T o m a t o T u r n i p −3−2−1012 Food W o r t h v a l ue Model

Simple RandomEffects

Parameters estimates with uncertainty (HPD) interval

Figure 2 . The estimated abilities of each food type for each species in both models. Fooditems that do not have an estimated ability were not fed to that particular species.jungle pellets, oats, and celery (given the low standard deviation of the rank). However,they do not have strong preferences between carrots and green beans. Chimpanzees haveless consistent ranks and with higher standard deviation. They have a higher preferencefor grapes and a lower preference for turnip. Gorillas show a stronger preference for bothgrapes and tomatoes and a lower preference for turnips. The standard deviations on theranks are smaller than with the chimpanzees that had the same food choice. It is worthnoting, that this analysis consist of observing the preferences at the species level, not atthe individual level. For example, although chimpanzees (at the species level) did not havewell-deﬁned ranks, each subject can have deﬁned preferences, if analyzed individually.The second method to assess the food preference is with the posterior probabilityof selecting a stimulus over the other. The bpcs package provides functions to calculatethese probabilities for all combinations except in the case of the random eﬀect (in which thenumber of combinations is much larger). However, the package also oﬀers the possibility tocalculate the probability for selected cases.In the case of the random-eﬀects model, it is necessary to calculate the posteriordistribution of each desired pair of stimuli for each subject. However, the package hasalso the capability to calculate the probabilities for the average of the subjects (the case inwhich random eﬀects have a null eﬀect in the probability). Table 6 shows the probabilitiesof selecting novel stimuli against the selection of old stimuli.AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 19Table 5

Ranking of the food preferences per specie with model with random eﬀects

Food Median Rank Mean Rank Std. Rank

Macaque

Peanuts 1 1.01 0.10Jungle Pellet 2 1.99 0.11Carrot 3 3.16 0.37Green Beans 4 3.85 0.38Oats 5 4.99 0.10Celery 6 6.00 0.00

Chimpanzees

Grape 1 1.53 0.85Tomato 2 2.27 1.12Apple 3 3.26 1.25Carrot 4 4.24 1.28Cucumber 5 4.70 1.18Turnip 5 5.00 1.11

Gorilla

Grape 2 1.55 0.60Tomato 2 1.57 0.61Apple 3 3.38 0.74Carrot 4 3.72 0.72Cucumber 5 5.12 0.70Turnip 6 5.66 0.56

Study III: Patients’ health locus of control and preferences about the role thatthey want to play in the medical decision-making process

This reanalysis is from the article “Patients’ health locus of control and preferencesabout the role that they want to play in the medical decision-making process” (Marton etal., 2020). In the paper, the authors investigated how Health Locus of Control (HLOC) mayinﬂuence patient’s preferences for active or passive roles regarding their medical decision-making.The study was conducted with 153 participants which responded to the Multidimen-sional Health Locus of Control Scale - form C (Ross, Ross, Short, & Cataldo, 2015) and aseries of ten paired comparisons questions. The HLOC measured four dimensions (internal,chance, doctor, and other people) using an 18-point scale. The paired comparison questionswere based on hypothetical situations of the Control Preference Scale (CPS) (Solari et al.,2013) in which the participants chose one scenario among a set of comparative scenariosfrom Active role (“I prefer to make the decision about which treatment I will receive”);Active-Collaborative role (“I prefer to make the ﬁnal decision about my treatment after se-riously considering my doctor’s opinion”); Collaborative role (“I prefer that my doctor andI share responsibility for deciding which treatment is best for me”); Passive-Collaborative(“I prefer that my doctor makes the ﬁnal decision about which treatment will be used, butAYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 20Table 6

Posterior probabilities of the novel stimuli i being selected over the trained stimuli j

Item i Item j Probability Odds Ratio

Gorilla

Apple Cucumber 0.78 3.55Apple Grape 0.35 0.54Apple Turnip 0.75 3.00Apple Carrot 0.57 1.33Tomato Cucumber 0.80 4.00Tomato Grape 0.48 0.92Tomato Turnip 0.79 3.76Tomato Carrot 0.69 2.23

Chimpanzee

Apple Cucumber 0.64 1.78Apple Grape 0.42 0.72Apple Turnip 0.55 1.22Apple Carrot 0.51 1.04Tomato Cucumber 0.54 1.17Tomato Grape 0.37 0.59Tomato Turnip 0.69 2.23Tomato Carrot 0.64 1.78

Macaque

Oats Celery 0.76 3.17Oats Jungle Pellet 0.11 0.12Oats Peanuts 0.06 0.06Oats Carrot 0.23 0.30Green Beans Celery 0.88 7.33Green Beans Jungle Pellet 0.26 0.35Green Beans Peanuts 0.10 0.11Green Beans Carrot 0.45 0.82seriously considers my opinion”); or Passive role (“I prefer to leave all decisions regardingtreatment to my doctor”).The data were analyzed with the frequentist Bradley-Terry model utilizing the prefmod package (Hatzinger & Dittrich, 2012). Four independent models were analyzedwith each dimension of HLOC as the predictor, based on median-splitting to represent highHLOC and low HLOC for each dimension. The authors opted for this approach becausethe prefmod package only supports categorical predictors. This reanalysis evaluates threemodels in increasing complexity, with the four dimensions of HLOC modeled together.The models’ ﬁts are evaluated and how it impacts the estimated coeﬃcients. The HLOCdimensions are normalized to both be presented at a comparable scale and to facilitateinference. Centering and scaling procedures such as normalizing facilitates the convergenceof predictors coeﬃcients McElreath (2020).AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 21The ﬁrst model is a simple Bradley-Terry model, to serve as a basis. This model has aWAIC of 1422.4. The second model utilizes the four dimensions of the HLOC as predictorsand has a WAIC of 1378.2. The third model introduces both random eﬀects to compensatefor individual preferences for each of the 5 roles (active to passive) and the four HLOCdimensions as subject-speciﬁc predictors. This third model has a WAIC of 924.8 indicatingthe best ﬁt out of the three models.Table 7 shows the values of the obtained λ parameters and the standard deviationof the random eﬀects. The results indicated a higher base preference for the Collaborativerole and a lower base preference for the Passive role.Table 7 Lambda parameters of the model and the random eﬀects standard deviation

Parameter Mean HPD lower HPD lowerActive -2.11 -3.04 -1.08Active-Collaborative 1.40 0.44 2.38Collaborative 3.21 2.08 4.21Passive-Collaborative 0.85 -0.07 1.84Passive -3.32 -4.43 -2.34 U std PassiveCollaborative Passive−CollaborativeActive Active−CollaborativeChance Doctors Internal Other people Chance Doctors Internal Other people−101−101−101

HLOC dimension E s t i m a t e Subject predictors estimates with uncertainty (HPD) interval

Figure 3 . The values of the subject predictors parameters with HPD intervals.AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 23Table 8

Subject predictors parameters by role

Parameter Mean HPD lower HPD lower

Active

Internal -0.09 -1.11 0.85Chance -0.11 -1.14 0.86Doctors 0.17 -1.50 0.50Other People 0.57 -1.19 0.80

Active-Collaborative

Internal -0.07 -0.97 0.98Chance -0.48 -1.10 0.86Doctors -0.74 -1.44 0.48Other People 0.10 -1.30 0.71

Collaborative

Internal -0.51 -1.03 0.93Chance -0.31 -0.90 1.09Doctors 0.03 -0.78 1.19Other People 0.14 -1.76 0.33

Passive-Collaborative

Internal -0.01 -0.88 1.14Chance 0.11 -0.84 1.15Doctors 0.27 -0.51 1.48Other People 0.69 -0.35 1.72

Passive

Internal -0.18 -0.97 0.97Chance -0.05 -1.07 0.88Doctors -0.06 -0.72 1.23Other People 0.49 -0.39 1.59the active role of 0.94, while a subject with two standard deviations above the mean forthe doctors HLOC has a probability of selecting the active role of only 0.34. The table alsoshows some comparisons for other roles and changes in the HLOC dimensions.

Conclusion

The ultimate goal of this article is to provide tools, with strong theoretical founda-tions, that empower researchers to have alternatives to the use of Likert scales and frequen-tist data analysis when analyzing paired data. Therefore, the bpcs package was introducedto facilitate the adoption of Bayesian models in paired comparison assessments. The pack-age is free to use and available at both the Comprehensive R Archive Network (CRAN) andat the oﬃcial repository.This article also presented the mathematical foundations of the diﬀerent Bayesianmodels implemented in the bpcs package and discussed the usage of this package based onAYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 24Table 9

Probabilities of selecting role i instead of j based on changes of the values of the HLOCdimensions

Roles HLOC dimensionsi j Internal Chance Doctors Other People ProbabilityActive Passive 0 0 0 0 0.87Active Passive -2 0 0 0 0.82Active Passive 2 0 0 0 0.78Active Passive 0 0 -2 0 0.94Active Passive 0 0 2 0 0.34Act.-Collab. Collab. 0 0 0 0 0.22Act.-Collab. Collab. -2 0 0 0 0.11Act.-Collab. Collab. 2 0 0 0 0.25Act.-Collab. Collab. 0 0 -2 0 0.43Act.-Collab. Collab. 0 0 2 0 0.09Collab. Pass.-Collab. 0 0 0 0 0.90Collab. Pass.-Collab. 0 -2 0 0 0.83Collab. Pass.-Collab. 0 2 0 0 0.93Collab. Pass.-Collab. 0 0 0 -2 1.00Collab. Pass.-Collab. 0 0 0 2 0.49the re-analyses of three studies in diﬀerent areas of behavior science (psychophysics, animalresearch, and health). Besides being the ﬁrst package to introduce a Bayesian Bradley-Terrymodel and many of its extensions, such as the Davidson model to handle ties, models withorder eﬀect, generalized models, models with dependent data, and models with predictorson the subject (and the diﬀerent combinations of these extensions) the package also providestools (such as assessment of uncertainty in the ranks and the posterior probabilities) notavailable in frequentist packages. All the code used to ﬁt the models, to create the tablesand the ﬁgures from the reanalyses’ section are available in the online appendix.Being able to easily extend a simple model to more complex ones, as shown in the re-analyses, allows researchers to control for bias and errors in the modeling. Future researchcould further develop the use of Bayesian cumulative models (when there is a strength scalein the assessment of two items) and models that have time dependency.References

Abalos, J., i de Lanuza, G. P., Carazo, P., & Font, E. (2016). The role of male coloration in theoutcome of staged contests in the european common wall lizard (podarcis muralis).

Behaviour , (5), 607–631.Böckenholt, U. (2001). Hierarchical modeling of paired comparison data. Psychological Methods , (1), 49.Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method ofpaired comparisons. Biometrika , (3/4), 324–345. AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 25

Brown, A. (2016). Item response models for forced-choice questionnaires: A common framework.

Psychometrika , (1), 135–160.Bush, J. M., Quinn, M. M., Balreira, E. C., & Johnson, M. A. (2016). How do lizards determinedominance? applying ranking algorithms to animal social behaviour. Animal Behaviour , ,65–74.Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The RJournal , (1), 395–411. doi: 10.32614/RJ-2018-017Carpenter, B., Gelman, A., Hoﬀman, M. D., Lee, D., Goodrich, B., Betancourt, M., . . . Riddell, A.(2017). Stan: A probabilistic programming language. Journal of statistical software , (1).Cattelan, M. (2012). Models for paired comparison data: A review with emphasis on dependentdata. Statistical Science , 412–433.Chien, S. H.-L., Lin, Y.-L., Qian, W., Zhou, K., Lin, M.-K., & Hsu, H.-Y. (2012). With or without ahole: Young infants’ sensitivity for topological versus geometric property.

Perception , (3),305–318.Coetzee, H., & Taylor, J. (1996). The use and adaptation of the paired-comparison method inthe sensory evaluation of hamburger-type patties by illiterate/semi-literate consumers. Foodquality and preference , (2), 81–85.Davidson, R. R. (1970). On extending the bradley-terry model to accommodate ties in pairedcomparison experiments. Journal of the American Statistical Association , (329), 317–328.Davidson, R. R., & Beaver, R. J. (1977). On extending the bradley-terry model to incorporatewithin-pair order eﬀects. Biometrics , 693–702.Davidson, R. R., & Solomon, D. L. (1973). A bayesian approach to paired comparison experimen-tation.

Biometrika , (3), 477–487.Gabry, J. (2018). shinystan: Interactive visual and numerical diagnostics and posterior analysisfor bayesian models [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=shinystan (R package version 2.5.0)Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesiandata analysis . CRC press.Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics.

Journal of theRoyal Statistical Society: Series A (Statistics in Society) , (4), 967–1033.Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria forbayesian models. Statistics and computing , (6), 997–1016.Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical science , (4), 457–472.Goodrich, B., Gabry, J., Ali, I., & Brilleman, S. (2020). rstanarm: Bayesian applied regressionmodeling via Stan. Retrieved from https://mc-stan.org/rstanarm (R package version2.21.1)Hägerhäll, C. M., Ode Sang, Å., Englund, J.-E., Ahlner, F., Rybka, K., Huber, J., & Burenhult, N.(2018). Do humans really prefer semi-open natural landscapes? a cross-cultural reappraisal.

Frontiers in psychology , , 822.Hatzinger, R., & Dittrich, R. (2012). Prefmod: An r package for modeling preferences based onpaired comparisons, rankings, or ratings. Journal of Statistical Software , (10), 1–31.Hoﬀman, M. D., & Gelman, A. (2014). The no-u-turn sampler: adaptively setting path lengths inhamiltonian monte carlo. J. Mach. Learn. Res. , (1), 1593–1623.Hontangas, P. M., de la Torre, J., Ponsoda, V., Leenen, I., Morillo, D., & Abad, F. J. (2015). Com-paring traditional and irt scoring of forced-choice tests. Applied Psychological Measurement , (8), 598–612.Hopper, L. M., Egelkamp, C. L., Fidino, M., & Ross, S. R. (2019). An assessment of touchscreens fortesting primate food preferences and valuations. Behavior research methods , (2), 639–650.Huskisson, S. M., Jacobson, S. L., Egelkamp, C. L., Ross, S. R., & Hopper, L. M. (2020). Using atouchscreen paradigm to evaluate food preferences and response to novel photographic stimuli AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 26 of food in three primate species (gorilla gorilla gorilla, pan troglodytes, and macaca fuscata).

International Journal of Primatology , 1–19.Iwasa, K., Komatsu, T., Kitamura, A., & Sakamoto, Y. (2020). Visual perception of moisture isa pathogen detection mechanism of the behavioral immune system.

Frontiers in Psychology , , 170.Kelter, R. (2020). Analysis of type i and ii error rates of bayesian and frequentist parametricand nonparametric two-sample hypothesis tests under preliminary assessment of normality. Computational Statistics , 1–26.Kreitchmann, R. S., Abad, F. J., Ponsoda, V., Nieto, M. D., & Morillo, D. (2019). Controlling forresponse biases in self-report scales: forced-choice vs. psychometric modeling of likert items.

Frontiers in Psychology , , 2309.Kruschke, J. K. (2013). Bayesian estimation supersedes the t-test. Journal of Experimental Psy-chology: General , (2), 573.Kruschke, J. K., & Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psychonomicbulletin & review , (1), 155–177.Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (2015). Automatic variational inference instan. Advances in neural information processing systems , , 568–576.Leonard, T. (1977). An alternative bayesian approach to the bradley-terry model for paired com-parisons. Biometrics , 121–132.Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What couldpossibly go wrong?

Journal of Experimental Social Psychology , , 328–348.Luckett, C. R., Burns, S. L., & Jenkinson, L. (2020). Estimates of relative acceptability from pairedpreference tests. Journal of Sensory Studies , (5), e12593.Marton, G., Pizzoli, S. F. M., Vergani, L., Mazzocco, K., Monzani, D., Bailo, L., . . . Pravettoni, G.(2020). Patients’ health locus of control and preferences about the role that they want to playin the medical decision-making process. Psychology, Health & Medicine , 1–7.McElreath, R. (2020).

Statistical rethinking: A bayesian course with examples in r and stan . CRCpress.McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statisticalsigniﬁcance.

The American Statistician , (sup1), 235–245.Meid, A. D., Quinzler, R., Groll, A., Wild, B., Saum, K.-U., Schöttker, B., . . . Haefeli, W. E. (2016).Longitudinal evaluation of medication underuse in older outpatients and its association withquality of life. European journal of clinical pharmacology , (7), 877–885.Miller, E. T., Bonter, D. N., Eldermire, C., Freeman, B. G., Greig, E. I., Harmon, L. J., . . . Stephens,D. (2017). Fighting over food unites the birds of north america in a continental dominancehierarchy. Behavioral Ecology , (6), 1454–1463.Morillo, D., Leenen, I., Abad, F. J., Hontangas, P., de la Torre, J., & Ponsoda, V. (2016). A dom-inance variant under the multi-unidimensional pairwise-preference framework: Model formu-lation and markov chain monte carlo estimation. Applied Psychological Measurement , (7),500–516.Petrou, S. (2003). Methodological issues raised by preference-based approaches to measuring thehealth status of children. Health economics , (8), 697–702.Ross, T. P., Ross, L. T., Short, S. D., & Cataldo, S. (2015). The multidimensional health locus ofcontrol scale: Psychometric properties and form equivalence. Psychological reports , (3),889–913.Rouder, J. N. (2014). Optional stopping: No problem for bayesians. Psychonomic bulletin & review , (2), 301–308.Solari, A., Giordano, A., Kasper, J., Drulovic, J., van Nunen, A., Vahter, L., . . . others (2013). Rolepreferences of people with multiple sclerosis: image-revised, computerized self-administeredversion of the control preference scale. PLoS One , (6), e66127.Springall, A. (1973). Response Surface Fitting Using a Generalization of the Bradley-Terry Paired AYESIAN PAIRED-COMPARISON WITH THE BPCS PACKAGE 27

Comparison Model.

Journal of the Royal Statistical Society: Series C (Applied Statistics) , (1), 59–68.Stan Development Team. (2020). RStan: the R interface to Stan.

Retrieved from http://mc-stan.org/ (R package version 2.19.3)Stern, S. E. (2011). Moderated paired comparisons: a generalized Bradley–Terry model for continu-ous data using a discontinuous penalized likelihood function.

Journal of the Royal StatisticalSociety: Series C (Applied Statistics) , (3), 397–415.Team, S. D. (2016). Stan modeling language users guide and reference manual. Technical report .Turner, H., & Firth, D. (2012). Bradley-terry models in r: the bradleyterry2 package.

Journal ofStatistical Software , (9).Turner, H., & Firth, D. (2020). Generalized nonlinear models in r: An overview of the gnm package[Computer software manual]. Retrieved from https://cran.r-project.org/package=gnm (R package version 1.1-1)Wang, W.-C., Qiu, X.-L., Chen, C.-W., Ro, S., & Jin, K.-Y. (2017). Item response theory modelsfor ipsative tests with multidimensional pairwise comparison items. Applied psychologicalmeasurement , (8), 600–613.Wickham, H. (2016). ggplot2: Elegant graphics for data analysis . Springer-Verlag New York.Retrieved from https://ggplot2.tidyverse.org Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2017). Using stacking to average bayesianpredictive distributions.

Bayesian Analysis . doi: 10.1214/17-BA1091Zhang, H., Houpt, J. W., & Harel, A. (2019). Establishing reference scales for scene naturalnessand openness.

Behavior research methods , (3), 1179–1186.Zhu, H. (2020). kableextra: Construct complex table with ’kable’ and pipe syntax [Computersoftware manual]. Retrieved from https://CRAN.R-project.org/package=kableExtrahttps://CRAN.R-project.org/package=kableExtra