[PDF] Computational methods for Bayesian semiparametric Item Response Theory models

Abstract

Item response theory (IRT) models are widely used to obtain interpretable inference when analyzing data from questionnaires, scaling binary responses into continuous constructs. Typically, these models rely on a normality assumption for the latent trait characterizing individuals in the population under study. However, this assumption can be unrealistic and lead to biased results. We relax the normality assumption by considering a flexible Dirichlet Process mixture model as a nonparametric prior on the distribution of the individual latent traits. Although this approach has been considered in the literature before, there is a lack of comprehensive studies of such models or general software tools. To fill this gap, we show how the NIMBLE framework for hierarchical statistical modeling enables the use of flexible priors on the latent trait distribution, specifically illustrating the use of Dirichlet Process mixtures in two-parameter logistic (2PL) IRT models. We study how different sets of constraints can lead to model identifiability and give guidance on eliciting prior distributions. Using both simulated and real-world data, we conduct an in-depth study of Markov chain Monte Carlo posterior sampling efficiency for several sampling strategies. We conclude that having access to semiparametric models can be broadly useful, as it allows inference on the entire underlying ability distribution and its functionals, with NIMBLE being a flexible framework for estimation of such models.

Full PDF

CComputational methods for Bayesian semiparametricItem Response Theory models

Sally Paganin ∗ , Christopher J. Paciorek , Claudia Wehrhahn , AbelRodr´ıguez , Sophia Rabe-Hesketh , and Perry de Valpine University of California, Berkeley University of California, Santa Cruz University of Washington, SeattleJanuary 28, 2021

Abstract

Item response theory (IRT) models are widely used to obtain interpretable infer-ence when analyzing data from questionnaires, scaling binary responses into continuousconstructs. Typically, these models rely on a normality assumption for the latent traitcharacterizing individuals in the population under study. However, this assumptioncan be unrealistic and lead to biased results. We relax the normality assumption byconsidering a ﬂexible Dirichlet Process mixture model as a nonparametric prior on thedistribution of the individual latent traits. Although this approach has been consideredin the literature before, there is a lack of comprehensive studies of such models or gen-eral software tools. To ﬁll this gap, we show how the NIMBLE framework for hierarchi-cal statistical modeling enables the use of ﬂexible priors on the latent trait distribution,speciﬁcally illustrating the use of Dirichlet Process mixtures in two-parameter logistic(2PL) IRT models. We study how diﬀerent sets of constraints can lead to model iden-tiﬁability and give guidance on eliciting prior distributions. Using both simulated andreal-world data, we conduct an in-depth study of Markov chain Monte Carlo posteriorsampling eﬃciency for several sampling strategies. These strategies consider variouscombinations of model parameterizations, identiﬁability constraints and sampling al-gorithms. In the simulated scenarios, we ﬁnd that some choices of parameterizationand identiﬁability constraints can yield order-of-magnitude diﬀerences in sampling eﬃ-ciency compared to others. We also ﬁnd that there is little inferential penalty in using ∗ [email protected] a r X i v : . [ s t a t . M E ] J a n a semiparametric model when a parametric model would be correct, supporting thebeneﬁt of greater robustness to mis-speciﬁcation. We illustrate the 2PL semiparamet-ric IRT models with two real datasets related to education and medical assessments:the 2007 Trends in International Mathematics and Science Study (TIMSS) and the1996 Health Survey for England. In the former case, semiparametric results are sim-ilar to the parametric 2PL model, while in the latter case the semiparametric modelidentiﬁes distinct ability clusters missed by the parametric model. We conclude thathaving access to semiparametric models can be broadly useful, as it allows inferenceon the entire underlying ability distribution and its functionals, with NIMBLE beinga ﬂexible framework for estimation of such models. Keywords : 2PL model, Dirichlet Process Mixture, MCMC strategies, NIMBLE.

Traditional approaches in item response theory (IRT) modeling rely on the assumptionthat subject-speciﬁc latent traits follow a normal distribution. This assumption is oftenconsidered for computational convenience, but there are many situations in which it may beunrealistic (Samejima, 1997). For example, Micceri (1989) gives a comprehensive review ofmany psychometric datasets where the distribution of latent individual trait does not respectthe normality assumption and presents instead asymmetries, heavy-tails or multimodality. Inaddition, estimation of IRT parameters in the presence of non-normal latent traits has beenshown to produce biased estimates of the parameters (see, for example Finch & Edwards,2016; Kirisci, chi Hsu, & Yu, 2001; Schmitt, Mehta, Aggen, Kubarych, & Neale, 2006; Seong,1990).Diﬀerent proposals have been made in the general IRT literature for relaxing the normal-ity assumption for the distribution of the latent individual trait, using either Markov chainMonte Carlo (MCMC) or Marginal Maximum Likelihood (MML) estimation methods. Oneoption is to rely on more general parametric assumptions. For example, Azevedo, Bolfarine,and Andrade (2011) considered a skew-normal distribution (Azzalini, 1985), while othershave suggested ﬁnite mixtures of normal distributions (Bambirra Gon¸calves, da Costa Cam-pos Dias, & Machado Soares, 2018; Bolt, Cohen, & Wollack, 2001). Alternatively, one canrefrain from making distributional assumption on the latent abilities by using nonparamet-ric maximum likelihood estimation (Laird, 1978; Mislevy, 1984), B-splines (Johnson, 2007;Woods & Thissen, 2006) or empirical histograms (Woods, 2007).This paper considers the Bayesian nonparametric approach that uses a Dirichlet processmixture (Escobar & West, 1995; Ferguson, 1973; Lo, 1984) as a nonparametric prior on thedistribution of the individual latent trait, in the context of logistic IRT models for binaryresponses. Bayesian nonparametric extensions of binary IRT models have been previouslyconsidered in the literature. Such models are semiparametric because they retain other,parametric, assumptions of binomial mixed models, such as the functional form of the linkfunction. Within this approach, the semiparametric 1PL model has been the focus of moreeﬀort as well as software (Jara, Hanson, Quintana, M¨uller, & Rosner, 2011,

DPpackage , nolonger actively maintained). San Mart´ın, Jara, Rolin, and Mouchart (2011) investigated asemiparametric generalization of the 1PL (Rasch-type) models from a theoretical perspec-tive, while Finch and Edwards (2016) provided results from simulation studies. An exampleusing the semiparametric 2PL model is given in Duncan and MacEachern (2008), but thereis a lack of comprehensive studies of such models or general software tools. Other non-parametric IRT models relax assumptions about the functional form of the item responseprobabilities, considering a general monotonic function in place of the logistic/probit linkfunction, sometimes referred as NIRT models. Some work using the Dirichlet Process fallsin this class of models (Karabatsos, 2017; Miyazaki & Hoshino, 2009; Qin, 1998). While wedo not pursue this direction in this paper, focusing instead on nonparametric modeling ofthe latent trait distribution, such an extension is relatively straightfoward.In this paper we ﬁll three major gaps that hinder widespread application of semipara-metric Bayesian item-response theory models. First, we implement semiparametric 1PL and2PL models in NIMBLE (de Valpine et al., 2017) ( R package nimble , de Valpine et al.(2020)), a ﬂexible R -based system for hierarchical modeling. In particular, NIMBLE pro-vides functionality for ﬁtting hierarchical models that involve Dirichlet process priors eithervia a Chinese Restaurant Process (CRP) (Blackwell, MacQueen, et al., 1973) or a truncatedstick-breaking (SB) (Sethuraman, 1994) representation of the prior. Hence, NIMBLE sup-ports a much wider class of models than those that are implemented in standard softwarepackages. Code is provided for all examples, along with guidelines on prior elicitation.Second, we study the eﬃciency of several MCMC sampling strategies to estimate 2PLmodels in both simulated and real-data scenarios. In particular, we use results for parametric2PL models to understand and make recommendations for eﬀective MCMC strategies whenusing semiparametric 2PL models, which are inherently slower to mix than parametric 2PLmodels. This approach also allows us to compare various random walk Metropolis-HastingsMCMC strategies to the Hamiltonian Monte Carlo (HMC) strategy implemented in thewidely used Stan package (Stan Development Team, 2018). This is important because HMCalgorithms are not readily available for Dirichlet process prior models, since HMC cannotsample discrete parameters (the component indicators), which also cannot be integrated outin inﬁnite mixture models.Finally, we present a comparison of inferential results for item and person parametersunder parametric and semiparametric assumptions. To make these comparisons fair, wecarefully elicitate prior distributions for the models by matching the prior predictive distri-bution of the data to a common distribution. We also illustrate how to estimate the entiredistribution of latent traits and its functionals under the two assumptions.The remaining of the paper is organized as follow: in Section 2 we present the standardIRT model and the Bayesian semiparametric extension along with considerations for identi-ﬁability. We then discuss the goals of our numerical experiments to assess MCMC strategiesand the datasets used in those experiments (Section 3). We present a variety of potentialsampling strategies (Section 4), give guidance on selecting prior distributions (Section 5),and then show results for MCMC eﬃciency and statistical inference (Sections 6 and 7). IRT models are widely used in various social science disciplines to scale binary responsesinto continuous constructs. For conciseness, in this section we introduce model notation inthe context of educational assessment, where typically data are answers to exam questionsfrom a set of individuals. In particular, let y ij denote the answer of individual j to item i for j = 1 , . . . , N and i = 1 , . . . , I , with y ij = 1 when the answer is correct and 0 otherwise.Responses from diﬀerent individuals are assumed to be independent, while responses from thesame individual are assumed independent conditional on the latent trait (this is sometimescalled the local independence assumption in the psychometric literature). Let π ij denote the probability that individual j answers item i correctly, given the modelparameters η j , λ i , β i , i.e. π ij = Pr( y ij = 1 | η j , λ i , β i ) for i = 1 , . . . , I and j = 1 , . . . , N . Theparameter η j represents the latent ability of the j -th individual, while β i and λ i encodethe item characteristics for the i -th item. In the two-parameter logistic (2PL) model, theprobability π ij is determined using the logistic function aslogit( π ij ) = λ i ( η j − β i ) , i = 1 , . . . , I, j = 1 , . . . , N. (1)A further assumption for the latent abilities is that they are independently and identicallydistributed according to some distribution G , η j iid ∼ G, j = 1 , . . . , N, (2)with G traditionally a standard normal distribution. The parameter λ i > discrimination , since items with a large λ i are better at discriminating between subjectswith similar abilities, while β i is called diﬃculty because for any ﬁxed η j the probability ofa correct response to item i is decreasing in β i . When λ i = 1 for all i = 1 , . . . , I , the modelin (1) reduces to the one-parameter logistic (1PL) model, also known as Rasch model (Rasch,1990).Often, the log-odds in (1) are reparameterized as λ i η j + γ i , (3)with γ i = − λ i β i . These two parameterizations are sometimes referred to as IRT parame-terization (1) and slope-intercept parameterization (3). While the slope-intercept param-eterization is often considered for computational convenience, the IRT parameterization isthe most traditional in terms of interpretation. In exploring diﬀerent strategies for Bayesianestimation, we will consider both alternatives and investigate potential diﬀerences in termsof computational performance.

The classical formulation of IRT models assumes that the latent abilities in (2) have anormal distribution. This assumption can be relaxed, modeling the distribution of ability asa mixture of normal distributions, where the number of mixture components does not needto be speciﬁed in advance but rather is learned from the data. This can be achieved usinga Dirichlet Process mixture (DPM) model for the distribution of ability. In particular, thelatent distribution G in (2) can be constructed as a convolution involving a DP prior, i.e. G = (cid:90) K ( η j | θ ) F ( dθ ) , F ∼ DP( α, G ) , (4)where K ( ·| θ ) is a suitable probability kernel (e.g., a normal distribution) indexed by theparameter θ , while α and G are, respectively, the concentration parameter and the basedistribution of the Dirichlet Process. A random distribution F drawn from such a DirichletProcess can be constructed as F ( · ) = ∞ (cid:88) k =1 w k δ ˜ θ k where δ a is the Dirac probability measure concentrated at a and ˜ θ , ˜ θ , . . . is a sequence ofindependent draws from G . The weights are calculated using a stick-breaking construction: w k = v k (cid:81) k − l =1 (1 − v l ) with v , v , . . . a sequence of independent draws from a Beta(1 , α )distribution (Sethuraman, 1994). This construction makes it clear that, as long as thekernel K ( ·| θ ) is continuous, the latent ability distribution G is also continuous, but F isalmost surely discrete, naturally inducing clustering via ties in the parameter indexing thedistribution of ability.The DP is often characterized in terms of the predictive distribution associated with itsdraws (Blackwell et al., 1973). More speciﬁcally, let θ , . . . , θ N be an independent samplefrom F . Integrating over F one can obtain the joint prior distribution on ( θ , . . . , θ N ), whichcan be written as the product of a sequence of conditional distributions, where( θ j | θ j − , . . . , θ ) ∼ αα + j − G + j − (cid:88) l =1 α + j − δ θ l , (5)for j = 1 , . . . , N . Note that this construction clearly indicates that there is a positiveprobability of ties among the θ j s.The distribution in (5) it is often interpreted in terms of the Chinese Restaurant Pro-cess (CRP) (see Aldous, 1985; Pitman, 1996). The name comes from the analogy used todescribe the process. Consider a Chinese restaurant with an inﬁnite number of tables, eachtable serving one dish shared by all customers sitting at that table. In this metaphor, eachtable represents a possible cluster, while each dish represents the parameter indexing thedistribution associated with the cluster. Customers entering the restaurant can seat them-selves at a previously occupied table and share the same dish (with probability proportionalto the number of customers already sitting at the table), or go to a new table and orderanother dish (with probability proportional to α ).Using z j to denote the table chosen by the j th customer and θ ∗ k to denote the dish servedin table k , this can be formally stated as p ( z j = k | z j − , . . . , z , z , α ) = (cid:40) n j − k α + j − , k = 1 , . . . , K j − , αα + j − , k = K j − + 1 , (6)and θ ∗ k ∼ G , where K j − is the total number of occupied tables by the ﬁrst j − n j − k is the number of customers at table k among the ﬁrst j −

1. The concentrationparameter α controls the distribution of the number of tables (clusters), with larger valuesfavoring more tables.NIMBLE allows one to use either this CRP representation or a ﬁnite stick-breakingapproximation for DP models. In this work we use the CRP. Using the indicators z = { z j , j = 1 , . . . , N } , and denoting by z | α ∼ CRP( α ) the joint distribution induced by (6), aDirichlet process mixture (DPM) model for the distribution of ability can alternatively bewritten as η j | θ, z j ind ∼ K ( ·| θ z j ) , j = 1 , . . . , N, z | α ∼ CRP( α ) ,θ ∗ k iid ∼ G , k = 1 , , . . . . In the context of 1PL and 2PL models, it seems natural to employ normal kernels, K ( ·| θ ),indexed by parameters θ = { µ, σ } . This allows one to represent the abilities as a mixture ofnormal distributions, where the number of mixture components is unknown. Furthermore,under this choice, taking α → G . For computational convenience the base distribution is typically the product of anormal distribution for µ and an inverse-gamma distribution for σ . Without additional constraints, the parameters of the 1PL and 2PL models are not identi-ﬁable (Bafumi, Gelman, Park, & Kaplan, 2005; Geweke & Singleton, 1981) (see AppendixA for details). For example, increasing all η j and β i values by the same amount yields thesame probabilities in (1) for all i and j . More generally, the ability parameters are knownup to a linear transformation, and constraints are needed to identify them. To address thisproblem, traditional work on parametric IRT models assumes that latent abilities in (2) comefrom a standard normal distribution, i.e. G ≡ N (0 , λ i for i = 1 , . . . , I to be positive in the 2PL model.Alternative constraints can also establish identiﬁability and could yield diﬀerent compu-tational performance for MCMC sampling. A common alternative for the parametric 1PLand 2PL models considers sum-to-zero constraints for the item parameters (Fox, 2010) I (cid:88) i =1 β i = 0 , (cid:32) or I (cid:88) i =1 γ i = 0 (cid:33) , I (cid:88) i =1 log( λ i ) = 0 . (7)Centering the diﬃculty parameters addresses the invariance to translations, while centeringthe log-scale of the discrimination parameters to set their product to one, accounts for theinvariance to rescalings of the latent space. Another potential set of constraints, popular inpolitical science applications, involves ﬁxing the value of the latent traits for two individuals(e.g., see Clinton, Jackman, & Rivers, 2004). Whatever the set of constraints however, it isworthwhile to note that they can be either directly incorporated in the model as part of theprior (and, therefore in the structure of the sampling algorithms), or they can be applied asa postprocessing step (after running an unconstrained MCMC). This last approach is typicalof parameter-expanded algorithms, which embed targeted models in a larger speciﬁcation.Parameter expansion has been proposed in the literature to accelerate EM (C. Liu, Rubin,& Wu, 1998) and Gibbs sampler (J. S. Liu & Wu, 1999) convergence, as well as a mechanismto induce new classes of priors (Gelman, 2004). Although targeting the same posterior,constrained priors and parameter expansion can lead to very diﬀerent results in terms ofconvergence and mixing of the MCMC algorithms.Similar arguments apply for the semiparametric 1PL and 2PL model extension using aDirichlet Process mixture prior the distribution of ability. In that setting, one identiﬁabil-ity strategy may be to constrain the base distribution G , e.g., by letting G ∼ N (0 , G are zero and one, the corresponding posterior quantities can deviatesubstantially from these values, leading to biased inference (Yang & Dunson, 2010). Moregeneral centering approaches have been proposed in the literature when a DP distributionis used to model random eﬀects or latent variables in a hierarchical model (Li, M¨uller, &Lin, 2011; Yang & Dunson, 2010; Yang, Dunson, & Baird, 2010). These approaches rely onparameter expansion by sampling from the unconstrained DP model and then applying apost-processing procedure to the posterior samples. This post-processing procedure requiresthe analytical evaluation of the posterior mean and variance of the DP random measure, withLi et al. (2011) providing results under the CRP representation and Yang et al. (2010) underthe stick-breaking one. Although such strategies are useful for general hierarchical modelsto avoid identiﬁability issues, for the semiparametric 1PL and 2PL models it is simpler touse the sum-to-zero constraints on the item parameters in (7), and that is the approach weadopt in this work. As in the parametric case, we can either include these constraints in theprior or use the parameter expansion approach by sampling and then centering and rescalingthe posterior samples as appropriate. Our evaluations of computational and inferential performance rely on a series of numericalexperiments that use both simulated and real datasets. The speciﬁc questions we considerare:1. For parametric 2PL models, what are eﬃcient Metropolis-Hastings-based MCMC sam-pling strategies? Do diﬀerent strategies work better in diﬀerent scenarios? We considerdiﬀerent parameterizations, constraints, and samplers.2. For the parametric 2PL models, how does eﬃciency of random walk Metropolis-Hastings sampling compare to Hamiltonian Monte Carlo, as implemented in the pop-ular Stan package? This question is of interest because HMC is not readily availablefor Dirichlet process prior models.3. How does MCMC eﬃciency of a semiparametric 2PL model compare to that of aparametric 2PL model? Does this comparison diﬀer when the parametric 2PL modelis correct vs. incorrect?4. To what degree does the use of a parametric model when its assumptions are violatedyield bad inference? Does use of a semiparametric model change inference even whena parametric model would be valid?5. How much do results diﬀer between the semiparametric and parametric models for thereal data examples?

We speciﬁed two diﬀerent simulation scenarios for the distribution assumed for the latentabilities: unimodal and bimodal distributions. For both scenarios, we simulate responsesfrom N = 2 ,

000 individuals on I = 15 binary items. Values for the discrimination param-eters { λ i } i =1 are sampled from a Uniform(0 . , .

5) distribution, while values for diﬃcultyparameters { β i } i =1 are taken to be equally spaced in ( − , η j for j = 1 , . . . , , unimodal scenario, latent abilities are gen-erated from a normal distribution with mean 0 and variance (1 . , while in the bimodal scenario, we used a equal-weights mixture of two normal distributions with means {− , } and common variance (1 . .0 The ﬁrst example is a subset of data from the 1996 Health Survey for England (Joint HealthSurveys Unit of Social and Community Planning Research and University College London,2017), a survey conducted yearly to collect information concerning health and behavior ofhouseholds in England. In particular, we have data for 10 items measuring Physical Func-tioning (PF-10), which is a sub-scale of the SF-36 Health Survey (Ware, 2003) administeredto people aged 16 and above. In this case the latent trait quantiﬁes the physical status of agiven individual (Hays, Morales, & Reise, 2000; McHorney, Haley, & Ware Jr, 1997).Participants in the survey were asked whether they perceived limitations in a varietyof physical activities (e.g., running, walking, lifting heavy objects) and if so the degree oflimitation. We list the original questions in the Appendix C. Answers to items comprisedthree possible responses (“yes, a lot”, “yes, limited a little”, “no, not limited at all”); however,in our analysis we consider the dichotomous indicator for not being limited at all. The leftpanel of Figure 1 shows the distribution of raw scores. For simplicity, we analyzed completecase data from 14 ,

525 individuals out of 15 ,

592 respondents, although the model can easilyaccommodate missing data.The second example uses data from the TIMSS (Trends in International Mathematicsand Science Study) survey, which is an international comparative educational survey dedi-cated to improving teaching and learning in mathematics and science for students around theworld ( http://timssandpirls.bc.edu/TIMSS2007/about.html ). We used data from the2007 eighth-grade mathematics assessment for the United States (N = 7 , https://timssandpirls.bc.edu/TIMSS2007/idb ug.html . The dataset comprises214 items, with 192 of them dichotomous, while the remaining 22 have three category re-sponses (”incorrect”, ”partially correct”, ”correct”). We dichotomized these latter questions,considering partially correct answers as incorrect ones. Like other large-scale assessments,participants in TIMSS only received a subset of the items according to a booklet design,resulting in 28-32 item responses per student. Raw scores for the data are showed in theright panel of Figure 1. In addition, participants in the survey were sampled according to acomplex two-stage clustered sampling design that we did not consider in our application. Inother contexts the design is typically taken into account using sampling weights for modelestimation, as discussed for example by Rutkowski, Gonzalez, Joncas, and von Davier (2010).1Figure 1: Distribution of the raw scores and relative percentages for the real data examples:health data (left panel) and TIMSS data (right panel). In this work we explore diﬀerent sampling strategies for Bayesian estimation of the 2PLparametric and semiparametric models. We deﬁne a sampling strategy to include the com-bination of model parameterization, identiﬁability constraints and sampling algorithms, assummarized in Table 1.

Model constraints Parametric Semi-parametric

Slope-intercept IRT Slope-intercept IRTConstrained abilities MH/conjugate MH/conjugateCentered HMC (Stan)Constrained item parameters MH/conjugate MH/conjugate ∗ MH/conjugate MH/conjugate ∗ Unconstrained MH/conjugate MH/conjugate MH/conjugate MH/conjugateCentered CenteredTable 1: Summary of the 14 sampling strategies considered for the parameter and semi-parametric 2PL model. Each of the 14 entries is a diﬀerent strategy with “MH/conjugate”,“Centered”, and “HMC (Stan)” referring to three diﬀerent sampling algorithms discussedbelow. The asterisk symbol denotes the sampling strategies that lead directly to posteriorsamples following model (8).As anticipated in Section 2, we explore both parameterizations of the 2PL model men-2tioned in Section 2.1: the IRT and the slope-intercept parameterization. To compare esti-mates obtained from diﬀerent parameterizations on a common scale, we post-process poste-rior samples (using transformations described in Appendix A) to respect the following baseparameterization logit( π ij ) = λ i ( η j − β i ) , i = 1 , . . . , I, I (cid:88) i =1 log( λ i ) = 0 I (cid:88) i =1 β i = 0 ,η j ∼ G, j = 1 , . . . , N, (8)where G denotes a general distribution for the latent abilities, either parametric or nonpara-metric. The model in (8) follows the IRT parameterization with sum-to-zero identiﬁabilityconstraints, which is typically the targeted one for inference for interpretability reasons. Our targeted inferential model in (8) can be estimated directly, accounting for identiﬁabilityconstraints. This can be achieved by introducing in the model formulation a set of auxiliaryitem parameters, { λ ∗ i , β ∗ i } for each i = 1 , . . . , I and deﬁning { λ i , β i } aslog( λ i ) = log( λ ∗ i ) − I I (cid:88) i =1 log( λ ∗ i ) β i = β ∗ i − I I (cid:88) i =1 β ∗ i , i = 1 , . . . , I. (9)In Table 1 we label this model as the constrained item parameters model . Unconstrainedpriors are then placed on the auxiliary parameters { λ ∗ i , β ∗ i } . The same formulation appliesunder the slope-intercept parameterization, where the sum-to-zero constraints are placed onthe pairs { log( λ i ) , γ i } for i = 1 , . . . , I .Alternatively, we can consider unconstrained versions of the 2PL model, treating theunconstrained model as a parameter-expanded version of the targeted inferential model in(8), where the redundant parameters are the means of the diﬃculty and log discriminationparameters. Hence we conduct MCMC sampling with known model unidentiﬁability, andbefore using the results for inference, we transform samples to follow the model in (8), asdescribed in Appendix A.Finally, for the parametric case only, we consider the traditional version of the 2PLmodel that assumes abilities parameters η j for j = 1 , . . . , N following a standard normaldistribution ( constrained abilities model ). All the versions of the 2PL model are estimated in the Bayesian framework via MCMCmethods. The NIMBLE system provides a suite of diﬀerent sampling algorithms along with3the possibility to code user-deﬁned samplers. For all models we consider NIMBLE’s defaultsampling conﬁguration (

MH/conjugate algorithm ). In addition, under the slope-interceptparameterization we propose a custom algorithm to jointly sample item parameters ( centeredalgorithm , discussed below). In the parametric setting only, we also investigate performanceof Hamiltonian Monte Carlo as implemented in Stan (Carpenter et al., 2017) (

HMC (Stan)algorithm ).NIMBLE’s MCMC uses an overall Gibbs sampling strategy, cycling over individual pa-rameters, or parameter blocks for parameters with a multivariate prior, in each iteration.By default, speciﬁc sampler types are assigned to the parameters or parameter blocks. NIM-BLE’s default MCMC conﬁguration assigns a conjugate (sometimes called “Gibbs”) samplerwhere possible, meaning that parameters are directly sampled from the corresponding fullconditional posterior distribution. For non-conjugate continuous-valued parameters, NIM-BLE’s default sampler assignment is an adaptive random walk Metropolis-Hastings sam-pler, while for discrete-valued parameters, NIMBLE assigns a conjugate sampler for binary-valued or categorical parameters. For the parametric versions of the 2PL model, the defaultNIMBLE assignment corresponds to these conjugate and adaptive random walk Metropolis-Hastings samplers, with the latter also used for most parametric components of the semi-parametric 2PL. Specialized samplers are assigned when Bayesian nonparametric priors areconsidered in the semiparametric 2PL.NIMBLE oﬀers high ﬂexibility for customizing the algorithms, allowing inclusion of user-programmed custom samplers. In the case of the slope-intercept parameterization, we takeadvantage of this ﬂexibility to propose a custom sampling strategy ( centered sampler ). Thisstrategy relies on the use of an adaptive random walk Metropolis-Hastings sampler with ajoint proposal for each pair of item parameters { λ i , γ i } for i = 1 , . . . , I . The proposal ismade under a reparameterization of the model that centers the abilities to have mean zero.Implementation details are provided in Appendix B.Finally, for the parametric 2PL model we also consider a Hamiltonian Monte-Carlo(HMC) algorithm, as implemented in the Stan software. Stan implements an adaptiveHMC sampler (Betancourt, Byrne, Livingstone, & Girolami, 2014) based on the No-U-Turnsampler (NUTS) by Hoﬀman and Gelman (2014). HMC algorithms are known to producesamples that are much less autocorrelated than those of other samplers but at more computa-tional cost given the need to calculate the gradient of the log-posterior. In this work, we limitthe comparison to the IRT parameterization with constraints on the abilities distribution,as that is the model provided in the edstan R package (Furr, 2017). Past research on Bayesian IRT models has warned about the use of either vague priors orhighly informative priors when there is little information about the parameters (Natesan,4Nandakumar, Minka, & Rubright, 2016; Sheng, 2010). In particular Natesan et al. (2016)investigated the use of diﬀerent prior choices in 1PL and 2PL models using MCMC andvariational Bayes algorithms and found that the use of vague priors tends to produce biasedinference or convergence issues. Similarly, it is well known that highly informative priordistributions on parameters can strongly aﬀect model comparison procedures. In our exper-iments we elicit the parameters of our priors by matching the prior predictive distributionsacross all the models we compare. This “predictive matching approach” has been widely usedto guide prior elicitation in model comparison settings (Bedrick, Christensen, & Johnson,1996; Berger & Pericchi, 1996; Ibrahim, 1997).In the context of 1PL and 2PL IRT models, we aim to match the prior marginal predictivedistribution of a response y ij , which in turn can be achieved by matching the induced priordistribution on the marginal prior probability of a correct response, π ij = expit { λ i ( η j − β i ) } .Note that all the priors discussed in this paper are separately exchangeable, which meansthat this prior marginal will be the same for any values of i and j . In particular, we attemptto match a Beta(0 . , .

5) distribution, which is both the reference and the Jeﬀreys priorfor the Bernoulli likelihood in the fully exchangeable case (Berger, Bernardo, & Sun, 2009;Bernardo, 1979). A similar approach to prior elicitation in the context of latent space modelsfor networks can be found in Guhaniyogi and Rodriguez (2020) and Sosa and Rodr`ıguez(2021).Because there are no analytical expressions available for the prior distribution of π ij , weuse simulations to estimate the shape of the prior distribution and obtain an approximatematch. This is facilitated by our implementation in NIMBLE. Indeed, one of the advantagesof the NIMBLE system is that it provides a seamless way to simulate from the model ofinterest. Histograms of samples from the resulting induced priors can be seen in Figure 2for a set of parametric and semiparametric models. Further details are presented in thefollowing subsections.5Figure 2: Histogram of samples from the induced prior on π ij under each of the consideredmodels. Dashed line indicates the density function of a Beta(0 . , .

5) distribution. Samplesfor the semiparametric models use a prior distribution Gamma(2 ,

4) for the DP concentrationparameter α , but similar results are obtained under the other settings presented Section 5. In Bayesian IRT modeling, normal distributions are typically chosen as priors for the itemparameters. This is true under both parameterizations. In addition, the discriminationparameters, { λ i } Ii =1 , are typically assumed positive, so we considered a normal distributionon the log-scale. To summarize, priors on the item parameters are:log λ i ∼ N ( µ λ , σ λ ) , β i ∼ N (0 , σ β ) , γ i ∼ N (0 , σ γ ) i = 1 , . . . , I. By default, we center on the diﬃculty parameters β i (or the reparameterized version γ i ) on0 for i = 1 , . . . , I , while we set σ β = σ γ = 3. For the discrimination parameters, we set µ λ = σ λ = 0 . . , . In choosing priors for the abilities, we distinguish between the parametric and semiparametriccases. In the parametric case, excluding the strategies in which the distribution is a standardnormal, we assume G ≡ N ( µ η , σ η ). We specify hyperpriors for the unknown mean and6variance, using a normal distribution for the mean µ η ∼ N (0 , σ η ∼ InvGamma(2 . , .

01) as in Paulon, De Iorio, Guglielmi,and Ieva (2018), with hyperparameter values implying an a priori marginal expected valueof 1 and an a priori variance equal to 100.In the semiparametric case, we need to specify the base distribution G of the DP mixtureprior along with the hyperparameters. We choose G ≡ N (0 , σ ) × InvGamma( ν , ν ) whereInvGamma( ν , ν ) denotes an inverse-gamma distribution with shape parameter ν and mean ν / ( ν − { σ , ν , ν } , we ﬁrst considered theconcentration parameter α as ﬁxed and evaluated the induced prior distribution on π forvalues of α ∈ { . , . , . , , . , } . Recall that α controls the prior expectation andvariance of the number of clusters induced by the DP, which are both of the order α log( N ).We discuss prior choice for the α in Section 5.3. As in the parametric case, we center thenormal distribution for the mixture component means on 0 with σ = 3 and set ν = 2 .

01 and ν = 1 .

01 for the inverse-gamma distribution. Given these settings, we found that choosing α ∈ { . , . , . , , . , } does not have much eﬀect on the marginal prior distribution ofthe π ij s. One may be interested in placing a prior distribution on the concentration parameter α ofthe Dirichlet Process. A typical choice for the DP concentration parameter is a Gamma( a, b ),with shape a > b >

0, due to its computational convenience (Escobar & West,1995). As previously stated, the concentration parameter controls the prior distribution ofthe number of clusters (Escobar & West, 1995; J. S. Liu, 1996). In choosing values a and b ,we considered the implied prior mean and variance of the number of clusters.Let K N denote the number of clusters for a sample of size N . Results from Antoniak(1974) and J. S. Liu (1996) show that the expected value and variance of K N given α is E ( K N | α ) = N (cid:88) i =1 αα + N − i , V ar ( K N | α ) = N (cid:88) i =1 α ( i − α + N − i ) . (10)We exploit these results to choose values a and b that lead to reasonable a priori valuesfor the moments of the number of clusters for each of our applications. For a given N andfor diﬀerent values of a and b , we evaluated the marginal expectation and variance of thequantities in (10) via Monte Carlo approximation. We sample α r for r = 1 , . . . , R from itsprior and compute (cid:98) E ( K N ) = 1 R R (cid:88) r =1 E [ K N | α r ] , (cid:100) V ar ( K N ) = 1 R R (cid:88) r =1 V ar ( K N | α r ) + (cid:100) V ar ( E [ K N | α ]) , where (cid:100) V ar ( E [ K N | α ]) = R − (cid:80) Rr =1 (cid:104) E [ K N | α r ] − (cid:98) E [ K N ] (cid:105) .7 α ∼ Gamma ( a, b ) E [ α ] V ar ( α ) (cid:98) E ( K , ) (cid:100) V ar ( K , ) (cid:98) E ( K , ) (cid:100) V ar ( K , ) (cid:98) E ( K , ) (cid:100) V ar ( K , ) a = 2 , b = 4 0 . .

12 4 . . . .

12 5 . . a = 1 , b = 3 0 . .

11 3 . . . .

76 3 . . a = 1 , b = 1 1 1 7 . . . .

35 9 . . Table 2: Approximate expectation and variance of the a priori number of clusters, K N , underdiﬀerent choices of the concentration parameter distribution, for N = { , , , , , } ,as in our data.We explored a few prior choices and tabulate approximated moments in Table 5.3, forthe values of N in our datasets. We consider the popular choice of a = 2, b = 4 for thehyperparameters as in Escobar and West (1995) along with values favoring a small numberof clusters ( a = 1 , b = 3) and values leading to a more vague prior ( a = 1 , b = 1). For ourapplications we decided to favor a relatively small number of clusters, choosing a = 2 , b = 4as hyperparameters for the simulated data, and a = 1 , b = 3 for the real-world data. To compare sampling strategies, we measure their performance in terms of MCMC eﬃciency,which we deﬁne for each parameter as the eﬀective sample size (ESS) divided by compu-tation time. The eﬀective sample size gives the equivalent number of independent samplesthat would contain the same statistical information as the actual non-independent samples.Computation time is measured for the actual MCMC run, not accounting for steps to preparefor a run, thereby focusing on the algorithms of interest rather than the software. Compar-ison between HMC and MCMC algorithms raises the question of how to fairly account forcomputation times, given that these two classes of algorithms use diﬀerent tuning phases.Hence, we decided to consider diﬀerent timings when using the two algorithms: (i) samplingtime , which accounts only for the time to draw the posterior samples, hence discarding thetime needed for the burn-in and warm-up phases of the two algorithms and (ii) total time comprising also the burn-in and warm-up phases. When computing eﬃciency based on thesampling time, we can assess pure eﬃciency of sampling from the posterior. Using totaltime accounts for potentially diﬀerent times needed for warm-up/burn-in by the diﬀerentalgorithms but introduces the diﬃculty of determining the optimal burn-in/warm-up time,which we avoided here in favor of use of basic defaults.We would like to use a single metric of MCMC performance to compare diﬀerent samplingstrategies, in particular the minimum MCMC eﬃciency across all model parameters, whichcorresponds to the worst mixing parameter (Nguyen et al., 2020). However our samplingmodels and sampling strategies have a variety of speciﬁcations for the ability distributions,which makes it diﬃcult to compare performance with respect to the abilities. Therefore,we compare MCMC performance using the minimum MCMC eﬃciency across the item8parameters, after mapping the posterior samples to follow our target inferential model (8).There are several ways to estimate ESS, but we use effectiveSize function in the R coda package (Plummer, Best, Cowles, & Vines, 2006) since this function provides stable estimatesof ESS.For all MCMCs using NIMBLE, we used a total of 50 ,

000 iterations, with a 10% burninof 5 ,

000 for all examples. We inspected traceplots to assess convergence of the chains.When running the HMC algorithm via Stan, we used a total of 8 ,

000 iterations, with theﬁrst 4 ,

000 iterations as warm-up steps (setting half the iterations for warm-up is the Standefault). Under these settings, the NIMBLE- and Stan-based MCMCs take approximatelythe same total time for a given dataset. Note that we ran the MCMC chains for manyiterations in these experiments so that we could obtain reliable estimates of ESS. To ensurethe chains were long enough, we used multiple runs for a portion of the experiments (notshown).

For the two simulation scenarios (unimodal and bimodal) we estimated the 2PL parametricmodel using the diﬀerent sampling strategies summarized in Table 1. Figure 3 compareseﬃciency for all these strategies using the minimum ESS per second, computed with respectto the total and sampling time. Using diﬀerent time baselines when computing eﬃciency doesnot aﬀect the ranking of the algorithms, but it highlights the trade-oﬀ for the HMC algorithmbetween sampling eﬃciency and computational cost. While the HMC is highly eﬃcientin producing samples with low correlation, warm-up steps are computationally expensive,taking 2 / / We can carry out similar comparisons using the real-data examples (Figure 5), noting thatwe excluded the constrained items scenario given its poor performance on the simulated1datasets. Here we see that HMC is best for the TIMSS data but not for the health data,with other strategies being comparable. Which parameterization and constraint scenariosare best depends on the dataset.Figure 5: Minimum ESS per second (computed using the total time) for semiparametric2PL models (bottom row) and their parametric versions (top row), in the health data (leftcolumn) and TIMSS data (right column) applications. Note the scales are diﬀerent for thediﬀerent rows.The eﬃciency is lower than for the simulated datasets because the real data have manymore individuals or items, and therefore more parameters. Related to this, when consideringthe posteriors for the abilities in the semiparametric model for the health data, we saw evi-dence for multimodality and some diﬃculty moving between modes for individuals with highraw scores. The multimodality is likely related to it being diﬃcult for the semiparametricmodel to identify the exact magnitude of the ability for such individuals. The use of moreinformative priors, with careful elicitation of the prior distribution, may be important insuch cases.2

In this section we compare results for the parametric and semiparametric models in termsof statistical inference, regardless of the sampling strategy used to obtain posterior samples.All posterior samples follow the parameterization in (8), and we use samples from the mosteﬃcient sampling strategy for each dataset. We use posterior means as point estimates forthe item and ability parameters. For the simulated datasets, we measure how well the modelsrecover the (known) true value of the parameters using absolute error, e.g., | ˆ β i − β i | , andsquared error, e.g., ( ˆ β i − β i ) .A crucial point of this paper is to make inference on the distribution of latent abilities.An estimate of this distribution is sometimes based on the posterior means of the abilities(Bambirra Gon¸calves et al., 2018; Duncan & MacEachern, 2008), and histograms or ker-nel density plots are reported. Such an estimate ignores uncertainty in the estimates ofindividual abilities. Instead, one should directly obtain the point estimate of the posteriordistribution of the latent abilities p ( η | Y ) (for any value of η ) using the posterior samples.In the parametric case, this reduces to: (cid:92) p ( η | Y ) = 1 T T (cid:88) t =1 N ( η ; µ ( t ) , σ t ) ) , (11)with N ( · ; µ, σ ) indicating the probability density function of a normal distribution withmean µ and variance σ . In the semiparametric case, an estimate of p ( η | Y ) is the posteriormean of the mixing measure G of the Dirichlet Process. This can be obtained using posteriorsamples, averaging over the DP conditional distribution in (5) computed for each iteration t = 1 , . . . , T , p (cid:92) ( η | Y ) = 1 T T (cid:88) t =1  K ( t ) (cid:88) k =1 n ( t ) k α ( t ) + N N ( η ; µ ( t ) k , σ ( t ) k )  + α ( t ) α ( t ) + N N ( η ; µ K ( t ) +1 , σ K ( t ) +1 )  , (12)with n ( t ) k the number of observations in cluster k at iteration t , K ( t ) the total number ofclusters at iteration t , and µ K ( t ) +1 and σ K ( t ) +1 sampled from G (conditional on the data).We graphically compare estimates for the distribution of ability resulting from (11)–(12)with the estimates obtained using the posterior means.In the semiparametric setting, it is possible to make full inference on p ( η | Y ); this requiressampling from the posterior of the mixing distribution F . A computational approach to ob-tain the entire posterior distribution has been presented in Gelfand and Kottas (2002), a ver-sion of whose algorithm is implemented in NIMBLE in the function getSamplesDPMeasure .This function provides samples of a truncated version of the inﬁnite mixture to a level L .The value of L varies at each iteration of the MCMC’s output when α is random, whileit is the same at each iteration when α is ﬁxed. In our case, for every MCMC iteration,3 Unimodal Simulation Bimodal simulationParametric Semi-parametric Parametric Semi-parametric

MAE MSE MAE MSE MAE MSE MAE MSEDiﬃculty parameters 0.0996 0.0185 0.0988 0.0183 0.0895 0.0137 0.0673 0.0083Discrimination parameters 0.0734 0.0069 0.0731 0.0070 0.0832 0.0105 0.0397 0.0020Ability parameters 0.4836 0.3719 0.4836 0.3720 0.5944 0.5571 0.5501 0.4775

Table 3: MAE and MSE for the item and ability parameters estimates, under the unimodaland bimodal simulation, using samples from most eﬃcient sampling strategies.we can obtain samples of the vector of mixture weights { w ( t )1 , . . . , w ( t ) L ( t ) } and parameters ofthe mixture components. We can use these samples to make inference on functionals of thedistribution, such as the percentile for an individual, 100 × p j , where p j = (cid:82) η j −∞ p ( η | Y ) dη ,typically paired with test scores when giving results for educational assessments. For anindividual j for j = 1 , . . . , N we estimate p j at each MCMC iteration as p ( t ) j = L ( t ) (cid:88) l =1 w ( t ) l F N ( η ( t ) j ; µ ( t ) , σ t ) ) , (13)where F N denotes the distribution function of the normal distribution. For comparison, wedeﬁne the parametric counterpart as p ( t ) j = F N ( η ( t ) j ; µ ( t ) , σ t ) ). We show results for the most eﬃcient sampling strategies under both the parametric andsemiparametric model. For the unimodal simulation, samples come from the unconstrainedsampling strategy using the centered sampler (SI parameterization), while for the bimodalcase we used samples from the unconstrained sampling strategy (IRT parameterization).Using the absolute error and the squared error for each parameter, we report in Table 3the mean absolute error (MAE) and the mean squared error (MSE) across item and abilityparameters.4Figure 6: Unimodal simulated data. Comparison of the posterior mean estimates (with 95%credible interval) of item parameters (diﬃculties β and discriminations λ ) for parametricand semiparametric 2PL models and true simulated values.In the unimodal scenario, we observe similar performance for the parametric and semi-parametric 2PL in estimating item parameters (Figure 6). As expected, we observe somediﬀerences when considering the bimodal scenario (Figure 7), with the semiparametric modeloutperforming the parametric one. This is especially evident when comparing estimates ofthe discrimination parameters, in particular for larger values.Figure 7: Bimodal simulated data. Comparison of the posterior mean estimates (with 95%credible interval) of item parameters (diﬃculties β and discriminations λ ) for parametricand semiparametric 2PL models and true simulated values.5Similar conclusions to those from Table 3 can be made about the ability parameters whenestimating abilities using the posterior means of the individual abilities (Figure 8). However,results are very diﬀerent when one looks at estimates of the distribution of ability (Figure 9).The normality assumption of the parametric model leads to unimodal density estimates,inconsistent with the true distribution, whereas the semiparametric model can recover it.The posterior means of individual abilities in Figure 8 are a compromise between the inferreddistribution of ability and the information in the data, so with suﬃcient observations, onecan obtain estimates of the distribution that are reasonable even with severe model mis-speciﬁcation. In other words, for mis-speciﬁed parametric models, the in-sample predictionsfor observed individuals can be reasonable, while the out-of-sample predictions based on(11) for new individuals are poor. Note that inspection of the posterior means of individualabilities for the parametric model, relative to the assumed parametric distribution, can beused to assess model mis-speciﬁcation, clearly indicating mis-speciﬁcation in the bimodalsimulation here.Figure 8: Histogram and density estimate of individual posterior mean abilities, under uni-modal scenario (left panel) and bimodal scenario (right panel), compared with the truedensity (dotted line).6Figure 9: Distribution of ability estimated under the unimodral scenario (left panel) andbimodal scenario (right panel), compared with the true density (dotted line). Dashed linesindicate 95% credible intervals for the estimated distributions.Mis-speciﬁcation of the distribution of ability has limited eﬀect when estimating indi-vidual percentiles. In Figure 7.1 we compare the posterior mean estimates of individualpercentiles with the true percentile of the simulation distribution, for a subset of 50 individ-uals. In the unimodal simulation the semiparametric and parametric model perform verysimilarly in ordering individuals, while there are modest diﬀerences under the bimodal one.MSE values for both models are very similar in both simulations.7Figure 10: Estimates of individual percentiles (with 95% credible interval) for a subset of 50individuals with varying (true) ability levels under the unimodal scenario (left panel) andthe bimodal scenario (right panel). Black dots correspond to true percentiles. For the real data examples we graphically inspect results from the parametric and semipara-metric models, using samples from the most eﬃcient strategy in Section 6. For the healthdata we used samples from the IRT unconstrained model, while for the TIMSS data we usedsamples from the SI unconstrained model using the centered sampler.In Figure 11 we compare item parameter estimates from the two models for the healthdata application, while Figure 12 shows estimates for the distribution of abilities. Recall that,in this case, we interpret the latent ability as physical ability, with high values characterizinghealthy individuals. As with the bimodal simulation, estimates from the parametric modelof the distribution of physical ability are quite diﬀerent than the distribution of individualposterior mean abilities. It is clear that the parametric model is badly mis-speciﬁed andwould produce bad out-of-sample predictions. In contrast, the semiparametric model seemsto nicely characterize multi-modality in the latent distribution. We observe in Figure 12 largecredible intervals for high values of this distribution that can be explained by the presence ofmany individuals with high raw scores, (i.e., 9 or 10 out of 10, Figure 1) for whom the modelcan clearly determine that their physical abilities are high, but with the exact magnitudesbeing diﬃcult to identify.The two modeling assumptions yield diﬀerent estimates of the item parameters (Fig-ure 11), with this diﬀerence being higher for extreme values. However, the relative rankingof the items is roughly the same in both cases, with for example item 1 (

Vigorous activities )being the most diﬃcult item and the one with lowest value of the discrimination parame-8ter. According to the parametric model, discrimination parameters for item 3 (

Lift/carry )and item 10 (

Bathing/dressing ) should have similar values, while the semiparametric modelseparates them.Figure 11: Health data. Comparison of item parameter estimates from the parametric andsemiparametric models. In each panel items are ordered by increasing values of the parameterestimate under the semiparametric model.Figure 12: Health Data. Histogram and density estimate of the posterior means of the latentabilities (left panel) and estimate of the posterior distribution for the latent abilities (rightpanel). Dashed lines indicate 95% credible intervals for the estimated distributions.For the TIMSS data we only compare point estimates of the item parameters (Figure 13),due to the large number of parameters. Estimates for the diﬃculty parameters are verysimilar between the two models, while estimates for the discrimination parameters are more9diﬀerent, especially for larger values. Figure 14 shows estimates of the distribution of ability.In this case the semiparametric model estimate shows a moderate departure from the normalparametric assumption, with the estimated distribution being right-skewed.Figure 13: TIMSS data. Comparison of posterior estimates of the item parameters betweenthe parametric and semiparametric model both using the SI unconstrained centered samplingstrategy.Figure 14: TIMSS Data. Histogram and density estimate of the posterior means of the latentabilities (left panel), and estimate of the posterior distribution for the latent abilities (rightpanel). Dashed lines indicate 95% credible intervals for the estimated distributions.0Figure 15: Estimates of individual percentiles (with 95% credible interval) for a subset of 50individuals, for the health data (left panel) and TIMSS data (right panel).The eﬀect of departures from normality in the distribution of ability is evident whenestimating individual percentiles. Figure 15 compares these estimates for both the healthand TIMSS data for a sample of 50 individuals sorted according to estimated abilities fromthe semiparametric model. In particular, the parametric assumption in the case of the healthdata multi-modality yields some diﬀerences in estimating percentile values and individualordering. This estimates are associated with larger intervals than in the semiparameric case.In the case of the skewness with the TIMSS data, the interval lengths are similar whencomparing the parametric and semiparametric models.

In this paper, we consider a semiparametric extension for 1PL and 2PL models, using Dirich-let process mixtures as a nonparametric prior to ﬂexibly characterize the distribution ofability. We provide an overview of these models and study how diﬀerent sets of constraintscan address identiﬁability issue and lead to diﬀerent MCMC estimation strategies.Focusing on the 2PL model, we compare eﬃciency and inferential results under diﬀerentsampling strategies based on model parametrization, constraints and sampling algorithms.We ﬁnd that MCMC performance across strategies can vary in relation to underlying shapeof the latent distribution. Hamiltonian Monte Carlo is often more eﬃcient for the parametric2PL model, particularly when the underlying distribution of ability is reasonably consistentwith the normality assumption. However non-HMC samplers are generally competitive interms of performance and are needed for semiparametric inference. When moving to semi-parametric modeling, the computational cost can be high for large datasets, given that1sampling from the Dirichlet Process requires iteration through all individuals. However weﬁnd computational costs to be reasonable in our applications in light of the better inferentialresults.In particular under model mis-speciﬁcation, inference for item parameters worsens notice-ably in the parametric model compared to the semiparametric model. With suﬃcient data,inference for the abilities of observed individuals can be decent even under mis-speciﬁcationof the distribution of ability, but inference for the unknown latent distribution (i.e., thepredictive distribution for new individuals) as a whole can be quite bad. Having access tosemiparametric models can be broadly useful, as it allows inference on the entire underlyingdistribution of ability and its functionals, such as percentiles.In this work we extensively use the NIMBLE software for hierarchical modeling, withcode reproducing results in the paper available at https://github.com/salleuska/IRTnimble code , along with a small tutorial using a simulated example. Although there areother software solutions enabling Bayesian nonparametric modeling, these are often limitedin the type of algorithms or in the class of models available. NIMBLE oﬀers a high degreeof ﬂexibility in that the models considered in this paper could be easily embedded in morecomplicated ones. Sampler assignment can be highly customized by the user, including user-deﬁned sampling algorithms. This customizability makes NIMBLE a powerful platform forcomparing diﬀerent sampling strategies. At the same time, NIMBLE allows easy sharing ofthe most successful strategies as block-box implementations for end users.2

A. Identiﬁability

The 2PL model is not identiﬁable based on the likelihood. Here we demonstrate the non-identiﬁability for the two parameterizations we consider, showing how diﬀerent linear trans-formations lead to the same probabilities. Note that these transformations are deﬁned forevery parameter associated with each item i = 1 , . . . , I and individual j = 1 , . . . , N .Under the IRT parameterization:1. η (cid:48) j = η j /s and λ (cid:48) i = sλ i λ (cid:48) i ( η (cid:48) j − β i ) = sλ i ( η j /s − β i ) = λ i η j − λ i β i = λ i ( η j − β i ) , η (cid:48) j = η j + c and β (cid:48) i = β i + c , λ i ( η (cid:48) j − β (cid:48) i ) = λ i ( η j + c − ( β i + c )) = λ i ( η j − β i ) . Under the slope-intercept parameterization:1. η (cid:48) j = η j /s and λ (cid:48) i = sλ i , λ (cid:48) i η (cid:48) j + γ i = sλ i η j /s + γ i = λ i η j + γ i ,

2. ( λ i η j ) (cid:48) = λ i η j + c and γ (cid:48) i = γ i − c , or η (cid:48) j = η j + c and γ (cid:48) i = γ i − λ i cλ i η (cid:48) j + γ (cid:48) i = λ i ( η j + c ) + γ i − λ i c = λ i η j + γ i . Post-processing to satisfy identiﬁability constraints

This section reports the transformations we apply to item and ability parameters in order tosatisfy the identiﬁability constraints in our base parameterization (8). These transformationsare applied to each posterior sample.Under the IRT parameterization, the set of transformations for each posterior sample of { λ i , β i , η j } for i = 1 , . . . , I, j = 1 , . . . , N takes these forms: λ ∗ i = sλ i , β ∗ i = β i − bs , η ∗ j = η j − bs , subject to (cid:81) Ii =1 λ ∗ i = 1, (cid:80) Ii =1 β ∗ i = 0. By solving the system of equations given by thetransformations and the set of identiﬁability constraints, we obtain s = exp (cid:40) I (cid:88) i =1 log( λ i ) /I (cid:41) , b = (cid:80) Ii =1 β i I . (A1)3Under the slope-intercept parameterization, the set of transformations for each posteriorsample of { λ i , γ i , η j } for i = 1 , . . . , I, j = 1 , . . . , N takes these forms:˜ λ i = sλ i , ˜ γ i = γ i − λ i c, ˜ η j = η j + cs , subject to (cid:81) Ii =1 ˜ λ i = 1, (cid:80) Ii =1 ˜ γ i = 0. Similarly, by solving the system of equations given bythe transformations and the set of identiﬁability constraints, we obtain s = exp (cid:40) I (cid:88) i =1 log( λ i ) /I (cid:41) , c = (cid:80) Ii =1 γ i (cid:80) Ii =1 λ i . (A2)Finally, to get from the slope-intercept parameterization to the IRT parameterization,we deﬁne ˜ β i := − ˜ γ i / ˜ λ i and then calculate β ∗ i = ˜ β i − (cid:80) i ˜ β i /I . Rescaling the DP density

We can obtained posterior samples from the mixing distribution F via NIMBLE’s getSamplesDPmeasure function, allowing us to estimate the density for the latent ability distribution. However,when comparing these estimated densities between models, for some of the sampling strate-gies, we need to transform the estimated density to account for the transformations of theabilities from the scale on which sampling is done to the scale in (8).As an example, consider the IRT parameterization without constraints. From the MCMCoutput we can obtain p (˜ η ) evaluated for diﬀerent values of ˜ η , but we want p (˜ η ∗ ) with ˜ η ∗ =(˜ η − b ) /s . To do so we need the Jacobian of the transformation, which is simply s . Then,we obtain p (˜ η ∗ ) p (˜ η ∗ ) = p ˜ η ( s ˜ η ∗ + b ) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( s ˜ η ∗ + b ) ∂ ˜ η ∗ (cid:12)(cid:12)(cid:12)(cid:12) = p ˜ η ( s ˜ η ∗ + b ) s. (A3) B. Centered sampler

We consider a custom sampler for the 2PL model under the slope-intercept parameterization.Intuition for this sampling strategy comes from the resemblance to a linear model. In order tosample { λ i , γ i } eﬃciently, we propose centering the implied covariate, η j , to have mean zero.This is analogous to centering covariates in a linear model, but in this case the ”covariate”values are not ﬁxed, so the centering needs to be done in each iteration. For a given item i for i = 1 , . . . , I we can rewrite λ i η j + γ i = λ i ( η j − ¯ η ) + λ i ¯ η + γ i , = λ i η cj + γ ci , η cj = η j − ¯ η is centered. The idea is to propose a new value λ ∗ i inthis new parameterization at each MCMC iteration, using a random walk on the log scale.Translating to the original parameterization, we have: λ ∗ i η cj + γ ci = λ ∗ i ( η j − ¯ η ) + λ i ¯ η + γ i , = λ ∗ i η j − λ ∗ i ¯ η + λ i ¯ η + γ i . This means that we are proposing γ ∗ i = γ i + ¯ η ( λ i − λ ∗ i ). Thus we have a joint proposal ( λ ∗ i , γ ∗ i )that accounts for the usual correlation in a regression between intercept and slope. Apartfrom accounting for sampling λ i on the log scale, the proposal is symmetric, so no Hastingscorrection is needed. The original sampler for γ i can stay the same. This is because in thereparameterization with γ ci above, shifting γ i by a certain amount is equivalent to shifting γ ci . C. Health data questions

The following items are about activities you might do during a typical day. Does your healthnow limit you in these activities? If so, how much?’1. Vigorous activities: Vigorous activities, such as running, lifting heavy objects, partic-ipating in strenuous sports.2. Moderate activities: Moderate activities, such as moving a table, pushing a vacuumcleaner, bowling or playing golf.3. Lift/Carry: Lifting or carrying groceries.4. Several stairs: Climbing several ﬂights of stairs.5. One ﬂight stairs: Climbing one ﬂight of stairs.6. Bend/Kneel/Stoop: Bending, kneeling, or stooping.7. Walk more mile: Walking more than a mile.8. Walk several blocks: Walking several blocks.9. Walk one block: Walking one block.10. Bathing/Dressing: Bathing or dressing yourself.5

References

Aldous, D. J. (1985). Exchangeability and related topics. In ´Ecole d’´et´e de probabilit´es desaint-ﬂour xiii—1983 (pp. 1–198). Springer.Antoniak, C. E. (1974, 11). Mixtures of Dirichlet Processes with applications to Bayesiannonparametric problems.

Annals of Statistics , (6), 1152–1174. Retrieved from https://doi.org/10.1214/aos/1176342871 doi: 10.1214/aos/1176342871Azevedo, C. L., Bolfarine, H., & Andrade, D. F. (2011). Bayesian inference for a skew-normal irt model under the centred parameterization. Computational Statistics &Data Analysis , (1), 353–365.Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavianjournal of statistics , 171–178.Bafumi, J., Gelman, A., Park, D. K., & Kaplan, N. (2005). Practical issues in implementingand understanding Bayesian ideal point estimation.

Political Analysis , (2), 171–187.Bambirra Gon¸calves, F., da Costa Campos Dias, B., & Machado Soares, T. (2018). Bayesianitem response model: a generalized approach for the abilities’ distribution using mix-tures. Journal of Statistical Computation and Simulation , (5), 967–981.Bedrick, E. J., Christensen, R., & Johnson, W. (1996). A new perspective on priors forgeneralized linear models. Journal of the American Statistical Association , (436),1450–1460.Berger, J. O., Bernardo, J. M., & Sun, D. (2009). The formal deﬁnition of reference priors. The Annals of Statistics , (2), 905–938. Retrieved from Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection andprediction.

Journal of the American Statistical Association , (433), 109–122.Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference. Journal ofthe Royal Statistical Society: Series B (Methodological) , (2), 113–128.Betancourt, M. J., Byrne, S., Livingstone, S., & Girolami, M. (2014). The geometricfoundations of Hamiltonian Monte Carlo.

Blackwell, D., MacQueen, J. B., et al. (1973). Ferguson distributions via P´olya urn schemes.

The annals of statistics , (2), 353–355.Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model formultiple-choice data. Journal of Educational and Behavioral Statistics , (4), 381–409.Retrieved from Carpenter, B., Gelman, A., Hoﬀman, M., Lee, D., Goodrich, B., Betancourt, M., . . . Riddell,A. (2017). Stan: A probabilistic programming language.

Journal of Statistical Soft-ware, Articles , (1), 1–32. Retrieved from doi: 10.18637/jss.v076.i01Clinton, J., Jackman, S., & Rivers, D. (2004). The statistical analysis of roll call data. American Political Science Review , 355–370.6de Valpine, P., Paciorek, C. J., Turek, D., Michaud, N., Anderson-Bergman, C., Obermeyer,F., . . . Paganin, S. (2020).

NIMBLE: MCMC, particle ﬁltering, and programmable hi-erarchical modeling.

Retrieved from https://cran.r-project.org/package=nimble (R package version 0.9.1) doi: 10.5281/zenodo.1211190de Valpine, P., Turek, D., Paciorek, C. J., Anderson-Bergman, C., Lang, D. T., & Bodik, R.(2017). Programming with models: Writing statistical algorithms for general modelstructures with nimble.

Journal of Computational and Graphical Statistics , (2),403-413. Retrieved from https://doi.org/10.1080/10618600.2016.1172487 doi:10.1080/10618600.2016.1172487Duncan, K. A., & MacEachern, S. N. (2008). Nonparametric Bayesian modelling for itemresponse. Statistical Modelling , (1), 41–66.Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference usingmixtures. Journal of the american statistical association , (430), 577–588.Ferguson, T. S. (1973, 03). A Bayesian analysis of some nonparametric problems. TheAnnals of Statistics , (2), 209–230. doi: 10.1214/aos/1176342360Finch, H., & Edwards, J. M. (2016). Rasch model parameter estimation in the presenceof a nonnormal latent trait using a nonparametric Bayesian approach. Educationaland Psychological Measurement , (4), 662-684. Retrieved from https://doi.org/10.1177/0013164415608418 doi: 10.1177/0013164415608418Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications . SpringerScience & Business Media.Furr, D. C. (2017). edstan: Stan models for item response theory [Computer softwaremanual]. Retrieved from https://CRAN.R-project.org/package=edstan (R packageversion 1.0.6)Gelfand, A. E., & Kottas, A. (2002). A computational approach for full nonparametricBayesian inference under dirichlet process mixture models.

Journal of Computationaland Graphical Statistics , (2), 289-305. Retrieved from https://doi.org/10.1198/106186002760180518 doi: 10.1198/106186002760180518Gelman, A. (2004). Parameterization and Bayesian modeling. Journal of the AmericanStatistical Association , (466), 537–545.Geweke, J. F., & Singleton, K. J. (1981). Maximum likelihood “conﬁrmatory” factor analysisof economic time series. International Economic Review , 37–54.Guhaniyogi, R., & Rodriguez, A. (2020). Joint modeling of longitudinal relational data andexogenous variables.

Bayesian Analysis , (2), 477–503.Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory and health outcomesmeasurement in the 21st century. Medical care , (9 Suppl), II28.Hoﬀman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: adaptively setting pathlengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research , (1),1593–1623.Ibrahim, J. G. (1997). On properties of predictive priors in linear models. The American Statistician , (4), 333-337. Retrieved from doi: 10.1080/00031305.1997.10474408Jara, A., Hanson, T. E., Quintana, F. A., M¨uller, P., & Rosner, G. L. (2011). DPpackage:Bayesian semi-and nonparametric modeling in R. Journal of statistical software , (5),1.Johnson, M. S. (2007). Modeling dichotomous item responses with free-knot splines. Com-putational statistics & data analysis , (9), 4178–4192.Joint Health Surveys Unit of Social and Community Planning Research and UniversityCollege London. (2017). Health survey for england, 1996. [data collection]. http://doi.org/10.5255/UKDA-SN-3886-2

Karabatsos, G. (2017). Bayesian nonparametric response models. In W. J. Van der Linden(Ed.),

Handbook of item response theory, volume one: Models (p. 323-336). CRCPress. Retrieved from

Kirisci, L., chi Hsu, T., & Yu, L. (2001). Robustness of item parameter estima-tion programs to assumptions of unidimensionality and normality.

Applied Psycho-logical Measurement , (2), 146-162. Retrieved from https://doi.org/10.1177/01466210122031975 doi: 10.1177/01466210122031975Laird, N. M. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association , , 805-807.Li, Y., M¨uller, P., & Lin, X. (2011). Center-adjusted inference for a nonparametric Bayesianrandom eﬀect distribution. Statistica Sinica , (3), 1201–1223.Liu, C., Rubin, D. B., & Wu, Y. N. (1998). Parameter expansion to accelerate EM: thePX-EM algorithm. Biometrika , (4), 755–770.Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. The Annalsof Statistics , 911–930.Liu, J. S., & Wu, Y. (1999). Parameter expansion for data augmentation.

Journal of theAmerican Statistical Association , (448), 1264–1274.Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. density estimates. Theannals of statistics , 351–357.McHorney, C. A., Haley, S. M., & Ware Jr, J. E. (1997). Evaluation of the MOS SF-36physical functioning scale (PF-40): Ii. comparison of relative precision using Likert andRasch scoring methods.

Journal of clinical epidemiology , (4), 451–461.Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psy-chological bulletin , (1), 156.Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika , (3), 359–381.Retrieved from https://doi.org/10.1007/BF02306026 doi: 10.1007/BF02306026Miyazaki, K., & Hoshino, T. (2009). A Bayesian semiparametric item response model withdirichlet process priors. Psychometrika , (3), 375–393.8Natesan, P., Nandakumar, R., Minka, T., & Rubright, J. D. (2016). Bayesian prior choice inirt estimation using MCMC and variational Bayes. Frontiers in psychology , , 1422.Nguyen, D., de Valpine, P., Atchade, Y., Turek, D., Michaud, N., & Paciorek, C. J. (2020,12). Nested adaptation of mcmc algorithms. Bayesian Analysis , (4), 1323–1343.Retrieved from https://doi.org/10.1214/19-BA1190 doi: 10.1214/19-BA1190Paulon, G., De Iorio, M., Guglielmi, A., & Ieva, F. (2018, 07). Joint modeling of recurrentevents and survival: a Bayesian non-parametric approach. Biostatistics , (1), 1-14.Retrieved from https://doi.org/10.1093/biostatistics/kxy026 doi: 10.1093/biostatistics/kxy026Pitman, J. (1996). Some developments of the blackwell-macqueen urn scheme. InT. S. Ferguson, L. S. Shapley, & J. B. MacQueen (Eds.), Statistics, probability andgame theory (Vol. Volume 30, pp. 245–267). Hayward, CA: Institute of Mathemat-ical Statistics. Retrieved from https://doi.org/10.1214/lnms/1215453576 doi:10.1214/lnms/1215453576Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). Coda: Convergence diagnosis andoutput analysis for mcmc.

R News , (1), 7–11. Retrieved from https://journal.r-project.org/archive/ Qin, L. (1998).

Nonparametric Bayesian models for item response data (Unpublished doc-toral dissertation). The Ohio State University.Rasch, G. (1990).

Probabilistic Models for Some Intelligence and Attainment Tests.

Copen-hagen: Danish Institute for Educational Research.Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scaleassessment data: Issues in secondary analysis and reporting.

Educational Researcher , (2), 142–151.Samejima, F. (1997). Departure from normal assumptions: A promise for future psycho-metrics with substantive mathematical modeling. Psychometrika , (4), 471–493.San Mart´ın, E., Jara, A., Rolin, J.-M., & Mouchart, M. (2011, Jul 01). On the Bayesiannonparametric generalization of IRT-type models. Psychometrika , (3), 385–409.Retrieved from https://doi.org/10.1007/s11336-011-9213-9 doi: 10.1007/s11336-011-9213-9Schmitt, J. E., Mehta, P. D., Aggen, S. H., Kubarych, T. S., & Neale, M. C. (2006). Semi-nonparametric methods for detecting latent non-normality: A fusion of latent trait andordered latent class modeling. Multivariate Behavioral Research , (4), 427–443.Seong, T. (1990). Sensitivity of marginal maximum likelihood estimation of item and abilityparameters to the characteristics of the prior ability distributions. Applied psychologicalmeasurement , (3), 299–311.Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors. Statistica Sinica , (2),639–650.Sheng, Y. (2010). A sensitivity analysis of gibbs sampling for 3pno irt models: Eﬀects ofprior speciﬁcations on parameter estimates. Behaviormetrika , (2), 87–110.9Sosa, J., & Rodr`ıguez, A. (2021). A latent space model for cognitive social structures data. Social Networks , , 85 - 97.Stan Development Team. (2018). Stan modeling language users guide and reference manual,version 2.18.0.

Retrieved from http://mc-stan.org

Ware, J. E. (2003). Sf-36 health survey: Manual and interpretation guide..Woods, C. M. (2007). Empirical histograms in item response theory with ordinal data.

Educational and Psychological Measurement , (1), 73–87.Woods, C. M., & Thissen, D. (2006). Item response theory with estimation of the latentpopulation distribution using spline-based densities. Psychometrika , (2), 281.Yang, M., & Dunson, D. B. (2010). Bayesian semiparametric structural equation modelswith latent variables. Psychometrika , (4), 675–693.Yang, M., Dunson, D. B., & Baird, D. (2010). Semiparametric Bayes hierarchical modelswith mean and variance constraints. Computational statistics & data analysis ,54