[PDF] A Multidimensional Hierarchical Framework for Modeling Speed and Ability in Computer-based Multidimensional Tests

Abstract

In psychological and educational computer-based multidimensional tests, latent speed, a rate of the amount of labor performed on the items with respect to time, may also be multidimensional. To capture the multidimensionality of latent speed, this study firstly proposed a multidimensional log-normal response time (RT) model to consider the potential multidimensional latent speed. Further, to simultaneously take into account the response accuracy (RA) and RTs in multidimensional tests, a multidimensional hierarchical modeling framework was proposed. The framework is an extension of the van der Linden (2007; doi:10.1007/s11336-006-1478-z) and allows a "plug-and-play approach" with alternative choices of multidimensional models for RA and RT. The model parameters within the framework were estimated using the Bayesian Markov chain Monte Carlo method. The 2012 Program for International Student Assessment computer-based mathematics data were analyzed first to illustrate the implications and applications of the proposed models. The results indicated that it is appropriate to simultaneously consider the multidimensionality of latent speed and latent ability for multidimensional tests. A brief simulation study was conducted to evaluate the parameter recovery of the proposed model and the consequences of ignoring the multidimensionality of latent speed.

Full PDF

MMultidimensional Hierarchical Modeling Framework

A Multidimensional Hierarchical Framework for Modeling Speed and Ability inComputer-based Multidimensional Tests

Peida Zhan (Zhejiang Normal University) Hong Jiao (University of Maryland, College Park)Wen-Chung Wang (The Education University of Hong Kong)Kaiwen Man (University of Maryland, College Park)

Note:

This is an arXiv preprint, may not be the final version [Manuscript submitted for publication andcurrently under review]. For reference:Zhan, P., Jiao, H., Wang, W.-C., & Man, K. (2018). A multidimensional hierarchical frameworkfor modeling speed and ability in computer-based multidimensional tests. arXiv preprintarXiv:1807.04003 . URL https://arxiv.org/abs/1807.04003 Corresponding author: Peida Zhan, Department of Psychology, College of Teacher Education, Zhejiang Normal University, No.688 Yingbin Road, Jinhua, Zhejiang, 321004, P. R. China. Email: [email protected] ultidimensional Hierarchical Modeling Framework

Abstract

In psychological and educational computer-based multidimensional tests, latent speed, a rate ofthe amount of labor performed on the items with respect to time, may also be multidimensional.To capture the multidimensionality of latent speed, this study firstly proposed a multidimensionallog-normal response time (RT) model to consider the potential multidimensional latent speed.Further, to simultaneously take into account the response accuracy (RA) and RTs inmultidimensional tests, a multidimensional hierarchical modeling framework was proposed. Theframework is an extension of the van der Linden (2007; doi:10.1007/s11336-006-1478-z) andallows a “plug-and-play approach” with alternative choices of multidimensional models for RAand RT. The model parameters within the framework were estimated using the Bayesian Markovchain Monte Carlo method. The 2012 Program for International Student Assessmentcomputer-based mathematics data were analyzed first to illustrate the implications andapplications of the proposed models. The results indicated that it is appropriate to simultaneouslyconsider the multidimensionality of latent speed and latent ability for multidimensional tests. Abrief simulation study was conducted to evaluate the parameter recovery of the proposed modeland the consequences of ignoring the multidimensionality of latent speed.

Key words : response times; multidimensional latent speed; item response theory; hierarchicalmodeling framework; computer-based tests; PISA

With the popularity of computer-based tests, collection of item response times (RTs) hasbecome a routine activity in large- and small-scale tests. For example, the Program forInternational Student Assessment (PISA) started using computer-based tests and recorded RTsdata since the year of 2012. In addition to response accuracy (RA), RTs provide an additionalsource of information about working speed of respondents and time cost of items (Marianti, Fox,Avetisyan, Veldkamp, & Tijmstra, 2014; van der Linden, 2006, 2007, 2009; Zhan, Jiao, & Liao,2018; Zhan, Liao, & Bian, 2018). Before making inferences by employing RTs, it is necessary tocreate an appropriate statistical model for RTs. Over the past few decades, various RT modelshave been presented based on cognitive/psychological theories and experimental research (forreview, see Lee & Chen, 2011; Schnipke & Scrams, 2002; van der Linden, 2009).Conventionally, the speed-accuracy trade-off (Luce, 1986) was the main motivation for RTmodeling, such as Thissen (1983), Wang and Hanson (2005), and Ferrando and Lorenzo-Seva(2007). However, the trade-off reflects only a within-person level relationship between speed andaccuracy (van der Linden, 2009), which is hard to be evaluated based on single time-point (orcross-sectional) assessments (see Curran & Bauer, 2010; Molenaar, P. C., 2004). Typically, for afixed set of items, once a respondent’s working speed is fixed, his/her accuracy remains constant,therefore, the speed and accuracy are suggested to be modeled separately and their relationshipcan be modeled at a higher-level (van der Linden, 2006; 2007; 2009). To this end, van der Linden(2007) proposed a Bayesian hierarchical modeling framework, which is one of the most flexibletools to explain the relationship between response speed and accuracy. When comparing variousRT models, Suh (2010) claimed that the Bayesian hierarchical modeling framework presents themost reasonable outcomes in both real and simulated data. In addition, some subsequent studiesfollowed the thought of Bayesian hierarchical modeling framework, but treated the item effects as fixed (e.g., Molenaar, D., Tuerlinckx, & van der Maas, 2015; Wang, Chang, & Douglas, 2013;Wang & Xu, 2015). For simplicity, they are collectively referred to as hierarchical modeling .The hierarchical modeling is gaining more and more recognition and it is sufficiently generalizedto integrate available measurement models for response accuracy and RTs (Fox & Marianti, 2016;Klein Entink, Fox, & van der Linden, 2009; Klein Entink, van der Linden, & Fox, 2009; Wang,et al., 2013; Wang & Xu, 2015; Zhan, Jiao et al., 2018; Zhan, Liao et al., 2018).Currently, however, based on the hierarchical modeling, most researches only focus onunidimensional tests (e.g., van der Linden, 2007; Klein Entink, Fox, van der Linden, 2009; KleinEntink, van der Linden, et al., 2009; Meng, Tao, & Chang, 2015; Wang & Xu, 2015; Molenaar,D., Oberski, Vermunt, & De Boeck, 2016; Fox, Klein Entink, & Timmers, 2014). And only aunidimensional latent ability and a unidimensional latent speed are taken into account by usingunidimensional item response theory (UIRT) models and unidimensional RT (URT) models,respectively, as shown in Figure 1(a).In reality, respondents are likely to bring multiple latent abilities to bear when responding toitems; meanwhile, items are likely to require various latent abilities to determine a correctresponse, especially in multidimensional tests (e.g., Reckase, 2009; Whitely, 1980; Tatsuoka,1983). In addition, with the increasing demand for providing more detailed and refined feedbackto test-takers, multidimensional tests have received much attention from practitioners andresearchers.In psychological and educational measurements, an appropriate notion of latent speed on testitems is that of speed of labor (van der Linden, 2009). Therefore, latent speed can be defined as arate of the amount of labor performed on the items with respect to time (van der Linden, 2011).Actually, the definition of latent speed should be discussed in a certain dimension of latent ability, because using the required latent ability is the basis of an effective labor. Due to themultidimensional nature of latent abilities and different items may require different kinds ofabilities, latent speed may also be a multidimensional concept, each dimension of whichcorresponds to a specific type of labor or latent ability. For example, the latent speedcorresponding to the latent ability of decoding (or algebra problem solving) may be differentfrom the latent speed corresponding to the latent ability of encoding (or geometry problemsolving). For another example, when respondents, especially for non-native English speakers,take part in the GRE ® Subject Test (e.g., Mathematics), at least two abilities are needed, one forunderstanding the questions (e.g., the English reading ability), and one for solving the questions(e.g., the subject ability). Meanwhile, corresponding two latent speeds worked, one reflects theworking speed of reading, and one reflects the working speed of problem-solving or applyingsubject ability.Currently, although multidimensional models for response accuracy have been welldeveloped (see Reckase, 2009), to our knowledge, there is a lack of multidimensional models forRTs to take account of the potential multidimensionality of latent speed. Recently, based on thehierarchical modeling, a few studies have attempted to use multidimensional models for responseaccuracy to capture the multidimensional structure of latent ability when multidimensional testsare involved; but only a URT model is used to capture the potentially multidimensional latentspeed, as shown in Figure 1(b). For instance, Man, Jiao, Zhan, and Huang (2017) employed acompensatory multidimensional IRT model for response accuracy and the unidimensionallog-normal RT model (van der Linden, 2006) for RTs. In addition, Zhan, Jiao et al. (2018)proposed a joint cognitive diagnosis modeling to simultaneously analysis RA and RTs incognitive diagnosis. This approach was further extended to account for the paired local item dependence (Zhan, Liao et al., 2018). However, in these studies, because of the lack ofmultidimensional RT (MRT) models, only the relationship among multiple latent abilities andone single latent speed can be evaluated. Logically, as aforementioned, different latent abilitiesmay be associated with different latent speeds. Thus, assuming the latent speeds corresponding todifferent latent abilities to be identical, as done by Man et al. (2017) and Zhan, Jiao et al. (2018),may be too restrictive to describe complicated data and thus may lead to biased conclusions. It isdesirable to release this limitation to allow each latent ability to be associated with its own latentspeed. As URT models may be inappropriate in practice, it is critical to develop a new RT modelthat considers multidimensionality of latent speed.To meet the demand, we firstly extend the most popular unidimensional log-normal RTmodel (ULRTM; van der Linden, 2006) to be multidimensional and call it the multidimensionallog-normal RT model (MLRTM). Secondly, a multidimensional hierarchical modelingframework for modeling multidimensional latent speed and multidimensional latent ability wasproposed, as shown in Figure 1(c). The rest of the paper starts with a brief review of the ULRTM,followed by the presentation of the proposed MLRTM. Further, the proposed modelingframework and a corresponding joint multidimensional model are presented. Model parameterestimation with the Bayesian Markov chain Monte Carlo (MCMC) method is demonstrated.Then, the PISA 2012 computer-based mathematics data are analyzed to demonstrate theadvantages of the proposed joint model. After this, a brief simulation study is conducted toevaluate the parameter recovery of the proposed model. Finally, conclusions and discussions arepresented.

Figure 1.

Three relationships between ability and speed in the hierarchical modeling framework.

Note , U = unidimensional, M = multidimensional, A = ability, S = speed.

Proposed Measurement Model for Response TimeOverview of the Unidimensional Log-normal RT Model

Let T ni be the observed RT of person n ( n = 1,..., N ) to item i ( i = 1,..., I ). In the ULRTM, thelogarithm function is used to transform the positively skewed distribution of RT to a symmetricshape and is assumed to be dominated by item i ’s time-intensity parameter ξ i and person n ’slatent speed parameter τ n as follows: )ω ,0(~ε ,ετξlog   inininini NT , (1)or equivalently, )ω ,τξ(~log   inini NT . (2)where ξ i represents the time needed to complete item i ; τ n represents the working speed of person n on a test and is assumed to be normally distributed with mean zero and variance σ ; ε ni is thenormally distributed residual error term, with mean zero and variance ω  i ; ω i is the reciprocal ofthe standard deviation of the error term, which can be treated as a time-kurtosis parameter.A basic assumption of the ULRTM is that log T ni s are conditionally independent given theunidimensional τ n , which is known as local RT independence (Zhan, Liao et al., 2018). In other words, local RT independence is obtained when the relationship among items is fullycharacterized by the URT model. However, if the single latent speed is not sufficient to accountfor the relationship among RTs, an MRT model is needed. The Multidimensional Log-normal RT Model

There are two types of multidimensional tests: between-item and within-item (Adams,Wilson, & Wang, 1997). In a between-item multidimensional test, each item measures a singledimension but different items may measure different dimensions so the multidimensionalityoccurs between items. In a within-item multidimensional test, an item may measure more thanone dimension simultaneously. As the between-item multidimensionality can be seen as a specialcase of the within-item multidimensionality, to be general, we focus on the latter throughout thispaper.For an item that measures multiple dimensions simultaneously, the ULRTM can be extendedto accommodate multiple latent speeds by replacing the scalar latent speed parameter, τ n , with avector of speed parameter, τ n . The resulting MLRTM is defined as: )ω ,0(~ε ,ετξεξlog    ininiKk nkikininiini NqT τq , (3)or equivalently, )ω ,τξ(~log    iKk nkikini qNT , (4)where τ nk represents the latent speed of person n in dimension k , and τ n = (τ n , ..., τ nk , ...,τ nK )’ isthe multidimensional latent speed vector following a multivariate normal distribution: ),(~ ττ Σμτ N n with mean vector μ τ and variance and covariance matrix Σ τ , and μ τ is set to -vector for identification; q ik is an element of the confirmatory Q-matrix (Tatsuoka, 1983)indicating whether dimension k is required to answer item i correctly; q ik = 1 if the dimension is required, and 0 otherwise. Other parameters are the same as those in the ULRTM.In the MLRTM, it is assumed that log T ni s are conditionally independent given τ n . In addition,for a given item i , if nkKk   is set at a constant, m , all τ -vectors that satisfy the expression nkKk m   yield the same RT. This feature suggests that a low speed on one dimension can becompensated by a high speed on another dimension, so the proposed MLRTM is a compensatoryMLRTM, which is in line with the compensatory assumption of latent abilities in compensatorymultidimensional item response theory (MIRT) models. Logically speaking, within eachdimension, the latent speed and the latent ability should be matched with each other (e.g., Zhan,Liao et al., 2018). Thus, when the latent abilities are compensatory, it is reasonable to assumethat the corresponding latent speeds to be compensatory as well. On the other hand, if the latentabilities are non-compensatory, the corresponding latent speeds should be non-compensatory aswell. However, the development of non-compensatory MLRTM is beyond the scope of this study,and can be studied in the future. Multidimensional Hierarchical Modeling Framework

Since both response accuracy and RTs contain information about items and persons, it isadvantageous to analysis them simultaneously. For example, one may be interested in therelationship between multidimensional latent ability and multidimensional latent speed ofpersons and the relationship between difficulty and time-intensity of items.In the multidimensional hierarchical modeling framework, at the first level, an MIRT modelcan be used as the measurement model for response accuracy and an MRT model can be used asthe measurement model for RTs, respectively; at the second level, all latent abilities and latentspeeds are assumed to follow a multivariate normal distribution; meanwhile, the item RAparameters (e.g., item difficulty) and item RT parameters (e.g., item time-intensity) are assumed to follow a multivariate normal distribution. Given the “plug-and-play” nature of themultidimensional hierarchical modeling, various choices of MIRT models and MRT models canbe adopted. For illustration purposes, in this study, the MLRTM is used as the measurementmodel for RTs, and according to the 2012 PISA mathematics assessment framework (OECD,2013), the multidimensional Rasch model (MRM; Adams et al., 1997) was employed as themeasurement model for response accuracy.

The Multidimensional Rasch Model for RA

Let Y ni be the observed response for person n to item i. The MRM can be expressed as    Kk inkikinini dqdYP θ))1((logit θq , (5)where logit( x ) = log( x /(1– x )); P ( Y ni = 1) is the probability of a correct response by person n toitem i ; θ nk is the latent ability of person n on dimension k , and θ n = (θ n , ..., θ nk , ..., θ nK )’ isassumed to follow a multivariate normal distribution as follows: ),(~ θθ Σμθ N n , and setting μ θ = for identification; d i is the intercept or easiness of item i ; q ik is an element of the confirmatoryQ-matrix indicating whether dimension k is required to answer item i correctly.Further, if only one dimension is assumed to be required by all items, the MRM reduces tothe unidmensional Rasch model (URM, Rasch, 1960), inni dYP  θ))1((logit , (6)where θ n is the unidimensional latent ability of person n ; d i is the intercept or easiness of item i .The URM can be identified by setting the mean of θ n at zero. Multidimensional Hierarchical Modeling

In multidimensional tests, three possible structures/combinations of latent speed and latentability have been outlined in Figure 1, including (a) unidimensional ability and unidimensional speed (UA-US), in which a single latent ability is associated with a single latent speed; (b)multidimensional ability and unidimensional speed (MA-US), in which four (in this example)latent abilities are associated with a single latent speed; and (c) multidimensional ability andmultidimensional speed (MA-MS), in which four latent abilities are associated with four latentspeeds. Among the three structures, the MA-MS structure is the most general and contains theother two as special cases. Although the UA-US structure is commonly used in unidimensionaltests, it still can be compulsively used in multidimensional tests by assuming all items requiredonly a single dimension. The MA-US structure was proposed by Man et al. (2017) and Zhan,Jiao et al. (2018). The MA-MS structure is proposed in this study.Different structures represent different combinations of measurement models. The UA-USrepresents a combination of the URM and the ULRTM, the MA-US represents a combination ofthe MRM and the ULRTM, and the MA-MS represents a combination of the MRM and theMLRTM. Besides the measurement models, the multivariate normal distribution was used todescribe the relationship among the latent abilities and latent speeds:    person* ,~ ΣμμτθΩ

Knnn N ,       

11 11111 1

KKKKK KKK    Σ , (7)where θ n is a vector of multiple latent abilities of person n ; τ n is a vector of multiple latent speedsof person n ; Σ person is a variance and covariance matrix of person parameters; K* is the totalnumber of latent abilities and latent speeds. Take the structures in Figure 1 as examples, abivariate normal distribution ( K* = 2) can be employed for the UA-US; a fivefold-variate normaldistribution ( K* = 5) can be employed for the MA-US; and an eightfold-variate normal distribution ( K* = 8) can be employed for the MA-MS.For the item parameters, a bivariate normal distribution was used to describe therelationship between item difficulty and item time-intensity:   itemξ2 ,μμ~ξ ΣΨ diii Nd ,    dd Σ , (8)where μ d and μ ξ is the mean of item difficulty and item time-intensity, respectively; Σ item is avariance and covariance matrix of item parameter. The residual error variance, ω  i , is assumedto be independently distributed (e.g., Zhan, Jiao et al., 2018), thus it is not included in Ψ i . Bayesian parameter estimation

Model parameters in the MRM-MLRTM can be estimated via the full Bayesian approachwith the Markov chain Monte Carlo (MCMC) method. In Bayesian estimation, priordistributions of model parameters and observed data likelihood produce a joint posteriordistribution for the model parameters. In this study, the OpenBUGS (Spiegelhalter, Thomas, Best,& Lunn, 2014) was used to estimate parameters. OpenBUGS uses a default option of the Gibbssampler (Gelfand & Smith, 1990), whose code for the MRM-MLRTM is provided in onlinesupplemental materials (runnable source code are also provided).Under the assumption of local independence, Y ni and log T ni are independently distributed as ))1((Bernoulli~  nini YPY , and )ω ,τξ(~log    iKk nkikini qNT .The priors of the person parameters are set as   person* ,~ Σ00ΤΘ

Knn N ,with a hyper prior )* ,(InvWishart~ personperson K RΣ , where R person is a K *-dimensional identity matrix.In addition, the priors of item parameters are set as   itemξ2 ,μμ~ξ Σ dii Nd ， )1 ,1(InvGamma~ω  i ， Furthermore, the hyper priors are specified as: )2 ,0(Normal~μ d , )2 ,3.4(Normal~μ ξ ,

2) ,(InvWishart~ itemitem RΣ ,where R item is a two-dimensional identity matrix. Finally, the posterior mean is treated as theestimate for model parameters. Real Data AnalysisData Description

A real data analysis is conducted using the PISA 2012 computer-based mathematics data toexplore whether the MRM-MLRTM fit the data better than the URM-ULRTM (e.g., van derLinden, 2007) and MRM-ULRTM (e.g., Man et al., 2017) when the test structure ismultidimensional. It is also an example to illustrate the use of the proposed model.This dataset was used by Zhan, Jiao et al. (2018). There are N = 1581 respondents and I = 10items. The logarithm of RTs have been computed before analysis, and all zero RTs were treatedas missing data. Four dimensions that belong to the mathematical content knowledge werechosen in this study, namely, change and relationships (θ ), quantity (θ ), space and shape (θ ),and uncertainty and data (θ ). The Q-matrix is shown in Table 1. Due to the use of 10 items tomeasure four dimensions and the incompleteness and unidentifiability of the Q-matrix (Chiu,2013; Xu & Zhang, 2016), it is expected that the estimation of model parameters may containrelatively high measurement errors. Nevertheless, such a test structure was retained because, thereal data analysis, though not perfect, can still provide some information about the nature of the real data structure. In addition, Bayesian estimation works even when the model is not identified,and adherence to identifiability has been deemed to be largely superfluous (Gustafson et al.,2005; Muthén & Asparouhov, 2012). More details can be found in Zhan, Jiao et al. (2018).Note that we also analyzed the RT data by using the MLRTM alone, the results indicatedthat the MLRTM fit the RT data better than the ULRTM. More details can be found in onlinesupplemental materials. Table 1.

Q matrix for PISA 2012 released computer-based mathematics items.Items θ θ θ θ CM015Q01 1CM015Q02D 1CM015Q03D 1CM020Q01 1CM020Q02 1CM020Q03 1CM020Q04 1CM038Q03T 1CM038Q05 1CM038Q06 1

Note , blank means zero.

Analysis and Model Selection

The three joint models in Figure 1, namely, the URM-ULRTM, the MRM-ULRTM, and theMRM-MLRTM were fit to the data. For each model, two Markov chains with random startingpoints were used, and each chain ran 10,000 iterations with the first 5,000 iterations in eachchain as burn-in. Finally, the remaining 10,000 iterations were used for the model parameterinferences. The potential scale reduction factor (PSRF; Brooks & Gelman, 1998) was computedto assess the convergence of each parameter. A PSRF with values smaller than 1.1 or 1.2indicates convergence (Brooks & Gelman, 1998; Zhan, Liao et al., 2018). Our studies indicatedthat the PSRF was smaller than 1.05 for all parameters, suggesting good convergence.Posterior predictive model checking (PPMC; Gelman, Carlin, Stern, Dunson, Vehtari, & Rubin, 2014) was used to evaluate model-data fit. A posterior predictive probability ( ppp ) valuenear 0.5 indicates that there are no systematic differences between the realized and predictivevalues, and thus an adequate fit of the model. In PPMC, the sum of the squared Pearson residualsfor person n and item i (Yan, Mislevy, & Almond, 2003) was used as a discrepancy measure toevaluate the overall fit of the RA model, which is written as

21 1 ))1(1)(1( )1(),,;(       

Nn Ii nini niniiinni

YPYP YPYYD α ,where P ( Y ni = 1) has the same definition as that in Equation (5). On the other hand, the sum ofthe standardized error function of log T ni for person n and item i was employed as a discrepancymeasure:         Nn Ii Kk nkikiniiinini qTTDD ))τ((log),,;(log);(log τυT .Additionally, the Akaike Information Criterion (AIC; Akaike, 1974), the Bayesian InformationCriterion (BIC; Schwar, 1978), and the deviance information criterion (DIC; Spiegelhalter, Best,Carlin, & van der Linde, 2002) were computed for model selection. Specifically, ____

DDpDDIC e  , namely, the effective number of parameters ( p e ) was computed by Dp e  (Gelman, Carlin, Stern, & Rubin, 2003), where D is the deviance, and D is theposterior mean of deviance (i.e., –2 log likelihood). Note that in Bayesian analysis, the AIC andBIC can be defined as pDAIC  and pNDBIC )1log(  (Congdon, 2003), where p is thenumber of estimated parameters. A smaller value of indicates a better model-data fit. Results

The AIC, BIC, and DIC all identified the MRM-MLRTM as the best-fitting model and theURM-ULRTM as the worst-fitting model, as shown in Table 2. In the MRM-MLRTM, the ppp values of the RA model and the RT model were 0.789 and 0.616, respectively, which indicatedan adequate model-data fit. Thus, it may be more appropriate to simultaneously consider themultidimensionality of the latent speeds and latent abilities. The MRM-MLRTM was used forfurther illustration. Table 2.

Models Fit for the PISA 2012 Computer-based Mathematics Data.Analysis Model AIC BIC DIC ppp _RA ppp _RTFor Ability For SpeedMRM MLRTM

Note , MRM = multidimensional Rasch model; MLRTM = multidimensional log-normal responsetime model; ULRTM = unidimensional log-normal response time model; AIC = Akaikeinformation criterion; BIC = Bayesian information criterion; DIC = deviance informationcriterion; ppp = posterior predictive probability value; RA = item response accuracy; RT = itemresponse times.To explore the relationship among multiple the latent abilities and latent speeds, Table 3presents the estimated person variance and covariance matrix. Moderate to high positivecorrelations (0.70 ~ 0.92) were found among the multiple abilities, and lower correlations (0.56 ~0.85) were found among the multiple speeds. However, low to moderate negative correlations(–0.60 ~ –0.01) were found between the multidimensional abilities and speeds. Although theseresults are not consistent with common sense that more capable respondents tended to workfaster, they are consistent with previous study findings (Klein Entink, Fox et al., 2009; van derLinden & Fox, 2015; Zhan, Jiao et al., 2018). As a low-stakes test, the PISA has more significantimplications for countries or areas than individual respondents. A reasonable explanation couldbe that low ability respondents lacked motivation in taking the low-stakes test (Wise & Kong,2005), which led to shorter RTs. Note that the variance of the first latent ability was quite large,indicating all respondents differ greatly in the first dimension. Table 3.

Estimated Variance and Covariance Matrix for the Multidimensional Abilities andSpeeds. θ θ θ θ τ τ τ τ θ –1.209(0.187) –0.567(0.080) –0.367(0.041) –0.338(0.035) 0.308(0.017) 0.563 0.739 0.753τ –0.014(0.086) –0.099(0.049) –0.044(0.031) –0.040(0.027) 0.163(0.012) 0.272(0.027) 0.651 0.605τ –0.404(0.093) –0.154(0.037) –0.190(0.028) –0.179(0.025) 0.184(0.011) 0.152(0.010) 0.202(0.010) 0.850τ –0.654(0.125) –0.262(0.051) –0.319(0.038) –0.312(0.034) 0.227(0.013) 0.171(0.012) 0.207(0.011) 0.294(0.014) Note , standard error in parentheses.Figure 2 presents four scatter diagrams to further depict the relationship between themultidimensional construct and the corresponding multidimensional speed. Obviously, differentrelationships existed in different dimensions. A relatively fuzzy relationship in the seconddimension might be caused the unstable estimation based on a single item. According to thescatter diagrams, some respondents had low ability but high speed, found in the fourth quadrant,and they might demonstrate aberrant response behavior such as rapid-guessing (Fox & Marianti,2017; Wang & Xu, 2015).Table 4 presents the estimates of the item parameters. The estimated mean item difficultyand mean time-intensity were –1.19 (SE = 0.54) and 4.29 (SE = 0.15), respectively. Theestimates of the time-intensity and time-kurtosis were quite similar to those shown in Table S3 inonline supplemental materials, which were estimated from the MLRTM alone. To further explorethe relationship between the item intercept and time-intensity parameters, we present the estimated item variance and covariance matrix in Table 5. The correlation was –0.43, whichimplied that the items with a lower intercept (i.e., more difficult) might have highertime-intensity. This result was consistent with that in the literature in that the more difficult itemsoften need more time to solve (Fox & Marianti, 2016; Meng et al., 2015; van der Linden, 2006;2007). Figure 2 . Relationship between the multidimensional construct and the correspondingmultidimensional speed.

Table 4.

Estimated Item Parameters for the Released 2012 PISA Computer-based MathematicsItems.Item d ξ ωMean SE Mean SE Mean SE1 0.567 0.095 4.224 0.016 2.865 0.0262 –4.548 0.463 4.472 0.020 1.865 0.0153 –3.415 0.358 4.631 0.019 1.995 0.0154 –2.384 0.114 4.779 0.015 2.524 0.0085 –0.129 0.069 3.861 0.017 1.913 0.0126 –1.441 0.087 4.258 0.016 2.190 0.0107 –0.419 0.072 3.739 0.017 2.097 0.0108 0.749 0.066 4.190 0.017 2.562 0.0119 –1.380 0.077 4.523 0.018 2.085 0.01210 –1.438 0.077 4.380 0.021 1.689 0.016 Note , Mean = posterior mean; SE = standard deviation of posterior distribution.

Table 5.

Estimated Variance and Covariance Matrix of Item Parameters for the Released 2012PISA Computer-based Mathematics Items. d ξ d Note , covariance in lower triangular and correlation coefficient in upper triangular. A Brief Simulation Study for Parameter RecoveryDesign and Data Generation

A real data study has been provided to demonstrate the applications of the proposed jointmultidimensional model. Further, to assess the parameter recovery of the proposed model, a briefsimulation study was conducted. For simplicity, only the MRM-MLRTM was assessed, becauseother two joint hierarchical models can be seen as its special cases.Four dimensions were measured by 30 items, which means there are four latent abilities andfour corresponding latent speeds. Q matrix was presented in Figure 3. For item parameters, d andξ parameters generated from a bivariate normal distribution with mean vector (0, 4) andcovariance matrix of [1, –0.2; –0.2, 0.25], in such setting, ρ d ξ = –0.4. ω parameters were set to 2,which were also similar to the estimates in the real data analysis. 1,000 respondents weresimulated. ( Θ, Τ ) generated from an eightfold-variate normal distribution with mean vector ( , )and covariance matrix of      .In such settings, ρ θθ’ = 0.8, ρ ττ’ = 0.6, and ρ θτ = –0.4. 30 datasets were generated. θ θ θ Figure 3. K -by- I Q’ matrix in the simulation study. Analysis

The MRM-MLRTM was fitted to each of the 30 replications. In each replication, thenumbers of chains, burn-in iterations and post-burn-in iterations were the same as those set in theempirical study. It appeared that convergence was well achieved. To evaluate parameter recovery,the bias and the root mean square error (RMSE) were computed as:

RRr r   ˆ1 )ˆbias( and RRr r )ˆ(1 )ˆRMSE(   , where  ˆ and υ are the estimated and true values of model parameters,respectively; R is the total number of replications. The correlation between estimated and truevalues (Cor) was also computed. Results

Table 6 presents the bias, RMSE, and the Cor of item and person parameters. Across theboard, all model parameters are well recovered. For item parameters, the recovery oftime-intensity is the best, then is time-kurtosis, and the worst is item intercept. For personparameters, the recovery of latent speeds is better than that of latent abilities. The recovery ofvariance and covariance was of more interest in this study; estimated bias and RMSE are givenin Table 7 for Σ item and Tables 8 and 9 for Σ person , respectively. In general, all of them are wellrecovered. The recovery of covariances is better than that of variances, and the recovery oftime-related parameters (e.g., item intensity, covariance of item difficulty and time-intensity,latent speeds, and covariance of latent ability and latent speed) is better than that oftime-unrelated parameters (e.g., item intercept and latent abilities). Overall, model parameters ofthe MRM-MLRTM can be recovered very well via the proposed full Bayesian MCMCestimation algorithm. Table 6.

Recovery of Person Parameters in the Simulation Study.Parameter Mean Bias Mean RMSE Cor d –0.001 0.468 0.886θ –0.001 0.458 0.888θ –0.001 0.460 0.886θ –0.001 0.467 0.883τ Note , Mean Bias = mean bias across all respondents; Mean RMSE = mean RMSE acorss allrespondents; Cor = correlation between estimated and true values; Cor of ω is NA because of thevariance of true ω is zero.

Table 7.

Recovery of Item Mean Vector and Item Variance and Covariance Matrix.Parameter Bias RMSEσ d d , ξ) –0.005 0.011σ ξ2 d ξ Table 8.

Bias of the Variance and Covariance Matrix of Person Parameters. Σ person θ θ θ θ τ τ τ τ θ –0.043θ –0.002 –0.007θ –0.016 0.000 –0.046θ –0.011 0.006 0.003 –0.026τ Table 9.

RMSE of the variance and covariance matrix of person parameters. Σ person θ θ θ θ τ τ τ τ θ Conclusion and discussion

To capture the multidimensionality of latent speed, this study proposed a multidimensionallog-normal RT model and a multidimensional hierarchical modeling framework. The PISA 2012computer-based mathematics data were analyzed to illustrate the implications and applications ofthe proposed models. The results indicating that it is appropriate to consider themultidimensionality of latent speed and the multidimensionality of latent ability, simultaneously,in multidimensional tests when RTs were collected. A brief simulation study was used as well tofurther evaluate model parameter recovery. The results indicated that model parameters could bewell recovered using the Bayesian MCMC approach.The work presented in this article is only a first attempt to deal with the multidimensionalityof latent speed. Despite promising results, further exploration is encouraged. First, the proposedRT model is a multidimensional extension of the classical log-normal RT model (van der Linden,2006), multidimensional extensions of other possible RT models (Fox & Marianti, 2016; KleinEntink, van der Linden, et al., 2009; Wang, et al., 2013; Wang & Xu, 2015) could be exploredand compared in the future. Second, in the proposed MLRTM, the latent speeds are assumed tobe compensatory. As non-compensatory multidimensional models for response accuracy becomepopular in recent decades (DeMars, 2016; Embretson, 1984; 2015; Templin & Henson, 2006; Wang & Nydick, 2015; Jiao, Lissitz, & Zhan, 2017), it is important to develop correspondingnon-compensatory MRT models in the future. Third, in the proposed multidimensionalhierarchical modeling approach, a multivariate normal distribution was used to describe therelationships among multidimensional latent speed and multidimensional latent ability. So, thenumber of total dimensions is twice as many as the number of dimensions that are measured bythe test. For example, in above application example, there were eight dimensions, which maypose a challenge on parameter estimation. If the multidimensional latent ability and themultidimensional latent speed can each have a second-order (or bi-factor) structure, not only theparameter estimation challenge can be largely reduced but also the structures of latent ability andlatent speed can be posited and tested. Fourth, applications of the proposed RT model, likedetecting aberrant responses (e.g., rapid-guessing and cheating) in multidimensional tests, needfurther investigation. Fifth, analyzing students’ growth is an important topic in educational andpsychological research (e.g., von Davier, Xu, & Carstensen, 2011; Zhan, Jiao, & Liao, 2017).How to employ the proposed multidimensional hierarchical modeling approach into longitudinalstudies is also an interesting topic (e.g., Wang, Zhang, Douglas, & Culpepper, 2018).

References

Adams, R. J., Wilson, M., & Wang, W. (1997). The multidimensional random coefficientsmultinomial logit model.

Applied Psychological Measurement, 21 , 1–23.http://dx.doi.org/10.1177/0146621697211001Akaike, H. (1974). A new look at the statistical model identification.

IEEE Transactions onAutomatic Control, 19 , 716–723. http://dx.doi.org/10.1109/TAC.1974.1100705Boughton, K. A., Smith, J., & Ren, H. (2017).

Using response time data to detect compromiseditems and/or people . In Cizek, G. J. & Wollack, J. A. (Eds). Handbook of quantitativemethods for detecting cheating on tests. (pp. 177–192). New York, NY: Routledge.Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7 , 434–455.http://dx.doi.org/10.2307/1390675Chiu, C.-Y. (2013). Statistical refinement of the Q-matrix in cognitive diagnosis.

AppliedPsychological Measurement, 37 , 598–618. https://doi.org/10.1177/0146621613488436Congdon, P. (2003).

Applied Bayesian modelling . New York: John Wiley.Curran, P. J., & Bauer, D. J. (2011). The disaggregation of within-person and between-personeffects in longitudinal models of change.

Annual Review of Psychology, 62 , 583-619.http://dx.doi.org/10.1146/annurev.psych.093008.100356DeMars, C. E. (2016). Partially compensatory multidimensional item response theory models:Two alternate model forms . Educational and Psychological Measurement, 76 , 231–257.http://dx.doi.org/10.1177/0013164415589595Embretson, S. E. (1984). A general latent trait model for response processes.

Psychometrika, 49 ,175–186. http://dx.doi.org/0033-3123/84/0600-5056500.75/0Embretson, S. E. (2015). The multicomponent latent trait model for diagnosis: Applications toheterogeneous test domains.

Applied Psychological Measurement, 39 , 16–30.http://dx.doi.org/10.1177/0146621614552014Fox, J.-P., Entink, R. K., & Timmers, C. (2014). The joint multivariate modeling of multiplemixed response sources: Relating student performances with feedback behavior.

Multivariate behavioral research, 49 , 54–66.http://dx.doi.org/10.1080/00273171.2013.843441Fox, J.-P. & Marianti, S. (2016). Joint modeling of ability and differential speed using responsesand response times.

Multivariate Behavioral Research, 51 , 540–553.http://dx.doi.org/10.1080/00273171.2016.1171128Fox, J.-P. & Marianti, S. (2017). Person-fit statistics for joint models for accuracy and speed.

Journal of Educational Measurement, 54 , 243–262. http://dx.doi.org/10.1111/jedm.12143Ferrando, P. J., & Lorenzo-Seva, U. (2007). A measurement model for Likert responses thatincorporates response time.

Multivariate Behavioral Research, 42 , 675–706.http://dx.doi.org/10.1080/00273170701710247Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginaldensities. Journal of the American Statistical Association, 85, 398–409. http://dx.doi.org/10.1080/01621459.1990.10476213Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis . New York: Chapman & Hall.Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003).

Bayesian Data Analysis, 2nd Edn .London: Chapman and HallGustafson, P., Gelfand, A. E., Sahu, S. K., Johnson, W. O., Hanson, T. E., Lawrence, J., & Lee, J.(2005). On model expansion, model contraction, identifiability and prior information: Twoillustrative scenarios involving mismeasured variables [with comments and rejoinder].

Statistical Science, 20 , 111–140. http://dx.doi.org/10.1214/088342305000000098Jiao, H., Lissitz, R. W., & Zhan, P. (2017).

A new noncompensatory testlet model for calibratinginnovative items embedded in multiple contexts . In Jiao, H., & Lissitz, R. W. (Eds.).Technology enhanced innovative assessment: Development, modeling, and scoring from aninterdisciplinary perspective. Charlotte, NC: Information Age Publisher.Klein Entink, R. H., Fox, J.-P., & van der Linden, W. J. (2009). A multivariate multilevelapproach to the modeling of accuracy and speed of test takers.

Psychometrika, 74 , 21–48.http://dx.doi.org/10.1007/S11336-008-9075-YKlein Entink, R. H., van der Linden, W. J., & Fox, J.-P. (2009). A Box-Cox normal model forresponse times.

British Journal of Mathematical and Statistical Psychology, 62 , 621–640.Lee, Y.-H., & Chen, H. (2011). A review of recent response-time analyses in educational testing.

Psychological Test and Assessment Modeling, 53 , 359–379.Luce, R. D. (1986).

Response times: Their role in inferring elementary mental organization . NewYork: Oxford University Press.Man, K., Jiao, H., Zhan, P., & Huang, C.-Y. (2017, May).

A conditional joint modeling approachfor compensatory multidimensional item response model and response times . Paperpresented at the annual meeting of the Modern Modeling Methods Conference, Storrs.Meredith, W. (1993). Measurement invariance, factor analysis, and factorial invariance.

Psychometrika, 58 , 525–543Meng, X.-B., Tao, J., & Chang, H.-H. (2015). A conditional joint modeling approach for locallydependent item responses and response times.

Journal of Educational Measurement, 52 ,1–27. http://dx.doi.org/10.1111/jedm.12060 Molenaar, D., Oberski, D., Vermunt, J., & De Boeck, P. (2016). Hidden Markov Item ResponseTheory Models for Responses and Response Times.

Multivariate behavioral research, 51 ,606–626. http://dx.doi.org/10.1080/00273171.2016.1192983Molenaar, D., Tuerlinckx, F., van der Maas, H. L. J. (2015). A generalized linear factor modelapproach to the hierarchical framework for responses and response times.

British Journal ofMathematical and Statistical Psychology, 68 , 197–219.http://dx.doi.org/10.1111/bmsp.12042Molenaar, P. C. (2004). A manifesto on psychology as idiographic science: Bringing the personback into scientific psychology, this time forever.

Measurement: Interdisciplinary Researchand Perspectives, 2 , 201–218. http://dx.doi.org/10.1207/s15366359mea0204_1Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexiblerepresentation of substantive theory.

Psychological Methods, 17 , 313–335.http://dx.doi.org/10.1037/a0026802OECD (2013). PISA 2012 assessment and analytical framework: mathematics, reading, science,problem solving and financial literacy, OECD Publishing.http://dx.doi.org/10.1787/9789264190511-enQian, H., Staniewska, D., Reckase, M., & Woo, A. (2016). Using response time to detect itempreknowledge in computer-based licensure examinations.

Educational Measurement: Issuesand Practice, 35 , 38–47Rasch, G. (1960).

Probabilistic models for some intelligence and attainment tests . DanmarksPaedagogiske Institut, Copenhagen.Ranger, J. (2013). Modeling responses and response times in personality tests with rating scales.

Psychological Test and Assessment Modeling, 55 , 361–382Reckase, M. (2009).

Multidimensional item response theory . New York: Springer.Schnipke, D.L., & Scrams, D.J. (2002).

Exploring issues of examinee behavior: Insights gainedfrom response-time analyses . In C.N. Mills, M. Potenza, J.J. Fremer & W. Ward (Eds.),Computer-Based Testing: Building the Foundation for Future Assessments (pp. 237–266).Hillsdale, NJ: Lawrence Erlbaum Associates.Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures ofmodel complexity and fit.

Journal of the Royal Statistical Society, Series B, 64 , 583–639. Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2014).

OpenBUGS User Manual Version3.2.3

The Annals of Statistics, 6 , 461–464.Sie, H., Finkelman, M. D., Riley, B., & Smits, N. (2015). Utilizing response times incomputerized classification testing.

Applied Psychological Measurement, 39 , 389-405.Tatsuoka, K. K. (1983). Rule Space: An approach for dealing with misconceptions based on itemresponse theory.

Journal of Educational Measurement, 20 , 345–354.http://dx.doi.org/10.1111/j.1745-3984.1983.tb00212.xTemplin, J. L., & Hensen, R. A. (2006). Measurement of psychological disorders using cognitivediagnostic models.

Psychological Methods, 11 , 287–305.Thissen, D. (1983).

Timed testing: An approach using item response theory . In D. J. Weiss (Ed.),New horizons in testing: Latent trait test theory and computerized adaptive testing. (pp.179–203) New York: Academic Press.van der Linden, W. J. (2006). A lognormal model for response times on test items.

Journal ofEducational and Behavioral Statistics, 31 , 181–204.http://dx.doi.org/10.3102/10769986031002181van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on testitems.

Psychometrika, 72 , 287–308. http://dx.doi.org/10.1007/s11336-006-1478-zvan der Linden, W. J. (2009). Conceptual issues in response-time modeling.

Journal ofEducational Measurement, 46 , 247–272.http://dx.doi.org/10.1111/j.1745-3984.2009.00080.xvan der Linden, W. J. (2011). Test design and speededness.

Journal of Educational Measurement ,48, 44-60van der Linden, W. J., & Fox, J.-P. (2015).

Joint modeling of responses and response times . In W.J. van der Linden (Ed.), Handbook of Item Response Theory: Vol 1. Models. Boca Raton:FL: Chapman & Hall/CRC.van der Linden, W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrantresponse-time patterns in adaptive testing.

Psychometrika, 73 , 365–384.https://doi.org/10.1007/s11336-007-9046-8Wang, C., Chang, H., & Douglas, J. (2013). The linear transformation model with frailties for the analysis of item response times. British Journal of Mathematical and Statistical Psychology,66 , 144–168. http://dx.doi.org/10.1111/j.2044-8317.2012.02045.xWang, C., & Nydick, S. W. (2015). Comparing two algorithms for calibrating the restrictednon-compensatory multidimensional IRT model.

Applied Psychological Measurement, 39 ,119-134. http://dx.doi.org/10.1177/0146621614545983Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and responseaccuracy.

British Journal of Mathematical and Statistical Psychology, 68 , 456–477.http://dx.doi.org/10.1111/bmsp.12054Wang, S., Yang, Y., Culpepper, S. A., & Douglas, J. A. (2017). Tracking skill acquisition withcognitive diagnosis models: A higher-order, hidden Markov model with covariates.

Journalof Educational and Behavioral Statistics . https://doi.org/10.3102/1076998617719727Wang, S., Zhang, S., Douglas, J., & Culpepper, S. (2018). Using response times to assesslearning progress: A joint model for responses and response times.

Measurement:Interdisciplinary Research and Perspectives, 16 , 45–58.https://doi.org/10.1080/15366367.2018.1435105Wang, T., & Hanson, B. A. (2005). Development and calibration of an item response model thatincorporates response time.

Applied Psychological Measurement, 29 , 323–339.http://dx.doi.org/10.1177/0146621605275984Wang, W.-C., & Wilson, M. (2005). The Rasch testlet model.

Applied PsychologicalMeasurement, 29 , 126–149. http://dx.doi.org/10.1177/0146621604271053Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation incomputer-based tests.

Applied Measurement in Education, 18 , 163–183.http://dx.doi.org/10.1207/s15324818ame1802_2Xu, G., & Zhang, S. (2016). Identifiability of diagnostic classification models.

Psychometrika,81 , 625–649. https://doi.org/10.1007/s11336-015-9471-zYan, D., Mislevy, R. J., & Almond, R. G. (2003).

Design and analysis in a cognitive assessment (ETS Research Report Series, RR-03-32). Princeton, NJ: ETS.Zhan, P., Jiao, H., & Liao, D. (2018). Cognitive diagnosis modeling incorporating item responsetimes.

British Journal of Mathematical and Statistical Psychology, 71, Zhan, P., Jiao, H., & Liao, D. (2017).

A longitudinal diagnostic classification model . arXivpreprint arXiv:1709.03431. URL https://arxiv.org/abs/1709.03431Zhan, P., Liao, M., & Bian, Y. (2018). A joint testlet cognitive diagnosis modeling for pairedlocal item dependence in response times and response accuracy.

Frontiers in Psychology ,9