[PDF] A psychometric modeling approach to fuzzy rating data

Abstract

Modeling fuzziness and imprecision in human rating data is a crucial problem in many research areas, including applied statistics, behavioral, social, and health sciences. Because of the interplay between cognitive, affective, and contextual factors, the process of answering survey questions is a complex task, which can barely be captured by standard (crisp) rating responses. Fuzzy rating scales have progressively been adopted to overcome some of the limitations of standard rating scales, including their inability to disentangle decision uncertainty from individual responses. The aim of this article is to provide a novel fuzzy scaling procedure which uses Item Response Theory trees (IRTrees) as a psychometric model for the stage-wise latent response process. In so doing, fuzziness of rating data is modeled using the overall rater's pattern of responses instead of being computed using a single-item based approach. This offers a consistent system for interpreting fuzziness in terms of individual-based decision uncertainty. A simulation study and two empirical applications are adopted to assess the characteristics of the proposed model and provide converging results about its effectiveness in modeling fuzziness and imprecision in rating data.

Full PDF

aa r X i v : . [ s t a t . A P ] F e b A psychometric modeling approach tofuzzy rating data

Antonio Calcagnì ∗ , Niccolò Cao , Enrico Rubaltelli , Luigi Lombardi University of Padova, University of Trento ∗ E-mail: [email protected]

Abstract

Modeling fuzziness and imprecision in human rating data is a crucial problem in many re-search areas, including applied statistics, behavioral, social, and health sciences. Because of theinterplay between cognitive, aﬀective, and contextual factors, the process of answering surveyquestions is a complex task, which can barely be captured by standard (crisp) rating responses.Fuzzy rating scales have progressively been adopted to overcome some of the limitations of stan-dard rating scales, including their inability to disentangle decision uncertainty from individualresponses. The aim of this article is to provide a novel fuzzy scaling procedure which uses ItemResponse Theory trees (IRTrees) as a psychometric model for the stage-wise latent responseprocess. In so doing, fuzziness of rating data is modeled using the overall rater’s pattern ofresponses instead of being computed using a single-item based approach. This oﬀers a con-sistent system for interpreting fuzziness in terms of individual-based decision uncertainty. Asimulation study and two empirical applications are adopted to assess the characteristics of theproposed model and provide converging results about its eﬀectiveness in modeling fuzziness andimprecision in rating data.Keywords: fuzzy rating data, fuzzy rating scale, item response model, fuzzy numbers, decisionuncertainty

Rating scales are the most common tools for collecting data involving the assessment of interests,motivations, attitudes, personality traits, and a wide variety of health-related and sociodemographic1onstructs. A typical use of rating scales is in self-report questionnaires and social surveys where aset of questions (items) are presented individually and respondents are asked to indicate the extentof their agreement/disagreement on a scale with multiple response categories. Overall, rating scalesare eﬀective, reliable, and easy to use instruments [1]. However, it is widely recognized that theyare not immune to problems such as response biases [2, 3], faking behaviors [4, 5], violation of ratingrules [6, 7], and cultural or cognitive diﬀerences in the use of response categories [8]. In addition,rating scales do not allow for an in-depth inquire into respondents’ rating process [9]. As manystudies have shown, the process of answering multiple choice questions is a complex task since itinvolves both individual-dependent cognitive and aﬀective factors as well as individual-independentcontextual factors (e.g., [10, 11, 12]). For instance, when a respondent is presented with an item like“I am satisﬁed with my current work”, which is rated on a ﬁve-point scale from “strongly disagree”to “strongly agree”, he or she ﬁrst retrieves long-term memory information about events, attitudes,beliefs about his or her job. The retrieved events may activate aﬀective components which inﬂuencepositively or negatively the opinion formation (for example, a recent promotion may enhance thechance for answering the item positively). Then, cognitive and aﬀective information are integratedto activate the decision making stage, which includes the answer editing step where a set of candidateanswers is pruned to produce the ﬁnal response [13]. As a result of conﬂicting demands from thelatter stages, some levels of decision uncertainty can impact the ﬁnal rating choice. Consequently,ﬁnal responses on questionnaires only reﬂect a portion of the entire response process.There have been numerous attempts to make rating scales more sensitive to detecting componentsof response process such as decision uncertainty. Generally, there are three types of solutions to thisproblem, a ﬁrst one involving the use of additional measures like response times or latencies alongwith standard multiple choice questions [14, 15, 16], a second one using extended item responsetheory (IRT) models on standard rating data [17, 18, 19], and a third one involving the use ofalternative rating instruments such as those based on tracing methodology [20] and fuzzy ratingscales [21, 22, 23]. Since the seminal work of [23], the latter has become popular only in recentyears. Typically, there are two ways to deﬁne a fuzzy rating scale, one involving fuzzy conversionsystems and the other involving a direct fuzzy rating system. In the ﬁrst case, a fuzzy conversionsystem is used to transform standard rating responses into fuzzy numbers (e.g., see [24]). In thesecond case, a tailor-made rating interface is instead adopted in order to map fuzzy numbers to arating process by means of implicit [21] or explicit [22] procedures. Despite their diﬀerences, boththe approaches aim at modeling decision uncertainty or its counterpart, fuzziness and imprecisionof rating data, as emerging from multiple choice rating tasks.In this paper, we contribute to this research stream by proposing a novel method which placesfuzzy rating scales in the context of Item Response Theory trees (IRTrees) models [17]. The aim isto provide an approach to fuzzy rating data that incorporates a stage-wise cognitive formalizationof the process that respondents use to answer survey questions. IRTrees are a novel class of itemresponse models aiming at representing the internal decision stages behind ﬁnal rating outcomes.2y adopting a sequence of linear or nested binary trees, they allow for disentangling the result of therating process (e.g., the choice of the category “strongly disagree” on a common Likert-type scale)and the sequential steps needed by raters to reach their ﬁnal outcomes. In this manner they providean elegant way to mine information from rating data, which can be used to model fuzziness andimprecision encapsulated into rating data.The reminder of this article is organized as follows. Section 2 oﬀers a review of the majorliterature about fuzzy rating scales. Section 3 describes our method for modeling imprecision anduncertainty in rating data using IRTrees. Section 4 reports the results of a simulation study designedto validate our proposal whereas Section 5 describes two applications using empirical case studies.Finally, Section 6 concludes the article by providing ﬁnal remarks and suggestion for future research.All the materials like algorithms and datasets used throughout the paper are available to downloadat https://github.com/antcalcagni/firtree/ . Fuzzy rating scales aim at quantifying fuzziness and imprecision of human subjective responses.Typically, there are two approaches known in the literature to construct a fuzzy rating instrument,namely fuzzy direct or indirect scales and fuzzy conversion scales.In fuzzy direct rating , a computerized rating scale is adopted and raters are asked to draw theirresponses using fuzzy sets according to their perceived uncertainty [23, 25]. This method usuallyrequire a two-step response process. First, raters draw an interval or a point on a pseudo-continuousgraphical scale which represent the set of admissible responses compatible with their assessmentof the item being rated. Then, they are asked to express the degree of conﬁdence by drawing an-other interval about their previous interval or point-wise responses. Finally, the two information arecombined to form triangular or trapezoidal fuzzy responses. An overview of direct fuzzy rating is de-scribed in [26]. By contrast, the fuzzy indirect rating uses implicit subjective information to quantifythe fuzziness of rating data. On this research line, for instance, [21] adopted a system which includesbiometric measures of cognitive response process (e.g., response time, computer-mouse trajectories)in the construction of fuzzy responses. Despite their diﬀerences, both the approaches have success-fully been adopted to measure psychological constructs [27], to evaluate students’ perceptions andfeelings [28], to measure gendered beliefs [29], to inspect experience of perplexity [30], to evaluatethe quality of linguistic descriptions [31], to explore physicals’ perception of mental patients [32], toevaluate service quality [33] as well as the quality of products [34].Unlike direct or indirect fuzzy rating, fuzzy conversion scales adopt stochastic or deterministicprocedures (e.g., fuzzy systems) to convert crisp rating data - usually collected by means of tradi-tional rating tools (e.g., Likert-type scales) - into fuzzy sets with the aim of obtaining an improvementof the scaling procedure. To this end, a number of conversion systems have been proposed, whichare mainly based on expert-knowledge, empirical-based or indirect methods [35]. Among them,3 xpert-knowledge conversion systems use a-priori information to derive fuzzy categories throughwhich crisp data are fuzziﬁed. For instance, [24] proposed an improved Likert-type scale based ona deterministic Mamdadi fuzzy system which includes fuzziﬁcation and defuzziﬁcation steps. Onthis line, [36] compared Likert-type scale and three fuzzy conversion scales based on triangular,trapezoidal, and Gaussian fuzzy numbers, respectively. This type of fuzzy scaling has been widelyapplied, for instance, in measuring user experience [37], workers’ motivation [38], teachers’ beliefsabout mathematics [39], students’ perceptions about learning through a computer algebra system[40], motivation, attention and anxiety [41], job satisfaction [42], tourists’ satisfaction [43, 44, 45], inevaluating healthcare services [46], educational services [47, 48, 49], and to develop methodologies forservice quality analysis [50, 51, 52, 53]. Instead, empirical-based fuzzy conversion methods transformcrisp responses into fuzzy data using the information gathered directly from the empirical sample ofresponses. For example, [54] developed a fuzzy system in which fuzzy categories are built based onempirical distribution of Likert-type responses. Similarly, [55] developed a fuzzy system to measurexenophobia through pollster method and frequency-based fuzzy set assignment. Still, [56, 57] and[58] proposed to generate fuzzy categories via Dombi-intersection of sigmoid-shaped functions basedon the most likely, worst and best values assigned by raters. In a similar way, [59] derived fuzzy num-bers using histograms of Likert-type responses and ideal histograms-based distances for modelingresponse-bias. Finally, indirect methods to fuzzy conversion scales use hybrid systems through whichfuzzy data are obtained by means of statistical models which are adapted on empirical crisp dataﬁrst. For instance, [60] proposed an innovative method where CUB models are used as a back-endtool for quantifying fuzziness of rating responses. Similarly, [35] used ordinal regression in order togenerate well-founded fuzzy response categories. [61] proposed a statistically-oriented procedure bymeans of which fuzzy sets are computed using non-parametric spline methods. On the same line,[62] and [63] used an Item Response Theory model (i.e., Partial Credit Model) to convert linguisticresponse categories in fuzzy numbers by means of the estimated IRT parameters.There have been several attempts to compare fuzzy rating and conversion scales with respect tomore traditional rating methods. To this end, comparisons have been made based on hypothesistesting about means [64, 65], descriptive summary measures [25], ratings accordance criterion inempirical and simulated context [66, 67], scale reliability [68, 69]. Other research used validatedquestionnaires to study the diﬀerences between traditional and fuzzy rating. For example, [70] usedthe WHOQOL-BREF questionnaire to compare standard Likert-type scale, fuzzy direct scale, andtwo fuzzy conversion scales. In a similar way, [71] proposed and compared four fuzzy version ofthe pain intensity scales, namely fuzzy visual analogue scale, fuzzy numerical rating scale, fuzzyqualitative pain scale, and fuzzy face pain scale. 4

An IRTree-based model for fuzzy rating

In this section we illustrate our approach to fuzzy rating scales, which is based upon the use ofIRT trees as computational models of the response process [17]. In particular, we adopt a two-stagemodeling strategy where IRTrees are ﬁrst ﬁt on rating data and then their estimated parametersare mapped to parametric fuzzy numbers [63]. In so doing, a psychometric model is used to modelresponse data for each rater and item combination, which is in turn used as a building block forrepresenting ﬁnal ratings in terms of fuzzy numbers.

IRT trees are conditional linear models that represent ﬁnal rating responses in terms of binary trees.They formalize the response process as a sequence of conditional stages going through the tree toend nodes. Intermediate nodes are deﬁned such that they represent speciﬁc cognitive componentsof the rating process whereas end nodes represent the possible outcomes of the decision process.Figure 1a depicts the simplest IRT tree model for three-point rating scales (0: “perhaps”; 2: “yes”; 1:“no”). It contains two intermediate nodes, one representing the ﬁrst stage of the response process Z (e.g., answering with uncertainty vs. answering with certainty) with a single outcome (e.g., Y = 0 :“perhaps”) and the other representing the second decision stage Z (e.g., answering with certainty)with two possible outcomes (e.g., Y = 2 : “yes” vs. Y = 1 : “no”). For instance, the probability ofuncertain responses (i.e., Y = 0 : “perhaps”) is simply given by the probability to activate the ﬁrststage of the decision process, i.e. P ( Y = 0) = P ( Z ; θ ) . By contrast, the probability of a negativeresponse (i.e., Y = 1 : “no”) is computed as P ( Y = 1) = ( P ( Z ); θ )(1 − P ( Z ; θ )) . The simplestcase described by Figure 1a is paradigmatic of the cognitive modeling underlying IRT trees [72, 73].These models assume the rater’s response process to be stage-wise: raters would ﬁrst decide whetheror not provide their responses ( Z ) and, then, decide on the direction and strength of their answers( Z ). The latent random variables Z and Z govern the two sub-processes of the rater’s response.Similarly, Figure 1b generalizes a two-stage decision tree for the common ﬁve-point rating scale (e.g.,from 1: “strongly disagree”; to 5: “strongly agree”). It contains three decision nodes, one for theuncertain response category (i.e., Y = 3 : “neither agree, nor disagree”), a second one for the levels ofdisagreement (i.e., Y = 1 : “strongly disagree”, Y = 2 : “disagree”), and the last node for the levels ofagreements (i.e., Y = 4 : “strongly disagree”, Y = 5 : “disagree”). Probabilities for each response arecomputed as before. Figures 1c-1d represents two cases of IRTrees for a six-point rating scale. Thetrees diﬀer in the way they model the middle categories (i.e., Y = 3 and Y = 4 ). In the ﬁrst schema(Figure 1c), they are represented independently from the extremes of the scale, as for the two-stageIRTree (Figure 1a). By contrast, the second schema (Figure 1c) places the middle categories in thesame branches of the extremes, as to represent a more graded decision process [74]. There are manypossible ways to conceptualize decision processes in terms of IRTrees and the choice of a particulardecision schema depends primarily on research-speciﬁc hypotheses [73].5 Z Y = 0 Y = 2 Y = 1 (a) IRTree model for three response categories Z Z Y = 3 Z Z Y = 1 Y = 2 Y = 4 Y = 5 (b) IRTree model for ﬁve response categories Z Z Z Y = 3 Y = 4 Z Z Y = 1 Y = 2 Y = 5 Y = 6 (c) IRTree model for six response categories (schema 1) Z Z Z Y = 3 Y = 4 Z Y = 2 Y = 1 Z Y = 6 Y = 5 (d) IRTree model for six response categories (schema 2)

Figure 1.

Examples of IRTree models for modeling response processes in rating scales.By using an IRT parameterization, IRTrees allow for introducing rater-speciﬁc and item-speciﬁccomponents for the response process. Hence, the probability to agree or disagree with an item canbe represented as a function of a rater’s latent trait and the speciﬁc content of the item [17]. Moreformally, let i ∈ { , . . . , I } and j ∈ { , . . . , J } be the indices for raters and items, respectively.Then, the ﬁnal response variable Y ij ∈ { , . . . , m, . . . , M } ⊂ N , with M being the maximum numberof response categories, can be decomposed in terms of binary responses using N binary variables Z ijn ∈ { , } , where n ∈ { , . . . , N } denotes the nodes of the tree. For instance, in Figure 1a, N = 2 and the ﬁnal response Y ij = 2 corresponds to Z ij = 0 . By following the common Raschrepresentation [17], for a generic pair ( i, j ) the IRTree consists of the following equations: η i ∼ N N ( , Σ η ) (1) π ijn = P ( Z ijn = 1; θ n ) = exp ( η in + α jn )1 + exp ( η in + α jn ) (2) Z ijn ∼ B er ( π ijn ) (3)where θ n = { α j , β i } , with the arrays α j ∈ R N and η i ∈ R N denoting the easiness of the itemand the rater’s latent trait. As is usual in IRT models, latent traits for each node are modeledusing a N -variate centered Gaussian distribution with covariance matrix Σ η . For instance, for thetwo-stage decision process in Figure 1a, α j indicates the easiness of choosing the right branch of the6ree for the item j whereas α j denotes the easiness of providing an aﬃrmative response ( Y = 1 : “yes”). Similarly, η i indicates the rater’s attitude to navigate through the right branch of the treewhereas η i denotes the rater’s attitude to provide an aﬃrmative response. Thus, the probabilitiesto activate a branch of the tree can be computed using Eq. (2) recursively. For instance, in thetwo-stage example, the probability of an uncertain response is computed as follows: P ( Y ij = 0) = P ( Z ij = 0; θ ) = 1 − exp ( η i + α j )1 + exp ( η i + α j ) To generalize single-branch probability equations, we ﬁrst deﬁne a M × N Boolean matrix T indi-cating how each response category (in rows) is associated to each node (in columns) of the tree. As t mn ∈ { , } , t mn = 1 indicates that the m -th category of response involves the node n , t mn = 0 indicates that the m -th category of response does not involve the node n , whereas t mn = NA in-dicates that the m -th category of response is not connected to the n -th node at all. For instance,considering the simplest two-stage example in Figure 1a, the mapping matrix T × is deﬁned asfollows: T =  NA  Finally, the probability for a generic rating response can be easily computed as: P ( Y ij = m ) = N Y n =1 P ( Z ijn = t mn ; θ n ) t mn = N Y n =1 (cid:18) exp ( η in + α jn ) t mn η in + α jn ) (cid:19) δ mn (4)where δ mn = 0 if t mn = NA and δ mn = 1 otherwise.IRTree models can be estimated either by means of standard methods used for generalized linearmixed models, such as restricted or marginal maximum likelihood [75, 17], or using proceduresfor multidimensional item response theory models, such as expectation-maximization algorithms[76]. In general, these models are ﬂexible enough to model simple situations like those requiringunidimensional latent variables (a single η for each node of the tree) or common item eﬀects (asingle α for each node of the tree) as well as more complex scenario involving multidimensionalhigh-order latent variables. For further details and implementations, we refer the reader to [17, 76]. A fuzzy set ˜ A of a universal set A is deﬁned by means of its characteristic function ξ e A : A → [0 , . Itcan be easily described as a collection of crisp subsets called α -sets, i.e. ˜ A α = { y ∈ A : ξ e A ( y ) > α } with α ∈ (0 , . If the α -sets of ˜ A are all convex sets then ˜ A is a convex fuzzy set. The support of ˜ A is A = { y ∈ A : ξ e A ( y ) > } and the core is the set of all its maximal points A c = { y ∈ A : ξ e A ( y ) = ax y ∈ A ξ e A ( y ) } . In the case max y ∈ A ξ e A ( y ) = 1 then ˜ A is a normal fuzzy set. If ˜ A is a normal andconvex subset of R then ˜ A is a fuzzy number. The quantity l ( ˜ A ) = max A − min A is the lengthof the support of the fuzzy set ˜ A . The class of all normal fuzzy numbers is denoted by F ( R ) . Fuzzynumbers can conveniently be represented using parametric models that are indexed by some scalars,such as c (mode) and s (spread or precision). These include a number of shapes like triangular,trapezoidal, gaussian, and exponential fuzzy sets [77]. A relevant class of parametric fuzzy numbersare the so-called LR-fuzzy numbers [78] and their generalizations like non-convex fuzzy numbers[79], ﬂexible fuzzy numbers [57], and beta fuzzy numbers [80, 81, 82]. The latter represent a specialclass of fuzzy sets that are deﬁned by generalizing triangular fuzzy sets. In particular, let: ξ e A ( y ) = (cid:18) y − y l c − y l (cid:19) · ( y l ,c ) ( y ) + (cid:18) y u − yy u − y l (cid:19) · ( c,y u ) ( y ) (5)be a triangular fuzzy set with y l , y u , c ∈ R being lower, upper bounds, and mode parameters,respectively. Then, a beta fuzzy set is of the form: ξ e A ( y ) = (cid:18) y − y l c − y l (cid:19) a (cid:18) y u − yy u − c (cid:19) b · ( y l ,y u ) ( y ) (6) c = ay u + by l a + b where y l , y u , a, b ∈ R , with y l and y u being the lower and upper bounds of the set, and c the modeof the fuzzy set. Beta fuzzy numbers can be expressed in terms of mode c ∈ R and precision s ∈ R + parameters, as follows ( y l = 0 and y u = 1 without loss of generality): ξ e A ( y ) = 1 C y a − (1 − y ) b − (7) a = 1 + csb = 1 + s (1 − c ) C = (cid:18) a − a + b − (cid:19) a − · (cid:18) − a − a + b − (cid:19) b − (8)with C being a constant ensuring ξ e A is still a normal fuzzy set. Figure 2 shows some examplesof beta fuzzy sets (dashed black curves). Because of their shape, beta-based fuzzy sets can be ofparticular utility in modeling bounded rating data (e.g., see [83]). Consider the case where a respondent i is faced with a M -choice item j . In the ﬁrst stage of theresponse process, the item content ﬁrst triggers memories and emotions of past personal experiences.Then, these activate the opinion formation stage, where a coherent opinion representation is formedalong with a ﬁnite set of potential responses U ij . Lastly, the ﬁnal response y ij is chosen by trimmingthe set of possible responses (selection stage). Decision uncertainty emerges as a result of the8onﬂicting demands of the opinion formation stage and it can be quantiﬁed by analysing somecharacteristics of U ij . Our approach resorts to using the latter as a source for mapping fuzzynumbers to the latent rater’s response process underlying y ij . To this end, IRT-trees are adoptedto estimate a probabilistic model for U ij as a function of estimated rater’s latent traits ˆ η i and itemcontent ˆ α j . In particular, for a given pair ( i, j ) the following procedure is used to obtain fuzzy ratingdata:1. Deﬁne and ﬁt an IRT-tree model to a sample of I × J responses Y and get the estimates ˆ η N × and ˆ α N × .2. Plug-in ˆ η N × and ˆ α N × into Eq. (4) to get the estimated probability value b P ( Y = m ) foreach m ∈ { , . . . , M } . This is the probabilistic model for U ij .3. Compute the mode of the fuzzy beta number ˜ y ij via the equality: c ij = X y ∈{ ,...,M } = y · b P ( Y = y ) (9)4. Compute the precision of the fuzzy beta number ˜ y ij via the equality: s ij = 1 v ij with: v ij = X y ∈{ ,...,M } = ( y − c ij ) · b P ( Y = y ) (10)In this context, ξ e y ij : Ω( y ) → (0 , , with Ω( y ) = (1 , M ) being the space of the means of Y ij for each response value. Thus, likewise for latent responses in psychometric models, fuzzy ratingdata are continuous and bounded instead of being discrete. Note that the above procedure isquite general and can be extended to the more general case of LR-type fuzzy numbers, such astriangular and trapezoidal, by means of any probability-possibility transformations [78] or othergeneral transformations preserving the original information content [84]. For instance, the easiestway to obtain triangular fuzzy numbers from b P ( Y ) is to compute the core using Eq. (9) whereaslower y l ij and upper y u ij bounds can instead be computed using quantiles, such as y l ij = min( { y ∈{ , . . . , M } : b P ( y ) ≥ } ) and y u ij = max( { y ∈ { , . . . , M } : b P ( y ) ≥ } ) . Another solution would beto transform fuzzy beta numbers using a kind of moments matching method [85] via the followinglink equations: y l ij = c ij − h , y r ij = c ij − h + h (11) h = q . v ij − c ij − µ ij ) h = 12 ( h + 3 c ij − µ ij ) µ ij = (1 + c ij s ij ) (cid:14) (2 + s ij ) The procedure yields regular triangular fuzzy sets deﬁned in terms of lower bound y l , mode c , andupper bound y u . 9 .0 1.4 2.8 3.2 . . . (A) y = 1 ˜ ξ ( Y = m ) . . . (B) y = 3 ˜ ξ ( Y = m ) . . . (C) y = 2 ˜ ξ ( Y = m ) . . . (D) y = 2 ˜ ξ ( Y = m ) Figure 2.

Examples of hypothetical probability distributions (black dashed vertical lines) and asso-ciated fuzzy numbers (black dashed curves) for a two-stage IRTree ( M = 3 and N = 2 ). Note thatprobability masses and fuzzy membership functions are overlapped over the same domain Ω( y ) , redand blue circles represent observed ( y ) as opposed to most probable responses, respectively.Figure 2 shows some hypothetical examples of fuzzy beta numbers for a two-stage IRTree with M = 3 and N = 2 . As a direct consequence of our modeling approach - which is based upon the useof heterogeneity in rater’s pattern of responses - the ﬁnal response y ij may not reﬂect the mode of thefuzzy response c ij (or similarly other measures like the centroid). This is particularly true for highuncertainty scenarios where two or more responses compete with each other (see Figure 2-c). Thus,decision uncertainty does not necessarily coincide with the choice of the middle or “don’t know”response category of the rating scale. Rather, it arises as a result of the transitions probabilitiesestimated by the IRTree (the easier the transition is, the more certain the response is). This isthe case, for instance, shown in Figure 4-d where the middle response category is chosen with littleuncertainty. 10 Simulation study

The aim of this simulation study is to provide an external validity check on the results provided bythe fuzzy IRT-map to recover decision uncertainty from rating tasks. In particular, our model wascontrasted against another IRT model for rating data that uses response times (RTs) as a sourcefor modeling decision uncertainty [14, 86]. It is well established that RTs can be used for measuringseveral cognitive facets such as item/question diﬃculties and participants’ performance on ratingand choice tasks [87]. Overall, the ﬁndings from the psychometric literature suggest that respondentswho are very hesitant and uncertain about their ﬁnal answers take a relatively long time to maketheir ﬁnal choice on a rating scale [86]. Conversely, respondents who are quite sure of their responsesare generally fast in providing their ﬁnal choices. As such, RTs can be considered valuable indirectmeasures of decision uncertainty in rating tasks [88]. In this study we assessed whether the fuzzyIRT-map can retrieve decision uncertainty from rating data as accurate as response times. To thisend, ﬁrst we will generate rating data and response times according to a dedicated IRT-RTs model,and then we will apply the fuzzy IRT-map on the rating data by evaluating to what extent fuzzynumbers computed via the fuzzy IRT-map will predict response times that were generated using theIRT-RTs model. The whole simulation study has been performed on a remote HPC machine basedon 16 Intel Xeon CPU E5-2630Lv3 1.80Ghz, 16x4 Gb Ram whereas computations and analyses havebeen performed in the R framework for statistical analyses. Data generation model . Discrete rating data Y ij ∈ { , . . . , M } and response times R ij ∈ (0 , ∞ ) forrespondent i ∈ { , . . . , I } and item j ∈ { , . . . , J } were generated according to the following IRT-RTsmodel [86]: η i ∼ N (0 , σ η ) , ω i ∼ N (0 , σ ω ) , ǫ i ∼ N (0 , σ ǫ ) P ( Y ij = m ; θ ) = exp ( P mk =1 ( η i − α j )) P mh =1 exp ( P mk =1 ( η i − α j )) (12) ln r ij = γ j + ω i + M X m =1 P ( Y ij = m ; θ ) ! β j + ǫ ij (13)where η I × and α J × are respondents’ latent traits and item parameters, ω I × and γ J × are respon-dents’ speeds and item times, β J × are the time intensity parameters which relate the response datasubmodel in Eq. (12) to the response time submodel in Eq. (13). The term P Mm =1 P ( Y ij = m ; θ ) in the response time submodel can be interpreted as the diﬃculty for the respondent i to respondto the item j (DIFF) and is closely related to the so-called Probability-Diﬃculty (PD) hypothesis inthe IRT literature [14]. The DIFF-based model for RTs states that longer response times occur whenDIFF is lower, which is the case where all the M alternatives for the item j are equally probable.By contrast, shorter response times are expected when DIFF is higher, which is the opposite casewhere there is single a response category with probability equals to one [86].11 esign . The design of the study involved four factors: (i) I ∈ { , , } , (ii) J ∈ { , } , (iii) M ∈ { , } , (iv) β ∈ {− . , − . } . They were varied in a complete factorial design with a totalof × × × scenarios. For each combination, B = 1000 samples were generated whichyielded to ×

24 = 24000 new data as well as an equivalent number of parameters.

Procedure . Let i h , j t , m p , β q be distinct levels of factors I , J , M , β . Then, rating data and responsetimes were generated according to the following procedure:(a) Respondents’ latent traits and speeds were drawn independently as η i h × ∼ N ( i h , I i h × i h ) and ω i h × ∼ N ( i h , I i h × i h ) .(b) Item parameters and average response times were generated independently as α j t × ∼ N ( j t , I j t × j t ) and γ j t × ∼ N (9 j t , I j t × j t ) .(c) For i = 1 , . . . , i h and j = 1 , . . . , j t , probabilities for each of the m p response categories werecomputed using the IRT component of the IRT-RTs model: P ( Y ij = m ) = exp m p X k =1 ( η i − α j ) !, m p X u =1 exp m p X k =1 ( η i − α j ) ! and response data y ij were drawn from a Multinomial distribution with probability equals to P ( Y ij ) .(d) Time intensity parameters were generated as β j t × ∼ N ( β q j t , I j t × j t ) .(e) Response times were computed using the second component of the IRT-RTs model, whichequals to the DIFF-based linear model: ln r ij = γ j + ω i + DIFF ij β j + ǫ ij where DIFF ij = P m p m =1 P ( Y ij = m ) and ǫ ij ∼ N (0 , . , for all i = 1 , . . . , i h and j = 1 , . . . , j t .(f) The generated matrices of response data Y i h × j t and times R i h × j t were analysed using thefuzzy IRT-map. For both M = 3 and M = 5 cases, the sequential decision tree (see Figure 1a)was adopted. Since α and η were simulated using the simplest model where latent traits anditem parameters are invariant across nodes (e.g., see [75]), an IRTree with a common latenttrait and common parameters was deﬁned using the IRTrees

R library [75]. The glmmTMB

Rpackage [89] was used to estimate the model parameters. Once estimates were obtained, fuzzybeta numbers were computed using the procedure described in Section 3.3, which yielded totwo new matrices for the modes C i j × j t and precisions S i j × j t of the fuzzy numbers. Measures.

For each condition of the study, we assessed whether the rating uncertainty, as recov-ered by the precision of the fuzzy set, predicted the response times. Thus, response times weredichotomized into fast responses ( r ij = 1 ) and slow ( r ij = 0 ) responses by an item median split [90].12hen, for each of the j t item, a Binomial linear model with logit link was used to predict the Booleanvector r ∗ i h × as a function of the precision values s i h × . Finally, predictions of the generalized linearmodel ˆr ∗ i h × were compared against the observations r ∗ i h × and the average Area Under the Curve(AUC) index was computed as follows:AUC avg = 1 j t j t X j =1 B B X b =1 AUC ( r ∗ b , ˆr ∗ b ) j ! It is expected that the closer the AUC avg to one, the more accurate the precisions of the fuzzynumbers will resemble the response times.

Results.

Table 1 shows the average AUC index as a function of the simulation condition. Overall,AUC avg was greater than the threshold for a random classiﬁcation (AUC avg = 0 . ), which indicatedthat precisions of the fuzzy numbers predicted response times better than random chance. Predic-tions were more accurate for the cases with M = 5 response categories and a larger number of items( J = 15 ). The number of sample units did not aﬀect the accuracy of prediction. As expected, thegreater accuracy was obtained for the cases with stronger time intensity parameters β = − . , acondition that occurs if the variation of response times is mainly due to the task (as measured bythe DIFF term). By and large, these ﬁndings suggest that, if compared to a RTs-based model fordecision uncertainty, fuzzy numbers appropriately encode the uncertainty component associated tothe choice of the ﬁnal response in rating scales. β = − . β = − . J = 5 J = 15 J = 5 J = 15 M = 3 I = 50 I = 150 I = 500 M = 5 I = 50 I = 150 I = 500 B = 1000 samples as a function of the simulation conditions. In this section we illustrate the features of the proposed approach using two applications to realdata that are based on a controlled scenario in which varying levels of decision uncertainty were13xperimentally controlled. The two studies oﬀer a way to assess the empirical eﬀectiveness of thefuzzy IRT-map in retrieving decision uncertainty from standard rating data.

The eﬀects of faking behaviors on rating data have been widely studied in the area of psychomet-rics (e.g., see [4, 5, 91, 92]). Faking is deﬁned as a deliberate behavior through which respondentsdistort their responses towards ones they consider more favorable in order to give overly positiveself-descriptions, to dissimulate vocational interests, to simulate physical or psychological symptomsas a way to obtain rewards, or to have access to advantageous work positions [93]. In all thesecases, faking acts as a kind of systematic error which alters the unfolding mechanism of the responseprocess. For instance, in the case of faking-good or faking-bad response styles (i.e., the tendency touse higher or lower response categories in rating procedures, respectively), this results in reducingthe overall response variability and increasing the number of stereotype answers. Because of itscharacteristics, faking can serve as a good candidate for studying uncertainty in rating process. Inthis application, we resorted to use rating data which were collected under honest and instructedfaking-good measurement conditions. The aim is assessing to what extent our approach is sensitiveenough to detect variations in decision uncertainty as arise from honest as opposed to faking responsepatterns. In particular, we expect to observe decreasing levels of decision uncertainty as responsespatterns varies from honest to faking-good condition.

Data and measures . Data were originally collected and analysed by [5, 94] and refer to a sampleof n = 484 undergraduate students (79% females, ages ranged from 18 to 48, with mean age of20.61 and standard deviation of 2.69) at the University of Padua (Italy). They were administereda personality questionnaire, the Perceived Empathic Self-Eﬃcacy Scale (AEP/A) [95], with itemsscored on a 5-point scale where 1 denotes that she/he “Cannot do at all” and 5 denotes that she/he“Certain can do” the behavior described by the item. The questionnaire was administered using apaper and pencil format. Participants were randomly assigned to two groups, one ( n = 237 ) receiv-ing the instruction to answer the questionnaire items as honest as possible (no faking condition),and the other ( n = 247 ) receiving the instruction to answer using a faking good response style.Faking-good was induced by letting participants know that a recruitment company was interestedin hiring candidates for a very appealing job position and the questionnaire would have been usedas a ﬁrst method of selection. Following the rational described in [5], for the current analyses weretained a subset of four items only, which guarantee representativeness of the complete item pool,a good factorial structure, and a clear diﬀerence between the two groups in response frequencies. The items were as follows: Q1.

When you meet new friends, ﬁnd out quickly the things they like and those theydo not like?

Q2.

Recognize if a person is seriously annoyed with you?

Q3.

Understand the state of mind of otherswhen you are very involved in a discussion?

Q4.

Understand when a friend needs your help, even if he/she doesn’tovertly ask for it? ata analyses and results . Table 2 shows the observed frequencies for the four items in the honest (H)and faking (F) conditions as well as the mean response value computed over the ﬁve categories. Asexpected, items in the faking condition showed increased frequencies of response categories associatedto positive responses (i.e., Y ∈ { , } ) as compared to items in the honest condition. A typical IRTreemodel for 5-point rating scales was deﬁned and adapted to both groups (see Figure 1c). In this case,the decision structure was deﬁned using three nodes, which represent the rating situation whereanswer using extreme points of the scale ( Y ∈ { , , , } ) is contrasted to the uncertain responsecategory ( Y = 3 ) [17]. Thus, the IRTree model implied four item parameters and three latent traits,with the last trait being the same for lower and higher extreme responses. The model structure wasdeﬁned using the IRTrees

R library whereas item and person parameters were estimated via marginalmaximum likelihood as implemented in the glmmTMB

R package [89]. Overall, model ﬁts showedgood accuracy in terms of observed as opposed to predicted missclassiﬁcation error (AUC H = 0 . ,AUC F = 0 . ). Tables 3-4 show the estimated model parameters for both honest and fakingconditions. As expected, the probability to activate the right-branch of the nodes increased in thefaking condition, especially for nodes 1 and 2. Similarly, latent traits were more strongly correlatedin the faking condition as opposed to the honest condition. Once model’s parameters have beenestimated, fuzzy beta numbers for both honest and faking groups were computed using the proceduregiven in Section 3.3. Thus, for each of the four items, we obtained n = 484 fuzzy numbers expressedin terms of mode ( m ) and precision ( s ). Figure 3 shows an exemplary set of reconstructed fuzzynumbers. In order to compare honest and faking conditions with regards to the decision uncertaintyas recovered by fuzzy beta numbers, in addition to mode (m) and precision (s) we computed fuzzycardinality | ˜ A | = R A ξ e A ( y ) dy and fuzzy centroid A = sm s as well. Figure 4 shows the distributionof these measures for both the experimental conditions. As expected, fuzzy numbers in the fakingcondition showed higher precision and smaller cardinality as compared to the honest case. Similarly,modes and centroids increased in the faking condition which is in agreement with the previous resultson faking experiments [5, 94]. Overall, the reconstructed fuzzy numbers behave according to thefaking-good manipulation, which implied a reduction of the rating uncertainty and the choice ofhigh rating scores. This was reﬂected by a highly increase in precision ( s ) as well as a decrease inthe size of fuzzy sets (fuzzy cardinality). Moral dilemmas are emotionally salient scenarios in which an agent ought to adopt one of two mu-tually exclusive alternatives that diﬀer in terms of violation of essential moral principles. Typicalmoral dilemmas include, for instance, the choice between letting one person die when that is nec-essary to saving ﬁve others ( footbridge ), the choice of smothering the supposedly incurable patientwith a pillow in order to get the patient’s life insurance ( smother for dollars ), the choice of handingover one of two children to a doctor for painful experiments (

Sophie’s choice ), the choice of killing15 = 1 Y = 2 Y = 3 Y = 4 Y = 5 mean responseitem1 (H) 0.00 0.08 0.58 0.32 0.02 3.27item1 (F) 0.00 0.03 0.48 0.44 0.04 3.50item2 (H) 0.00 0.05 0.30 0.51 0.14 3.75item2 (F) 0.00 0.04 0.22 0.54 0.20 3.90item3 (H) 0.02 0.20 0.39 0.33 0.06 3.22item3 (F) 0.01 0.13 0.36 0.38 0.11 3.45item4 (H) 0.00 0.04 0.22 0.60 0.14 3.84item4 (F) 0.00 0.01 0.20 0.52 0.27 4.04Table 2: Case study 1: Observed frequency tables as a function of item number and type of group(H: honest group; F: faking group). node 1 node 2 node 3 ˆ θ σ ˆ θ ˆ θ σ ˆ θ ˆ θ σ ˆ θ H α -0.32 0.14 1.99 0.35 -1.60 0.29 α α α α α α α ˆ θ ) and standard errors ( σ ˆ θ ) for item parameters in the honest(H) and faking (F) conditions. η η η ˆ σ η H η η -0.27 1.00 1.37 η η η η ˆ σ η ) for latent traitsin the honest (H) and faking (F) conditions. 16 . . . . . . (A) Rater: 1, Item: 1, y = m = . . . . . . (B) Rater: 49, Item: 1, y = m = . . . . . . (C) Rater: 64, Item: 2, y = m = . . . . . . (D) Rater: 1, Item: 1, y = m = . . . . . . (E) Rater: 1, Item: 4, y = m = . . . . . . (F) Rater: 40, Item: 3, y = m = Figure 3.

Case study 1: Fuzzy beta numbers (black curves) and estimated probabilities of responsecategories (black dashed vertical lines) for some raters of the honest (A-C panels) and faking groups(D-F panels). Note that probability masses and fuzzy sets are overlapped over the same domain Ω( y ) , red and blue circles represent observed response ( y ) and fuzzy mode ( m ), respectively.an healthy man to transplant his organs and saving ﬁve other patients ( transplant ) [96]. In all thesecases, the choice between the lesser of two evils involves a tangled web of cognitive and emotionalreactions that result in high levels of decision uncertainty. Because of these characteristics, moraldilemmas can serve as a framework for studying how ratings behave as a function of decision un-certainty. In this application, we used two moral dilemmas - i.e. footbridge and transplant - andassessed how they impacted the intensity of raters’ negative emotions towards the scenario’s protag-onist. In both dilemmas, the protagonist must choose between the sacriﬁce of one person (a strangerin the footbridge case, a victim’s physician in transplant ) in order to save a larger group. However,these scenarios diﬀer because of an additional role conﬂict that results from the diﬀerent method ofkilling [97]: while in footbridge the perpetrator is an anonymous pedestrian with no relationship tothe victim, in transplant the perpetrator is a doctor with moral duties. As such, we expect a higherdegree of uncertainty in assessing negative emotions for the transplant case as opposed to footbridge . Data and measures . Data were originally collected by [97] in a large project assessing many aspects of17 . . . . . . item 1 P ( Y = m ) . . . m o d e ( m ) p r ec i s i o n ( s ) . . . . . c a r d i n a li t y honest faking . . . . . . ce n t r o i d . . . . . . item 2 . . . . . . . . honest faking . . . . . . . . . . . . item 3 . . . . . . . . honest faking . . . . . . . . . . . . item 4 . . . . . . . . honest faking . . . . . . Figure 4.

Case study 1: Distribution of summary statistics (mode, precision, cardinality, centroid)for fuzzy numbers computed for each participant in honest (in light brown color) and faking (inorange color) conditions. Note that plots in the ﬁrst row show the observed frequencies as a functionof the experimental conditions.moral decision making, including several cognitive scales and personality surveys. For the purposesof this study, we selected a subset of the entire dataset. The ﬁnal sample consisted of n = 500 participants (54% females, ages ranged from 18 to 58, with mean age of 25.06 and standard deviationof 3.96), mainly composed of German speakers. They read both dilemma scenarios and rated the18ntensity of their negative emotions toward the scenario’s protagonist using a 5-point scale. A totalof four emotional items was presented along with the question “When I think of the protagonist andhis/her decision, I feel [disappointment, disgust, contempt, anger]”. The texts used for the moraldilemmas were as follows: Footbridge . A runaway trolley with malfunctioning breaks is heading down the tracks towards a groupof ﬁve workmen. A pedestrian observes this from a footbridge. If nothing is done the trolley will overrunand kill the ﬁve workmen. The only way for the pedestrian to avoid the deaths of the ﬁve workmenis to push a large stranger who is standing next to him oﬀ the bridge onto the tracks below where hislarge body will stop the trolley but which will also kill the stranger. Outcome: The pedestrian decidedto push the stranger oﬀ the bridge. Due to this decision the trolley was stopped and the ﬁve workmenwere saved; but the stranger was killed.

Transplant . Five patients are treated in a hospital. Each of whom is in critical condition due to organfailing. A healthy man consults the head physician for routine checkup. If nothing is done the ﬁvepatients will die due to a shortage of available transplants. The only way for the head physician to savethe lives of the ﬁrst ﬁve patients is to kill the healthy man (against his will) and to transplant his organsinto the bodies of the other ﬁve patients. Outcome: The head physician decided to kill the healthy manand to transplant the organs. Due to this decision ﬁve patients were saved; but the healthy man waskilled.

Data analysis and results . Two IRTree models with sequential structure (see Figure 1a) were sep-arately deﬁned and adapted to footbridge (F) and transplant (T) data. The IRT models required J = 4 item parameters and N = 4 number of nodes and latent traits. The model structure wasdeﬁned using the IRTrees

R library whereas model parameters were estimated via marginal max-imum likelihood as implemented in the glmmTMB

R package [89]. Tables 5-6 show the estimatedmodel parameters for both footbridge and transplant scenarios. Once model’s parameters have beenestimated, fuzzy beta numbers were computed using the procedure given in Section 3.3. The ﬁnalmodels showed a satisfactory ﬁt (AUC F = 0 . , AUC T = 0 . ). Finally, for each of the four items,we obtained n = 500 fuzzy numbers expressed in terms of mode ( m ) and precision ( s ). Likewise forthe ﬁrst case study, also in this context footbridge and transplant were compared in terms of modes,precisions, fuzzy cardinalities, and fuzzy centroids. Figure 5 shows the distribution of these measuresfor both the moral scenarios. As expected, unlike for the footbridge scenario, ratings in transplant were characterized by higher levels of decision uncertainty. Overall, fuzzy numbers showed largermodes and centroids, precisions of the fuzzy sets were higher in median and more variable, andfuzzy cardinalities were smaller in median. Finally, Figure 6 shows a subset of estimated fuzzy betanumbers for both dilemma scenarios. We can observe how fuzzy sets for transplant showed largersupport then fuzzy sets associated to footbridge . Interestingly, because of the diﬀerent levels of de-cision uncertainty underlying rating responses, the estimated modes often diﬀer from the observedﬁnal responses. 19ode 1 node 2 node 3 node 4 ˆ θ σ ˆ θ ˆ θ σ ˆ θ ˆ θ σ ˆ θ ˆ θ σ ˆ θ F α α α α α α α α ˆ θ ) and standard errors ( σ ˆ θ ) of item parameters in the footbridge(F) and transplant (T) scenarios. η η η η ˆ σ η F η η η η η η η η ˆ σ η ) for latent traitsin the footbridge (F) and transplant (T) scenarios. In this paper we described a novel procedure to represent rating responses in terms of fuzzy numbers.Similarly for other types of fuzzy conversion scales, our approach followed a two-step process bymeans of which fuzzy numbers are computed based on a previously estimated psychometric modelfor rating data. To this end, Item Response Theory-based trees (IRTrees) have been used, whichprovide a formal representation of the stage-wise cognitive process of answering survey questions[72, 98]. Unlike traditional IRT models, IRTrees allows for a ﬂexible modeling of the rating responsewhere item contents relate to latent traits by means of a priori speciﬁed response styles, which includedecision nodes for the tendency to choose moderate as opposed to extreme response categories aswell as for the tendency to agree versus disagree with a given item content. As a consequence,fuzziness of rating responses has been recovered from the characteristics of rater’s response pattern20 . . . . . . item 1 (Disappointment) P ( Y = m ) . . . . . . m o d e ( m ) p r ec i s i o n ( s ) . . . . . . c a r d i n a li t y footbridge transplant ce n t r o i d . . . . . . item 2 (Disgust) . . . . . . . . . . . . footbridge transplant . . . . . . item 3 (Contempt) . . . . . . . . . . . . footbridge transplant . . . . . . item 4 (Anger) . . . . . . . . . . . . footbridge transplant Figure 5.

Case study 2: Distribution of summary statistics (mode, precision, cardinality, centroid)for fuzzy numbers computed for each participant in footbridge (in light brown color) and transplant(in orange color) scenarios. Note that plots in the ﬁrst row show the observed frequencies as afunction of the dilemma scenarios. y i , instead of being computed as a byproduct of the item-based direct rating. This oﬀered a coherentmeaning system in which fuzzy responses ˜y i can be interpreted in terms of decision uncertainty thatcharacterized the rater’s response process. To this end, although other type of fuzzy sets havebeen suggested for rating data (e.g., triangular, trapezoidal [25, 69]), we resorted to adopt two-21 . . . (A) Item: 1, y F = y T = R a t e r : . . . (B) Item: 2, y F = y T = . . . (C) Item: 3, y F = y T = . . . (D) Item: 4, y F = y T = . . . (E) Item: 1, y F = y T = R a t e r : . . . (F) Item: 2, y F = y T = . . . (G) Item: 3, y F = y T = . . . (H) Item: 4, y F = y T = . . . (I) Item: 1, y F = y T = R a t e r : . . . (J) Item: 2, y F = y T = . . . (K) Item: 3, y F = y T = . . . (L) Item: 4, y F = y T = . . . (M) Item: 1, y F = y T = R a t e r : . . . (N) Item: 2, y F = y T = . . . (O) Item: 3, y F = y T = . . . (P) Item: 4, y F = y T = footbridge transplant Figure 6.

Case study 2: Fuzzy beta numbers for some raters of the footbridge (light brown curves)and transplant (orange curves) scenarios along with their estimated modes (ﬁlled circles). Note that y T and y F indicate the observed crisp responses for footbridge and transplant , respectively.parameter fuzzy beta numbers since beta-like models have been proved to adequately represent thecharacteristics of asymmetry of bounded rating data [83]. Simulation and real case studies wereadopted to evaluate the characteristics and properties of our proposal. In particular, the simulationstudy was designed in order to provide converging results about the eﬀectiveness of our proposal torecover decision rating uncertainty. To this purpose, a controlled scenario was used and our modelwas contrasted against a standard IRT-RTs model which uses response times (RTs) to quantifydecision uncertainty in rating responses [86]. The results showed the ability of the fuzzy IRT-mapto detect decision uncertainty when it is present in rating data. This was also conﬁrmed by theresults of the two case studies, which involved two empirical situations characterized by ratingsunder uncertainty.Some advantages of the proposed fuzzy IRT-map are as follows. First, since the procedure doesnot require a dedicated measurement setting, it is applicable over a wide range of survey situations22ncluding diﬀerent rating formats (e.g., Likert-type, forced-choice, funnel response-format) [73, 99].Second, it avoids using direct rating scales which can often provide distorted responses becauseof cognitive biases underling numerical and intensity estimation [100]. Third, it uses a ﬂexiblepsychometric model to represent the cognitive stages of the response process, which can each timebe adapted by researchers to model speciﬁc rating situations. Moreover, the use of a statisticalmodel as a ﬁrst processing step allows fuzzy responses to be computed on a kind of denoised data.However, as for other statistical-based fuzzy quantiﬁcation procedure, also the proposed fuzzyIRT-map can potentially suﬀer from some limitations. For instance, as it is based on a psychometricmodel for rating responses, sample size or the number of items should be large enough to providereliable results for the estimates ˆ θ = { ˆ α , ˆ η } [101, 102]. In addition, the hypothesized IRTree ratingmodel should also be valid for the sample being analyzed. For instance, in empirical cases forwhich a rating model cannot be determined in advance, it may be advised to deﬁne and test severalIRTrees, the best of which can be chosen by means of minimum Akaike Information Criteria (AIC)[17]. Similarly, for studies involving huge samples, MCMC based algorithms should be preferred toestimate IRTrees over standard marginal maximum likelihood-based algorithms [103]. To this end,several methods and implementations are available nowadays (e.g., see [104]).Our proposal may be extended in several ways. For instance, IRTree models including responsetimes in the computation of raters’ decision uncertainty [14] may also be adopted and generalizedfuzzy numbers may be used accordingly [57, 105]. In conclusion, modeling uncertainty in rating datais a crucial task in all those research contexts involving human subjects as source of informationsuch as social surveys, formative and teaching evaluation, decision support systems, quality control,psychological assessment, medical and health decision making, military promotion screening, etc.We believe that our proposal may oﬀer an ecological but reliable procedure to address the problemof measuring subjective evaluations. References [1] Eunike Wetzel and Samuel Greiﬀ. The world beyond rating scales.

European Journal ofPsychological Assessment , 34(1):1–5, 2018.[2] Adrian Furnham. Response bias, social desirability and dissimulation.

Personality and indi-vidual diﬀerences , 7(3):385–400, 1986.[3] Adam W Meade and S Bartholomew Craig. Identifying careless responses in survey data.

Psychological methods , 17(3):437, 2012.[4] Michael Eid and Michael J Zickar. Detecting response styles and faking in personality andorganizational assessments by mixed rasch models. In

Multivariate and mixture distributionRasch models , pages 255–270. Springer, 2007.235] Luigi Lombardi, Massimiliano Pastore, Massimo Nucci, and Andrea Bobbio. Sgr modeling ofcorrelational eﬀects in fake good self-report measures.

Methodology and Computing in AppliedProbability , 17(4):1037–1055, 2015.[6] Carolyn C Preston and Andrew M Colman. Optimal number of response categories in ratingscales: reliability, validity, discriminating power, and respondent preferences.

Acta psycholog-ica , 104(1):1–15, 2000.[7] Jonathan Rabinowitz, Nina R Schooler, Brianne Brown, Mads Dalsgaard, Nina Engelhardt,Gretchen Friedberger, Bruce J Kinon, Daniel Lee, Felice Ockun, Atul Mahableshwarkar, et al.Consistency checks to improve measurement with the montgomery-asberg depression ratingscale (madrs).

Journal of aﬀective disorders , 256:143–147, 2019.[8] Timothy Johnson, Patrick Kulesa, Young Ik Cho, and Sharon Shavitt. The relation betweenculture and response styles: Evidence from 19 countries.

Journal of Cross-cultural psychology ,36(2):264–277, 2005.[9] Philip J Rosenbaum and Jaan Valsiner. The un-making of a method: From rating scales tothe study of psychological processes.

Theory & Psychology , 21(1):47–65, 2011.[10] Ozlem Ozkok, Michael J. Zyphur, Adam P. Barsky, Max Theilacker, M. Brent Donnellan,and Frederick L. Oswald. Modeling measurement as a sequential process: Autoregressiveconﬁrmatory factor analysis (AR-CFA).

Front. Psychol. , 10, sep 2019.[11] Boaz Shulruf, John Hattie, and Robyn Dixon. Factors aﬀecting responses to likert type ques-tionnaires: introduction of the impexp, a new comprehensive model.

Social Psychology ofEducation , 11(1):59–78, 2008.[12] Roger Tourangeau, Lance J Rips, and Kenneth Rasinski.

The psychology of survey response .Cambridge University Press, 2000.[13] Norbert Schwarz and Daphna Oyserman. Asking questions about behavior: Cognition, commu-nication, and questionnaire construction.

The American Journal of Evaluation , 22(2):127–160,2001.[14] Pere J Ferrando and Urbano Lorenzo-Seva. A measurement model for likert responses thatincorporates response time.

Multivariate Behavioral Research , 42(4):675–706, 2007.[15] Kaiwen Man, Jeﬀery R. Harring, Yunbo Ouyang, and Sarah L. Thomas. Response time basednonparametric kullback-leibler divergence measure for detecting aberrant test-taking behavior.

International Journal of Testing , 18(2):155–177, feb 2018.[16] John Zaller and Stanley Feldman. A simple theory of the survey response: Answering questionsversus revealing preferences.

American journal of political science , pages 579–616, 1992.2417] Paul De Boeck and Ivailo Partchev. IRTrees: Tree-based item response models of the GLMMfamily.

J. Stat. Soft. , 48(Code Snippet 1), 2012.[18] Pere J Ferrando and Cristina Anguiano-Carrasco. Assessing the impact of faking on binarypersonality measures: An irt-based multiple-group factor analytic procedure.

MultivariateBehavioral Research , 44(4):497–524, 2009.[19] Cheng-Han Leng, Hung-Yu Huang, and Grace Yao. A social desirability item response theorymodel: Retrieve–deceive–transfer.

Psychometrika , 85(1):56–74, nov 2019.[20] Michael Schulte-Mecklenbeck, Anton Kühberger, and Joseph G Johnson.

A handbook of processtracing methods for decision research: A critical review and user’s guide . Psychology Press,2011.[21] Antonio Calcagnì and L Lombardi. Dynamic fuzzy rating tracker (dyfrat): a novel methodologyfor modeling real-time dynamic cognitive processes in rating scales.

Applied soft computing ,24:948–961, 2014.[22] Sara de la Rosa de Sáa, María Ángeles Gil, Gil Gonzalez-Rodriguez, María Teresa López,and María Asunción Lubiano. Fuzzy rating scale-based questionnaires and their statisticalanalysis.

IEEE Transactions on Fuzzy Systems , 23(1):111–126, 2014.[23] Tim Hesketh, Robert Pryor, and Beryl Hesketh. An application of a computerized fuzzygraphic rating scale to the psychological measurement of individual diﬀerences.

InternationalJournal of Man-Machine Studies , 29(1):21–35, 1988.[24] Paothai Vonglao. Application of fuzzy logic to improve the likert scale to measure latentvariables.

Kasetsart Journal of Social Sciences , 38(3):337–344, 2017.[25] María Asunción Lubiano, Sara de la Rosa de Sáa, Manuel Montenegro, Beatriz Sinova, andMaría Ángeles Gil. Descriptive analysis of responses to items in questionnaires. why not usinga fuzzy rating scale?

Information Sciences , 360:131–148, 2016.[26] María Ángeles Gil and Gil González-Rodríguez. Fuzzy vs. likert scale in statistics. In

Com-bining experimentation and theory , pages 407–420. Springer, 2012.[27] Concepción San Luis Costas, Pedro Prieto Maranon, and Juan A Hernandez Cabrera. Ap-plication of diﬀuse measurement to the evaluation of psychological structures.

Quality andQuantity , 28(3):305–313, 1994.[28] Itziar García-Honrado, Miquel Ferrer, and Angela Blanco-Fernandez. A tentative fuzzy as-sessment of the quality of teaching and opportunities to learn mathematics in a classroomdiscussion. In . Atlantis Press, 2015.2529] Ana M Castaño, M Asunción Lubiano, and Antonio L García-Izquierdo. Gendered beliefs instem undergraduates: A comparative analysis of fuzzy rating versus likert scales.

Sustainability ,12(15):6227, 2020.[30] Inés M Gómez-Chacón. Emotions and heuristics: The state of perplexity in mathematics.

Zdm , 49(3):323–338, 2017.[31] Patricia Conde-Clemente, Jose M Alonso, Éldman O Nunes, Angel Sanchez, and GracianTrivino. New types of computational perceptions: Linguistic descriptions in deforestationanalysis.

Expert Systems with Applications , 85:46–60, 2017.[32] María Asunción Lubiano Gómez, Pilar González Gil, Helena Sánchez Pastor, Carmen Pradas,Henar Arnillas, et al. An incipient fuzzy logic-based analysis of the medical specialty in uenceon the perception about mental patients.

The Mathematics of the Uncertain: A Tribute toPedro Gil , 2018.[33] Adrian Castro-Lopez and Jose M Alonso. Modeling human perceptions in e-commerce appli-cations: A case study on business-to-consumers websites in the textile and fashion sector. In

Applying Fuzzy Logic for the Digital Economy and Society , pages 115–134. Springer, 2019.[34] Ana Belén Ramos-Guajardo, Ángela Blanco-Fernández, and Gil González-Rodríguez. Apply-ing statistical methods with imprecise data to quality control in cheese manufacturing. In

SoftModeling in Industrial Manufacturing , pages 127–147. Springer, 2019.[35] Qing Li. Indirect membership function assignment based on ordinal regression.

Journal ofApplied Statistics , 43(3):441–460, 2016.[36] Yuan Horng Lin and Jeng Ming Yih. Comparisons on reliability of likert scale between crispand fuzzy data. In

Applied Mechanics and Materials , volume 635, pages 874–877. Trans TechPubl, 2014.[37] Jyh-Rong Chou. A psychometric user experience model based on fuzzy measure approaches.

Advanced Engineering Informatics , 38:794–810, 2018.[38] Muluken Yeheyis, Bahareh Reza, Kasun Hewage, Janaka Y Ruwanpura, and Rehan Sadiq.Evaluating motivation of construction workers: A comparison of fuzzy rule-based model withthe traditional expectancy theory.

Journal of Civil Engineering and Management , 22(7):862–873, 2016.[39] M.A. Lazim and M.T. Abu Osman. Measuring teachers’ beliefs about mathematics: a fuzzyset approach.

International Journal of Social Sciences , 4(1):39–43, 2009.[40] M.A. Lazim, M.T. Abu Osman, and W.A. Wan Salihin. Fuzzy set conjoint model in describingstudents’ perceptions on computer algebra system learning environment.

International Journalof Computer Science Issues (IJCSI) , 8(2):92, 2011.2641] Konul Memmedova. Quantitative analysis of eﬀect of pilates exercises on psychological vari-ables and academic achievement using fuzzy logic.

Quality & Quantity , 52(1):195–204, 2018.[42] Rahib H Abiyev, Tulen Saner, Serife Eyupoglu, and Gunay Sadikoglu. Measurement of jobsatisfaction using fuzzy sets.

Procedia Computer Science , 102:294–301, 2016.[43] Pierpaolo D’Urso, Marta Disegna, Riccardo Massari, and Linda Osti. Fuzzy segmentation ofpostmodern tourists.

Tourism Management , 55:297–308, 2016.[44] Marta Disegna, Pierpaolo D’Urso, and Riccardo Massari. Analysing cluster evolution usingrepeated cross-sectional ordinal data.

Tourism Management , 69:524–536, 2018.[45] Pierpaolo D’Urso, Marta Disegna, and Riccardo Massari. Satisfaction and tourism expenditurebehaviour.

Social Indicators Research , pages 1–26, 2020.[46] Mehmet Ozer Demir, Murat Alper Basaran, and Biagio Simonetti. Determining factors af-fecting healthcare service satisfaction utilizing fuzzy rule-based systems.

Journal of AppliedStatistics , 43(13):2474–2489, 2016.[47] Toni Lupo. A fuzzy servqual based method for reliable measurements of education quality initalian higher education area.

Expert systems with applications , 40(17):7096–7110, 2013.[48] Dian-Fu Chang, An Chen Chiu, and Berlin Wu. Fuzzy correlation among student engagementand interpersonal interactions.

ICIC Express Letters, Part B: Applications , 9(1):17–22, 2018.[49] Shahid Hussain, Prashant K Jamwal, Muhammad T Munir, and Aigerim Zuyeva. A quasi-qualitative analysis of ﬂipped classroom implementation in an engineering course: from theoryto practice.

International Journal of Educational Technology in Higher Education , 17(1):1–19,2020.[50] Hong Tau Lee and Sheu Hua Chen. Using cpk index with fuzzy numbers to evaluate servicequality.

International Transactions in Operational Research , 9(6):719–730, 2002.[51] Ming-Tien Tsai, Hsueh-Liang Wu, and Wen-Ko Liang. Fuzzy decision making for marketpositioning and developing strategy for improving service quality in department stores.

Quality& Quantity , 42(3):303–319, 2008.[52] Hung-Tso Lin. Fuzzy application in service quality analysis: An empirical study.

Expertsystems with Applications , 37(1):517–526, 2010.[53] Hsiu-Yuan Hu, Yu-Cheng Lee, and Tieh-Min Yen. Service quality gaps analysis based on fuzzylinguistic servqual with a case study in hospital out-patient services.

The TQM Journal , 2010.[54] Michele Lalla, Gisella Facchinetti, and Giovanni Mastroleo. Ordinal scales and fuzzy setsystems to measure agreement: an application to the evaluation of teaching activity.

Qualityand Quantity , 38(5):577–601, 2005. 2755] Maria Symeonaki and Aggeliki Kazani. Developing a fuzzy likert scale for measuring xeno-phobia in greece.

ASMDA, Rome , 2011.[56] Zsuzsanna E Tóth, Gábor Árva, and Rita V Dénes. Are the ‘illnesses’ of traditional likertscales treatable?

Quality Innovation Prosperity , 24(2):120–136, 2020.[57] Zsuzsanna E Tóth, Tamás Jónás, and Rita Veronika Dénes. Applying ﬂexible fuzzy numbersfor evaluating service features in healthcare–patients and employees in the focus.

Total QualityManagement & Business Excellence , 30(sup1):S240–S254, 2019.[58] Tamás Jónás, Zsuzsanna Eszter Tóth, and Gábor Árva. Applying a fuzzy questionnaire in apeer review process.

Total Quality Management & Business Excellence , 29(9-10):1228–1245,2018.[59] Jan Stoklasa, Tomáš Talášek, and Pasi Luukka. Fuzziﬁed likert scales in group multiple-criteriaevaluation. In

Soft computing applications for group decision-making and consensus modeling ,pages 165–185. Springer, 2018.[60] Elvira Di Nardo and Rosaria Simone. A model-based fuzzy analysis of questionnaires.

Statis-tical Methods & Applications , 28(2):187–215, 2019.[61] Donata Marasini, Piero Quatto, and Enrico Ripamonti. Evaluating university courses: intu-itionistic fuzzy sets with spline functions modelling.

Statistica & Applicazioni , 15(1), 2017.[62] Sen-Chi Yu and Min-Ning Yu. Fuzzy partial credit scaling: A valid approach for scoringthe beck depression inventory.

Social Behavior and Personality: an international journal ,35(9):1163–1172, 2007.[63] Sen-Chi Yu and Berlin Wu. Fuzzy item response model: a new approach to generate member-ship function to score psychological measurement.

Quality and Quantity , 43(3):381, 2009.[64] María Asunción Lubiano, Manuel Montenegro, Beatriz Sinova, Sara de la Rosa de Sáa, andMaría Ángeles Gil. Hypothesis testing for means in connection with fuzzy rating scale-baseddata: algorithms and applications.

European Journal of Operational Research , 251(3):918–929,2016.[65] María Asunción Lubiano, Antonia Salas, Carlos Carleos, Sara de la Rosa de Sáa, and María Án-geles Gil. Hypothesis testing-based comparative analysis between rating scales for intrinsicallyimprecise data.

International Journal of Approximate Reasoning , 88:128–147, 2017.[66] María Asunción Lubiano, Antonia Salas, Sara de la Rosa de Sáa, Manuel Montenegro, andMaría Ángeles Gil. An empirical analysis of the coherence between fuzzy rating scale-andlikert scale-based responses to questionnaires. In

International Conference on Soft Methods inProbability and Statistics , pages 329–337. Springer, 2016.2867] Irene Arellano, Beatriz Sinova, Sara de la Rosa de Sáa, María Asunción Lubiano, andMaría Ángeles Gil. Descriptive comparison of the rating scales through diﬀerent scale es-timates: Simulation-based analysis. In

International Conference Series on Soft Methods inProbability and Statistics , pages 9–16. Springer, 2018.[68] Ana Belén Ramos Guajardo, María José González López, and Ignacio González Ruiz. Analysisof the reliability of the fuzzy scale for assessing the students’ learning styles in mathematics.In , pages 727–733. Atlantis Press, 2015.[69] María Asunción Lubiano, Antonio L García-Izquierdo, and María Ángeles Gil. Fuzzy ratingscales: Does internal consistency of a measurement scale beneﬁt from coping with imprecisionand individual diﬀerences in psychological rating?

Information Sciences , 2020.[70] Po-Yi Chen and Grace Yao. Measuring quality of life with fuzzy numbers: in the perspec-tives of reliability, validity, measurement invariance, and feasibility.

Quality of Life Research ,24(4):781–785, 2015.[71] Ernesto Araujo and Susana Abe Miyahira. Unidimensional fuzzy pain intensity scale. In , pages 185–190. IEEE, 2009.[72] Ulf Böckenholt. Modeling multiple response processes in judgment and choice.

Decision ,1(S):83–103, 2013.[73] Ulf Böckenholt. Measuring response styles in likert items.

Psychological Methods , 22(1):69–83,2017.[74] Thorsten Meiser, Hansjörg Plieninger, and Mirka Henninger. IRT ree models with ordinal andmultidimensional decision nodes for response styles and trait-based rating responses.

BritishJournal of Mathematical and Statistical Psychology , 72(3):501–516, feb 2019.[75] Paul De Boeck, Marjan Bakker, Robert Zwitser, Michel Nivard, Abe Hofman, Francis Tuer-linckx, Ivailo Partchev, et al. The estimation of item response models with the lmer functionfrom the lme4 package in r.

Journal of Statistical Software , 39(12):1–28, 2011.[76] Minjeong Jeon and Paul De Boeck. A generalized item response tree model for psychologicalassessments.

Behavior Research Methods , 48(3):1070–1085, jul 2015.[77] Kwang Hyung Lee.

First course on fuzzy theory and applications . Springer Science & BusinessMedia, 2004.[78] Didier Dubois and Henri Prade.

Fundamentals of fuzzy sets , volume 7. Springer Science &Business Media, 2012. 2979] Antonio Calcagnì, Luigi Lombardi, and Eduardo Pascali. Non-convex fuzzy data and fuzzystatistics: a ﬁrst descriptive approach to data analysis.

Soft Computing , 18(8):1575–1588,2014.[80] Adel M Alimi. Beta neuro-fuzzy systems.

TASK Quarterly Journal, Special Issue on" NeuralNetworks , 7(1):23–41, 2003.[81] Nesrine Baklouti, Ajith Abraham, and Adel M Alimi. A beta basis function interval type-2 fuzzy neural network for time series applications.

Engineering Applications of ArtiﬁcialIntelligence , 71:259–274, 2018.[82] William E Stein. Fuzzy probability vectors.

Fuzzy sets and Systems , 15(3):263–267, 1985.[83] Sonia Migliorati, Agnese Maria Di Brisco, Andrea Ongaro, et al. A new regression model forbounded responses.

Bayesian Analysis , 13(3):845–872, 2018.[84] Efendi N Nasibov and Sinem Peker. On the nearest parametric approximation of a fuzzynumber.

Fuzzy Sets and Systems , 159(11):1365–1375, 2008.[85] T. M. Williams. Practical use of distributions in network analysis.

Journal of the OperationalResearch Society , 43(3):265–270, mar 1992.[86] Xiang-Bin Meng, Jian Tao, and Ning-Zhong Shi. An item response model for likert-type datathat incorporates response time in personality measurements.

Journal of Statistical Computa-tion and Simulation , 84(1):1–21, 2014.[87] Patrick C Kyllonen and Jiyun Zu. Use of response time for measuring cognitive ability.

Journalof Intelligence , 4(4):14, 2016.[88] Christopher Donkin and Scott D Brown. Response times and decision-making.

Stevens’Handbook of Experimental Psychology and Cognitive Neuroscience , 5:1–33, 2018.[89] Mollie E. Brooks, Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg,Anders Nielsen, Hans J. Skaug, Martin Maechler, and Benjamin M. Bolker. glmmTMB bal-ances speed and ﬂexibility among packages for zero-inﬂated generalized linear mixed modeling.

The R Journal , 9(2):378–400, 2017.[90] Dylan Molenaar and Paul de Boeck. Response mixture modeling: Accounting for heterogeneityin item characteristics across response times. psychometrika , 83(2):279–297, 2018.[91] Michael J Zickar. Modeling faking on personality tests. 2000.[92] Philseok Lee, Seang-Hwane Joo, and Shea Fyﬀe. Investigating faking eﬀects on the constructvalidity through the monte carlo simulation study.

Personality and Individual Diﬀerences ,150:109491, 2019. 3093] Michael J Zickar, Robert E Gibby, and Chet Robie. Uncovering faking samples in applicant,incumbent, and experimental data sets: An application of mixed-model item response theory.

Organizational Research Methods , 7(2):168–190, 2004.[94] Massimiliano Pastore, Massimo Nucci, Andrea Bobbio, and Luigi Lombardi. Empirical sce-narios of fake data analysis: The sample generation by replacement (sgr) approach.

Frontiersin psychology , 8:482, 2017.[95] Gian Vittorio Caprara.

La valutazione dell’autoeﬃcacia. Costrutti e strumenti . Edizioni Er-ickson, 2001.[96] Joshua D Greene, R Brian Sommerville, Leigh E Nystrom, John M Darley, and Jonathan DCohen. An fmri investigation of emotional engagement in moral judgment.

Science ,293(5537):2105–2108, 2001.[97] Alexander Behnke, Anja Strobel, and Diana Armbruster. When the killing has been done:Exploring associations of personality with third-party judgment and punishment of homicidesin moral dilemma scenarios.

Plos one , 15(6):e0235253, 2020.[98] Ulf Böckenholt. Modeling motivated misreports to sensitive survey questions.

Psychometrika ,79(3):515–537, 2014.[99] Eunike Wetzel, Susanne Frick, and Samuel Greiﬀ. The multidimensional forced-choice formatas an alternative for rating scales.

European Journal of Psychological Assessment , 36(4):511–515, 2020.[100] Valerie F. Reyna and Charles J. Brainerd. Numeracy, ratio bias, and denominator neglectin judgments of risk and probability.

Learning and Individual Diﬀerences , 18(1):89–107, jan2008.[101] David Preinerstorfer and Anton K Formann. Parameter recovery and model selection in mixedrasch models.

British Journal of Mathematical and Statistical Psychology , 65(2):251–262, 2012.[102] Thomas R O’Neill, Justin L Gregg, and Michael R Peabody. Eﬀect of sample size on com-mon item equating using the dichotomous rasch model.

Applied Measurement in Education ,33(1):10–23, 2020.[103] Anton A Béguin and Ceec AW Glas. Mcmc estimation and some model-ﬁt analysis of multi-dimensional irt models.

Psychometrika , 66(4):541–561, 2001.[104] Paul-Christian Bürkner. brms: An r package for bayesian multilevel models using stan.

Journalof statistical software , 80(1):1–28, 2017.[105] József Dombi and Tamás Jónás. Approximations to the normal probability distribution func-tion using operators of continuous-valued logic.