[PDF] Psychometric Analysis of Forensic Examiner Behavior

Abstract

Full PDF

PPsychometric Analysis of Forensic Examiner Behavior ∗ Amanda Luby † [email protected] Anjali Mazumder ‡ [email protected] Brian Junker § [email protected] October 17, 2019

Abstract

Forensic science often involves the comparison of crime-scene evidence to a known-sourcesample to determine if the evidence and the reference sample came from the same source. Evenas forensic analysis tools become increasingly objective and automated, ﬁnal source identiﬁ-cations are often left to individual examiners interpretation of the evidence. Each source iden-tiﬁcation relies on judgements about the features and quality of the crime-scene evidence thatmay vary from one examiner to the next. The current approach to characterizing uncertaintyin examiners decision-making has largely centered around the calculation of error rates aggre-gated across examiners and identiﬁcation tasks, without taking into account these variationsin behavior. We propose a new approach using IRT and IRT-like models to account for dif-ferences among examiners and additionally account for the varying difﬁculty among sourceidentiﬁcation tasks. In particular, we survey some recent advances (Luby, 2019a) in the ap-plication of Bayesian psychometric models, including simple Rasch models as well as moreelaborate decision tree models, to ﬁngerprint examiner behavior. ∗ The material presented here is based upon work supported in part under Award No. 70NANB15H176 from theU.S. Department of Commerce, National Institute of Science and Technology. Any opinions, ﬁndings, or recommen-dations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the NationalInstitute of Science and Technology, nor the Center for Statistics and Applications in Forensic Evidence. † Swarthmore College ‡ The Alan Turing Institute, London § Carnegie Mellon University a r X i v : . [ s t a t . A P ] O c t ontents Introduction

Validity and reliability of the evaluation of forensic science evidence is powerful and crucial tothe fact-ﬁnding mission of the courts and criminal justice system (Presidents Council of Advisorson Science and Technology, 2016). Common types of evidence include DNA taken from blood ortissue samples, glass fragments, shoe impressions, ﬁrearm bullets or casings, ﬁngerprints, hand-writing, and traces of online/digital behavior. Evaluating these types of evidence often involvescomparing a crime scene sample, referred to in this ﬁeld as a latent sample , with a sample fromone or more persons of interest, referred to as reference samples; forensic scientists refer to this asan identiﬁcation task . Ideally, the result of an identiﬁcation task is what is referred to as an individ-ualization , i.e. an assessment by the examiner that the latent and reference samples come from thesame source, or an exclusion , i.e. an assessment that the sources for the two samples are different.For a variety of reasons, the assessments in identiﬁcation tasks for some kinds of evidence can bemuch more accurate and precise than for others.The evaluation and interpretation of forensic evidence often involve at least two steps: (a) com-paring a latent sample to a reference sample, and (b) assessing the meaning of that reported matchor non-match (Saks and Koehler, 2008). There are often additional steps taken, for example, to as-sess whether the latent sample is of sufﬁcient quality for comparison. Many kinds of identiﬁcationtasks, e.g. those involving ﬁngerprint, ﬁrearms and handwriting data, require human examiners tosubjectively select features to compare in the latent and reference samples. The response providedby a forensic examiner is thus more nuanced than a dichotomous match or no-match decision. Fur-ther, each of these steps introduces potential for variability and uncertainty by the forensic scienceexaminer. Finally, the latent samples can be of varying quality, contributing further to variabilityand uncertainty in completing identiﬁcation tasks. Forensic examination is thus ripe for the appli-cation of item response theory (IRT) and related psychometric models, in which examiners playthe role of respondents or participants, and identiﬁcation tasks play the role of items (Kerkhoffet al., 2015; Luby and Kadane, 2018).In this paper we survey recent advances in the psychometric analysis of forensic examiner be-havior (Luby, 2019a). In particular we will apply IRT and related models, including Rasch models(Rasch, 1960; Fischer and Molenaar, 2012), models for collateral or covarying responses (simi-lar to Thissen, 1983), item response trees (IRTRees, De Boeck and Partchev, 2012) and culturalconsensus theory models (CCT, Batchelder and Romney, 1988), to better understand the operat-ing characteristics of identiﬁcation tasks performed by human forensic examiners. We will focuson ﬁngerprint analysis, but the same techniques can be used to understand identiﬁcation tasks forother types of forensic evidence. Understanding examiners’ performance is obviously of interestto legal decision makers, for whom the frequency and types of errors in forensic testimony is im-portant (Garrett and Mitchell, 2017; Max et al., 2019), but it can also lead to better pre-service andin-service training for examiners, to reduce erroneous or misleading testimony. This usage should not be confused with the usage of “latent” in psychometrics, meaning a variable related toindividual differences that is unobservable. We will use the word in both senses in this paper, the meaning being clearfrom context. .1 Fingerprint analysis Fingerprint identiﬁcation tasks in which an examiner compares a latent print to one or more ref-erence prints involve many sources of variation and uncertainty. The latent print may be smudgedor otherwise degraded to varying degrees, making comparison with the reference print difﬁcult orimpossible. The areas of the print available in the latent image may be difﬁcult to locate in thereference print of interest. Even the latent print is clear and complete, the degree of similarity be-tween the latent and reference prints varies considerably across identiﬁcation tasks. See, e.g. Bcueet al. (2019) for a comprehensive review of ﬁngerprint comparison.Examiners also contribute variability and uncertainty to the process. Different examiners maybe differentially inclined in their determinations of whether print quality is sufﬁcient to make acomparison. They may choose different features, or minutiae , on which to base a comparison,and they may have different personal thresholds for similarity of individual minutiae, or for thenumber of minutiae that must match (respectively fail to match) to declare an individualization(respectively exclusion); see for example Ulery et al. (2014).

Proﬁciency tests do exist for examiners (Presidents Council of Advisors on Science and Technol-ogy, 2016), but they are typically scored with number-right or percent-correct scoring (Gardneret al., 2019). This approach does not account for differing difﬁculty of identiﬁcation tasks acrossdifferent editions of the same proﬁciency test, nor across tasks within a single proﬁciency test.Thus the same score may indicate very different levels of examiner proﬁciency, depending on thedifﬁculty of the tasks on a particular edition of the test, or even on the difﬁculty of the particu-lar items answered correctly and incorrectly by different examiners with the same number-correctscore on the same edition of the test.Error rate studies, that aggregate true-positive, true-negative, false-positive and false-negativerates across many examiners and identiﬁcation tasks, contain unmeasured biases due to the abovevariations in task difﬁculty and examiner practice and proﬁciency; see for example Luby andKadane (2018). In addition, raw sample sizes in these studies understate true standard errors, dueto correlation between responses from the same examiner (Holland and Rosenbaum, 1986).

In this paper we review some recent advances (Luby, 2019a) in the application of Bayesian IRTand IRT-like models to ﬁngerprint examiner proﬁciency testing and error rate data. We show theadditional information that can be obtained from application of even a simple IRT model (e.g.,Rasch, 1960; Fischer and Molenaar, 2012) to proﬁciency data, and compare that information withexaminers’ perceived difﬁculty of identiﬁcation tasks. We also explore models for staged decisionmaking and polytomous responses when there is no ground truth (answer key). In this latter sit-uation, even though there is no answer key, we are able to extract useful diagnostic informationabout examiners’ decision processes, relative to a widely recommended decision process (known4s ACE-V, NIST 2012), using the IRTrees framework of De Boeck and Partchev (2012). Inter-estingly the latent traits or person parameters in these models no longer represent proﬁciencies inperforming identiﬁcation tasks but rather tendencies of examiners toward one decision or another.This leads to a better understanding of variation among examiners at different points in the analy-sis process. Finally we compare the characteristics of IRT-like models for generating answer keyswith the characteristics of social consensus models (Batchelder and Romney, 1988; Anders andBatchelder, 2015) applied to the same problem.

The vast majority of forensic decision-making occurs in casework, which is not often made avail-able to researchers due to privacy concerns or active investigation policies. Besides real-worldcasework, data on forensic decision-making is collected through proﬁciency testing and error ratestudies. Proﬁciency tests are periodic competency exams that must be completed for forensic lab-oratories to maintain their accreditation, while error rate studies are research studies designed tomeasure casework error rates.

Proﬁciency tests usually involve a large number of participants (often > ), across multiple lab-oratories, responding to a small set of identiﬁcation task items (often < ). Since every participantresponds to every item, we can assess participant proﬁciency and item difﬁculty largely using theobserved scores. Since proﬁciency exams are designed to assess basic competency, most items arerelatively easy and the vast majority of participants score 100% on each test.In the US, forensic proﬁciency testing companies include Collaborative Testing Services (CTS),Ron Smith and Associates (RSA), Forensic Testing Services (FTS), and Forensic Assurance (FA).Both CTS and RSA provide two tests per year in ﬁngerprint examination, consisting of 10-12items, and make reports of the results available. FA also provides two tests per year, but does notprovide reports of results. FTS does not offer proﬁciency tests for ﬁngerprint examiners but insteadfocuses on other forensic domains.In a typical CTS exam, for example, 300–500 participants respond to eleven or twelve items.In a typical item, a latent print is presented (e.g. Figure 1a), and participants are asked to determinethe source of the print from a pool of four known donors (e.g. Figure 1b), if any.Proﬁciency tests may be used for training, known or blind proﬁciency testing, research anddevelopment of new techniques, etc. Even non-forensic examiners can participate in CTS exams(Max et al., 2019) and distinguishing between experts and non-experts from the response dataalone is usually not feasible since most participants correctly answering every question (Luby andKadane, 2018). Moreover, since the test environment is not controlled, it is impossible to determinewhether responses correspond to an individual examiner’s decision, to the consensus answer of agroup of examiners working together on the exam, or some other response process.5 a) A latent ﬁngerprint sample pro-vided by CTS. (b) A ten-print card reference sample provided by CTS. Figure 1: Examples of latent and reference samples provided in CTS proﬁciency exams.

Error rate studies typically consist of a smaller number of participants (fewer than ), but usea larger pool of items (often 100 or more). In general, the items are designed to be difﬁcult, andevery participant does not respond to every item.AAAS (2017) identiﬁed twelve existing error rate studies in the ﬁngerprint domain, and asummary of those studies is provided here. The number of participants ( N ), number of items ( J ),false positive rate, false negative rate, and reporting strategy vary widely across the studies andare summarized in Table 1 below. For example, Evett and Williams (1996) did report the numberof inconclusive responses, making results difﬁcult to evaluate relative to the other studies. AndTangen et al. (2011) and Kellman et al. (2014) required examiners to make a determination aboutthe source of a latent print in only three minutes, likely leading to larger error rates. Ulery et al.(2011) is generally regarded as the most well-designed error rate study for ﬁngerprint examiners(AAAS, 2017; Presidents Council of Advisors on Science and Technology, 2016). Ulery et al.(2012) tested the same examiners on 25 of the same items they were shown seven months earlier,and found that 90% of decisions for same-source pairs were repeated, and 85.9% of decisions fordifferent-source pairs were repeated. For additional information on all twelve studies, see Luby(2019a) or AAAS (2017). All analyses in this paper use results from the FBI Black Box Study and are based on practicesand procedures of ﬁngerprint examiners in the United States. The FBI Black Box study (Ulery6 J False Pos False Neg InconclusiveEvett and Williams (1996) 130 10 0 0.007% Not reportedWertheim et al. (2006) 108 10 1.5%Langenburg et al. (2009) 15 (43) 6 2.3% 7%Langenberg (2009) 6 120 0 0.7%/ 2.2%Tangen et al. (2011) 37 (74) 36 0.0037 Not allowedUlery et al. (2011) 169 744 (100) 0.17% 7.5%Ulery et al. (2012) 72 744 (25) 0 30% of previousLangenburg et al. (2012) 159 12 2.4%Kellman et al. (2014) 56 200 (40) 3% 14% Not allowedPacheco et al. (2014) 109 40 4.2% 8.7%Liu et al. (2015) 40 5 0.11%Table 1: Summary of existing studies that estimate error rates in ﬁngerprint examinationet al., 2011, dataset available freely from the FBI ) was the ﬁrst large-scale study performed toassess the accuracy and reliability of ﬁngerprint examiners decisions. 169 ﬁngerprint examinerswere recruited for the study, and each participant was assigned roughly 100 items from a pool of744. The items (ﬁngerprint images) were designed to include ranges of features (e.g. minutiae,smudges, and patterns) and quality similar to those seen in casework, and to be representative ofsearches from an automated ﬁngerprint identiﬁcation system. The overall false positive rate in thestudy was 0.1% and the overall false negative rate was 7.5%. These computed quantities, however,excluded all “inconclusive” responses (i.e. neither individualizations nor exclusions).Each row in the data ﬁle corresponds to an examiner × task response. In addition to the Ex-aminer ID and item Pair ID (corresponding to the latent-reference pair), additional information isprovided for each examinee × task interaction, as shown in Table 2.Examiners thus made three distinct decisions when they were evaluating the latent and refer-ence prints in each item: (1) whether or not the latent print has value for a further decision, (2)whether the latent print was determined to come from the same source as the reference print, dif-ferent sources, or inconclusive, and (3) their reasoning for making an inconclusive or exclusiondecision. While the main purpose of the study was to calculate casework error rates (and thus fo-cused on the Compare Value decision), important trends in examiner behavior are also presentin the other decisions, to which we return in Section 3.3.

The Rasch Model (Rasch, 1960; Fischer and Molenaar, 2012) is a relatively simple, yet powerful,item response model, that allows us to separate examiner proﬁciency from task difﬁculty. The Mating : whether the pair of prints were “Mates” (a match) or “Non-mates” (a non-match) • Latent Value : the examiner’s assessment of the value of the print (NV = NoValue, VEO = Value for Exclusion Only, VID = Value for Individualization) • Compare Value : the examiner’s evaluation of whether the pair of prints is an “Ex-clusion”, “Inconclusive” or “Individualization” • Inconclusive Reason : If inconclusive, the reason for the inconclusive – “Close”: The correspondence of features is supportive of the conclusion that thetwo impressions originated from the same source, but not to the extent sufﬁcientfor individualization. – “Insufﬁcient”: Potentially corresponding areas are present, but there is insufﬁ-cient information present.

Examiners were told to select this reason if the refer-ence print was not of value. – “No Overlap”: No overlapping area between the latent and reference prints • Exclusion Reason : If exclusion, the reason for the exclusion – “Minutiae”: The exclusion determination required the use of minutiae – “Pattern”: The exclusion determination could be made on ﬁngerprint patternclass and did not require the use of minutiae • Difficulty : Reported difﬁculty on a ﬁve point scale: ‘A-Obvious’, ‘B-Easy’, ‘C-Medium’, ‘D-Difﬁcult’, ‘E-Very Difﬁcult’.Table 2: Additional information provided for each examiner × task interaction in the FBI BlackBox data (Ulery et al., 2011).probability of a correct response is modeled as a logistic function of the difference between theparticipant proﬁciency, θ i ( i = 1 , . . . , N ), and the item difﬁculty, b j ( j = 1 , . . . , J ), P ( Y ij = 1) = 11 − exp( − ( θ i − b j )) . (1)In order to ﬁt an IRT model to the Black Box Study, we will score responses as correct if theyare true identiﬁcations or exclusions and as incorrect if they are false identiﬁcations or exclusions.For the purpose of illustration will consider “inconclusive” responses as missing completely atrandom (MCAR), following Ulery et al. (2011). However, there are a large number of inconclusiveanswers (4907 of 17121 responses), which can be scored in a variety of ways (see Luby, 2019b,for examples), and we will return to the inconclusives in Section 3.4.The Rasch model was ﬁtted in a Bayesian framework, with θ i ∼ N (0 , σ θ ) , b j ∼ N ( µ b , σ b ) ,8 False Positive Rate P r o f i c i e n c y E s ti m a t e −2.50.02.5 0.0 0.1 0.2 0.3 False Negative Rate P r o f i c i e n c y E s ti m a t e Figure 2: Estimated IRT proﬁciency by observed false positive rate (left panel) and false negativerate (right panel). Examiners who made at least one false positive error, i.e. the nonzero cases inthe left-hand plot, are colored in purple on the right-hand plot. lllllllllll llll lllllll −2.50.02.5 0.7 0.8 0.9 1.0

Observed Score P r o f i c i e n c y E s ti m a t e l ll ll l ll lll l llll l lll ll −2024−5.5 −5.0 −4.5 −4.0 Avg Question Difficulty P r o f i c i e n c y E s ti m a t e % Conclusive Figure 3: The left panel shows proﬁciency by observed score under the “inconclusive MCAR” scor-ing scheme, with those examiners with scores between 94% and 96% highlighted. The right panelshows proﬁciency by average item difﬁculty, colored by percent conclusive, for the highlightedsubset from the left panel. Estimated proﬁciency is related to observed score, item difﬁculty, andconclusive decision rates. µ b ∼ N (0 , , σ θ ∼ Half-Cauchy (0 , . and σ b ∼ Half-Cauchy (0 , . , using Stan (Stan Devel-opment Team, 2018a,b). Figure 2 shows estimated proﬁciencies of examiners when responses arescored as described above, with 95% posterior intervals, plotted against the raw false positive rate(left panel) and against the raw false negative rate (right panel). Those examiners who made at leastone false positive error are colored in purple in the right panel of Figure 2. One of the examinerswho made a false positive error still received a relatively high proﬁciency estimate due to having asmall false negative rate.In the left panel of Figure 3, we see as expected a positive correlation between proﬁciencyestimates and observed score (% correct); variation in proﬁciency at each observed score is dueto the fact that different examiners saw subsets of items of differing difﬁculty. The highlightedexaminers in the left panel in Figure 3 all had raw percent-correct (observed scores) between 94%and 96%, and are re-plotted in the right panel showing average question difﬁculty, and percent ofitems with conclusive responses, illustrating substantial variation in both Rasch proﬁciency andrelative frequency of conclusive responses, for these examiners with similar, high observed scores.Luby (2019b) explores other scoring schemes as well as partial credit models for this data.9reating the inconclusives as MCAR leads to both the smallest range of observed scores and largestrange of estimated proﬁciencies; harsher scoring methods (e.g. treating inconclusives as incorrect)generally also lead to higher estimated proﬁciencies, since more items are estimated to be difﬁcult.Results from an IRT analysis are largely consistent with conclusions from an error rate analy-sis (Luby, 2019b). However, IRT provides substantially more information than a more traditionalanalysis, speciﬁcally through accounting for the difﬁculty of items seen. Additionally, IRT implic-itly accounts for the inconclusive rates of different examiners in its estimates of uncertainty forboth examiner proﬁciency and item difﬁculty. As shown in Table 2, the FBI Black Box study also asked examiners to report the difﬁculty ofeach item they evaluated on a ﬁve-point scale. These reported difﬁculties are not the purpose ofthe test, but are secondary responses for each item collected at the same time as the responses andcan therefore be thought of as ‘collateral information’. When the additional variables are covari-ates describing either the items or the examiners—for instance, image quality, number of minutiae,examiner’s experience, type of training—it would be natural to incorporate them as predictors forproﬁciency or difﬁculty in the IRT model (de Boeck and Wilson, 2004). However, since reporteddifﬁculty is, in effect, a secondary response in the Black Box study, we take an approach analo-gous to response time modeling in IRT: in our case we have a scored task response, and a difﬁcultyrating rather than a response time, for each person × item pair. Thissen (1983) provides an earlyexample of this type of modeling, where the logarithm of response time is modeled as a linearfunction of the log-odds θ i − b j of a correct response, and additional latent variables for both itemsand participants. Ferrando and Lorenzo-Seva (2007) and van der Linden (2006) each propose var-ious other models for modeling response time jointly with the traditional correct/incorrect IRTresponse. Modeling collateral information alongside responses in this way has been shown gener-ally to improve estimates of IRT parameters through the sharing of information (van der Lindenet al., 2010). Recall from Section 2.3 (Table 2) that examiners rate the difﬁculty of each item on a ﬁve-pointscale: ‘A-Obvious’, ‘B-Easy’, ‘C-Medium’, ‘D-Difﬁcult’, ‘E-Very Difﬁcult’. Let Y ij be the scoredresponse of participant i to item j , and let X ij be the difﬁculty reported by participant i to item j . Y ij thus takes the values 0 (incorrect) or 1 (correct), and X ij is an ordered categorical variablewith ﬁve levels (A-Obvious to E-Very Difﬁcult). Following Thissen (1983), we combine a Raschmodel, logit ( P ( Y ij = 1)) = θ i − b j , (2)with a cumulative-logits ordered logistic model for the reported difﬁculties, X ∗ ij = logit − ( g · ( θ i − b j ) + h i + f j ) , (3)10here X ij =  A-Obvious X ∗ ij ≤ γ B-Easy γ < X ∗ ij ≤ γ C-Medium γ < X ∗ ij ≤ γ D-Difﬁcult γ < X ∗ ij ≤ γ E-Very Difﬁcult X ∗ ij > γ . (4)The additional variables h i and f j in equation (3) allow for the possibilities that examiners over-report ( h i > ) or under-report ( h i < ) item difﬁculty, and that item difﬁculty tends to be over-reported ( f j > ) or under-reported ( f j < ), relative to the Rasch logit ( θ i − β j ) and the reportingtendencies of other examiners. These parameters will be discussed further in Section 3.2.2 below.We assume that each participant’s responses are independent of other participants’ responses, Y i · ⊥ Y i (cid:48) · ; that within-participant responses and reports are conditionally independent of one an-other given the latent trait(s), Y ij ⊥ Y ij (cid:48) | θ i and X ij ⊥ X ij (cid:48) | θ i , h i ; and that responses are condition-ally independent of reported difﬁculty given all latent variables, X ij ⊥ Y ij | θ i , b j , g, h i , f j . Then thelikelihood is L ( Y, X | θ, b, g, h i , f j ) = (cid:89) i (cid:89) j P ( Y ij = 1) Y ij (1 − P ( Y ij = 1)) − Y ij P ( X ij = x ij ) (5)and P ( X ij = c ) = P ( logit − ( g · ( θ i − b j )+ h i + f j ) ≤ γ c ) − P ( logit − ( g · ( θ i − b j )+ h i + f j ) ≤ γ c − ) , (6)where γ = −∞ and γ = ∞ .We chose a cumulative-logits approach because it is directly implemented in Stan and there-fore runs slightly faster than adjacent-category-logits and other approaches. We have no reason tobelieve this choice has a practical effect on modeling outcomes, but if desired other formulationscould certainly be used. Luby (2019a) compares the predictive performance and prediction errorof the above model with ﬁts of other models for X ij and ﬁnds the above model to best ﬁt the BlackBox data. For each examiner in the dataset, their observed score n i (cid:80) j ∈ J i y ij , and their predicted score un-der the model, n i (cid:80) j ∈ J i ˆ y ij , were calculated. Similarly, predicted and observed average reporteddifﬁculty were calculated, where the observed average reported difﬁculty is n i (cid:80) j ∈ J i x ij and thepredicted average reported difﬁculty is n i (cid:80) j ∈ J i ˆ x ij . If the model is performing well, the predictedscores should be very similar to the observed scores.Figure 4 shows the predicted scores compared to the observed scores (left panel), and thepredicted average difﬁculty compared to the observed average reported difﬁculty (right panel).Reported difﬁculties for inconclusive responses were also treated as MCAR under this scoringscheme. While the joint model tends to over-predict percent correct, it predicts average reporteddifﬁculty quite well. 11 l lll ll llll l ll ll ll ll ll ll l ll lll l l ll l ll ll l lll l l lll ll ll ll l l ll ll l l llll ll ll lll ll lll l ll ll lll l ll lll l l lll l lll lll ll ll ll ll lll l lll l ll l ll l lll lll llll l lll ll lll ll l llllll lll l lll l lll lll l . . . . Obs % Correct P r e d % C o rr ec t % Correct ll lll l ll ll ll ll ll ll lll l l ll lll l llll l ll l ll llll l lll llll l ll ll ll lll l l lll ll l ll l llll ll l ll l lll ll lll l lll ll llll lllll l lll ll l llll ll ll ll l ll l ll l l lll ll l ll llll ll l lll ll lll lll lll ll lll ll l l l Obs Avg Difficulty P r e d A vg D i ff i c u lt y Avg Reported Difficulty

Figure 4: Posterior predictive performance of% correct (left) and average predicted difﬁ-culty (right) for the joint model. The modelslightly over-predicts % correct, but performsquite well for average reported difﬁculty. −2.50.02.5 −3 −2 −1 0 1 2 q Estimate (Rasch) q E s ti m a t e ( J o i n t M od e l ) Proficiency Estimates −20−10010 −5 0 5 b Estimate (Rasch) b E s ti m a t e ( J o i n t M od e l ) Difficulty Estimates

Figure 5: Proﬁciency (left) and difﬁculty(right) estimates under the joint model (with95% posterior intervals) are very similar toRasch proﬁciency point estimates from previ-ous section.Figure 5 (left panel) plots the proﬁciency estimates from the joint model against the Raschproﬁciency estimates (i.e. the model for correctness from Section 3.1 without modeling reporteddifﬁculty). The proﬁciency estimates from the joint model do not differ substantially from theRasch proﬁciency estimates, although there is a slight shrinkage towards zero of the joint modelproﬁciency estimates. Figure 5 (right panel) plots the item difﬁculty estimates from the joint modelagainst the item difﬁculty estimates from the Rasch model. Like proﬁciency estimates, the difﬁ-culties under the joint model do not differ substantially from the Rasch difﬁculties. This is dueto the inclusion of the h i and f j parameters for the reported difﬁculty part of the model, whichsufﬁciently explains the variation in reported difﬁculty without impacting the IRT parameters.Recall that the joint model predicts reported difﬁculty as g · ( θ i − b j ) + h i + f j . In addition toproﬁciency and difﬁculty, “reporting bias” parameters for examiners ( h i ) and items ( f j ) are alsoincluded. Positive h i and f j thus increase the expected reported difﬁculty while negative h i and f j decrease the expected reported difﬁculty.Thus, h i can be interpreted as examiner i ’s tendency to over or under-report difﬁculty, afteraccounting for the other parameters. The left panel of Figure 6 shows the h i estimates and 95%posterior intervals compared to the proﬁciency (point) estimates. Since there are many examinerswhose 95% posterior intervals do not overlap with zero, Figure 6 provides evidence that there existdifferences among examiners in the way they report difﬁculty. This reporting bias does not appearto have any relationship with the model-based proﬁciency estimates. That is, examiners who reportitems to be more difﬁcult (positive h i ) do not perform worse than examiners who report items tobe easier (negative h i ).Similarly, f j can be interpreted as item j ’s tendency to be over or under-reported, after account-ing for other parameters. The right panel of Figure 6 shows the f j estimates and 95% posterior in-tervals compared to the point estimates for difﬁculty ( b j ). There are a substantial number of itemswhose posterior intervals do not overlap with zero; these are items that are consistently reported asmore or less difﬁcult than the number of incorrect responses for that item suggests. Additionally,there is a mild arc-shaped relationship between f j and b j : items with estimated difﬁculties nearzero are most likely to have over-reported difﬁculty, and items with very negative or very posi-12 l ll l ll lll l l ll l lll ll ll ll l lll lll l ll llll ll lll ll lll llllll l l llll ll llll ll ll lll ll llll lll l lll l l l lll l ll ll ll ll lll l l ll ll ll lll l lll lll l lll lll ll llll l ll ll ll ll llll lll ll ll ll l lll l l ll l ll ll l ll l ll lllll lll ll ll l lll lll lll llll ll l ll ll lll ll ll l l lll ll l lll ll llll l ll ll ll lll ll ll l ll lll −8−404 −2 −1 0 1 2 q i h i Participant reporting bias by proficiency l l l l llllll ll ll ll ll l l l lll l ll ll lllllllll ll llll l lll ll ll llll l lllll llll llll ll l lll lllll ll llll lll l llll llll ll l ll lll l l ll l lllllll llll ll lll lll lllllll ll llllllllllll ll l lllll ll llll l ll ll lll ll lll ll llll ll lllll l ll ll l lll ll ll ll l ll lll l ll ll ll lll llll lllllll l ll llll lll l lll ll l lllll l lllll l lll l ll ll llllll lll l ll l lll lllllll l l ll ll ll l ll l ll lll l llll lllll l lll lll llll ll lllll llll ll ll l llll l llll ll ll ll lllll llll l ll lllll l ll ll lllll lll ll lll ll l llll l ll ll ll l l lll lll ll ll l l lll lll llll l lll l l ll lllll lllll llll llllll llll lllllllllllllll ll ll l l ll ll ll lll lll lll ll ll l l ll ll ll lll lll llll l ll l lllll ll lll l lll llll ll lll lll ll llllll llll lll l lll l l lll l ll llllll ll lllllll l lll llll lll ll lll llllllll l lllll ll llll ll llll ll ll llll llll l ll llll llll ll lll llllll lll l l ll llllllll lllll l l l llll llllll ll llll lll ll ll lll ll lll lll ll lllllllll ll l lllll l l l llll ll ll llll lll lll l lllll l lllll lllll l ll llll llll llll ll ll l llll ll ll ll lll ll lll ll l ll ll lll l l ll l llll l ll l ll ll llll lll ll lll lllll ll ll lll llllll ll lll lll lll l llll ll l l lll llll ll llll ll ll ll lll ll lll ll ll ll l l llllll l ll l ll ll ll llll ll llllll l llll lllll l lll lll l ll ll ll l l lllllll llll lllll ll lll lll llllllll −15−10−5051015 −5 0 5 b j f j Item reporting bias by difficulty

Figure 6: Person reporting bias ( h i , left) and item reporting bias ( f j , right) with 95% posteriorintervals from the Thissen model compared to proﬁciency estimate ( θ i ) and difﬁculty estimate ( b j ),respectively. Points with intervals that overlap with zero are colored in gray. There is substantialvariation in h i not explained by θ i . Items with estimated difﬁculties near zero are most likely tohave over-reported difﬁculty.tive estimated difﬁculties (corresponding to items that examiners did very poorly or very well on,respectively) tend to have under-reported difﬁculty.Reported difﬁculty may provide additional information about the items beyond standard IRTestimates. For example, consider two items with identical response patterns (i.e. the same examin-ers answered each question correctly and incorrectly) but one item was reported to be more difﬁcultthan the other by all examiners. It is plausible that at least some examiners struggled with that item,but eventually came to the correct conclusion. Standard IRT will not detect the additional effortrequired for that item, compared to the less effortful item with the same response pattern. Although the purpose of the Black Box study was to estimate false positive and false negative errorrates, the recorded data also contains additional information about examiners’ decision-makingprocess. Recall from Section 2.3 that each recorded response to an item consists of three decisions:1. Value assessment for the latent print only (No Value, Value for Exclusion Only, or Value forIndividualization)2. Source evaluation of the latent/reference print pair (i.e. Individualization [match], Exclusion[non-match], or Inconclusive)3. (If inconclusive) Reason for inconclusiveFor our analysis, we do not distinguish between ‘value for individualization’ and ‘value forexclusion only’, and instead treat the value assessment as a binary response (‘Has value’ vs ‘Novalue’). As Haber and Haber (2014) note, only 17% of examiners reported that they used ‘valuefor exclusion only’ in their normal casework on a post-experiment questionnaire, and examinersin the Black Box study may have interpreted this decision in different ways. For example, therewere 32 examiners (of 169) who reported that a latent print had ‘value for exclusion only’ and then13 nconclusive No Value0 20 40 0 20 4001020

Number Reported N E x a m i n e r s Figure 7: Number of inconclusive (left) and no value (right) responses reported by each examiner.proceeded to make an individualization for the second decision. These discrepancies led us to treatthe value evaluation as a binary response – either ‘has value’ or ‘no value’.The Item Response Trees (IRTrees, De Boeck and Partchev, 2012) framework provides a so-lution for modeling the sequential decisions above explicitly. IRTrees represent responses withdecision trees where branch splits represent hypothesized internal decisions, conditional on theprevious decisions in the tree structure, and leaves are observed outcomes. Sequential decisionscan be represented explicitly in the IRTree framework, and node splits need not represent scoreddecisions.Fingerprint examiners have been found to vary in their tendencies to make ‘no-value’ and‘inconclusive’ decisions (Ulery et al., 2011). Figure 7 shows the distribution of the number ofinconclusive and no value decisions reported by each examiner. Although most examiners report20–40 inconclusives and 15–35 ‘no value’ responses, some examiners report as much as 60 or asfew as 5. By modeling these responses explicitly within the IRTree framework, individual differ-ences in proﬁciency among examiners be assessed alongside differences in tendency towards valueassessments (vs no-value assessments) and inconclusive responses (vs conclusive responses).

Figure 8 depicts an IRTree based on one possible internal decision process, motivated by the ACE-V decision process (NIST 2012). Each internal node Y ∗ , . . . , Y ∗ represents a possible binary (0/1)decision that each examiner could makes on each item; these decisions will be modeled with IRTmodels. The ﬁrst node, Y ∗ , represents the examiner’s assessment of whether the latent print is“of value” or “no value”. The second node, Y ∗ represents whether the examiner found sufﬁcientinformation in the (reference, latent) print pair to make a further decision. Y ∗ represents whetherthe pair of prints is more likely to be a match or a non-match, and Y ∗ and Y ∗ represent whether thisdetermination is conclusive (individualization and exclusion, respectively) or inconclusive (closeand no overlap, respectively). This binary decision process tree thus separates examiners’ decisionsinto both (a) distinguishing between matches and non-matches ( Y ∗ ) and (b) examiner “willingnessto respond with certainty” ( Y ∗ , Y ∗ , Y ∗ , Y ∗ ).Since each internal node in the IRTree in Figure 8 is a binary split, we use a Rasch model toparameterize each branch in the tree. That is, P ( Y ∗ kij = 1) = logit − ( θ ki − b kj ) , (7)14 ∗ No Value Y ∗ Y ∗ Insufﬁcient Y ∗ Y ∗ Indiv. Close Excl. No Ov. H a s V a l u e N o V a l u e S u f ﬁ c i e n t I n s u f ﬁ c i e n t M a t c h N on - m a t c h C on c l u s i v e I n c on c l u s i v e C on c l u s i v e I n c on c l u s i v e Figure 8: The binary decision process treewhere i indexes examiners, j indexes items, and k indexes internal nodes (sequential binary deci-sions). Thus, we account for examiner tendencies to choose one branch vs. the other at decision k with θ ki , and features of the task that encourage choice of one branch vs. the other at decision k with b kj . Clearly other IRT models could be chosen as well; see Luby (2019a) for further discussion.The full IRTree model is P ( Y ij = No Value ) = P ( Y ∗ ij = 1) (8) P ( Y ij = Individ. ) = P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) × P ( Y ∗ ij = 1) × P ( Y ∗ ij = 1) (9) P ( Y ij = Close ) = P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) × P ( Y ∗ ij = 1) × P ( Y ∗ ij = 0) (10) P ( Y ij = Insufﬁcient ) = P ( Y ∗ ij = 0) × P ( Y ∗ ij = 1) (11) P ( Y ij = No Ov. ) = P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) (12) P ( Y ij = Excl. ) = P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) × P ( Y ∗ ij = 0) × P ( Y ∗ ij = 1) . (13)Furthermore, an item-explanatory variable ( X j ) for each item was included at all k nodes, where X j = 1 if the latent and reference print came from the same source (i.e. a true match) and X j = 0 if the latent and reference print came from different sources (i.e. a true non-match). Then, b kj = β k + β k X j + (cid:15) jk k = 1 , ..., , (14)where b kj are the item parameters and β k , β k are linear regression coefﬁcients at node k . This isan instance of the Linear Logistic Test Model (Fischer, 1973) with random item effects (Janssenet al., 2004); see also de Boeck and Wilson (2004) for more elaborate models along these lines.This allows for the means of item parameters to differ depending on whether the pair of prints is atrue match or not. The random effects (cid:15) kj ∼ N (0 , σ kb ) , as speciﬁed in the second line of (15) below,allow for the possibility that print pairs in an identiﬁcation task may have other characteristics thatimpact task difﬁculty (e.g. image quality, number of features present), beyond whether the pair ofprints is a same-source or different-source pair. 15e ﬁt this model under the Bayesian framework with Stan in R (Stan Development Team,2018a; R Core Team, 2013), using the following prior distributions, θ i iid ∼ M V N ( , σ θ L θ L (cid:48) θ σ θ ) b j iid ∼ M V N ( β X j , σ b L b L (cid:48) b σ b ) L θ ∼ LKJ (4) L b ∼ LKJ (4) σ kθ iid ∼ Half-Cauchy (0 , . k = 1 , ..., σ kb iid ∼ Half-Cauchy (0 , . k = 1 , ..., β k iid ∼ N (0 , k = 1 , ..., β k iid ∼ N (0 , k = 1 , ..., .  (15)Here X j is the column vector (1 , X j ) (cid:48) , β = ( β , ..., β ) is the × matrix whose k th row is ( β k , β k ) , and σ b is a × diagonal matrix with σ b , ..., σ b as the diagonal entries; σ θ in theprevious line is deﬁned similarly. Multivariate normal distributions for θ i and b j were chosen toestimate covariance between sequential decisions explicitly. The Stan modeling language does notrely on conjugacy, so the Cholesky factorizations ( L θ and L b ) are modeled instead of the covariancematrices for computational efﬁciency. The recommended priors (Stan Development Team, 2018b)for L and σ were used: an LKJ prior (Lewandowski et al., 2009, LKJ = last initials of authors) withshape parameter 4, which results in correlation matrices that mildly concentrate around the identitymatrix ( LKJ (1) results in uniformly sampled correlation matrices), and half-Cauchy priors on σ kb and σ kθ to weakly inform the correlations. N (0 , priors were assigned to the linear regressioncoefﬁcients ( β k ).There are, of course, alternative prior structures, and indeed alternate tree formulations, thatcould reasonably model this data. For example Luby (2019a) constructs a novel bipolar scale,shown in Figure 9, from the possible responses, and a corresponding IRTree model. This not onlyprovides an ordering for the responses within each sub-decision (i.e. source decision and reason forinconclusive), but allows the sub-decisions to be combined in a logical way. This scale is also con-sistent with other hypothetical models for forensic decision-making (Dror and Langenburg, 2019).Based on the description of each option for an inconclusive response, the ‘Close’ inconclusives aremore similar to an individualization than the other inconclusive reasons. The ‘No overlap’ incon-clusives are more similar to exclusions than the other inconclusive reasons, under the assumptionthat the reference prints are relatively complete. That is, if there are no overlapping areas betweena latent print and a complete reference print, the two prints likely came from different sources. The‘insufﬁcient’ inconclusives are treated as the center of the constructed match/no-match scale. Formore details, and comparsions among multiple tree structures, see Luby (2019a). Our discussion of results will focus on estimated parameters from the ﬁtted IRTree model. Forbrevity, we will write θ k = ( θ k , . . . , θ kN ) and b k = ( b k , . . . , b kJ ) , k = 1 , . . . , , in equation (7)and Figure 8. 16ndividualizationMatch Close InsufﬁcientInconclusive No OverlapNon-matchExclusionFigure 9: FBI black box responses as a bipolar scale.The posterior medians for each examiner and item were calculated, and the distribution ofexaminer parameters (Figure 10) and item parameters (Figure 11) are displayed as a whole. Theitem parameters are generally more extreme than the person parameters corresponding to the samedecision (e.g. θ ranges from ≈ − to , while b ranges from ≈ − to ). This suggests thatmany of the responses are governed by item effects, rather than examiner tendencies.The greatest variation in person parameters occurs in θ (‘no value’ tendency), θ (conclusivetendency in matches) and θ (conclusive tendency in non-matches). Item parameters are most ex-treme in b (tendency towards has value) and b (inconclusive tendency in matches). For example, b , = − . and indeed all examiners agreed that item has no value; similarly b , = 10 . and all examiners indeed agree that no individualization determination can be made for item . q (No Value Tend) q (Insuff Tend) q (Match Tend) q (Individ Tend) q (Excl Tend)−8 −4 0 4 −8 −4 0 4 −8 −4 0 4 −8 −4 0 4 −8 −4 0 40102030 F r e qu e n c y Figure 10: Distribution of θ point estimates under the binary decision process model. Greatestvariation occurs in θ , θ , and θ , corresponding to No Value, Individualization, and Exclusiontendencies, respectively.Using probabilities calculated from the IRTree model estimates provides a way to assess theobserved decisions in each examiner × item pair in light of other decisions that examiner made,and how other examiners evaluated that item. Inconclusives that are ‘expected’ under the modelcan then be determined, along with which examiners often come to conclusions that are consistentwith the model-based predictions. For example, an examiner whose responses often match themodel-based predictions may be more proﬁcient in recognizing when there is sufﬁcient evidenceto make a conclusive decision than an examiner whose responses do not match the model-basedpredictions. 17 (Value Tend) b (Suff Tend) b (Non−Match) b (Close Tend) b (No Ov Tend)−20−10 0 10 20 −20−10 0 10 20 −20−10 0 10 20 −20−10 0 10 20 −20−10 0 10 200255075 c oun t Figure 11: Distribution of b point estimates under the binary decision process model. Greatestvariation occurs in b , b , corresponding to Value and Close tendencies, respectively. Also notethat b values are more extreme than θ values. k : 1 2 3 4 5 β k .

87 ( . , .

99) 1 .

95 (1 . , . .

39 ( . , . − .

44 ( − . , . .

58 (3 . , . β k − .

16 ( − . , − . − .

27 ( − . , − . − .

37 ( − . , − . .

19 ( − . , . .

06 ( − . , . Table 3: Regression coefﬁcients (with 90% posterior intervals) for each of the ﬁve nodes in theIRTree model.As one example, Examiner 55 decided Item 556 was a ‘Close’ inconclusive, but Item 556 isa true non-match. Using posterior median estimates for θ k, and b k, under the binary decisionprocess model (where k = 1 , ..., and indexes each split in the tree), the probability of observingeach response for this observation can be calculated: P(No Value) < . , P(Individualization) < . , P(Close) = 0 . , P(Insufﬁcient) < . , P(No Overlap) = 0 . and P(Exclusion) =0 . . According to the model, the most likely outcome for this response is an exclusion. Sincean inconclusive was observed instead, this response might be ﬂagged as being due to examinerindecision. This process suggests a method for determining “expected answers” for each itemusing an IRTree approach, which we further discuss in Section 3.4.The estimated β k and β k , with 90% posterior intervals, are displayed in Table 3. Since theestimated β k ’s all have posterior intervals that are entirely negative ( k = 1 , , ) or overlap zero( k = 4 , ), we can infer that the identiﬁcation tasks for true matches (e.g. X j = 1 in Equation 14)tend to have lower b jk parameters than the true non-matches ( X j = 0 ), leading to matching pairsbeing more likely fall along the left branches of the tree in Figure 8.We also note that the IRTrees approach is compatible with the joint models for correctness andreported difﬁculty introduced in Section 3.2.1. By replacing the Rasch model for correctness withan IRTree model, Luby (2019a) demonstrates that reported difﬁculty is related to IRTree branchpropensities ( θ ik − b jk ), with items tending to be rated as more difﬁcult when the IRTree branch18ropensities are near zero.Moreover, examiners are likely to use different thresholds for reporting difﬁculty, just as theydo for coming to source evaluations (AAAS, 2017; Ulery et al., 2017); the IRTrees analysis abovehas been helpful in making these differing thresholds more explicit. In the same way, the IRTreesanalysis of reported difﬁculty may lead to insights about how examiners decide how difﬁcult anidentiﬁcation task is. Generating evidence to construct test questions is both time-consuming and difﬁcult. The methodsintroduced in this section provide a way to use evidence collected in non-controlled settings, forwhich ground truth is unknown, for testing purposes. Furthermore, examiners should receive feed-back not only when they make false identiﬁcations or exclusions, but also if they make ‘no value’ or‘inconclusive’ decisions when most examiners are able to come to a conclusive determination (orvice-versa). It is therefore important to distinguish when no value, inconclusive, individualization,and exclusion responses are expected in a forensic analysis.There are also existing methods for ‘IRT without an answer key’, for example the culturalconsensus theory (CCT) approach (Batchelder and Romney, 1988; Oravecz et al., 2014). CCT wasdesigned for situations in which a group of respondents shares some knowledge or beliefs in adomain area which is unknown to the researcher or administrator (similar approaches have beenapplied to ratings of extended response test items, e.g. Casabianca et al., 2016). CCT then estimatesthe expected answers to the items provided to the group. We primarily focus on comparing theLatent Truth Rater Model (LTRM), a CCT model for ordinal categorical responses (Anders andBatchelder, 2015), to an IRTree-based approach.Although the individualization/exclusion scale in Figure 9 could be used to generate an answerkey for the source evaluations (i.e. individualization, exclusion, or inconclusive), it would not bepossible to determine an answer key for the latent print value assessments (i.e. no value vs hasvalue). Instead, a ‘conclusiveness’ scale, Figure 12, can be used. This scale does not distinguishbetween same source and different source prints, but does allow for the inclusion of no valueresponses on the scale. Using an answer key from this scale, alongside the same-source/different-source information provided by the FBI, provides a complete picture of what the expected answersare: An answer key generated for items placed on the scale of Figure 12 identiﬁes which items areexpected to generate conclusive, vs. inconclusive answers; for the conclusive items, same-sourcepairs should be individualizations and different-source pairs should be exclusions.

We ﬁt four models to the Black Box Data: (1) The LTRM (Anders and Batchelder, 2015), (2) anadapted LTRM using a cumulative logits model (C-LTRM), (3) an adapted LTRM based using anadjacent logits model (A-LTRM), and (4) an IRTree model. Each of the four models is detailedbelow. 19o Value

Lack of informa-tion in latent print

Inconclusive

Increasing information present in item

Lack of information inlatent/reference print pair

Exclusion andIndividualization

Enough informationfor conclusive decision

Figure 12: FBI Black Box responses on a ‘conclusiveness’ scale.

Latent Truth Rater Model

Let Y ij = c denote examiner i ’s categorical response to item j , where c = 1 is the response“No Value”, c = 2 is the response “Inconclusive”, and c = 3 is the response “Conclusive”. Keyfeatures of the LTRM in our context are T j , the latent “answer key” for item j , and γ c ( c =1 , ), the category boundaries between ‘No Value’ vs. ‘Inconclusive’, and for ‘Inconclusive’ vs.‘Conclusive’, respectively. Each examiner draws a latent appraisal of each item ( Z ij ), which isassumed to follow a normal distribution with mean T j (the ‘true’ location of item j ) and precision τ ij , which depends on both examiner competency ( E i ) and item difﬁculty ( λ j ) (that is, τ ij = E i λ j ).If every examiner uses the ‘true’ category boundaries, then if Z ij ≤ γ then Y ij = ‘No Value’, if γ ≤ Z ij ≤ γ then Y ij = ‘Inconclusive’, and if Z ij ≥ γ then Y ij = ‘Conclusive’. Individuals,however, might use a biased form of the category thresholds, and so individual category thresholds, δ i,c = a i γ c + b i , are deﬁned, where a i and b i are examiner scale and shift biasing parameters,respectively. That is, a i shrinks or expands the category thresholds for examiner i , and b i shifts thecategory thresholds to the left or right. The model is thus P ( Y ij = No Value ) = P ( Z ij ≤ δ i, ) = P ( T j + (cid:15) ij ≤ a i γ + b i ) = F ( a i γ + b i ) (16) P ( Y ij = Inconclusive ) = P ( δ i, < Z ij ≤ δ i, ) = P ( a i γ + b i ≤ T j + (cid:15) ij ≤ a i γ + b i ) (17) = F ( a i γ + b i ) − F ( a i γ + b i ) (18) P ( Y ij = Conclusive ) = P ( Z ij > δ i, ) = P ( T j + (cid:15) ij > a i γ + b i ) = 1 − F ( a i γ + b i ) , (19)where F ( u ) is the CDF of a normal variable with mean T j and precision τ ij . The likelihood of thedata under the LTRM is then L ( Y | T , a, b, γ, E, λ ) = (cid:89) I (cid:89) J [ F ( δ i,y ij ) − F ( δ i,y ij − )] , (20)where δ i, = −∞ , δ i, = ∞ , and δ i,c = a i γ c + b i . We next consider adaptations of the LTRM to alogistic modeling framework, with some simplifying assumptions on the LTRM parameters.20 dapted LTRM as a Cumulative Logits Model (C-LTRM) The original LTRM (Equation 20) is a cumulative-probits model, and is therefore more closelyrelated to more standard IRT models than it might seem at ﬁrst glance. Speciﬁcally, if (1) the latentappraisals ( Z ij ) are modeled with a logistic instead of a normal distribution, (2) it is assumed that τ ij = E i λ j = 1 for all i, j , and (3) it is assumed a i = 1 for all i , then the model collapses into a morefamiliar cumulative logits IRT model, log P ( Y ij ≤ c ) P ( Y ij > c ) = b i − T j + γ c . (21)This transformed model has the same form as the Graded Response Model (Samejima, 1969).Relaxing the assumption that a i = 1 , a cumulative logits model with a scaling effect for eachperson on the item categories is obtained, which we call the cumulative-logits LTRM (C-LTRM), log P ( Y ij ≤ c ) P ( Y ij > c ) = b i − T j + a i γ c . (22)The likelihood for the data under Equation 22 is L ( Y | a, b, T , γ ) = (cid:89) I (cid:89) J (cid:20) exp( b i − T j + a i γ c )1 + exp( b i − T j + a i γ c ) − exp( b i − T j + a i γ c − )1 + exp( b i − T j + a i γ c − ) (cid:21) , (23)where γ = −∞ and γ C = ∞ . Adapted LTRM as an Adjacent Category Logits Model (A-LTRM)

Making the same assumptions as above, P ( Y ij = c ) could instead be expressed using an adjacent-categories logit model, log P ( Y ij = c ) P ( Y ij = c −

1) = b i − T j + γ c , (24)which takes the same form as the Rating Scale Model (Andrich, 1978). The RSM has nice theo-retical properties due to the separability of T j and b i in the likelihood, and re-casting the LTRM asan adjacent-categories model opens the possibility of more direct theoretical comparisons betweenmodels. Relaxing the assumption that a i = 1 , a generalized adjacent-categories logit model with ascaling effect for each person on the item categories is obtained, which we call the adjacent-logitsLTRM (A-LTRM), log P ( Y ij = c ) P ( Y ij = c −

1) = b i − T j + a i γ c . (25)The likelihood is then L ( Y | a, b, T , γ ) = (cid:89) I (cid:89) J exp( b i − T j + a i γ c )1 + exp( b i − T j + a i γ c ) . (26)21 RTree for answer key generation

For comparison, we also consider a simpliﬁed IRTree model for answer key generation, whichdoes not include the reason provided for inconclusive responses (as the model in Section 3.3 did).This simpliﬁcation was made for two reasons: ﬁrst, this simpliﬁed IRTree model allows us tomake inferences on the ‘conclusiveness’ scale in Figure 12, facilitating comparison with the CCTmodel; second, the reasons provided for inconclusive responses are relatively inconsistent. Indeed,in a follow-up study done by the FBI (Ulery et al., 2012), 72 Black Box study participants wereasked to re-assess 25 items. 85% of no value assessments, 90% of exclusion evaluations, 68% ofinconclusive responses, and 89% of individualization evaluations were repeated; while only 44%of ‘Close’, 21% of ‘Insufﬁcient’, and 51% of ‘No Overlap’ responses were repeated. Inconclusivereasoning thus varies more within examiners than the source evaluations, and a generated answerkey containing reasons for inconclusives may not be reliable or consistent across time.The tree structure for the simpliﬁed IRTree model is shown in Figure 13. The ﬁrst internalnode ( Y ∗ ) represents the value assessment, the second internal node ( Y ∗ ) represents the conclusivedecision, and the third internal node represents the individualization/exclusion decision. Note that Y ∗ is not a part of the conclusiveness scale in Figure 12, and thus provides additional informationbeyond the ‘conclusiveness’ answer key. Y ∗ No Value Y ∗ Inconc. Y ∗ Individ. Exclusion N o V a l u e H a s V a l u e I n c on c l u s i v e C on c l u s i v e M a t c h N on - m a t c h Figure 13: The answer key

IRtree

We focus on comparing the answer keys generated by each of the models. As a simple baselineanswer key, we also calculate the modal response for each item using the observed responses.Unlike the IRTree and LTRM approaches, this baseline answer key does not account for differenttendencies of examiners who answered each item; nor does it account for items being answeredby different numbers of examiners. The LTRM, A-LTRM, and C-LTRM all estimate the answerkey, a combination of T j ’s and γ c ’s, directly. The answer for item j is ‘No Value’ if T j < γ ,‘Inconclusive’ if γ < T j < γ and ‘Conclusive’ if T j > γ . For the IRTree model, an answer keywas calculated based on what one would expect an ‘unbiased examiner’ to respond. The responseof a hypothetical unbiased examiner (i.e. θ ki = 0 for all k ) to each question was predicted, usingthe estimated item parameters in each split. 22here are thus ﬁve answer keys: (1) Modal answer key, (2) LTRM answer key, (3) C-LTRManswer key, (4) A-LTRM answer key, and (5) IRTree answer key. Each of the answer keys hasthree possible answers: no value, inconclusive, or conclusive. Table 4 shows the number of items(out of 744) that the answer keys disagreed upon. The most similar answer keys were the A-LTRM and C-LTRM, which only disagreed on six items: three that disagreed between inconclu-sive/conclusive and three that disagreed between no value and inconclusive. The original LTRMmodel most closely matched the modal answer, with the A-LTRM model disagreeing with themodal answer most often. Modal LTRM C-LTRM A-LTRM IRTreeModal 0 - - - -LTRM 12 0 - - -C-LTRM 48 39 0 - -A-LTRM 52 43 6 0 -IRTree 32 24 28 34 0Table 4: The number of items whose answers disagreed among the ﬁve approaches to ﬁnding ananswer key. The C-LTRM and A-LTRM most closely matched each other, and the original LTRManswer key most closely matched the modal answer.Recall that the three possible answers were (1) ‘no value’, (2) ‘inconclusive’, or (3) ‘conclu-sive’. There were 48 items for which at least one of the models disagreed with the others. The vastmajority of these disagreements were between ‘no value’ and ‘inconclusive’ or ‘inconclusive’ and‘conclusive’. Of the 48 items in which models disagreed, only ﬁve items were rated to be conclu-sive by some models and no value by others. All of these ﬁve items were predicted to be ‘no value’by the LTRM, ‘inconclusive’ by the A-LTRM and C-LTRM, and ‘exclusion’ by the IRTree. Table 5shows the number of observed responses in each category for these ﬁve items and illuminates twoproblems with the LTRM approaches. First, the original LTRM strictly follows the modal response,even when a substantial number of examiners came to a different conclusion. In Question 665, forexample, eight examiners were able to make a correct exclusion, while the LTRM still chose ‘novalue’ as the correct response. Second, the A-LTRM and C-LTRM models may rely too much onthe ordering of outcomes. Both adapted LTRM models predicted these items to be inconclusives,yet most examiners who saw the items rated it as either a ‘no value’ or ‘exclusion’.Using a model-based framework to generate expected answers provides more robust answerkeys than relying on the observed responses alone. Both IRTrees and a CCT-based approach allowfor the estimation of person and item effects alongside an answer key. Furthermore, although thetwo approaches are formulated quite differently, they lead to similar generated answer keys inthe Black Box data. This similarity is due to the conditional sufﬁcient statistics for item locationparameters being closely related in the two models (see Luby, 2019a, for further details).For this setting, we prefer using the IRTree framework to analyze responses because it doesnot require the responses to be ordered and because each decision may be modeled explicitly. Inaddition, model ﬁt comparisons using the Widely Applicalble AIC index (WAIC, Vehtari et al.,23tem ID No Value Inconclusive Exclusion427 13 3 13438 12 3 7443 7 1 6665 9 4 8668 14 1 11Table 5: The number of observed responses in each category for the ﬁve items with a disagreementbetween no value and conclusive.2017; Watanabe, 2010), as well as in-sample prediction error, prefer the IRTree model for thisdata; see Table 6.Table 6: WAIC and in-sample prediction error for each of the four models. In order to compare theIRTree to the LTRM models – which only predict no value, inconclusive, or conclusive responses– individualizations and exclusions (i.e. Y ∗ in Figure 13) were grouped together.Model WAIC SE In-Sample Prediction ErrorLTRM 40416 748 0.19C-LTRM 13976 175 0.14A-LTRM 14053 178 0.15IRTree 12484 166 0.12 In this survey of recent advances in the psychometric analysis of forensic decision-making processdata, we have applied a wide variety of models, including the Rasch model, Item Response Trees,and Cultural Consensus Models, to identiﬁcation tasks in the FBI Black Box study of error rates inﬁngerprint examination. Careful analysis of forensic decision-making processes unearths a seriesof sequential responses that to date have often been ignored, while the ﬁnal decision is simplyscored as either correct or incorrect. Standard IRT models applied to scored data, such as theRasch model of Section 3.1, provide substantial improvements over current examiner error ratestudies: examiner proﬁciencies can be justiﬁably compared even if the examiners did not do thesame identiﬁcation tasks, and the inﬂuence of the varying difﬁculty of identiﬁcation tasks can beseen in examiner proﬁciency estimates. Additional modeling techniques are needed to account forthe co-varying responses present in the form of reported difﬁculty (Section 3.2), the sequentialnature of examiner decision-making (Section 3.3), and the lack of an answer key for scoring ‘novalue’ and ‘inconclusive’ responses (Section 3.4). See Luby (2019a) for further developments ofall methods presented here. 24n our analyses, we found a number of interesting results with important implications forsubjective forensic science domains. Taken together, the results presented here demonstrate therich possibilities in accurately modeling the complex decision-making in ﬁngerprint identiﬁcationtasks.For instance, results from Section 3.2.2 show that there are differences among ﬁngerprint ex-aminers in how they report the difﬁculty of identiﬁcation tasks, and that this behavior is not directlyrelated to examiners’ estimated proﬁciency. Instead, examiners tended to over-rate task difﬁcultywhen the task was of middling difﬁculty, and under-rate the difﬁculty of tasks that were eitherextremely easy or extremely hard. A similar effect also holds for the intermediate decisions in anIRTree analysis (Luby, 2019a).Furthermore, we have shown that there is substantial variability among examiners in their ten-dency to make no value and inconclusive decisions, even after accounting for the variation in itemsthey were shown (Section 3.3.2). The variation in these tendencies could lead to additional falseidentiﬁcations (in the case of “no value” evidence being further analyzed), or to guilty perpetra-tors going free (in the case of “valuable” evidence not being further analyzed). To minimize thevariation in examiner decisions, examiners should receive feedback not only when they make falseidentiﬁcations or exclusions, but also when they make mistaken no value or inconclusive decisions.Finally, in Section 3.4, we show how to use the data to infer which ’no value’ or ’inconclusive’responses are likely to be mistaken.Our analyses were somewhat limited by available data; the Black Box study was designed tomeasure examiner performance without ascertaining how those decisions were made. Privacy &conﬁdentiality considerations on behalf of the persons providing ﬁngerprints for the study make itimpossible for the FBI to share the latent and reference prints for each identiﬁcation task; if theywere available we expect meaningful item covariates could be generated, perhaps through imageanalysis. Similar considerations on behalf of examiners preclude the possibility of demographic orbackground variables (e.g. nature of training, number of years in service, etc.) linked to individualexaminers; auxiliary information such as examiners’ annotations of selected features, or their clar-ity and correspondence determinations, is also not available. Each of these, if available, might helpelucidate individual differences in examiner behavior and proﬁciency.We anticipate future collaboration with experts in human decision making to improve the mod-els and with ﬁngerprint domain experts to determine the type and amount of data that would beneeded to make precise and accurate assessments of examiner proﬁciency and task difﬁculty.Finally, we expect a future line of work will be to consider what would be needed to connecterror rates, statistical measures of uncertainty, and examiner behavior collected from standard-ized/idealized testing situations such as those discussed in this paper, with task performance byexaminers in authentic forensic investigations.

References

AAAS (2017). Forensic Science Assessments: A quality and Gap Analysis - Latent FingerprintExamination. Technical report, (prepared by William Thompson, John Black, Anil Jain, andJoseph Kadane). 25nders, R. and Batchelder, W. H. (2015). Cultural consensus theory for the ordinal data case.

Psychometrika , 80(1):151–181.Andrich, D. (1978). Application of a psychometric rating model to ordered categories which arescored with successive integers.

Applied psychological measurement , 2(4):581–594.Batchelder, W. H. and Romney, A. K. (1988). Test theory without an answer key.

Psychometrika ,53(1):71–92.Bcue, A., Eldridge, H., and Champod, C. (2019). Fingermarks and other body impressions areview (august 2016 june 2019).Casabianca, J. M., Junker, B. W., and Patz, R. J. (2016). Hierarchical rater models. In

Handbookof Item Response Theory, Volume One , pages 477–494. Chapman and Hall/CRC.De Boeck, P. and Partchev, I. (2012). Irtrees: Tree-based item response models of the glmm family.

Journal of Statistical Software, Code Snippets , 48(1):1–28.de Boeck, P. and Wilson, M. (2004).

Explanatory Item Response Models: A generalized linear andnonlinear approach . Springer, New York.Dror, I. E. and Langenburg, G. (2019). ‘cannot decide’: The ﬁne line between appropriate incon-clusive determinations versus unjustiﬁably deciding not to decide.

Journal of forensic sciences ,64(1):10–15.Evett, I. and Williams, R. (1996). A review of the sixteen point ﬁngerprint standard in england andwales.

Journal of Forensic Identiﬁcation , 46:49–73.Ferrando, P. J. and Lorenzo-Seva, U. (2007). An Item Response Theory Model for Incorporat-ing Response Time Data in Binary Personality Items.

Applied Psychological Measurement ,31(6):525–543.Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research.

Actapsychologica , 37(6):359–374.Fischer, G. H. and Molenaar, I. W. (2012).

Rasch models: Foundations, recent developments, andapplications . Springer Science & Business Media, New York.Gardner, B. O., Kelley, S., and Pan, K. D. (2019). Latent print proﬁciency testing: An examinationof test respondents, test-taking procedures, and test characteristics.

Journal of forensic sciences .Garrett, B. L. and Mitchell, G. (2017). The proﬁciency of experts.

University of Pennsylvania LawReview , 166:901.Haber, R. N. and Haber, L. (2014). Experimental results of ﬁngerprint comparison validity andreliability: A review and critical analysis.

Science & Justice , 54(5):375–389.26olland, P. W. and Rosenbaum, P. R. (1986). Conditional association and unidimensionality inmonotone latent variable models.

The Annals of Statistics , 14(4):1523–1543.Janssen, R., Schepers, J., and Peres, D. (2004). Models with item and item group predictors. In

Explanatory item response models , pages 189–212. Springer.Kellman, P. J., Mnookin, J. L., Erlikhman, G., Garrigan, P., Ghose, T., Mettler, E., Charlton, D.,and Dror, I. E. (2014). Forensic comparison and matching of ﬁngerprints: using quantitativeimage measures for estimating error rates through understanding and predicting difﬁculty.

PloSone , 9(5):e94617.Kerkhoff, W., Stoel, R., Berger, C., Mattijssen, E., Hermsen, R., Smits, N., and Hardy, H. (2015).Design and results of an exploratory double blind testing program in ﬁrearms examination.

Sci-ence & Justice , 55(6):514 – 519.Langenberg, G. (2009). A performance study of the ace-v process: A pilot study to measure theaccuracy, precision, reproducibility, repeatability, and biasability of conclusions resulting fromthe ace-v process.

Journal of Forensic Identiﬁcation , 59(2):219.Langenburg, G., Champod, C., and Genessay, T. (2012). Informing the judgments of ﬁngerprintanalysts using quality metric and statistical assessment tools.

Forensic science international ,219(1-3):183–198.Langenburg, G., Champod, C., and Wertheim, P. (2009). Testing for potential contextual biaseffects during the veriﬁcation stage of the ace-v methodology when conducting ﬁngerprint com-parisons.

Journal of Forensic Sciences , 54(3):571–582.Lewandowski, D., Kurowicka, D., and Joe, H. (2009). Generating random correlation matricesbased on vines and extended onion method.

Journal of multivariate analysis , 100(9):1989–2001.Liu, S., Champod, C., Wu, J., Luo, Y., et al. (2015). Study on accuracy of judgments by chineseﬁngerprint examiners.

Journal of Forensic Science and Medicine , 1(1):33.Luby, A. (2019a).

Accounting for Individual Differences among Decision-Makers with Applica-tions in Forensic Evidence Evaluation

Open Forensic Science in R , chapter 8. rOpenSci Foundation, US.Luby, A. S. and Kadane, J. B. (2018). Proﬁciency testing of ﬁngerprint examiners with bayesianitem response theory.

Law, Probability and Risk , 17(2):111–121.Max, B., Cavise, J., and Gutierrez, R. E. (2019). Assessing latent print proﬁciency tests: Loftyaims, straightforward samples, and the implications of nonexpert performance.

Journal of Foren-sic Identiﬁcation , 69(3):281–298. 27ravecz, Z., Vandekerckhove, J., and Batchelder, W. H. (2014). Bayesian cultural consensus the-ory.

Field Methods , 26(3):207–222.Pacheco, I., Cerchiai, B., and Stoiloff, S. (2014). Miami-dade research study for the reliability ofthe ace-v process: Accuracy & precision in latent ﬁngerprint examinations.

Unpublished report. ,pages 2–5.Presidents Council of Advisors on Science and Technology (2016). Forensic science in criminalcourts: Ensuring scientiﬁc validity of feature-comparison methods. Technical report, ExecutiveOfﬁce of The Presidents Council of Advisors on Science and Technology, Washington DC.R Core Team (2013).

R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria.Rasch, G. (1960).

Probabilistic models for some intelligence and attainment tests . University ofChicago Press, Chicago.Saks, M. J. and Koehler, J. J. (2008). The individualization fallacy in forensic science evidence.

Vand. L. Rev. , 61:199.Samejima, F. (1969). Estimation of Latent Ability Using a Response Pattern of Graded Scores.page 97.Stan Development Team (2018a).

RStan: the R interface to Stan . R package version 2.18.2.Stan Development Team (2018b).

Stan Modeling Language Users Guide and Reference Manual .Tangen, J. M., Thompson, M. B., and McCarthy, D. J. (2011). Identifying ﬁngerprint expertise.

Psychological science , 22(8):995–997.Taylor, M. K., Kaye, D. H., Busey, T., Gische, M., LaPorte, G., Aitken, C., Ballou, S. M., Butt, L.,Champod, C., Charlton, D., et al. (2012). Latent print examination and human factors: Improvingthe practice through a systems approach. report of the expert working group on human factorsin latent print analysis. Technical report, U.S. Department of Commerce, National Institute ofStandards and Technology (NIST).Thissen, D. (1983). Timed Testing: An Approach Using Item Response Theory. In Weiss, D. J.,editor,

New Horizons in Testing , chapter 9, pages 179–203. Academic Press, San Diego.Ulery, B. T., Hicklin, R. A., Buscaglia, J., and Roberts, M. A. (2011). Accuracy and reliabil-ity of forensic latent ﬁngerprint decisions.

Proceedings of the National Academy of Sciences ,108(19):7733–7738.Ulery, B. T., Hicklin, R. A., Buscaglia, J., and Roberts, M. A. (2012). Repeatability and repro-ducibility of decisions by latent ﬁngerprint examiners.

PloS one , 7(3):e32800.28lery, B. T., Hicklin, R. A., Roberts, M. A., and Buscaglia, J. (2014). Measuring what latentﬁngerprint examiners consider sufﬁcient information for individualization determinations.

PloSone , 9(11):e110179.Ulery, B. T., Hicklin, R. A., Roberts, M. A., and Buscaglia, J. (2017). Factors associated withlatent ﬁngerprint exclusion determinations.

Forensic science international , 275:65–75.van der Linden, W. J. (2006). A Lognormal Model for Response Times on Test Items.

Journal ofEducational and Behavioral Statistics , 31(2):181–204.van der Linden, W. J., Klein Entink, R. H., and Fox, J.-P. (2010). Irt parameter estimation withresponse times as collateral information.

Applied Psychological Measurement , 34(5):327–347.Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical bayesian model evaluation using leave-one-out cross-validation and waic.

Statistics and Computing , 27(5):1413–1432.Watanabe, S. (2010). Asymptotic equivalence of bayes cross validation and widely applica-ble information criterion in singular learning theory.

Journal of Machine Learning Research ,11(Dec):3571–3594.Wertheim, K., Langenburg, G., and Moenssens, A. (2006). A report of latent print examiner accu-racy during comparison training exercises.