David M. Williamson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David M. Williamson is active.

Explore More

Publication

Featured researches published by David M. Williamson.

Speech Communication | 2009

Automatic scoring of non-native spontaneous speech in tests of spoken English

Klaus Zechner; Derrick Higgins; Xiaoming Xi; David M. Williamson

This paper presents the first version of the SpeechRater^S^M system for automatically scoring non-native spontaneous high-entropy speech in the context of an online practice test for prospective takers of the Test of English as a Foreign Language^(R) internet-based test (TOEFL^(R) iBT). The system consists of a speech recognizer trained on non-native English speech data, a feature computation module, using speech recognizer output to compute a set of mostly fluency based features, and a multiple regression scoring model which predicts a speaking proficiency score for every test item response, using a subset of the features generated by the previous component. Experiments with classification and regression trees (CART) complement those performed with multiple regression. We evaluate the system both on TOEFL Practice data [TOEFL Practice Online (TPO)] as well as on Field Study data collected before the introduction of the TOEFL iBT. Features are selected by test development experts based on both their empirical correlations with human scores as well as on their coverage of the concept of communicative competence. We conclude that while the correlation between machine scores and human scores on TPO (of 0.57) still differs by 0.17 from the inter-human correlation (of 0.74) on complete sets of six items (Pearson r correlation coefficients), the correlation of 0.57 is still high enough to warrant the deployment of the system in a low-stakes practice environment, given its coverage of several important aspects of communicative competence such as fluency, vocabulary diversity, grammar, and pronunciation. Another reason why the deployment of the system in a low-stakes practice environment is warranted is that this system is an initial version of a long-term research and development program where features related to vocabulary, grammar, and content will be added in a later stage when automatic speech recognition performance improves, which can then be easily achieved without a re-design of the system. Exact agreement on single TPO items between our system and human scores was 57.8%, essentially at par with inter-human agreement of 57.2%. Our system has been in operational use to score TOEFL Practice Online Speaking tests since the Fall of 2006 and has since scored tens of thousands of tests.

Archive | 2006

Automated scoring of complex tasks in computer-based testing

David M. Williamson; Isaac I. Bejar; Robert J. Mislevy

Contents: Preface. D.M. Williamson, I.I. Bejar, R.J. Mislevy, Automated Scoring of Complex Tasks in Computer-Based Testing: An Introduction. R.J. Mislevy, L.S. Steinberg, R.G. Almond, J.F. Lukas, Concepts, Terminology, and Basic Models of Evidence-Centered Design. I.I. Bejar, D.M. Williamson, R.J. Mislevy, Human Scoring. H. Braun, I.I. Bejar, D.M. Williamson, Rule-Based Methods for Automated Scoring: Application in a Licensing Context. M.J. Margolis, B.E. Clauser, A Regression-Based Procedure for Automated Scoring of a Complex Medical Performance Assessment. H. Wainer, L.M. Brown, E.T. Bradlow, X. Wang, W.P. Skorupski, J. Boulet, R.J. Mislevy, An Application of Testlet Response Theory in the Scoring of a Complex Certification Exam. D.M. Williamson, R.G. Almond, R.J. Mislevy, R. Levy, An Application of Bayesian Networks in Automated Scoring of Computerized Simulation Tasks. R.H. Stevens, A. Casillas, Artificial Neural Networks. P. Deane, Strategies for Evidence Identification Through Linguistic Assessment of Textual Responses. K. Scalise, M. Wilson, Analysis and Comparison of Automated Scoring Approaches: Addressing Evidence-Based Assessment Principles. R.E. Bennett, Moving the Field Forward: Some Thoughts on Validity and Automated Scoring.

International Journal of Testing | 2004

Introduction to Evidence Centered Design and Lessons Learned From Its Application in a Global E-Learning Program

John T. Behrens; Robert J. Mislevy; Malcolm Bauer; David M. Williamson; Roy Levy

This articles introduces the assessment and deployment contexts of the Networking Performance Skill System (NetPASS) project and the articles in this section that report on findings from this endeavor. First, the educational context of the Cisco Networking Academy Program is described. Second, the basic outline of Evidence Centered Design is described. In the third section, the intersection of these two activities in the NetPASS project is described and the subsequent articles introduced.

Journal of Educational and Behavioral Statistics | 2003

Calibrating Item Families and Summarizing the Results Using Family Expected Response Functions

Sandip Sinharay; Matthew S. Johnson; David M. Williamson

Item families, which are groups of related items, are becoming increasingly popular in complex educational assessments. For example, in automatic item generation (AIG) systems, a test may consist of multiple items generated from each of a number of item models. Item calibration or scoring for such an assessment requires fitting models that can take into account the dependence structure inherent among the items that belong to the same item family. Glas and van der Linden (2001) suggest a Bayesian hierarchical model to analyze data involving item families with multiple-choice items. We fit the model using the Markov Chain Monte Carlo (MCMC) algorithm, introduce the family expected response function (FERF) as a way to summarize the probability of a correct response to an item randomly generated from an item family, and suggest a way to estimate the FERFs. This work is thus a step towards creating a tool that can save significant amount of resources in educational testing, by allowing proper analysis and summarization of data from tests involving item families.

International Journal of Testing | 2004

Design Rationale for a Complex Performance Assessment

David M. Williamson; Malcolm Bauer; Linda S. Steinberg; Robert J. Mislevy; John T. Behrens; Sarah F. Demark

In computer-based interactive environments meant to support learning, students must bring a wide range of relevant knowledge, skills, and abilities to bear jointly as they solve meaningful problems in a learning domain. To function effectively as an assessment, a computer system must additionally be able to evoke and interpret observable evidence about targeted knowledge in a manner that is principled, defensible, and suited to the purpose at hand (e.g., licensure, achievement testing, coached practice). This article describes the foundations for the design of an interactive computer-based assessment of design, implementation, and troubleshooting in the domain of computer networking. The application is a prototype for assessing these skills as part of an instructional program, as interim practice tests and as chapter or end-of-course assessments. An Evidence Centered Design (ECD) framework was used to guide the work. An important part of this work is a cognitive task analysis designed (a) to tap the knowledge computer network specialists and students use when they design and troubleshoot networks and (b) to elicit behaviors that manifest this knowledge. After summarizing its results, we discuss implications of this analysis, as well as information gathered through other methods of domain analysis, for designing psychometric models, automated scoring algorithms, and task frameworks and for the capabilities required for the delivery of this example of a complex computer-based interactive assessment.

Applied Measurement in Education | 2004

Automated Tools for Subject Matter Expert Evaluation of Automated Scoring

David M. Williamson; Isaac I. Bejar; Anne Sax

As automated scoring of complex constructed-response examinations reaches operational status, the process of evaluating the quality of resultant scores, particularly in contrast to scores of expert human graders, becomes as complex as the data itself. Using a vignette from the Architectural Registration Examination (ARE), this article explores the potential utility of Classification and Regression Trees (CART) and Kohonen Self-Organizing Maps (SOM) as tools to facilitate subject matter expert (SME) examination of the fine-grained (feature level) quality of automated scores for complex data, with implications for the validity of resultant scores. This article explores both supervised and unsupervised learning techniques, with the former being represented by CART (Breiman, Friedman, Olshen, & Stone, 1984) and the latter by SOM (Kohonen, 1989). Three applications comprise this investigation, the first of which suggests that CART can facilitate efficient and economical identification of specific elements of complex responses that contribute to automated and human score discrepancies. The second application builds on the first by exploring CART for efficiently and accurately automating case selection for human intervention to ensure score validity. The final application explores the potential for SOM to reduce the need for SMEs in evaluating automated scoring. Although both supervised and unsupervised methodologies examined were found to be promising tools for facilitating SME roles in maintaining and improving the quality of automated scoring, such applications remain unproven, and further studies are necessary to establish the reliability of these techniques.

Language Testing | 2012

A Comparison of Two Scoring Methods for an Automated Speech Scoring System.

Xiaoming Xi; Derrick Higgins; Klaus Zechner; David M. Williamson

This paper compares two alternative scoring methods – multiple regression and classification trees – for an automated speech scoring system used in a practice environment. The two methods were evaluated on two criteria: construct representation and empirical performance in predicting human scores. The empirical performance of the two scoring models is reported in Zechner, Higgins, Xi, & Williamson (2009), which discusses the development of the entire automated speech scoring system; the current paper shifts the focus to the comparison of the two scoring methods, elaborating both technical and substantive considerations and providing a reasoned argument for the trade-off between them. We concluded that a multiple regression model with expert weights was superior to the classification tree model. In addition to comparing the relative performance of the two models, we also evaluated the adequacy of the regression model for the intended use. In particular, the construct representation of the model was sufficiently broad to justify its use in a low-stakes application. The correlation of the model-predicted total test scores with human scores (r = 0.7) was also deemed acceptable for practice purposes.

International Journal of Testing | 2012

Comparison of e-rater® Automated Essay Scoring Model Calibration Methods Based on Distributional Targets

Mo Zhang; David M. Williamson; F. Jay Breyer; Catherine Trapani

This article describes two separate, related studies that provide insight into the effectiveness of e-rater score calibration methods based on different distributional targets. In the first study, we developed and evaluated a new type of e-rater scoring model that was cost-effective and applicable under conditions of absent human rating and small candidate volume. This new model type, called the Scale Midpoint Model, outperformed an existing e-rater scoring model that is often adopted by certain e-rater system users without modification. In the second study, we examined the impact of three distributional score calibration approaches on existing models’ performance. These approaches included percentile calibrations on e-rater scores in accordance with a human rating distribution, normal distribution, and uniform distribution. Results indicated that these score calibration approaches did not have overall positive effects on the performance of existing e-rater scoring models.

ETS Research Report Series | 2004

AUTOMATED TOOLS FOR SUBJECT MATTER EXPERT EVALUATION OF AUTOMATED SCORING

David M. Williamson; Isaac I. Bejar; Anne Sax

As automated scoring of complex constructed-response examinations reaches operational status, the process of evaluating the quality of resultant scores, particularly in contrast to scores of expert human graders, becomes as complex as the data itself. Using a vignette from the Architectural Registration Examination (ARE), this paper explores the potential utility of classification and regression trees (CART) and Kohonen self-organizing maps (SOM) as tools to facilitate subject matter expert (SME) examination of the fine-grained (feature level) quality of automated scores for complex data, with implications for the validity of the resultant scores. The paper explores both supervised and unsupervised learning techniques, the former being represented by CART (Breiman, Friedman, Olshen, & Stone, 1984) and the latter by SOM (Kohonen, 1989). Three applications comprise this investigation, the first of which suggests that CART can facilitate efficient and economical identification of specific elements of complex solutions that contribute to automated and human score discrepancies. The second application builds on the first by exploring CARTs value for efficiently and accurately automating case selection for human intervention to ensure score validity. The final application explores the potential for SOM to reduce the need for SMEs in evaluating automated scoring. While both supervised and unsupervised methodologies examined were found to be promising tools for facilitating SME roles in maintaining and improving the quality of automated scoring, such applications remain unproven and further studies are necessary to establish the reliability of these techniques.

Archive | 2015

An Introduction to Evidence-Centered Design

Russell G. Almond; Robert J. Mislevy; Linda S. Steinberg; Duanli Yan; David M. Williamson

This chapter provides a brief introduction to evidence-centered assessment design. Although assessment design is an important part of this book, we do not tackle it in a formal way until Part III. Part I builds up a class of mathematical models for scoring an assessment, and Part II discusses how the mathematical models can be refined with data. Although throughout the book there are references to cognitive processes that the probability distributions model, the full discussion of assessment design follows the discussion of the more mathematical issues.

Explore More