Spiros Papageorgiou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Spiros Papageorgiou is active.

Explore More

Publication

Featured researches published by Spiros Papageorgiou.

Language Assessment Quarterly | 2015

Developing and Validating Band Levels and Descriptors for Reporting Overall Examinee Performance

Spiros Papageorgiou; Xiaoming Xi; Rick Morgan; Youngsoon So

This study presents the development and empirical validation of score levels and descriptors specifically designed for reporting purposes to provide test takers with more than just a number on a score scale. In the context of a test primarily intended for 11- to 15-year-old students learning English as a second/foreign language, the study examined the number of band levels that could be meaningfully distinguished, the reliability of the classification of students into these band levels, and the development of overall performance descriptors that would provide meaningful information to score users. The performance data from 2,931 students who took the test were used. The band level solution was determined by balancing considerations for the reliability of classification decisions and the desire for the levels to represent meaningful performance differences. To construct meaningful descriptors for the band levels, multiple sources of information were examined, including the scoring rubrics, the characteristics of test items, typical student performance profiles, and the performance of norm groups on the test. The importance of establishing the psychometric quality of band levels and the empirical basis for performance descriptors, as well as the implications for similar efforts, are discussed.

Language Assessment Quarterly | 2012

The Relative Difficulty of Dialogic and Monologic Input in a Second-Language Listening Comprehension Test.

Spiros Papageorgiou; Robin Stevens; Sarah Goodwin

Listening comprehension tests typically include both monologic and dialogic input to measure listening ability. However, research as to which type of input is more challenging for examinees remains limited and has provided inconclusive results (Brindley & Slatyer, 2002; Read, 2002; Shohamy & Inbar, 1991). A better understanding of the comparative difficulty of items associated with both input types is important, as it has implications for developing test content at the desired levels of difficulty. This study explores this issue by analyzing examinee performance on test items developed to accompany three pairs of stimuli on the same topic. Each pair of stimuli consists of a monologue and a dialogue with identical content and vocabulary. The test items associated with these stimuli were embedded in 3 test forms taken by 494 examinees as a part of a routine administration of the Michigan English Test. Test results were analyzed with the Rasch computer program WINSTEPS (Linacre, 2009) to investigate the relative difficulty of the items associated with the two versions of the input and the measurement characteristics of the item options. To interpret statistical findings, a content analysis of the stimuli and items was also performed. Findings provide partial support to the hypothesis that items associated with dialogic input may be easier for examinees than the same items associated with identical monologic input. The implications of these findings for developers and users of listening comprehension tests are discussed.

Language Assessment Quarterly | 2016

Situating Standard Setting within Argument-Based Validity

Spiros Papageorgiou; Richard J. Tannenbaum

ABSTRACT Although there has been substantial work on argument-based approaches to validation as well as standard-setting methodologies, it might not always be clear how standard setting fits into argument-based validity. The purpose of this article is to address this lack in the literature, with a specific focus on topics related to argument-based approaches to validation in language assessment contexts. We first argue that standard setting is an essential part of test development and validation because of the important consequences cut scores might have for decision-making. We then present the Assessment Use Argument (AUA) framework and explain how evidence from standard setting can support claims about consequences, decisions, and interpretations. We finally identify several challenges in setting cut scores in relation to the levels of the Common European Framework of Reference (CEFR) and argue that despite these challenges, standard setting is a critical component of any claim focusing on the interpretation and use of test scores in relation to the CEFR levels. We conclude that standard setting should be an integral part of the validity argument supporting score use and interpretation and should not be treated as an isolated event between the completion of test development and the reporting of scores.

Language Testing | 2014

Book review: Aligning Frameworks of Reference in Language Testing: The ACTFL Proficiency Guidelines and the Common European Framework of Reference for Languages

Spiros Papageorgiou

This volume is devoted to the relationship between the Common European Framework of Reference (CEFR; Council of Europe, 2001) and the ACTFL Proficiency Guidelines (American Council on the Teaching of Foreign Languages, 2012). This is an important topic, given that the two standards (or frameworks) have exerted significant influence on various areas of language teaching, learning, and assessment, such as language policy, curriculum design, test design, and score interpretation. Although the CEFR and the ACTFL Guidelines originate from different geographic areas, they share a common element: that is, the description of language proficiency through multi-level language proficiency scales. This naturally raises the question of how they relate to each other, in particular when establishing a link between their language proficiency levels. This collection of papers is “a joint effort to define the issues in and to embark on some preliminary studies for a crosswalk between the ACTFL Guidelines and the CEFR with implications not only for assessment but also for teaching and learning, teacher education, and educational standards” (p. 9). The papers originate from the 2010 ACTFL– CEFR Alignment Conference held at the University of Leipzig. The volume also includes the opening address by the late John Trim and the opening plenary paper by David Little from a follow-up conference in Provo, Utah in 2011. The volume is organized into three parts. Part 1 contains three chapters focusing on theoretical issues related to the ACTFL–CEFR crosswalk. In the first chapter, Kenyon uses Bachman’s Assessment Use Argument (Bachman, 2005; Bachman & Palmer, 2010) as a framework for linking the ACTFL Guidelines and the CEFR from a psychometric perspective. Kenyon describes the multiple types of evidence needed for the linking, and stresses the need for clearly justified arguments to support the linking in a social context. In the second chapter, Chapelle draws upon an argument-based validity theory (Chapelle, 2008; Kane, 2006; Mislevy, Steinberg, & Almond, 2003) to address the extent to which the ACTFL Guidelines and the CEFR are similar in their theoretical approaches to language, construct definition, and language development. Chapelle stresses that theoretical approaches constitute one area that needs to be explored in relation to an ACTFL–CEFR crosswalk, and, similar to Kenyon, calls for fully developed interpretive arguments to support the interpretation and use of specific tests. In the last chapter of Part 1, Clifford considers how differences in test purpose, test type, construct and scoring may hinder 514009 LTJ31210.1177/0265532213514009Language TestingBook reviews research-article2013

International Journal of Testing | 2018

Adding Value to Second-Language Listening and Reading Subscores: Using a Score Augmentation Approach

Spiros Papageorgiou; Ikkyu Choi

This study examined whether reporting subscores for groups of items within a test section assessing a second-language modality (specifically reading or listening comprehension) added value from a measurement perspective to the information already provided by the section scores. We analyzed the responses of 116,489 test takers to reading and listening items from operational administrations of two large-scale international tests of English as a foreign language. To “strengthen” the reliability of the subscores, and thus improve their added value, we applied a score augmentation method (Haberman, 2008). In doing so, our aim was to examine whether reporting augmented subscores for specific groups of reading and listening items could improve the added value of these subscores and consequently justify providing more fine-grained information about test taker performance. Our analysis indicated that in general, there was lack of support for reporting subscores from a psychometric perspective, and that score augmentation marginally improved the added value of the subscores. We discuss several implications of our findings for test developers wishing to report more fine-grained information about test performance. We conclude by arguing that research on how to best report such refined feedback should remain the focus of future efforts related to second-language proficiency tests.

International Journal of Testing | 2015

Enhancing the Interpretability of the Overall Results of an International Test of English-Language Proficiency

Spiros Papageorgiou; Rick Morgan; Valerie Becker

The purpose of this study was to enhance the meaning of the scores of an English-language test by developing performance levels and descriptors for reporting overall test performance. The levels and descriptors were intended to accompany the total scale scores of TOEFL Junior® Standard, an international test of English as a second/foreign language. The study addressed two issues: the number of performance levels that could be meaningfully reported and the information that should be included in the performance level descriptors. Data from 3,607 students who took an operational test form were used. Although our methodology built to some extent on earlier work (Papageorgiou et al., 2015), we demonstrate how content and construct differences between the tests of each study dictated use of different types of data in order to construct meaningful performance descriptors and select the cut-offs for the levels. We emphasize the importance of establishing the psychometric quality of the reported overall levels and the empirical basis for the content of their performance descriptors, and we discuss the implications of our work for similar efforts in the future.

ETS Research Report Series | 2014