Is this you? Create Your Porfile

Stephen G. Sireci

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen G. Sireci is active.

Explore More

Publication

Featured researches published by Stephen G. Sireci.

Social Indicators Research | 1998

The Construct of Content Validity

Stephen G. Sireci

Many behavioral scientists argue that assessments used in social indicators research must be content-valid. However, the concept of content validity has been controversial since its inception. The current unitary conceptualization of validity argues against use of the term content validity, but stresses the importance of content representation in the instrument construction and evaluation processes. However, by arguing against use of this term, the importance of demonstrating content representativeness has been severely undermined. This paper reviews the history of content validity theory to underscore its importance in evaluating construct validity. It is concluded that although measures cannot be “validated” based on content validity evidence alone, demonstration of content validity is a fundamental requirement of all assessment instruments.

Applied Measurement in Education | 2002

Technological Innovations in Large-Scale Assessment.

April L. Zenisky; Stephen G. Sireci

Computers have had a tremendous impact on assessment practices over the past half century. Advances in computer technology have substantially influenced the ways in which tests are made, administered, scored, and reported to examinees. These changes are particularly evident in computer-based testing, where the use of computers has allowed test developers to re-envision what test items look like and how they are scored. By integrating technology into assessments, it is increasingly possible to create test items that can sample as broad or as narrow a range of behaviors as needed while preserving a great deal of fidelity to the construct of interest. In this article we review and illustrate some of the current technological developments in computer-based testing, focusing on novel item formats and automated scoring methodologies. Our review indicates that a number of technological innovations in performance assessment are increasingly being researched and implemented by testing programs. In some cases, complex psychometric and operational issues have successfully been dealt with, but a variety of substantial measurement concerns associated with novel item types and other technological aspects impede more widespread use. Given emerging research, however, there appears to be vast potential for expanding the use of more computerized constructed-response type items in a variety of testing contexts.

Journal of Cross-Cultural Psychology | 2006

Evaluating Guidelines For Test Adaptations: A Methodological Analysis of Translation Quality

Stephen G. Sireci; Yongwei Yang; James Harter; Eldin J. Ehrlich

Guidelines for translating educational and psychological assessments for use across different languages and cultures have been developed by the International Test Commission and the Joint Committee on Standards for Educational and Psychological Testing. Common themes in these guidelines and standards are when translating items both judgmental and statistical techniques should be used to ensure item comparability across languages, and rigorous quality-control steps should be included in the translation process. In this study, the authors use differential item functioning methodology to evaluate the comparability of translated items at two different points in time—after the initial translation and 4 years later after the translations were revisited using a more rigorous translation model. The results indicated that the revised translations led to improvements in some but not all items. Improvements in the process of translating survey items, even when based on accepted professional standards, should be statistically evaluated. This methodology illustrates how evaluations can be conducted on translated survey items.

Educational Researcher | 2007

On Validity Theory and Test Validation

Stephen G. Sireci

Lissitz and Samuelsen (2007) propose a new framework for conceptualizing test validity that separates analysis of test properties from analysis of the construct measured. In response, the author of this article reviews fundamental characteristics of test validity, drawing largely from seminal writings as well as from the accepted standards. He argues that a serious validation endeavor requires integration of construct theory, subjective analysis of test content, and empirical analysis of item and test score data. He argues that the proposals presented by Lissitz and Samuelsen require revision or clarification to be useful to practitioners for justifying the use of a test for a particular purpose. He discusses the strengths and limitations of their proposal, as well as major tenets from other validity perspectives.

Applied Measurement in Education | 2000

Using bilingual respondents to evaluate translated-adapted items.

Stephen G. Sireci; Giray Berberoglu

Translating and adapting tests and questionnaires across languages is a common strategy for comparing people who operate in different languages with respect to their achievement, attitude, personality, or other psychological construct. Unfortunately, when tests and questionnaires are translated from one language to another, there is no guarantee that the different language versions are equivalent. In this study, we present and evaluate a methodology for investigating the equivalence of translated-adapted items using bilingual test takers. The methodology involves applying item response theory models to data obtained from randomly equivalent groups of bilingual respondents. The technique was applied to an English-Turkish version of a course evaluation form. The results indicate that the methodology is effective for flagging items that function differentially across languages as well as for informing the test development and test adaptation processes. The utility and limitations of the procedure for evaluating translation equivalence are discussed.

Educational and Psychological Measurement | 2006

Evaluating the Predictive Validity of Graduate Management Admission Test Scores

Stephen G. Sireci; Eileen Talento-Miller

Admissions data and first-year grade point average (GPA) data from 11 graduate management schools were analyzed to evaluate the predictive validity of Graduate Management Admission Test® (GMAT®) scores and the extent to which predictive validity held across sex and race/ethnicity. The results indicated GMAT verbal and quantitative scores had substantial predictive validity, accounting for about 16% of the variance in graduate GPA beyond that predicted by undergraduate GPA. When these scores and undergraduate GPA were used together, they accounted for approximately 25% of the variation in first-year graduate GPA. Correcting correlations for restriction of range improved the predictive power. No statistical differences were found across examinee groups defined by race/ethnicity and sex, which suggests a lack of bias in these scores. The predictive utility of GMAT analytical writing scores was relatively low, accounting for only about 1% of the variation in graduate GPA, after accounting for undergraduate GPA and GMAT verbal and quantitative scores.

Applied Psychological Measurement | 1992

Analyzing Test Content Using Cluster Analysis and Multidimensional Scaling

Stephen G. Sireci; Kurt F. Geisinger

A new method for evaluating the content representation of a test is illustrated. Item similari ty ratings were obtained from content domain ex perts in order to assess whether their ratings cor responded to item groupings specified in the test blueprint. Three expert judges rated the similarity of items on a 30-item multiple-choice test of study skills. The similarity data were analyzed using a multidimensional scaling (MDS) procedure followed by a hierarchical cluster analysis of the MDS stimulus coordinates. The results indicated a strong correspondence between the similarity data and the arrangement of items as prescribed in the test blueprint. The findings suggest that analyzing item similarity data with MDS and cluster analysis can provide substantive information pertaining to the content representation of a test. The advantages and disadvantages of using MDS and cluster analysis with item similarity data are discussed.

Educational Researcher | 2005

Unlabeling the Disabled: A Perspective on Flagging Scores From Accommodated Test Administrations

Stephen G. Sireci

Accommodations to standard test administrations are granted on many tests for students who have one or more disabling conditions. In some instances, students’ scores from these nonstandard administrations are “flagged” to caution those who interpret the test score that the test was not administered under typical conditions. The practice of flagging such test scores is contentious. Some argue that it essentially informs others that a student has a disability and creates the opportunity for bias against the student. Others argue that such scores must be flagged to be fair to those who took the test under standard conditions and to promote valid test score interpretations. This article reviews the psychometric issues regarding the flagging issue and discusses the guidance provided by the Standards for Educational and Psychological Testing. Research in this area related to college admissions testing is also reviewed and suggestions for avoiding the flagging issue in the future are provided. This review lends support to the recent decisions by several testing agencies to discontinue flagging practices.

Applied Measurement in Education | 1999

Using Cluster Analysis to Facilitate Standard Setting

Stephen G. Sireci; Frederic Robin; Thanos Patelis

Setting standards on tests remains an important and pervasive problem in educational and psychological testing. Traditional standard-setting methods have been criticized due to reliance on untested subjective judgment, lack of demonstrated reliability, and lack of external validation. In this article, we present a new procedure designed to help improve previous standard-setting methods. This procedure involves cluster analyzing test takers to discover examinee groups useful for (a) envisioning marginally competent performance as required in test-centered standard-setting methods or (b) defining borderline or contrasting groups used in examinee-centered methods. We applied the procedure to a state-wide mathematics proficiency test. The standards derived from the cluster analyses were compared with those established at the local level and with those derived from a more traditional borderline and contrasting groups analysis. We observed relative congruence across the local cutscores and those derived using c...

Psicothema | 2014

Validity evidence based on test content

Stephen G. Sireci; Molly Faulkner-Bond

BACKGROUND Validity evidence based on test content is one of the five forms of validity evidence stipulated in the Standards for Educational and Psychological Testing developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. In this paper, we describe the logic and theory underlying such evidence and describe traditional and modern methods for gathering and analyzing content validity data. METHOD A comprehensive review of the literature and of the aforementioned Standards is presented. RESULTS For educational tests and other assessments targeting knowledge and skill possessed by examinees, validity evidence based on test content is necessary for building a validity argument to support the use of a test for a particular purpose. CONCLUSIONS By following the methods described in this article, practitioners have a wide arsenal of tools available for determining how well the content of an assessment is congruent with and appropriate for the specific testing purposes.

Explore More