Christine E. DeMars
James Madison University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Christine E. DeMars.
Educational Assessment | 2005
Steven L. Wise; Christine E. DeMars
Student test-taking motivation in low-stakes assessment testing is examined in terms of both its relationship to test performance and the implications of low student effort for test validity. A theoretical model of test-taking motivation is presented, with a synthesis of previous research indicating that low student motivation is associated with a substantial decrease in test performance. A number of assessment practices and data analytic procedures for managing the problems posed by low student motivation are discussed.
Applied Measurement in Education | 2000
Christine E. DeMars
The effects of test consequences, response formats (multiple choice or constructed response), gender, and ethnicity were studied for the math and science sections of a high school diploma endorsement test. There was an interaction between response format and test consequences: Under both response formats, students performed better under high stakes (diploma endorsement) than under low stakes (pilot test), but the difference was larger for the constructed response items. Gender and ethnicity did not interact with test stakes; the means of all groups increased when the test had high stakes. Gender interacted with format; boys scored higher than girls on multiple-choice items, girls scored higher than boys on constructed-response items.
Behavior Therapy | 2011
Steven W. Evans; Brandon K. Schultz; Christine E. DeMars; Heather Davis
There are no empirically supported psychosocial treatments for adolescents with attention-deficit hyperactivity disorder (ADHD). This study examined the treatment benefits of the Challenging Horizons Program (CHP), a psychosocial treatment program designed to address the impairment and symptoms associated with this disorder in young adolescents. In addition to evaluating social and academic functioning outcomes, two critical questions from previous studies pertaining to the timing, duration, and family involvement in treatment were addressed. Forty-nine students recruited in two cohorts were randomly assigned to receive either the CHP or a community care condition. Outcomes suggested that students who received the CHP improved compared to students in the control condition on measures of symptoms and impairment. Implications related to timing, duration, and family involvement are reported, as well as recommendations for future studies.
Educational Assessment | 2010
Steven L. Wise; Christine E. DeMars
Educational program assessment studies often use data from low-stakes tests to provide evidence of program quality. The validity of scores from such tests, however, is potentially threatened by examinee noneffort. This study investigated the extent to which one type of noneffort—rapid-guessing behavior—distorted the results from three types of commonly used program assessment designs. It was found that, for each design, a modest amount of rapid guessing had a pronounced effect on the results. In addition, motivation filtering was found to be successful in mitigating the effects caused by rapid guessing. It is suggested that measurement practitioners routinely apply motivation filtering whenever the data from low-stakes tests are used to support program decisions.
International Journal of Testing | 2013
Christine E. DeMars
This tutorial addresses possible sources of confusion in interpreting trait scores from the bifactor model. The bifactor model may be used when subscores are desired, either for formative feedback on an achievement test or for theoretically different constructs on a psychological test. The bifactor model is often chosen because it requires fewer computational resources than other models for subscores. The bifactor model yields a score on the general or primary trait measured by the test overall, as well as specific or secondary traits measured by the subscales. Interpreting the general trait score is straight-forward, but the specific traits must be interpreted as residuals relative to the general trait. Trait scores on the specific factors are contrasted with trait scores on a simple-structure model with correlated factors, using example data from one TIMSS test booklet and a civic responsibility measure. The correlated factors model was used for contrast because its scores correspond to a more intuitive interpretation of subscores, and thus it helps to illustrate how the bifactor scores should NOT be interpreted. Estimation details are covered in an appendix.
Applied Measurement in Education | 2004
Christine E. DeMars
Three methods of detecting item drift were compared: the procedure in BILOG-MG for estimating linear trends in item difficulty, the CUSUM procedure that Veerkamp and Glas (2000) used to detect trends in difficulty or discrimination, and a modification of Kim, Cohen, and Parks (1995) χ 2 test for multiple-group differential item functioning (DIF), using linear contrasts on the discrimination and difficulty parameters. Data were simulated as if collected over 3, 4, or 5 time points, with parameter drift in either a gradual, linear pattern, a less linear but still monotonic pattern, or as a sudden shift at the third time point. The BILOG-MG procedure and the modification of the Kim et al. procedure were more powerful than the CUSUM procedure, nearly always detecting drift. All three procedures had false alarm rates for nondrift items near the nominal alpha. The procedures were also illustrated on a real data set.
Applied Psychological Measurement | 2003
Christine E. DeMars
In this study of polytomous items, the number of items and categories per item were varied to explore the effects on estimation of item parameters in the nominal response model. De Ayala and Sava-Bolestas (1999) work suggested that the ratio of the sample size to the total number of item parameters was a key factor. They varied the total number of item parameters by increasing the sample size or changing the number of categories per item while leaving the number of items constant. In this study, the total number of item parameters, the sample size, and the number of categories were manipulated as separate factors. Increasing the number of items had little effect on item parameter recovery, but increasing the number of categories increased the error variance of the parameter estimates. Error variance was also greater for more highly discriminating items and for skewed distributions of ability. Index terms: nominal response, parameter recovery, polytomous IRT, MULTILOG.
Educational and Psychological Measurement | 2010
Christine E. DeMars
In this brief explication, two challenges for using differential item functioning (DIF) measures when there are large group differences in true proficiency are illustrated. Each of these difficulties may lead to inflated Type I error rates, for very different reasons. One problem is that groups matched on observed score are not necessarily well matched on true proficiency, which may result in the false detection of DIF due to inaccurate matching. The other problem is that a model that does not allow for a nonzero asymptote can produce what seems to be DIF. These issues have been discussed separately in the literature earlier. This article brings them together in a nontechnical form.
Educational Assessment | 2007
Christine E. DeMars
Abstract A series of 8 tests was administered to university students over 4 weeks for program assessment purposes. The stakes of these tests were low for students; they received course points based on test completion, not test performance. Tests were administered in a counterbalanced order across 2 administrations. Response time effort, a measure of the proportion of items on which solution behavior rather than rapid-guessing behavior was used, was higher when a test was administered in the 1st week. Test scores were also higher. Differences between Week 1 and Week 4 test scores decreased when the test was scored with an effort-moderated model that took into account whether the student used solution or rapid-guessing behavior. Differences further decreased when students who used rapid-guessing on 5 or more of the 30 items were filtered from the data set.
Applied Psychological Measurement | 2012
Christine E. DeMars
A testlet is a cluster of items that share a common passage, scenario, or other context. These items might measure something in common beyond the trait measured by the test as a whole; if so, the model for the item responses should allow for this testlet trait. But modeling testlet effects that are negligible makes the model unnecessarily complicated and risks capitalization on chance, increasing the error in parameter estimates. Checking each testlet to see if the items within the testlet share something beyond the primary trait could therefore be useful. This study included (a) comparison between a model with no testlets and a model with testlet g, (b) comparison between a model with all suspected testlets and a model with all suspected testlets except testlet g, and (c) a test of essential unidimensionality. Overall, Comparison b was most useful for detecting testlet effects. Model comparisons based on information criteria, specifically the sample-size adjusted Bayesian Information Criteria (SSA-BIC) and BIC, resulted in fewer false alarms than statistical significance tests. The test of essential unidimensionality had true hit rates and false alarm rates similar to the SSA-BIC when the testlet effect was zero for all testlets except the studied testlet. But the presence of additional testlet effects in the partitioning test led to higher false alarm rates for the test of essential unidimensionality.