Melissa J. Margolis
National Board of Medical Examiners
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Melissa J. Margolis.
Academic Medicine | 2006
Melissa J. Margolis; Brian E. Clauser; Monica M. Cuddy; Andrea Ciccone; Janet Mee; Polina Harik; Richard E. Hawkins
Background Multivariate generalizability analysis was used to investigate the performance of a commonly used clinical evaluation tool. Method Practicing physicians were trained to use the mini-Clinical Skills Examination (CEX) rating form to rate performances from the United States Medical Licensing Examination Step 2 Clinical Skills examination. Results Differences in rater stringency made the greatest contribution to measurement error; more raters rating each examinee, even on fewer occasions, could enhance score stability. Substantial correlated error across the competencies suggests that decisions about one scale unduly influence those on others. Conclusions Given the appearance of a halo effect across competencies, score interpretations that assume assessment of distinct dimensions of clinical performance should be made with caution. If the intention is to produce a single composite score by combining results across competencies, the presence of these effects may be less critical.
Academic Medicine | 2010
Richard E. Hawkins; Melissa J. Margolis; Steven J. Durning; John J. Norcini
Purpose The mini-Clinical Evaluation Exercise (mCEX) is increasingly being used to assess the clinical skills of medical trainees. Existing mCEX research has typically focused on isolated aspects of the instruments reliability and validity. A more thorough validity analysis is necessary to inform use of the mCEX, particularly in light of increased interest in high-stakes applications of the methodology. Method Kanes (2006) validity framework, in which a structured argument is developed to support the intended interpretation(s) of assessment results, was used to evaluate mCEX research published from 1995 to 2009. In this framework, evidence to support the argument is divided into four components (scoring, generalization, extrapolation, and interpretation/decision), each of which relates to different features of the assessment or resulting scores. The strength and limitations of the reviewed research were identified in relation to these components, and the findings were synthesized to highlight overall strengths and weaknesses of existing mCEX research. Results The scoring component yielded the most concerns relating to the validity of mCEX score interpretations. More research is needed to determine whether scoring-related issues, such as leniency error and high interitem correlations, limit the utility of the mCEX for providing feedback to trainees. Evidence within the generalization and extrapolation components is generally supportive of the validity of mCEX score interpretations. Conclusions Careful evaluation of the circumstances of mCEX assessment will help to improve the quality of the resulting information. Future research should address issues of rater selection, training, and monitoring which can impact rating accuracy.
Academic Medicine | 2002
Gerard F. Dillon; Stephen G. Clyman; Brian E. Clauser; Melissa J. Margolis
In the early to mid-1990s, the National Board of Medical Examiners (NBME) examinations were replaced by the United States Medical Licensing Examination (USMLE). The USMLE, which was designed to have three components or Steps, was administered as a paper-and-pencil test until the late 1990s, when it moved to a computer-based testing (CBT) format. The CBT format provided the opportunity to realize the results of simulation research and development that had occurred during the prior two decades. A milestone in this effort occurred in November 1999 when, with the implementation of the computer-delivered USMLE Step 3 examination, the Primumt Computer-based Case Simulations (CCSs) were introduced. In the year preceding this introduction and the more than two years of operational use since the introduction, numerous challenges have been addressed. Preliminary results of this initial experience have been promising. This paper introduces the relevant issues, describes some pertinent research findings, and identifies next steps for research.
Academic Medicine | 1997
A F De Champlain; Melissa J. Margolis; A King; Daniel J. Klass
No abstract available.
Applied Measurement in Education | 2008
Brian E. Clauser; Polina Harik; Melissa J. Margolis; I. C. McManus; Jennifer Mollon; Liliana Chis; Simon Williams
Numerous studies have compared the Angoff standard-setting procedure to other standard-setting methods, but relatively few studies have evaluated the procedure based on internal criteria. This study uses a generalizability theory framework to evaluate the stability of the estimated cut score. To provide a measure of internal consistency, this study also compares the estimated proportion correct scores resulting from the Angoff exercise to empirical conditional proportion correct scores. In this research, judges made independent estimates of the proportion of minimally proficient candidates that would be expected to answer each item correctly; they then discussed discrepancies and revised their estimates. Discussion of discrepancies decreased the variance components associated with the judge and judge-by-item effects, indicating increased agreement between judges, but it did not improve the correspondence between the judgments and the empirical proportion correct estimates. The judges then were given examinee performance information for a subset of the items. Subsequent ratings showed a substantial increase in correspondence with the empirical conditional proportion correct estimates. Particular attention is given to examining the discrepancy between the judgments and empirical proportion correct estimates as a function of item difficulty.
Academic Medicine | 2002
Brian E. Clauser; Melissa J. Margolis; David B. Swanson
In November 1999 the USMLE Step 3 was implemented as a computer-administered examination that includes both MCQs and computer-based case simulations (CCSs) developed to assess physicians’ patient-management skills. The CCS format has been described in detail previously. Briefly, CCSs produce a simulation of the patient-management environment. Each case begins with an opening scenario describing the patient’s location and presentation. Using free-text entry, the examinee then orders tests, treatments, and consultations while advancing the case through simulated time. The system recognizes over 12,000 abbreviations, brand names, and other terms that represent more than 2,500 unique actions. Within the dynamic simulation framework, the patient’s condition changes based both on the actions taken by the examinee and the underlying problem. The simulations are scored using a computer-automated algorithm that is designed to approximate the score that would have been produced if the examinee’s performance had been reviewed and rated by a group of expert clinicians. CCSs are the result of a major development effort on the part of the National Board of Medical Examiners, and operational use of the CCS represents the culmination of 30 years of research. Throughout this effort, it was believed that this format had the potential to add a new dimension to physician licensure assessment. The purpose of the present research was to begin to assess the validity of that belief by examining the relationship between the scores produced by CCSs and those produced with MCQs. Two types of results are presented. First, multivariate generalizability analysis was used to examine the generalizability of the individual components of the test (CCS and MCQ), as well as the relationship between the proficiencies measured by those components (i.e., the true-score correlations termed universe-score correlations in generalizability theory). The resulting true-score correlation and reliability estimates were then used to examine the extent to which the addition of CCS improves measurement of the proficiency of interest, when that proficiency is defined as a weighted composite of the true scores measured by CCSs and MCQs. This latter assessment is of particular interest because, as Wainer and Thissen have pointed out, when a constructed-response item format has relatively low reliability and a reasonably high correlation with the proficiency measured by MCQs, allocating testing time to the constructed-response items may be counterproductive, even when the true score of interest is that measured by the constructed-response items. In this context, it becomes a matter of interest to know how much (if any) of the available testing time should be devoted to CCSs.
Academic Medicine | 2008
Kathleen M. Mazor; Colleen Canavan; Margaret Farrell; Melissa J. Margolis; Brian E. Clauser
Background This study investigated whether participants’ subjective reports of how they assigned ratings on a multisource feedback instrument provide evidence to support interpreting the resulting scores as objective, accurate measures of professional behavior. Method Twenty-six participants completed think-aloud interviews while rating students, residents, or faculty members they had worked with previously. The items rated included 15 behavioral items and one global item. Results Participants referred to generalized behaviors and global impressions six times as often as specific behaviors, rated observees in the absence of information necessary to do so, relied on indirect evidence about performance, and varied in how they interpreted items. Conclusions Behavioral change becomes difficult to address if it is unclear what behaviors raters considered when providing feedback. These findings highlight the importance of explicitly stating and empirically investigating the assumptions that underlie the use of an observational assessment tool.
Academic Medicine | 2006
Polina Harik; Brian E. Clauser; Irina Grabovsky; Melissa J. Margolis; Gerard F. Dillon; John R. Boulet
Background This research examined relationships between and among scores from the United States Medical Licensing Examination (USMLE) Step 1, Step 2 Clinical Knowledge (CK), and subcomponents of the Step 2 Clinical Skills (CS) examination. Method Correlations and failure rates were produced for first-time takers who tested during the first year of Step 2 CS Examination administration (June 2004 to July 2005). Results True-score correlations were high between patient note (PN) and data gathering (DG), moderate between communication and interpersonal skills and DG, and low between the remaining score pairs. There was little overlap between examinees failing Step 2 CK and the different components of Step 2 CS. Conclusion Results suggest that combining DG and PN scores into a single composite score is reasonable and that relatively little redundancy exists between Step 2 CK and CS scores.
Academic Medicine | 2003
Melissa J. Margolis; Brian E. Clauser; David B. Swanson; John R. Boulet
Purpose. This work investigated the reliability of and relationships between individual case and composite scores on a standardized patient clinical skills examination. Method. Four hundred ninety two fourth-year U.S. medical students received three scores [data gathering (DG), interpersonal skills (IPS), and written communication (WC)] for each of 10 standardized patient cases. mGENOVA software was used for all analyses. Results. Estimated generalizability coefficients were 0.69, 0.80, and 0.70 for the DG, IPS, and WC scores, respectively. The universe-score correlation between DG and WC was high (.83); those for DG/IPS and IPS/WC were not as strong (0.51 and 0.37, respectively). Task difficulty appears to be modestly but positively related across the three scores. Correlations between the person-by-task effects for DG/IPS and DG/WC were positive yet modest. The estimated generalizability coefficient for a ten-case test using an equally weighted composite DG/WC score was 0.78. Conclusions. This work allows for interpretation of correlations between (1) proficiencies measured by multiple scores and (2) sources of error that affect those scores as well as for estimation of the reliability of composite scores. Results have important implications for test construction and test validity.
Academic Medicine | 1996
Linette P. Ross; Brian E. Clauser; Melissa J. Margolis; Orr Na; Daniel J. Klass
No abstract available.