Polina Harik
National Board of Medical Examiners
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Polina Harik.
Academic Medicine | 2006
Melissa J. Margolis; Brian E. Clauser; Monica M. Cuddy; Andrea Ciccone; Janet Mee; Polina Harik; Richard E. Hawkins
Background Multivariate generalizability analysis was used to investigate the performance of a commonly used clinical evaluation tool. Method Practicing physicians were trained to use the mini-Clinical Skills Examination (CEX) rating form to rate performances from the United States Medical Licensing Examination Step 2 Clinical Skills examination. Results Differences in rater stringency made the greatest contribution to measurement error; more raters rating each examinee, even on fewer occasions, could enhance score stability. Substantial correlated error across the competencies suggests that decisions about one scale unduly influence those on others. Conclusions Given the appearance of a halo effect across competencies, score interpretations that assume assessment of distinct dimensions of clinical performance should be made with caution. If the intention is to produce a single composite score by combining results across competencies, the presence of these effects may be less critical.
Applied Measurement in Education | 2008
Brian E. Clauser; Polina Harik; Melissa J. Margolis; I. C. McManus; Jennifer Mollon; Liliana Chis; Simon Williams
Numerous studies have compared the Angoff standard-setting procedure to other standard-setting methods, but relatively few studies have evaluated the procedure based on internal criteria. This study uses a generalizability theory framework to evaluate the stability of the estimated cut score. To provide a measure of internal consistency, this study also compares the estimated proportion correct scores resulting from the Angoff exercise to empirical conditional proportion correct scores. In this research, judges made independent estimates of the proportion of minimally proficient candidates that would be expected to answer each item correctly; they then discussed discrepancies and revised their estimates. Discussion of discrepancies decreased the variance components associated with the judge and judge-by-item effects, indicating increased agreement between judges, but it did not improve the correspondence between the judgments and the empirical proportion correct estimates. The judges then were given examinee performance information for a subset of the items. Subsequent ratings showed a substantial increase in correspondence with the empirical conditional proportion correct estimates. Particular attention is given to examining the discrepancy between the judgments and empirical proportion correct estimates as a function of item difficulty.
Academic Medicine | 2006
Polina Harik; Brian E. Clauser; Irina Grabovsky; Melissa J. Margolis; Gerard F. Dillon; John R. Boulet
Background This research examined relationships between and among scores from the United States Medical Licensing Examination (USMLE) Step 1, Step 2 Clinical Knowledge (CK), and subcomponents of the Step 2 Clinical Skills (CS) examination. Method Correlations and failure rates were produced for first-time takers who tested during the first year of Step 2 CS Examination administration (June 2004 to July 2005). Results True-score correlations were high between patient note (PN) and data gathering (DG), moderate between communication and interpersonal skills and DG, and low between the remaining score pairs. There was little overlap between examinees failing Step 2 CK and the different components of Step 2 CS. Conclusion Results suggest that combining DG and PN scores into a single composite score is reasonable and that relatively little redundancy exists between Step 2 CK and CS scores.
Academic Medicine | 2008
Brian E. Clauser; Polina Harik; Melissa J. Margolis; Janet Mee; Kimberly A. Swygert; Thomas Rebbecchi
Background This research examined various sources of measurement error in the documentation score component of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills examination. Method A generalizability theory framework was employed to examine the documentation ratings for 847 examinees who completed the USMLE Step 2 Clinical Skills examination during an eight-day period in 2006. Each patient note was scored by two different raters allowing for a persons-crossed-with-raters-nested-in-cases design. Results The results suggest that inconsistent performance on the part of raters makes a substantially greater contribution to measurement error than case specificity. Double scoring the notes significantly increases precision. Conclusions The results provide guidance for improving operational scoring of the patient notes. Double scoring of the notes may produce an increase in the precision of measurement equivalent to that achieved by lengthening the test by more than 50%. The study also cautions researchers that when examining sources of measurement error, inappropriate data-collection designs may result in inaccurate inferences.
Academic Medicine | 2009
Brian E. Clauser; Kevin P. Balog; Polina Harik; Janet Mee; Nilufer Kahraman
Background In clinical skills, closely related skills are often combined to form a composite score. For example, history-taking and physical examination scores are typically combined. Interestingly, there is relatively little research to support this practice. Method Multivariate generalizability theory was employed to examine the relationship between history-taking and physical examination scores from the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills examination. These two proficiencies are currently combined into a data-gathering score. Results The physical examination score is less generalizable than the score for history taking, and there is only a modest to moderate relationship between these two proficiencies. Conclusions A decision about combining physical examination and history-taking proficiencies into one composite score, as well as the weighting of these components, should be driven by the intended use of the score. The choice of weights in combining physical examination and history taking makes a substantial difference in the precision of the resulting score.
Academic Medicine | 2004
Melissa J. Margolis; Brian E. Clauser; Polina Harik
Purpose. Operational USMLE™ computer-based case simulation results were examined to determine the extent to which rater reliability and regression model performance met expectations based on preoperational data. Method. Operational data resulted from Step 3 examinations given between 1999 and 2004. Plots were produced using reliability and multiple correlation coefficients. Results. Operational testing reliabilities increased over the four years but were lower than the preoperational reliability. Multiple correlation coefficient results are somewhat superior to the results reported during the preoperational period and suggest that the operational scoring algorithms have been relatively consistent. Conclusions. Changes in the rater population, changes in the rating task, and enhancements to the training procedures are several factors that can explain the identified differences between preoperational and operational results. The present findings have important implications for test development and test validity.
Applied Psychological Measurement | 2011
Mark R. Raymond; Polina Harik; Brian E. Clauser
Prior research indicates that the overall reliability of performance ratings can be improved by using ordinary least squares (OLS) regression to adjust for rater effects. The present investigation extends previous work by evaluating the impact of OLS adjustment on standard errors of measurement (SEM) at specific score levels. In addition, a cross-validation (i.e., resampling) design was used to determine the extent to which any improvements in measurement precision would be realized for new samples of examinees. Conditional SEMs were largest for scores toward the low end of the score distribution and smallest for scores at the high end. Conditional SEMs for adjusted scores were consistently less than conditional SEMs for observed scores, although the reduction in error was not uniform throughout the distribution. The improvements in measurement precision held up for new samples of examinees at all score levels.
Academic Medicine | 2009
Polina Harik; Monica M. Cuddy; Seosaimhin O’Donovan; Constance T. Murray; David B. Swanson; Brian E. Clauser
Background The 2000 Institute of Medicine report on patient safety brought renewed attention to the issue of preventable medical errors, and subsequently specialty boards and the National Board of Medical Examiners were encouraged to play a role in setting expectations around safety education. This paper examines potentially dangerous actions taken by examinees during the portion of the United States Medical Licensing Examination Step 3 that is particularly well suited to evaluating lapses in physician decision making, the Computer-based Case Simulation (CCS). Method Descriptive statistics and a general linear modeling approach were used to analyze dangerous actions ordered by 25,283 examinees that completed CCS for the first time between November 2006 and January 2008. Results More than 20% of examinees ordered at least one dangerous action with the potential to cause significant patient harm. The propensity to order dangerous actions may vary across clinical cases. Conclusions The CCS format may provide a means of collecting important information about patient-care situations in which examinees may be more likely to commit dangerous actions and the propensity of examinees to order dangerous tests and treatments.
Academic Medicine | 2007
Chaitanya Ramineni; Polina Harik; Melissa J. Margolis; Brian E. Clauser; David B. Swanson; Gerard F. Dillon
Background Systematic trends in examinee performance across the testing day (sequence effects) could indicate that artifacts of the testing situation have an impact on scores. This research investigated the presence of sequence effects for United States Medical Licensing Exam (USMLE) Step 2 clinical skills (CS) examination components. Method Data from Step 2 CS examinees were analyzed using analysis of covariance and hierarchical linear modeling procedures. Results Sequence was significant for three of the components; communication and interpersonal skills, data gathering, and documentation. A significant gender × sequence interaction was found for two components. Conclusions The presence of sequence effects suggests that scores on early cases are influenced by factors that are unrelated to the proficiencies of interest. More research is needed to fully understand these effects.
Applied Psychological Measurement | 2013
Polina Harik; Peter Baldwin; Brian E. Clauser
Growing reliance on complex constructed response items has generated considerable interest in automated scoring solutions. Many of these solutions are described in the literature; however, relatively few studies have been published that compare automated scoring strategies. Here, comparisons are made among five strategies for machine-scoring examinee performances of computer-based case simulations, a complex item format used to assess physicians’ patient-management skills as part of the Step 3 United States Medical Licensing Examination. These strategies utilize expert judgments to obtain various (a) case-specific or (b) generic scoring algorithms. The various compromises between efficiency, validity, and reliability that characterize each scoring approach are described and compared.