Mariska M.G. Leeflang
University of Amsterdam
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mariska M.G. Leeflang.
Annals of Internal Medicine | 2011
Penny F Whiting; Anne Wilhelmina Saskia Rutjes; Marie Westwood; Susan Mallett; Jonathan J Deeks; Johannes B. Reitsma; Mariska M.G. Leeflang; Jonathan A C Sterne; Patrick M. Bossuyt
In 2003, the QUADAS tool for systematic reviews of diagnostic accuracy studies was developed. Experience, anecdotal reports, and feedback suggested areas for improvement; therefore, QUADAS-2 was developed. This tool comprises 4 domains: patient selection, index test, reference standard, and flow and timing. Each domain is assessed in terms of risk of bias, and the first 3 domains are also assessed in terms of concerns regarding applicability. Signalling questions are included to help judge risk of bias. The QUADAS-2 tool is applied in 4 phases: summarize the review question, tailor the tool and produce review-specific guidance, construct a flow diagram for the primary study, and judge bias and applicability. This tool will allow for more transparent rating of bias and applicability of primary diagnostic accuracy studies.
Annals of Internal Medicine | 2008
Mariska M.G. Leeflang; Jonathan J Deeks; Constantine Gatsonis; Patrick M. Bossuyt
Diagnosis is a critical component of health care, and clinicians, policymakers, and patients routinely face a range of questions regarding diagnostic tests. They want to know whether testing improves outcome; what test to use, purchase, or recommend in practice guidelines; and how to interpret test results. Well-designed diagnostic test accuracy studies can help in making these decisions, provided that they transparently and fully report their participants, tests, methods, and results as facilitated, for example, by the STARD (Standards for Reporting of Diagnostic Accuracy) statement (1). That 25-item checklist was published in many journals and is now adopted by more than 200 scientific journals worldwide. As in other areas of science, systematic reviews and meta-analysis of accuracy studies can be used to obtain more precise estimates when small studies addressing the same test and patients in the same setting are available. Reviews can also be useful to establish whether and how scientific findings vary by particular subgroups, and may provide summary estimates with a stronger generalizability than estimates from a single study. Systematic reviews may help identify the risk for bias that may be present in the original studies and can be used to address questions that were not directly considered in the primary studies, such as comparisons between tests. The Cochrane Collaboration is the largest international organization preparing, maintaining, and promoting systematic reviews to help people make well-informed decisions about health care (2). The Collaboration decided in 2003 to make preparations for including systematic reviews of diagnostic test accuracy in their Cochrane Database of Systematic Reviews. To enable this, a working group (Appendix). was formed to develop methodology, software, and a handbook The first diagnostic test accuracy review was published in the Cochrane Database in October 2008. In this paper, we review recent methodological developments concerning problem formulation, location of literature, quality assessment, and meta-analysis of diagnostic accuracy studies by using our experience from the work on the Cochrane Handbook. The information presented here is based on the recent literature and updates previously published guidelines by Irwig and colleagues (3). Definition of the Objectives of the Review Diagnostic test accuracy refers to the ability of a test to distinguish between patients with disease (or more generally, a specified target condition) and those without. In a study of test accuracy, the results of the test under evaluation, the index test, are compared with those of the reference standard determined in the same patients. The reference standard is an agreed-on and accurate method for identifying patients who have the target condition. Test results are typically categorized as positive or negative for the target condition. By using such binary test outcomes, the accuracy is most often expressed as the tests sensitivity (the proportion of patients with positive results on the reference standard that are also positive on the index test) and specificity (the proportion of patients with negative results on the reference standard that are also negative on the index test). Other measures have been proposed and are in use (46). It has long been recognized that test accuracy is not a fixed property of a test. It can vary between patient subgroups, with their spectrum of disease, with the clinical setting, or with the test interpreters and may depend on the results of previous testing. For this reason, inclusion of these elements in the study question is essential. In order to make a policy decision to promote use of a new index test, evidence is required that using the new test increases test accuracy over other testing options, including current practice, or that the new test has equivalent accuracy but offers other advantages (79). As with the evaluation of interventions, systematic reviews need to include comparative analyses between alternative testing strategies and should not focus solely on evaluating the performance of a test in isolation. In relation to the existing situation, 3 possible roles for a new test can be defined: replacement, triage, and add-on (7). If a new test is to replace an existing test, then comparing the accuracy of both tests on the same population and with the same reference standard provides the most direct evidence. In triage, the new test is used before the existing test or testing pathway, and only patients with a particular result on the triage test continue the testing pathway. When a test is needed to rule out disease in patients who then need no further testing, a test that gives a minimal proportion of falsenegative results and thus a relatively high sensitivity should be used. Triage tests may be less accurate than existing ones, but they have other advantages, such as simplicity or low cost. A third possible role of a new test is add-on. The new test is then positioned after the existing testing pathway to identify false-positive or false-negative results after the existing pathway. The review should provide data to assess the incremental change in accuracy made by adding the new test. An example of a replacement question can be found in a systematic review of the diagnostic accuracy of urinary markers for primary bladder cancer (10). Clinicians may use cytology to triage patients before they undergo invasive cystoscopy, the reference standard for bladder cancer. Because cytology combines high specificity with low sensitivity (11), the goal of the review was to identify a tumor marker with sufficient accuracy to either replace cytology or be used in addition to cytology. For a marker to replace cytology, it has to achieve equally high specificity with improved sensitivity. New markers that are sensitive but not specific may have roles as adjuncts to conventional testing. The review included studies in which the test under evaluation (several different tumor markers and cytology) was evaluated against cystoscopy or histopathology. Included studies compared 1 or more of the markers, cytology only, or a combination of markers and cytology. Although information on accuracy can help clinicians make decisions about tests, good diagnostic accuracy is a desirable but not sufficient condition for the effectiveness of a test (8). To demonstrate that using a new test does more good than harm to patients tested, randomized trials of test-and-treatment strategies and reviews of such trials may be necessary. However, with the possible exception of screening, in most cases, such randomized trials are not available and systematic reviews of test accuracy may provide the most useful evidence available to guide clinical and health policy decision making and use as input for decision and cost-effectiveness analysis (12). Identification and Selection of Studies Identifying test accuracy studies is more difficult than searching for randomized trials (13). There is not a clear, unequivocal keyword or indexing term for an accuracy study in literature databases comparable with the term randomized, controlled trial. The Medical Subject Heading sensitivity and specificity may look suitable but is inconsistently applied in most electronic bibliographic databases. Furthermore, data on diagnostic test accuracy may be hidden in studies that did not have test accuracy estimation as their primary objective. This complicates the efficient identification of diagnostic test accuracy studies in electronic databases, such as MEDLINE. Until indexing systems properly code studies of test accuracy, searching for them will remain challenging and may require additional manual searches, such as screening reference lists. In the development of a comprehensive search strategy, review authors can use search strings that refer to the test(s) under evaluation, the target condition, and the patient description or a subset of these. For tests with a clear name that are used for a single purpose, searching for publications in which those tests are mentioned may suffice. For other reviews, adding the patient description may be necessary, although this is also often poorly indexed. A search strategy in MEDLINE should contain both Medical Subject Headings and free text words. A search strategy for articles about tests for bladder cancer, for example, should include as many synonyms for bladder cancer as possible in the search strategy, including neoplasm, carcinoma, transitional cell, and hematuria. Several methodological electronic search filters for diagnostic test accuracy studies have been developed, each attempting to restrict the search to articles that are most likely to be test accuracy studies (1316). These filters rely on indexing terms for research methodology and text words used in reporting results, but they often miss relevant studies and are unlikely to decrease the number of articles one needs to screen. Therefore, they are not recommended for systematic reviews (17, 18). The incremental value of searching in languages other than English and in the gray literature has not yet been fully investigated. In systematic reviews of intervention studies, publication bias is an important and well-studied form of bias in which the decision to report and publish studies is linked to their findings. For clinical trials, the magnitude and determinants of publication bias have been identified by tracing the publication history of cohorts of trials reviewed by ethics committees and research boards (19). A consistent observation has been that studies with significant results are more likely to be published than studies with nonsignificant findings (19). Investigating publication bias for diagnostic tests is problematic, because many studies are done without ethical review or study registration; therefore, identification of cohorts of studies from registration to final publication status i
Journal of Clinical Epidemiology | 2009
Mariska M.G. Leeflang; Patrick M. Bossuyt; Les Irwig
BACKGROUND Several studies and systematic reviews have reported results that indicate that sensitivity and specificity may vary with prevalence. STUDY DESIGN AND SETTING We identify and explore mechanisms that may be responsible for sensitivity and specificity varying with prevalence and illustrate them with examples from the literature. RESULTS Clinical and artefactual variability may be responsible for changes in prevalence and accompanying changes in sensitivity and specificity. Clinical variability refers to differences in the clinical situation that may cause sensitivity and specificity to vary with prevalence. For example, a patient population with a higher disease prevalence may include more severely diseased patients, therefore, the test performs better in this population. Artefactual variability refers to effects on prevalence and accuracy associated with study design, for example, the verification of index test results by a reference standard. Changes in prevalence influence the extent of overestimation due to imperfect reference standard classification. CONCLUSIONS Sensitivity and specificity may vary in different clinical populations, and prevalence is a marker for such differences. Clinicians are advised to base their decisions on studies that most closely match their own clinical situation, using prevalence to guide the detection of differences in study population or study design.
Annals of Internal Medicine | 2012
Caroline Chartrand; Mariska M.G. Leeflang; Jessica Minion; Timothy F. Brewer; Madhukar Pai
BACKGROUND Timely diagnosis of influenza can help clinical management. PURPOSE To examine the accuracy of rapid influenza diagnostic tests (RIDTs) in adults and children with influenza-like illness and evaluate factors associated with higher accuracy. DATA SOURCES PubMed and EMBASE through December 2011; BIOSIS and Web of Science through March 2010; and citations of articles, guidelines, reviews, and manufacturers. STUDY SELECTION Studies that compared RIDTs with a reference standard of either reverse transcriptase polymerase chain reaction (first choice) or viral culture. DATA EXTRACTION Reviewers abstracted study data by using a standardized form and assessed quality by using Quality Assessment of Diagnostic Accuracy Studies criteria. DATA SYNTHESIS 159 studies evaluated 26 RIDTs, and 35% were conducted during the H1N1 pandemic. Failure to report whether results were assessed in a blinded manner and the basis for patient recruitment were important quality concerns. The pooled sensitivity and specificity were 62.3% (95% CI, 57.9% to 66.6%) and 98.2% (CI, 97.5% to 98.7%), respectively. The positive and negative likelihood ratios were 34.5 (CI, 23.8 to 45.2) and 0.38 (CI, 0.34 to 0.43), respectively. Sensitivity estimates were highly heterogeneous, which was partially explained by lower sensitivity in adults (53.9% [CI, 47.9% to 59.8%]) than in children (66.6% [CI, 61.6% to 71.7%]) and a higher sensitivity for influenza A (64.6% [CI, 59.0% to 70.1%) than for influenza B (52.2% [CI, 45.0% to 59.3%). LIMITATION Incomplete reporting limited the ability to assess the effect of important factors, such as specimen type and duration of influenza symptoms, on diagnostic accuracy. CONCLUSION Influenza can be ruled in but not ruled out through the use of RIDTs. Sensitivity varies across populations, but it is higher in children than in adults and for influenza A than for influenza B. PRIMARY FUNDING SOURCE Canadian Institutes of Health Research.
Clinical Chemistry | 2008
Mariska M.G. Leeflang; Karel G.M. Moons; Johannes B. Reitsma; Aielko H. Zwinderman
BACKGROUND Optimal cutoff values for tests results involving continuous variables are often derived in a data-driven way. This approach, however, may lead to overly optimistic measures of diagnostic accuracy. We evaluated the magnitude of the bias in sensitivity and specificity associated with data-driven selection of cutoff values and examined potential solutions to reduce this bias. METHODS Different sample sizes, distributions, and prevalences were used in a simulation study. We compared data-driven estimates of accuracy based on the Youden index with the true values and calculated the median bias. Three alternative approaches (assuming a specific distribution, leave-one-out, smoothed ROC curve) were examined for their ability to reduce this bias. RESULTS The magnitude of bias caused by data-driven optimization of cutoff values was inversely related to sample size. If the true values for sensitivity and specificity are both 84%, the estimates in studies with a sample size of 40 will be approximately 90%. If the sample size increases to 200, the estimates will be 86%. The distribution of the test results had little impact on the amount of bias when sample size was held constant. More robust methods of optimizing cutoff values were less prone to bias, but the performance deteriorated if the underlying assumptions were not met. CONCLUSIONS Data-driven selection of the optimal cutoff value can lead to overly optimistic estimates of sensitivity and specificity, especially in small studies. Alternative methods can reduce this bias, but finding robust estimates for cutoff values and accuracy requires considerable sample sizes.
British Journal of Dermatology | 2008
Elian E.A. Brenninkmeijer; M.E. Schram; Mariska M.G. Leeflang; Jan C. van den Bos; Ph.I. Spuls
Background Atopic dermatitis (AD) has a wide spectrum of dermatological manifestations and despite various validated sets of diagnostic criteria that have been developed over the past decades, there is disagreement about its definition. Nevertheless, clinical studies require valid diagnostic criteria for reliable and reproducible results.
Allergy | 2012
M.E. Schram; Ph.I. Spuls; Mariska M.G. Leeflang; R. Lindeboom; Jan D. Bos; Jochen Schmitt
To cite this article: Schram ME, Spuls PI, Leeflang MMG, Lindeboom R, Bos JD, Schmitt J. EASI, (objective) SCORAD and POEM for atopic eczema: responsiveness and minimal clinically important difference. Allergy 2012; 67: 99–106.
Canadian Medical Association Journal | 2013
Mariska M.G. Leeflang; Anne Wilhelmina Saskia Rutjes; Johannes B. Reitsma; Lotty Hooft; Patrick M. Bossuyt
Background: Anecdotal evidence suggests that the sensitivity and specificity of a diagnostic test may vary with disease prevalence. Our objective was to investigate the associations between disease prevalence and test sensitivity and specificity using studies of diagnostic accuracy. Methods: We used data from 23 meta-analyses, each of which included 10–39 studies (416 total). The median prevalence per review ranged from 1% to 77%. We evaluated the effects of prevalence on sensitivity and specificity using a bivariate random-effects model for each meta-analysis, with prevalence as a covariate. We estimated the overall effect of prevalence by pooling the effects using the inverse variance method. Results: Within a given review, a change in prevalence from the lowest to highest value resulted in a corresponding change in sensitivity or specificity from 0 to 40 percentage points. This effect was statistically significant (p < 0.05) for either sensitivity or specificity in 8 meta-analyses (35%). Overall, specificity tended to be lower with higher disease prevalence; there was no such systematic effect for sensitivity. Interpretation: The sensitivity and specificity of a test often vary with disease prevalence; this effect is likely to be the result of mechanisms, such as patient spectrum, that affect prevalence, sensitivity and specificity. Because it may be difficult to identify such mechanisms, clinicians should use prevalence as a guide when selecting studies that most closely match their situation.
Lancet Oncology | 2013
Linda K Wanders; James E. East; Sanne E Uitentuis; Mariska M.G. Leeflang; Evelien Dekker
BACKGROUND Novel endoscopic technologies could allow optical diagnosis and resection of colonic polyps without histopathological testing. Our aim was to establish the sensitivity, specificity, and real-time negative predictive value of three types of narrowed spectrum endoscopy (narrow-band imaging [NBI], image-enhanced endoscopy [i-scan], and Fujinon intelligent chromoendoscopy [FICE]), confocal laser endomicroscopy (CLE), and autofluorescence imaging for differentiation between neoplastic and non-neoplastic colonic lesions. METHODS We identified relevant studies through a search of Medline, Embase, PubMed, and the Cochrane Library. Clinical trials and observational studies were eligible for inclusion when the diagnostic performance of NBI, i-scan, FICE, autofluorescence imaging, or CLE had been assessed for differentiation, with histopathology as the reference standard, and for which a 2 × 2 contingency table of lesion diagnosis could be constructed. We did a random-effects bivariate meta-analysis using a non-linear mixed model approach to calculate summary estimates of sensitivity and specificity, and plotted estimates in a summary receiver-operating characteristic curve. FINDINGS We included 91 studies in our analysis: 56 were of NBI, ten of i-scan, 14 of FICE, 11 of CLE, and 11 of autofluorescence imaging (more than one of the investigated modalities assessed in eight studies). For NBI, overall sensitivity was 91·0% (95% CI 88·6-93·0), specificity 85·6% (81·3-89·0), and real-time negative predictive value 82·5% (75·4-87·9). For i-scan, overall sensitivity was 89·3% (83·3-93·3), specificity 88·2% (80·3-93·2), and real-time negative predictive value 86·5% (78·0-92·1). For FICE, overall sensitivity was 91·8% (87·1-94·9), specificity 83·5% (77·2-88·3), and real-time negative predictive value 83·7% (77·5-88·4). For autofluorescence imaging, overall sensitivity was 86·7% (79·5-91·6), specificity 65·9% (50·9-78·2), and real-time negative predictive value 81·5% (54·0-94·3). For CLE, overall sensitivity was 93·3% (88·4-96·2), specificity 89·9% (81·8-94·6), and real-time negative predictive value 94·8% (86·6-98·1). INTERPRETATION All endoscopic imaging techniques other than autofluorescence imaging could be used by appropriately trained endoscopists to make a reliable optical diagnosis for colonic lesions in daily practice. Further research should be focused on whether training could help to improve negative predictive values. FUNDING None.
Systematic Reviews | 2013
Mariska M.G. Leeflang; Jonathan J Deeks; Yemisi Takwoingi; Petra Macaskill
In 1996, shortly after the founding of The Cochrane Collaboration, leading figures in test evaluation research established a Methods Group to focus on the relatively new and rapidly evolving methods for the systematic review of studies of diagnostic tests. Seven years later, the Collaboration decided it was time to develop a publication format and methodology for Diagnostic Test Accuracy (DTA) reviews, as well as the software needed to implement these reviews in The Cochrane Library. A meeting hosted by the German Cochrane Centre in 2004 brought together key methodologists in the area, many of whom became closely involved in the subsequent development of the methodological framework for DTA reviews. DTA reviews first appeared in The Cochrane Library in 2008 and are now an integral part of the work of the Collaboration.