Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Frank W. Samuelson is active.

Publication


Featured researches published by Frank W. Samuelson.


Breast Cancer Research | 2010

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Vlad Popovici; Weijie Chen; Brandon G Gallas; Christos Hatzis; Weiwei Shi; Frank W. Samuelson; Yuri Nikolsky; Marina Tsyganova; Alex Ishkin; Tatiana Nikolskaya; Kenneth R. Hess; Vicente Valero; Daniel J. Booser; Mauro Delorenzi; Gabriel N. Hortobagyi; Leming Shi; W. Fraser Symmans; Lajos Pusztai

IntroductionAs part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.MethodsWe used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.ResultsA ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.ConclusionsWe showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.


international symposium on biomedical imaging | 2006

Comparing image detection algorithms using resampling

Frank W. Samuelson; Nicholas Petrick

The ability to statistically compare the performance of two computer detection (CD) or computer-aided detection (CAD) algorithms is fundamental for the development and evaluation of medical image analysis tools. Automated detection tools for medical imaging are commonly characterized using free-response receiver operating characteristic (FROC) methods. However, few statistical tools are currently available to estimate statistical significance when comparing two FROC performance curves. In this study, we introduce a permutation and a bootstrap resampling method for the nonparametric estimation of statistical significance of performance metrics when comparing two FROC curves. We then provide an initial validation of the proposed methods using an area under the FROC performance metric and a simulation model for creating CD algorithm prompts. Validation is based on a comparison of the Type I error rate produced by two statistically identical CD algorithms. The results of 104 Monte Carlo trials show that both the permutation and bootstrap methods produced excellent estimates of the expected Type I error rate


international symposium on biomedical imaging | 2007

ADVANTAGES AND EXAMPLES OF RESAMPLING FOR CAD EVALUATION

Frank W. Samuelson; Nicholas Petrick; Sophie Paquerault

Comparison of performance accuracy between different computer-aided diagnosis (CAD) devices is a challenging task. Anatomical structure of the patient and imaging geometry introduce many possible correlations among scores produced by a CAD. Numerous analysis methods have been designed to account for the correlations among CAD scores or the variable number of CAD marks, but usually not both. However, methods that resample by case incorporate both of these sources of variability while accounting for in-case correlations. In this paper we present some examples of the use of resampling on CAD score data


BMC Medical Research Methodology | 2013

On the assessment of the added value of new predictive biomarkers

Weijie Chen; Frank W. Samuelson; Brandon D. Gallas; Le Kang; Berkman Sahiner; Nicholas Petrick

BackgroundThe surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC “has vastly inferior statistical properties,” i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests.DiscussionWe present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper.SummaryWe show that there is nothing inherently less powerful or disagreeable about ROC analysis for showing the usefulness of new biomarkers or characterizing the performance of classification models. Each statistical method for assessing biomarkers and classification models has its own strengths and weaknesses. Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies.


Academic Radiology | 2011

The importance of ROC data.

Frank W. Samuelson; Brandon D. Gallas; Kyle J. Myers; Nicholas Petrick; Paul Pinsky; Berkman Sahiner; Gregory Campbell; Gene Pennello

Dear editor: The article by Gur et al (1) presents interesting data for those who perform reader studies of radiological devices. The article reports differences between two methods of estimating a change in the probability of correct discrimination, or area under the receiver operating characteristic (ROC) curve (AUC). It uses data from one particular study (2) in which the breast cancer detection of full-field digital mammography (FFDM) was compared to that of FFDM plus the new investigational use technology of digital breast tomosynthesis (DBT). Both estimates use a nonparametric empirical method, but one estimate uses multicategory or semicontinuous rating data, whereas the other uses two-category or binary data. We want to highlight three points relevant to this study and other controlled studies undertaken before a technology is in wide use in the clinical setting.


Academic Radiology | 2013

Statistical Power Considerations for a Utility Endpoint in Observer Performance Studies

Craig K. Abbey; Frank W. Samuelson; Brandon D. Gallas

RATIONALE AND OBJECTIVES The purpose of this investigation is to compare the statistical power of the most common measure of performance for observer performance studies, area under the ROC curve (AUC), to an expected utility (EU) endpoint. MATERIALS AND METHODS We have modified a well-known simulation procedure developed by Roe and Metz for statistical power analysis in receiver operating characteristic (ROC) studies. Starting from a set of baseline simulations, we investigate the effects of three parameters that describe properties of the observers (iso-utility slope, unequal variance, and tendency to favor more aggressive or conservative actions) and three parameters that affect experimental design (number of readers, number of cases, and fraction of positive cases). RESULTS The EU endpoint generally has good statistical power relative to AUC in our simulations. Of 396 total conditions simulated, EU had higher statistical power in 377 cases (95%). In 246 of these cases, EU power was 5 percentage points or more higher than AUC. In simulation runs evaluating the effect of the number of readers and cases on the baseline simulations, EU measure had equivalent power to AUC with fewer readers (9% to 28%) or fewer cases (18% to 41%). CONCLUSION These simulation studies provide further motivation for considering EU in studies of screening mammography technology and they motivate investigations of utility in other diagnostic tasks.


Proceedings of SPIE | 2013

The equivalence of a human observer and an ideal observer in binary diagnostic tasks

Xin He; Frank W. Samuelson; Brandon D. Gallas; Berkman Sahiner; Kyle J. Myers

The Ideal Observer (IO) is “ideal” for given data populations. In the image perception process, as the raw images are degraded by factors such as display and eye optics, there is an equivalent IO (EIO). The EIO uses the statistical information that exits the perception/cognitive degradations as the data. We assume a human observer who received sufficient training, e.g., radiologists, and hypothesize that such a human observer can be modeled as if he is an EIO. To measure the likelihood ratio (LR) distributions of an EIO, we formalize experimental design principles that encourage rationality based on von Neumann and Morgenstern’s (vNM) axioms. We present examples to show that many observer study design refinements, although motivated by empirical principles explicitly, implicitly encourage rationality. Our hypothesis is supported by a recent review paper on ROC curve convexity by Pesce, Metz, and Berbaum. We also provide additional evidence based on a collection of observer studies in medical imaging. EIO theory shows that the “sub-optimal” performance of a human observer can be mathematically formalized in the form of an IO, and measured through rationality encouragement.


Academic Radiology | 2009

Investigation of Reading Mode and Relative Sensitivity as Factors That Influence Reader Performance When Using Computer-Aided Detection Software

Sophie Paquerault; Frank W. Samuelson; Nicholas Petrick; Kyle J. Myers; Robert C. Smith

RATIONALE AND OBJECTIVES The aim of this study was to investigate the effects of relative sensitivity (reader without computer-aided detection [CAD] vs stand-alone CAD) and reading mode on reader performance when using CAD software. MATERIALS AND METHODS Two sets of 100 images (low-contrast and high-contrast sets) were created by adding low-contrast or high-contrast simulated masses to random locations in 100 normal mammograms. This produced a relative sensitivity, substantially less for the low-contrast set and similar for the high-contrast set. Seven readers reviewed every image in each set and specified location and probability scores using three reading modes (without CAD, second read with CAD, and concurrent read with CAD). Reader detection accuracy was analyzed using areas under free-response receiver operating characteristic curves, sensitivity, and the number of false-positive findings per image. RESULTS For the low-contrast set, average differences in areas under free-response receiver operating characteristic curves, sensitivity, and false-positive findings per image without CAD were 0.02, 0.12, and 0.11, respectively, compared to second read and 0.05, 0.17, and 0.09 (not statistically significant), respectively, compared to concurrent read. For the high-contrast set, average differences were 0.002 (not statistically significant), 0.04, and 0.05, respectively, compared to second read and -0.004 (not statistically significant), 0.04, and 0.08 (not statistically significant), respectively, compared to concurrent read (all differences were statistically significant except as noted). Differences were greater in the low-contrast set than the high-contrast set. Differences between second read and concurrent read were not significant. CONCLUSIONS Relative sensitivity is a critical factor that determines incremental improvement in reader performance when using CAD and appears to be more important than reading mode. Relative sensitivity may determine the clinical usefulness of CAD in different clinical applications and for different types of users.


Proceedings of SPIE | 2013

Statistical properties of a utility measure of observer performance compared to area under the ROC curve

Craig K. Abbey; Frank W. Samuelson; Brandon D. Gallas; John M. Boone; Loren Niklason

The receiver operating characteristic (ROC) curve has become a common tool for evaluating diagnostic imaging technologies, and the primary endpoint of such evaluations is the area under the curve (AUC), which integrates sensitivity over the entire false positive range. An alternative figure of merit for ROC studies is expected utility (EU), which focuses on the relevant region of the ROC curve as defined by disease prevalence and the relative utility of the task. However if this measure is to be used, it must also have desirable statistical properties keep the burden of observer performance studies as low as possible. Here, we evaluate effect size and variability for EU and AUC. We use two observer performance studies recently submitted to the FDA to compare the EU and AUC endpoints. The studies were conducted using the multi-reader multi-case methodology in which all readers score all cases in all modalities. ROC curves from the study were used to generate both the AUC and EU values for each reader and modality. The EU measure was computed assuming an iso-utility slope of 1.03. We find mean effect sizes, the reader averaged difference between modalities, to be roughly 2.0 times as big for EU as AUC. The standard deviation across readers is roughly 1.4 times as large, suggesting better statistical properties for the EU endpoint. In a simple power analysis of paired comparison across readers, the utility measure required 36% fewer readers on average to achieve 80% statistical power compared to AUC.


Academic Radiology | 2013

Inference Based on Diagnostic Measures from Studies of New Imaging Devices

Frank W. Samuelson

RATIONALE AND OBJECTIVES Before using a new diagnostic imaging device regularly in a clinic, it should be studied using patients and radiologists. Often such studies report diagnostic performance in terms of sensitivity, specificity, area under the receiver operating characteristic curve (AUC), or differences thereof. In this report we look at how these studies differ from actual future clinical practice and how those differences may affect reported performance measures. MATERIALS AND METHODS We review signal detection (receiver operating characteristic) theory and decision theory. We compare diagnostic measures from several published studies in medical imaging and examine how they relate to theory and each other. RESULTS We see that clinical decisions can be modeled using signal detection and decision theories. Sensitivity and specificity are inextricably linked with clinical factors, such as prevalence and costs. Imaging devices are used in many different ways in clinical practice, so that sensitivities, specificities, and AUCs measured in studies of new diagnostic imaging devices will differ from those in actual future clinical use. CONCLUSIONS Measured sensitivities, specificities, and the directions of changes thereof are not necessarily consistent or reproducible across studies of new diagnostic devices. A change in the AUC, which should be independent of clinical costs or prevalence, is a consistent measure across similar studies, and a positive change in AUC is indicative of additional diagnostic information that will be available to radiologists in a future clinical environment.

Collaboration


Dive into the Frank W. Samuelson's collaboration.

Top Co-Authors

Avatar

Berkman Sahiner

Food and Drug Administration

View shared research outputs
Top Co-Authors

Avatar

Nicholas Petrick

Food and Drug Administration

View shared research outputs
Top Co-Authors

Avatar

Xin He

Food and Drug Administration

View shared research outputs
Top Co-Authors

Avatar

Brandon D. Gallas

Center for Devices and Radiological Health

View shared research outputs
Top Co-Authors

Avatar

Kyle J. Myers

Food and Drug Administration

View shared research outputs
Top Co-Authors

Avatar

Rongping Zeng

Food and Drug Administration

View shared research outputs
Top Co-Authors

Avatar

Craig K. Abbey

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Aldo Badano

Food and Drug Administration

View shared research outputs
Top Co-Authors

Avatar

John M. Boone

University of California

View shared research outputs
Researchain Logo
Decentralizing Knowledge