Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Douglas M. Hawkins is active.

Publication


Featured researches published by Douglas M. Hawkins.


Journal of Chemical Information and Computer Sciences | 2004

The Problem of Overfitting

Douglas M. Hawkins

Model fitting is an important part of all sciences that use quantitative measurements. Experimenters often explore the relationships between measures. Two subclasses of relationship problems are as follows: • Correlation problems: those in which we have a collection of measures, all of interest in their own right, and wish to see how and how strongly they are related. • Regression problems : those in which one of the measures, the dependent Variable, is of special interest, and we wish to explore its relationship with the other variables. These other variables may be called the independent Variables, the predictor Variables, or the coVariates. The dependent variable may be a continuous numeric measure such as a boiling point or a categorical measure such as a classification into mutagenic and nonmutagenic. We should emphasize that using the words ‘correlation problem’ and ‘regression problem’ is not meant to tie these problems to any particular statistical methodology. Having a ‘correlation problem’ does not limit us to conventional Pearson correlation coefficients. Log-linear models, for example, measure the relationship between categorical variables in multiway contingency tables. Similarly, multiple linear regression is a methodology useful for regression problems, but so also are nonlinear regression, neural nets, recursive partitioning and k-nearest neighbors, logistic regression, support vector machines and discriminant analysis, to mention a few. All of these methods aim to quantify the relationship between the predictors and the dependent variable. We will use the term ‘regression problem’ in this conceptual form and, when we want to specialize to multiple linear regression using ordinary least squares, will describe it as ‘OLS regression’. Our focus is on regression problems. We will use y as shorthand for the dependent variable and x for the collection of predictors available. There are two distinct primary settings in which we might want to do a regression study: • Prediction problems:We may want to make predictions of y for future cases where we know x but do not knowy. This for example is the problem faced with the Toxic Substances Control Act (TSCA) list. This list contains many tens of thousands of compounds, and there is a need to identify those on the list that are potentially harmful. Only a small fraction of the list however has any measured biological properties, but all of them can be characterized by chemical descriptors with relative ease. Using quantitative structure-activity relationships (QSARs) fitted to this small fraction to predict the toxicities of the much larger collection is a potentially cost-effective way to try to sort the TSCA compounds by their potential for harm. Later, we will use a data set for predicting the boiling point of a set of compounds on the TSCA list from some molecular descriptors. • Effect quantification:We may want to gain an understanding of how the predictors enter into the relationship that predicts y. We do not necessarily have candidate future unknowns that we want to predict, we simply want to know how each predictor drives the distribution of y. This is the setting seen in drug discovery, where the biological activity y of each in a collection of compounds is measured, along with molecular descriptors x. Finding out which descriptors x are associated with high and which with low biological activity leads to a recipe for new compounds which are high in the features associated positively with activity and low in those associated with inactivity or with adverse side effects. These two objectives are not always best served by the same approaches. ‘Feature selection’ skeeping those features associated withy and ignoring those not associated with y is very commonly a part of an analysis meant for effect quantification but is not necessarily helpful if the objective is prediction of future unknowns. For prediction, methods such as partial least squares (PLS) and ridge regression (RR) that retain all features but rein in their contributions are often found to be more effective than those relying on feature selection. What Is Overfitting? Occam’s Razor, or the principle of parsimony, calls for using models and procedures that contain all that is necessary for the modeling but nothing more. For example, if a regression model with 2 predictors is enough to explainy, then no more than these two predictors should be used. Going further, if the relationship can be captured by a linear function in these two predictors (which is described by 3 numbers sthe intercept and two slopes), then using a quadratic violates parsimony. Overfitting is the use of models or procedures that violate parsimonysthat is, that include more terms than are necessary or use more complicated approaches than are necessary. It is helpful to distinguish two types of overfitting: • Using a model that is more flexible than it needs to be. For example, a neural net is able to accommodate some curvilinear relationships and so is more flexible than a simple linear regression. But if it is used on a data set that conforms to the linear model, it will add a level of complexity without * Corresponding author e-mail: [email protected]. 1 J. Chem. Inf. Comput. Sci. 2004,44, 1-12


Mathematical Geosciences | 1980

Robust estimation of the variogram: I

Noel A Cressie; Douglas M. Hawkins

It is a matter of common experience that ore values often do not follow the normal (or lognormal) distributions assumed for them, but, instead, follow some other heavier-tailed distribution. In this paper we discuss the robust estimation of the variogram when the distribution is normal-like in the central region but heavier than normal in the tails. It is shown that the use of a fourth-root transformation with or without the use of M-estimation yields stable robust estimates of the variogram.


Journal of Chemical Information and Computer Sciences | 2003

Assessing model fit by cross-validation.

Douglas M. Hawkins; Subhash C. Basak; Denise Mills

When QSAR models are fitted, it is important to validate any fitted model-to check that it is plausible that its predictions will carry over to fresh data not used in the model fitting exercise. There are two standard ways of doing this-using a separate hold-out test sample and the computationally much more burdensome leave-one-out cross-validation in which the entire pool of available compounds is used both to fit the model and to assess its validity. We show by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small-in the dozens or scores rather than the hundreds, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.


Journal of Quality Technology | 1993

Regression Adjustment for Variables in Multivariate Quality Control

Douglas M. Hawkins

Multivariate process control problems are inherently more difficult than univariate problems. It is not always clear what type of multivariate statistic should be used, and the most statistically p...


Transplantation | 2006

Assessing relative risks of infection and rejection : A meta-analysis using an immune function assay

Richard J. Kowalski; Diane R. Post; Roslyn B. Mannon; Anthony Sebastian; Harlan Wright; Gary Sigle; James F. Burdick; Kareem Abu Elmagd; Adriana Zeevi; Mayra Lopez-Cepero; John A. Daller; H. Albin Gritsch; Elaine F. Reed; Johann Jonsson; Douglas M. Hawkins; Judith A. Britz

Background. Long-term use of immunosuppressants is associated with significant morbidity and mortality in transplant recipients. A simple whole blood assay that has U.S. Food and Drug Administration clearance directly assesses the net state of immune function of allograft recipients for better individualization of therapy. A meta-analysis of 504 solid organ transplant recipients (heart, kidney, kidney-pancreas, liver and small bowel) from 10 U.S. centers was performed using the Cylex ImmuKnow assay. Methods. Blood samples were taken from recipients at various times posttransplant and compared with clinical course (stable, rejection, infection). In this analysis, 39 biopsy-proven cellular rejections and 66 diagnosed infections occurred. Odds ratios of infection or rejection were calculated based on measured immune response values. Results. A recipient with an immune response value of 25 ng/ml adenosine triphosphate (ATP) was 12 times (95% confidence of 4 to 36) more likely to develop an infection than a recipient with a stronger immune response. Similarly, a recipient with an immune response of 700 ng/ml ATP was 30 times (95% confidence of 8 to 112) more likely to develop a cellular rejection than a recipient with a lower immune response value. Of note is the intersection of odds ratio curves for infection and rejection in the moderate immune response zone (280 ng/ml ATP). This intersection of risk curves provides an immunological target of immune function for solid organ recipients. Conclusion. These data show that the Cylex ImmuKnow assay has a high negative predictive value and provides a target immunological response zone for minimizing risk and managing patients to stability.


Journal of the American Statistical Association | 1977

Testing a Sequence of Observations for a Shift in Location

Douglas M. Hawkins

Abstract A possible alternative to the hypothesis that the sequence X 1, X 2, …, Xn are iid N(ξ, σ2) random variables is that at some unknown instant the expectation ξ shifts. The likelihood ratio test for the alternative of a location shift is studied and its distribution under the null hypothesis found. Tables of standard fractiles are given, along with asymptotic results.


Technometrics | 1984

Location of Several Outliers in Multiple-Regression Data Using Elemental Sets

Douglas M. Hawkins; Dan Bradu; Gordon V. Kass

The outlying tendency of any case in a multiple regression of p predictors may be estimated by drawing all subsets of size p from the remaining cases and fitting the model. Each such subset yields an elemental residual for the case in question, and a suitable summary statistic of them can be used as an estimate of the cases outlying tendency. We propose two such summary statistics: an unweighted median, which is of bounded influence, and a weighted median, which is more efficient but less robust. The computational load of the procedure is reduced by using random samples in place of the full set of subsets of size p. As a byproduct the method yields useful information on the influence (or leverage) of cases and the mutual masking of high leverage points.


Quality and Reliability Engineering International | 2007

A change point method for linear profile data

Mahmoud A. Mahmoud; Peter A. Parker; William H. Woodall; Douglas M. Hawkins

We propose a change point approach based on the segmented regression technique for testing the constancy of the regression parameters in a linear profile data set. Each sample collected over time in the historical data set consists of several bivariate observations for which a simple linear regression model is appropriate. The change point approach is based on the likelihood ratio test for a change in one or more regression parameters. We compare the performance of this method to that of the most effective Phase I linear profile control chart approaches using a simulation study. The advantages of the change point method over the existing methods are greatly improved detection of sustained step changes in the process parameters and improved diagnostic tools to determine the sources of profile variation and the location(s) of the change point(s). Also, we give an approximation for appropriate thresholds for the test statistic. The use of the change point method is demonstrated using a data set from a calibration application at the National Aeronautics and Space Administration (NASA) Langley Research Center. Copyright


Journal of Quality Technology | 2003

The changepoint model for statistical process control

Douglas M. Hawkins; Peihua Qiu; Chang Wook Kang

Statistical process control (SPC) requires statistical methodologies that detect changes in the pattern of data over time. The common methodologies, such as Shewhart, cumulative sum (cusum), and exponentially weighted moving average (EWMA) charting, require the in-control values of the process parameters, but these are rarely known accurately. Using estimated parameters, the run length behavior changes randomly from one realization to another, making it impossible to control the run length behavior of any particular chart. A suitable methodology for detecting and diagnosing step changes based on imperfect process knowledge is the unknown-parameter changepoint formulation. Long recognized as a Phase I analysis tool, we argue that it is also highly effective in allowing the user to progress seamlessly from the start of Phase I data gathering through Phase II SPC monitoring. Despite not requiring specification of the post-change process parameter values, its performance is never far short of that of the optimal cusum chart which requires this knowledge, and it is far superior for shifts away from the cusum shift for which the cusum chart is optimal. As another benefit, while changepoint methods are designed for step changes that persist, they are also competitive with the Shewhart chart, the chart of choice for isolated non-sustained special causes.


Computational Statistics & Data Analysis | 2001

Fitting multiple change-point models to data

Douglas M. Hawkins

Abstract Change-point problems arise when different subsequences of a data series follow different statistical distributions – commonly of the same functional form but having different parameters. This paper develops an exact approach for finding maximum likelihood estimates of the change points and within-segment parameters when the functional form is within the general exponential family. The algorithm, a dynamic program, has execution time only linear in the number of segments and quadratic in the number of potential change points. The details are worked out for the normal, gamma, Poisson and binomial distributions.

Collaboration


Dive into the Douglas M. Hawkins's collaboration.

Top Co-Authors

Avatar

Binbin Yue

University of Maryland

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jessica J. Kraker

University of Wisconsin–Eau Claire

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Panagiotis Tsiamyrtzis

Athens University of Economics and Business

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David G. Hicks

University of Rochester Medical Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge