Jeffrey S. Simonoff
New York University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jeffrey S. Simonoff.
Journal of the American Statistical Association | 1993
Ali S. Hadi; Jeffrey S. Simonoff
Abstract We consider the problem of identifying and testing multiple outliers in linear models. The available outlier identification methods often do not succeed in detecting multiple outliers because they are affected by the observations they are supposed to identify. We introduce two test procedures for the detection of multiple outliers that appear to be less sensitive to this problem. Both procedures attempt to separate the data into a set of “clean” data points and a set of points that contain the potential outliers. The potential outliers are then tested to see how extreme they are relative to the clean subset, using an appropriately scaled version of the prediction error. The procedures are illustrated and compared to various existing methods, using several data sets known to contain multiple outliers. Also, the performances of both procedures are investigated by a Monte Carlo study. The data sets and the Monte Carlo indicate that both procedures are effective in the detection of multiple outliers ...
Journal of Machine Learning Research | 2003
Claudia Perlich; Foster Provost; Jeffrey S. Simonoff
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several things. (1) Contrary to some prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (that is, the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training-set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of the separability of signal from noise.
Chance | 2000
Jeffrey S. Simonoff; Ilana R. Sparrow
Statistics Working Papers Series
Machine Learning | 2012
Rebecca J. Sela; Jeffrey S. Simonoff
Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations.
Biometrics | 1986
Jeffrey S. Simonoff; Yosef Hochberg; Benjamin Reiser
SUMMARY Consider two independent random variables X and Y. The functional R = Pr(X < Y) [or X = Pr(X < Y) - Pr(Y < X)] is of practical importance in many situations, including clinical trials, genetics, and reliability. In this paper several approaches to estimation of X when X and Y are presented in discretized (categorical) form are analyzed and compared. Asymptotic formulas for the variances of the estimators are derived; use of the bootstrap to estimate variances is also discussed. Computer simulations indicate that the choice of the best estimator depends on the value of X, the underlying distribution, and the sparseness of the data. It is shown that the bootstrap provides a robust estimate of variance. Several examples are treated.
Journal of Statistical Planning and Inference | 1995
Jeffrey S. Simonoff
Abstract Statistical analysis of categorical data (contingency tables) has a long history, and a good deal of work has been done formulating parametric models for such data. Unfortunately, such analyses are often not appropriate, due to sparseness of the table. An alternative to these parametric models is smoothing the table, by ‘borrowing’ information from neighboring cells. In this paper, various strategies that have been proposed for such smoothing are discussed. It is shown that these strategies have close ties to other areas of statistical methodology, including shrinkage estimation, Bayes methods, penalized likelihood, spline estimation, and kernel density and regression estimation. Probability estimates based on smoothing methods can outperform the unsmoothed frequency estimates when the table is sparse (often, dramatically so). Methods for one-dimensional tables are discussed, as well as generalizations to higher-dimensional tables. Attempts to use smoothed probability estimates in statistical functionals are identified. Finally, potential future work in categorical data smoothing is also mentioned.
International Journal of Critical Infrastructure Protection | 2009
Carlos E. Restrepo; Jeffrey S. Simonoff; Rae Zimmerman
Abstract In this paper the causes and consequences of accidents in US hazardous liquid pipelines that result in the unplanned release of hazardous liquids are examined. Understanding how different causes of accidents are associated with consequence measures can provide important inputs into risk management for this (and other) critical infrastructure systems. Data on 1582 accidents related to hazardous liquid pipelines for the period 2002–2005 are analyzed. The data were obtained from the US Department of Transportation’s Office of Pipeline Safety (OPS). Of the 25 different causes of accidents included in the data the most common ones are equipment malfunction, corrosion, material and weld failures, and incorrect operation. This paper focuses on one type of consequence–various costs associated with these pipeline accidents–and causes associated with them. The following economic consequence measures related to accident cost are examined: the value of the product lost; public, private, and operator property damage; and cleanup, recovery, and other costs. Logistic regression modeling is used to determine what factors are associated with nonzero product loss cost, nonzero property damage cost and nonzero cleanup and recovery costs. The factors examined include the system part involved in the accident, location characteristics (offshore versus onshore location, occurrence in a high consequence area), and whether there was liquid ignition, an explosion, and/or a liquid spill. For the accidents associated with nonzero values for these consequence measures (weighted) least squares regression is used to understand the factors related to them, as well as how the different initiating causes of the accidents are associated with the consequence measures. The results of these models are then used to construct illustrative scenarios for hazardous liquid pipeline accidents. These scenarios suggest that the magnitude of consequence measures such as value of product lost, property damage and cleanup and recovery costs are highly dependent on accident cause and other accident characteristics. The regression models used to construct these scenarios constitute an analytical tool that industry decision-makers can use to estimate the possible consequences of accidents in these pipeline systems by cause (and other characteristics) and to allocate resources for maintenance and to reduce risk factors in these systems.
Journal of Marketing Education | 1999
Priscilla A. Labarbera; Jeffrey S. Simonoff
This article reports the findings of a survey of undergraduate students designed to examine the key factors involved in selecting a marketing major. A discussion follows, dealing with the initiatives undertaken by marketing departments at various universities in an attempt to enhance the quality and quantity of marketing majors.
Biometrics | 1992
Glenn Heller; Jeffrey S. Simonoff
Although the analysis of censored survival data using the proportional hazards and linear regression models is common, there has been little work examining the ability of these estimators to predict time to failure. This is unfortunate, since a predictive plot illustrating the relationship between time to failure and a continuous covariate can be far more informative regarding the risk associated with the covariate than a Kaplan-Meier plot obtained by discretizing the variable. In this paper the predictive power of the Cox (1972, Journal of the Royal Statistical Society, Series B 34, 187-202) proportional hazards estimator and the Buckley-James (1979, Biometrika 66, 429-436) censored regression estimator are compared. Using computer simulations and heuristic arguments, it is shown that the choice of method depends on the censoring proportion, strength of the regression, the form of the censoring distribution, and the form of the failure distribution. Several examples are provided to illustrate the usefulness of the methods.
Applied statistics | 1994
Jeffrey S. Simonoff; Chih-Ling Tsai
SUMMARY Non-constant variance (heteroscedasticity) is common in regression data, and many tests have been proposed for detecting it. This paper shows that the properties of likelihood-based tests can be improved by using the modified profile likelihood of Cox and Reid. A modified likelihood ratio test and modified score tests are derived, and both theoretical and intuitive justifications are given for the improved properties of the tests. The results of a Monte Carlo study are mentioned, which show that, whereas the ordinary likelihood ratio test can be very anticonservative, the modified test holds its null size well and is more powerful than the other tests. For non-normal error distributions, Studentized tests hold their size well (without being overconservative), even for long-tailed error distributions. Under short-tailed error distributions, likelihood ratio or Studentized score tests are most powerful, depending on the degree of heteroscedasticity. The modified versions of the score tests consistently outperform the unmodified versions. The use of these tests is demonstrated through analysis of data on the volatility of stock prices.