Is this you? Create Your Porfile

Wei-Yin Loh

University of Wisconsin-Madison

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wei-Yin Loh is active.

Explore More

Publication

Featured researches published by Wei-Yin Loh.

Machine Learning | 2000

A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Tjen-Sien Lim; Wei-Yin Loh; Yu-shan Shih

Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classification accuracy, training time, and (in the case of trees) number of leaves. Classification accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, spline-based, algorithm called POLYCLSSS at the top, although it is not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is QUEST with linear splits, which ranks fourth and fifth, respectively. Although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The QUEST and logistic regression algorithms are substantially faster. Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed. But C4.5 tends to produce trees with twice as many leaves as those from IND-CART and QUEST.

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2011

Classification and regression trees

Wei-Yin Loh

Classification and regression trees are machine‐learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.

international conference on management of data | 1999

BOAT—optimistic decision tree construction

Johannes Gehrke; Venkatesh Ganti; Raghu Ramakrishnan; Wei-Yin Loh

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.

Journal of the American Statistical Association | 2001

Classification Trees With Unbiased Multiway Splits

Hyunjoong Kim; Wei-Yin Loh

Two univariate split methods and one linear combination split method are proposed for the construction of classification trees with multiway splits. Examples are given where the trees are more compact and hence easier to interpret than binary trees. A major strength of the univariate split methods is that they have negligible bias in variable selection, both when the variables differ in the number of splits they offer and when they differ in the number of missing values. This is an advantage because inferences from the tree structures can be adversely affected by selection bias. The new methods are shown to be highly competitive in terms of computational speed and classification accuracy of future observations.

Journal of the American Statistical Association | 1988

Tree-Structured Classification via Generalized Discriminant Analysis

Wei-Yin Loh; Nunta Vanichsetakul

Abstract The problem of constructing classification rules that can be represented as decision trees is considered. Each object to be classified has an associated x vector containing possibly incomplete covariate information. Tree construction is based on the information provided in a “learning sample” of objects with known class identities. The x vectors in the learning sample may have missing values as well. Procedures are proposed for each of the components of classifier construction, such as split selection, tree-size determination, treatment of missing values, and ranking of variables. The main idea is recursive application of linear discriminant analysis, with the variables at each stage being appropriately chosen according to the data and the type of splits desired. Standard statistical techniques used as basic building blocks include analysis of variance, linear and canonical discriminant analysis, and principal component analysis. A new method of tree-structured classification is obtained by assem...

Journal of the American Statistical Association | 1987

Calibrating Confidence Coefficients

Wei-Yin Loh

Abstract Two approaches for dealing with the problem of poor coverage probabilities of certain standard confidence intervals are proposed. The first is a recommendation that the actual coverage be estimated directly from the data and its value reported in addition to the nominal level. This is achieved through a combination of computer simulation and density estimation. The asymptotic validity of the procedure is proved for a number of common situations. A classical example is the nonparametric estimation of the variance of a population using the normal-theory interval. Here it is shown that the estimated coverage probability consistently estimates the true coverage probability if the population distribution possesses a finite sixth moment. The second approach is more traditional. It is a procedure for modifying an interval to yield improved coverage properties. Given a confidence interval, its estimated coverage probability obtained in the first approach is used to alter the nominal level of the interval...

Nicotine & Tobacco Research | 2010

Gender, race, and education differences in abstinence rates among participants in two randomized smoking cessation trials

Megan E. Piper; Jessica W. Cook; Tanya R. Schlam; Douglas E. Jorenby; Stevens S. Smith; Daniel M. Bolt; Wei-Yin Loh

INTRODUCTION Smoking is the leading preventable cause of morbidity and mortality in the United States, but this burden is not distributed equally among smokers. Women, Blacks, and people with low socioeconomic status are especially vulnerable to the health risks of smoking and are less likely to quit. METHODS This research examined cessation rates and treatment response among 2,850 participants (57.2% women, 11.7% Blacks, and 9.0% with less than a high school education) from two large cessation trials evaluating: nicotine patch, nicotine lozenge, bupropion, bupropion + lozenge, and nicotine patch + lozenge. RESULTS Results revealed that women, Blacks, and smokers with less education were less likely to quit smoking successfully than men, Whites, and smokers with more education, respectively. Women did not appear to benefit more from bupropion than from nicotine replacement therapy, but women and smokers with less education benefited more from combination pharmacotherapy than from monotherapy. DISCUSSION Women, Blacks, and smokers with less education are at elevated risk for cessation failure, and research is needed to understand this risk and develop pharmacological and psychosocial interventions to improve their long-term cessation rates.

Computational Statistics & Data Analysis | 1996

A comparison of tests of equality of variances

Tjen-Sien Lim; Wei-Yin Loh

Seven tests of equality of variances are compared in terms of robustness and power in a simulation experiment with small-to-moderate sample sizes. The data are assumed to come from a location-scale family with unknown means, variances, and density functions. The tests considered are the Levene test, the Bartlett test with and without kurtosis adjustment, the Box-Andersen test, and three jackknife tests. The bootstrap versions of these tests are also compared. It is found that the Levene test and one of jackknife test, as well as the bootstrap versions of the Levene test, the Bartlett test with kurtosis adjustment, and two jackknife tests are robust. Among these, the bootstrap version of the Levene test tends to have the highest power.

Journal of Computational and Graphical Statistics | 2004

LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees

Kin-Yee Chan; Wei-Yin Loh

Logistic regression is a powerful technique for fitting models to data with a binary response variable, but the models are difficult to interpret if collinearity, nonlinearity, or interactions are present. Besides, it is hard to judge model adequacy because there are few diagnostics for choosing variable transformations and no true goodness-of-fit test. To overcome these problems, this article proposes to fit a piecewise (multiple or simple) linear logistic regression model by recursively partitioning the data and fitting a different logistic regression in each partition. This allows nonlinear features of the data to be modeled without requiring variable transformations. The binary tree that results from the partitioning process is pruned to minimize a cross-validation estimate of the predicted deviance. This obviates the need for a formal goodness-of-fit test. The resulting model is especially easy to interpret if a simple linear logistic regression is fitted to each partition, because the tree structure and the set of graphs of the fitted functions in the partitions comprise a complete visual description of the model. Trend-adjusted chi-square tests are used to control bias in variable selection at the intermediate nodes. This protects the integrity of inferences drawn from the tree structure. The method is compared with standard stepwise logistic regression on 30 real datasets, with several containing tens to hundreds of thousands of observations. Averaged across the datasets, the results show that the method reduces predicted mean deviance by 9% to 16%.We use an example from the Dutch insurance industry to demonstrate how the method can identify and produce an intelligible profile of prospective customers.

Journal of Construction Engineering and Management-asce | 2013

Quantifying Performance for the Integrated Project Delivery System as Compared to Established Delivery Systems

Mounir El Asmar; Awad S. Hanna; Wei-Yin Loh

AbstractIntegrated project delivery (IPD) is an emerging construction project delivery system that collaboratively involves key participants very early in the project timeline, often before the design is started. It is distinguished by a multiparty contractual agreement that typically allows risks and rewards to be shared among project stakeholders. Because IPD is becoming increasingly popular, various organizations are expressing interest in its benefits to the architecture/engineering/construction (AEC) industry. However, no research studies have shown statistically significant performance differences between IPD and more established delivery systems. This study fills that missing gap by evaluating the performance of IPD projects compared to projects delivered using the more traditional design-bid-build, design-build, and construction management at-risk systems, and showing statistically significant improvements for IPD. Relevant literature was analyzed, and a data collection instrument was developed an...

Explore More