Chris Fraley
University of Washington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chris Fraley.
Journal of the American Statistical Association | 2002
Chris Fraley; Adrian E. Raftery
Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent developments in model-based clustering for non-Gaussian data, high-dimensional datasets, large datasets, and Bayesian estimation.
The Computer Journal | 1998
Chris Fraley; Adrian E. Raftery
We consider the problem of determining the structure of clustered data, without prior knowledge of the number of clusters or any other information about their composition. Data are represented by a mixture model in which each component corresponds to a different cluster. Models with varying geometric properties are obtained through Gaussian components with different parametrizations and cross-cluster constraints. Noise and outliers can be modelled by adding a Poisson process component. Partitions are determined by the expectation-maximization (EM) algorithm for maximum likelihood, with initial values from agglomerative hierarchical clustering. Models are compared using an approximation to the Bayes factor based on the Bayesian information criterion (BIC); unlike significance tests, this allows comparison of more than two models at the same time, and removes the restriction that the models compared be nested. The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model. Moreover, the EM result provides a measure of uncertainty about the associated classification of each data point. Examples are given, showing that this approach can give performance that is much better than standard procedures, which often fail to identify groups that are either overlapping or of varying sizes and shapes.
Journal of Classification | 2007
Chris Fraley; Adrian E. Raftery
AbstractNormal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC, where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing a small fraction of one observations worth of information. The resulting method avoids degeneracies and singularities, but when these are not present it gives similar results to the standard method using MLE, EM and BIC.
Monthly Weather Review | 2007
J. Mc Lean Sloughter; Adrian E. Raftery; Tilmann Gneiting; Chris Fraley
Abstract Bayesian model averaging (BMA) is a statistical way of postprocessing forecast ensembles to create predictive probability density functions (PDFs) for weather quantities. It represents the predictive PDF as a weighted average of PDFs centered on the individual bias-corrected forecasts, where the weights are posterior probabilities of the models generating the forecasts and reflect the forecasts’ relative contributions to predictive skill over a training period. It was developed initially for quantities whose PDFs can be approximated by normal distributions, such as temperature and sea level pressure. BMA does not apply in its original form to precipitation, because the predictive PDF of precipitation is nonnormal in two major ways: it has a positive probability of being equal to zero, and it is skewed. In this study BMA is extended to probabilistic quantitative precipitation forecasting. The predictive PDF corresponding to one ensemble member is a mixture of a discrete component at zero and a gam...
SIAM Journal on Scientific Computing | 1998
Chris Fraley
Agglomerative hierarchical clustering methods based on Gaussian probability models have recently shown promise in a variety of applications. In this approach, a maximum-likelihood pair of clusters is chosen for merging at each stage. Unlike classical methods, model-based methods reduce to a recurrence relation only in the simplest case, which corresponds to the classical sum of squares method. We show how the structure of the Gaussian model can be exploited to yield efficient algorithms for agglomerative hierarchical clustering.
Journal of Classification | 2003
Chris Fraley; Adrian E. Raftery
MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and the R language. It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term. Also included are functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation, and discriminant analysis. MCLUST provides functionality for displaying and visualizing clustering and classification results. A web page with related links can be found at http://www.stat.washington.edu/mclust.
Pattern Recognition Letters | 1997
Jonathan G. Campbell; Chris Fraley; Fionn Murtagh; Adrian E. Raftery
Abstract We combine image-processing techniques with a powerful new statistical technique to detect linear pattern production faults in woven textiles. Our approach detects a linear pattern in preprocessed images via model-based clustering. It employs an approximate Bayes factor which provides a criterion for assessing the evidence for the presence of a defect. The model used in experimentation is a (possibly highly elliptical) Gaussian cloud superimposed on Poisson clutter. Results are shown for some representative examples, and contrasted with a Hough transform. Software for the statistical modeling is available.
Bioinformatics | 2005
Qunhua Li; Chris Fraley; Roger E. Bumgarner; Ka Yee Yeung; Adrian E. Raftery
MOTIVATION Inner holes, artifacts and blank spots are common in microarray images, but current image analysis methods do not pay them enough attention. We propose a new robust model-based method for processing microarray images so as to estimate foreground and background intensities. The method starts with a very simple but effective automatic gridding method, and then proceeds in two steps. The first step applies model-based clustering to the distribution of pixel intensities, using the Bayesian Information Criterion (BIC) to choose the number of groups up to a maximum of three. The second step is spatial, finding the large spatially connected components in each cluster of pixels. The method thus combines the strengths of the histogram-based and spatial approaches. It deals effectively with inner holes in spots and with artifacts. It also provides a formal inferential basis for deciding when the spot is blank, namely when the BIC favors one group over two or three. RESULTS We apply our methods for gridding and segmentation to cDNA microarray images from an HIV infection experiment. In these experiments, our method had better stability across replicates than a fixed-circle segmentation method or the seeded region growing method in the SPOT software, without introducing noticeable bias when estimating the intensities of differentially expressed genes. AVAILABILITY spotSegmentation, an R language package implementing both the gridding and segmentation methods is available through the Bioconductor project (http://www.bioconductor.org). The segmentation method requires the contributed R package MCLUST for model-based clustering (http://cran.us.r-project.org). CONTACT [email protected].
Monthly Weather Review | 2010
Chris Fraley; Adrian E. Raftery; Tilmann Gneiting
Bayesian model averaging (BMA) is a statistical postprocessing technique that generates calibrated and sharp predictive probability density functions (PDFs) from forecast ensembles. It represents the predictive PDF as a weighted average of PDFs centered on the bias-corrected ensemble members, where the weights reflect the relative skill of the individual members over a training period. This work adapts the BMA approach to situations that arise frequently in practice; namely, when one or more of the member forecasts are exchangeable, and when there are missing ensemble members. Exchangeable members differ in random perturbations only, such as the members of bred ensembles, singular vector ensembles, or ensemble Kalman filter systems. Accounting for exchangeability simplifies the BMA approach, in that the BMA weights and the parameters of the component PDFs can be assumed to be equal within each exchangeable group. With these adaptations, BMA can be applied to postprocess multimodel ensembles of any composition. In experiments with surface temperature and quantitative precipitation forecasts from the University of Washington mesoscale ensemble and ensemble Kalman filter systems over the Pacific Northwest, the proposed extensions yield good results. The BMA method is robust to exchangeability assumptions, and the BMA postprocessed combined ensemble shows better verification results than any of the individual, raw, or BMA postprocessed ensemble systems. These results suggest that statistically postprocessed multimodel ensembles can outperform individual ensemble systems, even in cases in which one of the constituent systems is superior to the others.
Journal of Computational and Graphical Statistics | 2005
Chris Fraley; Adrian E. Raftery; Ron Wehrens
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.