Hadley Wickham
Rice University
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Hadley Wickham.
Journal of Computational and Graphical Statistics | 2010
Hadley Wickham
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics. This article builds on Wilkinson, Anand, and Grossman (2005), describing extensions and refinements developed while building an open source implementation of the grammar of graphics for R, ggplot2. The topics in this article include an introduction to the grammar by working through the process of creating a plot, and discussing the components that we need. The grammar is then presented formally and compared to Wilkinson’s grammar, highlighting the hierarchy of defaults, and the implications of embedding a graphical grammar into a programming language. The power of the grammar is illustrated with a selection of examples that explore different components and their interactions, in more detail. The article concludes by discussing some perceptual issues, and thinking about how we can build on the grammar to learn how to create graphical “poems.” Supplemental materials are available online.
Philosophical Transactions of the Royal Society A | 2009
Andreas Buja; Dianne Cook; Heike Hofmann; Michael S. Lawrence; Eun-Kyung Lee; Deborah F. Swayne; Hadley Wickham
We propose to furnish visual statistical methods with an inferential framework and protocol, modelled on confirmatory statistical testing. In this framework, plots take on the role of test statistics, and human cognition the role of statistical tests. Statistical significance of ‘discoveries’ is measured by having the human viewer compare the plot of the real dataset with collections of plots of simulated datasets. A simple but rigorous protocol that provides inferential validity is modelled after the ‘lineup’ popular from criminal legal procedures. Another protocol modelled after the ‘Rorschach’ inkblot test, well known from (pop-)psychology, will help analysts acclimatize to random variability before being exposed to the plot of the real data. The proposed protocols will be useful for exploratory data analysis, with reference datasets simulated by using a null assumption that structure is absent. The framework is also useful for model diagnostics in which case reference datasets are simulated from the model in question. This latter point follows up on previous proposals. Adopting the protocols will mean an adjustment in working procedures for data analysts, adding more rigour, and teachers might find that incorporating these protocols into the curriculum improves their students’ statistical thinking.
Journal of Telemedicine and Telecare | 2006
Amanda Oakley; Felicity Reeves; Jane Bennett; Stephen H Holmes; Hadley Wickham
We examined whether it is possible for a dermatologist to diagnose benign and malignant skin lesions by telemedicine, given a comprehensive history and/or clinical images. A medical student recorded a standardized history and description of 109 skin lesions and took digital photographs of the presenting lesion(s) immediately prior to a normal outpatient dermatology consultation. About 52 dermatologists were invited to participate in online diagnosis. In all, 38 took part and they were provided with the text and/or the image(s) online on a secure Website. When the images and text were provided, 53% of teledermatology diagnoses were the same as the face-to-face diagnosis. When images alone were provided, 57% of diagnoses were the same. When text alone was provided, 41% of diagnoses were the same. The relatively low diagnostic concordance may have been due to the inexperience of many teledermatologists and poor quality image display systems. The teledermatologists were less confident in their diagnoses than face-to-face specialists, especially in the absence of images. The teledermatology management plan was more likely to include biopsy, excision or review than was the case at the face-to-face consultation. Teledermatology may result in an increase in follow-up appointments and surgical procedures.
IEEE Transactions on Visualization and Computer Graphics | 2011
Hadley Wickham; Heike Hofmann
We propose a new framework for visualising tables of counts, proportions and probabilities. We call our framework product plots, alluding to the computation of area as a product of height and width, and the statistical concept of generating a joint distribution from the product of conditional and marginal distributions. The framework, with extensions, is sufficient to encompass over 20 visualisations previously described in fields of statistical graphics and infovis, including bar charts, mosaic plots, treemaps, equal area plots and fluctuation diagrams.
Journal of Computational and Graphical Statistics | 2013
John W. Emerson; Walton A. Green; Barret Schloerke; Jason Crowley; Dianne Cook; Heike Hofmann; Hadley Wickham
This article develops a generalization of the scatterplot matrix based on the recognition that most datasets include both categorical and quantitative information. Traditional grids of scatterplots often obscure important features of the data when one or more variables are categorical but coded as numerical. The generalized pairs plot offers a range of displays of paired combinations of categorical and quantitative variables. A mosaic plot, fluctuation diagram, or faceted bar chart may be used to display two categorical variables. A side-by-side boxplot, stripplot, faceted histogram, or density plot helps visualize a categorical and a quantitative variable. A traditional scatterplot is suitable for displaying a pair of numerical variables, but options also support density contours or annotating summary statistics such as the correlation and number of missing values, for example. By combining these, the generalized pairs plot may help to reveal structure in multivariate data that otherwise might go unnoticed in the process of exploratory data analysis. Two different R packages provide implementations of the generalized pairs plot, gpairs and GGally. Supplementary materials for this article are available online on the journal web site.
Archive | 2008
Dianne Cook; Andreas Buja; Eun Kyung Lee; Hadley Wickham
How do we find structure in multidimensional data when computer screens are only two-dimensional? One approach is to project the data onto one or two dimensions. Projections are used in classical statistical methods like principal component analysis (PCA) and linear discriminant analysis. PCA (e.g., Johnson and Wichern 2002) chooses a projection to maximize the variance. Fisher’s linear discriminant (e.g., Johnson and Wichern 2002) chooses a projection that maximizes the relative separation between group means. Projection pursuit (PP) (e.g., Huber 1985) generalizes these ideas into a common strategy, where an arbitrary function on projections is optimized. The scatterplot matrix (e.g., Becker and Cleveland 1987) also can be considered to be a projection method. It shows projections of the data onto all pairs of coordinate axes, the 2-D marginal projections of the data. These projection methods choose a few select projections out of infinitely many.
Visual Data Mining | 2008
Doina Caragea; Dianne Cook; Hadley Wickham; Vasant G. Honavar
Support vector machines (SVM) offer a theoretically wellfounded approach to automated learning of pattern classifiers. They have been proven to give highly accurate results in complex classification problems, for example, gene expression analysis. The SVM algorithm is also quite intuitive with a few inputs to vary in the fitting process and several outputs that are interesting to study. For many data mining tasks (e.g., cancer prediction) finding classifiers with good predictive accuracy is important, but understanding the classifier is equally important. By studying the classifier outputs we may be able to produce a simpler classifier, learn which variables are the important discriminators between classes, and find the samples that are problematic to the classification. Visual methods for exploratory data analysis can help us to study the outputs and complement automated classification algorithms in data mining. We present the use of tour-based methods to plot aspects of the SVM classifier. This approach provides insights about the cluster structure in the data, the nature of boundaries between clusters, and problematic outliers. Furthermore, tours can be used to assess the variable importance. We show how visual methods can be used as a complement to crossvalidation methods in order to find good SVM input parameters for a particular data set.
Ecology | 2013
Diane M. Debinski; Jennet C. Caruthers; Dianne Cook; Jason Crowley; Hadley Wickham
Ecological fingerprints of climate change are becoming increasingly evident at broad geographical scales as measured by species range shifts and changes in phenology. However, finer-scale species-level responses to environmental fluctuations may also provide an important bellwether of impending future community responses. Here we examined changes in abundance of butterfly species along a hydrological gradient of six montane meadow habitat types in response to drought. Our data collection began prior to the drought, and we were able to track changes for 11 years, of which eight were considered mild to extreme drought conditions. We separated the species into those that had an affinity for hydric vs. xeric habitats. We suspected that drought would favor species with xeric habitat affinities, but that there could be variations in species-level responses along the hydrological gradient. We also suspected that mesic meadows would be most sensitive to drought conditions. Temporal trajectories were modeled for both species groups (hydric vs. xeric affinity) and individual species. Abundances of species with affinity for xeric habitats increased in virtually all meadow types. Conversely, abundances of species with affinity for hydric habitats decreased, particularly in mesic and xeric meadows. Mesic meadows showed the most striking temporal abundance trajectory: Increasing abundances of species with xeric habitat affinity were offset by decreasing or stable abundances of species with hydric habitat affinity. The one counterintuitive finding was that, in some hydric meadows, species with affinity for hydric habitats increased. In these cases, we suspect that decreasing moisture conditions in hydric meadows actually increased habitat suitability because sites near the limit of moisture extremes for some species became more acceptable. Thus, species responses were relatively predictable based upon habitat affinity and habitat location along the hydrological gradient, and mesic meadows showed the highest potential for changes in community composition. The implications of these results are that longer-term changes due to drought could simplify community composition, resulting in prevalence of species tolerant to drying conditions and a loss of species associated with wetter conditions. We contend that this application of gradient analysis could be valuable in assessing species vulnerability of other taxa and ecosystems.
Journal of Computational and Graphical Statistics | 2017
Heike Hofmann; Hadley Wickham; Karen Kafadar
ABSTRACT Boxplots are useful displays that convey rough information about the distribution of a variable. Boxplots were designed to be drawn by hand and work best for small datasets, where detailed estimates of tail behavior beyond the quartiles may not be trustworthy. Larger datasets afford more precise estimates of tail behavior, but boxplots do not take advantage of this precision, instead presenting large numbers of extreme, though not unexpected, observations. Letter-value plots address this problem by including more detailed information about the tails using “letter values,” an order statistic defined by Tukey. Boxplots display the first two letter values (the median and quartiles); letter-value plots display further letter values so far as they are reliable estimates of their corresponding quantiles. We illustrate letter-value plots with real data that demonstrate their usefulness for large datasets. All graphics are created using the R package lvplot, and code and data are available in the supplementary materials.
Journal of Computational and Graphical Statistics | 2015
Garrett Grolemund; Hadley Wickham
This article describes a class of graphs, embedded plots, that are particularly useful for analyzing large and complex datasets. Embedded plots organize a collection of graphs into a larger graphic, which can display more complex relationships than would otherwise be possible. This arrangement provides additional axes, prevents overplotting, and allows for multiple levels of visual summarization. Embedded plots also preprocess complex data into a form suitable for the human cognitive system, which can facilitate comprehension. We illustrate the usefulness of embedded plots with a case study, discuss the practical and cognitive advantages of embedded plots, and demonstrate how to implement embedded plots as a general class within visualization software, something currently unavailable. This article has supplementary material online.
