Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Johanna Hardin is active.

Publication


Featured researches published by Johanna Hardin.


Journal of Computational and Graphical Statistics | 2005

The Distribution of Robust Distances

Johanna Hardin; David M. Rocke

Mahalanobis-type distances in which the shape matrix is derived from a consistent, high-breakdown robust multivariate location and scale estimator have an asymptotic chi-squared distribution as is the case with those derived from the ordinary covariance matrix. For example, Rousseeuws minimum covariance determinant (MCD) is a robust estimator with a high breakdown. However, even in quite large samples, the chi-squared approximation to the distances of the sample data from the MCD center with respect to the MCD shape is poor. We provide an improved F approximation that gives accurate outlier rejection points for various sample sizes.


Computational Statistics & Data Analysis | 2004

Outlier Detection in the Multiple Cluster Setting Using the Minimum Covariance Determinant Estimator

Johanna Hardin; David M. Rocke

Abstract Mahalanobis-type distances in which the shape matrix is derived from a consistent high-breakdown robust multivariate location and scale estimator can be used to find outlying points. Hardin and Rocke ( http://www.cipic.ucdavis.edu/~dmrocke/preprints.html ) developed a new method for identifying outliers in a one-cluster setting using an F distribution. We extend the method to the multiple cluster case which gives a robust clustering method in conjunction with an outlier identification method. We provide results of the F distribution method for multiple clusters which have different sizes and shapes.


BMC Bioinformatics | 2007

A robust measure of correlation between two genes on a microarray

Johanna Hardin; Aya Mitani ; Brian VanKoten

BackgroundThe underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.)ResultsWe propose a resistant similarity metric based on Tukeys biweight estimate of multivariate scale and location. The resistant metric is simply the correlation obtained from a resistant covariance matrix of scale. We give results which demonstrate that our correlation metric is much more resistant than the Pearson correlation while being more efficient than other nonparametric measures of correlation (e.g., Spearman correlation.) Additionally, our method gives a systematic gene flagging procedure which is useful when dealing with large amounts of noisy data.ConclusionWhen dealing with microarray data, which are known to be quite noisy, robust methods should be used. Specifically, robust distances, including the biweight correlation, should be used in clustering and gene network analysis.


Computational Statistics & Data Analysis | 1999

Some computational issues in cluster analysis with no a priori metric

Dan Coleman; Xiaopeng Dong; Johanna Hardin; David M. Rocke; David L. Woodruff

Abstract We address the problem of computing the largest fraction of missing information for the EM algorithm and the worst linear function for data augmentation. These are the largest eigenvalue and its associated eigenvector for the Jacobian of the EM operator at a maximum likelihood estimate, which are important for assessing convergence in iterative simulation. An estimate of the largest fraction of missing information is available from the EM iterates; this is often adequate since only a few figures of accuracy are needed. In some instances the EM iteration also gives an estimate of the worst linear function. We show that improved estimates can be essential for proper inference. In order to obtain improved estimates efficiently, we use the power method for eigencomputation. Unlike eigenvalue decomposition, the power method computes only the largest eigenvalue and eigenvector of a matrix, it can take advantage of a good eigenvector estimate as an initial value and it can be terminated after only a few figures of accuracy are achieved. Moreover, the matrix products needed in the power method can be computed by extrapolation, obviating the need to form the Jacobian of the EM operator. We give results of simulation studies on multivariate normal data showing that this approach becomes more efficient as the data dimension increases than methods that use a finite-difference approximation to the Jacobian, which is the only general-purpose alternative available.


The American Statistician | 2015

Data Science in Statistics Curricula: Preparing Students to “Think with Data”

Johanna Hardin; Roger Hoerl; Nicholas J. Horton; Deborah Nolan; Benjamin Baumer; O. Hall-Holt; Paul Murrell; Roger D. Peng; P. Roback; D. Temple Lang; Mark Daniel Ward

A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to use databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this article is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of undergraduates with data and data science. [Received November 2014. Revised July 2015.]


Biostatistics | 2009

A note on oligonucleotide expression values not being normally distributed

Johanna Hardin; Jason Wilson

Novel techniques for analyzing microarray data are constantly being developed. Though many of the methods contribute to biological discoveries, inability to properly evaluate the novel techniques limits their ability to advance science. Because the underlying distribution of microarray data is unknown, novel methods are typically tested against the assumed normal distribution. However, microarray data are not, in fact, normally distributed, and assuming so can have misleading consequences. Using an Affymetrix technical replicate spike-in data set, we show that oligonucleotide expression values are not normally distributed for any of the standard methods for calculating expression values. The resulting data tend to have a large proportion of skew and heavy tailed genes. Additionally, we show that standard methods can give unexpected and misleading results when the data are not well approximated by the normal distribution. Robust methods are therefore recommended when analyzing microarray data. Additionally, new techniques should be evaluated with skewed and/or heavy-tailed data distributions.


The Annals of Applied Statistics | 2013

A method for generating realistic correlation matrices

Johanna Hardin; Stephan Ramon Garcia; David Golan

Simulating sample correlation matrices is important in many areas of statistics. Approaches such as generating Gaussian data and finding their sample correlation matrix or generating random uniform


The American Statistician | 2015

Teaching the Next Generation of Statistics Students to "Think With Data": Special Issue on Statistics and the Undergraduate Curriculum

Nicholas J. Horton; Johanna Hardin

[-1,1]


Briefings in Bioinformatics | 2018

Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions

Ciaran Evans; Johanna Hardin; Daniel M. Stoebel

deviates as pairwise correlations both have drawbacks. We develop an algorithm for adding noise, in a highly controlled manner, to general correlation matrices. In many instances, our method yields results which are superior to those obtained by simply simulating Gaussian data. Moreover, we demonstrate how our general algorithm can be tailored to a number of different correlation models. Using our results with a few different applications, we show that simulating correlation matrices can help assess statistical methodology.


Journal of Statistics Education | 2015

Network Analysis with the Enron Email Corpus.

Johanna Hardin; Ghassan Sarkis; P. C. Urc

This is an exciting time to be a statistician. The contribution of the discipline of statistics to scientific knowledge is widely recognized (McNutt 2014) with increasingly positive public perception. Many feel “daunted by the challenge of extracting understanding from floods of disconnected data that threaten to swamp every discipline” (Yamamoto 2013). Demand for statisticians is strong, and as such, ‘statistician’ frequently ranks as a top job (Wasserstein 2015). The McKinsey report (Manyika et al. 2011) makes clear the need for new graduates with “deep analytical skills,” and many (most?) of these new workers will be trained at the undergraduate level. Fortunately, the recent growth of undergraduate statistics programs is impressive. While still small in absolute numbers they have nearly doubled between 2010 and 2013 (Wasserstein 2015) and are on track to continue to increase. But there are challenges as well as opportunities in this new world of data (Horton 2015; Ridgway 2015a). The traditional statistics curriculum with mathematical foundations has not kept up with pressing demands for students who can make sense of data. Calls for transformed undergraduate education have resonated nationally (Holdren and Lander 2012; Zorn et al. 2014). These pressures led ASA President Nathaniel Schenker to convene an ASA workgroup to update the association’s guidelines for undergraduate programs. The group, with broad representation from academia, industry, and government, put forward guidelines that were endorsed by the ASA Board of Directors in November 2014 (ASA 2014). Table 1 includes the full executive summary (a copy of the guidelines and related resources can be found at http://www.amstat.org/education/curriculumguidelines.cfm). Much of the statistics education literature focuses on the introductory statistics course and statistics before college. Given the relatively few decades since the establishment of undergraduate statistics programs, this is not surprising. While there has been impressive growth in the number of students taking introductory statistics, there has been a relative dearth of articles on the curriculum beyond the introductory course. The 2014

Collaboration


Dive into the Johanna Hardin's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

David M. Rocke

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

John Crowley

University of Texas MD Anderson Cancer Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Athanasios Fassas

University of Arkansas for Medical Sciences

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge