Ann B. Lee
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ann B. Lee.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2006
Stephane Lafon; Ann B. Lee
We provide evidence that nonlinear dimensionality reduction, clustering, and data set parameterization can be solved within one and the same framework. The main idea is to define a system of coordinates with an explicit metric that reflects the connectivity of a given data set and that is robust to noise. Our construction, which is based on a Markov random walk on the data, offers a general scheme of simultaneously reorganizing and subsampling graphs and arbitrarily shaped data sets in high dimensions using intrinsic geometry. We show that clustering in embedding spaces is equivalent to compressing operators. The objective of data partitioning and clustering is to coarse-grain the random walk on the data while at the same time preserving a diffusion operator for the intrinsic geometry or connectivity of the data set up to some accuracy. We show that the quantization distortion in diffusion space bounds the error of compression of the operator, thus giving a rigorous justification for k-means clustering in diffusion space and a precise measure of the performance of general clustering algorithms
Journal of Mathematical Imaging and Vision | 2003
Anuj Srivastava; Ann B. Lee; Eero P. Simoncelli; Song-Chun Zhu
Statistical analysis of images reveals two interesting properties: (i) invariance of image statistics to scaling of images, and (ii) non-Gaussian behavior of image statistics, i.e. high kurtosis, heavy tails, and sharp central cusps. In this paper we review some recent results in statistical modeling of natural images that attempt to explain these patterns. Two categories of results are considered: (i) studies of probability models of images or image decompositions (such as Fourier or wavelet decompositions), and (ii) discoveries of underlying image manifolds while restricting to natural images. Applications of these models in areas such as texture analysis, image classification, compression, and denoising are also considered.
Nature Genetics | 2014
Trent Gaugler; Lambertus Klei; Stephan J. Sanders; Corneliu A. Bodea; Arthur P. Goldberg; Ann B. Lee; Milind Mahajan; Dina Manaa; Yudi Pawitan; Jennifer Reichert; Stephan Ripke; Sven Sandin; Pamela Sklar; Oscar Svantesson; Abraham Reichenberg; Christina M. Hultman; Bernie Devlin; Kathryn Roeder; Joseph D. Buxbaum
A key component of genetic architecture is the allelic spectrum influencing trait variability. For autism spectrum disorder (herein termed autism), the nature of the allelic spectrum is uncertain. Individual risk-associated genes have been identified from rare variation, especially de novo mutations. From this evidence, one might conclude that rare variation dominates the allelic spectrum in autism, yet recent studies show that common variation, individually of small effect, has substantial impact en masse. At issue is how much of an impact relative to rare variation this common variation has. Using a unique epidemiological sample from Sweden, new methods that distinguish total narrow-sense heritability from that due to common variation and synthesis of results from other studies, we reach several conclusions about autisms genetic architecture: its narrow-sense heritability is ∼52.4%, with most due to common variation, and rare de novo mutations contribute substantially to individual liability, yet their contribution to variance in liability, 2.6%, is modest compared to that for heritable variation.
International Journal of Computer Vision | 2003
Ann B. Lee; Kim Steenstrup Pedersen; David Mumford
Recently, there has been a great deal of interest in modeling the non-Gaussian structures of natural images. However, despite the many advances in the direction of sparse coding and multi-resolution analysis, the full probability distribution of pixel values in a neighborhood has not yet been described. In this study, we explore the space of data points representing the values of 3 × 3 high-contrast patches from optical and 3D range images. We find that the distribution of data is extremely “sparse” with the majority of the data points concentrated in clusters and non-linear low-dimensional manifolds. Furthermore, a detailed study of probability densities allows us to systematically distinguish between images of different modalities (optical versus range), which otherwise display similar marginal distributions. Our work indicates the importance of studying the full probability distribution of natural images, not just marginals, and the need to understand the intrinsic dimensionality and nature of the data. We believe that object-like structures in the world and the sensor properties of the probing device generate observations that are concentrated along predictable shapes in state space. Our study of natural image statistics accounts for local geometries (such as edges) in natural scenes, but does not impose such strong assumptions on the data as independent components or sparse coding by linear change of bases.
International Journal of Computer Vision | 2001
Ann B. Lee; David Mumford; Jinggang Huang
We develop a scale-invariant version of Matherons “dead leaves model” for the statistics of natural images. The model takes occlusions into account and resembles the image formation process by randomly adding independent elementary shapes, such as disks, in layers. We compare the empirical statistics of two large databases of natural images with the statistics of the occlusion model, and find an excellent qualitative, and good quantitative agreement. At this point, this is the only image model which comes close to duplicating the simplest, elementary statistics of natural images—such as, the scale invariance property of marginal distributions of filter responses, the full co-occurrence statistics of two pixels, and the joint statistics of pairs of Haar wavelet responses.
computer vision and pattern recognition | 2000
Jinggang Huang; Ann B. Lee; David Mumford
The statistics of range images from natural environments is a largely unexplored field of research. It closely relates to the statistical modeling of the scene geometry in natural environments, and the modeling of optical natural images. We have used a 3D laser range-finder to collect range images from mixed forest scenes. The images are here analyzed with respect to different statistics.
The Annals of Applied Statistics | 2008
Ann B. Lee; Boaz Nadler; Larry Wasserman
In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered--with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets--a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.
Genetic Epidemiology | 2009
Ann B. Lee; Diana Luca; Lambertus Klei; Bernie Devlin; Kathryn Roeder
As one approach to uncovering the genetic underpinnings of complex disease, individuals are measured at a large number of genetic variants (usually SNPs) across the genome and these SNP genotypes are assessed for association with disease status. We propose a new statistical method called Spectral‐GEM for the analysis of genome‐wide association studies; the goal of Spectral‐GEM is to quantify the ancestry of the sample from such genotypic data. Ignoring structure due to differential ancestry can lead to an excess of spurious findings and reduce power. Ancestry is commonly estimated using the eigenvectors derived from principal component analysis (PCA). To develop an alternative to PCA we draw on connections between multidimensional scaling and spectral graph theory. Our approach, based on a spectral embedding derived from the normalized Laplacian of a graph, can produce more meaningful delineation of ancestry than by using PCA. Often the results from Spectral‐GEM are straightforward to interpret and therefore useful in association analysis. We illustrate the new algorithm with an analysis of the POPRES data [Nelson et al., 2008]. Genet. Epidemiol. 34:51–59, 2010.
IEEE Transactions on Medical Imaging | 2011
Wei Wang; John A. Ozolek; Dejan Slepčev; Ann B. Lee; Cheng Chen; Gustavo K. Rohde
Nuclear morphology and structure as visualized from histopathology microscopy images can yield important diagnostic clues in some benign and malignant tissue lesions. Precise quantitative information about nuclear structure and morphology, however, is currently not available for many diagnostic challenges. This is due, in part, to the lack of methods to quantify these differences from image data. We describe a method to characterize and contrast the distribution of nuclear structure in different tissue classes (normal, benign, cancer, etc.). The approach is based on quantifying chromatin morphology in different groups of cells using the optimal transportation (Kantorovich-Wasserstein) metric in combination with the Fisher discriminant analysis and multidimensional scaling techniques. We show that the optimal transportation metric is able to measure relevant biological information as it enables automatic determination of the class (e.g., normal versus cancer) of a set of nuclei. We show that the classification accuracies obtained using this metric are, on average, as good or better than those obtained utilizing a set of previously described numerical features. We apply our methods to two diagnostic challenges for surgical pathology: one in the liver and one in the thyroid. Results automatically computed using this technique show potentially biologically relevant differences in nuclear structure in liver and thyroid cancers.
Monthly Notices of the Royal Astronomical Society | 2009
Peter E. Freeman; Jeffrey A. Newman; Ann B. Lee; Joseph W. Richards; Chad M. Schafer
The development of fast and accurate methods of photometric redshift estimation is a vital step towards being able to fully utilize the data of next-generation surveys within precision cosmology. In this paper we apply a specific approach to spectral connectivity analysis (SCA; Lee & Wasserman 2009) called diffusion map. SCA is a class of non-linear techniques for transforming observed data (e.g., photometric colours for each galaxy, where the data lie on a complex subset of p-dimensional space) to a simpler, more natural coordinate system wherein we apply regression to make redshift predictions. As SCA relies upon eigen-decomposition, our training set size is limited to ~ 10,000 galaxies; we use the Nystrom extension to quickly estimate diffusion coordinates for objects not in the training set. We apply our method to 350,738 SDSS main sample galaxies, 29,816 SDSS luminous red galaxies, and 5,223 galaxies from DEEP2 with CFHTLS ugriz photometry. For all three datasets, we achieve prediction accuracies on par with previous analyses, and find that use of the Nystrom extension leads to a negligible loss of prediction accuracy relative to that achieved with the training sets. As in some previous analyses (e.g., Collister & Lahav 2004, Ball et al. 2008), we observe that our predictions are generally too high (low) in the low (high) redshift regimes. We demonstrate that this is a manifestation of attenuation bias, wherein measurement error (i.e., uncertainty in diffusion coordinates due to uncertainty in the measured fluxes/magnitudes) reduces the slope of the best-fit regression line. Mitigation of this bias is necessary if we are to use photometric redshift estimates produced by computationally efficient empirical methods in precision cosmology.