Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jon McAuliffe is active.

Publication


Featured researches published by Jon McAuliffe.


Journal of the American Statistical Association | 2006

Convexity, classification, and risk bounds

Peter L. Bartlett; Michael I. Jordan; Jon McAuliffe

Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0–1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0–1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function—that it satisfies a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise, and show that in this case, strictly convex loss functions lead to faster rates of convergence of the risk than would be implied by standard uniform convergence arguments. Finally, we present applications of our results to the estimation of convergence rates in function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.


Journal of Biological Chemistry | 2008

Candidate cell and matrix interaction domains on the collagen fibril, the predominant protein of vertebrates.

Shawn M. Sweeney; Joseph P. R. O. Orgel; Andrzej Fertala; Jon McAuliffe; Kevin Turner; Gloria A. Di Lullo; Steven Chen; Olga Antipova; Shiamalee Perumal; Leena Ala-Kokko; Antonella Forlino; Wayne A. Cabral; Aileen M. Barnes; Joan C. Marini; James D. San Antonio

Type I collagen, the predominant protein of vertebrates, polymerizes with type III and V collagens and non-collagenous molecules into large cable-like fibrils, yet how the fibril interacts with cells and other binding partners remains poorly understood. To help reveal insights into the collagen structure-function relationship, a data base was assembled including hundreds of type I collagen ligand binding sites and mutations on a two-dimensional model of the fibril. Visual examination of the distribution of functional sites, and statistical analysis of mutation distributions on the fibril suggest it is organized into two domains. The “cell interaction domain” is proposed to regulate dynamic aspects of collagen biology, including integrin-mediated cell interactions and fibril remodeling. The “matrix interaction domain” may assume a structural role, mediating collagen cross-linking, proteoglycan interactions, and tissue mineralization. Molecular modeling was used to superimpose the positions of functional sites and mutations from the two-dimensional fibril map onto a three-dimensional x-ray diffraction structure of the collagen microfibril in situ, indicating the existence of domains in the native fibril. Sequence searches revealed that major fibril domain elements are conserved in type I collagens through evolution and in the type II/XI collagen fibril predominant in cartilage. Moreover, the fibril domain model provides potential insights into the genotype-phenotype relationship for several classes of human connective tissue diseases, mechanisms of integrin clustering by fibrils, the polarity of fibril assembly, heterotypic fibril function, and connective tissue pathology in diabetes and aging.


Journal of the American Statistical Association | 2017

Variational Inference: A Review for Statisticians

David M. Blei; Alp Kucukelbir; Jon McAuliffe

ABSTRACT One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find a member of that family which is close to the target density. Closeness is measured by Kullback–Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this article is to catalyze statistical research on this class of algorithms. Supplementary materials for this article are available online.


Proceedings of the National Academy of Sciences of the United States of America | 2003

Toward a protein profile of Escherichia coli: Comparison to its transcription profile

Rebecca W. Corbin; Oleg Paliy; Feng Yang; Jeffrey Shabanowitz; Mark D. Platt; Charles E. Lyons; Karen Root; Jon McAuliffe; Michael I. Jordan; Sydney Kustu; Eric Soupene; Donald F. Hunt

High-pressure liquid chromatography–tandem mass spectrometry was used to obtain a protein profile of Escherichia coli strain MG1655 grown in minimal medium with glycerol as the carbon source. By using cell lysate from only 3 × 108 cells, at least four different tryptic peptides were detected for each of 404 proteins in a short 4-h experiment. At least one peptide with a high reliability score was detected for 986 proteins. Because membrane proteins were underrepresented, a second experiment was performed with a preparation enriched in membranes. An additional 161 proteins were detected, of which from half to two-thirds were membrane proteins. Overall, 1,147 different E. coli proteins were identified, almost 4 times as many as had been identified previously by using other tools. The protein list was compared with the transcription profile obtained on Affymetrix GeneChips. Expression of 1,113 (97%) of the genes whose protein products were found was detected at the mRNA level. The arithmetic mean mRNA signal intensity for these genes was 3-fold higher than that for all 4,300 protein-coding genes of E. coli. Thus, GeneChip data confirmed the high reliability of the protein list, which contains about one-fourth of the proteins of E. coli. Detection of even those membrane proteins and proteins of undefined function that are encoded by the same operons (transcriptional units) encoding proteins on the list remained low.


Journal of the American Statistical Association | 2010

Variational Inference for Large-Scale Models of Discrete Choice

Michael Braun; Jon McAuliffe

Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents in discrete choice models are assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate inference possible, but the computational cost is prohibitive on the large datasets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of the posterior distribution. We derive variational procedures for empirical Bayes and fully Bayesian inference in the mixed multinomial logit model of discrete choice. The algorithms require only that we solve a sequence of unconstrained optimization problems, which are shown to be convex. One version of the procedures relies on a new approximation to the variational objective function, based on the multivariate delta method. Extensive simulations, along with an analysis of real-world data, demonstrate that variational methods achieve accuracy competitive with Markov chain Monte Carlo at a small fraction of the computational cost. Thus, variational methods permit inference on datasets that otherwise cannot be analyzed without possibly adverse simplifications of the underlying discrete choice model. Appendices C through F are available as online supplemental materials.


Statistics and Computing | 2006

Nonparametric empirical Bayes for the Dirichlet process mixture model

Jon McAuliffe; David M. Blei; Michael I. Jordan

The Dirichlet process prior allows flexible nonparametric mixture modeling. The number of mixture components is not specified in advance and can grow as new data arrive. However, analyses based on the Dirichlet process prior are sensitive to the choice of the parameters, including an infinite-dimensional distributional parameter G0. Most previous applications have either fixed G0 as a member of a parametric family or treated G0 in a Bayesian fashion, using parametric prior specifications. In contrast, we have developed an adaptive nonparametric method for constructing smooth estimates of G0. We combine this method with a technique for estimating α, the other Dirichlet process parameter, that is inspired by an existing characterization of its maximum-likelihood estimator. Together, these estimation procedures yield a flexible empirical Bayes treatment of Dirichlet process mixtures. Such a treatment is useful in situations where smooth point estimates of G0 are of intrinsic interest, or where the structure of G0 cannot be conveniently modeled with the usual parametric prior families. Analysis of simulated and real-world datasets illustrates the robustness of this approach.


Journal of Bacteriology | 2005

Sulfur and Nitrogen Limitation in Escherichia coli K-12: Specific Homeostatic Responses

Prasad Gyaneshwar; Oleg Paliy; Jon McAuliffe; David L. Popham; Michael I. Jordan; Sydney Kustu

We determined global transcriptional responses of Escherichia coli K-12 to sulfur (S)- or nitrogen (N)-limited growth in adapted batch cultures and cultures subjected to nutrient shifts. Using two limitations helped to distinguish between nutrient-specific changes in mRNA levels and common changes related to the growth rate. Both homeostatic and slow growth responses were amplified upon shifts. This made detection of these responses more reliable and increased the number of genes that were differentially expressed. We analyzed microarray data in several ways: by determining expression changes after use of a statistical normalization algorithm, by hierarchical and k-means clustering, and by visual inspection of aligned genome images. Using these tools, we confirmed known homeostatic responses to global S limitation, which are controlled by the activators CysB and Cbl, and found that S limitation propagated into methionine metabolism, synthesis of FeS clusters, and oxidative stress. In addition, we identified several open reading frames likely to respond specifically to S availability. As predicted from the fact that the ddp operon is activated by NtrC, synthesis of cross-links between diaminopimelate residues in the murein layer was increased under N-limiting conditions, as was the proportion of tripeptides. Both of these effects may allow increased scavenging of N from the dipeptide D-alanine-D-alanine, the substrate of the Ddp system.


Bioinformatics | 2004

Multiple-sequence functional annotation and the generalized hidden Markov phylogeny

Jon McAuliffe; Lior Pachter; Michael I. Jordan

MOTIVATION Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. RESULTS We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadowers performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. AVAILABILITY A Web server is available at http://bonaire.lbl.gov/shadower


Sigkdd Explorations | 2003

Machine learning in low-level microarray analysis

Benjamin I. P. Rubinstein; Jon McAuliffe; Simon Cawley; Marimuthu Palaniswami; Kotagiri Ramamohanarao; Terence P. Speed

Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.


PLOS Genetics | 2010

Long- and Short-Term Selective Forces on Malaria Parasite Genomes

Sanne Nygaard; Alexander Braunstein; Gareth Malsen; Stijn van Dongen; Paul P. Gardner; Anders Krogh; Thomas D. Otto; Arnab Pain; Matthew Berriman; Jon McAuliffe; Emmanouil T. Dermitzakis; Daniel C. Jeffares

Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ∼23 Mb genomes encoding ∼5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome.

Collaboration


Dive into the Jon McAuliffe's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jeffrey Regier

University of California

View shared research outputs
Top Co-Authors

Avatar

David J. Schlegel

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Prabhat

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Lior Pachter

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge