Asa Ben-Hur | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Asa Ben-Hur is active.

Explore More

Publication

Featured researches published by Asa Ben-Hur.

intelligent systems in molecular biology | 2005

Kernel methods for predicting protein--protein interactions

Asa Ben-Hur; William Stafford Noble

MOTIVATION Despite advances in high-throughput methods for discovering protein-protein interactions, the interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. RESULTS We present a kernel method for predicting protein-protein interactions using a combination of data sources, including protein sequences, Gene Ontology annotations, local properties of the network, and homologous interactions in other species. Whereas protein kernels proposed in the literature provide a similarity between single proteins, prediction of interactions requires a kernel between pairs of proteins. We propose a pairwise kernel that converts a kernel between single proteins into a kernel between pairs of proteins, and we illustrate the kernels effectiveness in conjunction with a support vector machine classifier. Furthermore, we obtain improved performance by combining several sequence-based kernels based on k-mer frequency, motif and domain content and by further augmenting the pairwise sequence kernel with features that are based on other sources of data. We apply our method to predict physical interactions in yeast using data from the BIND database. At a false positive rate of 1% the classifier retrieves close to 80% of a set of trusted interactions. We thus demonstrate the ability of our method to make accurate predictions despite the sizeable fraction of false positives that are known to exist in interaction databases. AVAILABILITY The classification experiments were performed using PyML available at http://pyml.sourceforge.net. Data are available at: http://noble.gs.washington.edu/proj/sppi.

pacific symposium on biocomputing | 2001

A stability based method for discovering structure in clustered data

Asa Ben-Hur; André Elisseeff; Isabelle Guyon

We present a method for visually and quantitatively assessing the presence of structure in clustered data. The method exploits measurements of the stability of clustering solutions obtained by perturbing the data set. Stability is characterized by the distribution of pairwise similarities between clusterings obtained from sub samples of the data. High pairwise similarities indicate a stable clustering pattern. The method can be used with any clustering algorithm; it provides a means of rationally defining an optimum number of clusters, and can also detect the lack of structure in data. We show results on artificial and microarray data using a hierarchical clustering algorithm.

PLOS Computational Biology | 2008

Support Vector Machines and Kernels for Computational Biology

Asa Ben-Hur; Cheng Soon Ong; Sören Sonnenburg; Bernhard Schölkopf; Gunnar Rätsch

The increasing wealth of biological data coming from a large variety of platforms and the continued development of new high-throughput methods for probing biological systems require increasingly more sophisticated computational approaches. Putting all these data in simple-to-use databases is a first step; but realizing the full potential of the data requires algorithms that automatically extract regularities from the data, which can then lead to biological insight. Many of the problems in computational biology are in the form of prediction: starting from prediction of a genes structure, prediction of its function, interactions, and role in disease. Support vector machines (SVMs) and related kernel methods are extremely good at solving such problems [1]–[3]. SVMs are widely used in computational biology due to their high accuracy, their ability to deal with high-dimensional and large datasets, and their flexibility in modeling diverse sources of data [2], [4]–[6]. The simplest form of a prediction problem is binary classification: trying to discriminate between objects that belong to one of two categories—positive (+1) or negative (−1). SVMs use two key concepts to solve this problem: large margin separation and kernel functions. The idea of large margin separation can be motivated by classification of points in two dimensions (see Figure 1). A simple way to classify the points is to draw a straight line and call points lying on one side positive and on the other side negative. If the two sets are well separated, one would intuitively draw the separating line such that it is as far as possible away from the points in both sets (see Figures 2 and and3).3). This intuitive choice captures the idea of large margin separation, which is mathematically formulated in the section Classification with Large Margin. Open in a separate window Figure 1 A linear classifier separating two classes of points (squares and circles) in two dimensions. The decision boundary divides the space into two sets depending on the sign of f(x) = 〈w,x〉+b. The grayscale level represents the value of the discriminant function f(x): dark for low values and a light shade for high values.

Methods of Molecular Biology | 2010

A User's Guide to Support Vector Machines

Asa Ben-Hur; Jason Weston

The Support Vector Machine (SVM) is a widely used classifier in bioinformatics. Obtaining the best results with SVMs requires an understanding of their workings and the various ways a user can influence their accuracy. We provide the user with a basic understanding of the theory behind SVMs and focus on their use in practice. We describe the effect of the SVM parameters on the resulting classifier, how to select good values for those parameters, data normalization, factors that affect training time, and software for training SVMs.

Proceedings of the National Academy of Sciences of the United States of America | 2006

Genotypic predictors of human immunodeficiency virus type 1 drug resistance

Soo-Yon Rhee; Jonathan Taylor; Gauhar Wadhera; Asa Ben-Hur; Douglas L. Brutlag; Robert W. Shafer

Understanding the genetic basis of HIV-1 drug resistance is essential to developing new antiretroviral drugs and optimizing the use of existing drugs. This understanding, however, is hampered by the large numbers of mutation patterns associated with cross-resistance within each antiretroviral drug class. We used five statistical learning methods (decision trees, neural networks, support vector regression, least-squares regression, and least angle regression) to relate HIV-1 protease and reverse transcriptase mutations to in vitro susceptibility to 16 antiretroviral drugs. Learning methods were trained and tested on a public data set of genotype–phenotype correlations by 5-fold cross-validation. For each learning method, four mutation sets were used as input features: a complete set of all mutations in ≥2 sequences in the data set, the 30 most common data set mutations, an expert panel mutation set, and a set of nonpolymorphic treatment-selected mutations from a public database linking protease and reverse transcriptase sequences to antiretroviral drug exposure. The nonpolymorphic treatment-selected mutations led to the best predictions: 80.1% accuracy at classifying sequences as susceptible, low/intermediate resistant, or highly resistant. Least angle regression predicted susceptibility significantly better than other methods when using the complete set of mutations. The three regression methods provided consistent estimates of the quantitative effect of mutations on drug susceptibility, identifying nearly all previously reported genotype–phenotype associations and providing strong statistical support for many new associations. Mutation regression coefficients showed that, within a drug class, cross-resistance patterns differ for different mutation subsets and that cross-resistance has been underestimated.

Methods of Molecular Biology | 2003

Detecting stable clusters using principal component analysis.

Asa Ben-Hur; Isabelle Guyon

Clustering is one of the most commonly used tools in the analysis of gene expression data (1, 2) . The usage in grouping genes is based on the premise that co-expression is a result of co-regulation. It is thus a preliminary step in extracting gene networks and inference of gene function (3, 4) . Clustering of experiments can be used to discover novel phenotypic aspects of cells and tissues (3, 5, 6) , including sensitivity to drugs (7) , and can also detect artifacts of experimental conditions (8) . Clustering and its applications in biology are presented in greater detail in the chapter by Zhao and Karypis (see also (9) ). While we focus on gene expression data in this chapter, the methodology presented here is applicable for other types of data as well. Clustering is a form of unsupervised learning, i.e. no information on the class variable is assumed, and the objective is to find the “natural” groups in the data. However, most clustering algorithms generate a clustering even if the data has no inherent cluster structure, so external validation tools are required. Given a set of partitions of the data into an increasing number of clusters (e.g. by a hierarchical clustering algorithm, or k-means), such a validation tool will tell the user the number of clusters in the data (if any). Many methods have been proposed in the literature to address this problem (10–15) . Recent studies have shown the advantages of sampling-based methods (12, 14) . These methods are based on the idea that when a partition has captured the structure in the data, this partition should be stable with respect to perturbation of the data. Bittner et al. (16) used a similar approach to validate clusters representing gene expression of melanoma patients. The emergence of cluster structure depends on several choices: data representation and normalization, the choice of a similarity measure and clustering algorithm. In this chapter we extend the stability-based validation of cluster structure, and propose stability as a figure of merit that is useful for comparing clustering solutions, thus helping in making these choices. We use this framework to demonstrate the ability of Principal Component Analysis (PCA) to extract features relevant to the cluster structure. We use stability as a tool for simultaneously choosing the number of principal components and the number of clusters; we compare the performance of different similarity measures and normalization schemes. The approach is demonstrated through a case study of yeast gene expression data from Eisen et al. (1) . For yeast, a functional classification of a large number of genes is known, and we use this classification for validating the results produced by clustering. A method for comparing clustering solutions specifically applicable to gene expression data was introduced in (17) . However, it cannot be used to choose the number of clusters, and is not directly applicable in choosing the number of principal components. The results of clustering are easily corrupted by the addition of noise: even a few

BMC Bioinformatics | 2006

Choosing negative examples for the prediction of protein-protein interactions

Asa Ben-Hur; William Stafford Noble

The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions.

international conference on pattern recognition | 2000

A support vector clustering method

Asa Ben-Hur; D. Horn; Hava T. Siegelmann; Vladimir Vapnik

We present a novel kernel method for data clustering using a description of the data by support vectors. The kernel reflects a projection of the data points from data space to a high dimensional feature space. Cluster boundaries are defined as spheres in feature space, which represent complex geometric shapes in data space. We utilize this geometric representation of the data to construct a simple clustering algorithm.

Genome Biology | 2012

SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data

Mark F. Rogers; Julie Thomas; Asa Ben-Hur

We propose a method for predicting splice graphs that enhances curated gene models using evidence from RNA-Seq and EST alignments. Results obtained using RNA-Seq experiments in Arabidopsis thaliana show that predictions made by our SpliceGrapher method are more consistent with current gene models than predictions made by TAU and Cufflinks. Furthermore, analysis of plant and human data indicates that the machine learning approach used by SpliceGrapher is useful for discriminating between real and spurious splice sites, and can improve the reliability of detection of alternative splicing. SpliceGrapher is available for download at http://SpliceGrapher.sf.net.

Proceedings of the National Academy of Sciences of the United States of America | 2012

De novo design of synthetic prion domains

James A. Toombs; Michelina Petri; Kacy R. Paul; Grace Y. Kan; Asa Ben-Hur; Eric D. Ross

Prions are important disease agents and epigenetic regulatory elements. Prion formation involves the structural conversion of proteins from a soluble form into an insoluble amyloid form. In many cases, this structural conversion is driven by a glutamine/asparagine (Q/N)-rich prion-forming domain. However, our understanding of the sequence requirements for prion formation and propagation by Q/N-rich domains has been insufficient for accurate prion propensity prediction or prion domain design. By focusing exclusively on amino acid composition, we have developed a prion aggregation prediction algorithm (PAPA), specifically designed to predict prion propensity of Q/N-rich proteins. Here, we show not only that this algorithm is far more effective than traditional amyloid prediction algorithms at predicting prion propensity of Q/N-rich proteins, but remarkably, also that PAPA is capable of rationally designing protein domains that function as prions in vivo.

Explore More