Golan Yona
Cornell University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Golan Yona.
Bioinformatics | 2001
Gill Bejerano; Golan Yona
MOTIVATION We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. RESULTS The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.
Proteins | 1999
Golan Yona; Nathan Linial; Michal Linial
We investigate the space of all protein sequences in search of clusters of related proteins. Our aim is to automatically detect these sets, and thus obtain a classification of all protein sequences. Our analysis, which uses standard measures of sequence similarity as applied to an all‐vs.‐all comparison of SWISSPROT, gives a very conservative initial classification based on the highest scoring pairs. The many classes in this classification correspond to protein subfamilies. Subsequently we merge the subclasses using the weaker pairs in a two‐phase clustering algorithm. The algorithm makes use of transitivity to identify homologous proteins; however, transitivity is applied restrictively in an attempt to prevent unrelated proteins from clustering together. This process is repeated at varying levels of statistical significance. Consequently, a hierarchical organization of all proteins is obtained.
Bioinformatics | 2004
Niranjan Nagarajan; Golan Yona
MOTIVATION We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition positions between domains. RESULTS The method was assessed using the domain definitions in SCOP and CATH for proteins of known structure and was compared with several other existing methods. Our method performs well both in terms of accuracy and sensitivity. It improves significantly over the best methods available, even some of the semi-manual ones, while being fully automatic. Our method can also be used to suggest and verify domain partitions based on structural data. A few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed. AVAILABILITY An online domain-prediction server is available at http://biozon.org/tools/domains/
research in computational molecular biology | 1999
Gill Bejerano; Golan Yona
proteins which await analysis. We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Incorporating basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can improve the performance in some cases. The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on one of the state of the art databases of protein families, namely, the Pfam database of HMMs, with satisfactory performance. Generally, the existing approaches can be divided into those based on short conserved motifs (e.g. [Bairoch et al. 1997, Attwood et al. 1998, Henikoff & Henikoff 19911) and those which are based on whole domains (e.g. [Sonnhammer & Kahn 1994, Sonnhammer et al. 19981). The manually defined patterns in PROSITE have served as an excellent seed for several such works. The methods used to represent these motifs and domains vary, and among the most popular forms are the consensus patterns [Bairoch et al. 1997, Attwood et al. 19981, the position-specific scoring matrices (profiles) [Gribskov et al. 1987, Henikoff & Henikoff 19911 and the HMMs [Krogh et al. 19961. These forms differ in their mathematical complexity, as well as in their sensitivity/selectivity.
Bioinformatics | 2006
Golan Yona; William Dirks; Shafquat Rahman; David M. Lin
It is commonly accepted that genes with similar expression profiles are functionally related. However, there are many ways one can measure the similarity of expression profiles, and it is not clear a priori what is the most effective one. Moreover, so far no clear distinction has been made as for the type of the functional link between genes as suggested by microarray data. Similarly expressed genes can be part of the same complex as interacting partners; they can participate in the same pathway without interacting directly; they can perform similar functions; or they can simply have similar regulatory sequences. Here we conduct a study of the notion of functional link as implied from expression data. We analyze different similarity measures of gene expression profiles and assess their usefulness and robustness in detecting biological relationships by comparing the similarity scores with results obtained from databases of interacting proteins, promoter signals and cellular pathways, as well as through sequence comparisons. We also introduce variations on similarity measures that are based on statistical analysis and better discriminate genes which are functionally nearby and faraway. Our tools can be used to assess other similarity measures for expression profiles, and are accessible at biozon.org/tools/expression/
Progress in Biophysics & Molecular Biology | 2000
Michal Linial; Golan Yona
As the number of complete genomes that have been sequenced keeps growing, unknown areas of the protein space are revealed and new horizons open up. Most of this information will be fully appreciated only when the structural information about the encoded proteins becomes available. The goal of structural genomics is to direct large-scale efforts of protein structure determination, so as to increase the impact of these efforts. This review focuses on current approaches in structural genomics aimed at selecting representative proteins as targets for structure determination. We will discuss the concept of representative structures/folds, the current methodologies for identifying those proteins, and computational techniques for identifying proteins which are expected to adopt new structural folds.
Nucleic Acids Research | 2006
Aaron Birkland; Golan Yona
Biological entities are strongly related and mutually dependent on each other. Therefore, there is a growing need to corroborate and integrate data from different resources and aspects of biological systems in order to analyze them effectively. Biozon is a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein–protein interactions and cellular pathways, and establishes the relationships between them. All data are integrated on to a single graph schema centered around the non-redundant set of biological objects that are shared by each source. This integration results in a highly connected graph structure that provides a more complete picture of the known context of a given object that cannot be determined from any one source. Currently, Biozon integrates roughly 2 million protein sequences, 42 million DNA or RNA sequences, 32 000 protein structures, 150 000 interactions and more from sources such as GenBank, UniProt, Protein Data Bank (PDB) and BIND. Biozon augments source data with locally derived data such as 5 billion pairwise protein alignments and 8 million structural alignments. The user may form complex cross-type queries on the graph structure, add similarity relations to form fuzzy queries and rank the results based on analysis of the edge structure similar to Google PageRank, online at .
Methods of Molecular Biology | 2009
Umar Syed; Golan Yona
Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
research in computational molecular biology | 2003
Umar Syed; Golan Yona
We study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction that can be used to complement and support other techniques based on sequence or structure information. In order to define this new measure of similarity between proteins we collected a set of 453 features and properties that characterize proteins and are believed to be correlated and related to structural and functional aspects of proteins. Among these properties are the composition and fraction of different groups of amino acids, predicted secondary structure content, molecular weight, average hydrophobicity, isoelectric point and others, as well as a set of properties that are extracted from database records of known protein sequences, such as subcellular location, tissue specificity, and others.We introduce the mixture model of probabilistic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam sequence-based classification of proteins and the EC classification of enzyme families. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
Machine Learning | 2002
Shlomo Dubnov; Ran El-Yaniv; Yoram Gdalyahu; Elad Schneidman; Naftali Tishby; Golan Yona
We present a novel pairwise clustering method. Given a proximity matrix of pairwise relations (i.e. pairwise similarity or dissimilarity estimates) between data points, our algorithm extracts the two most prominent clusters in the data set. The algorithm, which is completely nonparametric, iteratively employs a two-step transformation on the proximity matrix. The first step of the transformation represents each point by its relation to all other data points, and the second step re-estimates the pairwise distances using a statistically motivated proximity measure on these representations. Using this transformation, the algorithm iteratively partitions the data points, until it finally converges to two clusters. Although the algorithm is simple and intuitive, it generates a complex dynamics of the proximity matrices. Based on this bipartition procedure we devise a hierarchical clustering algorithm, which employs the basic bipartition algorithm in a straightforward divisive manner. The hierarchical clustering algorithm copes with the model validation problem using a general cross-validation approach, which may be combined with various hierarchical clustering methods.We further present an experimental study of this algorithm. We examine some of the algorithms properties and performance on some synthetic and ‘standard’ data sets. The experiments demonstrate the robustness of the algorithm and indicate that it generates a good clustering partition even when the data is noisy or corrupted.