Sorin Istrail | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sorin Istrail is active.

Explore More

Publication

Featured researches published by Sorin Istrail.

foundations of computer science | 1999

Algorithmic aspects of protein structure similarity

Deborah Goldman; Sorin Istrail; Christos H. Papadimitriou

We show that calculating contact map overlap (a measure of similarity of protein structures) is NP-hard, but can be solved in polynomial time for several interesting and relevant special cases. We identify an important special case of this problem corresponding to self-avoiding walks, and prove a decomposition theorem and a corollary approximation result for this special case. These are the first approximation algorithms with guaranteed error bounds, and NP-completeness results in the literature in the area of protein structure alignment/fold recognition for measures of structure similarity of practical interest.

Proceedings of the National Academy of Sciences of the United States of America | 2004

Whole-genome shotgun assembly and comparison of human genome assemblies

Sorin Istrail; Granger Sutton; Liliana Florea; Aaron L. Halpern; Clark M. Mobarry; Ross A. Lippert; Brian Walenz; Hagit Shatkay; Ian M. Dew; Jason R. Miller; Michael Flanigan; Nathan Edwards; Randall Bolanos; Daniel Fasulo; Bjarni V. Halldórsson; Sridhar Hannenhalli; Russell Turner; Shibu Yooseph; Fu Lu; Deborah Nusskern; Bixiong Shue; Xiangqun Holly Zheng; Fei Zhong; Arthur L. Delcher; Daniel H. Huson; Saul Kravitz; Laurent Mouchard; Knut Reinert; Karin A. Remington; Andrew G. Clark

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304–1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860–921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.

Journal of Computational Biology | 2004

1001 optimal PDB structure alignments: integer programming methods for finding the maximum contact map overlap.

Alberto Caprara; Robert D. Carr; Sorin Istrail; Giuseppe Lancia; Brian Walenz

Protein structure comparison is a fundamental problem for structural genomics, with applications to drug design, fold prediction, protein clustering, and evolutionary studies. Despite its importance, there are very few rigorous methods and widely accepted similarity measures known for this problem. In this paper we describe the last few years of developments on the study of an emerging measure, the contact map overlap (CMO), for protein structure comparison. A contact map is a list of pairs of residues which lie in three-dimensional proximity in the proteins native fold. Although this measure is in principle computationally hard to optimize, we show how it can in fact be computed with great accuracy for related proteins by integer linear programming techniques. These methods have the advantage of providing certificates of near-optimality by means of upper bounds to the optimal alignment value. We also illustrate effective heuristics, such as local search and genetic algorithms. We were able to obtain for the first time optimal alignments for large similar proteins (about 1,000 residues and 2,000 contacts) and used the CMO measure to cluster proteins in families. The clusters obtained were compared to SCOP classification in order to validate the measure. Extensive computational experiments showed that alignments which are off by at most 10% from the optimal value can be computed in a short time. Further experiments showed how this measure reacts to the choice of the threshold defining a contact and how to choose this threshold in a sensible way.

Journal of Computational Biology | 1996

Fast protein folding in the hydrophobic-hydrophilic model within three-eighths of optimal.

William E. Hart; Sorin Istrail

We present performance-guaranteed approximation algorithms for the protein folding problem in the hydrophobic-hydrophilic model (Dill, 1985). Our algorithms are the first approximation algorithms in the literature with guaranteed performance for this model (Dill, 1994). The hydrophobic-hydrophilic model abstracts the dominant force of protein folding: the hydrophobic interaction. The protein is modeled as a chain of amino acids of length n that are of two types; H (hydrophobic, i.e., nonpolar) and P (hydrophilic, i.e., polar). Although this model is a simplification of more complex protein folding models, the protein folding structure prediction problem is notoriously difficult for this model. Our algorithms have linear (3n) or quadratic time and achieve a three-dimensional protein conformation that has a guaranteed free energy no worse than three-eighths of optimal. This result answers the open problem of Ngo et al. (1994) about the possible existence of an efficient approximation algorithm with guaranteed performance for protein structure prediction in any well-studied model of protein folding. By achieving speed and near-optimality simultaneously, our algorithms rigorously capture salient features of the recently proposed framework of protein folding by Sali et al. (1994). Equally important, the final conformations of our algorithms have significant secondary structure (antiparallel sheets, beta-sheets, compact hydrophobic core). Furthermore, hypothetical folding pathways can be described for our algorithms that fit within the framework of diffusion-collision protein folding proposed by Karplus and Weaver (1979). Computational limitations of algorithms that compute the optimal conformation have restricted their applicability to short sequences (length < or = 90). Because our algorithms trade computational accuracy for speed, they can construct near-optimal conformations in linear time for sequences of any size.

research in computational molecular biology | 2001

101 optimal PDB structure alignments: a branch-and-cut algorithm for the maximum contact map overlap problem

Giuseppe Lancia; Robert D. Carr; Brian Walenz; Sorin Istrail

Structure comparison is a fundamental problem for structural genomics. A variety of structure comparison methods were proposed and several protein structure classification servers e.g., SCOP, DALI, CATH, were designed based on them, and are extensively used in practice. This area of research continues to be very active, being energized bi-annually by the CASP folding competitions, but despite the extraordinary international research effort devoted to it, progress is slow. A fundamental dimension of this bottleneck is the absence of rigorous algorithmic methods. A recent excellent survey on structure comparison by Taylor et.al. [23] records the state of the art of the area: In structure comparison, we do not even have an algorithm that guarantees an optimal answer for pairs of structures … In this paper we provide the first rigorous algorithm for structure comparison. Our method is based on developing an effective integer linear programming (IP) formulation of protein structure contact maps overlap (CMO), and a branch-and-cut strategy that employs lower-bounding heuristics at the branch nodes. Our algorithms identified a gallery of optimal and near-optimal structure alignments for pairs of proteins from the Protein Data Bank with up to 80 amino acids and about 150 contacts each — problems of instance size of about 300. Although these sizes also reflect our current limitations, these are the first provable optimal and near-optimal algorithms in the literature for a measure of structure similarity which sees extensive practical use. At the heart of our success in finding optimal alignments is a reduction of the CMO optimization to the maximum independent set (MIS) problem on special graphs. For CMO instances of size 300, the corresponding MIS graph instance contains about 10,000 nodes. While our algorithms are able to solve to optimality MIS problem of these sizes, the known optimal algorithms for the MIS on general graphs can at present only solve instances with up to a few hundred nodes. This is the first effective use of IP methods in protein structure comparison; the biomolecular structure literature contains only one other effective IP method devoted to RNA comparison, due to Lenhof et.al. [18]. The hybrid heuristic approach that worked well for providing lower bounds in the branch and cut algorithm was tried on large proteins in a test set suggested by Jeffrey Skolnick. It involved 33 proteins classified into four families: Flavodoxin-like fold CheY-related, Plastocyanin, TIM Barrel, and Ferratin. Out of the set of all 528 pairwise structure alignments, we have validated the clustering with a 98.7% accuracy (1.3% false negatives and 0% false positives).

Journal of Computational Biology | 1997

Robust Proofs of NP-Hardness for Protein Folding: General Lattices and Energy Potentials

William E. Hart; Sorin Istrail

This paper addresses the robustness of intractability arguments for simplified models of protein folding that use lattices to discretize the space of conformations that a protein can assume. We present two generalized NP-hardness results. The first concerns the intractability of protein folding independent of the lattice used to define the discrete protein-folding model. We consider a previously studied model and prove that for any reasonable lattice the protein-structure prediction problem is NP-hard. The second hardness result concerns the intractability of protein folding for a class of energy formulas that contains a broad range of mean force potentials whose form is similar to commonly used pair potentials (e.g., the Lennard-Jones potential). We prove that protein-structure prediction is NP-hard for any energy formula in this class. These are the first robust intractability results that identify sources of computational complexity of protein-structure prediction that transcend particular problem formulations.

research in computational molecular biology | 2002

A Survey of Computational Methods for Determining Haplotypes

Bjarni V. Halldórsson; Vineet Bafna; Nathan Edwards; Ross A. Lippert; Shibu Yooseph; Sorin Istrail

It is widely anticipated that the study of variation in the human genome will provide a means of predicting risk of a variety of complex diseases. Single nucleotide polymorphisms (SNPs) are the most common form of genomic variation. Haplotypes have been suggested as one means for reducing the complexity of studying SNPs. In this paper we review some of the computational approaches that have been taking for determining haplotypes and suggest new approaches.

research in computational molecular biology | 2003

Haplotypes and informative SNP selection algorithms: don't block out information

Vineet Bafna; Bjarni V. Halldórsson; Russell Schwartz; Andrew G. Clark; Sorin Istrail

It is widely hoped that variation in the human genome will provide a means of predicting risk of a variety of complex, chronic diseases. A major stumbling block to the successful identification of association between human DNA polymorphisms (SNPs) and variability in risk of complex diseases is the enormous number of SNPs in the human genome (4,9). The large number of SNPs results in unacceptably high costs for exhaustive genotyping, and so there is a broad effort to determine ways to select SNPs so as to maximize the informativeness of a subset.In this paper we contrast two methods for reducing the complexity of SNP variation: haplotype tagging, i.e. typing a subset of SNPs to identify segments of the genome that appear to be nearly unrecombined (haplotype blocks), and a new block-free model that we develop in this report. We present a statistic for comparing haplotype blocks and show that while the concept of haplotype blocks is reasonably robust there is substantial variability among block partitions. We develop a measure for selecting an informative subset of SNPs in a block free model. We show that the general version of this problem is NP-hard and give efficient algorithms for two important special cases of this problem.

Protein Science | 2001

Frequencies of amino acid strings in globular protein sequences indicate suppression of blocks of consecutive hydrophobic residues.

Russell Schwartz; Sorin Istrail; Jonathan King

Patterns of hydrophobic and hydrophilic residues play a major role in protein folding and function. Long, predominantly hydrophobic strings of 20–22 amino acids each are associated with transmembrane helices and have been used to identify such sequences. Much less attention has been paid to hydrophobic sequences within globular proteins. In prior work on computer simulations of the competition between on‐pathway folding and off‐pathway aggregate formation, we found that long sequences of consecutive hydrophobic residues promoted aggregation within the model, even controlling for overall hydrophobic content. We report here on an analysis of the frequencies of different lengths of contiguous blocks of hydrophobic residues in a database of amino acid sequences of proteins of known structure. Sequences of three or more consecutive hydrophobic residues are found to be significantly less common in actual globular proteins than would be predicted if residues were selected independently. The result may reflect selection against long blocks of hydrophobic residues within globular proteins relative to what would be expected if residue hydrophobicities were independent of those of nearby residues in the sequence.

Human Heredity | 2004

Optimal Selection of SNP Markers for Disease Association Studies

Bjarni V. Halldórsson; Sorin Istrail; Francisco M. De La Vega

Genetic association studies with population samples hold the promise of uncovering the susceptibility genes underlying the heritability of complex or common disease. Most association studies rely on the use of surrogate markers, single-nucleotide polymorphism (SNP) being the most suitable due to their abundance and ease of scoring. SNP marker selection is aimed to increase the chances that at least one typed SNP would be in linkage disequilibrium (LD) with the disease causative variant, while at the same time controlling the cost of the study in terms of the number of markers genotyped and samples. Empirical studies reporting block-like segments in the genome with high LD and low haplotype diversity have motivated a marker selection strategy whereby subsets of SNPs that ‘tag’ the common haplotypes of a region are picked for genotyping, avoiding typing redundant SNPs. Based on these initial observations, a plethora of ‘tagging’ algorithms for selecting minimum informative subsets of SNPs has recently appeared in the literature. These differ mostly in two major aspects: the quality or correlation measure used to define tagging and the algorithm used for the minimization of the final number of tagging SNPs. In this review we describe the available tagging algorithms utilizing a 3-step unifying framework, point out their methodological and conceptual differences, and make an assessment of their assumptions, performance, and scalability.

Explore More