Shuai Cheng Li | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuai Cheng Li is active.

Explore More

Publication

Featured researches published by Shuai Cheng Li.

Protein Science | 2008

Fragment-HMM: A new approach to protein structure prediction

Shuai Cheng Li; Dongbo Bu; Jinbo Xu; Ming Li

We designed a simple position‐specific hidden Markov model to predict protein structure. Our new framework naturally repeats itself to converge to a final target, conglomerating fragment assembly, clustering, target selection, refinement, and consensus, all in one process. Our initial implementation of this theory converges to within 6 Å of the native structures for 100% of decoys on all six standard benchmark proteins used in ROSETTA (discussed by Simons and colleagues in a recent paper), which achieved only 14%–94% for the same data. The qualities of the best decoys and the final decoys our theory converges to are also notably better.

Proteins | 2008

Discriminative learning for protein conformation sampling

Feng Zhao; Shuai Cheng Li; Beckett Sterner; Jinbo Xu

Protein structure prediction without using templates (i.e., ab initio folding) is one of the most challenging problems in structural biology. In particular, conformation sampling poses as a major bottleneck of ab initio folding. This article presents CRFSampler, an extensible protein conformation sampler, built on a probabilistic graphical model Conditional Random Fields (CRFs). Using a discriminative learning method, CRFSampler can automatically learn more than ten thousand parameters quantifying the relationship among primary sequence, secondary structure, and (pseudo) backbone angles. Using only compactness and self‐avoiding constraints, CRFSampler can efficiently generate protein‐like conformations from primary sequence and predicted secondary structure. CRFSampler is also very flexible in that a variety of model topologies and feature sets can be defined to model the sequence‐structure relationship without worrying about parameter estimation. Our experimental results demonstrate that using a simple set of features, CRFSampler can generate decoys with much higher quality than the most recent HMM model. Proteins 2008.

BMC Bioinformatics | 2010

Calibur: a tool for clustering large numbers of protein decoys

Shuai Cheng Li; Yen Kaow Ng

BackgroundAb initio protein structure prediction methods generate numerous structural candidates, which are referred to as decoys. The decoy with the most number of neighbors of up to a threshold distance is typically identified as the most representative decoy. However, the clustering of decoys needed for this criterion involves computations with runtimes that are at best quadratic in the number of decoys. As a result currently there is no tool that is designed to exactly cluster very large numbers of decoys, thus creating a bottleneck in the analysis.ResultsUsing three strategies aimed at enhancing performance (proximate decoys organization, preliminary screening via lower and upper bounds, outliers filtering) we designed and implemented a software tool for clustering decoys called Calibur. We show empirical results indicating the effectiveness of each of the strategies employed. The strategies are further fine-tuned according to their effectiveness.Calibur demonstrated the ability to scale well with respect to increases in the number of decoys. For a sample size of approximately 30 thousand decoys, Calibur completed the analysis in one third of the time required when the strategies are not used.For practical use Calibur is able to automatically discover from the input decoys a suitable threshold distance for clustering. Several methods for this discovery are implemented in Calibur, where by default a very fast one is used. Using the default method Calibur reported relatively good decoys in our tests.ConclusionsCaliburs ability to handle very large protein decoy sets makes it a useful tool for clustering decoys in ab initio protein structure prediction. As the number of decoys generated in these methods increases, we believe Calibur will come in important for progress in the field.

research in computational molecular biology | 2011

Pedigree reconstruction using identity by descent

Bonnie Kirkpatrick; Shuai Cheng Li; Richard M. Karp; Eran Halperin

Can we find the family trees, or pedigrees, that relate the haplotypes of a group of individuals? Collecting the genealogical information for how individuals are related is a very time-consuming and expensive process. Methods for automating the construction of pedigrees could stream-line this process. While constructing single-generation families is relatively easy given whole genome data, reconstructing multigenerational, possibly inbred, pedigrees is much more challenging. This paper addresses the important question of reconstructing monogamous, regular pedigrees, where pedigrees are regular when individuals mate only with other individuals at the same generation. This paper introduces two multi-generational pedigree reconstruction methods: one for inbreeding relationships and one for outbreeding relationships. In contrast to previous methods that focused on the independent estimation of relationship distances between every pair of typed individuals, here we present methods that aim at the reconstruction of the entire pedigree. We show that both our methods out-perform the state-of-the-art and that the outbreeding method is capable of reconstructing pedigrees at least six generations back in time with high accuracy. The two programs are available at http://cop.icsi.berkeley.edu/ cop/

database systems for advanced applications | 2005

Indexing DNA sequences using q-grams

Xia Cao; Shuai Cheng Li; Anthony K. H. Tung

We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.

Algorithms for Molecular Biology | 2013

The difficulty of protein structure alignment under the RMSD

Shuai Cheng Li

BackgroundProtein structure alignment is often modeled as the largest common point set (LCP) problem based on the Root Mean Square Deviation (RMSD), a measure commonly used to evaluate structural similarity. In the problem, each residue is represented by the coordinate of the Cα atom, and a structure is modeled as a sequence of 3D points. Out of two such sequences, one is to find two equal-sized subsequences of the maximum length, and a bijection between the points of the subsequences which gives an RMSD within a given threshold. The problem is considered to be difficult in terms of time complexity, but the reasons for its difficulty is not well-understood. Improving this time complexity is considered important in protein structure prediction and structural comparison, where the task of comparing very numerous structures is commonly encountered.ResultsTo study why the LCP problem is difficult, we define a natural variant of the problem, called the minimum aligned distance (MAD). In the MAD problem, the length of the subsequences to obtain is specified in the input; and instead of fulfilling a threshold, the RMSD between the points of the two subsequences is to be minimized. Our results show that the difficulty of the two problems does not lie solely in the combinatorial complexity of finding the optimal subsequences, or in the task of superimposing the structures. By placing a limit on the distance between consecutive points, and assuming that the points are specified as integral values, we show that both problems are equally difficult, in the sense that they are reducible to each other. In this case, both problems can be exactly solved in polynomial time, although the time complexity remains high.ConclusionsWe showed insights and techniques which we hope will lead to practical algorithms for the LCP problem for protein structures. The study identified two important factors in the problem’s complexity: (1) The lack of a limit in the distance between the consecutive points of a structure; (2) The arbitrariness of the precision allowed in the input values. Both issues are of little practical concern for the purpose of protein structure alignment. When these factors are removed, the LCP problem is as hard as that of minimizing the RMSD (MAD problem), and can be solved exactly in polynomial time.

Journal of Bioinformatics and Computational Biology | 2011

ERROR TOLERANT NMR BACKBONE RESONANCE ASSIGNMENT AND AUTOMATED STRUCTURE GENERATION

Babak Alipanahi; Xin Gao; Emre Karakoc; Shuai Cheng Li; Frank J. Balbach; Guangyu Feng; Logan W. Donaldson; Ming Li

Error tolerant backbone resonance assignment is the cornerstone of the NMR structure determination process. Although a variety of assignment approaches have been developed, none works sufficiently well on noisy fully automatically picked peaks to enable the subsequent automatic structure determination steps. We have designed an integer linear programming (ILP) based assignment system (IPASS) that has enabled fully automatic protein structure determination for four test proteins. IPASS employs probabilistic spin system typing based on chemical shifts and secondary structure predictions. Furthermore, IPASS extracts connectivity information from the inter-residue information and the (automatically picked) (15)N-edited NOESY peaks which are then used to fix reliable fragments. When applied to automatically picked peaks for real proteins, IPASS achieves an average precision and recall of 82% and 63%, respectively. In contrast, the next best method, MARS, achieves an average precision and recall of 77% and 36%, respectively. The assignments generated by IPASS are then fed into our protein structure calculation system, FALCON-NMR, to determine the 3D structures without human intervention. The final models have backbone RMSDs of 1.25Å, 0.88Å, 1.49Å, and 0.67Å to the reference native structures for proteins TM1112, CASKIN, VRAR, and HACS1, respectively. The web server is publicly available at http://monod.uwaterloo.ca/nmr/ipass.

Journal of Computational Biology | 2011

Pedigree reconstruction using identity by descent.

Bonnie Kirkpatrick; Shuai Cheng Li; Richard M. Karp; Eran Halperin

Can we find the family trees, or pedigrees, that relate the haplotypes of a group of individuals? Collecting the genealogical information for how individuals are related is a very time-consuming and expensive process. Methods for automating the construction of pedigrees could stream-line this process. While constructing single-generation families is relatively easy given whole genome data, reconstructing multi-generational, possibly inbred, pedigrees is much more challenging. This article addresses the important question of reconstructing monogamous, regular pedigrees, where pedigrees are regular when individuals mate only with other individuals at the same generation. This article introduces two multi-generational pedigree reconstruction methods: one for inbreeding relationships and one for outbreeding relationships. In contrast to previous methods that focused on the independent estimation of relationship distances between every pair of typed individuals, here we present methods that aim at the reconstruction of the entire pedigree. We show that both our methods out-perform the state-of-the-art and that the outbreeding method is capable of reconstructing pedigrees at least six generations back in time with high accuracy. The two programs are available at http://cop.icsi.berkeley.edu/cop/.

Journal of Bioinformatics and Computational Biology | 2010

Protein secondary structure prediction using NMR chemical shift data.

Yuzhong Zhao; Babak Alipanahi; Shuai Cheng Li; Ming Li

Accurate determination of protein secondary structure from the chemical shift information is a key step for NMR tertiary structure determination. Relatively few work has been done on this subject. There needs to be a systematic investigation of algorithms that are (a) robust for large datasets; (b) easily extendable to (the dynamic) new databases; and (c) approaching to the limit of accuracy. We introduce new approaches using k-nearest neighbor algorithm to do the basic prediction and use the BCJR algorithm to smooth the predictions and combine different predictions from chemical shifts and based on sequence information only. Our new system, SUCCES, improves the accuracy of all existing methods on a large dataset of 805 proteins (at 86% Q(3) accuracy and at 92.6% accuracy when the boundary residues are ignored), and it is easily extendable to any new dataset without requiring any new training. The software is publicly available at http://monod.uwaterloo.ca/nmr/succes.

combinatorial pattern matching | 2008

Finding Largest Well-Predicted Subset of Protein Structure Models

Shuai Cheng Li; Dongbo Bu; Jinbo Xu; Ming Li

How to evaluate the quality of models is a basic problem for the field of protein structure prediction. Numerous evaluation criteria have been proposed, and one of the most intuitive criteria requires us to find a largest well-predicted subset-- a maximum subset of the model which matches the native structure [12]. The problem is solvable in O(n7) time, albeit too slow for practical usage. We present a (1 + i¾?)ddistance approximation algorithm that runs in time O(n3logn/i¾?5) for general protein structures. In the case of globular proteins, this result can be enhanced to a randomized O(nlog2n) time algorithm with probability at least 1 i¾? O(1/n). In addition, we propose a (1 + i¾?)-approximation algorithm to compute the minimum distance to fit all the points of a model to its native structure in time O(n(loglogn+ log1/i¾?)/i¾?5). We have implemented our algorithms and results indicate our program finds much more matched pairs with less running time than TMScore, which is one of the most popular tools to assess the quality of predicted models.

Explore More