Xiaoqiu Huang
Iowa State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiaoqiu Huang.
international symposium on bioinformatics research and applications | 2008
Ankit Agrawal; Volker Brendel; Xiaoqiu Huang
An important aspect of pairwise sequence comparison is assessingthe statistical significance of the alignment. Most of the currentlypopular alignment programs report the statistical significance ofan alignment in context of a database search. This database statisticalsignificance is dependent on the database, and hence, the same alignmentof a pair of sequences may be assessed different statistical significancevalues in different databases. In this paper, we explore the use of pairwisestatistical significance, which is independent of any database, andcan be useful in cases where we only have a pair of sequences and wewant to comment on the relatedness of the sequences, independent of anydatabase. We compared different methods and determined that censoredmaximum likelihood fitting the score distribution right of the peak is themost accurate method for estimating pairwise statistical significance. Weevaluated this method in an experiment with a subset of CATH2.3, whichhad been previoulsy used by other authors as a benchmark data set forprotein comparison. Comparison of results with database statistical significancereported by popular programs like SSEARCH and PSI-BLAST indicate that the results of pairwise statistical significance are comparable,indeed sometimes significantly better than those of database statisticalsignificance (with SSEARCH). However, PSI-BLAST performs best,presumably due to its use of query-specific substitution matrices.
BMC Bioinformatics | 2009
Ankit Agrawal; Xiaoqiu Huang
BackgroundAccurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets.ResultsResults for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty.ConclusionThe fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.
International Journal of Computational Biology and Drug Design | 2008
Ankit Agrawal; Volker Brendel; Xiaoqiu Huang
We evaluate various methods to estimate pairwise statistical significance of a pairwise local sequence alignment in terms of statistical significance accuracy and compare it with popular database search programs in terms of retrieval accuracy on a benchmark database. Results indicate that using pairwise statistical significance using standard substitution matrices is significantly better than database statistical significance reported by BLAST and PSI-BLAST, and that it is comparable and at times significantly better than SSEARCH. An application of pairwise statistical significance to empirically determine effective gap opening penalties for protein local sequence alignment using the widely used BLOSUM matrices is also presented.
BMC Bioinformatics | 2005
Jianmin Wang; Xiaoqiu Huang
BackgroundThe allele frequencies of single-nucleotide polymorphisms (SNPs) are needed to select an optimal subset of common SNPs for use in association studies. Sequence-based methods for finding SNPs with allele frequencies may need to handle thousands of sequences from the same genome location (sequences of deep coverage).ResultsWe describe a computational method for finding common SNPs with allele frequencies in single-pass sequences of deep coverage. The method enhances a widely used program named PolyBayes in several aspects. We present results from our method and PolyBayes on eighteen data sets of human expressed sequence tags (ESTs) with deep coverage. The results indicate that our method used almost all single-pass sequences in computation of the allele frequencies of SNPs.ConclusionThe new method is able to handle single-pass sequences of deep coverage efficiently. Our work shows that it is possible to analyze sequences of deep coverage by using pairwise alignments of the sequences with the finished genome sequence, instead of multiple sequence alignments.
international conference on information technology | 2008
Ankit Agrawal; Xiaoqiu Huang
Pairwise sequence alignment forms the basis of numerous other applications in bioinformatics. The quality of an alignment is gauged by statistical significance rather than by alignment score alone. Therefore, accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, it was shown that pairwise statistical significance does better in practice than database statistical significance, and also provides quicker individual pairwise estimates of statistical significance without having to perform time-consuming database search. Under an evolutionary model, a substitution matrix can be derived using a rate matrix and a fixed distance. Although the commonly used substitution matrices like BLOSUM62, etc. were not originally derived from a rate matrix under an evolutionary model, the corresponding rate matrices can be back calculated. Many researchers have derived different rate matrices using different methods and data. In this paper, we show that pairwise statistical significance using rate matrices with sequence-pair-specific distance performs significantly better compared to using a fixed distance. Pairwise statistical significance using sequence-pair-specific distanced substitution matrices also outperforms database statistical significance reported by BLAST.
bioinformatics and biomedicine | 2008
Ankit Agrawal; Xiaoqiu Huang
Estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, it was shown that pairwise statistical significance does better in practice than database statistical significance in terms of retrieval accuracy of homologs. In this paper, we introduce the concept of conservative, non-conservative, and average pairwise statistical significance which can be easily derived from original pairwise statistical significance estimates and use more information specific to the sequence pair under consideration using multiple shuffle spaces. Experimental results for homology detection reveal that the proposed measures give at least comparable or significantly better retrieval accuracy than original pairwise statistical significance and database statistical significance reported by BLAST, PSI-BLAST, and SSEARCH. The use of the proposed measures is further shown to be extremely useful when using sequence-specific substitution matrices.
international conference on information technology | 2008
Ankit Agrawal; Xiaoqiu Huang
Pairwise DNA and protein sequence alignment is an underlying task in bioinformatics which forms the basis of many other bioinformatics applications. Protein sequence alignment is in general given more importance than DNA sequence alignment, and protein sequence alignment methods can usually be used with little modification for DNA sequences as well. However, alignment methods specific to DNA sequence alignment using sequence specific information are highly desirable. Most existing DNA alignment programs routinely use the common match/mismatch scoring scheme. Recently, an iterative alignment scheme using sequence-specific transition-transversion ratio was shown to be better than using a simple match/mismatch scoring scheme. In this paper, we present a modification to the iterative approach by incorporating in it the use of multiple parameter sets. Preliminary experiments indicate that using multiple parameter sets gives significantly better performance than using a single parameter set, and than using a simple match/mismatch scoring scheme. Sequence specific scoring matrices have been shown to be highly successful for protein alignment over the last decade, and the current work should be a significant step in the direction of using sequence specific substitution matrices for DNA sequences.
electro information technology | 2008
Ankit Agrawal; Xiaoqiu Huang
Pairwise DNA and protein sequence alignment is an important task in bioinformatics which forms the basis of many other tasks like multiple sequence alignment, protein structure and function prediction, phylogenetic analysis. In general, more emphasis is given to protein sequence alignment, and the alignment methods designed for protein sequences can usually be used with little modification for DNA sequences as well. However, it is desirable to design methods specifically for DNA alignments, making use of specific DNA sequence models and if possible, also of the specific sequences being aligned. In this paper, we present an iterative method for DNA sequence alignment with sequence specific transition-transversion ratio. Preliminary experiments indicate that the proposed technique has significant potential. The approach better suits to the nature of the specific DNA sequence pair, and could be a significant step in the direction of using sequence specific substitution matrices for DNA sequences.
Advances in Experimental Medicine and Biology | 2009
Xiaoqiu Huang; Ankit Agrawal
There has been a deluge of biological sequence data in the public domain, which makes sequence comparison one of the most fundamental computational problems in bioinformatics. The biologists routinely use pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is a well-known fact that almost everything in bioinformatics depends on the inter-relationship between sequence, structure, and function (all encapsulated in the term relatedness), which is far from being well understood. The potential relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone. This chapter presents a summary of recent advances in accurately estimating statistical significance of pairwise local alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence specific. Comparison of using pairwise statistical significance to rank database sequences, with well-known database search programs like BLAST, PSI-BLAST, and SSEARCH, is also presented. As expected, the sequence-comparison performance (evaluated in terms of retrieval accuracy) improves significantly as the sequence comparison process is made more and more sequence specific. Shortcomings of currently used approaches and some potentially useful directions for future work are also presented.
international symposium on bioinformatics research and applications | 2008
Ankit Agrawal; Arka P. Ghosh; Xiaoqiu Huang
A central question in pairwise sequence comparison is assessingthe statistical significance of the alignment. The alignment scoredistribution is known to follow an extreme value distribution with analyticallycalculable parameters K and λ for ungapped alignments withone substitution matrix. But no statistical theory is currently availablefor the gapped case and for alignments using multiple scoring matrices,although their score distribution is known to closely follow extremevalue distribution and the corresponding parameters can be estimated bysimulation. Ideal estimation would require simulation for each sequencepair, which is impractical. In this paper, we present a simple clusteringclassificationapproach based on amino acid composition to estimate Kand λ for a given sequence pair and scoring scheme, including using multipleparameter sets. The resulting set of K and λ for different clusterpairs has large variability even for the same scoring scheme, underscoringthe heavy dependence of K and λ on the amino acid composition. Theproposed approach in this paper is an attempt to separate the influenceof amino acid composition in estimation of statistical significance of pairwiseprotein alignments. Experiments and analysis of other approachesto estimate statistical parameters also indicate that the methods used inthis work estimate the statistical significance with good accuracy.