Louxin Zhang
National University of Singapore
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Louxin Zhang.
Information & Computation | 2003
J. Kevin Lanctot; Ming Li; Bin Ma; Shaojiu Wang; Louxin Zhang
This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences. All these problems reduce to the task of finding a pattern that, with some error, occurs in one set of strings (Closest Substring Problem) and does not occur in another set (Farthest String Problem). In this paper, we break down the problem into several subproblems and prove the following results. 1. The following are all NP-Hard: the Farthest String Problem, the Closest Substring Problem, and the Closest String Problem of finding a string that is close to each string in a set. 2. There is a PTAS for the Farthest String Problem based on a linear programming relaxation technique. 3. There is a polynomial-time (4/3 + e)-approximation algorithm for the Closest String Problem for any small constant e > 0. Using this algorithm, we also provide an efficient heuristic algorithm for the Closest Substring Problem. 4. The problem of finding a string that is at least Hamming distance d from as many strings in a set as possible, cannot be approximated within ne in polynomial time for some fixed constant e unless NP = P, where n is the number of strings in the set. 5. There is a polynomial-time 2-approximation for finding a string that is both the Closest Substring to one set, and the Farthest String from another set.
SIAM Journal on Computing | 2000
Bin Ma; Ming Li; Louxin Zhang
This paper studies various algorithmic issues in reconstructing a species tree from gene trees under the duplication and the mutation cost model. This is a fundamental problem in computational molecular biology. Our main results are as follows. A linear time algorithm is presented for computing all the losses in duplications associated with the least common ancestor mapping from a gene tree to a species tree. This answers a problem raised recently by Eulenstein, Mirkin, and Vingron [J. Comput. Bio., 5 (1998), pp. 135--148]. The complexity of finding an optimal species tree from gene trees is studied. The problem is proved to be NP-hard for the duplication cost and for the mutation cost. Further, the concept of reconciled trees was introduced by Goodman et al. and formalized by Page for visualizing the relationship between gene and species trees. We show that constructing an optimal reconciled tree for gene trees is also NP-hard. Finally, we consider a general reconstruction problem and show it to be NP-hard even for the well-known nearest neighbor interchange distance. A new and efficiently computable metric is defined based on the duplication cost. We show that the problem of finding an optimal species tree from gene trees is NP-hard under this new metric but it can be approximated within factor 2 in polynomial time. Using this approximation result, we propose a heuristic method for finding a species tree from gene trees with uniquely labeled leaves under the duplication cost. Our experimental tests demonstrate that when the number of species is larger than 15 and gene trees are close to each other, our heuristic method is significantly better than the existing program in Pages GeneTree 1.0 that starts the search from a random tree.
Journal of Computer and System Sciences | 2004
Kwok Pui Choi; Louxin Zhang
The novel introduction of spaced seed idea in the filtration stage of sequence comparison by Ma et al. (Bioinformatics 18 (2002) 440) has greatly increased the sensitivity of homology search without compromising the speed of search. Finding the optimal spaced seeds is of great importance both theoretically and in designing better search tool for sequence comparison. In this paper, we study the computational aspects of calculating the hitting probability of spaced seeds; and based on these results, we propose an efficient algorithm for identifying optimal spaced seeds.
Bioinformatics | 2004
Kwok Pui Choi; Fanfan Zeng; Louxin Zhang
Motivation: Filtration is an important technique used to speed up local alignment as exemplified in the BLAST programs. Recently, Ma et al. discovered that better filtering can be achieved by spacing out the matching positions according to a certain pattern, instead of contiguous positions to trigger a local alignment in their PatternHunter program. Such a match pattern is called a spaced seed. Results: Our numerical computation shows that the ranks of spaced seeds (based on sensitivity) change with the sequences similarity. Since homologous sequences may have diverse similarity, we assess the sensitivity of spaced seeds over a range of similarity levels and present a list of good spaced seeds for facilitating homology search in DNA genomic sequences. We validate that the listed spaced seeds are indeed more sensitive using three arbitrarily chosen pairs of DNA genomic sequences.
Bioinformatics | 1999
Allison Lim; Louxin Zhang
A web interface to PHYLIP (version 3.57 C) is implemented using CGI/Perl programming. It enables users to do phylogenetic analysis through the Internet.
Journal of Computational Biology | 2008
Jian Ma; Aakrosh Ratan; Brian J. Raney; Bernard B. Suh; Louxin Zhang; Webb Miller; David Haussler
Accurately reconstructing the large-scale gene order in an ancestral genome is a critical step to better understand genome evolution. In this paper, we propose a heuristic algorithm, called DUPCAR, for reconstructing ancestral genomic orders with duplications. The method starts from the order of genes in modern genomes and predicts predecessor and successor relationships in the ancestor. Then a greedy algorithm is used to reconstruct the ancestral orders by connecting genes into contiguous regions based on predicted adjacencies. Computer simulation was used to validate the algorithm. We also applied the method to reconstruct the ancestral chromosome X of placental mammals and the ancestral genomes of the ciliate Paramecium tetraurelia.
BMC Genomics | 2005
Quan Li; Bernett T. K. Lee; Louxin Zhang
BackgroundGenes are not randomly distributed on a chromosome as they were thought even after removal of tandem repeats. The positional clustering of co-expressed genes is known in prokaryotes and recently reported in several eukaryotic organisms such as Caenorhabditis elegans, Drosophila melanogaster, and Homo sapiens. In order to further investigate the mode of tissue-specific gene clustering in higher eukaryotes, we have performed a genome-scale analysis of positional clustering of the mouse testis-specific genes.ResultsOur computational analysis shows that a large proportion of testis-specific genes are clustered in groups of 2 to 5 genes in the mouse genome. The number of clusters is much higher than expected by chance even after removal of tandem repeats.ConclusionOur result suggests that testis-specific genes tend to cluster on the mouse chromosomes. This provides another piece of evidence for the hypothesis that clusters of tissue-specific genes do exist.
IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2011
Louxin Zhang
When gene copies are sampled from various species, the resulting gene tree might disagree with the containing species tree. The primary causes of gene tree and species tree discord include incomplete lineage sorting, horizontal gene transfer, and gene duplication and loss. Each of these events yields a different parsimony criterion for inferring the (containing) species tree from gene trees. With incomplete lineage sorting, species tree inference is to find the tree minimizing extra gene lineages that had to coexist along species lineages; with gene duplication, it becomes to find the tree minimizing gene duplications and/or losses. In this paper, we present the following results: 1) The deep coalescence cost is equal to the number of gene losses minus two times the gene duplication cost in the reconciliation of a uniquely leaf labeled gene tree and a species tree. The deep coalescence cost can be computed in linear time for any arbitrary gene tree and species tree. 2) The deep coalescence cost is always not less than the gene duplication cost in the reconciliation of an arbitrary gene tree and a species tree. 3) Species tree inference by minimizing deep coalescence events is NP-hard.
Systematic Biology | 2008
Guoliang Li; Mike Steel; Louxin Zhang
Ancestral state reconstruction is an important approach to understanding the origins and evolution of key features of different living organisms (Liberles, 2007). For example, ancestral proteins and genomic sequences have been reconstructed for investigating the origins of genes and proteins (Hillis et al., 1994; Jermann et al., 1995; Zhang and Rosenberg, 2002; Gaucher et al., 2003; Thornton et al., 2003; Blanchette et al., 2004; Cai et al., 2004; Felsenstein, 2004; Taubenberger et al., 2005). A variety of reconstruction methods, including parsimony and maximum likelihood, exist for biomolecular sequencing (Yang et al., 1995; Koshi and Goldstein, 1996; Elias and Tuller, 2007), multistate discrete data (Schultz et al., 1996; Mooers and Schluter, 1999; Pagel, 1999), and continuous data (Martins, 1999). These different reconstruction methods have been assessed by both theoretical analyses (Maddison, 1995; Yang et al., 1995) and computer simulation (Schultz et al., 1996; Zhang and Nei, 1997; Salisbury and Kim, 2001; Blanchette et al., 2004; Mooers, 2004; Williams et al., 2006). One important observation in these investigations is that the topology of the phylogenetic tree relating the extant taxa to the target ancestor has a significant influence on reconstruction accuracy. For instance, a star-like phylogeny allows the ancestral character states to be inferred more accurately than other topologies given the same number of terminal taxa under the two-state symmetric model (Schultz et al., 1996; Evans et al., 2000). For more complex models (e.g., on four-states such as DNA), the influence of topology on reconstruction accuracy is more complicated (Lucena and Haussler, 2005).
symposium on discrete algorithms | 2006
Ming Li; Bin Ma; Louxin Zhang
Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a non-uniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of non-overlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NP-hard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length.