Bin Ma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bin Ma is active.

Explore More

Publication

Featured researches published by Bin Ma.

Journal of the ACM | 2002

On the closest string and substring problems

Ming Li; Bin Ma; Lusheng Wang

The problem of finding a center string that is close to everygiven string arises in computational molecular biology and codingtheory. This problem has two versions: the Closest String problemand the Closest Substring problem. Given a set of strings S= {s1, s2, ...,sn}, each of length m, the Closest Stringproblem is to find the smallest d and a string s of lengthm which is within Hamming distance d to eachsi ε S. This problem comes fromcoding theory when we are looking for a code not too far away froma given set of codes. Closest Substring problem, with an additionalinput integer L, asks for the smallest d and a strings, of length L, which is within Hamming distance daway from a substring, of length L, of each si. This problemis much more elusive than the Closest String problem. The ClosestSubstring problem is formulated from applications in findingconserved regions, identifying genetic drug targets and generatinggenetic probes in molecular biology. Whether there are efficientapproximation algorithms for both problems are major open questionsin this area. We present two polynomial-time approximationalgorithms with approximation ratio 1 + ε for any smallε to settle both questions.

Bioinformatics | 2008

ZOOM! Zillions of oligos mapped

Hao Lin; Zefeng Zhang; Michael Q. Zhang; Bin Ma; Ming Li

MOTIVATIONnThe next generation sequencing technologies are generating billions of short reads daily. Resequencing and personalized medicine need much faster software to map these deep sequencing reads to a reference genome, to identify SNPs or rare transcripts.nnnRESULTSnWe present a framework for how full sensitivity mapping can be done in the most efficient way, via spaced seeds. Using the framework, we have developed software called ZOOM, which is able to map the Illumina/Solexa reads of 15x coverage of a human genome to the reference human genome in one CPU-day, allowing two mismatches, at full sensitivity.nnnAVAILABILITYnZOOM is freely available to non-commercial users at http://www.bioinfor.com/zoom

Journal of Computational Biology | 2002

A General Edit Distance between RNA Structures

Tao Jiang; Guohui Lin; Bin Ma; Kaizhong Zhang

Arc-annotated sequences are useful in representing the structural information of RNA sequences. In general, RNA secondary and tertiary structures can be represented as a set of nested arcs and a set of crossing arcs, respectively. Since RNA functions are largely determined by molecular confirmation and therefore secondary and tertiary structures, the comparison between RNA secondary and tertiary structures has received much attention recently. In this paper, we propose the notion of edit distance to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs). Several algorithms are presented to compute the edit distance between two RNA sequences with various arc structures and under various score schemes, either exactly or approximately, with provably good performance. Preliminary experimental tests confirm that our definition of edit distance and the computation model are among the most reasonable ones ever studied in the literature.

Information & Computation | 2003

Distinguishing string selection problems

J. Kevin Lanctot; Ming Li; Bin Ma; Shaojiu Wang; Louxin Zhang

This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences. All these problems reduce to the task of finding a pattern that, with some error, occurs in one set of strings (Closest Substring Problem) and does not occur in another set (Farthest String Problem). In this paper, we break down the problem into several subproblems and prove the following results. 1. The following are all NP-Hard: the Farthest String Problem, the Closest Substring Problem, and the Closest String Problem of finding a string that is close to each string in a set. 2. There is a PTAS for the Farthest String Problem based on a linear programming relaxation technique. 3. There is a polynomial-time (4/3 + e)-approximation algorithm for the Closest String Problem for any small constant e > 0. Using this algorithm, we also provide an efficient heuristic algorithm for the Closest Substring Problem. 4. The problem of finding a string that is at least Hamming distance d from as many strings in a set as possible, cannot be approximated within ne in polynomial time for some fixed constant e unless NP = P, where n is the number of strings in the set. 5. There is a polynomial-time 2-approximation for finding a string that is both the Closest Substring to one set, and the Farthest String from another set.

symposium on the theory of computing | 1999

Finding similar regions in many strings

Ming Li; Bin Ma; Lusheng Wang

Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences 81,. , an. The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 271, in its simplest form, asks for a region of length L in each ai, and a median string s of length L so that the total Hamming distance from B to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 251. As an interesting application of OUT analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [Z, 3, 7, 9, 181 asks for the smallest d and a string d which is within Hamming distance d to each a;. The problem is NP-hard [7, 181. [3] gives a polynomial time algorithm for constant d. For super-logarithmic d, [Z, 91 give efficient approximation algorithms using linear program relaxation techniques. The best polynomial time approximation has ratio

Journal of Bioinformatics and Computational Biology | 2005

SPIDER: SOFTWARE FOR PROTEIN IDENTIFICATION FROM SEQUENCE TAGS WITH DE NOVO SEQUENCING ERROR

Yonghua Han; Bin Ma; Kaizhong Zhang

for all d, given by [18] ([9] also independently claimed the

combinatorial pattern matching | 2000

The Longest Common Subsequence Problem for Arc-Annotated Sequences

Tao Jiang; Guohui Lin; Bin Ma; Kaizhong Zhang

ratio but only for super-logarithmic d). We settle the problem with a PTAS. We then give the fist nontrivial better-than-2 approximation with ratio 2 & for the more eluive Closest Substring problem [IS]: find a string d of length L such that, for each i, s is within Hamming distance d from home substring, of length L, of si.

Theoretical Computer Science | 2009

On the similarity metric and the distance metric

Shihyen Chen; Bin Ma; Kaizhong Zhang

For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This work describes the algorithms and features of the SPIDER software.

Proceedings of the National Academy of Sciences of the United States of America | 2012

Dereplicating nonribosomal peptides using an informatic search algorithm for natural products (iSNAP) discovery

Ashraf S. Ibrahim; Lian Yang; Chad W. Johnston; Xiaowen Liu; Bin Ma; Nathan A. Magarvey

Arc-annotated sequences are useful in representing the structural information of RNA and protein sequences. Recently, the longest arc-preserving common subsequence problem has been introduced in as a framework for studying the similarity of arc-annotated sequences. In this paper, we consider arc-annotated sequences with various arc structures and present some new algorithmic and complexity results on the longest arc-preserving common subsequence problem. Some of our results answer an open question in [6,7] and some others improve the hardness results in [6,7].

SIAM Journal on Computing | 2003

GENETIC DESIGN OF DRUGS WITHOUT SIDE-EFFECTS ∗

Xiaotie Deng; Guojun Li; Zimao Li; Bin Ma; Lusheng Wang

Similarity and dissimilarity measures are widely used in many research areas and applications. When a dissimilarity measure is used, it is normally required to be a distance metric. However, when a similarity measure is used, there is no formal requirement. In this article, we have three contributions. First, we give a formal definition of similarity metric. Second, we show the relationship between similarity metric and distance metric. Third, we present general solutions to normalize a given similarity metric or distance metric.

Explore More