Jason Tsong-Li Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jason Tsong-Li Wang is active.

Explore More

Publication

Featured researches published by Jason Tsong-Li Wang.

symposium on principles of database systems | 2002

Algorithmics and applications of tree and graph searching

Dennis E. Shasha; Jason Tsong-Li Wang; Rosalba Giugno

Modern search engines answer keyword-based queries extremely efficiently. The impressive speed is due to clever inverted index structures, caching, a domain-independent knowledge of strings, and thousands of machines. Several research efforts have attempted to generalize keyword search to keytree and keygraph searching, because trees and graphs have many applications in next-generation database systems. This paper surveys both algorithms and applications, giving some emphasis to our own work.

Archive | 2010

Data Mining in Bioinformatics

Jason Tsong-Li Wang; Mohammed Javeed Zaki; Hannu Toivonen; Dennis E. Shasha

Written especially for computer scientists, all necessary biology is explained. Presents new techniques on gene expression data mining, gene mapping for disease detection, and phylogenetic knowledge discovery.

systems man and cybernetics | 1994

Exact and approximate algorithms for unordered tree matching

Dennis E. Shasha; Jason Tsong-Li Wang; Kaizhong Zhang; Frank Y. Shih

We consider the problem of comparison between unordered trees, i.e., trees for which the order among siblings is unimportant. The criterion for comparison is the distance as measured by a weighted sum of the costs of deletion, insertion and relabel operations on tree nodes. Such comparisons may contribute to pattern recognition efforts in any field (e.g., genetics) where data can naturally be characterized by unordered trees. In companion work, we have shown this problem to be NP-complete. This paper presents an efficient enumerative algorithm and several heuristics leading to approximate solutions. The algorithms are based on probabilistic hill climbing and bipartite matching techniques. The paper evaluates the accuracy and time efficiency of the heuristics by applying them to a set of trees transformed from industrial parts based on a previously proposed morphological model. >

International Journal of Foundations of Computer Science | 1996

ON THE EDITING DISTANCE BETWEEN UNDIRECTED ACYCLIC GRAPHS

Kaizhong Zhang; Jason Tsong-Li Wang; Dennis E. Shasha

We consider the problem of comparing CUAL graphs (Connected, Undirected, Acyclic graphs with nodes being Labeled). This problem is motivated by the study of information retrieval for bio-chemical and molecular databases. Suppose we define the distance between two CUAL graphs G1 and G2 to be the weighted number of edit operations (insert node, delete node and relabel node) to transform G1 to G2. By reduction from exact cover by 3-sets, one can show that finding the distance between two CUAL graphs is NP-complete. In view of the hardness of the problem, we propose a constrained distance metric, called the degree-2 distance, by requiring that any node to be inserted (deleted) have no more than 2 neighbors. With this metric, we present an efficient algorithm to solve the problem. The algorithm runs in time O(N1N2D2) for general weighting edit operations and in time for integral weighting edit operations, where Ni, i=1, 2, is the number of nodes in Gi, D=min{d1, d2} and di is the maximum degree of Gi.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 1998

An algorithm for finding the largest approximately common substructures of two trees

Jason Tsong-Li Wang; Bruce A. Shapiro; Dennis E. Shasha; Kaizhong Zhang; Kathleen M. Currey

Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree T to be a connected subgraph of T. Given two ordered labeled trees T/sub 1/ and T/sub 2/ and an integer d, the largest approximately common substructure problem is to find a substructure U/sub 1/ of T/sub 1/ and a substructure U/sub 2/ of T/sub 2/ such that U/sub 1/ is within edit distance d of U/sub 2/ and where there does not exist any other substructure V/sub 1/ of T/sub 1/ and V/sub 2/ of T/sub 2/ such that V/sub 1/ and V/sub 2/ satisfy the distance constraint and the sum of the sizes of V/sub 1/ and V/sub 2/ is greater than the sum of the sizes of U/sub 1/ and U/sub 2/. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).

Ibm Systems Journal | 2001

New techniques for extracting features from protein sequences

Jason Tsong-Li Wang; Qicheng Ma; Dennis E. Shasha; Cathy H. Wu

In this paper we propose new techniques to extract features from protein sequences. We then use the features as inputs for a Bayesian neural network (BNN) and apply the BNN to classifying protein sequences obtained from the PIR (Protein Information Resource) database maintained at the National Biomedical Research Foundation. To evaluate the performance of the proposed approach, we compare it with other protein classifiers built based on sequence alignment and machine learning methods. Experimental results show the high precision of the proposed classifier and the complementarity of the bioinformatics tools studied in the paper.

knowledge discovery and data mining | 1999

Evaluating a class of distance-mapping algorithms for data mining and clustering

Jason Tsong-Li Wang; Xiong Wang; King-Ip Lin; Dennis E. Shasha; Bruce A. Shapiro; Kaizhong Zhang

A distance-mapping algorithm takes a set of objects and a distance metric and then maps those objects to a Euclidean or pseudoEuclidean space in such a way that the distances among objects are approximately preserved. Distance mapping algorithms are a useful tool for clustering and visualization in data intensive applications, because they replace expensive distance calculations by sum-of-square calculations. This can make clustering in large databases with expensive distance metrics practical. In this paper we present five distance-mapping algorithms and conduct experiments to compare their performance in data clustering applications. These include two algorithms called FastMap and MetricMap, and three hybrid heuristics that combine the two algorithms in different ways. Experimental results on both synthetic and RNA data show the superiority of the hybrid algorithms. The results imply that FastMap and MetricMap capture complementary information about distance metrics and therefore can be used together to great benefit. The net effect is that multi-day computations may be done in minutes.

international conference on data engineering | 2004

Unordered tree mining with applications to phylogeny

Dennis E. Shasha; Jason Tsong-Li Wang; Sen Zhang

Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data, such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. We present a new FSM technique for finding patterns in rooted unordered labeled trees. The patterns of interest are cousin pairs in these trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree T, our algorithm finds all interesting cousin pairs of T in O(|T|/sup 2/) time where |T| is the number of nodes in T. Experimental results on synthetic data and phylogenies show the scalability and effectiveness of the proposed technique. To demonstrate the usefulness of our approach, we discuss its applications to locating co-occurring patterns in multiple evolutionary trees, evaluating the consensus of equally parsimonious trees, and finding kernel trees of groups of phylogenies. We also describe extensions of our algorithms for undirected acyclic graphs (or free trees).

combinatorial pattern matching | 1995

On the editing distance between undirected acyclic graphs and related problems

Kaizhong Zhang; Jason Tsong-Li Wang; Dennis E. Shasha

Using these simple, efficient algorithms, a user can submit a query structure and obtain those data structures approximately matching the query. To our knowledge, this work gives the first polynomial time algorithm ever presented to solve the edit distance problem between undirected acyclie graphs. We will have this algorithm implemented within a few months and will make it available to the community.

BMC Bioinformatics | 2005

A method for aligning RNA secondary structures and its application to RNA motif detection

Jianghui Liu; Jason Tsong-Li Wang; Jun Hu; Bin Tian

BackgroundAlignment of RNA secondary structures is important in studying functional RNA motifs. In recent years, much progress has been made in RNA motif finding and structure alignment. However, existing tools either require a large number of prealigned structures or suffer from high time complexities. This makes it difficult for the tools to process RNAs whose prealigned structures are unavailable or process very large RNA structure databases.ResultsWe present here an efficient tool called RSmatch for aligning RNA secondary structures and for motif detection. Motivated by widely used algorithms for RNA folding, we decompose an RNA secondary structure into a set of atomic structure components that are further organized by a tree model to capture the structural particularities. RSmatch can find the optimal global or local alignment between two RNA secondary structures using two scoring matrices, one for single-stranded regions and the other for double-stranded regions. The time complexity of RSmatch is O(mn) where m is the size of the query structure and n that of the subject structure. When applied to searching a structure database, RSmatch can find similar RNA substructures, and is capable of conducting multiple structure alignment and iterative database search. Therefore it can be used to identify functional RNA motifs. The accuracy of RSmatch is tested by experiments using a number of known RNA structures, including simple stem-loops and complex structures containing junctions.ConclusionWith respect to computing efficiency and accuracy, RSmatch compares favorably with other tools for RNA structure alignment and motif detection. This tool shall be useful to researchers interested in comparing RNA structures obtained from wet lab experiments or RNA folding programs, particularly when the size of the structure dataset is large.

Explore More