Is this you? Create Your Porfile

Jiong Yang

Case Western Reserve University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiong Yang is active.

Explore More

Publication

Featured researches published by Jiong Yang.

international conference on data engineering | 2007

TreePi: A Novel Graph Indexing Method

Shijie Zhang; Meng Hu; Jiong Yang

Graphs are widely used to model complex structured data such as XML documents, protein networks, and chemical compounds. One of the fundamental problems in graph databases is efficient search and retrieval of graphs using indexing techniques. In this paper, we study the problem of indexing graph databases using frequent subtrees as indexing structures. Trees can be manipulated efficiently while preserving a lot of structural information of the original graphs. In our proposed method, frequent subtrees of a database are selected as the feature set. To save memory, the set of feature trees is shrunk based on a support threshold function and their discriminative power. A tree-partition based query processing scheme is proposed to perform graph queries. The concept of center distance constraints is introduced to prune the search space. Furthermore, a new algorithm which utilizes the location information of indexing structures is used to perform subgraph isomorphism tests. We apply our method on a wide range of real and synthetic data to demonstrate the usefulness and effectiveness of this approach.

BMC Bioinformatics | 2007

PathFinder: Mining signal transduction pathway segments from protein-protein interaction networks

Gurkan Bebek; Jiong Yang

BackgroundA Signal transduction pathway is the chain of processes by which a cell converts an extracellular signal into a response. In most unicellular organisms, the number of signal transduction pathways influences the number of ways the cell can react and respond to the environment. Discovering signal transduction pathways is an arduous problem, even with the use of systematic genomic, proteomic and metabolomic technologies. These techniques lead to an enormous amount of data and how to interpret and process this data becomes a challenging computational problem.ResultsIn this study we present a new framework for identifying signaling pathways in protein-protein interaction networks. Our goal is to find biologically significant pathway segments in a given interaction network. Currently, protein-protein interaction data has excessive amount of noise, e.g., false positive and false negative interactions. First, we eliminate false positives in the protein-protein interaction network by integrating the network with microarray expression profiles, protein subcellular localization and sequence information. In addition, protein families are used to repair false negative interactions. Then the characteristics of known signal transduction pathways and their functional annotations are extracted in the form of association rules.ConclusionGiven a pair of starting and ending proteins, our methodology returns candidate pathway segments between these two proteins with possible missing links (recovered false negatives). In our study, S. cerevisiae (yeast) data is used to demonstrate the effectiveness of our method.

international conference on data engineering | 1999

STING+: an approach to active spatial data mining

Wei Wang; Jiong Yang; Richard R. Muntz

Spatial data mining presents new challenges due to the large size of spatial data, the complexity of spatial data types, and the special nature of spatial access methods. Most research in this area has focused on efficient query processing of static data. This paper introduces an active spatial data mining approach which extends the current spatial data mining algorithms to efficiently support user-defined triggers on dynamically evolving spatial data. To exploit the locality of the effect of an update and the nature of spatial data, we employ a hierarchical structure with associated statistical information at the various levels of the hierarchy and decompose the user-defined trigger into a set of sub-triggers associated with cells in the hierarchy. Updates are suspended in the hierarchy until their cumulative effect might cause the trigger to fire. It is shown that this approach achieves three orders of magnitude improvement over the naive approach that re-evaluates the condition over the database for each update, while both approaches produce the same result without any delay. Moreover this scheme can support incremental query processing as well.

International Journal on Artificial Intelligence Tools | 2005

An improved biclustering method for analyzing gene expression profiles

Jiong Yang; Haixun Wang; Wei Wang; Philip S. Yu

Microarrays are one of the latest breakthroughs in experimental molecular biology, which provide a powerful tool by which the expression patterns of thousands of genes can be monitored simultaneously and are already producing huge amount of valuable data. The concept of bicluster was introduced by Cheng and Church1 to capture the coherence of a subset of genes and a subset of conditions. A set of heuristic algorithms were also designed to either find one bicluster or a set of biclusters, which consist of iterations of masking null values and discovered biclusters, coarse and fine node deletion, node addition, and the inclusion of inverted data. These heuristics inevitably suffer from some serious drawback. The masking of null values and discovered biclusters with random numbers may result in the phenomenon of random interference which in turn impacts the discovery of high quality biclusters. To address this issue and to further accelerate the biclustering process, we generalize the model of bicluster to incorporate null values and propose a probabilistic algorithm (FLOC) that can discover a set of k possibly overlapping biclusters simultaneously. Furthermore, this algorithm can easily be extended to support additional features that suit different requirements at virtually little cost. Experimental study on the yeast gene expression data2 shows that the FLOC algorithm can offer substantial improvements over the previously proposed algorithm.

international conference on data mining | 2005

Finding representative set from massive data

Feng Pan; Wei Wang; Anthony K. H. Tung; Jiong Yang

In the information age, data is pervasive. In some applications, data explosion is a significant phenomenon. The massive data volume poses challenges to both human users and computers. In this project, we propose a new model for identifying representative set from a large database. A representative set is a special subset of the original dataset, which has three main characteristics: It is significantly smaller in size compared to the original dataset. It captures the most information from the original dataset compared to other subsets of the same size. It has low redundancy among the representatives it contains. We use information-theoretic measures such as mutual information and relative entropy to measure the representativeness of the representative set. We first design a greedy algorithm and then present a heuristic algorithm that delivers much better performance. We run experiments on two real datasets and evaluate the effectiveness of our representative set in terms of coverage and accuracy. The experiments show that our representative set attains expected characteristics and captures information more efficiently.

intelligent systems in molecular biology | 2006

Annotating proteins by mining protein interaction networks

Mustafa Kirac; Gultekin Ozsoyoglu; Jiong Yang

MOTIVATIONnIn general, most accurate gene/protein annotations are provided by curators. Despite having lesser evidence strengths, it is inevitable to use computational methods for fast and a priori discovery of protein function annotations. This paper considers the problem of assigning Gene Ontology (GO) annotations to partially annotated or newly discovered proteins.nnnRESULTSnWe present a data mining technique that computes the probabilistic relationships between GO annotations of proteins on protein-protein interaction data, and assigns highly correlated GO terms of annotated proteins to non-annotated proteins in the target set. In comparison with other techniques, probabilistic suffix tree and correlation mining techniques produce the highest prediction accuracy of 81% precision with the recall at 45%.nnnAVAILABILITYnCode is available upon request. Results and used materials are available online at http://kirac.case.edu/PROTAN.

BMC Bioinformatics | 2008

Microarray data mining using landmark gene-guided clustering

Pankaj Chopra; Jaewoo Kang; Jiong Yang; HyungJun Cho; Heenam Stanley Kim; Min-Goo Lee

BackgroundClustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.ResultsBy applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters.ConclusionWe have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

computational systems bioinformatics | 2005

Gene teams with relaxed proximity constraint

Sun Kim; Jeong Hyeon Choi; Jiong Yang

Functionally related genes co-evolve, probably due to the strong selection pressure in evolution. Thus we expect that they are present in multiple genomes. Physical proximity among genes, known as gene team, is a very useful concept to discover functionally related genes in multiple genomes. However there are also many gene sets that do not preserve physical proximity. In this paper, we generalized the gene team model, that looks for gene clusters in a physically clustered form, to multiple genome cases with relaxed constraint. We propose a novel hybrid pattern model that combines the set and the sequential pattern models. Our model searches for gene clusters with and/or without physical proximity constraint. This model is implemented and tested with 97 genomes (120 replicons). The result was analyzed to show the usefulness of our model. Especially, analysis of gene clusters that belong to B. subtilis and E. coli demonstrated that our model predicted many experimentally verified operons and functionally related clusters. Our program is fast enough to provide a service on the web at http://platcom.informatics.Indiana.edu/platcom/. Users can select any combination of 97 genomes to predict gene teams.

international conference on data mining | 2009

RING: An Integrated Method for Frequent Representative Subgraph Mining

Shijie Zhang; Jiong Yang; Shirong Li

We propose a novel representative based subgraph mining model. A series of standards and methods are proposed to select invariants. Patterns are mapped into invariant vectors in a multidimensional space. To find qualified patterns, only a subset of frequent patterns is generated as representatives, such that every frequent pattern is close to one of the representative patterns while representative patterns are distant from each other. We devise the RING algorithm, integrating the representative selection into the pattern mining process. Meanwhile, we use R-trees to assist this mining process. Last but not least, a large number of real and synthetic datasets are employed for the empirical study, which show the benefits of the representative model and the efficiency of the RING algorithm.

international conference on data engineering | 2007

Monkey: Approximate Graph Mining Based on Spanning Trees

Shijie Zhang; Jiong Yang; VenuMadhav Cheedella

In the recent past, many exact graph mining algorithms have been developed to find frequent patterns in a graph database. However, many networks or graphs generated from biological data and other applications may be incomplete or inaccurate. Hence, it is necessary to design approximate graph mining techniques. In this paper, we will study the problem of approximate graph mining and propose an optimized solution which uses frequent trees and a spanning tree based pre-verification check in the mining process.

Explore More