Jun Huan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jun Huan is active.

Explore More

Publication

Featured researches published by Jun Huan.

international conference on data mining | 2003

Efficient mining of frequent subgraphs in the presence of isomorphism

Jun Huan; Wei Wang; Jan F. Prins

Frequent subgraph mining is an active research topic in the data mining community. A graph is a general model to represent data and has been used in many domains like cheminformatics and bioinformatics. Mining patterns from graph databases is challenging since graph related operations, such as subgraph testing, generally have higher time complexity than the corresponding operations on itemsets, sequences, and trees, which have been studied extensively. We propose a novel frequent subgraph mining algorithm: FFSM, which employs a vertical search scheme within an algebraic graph framework we have developed to reduce the number of redundant candidates proposed. Our empirical study on synthetic and real datasets demonstrates that FFSM achieves a substantial performance gain over the current start-of-the-art subgraph mining algorithm gSpan.

knowledge discovery and data mining | 2004

SPIN: mining maximal frequent subgraphs from graph databases

Jun Huan; Wei Wang; Jan F. Prins; Jiong Yang

One fundamental challenge for mining recurring subgraphs from semi-structured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this paper, we propose a new algorithm that mines only maximal frequent subgraphs, i.e. subgraphs that are not a part of any other frequent subgraphs. This may exponentially decrease the size of the output set in the best case; in our experiments on practical data sets, mining maximal frequent subgraphs reduces the total number of mined patterns by two to three orders of magnitude.Our method first mines all frequent trees from a general graph database and then reconstructs all maximal subgraphs from the mined trees. Using two chemical structure benchmarks and a set of synthetic graph data sets, we demonstrate that, in addition to decreasing the output size, our algorithm can achieve a five-fold speed up over the current state-of-the-art subgraph mining algorithms.

international conference on data engineering | 2007

Graph Database Indexing Using Structured Graph Decomposition

David W. Williams; Jun Huan; Wei Wang

We introduce a novel method of indexing graph databases in order to facilitate subgraph isomorphism and similarity queries. The index is comprised of two major data structures. The primary structure is a directed acyclic graph which contains a node for each of the unique, induced subgraphs of the database graphs. The secondary structure is a hash table which cross-indexes each subgraph for fast isomorphic lookup. In order to create a hash key independent of isomorphism, we utilize a code-based canonical representation of adjacency matrices, which we have further refined to improve computation speed. We validate the concept by demonstrating its effectiveness in answering queries for two practical datasets. Our experiments show that for subgraph isomorphism queries, our method outperforms existing methods by more than an order of magnitude.

languages and compilers for parallel computing | 2006

UTS: an unbalanced tree search benchmark

Stephen L. Olivier; Jun Huan; Jinze Liu; Jan F. Prins; James Dinan; P. Sadayappan; Chau-Wen Tseng

This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared-memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPCs shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.

research in computational molecular biology | 2004

Mining protein family specific residue packing patterns from protein structure graphs

Jun Huan; Wei Wang; Deepak Bandyopadhyay; Jack Snoeyink; Jan F. Prins; Alexander Tropsha

Finding recurring residue packing patterns, or spatial motifs, that characterize protein structural families is an important problem in bioinformatics. We apply a novel frequent subgraph mining algorithm to three graph representations of protein three-dimensional (3D) structure. In each protein graph, a vertex represents an amino acid. Vertex-residues are connected by edges using three approaches: first, based on simple distance threshold between contact residues; second using the Delaunay tessellation from computational geometry, and third using the recently developed almost-Delaunay tessellation approach.Applying a frequent subgraph mining algorithm to a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, we typically identify several hundred common subgraphs equivalent to common packing motifs found in the majority of proteins in the family. We also use the counts of motifs extracted from proteins in two different SCOP families as input variables in a binary classification experiment. The resulting models are capable of predicting the protein family association with the accuracy exceeding 90 percent. Our results indicate that graphs based on both almost-Delaunay and Delaunay tessellations are sparser than the contact distance graphs; yet they are robust and efficient for mining protein spatial motif.

Journal of Computational Biology | 2005

Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs

Jun Huan; Deepak Bandyopadhyay; Wei Wang; Jack Snoeyink; Jan F. Prins; Alexander Tropsha

We find recurring amino-acid residue packing patterns, or spatial motifs, that are characteristic of protein structural families, by applying a novel frequent subgraph mining algorithm to graph representations of protein three-dimensional structure. Graph nodes represent amino acids, and edges are chosen in one of three ways: first, using a threshold for contact distance between residues; second, using Delaunay tessellation; and third, using the recently developed almost-Delaunay edges. For a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, subgraph mining typically identifies several hundred common subgraphs corresponding to spatial motifs that are frequently found in proteins in the family but rarely found outside of it. We find that some of the large motifs map onto known functional regions in two protein families explored in this study, i.e., serine proteases and kinases. We find that graphs based on almost-Delaunay edges significantly reduce the number of edges in the graph representation and hence present computational advantage, yet the patterns extracted from such graphs have a biological interpretation approximately equivalent to that of those extracted from distance based graphs.

conference on information and knowledge management | 2009

Large margin transductive transfer learning

Brian Quanz; Jun Huan

Recently there has been increasing interest in the problem of transfer learning, in which the typical assumption that training and testing data are drawn from identical distributions is relaxed. We specifically address the problem of transductive transfer learning in which we have access to labeled training data and unlabeled testing data potentially drawn from different, yet related distributions, and the goal is to leverage the labeled training data to learn a classifier to correctly predict data from the testing distribution. We have derived efficient algorithms for transductive transfer learning based on a novel viewpoint and the Support Vector Machine (SVM) paradigm, of a large margin hyperplane classifier in a feature space. We show that our method can out-perform some recent state-of-the-art approaches for transfer learning on several data sets, with the added benefits of model and data separation and the potential to leverage existing work on support vector machines.

Knowledge and Information Systems | 2011

An efficient graph-mining method for complicated and noisy data with real-world applications

Yi Jia; Jintao Zhang; Jun Huan

In this paper, we present a novel graph database-mining method called APGM (APproximate Graph Mining) to mine useful patterns from noisy graph database. In our method, we designed a general framework for modeling noisy distribution using a probability matrix and devised an efficient algorithm to identify approximate matched frequent subgraphs. We have used APGM to both synthetic data set and real-world data sets on protein structure pattern identification and structure classification. Our experimental study demonstrates the efficiency and efficacy of the proposed method.

Knowledge and Information Systems | 2013

Structured feature selection and task relationship inference for multi-task learning

Hongliang Fei; Jun Huan

Multi-task learning (MTL) aims to enhance the generalization performance of supervised regression or classification by learning multiple related tasks simultaneously. In this paper, we aim to extend the current MTL techniques to high dimensional data sets with structured input and structured output (SISO), where the SI means the input features are structured and the SO means the tasks are structured. We investigate a completely ignored problem in MTL with SISO data: the interplay of structured feature selection and task relationship modeling. We hypothesize that combining the structure information of features and task relationship inference enables us to build more accurate MTL models. Based on the hypothesis, we have designed an efficient learning algorithm, in which we utilize a task covariance matrix related to the model parameters to capture the task relationship. In addition, we design a regularization formulation for incorporating the structured input features in MTL. We have developed an efficient iterative optimization algorithm to solve the corresponding optimization problem. Our algorithm is based on the accelerated first order gradient method in conjunction with the projected gradient scheme. Using two real-world data sets, we demonstrate the utility of the proposed learning methods.

IEEE Transactions on Knowledge and Data Engineering | 2012

Knowledge Transfer with Low-Quality Data: A Feature Extraction Issue

Brian Quanz; Jun Huan; Meenakshi Mishra

Effectively utilizing readily available auxiliary data to improve predictive performance on new modeling tasks is a key problem in data mining. In this research, the goal is to transfer knowledge between sources of data, particularly when ground-truth information for the new modeling task is scarce or is expensive to collect where leveraging any auxiliary sources of data becomes a necessity. Toward seamless knowledge transfer among tasks, effective representation of the data is a critical but yet not fully explored research area for the data engineer and data miner. Here, we present a technique based on the idea of sparse coding, which essentially attempts to find an embedding for the data by assigning feature values based on subspace cluster membership. We modify the idea of sparse coding by focusing the identification of shared clusters between data when source and target data may have different distributions. In our paper, we point out cases where a direct application of sparse coding will lead to a failure of knowledge transfer. We then present the details of our extension to sparse coding, by incorporating distribution distance estimates for the embedded data, and show that the proposed algorithm can overcome the shortcomings of the sparse coding algorithm on synthetic data and achieve improved predictive performance on a real world chemical toxicity transfer learning task.

Explore More