Aaron M. Smalter | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aaron M. Smalter is active.

Explore More

Publication

Featured researches published by Aaron M. Smalter.

extending database technology | 2009

G-hash: towards fast kernel-based similarity search in large graph databases

Xiaohong Wang; Aaron M. Smalter; Jun Huan; Gerald H. Lushington

Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others. Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database. Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.

Journal of Bioinformatics and Computational Biology | 2009

GRAPH WAVELET ALIGNMENT KERNELS FOR DRUG VIRTUAL SCREENING

Aaron M. Smalter; Jun Huan; Gerald H. Lushington

In this paper, we introduce a novel statistical modeling technique for target property prediction, with applications to virtual screening and drug design. In our method, we use graphs to model chemical structures and apply a wavelet analysis of graphs to summarize features capturing graph local topology. We design a novel graph kernel function to utilize the topology features to build predictive models for chemicals via Support Vector Machine classifier. We call the new graph kernel a graph wavelet-alignment kernel. We have evaluated the efficacy of the wavelet-alignment kernel using a set of chemical structure-activity prediction benchmarks. Our results indicate that the use of the kernel function yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art chemical classification approaches. In addition, our results also show that the use of wavelet functions significantly decreases the computational costs for graph kernel computation with more than ten fold speedup.

bioinformatics and biomedicine | 2007

Human Disease-Gene Classification with Integrative Sequence-Based and Topological Features of Protein-Protein Interaction Networks

Aaron M. Smalter; Seak Fei Lei; Xue Wen Chen

The discovery of human genes that contribute to the appearance and growth of hereditary diseases is an important problem in bioinformatics research. Many techniques have been devised for classifying genes based on information from a variety of sources such as sequence and functional annotation. Recently, the use of topological information in protein-protein interaction networks has shown promise in disease- genes. In this paper, we develop a disease-gene classification system that integrates topological features of protein interaction networks with sequence- derived and other features, utilizing support vector machines for disease-gene classification. We identified several novel topological, sequence, and function-based features that can help to characterize hereditary disease-genes. We also found that using a more complex classifier can contribute to disease-gene classification. We validated our methods by selecting previously unclassified genes that were predicted with high probabilities as disease-genes, and searching for evidence in recent literature of their involvements in disease.

bioinformatics and bioengineering | 2008

GPM: A graph pattern matching kernel with diffusion for chemical compound classification

Aaron M. Smalter; Jun Huan; Gerald H. Lushington

Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modelling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge . In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

international conference on data mining | 2009

Feature Selection in the Tensor Product Feature Space

Aaron M. Smalter; Jun Huan; Gerald H. Lushington

Classifying objects that are sampled jointly from two or more domains has many applications. The tensor product feature space is useful for modeling interactions between feature sets in different domains but feature selection in the tensor product feature space is challenging. Conventional feature selection methods ignore the structure of the feature space and may not provide the optimal results. In this paper we propose methods for selecting features in the original feature spaces of different domains. We obtained sparsity through two approaches, one using integer quadratic programming and another using L1-norm regularization. Experimental studies on biological data sets validate our approach.

bioinformatics and biomedicine | 2009

CGM: A biomedical text categorization approach using concept graph mining

Said Bleik; Min Song; Aaron M. Smalter; Jun Huan; Gerald H. Lushington

Text Categorization is used to organize and manage biomedical text databases that are growing at an exponential rate. Feature representations for documents are a crucial factor for the performance of text categorization. Most of the successful existing techniques use a vector representation based on key entities extracted from the text. In this paper we investigate a new direction where we represent a document as a graph. In this representation we identify high level concepts and build a rich graph structure that contains additional concepts and relationships. We then use graph kernel techniques to perform text categorization. The results show a significant improvement in accuracy when compared to categorization based on only the extracted concepts.

BMC Bioinformatics | 2010

Application of kernel functions for accurate similarity search in large chemical databases

Xiaohong Wang; Jun Huan; Aaron M. Smalter; Gerald H. Lushington

BackgroundSimilaritysearch in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases.ResultsTo bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep.ConclusionsEfficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.

computational systems bioinformatics | 2008

Graph wavelet alignment kernels for drug virtual screening.

Aaron M. Smalter; Jun Huan; Gerald H. Lushington

In this paper we introduce a novel graph classification algorithm and demonstrate its efficacy in drug design. In our method, we use graphs to model chemical structures and apply a wavelet analysis of graphs to create features capturing graph local topology. We design a novel graph kernel function to utilize the created feature to build predictive models for chemicals. We call the new graph kernel a graph wavelet-alignment kernel. We have evaluated the efficacy of the wavelet-alignment kernel using a set of chemical structure-activity prediction benchmarks. Our results indicate that the use of the kernel function yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art chemical classification approaches. In addition, our results also show that the use of wavelet functions significantly decreases the computational costs for graph kernel computation with more than 10 fold speed up.

bioinformatics and biomedicine | 2009

Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases

Xiaohong Wang; Jun Huan; Aaron M. Smalter; Gerald H. Lushington

Similarity search in chemical structure databases is an important problem with many applications in chemicalgenomics, drug design, and efficient chemical probe screeningamong others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases.To bridge graph kernel function and similarity search inchemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, namedG-hash, to large chemical databases. Our results show thatthe G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2010