Keith Marsolo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keith Marsolo is active.

Explore More

Publication

Featured researches published by Keith Marsolo.

bioinformatics and bioengineering | 2005

A multi-level approach to SCOP fold recognition

Keith Marsolo; Srinivasan Parthasarathy; Chris H. Q. Ding

The classification of proteins based on their structure can play an important role in the deduction or discovery of protein function. However, the relatively low number of solved protein structures and the unknown relationship between structure and sequence requires an alternative method of representation for classification to be effective. Furthermore, the large number of potential folds causes problems for many classification strategies, increasing the likelihood that the classifier will reach a local optima while trying to distinguish between all of the possible structural categories. Here we present a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold. Using a well-known dataset derived from the 27 most-populated SCOP folds and several sequence-based descriptor properties as input features, we test a number of classification methods, including Naive Bayes and Boosted C4.5. Our strategy achieves an average fold recognition of 74%, which is significantly higher than the 56-60% previously reported in the literature, indicating the effectiveness of a multi-level approach.

international conference of the ieee engineering in medicine and biology society | 2007

Spatial Modeling and Classification of Corneal Shape

Keith Marsolo; Michael D. Twa; Mark A. Bullimore; Srinivasan Parthasarathy

One of the most promising applications of data mining is in biomedical data used in patient diagnosis. Any method of data analysis intended to support the clinical decision-making process should meet several criteria: it should capture clinically relevant features, be computationally feasible, and provide easily interpretable results. In an initial study, we examined the feasibility of using Zernike polynomials to represent biomedical instrument data in conjunction with a decision tree classifier to distinguish between the diseased and non-diseased eyes. Here, we provide a comprehensive follow-up to that work, examining a second representation, pseudo-Zernike polynomials, to determine whether they provide any increase in classification accuracy. We compare the fidelity of both methods using residual root-mean-square (rms) error and evaluate accuracy using several classifiers: neural networks, C4.5 decision trees, Voting Feature Intervals, and Nainodotumlve Bayes. We also examine the effect of several meta-learning strategies: boosting, bagging, and Random Forests (RFs). We present results comparing accuracy as it relates to dataset and transformation resolution over a larger, more challenging, multi-class dataset. They show that classification accuracy is similar for both data transformations, but differs by classifier. We find that the Zernike polynomials provide better feature representation than the pseudo-Zernikes and that the decision trees yield the best balance of classification accuracy and interpretability

international conference on data mining | 2006

On the Use of Structure and Sequence-Based Features for Protein Classification and Retrieval

Keith Marsolo; Srinivasan Parthasarathy

The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as possible sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. To derive any type of similarity, however, one must have an effective model of the protein that allows for easy comparison. In this work, we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the frequency and scoring matrices returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. Our structural descriptor is constructed using an algorithm we developed previously. First, we transform each 3D structure into a 2D distance matrix by calculating the pair-wise distance between the amino acids of a protein. We normalize this matrix and apply a 2D wavelet decomposition to generate a set of approximation coefficients, which serve as our feature vector. We also concatenate the sequence and structural descriptors together to create a hybrid solution. We evaluate the generality of our models by using them as database indices for nearest-neighbor and range-based retrieval experiments as well as feature vectors for classification using support vector machines. We find that our methods provide excellent performance when compared with the current state-of-the-art techniques of each task. Our results show that the sequence-based representation is on par with, or out-performs, the structure-based representation. Moreover, we find that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure.

conference on information and knowledge management | 2006

Structure-based querying of proteins using wavelets

Keith Marsolo; Srinivasan Parthasarathy; Kotagiri Ramamohanarao

The ability to retrieve molecules based on structural similarity has use in many applications, from disease diagnosis and treatment to drug discovery and design. In this paper, we present a method to represent protein molecules that allows for the fast, flexible and efficient retrieval of similar structures, based on either global or local attributes. We begin by computing the pair-wise distance between amino acids, transforming each 3D structure into a 2D distance matrix. We normalize this matrix to a specific size and apply a 2D wavelet decomposition to generate a set of approximation coefficients, which serves as our global feature vector. This transformation reduces the overall dimensionality of the data while still preserving spatial features and correlations. We test our method by running queries on three different protein data sets that have been used previously in the literature, basing our comparisons on labels taken from the SCOP database. We find that our method significantly outperforms existing approaches, in terms of retrieval accuracy, memory utilization and execution time. Specifically, using a k-d tree and running a 10-nearest-neighbor search on a dataset of 33,000 proteins against itself, we see an average accuracy of 89% at the SCOP SuperFamily level and a total query time that is up to 350 times faster than previously published techniques. In addition to processing queries based on global similarity, we also propose innovative extensions to effectively match proteins based solely on shared local substructures, allowing for a more flexible query interface.

international conference on data mining | 2005

Alternate representation of distance matrices for characterization of protein structure

Keith Marsolo; Srinivasan Parthasarathy

The most suitable method for the automated classification of protein structures remains an open problem in computational biology. In order to classify a protein structure with any accuracy, an effective representation must be chosen. Here we present two methods of representing protein structure. One involves representing the distances between the C/sub a/ atoms of a protein as a two-dimensional matrix and creating a model of the resulting surface with Zernike polynomials. The second uses a wavelet-based approach. We convert the distances between a proteins C/sub a/ atoms into a one-dimensional signal which is then decomposed using a discrete wavelet transformation. Using the Zernike coefficients and the approximation coefficients of the wavelet decomposition as feature vectors, we test the effectiveness of our representation with two different classifiers on a dataset of more than 600 proteins taken from the 27 most-populated SCOP folds. We find that the wavelet decomposition greatly outperforms the Zernike model. With the wavelet representation, we achieve an accuracy of approximately 56%, roughly 12% higher than results reported on a similar, but less-challenging dataset. In addition, we can couple our structure-based feature vectors with several sequence-based properties to increase accuracy another 5-7%. Finally, we use a multi-stage classification strategy on the combined features to increase performance to 78%, an improvement in accuracy of more than 15-20% and 34% over the highest reported sequence-based and structure-based classification results, respectively.

artificial intelligence in medicine in europe | 2005

A model-based approach to visualizing classification decisions for patient diagnosis

Keith Marsolo; Srinivasan Parthasarathy; Michael D. Twa; Mark A. Bullimore

Automated classification systems are often used for patient diagnosis. In many cases, the rationale behind a decision is as important as the decision itself. Here we detail a method of visualizing the criteria used by a decision tree classifier to provide support for clinicians interested in diagnosing corneal disease. We leverage properties of our data transformation to create surfaces highlighting the details deemed important in classification. Preliminary results indicate that the features illustrated by our visualization method are indeed the criteria that often lead to a correct diagnosis and that our system also seems to find favor with practicing clinicians.

bioinformatics and bioengineering | 2005

Classification of biomedical data through model-based spatial averaging

Keith Marsolo; P. Parthasarathy; Michael D. Twa; Mark A. Bullimore

Ensemble learning is frequently used to reduce classification error. The more popular techniques draw multiple samples from the training data and employ a voting procedure to aggregate the decisions of the classifiers constructed from those samples. In practice, such ensemble methods have been shown to work well and improve accuracy. Here we present a meta-learning strategy that combines the decisions of classifiers constructed from spatial models taken at multiple resolutions. By varying the resolution from coarse to fine-grained, we are able to partition the data on global features that describe a majority of the objects, as well as small, local features that are present in just a few problem cases. We test our technique on a biomedical dataset containing surface elevation values for diseased and nondiseased corneas. We transform these elevations into a series of coefficients using two different spatial transformations. Using these coefficients, we determine how well they distinguish between the two classes. We find our algorithm can increase the classification accuracy of a single decision tree up to 10% and can also be used in conjunction with traditional meta-learning techniques such as bagging to further improve performance. In an attempt to improve the execution time of the transformation algorithms, we have developed a distributed, grid-based implementation as well.

Radiology | 2008