Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Michael W. Berry is active.

Publication


Featured researches published by Michael W. Berry.


Siam Review | 1995

Using linear algebra for intelligent information retrieval

Michael W. Berry; Susan T. Dumais; Gavin W. O'Brien

Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. ...


Computational Statistics & Data Analysis | 2007

Algorithms and applications for approximate nonnegative matrix factorization

Michael W. Berry; Murray Browne; Amy N. Langville; V. Paul Pauca; Robert J. Plemmons

The development and use of low-rank approximate nonnegative matrix factorization (NMF) algorithms for feature extraction and identification in the fields of text mining and spectral data analysis are presented. The evolution and convergence properties of hybrid methods based on both sparsity and smoothness constraints for the resulting nonnegative matrix factors are discussed. The interpretability of NMF outputs in specific contexts are provided along with opportunities for future work in the modification of NMF algorithms for large-scale and time-varying data sets.


Siam Review | 1999

Matrices, Vector Spaces, and Information Retrieval

Michael W. Berry; Zlatko Drmac; Elizabeth R. Jessup

The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no shortage of textual materials on a particular topic, procedures for indexing or extracting the knowledge or conceptual information contained in them can be lacking. Recently developed information retrieval technologies are based on the concept of a vector space. Data are modeled as a matrix, and a users query of the database is represented as a vector. Relevant documents in the database are then identified via simple vector operations. Orthogonal factorizations of the matrix provide mechanisms for handling uncertainty in the database itself. The purpose of this paper is to show how such fundamental mathematical concepts from linear algebra can be used to manage and index large text collections.


ieee international conference on high performance computing data and analytics | 1989

The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers

Michael W. Berry; Da-Ren Chen; Peter F. Koss; David J. Kuck; Sy-Shin Lo; Yingxin Pang; Lynn Pointer; R. Roloff; Ahmed H. Sameh; E. Clementi; Shaoan Chin; David J. Schneider; Geoffrey C. Fox; Paul C. Messina; David Walker; C. Hsiung; Jim Schwarzmeier; K. Lue; Steven A. Orszag; F. Seidl; O. Johnson; R. Goodrum; Joanne L. Martin

This report presents a methodology for measuring the performance of supercomputers. It includes 13 Fortran programs that total over 50,000 lines of source code. They represent applications in several areas of engi neering and scientific computing, and in many cases the codes are currently being used by computational re search and development groups. We also present the PERFECT Fortran standard, a set of guidelines that allow portability to several types of machines. Furthermore, we present some performance measures and a method ology for recording and sharing results among diverse users on different machines. The results presented in this paper should not be used to compare machines, except in a preliminary sense. Rather, they are presented to show how the methodology has been applied, and to encourage others to join us in this effort. The results should be regarded as the first step toward our objec tive, which is to develop a publicly accessible data base of performance information of this type.


Information Processing and Management | 2006

Document clustering using nonnegative matrix factorization

Farial Shahnaz; Michael W. Berry; V. Paul Pauca; Robert J. Plemmons

A methodology for automatically identifying and clustering semantic features or topics in a heterogeneous text collection is presented. Textual data is encoded using a low rank nonnegative matrix factorization algorithm to retain natural data nonnegativity, thereby eliminating the need to use subtractive basis vector and encoding calculations present in other techniques such as principal component analysis for semantic feature abstraction. Existing techniques for nonnegative matrix factorization are reviewed and a new hybrid technique for nonnegative matrix factorization is proposed. Performance evaluations of the proposed method are conducted on a few benchmark text collections used in standard topic detection studies.


ieee international conference on high performance computing data and analytics | 1992

Large-Scale Sparse Singular Value Computations

Michael W. Berry

We present four numerical methods for computing the singular value decomposition (SVD) of large sparse matrices on a multiprocessor architecture. We emphasize Lanczos and subspace iteration-based methods for determining several of the largest singular triplets (singular values and corresponding left- and right-singular vectors) for sparse matrices arising from two practical applications: information retrieval and seismic reflection tomography. The target architectures for our implementations are the CRAY-2S/4–128 and Alliant FX/80. The sparse SVD problem is well motivated by recent information-retrieval techniques in which dominant singular values and their corresponding singular vectors of large sparse term-document matrices are desired, and by nonlinear inverse problems from seismic tomography applications which require approximate pseudo-inverses of large sparse Jacobian matrices. This research may help advance the development of future out-of-core sparse SVD methods, which can be used, for example, to handle extremely large sparse matrices 0 × (106) rows or columns associated with extremely large databases in query-based information-retrieval applications.


Information Sciences | 1997

Large-scale information retrieval with latent semantic indexing

Todd A. Letsche; Michael W. Berry

Abstract As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the user. Vector-space approaches to information retrieval, on the other hand, allow the user to search for concepts rather than specific words, and rank the results of the search according to their relative similarity to the query. One vector-space approach, Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a reduced-rank model of the term-document space. However, the original implementation of LSI lacked the execution efficiency required to make LSI useful for large data sets. A new implementation of LSI, LSI++, seeks to make LSI efficient, extensible, portable, and maintainable. The LSI++ Application Programming Interface (API) allows applications to immediately use LSI without knowing the implementation details of the underlying system. LSI++ supports both serial and distributed searching of large data sets, providing the same programming interface regardless of the implementation actually executing. In addition, a World Wide Web interface was created to allow simple, intuitive searching of document collections using LSI++. Timing results indicate that the serial implementation of LSI++ searches up to six times faster than the original implementation of LSI, while the parallel implementation searches nearly 180 times faster on large document collections.


Bioinformatics | 2005

Gene clustering by Latent Semantic Indexing of MEDLINE abstracts

Ramin Homayouni; Kevin Heinrich; Lai Wei; Michael W. Berry

MOTIVATION A major challenge in the interpretation of high-throughput genomic data is understanding the functional associations between genes. Previously, several approaches have been described to extract gene relationships from various biological databases using term-matching methods. However, more flexible automated methods are needed to identify functional relationships (both explicit and implicit) between genes from the biomedical literature. In this study, we explored the utility of Latent Semantic Indexing (LSI), a vector space model for information retrieval, to automatically identify conceptual gene relationships from titles and abstracts in MEDLINE citations. RESULTS We found that LSI identified gene-to-gene and keyword-to-gene relationships with high average precision. In addition, LSI identified implicit gene relationships based on word usage patterns in the gene abstract documents. Finally, we demonstrate here that pairwise distances derived from the vector angles of gene abstract documents can be effectively used to functionally group genes by hierarchical clustering. Our results provide proof-of-principle that LSI is a robust automated method to elucidate both known (explicit) and unknown (implicit) gene relationships from the biomedical literature. These features make LSI particularly useful for the analysis of novel associations discovered in genomic experiments. AVAILABILITY The 50-gene document collection used in this study can be interactively queried at http://shad.cs.utk.edu/sgo/sgo.html.


conference on high performance computing (supercomputing) | 1995

Computational Methods for Intelligent Information Access

Michael W. Berry; Susan T. Dumais; Todd A. Letsche

Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users’ access to many kinds of textual materials, or to documents and services for which textual descriptions are available. A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented.


Machine Learning | 1995

Neural Networks for Full-Scale Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition

Cathy H. Wu; Michael W. Berry; Sailaja Shivakumar; Jerry W. McLarty

A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences ofn-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparsen-gram input vectors and captures semantics ofn-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.

Collaboration


Dive into the Michael W. Berry's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Dali Wang

University of Tennessee

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Eric A. Carr

University of Tennessee

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Richard Barrett

Los Alamos National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge