Jim Cowie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jim Cowie is active.

Explore More

Publication

Featured researches published by Jim Cowie.

Communications of The ACM | 1996

Information extraction

Jim Cowie; Wendy G. Lehnert

here may be more text data in electronic form than ever before, but much of it is ignored. No human can read, understand, and synthesize megabytes of text on an everyday basis. Missed information— and lost opportunities—has spurred researchers to explore various information management strategies to establish order in the text wilderness. The most common strategies are information retrieval (IR) and information filtering [4]. A relatively new development—information extraction (IE)—is the subject of this article. We can view IR systems as combine harvesters that bring back useful material from vast fields of raw material. With large amounts of potentially useful information in hand, an IE system can then transform the raw material, refining and reducing it to a germ of the original text (see Figure 1). Suppose financial analysts are investigating production of semiconductor devices (see Figure 2). They might want to know several things:

Information Processing and Management | 2007

Improving query precision using semantic expansion

Ahmed Abdelali; Jim Cowie; Hamdy S. Soliman

Query Expansion (QE) is one of the most important mechanisms in the information retrieval field. A typical short Internet query will go through a process of refinement to improve its retrieval power. Most of the existing QE techniques suffer from retrieval performance degradation due to imprecise choice of querys additive terms in the QE process. In this paper, we introduce a novel automated QE mechanism. The new expansion process is guided by the semantics relations between the original query and the expanding words, in the context of the utilized corpus. Experimental results of our controlled query expansion, using the Arabic TREC-10 data, show a significant enhancement of recall and precision over current existing mechanisms in the field.

human language technology | 1992

Lexical disambiguation using simulated annealing

Jim Cowie; Joe A Guthrie; Louise Guthrie

The resolution of lexical ambiguity is important for most natural language processing tasks, and a range of computational techniques have been proposed for its solution. None of these has yet proven effective on a large scale. In this paper, we describe a method for lexical disambiguation of text using the definitions in a machine-readable dictionary together with the technique of simulated annealing. The method operates on complete sentences and attempts to select the optimal combinations of word senses for all the words in the sentence simultaneously. The words in the sentences may be any of the 28,000 headwords in Longmans Dictionary of Contemporary English (LDOCE) and are disambiguated relative to the senses given in LDOCE. Our initial results on a sample set of 50 sentences are comparable to those of other researchers, and the fully automatic method requires no hand coding of lexical entries, or hand tagging of text.

MUC6 '95 Proceedings of the 6th conference on Message understanding | 1995

CRL/NMSU: description of the CRL/NMSU systems used for MUC-6

Jim Cowie

This paper discusses the two CRL named entity recognition systems submitted for MUC-6. The systems are based on entirely different approaches. The first is a data intensive method, which uses human generated patterns. The second uses the training data to develop decision trees which detect the start and end points of names.

Archive | 1999

COLLAGE: An NLP Toolset to Support Boolean Retrieval

Jim Cowie

COLLAGE is a collection of processes and methods which carry out automatic analysis of topics in a natural language form. The results of this analysis are used to determine which NLP resources should be applied to converting each part of a topic into a set of Boolean queries, and how the document lists resulting from the application of each query should be combined to give a final list of ranked documents.

Proceedings of the TIPSTER Text Program: Phase II | 1996

CRL's APPROACH TO MET

Jim Cowie

From February to April CRL carried out investigations into the modification of our English name recognition software developed for MUC-6 [1] to Chinese and Spanish. In addition a Japanese system, developed under Tipster Phase I [2], was modified to comply with the MET task. Finally learning methods developed for MUC-6 were adapted to handle Chinese. All systems performed with good levels of accuracy and it is clear that further tuning and refinement, for which there was no time or resources, would lead to even higher levels of performance.

Journal of the Chicago Colloquium on Digital Humanities and Computer Science | 2009

Linguistic Dumpster Diving: Geographical Classification of Arabic Text

Ron Zacharski; Ahmed Abdelali; Stephen Helmreich; Jim Cowie

In many text analysis tasks it is common to remove frequently occurring words as part of the pre-processing step prior to analysis. Frequent words are removed for two reasons: first, because they are unlikely to contribute in any meaningful way to the results; and, second, removing them can greatly reduce the amount of computation required for the analysis task. In the literature, such words have been called noise in the system, fluff words, and non-significant words. While the removal of frequent words is correct for many text analysis tasks, it is not correct for all tasks. There are many analysis tasks where frequent words play a crucial role. To cite just one example, Mosteller and Wallace in their seminal book on stylometrics noted that the frequencies of various function words could distinguish the writings of Alexander Hamilton and James Madison. We use a similar frequent word technique to geographically classify Arabic news stories. In representing a document, we throw away all content words and retain only the most frequent words. In this way, we represent each document by a vector of common word frequencies. In our study we used a collection of 4,167 Arabic documents from 5 newspapers (representing Egypt, Sudan, Libya, Syria, and the U.K.). We then train on this data using a sequential minimal optimization algorithm to create a support vector, and evaluate the approach using 10-fold cross-validation. Depending on the number of frequent words, results range from 92% classification accuracy to 99.8%.

Archive | 2004

Application of Multidisciplinary Analysis to Gene Expression

Xuefel Wang; Huining Kang; Chris Fields; Jim Cowie; George S. Davidson; David M. Haaland; Valeriy Sibirtsev; Monica P. Mosquera-Caro; Yuexian Xu; Shawn Martin; Paul Helman; Erik Andries; Kerem Ar; Jeffrey W. Potter; Cheryl L. Willman; Maurice H. Murphy

Molecular analysis of cancer, at the genomic level, could lead to individualized patient diagnostics and treatments. The developments to follow will signal a significant paradigm shift in the clinical management of human cancer. Despite our initial hopes, however, it seems that simple analysis of microarray data cannot elucidate clinically significant gene functions and mechanisms. Extracting biological information from microarray data requires a complicated path involving multidisciplinary teams of biomedical researchers, computer scientists, mathematicians, statisticians, and computational linguists. The integration of the diverse outputs of each team is the limiting factor in the progress to discover candidate genes and pathways associated with the molecular biology of cancer. Specifically, one must deal with sets of significant genes identified by each method and extract whatever useful information may be found by comparing these different gene lists. Here we present our experience with such comparisons, and share methods developed in the analysis of an infant leukemia cohort studied on Affymetrix HG-U95A arrays. In particular, spatial gene clustering, hyper-dimensional projections, and computational linguistics were used to compare different gene lists. In spatial gene clustering, different gene lists are grouped together and visualized on a three-dimensional expression map, where genes with similar expressions are co-located. Inmorexa0» another approach, projections from gene expression space onto a sphere clarify how groups of genes can jointly have more predictive power than groups of individually selected genes. Finally, online literature is automatically rearranged to present information about genes common to multiple groups, or to contrast the differences between the lists. The combination of these methods has improved our understanding of infant leukemia. While the complicated reality of the biology dashed our initial, optimistic hopes for simple answers from microarrays, we have made progress by combining very different analytic approaches.«xa0less

Archive | 2003

High throughput instruments, methods, and informatics for systems biology.

Michael B. Sinclair; Jim Cowie; Mark Hilary Van Benthem; Brian N. Wylie; George S. Davidson; David M. Haaland; Jerilyn Ann Timlin; Anthony D. Aragon; Michael R. Keenan; Kevin W. Boyack; Edward V. Thomas; Margaret C. Werner-Washburne; Monica P. Mosquera-Caro; M. Juanita Martinez; Shawn Martin; Cheryl L. Willman

High throughput instruments and analysis techniques are required in order to make good use of the genomic sequences that have recently become available for many species, including humans. These instruments and methods must work with tens of thousands of genes simultaneously, and must be able to identify the small subsets of those genes that are implicated in the observed phenotypes, or, for instance, in responses to therapies. Microarrays represent one such high throughput method, which continue to find increasingly broad application. This project has improved microarray technology in several important areas. First, we developed the hyperspectral scanner, which has discovered and diagnosed numerous flaws in techniques broadly employed by microarray researchers. Second, we used a series of statistically designed experiments to identify and correct errors in our microarray data to dramatically improve the accuracy, precision, and repeatability of the microarray gene expression data. Third, our research developed new informatics techniques to identify genes with significantly different expression levels. Finally, natural language processing techniques were applied to improve our ability to make use of online literature annotating the important genes. In combination, this research has improved the reliability and precision of laboratory methods and instruments, while also enabling substantially faster analysis and discovery.

Proceedings of the TIPSTER Text Program: Phase II | 1996

CERVANTES - A SYSTEM SUPPORTING TEXT ANALYSIS

Jim Cowie

CRL is engaged in the development of document management software and user interfaces to support government analysts in their information analysis tasks. It is also continuing to develop language technologies to support document detection and information extraction in a variety of languages. It has also been responsible for the integration and delivery of both the six and twelve month Tipster demonstration systems and the development of the first Tipster document manager.

Explore More