James Mayfield | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James Mayfield is active.

Explore More

Publication

Featured researches published by James Mayfield.

Information Retrieval | 2004

Character N -Gram Tokenization for European Language Text Retrieval

Paul McNamee; James Mayfield

The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.

north american chapter of the association for computational linguistics | 2003

Named entity recognition using hundreds of thousands of features

James Mayfield; Paul McNamee; Christine D. Piatko

We present an approach to named entity recognition that uses support vector machines to capture transition probabilities in a lattice. The support vector machines are trained with hundreds of thousands of features drawn from the CoNLL-2003 Shared Task training data. Margin outputs are converted to estimated probabilities using a simple static function. Performance is evaluated using the CoNLL-2003 Shared Task test set; Test B results were Fβ=1 = 84.67 for English, and Fβ=1 = 69.96 for German.

hawaii international conference on system sciences | 2005

Information Retrieval and the Semantic Web

Tim Finin; James Mayfield; Anupam Joshi; R.S. Cost; Clayton Fink

Information retrieval technology has been central to the success of the Web. For semantic web documents or annotations to have an impact, they will have to be compatible with Web based indexing and retrieval technology. We discuss some of the underlying problems and issues central to extending information retrieval systems to handle annotations in semantic web languages. We also describe three prototype systems that we have implemented to explore these ideas.

international acm sigir conference on research and development in information retrieval | 2003

Single n-gram stemming

James Mayfield; Paul McNamee

Stemming can improve retrieval accuracy, but stemmers are language-specific. Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty. We demonstrate that selection of a single n-gram as a pseudo-stem for a word can be an effective and efficient language-neutral approach for some languages.

international acm sigir conference on research and development in information retrieval | 2004

Building an information retrieval test collection for spontaneous conversational speech

Douglas W. Oard; Dagobert Soergel; David S. Doermann; Xiaoli Huang; G. Craig Murray; Jianqiang Wang; Bhuvana Ramabhadran; Martin Franz; Samuel Gustman; James Mayfield; Liliya Kharevych; Stephanie M. Strassel

Test collections model use cases in ways that facilitate evaluation of information retrieval systems. This paper describes the use of search-guided relevance assessment to create a test collection for retrieval of spontaneous conversational speech. Approximately 10,000 thematically coherent segments were manually identified in 625 hours of oral history interviews with 246 individuals. Automatic speech recognition results, manually prepared summaries, controlled vocabulary indexing, and name authority control are available for every segment. Those features were leveraged by a team of four relevance assessors to identify topically relevant segments for 28 topics developed from actual user requests. Search-guided assessment yielded sufficient inter-annotator agreement to support formative evaluation during system development. Baseline results for ranked retrieval are presented to illustrate use of the collection.

international acm sigir conference on research and development in information retrieval | 2009

Addressing morphological variation in alphabetic languages

Paul McNamee; Charles K. Nicholas; James Mayfield

The selection of indexing terms for representing documents is a key decision that limits how effective subsequent retrieval can be. Often stemming algorithms are used to normalize surface forms, and thereby address the problem of not finding documents that contain words related to query terms through infectional or derivational morphology. However, rule-based stemmers are not available for every language and it is unclear which methods for coping with morphology are most effective. In this paper we investigate an assortment of techniques for representing text and compare these approaches using data sets in eighteen languages and five different writing systems. We find character n-gram tokenization to be highly effective. In half of the languages examined n-grams outperform unnormalized words by more than 25%; in highly infective languages relative improvements over 50% are obtained. In languages with less morphological richness the choice of tokenization is not as critical and rule-based stemming can be an attractive option, if available. We also conducted an experiment to uncover the source of n-gram power and a causal relationship between the morphological complexity of a language and n-gram effectiveness was demonstrated.

cross language evaluation forum | 2000

A Language-Independent Approach to European Text Retrieval

Paul McNamee; James Mayfield; Christine D. Piatko

We present an approach to multilingual information retrieval that does not depend on the existence of specific linguistic resources such as stemmers or thesauri. Using the HAIRCUT system we participated in the monolingual, bilingual, and multilingual tasks of the CLEF-2000 evaluation. Our approach, based on combining the benefits of words and character n-grams, was effective for both language-independent monolingual retrieval as well as for cross-language retrieval using translated queries. After describing our monolingual retrieval approach we compare a translation method using aligned parallel corpora to commercial machine translation software.

cross language evaluation forum | 2001

JHU/APL Experiments at CLEF: Translation Resources and Score Normalization

Paul McNamee; James Mayfield

The Johns Hopkins University Applied Physics Laboratory participated in three of the five tasks of the CLEF-2001 evaluation, monolingual retrieval, bilingual retrieval, and multilingual retrieval. In this paper we describe the fundamental methods we used and we present initial results from three experiments. The first investigation examines whether residual inverse document frequency can improve the term weighting methods used with a linguistically-motivated probabilistic model. The second experi-ment attempts to assess the benefit of various translation resources for cross-language retrieval. Our last effort aims to improve cross-collection score normalization, a task essential for the multilingual problem.

cross language evaluation forum | 2003

JHU/APL Experiments in Tokenization and Non-word Translation

Paul McNamee; James Mayfield

In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer; a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance; various lengths of n-grams; and the use of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolingual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular stemmer in non-Romance languages, that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-grams, and that a combination of indexing methods is best of all.

cross language evaluation forum | 2002

Scalable Multilingual Information Access

Paul McNamee; James Mayfield

The third Cross-Language Evaluation Forum workshop (CLEF-2002) provides the unprecedented opportunity to evaluate retrieval in eight different languages using a common set of topics and a uniform assessment methodology. This year the Johns Hopkins University Applied Physics Laboratory participated in the monolingual, bilingual, and multilingual retrieval tasks. We contend that information access in a plethora of languages requires approaches that are inexpensive in developer and run-time costs. In this paper we describe a simplified approach that seems suitable for retrieval in many languages; we also show how good retrieval is possible over many languages, even when translation resources are scarce, or when query-time translation is infeasible. In particular, we investigate the use of character n-grams for monolingual retrieval, CLIR between related languages using partial morphological matches, and translation of document representations to an interlingua for computationally efficient retrieval against multiple languages.

Explore More