Egidio L. Terra
University of Waterloo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Egidio L. Terra.
north american chapter of the association for computational linguistics | 2003
Egidio L. Terra; Charles L. A. Clarke
Statistical measures of word similarity have application in many areas of natural language processing, such as language modeling and information retrieval. We report a comparative study of two methods for estimating word co-occurrence frequencies required by word similarity measures. Our frequency estimates are generated from a terabyte-sized corpus of Web data, and we study the impact of corpus size on the effectiveness of the measures. We base the evaluation on one TOEFL question set and two practice questions sets, each consisting of a number of multiple choice questions seeking the best synonym for a given target word. For two question sets, a context for the target word is provided, and we examine a number of word similarity measures that exploit this context. Our best combination of similarity measure and frequency estimation method answers 6-8% more questions than the best results previously reported for the same question sets.
international acm sigir conference on research and development in information retrieval | 2003
Charles L. A. Clarke; Egidio L. Terra
Question answering (QA) systems often contain an information retrieval subsystem that identifies documents or passages where the answer to a question might appear [1–3, 5, 6, 10]. The QA system generates queries from the questions and submits them to the IR subsystem. The IR subsystem returns the top-ranked documents or passages, and the QA system selects the answers from them. In many QA systems, the IR component retrieves entire documents. Then, in a post-retrieval step, the system scans the retrieved documents and locates groups of sentences that contain most or all of the question keywords [3,10, and others]. These sentences are subjected to further analysis to select the answer. In other QA systems, a passage-retrieval technique is employed to directly identify locations within the document collection where the answer might be found, avoiding the post-retrieval step [1, 2, 5, 6, and others]. In this context, a “relevant” document or passage is one that contains an answer. We utilize this notion of relevance to evaluate an IR subsystem in isolation from the rest of its QA system by applying standard measures of IR effectiveness. By restricting our evaluation to a single subsystem we hope to gain experience that is applicable to QA systems beyond our own. An assumption inherent in this approach is that improved precision in the IR subsystem will translate to improved performance of the QA system as a whole. This assumption holds for our own system, and should (at least) hold for any system that exploits redundancy—that takes advantage of the observation that answers tend to occur in more than one retrieved passage [1, 2, 5]. In this paper we compare a successful passage-retrieval method [1, 5] with a well-known and effective documentretrieval method: Okapi BM25 [7]. Our goal is to examine
international acm sigir conference on research and development in information retrieval | 2002
Charles L. A. Clarke; Gordon V. Cormack; M. Laszlo; Thomas R. Lynam; Egidio L. Terra
Using our question answering system, questions from the TREC 2001 evaluation were executed over a series of Web data collections, with the sizes of the collections increasing from 25 gigabytes up to nearly a terabyte.
text retrieval conference | 2008
Charles L. A. Clarke; Gordon V. Cormack; Thomas R. Lynam; Egidio L. Terra
The MultiText QA System performs question answering using a two step passage selection method. In the first step, an arbitrary passage retrieval algorithm efficiently identifies hotspots in a large target corpus where the answer might be located. In the second step, an answer selection algorithm analyzes these hotspots, considering such factors as answer type and candidate redundancy, to extract short answer snippets. This chapter describes both steps in detail, with the goal of providing sufficient information to allow independent implementation. The method is evaluated using the test collection developed for the TREC 2001 question answering track.
conference on information and knowledge management | 2004
Egidio L. Terra; Charles L. A. Clarke
An usual approach to address mismatching vocabulary problem is to augment the original query using dictionaries and other lexical resources and/or by looking at pseudo-relevant documents. Either way, terms are added to form a new query that will be used to score all documents in a subsequent retrieval pass, and as consequence the original querys focus may drift because of the newly added terms. We propose a new method to address the mismatching vocabulary problem, expanding original query terms only when necessary and complementing the user query for missing terms while scoring documents. It allows related semantic aspects to be included in a conservative and selective way, thus reducing the possibility of query drift. Our results using replacements for the <i>missing query terms</i> in modified document and passages retrieval methods show significant improvement over the original ones.
international conference on computational linguistics | 2004
Egidio L. Terra; Charles L. A. Clarke
We present a framework for the fast computation of lexical affinity models. The framework is composed of a novel algorithm to efficiently compute the co-occurrence distribution between pairs of terms, an independence model, and a parametric affinity model. In comparison with previous models, which either use arbitrary windows to compute similarity between words or use lexical affinity to create sequential models, in this paper we focus on models intended to capture the co-occurrence patterns of any pair of words or phrases at any distance in the corpus. The framework is flexible, allowing fast adaptation to applications and it is scalable. We apply it in combination with a terabyte corpus to answer natural language tests, achieving encouraging results.
conference on information and knowledge management | 2004
Charles L. A. Clarke; Egidio L. Terra
We examine the problem of retrieving the top-<i>m</i> ranked items from a large collection, randomly distributed across an <i>n</i>-node system. In order to retrieve the top <i>m</i> overall, we must retrieve the top <i>m</i> from the subcollection stored on each node and merge the results. However, if we are willing to accept a small probability that one or more of the top-<i>m</i> items may be missed, it is possible to reduce computation time by retrieving only the top <i>k < m</i> from each node. In this paper, we demonstrate that this simple observation can be exploited in a realistic application to produce a substantial efficiency improvement without compromising the quality of the retrieved results. To support our claim, we present a statistical model that predicts the impact of the optimization. The paper is structured around a specific application~---~passage retrieval for question answering~---~but the primary results are more broadly applicable.
text retrieval conference | 2002
Charles L. A. Clarke; Gordon V. Cormack; Graeme Kemkes; M. Laszlo; Thomas R. Lynam; Egidio L. Terra; Philip L. Tilker
international acm sigir conference on research and development in information retrieval | 2004
Kevyn Collins-Thompson; Jamie Callan; Egidio L. Terra; Charles L. A. Clarke
text retrieval conference | 2003
David L. Yeung; Charles L. A. Clarke; Gordon V. Cormack; Thomas R. Lynam; Egidio L. Terra