Gordon V. Cormack
University of Waterloo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gordon V. Cormack.
international acm sigir conference on research and development in information retrieval | 2008
Charles L. A. Clarke; Maheedhar Kolla; Gordon V. Cormack; Olga Vechtomova; Azin Ashkan; Stefan Büttcher; Ian MacKinnon
Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track.
Information Retrieval | 2011
Gordon V. Cormack; Mark D. Smucker; Charles L. A. Clarke
The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering—from among the worst to among the best.
The Computer Journal | 1987
Gordon V. Cormack; R. N. Horspool
A method of dynamically constructing Markov chain models that describe the characteristics of binary messages is developed. Such models can be used to predict future message characters and can therefore be used as a basis for data compression. To this end, the Markov modelling technique is combined with Guazzos arithmetic coding scheme to produce a powerful method of data compression. The method has the advantage of being adaptive: messages may be encoded or decoded with just a single pass through the data. Experimental results reported here indicate that the Markov modelling approach generally achieves much better data compression than that observed with competing methods on typical computer data.
international acm sigir conference on research and development in information retrieval | 1998
Gordon V. Cormack; Christopher Palmer; Charles L. A. Clarke
Test collections with a million or more documents are needed for the evaluation of modern information retrieval systems. Yet their construction requires a great deal of effort. Judgements must be rendered as to whether or not documents are relevant to each of a set of queries. Exhaustive judging, in which every document is examined and a judgement rendered, is infeasible for collections of this size. Current practice is represented by the “pooling method”, as used in the TREC conference series, in which only the first k documents from each of a number of sources are judged. We propose two methods, Intemctive Searching and Judging and Moveto-front Pooling, that yield effective test collections while requiring many fewer judgements. Interactive Searching and Judging selects documents to be judged using an interactive search system, and may be used by a small research team to develop an effective test collection using minimal resources. Move-to-Front Pooling directly improves on the standard pooling method by using a variable number of documents from each source depending on its retrieval performance. Move-to-Front Pooling would be an appropriate replacement for the standard pooling method in future collection development efforts involving many independent groups.
The Computer Journal | 1995
Charles L. A. Clarke; Gordon V. Cormack; Forbes J. Burkowski
A query algebra is presented that expresses searches on structured text. In addition to traditional full-text boolean queries that search a pre-defined collection of documents, the algebra permits queries that harness document structure. The algebra manipulates arbitrary intervals of text, which are recognized in the text from implicit or explicit markup. The algebra has seven operators, which combined intervals to yield new ones: containing, not containing, contained in, not contained in, one of, both of, followed by. The ultimate result of a query is the set of intervals that satisfy it. An implementation framework is given based on four primitive access functions. Each access function finds the solution to a query nearest to a given position in the database. Recursive definitions for the seven operators are given in terms of these access functions. Search time is at worst proportional to the time required to evaluate the access functions for occurrences of the elementary terms in a query
Information Processing and Management | 2000
Charles L. A. Clarke; Gordon V. Cormack; Elizabeth A. Tudhope
We investigate the application of a novel relevance ranking technique, cover density ranking, to the requirements of Web-based information retrieval, where a typical query consists of a few search terms and a typical result consists of a page indicating several potentially relevant documents. Traditional ranking methods for information retrieval, based on term and inverse document frequencies, have been found to work poorly in this context. Under the cover density measure, ranking is based on term proximity and cooccurrence. Experimental comparisons show performance that compares favorably with previous work.
international acm sigir conference on research and development in information retrieval | 2009
Gordon V. Cormack; Charles L. A. Clarke; Stefan Buettcher
Reciprocal Rank Fusion (RRF), a simple method for combining the document rankings from multiple IR systems, consistently yields better results than any individual system, and better results than the standard method Condorcet Fuse. This result is demonstrated by using RRF to combine the results of several TREC experiments, and to build a meta-learner that ranks the LETOR 3 dataset better than any previously reported method
ACM Transactions on Information Systems | 2007
Gordon V. Cormack; Thomas R. Lynam
Eleven variants of six widely used open-source spam filters are tested on a chronological sequence of 49086 e-mail messages received by an individual from August 2003 through March 2004. Our approach differs from those previously reported in that the test set is large, comprises uncensored raw messages, and is presented to each filter sequentially with incremental feedback. Misclassification rates and Receiver Operating Characteristic Curve measurements are reported, with statistical confidence intervals. Quantitative results indicate that content-based filters can eliminate 98% of spam while incurring 0.1% legitimate email loss. Qualitative results indicate that the risk of loss depends on the nature of the message, and that messages likely to be lost may be those that are less critical. More generally, our methodology has been encapsulated in a free software toolkit, which may used to conduct similar experiments.
conference on information and knowledge management | 2007
Gordon V. Cormack; José María Gómez Hidalgo; Enrique Puertas Sanz
We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages often consist of only a few words, and therefore present a challenge to traditional bag-of-words based spam filters. Using three corpora of short messages and message fields derived from real SMS, blog, and spam messages, we evaluate feature-based and compression-model-based spam filters. We observe that bag-of-words filters can be improved substantially using different features, while compression-model filters perform quite well as-is. We conclude that content filtering for short messages is surprisingly effective.
Communications of The ACM | 1985
Gordon V. Cormack
A general-purpose data-compression routine—implemented on the IMS database system—makes use of context to achieve better compression than Huffmans method applied character by character. It demonstrates that a wide variety of data can be compressed effectively using a single, fixed compression routine with almost no working storage.