Matt Crane | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matt Crane is active.

Explore More

Publication

Featured researches published by Matt Crane.

european conference on information retrieval | 2016

Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

Jimmy J. Lin; Matt Crane; Andrew Trotman; Jamie Callan; Ishan Chattopadhyaya; John Foley; Grant Ingersoll; Craig Macdonald; Sebastiano Vigna

The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with a copy of the collection can reproduce the submitted runs. Our vision is that these results would serve as widely accessible points of comparison in future IR research. This project represents an ongoing effort, but we describe the first phase of the challenge that was organized as part of a workshop at SIGIR 2015. We have succeeded modestly so far, achieving our main goals on the Gov2 collection with seven open-source search engines. In this paper, we describe our methodology, share experimental results, and discuss lessons learned as well as next steps.

international acm sigir conference on research and development in information retrieval | 2016

Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR)

Jaime Arguello; Matt Crane; Fernando Diaz; Jimmy J. Lin; Andrew Trotman

The SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) took place on Thursday, August 13, 2015 in Santiago, Chile. The goal of the workshop was two fold. The first to provide a venue for the publication and presentation of negative results. The second was to provide a venue through which the authors of open source search engines could compare performance of indexing and searching on the same collections and on the same machines - encouraging the sharing of ideas and discoveries in a like-to-like environment. In total three papers were presented and seven systems participated.

conference on information and knowledge management | 2013

Maintaining discriminatory power in quantized indexes

Matt Crane; Andrew Trotman; Richard A. O'Keefe

The time cost of searching with an inverted index is directly proportional to the number of postings processed and the cost of processing each posting. Dynamic pruning reduces the number of postings examined. Pre-calculation then quantization of term / document weights reduces the cost of evaluating each posting. The effect of quantization on precision, latency, and index size is examined herein. We show empirically that there is an ideal size (in bits) for storing the quantized scores. Increasing this adversely affects index size and search latency; decreasing it adversely affects precision. We observe a relationship between the collection size and ideal quantization size, and provide a way to determine the number of bits to use from the collection size.

web search and data mining | 2017

A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation

Matt Crane; J. Shane Culpepper; Jimmy J. Lin; Joel Mackenzie; Andrew Trotman

We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index organizations. Our work controls for score quantization, document processing, compression, implementation language, implementation effort, and a number of details, arriving at an empirical evaluation that fairly characterizes the performance of three specific techniques: WAND (DaaT), BMW (DaaT), and JASS (SaaT). Experiments reveal a number of interesting findings. The performance gap between WAND and BMW is not as clear as the literature suggests, and both methods are susceptible to tail queries that may take orders of magnitude longer than the median query to execute. Surprisingly, approximate query evaluation in WAND and BMW does not significantly reduce the risk of these tail queries. Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects. Furthermore, JASS query latency is not particularly sensitive to the retrieval depth, making it an appealing solution for performance-sensitive applications where bounds on query latencies are desirable.

australasian document computing symposium | 2012

Effects of spam removal on search engine efficiency and effectiveness

Matt Crane; Andrew Trotman

Spam has long been identified as a problem that web search engines are required to deal with. Large collection sizes are also an increasing issue for institutions that do not have the necessary resources to process them in their entirety. In this paper we investigate the effect that withholding documents identified as spam has on the resources required to process large collections. We also investigate the resulting search effectiveness and efficiency when different amounts of spam are withheld. We find that by removing spam at indexing time we are able to decrease the index size without affecting the indexing throughput, and are able to improve search precision for some thresholds.

web search and data mining | 2018

Query Driven Algorithm Selection in Early Stage Retrieval

Joel Mackenzie; J. Shane Culpepper; Roi Blanco; Matt Crane; Charles L. A. Clarke; Jimmy J. Lin

Large scale retrieval systems often employ cascaded ranking architectures, in which an initial set of candidate documents are iteratively refined and re-ranked by increasingly sophisticated and expensive ranking models. In this paper, we propose a unified framework for predicting a range of performance-sensitive parameters based on minimizing end-to-end effectiveness loss. The framework does not require relevance judgments for training, is amenable to predicting a wide range of parameters, allows for fine tuned efficiency-effectiveness trade-offs, and can be easily deployed in large scale search systems with minimal overhead. As a proof of concept, we show that the framework can accurately predict a number of performance parameters on a query-by-query basis, allowing efficient and effective retrieval, while simultaneously minimizing the tail latency of an early-stage candidate generation system. On the 50 million document ClueWeb09B collection, and across 25,000 queries, our hybrid system can achieve superior early-stage efficiency to fixed parameter systems without loss of effectiveness, and allows more finely-grained efficiency-effectiveness trade-offs across the multiple stages of the retrieval system.

australasian document computing symposium | 2013

Managing short postings lists

Andrew Trotman; Xiang-Fei Jia; Matt Crane

Previous work has examined space saving and throughput increasing techniques for long postings lists in an inverted file search engine. In this contribution we show that highly sporadic terms (terms that occur in 1 or 2 documents) are a high proportion of the unique terms in the collection and that these terms are seen in queries. The previously known space saving method of storing their short postings lists in the vocabulary is compared to storing in the postings file. We quantify the saving as about 6.5%, with no loss in precision, and suggest the adoption of this technique.

international conference on the theory of information retrieval | 2017

An Exploration of Serverless Architectures for Information Retrieval

Matt Crane; Jimmy J. Lin

Serverless architectures represent a new approach to designing applications in the cloud without having to explicitly provision or manage servers. The developer specifies functions with well-defined entry and exit points, and the cloud provider handles all other aspects of execution. In this paper, we explore a novel application of serverless architectures to information retrieval and describe a search engine built in this manner with Amazon Web Services: postings lists are stored in the DynamoDB NoSQL store and the postings traversal algorithm for query evaluation is implemented in the Lambda service. The result is a search engine that scales elastically with a pay-per-request model, in contrast to a server-based model that requires paying for running instances even if there are no requests. We empirically assess the performance and economics of our serverless architecture. While our implementation is currently too slow for interactive searching, analysis shows that the pay-per-request model is economically compelling, and future infrastructure improvements will increase the attractiveness of serverless designs over time.

australasian document computing symposium | 2015

Improving Throughput of a Pipeline Model Indexer

Matt Crane; Andrew Trotman; David M. Eyers

There are many competing models for the indexing process of an information retrieval system, one of which is a pipeline based model. Information retrieval is also an inherently parallel process, indexing one document is independent of another document. A pipeline model allows for easy experimentation on the parallelism within an indexer. In this paper we investigate areas within a pipeline where indexing throughput can be increased, as well as exploiting the inherent parallelism of indexing.

australasian document computing symposium | 2015

Collision Resolution in Hash Tables for Vocabulary Accumulation During Parallel Indexing

Matt Crane; Andrew Trotman

During indexing the vocabulary of a collection needs to be built. The structure used for this needs to account for the skew distribution of terms. Parallel indexing allows for a large reduction in number of times the global vocabulary needs to be examined, however, this also raises a new set of challenges. In this paper we examine the structures used to resolve collisions in a hash table during parallel indexing, and find that the best structure is different from those suggested previously.

Explore More