Jamie Callan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jamie Callan is active.

Explore More

Publication

Featured researches published by Jamie Callan.

Archive | 2002

Distributed Information Retrieval

Jamie Callan

A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable.

conference on information and knowledge management | 2005

Query expansion using random walk models

Kevyn Collins-Thompson; Jamie Callan

It has long been recognized that capturing term relationships is an important aspect of information retrieval. Even with large amounts of data, we usually only have significant evidence for a fraction of all potential term pairs. It is therefore important to consider whether multiple sources of evidence may be combined to predict term relations more accurately. This is particularly important when trying to predict the probability of relevance of a set of terms given a query, which may involve both lexical and semantic relations between the terms.We describe a Markov chain framework that combines multiple sources of knowledge on term associations. The stationary distribution of the model is used to obtain probability estimates that a potential expansion term reflects aspects of the original query. We use this model for query expansion and evaluate the effectiveness of the model by examining the accuracy and robustness of the expansion methods, and investigate the relative effectiveness of various sources of term evidence. Statistically significant differences in accuracy were observed depending on the weighting of evidence in the random walk. For example, using co-occurrence data later in the walk was generally better than using it early, suggesting further improvements in effectiveness may be possible by learning walk behaviors.

international acm sigir conference on research and development in information retrieval | 2009

Sources of evidence for vertical selection

Jaime Arguello; Fernando Diaz; Jamie Callan; Jean Francois Crespo

Web search providers often include search services for domain-specific subcollections, called verticals, such as news, images, videos, job postings, company summaries, and artist profiles. We address the problem of vertical selection, predicting relevant verticals (if any) for queries issued to the search engines main web search page. In contrast to prior query classification and resource selection tasks, vertical selection is associated with unique resources that can inform the classification decision. We focus on three sources of evidence: (1) the query string, from which features are derived independent of external resources, (2) logs of queries previously issued directly to the vertical, and (3) corpora representative of vertical content. We focus on 18 different verticals, which differ in terms of semantics, media type, size, and level of query traffic. We compare our method to prior work in federated search and retrieval effectiveness prediction. An in-depth error analysis reveals unique challenges across different verticals and provides insight into vertical selection for future work.

international acm sigir conference on research and development in information retrieval | 2008

Retrieval and feedback models for blog feed search

Jonathan L. Elsas; Jaime Arguello; Jamie Callan; Jaime G. Carbonell

Blog feed search poses different and interesting challenges from traditional ad hoc document retrieval. The units of retrieval, the blogs, are collections of documents, the blog posts. In this work we adapt a state-of-the-art federated search model to the feed retrieval task, showing a significant improvement over algorithms based on the best performing submissions in the TREC 2007 Blog Distillation task[12]. We also show that typical query expansion techniques such as pseudo-relevance feedback using the blog corpus do not provide any significant performance improvement and in many cases dramatically hurt performance. We perform an in-depth analysis of the behavior of pseudo-relevance feedback for this task and develop a novel query expansion technique using the link structure in Wikipedia. This query expansion technique provides significant and consistent performance improvements for this task, yielding a 22% and 14% improvement in MAP over the unexpanded query for our baseline and federated algorithms respectively.

International Journal on Digital Libraries | 2005

Personalisation and recommender systems in digital libraries

Alan F. Smeaton; Jamie Callan

Widespread use of the Internet has resulted in digital libraries that are increasingly used by diverse communities of users for diverse purposes and in which sharing and collaboration have become important social elements. As such libraries become commonplace, as their contents and services become more varied, and as their patrons become more experienced with computer technology, users will expect more sophisticated services from these libraries. A simple search function, normally an integral part of any digital library, increasingly leads to user frustration as user needs become more complex and as the volume of managed information increases. Proactive digital libraries, where the library evolves from being passive and untailored, are seen as offering great potential for addressing and overcoming these issues and include techniques such as personalisation and recommender systems. In this paper, following on from the DELOS/NSF Working Group on Personalisation and Recommender Systems for Digital Libraries, which met and reported during 2003, we present some background material on the scope of personalisation and recommender systems in digital libraries. We then outline the working group’s vision for the evolution of digital libraries and the role that personalisation and recommender systems will play, and we present a series of research challenges and specific recommendations and research priorities for the field.

conference on information and knowledge management | 2006

Incremental hierarchical clustering of text documents

Nachiketa Sahoo; Jamie Callan; Ramayya Krishnan; George T. Duncan; Rema Padman

Incremental hierarchical text document clustering algorithms are important in organizing documents generated from streaming on-line sources, such as, Newswire and Blogs. However, this is a relatively unexplored area in the text document clustering literature. Popular incremental hierarchical clustering algorithms, namely Cobweb and Classit, have not been widely used with text document data. We discuss why, in the current form, these algorithms are not suitable for text clustering and propose an alternative formulation that includes changes to the underlying distributional assumption of the algorithm in order to conform with the data. Both the original Classit algorithm and our proposed algorithm are evaluated using Reuters newswire articles and Ohsumed dataset.

european conference on information retrieval | 2005

Federated search of text-based digital libraries in hierarchical peer-to-peer networks

Jie Lu; Jamie Callan

Peer-to-peer architectures are a potentially powerful model for developing large-scale networks of text-based digital libraries, but peer-to-peer networks have so far provided very limited support for text-based federated search of digital libraries using relevance-based ranking. This paper addresses the problems of resource representation, resource ranking and selection, and result merging for federated search of text-based digital libraries in hierarchical peer-to-peer networks. Existing approaches to text-based federated search are adapted and new methods are developed for resource representation and resource selection according to the unique characteristics of hierarchical peer-to-peer networks. Experimental results demonstrate that the proposed approaches offer a better combination of accuracy and efficiency than more common alternatives for federated search in peer-to-peer networks.

digital government research | 2006

Automatically labeling hierarchical clusters

Pucktada Treeratpituk; Jamie Callan

Government agencies must often quickly organize and analyze large amounts of textual information, for example comments received as part of notice and comment rulemaking. Hierarchical organization is popular because it represents information at different levels of detail and is convenient for interactive browsing. Good hierarchical clustering algorithms are available, but there are few good solutions for automatically labeling the nodes in a cluster hierarchy.This paper presents a simple algorithm that automatically assigns labels to hierarchical clusters. The algorithm evaluates candidate labels using information from the cluster, the parent cluster, and corpus statistics. A trainable threshold enables the algorithm to assign just a few high-quality labels to each cluster. Experiments with Open Directory Project (ODP) hierarchies indicate that the algorithm creates cluster labels that are similar to labels created by ODP editors.

international acm sigir conference on research and development in information retrieval | 2007

Estimation and use of uncertainty in pseudo-relevance feedback

Kevyn Collins-Thompson; Jamie Callan

Existing pseudo-relevance feedback methods typically perform averaging over the top-retrieved documents, but ignore an important statistical dimension: the risk or variance associated with either the individual document models, or their combination. Treating the baseline feedback method as a black box, and the output feedback model as a random variable, we estimate a posterior distribution for the feed-back model by resampling a given querys top-retrieved documents, using the posterior mean or mode as the enhanced feedback model. We then perform model combination over several enhanced models, each based on a slightly modified query sampled from the original query. We find that resampling documents helps increase individual feedback model precision by removing noise terms, while sampling from the query improves robustness (worst-case performance) by emphasizing terms related to multiple query aspects. The result is a meta-feedback algorithm that is both more robust and more precise than the original strong baseline method.

international joint conference on natural language processing | 2009

A Metric-based Framework for Automatic Taxonomy Induction

Hui Yang; Jamie Callan

This paper presents a novel metric-based framework for the task of automatic taxonomy induction. The framework incrementally clusters terms based on ontology metric, a score indicating semantic distance; and transforms the task into a multi-criteria optimization based on minimization of taxonomy structures and modeling of term abstractness. It combines the strengths of both lexico-syntactic patterns and clustering through incorporating heterogeneous features. The flexible design of the framework allows a further study on which features are the best for the task under various conditions. The experiments not only show that our system achieves higher F1-measure than other state-of-the-art systems, but also reveal the interaction between features and various types of relations, as well as the interaction between features and term abstractness.

Explore More