James C. French
University of Virginia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by James C. French.
international acm sigir conference on research and development in information retrieval | 1999
James C. French; Allison L. Powell; James P. Callan; Charles L. Viles; Travis Emmitt; Kevin J. Prey; Yun Mou
We compare the performance of two database selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for database selection techniques. The testbed is a decomposition of the TREC/TIPSTER data into 236 subcollections. We present results of a recent investigation of the performance of the CORI algorithm and compare the performance with earlier work that examined the performance of gGlOSS. The databases from our testbed were ranked using both the gGlOSS and CORI techniques and compared to the RBR baseline, a baseline derived from TREC relevance judgements. We examined the degree to which CORI and gGlOSS approximate this baseline. Our results confirm our earlier observation that the gGlOSS Ideal(l) ranks do not estimate relevance-based ranks well. We also find that CORI is a uniformly better estimator of relevance-based ranks than gGlOSS for the test environment used in this study. Part of the advantage of the CORI algorithm can be explained by a strong correlation between gGlOSS and a size-based baseline (SBR). We also find that CORI produces consistently accurate rankings on testbeds ranging from 100--921 sites. However for a given level of recall, search effort appears to scale linearly with the number of databases.
international conference on data engineering | 1999
Venkatesh Ganti; Raghu Ramakrishnan; Johannes Gehrke; Allison L. Powell; James C. French
Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a distance space. In a distance space, the only operation possible on data objects is the computation of distance between them. All scalable algorithms in the literature assume a special type of distance space, namely a k-dimensional vector space, which allows vector operations on objects. We present two scalable algorithms designed for clustering very large datasets in distance spaces. Our first algorithm BUBBLE is, to our knowledge, the first scalable clustering algorithm for data in a distance space. Our second algorithm BUBBLE-FM improves upon BUBBLE by reducing the number of calls to the distance function, which may be computationally very expensive. Both algorithms make only a single scan over the database while producing high clustering quality. In a detailed experimental evaluation, we study both algorithms in terms of scalability and quality of clustering. We also show results of applying the algorithms to a real life dataset.
international acm sigir conference on research and development in information retrieval | 2000
Allison L. Powell; James C. French; James P. Callan; Margaret E. Connell; Charles L. Viles
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — database selection, query processing, and results merging. In this paper we examine the effect of database selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good database selection can result in better retrieval effectiveness than can be achieved in a centralized database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when database selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in database selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single database) systems. Given a centralized database and a good selection mechanism, retrieval performance can be improved by decomposing that database conceptually and employing a selection step.
international acm sigir conference on research and development in information retrieval | 1998
James C. French; Allison L. Powell; Charles L. Viles; Travis Emmitt; Kevin J. Prey
We describe a testbed for database selection techniques and an experiment conducted using this testbed. The testbed is a decomposition of the TREC/TIPSTER data that allows analysis of the data along multiple dimensions, including collection-based and temporal-based analysis. We characterize the subcollections in this testbed in terms of number of documents, queries against which the document,s have been evaluated for relevance, and distribution of relevant documents. We then present initial results from a study conducted using this testbed that examines the effectiveness of the gGlOSS approach to database selection. The databases from our testbed were ranked using the gGl0S.S techniques and compared to the gGlOSS I&l(l) baseline and a baseline derived from TREC relevance judgements. We have examined the degree to which several gGlOSS estimate functions approximate these baselines. Our initial results confirm that the gGZOSS estimators are excellent predictors of the Ideal(Z) ranks but that the Ideal(l) ranks do not estimate relevance-based ranks well.
ACM Transactions on Information Systems | 2003
Allison L. Powell; James C. French
The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multicollection environment. Multicollection searching is cast in three parts: collection selection (also referred to as database selection), query processing and results merging. In this work, we focus our attention on the evaluation of the first step, collection selection.In this article, we present a detailed discussion of the methodology that we used to evaluate and compare collection selection approaches, covering both test environments and evaluation measures. We compare the CORI, CVV and gGLOSS collection selection approaches using six test environments utilizing three document testbeds. We note similar trends in performance among the collection selection approaches, but the CORI approach consistently outperforms the other approaches, suggesting that effective collection selection can be achieved using limited information about each collection.The contributions of this work are both the assembled evaluation methodology as well as the application of that methodology to compare collection selection approaches in a standardized environment.
international acm sigir conference on research and development in information retrieval | 1995
Charles L. Viles; James C. French
We find that dissemination of collection wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon how documents are allocated among sites. Low dissemination is needed for random document allocation, but higher levels are needed when documents are allocated based on content. We define parameters to control dissemination and document allocation and present results from four test collections. We define the notion of iso-knowledge lines with respect to the number of sites and level of dissemination in the distributed archive, and show empirically that iso-knowledge lines are also iso-effectiveness lines when documents are randomly allocated.
Storage and Retrieval for Image and Video Databases | 1996
Julio E. Barros; James C. French; Worthy N. Martin; Patrick M. Kelly; T. Michael Cannon
Dissimilarity measures, the basis of similarity-based retrieval, can be viewed as a distance and a similarity-based search as a nearest neighbor search. Though there has been extensive research on data structures and search methods to support nearest-neighbor searching, these indexing and dimension-reduction methods are generally not applicable to non-coordinate data and non-Euclidean distance measures. In this paper we reexamine and extend previous work of other researchers on best match searching based on the triangle inequality. These methods can be used to organize both non-coordinate data and non-Euclidean metric similarity measures. The effectiveness of the indexes depends on the actual dimensionality of the feature set, data, and similarity metric used. We show that these methods provide significant performance improvements and may be of practical value in real-world databases.
Proceedings First International Conference on WEB Delivering of Music. WEDELMUSIC 2001 | 2001
David B. Hauver; James C. French
In recent years, the popularity of online radio has exploded. This new entertainment medium affords an opportunity not available to conventional broadcast radio: the instantaneous listening audience can be known, or what is more important, the musical tastes of the current listening audience can be known. Thus, it is possible in the new medium to tailor the playlist in real-time to the musical tastes of the listening audience. We discuss a method, termed flycasting, for using collaborative filtering techniques to generate a playlist in real-time based on the request histories of the current listening audience. We also describe a concrete implementation of the technique.
conference on object oriented programming systems languages and applications | 1994
John F. Karpovich; Andrew S. Grimshaw; James C. French
Scientific applications often manipulate very large sets of persistent data. Over the past decade, advances in disk storage device performance have consistently been outpaced by advances in the performance of the rest of the computer system. As a result, many scientific applications have become I/O-bound, i.e. their run-times are dominated by the time spent performing I/O operations. Consequently, the performance of I/O operations has become critical for high performance in these applications. The ELFS approach is designed to address the issue of high performance I/O by treating files as typed objects. Typed file objects can exploit knowledge about the file structure and type of data. Typed file objects can selectively apply techniques such as prefetching, parallel asynchronous file access, and caching to improve performance. Also, by typing objects, the interface to the user can be improved in two ways. First, the interface can be made easier to use by presenting file operations in a more natural manner to the user. Second, the interface can allow the user to provide an “oracle” about access patterns, that can allow the file object to improve performance. By combining these concepts with the object-oriented paradigm, the goal of the ELFS methodology is to create flexible, extensible file classes that are easy to use while achieving high performance. In this paper we present the ELFS approach and our experiences with the design and implementation of two file classes: a two dimensional dense matrix file class and a multidimensional range searching file class.
Multimedia Tools and Applications | 2005
Xiangyu Jin; James C. French
Conventional approaches to image retrieval are based on the assumption that relevant images are physically near the query image in some feature space. This is the basis of the cluster hypothesis. However, semantically related images are often scattered across several visual clusters. Although traditional Content-based Image Retrieval (CBIR) technologies may utilize the information contained in multiple queries (gotten in one step or through a feedback process), this is often only a reformulation of the original query. As a result most of these strategies only get the images in some neighborhood of the original query as the retrieval result. This severely restricts the system performance. Relevance feedback techniques are generally used to mitigate this problem. In this paper, we present a novel approach to relevance feedback which can return semantically related images in different visual clusters by merging the result sets of multiple queries. We also provide experimental results to demonstrate the effectiveness of our approach.