David A. Grossman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David A. Grossman is active.

Explore More

Publication

Featured researches published by David A. Grossman.

international acm sigir conference on research and development in information retrieval | 2004

Hourly analysis of a very large topically categorized web query log

Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David A. Grossman; Ophir Frieder

We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours of the day. We examine query traffic on an hourly basis by matching it against lists of queries that have been topically pre-categorized by human editors. This represents 13% of the query traffic. We show that query traffic from particular topical categories differs both from the query stream as a whole and from other categories. This analysis provides valuable insight for improving retrieval effectiveness and efficiency. It is also relevant to the development of enhanced query disambiguation, routing, and caching algorithms.

ACM Transactions on Information Systems | 2002

Collection statistics for fast duplicate document detection

Abdur Chowdhury; Ophir Frieder; David A. Grossman; Mary Catherine McCabe

We present a new algorithm for duplicate document detection thatuses collection statistics. We compare our approach with thestate-of-the-art approach using multiple collections. Thesecollections include a 30 MB 18,577 web document collectiondeveloped by Excite@Home and three NIST collections. The first NISTcollection consists of 100 MB 18,232 LA-Times documents, which isroughly similar in the number of documents to theExcite&at;Home collection. The other two collections are both 2GB and are the 247,491-web document collection and the TREC disks 4and 5---528,023 document collection. We show that our approachcalled I-Match, scales in terms of the number of documents andworks well for documents of all sizes. We compared our solution tothe state of the art and found that in addition to improvedaccuracy of detection, our approach executed in roughly one-fifththe time.

international acm sigir conference on research and development in information retrieval | 2006

Building a test collection for complex document information processing

David Lewis; Gady Agam; Shlomo Argamon; Ophir Frieder; David A. Grossman; Jefferson Heard

Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.

international acm sigir conference on research and development in information retrieval | 2005

Automatic web query classification using labeled and unlabeled training data

Steven M. Beitzel; Eric C. Jensen; Ophir Frieder; David A. Grossman; David Lewis; Abdur Chowdhury; Aleksander Kolcz

Accurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general web collection but from topic-specific databases as well. Maintaining sufficient categorization recall is very difficult as web queries are typically short, yielding few features per query. We examine three approaches to topical categorization of general web queries: matching against a list of manually labeled queries, supervised learning of classifiers, and mining of selectional preference rules from large unlabeled query logs. Each approach has its advantages in tackling the web query classification recall problem, and combining the three techniques allows us to classify a substantially larger proportion of queries than any of the individual techniques. We examine the performance of each approach on a real web query stream and show that our combined method accurately classifies 46% of queries, outperforming the recall of the best single approach by nearly 20%, with a 7% improvement in overall effectiveness.

Journal of the Association for Information Science and Technology | 2004

Fusion of effective retrieval strategies in the same information retrieval system

Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David A. Grossman; Ophir Frieder; Nazli Goharian

Prior efforts have shown that under certain situations retrieval effectiveness may be improved via the use of data fusion techniques. Although these improvements have been observed from the fusion of result sets from several distinct information retrieval systems, it has often been thought that fusing different document retrieval strategies in a single information retrieval system will lead to similar improvements. In this study, we show that this is not the case. We hold constant systemic differences such as parsing, stemming, phrase processing, and relevance feedback, and fuse result sets generated from highly effective retrieval strategies in the same information retrieval system. From this, we show that data fusion of highly effective retrieval strategies alone shows little or no improvement in retrieval effectiveness. Furthermore, we present a detailed analysis of the performance of modern data fusion approaches, and demonstrate the reasons why they do not perform well when applied to this problem. Detailed results and analyses are included to support our conclusions.

conference on information and knowledge management | 1997

Improving relevance feedback in the vector space model

Carol Lundquist; David A. Grossman; Ophir Frieder

Since the use of relevance f&back in information retrieval to impmve precision and recall was first proposed in the Iate1960’s, many different techniques have been used to improve the results obtained from relevance feedback. Siice most information retrieval systems perfbrming relevance feedback use combinations of several techniques, the individual contribution of each technique to the overall improvement is reIatively unknown. We discuss several techniques to improve relevance feedback including calibrating the number of top-ranked documents or feedback terms used for relevance feedback, clustering the top-ranked documents, changing the term weighting formula, and scaling the weight of the feedback terms. The impact of each technique on improving precision and recall is investigated using the Tipster document collection. We compare our work to a commonly accepted approach of using 50 words and 20 phrases for relevance f&back and show a 3 1% improvement in average precision over the commonly accepted approach when IO feedback terms (either words or phrases) are used. In addition, we have identitied a method which shows promise in predicting those queries which benetit Corn reIevance feedback

acm symposium on applied computing | 2003

Disproving the fusion hypothesis: an analysis of data fusion via effective information retrieval strategies

Steven M. Beitzel; Ophir Frieder; Eric C. Jensen; David A. Grossman; Abdur Chowdhury; Nazli Goharian

Many prior efforts have been devoted to the basic idea that data fusion techniques can improve retrieval effectiveness. Recent work in the area suggests that many approaches, particularly multiple-evidence combinations, can be a successful means of improving the effectiveness of a system. Unfortunately, the conditions favorable to effectiveness improvements have not been made clear. We examine popular data fusion techniques designed to achieve improvements in effectiveness and clarify the conditions required for data fusion to show improvement. We demonstrate that for fusion to improve effectiveness, the result sets being fused must contain a significant number of unique relevant documents. Furthermore, we show that for this improvement to be visible, these unique relevant documents must be highly ranked. In addition, we present a comprehensive discussion on why previous assumptions about the effectiveness of multiple-evidence techniques are misleading. Detailed empirical results and analysis are provided to support our conclusions.

conference on information and knowledge management | 2003

Using titles and category names from editor-driven taxonomies for automatic evaluation

Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David A. Grossman

Evaluation of IR systems has always been difficult because of the need for manually assessed relevance judgments. The advent of large editor-driven taxonomies on the web opens the door to a new evaluation approach. We use the ODP (Open Directory Project) taxonomy to find sets of pseudo-relevant documents via one of two assumptions: 1) taxonomy entries are relevant to a given query if their editor-entered titles exactly match the query, or 2) all entries in a leaf-level taxonomy category are relevant to a given query if the category title exactly matches the query. We compare and contrast these two methodologies by evaluating six web search engines on a sample from an America Online log of ten million web queries, using MRR measures for the first method and precision-based measures for the second. We show that this technique is stable with respect to the query set selected and correlated with a reasonably large manual evaluation.

international acm sigir conference on research and development in information retrieval | 2002

Document normalization revisited

Abdur Chowdhury; M. Catherine McCabe; David A. Grossman; Ophir Frieder

Cosine Pivoted Document Length Normalization has reached a point of stability where many researchers indiscriminately apply a specific value of 0.2 regardless of the collection. Our efforts, however, demonstrate that applying this specific value without tuning for the document collection degrades average precision by as much as 20%.

conference on information and knowledge management | 2003

Misuse detection for information retrieval systems

Rebecca Cathey; Ling Ma; Nazli Goharian; David A. Grossman

We present a novel approach to detect misuse within an information retrieval system by gathering and maintaining knowledge of the behavior of the user rather than anticipating attacks by unknown assailants. Our approach is based on building and maintaining a profile of the behavior of the system user through tracking, or monitoring of user activity within the information retrieval system. Any new activity of the user is compared to the user profile to detect a potential misuse for the authorized user. We propose four different methods to detect misuse in information retrieval systems. Our experimental results on

Explore More