Andrei Z. Broder | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrei Z. Broder is active.

Explore More

Publication

Featured researches published by Andrei Z. Broder.

international acm sigir conference on research and development in information retrieval | 2007

Robust classification of rare queries using web knowledge

Andrei Z. Broder; Marcus Fontoura; Evgeniy Gabrilovich; Amruta Joshi; Vanja Josifovski; Tong Zhang

We propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a blind feedback technique: given a query, we determine its topic by classifying the web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregation account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

conference on learning theory | 2007

Margin based active learning

Maria-Florina Balcan; Andrei Z. Broder; Tong Zhang

We present a framework for margin based active learning of linear separators. We instantiate it for a few important cases, some of which have been previously considered in the literature.We analyze the effectiveness of our framework both in the realizable case and in a specific noisy setting related to the Tsybakov small noise condition.

international acm sigir conference on research and development in information retrieval | 2008

Optimizing relevance and revenue in ad search: a query substitution approach

Filip Radlinski; Andrei Z. Broder; Peter Ciccolo; Evgeniy Gabrilovich; Vanja Josifovski; Lance Riedel

The primary business model behind Web search is based on textual advertising, where contextually relevant ads are displayed alongside search results. We address the problem of selecting these ads so that they are both relevant to the queries and profitable to the search engine, showing that optimizing ad relevance and revenue is not equivalent. Selecting the best ads that satisfy these constraints also naturally incurs high computational costs, and time constraints can lead to reduced relevance and profitability. We propose a novel two-stage approach, which conducts most of the analysis ahead of time. An offine preprocessing phase leverages additional knowledge that is impractical to use in real time, and rewrites frequent queries in a way that subsequently facilitates fast and accurate online matching. Empirical evaluation shows that our method optimized for relevance matches a state-of-the-art method while improving expected revenue. When optimizing for revenue, we see even more substantial improvements in expected revenue.

international world wide web conferences | 2009

Online expansion of rare queries for sponsored search

Andrei Z. Broder; Peter Ciccolo; Evgeniy Gabrilovich; Vanja Josifovski; Donald Metzler; Lance Riedel; Jeffrey Yuan

Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the users query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost.

conference on information and knowledge management | 2007

Just-in-time contextual advertising

Aris Anagnostopoulos; Andrei Z. Broder; Evgeniy Gabrilovich; Vanja Josifovski; Lance Riedel

Contextual Advertising is a type of Web advertising, which, given the URL of a Web page, aims to embed into the page (typically via JavaScript) the most relevant textual ads available. For static pages that are displayed repeatedly, the matching of ads can be based on prior analysis of their entire content; however, ads need to be matched also to new or dynamically created pages that cannot be processed ahead of time. Analyzing the entire body of such pages on-the-fly entails prohibitive communication and latency costs. To solve the three-horned dilemma of either low-relevance or high-latency or high-load, we propose to use text summarization techniques paired with external knowledge (exogenous to the page) to craft short page summaries in real time. Empirical evaluation proves that matching ads on the basis of such summaries does not sacrifice relevance, and is competitive with matching based on the entire page content. Specifically, we found that analyzing a carefully selected 5% fraction of the page text sacrifices only 1%-3% in ad relevance. Furthermore, our summaries are fully compatible with the standard JavaScript mechanisms used for ad placement: they can be produced at ad-display time by simple additions to the usual script, and they only add 500-600 bytes to the usual request.

conference on information and knowledge management | 2006

Estimating corpus size via queries

Andrei Z. Broder; Marcus Fontura; Vanja Josifovski; Ravi Kumar; Rajeev Motwani; Shubha U. Nabar; Rina Panigrahy; Andrew Tomkins; Ying Xu

We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.

ACM Transactions on The Web | 2009

Classifying search queries using the Web as a source of knowledge

Evgeniy Gabrilovich; Andrei Z. Broder; Marcus Fontoura; Amruta Joshi; Vanja Josifovski; Lance Riedel; Tong Zhang

We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregate account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

knowledge discovery and data mining | 2007

Estimating rates of rare events at multiple resolutions

Deepak Agarwal; Andrei Z. Broder; Deepayan Chakrabarti; Dejan Diklic; Vanja Josifovski; Mayssam Sayyadian

We consider the problem of estimating occurrence rates of rare eventsfor extremely sparse data, using pre-existing hierarchies to perform inference at multiple resolutions. In particular, we focus on the problem of estimating click rates for (webpage, advertisement) pairs (called impressions) where both the pages and the ads are classified into hierarchies that capture broad contextual information at different levels of granularity. Typically the click rates are low and the coverage of the hierarchies is sparse. To overcome these difficulties we devise a sampling method whereby we analyze aspecially chosen sample of pages in the training set, and then estimate click rates using a two-stage model. The first stage imputes the number of (webpage, ad) pairs at all resolutions of the hierarchy to adjust for the sampling bias. The second stage estimates clickrates at all resolutions after incorporating correlations among sibling nodes through a tree-structured Markov model. Both models are scalable and suited to large scale data mining applications. On a real-world dataset consisting of 1/2 billion impressions, we demonstrate that even with 95% negative (non-clicked) events in the training set, our method can effectively discriminate extremely rare events in terms of their click propensity.

international world wide web conferences | 2009

Nearest-neighbor caching for content-match applications

Sandeep Pandey; Andrei Z. Broder; Flavio Chierichetti; Vanja Josifovski; Ravi Kumar; Sergei Vassilvitskii

Motivated by contextual advertising systems and other web applications involving efficiency-accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency-accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model.

extending database technology | 2006

Indexing shared content in information retrieval systems

Andrei Z. Broder; Nadav Eiron; Marcus Fontoura; Michael Herscovici; Ronny Lempel; John McPherson; Runping Qi; Eugene J. Shekita

Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

Explore More