Marcus Fontoura | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marcus Fontoura is active.

Explore More

Publication

Featured researches published by Marcus Fontoura.

international acm sigir conference on research and development in information retrieval | 2007

Robust classification of rare queries using web knowledge

Andrei Z. Broder; Marcus Fontoura; Evgeniy Gabrilovich; Amruta Joshi; Vanja Josifovski; Tong Zhang

We propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a blind feedback technique: given a query, we determine its topic by classifying the web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregation account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

international world wide web conferences | 2006

Using annotations in enterprise search

Pavel Dmitriev; Nadav Eiron; Marcus Fontoura; Eugene J. Shekita

A major difference between corporate intranets and the Internet is that in intranets the barrier for users to create web pages is much higher. This limits the amount and quality of anchor text, one of the major factors used by Internet search engines, making intranet search more difficult. The social phenomenon at play also means that spam is relatively rare. Both on the Internet and in intranets, users are often willing to cooperate with the search engine in improving the search experience. These characteristics naturally lead to considering using user feedback to improve search quality in intranets. In this paper we show how a particular form of feedback, namely user annotations, can be used to improve the quality of intranet search. An annotation is a short description of the contents of a web page, which can be considered a substitute for anchor text. We propose two ways to obtain user annotations, using explicit and implicit feedback, and show how they can be integrated into a search engine. Preliminary experiments on the IBM intranet demonstrate that using annotations improves the search quality.

ACM Transactions on The Web | 2009

Classifying search queries using the Web as a source of knowledge

Evgeniy Gabrilovich; Andrei Z. Broder; Marcus Fontoura; Amruta Joshi; Vanja Josifovski; Lance Riedel; Tong Zhang

We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregate account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

extending database technology | 2006

Indexing shared content in information retrieval systems

Andrei Z. Broder; Nadav Eiron; Marcus Fontoura; Michael Herscovici; Ronny Lempel; John McPherson; Runping Qi; Eugene J. Shekita

Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

very large data bases | 2013

Top-k publish-subscribe for social annotation of news

Alexander Shraer; Maxim Gurevich; Marcus Fontoura; Vanja Josifovski

Social content, such as Twitter updates, often have the quickest first-hand reports of news events, as well as numerous commentaries that are indicative of public view of such events. As such, social updates provide a good complement to professionally written news articles. In this paper we consider the problem of automatically annotating news stories with social updates (tweets), at a news website serving high volume of pageviews. The high rate of both the pageviews (millions to billions a day) and of the incoming tweets (more than 100 millions a day) make real-time indexing of tweets ineffective, as this requires an index that is both queried and updated extremely frequently. The rate of tweet updates makes caching techniques almost unusable since the cache would become stale very quickly. We propose a novel architecture where each story is treated as a subscription for tweets relevant to the storys content, and new algorithms that efficiently match tweets to stories, proactively maintaining the top-k tweets for each story. Such top-k pub-sub consumes only a small fraction of the resource cost of alternative solutions, and can be applicable to other large scale content-based publish-subscribe problems. We demonstrate the effectiveness of our approach on realworld data: a corpus of news stories from Yahoo! News and a log of Twitter updates.

international world wide web conferences | 2010

Using landing pages for sponsored search ad selection

Yejin Choi; Marcus Fontoura; Evgeniy Gabrilovich; Vanja Josifovski; Maurício R. Mediano; Bo Pang

We explore the use of the landing page content in sponsored search ad selection. Specifically, we compare the use of the ads intrinsic content to augmenting the ad with the whole, or parts, of the landing page. We explore two types of extractive summarization techniques to select useful regions from the landing pages: out-of-context and in-context methods. Out-of-context methods select salient regions from the landing page by analyzing the content alone, without taking into account the ad associated with the landing page. In-context methods use the ad context (including its title, creative, and bid phrases) to help identify regions of the landing page that should be used by the ad selection engine. In addition, we introduce a simple yet effective unsupervised algorithm to enrich the ad context to further improve the ad selection. Experimental evaluation confirms that the use of landing pages can significantly improve the quality of ad selection. We also find that our extractive summarization techniques reduce the size of landing pages substantially, while retaining or even improving the performance of ad retrieval over the method that utilize the entire landing page.

Journal of Computer and System Sciences | 2007

On the memory requirements of XPath evaluation over XML streams

Ziv Bar-Yossef; Marcus Fontoura; Vanja Josifovski

The important challenge of evaluating XPath queries over XML streams has sparked much interest in the past few years. A number of algorithms have been proposed, supporting wider fragments of the query language, and exhibiting better performance and memory utilization. Nevertheless, all the algorithms known to date use a prohibitively large amount of memory for certain types of queries. A natural question then is whether this memory bottleneck is inherent or just an artifact of the proposed algorithms. In this paper we initiate the first systematic and theoretical study of lower bounds on the amount of memory required to evaluate XPath queries over XML streams. We present a general lower bound technique, which given a query, specifies the minimum amount of memory that any algorithm evaluating the query on a stream would need to incur. The lower bounds are stated in terms of new graph-theoretic properties of queries. The proofs are based on tools from communication complexity. We then exploit insights learned from the lower bounds to obtain a new algorithm for XPath evaluation on streams. The algorithm uses space close to the optimum. Our algorithm deviates from the standard paradigm of using automata or transducers, thereby avoiding the need to store large transition tables.

international conference on management of data | 2010

Efficiently evaluating complex boolean expressions

Marcus Fontoura; Suhas Sadanandan; Jayavel Shanmugasundaram; Sergei Vassilvitski; Erik Vee; Srihari Venkatesan; Jason Zien

The problem of efficiently evaluating a large collection of complex Boolean expressions - beyond simple conjunctions and Disjunctive/Conjunctive Normal Forms (DNF/CNF) - occurs in many emerging online advertising applications such as advertising exchanges and automatic targeting. The simple solution of normalizing complex Boolean expressions to DNF or CNF form, and then using existing methods for evaluating such expressions is not always effective because of the exponential blow-up in the size of expressions due to normalization. We thus propose a novel method for evaluating complex expressions, which leverages existing techniques for evaluating leaf-level conjunctions, and then uses a bottom-up evaluation technique to only process the relevant parts of the complex expressions that contain the matching conjunctions. We develop two such bottom-up evaluation techniques, one based on Dewey IDs and another based on mapping Boolean expressions to one-dimensional intervals. Our experimental evaluation based on data obtained from an online advertising exchange shows that the proposed techniques are efficient and scalable, both with respect to space usage as well as evaluation time.

very large data bases | 2008

Relaxation in text search using taxonomies

Marcus Fontoura; Vanja Josifovski; Ravi Kumar; Christopher Olston; Andrew Tomkins; Sergei Vassilvitskii

In this paper we propose a novel document retrieval model in which text queries are augmented with multi-dimensional taxonomy restrictions. These restrictions may be relaxed at a cost to result quality. This new model may be applicable in many arenas, including multifaceted, product, and local search, where documents are augmented with hierarchical metadata such as topic or location. We present efficient algorithms for indexing and query processing in this new retrieval model. We decompose query processing into two sub-problems: first, an online search problem to determine the correct overall level of relaxation cost that must be incurred to generate the top k results; and second, a budgeted relaxation search problem in which all results at a particular relaxation cost must be produced at minimal cost. We show the latter problem is solvable exactly in two hierarchical dimensions, is NP-hard in three or more dimensions, but admits efficient approximation algorithms with provable guarantees. We present experimental results evaluating our algorithms on both synthetic and real data, showing order of magnitude improvements over the baseline algorithm.

international world wide web conferences | 2000

V-Market: A framework for agent e-commerce systems

Pedro S. Ripper; Marcus Fontoura; Ayrton Maia Neto; Carlos José Pereira de Lucena

Software agent technology is still an emerging technology, and as such, agent based software design is still in its infancy. Software agents have just started to be used in the e-commerce domain, and they are already beginning to create a series of new possibilities for this arena. Agents can be used to automate, as well as to enhance many stages of the traditional consumer-buying behavior process. This paper proposes a software engineering approach to the design of agent mediated e-commerce systems, through the definition of an object-oriented framework. The paper presents the underlying concepts, and the architecture of the environment, showing how it allows developers to customize virtual marketplaces, and to define transaction categories on demand, incorporating many possible products and services that can be traded online.

Explore More