Flavio Junqueira
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Flavio Junqueira.
international acm sigir conference on research and development in information retrieval | 2007
Ricardo A. Baeza-Yates; Aristides Gionis; Flavio Junqueira; Vanessa Murdock; Vassilis Plachouras; Fabrizio Silvestri
In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs.caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
dependable systems and networks | 2011
Flavio Junqueira; Benjamin Reed; Marco Serafini
Zab is a crash-recovery atomic broadcast algorithm we designed for the ZooKeeper coordination service. ZooKeeper implements a primary-backup scheme in which a primary process executes clients operations and uses Zab to propagate the corresponding incremental state changes to backup processes1. Due the dependence of an incremental state change on the sequence of changes previously generated, Zab must guarantee that if it delivers a given state change, then all other changes it depends upon must be delivered first. Since primaries may crash, Zab must satisfy this requirement despite crashes of primaries.
international conference on data engineering | 2007
Ricardo A. Baeza-Yates; Carlos Castillo; Flavio Junqueira; Vassilis Plachouras; Fabrizio Silvestri
In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such engines need to achieve the following goals: high quality answers, fast response time, high query throughput, and scalability. In this paper we survey and organize recent research results, outlining the main challenges of designing a distributed Web retrieval system.
ACM Transactions on The Web | 2008
Ricardo A. Baeza-Yates; Aristides Gionis; Flavio Junqueira; Vanessa Murdock; Vassilis Plachouras; Fabrizio Silvestri
In this article we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log influence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
international world wide web conferences | 2010
Berkant Barla Cambazoglu; Flavio Junqueira; Vassilis Plachouras; Scott Alexander Banachowski; Baoqiu Cui; Swee Lim; Bill Bridge
Commercial Web search engines have to process user queries over huge Web indexes under tight latency constraints. In practice, to achieve low latency, large result caches are employed and a portion of the query traffic is served using previously computed results. Moreover, search engines need to update their indexes frequently to incorporate changes to the Web. After every index update, however, the content of cache entries may become stale, thus decreasing the freshness of served results. In this work, we first argue that the real problem in todays caching for large-scale search engines is not eviction policies, but the ability to cope with changes to the index, i.e., cache freshness. We then introduce a novel algorithm that uses a time-to-live value to set cache entries to expire and selectively refreshes cached results by issuing refresh queries to back-end search clusters. The algorithm prioritizes the entries to refresh according to a heuristic that combines the frequency of access with the age of an entry in the cache. In addition, for setting the rate at which refresh queries are issued, we present a mechanism that takes into account idle cycles of back-end servers. Evaluation using a real workload shows that our algorithm can achieve hit rate improvements as well as reduction in average hit ages. An implementation of this algorithm is currently in production use at Yahoo!.
string processing and information retrieval | 2007
Ricardo A. Baeza-Yates; Flavio Junqueira; Vassilis Plachouras; Hans Friedrich Witschel
This paper studies the impact of the tail of the query distribution on caches of Web search engines, and proposes a technique for achieving higher hit ratios compared to traditional heuristics such as LRU. The main problem we solve is the one of identifying infrequent queries, which cause a reduction on hit ratio because caching them often does not lead to hits. To mitigate this problem, we introduce a cache management policy that employs an admission policy to prevent infrequent queries from taking space of more frequent queries in the cache. The admission policy uses either stateless features, which depend only on the query, or stateful features based on usage information. The proposed management policy is more general than existing policies for caching of search engine results, and it is fully dynamic. The evaluation results on two different query logs show that our policy achieves higher hit ratios when compared to previously proposed cache management policies.
international acm sigir conference on research and development in information retrieval | 2008
Gleb Skobeltsyn; Flavio Junqueira; Vassilis Plachouras; Ricardo A. Baeza-Yates
Results caching is an efficient technique for reducing the query processing load, hence it is commonly used in real search engines. This technique, however, bounds the maximum hit rate due to the large fraction of singleton queries, which is an important limitation. In this paper we propose ResIn - an architecture that uses a combination of results caching and index pruning to overcome this limitation. We argue that results caching is an inexpensive and efficient way to reduce the query processing load and show that it is cheaper to implement compared to a pruned index. At the same time, we show that index pruning performance is fundamentally affected by the changes in the query traffic that the results cache induces. We experiment with real query logs and a large document collection, and show that the combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines.
Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware | 2008
Benjamin Reed; Flavio Junqueira
This is a short overview of a totally ordered broadcast protocol used by ZooKeeper, called Zab. It is conceptually easy to understand, is easy to implement, and gives high performance. In this paper we present the requirements ZooKeeper makes on Zab, we show how the protocol is used, and we give an overview of how the protocol works.
dependable systems and networks | 2012
Daniele Sciascia; Fernando Pedone; Flavio Junqueira
Deferred update replication is a well-known approach to building data management systems as it provides both high availability and high performance. High availability comes from the fact that any replica can execute client transactions; the crash of one or more replicas does not interrupt the system. High performance comes from the fact that only one replica executes a transaction; the others must only apply its updates. Since replicas execute transactions concurrently, transaction execution is distributed across the system. The main drawback of deferred update replication is that update transactions scale poorly with the number of replicas, although read-only transactions scale well. This paper proposes an extension to the technique that improves the scalability of update transactions. In addition to presenting a novel protocol, we detail its implementation and provide an extensive analysis of its performance.
international world wide web conferences | 2010
Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny Lempel; Luca Telloli; Hugo Zaragoza
A Web search engine must update its index periodically to incorporate changes to the Web, and we argue in this work that index updates fundamentally impact the design of search engine result caches. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. To enable efficient invalidation of cached results, we propose a framework for developing invalidation predictors and some concrete predictors. Evaluation using Wikipedia documents and a query log from Yahoo! shows that selective invalidation of cached search results can lower the number of query re-evaluations by as much as 30% compared to a baseline time-to-live scheme, while returning results of similar freshness.