Ricardo A. Baeza-Yates
Pompeu Fabra University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ricardo A. Baeza-Yates.
extending database technology | 2004
Ricardo A. Baeza-Yates; Carlos A. Hurtado; Marcelo Mendoza
In this paper we propose a method that, given a query submitted to a search engine, suggests a list of related queries The related queries are based in previously issued queries, and can be issued by the user to the search engine to tune or redirect the search process The method proposed is based on a query clustering process in which groups of semantically similar queries are identified The clustering process uses the content of historical preferences of users registered in the query log of the search engine The method not only discovers the related queries, but also ranks them according to a relevance criterion Finally, we show with experiments over the query log of a search engine the effectiveness of the method.
knowledge discovery and data mining | 2007
Ricardo A. Baeza-Yates; Alessandro Tiberi
In this paper we study a large query log of more than twenty million queries with the goal of extracting the semantic relations that are implicitly captured in the actions of users submitting queries and clicking answers. Previous query log analyses were mostly done with just the queries and not the actions that followed after them. We first propose a novel way to represent queries in a vector space based on a graph derived from the query-click bipartite graph. We then analyze the graph produced by our query log, showing that it is less sparse than previous results suggested, and that almost all the measures of these graphs follow power laws, shedding some light on the searching user behavior as well as on the distribution of topics that people want in the Web. The representation we introduce allows to infer interesting semantic relationships between queries. Second, we provide an experimental analysis on the quality of these relations, showing that most of them are relevant. Finally we sketch an application that detects multitopical URLs.
Information & Computation | 1993
Ricardo A. Baeza-Yates; Joseph C. Culberson; Gregory J. E. Rawlins
In this paper we initiate a new area of study dealing with the best way to search a possibly unbounded region for an object. The model for our search algorithms is that we must pay costs proportional to the distance of the next probe position relative to our current position. This model is meant to give a realistic cost measure for a robot moving in the plane. We also examine the effect of decreasing the amount of a priori information given to search problems. Problems of this type are very simple analogues of non-trivial problems on searching an unbounded region, processing digitized images, and robot navigation. We show that for some simple search problems, knowing the general direction of the goal is much more informative than knowing the distance to the goal.
ACM Transactions on Information Systems | 2000
Edleno Silva de Moura; Gonzalo Navarro; Nivio Ziviani; Ricardo A. Baeza-Yates
We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
international acm sigir conference on research and development in information retrieval | 2007
Ricardo A. Baeza-Yates; Aristides Gionis; Flavio Junqueira; Vanessa Murdock; Vassilis Plachouras; Fabrizio Silvestri
In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs.caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
string processing and information retrieval | 2006
Ricardo A. Baeza-Yates; Liliana Calderón-Benavides; Cristina N. González-Caro
The identification of the user’s intention or interest through queries that they submit to a search engine can be very useful to offer them more adequate results. In this work we present a framework for the identification of user’s interest in an automatic way, based on the analysis of query logs. This identification is made from two perspectives, the objectives or goals of a user and the categories in which these aims are situated. A manual classification of the queries was made in order to have a reference point and then we applied supervised and unsupervised learning techniques. The results obtained show that for a considerable amount of cases supervised learning is a good option, however through unsupervised learning we found relationships between users and behaviors that are not easy to detect just taking the query words. Also, through unsupervised learning we established that there are categories that we are not able to determine in contrast with other classes that were not considered but naturally appear after the clustering process. This allowed us to establish that the combination of supervised and unsupervised learning is a good alternative to find user’s goals. From supervised learning we can identify the user interest given certain established goals and categories; on the other hand, with unsupervised learning we can validate the goals and categories used, refine them and select the most appropriate to the user’s needs.
Algorithmica | 1999
Ricardo A. Baeza-Yates; Gonzalo Navarro
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manbers work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k<m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m=O(log n) . Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost. We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to
ACM Transactions on Information Systems | 1997
Gonzalo Navarro; Ricardo A. Baeza-Yates
O(\sqrt{mk/w} n)
international acm sigir conference on research and development in information retrieval | 2007
Omar Alonso; Michael Gertz; Ricardo A. Baeza-Yates
. Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near
combinatorial pattern matching | 1994
Ricardo A. Baeza-Yates; Walter Cunto; Udi Manber; Sun Wu
O(\sqrt{mk/(\sigma w)} n)