Tamas Sarlos | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tamas Sarlos is active.

Explore More

Publication

Featured researches published by Tamas Sarlos.

international world wide web conferences | 2014

On estimating the average degree

Anirban Dasgupta; Ravi Kumar; Tamas Sarlos

Networks are characterized by nodes and edges. While there has been a spate of recent work on estimating the number of nodes in a network, the edge-estimation question appears to be largely unaddressed. In this work we consider the problem of estimating the average degree of a large network using efficient random sampling, where the number of nodes is not known to the algorithm. We propose a new estimator for this problem that relies on access to node samples under a prescribed distribution. Next, we show how to efficiently realize this ideal estimator in a random walk setting. Our estimator has a natural and simple implementation using random walks; we bound its performance in terms of the mixing time of the underlying graph. We then show that our estimators are both provably and practically better than many natural estimators for the problem. Our work contrasts with existing theoretical work on estimating average degree, which assume that a uniform random sample of nodes is available and the number of nodes is known.

international world wide web conferences | 2013

Optimal hashing schemes for entity matching

Nilesh N. Dalvi; Vibhor Rastogi; Anirban Dasgupta; Anish Das Sarma; Tamas Sarlos

In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously.n In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies.

international world wide web conferences | 2016

On Sampling Nodes in a Network

Flavio Chiericetti; Anirban Dasgupta; Ravi Kumar; Silvio Lattanzi; Tamas Sarlos

Random walk is an important tool in many graph mining applications including estimating graph parameters, sampling portions of the graph, and extracting dense communities. In this paper we consider the problem of sampling nodes from a large graph according to a prescribed distribution by using random walk as the basic primitive. Our goal is to obtain algorithms that make a small number of queries to the graph but output a node that is sampled according to the prescribed distribution. Focusing on the uniform distribution case, we study the query complexity of three algorithms and show a near-tight bound expressed in terms of the parameters of the graph such as average degree and the mixing time. Both theoretically and empirically, we show that some algorithms are preferable in practice than the others. We also extend our study to the problem of sampling nodes according to some polynomial function of their degrees; this has implications for designing efficient algorithms for applications such as triangle counting.

international world wide web conferences | 2017

Linear Additive Markov Processes

Ravi Kumar; Maithra Raghu; Tamas Sarlos; Andrew Tomkins

We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parameterization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steady-state and mixing time. We then give an algorithm based on alternating minimization to learn LAMP models from data. Finally, we perform a series of real-world experiments to show that LAMP is more powerful than first-order Markov processes, and even holds its own against deep sequential models (LSTMs) with a negligible increase in parameter complexity.

web search and data mining | 2018

Orienteering Algorithms for Generating Travel Itineraries

Zachary Friggstad; Sreenivas Gollapudi; Kostas Kollias; Tamas Sarlos; Chaitanya Swamy; Andrew Tomkins

We study the problem of automatically and efficiently generating itineraries for users who are on vacation. We focus on the common case, wherein the trip duration is more than a single day. Previous efficient algorithms based on greedy heuristics suffer from two problems. First, the itineraries are often unbalanced, with excellent days visiting top attractions followed by days of exclusively lower-quality alternatives. Second, the trips often re-visit neighborhoods repeatedly in order to cover increasingly low-tier points of interest. Our primary technical contribution is an algorithm that addresses both these problems by maximizing the quality of the worst day. We give theoretical results showing that this algorithm»s competitive factor is within a factor two of the guarantee of the best available algorithm for a single day, across many variations of the problem. We also give detailed empirical evaluations using two distinct datasets:(a) anonymized Google historical visit data and(b) Foursquare public check-in data. We show first that the overall utility of our itineraries is almost identical to that of algorithms specifically designed to maximize total utility, while the utility of the worst day of our itineraries is roughly twice that obtained from other approaches. We then turn to evaluation based on human raters who score our itineraries only slightly below the itineraries created by human travel experts with deep knowledge of the area.

international world wide web conferences | 2017

Caching with Dual Costs

Anirban Dasgupta; Ravi Kumar; Tamas Sarlos

Caching mechanisms in distributed and social settings face the issue that the items can frequently change, requiring the cached versions to be updated to maintain coherence. There is thus a trade-off between incurring cache misses on read requests and cache hits on update requests. Motivated by this we consider the following dual cost variant of the classical caching problem: each request for an item can be either a read or a write. If the request is read and the item is not in the cache, then a read-miss cost is incurred and if the request is write and the item is in the cache, then a write-hit cost is incurred. The goal is to design a caching algorithm that minimizes the sum of read-miss and write-hit costs. We study online and offline algorithms for this problem. For the online version of the problem, we obtain an efficient algorithm whose cost is provably close to near-optimal cost. This algorithm builds on online algorithms for classical caching and metrical task systems, using them as black boxes. For the offline version, we obtain an optimal deterministic algorithm that is based on a minimum cost flow. Experiments on real and synthetic data show that our online algorithm incurs much less cost compared to natural baselines, while utilizing cache even better; furthermore, they also show that the online algorithm is close to the offline optimum.

conference on information and knowledge management | 2013

Permutation indexing: fast approximate retrieval from large corpora

Maxim Gurevich; Tamas Sarlos

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.

international conference on machine learning | 2013