Sergei Vassilvitskii | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sergei Vassilvitskii is active.

Explore More

Publication

Featured researches published by Sergei Vassilvitskii.

international world wide web conferences | 2011

Counting triangles and the curse of the last reducer

Siddharth Suri; Sergei Vassilvitskii

The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightly-knit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit into memory, this is a non-trivial task, and previous researchers showed how to estimate the clustering coefficient in this scenario. A different avenue of research is to to perform the computation in parallel, spreading it across many machines. In recent years MapReduce has emerged as a de facto programming paradigm for parallel computation on massive data sets. The main focus of this work is to give MapReduce algorithms for counting triangles which we use to compute clustering coefficients. Our contributions are twofold. First, we describe a sequential triangle counting algorithm and show how to adapt it to the MapReduce setting. This algorithm achieves a factor of 10-100 speed up over the naive approach. Second, we present a new algorithm designed specifically for the MapReduce framework. A key feature of this approach is that it allows for a smooth tradeoff between the memory available on each individual machine and the total memory available to the algorithm, while keeping the total work done constant. Moreover, this algorithm can use any triangle counting algorithm as a black box and distribute the computation across many machines. We validate our algorithms on real world datasets comprising of millions of nodes and over a billion edges. Our results show both algorithms effectively deal with skew in the degree distribution and lead to dramatic speed ups over the naive implementation.

very large data bases | 2012

Scalable k-means++

Bahman Bahmani; Benjamin Moseley; Andrea Vattani; Ravi Kumar; Sergei Vassilvitskii

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

symposium on computational geometry | 2006

How slow is the k -means method?

David Arthur; Sergei Vassilvitskii

The k-means method is an old but popular clustering algorithm known for its observed speed and its simplicity. Until recently, however, no meaningful theoretical bounds were known on its running time. In this paper, we demonstrate that the worst-case running time of k-means is superpolynomial by improving the best known lower bound from Ω(n) iterations to 2Ω(√n).

international world wide web conferences | 2010

Generalized distances between rankings

Ravi Kumar; Sergei Vassilvitskii

Spearmans footrule and Kendalls tau are two well established distances between rankings. They, however, fail to take into account concepts crucial to evaluating a result set in information retrieval: element relevance and positional information. That is, changing the rank of a highly-relevant document should result in a higher penalty than changing the rank of an irrelevant document; a similar logic holds for the top versus the bottom of the result ordering. In this work, we extend both of these metrics to those with position and element weights, and show that a variant of the Diaconis-Graham inequality still holds - the generalized two measures remain within a constant factor of each other for all permutations. We continue by extending the element weights into a distance metric between elements. For example, in search evaluation, swapping the order of two nearly duplicate results should result in little penalty, even if these two are highly relevant and appear at the top of the list. We extend the distance measures to this more general case and show that they remain within a constant factor of each other. We conclude by conducting simple experiments on web search data with the proposed measures. Our experiments show that the weighted generalizations are more robust and consistent with each other than their unweighted counter-parts.

acm symposium on parallel algorithms and architectures | 2011

Filtering: a method for solving graph problems in MapReduce

Silvio Lattanzi; Benjamin Moseley; Siddharth Suri; Sergei Vassilvitskii

The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems for sufficiently dense graphs. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version.

very large data bases | 2012

Densest subgraph in streaming and MapReduce

Bahman Bahmani; Ravi Kumar; Sergei Vassilvitskii

The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any e > 0, our algorithms make O(log1+e n) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1 + e) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive real-world graphs showing the performance and scalability of our algorithms in practice.

international conference on robotics and automation | 2002

A complete, local and parallel reconfiguration algorithm for cube style modular robots

Sergei Vassilvitskii; Mark H. Yim; John W. Suh

We present a complete, local, and parallel reconfiguration algorithm for metamorphic robots made up of Telecubes, six degree of freedom cube shaped modules currently being developed at PARC. We show that by using 2 /spl times/ 2 /spl times/ 2 meta-modules we can achieve completeness of reconfiguration space using only local rules. Furthermore, this reconfiguration can be done in place and massively in parallel with many simultaneous module movements. Finally we present a loose quadratic upper bound on the total number of module movements required by the algorithm.

workshop on internet and network economics | 2009

Bidding for Representative Allocations for Display Advertising

Arpita Ghosh; Preston McAfee; Kishore Papineni; Sergei Vassilvitskii

Display advertising has traditionally been sold via guaranteed contracts --- a guaranteed contract is a deal between a publisher and an advertiser to allocate a certain number of impressions over a certain period, for a pre-specified price per impression. However, as spot markets for display ads, such as the RightMedia Exchange, have grown in prominence, the selection of advertisements to show on a given page is increasingly being chosen based on price, using an auction. As the number of participants in the exchange grows, the price of an impressions becomes a signal of its value. This correlation between price and value means that a seller implementing the contract through bidding should offer the contract buyer a range of prices, and not just the cheapest impressions necessary to fulfill its demand. Implementing a contract using a range of prices, is akin to creating a mutual fund of advertising impressions, and requires randomized bidding. We characterize what allocations can be implemented with randomized bidding, namely those where the desired share obtained at each price is a non-increasing function of price. In addition, we provide a full characterization of when a set of campaigns are compatible and how to implement them with randomized bidding strategies.

electronic commerce | 2010

Optimal online assignment with forecasts

Erik Vee; Sergei Vassilvitskii; Jayavel Shanmugasundaram

Motivated by the allocation problem facing publishers in display advertising we formulate the online assignment with forecast problem, a version of the online allocation problem where the algorithm has access to random samples from the future set of arriving vertices. We provide a solution that allows us to serve Internet users in an online manner that is provably nearly optimal. Our technique applies to the forecast version of a large class of online assignment problems, such as online bipartite matching, allocation, and budgeted bidders, in which we wish to minimize the value of some convex objective function subject to a set of linear supply and demand constraints. Our solution utilizes a particular subspace of the dual space, allowing us to describe the optimal primal solution implicitly in space proportional to the demand side of the input graph. More importantly, it allows us to prove that representing the primal solution using such a compact allocation plan yields a robust online algorithm which makes near-optimal online decisions. Furthermore, unlike the primal solution, we show that the compact allocation plan produced by considering only a sampled version of the original problem generalizes to produce a near optimal solution on the full problem instance.

international world wide web conferences | 2009

Nearest-neighbor caching for content-match applications

Sandeep Pandey; Andrei Z. Broder; Flavio Chierichetti; Vanja Josifovski; Ravi Kumar; Sergei Vassilvitskii

Motivated by contextual advertising systems and other web applications involving efficiency-accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency-accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model.

Explore More