Is this you? Create Your Porfile

Benjamin Moseley

Washington University in St. Louis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin Moseley is active.

Explore More

Publication

Featured researches published by Benjamin Moseley.

very large data bases | 2012

Scalable k-means++

Bahman Bahmani; Benjamin Moseley; Andrea Vattani; Ravi Kumar; Sergei Vassilvitskii

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

acm symposium on parallel algorithms and architectures | 2011

Filtering: a method for solving graph problems in MapReduce

Silvio Lattanzi; Benjamin Moseley; Siddharth Suri; Sergei Vassilvitskii

The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems for sufficiently dense graphs. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version.

acm symposium on parallel algorithms and architectures | 2011

On scheduling in map-reduce and flow-shops

Benjamin Moseley; Anirban Dasgupta; Ravi Kumar; Tamas Sarlos

The map-reduce paradigm is now standard in industry and academia for processing large-scale data. In this work, we formalize job scheduling in map-reduce as a novel generalization of the two-stage classical flexible flow shop (FFS) problem: instead of a single task at each stage, a job now consists of a set of tasks per stage. For this generalization, we consider the problem of minimizing the total flowtime and give an efficient 12-approximation in the offline setting and an online (1+µ)-speed O(1/µ2)-competitive algorithm. Motivated by map-reduce, we revisit the two-stage flow shop problem, where we give a dynamic program for minimizing the total flowtime when all jobs arrive at the same time. If there are fixed number of job-types the dynamic program yields a PTAS; it is also a QPTAS when the processing times of jobs are polynomially bounded. This gives the first improvement in approximation of flowtime for the two-stage flow shop problem since the trivial 2-approximation algorithm of Gonzalez and Sahni [29] in 1978, and the first known approximation for the FFS problem. We then consider the generalization of the two-stage FFS problem to the unrelated machines case, where we give an offline 6-approximation and an online (1+µ)-speed O(1/µ4)-competitive algorithm.

Sigact News | 2011

A tutorial on amortized local competitiveness in online scheduling

Sungjin Im; Benjamin Moseley; Kirk Pruhs

Recently the use of potential functions to analyze online scheduling algorithms has become popular [19, 7,29, 13, 31, 4, 30, 3, 21, 15, 14, 28, 12, 2, 5, 6, 9, 11, 23, 33, 24, 8, 17, 16, 25, 1, 20, 26, 22, 18]. Thesepotential functions are used to show that a particular online algorithm is locally competitive in an amortizedsense. Algorithm analyses using potential functions are sometimes criticized as seeming to be black magicas the formal proofs do not require, and commonly do not contain, any discussion of the intuition behindthe design of the potential function. Sometimes, as in the case for the ﬁrst couple uses of potential functionsin the online scheduling literature, this is because the authors arrived at the potential function by trial anderror, and there was not a cohesive underlying intuitionguidingthe development. However, now that tens ofonline scheduling papers have used potential functions, one can see that a “standard” potential function hasemerged that seems to be applicable to a wide range of problems. The use of this standard potential functionto prove amortized local competitiveness can no longer be considered to be magical, and is a learnabletechnique. Our main goal here is to give a tutorialteaching this technique to readers with some modest priorknowledge of scheduling, online problems, and the concept of worst-case performance ratios.Online Scheduling: We consider online schedulingproblems where jobs/tasksarrive at a server (e.g. a webserver, a database server, an operating system, etc.) over time. Throughoutthis paper N willdenote the totalnumber of jobs and jobs are indexed J

acm symposium on parallel algorithms and architectures | 2010

Scheduling jobs with varying parallelizability to reduce variance

Anupam Gupta; Sungjin Im; Ravishankar Krishnaswamy; Benjamin Moseley; Kirk Pruhs

We give a (2+ε)-speed <i>O</i>(1)-competitive algorithm for scheduling jobs with arbitrary speed-up curves for the l<sub>2</sub> norm of flow. We give a similar result for the broadcast setting with varying page sizes.

parallel computing | 2015

Fast Greedy Algorithms in MapReduce and Streaming

Ravi Kumar; Benjamin Moseley; Sergei Vassilvitskii; Andrea Vattani

Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints.

workshop on approximation and online algorithms | 2009

Longest wait first for broadcast scheduling [extended abstract]

Chandra Chekuri; Sungjin Im; Benjamin Moseley

We consider online algorithms for broadcast scheduling. In the pull-based broadcast model there are n unit-sized pages of information at a server and requests arrive online for pages. When the server transmits a page p, all outstanding requests for that page are satisfied. There is a lower bound of Ω(n) on the competitiveness of online algorithms to minimize average flow-time; therefore we consider resource augmentation analysis in which the online algorithm is given extra speed over the adversary. The longest-wait-first (LWF) algorithm is a natural algorithm that has been shown to have good empirical performance [2]. Edmonds and Pruhs showed that LWF is 6-speed O(1)-competitive using a novel yet complex analysis; they also showed that LWF is not O(1)-competitive with less than 1.618-speed. In this paper we make two main contributions to the analysis of LWF and broadcast scheduling. We give an intuitive and easy to understand analysis of LWF which shows that it is O(1/e2)-competitive for average flow-time with (4+e) speed. We show that a natural extension of LWF is O(1)-speed O(1)-competitive for more general objective functions such as average delay-factor and Lk norms of delay-factor (for fixed k). These metrics generalize average flow-time and Lk norms of flow-time respectively and ours are the first non-trivial results for these objective functions in broadcast scheduling.

european symposium on algorithms | 2009

Minimizing Maximum Response Time and Delay Factor in Broadcast Scheduling

Chandra Chekuri; Sungjin Im; Benjamin Moseley

We consider online algorithms for pull-based broadcast scheduling. In this setting there are n pages of information at a server and requests for pages arrive online. When the server serves (broadcasts) a page p, all outstanding requests for that page are satisfied. We study two related metrics, namely maximum response time (waiting time) and maximum delay-factor and their weighted versions. We obtain the following results in the worst-case online competitive model. We show that FIFO (first-in first-out) is 2-competitive even when the page sizes are different. Previously this was known only for unit-sized pages [10] via a delicate argument. Our proof differs from [10] and is perhaps more intuitive. We give an online algorithm for maximum delay-factor that is O(1/e 2)-competitive with (1 + e)-speed for unit-sized pages and with (2 + e)-speed for different sized pages. This improves on the algorithm in [13] which required (2 + e)-speed and (4 + e)-speed respectively. In addition we show that the algorithm and analysis can be extended to obtain the same results for maximum weighted response time and delay factor. We show that a natural greedy algorithm modeled after LWF (Longest-Wait-First) is not O(1)-competitive for maximum delay factor with any constant speed even in the setting of standard scheduling with unit-sized jobs. This complements our upper bound and demonstrates the importance of the tradeoff made in our algorithm.

SIAM Journal on Computing | 2014

Online Scheduling with General Cost Functions

Sungjin Im; Benjamin Moseley; Kirk Pruhs

We consider a general online scheduling problem on a single machine with the objective of minimizing Σjwjg(Fj), where wj is the weight/importance of job Jj, Fj is the flow time of the job in the schedule, and g is an arbitrary non-decreasing cost function. Numerous natural scheduling objectives are special cases of this general objective. We show that the scheduling algorithm Highest Density First (HDF) is (2+e)-speed O(1)-competitive for all cost functions g simultaneously. We give lower bounds that show the HDF algorithm and this analysis are essentially optimal. Finally, we show scalable algorithms are achievable in some special cases.

very large data bases | 2017

Local search methods for k-means with outliers

Shalmoli Gupta; Ravi Kumar; Kefu Lu; Benjamin Moseley; Sergei Vassilvitskii

We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective. Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.

Explore More