Is this you? Create Your Porfile

Sameer Agarwal

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sameer Agarwal is active.

Explore More

Publication

Featured researches published by Sameer Agarwal.

european conference on computer systems | 2013

BlinkDB: queries with bounded errors and bounded response times on very large data

Sameer Agarwal; Barzan Mozafari; Aurojit Panda; Henry Milner; Samuel Madden; Ion Stoica

In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a querys accuracy or response time requirements. We evaluate BlinkDB against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200 x faster than Hive), within an error of 2-10%.

european conference on computer systems | 2011

Scarlett: coping with skewed content popularity in mapreduce clusters

Ganesh Ananthanarayanan; Sameer Agarwal; Srikanth Kandula; Albert G. Greenberg; Ion Stoica; Duke Harlan; Ed Harris

To improve data availability and resilience MapReduce frameworks use file systems that replicate data uniformly. However, analysis of job logs from a large production cluster shows wide disparity in data popularity. Machines and racks storing popular content become bottlenecks; thereby increasing the completion times of jobs accessing this data even when there are machines with spare cycles in the cluster. To address this problem, we present Scarlett, a system that replicates blocks based on their popularity. By accurately predicting file popularity and working within hard bounds on additional storage, Scarlett causes minimal interference to running jobs. Trace driven simulations and experiments in two popular MapReduce frameworks (Hadoop, Dryad) show that Scarlett effectively alleviates hotspots and can speed up jobs by 20.2%.

very large data bases | 2012

Blink and it's done: interactive queries on very large data

Sameer Agarwal; Anand Padmanabha Iyer; Aurojit Panda; Samuel Madden; Barzan Mozafari; Ion Stoica

In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150x faster than Hive on MapReduce and 10--150x faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2--10%.

international conference on management of data | 2015

G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data

Kai Zeng; Sameer Agarwal; Ankur Dave; Michael Armbrust; Ion Stoica

Nearly 15 years ago, Hellerstein, Haas and Wang proposed online aggregation (OLA), a technique that allows users to (1) observe the progress of a query by showing iteratively refined approximate answers, and (2) stop the query execution once its result achieves the desired accuracy. In this demonstration, we present G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques. We have implemented G-OLA in FluoDB, a parallel online query execution framework that is built on top of the Spark cluster computing framework that can scale to massive data sets. We will demonstrate FluoDB on a cluster of 100 machines processing roughly 10TB of real-world session logs from a video-sharing website. Using an ad optimization and an A/B testing based scenario, we will enable users to perform real-time data analysis via web-based query consoles and dashboards.

international conference on management of data | 2012

Recurring job optimization in scope

Nicolas Bruno; Sameer Agarwal; Srikanth Kandula; Bing Shi; Ming-Chuan Wu; Jingren Zhou

An increasing number of applications require distributed data storage and processing infrastructure over large clusters of commodity hardware for critical business decisions. The MapReduce programming model [2] helps programmers write distributed applications on large clusters, but requires dealing with complex implementation details (e.g., reasoning with data distribution and overall system configuration). Recent proposals, such as Scope[1], raise the level of abstraction by providing a declarative language that not only increases programming productivity but is also amenable to sophisticated optimization. Like in traditional database systems, such optimization relies on detailed data statistics to choose the best execution plan in a cost-based fashion. However, in contrast to database systems, it is very difficult to obtain and maintain good quality statistics in a highly distributed environment that contains tens of thousands of machines. First, it is very challenging to efficiently combine a large number of individually collected local complex statistical information (e.g., histograms, distinct values) in a statistically meaningful way. Second, calculating statistics typically requires scans over the full dataset. Such operation can be overwhelmingly expensive for terabytes of data. Third, even if we can collect statistics for base tables, the nature of user scripts, which typically rely on userdefined code, makes the problem of statistical inference beyond selection and projection even more difficult during optimization. Finally, the cost of user defined code is another important source of information for cost-based query optimization. Such information is crucial for the optimizer to choose the optimal degree of parallelism for the final execution plan and when and where to execute the user code. It is challenging, if not impossible, to estimate its actual cost before running the query with the real dataset. We leverage the fact that a large proportion of scripts in this environment are parametric and recurring over a time series of data. The input datasets usually come in regularly,

knowledge discovery and data mining | 2013

A general bootstrap performance diagnostic

Ariel Kleiner; Ameet Talwalkar; Sameer Agarwal; Ion Stoica; Michael I. Jordan

As datasets become larger, more complex, and more available to diverse groups of analysts, it would be quite useful to be able to automatically and generically assess the quality of estimates, much as we are able to automatically train and evaluate predictive models such as classifiers. However, despite the fundamental importance of estimator quality assessment in data analysis, this task has eluded highly automatic solutions. While the bootstrap provides perhaps the most promising step in this direction, its level of automation is limited by the difficulty of evaluating its finite sample performance and even its asymptotic consistency. Thus, we present here a general diagnostic procedure which directly and automatically evaluates the accuracy of the bootstraps outputs, determining whether or not the bootstrap is performing satisfactorily when applied to a given dataset and estimator. We show that our proposed diagnostic is effective via an extensive empirical evaluation on a variety of estimators and simulated and real datasets, including a real-world query workload from Conviva, Inc. involving 1.7TB of data (i.e., approximately 0.5 billion data points).

international conference on management of data | 2016

iOLAP: Managing Uncertainty for Efficient Incremental OLAP

Kai Zeng; Sameer Agarwal; Ion Stoica

The size of data and the complexity of analytics continue to grow along with the need for timely and cost-effective analysis. However, the growth of computation power cannot keep up with the growth of data. This calls for a paradigm shift from traditional batch OLAP processing model to an incremental OLAP processing model. In this paper, we propose iOLAP, an incremental OLAP query engine that provides a smooth trade-off between query accuracy and latency, and fulfills a full spectrum of user requirements from approximate but timely query execution to a more traditional accurate query execution. iOLAP enables interactive incremental query processing using a novel mini-batch execution model---given an OLAP query, iOLAP first randomly partitions the input dataset into smaller sets (mini-batches) and then incrementally processes through these mini-batches by executing a delta update query on each mini-batch, where each subsequent delta update query computes an update based on the output of the previous one. The key idea behind iOLAP is a novel delta update algorithm that models delta processing as an uncertainty propagation problem, and minimizes the recomputation during each subsequent delta update by minimizing the uncertainties in the partial (including intermediate) query results. We implement iOLAP on top of Apache Spark and have successfully demonstrated it at scale on over 100 machines. Extensive experiments on a multitude of queries and datasets demonstrate that iOLAP can deliver approximate query answers for complex OLAP queries orders of magnitude faster than traditional OLAP engines, while continuously delivering updates every few seconds.

networked systems design and implementation | 2012