Shivaram Venkataraman

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shivaram Venkataraman is active.

Explore More

Publication

Featured researches published by Shivaram Venkataraman.

Communications of The ACM | 2016

Apache Spark: a unified engine for big data processing

Matei Zaharia; Reynold S. Xin; Patrick Wendell; Tathagata Das; Michael Armbrust; Ankur Dave; Xiangrui Meng; Josh Rosen; Shivaram Venkataraman; Michael J. Franklin; Ali Ghodsi; Joseph E. Gonzalez; Scott Shenker; Ion Stoica

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

symposium on cloud computing | 2012

Cake: enabling high-level SLOs on shared storage systems

Andrew Wang; Shivaram Venkataraman; Sara Alspaugh; Randy H. Katz; Ion Stoica

Cake is a coordinated, multi-resource scheduler for shared distributed storage environments with the goal of achieving both high throughput and bounded latency. Cake uses a two-level scheduling scheme to enforce high-level service-level objectives (SLOs). First-level schedulers control consumption of resources such as disk and CPU. These schedulers (1) provide mechanisms for differentiated scheduling, (2) split large requests into smaller chunks, and (3) limit the number of outstanding device requests, which together allow for effective control over multi-resource consumption within the storage system. Cakes second-level scheduler coordinates the first-level schedulers to map high-level SLO requirements into actual scheduling parameters. These parameters are dynamically adjusted over time to enforce high-level performance specifications for changing workloads. We evaluate Cake using multiple workloads derived from real-world traces. Our results show that Cake allows application programmers to explore the latency vs. throughput trade-off by setting different high-level performance requirements on their workloads. Furthermore, we show that using Cake has concrete economic and business advantages, reducing provisioning costs by up to 50% for a consolidated workload and reducing the completion time of an analytics cycle by up to 40%.

european conference on computer systems | 2013

Presto: distributed machine learning and graph processing with sparse matrices

Shivaram Venkataraman; Erik Bodzsar; Indrajit Roy; Alvin AuYoung; Robert Schreiber

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks. In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

very large data bases | 2014

Quantifying eventual consistency with PBS

Peter Bailis; Shivaram Venkataraman; Michael J. Franklin; Joseph M. Hellerstein; Ion Stoica

Data store replication results in a fundamental trade-off between operation latency and data consistency. At the weak end of the consistency spectrum is eventual consistency providing no limit to the staleness of data returned. However, anecdotally, eventual consistency is often “good enough” for practitioners given its latency and availability benefits. In this work, we explain why eventually consistent systems are regularly acceptable in practice, analyzing both the staleness of data they return and the latency benefits they offer. We introduce Probabilistically Bounded Staleness (PBS), a consistency model which provides expected bounds on data staleness with respect to both versions and wall clock time. We derive a closed-form solution for versioned staleness as well as model real-time staleness under Internet-scale production workloads for a large class of quorum-replicated, Dynamo-style stores. Using PBS, we measure the latency–consistency trade-off for partial, non-overlapping quorum systems, including limited multi-object operations. We quantitatively demonstrate how and why eventually consistent systems frequently return consistent data within tens of milliseconds while offering significant latency benefits.

symposium on cloud computing | 2017

Occupy the cloud: distributed computing for the 99%

Eric Jonas; Qifan Pu; Shivaram Venkataraman; Ion Stoica; Benjamin Recht

Distributed computing remains inaccessible to a large number of users, in spite of many open source platforms and extensive commercial offerings. While distributed computation frameworks have moved beyond a simple map-reduce model, many users are still left to struggle with complex cluster management and configuration tools, even for running simple embarrassingly parallel jobs. We argue that stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity. Furthermore, using our prototype implementation, PyWren, we show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Extrapolating from recent trends in network bandwidth and the advent of disaggregated storage, we suggest that stateless functions are a natural fit for data processing in future computing environments.

congress on evolutionary computation | 2010

Scaling eCGA model building via data-intensive computing

Abhishek Verma; Xavier Llorà; Shivaram Venkataraman; David E. Goldberg; Roy H. Campbell

This paper shows how the extended compact genetic algorithm can be scaled using data-intensive computing techniques such as MapReduce. Two different frameworks (Hadoop and MongoDB) are used to deploy MapReduce implementations of the compact and extended compact genetic algorithms. Results show that both are good choices to deal with large-scale problems as they can scale with the number of commodity machines, as opposed to previous efforts with other techniques that either required specialized high-performance hardware or shared memory environments.

annual computer security applications conference | 2010

Forenscope: a framework for live forensics

Ellick M. Chan; Shivaram Venkataraman; Francis M. David; Amey Chaugule; Roy H. Campbell

Current post-mortem cyber-forensic techniques may cause significant disruption to the evidence gathering process by breaking active network connections and unmounting encrypted disks. Although newer live forensic analysis tools can preserve active state, they may taint evidence by leaving footprints in memory. To help address these concerns we present Forenscope, a framework that allows an investigator to examine the state of an active system without the effects of taint or forensic blurriness caused by analyzing a running system. We show how Forenscope can fit into accepted workflows to improve the evidence gathering process. Forenscope preserves the state of the running system and allows running processes, open files, encrypted filesystems and open network sockets to persist during the analysis process. Forenscope has been tested on live systems to show that it does not operationally disrupt critical processes and that it can perform an analysis in less than 15 seconds while using only 125 KB of memory. We show that Forenscope can detect stealth rootkits, neutralize threats and expedite the investigation process by finding evidence in memory.

international conference on data engineering | 2017

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

Evan R. Sparks; Shivaram Venkataraman; Tomer Kaftan; Michael J. Franklin; Benjamin Recht

Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach offers increased ease of use and higher performance over existing systems for large scale learning. We demonstrate the effectiveness of KeystoneML in achieving high quality statistical accuracy and scalable training using real world datasets in several domains.

knowledge discovery and data mining | 2016

Matrix Computations and Optimization in Apache Spark

Reza Bosagh Zadeh; Xiangrui Meng; Alexander Ulanov; Burak Yavuz; Li Pu; Shivaram Venkataraman; Evan R. Sparks; Aaron Staple; Matei Zaharia

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.

international conference on management of data | 2016

SparkR: Scaling R Programs with Spark

Shivaram Venkataraman; Zongheng Yang; Davies Liu; Eric Liang; Hossein Falaki; Xiangrui Meng; Reynold S. Xin; Ali Ghodsi; Michael J. Franklin; Ion Stoica; Matei Zaharia

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machines memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Sparks distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

Explore More