Is this you? Create Your Porfile

Evan R. Sparks

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Evan R. Sparks is active.

Explore More

Publication

Featured researches published by Evan R. Sparks.

international conference on data mining | 2013

MLI: An API for Distributed Machine Learning

Evan R. Sparks; Ameet Talwalkar; Virginia Smith; Jey Kottalam; Xinghao Pan; Joseph E. Gonzalez; Michael J. Franklin; Michael I. Jordan; Tim Kraska

MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

symposium on cloud computing | 2015

Automating model search for large scale machine learning

Evan R. Sparks; Ameet Talwalkar; Daniel Haas; Michael J. Franklin; Michael I. Jordan; Tim Kraska

The proliferation of massive datasets combined with the development of sophisticated analytical techniques has enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. A major obstacle to supporting these predictive applications is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, limiting their usefulness for applications driven by large-scale datasets. In this work, we build upon these recent efforts and propose an architecture for automatic machine learning at scale comprised of a cost-based cluster resource allocation estimator, advanced hyper-parameter tuning techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching and optimal resource allocation. The result is TuPAQ, a component of the MLbase system that automatically finds and trains models for a users predictive application with comparable quality to those found using exhaustive strategies, but an order of magnitude more efficiently than the standard baseline approach. TuPAQ scales to models trained on Terabytes of data across hundreds of machines.

international conference on data engineering | 2017

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

Evan R. Sparks; Shivaram Venkataraman; Tomer Kaftan; Michael J. Franklin; Benjamin Recht

Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach offers increased ease of use and higher performance over existing systems for large scale learning. We demonstrate the effectiveness of KeystoneML in achieving high quality statistical accuracy and scalable training using real world datasets in several domains.

knowledge discovery and data mining | 2016

Matrix Computations and Optimization in Apache Spark

Reza Bosagh Zadeh; Xiangrui Meng; Alexander Ulanov; Burak Yavuz; Li Pu; Shivaram Venkataraman; Evan R. Sparks; Aaron Staple; Matei Zaharia

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.

international conference on big data | 2015

Scientific computing meets big data technology: An astronomy use case

Zhao Zhang; K. Barbary; Frank Austin Nothaft; Evan R. Sparks; Oliver Zahn; Michael J. Franklin; David A. Patterson; S. Perlmutter

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark - a modern big data platform - to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 3.7 χ speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications.

high performance distributed computing | 2017

Diagnosing Machine Learning Pipelines with Fine-grained Lineage

Zhao Zhang; Evan R. Sparks; Michael J. Franklin

We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.

Journal of Machine Learning Research | 2016

MLlib: machine learning in apache spark

Xiangrui Meng; Joseph K. Bradley; Burak Yavuz; Evan R. Sparks; Shivaram Venkataraman; Davies Liu; Jeremy Freeman; D. B. Tsai; Manish Amde; Sean Owen; Doris Xin; Reynold S. Xin; Michael J. Franklin; Reza Bosagh Zadeh; Matei Zaharia; Ameet Talwalkar

arXiv: Databases | 2015