Is this you? Create Your Porfile

Michael Armbrust

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Armbrust is active.

Explore More

Publication

Featured researches published by Michael Armbrust.

Communications of The ACM | 2010

A view of cloud computing

Michael Armbrust; Armando Fox; Rean Griffith; Anthony D. Joseph; Randy H. Katz; Andy Konwinski; Gunho Lee; David A. Patterson; Ariel Rabkin; Ion Stoica; Matei Zaharia

Clearing the clouds away from the true potential and obstacles posed by this computing capability.

international conference on management of data | 2015

Spark SQL: Relational Data Processing in Spark

Michael Armbrust; Reynold S. Xin; Cheng Lian; Yin Huai; Davies Liu; Joseph K. Bradley; Xiangrui Meng; Tomer Kaftan; Michael J. Franklin; Ali Ghodsi; Matei Zaharia

Spark SQL is a new module in Apache Spark that integrates relational processing with Sparks functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Communications of The ACM | 2016

Apache Spark: a unified engine for big data processing

Matei Zaharia; Reynold S. Xin; Patrick Wendell; Tathagata Das; Michael Armbrust; Ankur Dave; Xiangrui Meng; Josh Rosen; Shivaram Venkataraman; Michael J. Franklin; Ali Ghodsi; Joseph E. Gonzalez; Scott Shenker; Ion Stoica

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

very large data bases | 2015

Scaling spark in the real world: performance and usability

Michael Armbrust; Tathagata Das; Aaron Davidson; Ali Ghodsi; Andrew Or; Josh Rosen; Ion Stoica; Patrick Wendell; Reynold S. Xin; Matei Zaharia

Apache Spark is one of the most widely used open source processing engines for big data, with rich language-integrated APIs and a wide range of libraries. Over the past two years, our group has worked to deploy Spark to a wide range of organizations through consulting relationships as well as our hosted service, Databricks. We describe the main challenges and requirements that appeared in taking Spark to a wide set of users, and usability and performance improvements we have made to the engine in response.

international conference on management of data | 2015

G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data

Kai Zeng; Sameer Agarwal; Ankur Dave; Michael Armbrust; Ion Stoica

Nearly 15 years ago, Hellerstein, Haas and Wang proposed online aggregation (OLA), a technique that allows users to (1) observe the progress of a query by showing iteratively refined approximate answers, and (2) stop the query execution once its result achieves the desired accuracy. In this demonstration, we present G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques. We have implemented G-OLA in FluoDB, a parallel online query execution framework that is built on top of the Spark cluster computing framework that can scale to massive data sets. We will demonstrate FluoDB on a cluster of 100 machines processing roughly 10TB of real-world session logs from a video-sharing website. Using an ad optimization and an A/B testing based scenario, we will enable users to perform real-time data analysis via web-based query consoles and dashboards.

international conference on management of data | 2013

Generalized scale independence through incremental precomputation

Michael Armbrust; Eric Liang; Tim Kraska; Armando Fox; Michael J. Franklin; David A. Patterson

Developers of rapidly growing applications must be able to anticipate potential scalability problems before they cause performance issues in production environments. A new type of data independence, called scale independence, seeks to address this challenge by guaranteeing a bounded amount of work is required to execute all queries in an application, independent of the size of the underlying data. While optimization strategies have been developed to provide these guarantees for the class of queries that are scale-independent when executed using simple indexes, there are important queries for which such techniques are insufficient. Executing these more complex queries scale-independently requires precomputation using incrementally-maintained materialized views. However, since this precomputation effectively shifts some of the query processing burden from execution time to insertion time, a scale-independent system must be careful to ensure that storage and maintenance costs do not threaten scalability. In this paper, we describe a scale-independent view selection and maintenance system, which uses novel static analysis techniques that ensure that created views do not themselves become scaling bottlenecks. Finally, we present an empirical analysis that includes all the queries from the TPC-W benchmark and validates our implementations ability to maintain nearly constant high-quantile query and update latency even as an application scales to hundreds of machines.

symposium on operating systems principles | 2017

Drizzle: Fast and Adaptable Stream Processing at Scale

Shivaram Venkataraman; Aurojit Panda; Kay Ousterhout; Michael Armbrust; Ali Ghodsi; Michael J. Franklin; Benjamin Recht; Ion Stoica

Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2-3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.

international conference on management of data | 2018

Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark

Michael Armbrust; Tathagata Das; Joseph Torres; Burak Yavuz; Shixiong Zhu; Reynold S. Xin; Ali Ghodsi; Ion Stoica; Matei Zaharia

With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQLs code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the systems design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.

Archive | 2009

Above the Clouds: A Berkeley View of Cloud Computing

Michael Armbrust; Armando Fox; Rean Griffith; Anthony D. Joseph; Randy H. Katz; Andy Konwinski; Gunho Lee; David A. Patterson; Ariel Rabkin; Ion Stoica; Matei Zaharia

conference on innovative data systems research | 2009