Is this you? Create Your Porfile

Reynold S. Xin

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Reynold S. Xin is active.

Explore More

Publication

Featured researches published by Reynold S. Xin.

international conference on management of data | 2011

CrowdDB: answering queries with crowdsourcing

Michael J. Franklin; Donald Kossmann; Tim Kraska; Sukriti Ramesh; Reynold S. Xin

Some queries cannot be answered by machines only. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating results based on fuzzy criteria. CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many aspects of traditional database systems, there are also important differences. Conceptually, a major change is that the traditional closed-world assumption for query processing does not hold for human input. From an implementation perspective, human-oriented query operators are needed to solicit, integrate and cleanse crowdsourced data. Furthermore, performance and cost depend on a number of new factors including worker affinity, training, fatigue, motivation and location. We describe the design of CrowdDB, report on an initial set of experiments using Amazon Mechanical Turk, and outline important avenues for future work in the development of crowdsourced query processing systems.

international conference on management of data | 2015

Spark SQL: Relational Data Processing in Spark

Michael Armbrust; Reynold S. Xin; Cheng Lian; Yin Huai; Davies Liu; Joseph K. Bradley; Xiangrui Meng; Tomer Kaftan; Michael J. Franklin; Ali Ghodsi; Matei Zaharia

Spark SQL is a new module in Apache Spark that integrates relational processing with Sparks functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

First International Workshop on Graph Data Management Experiences and Systems | 2013

GraphX: a resilient distributed graph system on Spark

Reynold S. Xin; Joseph E. Gonzalez; Michael J. Franklin; Ion Stoica

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g., Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining. We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

international conference on management of data | 2013

Shark: SQL and rich analytics at scale

Reynold S. Xin; Joshua Rosen; Matei Zaharia; Michael J. Franklin; Scott Shenker; Ion Stoica

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

Communications of The ACM | 2016

Apache Spark: a unified engine for big data processing

Matei Zaharia; Reynold S. Xin; Patrick Wendell; Tathagata Das; Michael Armbrust; Ankur Dave; Xiangrui Meng; Josh Rosen; Shivaram Venkataraman; Michael J. Franklin; Ali Ghodsi; Joseph E. Gonzalez; Scott Shenker; Ion Stoica

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

international conference on management of data | 2012

Shark: fast data analysis using coarse-grained distributed memory

Cliff Engle; Antonio Lupher; Reynold S. Xin; Matei Zaharia; Michael J. Franklin; Scott Shenker; Ion Stoica

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.

international conference on management of data | 2012

Finding related tables

Anish Das Sarma; Lujun Fang; Nitin Gupta; Alon Y. Halevy; Hongrae Lee; Fei Wu; Reynold S. Xin; Cong Yu

We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.

very large data bases | 2015

Scaling spark in the real world: performance and usability

Michael Armbrust; Tathagata Das; Aaron Davidson; Ali Ghodsi; Andrew Or; Josh Rosen; Ion Stoica; Patrick Wendell; Reynold S. Xin; Matei Zaharia

Apache Spark is one of the most widely used open source processing engines for big data, with rich language-integrated APIs and a wide range of libraries. Over the past two years, our group has worked to deploy Spark to a wide range of organizations through consulting relationships as well as our hosted service, Databricks. We describe the main challenges and requirements that appeared in taking Spark to a wide set of users, and usability and performance improvements we have made to the engine in response.

international conference on management of data | 2014

Fine-grained partitioning for aggressive data skipping

Liwen Sun; Michael J. Franklin; Sanjay Krishnan; Reynold S. Xin

Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how well the blocking scheme matches the query filters. In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. We first extract representative filters in a workload as features using frequent itemset mining. Based on these features, each data tuple can be represented as a feature vector. We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning, which we prove is NP-hard. To find an approximate solution efficiently, we adopt the bottom-up clustering framework. We prototyped our blocking techniques on Shark, an open-source data warehouse system. Our experiments on TPC-H and a real-world workload show that our blocking technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.

international conference on management of data | 2016

SparkR: Scaling R Programs with Spark

Shivaram Venkataraman; Zongheng Yang; Davies Liu; Eric Liang; Hossein Falaki; Xiangrui Meng; Reynold S. Xin; Ali Ghodsi; Michael J. Franklin; Ion Stoica; Matei Zaharia

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machines memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Sparks distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

Explore More