Xiangrui Meng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiangrui Meng is active.

Explore More

Publication

Featured researches published by Xiangrui Meng.

international conference on management of data | 2015

Spark SQL: Relational Data Processing in Spark

Michael Armbrust; Reynold S. Xin; Cheng Lian; Yin Huai; Davies Liu; Joseph K. Bradley; Xiangrui Meng; Tomer Kaftan; Michael J. Franklin; Ali Ghodsi; Matei Zaharia

Spark SQL is a new module in Apache Spark that integrates relational processing with Sparks functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Communications of The ACM | 2016

Apache Spark: a unified engine for big data processing

Matei Zaharia; Reynold S. Xin; Patrick Wendell; Tathagata Das; Michael Armbrust; Ankur Dave; Xiangrui Meng; Josh Rosen; Shivaram Venkataraman; Michael J. Franklin; Ali Ghodsi; Joseph E. Gonzalez; Scott Shenker; Ion Stoica

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

knowledge discovery and data mining | 2016

Matrix Computations and Optimization in Apache Spark

Reza Bosagh Zadeh; Xiangrui Meng; Alexander Ulanov; Burak Yavuz; Li Pu; Shivaram Venkataraman; Evan R. Sparks; Aaron Staple; Matei Zaharia

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.

international conference on management of data | 2016

SparkR: Scaling R Programs with Spark

Shivaram Venkataraman; Zongheng Yang; Davies Liu; Eric Liang; Hossein Falaki; Xiangrui Meng; Reynold S. Xin; Ali Ghodsi; Michael J. Franklin; Ion Stoica; Matei Zaharia

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machines memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Sparks distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

conference on information and knowledge management | 2017

Collaborative Filtering as a Case-Study for Model Parallelism on Bulk Synchronous Systems

Ariyam Das; Ishan Upadhyaya; Xiangrui Meng; Ameet Talwalkar

Industrial-scale machine learning applications often train and maintain massive models that can be on the order of hundreds of millions to billions of parameters. Model parallelism thus plays a significant role to support these machine learning tasks. Recent work in this area has been dominated by parameter server architectures that follow an asynchronous computation model, introducing added complexity and approximation in order to scale to massive workloads. In this work, we explore model parallelism in the distributed bulk-synchronous parallel (BSP) setting, leveraging some recent progress made in the area of high performance computing, in order to address these complexity and approximation issues. Using collaborative filtering as a case-study, we introduce an efficient model parallel industrial scale algorithm for alternating least squares (ALS), along with a highly optimized implementation of ALS that serves as the default implementation in MLlib, Apache Sparks machine learning library. Our extensive empirical evaluation demonstrates that our implementation in MLlib compares favorably to the leading open-source parameter server framework, and our implementation scales to massive problems on the order of 50 billion ratings and close to 1 billion parameters.

Journal of Machine Learning Research | 2016

MLlib: machine learning in apache spark

Xiangrui Meng; Joseph K. Bradley; Burak Yavuz; Evan R. Sparks; Shivaram Venkataraman; Davies Liu; Jeremy Freeman; D. B. Tsai; Manish Amde; Sean Owen; Doris Xin; Reynold S. Xin; Michael J. Franklin; Reza Bosagh Zadeh; Matei Zaharia; Ameet Talwalkar

Archive | 2015