Max Grossman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Max Grossman is active.

Explore More

Publication

Featured researches published by Max Grossman.

european conference on parallel processing | 2009

JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA

Yonghong Yan; Max Grossman; Vivek Sarkar

A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-of-magnitude performance improvements. CUDA has emerged as a popular programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern object-oriented languages with managed runtimes like Java and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those programmers as well. In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls to the compiler. Our preliminary performance results show that this interface can deliver significant performance improvements to Java programmers. For future work, we plan to use the JCUDA interface as a target language for supporting higher level parallel programming languages like X10 and Habanero-Java.

international parallel and distributed processing symposium | 2013

Integrating Asynchronous Task Parallelism with MPI

Sanjay Chatterjee; Sagnak Tasirlar; Zoran Budimlic; Vincent Cavé; Milind Chabbi; Max Grossman; Vivek Sarkar; Yonghong Yan

Effective combination of inter-node and intra-node parallelism is recognized to be a major challenge for future extreme-scale systems. Many researchers have demonstrated the potential benefits of combining both levels of parallelism, including increased communication-computation overlap, improved memory utilization, and effective use of accelerators. However, current “hybrid programming” approaches often require significant rewrites of application code and assume a high level of programmer expertise. Dynamic task parallelism has been widely regarded as a programming model that combines the best of performance and programmability for shared-memory programs. For distributed-memory programs, most users rely on efficient implementations of MPI. In this paper, we propose HCMPI (Habanero-C MPI), an integration of the Habanero-C dynamic task-parallel programming model with the widely used MPI message-passing interface. All MPI calls are treated as asynchronous tasks in this model, thereby enabling unified handling of messages and tasking constructs. For programmers unfamiliar with MPI, we introduce distributed data-driven futures (DDDFs), a new data-flow programming model that seamlessly integrates intra-node and inter-node data-flow parallelism without requiring any knowledge of MPI. Our novel runtime design for HCMPI and DDDFs uses a combination of dedicated communication and computation specific worker threads. We evaluate our approach on a set of micro-benchmarks as well as larger applications and demonstrate better scalability compared to the most efficient MPI implementations, while offering a unified programming model to integrate asynchronous task parallelism with distributed-memory parallelism.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

HadoopCL: MapReduce on Distributed Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL

Max Grossman; Mauricio Breternitz; Vivek Sarkar

As the scale of high performance computing systems grows, three main challenges arise: the programmability, reliability, and energy efficiency of those systems. Accomplishing all three without sacrificing performance requires a rethinking of legacy distributed programming models and homogeneous clusters. In this work, we integrate Hadoop MapReduce with OpenCL to enable the use of heterogeneous processors in a distributed system. We do this by exploiting the implicit data parallelism of mappers and reducers in a MapReduce system. Combining Hadoop and OpenCL provides 1) an easy-to-learn and flexible application programming interface in a high level and popular programming language, 2) the reliability guarantees and distributed file system of Hadoop, and 3) the low power consumption and performance acceleration of heterogeneous processors. This paper presents HadoopCL: an extension to Hadoop which supports execution of user-written Java kernels on heterogeneous devices, optimizes communication through asynchronous transfers and dedicated I/O threads, automatically generates OpenCL kernels from Java byte code using the open source tool APARAPI, and achieves nearly 3x overall speedup and better than 55x speedup of the computational sections for example MapReduce applications, relative to Hadoop.

languages and compilers for parallel computing | 2011

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Sanjay Chatterjee; Max Grossman; Alina Simion Sbîrlea; Vivek Sarkar

NVIDIA’s Compute Unified Device Architecture (CUDA) enabled GPUs become accessible to mainstream programming. Abundance of simple computational cores and high memory bandwidth make GPUs ideal candidates for data parallel applications. However, its potential for executing applications that combine task and data parallelism has not been explored in detail. CUDA does not provide a viable interface for creating dynamic tasks and handling load balancing issues. Any support for such has to be orchestrated entirely by the CUDA programmer today.

languages and compilers for parallel computing | 2010

CnC-CUDA: declarative programming for GPUs

Max Grossman; Alina Simion Sbîrlea; Zoran Budimlic; Vivek Sarkar

The computer industry is at a major inflection point in its hardware roadmap due to the end of a decades-long trend of exponentially increasing clock frequencies. Instead, future computer systems are expected to be built using homogeneous and heterogeneous many-core processors with 10s to 100s of cores per chip, and complex hardware designs to address the challenges of concurrency, energy efficiency and resiliency. Unlike previous generations of hardware evolution, this shift towards many-core computing will have a profound impact on software. These software challenges are further compounded by the need to enable parallelism in workloads and application domains that traditionally did not have to worry about multiprocessor parallelism in the past. A recent trend in mainstream desktop systems is the use of graphics processor units (GPUs) to obtain order-of-magnitude performance improvements relative to general-purpose CPUs. Unfortunately, hybrid programming models that support multithreaded execution on CPUs in parallel with CUDA execution on GPUs prove to be too complex for use by mainstream programmers and domain experts, especially when targeting platforms with multiple CPU cores and multiple GPU devices. In this paper, we extend past work on Intels Concurrent Collections (CnC) programming model to address the hybrid programming challenge using a model called CnC-CUDA. CnC is a declarative and implicitly parallel coordination language that supports flexible combinations of task and data parallelism while retaining determinism. CnC computations are built using steps that are related by data and control dependence edges, which are represented by a CnC graph. The CnC-CUDA extensions in this paper include the definition of multithreaded steps for execution on GPUs, and automatic generation of data and control flow between CPU steps and GPU steps. Experimental results show that this approach can yield significant performance benefits with both GPU execution and hybrid CPU/GPU execution.

principles and practice of programming in java | 2013

Accelerating Habanero-Java programs with OpenCL generation

Akihiro Hayashi; Max Grossman; Jisheng Zhao; Jun Shirako; Vivek Sarkar

The initial wave of programming models for general-purpose computing on GPUs, led by CUDA and OpenCL, has provided experts with low-level constructs to obtain significant performance and energy improvements on GPUs. However, these programming models are characterized by a challenging learning curve for non-experts due to their complex and low-level APIs. Looking to the future, improving the accessibility of GPUs and accelerators for mainstream software developers is crucial to bringing the benefits of these heterogeneous architectures to a broader set of application domains. A key challenge in doing so is that mainstream developers are accustomed to working with high-level managed languages, such as Java, rather than lower-level native languages such as C, CUDA, and OpenCL. The OpenCL standard enables portable execution of SIMD kernels across a wide range of platforms, including multi-core CPUs, many-core GPUs, and FPGAs. However, using OpenCL from Java to program multi-architecture systems is difficult and error-prone. Programmers are required to explicitly perform a number of low-level operations, such as (1) managing data transfers between the host system and the GPU, (2) writing kernels in the OpenCL kernel language, (3) compiling kernels & performing other OpenCL initialization, and (4) using the Java Native Interface (JNI) to access the C/C++ APIs for OpenCL. In this paper, we present compile-time and run-time techniques for accelerating programs written in Java using automatic generation of OpenCL as a foundation. Our approach includes (1) automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct (forall) available in the Habanero-Java (HJ) language, (2) leveraging HJs array view language construct to efficiently support rectangular, multi-dimensional arrays on OpenCL devices, and (3) implementing HJs phaser (next) construct for all-to-all barrier synchronization in automatically generated OpenCL kernels. Our approach is named HJ-OpenCL. Contrasting with past approaches to generating CUDA or OpenCL from high-level languages, the HJ-OpenCL approach helps the programmer preserve Java exception semantics by generating both exception-safe and unsafe code regions. The execution of one or the other is selected at runtime based on the safe language construct introduced in this paper. We use a set of ten Java benchmarks to evaluate our approach, and observe performance improvements due to both native OpenCL execution and parallelism. On an AMD APU, our results show speedups of up to 36.7× relative to sequential Java when executing on the host 4-core CPU, and of up to 55.0x on the integrated GPU. For a system with an Intel Xeon CPU and a discrete NVIDIA Fermi GPU, the speedups relative to sequential Java are 35.7× for the 12-core CPU and 324.0× for the GPU. Further, we find that different applications perform optimally in JVM execution, in OpenCL CPU execution, and in OpenCL GPU execution. The language features, compiler extensions, and runtime extensions included in this work enable portability, rapid prototyping, and transparent execution of JVM applications across all OpenCL platforms.

european conference on parallel processing | 2013

Compiler-Driven Data Layout Transformation for Heterogeneous Platforms

Deepak Majeti; Rajkishore Barik; Jisheng Zhao; Max Grossman; Vivek Sarkar

Modern heterogeneous systems comprise of CPU cores, GPU cores, and in some cases, accelerator cores. Each of these computational cores have very different memory hierarchies, making it challenging to efficiently map the data structures of an application to these memory hierarchies automatically. In this paper, we present a compiler-driven data layout transformation framework for heterogeneous platforms. We integrate our data layout framework with the data parallel construct, forasync, of Habanero-C and enable the same source code to be compiled with different data layouts for various architectures. The programmer or an auto-tuner specifies a schema of the data layout. Our compiler infrastructure generates efficient code for different architectures based on the meta information provided in the schema. Our experimental results show significant benefits from the compiler-driven data layout transformation, and demonstrate that the best data layout for a program varies with different heterogenous platforms.

IEEE Transactions on Parallel and Distributed Systems | 2016

HadoopCL2: Motivating the Design of a Distributed, Heterogeneous Programming System With Machine-Learning Applications

Max Grossman; Mauricio Breternitz; Vivek Sarkar

Machine learning (ML) algorithms have garnered increased interest as they demonstrate improved ability to extract meaningful trends from large, diverse, and noisy data sets. While research is advancing the state-of-the-art in ML algorithms, it is difficult to drastically improve the real-world performance of these algorithms. Porting new and existing algorithms from single-node systems to multi-node clusters, or from architecturally homogeneous systems to heterogeneous systems, is a promising optimization technique. However, performing optimized ports is challenging for domain experts who may lack experience in distributed and heterogeneous software development. This work explores how challenges in ML application development on heterogeneous, distributed systems shaped the development of the HadoopCL2 (HCL2) programming system. ML applications guide this work because they exhibit features that make application development difficult: large & diverse datasets, complex algorithms, and the need for domain-specific knowledge. The goal of this work is a general, MapReduce programming system that outperforms existing programming systems. This work evaluates the performance and portability of HCL2 against five ML applications from the Mahout ML framework on two hardware platforms. HCL2 demonstrates speedups of greater than 20x relative to Mahout for three computationally heavy algorithms and maintains minor performance improvements for two I/O bound algorithms.

languages and compilers for parallel computing | 2013

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs

Akihiro Hayashi; Max Grossman; Jisheng Zhao; Jun Shirako; Vivek Sarkar

General purpose computing on GPUs (GPGPU) can enable significant performance and energy improvements for certain classes of applications. However, current GPGPU programming models, such as CUDA and OpenCL, are only accessible by systems experts through low-level C/C++ APIs. In contrast, large numbers of programmers use high-level languages, such as Java, due to their productivity advantages of type safety, managed runtimes and precise exception semantics. Current approaches to enabling GPGPU computing in Java and other managed languages involve low-level interfaces to native code that compromise the semantic guarantees of managed languages, and are not readily accessible to mainstream programmers.

Journal of Parallel and Distributed Computing | 2017

Pedagogy and tools for teaching parallel computing at the sophomore undergraduate level

Max Grossman; Maha Aziz; Heng Chi; Anant Tibrewal; Shams Imam; Vivek Sarkar

As the need for multicore-aware programmers rises in both science and industry, Computer Science departments in universities around the USA are having to rethink their parallel computing curriculum. At Rice University, this rethinking took the shape of COMP 322, an introductory parallel programming course that is required for all Bachelors students. COMP 322 teaches students to reason about the behavior of parallel programs, educating them in both the high level abstractions of task-parallel programming as well as the nitty gritty details of working with threads in Java.In this paper, we detail the structure, principles, and experiences of COMP 322, gained from 6 years of teaching parallel programming to second-year undergraduates. We describe in detail two particularly useful tools that have been integrated into the curriculum: the HJlibparallel programming library and the Habanero Autograder for parallel programs. We present this work with the hope that it will help augment improvements to parallel computing education at other universities. An overview of parallel computing pedagogy at Rice University, including a unique approach to incrementally teaching parallel programming: from abstract parallel concepts to hands-on experience with industry-standard frameworks.A description of the HJlib parallel programming library and its applicability to parallel programming education.A description of the motivation, design, and implementation of the Habanero Autograder, a tool for providing automated and immediate feedback to students on programming assignments.A discussion of unexpected benefits from using the Habanero Autograder as part of Rice Universitys core parallel computing curriculum.

Explore More