Wenjing Ma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wenjing Ma is active.

Explore More

Publication

Featured researches published by Wenjing Ma.

international conference on supercomputing | 2010

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Vignesh T. Ravi; Wenjing Ma; David Chiu; Gagan Agrawal

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of todays heterogeneous machines. This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.

international conference on supercomputing | 2009

A translation system for enabling data mining applications on GPUs

Wenjing Ma; Gagan Agrawal

Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface. In this paper, we offer a solution that targets a specific class of applications, which are the data mining and scientific data analysis applications. Our work is driven by the observation that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms. In our solution, the programmers simply need to specify the sequential reduction loop(s) with some additional information about the parameters. We use program analysis and code generation to map the applications to a GPU. Several additional optimizations are also performed by the system. We have evaluated our system using three popular data mining applications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The main observations from our experiments are as follows. The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. The automatically generated version did not have any noticeable overheads compared to hand written codes. Finally, the optimizations performed in the system resulted in significant performance improvements.

Journal of Chemical Theory and Computation | 2013

Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU–GPU Systems

Kiran Bhaskaran-Nair; Wenjing Ma; Sriram Krishnamoorthy; Oreste Villa; Hubertus J. J. van Dam; Edoardo Aprà; Karol Kowalski

A novel parallel algorithm for noniterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [Bhaskaran-Nair, K.; Brabec, J.; Aprà, E.; van Dam, H. J. J.; Pittner, J.; Kowalski, K. J. Chem. Phys.2012, 137, 094112] with the possibility of accelerating numerical calculations using graphics processing units (GPUs) is presented. We discuss the performance of this approach applied to the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.

Cluster Computing | 2013

Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Wenjing Ma; Sriram Krishnamoorthy; Oreste Villa; Karol Kowalski; Gagan Agrawal

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU as compared to one CPU core and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3.4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.

international conference on supercomputing | 2012

Data-driven fault tolerance for work stealing computations

Wenjing Ma; Sriram Krishnamoorthy

Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is also used to accurately identify the tasks to be re-executed in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs --- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed-memory work stealing to dynamically rebalance the tasks onto the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the costs incurred due to failures are small, and the overheads decrease with per-process work at scale.

international conference on cluster computing | 2010

Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters

Wenjing Ma; Sriram Krishnamoorthy; Oreste Villay; Karol Kowalski

international conference on supercomputing | 2011

An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Xin Huo; Vignesh T. Ravi; Wenjing Ma; Gagan Agrawal

GPUs have rapidly emerged as a very significant player in high performance computing. However, despite the popularity of CUDA, there are significant challenges in porting different classes of HPC applications on modern GPUs. This paper focuses on the challenges of implementing irregular applications arising from unstructured grids on modern NVIDIA GPUs. Considering the importance of irregular reductions in scientific and engineering codes, substantial effort was made in developing compiler and runtime support for parallelization or optimization of these codes in the previous two decades, with different efforts targeting distributed memory machines, distributed shared memory machines, shared memory machines, or cache performance improvement on uniprocessor machines. However, there have not been any systematic studies on parallelizing these applications on modern GPUs. There are at least two significant challenges associated with porting this class of applications on modern GPUs. The first is related to correct and efficient parallelization while using a large number of threads. The second challenge is effective use of shared memory. Since data accesses cannot be determined statically, runtime partitioning methods are needed for effectively using the shared memory. This paper describes an execution methodology that can address the above two challenges. We have also developed optimized runtime modules to support our execution methodology. Our approach and runtime methods have been extensively evaluated using two indirection array based applications.

ieee international conference on high performance computing, data, and analytics | 2010

An integer programming framework for optimizing shared memory use on GPUs

Wenjing Ma; Gagan Agrawal

General purpose computing using GPUs is becoming increasingly popular, because of GPUs extremely favorable performance/price ratio. Besides application development using CUDA, automatic code generation for GPUs is also receiving attention. Like standard processors, GPUs also have a memory hierarchy, which must be carefully optimized for in order to achieve efficient execution. Specifically, modern NVIDIA GPUs have a very small programmable cache, referred to as shared memory, accesses to which are nearly 100 to 150 times faster than accesses to the regular device memory. An automatically generated or hand-written CUDA program can explicitly control what variables and array sections are allocated on the shared memory at any point during the execution. This, however, leads to a difficult optimization problem. In this paper, we formulate and solve the shared memory allocation problem as an integer programming problem. We present a global (intraprocedural) framework which can model structured control flow, and is not restricted to a single loop nest. We consider allocation of scalars, arrays, and array sections on shared memory. We also briefly show how our framework can suggest useful loop transformations to further improve performance. Our experiments using several non-scientific application show that our integer programming framework outperforms a recently published heuristic method, and our loop transformations also improve performance for many applications.

ieee international conference on high performance computing, data, and analytics | 2010

Approaches for parallelizing reductions on modern GPUs

Xin Huo; Vignesh T. Ravi; Wenjing Ma; Gagan Agrawal

GPU hardware and software has been evolving rapidly. CUDA versions 1.1 and higher started supporting atomic operations on device memory, and CUDA versions 1.2 and higher started supporting atomic operations on shared memory. This paper focuses on parallelizing applications involving reductions on GPUs. Prior to the availability of support for locking, these applications could only be parallelized using full replication, i.e., by creating a copy of the reduction object for each thread. However, CUDA 1.1 (1.2) onwards, use of atomic operations (on shared memory) is another option, though some effort is still required in supporting locking on floating point numbers and for supporting coarse-grained locking. Based on the tradeoffs between locking and full replication, we also introduce a hybrid approach, in which a group of threads use atomic operations to update one copy of the reduction object. Using three data mining algorithms that follow the reduction structure — k-means clustering, Principal Component Analysis (PCA) and k-nearest neighbor search (kNN), we evaluate the relative performance of these three approaches. We show how the relative performance of these techniques can vary depending upon the application and its parameters. The hybrid approach we have introduced clearly outperforms other approaches in several cases.

international parallel and distributed processing symposium | 2017

Localized Fault Recovery for Nested Fork-Join Programs

Gokcen Kestor; Sriram Krishnamoorthy; Wenjing Ma

Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.

Explore More