Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chunhua Liao is active.

Publication


Featured researches published by Chunhua Liao.


international workshop on openmp | 2013

Early Experiences With The OpenMP Accelerator Model

Chunhua Liao; Yonghong Yan; Bronis R. de Supinski; Daniel J. Quinlan; Barbara M. Chapman

A recent trend in mainstream computer nodes is the combined use of general-purpose multicore processors and specialized accelerators such as GPUs and DSPs in order to achieve better performance and to reduce power consumption. To support this trend, the OpenMP Language Committee has approved a set of extensions to OpenMP (referred to as the OpenMP accelerator model). The initial version is the subject of Technical Report 1 (TR1) while OpenMP 4.0 Release Candidate 2 (RC2) further refines the extensions.


international workshop on openmp | 2010

A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries

Chunhua Liao; Daniel J. Quinlan; Thomas Panas; Bronis R. de Supinski

OpenMP is a popular and evolving programming model for shared-memory platforms. It relies on compilers to target modern hardware architectures for optimal performance. A variety of extensible and robust research compilers are key to OpenMP’s sustainable success in the future. In this paper, we present our efforts to build an OpenMP 3.0 research compiler for C, C++, and Fortran using the ROSE source-to-source compiler framework. Our goal is to support OpenMP research for ourselves and others. We have extended ROSE’s internal representation to handle all OpenMP 3.0 constructs, thus facilitating experimenting with them. Since OpenMP research is often complicated by the tight coupling of the compiler translation and the runtime system, we present a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries. These rules additionally define how to build a set of translations targeting XOMP. Our work demonstrates how to reuse OpenMP translations across different runtime libraries. This work simplifies OpenMP research by decoupling the problematic dependence between the compiler translations and the runtime libraries. We present an evaluation of our work by demonstrating an analysis tool for OpenMP correctness. We also show how XOMP can be defined using both GOMP and Omni. Our comparative performance results against other OpenMP compilers demonstrate that our flexible runtime support does not incur additional overhead.


languages and compilers for parallel computing | 2009

Effective source-to-source outlining to support whole program empirical optimization

Chunhua Liao; Daniel J. Quinlan; Richard W. Vuduc; Thomas Panas

Although automated empirical performance optimization and tuning is well-studied for kernels and domain-specific libraries, a current research grand challenge is how to extend these methodologies and tools to significantly larger sequential and parallel applications. In this context, we present the ROSE source-to-source outliner, which addresses the problem of extracting tunable kernels out of whole programs, thereby helping to convert the challenging whole-program tuning problem into a set of more manageable kernel tuning tasks. Our outliner aims to handle large scale C/C++, Fortran and OpenMP applications. A set of program analysis and transformation techniques are utilized to enhance the portability, scalability, and interoperability of source-to-source outlining. More importantly, the generated kernels preserve performance characteristics of tuning targets and can be easily handled by other tools. Preliminary evaluations have shown that the ROSE outliner serves as a key component within an end-to-end empirical optimization system and enables a wide range of sequential and parallel optimization opportunities.


dependable systems and networks | 2012

ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research

Jacob Lidman; Daniel J. Quinlan; Chunhua Liao; Sally A. McKee

Exascale computing systems will require sufficient resilience to tolerate numerous types of hardware faults while still assuring correct program execution. Such extreme-scale machines are expected to be dominated by processors driven at lower voltages (near the minimum 0.5 volts for current transistors). At these voltage levels, the rate of transient errors increases dramatically due to the sensitivity to transient and geographically localized voltage drops on parts of the processor chip. To achieve power efficiency, these processors are likely to be streamlined and minimal, and thus they cannot be expected to handle transient errors entirely in hardware. Here we present an open, compiler-based framework to automate the armoring of High Performance Computing (HPC) software to protect it from these types of transient processor errors. We develop an open infrastructure to support research work in this area, and we define tools that, in the future, may provide more complete automated and/or semi-automated solutions to support software resiliency on future exascale architectures. Results demonstrate that our approach is feasible, pragmatic in how it can be separated from the software development process, and reasonably efficient (0% to 30% overhead for the Jacobi iteration on common hardware; and 20%, 40%, 26%, and 2% overhead for a randomly selected subset of benchmarks from the Livermore Loops [1]).


ieee international conference on high performance computing data and analytics | 2011

Auto-tuning full applications: A case study

Ananta Tiwari; Jeffrey K. Hollingsworth; Chun Chen; Mary W. Hall; Chunhua Liao; Daniel J. Quinlan; Jacqueline Chame

In this paper, we take a concrete step towards materializing our long-term goal of providing a fully automatic end-to-end tuning infrastructure for arbitrary program components and full applications. We describe a general-purpose offline auto-tuning framework and apply it to an application benchmark, SMG2000, a semi-coarsening multigrid on structured grids. We show that the proposed system first extracts computationally intensive loop nests into separate executable functions, a code transformation called outlining. The outlined loop nests are then tuned by the framework and subsequently integrated back into the application. Each loop nest is optimized through a series of composable code transformations, with the transformations parameterized by unbound optimization parameters that are bound during the tuning process. The values for these parameters are selected using a search-based auto-tuner, which performs a parallel heuristic search for the best-performing optimized variants of the outlined loop nests. We show that our system pinpoints a code variant that performs 2.37 times faster than the original loop nest. When the full application is run using the code variant found by the system, the application’s performance improves by 27%.


international conference on parallel processing | 2013

Symbolic Analysis of Concurrency Errors in OpenMP Programs

Hongyi Ma; Steve R. Diersen; Liqiang Wang; Chunhua Liao; Daniel J. Quinlan; Zijiang Yang

In this paper we present the OpenMP Analysis Toolkit (OAT), which uses Satisfiability Modulo Theories (SMT) solver based symbolic analysis to detect data races and deadlocks in OpenMP codes. Our approach approximately simulates real executions of an OpenMP program through schedule permutation. We conducted experiments on real-world OpenMP benchmarks and student homework assignments by comparing our OAT tool with two commercial dynamic analysis tools: Intel Thread Checker and Sun Thread Analyzer, and one commercial static analysis tool: Viva64 PVS Studio. The experiments show that our symbolic analysis approach is more accurate than static analysis and more efficient and scalable than dynamic analysis tools with less false positives and negatives.


International Journal of Parallel Programming | 2010

Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions

Chunhua Liao; Daniel J. Quinlan; Jeremiah Willcock; Thomas Panas

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. Modern applications using high-level abstractions, such as C++ STL containers and complex user-defined class types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we use a source-to-source compiler infrastructure, ROSE, to explore compiler techniques to recognize high-level abstractions and to exploit their semantics for automatic parallelization. Several representative parallelization candidate kernels are used to study semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Preliminary results have shown that semantics of abstractions can help extend the applicability of automatic parallelization to modern applications and expose more opportunities to take advantage of multicore processors.


programming models and applications for multicores and manycores | 2015

Supporting multiple accelerators in high-level programming models

Yonghong Yan; Pei-Hung Lin; Chunhua Liao; Bronis R. de Supinski; Daniel J. Quinlan

Computational accelerators, such as manycore NVIDIA GPUs, Intel Xeon Phi and FPGAs, are becoming common in work-stations, servers and supercomputers for scientific and engineering applications. Efficiently exploiting the massive parallelism these accelerators provide requires the designs and implementations of productive programming models. In this paper, we explore support of multiple accelerators in high-level programming models. We design novel language extensions to OpenMP to support offloading data and computation regions to multiple accelerators (devices). These extensions allow for distributing data and computation among a list of devices via easy-to-use interfaces, including specifying the distribution of multi-dimensional arrays and declaring shared data regions among accelerators. Computation distribution is realized by partitioning a loop iteration space among accelerators. We implement mechanisms to marshal/unmarshal and to move data of non-contiguous array subregions and shared regions between accelerators without involving CPUs. We design reduction techniques that work across multiple accelerators. Combined compiler and runtime support is designed to manage multiple GPUs using asynchronous operations and threading mechanisms. We implement our solutions for NVIDIA GPUs and demonstrate through example OpenMP codes the effectiveness of our solutions for performance improvement.


computing frontiers | 2012

Studying the impact of application-level optimizations on the power consumption of multi-core architectures

Shah Mohammad Faizur Rahman; Jichi Guo; Akshatha Bhat; Carlos D. Garcia; Majedul Haque Sujon; Qing Yi; Chunhua Liao; Daniel J. Quinlan

This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential and multi-threaded benchmarks using varying compiler optimization settings and runtime configurations. Our extensive experimental study provides insights for answering two questions: 1) what degrees of impact can application level optimizations have on reducing the overall system power consumption of modern CMP architectures; and 2) what strategies can compilers and application developers adopt to achieve a balanced performance and power efficiency for applications from a variety of science and embedded systems domains.


international workshop on openmp | 2012

Auto-scoping for OpenMP tasks

Sara Royuela; Alejandro Duran; Chunhua Liao; Daniel J. Quinlan

Auto-scoping analysis for OpenMP must be revised owing to the introduction of asynchronous parallelism in the form of tasks. Auto-scoping is the process of automatically determine the data-sharing of variables. This process has been implemented for worksharing and parallel regions. Based on the previous work, we present an auto-scoping algorithm to work with OpenMP tasks. This is a much more complex challenge due to the uncertainty of when a task will be executed, which makes it harder to determine what parts of the program will run concurrently. We also introduce an implementation of the algorithm and results with several benchmarks showing that the algorithm is able to correctly scope a large percentage of the variables appearing in them.

Collaboration


Dive into the Chunhua Liao's collaboration.

Top Co-Authors

Avatar

Daniel J. Quinlan

Lawrence Livermore National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Pei-Hung Lin

Lawrence Livermore National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Thomas Panas

Lawrence Livermore National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Bronis R. de Supinski

Lawrence Livermore National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xipeng Shen

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Dan Quinlan

Lawrence Livermore National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Hongyi Ma

University of Wyoming

View shared research outputs
Top Co-Authors

Avatar

Ian Karlin

Lawrence Livermore National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge