Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guoyang Chen is active.

Publication


Featured researches published by Guoyang Chen.


international symposium on microarchitecture | 2014

PORPLE: An Extensible Optimizer for Portable Data Placement on GPU

Guoyang Chen; Bo Wu; Dong Li; Xipeng Shen

GPU is often equipped with complex memory systems, including globalmemory, texture memory, shared memory, constant memory, and variouslevels of cache. Where to place the data is important for theperformance of a GPU program. However, the decision is difficult for aprogrammer to make because of architecture complexity and thesensitivity of suitable data placements to input and architecturechanges.This paper presents PORPLE, a portable data placement engine thatenables a new way to solve the data placement problem. PORPLE consistsof a mini specification language, a source-to-source compiler, and a runtime data placer. The language allows an easy description of amemory system; the compiler transforms a GPU program into a formamenable to runtime profiling and data placement; the placer, based onthe memory description and data access patterns, identifies on the flyappropriate placement schemes for data and places themaccordingly. PORPLE is distinctive in being adaptive to program inputsand architecture changes, being transparent to programmers (in mostcases), and being extensible to new memory architectures. Ourexperiments on three types of GPU systems show that PORPLE is able toconsistently find optimal or near-optimal placement despite the largedifferences among GPU architectures and program inputs, yielding up to2.08X (1.59X on average) speedups on a set of regular and irregularGPU benchmarks.


international conference on supercomputing | 2015

Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

Bo Wu; Guoyang Chen; Dong Li; Xipeng Shen; Jeffrey S. Vetter

A GPUs computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on program characteristics effectively, achieving only a fraction of the GPUs full potential. Moreover, current GPUs do not allow programmers or compilers to control this thread scheduling, forfeiting important optimization opportunities at the program level. This paper presents a transformation centered on Streaming Multiprocessors (SM); this software approach to circumventing the limitations of the hardware scheduler allows flexible program-level control of scheduling. By permitting precise control of job locality on SMs, the transformation overcomes inherent limitations in prior methods. With this technique, flexible control of GPU scheduling at the program level becomes feasible, which opens up new opportunities for GPU program optimizations. The second part of the paper explores how the new opportunities could be leveraged for GPU performance enhancement, what complexities there are, and how to address them. We show that some simple optimization techniques can enhance co-runs of multiple kernels and improve data locality of irregular applications, producing 20-33% average increase in performance, system throughput, and average turnaround time.


international symposium on microarchitecture | 2015

Free launch: optimizing GPU dynamic kernel launches through thread reuse

Guoyang Chen; Xipeng Shen

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.


acm sigplan symposium on principles and practice of parallel programming | 2017

EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU

Guoyang Chen; Yue Zhao; Xipeng Shen; Huiyang Zhou

Modern GPUs are broadly adopted in many multitasking environments, including data centers and smartphones. However, the current support for the scheduling of multiple GPU kernels (from different applications) is limited, forming a major barrier for GPU to meet many practical needs. This work for the first time demonstrates that on existing GPUs, efficient preemptive scheduling of GPU kernels is possible even without special hardware support. Specifically, it presents EffiSha, a pure software framework that enables preemptive scheduling of GPU kernels with very low overhead. The enabled preemptive scheduler offers flexible support of kernels of different priorities, and demonstrates significant potential for reducing the average turnaround time and improving the system overall throughput of programs that time share a modern GPU.


international conference on supercomputing | 2016

Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU

Guoyang Chen; Xipeng Shen

A Graphic Processing Unit (GPU) system is typically equipped with many types of memory (e.g., global, constant, texture, shared, cache). Data placement determines what data are placed on which type of memory, essential for GPU memory performance. Prior optimizations of data placement always require a single view of a data object on memory, which limits the optimization effectiveness. In this work, we propose coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution. We demonstrate that under certain conditions, the multiple views can remain incoherent while facilitating enhanced data placement. We present a theorem and some compiler support to ensure the soundness of the usage of coherence-free multiview. We further develop reference-discerning data placement, a new way to enhance data placements on GPU. It enables more flexible data placements by using coherence-free multiview to leverage the slack in coherence requirement of some GPU programs. Experiments on three types of GPU systems show that, with less than 200KB space cost, the new data placement technique can provide a 1.6X average (up to 4.27X) speedup.


IEEE Transactions on Computers | 2017

Optimizing Data Placement on GPU Memory: A Portable Approach

Guoyang Chen; Xipeng Shen; Bo Wu; Dong Li

Modern GPUs feature complex memory system designs. One GPU may contain many types of memory of different properties. The best way to place data in memory is sensitive to many factors (e.g., program inputs, architectures), making portable optimizations of GPU data placement a difficult challenge. PORPLE is a recently proposed method that overcomes the difficulties by enabling online optimizations of data placement through a three-way synergy: a specification language for memory system description, a compiler framework for data access analysis and code staging, and a runtime library for efficiently finding and materializing data placement on the fly. This article provides a comprehensive description of this method, and presents several extensions that significantly improve the scalability of PORPLE, which include a novel algorithm design for efficiently searching for the best data placements, the use of active profiling for reducing the online-profiling overhead, and a systematic examination of a path-based performance model. By automatically tailoring data placements for each execution of a GPU program, the enhanced PORPLE brings significant speedups (1.72X on average) to many GPU kernels across GPU architectures and program inputs.


acm sigplan symposium on principles and practice of parallel programming | 2016

Data-centric combinatorial optimization of parallel code

Hao Luo; Guoyang Chen; Pengcheng Li; Chen Ding; Xipeng Shen

Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way to model the performance of a system with heterogeneous memory resources. The new model is composable, meaning it can predict the performance difference due to placing data differently by profiling the execution just once.


international symposium on microarchitecture | 2017

Efficient support of position independence on non-volatile memory

Guoyang Chen; Lei Zhang; Richa Budhiraja; Xipeng Shen; Youfeng Wu

This paper explores solutions for enabling efficient supports of position independence of pointer-based data structures on byte-addressable None-Volatile Memory (NVM). When a dynamic data structure (e.g., a linked list) gets loaded from persistent storage into main memory in different executions, the locations of the elements contained in the data structure could differ in the address spaces from one run to another. As a result, some special support must be provided to ensure that the pointers contained in the data structures always point to the correct locations, which is called position independence. This paper shows the insufficiency of traditional methods in supporting position independence on NVM. It proposes a concept called implicit self-contained representations of pointers, and develops two such representations named off-holder and Region ID in Value (RIV) to materialize the concept. Experiments show that the enabled representations provide much more efficient and flexible support of position independence for dynamic data structures, alleviating a major issue for effective data reuses on NVM. CCS CONCEPTS • Hardware → Memory and dense storage; • Computer systems organization → Architectures; • Software and its engineering → Compilers; General programming languages;


european conference on object-oriented programming | 2016

Towards Ontology-Based Program Analysis

Yue Zhao; Guoyang Chen; Chunhua Liao; Xipeng Shen

Program analysis is fundamental for program optimizations, debugging, and many other tasks. But developing program analyses has been a challenging and error-prone process for general users. Declarative program analysis has shown the promise to dramatically improve the productivity in the development of program analyses. Current declarative program analysis is however subject to some major limitations in supporting cooperations among analysis tools, guiding program optimizations, and often requires much effort for repeated program preprocessing. In this work, we advocate the integration of ontology into declarative program analysis. As a way to standardize the definitions of concepts in a domain and the representation of the knowledge in the domain, ontology offers a promising way to address the limitations of current declarative program analysis. We develop a prototype framework named PATO for conducting program analysis upon ontology-based program representation. Experiments on six program analyses confirm the potential of ontology for complementing existing declarative program analysis. It supports multiple analyses without separate program preprocessing, promotes cooperative Liveness analysis between two compilers, and effectively guides a data placement optimization for Graphic Processing Units (GPU).


international conference on parallel architectures and compilation techniques | 2014

SM-centric transformation: circumventing hardware restrictions for flexible GPU scheduling

Bo Wu; Guoyang Chen; Dong Li; Xipeng Shen; Jeffrey S. Vetter

To circumvent the limitation from the hardware scheduler on GPU, we create an SM-centric transformation technique. This technique enables complete control of the mapping between tasks and streaming multi-processors (SMs), and enables controlling the number of active thread blocks on each SM. Results show that our approach achieves better speedup than previous ones with kernel co-run cases.

Collaboration


Dive into the Guoyang Chen's collaboration.

Top Co-Authors

Avatar

Xipeng Shen

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Bo Wu

Colorado School of Mines

View shared research outputs
Top Co-Authors

Avatar

Dong Li

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Jeffrey S. Vetter

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Yue Zhao

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Chen Ding

University of Rochester

View shared research outputs
Top Co-Authors

Avatar

Hao Luo

University of Rochester

View shared research outputs
Top Co-Authors

Avatar

Huiyang Zhou

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Lei Zhang

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Pengcheng Li

University of Rochester

View shared research outputs
Researchain Logo
Decentralizing Knowledge