Po-Han Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Po-Han Wang is active.

Explore More

Publication

Featured researches published by Po-Han Wang.

ACM Transactions on Architecture and Code Optimization | 2011

Power gating strategies on GPUs

Po-Han Wang; Chia-Lin Yang; Yen-Ming Chen; Yu-Jung Cheng

As technology continues to shrink, reducing leakage is critical to achieving energy efficiency. Previous studies on low-power GPUs (Graphics Processing Units) focused on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage and Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPUs. We propose three strategies for applying power gating on different modules in GPUs. The Predictive Shader Shutdown technique exploits workload variation across frames to eliminate leakage in shader clusters. Deferred Geometry Pipeline seeks to minimize leakage in fixed-function geometry units by utilizing an imbalance between geometry and fragment computation across batches. Finally, the simple time-out power gating method is applied to nonshader execution units to exploit a finer granularity of the idle time. Our results indicate that Predictive Shader Shutdown eliminates up to 60% of the leakage in shader clusters, Deferred Geometry Pipeline removes up to 57% of the leakage in the fixed-function geometry units, and the simple time-out power gating mechanism eliminates 83.3% of the leakage in nonshader execution units on average. All three schemes incur negligible performance degradation, less than 1%.

IEEE Computer Architecture Letters | 2009

A Predictive Shutdown Technique for GPU Shader Processors

Po-Han Wang; Yen-Ming Chen; Chia-Lin Yang; Yu-Jung Cheng

As technology continues to shrink, reducing leakage is critical to achieve energy efficiency. Previous works on low-power GPU (graphics processing unit) focus on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage/Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPU. In particular, we focus on the most power-hungry components, shader processors. We observe that, due to different scene complexity, the required shader resources to satisfy the target frame rate actually vary across frames. Therefore, we propose the predictive shader shutdown technique to exploit workload variation across frames for leakage reduction on shader processors. The experimental results show that predictive shader shutdown achieves up to 46% leakage reduction on shader processors with negligible performance degradation.

international symposium on performance analysis of systems and software | 2012

A cycle-level SIMT-GPU simulation framework

Po-Han Wang; Chien-Wei Lo; Chia-Lin Yang; Yu-Jung Cheng

The massive parallelism provided by the modern graphics processing units (GPUs) makes them the attractive processors to accelerate the applications with high data-level parallelism. Therefore, the GPU architecture has recently gained a lot of attention in research community. However, the advance in the GPU architecture is impeded by the limited documents released from the major GPU vendors. Furthermore, current studies on GPUs often focus only on general-purpose (GPGPU) applications. The behaviors of the graphics applications, which are considered as the major GPU workloads, are often overlooked in these studies. A GPU design good for the GPGPU applications is not necessarily good for the graphics applications. Therefore, a simulation framework that is able to provide performance characterization for both applications is mandatory for the innovation of the GPU architecture.

international symposium on vlsi design, automation and test | 2014

Full system simulation framework for integrated CPU/GPU architecture

Po-Han Wang; Gen-Hong Liu; Jen-Chieh Yeh; Tse-Min Chen; Hsu-Yao Huang; Chia-Lin Yang; Shih-Lien Liu; James B. S. G. Greensky

The integrated CPU/GPU architecture brings performance advantage since the communication cost between the CPU and GPU is reduced, and also imposes new challenges in processor architecture design, especially in the management of shared memory resources, e.g, the last-level cache and memory bandwidth. Therefore, a micro-architecture level simulator is essential to facilitate researches in this direction. In this paper, we develop the first cycle-level full-system simulation framework for CPU-GPU integration with detailed memory models. With the simulation framework, we analyze the communication cost between the CPU and GPU for GPU workloads, and perform memory system characterization running both applications concurrently.

international symposium on performance analysis of systems and software | 2017

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator

Li Wang; Ren-Wei Tsai; Shao-Chung Wang; Kun-Chih Chen; Po-Han Wang; Hsiang-Yun Cheng; Yi-Chung Lee; Sheng-Jie Shu; Chun-Chieh Yang; Min-Yih Hsu; Li-Chen Kan; Chao-Lin Lee; Tzu-Chieh Yu; Rih-Ding Peng; Chia-Lin Yang; Yuan-Shin Hwang; Jenq Kuen Lee; Shiao-Li Tsao; Ming Ouhyoung

Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we extend the existing integrated CPU-GPU simulator, gem5-gpu, to support OpenCL 2.0. In addition, we conduct experiments on the extended simulator to see the impact of new features introduced by OpenCL 2.0. Our OpenCL 2.0 compatible simulator is successfully validated against a state-of-the-art commercial product, and is expected to help boost future studies in heterogeneous CPU-GPU systems.

design, automation, and test in europe | 2010

PM-COSYN: PE and memory co-synthesis for MPSoCs

Yi-Jung Chen; Chia-Lin Yang; Po-Han Wang

Multi-Processor System-on-Chips (MPSoCs) exploit task-level parallelism to achieve high computation throughput, but concurrent memory accesses from multiple PEs may cause memory bottleneck. Therefore, to maximize system performance, it is important to simultaneously consider the PE and on-chip memory architecture design. However, in a traditional MPSoC design flow, PE allocation and on-chip memory allocation are often considered independently. To tackle this problem, we propose the first PE and Memory Co-synthesis (PM-COSYN) framework for MPSoCs. One critical issue in such a memory-aware MPSoC design is how to utilize the available die area to achieve a balanced design between memory and computation subsystems. Therefore, the goal of PM-COSYN is to allocate PE and on-chip memory for MPSoCs with Network-on-Chip (NoC) architecture such that system performance is maximized and the area constraint is met. The experimental results show that, PM-COSYN can synthesize NoC resource allocation according to the needs of the target task set. When comparing to a Simulated-Annealing method, PM-COSYN generates a comparable solution with much shorter CPU time.

design automation conference | 2018

Active forwarding: eliminate IOMMU address translation for accelerator-rich architectures

Hsueh-Chun Fu; Po-Han Wang; Chia-Lin Yang

Accelerator-rich architectures employ IOMMUs to support unified virtual address, but researches show that they fail to meet the performance and energy requirements of accelerators. Instead of optimizing the speed/energy of IOMMU address translation, this work tackles the issue from a new perspective, eliminating the need for translation with an active forwarding (AF) mechanism that forwards input data of accelerators directly from the CPU cache to the scratchpad memory of the accelerator. Results show that on average, AF can provide 8% performance improvement compared to the state-of-the-art mechanism, hostPageWalk, and reduce 22.1% accelerator power.

asia and south pacific design automation conference | 2017

Enabling fast preemption via Dual-Kernel support on GPUs

Li-Wei Shieh; Kun-Chih Chen; Hsueh-Chun Fu; Po-Han Wang; Chia-Lin Yang

To consider QoS for resource-limited mobile systems, we introduce a fast preemption mechanism on GPUs. First, we involve a dual-kernel execution model to support fine-grained preemption, and a resource allocation policy to avoid resource fragmentation problem. Second, we propose a preemption victim selection scheme to reduce the throughput overhead while satisfying a required preemption latency. Evaluations show that we can reach very close to the ideal preemption scheme within 2% difference in terms of deadline violations. Furthermore, on average we improve GPU resource utilization by 2.93× over prior technique during preemption.

IEEE Computer Architecture Letters | 2017

Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling

Li-Jhan Chen; Hsiang-Yun Cheng; Po-Han Wang; Chia-Lin Yang

Modern GPGPUs support the concurrent execution of thousands of threads to provide an energy-efficient platform. However, the massive multi-threading of GPGPUs incurs serious cache contention, as the cache lines brought by one thread can easily be evicted by other threads in the small shared cache. In this paper, we propose a software-hardware cooperative approach that exploits the spatial locality among different thread blocks to better utilize the precious cache capacity. Through dynamic locality estimation and thread block scheduling, we can capture more performance improvement opportunities than prior work that only explores the spatial locality between consecutive thread blocks. Evaluations across diverse GPGPU applications show that, on average, our locality-aware scheduler provides 25 and 9 percent performance improvement over the commonly-employed round-robin scheduler and the state-of-the-art scheduler, respectively.

Biosensors and Bioelectronics | 2006

WITHDRAWN: A radio frequency biosensor with gold nanoparticle probes.

Jun-Chau Chien; Ying-Chou Cheng; Po-Han Wang; Chii Rong Yang; Ping-Hei Chen

This article has been withdrawn at the request of the author(s) and/or editor. The Publisher apologizes for any inconvenience this may cause. The full Elsevier Policy on Article Withdrawal can be found at http://www.elsevier.com/locate/withdrawalpolicy.

Explore More