Po-Han Wang
National Taiwan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Po-Han Wang.
ACM Transactions on Architecture and Code Optimization | 2011
Po-Han Wang; Chia-Lin Yang; Yen-Ming Chen; Yu-Jung Cheng
As technology continues to shrink, reducing leakage is critical to achieving energy efficiency. Previous studies on low-power GPUs (Graphics Processing Units) focused on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage and Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPUs. We propose three strategies for applying power gating on different modules in GPUs. The Predictive Shader Shutdown technique exploits workload variation across frames to eliminate leakage in shader clusters. Deferred Geometry Pipeline seeks to minimize leakage in fixed-function geometry units by utilizing an imbalance between geometry and fragment computation across batches. Finally, the simple time-out power gating method is applied to nonshader execution units to exploit a finer granularity of the idle time. Our results indicate that Predictive Shader Shutdown eliminates up to 60% of the leakage in shader clusters, Deferred Geometry Pipeline removes up to 57% of the leakage in the fixed-function geometry units, and the simple time-out power gating mechanism eliminates 83.3% of the leakage in nonshader execution units on average. All three schemes incur negligible performance degradation, less than 1%.
IEEE Computer Architecture Letters | 2009
Po-Han Wang; Yen-Ming Chen; Chia-Lin Yang; Yu-Jung Cheng
As technology continues to shrink, reducing leakage is critical to achieve energy efficiency. Previous works on low-power GPU (graphics processing unit) focus on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage/Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPU. In particular, we focus on the most power-hungry components, shader processors. We observe that, due to different scene complexity, the required shader resources to satisfy the target frame rate actually vary across frames. Therefore, we propose the predictive shader shutdown technique to exploit workload variation across frames for leakage reduction on shader processors. The experimental results show that predictive shader shutdown achieves up to 46% leakage reduction on shader processors with negligible performance degradation.
international symposium on performance analysis of systems and software | 2012
Po-Han Wang; Chien-Wei Lo; Chia-Lin Yang; Yu-Jung Cheng
The massive parallelism provided by the modern graphics processing units (GPUs) makes them the attractive processors to accelerate the applications with high data-level parallelism. Therefore, the GPU architecture has recently gained a lot of attention in research community. However, the advance in the GPU architecture is impeded by the limited documents released from the major GPU vendors. Furthermore, current studies on GPUs often focus only on general-purpose (GPGPU) applications. The behaviors of the graphics applications, which are considered as the major GPU workloads, are often overlooked in these studies. A GPU design good for the GPGPU applications is not necessarily good for the graphics applications. Therefore, a simulation framework that is able to provide performance characterization for both applications is mandatory for the innovation of the GPU architecture.
international symposium on vlsi design, automation and test | 2014
Po-Han Wang; Gen-Hong Liu; Jen-Chieh Yeh; Tse-Min Chen; Hsu-Yao Huang; Chia-Lin Yang; Shih-Lien Liu; James B. S. G. Greensky
The integrated CPU/GPU architecture brings performance advantage since the communication cost between the CPU and GPU is reduced, and also imposes new challenges in processor architecture design, especially in the management of shared memory resources, e.g, the last-level cache and memory bandwidth. Therefore, a micro-architecture level simulator is essential to facilitate researches in this direction. In this paper, we develop the first cycle-level full-system simulation framework for CPU-GPU integration with detailed memory models. With the simulation framework, we analyze the communication cost between the CPU and GPU for GPU workloads, and perform memory system characterization running both applications concurrently.
international symposium on performance analysis of systems and software | 2017
Li Wang; Ren-Wei Tsai; Shao-Chung Wang; Kun-Chih Chen; Po-Han Wang; Hsiang-Yun Cheng; Yi-Chung Lee; Sheng-Jie Shu; Chun-Chieh Yang; Min-Yih Hsu; Li-Chen Kan; Chao-Lin Lee; Tzu-Chieh Yu; Rih-Ding Peng; Chia-Lin Yang; Yuan-Shin Hwang; Jenq Kuen Lee; Shiao-Li Tsao; Ming Ouhyoung
Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we extend the existing integrated CPU-GPU simulator, gem5-gpu, to support OpenCL 2.0. In addition, we conduct experiments on the extended simulator to see the impact of new features introduced by OpenCL 2.0. Our OpenCL 2.0 compatible simulator is successfully validated against a state-of-the-art commercial product, and is expected to help boost future studies in heterogeneous CPU-GPU systems.
design, automation, and test in europe | 2010
Yi-Jung Chen; Chia-Lin Yang; Po-Han Wang
Multi-Processor System-on-Chips (MPSoCs) exploit task-level parallelism to achieve high computation throughput, but concurrent memory accesses from multiple PEs may cause memory bottleneck. Therefore, to maximize system performance, it is important to simultaneously consider the PE and on-chip memory architecture design. However, in a traditional MPSoC design flow, PE allocation and on-chip memory allocation are often considered independently. To tackle this problem, we propose the first PE and Memory Co-synthesis (PM-COSYN) framework for MPSoCs. One critical issue in such a memory-aware MPSoC design is how to utilize the available die area to achieve a balanced design between memory and computation subsystems. Therefore, the goal of PM-COSYN is to allocate PE and on-chip memory for MPSoCs with Network-on-Chip (NoC) architecture such that system performance is maximized and the area constraint is met. The experimental results show that, PM-COSYN can synthesize NoC resource allocation according to the needs of the target task set. When comparing to a Simulated-Annealing method, PM-COSYN generates a comparable solution with much shorter CPU time.
design automation conference | 2018
Hsueh-Chun Fu; Po-Han Wang; Chia-Lin Yang
Accelerator-rich architectures employ IOMMUs to support unified virtual address, but researches show that they fail to meet the performance and energy requirements of accelerators. Instead of optimizing the speed/energy of IOMMU address translation, this work tackles the issue from a new perspective, eliminating the need for translation with an active forwarding (AF) mechanism that forwards input data of accelerators directly from the CPU cache to the scratchpad memory of the accelerator. Results show that on average, AF can provide 8% performance improvement compared to the state-of-the-art mechanism, hostPageWalk, and reduce 22.1% accelerator power.
asia and south pacific design automation conference | 2017
Li-Wei Shieh; Kun-Chih Chen; Hsueh-Chun Fu; Po-Han Wang; Chia-Lin Yang
To consider QoS for resource-limited mobile systems, we introduce a fast preemption mechanism on GPUs. First, we involve a dual-kernel execution model to support fine-grained preemption, and a resource allocation policy to avoid resource fragmentation problem. Second, we propose a preemption victim selection scheme to reduce the throughput overhead while satisfying a required preemption latency. Evaluations show that we can reach very close to the ideal preemption scheme within 2% difference in terms of deadline violations. Furthermore, on average we improve GPU resource utilization by 2.93× over prior technique during preemption.
IEEE Computer Architecture Letters | 2017
Li-Jhan Chen; Hsiang-Yun Cheng; Po-Han Wang; Chia-Lin Yang
Modern GPGPUs support the concurrent execution of thousands of threads to provide an energy-efficient platform. However, the massive multi-threading of GPGPUs incurs serious cache contention, as the cache lines brought by one thread can easily be evicted by other threads in the small shared cache. In this paper, we propose a software-hardware cooperative approach that exploits the spatial locality among different thread blocks to better utilize the precious cache capacity. Through dynamic locality estimation and thread block scheduling, we can capture more performance improvement opportunities than prior work that only explores the spatial locality between consecutive thread blocks. Evaluations across diverse GPGPU applications show that, on average, our locality-aware scheduler provides 25 and 9 percent performance improvement over the commonly-employed round-robin scheduler and the state-of-the-art scheduler, respectively.
Biosensors and Bioelectronics | 2006
Jun-Chau Chien; Ying-Chou Cheng; Po-Han Wang; Chii Rong Yang; Ping-Hei Chen
This article has been withdrawn at the request of the author(s) and/or editor. The Publisher apologizes for any inconvenience this may cause. The full Elsevier Policy on Article Withdrawal can be found at http://www.elsevier.com/locate/withdrawalpolicy.