Jih-Kwon Peir | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jih-Kwon Peir is active.

Explore More

Publication

Featured researches published by Jih-Kwon Peir.

international conference on supercomputing | 2002

Bloom filtering cache misses for accurate data speculation and prefetching

Jih-Kwon Peir; Shih-Chang Lai; Shih-Lien Lu; Jared Stark; Konrad K. Lai

A processor must know a load instructions latency to schedule the loads dependent instructions at the correct time. Unfortunately, modern processors do not know this latency until well after the dependent instructions should have been scheduled to avoid pipeline bubbles between themselves and the load. One solution to this problem is to predict the loads latency, by predicting whether the load will hit or miss in the data cache. Existing cache hit/miss predictors, however, can only correctly predict about 50% of cache misses.This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline. This early identification of cache misses allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache. Simulations using a modified SimpleScalar model show that the proposed Bloom Filter is nearly perfect, with a prediction accuracy greater than 99% for the SPECint2000 benchmarks. IPC (Instructions Per Cycle) performance improved by 19% over a processor that delayed the scheduling of instructions dependent on a load until the load latency was known, and by 6% and 7% over a processor that always predicted a load would hit the cache and with a counter-based hit/miss predictor respectively. This IPC reaches 99.7% of the IPC of a processor with perfect scheduling.

architectural support for programming languages and operating systems | 1998

Capturing dynamic memory reference behavior with adaptive cache topology

Jih-Kwon Peir; Yongjoon Lee; Windsor Wee Sun Hsu

Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are less likely to be re-referenced before they are replaced. In this paper, we describe a technique that dynamically identifies these less-recently-used lines and effectively utilizes the cache frames they occupy to more accurately approximate the global least-recently-used replacement policy while maintaining the fast access time of a direct-mapped cache. We also explore the idea of using these underutilized cache frames to reduce cache misses through data prefetching. In the proposed design, the possible locations that a line can reside in is not predetermined. Instead, the cache is dynamically partitioned into groups of cache lines. Because both the total number of groups and the individual group associativity adapt to the dynamic reference pattern, we call this design the adaptive group-associative cache. Performance evaluation using trace-driven simulations of the TPC-C benchmark and selected programs from the SPEC95 benchmark suite shows that the group-associative cache is able to achieve a hit ratio that is consistently better than that of a 4-way set-associative cache. For some of the workloads, the hit ratio approaches that of a fully-associative cache.

international performance computing and communications conference | 2007

Memory Performance and Scalability of Intel's and AMD's Dual-Core Processors: A Case Study

Lu Peng; Jih-Kwon Peir; Tribuvan K. Prakash; Yen-Kuang Chen; David M. Koppelman

As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors to the PC market. In this paper, performance studies on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64times2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multiprogrammed and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for the best performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4.

international conference on computer communications | 2009

Fit a Spread Estimator in Small Memory

MyungKeun Yoon; Tao Li; Shigang Chen; Jih-Kwon Peir

The spread of a source host is the number of distinct destinations that it has sent packets to during a measurement period. A spread estimator is a software/hardware module on a router that inspects the arrival packets and estimates the spread of each source. It has important applications in detecting port scans and DDoS attacks, measuring the infection rate of a worm, assisting resource allocation in a server farm, determining popular web contents for caching, to name a few. The main technical challenge is to fit a spread estimator in a fast but small memory (such as SRAM) in order to operate it at the line speed in a high-speed network. In this paper, we design a new spread estimator that delivers good performance in tight memory space where all existing estimators no longer work. The new estimator not only achieves space compactness but operates more efficiently than the existing ones. Its accuracy and efficiency come from a new method for data storage, called virtual vectors, which allow us to measure and remove the errors in spread estimation. We perform experiments on real Internet traces to verify the effectiveness of the new estimator.

Journal of Systems Architecture | 2008

Memory hierarchy performance measurement of commercial dual-core desktop processors

Lu Peng; Jih-Kwon Peir; Tribuvan K. Prakash; Carl Staelin; Yen-Kuang Chen; David M. Koppelman

As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors. In this paper, performance measurement on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64x2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multi-program-med and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for better performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as the shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4 without taking the advantage of on-chip communication.

2011 International Green Computing Conference and Workshops | 2011

Statistical GPU power analysis using tree-based methods

Jianmin Chen; Bin Li; Ying Zhang; Lu Peng; Jih-Kwon Peir

Graphics Processing Units (GPUs) have emerged as a promising platform for parallel computation. With a large number of scalar processors and abundant memory bandwidth, GPUs provide substantial computation power. While delivering high computation performance, the GPU also consumes high power and needs to be equipped with sufficient power supplies and cooling systems. Therefore, it is essential to institute an efficient mechanism for evaluating and understanding the power consumption requirement when running real applications on high-end GPUs. In this paper, we present a high-level GPU power consumption model using sophisticated tree-based random forest methods which can correlate the power consumption with a set of independent performance variables. This statistical model not only predicts the GPU runtime power consumption accurately, but more importantly, it also provides sufficient insights for understanding the dependence between the GPU runtime power consumption and the individual performance metrics. In order to gain more insights, we use a GPU simulator that can collect more runtime performance metrics than hardware counters. We measure the power consumption of a wide-range of CUDA kernels on an experimental system with GTX 280 GPU as statistical samples for our power analysis. This methodology can certainly be applied to any other CUDA GPU.

IEEE ACM Transactions on Networking | 2011

Fit a compact spread estimator in small high-speed memory

MyungKeun Yoon; Tao Li; Shigang Chen; Jih-Kwon Peir

The spread of a source host is the number of distinct destinations that it has sent packets to during a measurement period. A spread estimator is a software/hardware module on a router that inspects the arrival packets and estimates the spread of each source. It has important applications in detecting port scans and distributed denial-of-service (DDoS) attacks, measuring the infection rate of a worm, assisting resource allocation in a server farm, determining popular Web contents for caching, to name a few. The main technical challenge is to fit a spread estimator in a fast but small memory (such as SRAM) in order to operate it at the line speed in a high-speed network. In this paper, we design a new spread estimator that delivers good performance in tight memory space where all existing estimators no longer work. The new estimator not only achieves space compactness, but operates more efficiently than the existing ones. Its accuracy and efficiency come from a new method for data storage, called virtual vectors, which allow us to measure and remove the errors in spread estimation. We also propose several ways to enhance the range of spread values that the estimator can measure. We perform extensive experiments on real Internet traces to verify the effectiveness of the new estimator .

international conference on parallel and distributed information systems | 1991

Interconnecting shared-everything systems for efficient parallel query processing

Kien A. Hua; Chiang Lee; Jih-Kwon Peir

The most debated architectures for parallel database processing are Shared Nothing (SN) and Shared Everything (SE) structures. Although SN is considered to be most scalable, it is very sensitive to the data skew problem. On the other hand, SE allows the collaborating processors to share the work load more efficiently. It, however, suffers from the limitation of the memory and disk I/O band-width. The authors present a hybrid architecture in which SE clusters are interconnected through a communication network to form a SN structure at the inter-cluster level. In this approach, processing elements are clustered into SE systems to minimize the skew effect. Each cluster, however, is kept small within the limitation of the memory and I/O technology to avoid the data access bottleneck. A generalized performance model was developed to perform sensitivity analysis for the hybrid structure, and to compare it against SE and SN organizations.<<ETX>>

international conference on computer design | 2011

Tree structured analysis on GPU power study

Jianmin Chen; Bin Li; Ying Zhang; Lu Peng; Jih-Kwon Peir

Graphics Processing Units (GPUs) have emerged as a promising platform for parallel computation. With a large number of processor cores and abundant memory bandwidth, GPUs deliver substantial computation power. While providing high computation performance, a GPU consumes high power and needs sufficient power supplies and cooling systems. It is essential to institute an efficient mechanism for evaluating and understanding the power consumption when running real applications on high-end GPUs. In this paper, we present a high-level GPU power consumption model using sophisticated tree-based random forest methods which correlate and predict the power consumption using a set of performance variables. We demonstrate that this statistical model not only predicts the GPU runtime power consumption more accurately than existing regression based approaches, but more importantly, it provides sufficient insights into understanding the correlation of the GPU power consumption with individual performance metrics. We use a GPU simulator that can collect more runtime performance metrics than hardware counters. We measure the power consumption of a wide-range of CUDA kernels on an experimental system with GTX 280 GPU to collect statistical samples for power analysis. The proposed method is applicable to other GPUs as well.

ieee international symposium on workload characterization | 2011

Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications

Ying Zhang; Lu Peng; Bin Li; Jih-Kwon Peir; Jianmin Chen

In recent years, modern graphics processing units have been widely adopted in high performance computing areas to solve large scale computation problems. The leading GPU manufacturers Nvidia and ATI have introduced series of products to the market. While sharing many similar design concepts, GPUs from these two manufacturers differ in several aspects on processor cores and the memory subsystem. In this paper, we conduct a comprehensive study to characterize the architectural differences between Nvidias Fermi and ATIs Cypress and demonstrate their impact on performance. Our results indicate that these two products have diverse advantages that are reflected in their performance for different sets of applications. In addition, we also compare the energy efficiencies of these two platforms since power/energy consumption is a major concern in the high performance computing.

Explore More