Yarsun Hsu
National Tsing Hua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yarsun Hsu.
IEEE Journal of Solid-state Circuits | 2008
Chih-Hao Liu; Shau-Wei Yen; Chih-Lung Chen; Hsie-Chia Chang; Chen-Yi Lee; Yarsun Hsu; Shyh-Jye Jou
An LDPC decoder chip fully compliant to IEEE 802.16e applications is presented. Since the parity check matrix can be decomposed into sub-matrices which are either a zero-matrix or a cyclic shifted matrix, a phase-overlapping message passing scheme is applied to update messages immediately, leading to enhance decoding throughput. With only one shifter-based permutation structure, a self-routing switch network is proposed to merge 19 different sub-matrix sizes as defined in IEEE 802.16e and enable parallel message to be routed without congestion. Fabricated in the 90 nm 1P9M CMOS process, this chip achieves 105 Mb/s at 20 iterations while decoding the rate-5/6 2304-bit code at 150 MHz operation frequency. To meet the maximum data rate in IEEE 802.16e, this chip operates at 109 MHz frequency and dissipates 186 mW at 1.0 V supply.
measurement and modeling of computer systems | 2000
Jin-Soo Kim; Yarsun Hsu
This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler. Trace information is collected by an exception-based tracing tool called JTRACE, without any instrumentation to the Java programs or the JIT compiler. First, we find that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application. We also note that going beyond 2-way cache associativity improves the cache miss ratio marginally. Second, we observe that Java programs generate a substantial amount of short-lived objects. However, the size of frequently-referenced long-lived objects is more important to the cache performance, because it tends to determine the applications working set size. Finally, we note that the default heap configuration which starts from a small initial heap size is very inefficient since it invokes a garbage collector frequently. Although the direct costs of garbage collection decrease as we increase the available heap size, there exists an optimal heap size which minimizes the total execution time due to the interaction with the virtual memory performance.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2009
Chih-Hao Liu; Chien-Ching Lin; Shau-Wei Yen; Chih-Lung Chen; Hsie-Chia Chang; Chen-Yi Lee; Yarsun Hsu; Shyh-Jye Jou
A reconfigurable message-passing network is proposed to facilitate message transportation in decoding multimode quasi-cyclic low-density parity-check (QC-LDPC) codes. By exploiting the shift-routing network (SRN) features, the decoding messages are routed in parallel to fully support those specific 19 and 3 submatrix sizes defined in IEEE 802.16e and IEEE 802.11n applications with less hardware complexity. A 6.22- mm2 QC-LDPC decoder with SRN is implemented in a 90-nm 1-Poly 9-Metal (1P9M) CMOS process. Postlayout simulation results show that the operation frequency can achieve 300 MHz, which is sufficient to process the 212-Mb/s 2304-bit and 178-Mb/s 1944-bit codeword streams for IEEE 802.16e and IEEE 802.11n systems, respectively.
IEEE Transactions on Computers | 1993
Ching-Farn Eric Wu; Yarsun Hsu; Yew-Huey Liu
Parallel accesses to the table lookaside buffer (TLB) and cache array are crucial for high-performance computer systems, and the choice of cache types is one of the most important factors affecting cache performance. The authors classify caches according to both index and tag. Since both index and tag could be either virtual (V) or real (R), their classification results in four combinations or cache types. The real address caches with virtual tags for high-performance computer systems in this study are prediction-based, since index bins are generated from a small array and predictions could be false. As a result, they also discuss and evaluate real address MRU caches with real tags, and propose virtually indexed MRU caches with real tags. Each of the four cache types and MRU caches are discussed and evaluated using trace-driven simulation. The results show that a virtually indexed MRU cache with real tags is a good choice for high-performance computer systems. >
ieee international symposium on workload characterization | 1998
Jin-Soo Kim; Xiaohan Qin; Yarsun Hsu
Studies a representative of an important class of emerging applications: a parallel data mining workload. The application, extracted from the IBM Intelligent Miner, identifies groups of records that are mathematically similar, based on a neural network model called a self-organizing map. We examine and compare, in detail, two implementations of the application: (1) temporal locality or working set size; (2) spatial locality and memory block utilization; (3) communication characteristics and scalability; and (4) translation lookaside buffer (TLB) performance. First, we find that the working set hierarchy of the application is governed by two parameters, namely the size of an input record and the size of prototype array; it is independent of the number of input records. Second, the application shows good spatial locality, with the implementation optimized for sparse data sets having slightly worse spatial locality. Third, due to the batch update scheme, the application bears very low communication. Finally, a two-way set-associative TLB may result in severely skewed TLB performance in a multiprocessor environment, caused by the large discrepancy in the number of conflict misses. Increasing the set associativity is more effective in mitigating the problem than increasing the TLB size.
international conference on computer design | 1989
Shiwei Wang; Yarsun Hsu; Cj Tan
A novel VLSI message switch design for application in highly parallel architectures is presented. The prominent features of this design are message combining, a shared central queue structure with a dynamic boundary and nonpreemptive priority, and a look-ahead protocol between switch nodes in adjacent stages. These features alleviate memory contention and increase the effective network bandwidth.<<ETX>>
international conference on supercomputing | 1995
Sandra Johnson Baylor; Caroline D. Benveniste; Yarsun Hsu
Presented are the results of a study conducted to evaluate the performance of parallel I/O on a massively parallel processor (MPP). The network traversal and total processing times are calculated for I/O reads and writes while varying the I/O and non-I/O request rates and the request size. Also studied is the performance impact of I/O and non-I/O traffic on each other. The results show that the system is scalable for I/O loads considered; however, the scalability is limited by I/O node saturation or considerable network contention.
ACM Sigarch Computer Architecture News | 1994
Sandra Johnson Baylor; Caroline D. Benveniste; Yarsun Hsu
Presented are the trace-driven simulation results of a study conducted to evaluate the performance of the internal parallel I/O subsystem of the Vulcan MPP architecture. The system sizes evaluated vary from 16 to 512 nodes. The results show that a compute node to I/O node ratio of four is the most cost effective for all system sizes, showing high scalability. Also, processor-to-processor communication effects are negligible for small message sizes and the greater the fraction of I/O reads, the better the I/O performance. Worse case I/O node placement is within 13% of more efficient placement strategies. Introducing parallelism into the internal I/O subsystem improves I/O performance significantly.
international computer symposium | 2010
Chi-Fu Chang; Yarsun Hsu
The design of a NoC simulator can significantly affect its ability to explore various design space and simulation accuracy. Its not easy to achieve both wide-range design space exploration and detailed characterization of hardware components simultaneously. This paper presents a kind of NoC modeling in an object oriented flavor: object oriented NoC modeling (“OONoC” in brief). OONoC divides the NoC design space into many design blocks and each block into many abstraction levels. OONoC can extend the exploration space of NoC, study hardware characteristics, and significantly reduce the coding effort of a new NoC design.
IEEE Transactions on Computers | 1995
Ching-Farn Eric Wu; Yarsun Hsu; Yew-Huey Liu
Stack simulation is a powerful cache analysis approach to generate the number of misses and write backs for various cache configurations in a single run. Unfortunately, none of the previous work on stack simulation has efficient stack algorithm for virtual address caches with real tags (VIR-type caches). In this paper, we devise an efficient stack simulation algorithm for analyzing VIR-type caches. Using markers with a valid range for synonym lines, our algorithm is able to keep track of stack distances for different cache configurations. In addition to cache miss ratios and write back ratios, our approach generates pseudonym frequency for all cache configurations under investigation. >