Hwansoo Han | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hwansoo Han is active.

Explore More

Publication

Featured researches published by Hwansoo Han.

IEEE Transactions on Parallel and Distributed Systems | 2006

Exploiting locality for irregular scientific codes

Hwansoo Han; Chau-Wen Tseng

Irregular scientific codes experience poor cache performance due to their irregular memory access patterns. In this paper, we present two new locality improving techniques for irregular scientific ...Irregular scientific codes experience poor cache performance due to their irregular memory access patterns. In this paper, we present two new locality improving techniques for irregular scientific codes. Our techniques exploit geometric structures hidden in data access patterns and computation structures. Our new data reordering (GPART) finds the graph structure within data accesses and applies hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighbor nodes with priority on nodes with high degree and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Our new computation reordering (Z-SORT) treats the values of index arrays as coordinates and reorders corresponding computations in Z-curve order. Applied to dense inputs, Z-SORT achieves performance close to data reordering combined with other computation reordering but without the overhead involved in data reordering. Experiments on irregular scientific codes for a variety of meshes show locality optimization techniques are effective for both sequential and parallelized codes, improving performance by 60-87 percent. GPART achieved within 1-2 percent of the performance of more sophisticated partitioning algorithms, but with one third of the overhead. Z-SORT also yields the performance improvement of 64 percent for dense inputs, which is comparable with data reordering combined with computation reordering

Lecture Notes in Computer Science | 2000

A Comparison of Locality Transformations for Irregular Codes

Hwansoo Han; Chau-Wen Tseng

Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighboring nodes with priority on nodes with high degree, and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Experimental results show GPART matches the performance of more sophisticated partitioning algorithms to with 6%-8%, with a small fraction of the overhead. It is thus useful for optimizing programs whose running times are not known.

languages and compilers for parallel computing | 2000

Improving Locality for Adaptive Irregular Scientific Codes

Hwansoo Han; Chau-Wen Tseng

Irregular scientific codes experience poor cache performance due to their memory access patterns. In this paper, we examine two issues for locality optimizations for irregular computations. First, we experimentally find locality optimization can improve performance for parallel codes, but is dependent on the parallelization techniques used. Second, we show locality optimization may be used to improve performance even for adaptive codes. We develop a cost model which can be employed to calculate an efficient optimization frequency; it may be applied dynamically instrumenting the program to measure execution time per time-step iteration. Our results are validated through experiments on three representative irregular scientific codes.

acm sigplan symposium on principles and practice of parallel programming | 2012

Efficient SIMD code generation for irregular kernels

Seonggun Kim; Hwansoo Han

Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to handle for SIMD compilers. Due to those challenges, existing SIMD compilers have excluded loops with array indirection from their candidate loops for SIMD vectorization. However, addressing those challenges is inevitable, since many important compute-intensive applications extensively use array indirection to reduce memory and computation requirements. In this work, we propose a method to generate efficient SIMD code for loops containing indirected memory references. We extract both inter- and intra-iteration parallelism, taking data reorganization overhead into consideration. We also optimally place data reorganization code in order to amortize the reorganization overhead through the performance gain of SIMD vectorization. Experiments on four array indirection kernels, which are extracted from real-world scientific applications, show that our proposed method effectively generates SIMD code for irregular kernels with array indirection. Compared to the existing SIMD vectorization methods, our proposed method significantly improves the performance of irregular kernels by 91%, on average.

international conference on parallel architectures and compilation techniques | 1998

Improving compiler and run-time support for adaptive irregular codes

Hwansoo Han; Chau-Wen Tseng

Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., GYM, TreadMarks). We introduce LOCALWRITE, a new technique based on the owner-computes rule which eliminates the need for buffers or synchronized writes but may replicate computation. We evaluate its performance for irregular codes while varying connectivity, locality, and adaptivity. LOCALWRITE improves performance by 50-150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity.

parallel computing | 2000

Efficient compiler and run-time support for parallel irregular reductions

Hwansoo Han; Chau-Wen Tseng

Abstract Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop L ocal W rite , a new technique which partitions irregular reductions so that each processor computes values only for locally assigned data, eliminating the need for buffers or synchronized writes. Computation is replicated if its results are needed on multiple processors. We experimentally evaluate its performance for three irregular codes on a software DSM running on a distributed-memory multiprocessor and two shared-memory multiprocessors while varying connectivity, locality, and adaptivity. Results show L ocal W rite improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity.

Information & Software Technology | 2010

Filtering false alarms of buffer overflow analysis using SMT solvers

Youil Kim; Jooyong Lee; Hwansoo Han; Kwang-Moo Choe

Buffer overflow detection using static analysis can provide a powerful tool for software programmers to find difficult bugs in C programs. Sound static analysis based on abstract interpretation, however, often suffers from false alarm problem. Although more precise abstraction can reduce the number of the false alarms in general, the cost to perform such analysis is often too high to be practical for large software. On the other hand, less precise abstraction is likely to be scalable in exchange for the increased false alarms. In order to attain both precision and scalability, we present a method that first applies less precise abstraction to find buffer overflow alarms fast, and selectively applies a more precise analysis only to the limited areas of code around the potential false alarms. In an attempt to develop the precise analysis of alarm filtering for large C programs, we perform a symbolic execution over the potential alarms found in the previous analysis, which is based on the abstract interpretation. Taking advantage of a state-of-art SMT solver, our precise analysis efficiently filters out a substantial number of false alarms. Our experiment with the test cases from three open source programs shows that our filtering method can reduce about 68% of false alarms on average.

languages and compilers for parallel computing | 1998

Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes

Hwansoo Han; Chau-Wen Tseng

Current compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems (CHAOS) or by relying on replicated buffers and the shared-memory interface supported by software DSMs (TreadMarks). We introduce LocalWrite, a new technique for parallelizing irregular reductions based on the owner-computes rule. It eliminates the need for buffers or synchronized writes, but may replicate computation. We investigate the impact of connectivity (node/edge ratio), locality (accesses to local data) and adaptivity (edge modifications) on their relative performance. LocalWrite improves performance by 50-150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity.

languages and compilers for parallel computing | 1998

Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs

Hwansoo Han; Chau-Wen Tseng; Peter J. Keleher

Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to messagepassing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: (1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops; (2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences; (3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20% on average and by up to 60% for some applications.

merged international parallel processing symposium and symposium on parallel and distributed processing | 1998

Compile-time synchronization optimizations for software DSMs

Hwansoo Han; Chau-Wen Tseng

Sofware distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. The authors investigate the impact of compilation techniques for eliminating synchronization overhead in software DSMs, developing new algorithms to handle situations found in practice. They evaluate the contributions of synchronization elimination algorithms based on 1) dependence analysis, 2) communication analysis, 3) exploiting coherence protocols in software DSMs, and 4) aggressive expansion of parallel SPMD regions. They also found suppressing expensive parallelism to be useful for one application. Experiments indicate these techniques eliminate almost all parallel task invocations, and reduce the number of barriers executed by 66% on average. On a 16 processor IBM SP-2, speedups are improved on average by 35%, and are tripled for some applications.

Explore More