Is this you? Create Your Porfile

Seunghwa Kang

Georgia Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Seunghwa Kang is active.

Explore More

Publication

Featured researches published by Seunghwa Kang.

parallel computing | 2007

High performance combinatorial algorithm design on the Cell Broadband Engine processor

David A. Bader; Virat Agarwal; Kamesh Madduri; Seunghwa Kang

The Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations.

acm sigplan symposium on principles and practice of parallel programming | 2009

An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs

Seunghwa Kang; David A. Bader

Due to power wall, memory wall, and ILP wall, we are facing the end of ever increasing single-threaded performance. For this reason, multicore and manycore processors are arising as a new paradigm to pursue. However, to fully exploit all the cores in a chip, parallel programming is often required, and the complexity of parallel programming raises a significant concern. Data synchronization is a major source of this programming complexity, and Transactional Memory is proposed to reduce the difficulty caused by data synchronization requirements, while providing high scalability and low performance overhead. The previous literature on Transactional Memory mostly focuses on architectural designs. Its impact on algorithms and applications has not yet been studied thoroughly. In this paper, we investigate Transactional Memory from the algorithm designers perspective. This paper presents an algorithmic model to assist in the design of efficient Transactional Memory algorithms and a novel Transactional Memory algorithm for computing a minimum spanning forest of sparse graphs. We emphasize multiple Transactional Memory related design issues in presenting our algorithm. We also provide experimental results on an existing software Transactional Memory system. Our algorithm demonstrates excellent scalability in the experiments, but at the same time, the experimental results reveal the clear limitation of software Transactional Memory due to its high performance overhead. Based on our experience, we highlight the necessity of efficient hardware support for Transactional Memory to realize the potential of the technology.

parallel computing | 2009

Computing discrete transforms on the Cell Broadband Engine

David A. Bader; Virat Agarwal; Seunghwa Kang

Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT), on the Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.), a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. We design an efficient parallel implementation of Fast Fourier Transform (FFT) to fully exploit the architectural features of the Cell/B.E. Our FFT algorithm uses an iterative out-of-place approach and for 1K to 16K complex input samples outperforms all other parallel implementations of FFT on the Cell/B.E. including FFTW. Our FFT implementation obtains a single-precision performance of 18.6 GFLOP/s on the Cell/B.E., outperforming Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. We also optimize Discrete Wavelet Transform (DWT) in the context of JPEG2000 for the Cell/B.E. DWT has an abundant parallelism, however, due to the low temporal locality of the algorithm, memory bandwidth becomes a significant bottleneck in achieving high performance. We introduce a novel data decomposition scheme to achieve highly efficient DMA data transfer and vectorization with low programming complexity. Also, we merge the multiple steps in the algorithm to reduce the bandwidth requirement. This leads to a significant enhancement in the scalability of the implementation. Our optimized implementation of DWT demonstrates 34 and 56 times speedup using one Cell/B.E. chip to the baseline code for the lossless and lossy transforms, respectively. We also provide the performance comparison with the AMD Barcelona (Quad-core Opteron) processor, and the Cell/B.E. excels the AMD Barcelona processor. This highlights the advantage of the Cell/B.E. over general purpose multicore processors in processing regular and bandwidth intensive scientific applications.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Large scale complex network analysis using the hybrid combination of a MapReduce cluster and a highly multithreaded system

Seunghwa Kang; David A. Bader

Complex networks capture interactions among entities in various application areas in a graph representation. Analyzing large scale complex networks often answers important questions—e.g. estimate the spread of epidemic diseases—but also imposes computing challenges mainly due to large volumes of data and the irregular structure of the graphs. In this paper, we aim to solve such a challenge: finding relationships in a subgraph extracted from the data. We solve this problem using three different platforms: a MapReduce cluster, a highly multithreaded system, and a hybrid system of the two. The MapReduce cluster and the highly multithreaded system reveal limitations in efficiently solving this problem, whereas the hybrid system exploits the strengths of the two in a synergistic way and solves the problem at hand. In particular, once the subgraph is extracted and loaded into memory, the hybrid system analyzes the subgraph five orders of magnitude faster than the MapReduce cluster.

international conference on parallel processing | 2008

Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine

Seunghwa Kang; David A. Bader

JPEG2000 is the latest still image coding standard from the JPEG committee, which adopts new algorithms such as embedded block coding with optimized truncation (EBCOT) and discrete wavelet transform (DWT). These algorithms enable superior coding performance over JPEG and support various new features at the cost of the increased computational complexity. The Sony-Toshiba-IBM cell broadband engine (or the Cell/B.E.) is a heterogeneous multicore architecture with SIMD accelerators. In this work, we optimize the computationally intensive algorithmic kernels of JPEG2000 for the Cell/B.E. and also introduce a novel data decomposition scheme to achieve high performance with low programming complexity. We compare the Cell/B.E.s performance to the performance of the Intel Pentium IV 3.2 GHz processor. The Cell/B.E. demonstrates 3.2 times higher performance for lossless encoding and 2.7 times higher performance for lossy encoding. For the DWT, the Cell/B.E. outperforms the Pentium IV processor by 9.1 times for the lossless case and 15 times for the lossy case. We also provide the experimental results on one IBM QS20 blade with two Cell/B.E. chips and the performance comparison with the existing JPEG2000 encoder for the Cell/B.E.

Information Technology | 2011

Algorithm Engineering Challenges in Multicore and Manycore Systems

Seunghwa Kang; David Ediger; David A. Bader

Abstract Modern multicore and manycore systems have the strong potential to deliver both high performance and high power efficiency. The large variance in memory access latency, resource sharing, and the heterogeneity of processor architectures in modern multicore and manycore systems raise significant algorithm engineering challenges. In this article, we overview important algorithm engineering issues for modern multicore and manycore systems, and we present algorithm engineering techniques to address such problems as a guideline for practitioners. Abstract Moderne Mehrkernsysteme und Manycore Prozessoren erlauben nicht nur hohe Performance sondern auch geringen Energieverbrauch. Die große Varianz bei Speicherzugriffszeiten, die Möglichkeit gemeinsame Ressourcen zu nutzen und die Heterogenität der Prozessorarchitekturen in solchen Systemen erzeugen erhebliche Herausforderungen für den Bereich des Algorithm Engineering. In diesem Artikel geben wir eine Übersicht der wichtigsten Belange des Algorithm Engineering für Mehrkern- und Manycore-Systeme und stellen Methoden des Algorithm Engineerings vor, die dem Praktiker als Leitlinien bei derartigen Problemen dienen können.

international parallel and distributed processing symposium | 2016

Algorithm and Architecture Independent Benchmarking with SEAK

Nathan R. Tallent; Joseph B. Manzano; Nitin A. Gawande; Seunghwa Kang; Darren J. Kerbyson; Adolfy Hoisie; Joseph K. Cross

Many applications of high performance embedded computing are constrained by performance or power bottlenecks. We designed a new benchmark suite, the Suite for Embedded Applications and Kernels (SEAK), (a) to capture these bottlenecks in a way that encourages creative solutions, and (b) to facilitate rigorous tradeoff evaluation for their solutions. To avoid biases toward existing solutions, both algorithms and architecture are variables. Thus, each benchmark has a mission-centric (abstracted from a particular algorithm) and goal-oriented (functional) specification. To encourage solutions that are any combination of software or hardware, we use an end-user black-box evaluation. To inform procurement decisions, evaluations capture tradeoffs between performance, power, accuracy, size, and weight. We call our benchmarks future proof because they remain useful despite shifting algorithmic/architectural preferences. To create both concise and precise mission-centric specifications, we introduce two distinct benchmark classes. This paper describes the SEAK suite and presents an evaluation of sample solutions that highlights power and performance tradeoffs.

international parallel and distributed processing symposium | 2009

Understanding the design trade-offs among current multicore systems for numerical computations

Seunghwa Kang; David A. Bader; Richard W. Vuduc

In this paper, we empirically evaluate fundamental design trade-offs among the most recent multicore processors and accelerator technologies. Our primary aim is to aid application designers in better mapping their software to the most suitable architecture, with an additional goal of influencing future computing system design. We specifically examine five architectures, based on: the Intel quadcore Harpertown processor, the AMD quad-core Barcelona processor, the Sony-Toshiba-IBM Cell Broadband Engine processors (both the first-generation chip and the second-generation PowerXCell 8i), and the NVIDIA Tesla C1060 GPU. We illustrate the software implementation process on each platform for a set of widely-used kernels from computational statistics that are simple to reason about; measure and analyze the performance of each implementation; and discuss the impact of different architectural design choices on each implementation.

PLOS ONE | 2011