Neungsoo Park | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Neungsoo Park is active.

Explore More

Publication

Featured researches published by Neungsoo Park.

IEEE Transactions on Parallel and Distributed Systems | 2003

Tiling, block data layout, and memory hierarchy performance

Neungsoo Park; Bo Hong; Viktor K. Prasanna

Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.

IEEE Transactions on Parallel and Distributed Systems | 1999

Efficient algorithms for block-cyclic array redistribution between processor sets

Neungsoo Park; Viktor K. Prasanna; Cauligi S. Raghavendra

Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors. The algorithm reduces the overall time for communication by considering the data transfer, communication schedule, and index computation costs. The proposed algorithm is based on a generalized circulant matrix formalism. Our algorithm generates a schedule that minimizes the number of communication steps and eliminates node contention in each communication step. The network bandwidth is fully utilized by ensuring that equal-sized messages are transferred in each communication step. Furthermore, the procedure to compute the schedule and the index sets is extremely fast. It takes O(max(P, Q)) time. Therefore, our proposed algorithm is suitable for run-time array redistribution. To evaluate the performance of our scheme, we have implemented the algorithm using C and MPI. The experiments were conducted on the IBM SP2. The experimental results show that the proposed algorithm outperforms well- known algorithms with respect to the total redistribution time including the data transfer and schedule and index computation times.

international conference on parallel processing | 2002

Analysis of memory hierarchy performance of block data layout

Neungsoo Park; Bo Hong; Viktor K. Prasanna

Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. We provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on the number of TLB misses for any data layout and show that block data layout achieves this bound. We show that block data layout improves TLB misses by a factor of O(B) compared with conventional data layouts, where B is the block size of block data layout. This reduction contributes to the improvement in memory hierarchy performance. Using our TLB and cache analysis, we also discuss the impact of block size on the overall memory hierarchy performance. These results are validated through simulations and experiments on state-of-the-art platforms.

international conference on parallel processing | 1997

Efficient algorithms for multi-dimensional block-cyclic redistribution of arrays

Young Won Lim; Neungsoo Park; Viktor K. Prasanna

We present a uniform framework for a classical problem, redistribution of a multi-dimensional array. Using a generalized circulant matrix formalism, we derive efficient direct, indirect and hybrid contention-free communication schedules. Our indirect schedule reduces the number of communication steps significantly compared with the previous approaches. Our approach exploits the regularity of the block-cyclic redistribution to minimize the index computation overheads. For the case of 2-d redistribution, when the block size increases by factors of K/sub 1/ and K/sub 2/ along each dimension and the process topology remains fixed, our indirect schedule performs the redistribution in O(log(K/sub 1/K/sub 2/)) communication steps. For the case of fixed block size and the processor topology is transposed, our indirect schedule results in O(log(L/G)) communication steps. Implementations of our algorithms on the IBM SP-2 show superior performance over previous approaches.

acm symposium on applied computing | 2007

A high performance NIDS using FPGA-based regular expression matching

Janghaeng Lee; Sung Ho Hwang; Neungsoo Park; Seong-Won Lee; Sunglk Jun; Youngsoo Kim

A Network Intrusion Detection System (NIDS) monitors all incoming packets in the network and detects packets that are malicious to the internal system. The NIDS should also have ability to update the detection rules because new attack patterns are unpredictable. Incorporating FPGAs into the NIDS is one of the best solutions that can provide both high performance and high flexibility comparing to the other approaches such as software solutions. In this paper we propose a novel approach to design the parallel comparator of NIDS that can not only minimize additional resources but also maximize the processing performance. The performance and resource tradeoff due to the implementation of the parallel comparator in the prefix sharing is also analyzed.

international conference on acoustics, speech, and signal processing | 2001

Cache conscious Walsh-Hadamard transform

Neungsoo Park; N. K. Prasanna

The Walsh-Hadamard Transform (WHT) is an important algorithm in signal processing because of its simplicity. However, in computing large size WHT, non-unit stride access results in poor cache performance leading to severe degradation in performance. This poor cache performance is also a critical problem in achieving high performance in other large size signal transforms. We develop a cache friendly technique that improves the performance of large size WHT. In our approach, data reorganization is performed between computation stages to reduce cache pollution. Furthermore, we develop an efficient search algorithm to determine the optimal factorization tree based upon problem size and stride access in the decomposition. Experimental results show that our approach achieves up to 180% performance improvement over the state of the art package on Alpha 21264 and MIPS R10000. In addition, the proposed optimization is applicable to other signal transforms and is portable across various platforms.

international parallel and distributed processing symposium | 2000

Dynamic data layouts for cache-conscious factorization of DFT

Neungsoo Park; Dongsoo Kang; Kiran Bondalapati; Viktor K. Prasanna

Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this paper we propose a high level optimization technique, dynamic data layout (DDL). In DDL, data reorganization is performed between computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III.

ieee high performance extreme computing conference | 2013

High throughput energy efficient parallel FFT architecture on FPGAs

Ren Chen; Neungsoo Park; Viktor K. Prasanna

Throughput is a key performance metric for streaming FFT architectures. However, increasing spatial parallelism to improve throughput introduces complex routing, thus resulting in high power consumption. In this paper, we propose a high throughput energy efficient parallel FFT architecture based on Cooley-Tukey algorithm. Multiple pipeline FFT processors using time-multiplexing are utilized to perform FFT computation tasks in parallel. This design realizes high performance using task-level parallelism and avoids complex routing. Furthermore, to reduce the memory power consumption, a periodic memory activation (PMA) scheme is developed. By analyzing energy efficiency (defined as GOPS/Joule) asymptotically, we show that our design achieves a low energy efficiency complexity while satisfying a high-throughput requirement. For N-point FFT (64 ≤ N ≤ 4096), our proposed architecture achieves 50 ~ 63 GOPS/Joule, i.e., up to 78% of the Peak Energy Efficiency of FFT designs on FPGAs. Compared with a state-of-the-art design, our design improves the energy efficiency (defined as GOPS/Joule) by 17% to 26% with the same throughput.

international conference on hybrid information technology | 2008

Parallel Algorithms for Steiner Tree Problem

Joon-Sang Park; Won Woo Ro; Handuck Lee; Neungsoo Park

The Steiner tree problem seeks for the shortest tree connecting a given set of terminal points. This paper discusses parallelization of algorithms for the Steiner tree problem. First, a 2-approximation algorithm due to Takahashi and Matsuyama is parallelized for PRAM(Parallel Random Access Machine) model, and then issues in parallelizing another 2-approximation heuristic, namely, Kou, Markowsky, and Berman algorithm and other advance heuristics achieving less approximation ratio are discussed.

international conference on intelligent pervasive computing | 2007

Quarter-pel Interpolation Architecture in H.264/AVC Decoder

Jongwoo Bae; Neungsoo Park; Seong-Won Lee

In this paper, to reduce the computation amount of the quarter-pel interpolation in H.264 motion compensation, two-step interpolation approach is proposed: the first step is the half-pel interpolation and the other is the quarter-pel interpolation using the previous results. The quarter-pel interpolation is performed selectively according to the motion vector. The half-pel interpolation can be performed by simple row shift operation of the data stored in the register. Therefore, the proposed approach can be implemented as an simple hardware architecture which can perform a fast interpolation enough to process high-definition videos.

Explore More