Seonggun Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Seonggun Kim is active.

Explore More

Publication

Featured researches published by Seonggun Kim.

acm sigplan symposium on principles and practice of parallel programming | 2012

Efficient SIMD code generation for irregular kernels

Seonggun Kim; Hwansoo Han

Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to handle for SIMD compilers. Due to those challenges, existing SIMD compilers have excluded loops with array indirection from their candidate loops for SIMD vectorization. However, addressing those challenges is inevitable, since many important compute-intensive applications extensively use array indirection to reduce memory and computation requirements. In this work, we propose a method to generate efficient SIMD code for loops containing indirected memory references. We extract both inter- and intra-iteration parallelism, taking data reorganization overhead into consideration. We also optimally place data reorganization code in order to amortize the reorganization overhead through the performance gain of SIMD vectorization. Experiments on four array indirection kernels, which are extracted from real-world scientific applications, show that our proposed method effectively generates SIMD code for irregular kernels with array indirection. Compared to the existing SIMD vectorization methods, our proposed method significantly improves the performance of irregular kernels by 91%, on average.

IEEE Transactions on Consumer Electronics | 2009

Distributed execution for resource-constrained mobile consumer devices

Seonggun Kim; Heungsoon Rim; Hwansoo Han

Mobile consumer devices take increasingly important roles, more closely and personally interacting with users. As users get used to mobile devices, they often want the same level of computing experience as they can have from desktop PCs, but still in small and light form factors. Considering current technology, we find the limitations of the processor and the memory are still too big in current mobile devices to satisfy demanding mobile users. To alleviate resource limitations, many researchers explored techniques to share the resources of powerful surrogate servers nearby. In that line of research, we propose slim execution for an effective mobile computing paradigm. To experimentally verify our execution model, we develop a code transforming tool, distributed execution transformer (DiET). The DiET takes original Java bytecode and replaces the bodies of heavy methods with remote procedure calls to surrogate servers. Since the modified bytecode is still a legal Java bytecode, mobile devices can download and run the modified bytecode on standard JVMs, cooperating with surrogate servers. Our experiments with the SciMark 2.0 show our distributed execution scheme reduces the execution time by up to 71%.

The Journal of Supercomputing | 2011

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Seonggun Kim; Hwansoo Han; Kwang-Moo Choe

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs.

design, automation, and test in europe | 2006

Restructuring Field Layouts for Embedded Memory Systems

Keoncheol Shin; Jung-Eun Kim; Seonggun Kim; Hwansoo Han

In many computer systems with large data computations, the delay of memory access is one of the major performance bottlenecks. In this paper, we propose an enhanced field remapping scheme for dynamically allocated structures in order to provide better locality than conventional field layouts. Our proposed scheme reduces cache miss rates drastically by aggregating and grouping fields from multiple instances of the same structure, which implies the performance improvement and power reduction. Our methodology will become more important in the design space exploration, especially as the embedded systems for data oriented application become prevalent. Experimental results show that average L1 and L2 data cache misses are reduced by 23% and 17%, respectively. Due to the enhanced localities, our remapping achieves 13% faster execution time on average than original programs. It also reduces power consumption by 18% for data cache

research in applied computation symposium | 2011

Accelerating loops for coarse grained reconfigurable architectures using instruction extensions

Jungsik Choi; Seonggun Kim; Hwansoo Han

Aggressive embedded processors are often equipped with general purpose cores and special purpose acceleration logics. In our paper, we consider a reconfigurable processor that consists of very long instruction word (VLIW) cores and coarse grained reconfigurable arrays (CGRAs). CGRAs are particularly used to enhance the performance by exploiting loop parallelism, while VLIW cores rely on discovering instruction level parallelism. For time consuming loops, CGRAs can accelerate them with powerful pipeline scheduling. However, not all loops can be accelerated by CGRAs. Outer loops and loops containing function calls cannot be candidates for CGRA acceleration. In our paper, we adopt instruction extensions to convert code fragments in outer loops and simple functions into single instructions. With the extended instructions in CGRAs, more loops can be accelerated with CGRAs. Our experiment with mpeg2dec from Mediabench shows 32% performance increase.

IEEE Transactions on Consumer Electronics | 2010

Efficient reuse of local regions in memory-limited mobile devices

Seonggun Kim; Taein Kim; Eul Gyu Im; Hwansoo Han

Many researches aim to improve memory management for performance, efficiency, ease of use, and safety. Region-based memory management, a newly investigated technique for memory-limited mobile devices, splits the heap into one global (persistent) region, and multiple local regions - one local region per method invocation. Each object allocation is initially assigned to a local region and later transferred to the global region if needed. The allocated memory for a local region is implicitly reclaimed when the associated method call finishes. In this paper, we propose a technique to reduce heap memory usage in memory-limited devices by reusing early local regions in the calling sequence, as they are rarely accessed during the current method. Our experiment with SpecJvm98 shows up to 9% reduction in heap memory.

ACM Transactions in Embedded Computing Systems | 2013

Detection of harmful schizophrenic statements in esterel

Jeong-Han Yun; Chul-Joo Kim; Seonggun Kim; Kwang-Moo Choe; Taisook Han

In imperative synchronous languages, a statement is called schizophrenic if it is executed more than once in a single clock. When a schizophrenic statement is translated into a circuit, the circuit can behave abnormally because of the multiple executions. To solve the problems caused by schizophrenic statements, compilers duplicate the statements to avoid multiple executions. Esterel is an imperative synchronous language. Schizophrenic statements in Esterel are considered to occur due to the instantaneous reentrance of local signal declarations or parallel statements. However, if the corresponding circuit of a schizophrenic statement behaves normally, it is harmless and thus curing is not necessary. In this paper, we identify the conditions under which a schizophrenic statement of the Esterel program must be cured during circuit translation. We also propose an algorithm to detect schizophrenic statements that have to be cured on the control flow graphs (CFGs) of source codes. Our algorithm detects all schizophrenic statements that have to be cured and results in fewer false alarms on the benchmark programs used in the previous work. It is simple and based on the CFG of a program, implying that it can be merged into existing compilers easily.

Journal of Information Science and Engineering | 2012

Enhancing Visual Rendering on Multicore Accelerators with Explicitly Managed Memories

Kyunghee Cho; Seonggun Kim; Hwansoo Han

Recent electronic devices are equipped with processors extended with multicore accelerators to take advantage of the powerful performance from acceleration co-processors. Applications on such high-end electronic products require capability to run graphic-rich applications. Scalable acceleration co-processors are frequently designed as multicores with explicitly managed memories. Such multicore architectures require sophisticated data management among the main memory and the local memories to fully exploit their potential performance. Ray tracing is a high quality rendering algorithm in computer graphics and has potentially many parallelism to exploit. On the explicitly managed memory hierarchies, however, ray tracing with complex data structures tends to suffer from irregular memory accesses and inefficient data management. Compared to other acceleration structures for ray tracing, grid structure is simple to manage but commonly regarded to produce too slow algorithms. However, recent improvements on grid structure with SIMD optimizations show comparable performance with kd-tree structure, which is one of the fastest acceleration structures. We introduce a grid structure based parallel ray tracer on a processor with a multicore accelerator. We adopt SIMD optimizations and double buffering to enhance the performance of grid-based ray tracer and propose a macrocell structure over the grid to fully exploit the memory bandwidth. In our experiment, our ray tracing scheme shows comparable performance with BVH-based ray tracer.

international symposium on wireless pervasive computing | 2006