Yunheung Paek
Seoul National University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yunheung Paek.
Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture | 2007
Lynn Choi; Yunheung Paek; Sangyeun Cho
A Compiler Framework for Supporting Speculative Multicore Processors.- Power-Efficient Heterogeneous Multicore Technology for Digital Convergence.- StarDBT: An Efficient Multi-platform Dynamic Binary Translation System.- Unbiased Branches: An Open Problem.- An Online Profile Guided Optimization Approach for Speculative Parallel Threading.- Entropy-Based Profile Characterization and Classification for Automatic Profile Management.- Laplace Transformation on the FT64 Stream Processor.- Towards Data Tiling for Whole Programs in Scratchpad Memory Allocation.- Evolution of NAND Flash Memory Interface.- FCC-SDP: A Fast Close-Coupled Shared Data Pool for Multi-core DSPs.- Exploiting Single-Usage for Effective Memory Management.- An Alternative Organization of Defect Map for Defect-Resilient Embedded On-Chip Memories.- An Effective Design of Master-Slave Operating System Architecture for Multiprocessor Embedded Systems.- Optimal Placement of Frequently Accessed IPs in Mesh NoCs.- An Efficient Link Controller for Test Access to IP Core-Based Embedded System Chips.- Performance of Keyword Connection Algorithm in Nested Mobility Networks.- Leakage Energy Reduction in Cache Memory by Software Self-invalidation.- Exploiting Task Temperature Profiling in Temperature-Aware Task Scheduling for Computational Clusters.- Runtime Performance Projection Model for Dynamic Power Management.- A Power-Aware Alternative for the Perceptron Branch Predictor.- Power Consumption and Performance Analysis of 3D NoCs.- A Design Methodology for Performance-Resource Optimization of a Generalized 2D Convolution Architecture with Quadrant Symmetric Kernels.- Bipartition Architecture for Low Power JPEG Huffman Decoder.- A SWP Specification for Sequential Image Processing Algorithms.- A Stream System-on-Chip Architecture for High Speed Target Recognition Based on Biologic Vision.- FPGA-Accelerated Active Shape Model for Real-Time People Tracking.- Performance Evaluation of Evolutionary Multi-core and Aggressively Multi-threaded Processor Architectures.- Synchronization Mechanisms on Modern Multi-core Architectures.- Concerning with On-Chip Network Features to Improve Cache Coherence Protocols for CMPs.- Generalized Wormhole Switching: A New Fault-Tolerant Mathematical Model for Adaptively Wormhole-Routed Interconnect Networks.- Open Issues in MPI Implementation.- Implicit Transactional Memory in Kilo-Instruction Multiprocessors.- Design of a Low-Power Embedded Processor Architecture Using Asynchronous Function Units.- A Bypass Mechanism to Enhance Branch Predictor for SMT Processors.- Thread Priority-Aware Random Replacement in TLBs for a High-Performance Real-Time SMT Processor.- Architectural Solution to Object-Oriented Programming.
languages compilers and tools for embedded systems | 2003
Prasad A. Kulkarni; Wankang Zhao; Hwashin Moon; Kyunghwan Cho; David B. Whalley; Jack W. Davidson; Mark W. Bailey; Yunheung Paek; Kyle A. Gallivan
It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumption. Given that many embedded application developers are willing to spend time tuning an application, we believe a viable approach is to allow the developer to steer the process of optimizing a function. In this paper, we describe support in VISTA, an interactive compilation system, for finding effective sequences of optimization phases. VISTA provides the user with dynamic and static performance information that can be used during an interactive compilation session to gauge the progress of improving the code. In addition, VISTA provides support for automatically using performance information to select the best optimization sequence among several attempted. One such feature is the use of a genetic algorithm to search for the most efficient sequence based on specified fitness criteria. We have included a number of experimental results that evaluate the effectiveness of using a genetic algorithm in VISTA to find effective optimization phase sequences.
ACM Transactions on Programming Languages and Systems | 2002
Yunheung Paek; Jay Hoeflinger; David A. Padua
A number of existing compiler techniques hinge on the analysis of array accesses in a program. The most important task in array access analysis is to collect the information about array accesses of interest and summarize it in some standard form. Traditional forms used in array access analysis are sensitive to the complexity of array subscripts; that is, they are usually quite accurate and efficient for simple array subscripting expressions, but lose accuracy or require potentially expensive algorithms for complex subscripts. Our study has revealed that in many programs, particularly numerical applications, many access patterns are simple in nature even when the subscripting expressions are complex. Based on this analysis, we have developed a new, general array region representational form, called the linear memory access descriptor (LMAD). The key idea of the LMAD is to relate all memory accesses to the linear machine memory rather than to the shape of the logical data structures of a programming language. This form helps us expose the simplicity of the actual patterns of array accesses in memory, which may be hidden by complex array subscript expressions. Our recent experimental studies show that our new representation simplifies array access analysis and, thus, enables efficient and accurate compiler analysis.
programming language design and implementation | 1998
Yunheung Paek; Jay Hoeflinger; David A. Padua
Existing array region representation techniques are sensitive to the complexity of array subscripts. In general, these techniques are very accurate and efficient for simple subscript expressions, but lose accuracy or require potentially expensive algorithms for complex subscripts. We found that in scientific applications, many access patterns are simple even when the subscript expressions are complex. In this work, we present a new, general array access representation and define operations for it. This allows us to aggregate and simplify the representation enough that precise region operations may be applied to enable compiler optimizations. Our experiments show that these techniques hold promise for speeding up applications.
ACM Transactions on Design Automation of Electronic Systems | 2008
Seongnam Kwon; Yongjoo Kim; Woo-Chul Jeun; Soonhoi Ha; Yunheung Paek
As more processing elements are integrated in a single chip, embedded software design becomes more challenging: It becomes a parallel programming for nontrivial heterogeneous multiprocessors with diverse communication architectures, and design constraints such as hardware cost, power, and timeliness. In the current practice of parallel programming with MPI or OpenMP, the programmer should manually optimize the parallel code for each target architecture and for the design constraints. Thus, the design-space exploration of MPSoC (multiprocessor systems-on-chip) costs become prohibitively large as software development overhead increases drastically. To solve this problem, we develop a parallel-programming framework based on a novel programming model called common intermediate code (CIC). In a CIC, functional parallelism and data parallelism of application tasks are specified independently of the target architecture and design constraints. Then, the CIC translator translates the CIC into the final parallel code, considering the target architecture and design constraints to make the CIC retargetable. Experiments with preliminary examples, including the H.263 decoder, show that the proposed parallel-programming framework increases the design productivity of MPSoC software significantly.
international symposium on low power electronics and design | 2006
Yoon-Jin Kim; Il-hyun Park; Kiyoung Choi; Yunheung Paek
Coarse-grained reconfigurable architecture aims to achieve both performance and flexibility. However, power consumption is no less important for the reconfigurable architecture to be used as a competitive processing core in embedded systems. In this paper, we show how power is consumed in a typical coarse-grained reconfigurable architecture. Based on the power breakdown data, we suggest a power-conscious configuration cache structure and code mapping technique, which reduce power consumption without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size
computer and communications security | 2012
Hyungon Moon; Hojoon Lee; Jihoon Lee; Kihwan Kim; Yunheung Paek; Brent ByungHoon Kang
In this paper, we present Vigilare system, a kernel integrity monitor that is architected to snoop the bus traffic of the host system from a separate independent hardware. This snoop-based monitoring enabled by the Vigilare system, overcomes the limitations of the snapshot-based monitoring employed in previous kernel integrity monitoring solutions. Being based on inspecting snapshots collected over a certain interval, the previous hardware-based monitoring solutions cannot detect transient attacks that can occur in between snapshots. We implemented a prototype of the Vigilare system on Gaislers grlib-based system-on-a-chip (SoC) by adding Snooper hardware connections module to the host system for bus snooping. To evaluate the benefit of snoop-based monitoring, we also implemented similar SoC with a snapshot-based monitor to be compared with. The Vigilare system detected all the transient attacks without performance degradation while the snapshot-based monitor could not detect all the attacks and induced considerable performance degradation as much as 10% in our tuned STREAM benchmark test.
design, automation, and test in europe | 2006
Minwook Ahn; Jonghee W. Yoon; Yunheung Paek; Yoon-Jin Kim; Mary Kiemb; Kiyoung Choi
In this work, we investigate the problem of automatically mapping applications onto a coarse-grained reconfigurable architecture and propose an efficient algorithm to solve the problem. We formalize the mapping problem and show that it is NP-complete. To solve the problem within a reasonable amount of time, we divide it into three subproblems: covering, partitioning and layout. Our empirical results demonstrate that our technique produces nearly as good performance as hand-optimized outputs for many kernels
asia and south pacific design automation conference | 2008
Jonghee W. Yoon; Aviral Shrivastava; Sang-Hyun Park; Minwook Ahn; Yunheung Paek
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA architecture, due to which they are, i) unable to map applications, even though a mapping exists, and ii) use too many PEs to map an application. In this paper, we model several CGRA details in our compiler and develop a graph mapping based approach (SPKM) for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approaches, while using fewer CGRA rows 62% times, without any penalty in mapping time. We observe similar results on a suite of benchmarks collected from Livermore Loops, Multimedia and DSPStone benchmarks.
IEEE Transactions on Very Large Scale Integration Systems | 2009
Jonghee W. Yoon; Aviral Shrivastava; Sang-Hyun Park; Minwook Ahn; Yunheung Paek
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, split-push kernel mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.