Jinpyo Park | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jinpyo Park is active.

Explore More

Publication

Featured researches published by Jinpyo Park.

international conference on parallel architectures and compilation techniques | 1999

LaTTe: a Java VM just-in-time compiler with fast and efficient register allocation

Byung-Sun Yang; Soo-Mook Moon; Seong-Bae Park; Junpyo Lee; Seungil Lee; Jinpyo Park; Yoo C. Chung; Suhyun Kim; Kemal Ebcioglu; Erik R. Altman

For network computing on desktop machines, fast execution of Java bytecode programs is essential because these machines are expected to run substantial application programs written in Java. Higher Java performance can be achieved by just-in-time (JIT) compilers which translate the stack-based bytecode into register-based machine code on demand. One crucial problem in Java JIT compilation is how to map and allocate stack entries and local variables into registers efficiently and quickly, so as to improve the Java performance. This paper introduces LaTTe, a Java JIT compiler that performs fast and efficient register mapping and allocation for RISC machines. LaTTe first translates the bytecode into pseudo RISC code with symbolic registers, which is then register allocated while coalescing those copies corresponding to pushes and pops between local variables and the stack. The LaTTe JVM also includes an enhanced object model, a lightweight monitor, a fast mark-and-sweep garbage collector, and an on-demand exception handling mechanism, all of which are closely coordinated with LaTTes JIT compilation.

international conference on hardware/software codesign and system synthesis | 2013

Learning the optimal operating point for many-core systems with extended range voltage/frequency scaling

Da Cheng Juan; Siddharth Garg; Jinpyo Park; Diana Marculescu

Near-Threshold Computing (NTC) has emerged as a solution that promises to significantly increase the energy efficiency of next-generation multi-core systems. This paper evaluates and analyzes the behavior of dynamic voltage and frequency scaling (DVFS) control algorithms for multi-core systems operating under near-threshold, nominal, or turbo-mode conditions. We adapt the model selection technique from machine learning to learn the relationship between performance and power. The theoretical results show that the resulting models satisfy convexity properties essential to efficiently determining optimal voltage/frequency operating points for minimizing energy consumption under throughput constraints or maximizing throughput under a given power budget. Our experimental results show that, compared with DVFS in the conventional operating range, extended range DVFS control including turbo-mode and near-threshold operation achieves an additional (1) 13.28% average energy reduction under isoperformance conditions, and (2) 7.54% average throughput increase under iso-power conditions.

international conference on computer design | 2013

Dynamic thread mapping for high-performance, power-efficient heterogeneous many-core systems

Guangshuo Liu; Jinpyo Park; Diana Marculescu

This paper addresses the problem of dynamic thread mapping in heterogeneous many-core systems via an efficient algorithm that maximizes performance under power constraints. Heterogeneous many-core systems are composed of multiple core types with different power-performance characteristics. As well documented in the literature, the generic mapping problem is an NP-complete problem which can be formulated as a 0-1 integer linear program, therefore, prohibitively expensive to solve optimally in an online scenario. However, in real applications, thread mapping decisions need to be responsive to workload phase changes. This paper proposes an iterative approach bounding the runtime as O(n2/m), for mapping multi-threaded applications on n cores comprising of m core types. Compared with an optimal solution, the proposed algorithm produces results less than 0.6% away from optimum on average, with two orders of magnitude improvement in runtime. Results show that performance improvement can reach 16% under iso-power constraints compared to a random mapping. The algorithm can be brought online for hundred-core heterogeneous systems as it scales to systems comprised of 256 cores with less than one millisecond in overhead.

IEEE Transactions on Very Large Scale Integration Systems | 2012

Optimizing Video Application Design for Phase-Change RAM-Based Main Memory

Suknam Kwon; Sungjoo Yoo; Sunggu Lee; Jinpyo Park

Video applications including video codecs place a large traffic demand on main memory. Emerging memory technology, such as phase-change RAM (PRAM) tends to suffer from the write endurance problem, in which the maximum number of writes is limited. Thus, it is required to improve video application designs to adapt to the new requirements of emerging memory technology, i.e., to minimize the number of writes in terms of bit updates. In this paper, we present a way to optimize video application design for PRAM-based main memory. We propose two methods to resolve the write endurance problem: inter-block differential data encoding and inter-frame multiple experts. Experimental results show an average of 18.4% reduction in bit updates when compared to the best existing data encoding methods for PRAM.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015

Procrustes 1 : Power Constrained Performance Improvement Using Extended Maximize-Then-Swap Algorithm

Guangshuo Liu; Jinpyo Park; Diana Marculescu

This paper proposes an efficient algorithm that maximizes performance under power constraints and is applicable in the general context of traditional dynamic voltage/frequency (V/P) scaling, or core heterogeneity and emerging dynamic micro-architectural adaptation. Performance maximization in these scenarios can be essentially viewed as mapping application threads to appropriate core states that have various power/performance characteristics. Such problems are formulated as a generic 0-1 integer linear program (ILP). The proposed algorithm is an iterative heuristic-based solution. Compared with an optimal solution generated by commercial ILP solver, the proposed algorithm produces results less than 1% away from optimum on average, with more than two orders of magnitude improvement in runtime. The algorithm can be brought online for hundred-core heterogeneous systems as it scales to systems comprised of 256 cores with less than 1 ms in overhead in worst cases. The intrinsic history awareness also provides flexibility to control cost induced by switching V/F pairs, migrating threads across cores, or tuning on/off micro-architectural resources.

ACM Sigarch Computer Architecture News | 1999

Lightweight monitor for Java VM

Byung-Sun Yang; Junpyo Lee; Jinpyo Park; Soo-Mook Moon; Kemal Ebcioglu; Erik R. Altman

| This paper introduces the lightweight monitor in Java VM that is fast on single-threaded programs as well as on multi-threaded programs with little lock contention. A 32-bit lock is embedded into each object for eecient access while the lock queue and the wait set is managed through a hash table. The lock manipulation code is highly optimized and inlined by our Java VM JIT compiler called LaTTe wherever the lock is accessed. In most cases, only 9 SPARC instructions are spent for lock acquisition and 5 instructions for lock release. Our experimental results indicate that the lightweight monitor is faster than the monitor in the latest SUN JDK 1.2 Release Candidate 1 by up to 21 times in the absence of lock contention and by up to 7 times in the presence of lock contention.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2016

Learning-Based Power/Performance Optimization for Many-Core Systems With Extended-Range Voltage/Frequency Scaling

Ermao Cai; Da Cheng Juan; Siddharth Garg; Jinpyo Park; Diana Marculescu

Near-threshold computing has emerged as a promising solution to significantly increase the energy efficiency of next-generation multicore systems. This paper evaluates and analyzes the behavior of dynamic voltage and frequency scaling for multicore systems operating under extended range: including near-threshold, nominal, and turbo modes. We adapt the model selection technique from machine learning to determine the relationship between performance and power. The theoretical results show that the resulting models satisfy convexity, which efficiently determines the optimal voltage/frequency operating points for: 1) minimizing energy consumption under throughput constraints or 2) maximizing throughput under a given power budget. We validate our models on FinFET-based chip-multiprocessors. Considering process variations (PVs), experimental results show that at 30% PV levels, our proposed method: 1) reduces energy consumption by 31.09% at iso-performance condition and 2) increases throughput by 11.46% at iso-power when compared with variation-agnostic nominal case.

embedded and real-time computing systems and applications | 2007

Securing More Registers with Reduced Instruction Encoding Architectures

Je-Hyung Lee; Jinpyo Park; Soo-Mook Moon

One of the most serious constraints of an embedded system is its limited memory, which requires small code size for embedded software. One popular method to reduce the code size is reducing the instruction encoding, such as the ARM THUMB or the MIPS-16 architectures. They employ shorter instructions by reducing the field width, including those of register operands. This obviously reduces the number of registers available for register allocation than in the original architecture, which can lead to more register spills, negatively affecting the code size. This paper proposes a simple architectural upgrade by reconstructing the original register file into register banks and by providing a bank change instruction. This can allow all of the original registers to be available for register allocation when the bank change instructions are added appropriately. For such a banked register file, we propose an efficient, region-based register bank allocation technique where appropriate regions are chosen first for bank changes, followed by conventional global register allocation. As a case study, we apply the idea to the ARM THUMB architecture and evaluate how the upgrade affects its overall code size. We found that the upgrade results in an average of 5.0% code size reduction for some of MiBench and MediaBench benchmarks, compared to the original THUMB code.

Microprocessors and Microsystems | 2017

Region-based dual bank register allocation for reduced instruction encoding Architectures

Je-Hyung Lee; Soo-Mook Moon; Jinpyo Park

Abstract In embedded systems, small code size is important due to memory constraints. One technique to achieve a small code size is reducing the instruction encoding from 32-bit to 16-bit, such as the ARM THUMB or MIPS-16 architectures. This half-size encoding leads to shorter register operands, making fewer registers available for register allocation and causing more spills, although invisible registers can be used as spill locations via copies. We propose reconstructing the original register file into dual-banks, added with the bank toggle instruction for bank changes and the inter-bank copies between the banks. We also propose an efficient dual-bank register allocation technique based on regions in the code to reduce spills. As a case study, we applied our banked register allocation model for the THUMB architecture. We found that the code size decreases by as much as 8% (5.8% on average) while the performance improves by as much as 11.1% (3.3% on average). Our results indicate that we would better organize the register file of an embedded CPU that can provide reduced encoding into dual banks for better quality of register allocation, rather than using the invisible registers for spills.

Archive | 2006