Jinzhan Peng
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jinzhan Peng.
languages, compilers, and tools for embedded systems | 2004
Gilberto Contreras; Margaret Martonosi; Jinzhan Peng; Roy Dz-Ching Ju; Guei-Yuan Lueh
Managing power concerns in icroprocessors has become a pressing research problem across the domains of computer architecture, CAD, and compilers. As a result, several parameterized cycle-level power simulators have been introduced. While these simulators can be quite useful for microarchitectural studies, their generality limits how accurate they can be for any one chip family. Furthermore, their hardware focus means that they do not explicitly enable studying the interaction of different software layers, such as Java applications and their underlying Runtime system software.This paper describes and evaluates XTREM, a power simulation tool tailored for the Intel XScale icroarchitecture. In building XTREM, our goals were to develop a icroarchitecture simulator that, while still offering size parameterizations for cache, TLB, etc., more accurately reflected a realistic processor pipeline. We present a detailed set of validations based on ultimeter power measurements and hardware performance counter sampling. Based on these validations across a wide range of stressmarks, Java benchmarks, and non-Java benchmarks, XTREM has an average performance error of only 6.5% and an even smaller average power error: 4%. The paper goes on to present a selection of application studies enabled by the simulator. For example, presenting power behavior vs. time for selected embedded C and Java CLDC benchmarks, we can make power distinctions between the two programming domains as well as distinguishing Java application (JITted code) power from Java Runtime system power. We also study how the Intel XScale core s power consumption varies for different data activity factors, creating power swings as large as 50mW for a 200Mhz core. We are planning to release XTREM for wider use, and feel that it offers a useful step forward for compiler and embedded software designers.
interpreters, virtual machines and emulators | 2004
Jinzhan Peng; Gansha Wu; Guei-Yuan Lueh
Interpretation has salient merits of simplicity, portability and small footprint but comes with a price of poor performance. Stack caching is a technique to build a high-performance interpreter by keeping source and destination operands of instructions in registers so as to reduce memory accesses involved during interpretation. One drawback of stack caching is that an instruction may have multiple ways to perform interpretation depending on which registers source operands reside in, resulting in code explosion as well as deterioration of code maintainability. This paper presents a code sharing mechanism that achieves performance as efficient as the stack-caching interpreter and in the meantime keeps the code size as compact as general threaded interpreters. Our results show that our approach outperforms a threaded interpreter by an average of 13.6% and the code size increases by only 1KB (~3%).
workshop on memory system performance and correctness | 2006
Jinzhan Peng; Guei-Yuan Lueh; Gansha Wu; Xiaogang Gou; Ryan N. Rakvic
The working set size of Java applications on embedded systems has recently been increasing, causing the Translation Lookaside Buffer (TLB) to become a serious performance bottleneck. From a thorough analysis of the SPECjvm98 benchmark suite executing on a commodity embedded system, we find TLB misses attribute from 24% to 50% of the total execution time. We explore and evaluate a wide spectrum of TLB-enhancing techniques with different combinations of software/hardware approaches, namely superpage for reducing TLB miss rates, two-level TLB and TLB prefetching for reducing both TLB miss rates and TLB miss latency, and even a no-TLB design for removing TLB overhead completely. We adapt and then in a novel way extend these approaches to fit the design space of embedded systems executing Java code. We compare these approaches, discussing their performance behavior, software/hardware complexity and constraints, especially the design implications for the application, runtime and OS.We first conclude that even with the aggressive approaches presented, there remains a performance bottleneck with the TLB. Second, in addition to facing very different design considerations and constraints for embedded systems, proven hardware techniques, such as TLB prefetching have different performance implications. Third, software based solutions, no-TLB design and superpaging, appear to be more effective in improving Java application performance on embedded systems. Finally, beyond performance, these approaches have their respective pros and cons; it is left to the system designer to make the appropriate engineering tradeoff.
high performance embedded architectures and compilers | 2005
Gansha Wu; Xin Zhou; Guei-Yuan Lueh; Jesse Fang; Peng Guo; Jinzhan Peng; Victor Ying
Automatic memory management has been prevalent on memory / computation constraint systems. Previous research has shown strong interest in small memory footprint, garbage collection (GC) pause time and energy consumption, while performance was left out of the spotlight. This fact inspired us to design memory management techniques delivering high performance, while still keeping space consumption and response time under control. XAMM is an attempt to answer such a quest. Driven by the design decisions above, XAMM implements a variety of novel techniques, including object model, heap management, allocation and GC mechanisms. XAMM also adopts techniques that can not only exploit the underlying systems capabilities, but can also assist the optimizations by other runtime components (e.g. code generator). This paper describes these techniques in details and reports our experiences in the implementation. We conclude that XAMM demonstrates the feasibility to achieve high performance without breaking memory constraints. We support our claims with evaluation results, for a spectrum of real-world programs and synthetic benchmarks. For example, the heap placement optimization can boost the system-wide performance by as much as 10%; the lazy and selective location bits management can reduce the execution time by as much as 14%, while reducing GC pause time on average by as much as 25%. The sum of these techniques improves the system-wide performance by as much as 56%.
Archive | 2004
Zhiwei Ying; Guei-Yuan Lueh; Jinzhan Peng; Anwar M. Ghuloum; Ali-Reza Adl-Tabatabai
Archive | 2005
Shih-Wei Liao; Zhaohui Du; Gansha Wu; Guei-Yuan Lueh; Zhiwei Ying; Jinzhan Peng
Archive | 2007
Gansha Wu; Xin Zhou; Biao Chen; Jinzhan Peng; Peng Guo; Xiaogang Gou
Archive | 2006
Gansha Wu; Xin Zhou; Biao Chen; Peng Guo; Jinzhan Peng; Zhiwei Ying
Archive | 2003
Gansha Wu; Guei-Yuan Lueh; Xiaohua Shi; Jinzhan Peng
Archive | 2005
Gansha Wu; Xin Zhou; Peng Guo; Jinzhan Peng; Zhiwei Ying; Guei-Yuan Lueh