Zhenjiang Wang
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhenjiang Wang.
measurement and modeling of computer systems | 2012
Di Xu; Chenggang Wu; Pen Chung Yew; Jianjun Li; Zhenjiang Wang
Competition for shared memory resources on multiprocessors is the most dominant cause for slowing down applications and makes their performance varies unpredictably. It exacerbates the need for Quality of Service (QoS) on such systems. In this paper, we propose a fair-progress process scheduling (FPS) policy to improve system fairness. Its strategy is to force the equally-weighted applications to have the same amount of slowdown when they run concurrently. The basic approach is to monitor the progress of all applications at runtime. When we find an application suffered more slowdown and accumulated less effective work than others, we allocate more CPU time to give it a better parity. Our policy also allows different weights to different threads, and provides an effective and robust tuner that allows the OS to freely make tradeoffs between system fairness and higher throughput. Evaluation results show that FPS can significantly improve system fairness by an average of 53.5% and 65.0% on a 4-core processor with a private cache and a 4-core processor with a shared cache, respectively. The penalty is about 1.1% and 1.6% of the system throughput. For memory-intensive workloads, FPS also improves system fairness by an average of 45.2% and 21.1% on 4-core and 8-core system respectively at the expense of a throughput loss of about 2%.
symposium on code generation and optimization | 2010
Zhenjiang Wang; Chenggang Wu; Pen Chung Yew
Dynamic memory allocation is widely used in modern programs. General-purpose heap allocators often focus more on reducing their run-time overhead and memory space utilization, but less on exploiting the characteristics of their allocated heap objects. This paper presents a lightweight dynamic optimizer, named Dynamic Pool Allocation (DPA), which aims to exploit the affinity of the allocated heap objects and improve their layout at run-time. DPA uses an adaptive partial call chain with heuristics to aggregate affinitive heap objects into dedicated memory regions, called memory pools. We examine the factors that could affect the effectiveness of such layout. We have implemented DPA and measured its performance on several SPEC CPU 2000 and 2006 benchmarks that use extensive heap objects. Evaluations show that it could achieve an average speed up of 12.1% and 10.8% on two x86 commodity machines respectively using GCC -O3, and up to 82.2% for some benchmarks.
IEEE Transactions on Parallel and Distributed Systems | 2015
Chenggang Wu; Jin Li; Di Xu; Pen Chung Yew; Jianjun Li; Zhenjiang Wang
Competition for shared memory resources on multiprocessors is the dominant cause for slowing down applications and making their performance varies unpredictably. It exacerbates the need for Quality of Service (QoS) on such systems. In this paper, we propose a fair-progress process scheduling (FPS) policy to improve system fairness. The strategy is to force the equally-weighted applications to bear the same amount of slowdown when they run concurrently. When we find an application suffered more slowdown and accumulated less effective work than others, we allocate more CPU time to give it a better parity. This policy can also be applied to threads with different weights. Evaluation results show that FPS can significantly improve system fairness at the expense of a slight loss in throughput. We can also keep the performance information of an application to guide process scheduling when it runs again later on. When FPS uses such performance information from previous runs, fairness can be maintained without the overhead of the training periods required in FPS. Throughput can thus be enhanced.
automated software engineering | 2014
Wenwen Wang; Zhenjiang Wang; Chenggang Wu; Pen Chung Yew; Xipeng Shen; Xiang Yuan; Jianjun Li; Xiaobing Feng; Yong Guan
We propose an effective approach to automatically localize buggy shared memory accesses that trigger concurrency bugs. Compared to existing approaches, our approach has two advantages. First, as long as enough successful runs of a concurrent program are collected, our approach can localize buggy shared memory accesses even with only one single failed run captured, as opposed to the requirement of capturing multiple failed runs in existing approaches. This is a significant advantage because it is more difficult to capture the elusive failed runs than the successful runs in practice. Second, our approach exhibits more precise bug localization results because it also captures buggy shared memory accesses in those failed runs that terminate prematurely, which are often neglected in existing approaches. Based on this proposed approach, we also implement a prototype, named LOCON. Evaluation results on 16 common concurrency bugs show that all buggy shared memory accesses that trigger these bugs can be precisely localized by LOCON with only one failed run captured.
high performance embedded architectures and compilers | 2012
Zhenjiang Wang; Chenggang Wu; Pen Chung Yew; Jianjun Li; Di Xu
With the advent of multicore systems, the gap between processor speed and memory latency has grown worse because of their complex interconnect. Sophisticated techniques are needed more than ever to improve an applications spatial and temporal locality. This paper describes an optimization that aims to improve heap data layout by structure-splitting. It also provides runtime address checking by piggybacking on the existing page protection mechanism to guarantee the correctness of such optimization that has eluded many previous attempts due to safety concerns. The technique can be applied to both sequential and parallel programs at either compile time or runtime. However, we focus primarily on sequential programs (i.e., single-threaded programs) at runtime in this paper. Experimental results show that some benchmarks in SPEC 2000 and 2006 can achieve a speedup of up to 142.8%.
symposium on code generation and optimization | 2014
Jianjun Li; Zhenjiang Wang; Chenggang Wu; Wei-Chung Hsu; Di Xu
Calling context has been widely used in many software development processes such as testing, event logging, and program analysis. It plays an even more important role in data race detection and performance bottleneck analysis for multi-threaded programs. This paper presents DACCE (Dynamic and Adaptive Calling Context Encoding), an efficient runtime encoding/decoding mechanism for single-threaded and multi-threaded programs that captures dynamic calling contexts. It can dynamically encode all call paths invoked at runtime, and adjust the encodings according to programs execution behavior. In contrast to existing context encoding method, DACCE can work on incomplete call graph, and it does not require source code analysis and offline profiling to conduct context encoding. DACCE has significantly expanded the functionality and applicability of calling context with even lower runtime overhead. DACCE is very efficient based on experiments with SPEC CPU2006 and Parsec 2.1 (with about 2% of runtime overhead) and effective for all tested benchmarks.
acm sigplan symposium on principles and practice of parallel programming | 2014
Wenwen Wang; Chenggang Wu; Pen Chung Yew; Xiang Yuan; Zhenjiang Wang; Jianjun Li; Xiaobing Feng
Non-determinism in concurrent programs makes their debugging much more challenging than that in sequential programs. To mitigate such difficulties, we propose a new technique to automatically locate buggy shared memory accesses that triggered concurrency bugs. Compared to existing fault localization techniques that are based on empirical statistical approaches, this technique has two advantages. First, as long as enough successful runs of a concurrent program are collected, the proposed technique can locate buggy memory accesses to the shared data even with only one single failed run captured, as opposed to the need of capturing multiple failed runs in other statistical approaches. Second, the proposed technique is more precise because it considers memory accesses in those failed runs that terminate prematurely.
design, automation, and test in europe | 2014
Hui Guo; Zhenjiang Wang; Chenggang Wu; Ruining He
Binary translation makes it convenient to emulate one instruction set by another. Nowadays, it is growing in popularity in various applications, especially the embedded platforms. When it comes to the test of binary translators, traditional methodologies which still mainly rely on manual unit test is costly, labor intensive and often not adequate to test complicated algorithms in the translators. Some standard benchmark suites, like SPEC CPU2006, are compiled with different compilation options for further tests. However, the translation modules still have over 30% of their code unexecuted after such tests, according to our experimental results. Methodologies based on randomization can generate a vast variety of tests, thus improve the code coverage in the translation system. In this paper, we propose such an approach named EATBit. Test binaries are generated with randomly selected instructions and operands. The binaries and a large amount of input data are then refined to exclude invalid ones. Experimental results on a real binary translator demonstrate that EATBit can not only improve code coverage by over 20%, but also find some new bugs in the translator successfully.
Physica E-low-dimensional Systems & Nanostructures | 2007
Liwei Shi; Y. H. Chen; B. Xu; Zhenjiang Wang; Z.G. Wang
international conference on software engineering | 2015
Xiang Yuan; Chenggang Wu; Zhenjiang Wang; Jianjun Li; Pen Chung Yew; Jeff Huang; Xiaobing Feng; Yanyan Lan; Yunji Chen; Yong Guan