Zongwei Zhu
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zongwei Zhu.
international conference on cluster computing | 2012
Gangyong Jia; Xi Li; Chao Wang; Xuehai Zhou; Zongwei Zhu
Main memory is expected to grow significantly in both speed and capacity for it is a major shared resource among cores in a multi-core system, which will lead to increasing power consumption. Therefore, it is critical to address the power issue without seriously decreasing performance in the memory subsystem. In this paper, we firstly propose memory affinity which retains the active and low power memory ranks as long as possible to avoid frequently switching between active and low power status, and then present a memory affinity aware scheduling (MAS) to balance performance, power, thermal and fairness for multi-core systems. Experimental results demonstrate our memory affinity aware scheduling algorithms well adapt to system loading to maximize power saving and avoid memory hotspot at the same time while sustaining the system bandwidth demand and preserving fairness among threads.
great lakes symposium on vlsi | 2012
Xi Li; Gangyong Jia; Yun Chen; Zongwei Zhu; Xuehai Zhou
Optimizing system performance through scheduling has received a lot of attention. However, none of the existing approaches can balance the system performance improvement and the fair share of CPU time among threads. We present in this paper a share memory aware scheduler (SMAS). The key idea is to adopt thread group scheduling which partitions threads based on memory address space to reduce switching overhead and to give each thread a fair chance to occupy CPU time. There are three main contributions: 1) SMAS does well in balancing system performance and fairness among all threads; 2) to our knowledge, this is the first attempt to use share memory aware scheduler for system performance improvement; 3) we implement SMAS both in testbed and simulator for evaluation. The testbed results on a 2-core processor show that our proposed scheduler can improve performance of different performance parameters with neglected overhead in fairness, which reduced 0.128% in cache miss rate, 2.62% in run time, 13.15% in DTBL misses, 31.68% in ITLB misses and 46.15% in ITLB flushes maximum. Furthermore, our extensive simulation results for 4 and 8 cores demonstrate that SMAS is highly scalable.
modeling, analysis, and simulation on computer and telecommunication systems | 2012
Gangyong Jia; Xi Li; Chao Wang; Xuehai Zhou; Zongwei Zhu
Performance optimization and energy efficiency are the major challenges in multi-core system design. Of the state-of-the-art approaches, cache affinity aware scheduling and techniques based on dynamic voltage frequency scaling (DVFS) are widely applied to improve performance and save energy consumptions respectively. In modern operating systems, schedulers exploit high cache affinity by allocating a process on a recently used processor whenever possible. When a process runs on a high-affinity processor it will find most of its states already in the cache and will thus achieve more efficiency. However, most state-of-the-art DVFS techniques do not concentrate on the cost analysis for DVFS mechanism. In this paper, we firstly propose frequency affinity which retains the voltage frequency as long as possible to avoid frequently switching, and then present a frequency affinity aware scheduling (FAS) to maximize power efficiency for multi-core systems. Experimental results demonstrate our frequency affinity aware scheduling algorithms are much more power efficient than single-ISA heterogeneous multi-core processors.
international conference on cluster computing | 2012
Gangyong Jia; Xi Li; Chao Wang; Xuehai Zhou; Zongwei Zhu
The last-level cache (LLC) mitigates the long latencies of memory access in todays chip multi-core processor (CMP). The promotion policy in the LLC largely affects cache efficiency, while an inappropriate promotion policy may lead useless blocks to remain in the cache longer than necessary, in turn result into inefficiency. Currently state-of-the-art promotion policies are unaware of the re-reference interval of cache accesses. Applications that exhibit a long re-reference interval perform poorly with these promotion policies. In this paper, we propose a promotion policy that uses re-reference interval prediction (RRIP) information. Such technique requires minor hardware modification over the least-recently-used (LRU) replacement policy. Our evaluation shows that RRIP improves IPCsum by 2.58%, Weighted Speedup by 3.54% and IPCnorm_hmean by 6.2% on average over single-step promotion policy.
international symposium on circuits and systems | 2013
Xi Li; Zongwei Zhu; Gangyong Jia; Xuehai Zhou
Memory is responsible for a large and increasing fraction of the energy consumed by computers. To address this challenge, memory manufacturers have developed memory devices with different power states. In order to more effectively manage the power states in the operating system, in this paper, we propose a rank-sensitive buddy system (RS-Buddy) which clusters pages together to prolong the idle time of memory ranks without breaking defragmentation characteristics. For the purpose of decreasing unnecessary frequent mode transitions, we introduce a power-aware task group scheduler (PATGS) that groups the threads which access the same rank together to schedule while sustaining system fairness. Finally, we integrate state-of-the-art mode control policies with our RS-Buddy and PATGS, with experimental results demonstrating that our algorithms can improve the power efficiency from 25.31% to 27.35% compared with state-of-the-art studies.
international conference on parallel and distributed systems | 2012
Gangyong Jia; Xi Li; Chao Wang; Xuehai Zhou; Zongwei Zhu
Optimizing cache performance through improving data locality has been receiving a lot of attention. However, none of the existing approaches can combine each tasks behavior to optimize data locality for caches. We present a behavior aware data locality (BADL) to optimize cache performance in this paper. The key idea is to add each tasks behavior when allocating memory, which can take advantage of each tasks different locality to optimize cache performance. There are five main contributions: 1. to our best knowledge, this is the first attempt to improve cache performance through combining task behavior, 2. BADL detailed analyzes low performance derived from internal of the cache line, which is more fine-grained than the current state-of-the-art fine-grained in hardware angle, 3. BADL optimizes the cache performance through improving internal of cache line efficiency, 4. we implement BADL both in single-threaded application and multi-threaded applications scenarios, 5. BADL can be combined to most of the cache optimizing researches. The experiment results show our proposed BADL can improve 18.6% performance on average in single-threaded application situation and improve 20.8% performance on average in multi-threaded application situation.
ubiquitous intelligence and computing | 2014
Bo Chen; Xi Li; Xuehai Zhou; Tengfu Liu; Zongwei Zhu
To improve user experience, reducing energy consumption of a wireless device is an important factor. Several smartphones try to save energy by tearing down connections to the mobile network as soon as the data transmission has completed. However, the side effect is the frequent connection reestablishment in applications where small amounts data are sent and received, leading to a high energy overhead in the mobile network. This paper presents an optimization mechanism to buffer the network packages. The packages are classified according to the urgency of application, and then combined with the original Power Save (PS) mode mechanism of WiFi to dynamically regulate the data transmission. Therefore, urgent application data will be transmitted immediately while others would be delayed for different intervals. Our experiments show that the optimization mechanism can improve the network response time for the foreground application, reduce the energy consumption of WiFi for background application, as well as reduce the energy consumption for background application when the screen is locked.
international symposium on parallel and distributed processing and applications | 2014
Xi Li; Beilei Sun; Zongwei Zhu; Chao Wang; Xuehai Zhou
Performance of software is increasingly restricted by the Memory Wall instead of CPU. Many studies focus on alleviating the DRAM latency by improving the row-buffer hit rate. But most of them treat the Kernel and User equally. Data used by Operating System and User applications spread in different rows of the same bank, leading to the contentions for the row-buffer when they access the bank successively. We find that contentions between Kernel and User make up of a great proportion of all the row-buffer misses. To alleviate the contentions between Kernel and User, we divide the united DRAM memory space into Kernel-Space and User-Space. A new page-allocation-system, the K/U-Aware page-allocation-system, is proposed to manage Kernel-Space and User-Space in DRAM memory in different address mapping schemes of DRAM memory controller. In the new system, pages are allocated from different spaces according to applicants (Kernel or User). Sizes of the two spaces increase and decrease dynamically as required. For benchmarks in PARSEC suites, the proposed system reduces the contentions of Kernel and User effectively, producing significant improvements of row-buffer hit rate. The execution time is reduced by 9.45% (max. 20.45%) and 6.51% (max. 18.05%) respectively in two typical address mapping schemes.
international conference on engineering of complex computer systems | 2014
Zongwei Zhu; Xi Li; Hengchang Liu; Cheng Ji; Yuan Xu; Xuehai Zhou; Beilei Sun
Memory management systems have significantly affected the overall performance of modern multi-core smartphone systems. Android, as one of the most popular smartphone operating systems, adopts a global buddy system with the FCFS (first come, first served) principle for memory allocation, and releases requests to manage external fragmentations and maintain the memory allocation efficiency. However, extensive experimental study on thread behaviors indicates that memory external fragmentation is no longer the crucial bottleneck in most Android applications. Specifically, a thread usually allocates or releases memory in bursts, resulting in serious memory locks and inefficient memory allocation. Furthermore, the pattern of such bursting behaviors varies throughout the life cycle of a thread. The conventional FCFS policy of Android buddy system fails to adapt to such variations and thus suffers from performance degradation. In this paper, we propose a novel memory management framework, called Memory Management Based on Thread Behaviors (MMBTB), for multi-core smartphone systems. It adapts to various thread behaviors through targeted optimizations to provide efficient memory allocation. The efficiency and effectiveness of this new memory management scheme on multicore architecture is proved by a theoretical emulation model. Our experimental studies on the real Android system show that MMBTB can improve the efficiency of memory allocation by 12%-20%, confirming the theoretical analysis results.
international conference on engineering of complex computer systems | 2014
Beilei Sun; Xi Li; Zongwei Zhu; Xuehai Zhou
Detailed analyses of the behaviors of operating system and applications are significant for taking full advantage of the precious hardware resources and improving performance. This paper focus on their DRAM access behaviors based on access proportion and row-buffer miss ratio (RBM). The access proportions of Kernel and User vary greatly in different stages throughout the lifetime of a process. Most of the row-buffer misses are caused by the one having higher access proportion. By analyzing the RBM series through ARMA model, we found that Users DRAM accesses only have short-term influences on its behavior, while the Kernels influences are relatively deeper. The ARMA model for the RBM series is able to predict the future RBMs, which are profound basis to schedule the DRAM access commands. The results of Gaussian Fitting show that Kernel and User are tightly correlated on accessing DRAM, especially in the steady stage and the end stage of a processs life cycle. Based on this close relation, it is possible to estimate the DRAM access behaviors of the other one according to the one whose behaviors have been known. System-calls that obviously affect the access proportions and RBMs are also revealed in this paper.