Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zhichun Zhu is active.

Publication


Featured researches published by Zhichun Zhu.


international symposium on microarchitecture | 2000

A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Zhao Zhang; Zhichun Zhu; Xiaodong Zhang

DRAM row-buffer conflicts occur when a sequence of requests on different rows goes to the same memory bank, causing much higher memory access latency than requests to the same row or to different banks. We analyze the sources of row-buffer conflicts in the context of superscalar processors, and propose a permutation based page interleaving scheme to reduce row-buffer conflicts and to exploit data access locality in the row-buffer. Compared with several existing schemes, we show that the permutation based scheme dramatically increases the hit rates on DRAM row-buffers and reduces memory stall time of the SPEC95 and TPC-C workloads. The memory stall times of the workloads are reduced up to 68% and 50%, compared with the conventional cache line and page interleaving schemes, respectively.


international symposium on microarchitecture | 2008

Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

Hongzhong Zheng; Jiang Lin; Zhao Zhang; Eugene Gorbatov; Howard S. David; Zhichun Zhu

The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture prevents any meaningful power and performance trade-offs for memory-intensive workloads. We propose a novel idea called mini-rank for DDRx (DDR/DDR2/DDR3) DRAMs, which uses a small bridge chip on each DRAM DIMM to break a conventional DRAM rank into multiple smaller mini-ranks so as to reduce the number of devices involved in a single memory access. The design dramatically reduces the memory power consumption with only a slight increase on the memory idle latency. It does not change the DDRx bus protocol and its configuration can be adapted for the best performance-power trade-offs. Our experimental results using four-core multiprogramming workloads show that using x32 mini-ranks reduces memory power by 27.0% with 2.8% performance penalty and using x16 mini-ranks reduces memory power by 44.1% with 7.4% performance penalty on average for memory-intensive workloads, respectively.


high-performance computer architecture | 2005

A performance comparison of DRAM memory system optimizations for SMT processors

Zhichun Zhu; Zhao Zhang

Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRAM systems in SMT systems, and search for new thread-aware DRAM optimization techniques. Our major findings are: (1) in general, increasing the number of threads tends to increase the memory concurrency and thus the pressure on DRAM systems, but some exceptions do exist; (2) the application performance is sensitive to memory channel organizations, e.g. independent channels may outperform ganged organizations by up to 90%; (3) the DRAM latency reduction through improving row buffer hit rates becomes less effective due to the increased bank contentions; and (4) thread-aware DRAM access scheduling schemes may improve performance by up to 30% on workload mixes of memory-intensive applications. In short, the use of SMT techniques has somewhat changed the context of DRAM optimizations but does not make them obsolete.


international symposium on computer architecture | 2009

Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices

Hongzhong Zheng; Jiang Lin; Zhao Zhang; Zhichun Zhu

The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM. Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.


international symposium on computer architecture | 2007

Thermal modeling and management of DRAM memory systems

Jiang Lin; Hongzhong Zheng; Zhichun Zhu; Howard S. David; Zhao Zhang

With increasing speed and power density, high-performance memories, including FB-DIMM (Fully Buffered DIMM) and DDR2 DRAM, now begin to require dynamic thermal management(DTM) as processors and hard drives did. The DTM of memories, nevertheless, is different in that it should take the processor performance and power consumption into consideration. Existing schemes have ignored that. In this study, we investigate a new approach that controls the memory thermal issues from the source generating memory activities - the processor. It will smooth the program execution when compared with shutting down memory abruptly, and therefore improve the overall system performance and power efficiency. For multicore systems, we propose two schemes called adaptive core gating and coordinated DVFS. The first scheme activates clock gating on selected processor cores and the second one scales down the frequency and voltage levels of processor cores when the memory is to be over-heated. They can successfully control the memory activities and handle thermal emergency. More importantly, they improve performance significantly under the given thermal envelope. Our simulation results show that adaptive coregating improves performance by up to 23.3% (16.3% on average) on a four-core system with FB-DIMM when compared with DRAM thermal shutdown; and coordinated DVFS with control-theoretic methods improves the performance by up to 18.5% (8.3% on average).


international symposium on microarchitecture | 2002

Access-mode predictions for low-power cache design

Zhichun Zhu; Xiaodong Zhang

An access-mode prediction technique based on cache hit and miss speculation for cache design-achieves minimal energy consumption. Using this method, cache accesses can be adaptively switched between the way-prediction and the phased accessing modes.


international symposium on microarchitecture | 2001

Cached DRAM for ILP processor memory access latency reduction

Zhao Zhang; Zhichun Zhu; Xiaodong Zhang

Cached DRAM adds a small cache onto a DRAM chip to reduce average DRAM access latency. The authors compare cached DRAM with other advanced DRAM techniques for reducing memory access latency in instruction-level-parallelism processors.


measurement and modeling of computer systems | 2008

Software thermal management of dram memory for multicore systems

Jiang Lin; Hongzhong Zheng; Zhichun Zhu; Eugene Gorbatov; Howard S. David; Zhao Zhang

Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: the CPU heat dissipation and its impact on DRAM memories, which were ignored, are significant factors. We have observed that the second policy, called DTM-CDVFS (Coordinated Dynamic Voltage and Frequency Scaling), has much better performance than previously reported for this reason. The average improvements are 10.8% and 15.3% on the two machines (vs. 3.4% from the previous study), respectively. It also significantly reduces the processor power by 15.5% and energy by 22.7% on average.


IEEE Transactions on Computers | 2000

Memory hierarchy considerations for cost-effective cluster computing

Xing Du; Xiaodong Zhang; Zhichun Zhu

Using off-the-shelf commodity workstations and PCs to build a cluster for parallel computing has become a common practice. The cost-effectiveness of a cluster computing platform for a given budget and for certain types of applications is mainly determined by its memory hierarchy and the interconnection network configurations of the cluster. Finding such a cost-effective solution from exhaustive simulations would be highly time-consuming and predictions from measurements on existing clusters would be impractical. We present an analytical model for evaluating the performance impact of memory hierarchies and networks on cluster computing. The model covers the memory hierarchy of a single SMP, a cluster of workstations/PCs, or a cluster of SMPs by changing various architectural parameters. Network variations covering both bus and switch networks are also included in the analysis. Different types of applications are characterized by parameterized workloads with different computation and communication requirements. The model has been validated by simulations and measurements. The workloads used for experiments are both scientific applications and commercial workloads. Our study shows that the depth of the memory hierarchy is the most sensitive factor affecting the execution time for many types of workloads. However, the interconnection network cost of a tightly coupled system with a short depth in memory hierarchy, such as an SMP, is significantly more expensive than a normal cluster network connecting independent computer nodes. Thus, the essential issue to be considered is the trade-off between the depth of the memory hierarchy and the system cost. Based on analyses and case studies, we present our quantitative recommendations for building cost-effective clusters for different workloads.


international symposium on performance analysis of systems and software | 2007

DRAM-Level Prefetching for Fully-Buffered DIMM: Design, Performance and Power Saving

Jiang Lin; Hongzhong Zheng; Zhichun Zhu; Zhao Zhang; Howard S. David

We have studied DRAM-level prefetching for the fully buffered DIMM (FB-DIMM) designed for multi-core processors. FB-DIMM has a unique two-level interconnect structure, with FB-DIMM channels at the first-level connecting the memory controller and advanced memory buffers (AMBs); and DDR2 buses at the second-level connecting the AMBs with DRAM chips. We propose an AMB prefetching method that prefetches memory blocks from DRAM chips to AMBs. It utilizes the redundant bandwidth between the DRAM chips and AMBs but does not consume the crucial channel bandwidth. The proposed method fetches K memory blocks of L2 cache block sizes around the demanded block, where K is a small value ranging from two to eight. The method may also reduce the DRAM power consumption by merging some DRAM precharges and activations. Our cycle-accurate simulation shows that the average performance improvement is 16% for single-core and multi-core workloads constructed from memory-intensive SPEC2000 programs with software cache prefetching enabled; and no workload has negative speedup. We have found that the performance gain comes from the reduction of idle memory latency and the improvement of channel bandwidth utilization. We have also found that there is only a small overlap between the performance gains from the AMB prefetching and the software cache prefetching. The average of estimated power saving is 15%

Collaboration


Dive into the Zhichun Zhu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hongzhong Zheng

University of Illinois at Chicago

View shared research outputs
Top Co-Authors

Avatar

Jiang Lin

Iowa State University

View shared research outputs
Top Co-Authors

Avatar

Kun Fang

University of Illinois at Chicago

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Long Chen

Iowa State University

View shared research outputs
Top Co-Authors

Avatar

Muhammad M. Rafique

University of Illinois at Chicago

View shared research outputs
Researchain Logo
Decentralizing Knowledge