Chongmin Li
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chongmin Li.
Journal of Computer Science and Technology | 2010
Song-Liu Guo; Haixia Wang; Yibo Xue; Chongmin Li; Dongsheng Wang
As more processing cores are integrated into one chip and feature size continues to shrink, the average access latency for remote nodes using directory-based coherence protocol becomes higher, which greatly impacts system performance. Previous techniques such as data replication and data migration optimize the performance of the requesting core, but offer little improvement for neighbor nodes. Other techniques such as in-transit optimization try to reduce latency at the cost of increased storage. This paper introduces hierarchical cache directory into CMP (chip multiprocessor), which divides CMP tiles into multiple regions hierarchically, and combines it with data replication. A new directory organization is proposed to record the share status within a region and assist the regional home to complete operation efficiently. Simulation results show that for a 16-core CMP, compared to traditional directory, hierarchical cache directory reduces average access latency by 9% and on-chip network traffic by 34% on average with less storage. Theoretical analyses show that for a 2n × 2n tiled CMP, the average access latency in hierarchical cache directory asymptotically approaches a function that is independent of n, hence the architecture is highly scalable.
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies | 2011
Xi Zhang; Qian Hu; Dongsheng Wang; Chongmin Li; Haixia Wang
Scaling DRAM will be increasingly difficult due to power and cost constraint. Phase Change Memory (PCM) is an emerging memory technology that can increase main memory capacity in a cost-effective and power-efficient manner. However, PCM incurs relatively long latency , high write energy, and finite endurance. To make PCM an alternative for scalable main memory, write traffic to PCM should be reduced, where memory replacement policy could play a vital role. In this paper, we propose a Read-Write Aware policy (RWA) to reduce write traffic without performance degradation. RWA explores the asymmetry of read and write costs of PCM, and prevents dirty data lines from frequent evictions. Simulations results on an 8-core CMP show that for memory organization with and without DRAM buffer, RWA can achieve 33.1% and 14.2% reduction in write traffic to PCM respectively. In addition, an Improved RWA (I-RWA) is proposed that takes into consideration the write access pattern and can further improve memory efficiency. For organization with DRAM buffer, I-RWA provides a significant 42.8% reduction in write traffic. Furthermore, both RWA and I-RWA incurs no hardware overhead and can be easily integrated into existing hardware.
symposium on computer architecture and high performance computing | 2010
Xi Zhang; Chongmin Li; Haixia Wang; Dongsheng Wang
Previous research shows that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently proposed RRIP policy improves performance for such workloads. However, RRIP lacks of access recency information, which may confuse the replacement policy to make accurate prediction. Consequently, RRIP is not robust for recency-friendly workloads. This paper proposes an Adaptive Insertion and Re-reference Prediction (AI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. To make the replacement policy more adaptive across different workloads and different phases during execution, Dynamic AI-RRP (DAI-RRP) is proposed which adjusts the insertion position and prediction value for different access patterns. Simulation results show DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.3% and 4.1% respectively on a single-core processor with a 1MB 16-way set last-level cache (LLC). Evaluations on quad-core CMP with a 4MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 13.2% and 26.7% respectively. Furthermore, compred to LRU, DAI-RRP requires similar hardware, or even less hardware for high-associativity cache.
Tsinghua Science & Technology | 2007
Haixia Wang; Dongsheng Wang; Peng Li; Jinglei Wang; Chongmin Li
Abstract Token protocol provides a new coherence framework for shared-memory multiprocessor systems. It avoids indirections of directory protocols for common cache-to-cache transfer misses, and achieves higher interconnect bandwidth and lower interconnect latency compared with snooping protocols. However, the broadcasting increases network traffic, limiting the scalability of token protocol. This paper describes an efficient technique to reduce the token protocol network traffic, called sharing relation cache. This cache provides destination set information for cache-to-cache miss requests by caching directory information for recent shared data. This paper introduces how to implement the technique in a token protocol. Simulations using SPLASH-2 benchmarks show that in a 16-core chip multiprocessor system, the cache reduced the network traffic by 15% on average.
networking architecture and storages | 2010
Chongmin Li; Haixia Wang; Yibo Xue; Xi Zhang; Dongsheng Wang
As more processing cores are integrated into one chip and the feature size continues to shrink, the increasing on-chip access latency complicates the design of the on-chip last-level cache for chip multiprocessors. At the same time, the overhead of maintaining on-chip directory cannot be ignored as the number of processing cores increasing. There is an urgent need for scalable organization of on-chip last-level cache. In this work, we propose fast hierarchical cache directory for tiled CMP, which divides CMP tiles into multiple regions hierarchically, and combines it with data replication. Multi-level directory is used to record the share information within a region and assist the regional home node to complete operation efficiently. Fast directory is used to get lower L2 slice access latency at the same time. Most cache requests to last-level cache can be handled within the local level-1 region. Evaluation indicates this architecture is highly scalable. Simulation results show that for a 16-core CMP, hierarchical cache directory reduces average access latency to last-level cache by 46.35% and average on-chip network traffic by 19.25% respectively. The system performance is increased by 20.82% at the same time.
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies | 2011
Chongmin Li; Dongsheng Wang; Yibo Xue; Haixia Wang; Xi Zhang
The LRU replacement policy is commonly used in the lastlevel caches of multiprocessors. However, LRU policy does not work well for memory intensive workloads which working set are greater than the available cache size. When a new arrival cache block is inserted at the MRU position, it may never be reused until being evicted from the cache but occupy the cache space for a long time during its movement from the MRU to the LRU position. This results in inefficient use of cache space. If we insert a new cache block at the LRU position directly, the cache performance can be improved by keeping some fraction of the working sets is retained in the caches. In this work, we propose Enhanced Dynamic Insertion Policy (EDIP) and Thread Aware Enhanced Dynamic Insertion Policy (TAEDIP) which can adjust the probability of insertion at MRU by set dueling. The runtime information of the previous and the next BIP level are gathered and compared with current level to choose an appropriate BIP level. At the same time, access frequency is used to choose a victim. In this way, our work can get less miss rate than LRU for workloads with large work set. For workloads with small working set, the miss rate of our design is close to LRU replacement policy. Simulation results in single core configuration with 1MB 16-way LLC show that EDIP reduces CPI over LRU and DIP by an average of 11.4% and 1.8% respectively. On quad-core configuration with 4MB 16-way LLC. TAEDIP improves the performance on the weighted speedup metric by 11.2% over LRU and 3.7% over TADIP on average. For fairness metric, TAEDIP improves the performance by 11.2% over LRU and 2.6% over TADIP on average.
Journal of Computer Science and Technology | 2010
Jinglei Wang; Yibo Xue; Haixia Wang; Chongmin Li; Dongsheng Wang
As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.
international conference on computer design | 2013
Guohong Li; Zhenyu Liu; Sanchuan Guo; Chongmin Li; Dongsheng Wang
With the number of cores and working sets of parallel workloads soaring, shared L2 caches exhibit fewer misses than private L2 caches via making better use of the all available cache capacity. However, shared L2 caches induce higher overall L1 miss latencies because of longer average distance between requestor and home node, and potentially congestions at some nodes. We observe that there is a high probability that the requested data of an L1 miss resides in a neighbor nodes L1 cache. In such cases, these long-distance accesses to the home nodes can be potentially avoided. In order to successfully leverage the aforementioned property, we propose Bayesian theory oriented Optimal Data-Provider Selection (ODPS). ODPS partitions the multi-core into clusters of 2×2 nodes, and introduces the Proximity Data Prober (PDP) to detect whether an L1 miss can be served by one L1 cache within the same cluster. Furthermore, we devise the Bayesian Decision Classifier (BDC) to intelligently and adaptively select a remote L2 cache or a neighboring L1 node as the data provider according to the minimal miss cost based on the Bayesian decision theory.
symposium on computer architecture and high performance computing | 2013
Guohong Li; Olivier Temam; Zhenyu Liu; Dongsheng Wang; Sanchuan Guo; Chongmin Li
As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of 2×2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15% and reduce the energy by 14%, while saving 28% of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA.
APPT 2013 Revised Selected Papers of the 10th International Symposium on Advanced Parallel Processing Technologies - Volume 8299 | 2013
Chongmin Li; Dongsheng Wang; Haixia Wang; Guohong Li; Yibo Xue
Chip multiprocessors CMPs are becoming the trend of mainstream computing platforms. The design of an efficient on-chip memory hierarchy is one of the key challenges in computer architecture. Tiled architecture and non-uniform cache architecture NUCA is commonly adopted in modern CMPs. Previous efforts on cache replacement policy usually assume an unified last-level cache or running multiprogrammed workloads. However, few researches focus on the replacement policy of cache clustering scheme running parallel workloads. Cache clustering scheme can improve the system performance on parallel performance, which is a tradeoff between shared cache organization and private cache organization which adopts cache replication. In cache clustering scheme, cache blocks in last-level cache can be subdivided into eight types. In this work we propose Data access Type Aware Replacement Policy DTARP for cache clustering organization, DTARP classifies data blocks in last-level cache into different access types, and designs the insertion and the victim selection policies according to different data access types based on traditional LRU policy. The global shared data will be kept in last-level cache longer than before. Simulation results show that DTARP can improve the system performance of cluster scheme using LRU policy by 10.9% on average.