Jianliang Ma
Zhejiang University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jianliang Ma.
ieee international conference on dependable, autonomic and secure computing | 2011
Baozhong Yu; Jianliang Ma; Tianzhou Chen; Minghui Wu
Last-level caches (LLC) grow large with significant power consumption. As LLCs capacity increases, it becomes quite inefficient. As recent studies show, a large percent of cache blocks are dead during the cache time. There is a growing need for LLC management to reduce the number of dead block in the LLC. However, there is a significant power requirement for the dead blocks in-placement and replacement operations. In this paper, we introduce a global priority table predictor, a technique which is used for determining a cache blocks priority when it attempts to insert into the LLC. It is similar to previous predictors, such as reuse distance and dead block predictor. The global priority table is indexed by the hash value of the block address and stores the priority value of the associate cache block. The priority value can be used to drive a dead block replacement and bypass optimization. Through the priority table, a large number of dead blocks could be bypassed. It achieves an average reduction of 13.2% in the number of LLC miss for twenty single-thread workloads from the SPEC2006 suite and 29.9% for ten multi-programmed workloads. It also yields a geometric mean speedup of 8.6% for single-thread workloads and a geometric mean normalized weighted speedup of 39.1% for multi-programmed workloads.
high performance computing and communications | 2013
Jun Yao; Jianliang Ma; Tianzhou Chen; Tongsen Hu
Spin-Transfer Torque RAM (STT-RAM) is a promising cache candidate studied frequently in recent years. Compared to the traditional SRAM, The STT-RAM is more promising for future on-chip caches due to STT-RAMs long endurance, low leakage, high density and high access speed. Nevertheless, the major challenges of using STT-RAM as L1 cache are its write energy and write latency. It is feasible to use STT-RAM as L1 cache by reducing data retention time. We find that most data in L1 cache has a life time shorter than STT-RAM data retention time. A refresh scheme that will degrades system performance, and bring more energy consumption is needed to assure data correctness. In this paper, we proposed a counter-controlled scheme to avoid STT-RAM L1 cache data block refreshing. We propose a dead data processing strategy that deals with data block when it exceeds its retention time. Our simulation results show that STT-RAM L1 cache coupled with our counter-controlled scheme can save up to 60% energy consumption, 44% energy consumption on average compared to SRAM L1 cache, achieve slightly performance improvement on average compared to baseline.
information security and assurance | 2011
Yifan Hu; Baozhong Yu; Jianliang Ma; Tianzhou Chen
With the development of Graphics Processing Unit (GPU) and the Compute Unified Device Architecture (CUDA) platform, researchers shift their attentions to general-purpose computing applications with GPU. In this paper, we present a novel parallel approach to run artificial fish swarm algorithm (AFSA) on GPU. Experiments are conducted by running AFSA both on GPU and CPU respectively to optimize four benchmark test functions. With the same optimization performance, the running speed of the AFSA based on GPU (GPU-AFSA) can be as 30 time fast as that of the AFSA based on CPU (CPU-AFSA). As far as we know, this is the first implementation of AFSA on GPU.
parallel and distributed computing: applications and technologies | 2014
Jianliang Ma; Jinglei Meng; Tianzhou Chen; Qingsong Shi; Minghui Wu; Li Liu
The shared last-level cache (SLLC) in heterogeneous multicore system is an important memory component that shared and competitive between multiple cores, so how to improve the SLLC performance has become an important research area. Last-level cache (LLC) bypassing technique that bypasses the LLC a part of memory requests is one of the most effective methods. The bypassed requests are sent directly to off-chip main memory (DRAM) rather than eliminated. We find that the bypassed requests influence the original scheduling sequence in Memory Controller (MC) severely. Besides, immoderate bypassing will disturb the MC load balance. We propose a 3-step method memory that adjusts memory scheduling algorithm to optimize LLC bypassing performance. The first step is adding an independent bypass stream for bypassed requests. The second step is scheduling the bypass stream with a smaller probability than that of normal GPU stream. The third step is adding a guard mechanism for MC. By dynamically set and revoke the guard, we can avoid unbalanced bypassing. For case study, we applied the 3-step method on two modern memory schedulers. The experimental results show that after applied the 3-step method, the schedulers improve the system performance obviously.
Future Generation Computer Systems | 2012
Jianliang Ma; Chunhao Wang; Baozhong Yu; Tianzhou Chen
Executing sequential program on multi-core is crucial for accommodating Instruction Level Parallelism (ILP) in Chip Multi-Processor (CMP) architecture. One widely used method for steering instructions across cores is based on dependency. However, this method requires a sophisticated steering mechanism and brings about much hardware complexity and die area overhead. This paper presents the Global Register Alias Table (GRAT), a structure which can be used in CMP architecture to facilitate sequential program execution across cores. The GRAT drastically reduces the area overhead and design complexity of steering instructions without introducing additional programming effort or compiler support. Dynamic reconfiguration is also implemented to support efficient parallel program execution. In our evaluation, the result shows that our work performs within 5.9% of Core Fusion, a recent work which requires a complex steering unit.
international conference on intelligent computation technology and automation | 2011
Fuming Qiao; Baozhong Yu; Jianliang Ma; Tianzhou Chen; Tongsen Hu
In general, the Less Recently Used (LRU) policy was commonly employed to manage shared L2 cache in Chip Multiprocessors. However, LRU policy remains some deficiencies based on previous studies. In particular, LRU may perform considerably bad when the workloads of application program are larger than L2 cache, because there are usually a great number of less reused lines that are never reused or reused for few times in L2 cache. The cache performance can be improved significantly if we keep non-less reused lines rather than less reused lines in cache for a time quantum. This paper proposes a new architecture called Shared Less Reused Filter (SLRF) that applying the less reused filter that can filter out the less reused lines rather than just never reused lines according to the context of Chip Multiprocessors. Our experiments on a large set of multithread benchmarks, for 11 splash-2 benchmarks, demonstrate that our technique shows that augmented in a 2M LRU-managed L2 cache with a SLRF which has 256 KB filter buffer improves IPC by 13.43% compared with the context of the uniprocessor, and reduces the average MPKI by 18.20 %.
Archive | 2010
Lingxiang Xiang; Tianzhou Chen; Lianghua Miao; Guanjun Jiang; Fuming Qiao; Du Chen; Jianliang Ma; Chunhao Wang; Tiefei Zhang; Man Cao
Archive | 2009
Tianzhou Chen; Tiefei Zhang; Lingxiang Xiang; Lianghua Miao; Chunhao Wang; Man Cao; Jianliang Ma; Jiangwei Huang; Fuming Qiao; Du Chen
Archive | 2012
Tianzhou Chen; Baozhong Yu; Jinming Le; Jianliang Ma; Fuming Qiao
Archive | 2011
Tianzhou Chen; Jinming Le; Jianliang Ma; Fuming Qiao; Baozhong Yu