Baozhong Yu
Zhejiang University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Baozhong Yu.
ieee international conference on dependable, autonomic and secure computing | 2011
Baozhong Yu; Jianliang Ma; Tianzhou Chen; Minghui Wu
Last-level caches (LLC) grow large with significant power consumption. As LLCs capacity increases, it becomes quite inefficient. As recent studies show, a large percent of cache blocks are dead during the cache time. There is a growing need for LLC management to reduce the number of dead block in the LLC. However, there is a significant power requirement for the dead blocks in-placement and replacement operations. In this paper, we introduce a global priority table predictor, a technique which is used for determining a cache blocks priority when it attempts to insert into the LLC. It is similar to previous predictors, such as reuse distance and dead block predictor. The global priority table is indexed by the hash value of the block address and stores the priority value of the associate cache block. The priority value can be used to drive a dead block replacement and bypass optimization. Through the priority table, a large number of dead blocks could be bypassed. It achieves an average reduction of 13.2% in the number of LLC miss for twenty single-thread workloads from the SPEC2006 suite and 29.9% for ten multi-programmed workloads. It also yields a geometric mean speedup of 8.6% for single-thread workloads and a geometric mean normalized weighted speedup of 39.1% for multi-programmed workloads.
information security and assurance | 2011
Yifan Hu; Baozhong Yu; Jianliang Ma; Tianzhou Chen
With the development of Graphics Processing Unit (GPU) and the Compute Unified Device Architecture (CUDA) platform, researchers shift their attentions to general-purpose computing applications with GPU. In this paper, we present a novel parallel approach to run artificial fish swarm algorithm (AFSA) on GPU. Experiments are conducted by running AFSA both on GPU and CPU respectively to optimize four benchmark test functions. With the same optimization performance, the running speed of the AFSA based on GPU (GPU-AFSA) can be as 30 time fast as that of the AFSA based on CPU (CPU-AFSA). As far as we know, this is the first implementation of AFSA on GPU.
Future Generation Computer Systems | 2012
Jianliang Ma; Chunhao Wang; Baozhong Yu; Tianzhou Chen
Executing sequential program on multi-core is crucial for accommodating Instruction Level Parallelism (ILP) in Chip Multi-Processor (CMP) architecture. One widely used method for steering instructions across cores is based on dependency. However, this method requires a sophisticated steering mechanism and brings about much hardware complexity and die area overhead. This paper presents the Global Register Alias Table (GRAT), a structure which can be used in CMP architecture to facilitate sequential program execution across cores. The GRAT drastically reduces the area overhead and design complexity of steering instructions without introducing additional programming effort or compiler support. Dynamic reconfiguration is also implemented to support efficient parallel program execution. In our evaluation, the result shows that our work performs within 5.9% of Core Fusion, a recent work which requires a complex steering unit.
international conference on intelligent computation technology and automation | 2011
Fuming Qiao; Baozhong Yu; Jianliang Ma; Tianzhou Chen; Tongsen Hu
In general, the Less Recently Used (LRU) policy was commonly employed to manage shared L2 cache in Chip Multiprocessors. However, LRU policy remains some deficiencies based on previous studies. In particular, LRU may perform considerably bad when the workloads of application program are larger than L2 cache, because there are usually a great number of less reused lines that are never reused or reused for few times in L2 cache. The cache performance can be improved significantly if we keep non-less reused lines rather than less reused lines in cache for a time quantum. This paper proposes a new architecture called Shared Less Reused Filter (SLRF) that applying the less reused filter that can filter out the less reused lines rather than just never reused lines according to the context of Chip Multiprocessors. Our experiments on a large set of multithread benchmarks, for 11 splash-2 benchmarks, demonstrate that our technique shows that augmented in a 2M LRU-managed L2 cache with a SLRF which has 256 KB filter buffer improves IPC by 13.43% compared with the context of the uniprocessor, and reduces the average MPKI by 18.20 %.
Archive | 2012
Tianzhou Chen; Baozhong Yu; Jinming Le; Jianliang Ma; Fuming Qiao
Archive | 2011
Tianzhou Chen; Jinming Le; Jianliang Ma; Fuming Qiao; Baozhong Yu
Archive | 2011
Tianzhou Chen; Fuming Qiao; Jianliang Ma; Baozhong Yu; Jinming Le
Archive | 2011
Tianzhou Chen; Baozhong Yu; Jinming Le; Fuming Qiao; Jianliang Ma
Archive | 2012
Tianzhou Chen; Baozhong Yu; Jianliang Ma; Yifan Hu; Minjiao Ye
Archive | 2011
Tianzhou Chen; Baozhong Yu; Fuming Qiao; Jianliang Ma; Jinming Le