Zichao Xie
Peking University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zichao Xie.
international conference on parallel architectures and compilation techniques | 2012
Lingda Li; Dong Tong; Zichao Xie; Junlin Lu; Xu Cheng
In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks.
international conference on computer design | 2011
Zichao Xie; Dong Tong; Mingkai Huang; Xiaoyin Wang; Qinqing Shi; Xu Cheng
Indirect-branch prediction is becoming more important for modern processors as more programs are written in object-oriented languages. Previous hardware-based indirect-branch predictors generally require significant hardware storage or use aggressive algorithms which make the processor front-end more complex. In this paper, we propose a fast and cost-efficient indirect-branch prediction strategy, called Target Address Pointer (TAP) Prediction. TAP Prediction reuses the history-based branch direction predictor to detect occurrences of indirect branches, and then stores indirect-branch targets in the Branch Target Buffer (BTB). The key idea of TAP Prediction is to predict the Target Address Pointers, which generate virtual addresses to index the targets stored in the BTB, rather than to predict the indirect-branch targets directly. TAP Prediction also reuses the branch direction predictor to construct several small predictors. When fetching an indirect branch, these small predictors work in parallel to generate the target address pointer. Then TAP prediction accesses the BTB to fetch the predicted indirect-branch target using the generated virtual address. This mechanism could achieve time cost comparable to that of dedicated-storage-predictors, without requiring additional large amounts of storage. Our evaluation shows that for three representative direction predictors-Hybrid, Perceptrons, and O-GEHL-TAP schemes improve performance by 18.19%, 21.52%, and 20.59%, respectively, over the baseline processor with the most commonly-used BTB prediction. Compared with previous hardware-based indirect-branch predictors, the TAP-Perceptrons scheme achieves performance improvement equivalent to that provided by a 48KB TTC predictor, and it also outperforms the VPC predictor by 14.02%.
international symposium on low power electronics and design | 2013
Zichao Xie; Dong Tong; Xu Cheng
Accurate branch prediction can improve processor performance, while reducing energy waste. Though some existing branch predictors have been proved effective, they usually require large amount of storage or complicate the processor front-end. This paper proposes a novel branch prediction technique called History Artificially Selected (HAS) prediction. It is a hardware technique that bases on the existing branch predictors to detect history noises and avoid noise interferences when predicting branches. It separates the original branch predictor into sub-predictors, each of which performs differently in branch history updating. With the help of some history stacks, one sub-predictor saves and restores the branch history at the entrance and the exit of loops and program subroutines where history noise usually exists. Through using a tournament mechanism, HAS prediction selectively uses the modified branch history to eliminate the history noise interferences and retain those useful history correlations at the same time. Our experimental results show that for three representative branch predictors, gshare, perceptron, and TAGE, it reduces the MPKI by 1.49, 2.85, and 1.10 respectively, resulting in 4.55%, 10.16%, and 4.45% performance improvement. It also reduces energy consumption by 4.02%, 7.78%, and 3.91%, respectively.
design, automation, and test in europe | 2012
Mingxing Tan; Xianhua Liu; Zichao Xie; Dong Tong; Xu Cheng
Branch prediction is critical in exploring instruction level parallelism for modern processors. Previous aggressive branch predictors generally require significant amount of hardware storage and complexity to pursue high prediction accuracy. This paper proposes the Compiler-guided History Stack (CHS), an energy-efficient compiler-microarchitecture cooperative technique for branch prediction. The key idea is to track very-long-distance branch correlation using a low-cost compiler-guided history stack. It relies on the compiler to identify branch correlation based on two program substructures: loop and procedure, and feed the information to the predictor by inserting guiding instructions. At runtime, the processor dynamically saves and restores the global history using a low-cost history stack structure according to the compiler-guided information. The modification on the global history enables the predictor to track very-long-distance branch correlation and thus improves the prediction accuracy. We show that CHS can be combined with most of existing branch predictors and it is especially effective with small and simple predictors. Our evaluations show that the CHS technique can reduce the average branch mispredictions by 28.7% over gshare predictor, resulting in average performance improvement of 10.4%. Furthermore, it can also improve those aggressive perceptron, OGEHL and TAGE predictors.
international conference on computer design | 2012
Lingda Li; Dong Tong; Zichao Xie; Junlin Lu; Xu Cheng
Inclusive cache hierarchies are widely adopted in modern processors, since they can simplify the implementation of cache coherence. However, it sacrifices some performance to guarantee inclusion. Many recent intelligent management policies are proposed to improve the last-level cache (LLC) performance by evicting blocks with poor locality earlier. Unfortunately, they are inapplicable in inclusive LLCs. In this paper, we propose Two-level Eviction Priority (TEP) policy. Besides the eviction priority provided by the baseline replacement policy, TEP appends an additional high level of eviction priority to LLC blocks, which is decided at the insertion time and cannot be changed during their lifetime in the LLC. When blocks with high eviction priority are not in inner caches anymore, they get evicted from the LLC preferentially. Thus, the LLC can retain more useful blocks to improve performance. TEP can cooperate well with various baseline replacement policies. Our evaluation shows that TEP with NRU can improve the performance of inclusive LLCs significantly while requiring negligible extra storage. It also outperforms other recent proposals including QBS, DIP, and DRRIP.
international conference on computer design | 2009
Zichao Xie; Dong Tong; Xu Cheng
Set-associative instruction caches achieve low miss rates at the expense of significant energy dissipation. Previous energy-efficient approaches usually suffer from performance degradation and redundant extension bits. In this paper, we propose a Way History Oriented Low Energy Instruction Cache (WHOLE-Cache) design for single issue and in-order execution processors. The WHOLE-Cache design not only achieves a significant portion of energy reduction by effectively reducing dynamic energy dissipation of set-associative instruction cache, but also leads to no additional cycle penalties. Tag comparison results are stored into either the Branch Target Buffer (BTB) or the Instruction Cache (I-Cache) to avoid tag checks and unnecessary way activation for subsequent accesses to visited cache lines. The extended BTB uses way history bits for branch instructions, while the I-Cache extension bits are used in case of fetching consecutive instructions resided in different cache lines. A valid flag is associated with each stored tag comparison result to indicate whether the instruction to be fetched is resided in the recorded location. A simple invalidation scheme is implemented in the cache miss replacement operation. Whenever a cache line is replaced, the pointers to it, which reside in the BTB or other I-cache lines, will be invalidated accordingly. We model the WHOLE-Cache design in Verilog. By deriving basic parameters from TSMC 65nm technology, we use Wattch simulator to evaluate the performance and energy reduction of the WHOLE-Cache in the instruction fetch stage. We use SPEC2000 and Mediabench as benchmarks. It is observed that compared with a conventional 4-way set-associative I-Cache, the energy consumption of the WHOLE-Cache is reduced by 65% without any performance penalty.
Journal of Computer Science and Technology | 2014
Zichao Xie; Dong Tong; Mingkai Huang
Nowadays energy-efficiency becomes the first design metric in chip development. To pursue higher energy efficiency, the processor architects should reduce or eliminate those unnecessary energy dissipations. Indirect-branch prediction has become a performance bottleneck, especially for the applications written in object-oriented languages. Previous hardware-based indirect-branch predictors are generally inefficient, for they either require significant hardware storage or predict indirect-branch targets slowly. In this paper, we propose an energy-efficient indirect-branch prediction technique called TAP (target address pointer) prediction. Its key idea includes two parts: utilizing specific hardware pointers to accelerate the indirect branch prediction flow and reusing the existing processor components to reduce additional hardware costs and power consumption. When fetching an indirect branch, TAP prediction first gets the specific pointers called target address pointers from the conditional branch predictor, and then uses such pointers to generate virtual addresses which index the indirect-branch targets. This technique spends similar time compared to the dedicated storage techniques without requiring additional large amounts of storage. Our evaluation shows that TAP prediction with some representative state-of-the-art branch predictors can improve performance significantly over the baseline processor. Compared with those hardware-based indirect-branch predictors, the TAP-Perceptron scheme achieves performance improvement equivalent to that provided by an 8 K-entry TTC predictor, and also outperforms the VPC predictor.
asia and south pacific design automation conference | 2013
Xianglei Dang; Xiaoyin Wang; Dong Tong; Zichao Xie; Lingda Li; Keyi Wang
As data prefetching is used in embedded processors, it is crucial to reduce the wasted energy for improving the energy efficiency. In this paper, we propose an adaptive prefetch filtering (APF) mechanism to reduce the wasted bandwidth and energy as well as the cache pollution caused by useless prefetches. APF records the prefetch-victim address pairs of issued prefetches and collects information about which address in each pair is first accessed by the processor to guide the filtering of new generated useless prefetches. Meanwhile, filtered prefetches are recorded for building the feedback mechanism to avoid filtering useful prefetches. Experimental results demonstrate that APF reduces useless prefetches by an average of 53.81% with a mere 5.28% reduction of useful prefetches, thus reducing the memory access bandwidth consumption by 59.92% and the L2 cache energy by 6.19%. APF also improves the performance of several programs by reducing the cache pollution incurred by useless prefetches, thus gaining an average performance improvement of 2.12%.
Journal of Computer Science and Technology | 2012
Zichao Xie; Dong Tong; Mingkai Huang; Qinqing Shi; Xu Cheng
Predicting indirect-branch targets has become a performance bottleneck for many applications. Previous high-performance indirect-branch predictors usually require significant hardware storage or additional compiler support, which increases the complexity of the processor front-end or the compilers. This paper proposes a complexity-effective indirect-branch prediction mechanism, called the Set-Way Index Pointing (SWIP) prediction. It stores multiple indirect-branch targets in different branch target buffer (BTB) entries, whose set indices and way locations are treated as set-way index pointers. These pointers are stored in the existing branch-direction predictor. SWIP prediction reuses the branch direction predictor to provide such pointers, and then accesses the pointed BTB entries for the predicted indirect-branch target. Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support. It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB, resulting in average performance improvement of 18.56%. Its energy consumption is also reduced by 14.34% over that of the baseline.
Archive | 2011
Xu Cheng; Mingxing Tan; Xianhua Liu; Jiyu Zhang; Zichao Xie; Dong Tong