Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Weiyu Tang is active.

Publication


Featured researches published by Weiyu Tang.


international conference on supercomputing | 1999

Adapting cache line size to application behavior

Alexander V. Veidenbaum; Weiyu Tang; Rajesh K. Gupta; Alexandru Nicolau; Xiaomei Ji

A cache line size has a significant effect on miss rate and memory traffic. Today’s computers use a fixed line size, typically 32B, which may not be optimal for a given application. Optimal size may also change during application execution. This paper describes a cache in which the line (fetch) size is continuously adjusted by hardware based on observed application accesses to the line. The approach can improve the miss rate, even over the optimal for the fixed line size, as well as significantly reduce the memory traffic.


design, automation, and test in europe | 2002

Power savings in embedded processors through decode filter cache

Weiyu Tang; Rajesh K. Gupta; Alexandru Nicolau

In embedded processors, instruction fetch and decode can consume more than 40% of processor power. An instruction filter cache can be placed between the CPU core and the instruction cache to service the instruction stream. Power savings in instruction fetch result from accesses to a small cache. In this paper, we introduce a decode filter cache to provide a decoded instruction stream. On a hit in the decode filter cache, fetching from the instruction cache and the subsequent decoding is eliminated, which results in power savings in both instruction fetch and instruction decode. We propose to classify instructions into cacheable or uncacheable depending on the decoded width. Then sectored cache design is used in the decode filter cache so that cacheable and uncacheable instructions can coexist in a decode filter cache sector. Finally, a prediction mechanism is presented to reduce the decode filter cache miss penalty. Experimental results show average 34% processor power reduction and less than 1% performance degradation.


international conference on computer design | 2001

Design of a predictive filter cache for energy savings in high performance processor architectures

Weiyu Tang; Rajesh K. Gupta; Alexandru Nicolau

Filter cache has been proposed as an energy saving architectural feature. A filter cache is placed between the CPU and the instruction cache (I-cache) to provide the instruction stream. Energy savings result from accesses to a small cache. There is however loss of performance when instructions are not found in the filter cache. The majority of the energy savings from the filter cache are due to the temporal reuse of instructions in small loops. We examine subsequent fetch addresses to predict whether the next fetch address is in the filter cache dynamically. In case a miss is predicted, we reduce miss penalty by accessing the I-cache directly. Experimental results show that our next fetch prediction reduces performance penalty by more than 91% and is more energy efficient than a conventional filter cache. Average I-cache energy savings of 31 % can be achieved by our filter cache design with around 1 % performance degradation.


international conference on asic | 2001

Architectural adaptation for power and performance

Weiyu Tang; Alexander V. Veidenbaum; Rajesh K. Gupta

Architectural adaptation provides an attractive means to ensure high performance and low power. Adaptable architectural components support multiple mechanisms that can be tailored to application needs. In this paper, we show the benefits of architectural adaptation for power and performance using the cache memory as an example. Dynamic L0 cache management selects either L0 cache or L1 cache for instruction fetch. It reduces average power consumption in the instruction cache by 29.5% with only 0.7% performance degradation. Dynamic fetch size profiling changes cache fetch size at run-time to improve locality utilization. It improves average benchmark performance by 15%.


ieee international conference on high performance computing data and analytics | 2009

Integrated I-cache Way Predictor and Branch Target Buffer to Reduce Energy Consumption

Weiyu Tang; Alexander V. Veidenbaum; Alexandru Nicolau; Rajesh K. Gupta

In this paper, we present a Branch Target Buffer (BTB) design for energy savings in set-associative instruction caches. We extend the functionality of a BTB by caching way predictions in addition to branch target addresses. Way prediction and branch target prediction are done in parallel. Instruction cache energy savings are achieved by accessing one cache way if the way prediction for a fetch is available. To increase the number of way predictions for higher energy savings, we modify the BTB management policy to allocate entries for non-branch instructions. Furthermore, we propose to partition a BTB into ways for branch instructions and ways for non-branch instructions to reduce the BTB energy as well. We evaluate the effectiveness of our BTB design and management policies with SPEC95 benchmarks. The best BTB configuration shows a 74% energy savings on average in a 4-way set-associative instruction cache and the performance degradation is only 0.1&. When the instruction cache energy and the BTB energy are considered together, the average energy-delay product reduction is 65%.


International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems | 2002

Reducing power with an L0 instruction cache using history-based prediction

Weiyu Tang; Alexander V. Veidenbaum; Alexandru Nicolau

Advances in semiconductor technology have several impacts on processor design. One impact is that faster clock rates and slower wires will limit the number of transistors reachable in a single cycle. Another impact is that power management is becoming a design constraint due to increase in power density. A small L0 cache on top of a traditional L1 cache has the advantages of shorter access time and lower power consumption. The downside of a L0 cache is possible performance loss in the case of cache misses. In this paper, we have analyzed L0 instruction cache miss patterns and have proposed an effective L0 instruction cache management scheme through history-based prediction. For SPEC2000 benchmarks, the prediction hit rate is as high as 99% and the average hit rate is more than 93%. Compared to other L0 instruction cache management schemes, our scheme reduces more than 95% the performance degradation in L0 caches while maintaining the energy advantage as shown by a lower energy-delay product.


ACM Transactions on Design Automation of Electronic Systems | 2007

A predictive decode filter cache for reducing power consumption in embedded processors

Weiyu Tang; Arun Kejariwal; Alexander V. Veidenbaum; Alexandru Nicolau

With advances in semiconductor technology, power management has increasingly become a very important design constraint in processor design. In embedded processors, instruction fetch and decode consume more than 40% of processor power. This calls for development of power minimization techniques for the fetch and decode stages of the processor pipeline. For this, filter cache has been proposed as an architectural extension for reducing the power consumption. A filter cache is placed between the CPU and the instruction cache (I-cache) to provide the instruction stream. A filter cache has the advantages of shorter access time and lower power consumption. However, the downside of a filter cache is a possible performance loss in case of cache misses. In this article, we present a novel technique---decode filter cache (DFC)---for minimizing power consumption with minimal performance impact. The DFC stores decoded instructions. Thus, a hit in the DFC eliminates instruction fetch and its subsequent decoding. The bypassing of both instruction fetch and decode reduces processor power. We present a runtime approach for predicting whether the next fetch source is present in the DFC. In case a miss is predicted, we reduce the miss penalty by accessing the I-cache directly. We propose to classify instructions as cacheable or noncacheable, depending on the decode width. For efficient use of the cache space, a sectored cache design is used for the DFC so that both cacheable and noncacheable instructions can coexist in the DFC sector. Experimental results show that the DFC reduces processor power by 34% on an average and our next fetch prediction mechanism reduces miss penalty by more than 91%.


Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003 | 2003

Dynamically adaptive fetch size prediction for data caches

Weiyu Tang; Alexander V. Veidenbaum; Alex Nicolau

Cache line size has a significant impact on cache and overall CPU performance. This size is typically fixed at design time and may not be optimal for a given program or even within a program. Past attempts to achieve an effect of dynamic line size require complex hardware fetch size predictors. This paper proposes an adaptive fetch size predictor based on miss rate sampling. It requires little additional hardware and is straightforward to implement. Adaptive fetch size can be used at either L1 or L2 caches and achieves significant miss rate reductions in both cases. On average, the L2 adaptive fetch size cache results in highest overall performance improvements: 15% average speedup and up to 50% speedup in individual programs.


Archive | 1999

Adaptive Line Size Cache

Weiyu Tang; Alexander V. Veidenbaum; Alexandru Nicolau; Rajesh K. Gupta


Archive | 2000

Cache with Adaptive Fetch Size

Weiyu Tang; Alexander V. Veidenbaum; Alexandru Nicolau; Rajesh K. Gupta

Collaboration


Dive into the Weiyu Tang's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alex Nicolau

University of California

View shared research outputs
Top Co-Authors

Avatar

Xiaomei Ji

University of California

View shared research outputs
Researchain Logo
Decentralizing Knowledge