Youfeng Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Youfeng Wu is active.

Explore More

Publication

Featured researches published by Youfeng Wu.

symposium on code generation and optimization | 2006

Software-Based Transparent and Comprehensive Control-Flow Error Detection

Edson Borin; Cheng Wang; Youfeng Wu; Guido Araujo

Shrinking microprocessor feature size and growing transistor density may increase the soft-error rates to unacceptable levels in the near future. While reliable systems typically employ hardware techniques to address soft-errors, software-based techniques can provide a less expensive and more flexible alternative. This paper presents a control-flow error classification and proposes two new software-based comprehensive control-flow error detection techniques. The new techniques are better than the previous ones in the sense that they detect errors in all the branch-error categories. We implemented the techniques in our dynamic binary translator so that the techniques can be applied to existing x86 binaries transparently. We compared our new techniques with the previous ones and we show that our methods cover more errors while has similar performance overhead.

international symposium on computer architecture | 2001

Better exploration of region-level value locality with integrated computation reuse and value prediction

Youfeng Wu; Dong-Yuan Chen; Jesse Fang

Computation-reuse and value-prediction are two recent techniques for improving microprocessor performance by exploiting value localities. They both aim at breaking the data dependence limit in traditional processors. In this paper, we propose a speculative multithreading scheme in which the same hardware can be efficiently used for both computation reuse and value prediction. For the SpecInt95 benchmarks, our experiment shows that the integrated approach significantly out-performs either computation reuse or value prediction alone. For example, the integrated approach improves over computation reuse from a speedup of 1.25 to 1.40, and improves over value prediction from 1.28 to 1.40. In particular, the integrated approach out-performs a computation reuse configuration that has twice as much reuse buffer entries (from a speedup 1.33 to 1.40). Furthermore, unlike the computation reuse approach, the performance of the integrated approach does not rely on value profile during region formation and thus our approach is more suitable for production systems.

international conference on human computer interaction | 2004

Continuous trip count profiling for loop optimization in two-phase dynamic binary translators

Youfeng Wu; Mauricio Breternitz; Tevi Devor

Most dynamic binary translators use a two-phase approach to identify and optimize frequently executed code dynamically. In the profiling phase, blocks of code are interpreted or translated without optimization to collect execution frequency information for the blocks. In the optimization phase, frequently executed blocks are grouped into regions and advanced optimizations are applied on them. This approach implicitly assumes that the initial execution of each block is representative of the block throughout its lifetime. In particular, loop optimizations may use the block frequency information to determine loop trip counts to guide their optimizations. If the trip count information is incorrect, however, a loop may be improperly optimized, and program performance suffers. In this paper we show that the initial profile is inadequate at predicting loop trip count information for several integer programs. We propose and evaluate efficient algorithms to continuously profile for trip count. Our results show that accurate trip count information may be obtained with very low overhead (about 0.5%). This enables advanced loop optimizations in dynamic binary translators.

ACM Sigarch Computer Architecture News | 2005

Dynamic binary control-flow errors detection

Edson Borin; Cheng Wang; Youfeng Wu; Guido Araujo

Shrinking microprocessor feature size will increase the soft-error rates to unacceptable levels in the near future. While reliable systems typically employ hardware techniques to address soft-errors, software techniques can provide a less expensive and more flexible alternative. This paper presents a control-flow error classification and proposes new software based control-flow error detection techniques. The new techniques are better than the previous ones in the sense that they detect errors in all the branch-error categories. We also compare the performance of our new techniques with that of the previous ones using our dynamic binary translator.

international conference on computer design | 2006

Clustering-Based Microcode Compression

Edson Borin; Mauricio Breternitz; Youfeng Wu; Guido Araujo

Microcode enables programmability of (micro) architectural structures to enhance functionality and to apply patches to an existing design. As more features get added to a CPU core, the area and power costs associated with microcode increase. A recent Intel internal design targeted at low power and small footprint has estimated the costs of the microcode ROM to approach 20% of the total die area (and associated power consumption). Therefore, it is desirable to apply compression techniques to microcode. Microcode poses unique challenges for compression due to the long instruction format, the hand-coded nature of the programs and the stringent performance requirements that require fast decompression. This paper describes techniques for microcode compression that achieve .significant area and power savings, while presenting a streamlined architecture that enables high throughput within the constraints of a high performance CPU. The paper presents results for microcode compression on several commercial CPU designs which demonstrates compression ratios ranging from 50% to 62%.

international symposium on computer architecture | 2010

Trace execution automata in dynamic binary translation

João Paulo Porto; Guido Araujo; Edson Borin; Youfeng Wu

Program performance can be dynamically improved by optimizing its frequent execution traces. Once traces are collected, they can be analyzed and optimized based on the dynamic information derived from the programs previous runs. The ability to record traces is thus central to any dynamic binary translation system. Recording traces, as well as loading them for use in different runs, requires code replication to represent the trace. This paper presents a novel technique which records execution traces by using an automaton called TEA (Trace Execution Automata). Contrary to other approaches, TEA stores traces implicitly, without the need to replicate execution code. TEA can also be used to simulate the trace execution in a separate environment, to store profile information about the generated traces, as well to instrument optimized versions of the traces. In our experiments, we showed that TEA decreases memory needs to represent the traces (nearly 80% savings).

symposium on computer architecture and high performance computing | 2008

A Segmented Bloom Filter Algorithm for Efficient Predictors

Mauricio Breternitz; Gabriel H. Loh; Bryan Black; Jeff Rupley; Peter G. Sassone; Wesley Attrot; Youfeng Wu

Bloom Filters are a technique to reduce the effects of conflicts/interference in hash table-like structures. Conventional hash tables store information in a single location which is susceptible to destructive interference through hash conflicts. A Bloom Filter uses multiple hash functions to store information in several locations, and recombines the information through some voting mechanism. Many microarchitectural predictors use simple single-index hash tables to make binary 0/1 predictions, and Bloom Filters help improve predictor accuracy. However, implementing a true Bloom Filter requires k hash functions, which in turn implies a k-ported hash table, or k sequential accesses. Unfortunately,the area of a hardware table increases quadratically with the port count, increasing costs of area, latency and power consumption. We propose a simple but elegant modification to the Bloom Filter algorithm that uses banking combined with special hash functions that guarantee all hash indexes fall into non-conflicting banks. We evaluate several applications of our Banked Bloom Filter (BBF) prediction in processors: BBF branch prediction, BBF load hit/miss prediction, and BBF last-tag prediction. We show that BBF predictors can provide accurate predictions with substantially less cost than previous techniques.

International Journal of Parallel Programming | 2014

Microcode Compression Using Structured-Constrained Clustering

Edson Borin; Guido Araujo; Mauricio Breternitz; Youfeng Wu

Modern microprocessors have used microcode as a way to implement legacy (rarely used) instructions, add new ISA features and enable patches to an existing design. As more features are added to processors (e.g. protection and virtualization), area and power costs associated with the microcode memory increased significantly. A recent Intel internal design targeted at low power and small footprint has estimated the costs of the microcode ROM to approach 20% of the total die area (and associated power consumption). Moreover, with the adoption of multicore architectures, the impact of microcode memory size on the chip area has become relevant, forcing industry to revisit the microcode size problem. A solution to address this problem is to store the microcode in a compressed form and decompress it at runtime. This paper describes techniques for microcode compression that achieve significant area and power savings, while proposes a streamlined architecture that enables high throughput within the constraints of a high performance CPU. The paper presents results for microcode compression on several commercial CPU designs which demonstrates compression ratios ranging from 50 to 62%. In addition, it proposes techniques that enable the reuse of (pre-validated) hardware building blocks that can considerably reduce the cost and design time of the microcode decompression engine in real-world designs.

architectural support for programming languages and operating systems | 2013

TSO_ATOMICITY: efficient hardware primitive for TSO-preserving region optimizations

Cheng Wang; Youfeng Wu

Program optimizations based on data dependences may not preserve the memory consistency in the programs. Previous works leverage a hardware ATOMICITY primitive to restrict the thread interleaving for preserving sequential consistency in region optimizations. However, ATOMICITY primitive is over restrictive on the thread interleaving for optimizing real-world applications developed with the popular Total-Store-Ordering (TSO) memory consistency, which is weaker than sequential consistency. In this paper, we present a novel hardware TSO_ATOMICITY primitive, which has less restriction on the thread interleaving than ATOMICITY primitive to permit more efficient program execution than ATOMICITY primitive, but can still preserve TSO memory consistency in all region optimizations. Furthermore, TSO_ATOMICITY primitive requires similar architecture support as ATOMICITY primitive and can be implemented with only slight change to the existing ATOMICITY primitive implementation. Our experimental results show that in a start-of-art dynamic binary optimization system on a large set of workloads, ATOMICITY primitive can only improve the performance by 4% on average. TSO_ATOMICITY primitive can reduce the overhead associated with ATOMICITY primitive and improve the performance by 12% on average.

symposium on computer architecture and high performance computing | 2011

Structure-Constrained Microcode Compression

Edson Borin; Guido Araujo; Mauricio Breternitz; Youfeng Wu

Microcode enables programmability of (micro) architectural structures to enhance functionality and to apply patches to an existing design. As more features get added to a CPU core, the area and power costs associated with microcode increase. One solution to address the microcode size issue is to store the microcode in a compressed form and decompress it during execution. Furthermore, the reuse of a single hardware building block layout to implement different dictionaries in the two-level microcode compression reduces the cost and the design time of the decompression engine. However, the reuse of the hardware building block imposes structural constraints to the compression algorithm, and existing algorithms may yield poor compression. In this paper, we develop the SC2 algorithm that considers the structural constraint in its objective function and reduces the area expansion when reusing hardware building blocks to implement different dictionaries. Our experimental results show that the SC2 algorithm is able to produce similar sized dictionaries and achieves the similar compression ratio to the non-constrained algorithm.

Explore More