Jesse Fang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jesse Fang is active.

Explore More

Publication

Featured researches published by Jesse Fang.

international conference on computational logistics | 1998

Path profile guided partial redundancy elimination using speculation

Rajiv Gupta; David A. Berson; Jesse Fang

While programs contain a large number of paths, a very small fraction of these paths are typically exercised during program execution. Thus, optimization algorithms should be designed to trade off the performance of less frequently executed paths in favor of more frequently executed paths. However, traditional formulations to code optimizations are incapable of performing such a trade-off. The authors present a path profile guided partial redundancy elimination algorithm that uses speculation to enable the removal of redundancy along more frequently executed paths at the expense of introducing additional expression evaluations along less frequently executed paths. They describe cost-benefit data flow analysis that uses path profiling information to determine the profitability of using speculation. The cost of enabling speculation of an expression at a conditional is determined by identifying paths along which an additional evaluation of the expression is introduced. The benefit of enabling speculation is determined by identifying paths along which additional redundancy elimination is enabled by speculation. The results of this analysis are incorporated in a speculative expression hoisting framework for partial redundancy elimination.

programming language design and implementation | 2009

Programming model for a heterogeneous x86 platform

Bratin Saha; Xiaocheng Zhou; Hu Chen; Ying Gao; Shoumeng Yan; Mohan Rajagopalan; Jesse Fang; Peinan Zhang; Ronny Ronen; Avi Mendelson

The client computing platform is moving towards a heterogeneous architecture consisting of a combination of cores focused on scalar performance, and a set of throughput-oriented cores. The throughput oriented cores (e.g. a GPU) may be connected over both coherent and non-coherent interconnects, and have different ISAs. This paper describes a programming model for such heterogeneous platforms. We discuss the language constructs, runtime implementation, and the memory model for such a programming environment. We implemented this programming environment in a x86 heterogeneous platform simulator. We ported a number of workloads to our programming environment, and present the performance of our programming environment on these workloads.

international conference on parallel architectures and compilation techniques | 1997

Path profile guided partial dead code elimination using predication

Rajiv Gupta; D.A. Benson; Jesse Fang

Presents a path-profile-guided partial dead code elimination algorithm that uses predication to enable sinking for the removal of deadness along frequently executed paths at the expense of adding additional instructions along infrequently executed paths. Our approach to optimization is particularly suitable for VLIW architectures since it directs the efforts of the optimizer towards aggressively enabling generation of fast schedules along frequently executed paths by reducing their critical path lengths. The paper presents a cost-benefit data flow analysis that uses path profiling information to determine the profitability of using predication-enabled sinking. The cost of predication-enabled sinking of a statement past a merge point is determined by identifying paths along which an additional statement is introduced. The benefit of predication-enabled sinking is determined by identifying paths along which additional dead code elimination is achieved due to predication. The results of this analysis are incorporated in a code sinking framework in which predication-enabled sinking is allowed past merge points only if its benefit is determined to be greater than the cost. It is also demonstrated that trade-off can be performed between the compile-time cost and the precision of cost-benefit analysis.

international symposium on computer architecture | 2001

Better exploration of region-level value locality with integrated computation reuse and value prediction

Youfeng Wu; Dong-Yuan Chen; Jesse Fang

Computation-reuse and value-prediction are two recent techniques for improving microprocessor performance by exploiting value localities. They both aim at breaking the data dependence limit in traditional processors. In this paper, we propose a speculative multithreading scheme in which the same hardware can be efficiently used for both computation reuse and value prediction. For the SpecInt95 benchmarks, our experiment shows that the integrated approach significantly out-performs either computation reuse or value prediction alone. For example, the integrated approach improves over computation reuse from a speedup of 1.25 to 1.40, and improves over value prediction from 1.28 to 1.40. In particular, the integrated approach out-performs a computation reuse configuration that has twice as much reuse buffer entries (from a speedup 1.33 to 1.40). Furthermore, unlike the computation reuse approach, the performance of the integrated approach does not rely on value profile during region formation and thus our approach is more suitable for production systems.

international conference on supercomputing | 2009

Dynamic parallelization of single-threaded binary programs using speculative slicing

Cheng Wang; Youfeng Wu; Edson Borin; Shiliang Hu; Wei Liu; Dave Sager; Tin-Fook Ngai; Jesse Fang

The performance of single-threaded programs and legacy binary code is of critical importance in many everyday applications. However, neither can hardware multi-core processors directly speed up single-threaded programs, nor can software automatic parallelizing compilers effectively parallelize legacy binary code and irregular applications. In this paper, we propose a framework and a set of algorithms to dynamically parallelize single-threaded binary programs. Our parallelization is based on program slicing and explores both instruction-level parallelism (ILP) and thread-level parallelism (TLP). To significantly reduce the critical path of the parallel slices, our slicing algorithms exploit speculation to cut rare dependences, and use well-designed program transformations to expose parallelism. Furthermore, because we transparently parallelize binary code at runtime, we perform slicing only on program hot regions. Our experiments demonstrate that the proposed speculative slicing approach extracts more parallelism than any known slicing based parallelization schemes. For the SPEC2000 benchmarks, we can achieve 3x parallelism with infinite number of threads, and 1.8x parallelism with 4 threads.

international symposium on microarchitecture | 2002

Compiler managed micro-cache bypassing for high performance EPIC processors

Youfeng Wu; Ryan N. Rakvic; Li-Ling Chen; Chyi-chang Miao; George Z. Chrysos; Jesse Fang

Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz boundary. For such high performance microprocessors, a small and fast data micro-cache (ucache) is important to overall performance, and proper management of it via load bypassing has a significant performance impact. In this paper, we propose and evaluate a hardware-software collaborative technique to manage ucache bypassing for EPIC processors. The hardware supports the ucache bypassing with a fag in the load instruction format, and the compiler employs static analysis and profiling to identify loads that should bypass the ucache. The collaborative method achieves a significant improvement in performance for the SpecInt2000 benchmarks. On average, about 40%, 30%, 24%, and 22% of load references are identified to bypass 256 B, 1 K, 4 K, and 8 K sized ucaches, respectively. This reduces the ucache miss rates by 39%, 32%, 28%, and 26%. The number of pipeline stalls from loads to their uses is reduced by 13%, 9%, 6%, and 5%. Meanwhile, the L1 and L2 cache misses remain largely unchanged. For the 256 B ucache, bypassing improves overall performance on average by 5%.

international symposium on microarchitecture | 1997

Resource-sensitive profile-directed data flow analysis for code optimization

Rajiv Gupta; David A. Berson; Jesse Fang

Instruction schedulers employ code motion as a means of instruction reordering to enable scheduling of instructions at points where the resources required for their execution are available. In addition, driven by the profiling data, schedulers take advantage of predication and speculation for aggressive code motion across conditional branches. Optimization algorithms for partial dead code elimination (PDE) and partial redundancy elimination (PRE) employ code sinking and hoisting to enable optimization. However, unlike instruction scheduling, these optimization algorithms are unaware of resource availability and are incapable of exploiting profiling information, speculation, and predication. In this paper we develop data flow algorithms for performing the above optimizations with the following characteristics: (i) opportunities for PRE and PDE enabled by hoisting and sinking are exploited; (ii) hoisting and sinking of a code statement is driven by availability of functional unit resources; (iii) predication and speculation is incorporated to allow aggressive hoisting and sinking; and (iv) path profile information guides predication and speculation to enable optimization.

IEEE Computer | 1997

Changing interaction of compiler and architecture

Sarita V. Adve; Doug Burger; Rudolf Eigenmann; Alasdair Rawsthorne; Michael D. Smith; Catherine H. Gebotys; Mahmut T. Kandemir; David J. Lilja; A.N. Choudbary; Jesse Fang; Pen Chung Yew

Program optimizations that have been exclusively done by either the architecture or the compiler are now being done by both. This blurred distinction offers opportunities to optimize performance and redefine the compiler-architecture interface. We describe an optimization continuum with compile time and post run time as end points and show how different classes of optimizations fall within it. Most current commercial compilers are still at the compile-time end point, and only a few research efforts are venturing beyond it. As the gap between architecture and compiler closes, there are also attempts to completely redefine the architecture-compiler interface to increase both performance and architectural flexibility.

symposium on code generation and optimization | 2004

Improving 64-bit Java IPF performance by compressing heap references

Ali-Reza Adl-Tabatabai; Jay Bharadwaj; Michal Cierniak; Marsha Eng; Jesse Fang; Brian T. Lewis; Brian R. Murphy; James M. Stichnoth

64-bit processor architectures like the Intel/spl reg/ Itanium/spl reg/ processor family are designed for large applications that need large memory addresses. When running applications that fit within a 32-bit address space, 64-bit CPUs are at a disadvantage compared to 32-bit CPUs because of the larger memory footprints for their data. This results in worse cache and TLB utilization, and consequently lower performance because of increased miss ratios. This paper considers software techniques for virtual machines that allow 32-bit pointers to be used on 64-bit CPUs for managed runtime applications that do not need the full 64-bit address space. We describe our pointer compression techniques and discuss our experience implementing these for Java applications. In addition, we give performance results with our techniques for both the SPEC JVM98 and SPEC JBB2000 benchmarks. We demonstrate a 12% performance improvement on SPEC JBB2000 and a reduction in the number of garbage collections required for a given heap size.

languages and compilers for parallel computing | 1996

Compiler Algorithms on If-Conversion, Speculative Predicates Assignment and Predicated Code Optimizations

Jesse Fang

Instruction level parallelism has become more and more important in todays microprocessor design. These microprocessors have multiple function units, which can execute more than one instruction at the same machine cycle to enhance the uniprocessor performance. Since the function units are usually pipelined in such microprocessors, branch misprediction penalty tremendously degrades the CPU performance. In order to reduce the branch misprediction penalty, predicated operation has been introduced in such microprocessor design as one of the new architectural features, which allows compilers to remove branches from programs.

Explore More