Gansha Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gansha Wu is active.

Explore More

Publication

Featured researches published by Gansha Wu.

symposium on code generation and optimization | 2006

Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Shih-Wei Liao; Zhaohui Du; Gansha Wu; Guei-Yuan Lueh

Multicore processors are about to become prevalent in the PC world. Meanwhile, over 90% of the computing cycles are estimated to be consumed by streaming media applications (Rixner et al., 1998). Although stream programming exposes parallelism naturally, we found that achieving high performance on multiprocessors is challenging. Therefore, we develop a parallel compiler for the Brook streaming language with aggressive data and computation transformations. First, we formulate fifteen Brook stream operators in terms of systems of inequalities. Our compiler optimizes the modeled operators to improve memory footprint and performance. Second, the stream computation including both kernels and operators is mapped to the affine partitioning model by modeling each kernel as an implicit loop nest over stream elements. Note that our general abstraction is not limited to Brook. Our modeling and transformations yield high performance on uniprocessors as well. The geometric mean of speedups is 4.7 on ten streaming applications on a Xeon. On multiprocessors, we show that exploiting the standard intra-kernel data parallelism is inferior to our general modeling. The former yields a speedup of 1.5 for ten applications on a 4-way Xeon, while the latter achieves a speedup of 6.4 over the same baseline. We show that our compiler effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues.

symposium on code generation and optimization | 2011

Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

Chris J. Newburn; Byoungro So; Zhenying Liu; Michael McCool; Anwar M. Ghuloum; Stefanus Jakobus Du Toit; Zhigang Wang; Zhaohui Du; Yongjian Chen; Gansha Wu; Peng Guo; Zhanglin Liu; Dan Zhang

Our ability to create systems with large amount of hardware parallelism is exceeding the average software developers ability to effectively program them. This is a problem that plagues our industry. Since the vast majority of the worlds software developers are not parallel programming experts, making it easy to write, port, and debug applications with sufficient core and vector parallelism is essential to enabling the use of multi- and many-core processor architectures. However, hardware architectures and vector ISAs are also shifting and diversifying quickly, making it difficult for a single binary to run well on all possible targets. Because of this, retargetability and dynamic compilation are of growing relevance. This paper introduces Intel® Array Building Blocks (ArBB), which is a retargetable dynamic compilation framework. This system focuses on making it easier to write and port programs so that they can harvest data and thread parallelism on both multi-core and heterogeneous many-core architectures, while staying within standard C++. ArBB interoperates with other programming models to help meet the demands we hear from customers for a solution with both greater programmer productivity and good performance. This work makes contributions in language features, compiler architecture, code transformations and optimizations. It presents performance data from the current beta release of ArBB and quantitatively shows the impact of some key analyses, enabling transformations and optimizations for a variety of benchmarks that are of interest to our customers.

interpreters, virtual machines and emulators | 2004

Code sharing among states for stack-caching interpreter

Jinzhan Peng; Gansha Wu; Guei-Yuan Lueh

Interpretation has salient merits of simplicity, portability and small footprint but comes with a price of poor performance. Stack caching is a technique to build a high-performance interpreter by keeping source and destination operands of instructions in registers so as to reduce memory accesses involved during interpretation. One drawback of stack caching is that an instruction may have multiple ways to perform interpretation depending on which registers source operands reside in, resulting in code explosion as well as deterioration of code maintainability. This paper presents a code sharing mechanism that achieves performance as efficient as the stack-caching interpreter and in the meantime keeps the code size as compact as general threaded interpreters. Our results show that our approach outperforms a threaded interpreter by an average of 13.6% and the code size increases by only 1KB (~3%).

workshop on memory system performance and correctness | 2006

A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems

Jinzhan Peng; Guei-Yuan Lueh; Gansha Wu; Xiaogang Gou; Ryan N. Rakvic

The working set size of Java applications on embedded systems has recently been increasing, causing the Translation Lookaside Buffer (TLB) to become a serious performance bottleneck. From a thorough analysis of the SPECjvm98 benchmark suite executing on a commodity embedded system, we find TLB misses attribute from 24% to 50% of the total execution time. We explore and evaluate a wide spectrum of TLB-enhancing techniques with different combinations of software/hardware approaches, namely superpage for reducing TLB miss rates, two-level TLB and TLB prefetching for reducing both TLB miss rates and TLB miss latency, and even a no-TLB design for removing TLB overhead completely. We adapt and then in a novel way extend these approaches to fit the design space of embedded systems executing Java code. We compare these approaches, discussing their performance behavior, software/hardware complexity and constraints, especially the design implications for the application, runtime and OS.We first conclude that even with the aggressive approaches presented, there remains a performance bottleneck with the TLB. Second, in addition to facing very different design considerations and constraints for embedded systems, proven hardware techniques, such as TLB prefetching have different performance implications. Third, software based solutions, no-TLB design and superpaging, appear to be more effective in improving Java application performance on embedded systems. Finally, beyond performance, these approaches have their respective pros and cons; it is left to the system designer to make the appropriate engineering tradeoff.

high performance embedded architectures and compilers | 2005

XAMM: a high-performance automatic memory management system with memory-constrained designs

Gansha Wu; Xin Zhou; Guei-Yuan Lueh; Jesse Fang; Peng Guo; Jinzhan Peng; Victor Ying

Automatic memory management has been prevalent on memory / computation constraint systems. Previous research has shown strong interest in small memory footprint, garbage collection (GC) pause time and energy consumption, while performance was left out of the spotlight. This fact inspired us to design memory management techniques delivering high performance, while still keeping space consumption and response time under control. XAMM is an attempt to answer such a quest. Driven by the design decisions above, XAMM implements a variety of novel techniques, including object model, heap management, allocation and GC mechanisms. XAMM also adopts techniques that can not only exploit the underlying systems capabilities, but can also assist the optimizations by other runtime components (e.g. code generator). This paper describes these techniques in details and reports our experiences in the implementation. We conclude that XAMM demonstrates the feasibility to achieve high performance without breaking memory constraints. We support our claims with evaluation results, for a spectrum of real-world programs and synthetic benchmarks. For example, the heap placement optimization can boost the system-wide performance by as much as 10%; the lazy and selective location bits management can reduce the execution time by as much as 14%, while reducing GC pause time on average by as much as 25%. The sum of these techniques improves the system-wide performance by as much as 56%.

international conference on parallel and distributed systems | 2005

A Code Generation Algorithm for Affine Partitioning Framework

Shih-Wei Liao; Zhaohui Du; Gansha Wu; Guei-Yuan Lueh

Multiprocessors are about to become prevalent in the PC world. Major CPU vendors such as Intel and Advanced Micro Devices have recently announced their imminent migration to multicore processors. Affine partitioning provides a systematic framework to find asymptotically optimal computation and data decomposition for multiprocessors, including multicore processors. This affine framework uniformly models a large class of high-level optimizations such as loop interchange, reversal, skewing, fusion, fission, re-indexing, scaling, and statement reordering. However, the resulting code after applying affine transformations tends to contain more loop levels and complex conditional expressions. This impacts performance, code readability and debuggability for both programmers and compiler developers. To facilitate the adoption of affine partitioning in industry, we address the above practical issues by proposing a salient two-step algorithm: coalesce and optimize. The coalescing algorithm maintains valid code throughout and improves readability and debuggability. We demonstrate with examples that the optimization algorithm simplifies the resulting loop structures, conditional expressions and array access functions and generates efficient code

Archive | 2005