Chengyong Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chengyong Wu is active.

Explore More

Publication

Featured researches published by Chengyong Wu.

annual computer security applications conference | 2005

A register allocation framework for banked register files with access constraints

Feng Zhou; Junchao Zhang; Chengyong Wu; Zhaoqing Zhang

Banked register file has been proposed to reduce die area, power consumption, and access time. Some embedded processors, e.g. Intel’s IXP network processors, adopt this organization. However, they expose some access constraints in ISA, which complicates the design of register allocation. In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors. Our approach relies on the estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. Compared to previous works, our framework can better balance the register pressure among multiple banks and improve the performance of typical network applications.

ieee international conference on high performance computing data and analytics | 2004

An overview of the open research compiler

Chengyong Wu; Ruiqi Lian; Junchao Zhang; Roy Dz-Ching Ju; Sun Chan; Lixia Liu; Xiaobing Feng; Zhaoqing Zhang

The Open Research Compiler (ORC), jointly developed by Intel Microprocessor Technology Labs and the Institute of Computing Technology at Chinese Academy of Sciences, has become the leading open source compiler on the ItaniumTM Processor Family (IPF, previously known as IA-64). Since its first release in 2002, it has been widely used in academia and industry worldwide as a compiler and architecture research infrastructure and as code base for further development. In this paper, we present an overview of the design of the major components in ORC, especially those new features in the code generator. We discuss the development methodology that is important to achieving the objectives of ORC. Performance comparisons with other IPF compilers and a brief summary of the research work based on ORC are also presented.

international conference on parallel architectures and compilation techniques | 2003

Efficient resource management during instruction scheduling for the EPIC architectures

Dong-yuan Chen; Lixia Liu; Chen Fu; Shuxin Yang; Chengyong Wu; Roy Dz-Ching Ju

Effective and efficient modelling and management of hardware resources have always been critical toward generating highly efficient code in optimizing compilers. The instruction templates and dispersal rules of the EPIC architecture add new complexity in managing resource constraints to instruction scheduler. We extended a finite state automaton (FSA) approach to efficiently manage all key resource constraints of an EPIC architecture on-the-fly during instruction scheduling. We have fully integrated the FSA-based resource management into the instruction scheduler in the Open Research Compiler for the EPIC architecture. Our integrated approach shows up to 12% speedup on some SPECint2000 benchmarks and 4.5% speedup on average for all SPECint2000 benchmarks on an Itanium machine when compares to an instruction scheduler with decoupled resource management. In the meantime, the instruction scheduling time of our approach is reduced by 4% on average.

languages and compilers for parallel computing | 2005

Optimizing packet accesses for a domain specific language on network processors

Tao Liu; Xiao-Feng Li; Lixia Liu; Chengyong Wu; Roy Dz-Ching Ju

Programming network processors remains a challenging task since their birth until recently when high-level programming environments for them are emerging. By employing domain specific languages for packet processing, the new environments try to hide hardware details from the programmers and enhance both the programmability of the systems and the portability of the applications. A frequent issue for the new environments to be widely adopted is their relatively low achievable performance compared to low-level, hand-tuned programming. In this paper we present two techniques, Packet Access Combining (PAC) and Compiler-Generated Packet Caching (CGPC), to optimize packet accesses, which are shown as the performance bottleneck in such new environments for packet processing applications. PAC merges multiple packet accesses into a single wider access; CGPC implements an automatic packet data caching mechanism without a hardware cache. Both techniques focus on reducing long memory latency and expensive memory traffic, and they also reduce instruction counts significantly. We have implemented the proposed techniques in a high level programming environment for network processor named Shangri-La. Our evaluation with standard NPF benchmarks shows that for the evaluated applications the two techniques can reduce the memory traffic by 90% and improve the packet throughput by 5.8 times, on average.

annual computer security applications conference | 2008

Diva: A dataflow programming model and its runtime support in Java virtual machine

Yang Chen; Bin Fan; Lujie Zhong; Chengyong Wu

Microprocessors have turned to multicore, i.e. multiple processor cores, along with some levels of on-chip caches and interconnection networks, integrated on a singe chip. However, it brings challenges on how to program these processors effectively and efficiently, which is known as the ldquoWallrdquo. This paper proposes a systematic approach to attack problem. We describe an extension of Java programming language with dataflow paradigm and transactional memory. Our approach alleviates the difficulties of parallel programming by providing a higher level of abstraction and relieving programmers of low-level threading and locking details. We also describe the design of a runtime system to support and optimize for the extension. We have implemented a prototype based on Apache Harmony DRL Virtual Machine. Preliminary experimental results on a 16-core SMP machine show that our approach achieves reasonable scalability and can adapt to the variance of available hardware resources.

Archive | 2002