Jinquan Dai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jinquan Dai is active.

Explore More

Publication

Featured researches published by Jinquan Dai.

programming language design and implementation | 2005

Automatically partitioning packet processing applications for pipelined architectures

Jinquan Dai; Bo Huang; Long Li; Luddy Harrison

Modern network processors employs parallel processing engines (PEs) to keep up with explosive internet packet processing demands. Most network processors further allow processing engines to be organized in a pipelined fashion to enable higher processing throughput and flexibility. In this paper, we present a novel program transformation technique to exploit parallel and pipelined computing power of modern network processors. Our proposed method automatically partitions a sequential packet processing application into coordinated pipelined parallel subtasks which can be naturally mapped to contemporary high-performance network processors. Our transformation technique ensures that packet processing tasks are balanced among pipeline stages and that data transmission between pipeline stages is minimized. We have implemented the proposed transformation method in an auto-partitioning C compiler product for Intel Network Processors. Experimental results show that our method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks. For a 9-stage pipeline, our auto-partitioning C compiler obtained more than 4X speedup for the IPv4 forwarding PPS and the IP forwarding PPS (for both the IPv4 traffic and IPv6 traffic).

acm sigplan symposium on principles and practice of parallel programming | 2005

Automatic multithreading and multiprocessing of C programs for IXP

Long Li; Bo Huang; Jinquan Dai; Luddy Harrison

Effective compilation of packet processing applications onto the Intel IXP network processors requires, among other things, the automatic use of multiple threads on one or more processing elements, and the automatic introduction of synchronization as required to correctly enforce dependences between such threads. We describe the program transformation that is used in the Intel Auto-partitioning C Compiler for IXP to automatically multithread/multi-process a program for the IXP. This transformation consists of steps that introduce inter-thread signaling to enforce dependences, optimize the placement of such signaling, reduce the number of signals in use to the number available in hardware, and transform the initialization code for correct execution in the multithreaded version. Experimental results show that our method provides impressive speedup for six PPSes (Packet Processing Stages) in the widely used NPF IP forwarding benchmarks. For most packet processing stages, our algorithms can achieve almost linear performance improvement after automatic multi-threading transformation. The automatic multi-processing transformation help further boost the speedup of two PPSes.

symposium on code generation and optimization | 2007

Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors

Jinquan Dai; Long Li; Bo Huang

To keep up with the explosive Internet packet processing demands, modern network processors (NPs) employ a highly parallel, multi-threaded and multi-core architecture. In such a parallel paradigm, accesses to the shared variables in the external memory (and the associated memory latency) are contained in the critical sections, so that they can be executed atomically and sequentially by different threads in the network processor. In this paper, we present a novel program transformation that is used in the Intelreg Auto-partitioning C Compiler for IXP to exploit the inherent finer-grained parallelism of those critical sections, using the software-controlled caching mechanism available in the NPs. Consequently, those critical sections can be executed in a pipelined fashion by different threads, thereby effectively hiding the memory latency and improving the performance of network applications. Experimental results show that the proposed transformation provides impressive speedup (up-to 9.9times) and scalability (up-to 80 threads) of the performance for the real-world network application (a 10Gbps Ethernet Core/Metro Router)

international conference on parallel architectures and compilation techniques | 2007

Latency Hiding in Multi-Threading and Multi-Processing of Network Applications

Xiaofeng Guo; Jinquan Dai; Long Li; Zhiyuan Lv; Prashant R. Chandra

Network processors employ a multithreaded, chip-multiprocessing architecture to effectively hide memory latency and deliver high performance for packet processing applications. In such a parallel paradigm, when multiple threads modify a shared variable in the external memory, the threads should be properly synchronized such that the accesses to the shared variable are protected by critical sections. Therefore, in order to efficiently harness the performance potential of network processors, it is critical to hide the memory latency and synchronization latency in multi-threading and multiprocessing. In this paper, we present a novel program transformation used in the Intelreg Auto-partitioning C Compiler for IXP, which perform optimal placement of memory access instructions and synchronization instructions for effective latency hiding. Experimental results show that the transformation provides impressive speedup (up-to to 8.5x) and scalability (up- to 72 threads) of the performance for the real-world network application (a 10Gbps Ethernet Core/Metro Router).

acm sigplan symposium on principles and practice of parallel programming | 2007

Latency hiding through multithreading on a network processor

Xiaofeng Guo; Jinquan Dai; Long Li; Zhiyuan Lv; Prashant R. Chandra

1. IXP Architecture Network processors are specialized processors used to build network devices such as switches, routers, firewalls, etc. because of their flexibility, programmability and ability to deliver scalable packet processing performance from a few 100 Mbps to 10Gbps. However, because of the highly multi-threaded, chipmultiprocessor architectures of network processors, most developers find it difficult to realize their full performance potential for their target applications and often resort to programming in low-level assembly language. The resulting complexity of software development often masks the benefits of employing network processors.

Archive | 2003