Jilong Kuang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jilong Kuang is active.

Explore More

Publication

Featured researches published by Jilong Kuang.

international conference on computer communications | 2010

Optimizing Throughput and Latency under Given Power Budget for Network Packet Processing

Jilong Kuang; Laxmi N. Bhuyan

Current state-of-the-art task scheduling algorithms for network packet processing schedule the program into a parallel-pipeline topology on network processors to maximize the throughput. However, there has been no existing work targeting power budget for packet processing on off-the-shelf multicore architectures. As energy consumption, reliability and cooling cost for packet processing systems become increasingly important, it is necessary to integrate power-awareness into a scheduler to meet the power budget. In this paper, we propose a novel scheduling algorithm to optimize both throughput and latency given a power budget for network packet processing on multicore architectures. This algorithm addresses power-aware parallel-pipeline scheduling problem by applying per-core DVFS to optimally adjust frequency on each core. We implement our algorithm on an AMD machine with two Quad-Core Opteron 2350 processors and compare the results with existing algorithms given the same power budget. For six real packet processing applications, our algorithm improves throughput and reduces latency by an average of 64.6% and 25.2%, respectively.

architectures for networking and communications systems | 2011

E-AHRW: An Energy-Efficient Adaptive Hash Scheduler for Stream Processing on Multi-core Servers

Jilong Kuang; Laxmi N. Bhuyan; Haiyong Xie; Danhua Guo

We study a streaming network application -- video transcoding to be executed on a multi-core server. It is important for the scheduler to minimize the total processing time and preserve good video quality in an energy-efficient manner. However, the performance of existing scheduling schemes is largely limited by ineffective use of the multi-core architecture characteristic and undifferentiated transcoding cost in terms of energy consumption. In this paper, we identify three key factors that collectively play important roles in affecting transcoding performance: memory access (M), core/cache topology (C) and transcoding format cost (C), or MC^2 for short. Based on MC^2, we propose E-AHRW, an Energy-efficient Adaptive Highest Random Weight hash scheduler by extending the HRW scheduler proposed for packet scheduling on a homogeneous multiprocessor. E-AHRW achieves stream locality and load balancing at both stream and packet (frame) level by adaptively adjusting the hashing decision according to real-time weighted queue length of each processing unit (PU). Based on E-AHRW, we also design, implement and evaluate a hash-tree scheduler to further reduce the computation cost and achieve more effective load balancing on multi-core architectures. Through implementation on an Intel Xeon server and evaluations on realistic workload, we demonstrate that E-AHRW improves throughput, energy efficiency and video quality due to better load balancing, lower L2 cache miss rate and negligible scheduling overhead.

design automation conference | 2010

LATA: a latency and throughput-aware packet processing system

Jilong Kuang; Laxmi N. Bhuyan

Current packet processing systems only aim at producing high throughput without considering packet latency reduction. For many real-time embedded network applications, it is essential that the processing time not exceed a given threshold. In this paper, we propose LATA, a LAtency and Throughput-Aware packet processing system for multicore architectures. Based on parallel pipeline core topology, LATA can satisfy the latency constraint and produce high throughput by exploiting fine-grained task-level parallelism. We implement LATA on an Intel machine with two Quad-Core Xeon E5335 processors and compare it with four other systems (Parallel, Greedy, Random and Bipar) for six network applications. LATA exhibits an average of 36.5% reduction of latency and a maximum of 62.2% reduction of latency for URL over Random with comparable throughput performance.

international conference on multimedia and expo | 2012

An Adaptive Dynamic Scheduling Scheme for H.264/AVC Decoding on Multicore Architecture

Dung Vu; Jilong Kuang; Laxmi N. Bhuyan

Parallelizing H.264/AVC decoding on multicore architectures is challenged by its inherent structural and functional dependencies at both frame and macro-block levels, as macro-blocks and certain frame types must be decoded in a sequential order. So far, dynamic scheduling scheme with recursive tail submit, as one of the best existing algorithms, provides a good throughput performance by exploiting macro-block level parallelism and mitigating global queue contention. Nevertheless, it fails to achieve an optimal performance due to 1) the use of global queue, which incurs substantial synchronization overhead when the number of cores increases and 2) the unawareness of cache locality with respect to the underlying hierarchical core/cache topology that results in unnecessary latency, communication cost and load imbalance. In this paper, we propose an adaptive dynamic scheduling scheme that employs multiple local queues to reduce lock contention, and assigns tasks in a cache locality aware and load-balancing fashion so that neighboring macro-blocks are preferably dispatched to nearby cores. We design, implement and evaluate our scheme on a 32-core cc-NUMA SGI server. Compared to existing alternatives by running real benchmark applications, we observe that our scheme produces higher throughput and lower latency with more balanced workload and less communication cost.

design automation conference | 2012

Traffic-aware power optimization for network applications on multicore servers

Jilong Kuang; Laxmi N. Bhuyan; Raymond Klefstad

In this paper, we design, implement, and evaluate a traffic-aware and power-efficient multicore server system by translating incoming traffic rate to appropriate system operating level, which is then translated to optimal per-core frequency configuration. According to the varying traffic rate, the system can adjust the number of active cores and per-core frequency “on-the-fly” via the use of per-core DVFS, power gating, and power migration techniques based on our new power model which considers both dynamic and static power consumption of all cores. Results on an AMD machine with two Quad-Core Opteron 2350 processors for six real network applications chosen from NetBench [19] show that our scheme reduces power consumption by an average of 41.0% compared to running with full capacity without any reduction in throughput. It also consumes less power than three other approaches, chip-wide DVFS [22], power gating [17], and chip-wide DVFS + power gating [15], by 35.2%, 24.3%, and 10.5% respectively.

architectures for networking and communications systems | 2010

Power optimization for multimedia transcoding on multicore servers

Jilong Kuang; Danhua Guo; Laxmi N. Bhuyan

We design, implement and evaluate a power-efficient and traffic-aware transcoding system on multicore servers that appropriately adjusts the processor operating level. The system is capable of configuring the number of active cores and core frequency “on-the-fly” according to the varying traffic rate. Results on an AMD machine show that our system saves 51.0% power consumption compared to a native system without power-saving schemes. It also outperforms three other power-aware systems, CG [2] (clock gating), C-DVFS [3] (chip-wide DVFS) and Hybrid [1] (chip-wide DVFS + power-gating), by 19.5%, 10.5% and 5.5% reduction of power consumption, respectively.

international conference for internet technology and secured transactions | 2013

RUFC: A flexible framework for reliable UDP with flow control

Ahmed Osama Fathy Atya; Jilong Kuang

Various reliable UDP with flow control schemes have been widely adopted to enhance the native UDP protocol. However, all existing schemes exhibit one or more drawbacks that prevent them from achieving the optimal performance in practice. For example, some schemes under-utilize the available bandwidth; some schemes enforce high overhead logic; and some schemes have rigid policies and parameters that do not adapt to ever-changing system runtime and network conditions. In this paper, we propose RUFC, an efficient and flexible framework to support Reliable UDP with Flow Control. Our framework fits between the transport layer employing UDP and the application layer. While providing the common functionalities and interfaces for both layers, RUFC is capable of supporting policy customization and parameter tuning to achieve the optimal application performance. Extensive experimental data shows that RUFC significantly outperforms the native UDP protocol as well as the two state-of-the-art UDP-based protocols (UDT and Tsunami) in terms of throughput and error rate.

international conference on multimedia and expo | 2014

A scalable hash scheduler for decoding of multiple H.264/AVC streams on multi-core architecture

Dung Vu; Jilong Kuang; Laxmi N. Bhuyan

Existing scheduling schemes for decoding H.264/AVC multiple streams on multi-core are largely limited by ineffective use of multi-core architecture. Among the reasons are inefficient load balancing, in which common load metrics (e.g. tasks, frames, bytes) are unable to correctly reflect processing load at cores, unscalability of scheduling algorithms for a large scale multi-core, and bottlenecks at schedulers for multi-stream decoding. In this paper, we propose a scalable adaptive Highest Random Weight (HA-HRW) hash scheduler for distributed shared memory multi-core architecture considering the following: 1) memory access and core/cache topology of the multi-core architecture; 2) appropriate processing time load metric to enforce a true load balancing; 3) hierarchical parallel scheduling to decode multiple streams simultaneously; 4) locality characteristics of processing unit candidate to limit search within neighboring cores to enable scalable scheduling. We implement and evaluate our approach on a 32-core SGI server with realistic workload. Comparing with existing schemes, our scheme achieves higher throughput, better load balancing, better CPU utilization, and no jitter problem. Our scheme scales with multi-core and multiple streams as its time complexity is O(1).

computer software and applications conference | 2012

A Scalable Physical Memory Allocation Scheme for L4 Microkernel

Chen Tian; Daniel G. Waddington; Jilong Kuang

L4 microkernel family has become very successful on mobile devices. However, with the rapid shift from uniprocessor to multicore and manycore processor, many critical OS functions including physical memory allocator (PMA) must be re-designed in order to achieve better system throughput. While research and engineering efforts have been made for PMA in monolithic kernels such as Linux, not much work can be found for L4 microkernels. Due to the the design difference, the PMA in L4 microkernels is part of user level page fault handler (a.k.a. pager), which is executed as a stand-alone server in the least privilege mode. Memory allocation and free requests are handled through inter-process communication (IPC) rather than normal system or kernel function calls. In this work, we first study the scalability issue of the PMA implementation in L4 microkernels, and propose our solution in the context of Fiasco.OC, a state-of-the-art L4 microkernel implementation. We also discuss how to leverage the L4 microkernel design advantages to implement a PMA with more advanced features, such as load balancing, customizability and NUMA-awareness. Finally, we conduct experiments to verify the scalability result of our solution. The experiment is conducted on a 48-core AMD magny-cours server.

Archive | 2013