Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ji Kong is active.

Publication


Featured researches published by Ji Kong.


IEEE Journal of Solid-state Circuits | 2011

A 530 Mpixels/s 4096x2160@60fps H.264/AVC High Profile Video Decoder Chip

Dajiang Zhou; Jinjia Zhou; Xun He; Jiayi Zhu; Ji Kong; Peilin Liu; Satoshi Goto

The increased resolution of Quad Full High Definition (QFHD) offers significantly enhanced visual experience. However, the corresponding huge data throughput of up to 530 Mpixels/s greatly challenges the design of real-time video decoder VLSI with the extensive requirement on both DRAM bandwidth and computational power. In this work, a lossless frame recompression technique and a partial MB reordering scheme are proposed to save the DRAM access of a QFHD video decoder chip. Besides, pipelining and parallelization techniques such as NAL/slice-parallel entropy decoding are implemented to efficiently enhance its computational power. The chip supporting H.264/AVC high profile is fabricated in 90 nm CMOS and verified. It delivers a maximum throughput of 4096×2160@60fps, which is at least 4.3 times higher than the state-of-the-art. DRAM bandwidth requirement is reduced by typically 51%, which fits the design into a 64-bit LPDDR SDRAM interface and results in 58% DRAM power saving. Meanwhile, the core energy is saved by 54% by pipelining and parallelization.


symposium on vlsi circuits | 2010

A 530Mpixels/s 4096×2160@60fps H.264/AVC high profile video decoder chip

Dajiang Zhou; Jinjia Zhou; Xun He; Ji Kong; Jiayi Zhu; Peilin Liu; Satoshi Goto

An H.264/AVC HP video decoder is implemented in 90nm CMOS. Its maximum throughput reaches 4096×2160@60fps, which is at least 4.3x higher than the state-of-the-art. By using partial MB reordering and lossless frame recompression, 51% of DRAM bandwidth is reduced which results in 58% DRAM power saving. Meanwhile, various efficient parallelization techniques contribute to a core energy saving of 54%.


asian solid state circuits conference | 2007

An SoC based HW/SW co-design architecture for multi-standard audio decoding

Dajiang Zhou; Peilin Liu; Ji Kong; Yunfei Zhang; Bin He; Ning Deng

In this paper, we presented an SoC based HW/SW co-design architecture for multi-standard audio decoding. It is developed to support the audio standards of AAC LC profile, Dolby AC3, Ogg Vorbis, MPEG-1 Layer 3 (MP3) and Windows Media Audio (WMA). A VLSI reconfigurable filterbank based on CORDIC algorithm is developed to accelerate the multi-standard decoding process. We designed and implemented an SoC platform to verify the interbank as an IP core. Experimental result shows that the architecture is able to perform real-time audio decoding at low frequency (typically 10.6MHz for AAC and 11.3 MHz for MP3) and the implementation cost is low (44.3k gates, 34k bytes RAM and 45k bytes data ROM for 5 audio standards). The architecture is also flexible for extending support of new formats and standards.


symposium on application specific processors | 2010

Next-generation consumer audio application specific embedded processor

Ji Kong; Peilin Liu; Xianmin Chen; Jin Wang; Xingguang Pan; Jun Wang; He Xiao; Zhenqi Wei; Rendong Ying

For next-generation audio applications, the dominant trends are much higher sample rate, larger word length and more audio channels for playback audio data. Traditional DSPs or embedded processors are inefficient for such kinds of applications because of their non-specific or limited computing capabilities as well as the on-chip memory architectures. In this paper, an embedded audio processor aiming at next-generation audio applications has been proposed. The audio specific instruction set architecture is based on the analysis of the requirements for next-generation audio processing. Besides, a novel tightly coupled audio memory has been proposed to support extremely high audio data throughputs and flexible audio data transfers with main memories. To evaluate the performance of the proposed audio processor, a set of benchmarks have been used based on the analysis of next-generation audio applications. The implementation and evaluation results lead to the conclusion that the proposed audio processor is of outstanding efficiency and cost-effectiveness for next-generation audio applications.


system on chip conference | 2010

A novel reconfigurable scratchpad memory for audio applications on cost-effective SoC

Ji Kong; Peilin Liu

Nowadays, the scratchpad memories (SPMs) are widely used as supplements or even alternatives for cache memories in audio applications on cost-effective SoCs. However, traditional SPM architectures encounter limitations of tight capacities and restricted data exchange methods with main memories. Such kinds of limitations significantly decrease the performance of the whole system, since most of the audio applications require high-capacity memory modules and flexible data transfer methods. To overcome the weaknesses of traditional SPMs, a novel reconfigurable SPM (RSPM) has been proposed in this paper. The outstanding advantage of the proposed RSPM is that a succession of data transfers is accomplished by the RSPM independently, and the performance of the whole system are effectively enhanced for most of the audio applications. Another attractive feature of RSPM is that the hardware cost of RSPM is independent of the scale/complexity of the target applications. Compared with the d-cache at the same capacity, the performance of the SoC with RSPM for audio computing kernel benchmarks are improved by up to 26.7%. Meanwhile, to achieve the same processing efficiency for the audio benchmarks, the area overheads of traditional SPMs are up to 7.2 times larger than the RSPM. Besides, the explorations of two complete sample audio applications show that the performance of the SoC solution with 1KB RSPM cooperating with 1KB d-cache is even better than the SoC solution with pure 16-KB d-cache, which is of much larger hardware cost. All the advanced features of RSPM make it more attractive than traditional SPMs, and make the SoC solutions cost-effective for most of the audio applications.


IEEE Signal Processing Letters | 2010

Split Table Extension: A Low Complexity LVQ Extension Scheme in Low Bitrate Audio Coding

Jin Wang; Peilin Liu; Ji Kong; Rendong Ying

Embedded algebraic vector quantization (EAVQ) is a fast and efficient lattice vector quantization (LVQ) scheme used in low-bitrate audio coding. However, a defect of EAVQ is the overload distortion which causes unpleasant noises in audio coding. To solve this problem, specific base codebook extension schemes should be carefully considered. In this letter, we present a novel EAVQ codebook extension scheme-split table extension (STE), which splits a vector into two smaller vectors: one in the base codebook and the other in the split table. The base codebook and the split table are designed according to the appearance probability of quantized vectors in audio segments. Experiments on encoding multiple audio and speech sequences show that, compared with the existed Voronoi extension scheme, STE greatly reduces computation complexity and storage requirement while achieving similar coding quality.


Applied Mechanics and Materials | 2014

Design of Arithmetic Operation Core in Embedded Processor for High Definition Audio Applications

Zhen Qi Wei; Pei Lin Liu; Ji Kong; Ren Dong Ying

To meet requirements of wider data width, higher throughput, and more flexibility, a specific arithmetic operation core (AOC) is designed for high definition audio application specific processors. The proposed core is capable of processing long bit-width operations, as well as short bit-width operations in parallel. A six-stage pipeline is applied in the architecture of AOC to support amounts of DSP operations, and a novel stage-skipping technique is used to improve the execution efficiency of instructions passing through the deep pipeline. Several DSP kernels and audio data decoding applications are used in performance evaluation of AOC. Experiment results show that the proposed operation core can achieve over 50% higher execution efficiency in audio applications than conventional high performance DSPs, providing an appealing solution for design of operation core for high definition audio applications.


international symposium on circuits and systems | 2013

Optimization of ETSI DSR frontend software on a high-efficient audio DSP

Zhenqi Wei; Peilin Liu; Cun Yu; Hongbin Zhou; Ying Ye; Ji Kong; Rendong Ying

Server-terminal based distributed speech recognition (DSR) applications are widely adopted on mobile devices. In this paper, we have implemented a power-efficient DSR solution of high performance for real-time speech processing. The DSR frontend algorithms are elaborately optimized in assembly codes utilizing accelerating technics provided by a previously released audio DSP, such as binary scaling operations in a deep instruction pipeline, automatic memory addressing method, and parallel processing of packaged data. The performance of DSR frontend software running on the DSP is greatly improved, and our work is of best efficiency compared with former solutions. The realtime frequency of processing 16 kHz input streams is 124.3 MHz and is only about 30% of what is required on a TI C64x DSP. Based on simulation experiment under SMIC 130 nm process, the power consumed for DSR frontend processing is 23 mW. Besides, the presented implementation of the algorithms is also integrated in a server-terminal demo system, and is proved to be worked well in real speech recognition applications.


international symposium on circuits and systems | 2011

StreamPoP: Stream programming oriented power-efficient audio DSP

Ji Kong; Peilin Liu; Zhenqi Wei; Kun Yang; Ying Ye; Rendong Ying

In stream programming style, the computation and the memory accesses are decoupled as much as possible. Such kind of programming style has brought new profits both on performance and power-efficiency for digital signal processing. In this paper, the architecture of a stream programming oriented power-efficient digital signal processor named as StreamPoP has been presented for consumer audio applications. The instruction and data supply subsystem of StreamPoP has been well designed for flexible stream programming and high power-efficiency. To further reduce the power consumption, an audio computation specific deep-instruction-pipeline (DIP) has been used in the micro-architecture of StreamPoP. To evaluate the performance of StreamPoP for consumer audio applications, a set of audio benchmarks have been used. It has been presented in this paper that the performance of StreamPoP is better than conventional high performance DSPs, while the former architecture is much more power-efficient for audio applications. The simulated power consumption result of StreamPoP under TSMC 90nm process is 5.1mW for AAC real-time decoding.


IEEE Computer Architecture Letters | 2012

Atomic Streaming: A Framework of On-Chip Data Supply System for Task-Parallel MPSoCs

Ji Kong; Peilin Liu; Yu Zhang

State of the art fabrication technology for integrating numerous hardware resources such as Processors/DSPs and memory arrays into a single chip enables the emergence of Multiprocessor System-on-Chip (MPSoC). Stream programming paradigm based on MPSoC is highly efficient for single functionality scenario due to its dedicated and predictable data supply system. However, when memory traffic is heavily shared among parallel tasks in applications with multiple interrelated functionalities, performance suffers through task interferences and shared memory congestions which lead to poor parallel speedups and memory bandwidth utilizations. This paper proposes a framework of stream processing based on-chip data supply system for task-parallel MPSoCs. In this framework, stream address generations and data computations are decoupled and parallelized to allow full utilization of on-chip resources. Task granularities are dynamically tuned to jointly optimize the overall application performance. Experiments show that proposed framework as well as the tuning scheme are effective for joint optimization in task-parallel MPSoCs.

Collaboration


Dive into the Ji Kong's collaboration.

Top Co-Authors

Avatar

Peilin Liu

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Rendong Ying

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Zhenqi Wei

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Jin Wang

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Xianmin Chen

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Dajiang Zhou

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bin He

Shanghai Jiao Tong University

View shared research outputs
Researchain Logo
Decentralizing Knowledge