Is this you? Create Your Porfile

Kui Dai

Huazhong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kui Dai is active.

Explore More

Publication

Featured researches published by Kui Dai.

asia pacific conference on circuits and systems | 2010

The parallel algorithm implementation of matrix multiplication based on ESCA

Pan Chen; Kui Dai; Dan Wu; Jinli Rao; Xuecheng Zou

Parallel computing is an important method used in high performance computing. A new SIMD architecture named ESCA (Engineering and Science Computing Accelerator) is introduced briefly in this paper. It aims to accelerate the computation for most critical scientific workload as a coprocessor by virtue of outstanding architecture and flexible parallel algorithm. As dense matrix multiplication is a widely used operation that can be accelerated by parallel computing, we maps its algorithm onto ESCA and estimates the performance, and the results imply that ESCA has some advantage and potentiality.

wase international conference on information engineering | 2010

Parallel Algorithms for FIR Computation Mapped to ESCA Architecture

Pan Chen; Kui Dai; Dan Wu; Jinli Rao; Xuecheng Zou

IN this paper we present a parallel algorithm for FIR (Finite Impulse Response) filter computation based on Engineering and Scientific Computation Accelerator (ESCA) System. ESCA is a heterogeneous multi-core architecture aiming to accelerate the compute-intensive parallel computing in high performance applications. By taking advantage of SIMD processing elements (PEs) and hierarchical on-chip networks with high-bandwidth and low-latency inside ESCA, we can get a good performance at parallel computation, and find a way to implement the FIR kernel. By translating the FIR computation into Matrix-Vector multiplication, we proposed an improved implementation of FIR algorithm, which achieved higher performance.

international conference on computing and communications technologies | 2015

A SCA-resistant processor architecture based on random delay insertion

Zhangqing He; Xingran Deng; Bangmin Yang; Kui Dai; Xuecheng Zou

Random delay insertion is a simple and efficient approach to counter side-channel attacks, but previous methods do not have the ideal protective effect. In this article, based on random delay insertion, an effective processor architecture resistant to side-channel attacks was proposed. It used a combination of randomized scheduling, randomized instruction insertion and randomized pipeline-delay to resist side-channel attacks. On the base of ARM7 processor, we implemented this architecture and the implementation results showed that this processor has increased approximate 24.3% in hardware area than the original ARM7 processor. The CPA attack experiment results suggested that our new secure processor have high capacity to resist side-channel attacks and thus could be used in USBKEY, Smart CARD and other application scenarios which require extremely high security level.

Journal of Zhejiang University Science C | 2011

Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture

Dan Wu; Xuecheng Zou; Kui Dai; Jinli Rao; Pan Chen; Zhao-xia Zheng

The fast Fourier transform (FFT) is a fundamental kernel of many computation-intensive scientific applications. This paper deals with an implementation of the FFT on the accelerator system, a heterogeneous multicore architecture to accelerate computation-intensive parallel computing in scientific and engineering applications. The Engineering and Scientific Computation Accelerator (ESCA) consists of a control unit and a single instruction multiple data (SIMD) processing element (PE) array, in which PEs communicate with each other via a hierarchical two-level network-on-chip (NoC) with high bandwidth and low latency. We exploit the architecture features of ESCA to implement a parallel FFT algorithm efficiently. Experimental results show that both the proposed parallel FFT algorithm and the ESCA architecture are scalable. The 16-bit fixed-point parallel FFT performance of ESCA is compared with a published work to prove the superiority of the mapping algorithm and the hardware architecture. The floating-point parallel FFT performances of ESCA are evaluated and compared with those of the IBM Cell processor and GPU to demonstrate the computing power of the ESCA system for high performance applications.

canadian conference on electrical and computer engineering | 2011

An area-efficient 5GHz/10GHz dual-mode VCO with coupled helical inductors in 0.13-UM CMOS technology

Wanghui Zou; Xiaofei Chen; Jianming Lei; Kui Dai; Xuecheng Zou

The helical inductor is a three-dimensional multi-layer inductor with relatively higher quality factor and self-resonance frequency compare with other multi-layer inductors. A new dual-mode coupled-inductor VCO based on helical inductors is proposed, which is able to work on 5GHz and 10GHz band. Because of the helical inductors used, the proposed VCO is much more area-efficient than previous coupled-inductor structure. The simulation results show that the phase noise at 1 MHz offset is −101.6dBc/Hz and −101.3dBc/Hz at the frequency of 4.8GHz and 10.4GHz, respectively. The VCO draws 4mA from 1.2 V supply.

international conference on algorithms and architectures for parallel processing | 2010

A high efficient on-chip interconnection network in SIMD CMPs

Dan Wu; Kui Dai; Xuecheng Zou; Jinli Rao; Pan Chen

In order to improve the performance of on-chip data communications in SIMD (Single Instruction Multiple Data) architecture, we propose an efficient and modular interconnection architecture called Broadcast and Permutation Mesh network (BP-Mesh) BP-Mesh architecture possesses not only low complexity and high bandwidth, but also well flexibility and scalability Detailed hardware implementation is discussed in the paper And the proposed architecture is evaluated in terms of area cost and performance.

wase international conference on information engineering | 2010

Implementation of Parallel Game Tree Search on a SIMD System

Dan Wu; Pan Chen; Kui Dai; Jinli Rao; Xuecheng Zou

The α-β algorithm is an efficient technique for searching game trees. In this paper, we present the detailed implementation of parallel α-β algorithm on our Engineering and Scientific Computation Accelerator (ESCA) system, which is a heterogeneous multi-core SIMD (Single Instruction stream Multiple Data stream) architecture to accelerate the compute-intensive parallel computing in high performance applications and has enhanced for control organization. The conditional execution provided by ESCA has significant benefit to implement the α-β pruning, and the communication mechanism between the control unit and PEs (processing elements) can accumulate information for load balancing to gain better processor utilization. We choose synthetic game tree for the evaluation and the experimental results with the metric of speedup are presented with varied number of processing elements under various size of trees.

international conference on computer engineering and technology | 2010

Memory system design and implementation for a multiprocessor

Dan Wu; Xuecheng Zou; Kui Dai; Chengnuo Deng; Shuangxi Lin

In the era of multi-core processors, the challenge of designing a high efficient memory system is more severe than before. This paper focuses on the memory hierarchy design and implementation on a multiprocessor system. With the distributed shared memory (DSM) model, some techniques have been presented to improve the performance of traditional memory hierarchy and simplify the complexity of cache coherence logic. Moreover, the proposed memory system is in favor of power-saving by reducing the number of times to access the lower-level memory devices. The structure of the memory system has been implemented with a 0.18µm CMOS process and some experimental results are presented.

Archive | 2012

A Protected 0.35 μ m CMOS Transmitter Circuit for 13.56 MHz RFID Reader SoC

Pan Feng; Jianming Lei; Kui Dai; Xuecheng Zou

An integrated transmitter for 13.56 MHz RFID Reader SoC is designed in 0.35 μm cmos technology. The circuit consists of a RF/analog part for modulation, protection and digital part for controlling the transmitter functionality. It has been realized to be compatible with the communication standards ISO 14443 A/B, 15693 and 18000-3. It operates at 13.56 MHz with a communication date rate from 212 kHz up to 848 kHz. In modulation mode of operation, a solution based on configurable antenna drivers is proposed to control both the output power and the modulation index which is in the range [0%-100%] including OOK. What’s more, a short protection circuit for the antenna diver makes great improvement on its stability. The transmitter is implemented at 3.3V. The measurement and simulation results of the implemented chip indicate that the designed transmitter operates well in multi-standard mode.

Electronics Letters | 2012