Xiangyu Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiangyu Zhang is active.

Explore More

Publication

Featured researches published by Xiangyu Zhang.

Japanese Journal of Applied Physics | 2016

Reconfigurable VLSI implementation for learning vector quantization with on-chip learning circuit

Xiangyu Zhang; Fengwei An; Lei Chen; Hans Jürgen Mattausch

As an alternative to conventional single-instruction-multiple-data (SIMD) mode solutions with massive parallelism for self-organizing-map (SOM) neural network models, this paper reports a memory-based proposal for the learning vector quantization (LVQ), which is a variant of SOM. A dual-mode LVQ system, enabling both on-chip learning and classification, is implemented by using a reconfigurable pipeline with parallel p-word input (R-PPPI) architecture. As a consequence of the reuse of R-PPPI for solving the most severe computational demands in both modes, power dissipation and Si-area consumption can be dramatically reduced in comparison to previous LVQ implementations. In addition, the designed LVQ ASIC has high flexibility with respect to feature-vector dimensionality and reference-vector number, allowing the execution of many different machine-learning applications. The fabricated test chip in 180 nm CMOS with parallel 8-word inputs and 102 K-bit on-chip memory achieves low power consumption of 66.38 mW (at 75 MHz and 1.8 V) and high learning speed of clock cycles per d-dimensional sample vector where R is the reference-vector number.

Sensors | 2017

Real-Time Straight-Line Detection for XGA-Size Videos by Hough Transform with Parallelized Voting Procedures

Jungang Guan; Fengwei An; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

The Hough Transform (HT) is a method for extracting straight lines from an edge image. The main limitations of the HT for usage in actual applications are computation time and storage requirements. This paper reports a hardware architecture for HT implementation on a Field Programmable Gate Array (FPGA) with parallelized voting procedure. The 2-dimensional accumulator array, namely the Hough space in parametric form (ρ, θ), for computing the strength of each line by a voting mechanism is mapped on a 1-dimensional array with regular increments of θ. Then, this Hough space is divided into a number of parallel parts. The computation of (ρ, θ) for the edge pixels and the voting procedure for straight-line determination are therefore executable in parallel. In addition, a synchronized initialization for the Hough space further increases the speed of straight-line detection, so that XGA video processing becomes possible. The designed prototype system has been synthesized on a DE4 platform with a Stratix-IV FPGA device. In the application of road-lane detection, the average processing speed of this HT implementation is 5.4 ms per XGA-frame at 200 MHz working frequency.

IEEE Transactions on Circuits and Systems for Video Technology | 2017

A Hardware Architecture for Cell-based Feature-Extraction and Classification Using Dual-Feature Space

Fengwei An; Xiangyu Zhang; Aiwen Luo; Lei Chen; Hans Jürgen Mattausch

Many computer-vision and machine-learning applications in robotics, mobile, wearable devices, and automotive domains are constrained by their real-time performance requirements. This paper reports a dual-feature-based object recognition coprocessor that exploits both histogram of oriented gradient (HOG) and Haar-like descriptors with a cell-based parallel sliding-window recognition mechanism. The feature extraction circuitry for HOG and Haar-like descriptors is implemented by a pixel-based pipelined architecture, which synchronizes to the pixel frequency from the image sensor. After extracting each cell feature vector, a cell-based sliding window scheme enables parallelized recognition for all windows, which contain this cell. The nearest neighbor search classifier is, respectively, applied to the HOG and Haar-like feature space. The complementary aspects of the two feature domains enable a hardware-friendly implementation of the binary classification for pedestrian detection with improved accuracy. A proof-of-concept prototype chip fabricated in a 65-nm SOI CMOS, having thin gate oxide and buried oxide layers (SOTB CMOS), with 3.22-mm2 core area achieves an energy efficiency of 1.52 nJ/pixel and a processing speed of 30 fps for

international symposium on circuits and systems | 2016

Dynamically reconfigurable system for LVQ-based on-chip learning and recognition

Fengwei An; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

1024times 1616

IEEE Transactions on Multi-Scale Computing Systems | 2016

A Memory-Based Modular Architecture for SOM and LVQ with Dynamic Configuration

Fengwei An; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

-pixel image frames at 200-MHz recognition working frequency and 1-V supply voltage. Furthermore, multiple chips can implement image scaling, since the designed chip has image-size flexibility attributable to the pixel-based architecture.

IEEE Transactions on Very Large Scale Integration Systems | 2018

Resource-Efficient Object-Recognition Coprocessor With Parallel Processing of Multiple Scan Windows in 65-nm CMOS

Aiwen Luo; Fengwei An; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

Artificial neural networks implement a simplified model of the human brain and thus specialize on pattern recognition. As an alternative to conventional single-instruction-multiple-data (SIMD) solutions with massive parallelism for self-organizing-map (SOM) neural network models, we report resource-efficient hardware architecture for 1-chip implementation of the learning vector quantization (LVQ) neural network algorithm, which is a variant of SOM. Dynamic configurability for two operation modes is realized through the same circuitry for recognition based on nearest-neighbor matching and on-chip learning based on error back-propagation. Switching between learning and recognition modes is carried out by a pipeline with multiplexers and parallel p-word input (P-MPPI). The multiplexers enable data-flow-path reconfiguration, resulting in a significant reduction of area and power consumption. Thus, the P-MPPI architecture achieves time-domain multiplexing between operation modes as well as area/energy-efficiency by reusing both memory arrays and arithmetic or logic units. Additionally, high flexibility for feature-vector dimension and reference-vector number allows the implementation of many different applications, including continuously adaptive neural systems, on the same hardware platform. A test chip in TSMC 65 nm CMOS has parallel 32-word inputs, 585 K-bit on-chip memory, and achieves high processing throughput of 76.8 Gbps and low power consumption of 27.92 mW (at 150 MHz, 1.0 V supply voltage).

Japanese Journal of Applied Physics | 2017

Low-power coprocessor for Haar-like feature extraction with pixel-based pipelined architecture

Aiwen Luo; Fengwei An; Yuki Fujita; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

Pattern matching with high computational cost is often realized in hardware by dedicated implementations, having usually low flexibility for satisfying different target applications. The conventional single-instruction-multiple-data (SIMD) solution, which sufficiently exploits the massive intrinsic parallelism, is therefore attracting increasing attention for artificial neural networks. In this paper, as an alternative to SIMD, we propose a memory-based reconfigurable architecture for implementing the self-organizing-map (SOM) neural network model and its supervised variant named learning vector quantization (LVQ). A reconfigurable complete-binary-adder-tree (RCBAT) architecture, which allows multiple operation modes through self-managed dynamic configuration, attains good area/power efficiency due to the reusage of arithmetic units. The implemented pipelined parallelism leads to higher throughput and processing speed. Furthermore, high flexibility in feature-vector dimensionality and number is enabled by a partial vector-component storage (PVCS) concept. The fabricated prototype chips in 65 and 180 nm CMOS technology achieve 51.2 Gbits/s and 9.6 Gbits/s throughput, respectively. The experimental results verify fast training and recognition speed, low power dissipation, and large flexibility for different applications.

Japanese Journal of Applied Physics | 2017

A hardware-oriented histogram of oriented gradients algorithm and its VLSI implementation

Xiangyu Zhang; Fengwei An; Ikki Nakashima; Aiwen Luo; Lei Chen; Idaku Ishii; Hans Jürgen Mattausch

Object recognition offers a more general implementation for vision-based applications. This paper reports a resource-efficient recognition coprocessor with embedded cell-based simplified speeded up robust feature descriptor extraction unit and parallel scan-window (SW) recognition engine, applicable for various mobile scenarios and image sensor types. The feature extraction circuitry with pixel-based pipelined architecture describes the target objects among complex backgrounds, only relying on the pixel frequency from the image sensor. A cell-based SW algorithm enables parallelized recognition in multiple SWs and compatibility to different image sizes. The proposed hardware-friendly object-recognition coprocessor was implemented in 65-nm Silicon on thin BOX CMOS technology with 1.26 mm2 core area and can operate down to low supply voltage of 0.5 V. For video graphics array image sizes, the energy efficiency is determined as

IEEE Access | 2017