Fengwei An | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fengwei An is active.

Explore More

Publication

Featured researches published by Fengwei An.

Journal of Systems Architecture | 2013

K-means clustering algorithm for multimedia applications with flexible HW/SW co-design

Fengwei An; Hans Jürgen Mattausch

In this paper, we report a hardware/software (HW/SW) co-designed K-means clustering algorithm with high flexibility and high performance for machine learning, pattern recognition and multimedia applications. The contributions of this work can be attributed to two aspects. The first is the hardware architecture for nearest neighbor searching, which is used to overcome the main computational cost of a K-means clustering algorithm. The second aspect is the high flexibility for different applications which comes from not only the software but also the hardware. High flexibility with respect to the number of training data samples, the dimensionality of each sample vector, the number of clusters, and the target application, is one of the major shortcomings of dedicated hardware implementations for the K-means algorithm. In particular, the HW/SW K-means algorithm is extendable to embedded systems and mobile devices. We benchmark our multi-purpose K-means system against the application of handwritten digit recognition, face recognition and image segmentation to demonstrate its excellent performance, high flexibility, fast clustering speed, short recognition time, good recognition rate and versatile functionality.

Japanese Journal of Applied Physics | 2015

VLSI realization of learning vector quantization with hardware/software co-design for different applications

Fengwei An; Toshinobu Akazawa; Shogo Yamasaki; Lei Chen; Hans Jürgen Mattausch

This paper reports a VLSI realization of learning vector quantization (LVQ) with high flexibility for different applications. It is based on a hardware/software (HW/SW) co-design concept for on-chip learning and recognition and designed as a SoC in 180 nm CMOS. The time consuming nearest Euclidean distance search in the LVQ algorithms competition layer is efficiently implemented as a pipeline with parallel p-word input. Since neuron number in the competition layer, weight values, input and output number are scalable, the requirements of many different applications can be satisfied without hardware changes. Classification of a d-dimensional input vector is completed in clock cycles, where R is the pipeline depth, and n is the number of reference feature vectors (FVs). Adjustment of stored reference FVs during learning is done by the embedded 32-bit RISC CPU, because this operation is not time critical. The high flexibility is verified by the application of human detection with different numbers for the dimensionality of the FVs.

custom integrated circuits conference | 2014

A coprocessor for clock-mapping-based nearest Euclidean distance search with feature vector dimension adaptability

Fengwei An; Toshinobu Akazawa; Shogo Yamazaki; Lei Chen; Hans Jürgen Mattausch

In this paper, a coprocessor fabricated in 180nm for word-parallel nearest Euclidean distance search is developed based on a distance-clock-mapping concept which results in an area-efficient architecture. Conventionally, the nearest distance search is a computational issue in pattern recognition, which can be completed in O(dn) time by the brute-force search in a d-dimensional space among n reference vectors. For satisfying multiple applications, the dimension flexibility of feature vectors is achieved in the coprocessor with a Dimension Extension Circuit (DEC) for partial distance pre-accumulation. A clock reduction algorithm is used to drastically reduce the exponential increase of worst-case search-clock number with vector-component bit width to only a linear increase. The test chip in 180nm CMOS for parallel search among 32 reference vectors with 8 bit per component achieves low power dissipation of 5.02 mW at 42.9MHz clock frequency and 1.8 V supply voltage. Applications with up to 2048-dimensional feature vectors can be handled by the designed coprocessor.

Advanced Robotics | 2014

Memory-based hardware-accelerated system for high-speed human detection

Fengwei An; Hans Jürgen Mattausch

The real-world applicability of modern computer vision and recognition applications is limited by its real-time performance. Hardware-based systems can provide fast solutions for real-time limited problems; however, hardware-friendly solutions usually lack the flexibility to handle highly complex tasks. On the other hand, software-based solutions are used to tackle complex tasks and allow for greater flexibility but lack the speeds which hardware systems can provide. Inspired by the function of the human memory, we propose a hardware-accelerated multi-prototype and nearest neighbor (NN) search-based learning and classification system, which overcomes these flexibility limitations. A major deficiency for NN-based implementations is the computational demand for the searching and clustering processes. An FPGA-implemented coprocessor architecture for the Euclidean distance search was designed to resolve this deficiency. We benchmarked the system on the complex application of human detection. The experimental results revealed that the system outperformed other implementations by significantly reducing training times and attained a per sample detection speed of 2.24 s. Graphical Abstract

Japanese Journal of Applied Physics | 2016

Reconfigurable VLSI implementation for learning vector quantization with on-chip learning circuit

Xiangyu Zhang; Fengwei An; Lei Chen; Hans Jürgen Mattausch

As an alternative to conventional single-instruction-multiple-data (SIMD) mode solutions with massive parallelism for self-organizing-map (SOM) neural network models, this paper reports a memory-based proposal for the learning vector quantization (LVQ), which is a variant of SOM. A dual-mode LVQ system, enabling both on-chip learning and classification, is implemented by using a reconfigurable pipeline with parallel p-word input (R-PPPI) architecture. As a consequence of the reuse of R-PPPI for solving the most severe computational demands in both modes, power dissipation and Si-area consumption can be dramatically reduced in comparison to previous LVQ implementations. In addition, the designed LVQ ASIC has high flexibility with respect to feature-vector dimensionality and reference-vector number, allowing the execution of many different machine-learning applications. The fabricated test chip in 180 nm CMOS with parallel 8-word inputs and 102 K-bit on-chip memory achieves low power consumption of 66.38 mW (at 75 MHz and 1.8 V) and high learning speed of clock cycles per d-dimensional sample vector where R is the reference-vector number.

Sensors | 2017

Real-Time Straight-Line Detection for XGA-Size Videos by Hough Transform with Parallelized Voting Procedures

Jungang Guan; Fengwei An; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

The Hough Transform (HT) is a method for extracting straight lines from an edge image. The main limitations of the HT for usage in actual applications are computation time and storage requirements. This paper reports a hardware architecture for HT implementation on a Field Programmable Gate Array (FPGA) with parallelized voting procedure. The 2-dimensional accumulator array, namely the Hough space in parametric form (ρ, θ), for computing the strength of each line by a voting mechanism is mapped on a 1-dimensional array with regular increments of θ. Then, this Hough space is divided into a number of parallel parts. The computation of (ρ, θ) for the edge pixels and the voting procedure for straight-line determination are therefore executable in parallel. In addition, a synchronized initialization for the Hough space further increases the speed of straight-line detection, so that XGA video processing becomes possible. The designed prototype system has been synthesized on a DE4 platform with a Stratix-IV FPGA device. In the application of road-lane detection, the average processing speed of this HT implementation is 5.4 ms per XGA-frame at 200 MHz working frequency.

robotics and biomimetics | 2012

Human recognition with a hardware-accelerated multi-prototype learning and classification system

Fengwei An; Hans Jürgen Mattausch

This paper reports a hardware-accelerated multi-prototype learning and classification system which is suitable for real-time recognition systems. The real-world applicability of robotics or surveillance systems is dependent upon their real-time performance. Hardware based solutions can meet the needs for real-time limited problems; however, hardware-friendly solutions have lacked the flexibility to handle a large range of complex tasks. Software based solutions have been used to tackle complex tasks and allow for greater flexibility but lack the speeds which hardware systems can provide. The developed multi-prototype learning and classification system surmounts these limitations and is applied to the problem of human recognition for demonstrating its capabilities. A fully digital Euclidian distance searching circuit is developed in order to reduce the computational cost within the learning and classification process. The system outperforms other implementations by significantly reducing training times and attains a per sample recognition speed of 1.03 μs.

IEEE Transactions on Circuits and Systems for Video Technology | 2017

A Hardware Architecture for Cell-based Feature-Extraction and Classification Using Dual-Feature Space

Fengwei An; Xiangyu Zhang; Aiwen Luo; Lei Chen; Hans Jürgen Mattausch

Many computer-vision and machine-learning applications in robotics, mobile, wearable devices, and automotive domains are constrained by their real-time performance requirements. This paper reports a dual-feature-based object recognition coprocessor that exploits both histogram of oriented gradient (HOG) and Haar-like descriptors with a cell-based parallel sliding-window recognition mechanism. The feature extraction circuitry for HOG and Haar-like descriptors is implemented by a pixel-based pipelined architecture, which synchronizes to the pixel frequency from the image sensor. After extracting each cell feature vector, a cell-based sliding window scheme enables parallelized recognition for all windows, which contain this cell. The nearest neighbor search classifier is, respectively, applied to the HOG and Haar-like feature space. The complementary aspects of the two feature domains enable a hardware-friendly implementation of the binary classification for pedestrian detection with improved accuracy. A proof-of-concept prototype chip fabricated in a 65-nm SOI CMOS, having thin gate oxide and buried oxide layers (SOTB CMOS), with 3.22-mm2 core area achieves an energy efficiency of 1.52 nJ/pixel and a processing speed of 30 fps for

international symposium on circuits and systems | 2016

Dynamically reconfigurable system for LVQ-based on-chip learning and recognition

Fengwei An; Xiangyu Zhang; Lei Chen; Hans Jürgen Mattausch

1024\times 1616

Japanese Journal of Applied Physics | 2016

Highly flexible nearest-neighbor-search associative memory with integrated k nearest neighbor classifier, configurable parallelism and dual-storage space

Fengwei An; Keisuke Mihara; Shogo Yamasaki; Lei Chen; Hans Jürgen Mattausch

-pixel image frames at 200-MHz recognition working frequency and 1-V supply voltage. Furthermore, multiple chips can implement image scaling, since the designed chip has image-size flexibility attributable to the pixel-based architecture.

Explore More