Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Qianming Yang is active.

Publication


Featured researches published by Qianming Yang.


international symposium on microarchitecture | 2008

On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator

Mei Wen; Nan Wu; Chunyuan Zhang; Qianming Yang; Jun Ren; Yi He; Wei Wu; Jun Chai; Maolin Guan; Changqing Xun

In this paper shows the extension of application domains, hardware-managed memory structures such as caches are drawing attention for dealing with irregular stream applications. However, since a real application usually has both regular and irregular stream characteristics, conventional stream register files, caches, or combinations thereof have shortcomings. This article focuses on combining software- and hardware-managed memory structures and presents a new syncretic memory system based on the ft64 stream accelerator.


Concurrency and Computation: Practice and Experience | 2017

FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency‡

Yuran Qiao; Junzhong Shen; Tao Xiao; Qianming Yang; Mei Wen; Chunyuan Zhang

Recent breakthroughs in the deep convolutional neural networks (CNNs) have led to great improvements in the accuracy of both vision and auditory systems. Characterized by their deep structures and large numbers of parameters, deep CNNs challenge the computational performance of today. Hardware specialization in the form of field‐programmable gate array offers a promising path towards major leaps in computational performance while achieving high‐energy efficiency.


pacific rim conference on communications, computers and signal processing | 2011

A efficient parallel deblocking filter based on GPU: Implementation and optimization

Huayou Su; Chunyuan Zhang; Jun Chai; Qianming Yang

The deblocking filter represents one of the most time consuming tasks of the H.264/AVC standard. Due to its characteristics of data dependencies and frequent memory access, it poses an arduous challenge to mapping the algorithm onto massively parallel architecture efficiently. In this paper, a novel parallel deblocking filter is proposed based on GPU, which weaken the dependencies between MBs by rearrange the filter orders of boundaries. We implemented the proposed algorithm on GPU and optimized the program through three strategies, including kernel combination, reusing the intermediate data and optimizing data representation. Experimental results show that applying the proposed parallel method supports real-time processing throughput for 1080p at 450fps. We have also observed 3.78× and 16.68× speedup for comprehensive optimization parallel deblocking filter on two-core processor and the state-of-the-art GPU-based implementation, respectively.


ieee international conference on high performance computing data and analytics | 2007

FT64: scientific computing with streams

Mei Wen; Nan Wu; Chunyuan Zhang; Wei Wu; Qianming Yang; Changqing Xun

This paper describes FT64 and Multi-FT64, single- and multicoprocessor systems designed for high performance scientific computing with streams. We give a detailed case study of porting the Mersenne Prime Search problem to FT64 and Multi-FT64 systems. We discuss several special problems associated with streamizing, such as kernel processing granularity, stream organization and workload partitioning for a multi-processor, which are generally applicable to other scientific codes on FT64. Finally, we perform experiments with eight typical scientific applications on FT64. The results show that a 500MHz FT64 achieves over 50% of its peak performance and a 4.2x peak speedup over 1.6GHz Itanium2. An eight processor Multi-FT64 system achieves 6.8x peak speedup over a single FT64.


annual computer security applications conference | 2008

FPGA-based Equivalent Simulation Technology (FEST) for clustered stream architecture

Yi He; Ju Ren; Qianming Yang; Mei Wen; Nan Wu; Chunyuan Zhang

Stream architecture research is often hindered by slow software simulations. Simulators based on FPGA are much faster. However, larger scale stream architecture simulation needs more FPGA resource, which may result in more FPGA chips or larger capacity FPGA chip would be used. It not only increases the complexity of design, but also increases the cost of research. This paper proposed FPGA-based equivalent simulation technology (FEST) and constructs an Equivalent model called FEST model based on it. FEST can support cluster-scalable simulation for clustered stream architecture well by replace some components by a simpler structure with equivalent function. The simulator based on FEST model (1) needs fewer FPGA resource than the original system but has little influence on simulation speed, (2) is accurate to cycle level resolution, (3) can run unmodified applications, (4) can reappear simulation results including resource consuming and timing analysis of original system.


high performance embedded architectures and compilers | 2011

Tiled multi-core stream architecture

Nan Wu; Qianming Yang; Mei Wen; Yi He; Ju Ren; Maolin Guan; Chunyuan Zhang

Conventional stream architectures focus on exploiting ILP and DLP in the applications, although stream model also exposes abundant TLP at kernel granularity. On the other side, with the development of model VLSI technology, increasing application demands and scalability challenges conventional stream architectures. In this paper, we present a novel Tiled Multi-Core Stream Architecture called TiSA. TiSA introduces the tile that consists of multiple stream cores as a new category of architectural resources, and designed an on-chip network to support stream transfer among tiles. In TiSA, multiple levels parallelisms are exploited on different granularity of processing elements. Besides hardware modules, this paper also discusses some other key issues of TiSA architecture, including programming model, various execution patterns and resource allocations. We then evaluate the hardware scalability of TiSA by scaling to 10s~1000s ALUs and estimating its area and delay cost. We also evaluate the software scalability of TiSA by simulating 6 stream applications and comparing sustained performance with other stream processors and general purpose processors, and different configuration of TiSA. A 256-ALU TiSA with 4 tile and 4 stream cores per tile is shown to be feasible with 45 nanometer technology, sustaining 100~350 GFLOP/s on most stream benchmarks and providing ~10x of speedup over a 16-ALU TiSA with a 5% degradation in area per ALU. The result shows that TiSA is a VLSI- and performance-efficient architecture for the billions-transistors era.


computer and information technology | 2010

SAT: A Stream Architecture Template for Embedded Applications

Qianming Yang; Nan Wu; Mei Wen; Yi He; Huayou Su; Chunyuan Zhang

The increase of embedded applications complexity has demanded hardware more flexible while providing higher performance. Reconfigurable architectures and stream processing have been showed significant progresses in exploiting these applications. This paper presents SAT, a stream architecture template that combines stream processing, clustered-VLIW, multi-core, and hardware reuse to achieve the capability of high performance and high flexibility. In order to enhance the efficiency of the auto-generated circuit, we introduced the idea of parameter and template in SAT. In this paper, we also demonstrated a flow of generating the target system based on SAT. The system is implemented in VerilogHDL, and consists of configurable stream core, RISC core, memory core, IO core, and interconnect core. The diverse stream-based implementation of stream core can be configured easily according to the requirements of applications. The system has also been synthesized into an Altera FPGA to verify resource utilization and performance.


network and parallel computing | 2017

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA

Yuran Qiao; Junzhong Shen; Dafei Huang; Qianming Yang; Mei Wen; Chunyuan Zhang

Nowadays, the rapid growth of data across the Internet has provided sufficient labeled data to train deep structured artificial neural networks. While deeper structured networks bring about significant precision gains in many applications, they also pose an urgent demand for higher computation capacity at the expense of power consumption. To this end, various FPGA based deep neural network accelerators are proposed for higher performance and lower energy consumption. However, as a dilemma, the development cycle of FPGA application is much longer than that of CPU and GPU. Although FPGA vendors such as Altera and Xilinx have released OpenCL framework to ease the programming, tuning the OpenCL codes for desirable performance on FPGAs is still challenging. In this paper, we look into the OpenCL implementation of Convolutional Neural Network (CNN) on FPGA. By analysing the execution manners of a CPU/GPU oriented verision on FPGA, we find out the causes of performance difference between FPGA and CPU/GPU and locate the performance bottlenecks. According to our analysis, we put forward a corresponding optimization method focusing on external memory transfers. We implement a prototype system on an Altera Stratix V A7 FPGA, which brings a considerable 4.76\(\times \) speed up to the original version. To the best of our knowledge, this implementation outperforms most of the previous OpenCL implementations on FPGA by a large margin.


international conference on computer science and service system | 2012

An Energy-Efficient Processor Core for Massively Parallel Computing

Qianming Yang; Nan Wu; Maolin Guan; Chunyuan Zhang; Jun Cai

With the evolution of more sophisticated communication standards and algorithms, embedded applications exhibit demanding performance and efficiency requirements. Massively Parallel Computing based on many simple cores and few powerful cores is becoming the mainstream method of building high performance and low power processor. While aimed at the design of the simple core, this paper proposes an energy-efficient processor architecture named Smart Core. Following the idea of explicitly parallel and accurate computing, Smart Core uses exposed and non-deep pipeline to eliminate the pipeline registers and to reduce the cost of executing instructions. Multi-level data memory organization, consisted of streaming memory, multi-mode register file and fully distributed tiny operand register file, captures various data reuse and locality to reduce the cost of delivering data. To reduce the cost of delivering instructions, an asymmetric and fully distributed instruction register file is used to capture locality and reuse of instructions in a loop. Preliminary results show that Smart Core achieves an energy efficiency that is 25x greater than the traditional embedded RISC processor. When scaled to a 40nm CMOS technology, single chip multi-processor, consisted of many cores like Smart Core, is capable of providing more than 1TOPS performance while achieving efficiency of 100GOPS/W or more.


ieee international conference on high performance computing data and analytics | 2012

Fully Distributed On-chip Instruction Memory Design for Stream Architecture Based on Field-Divided VLIW Compression

Yi He; Maolin Guan; Chunyuan Zhang; Tian Tian; Qianming Yang

Huge code size and poor code density have always been a serious problem in VLIW processor. In order to deal with the problem and its influence on the instruction memory in stream architecture, this paper proposes a novel method called field-divided VLIW compression through analyzing the code characteristics of stream program across a wide range of typical stream application domains and dividing the instruction code unrelated to each other into different subfields. Based on the field-divided VLIW compression, this paper designs a fully distributed on-chip instruction memory (FDIM) for stream architecture. The experiment on MASA stream processor demonstrates that the field-divided VLIW compression can reduce about 38% of off-chip instruction code and about 66% of on-chip instruction memory space demand in the case of having little influence on the program performance; FDIM reduces the area of on-chip instruction memory by about 37%, thus reduces the area of the MASA stream processor by about 8.92%. Besides, the energy consumption of instruction memory is decreased by about 61%.

Collaboration


Dive into the Qianming Yang's collaboration.

Top Co-Authors

Avatar

Chunyuan Zhang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Mei Wen

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Nan Wu

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Yi He

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Changqing Xun

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Ju Ren

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Maolin Guan

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Wei Wu

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Dafei Huang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Huayou Su

National University of Defense Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge