Xinyu Niu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xinyu Niu is active.

Explore More

Publication

Featured researches published by Xinyu Niu.

field programmable logic and applications | 2012

Exploiting run-time reconfiguration in stencil computation

Xinyu Niu; Qiwei Jin; Wayne Luk; Qiang Liu; Oliver Pell

Stencil computation is computationally intensive and required by many applications. This paper proposes an approach to exploit run-time reconfigurability of field-programmable accelerators for stencil computation. System throughput is optimized by partitioning, analysing and scheduling tasks in applications to remove idle functions. To evaluate the proposed approach, Reverse Time Migration (RTM), a high performance application, is developed. Our optimized runtime reconfigurable solution, which targets a Virtex-6 FPGA in a Maxeler MAX3424A system, can achieves an improved throughput of 102.8 GFlop/s, up to two orders of magnitude faster than the CPU reference designs, 1.59 times faster than the best published GPU and FPGA results, and 1.45 times faster than an optimized static implementation.

applied reconfigurable computing | 2013

Heterogeneous reconfigurable system for adaptive particle filters in real-time applications

Thomas C. P. Chau; Xinyu Niu; Alison Eele; Wayne Luk; Peter Y. K. Cheung; Jan M. Maciejowski

This paper presents a heterogeneous reconfigurable system for real-time applications applying particle filters. The system consists of an FPGA and a multi-threaded CPU. We propose a method to adapt the number of particles dynamically and utilise the run-time reconfigurability of the FPGA for reduced power and energy consumption. An application is developed which involves simultaneous mobile robot localisation and people tracking. It shows that the proposed adaptive particle filter can reduce up to 99% of computation time. Using run-time reconfiguration, we achieve 34% reduction in idle power and save 26-34% of system energy. Our proposed system is up to 7.39 times faster and 3.65 times more energy efficient than the Intel Xeon X5650 CPU with 12 threads, and 1.3 times faster and 2.13 times more energy efficient than an NVIDIA Tesla C2070 GPU.

field-programmable custom computing machines | 2013

Automating Elimination of Idle Functions by Run-Time Reconfiguration

Xinyu Niu; Thomas C. P. Chau; Qiwei Jin; Wayne Luk; Qiang Liu

A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Reconfiguration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.31 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.

field-programmable custom computing machines | 2015

Architectures and Precision Analysis for Modelling Atmospheric Variables with Chaotic Behaviour

Francis P. Russell; Peter D. Düben; Xinyu Niu; Wayne Luk; T. N. Palmer

The computationally intensive nature of atmospheric modelling is an ideal target for hardware acceleration. Performance of hardware designs can be improved through the use of reduced precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced precision optimisation for simulating chaotic systems, targeting atmospheric modelling in which even minor changes in arithmetic behaviour can have a significant impact on system behaviour. Hence, standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide FPGA designs of a chaotic system, and analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainly in input parameters, a single Xilinx Virtex 6 SXT475 FPGA can be 13 times faster and 23 times more power efficient than a 6-core Intel Xeon X5650 processor.

field-programmable technology | 2015

Accelerated cell imaging and classification on FPGAs for quantitative-phase asymmetric-detection time-stretch optical microscopy

Junyi Xie; Xinyu Niu; Andy K. S. Lau; Kevin K. Tsia; Hayden Kwok-Hay So

With the fundamental trade-off between speed and sensitivity, existing quantitative phase imaging (QPI) systems for diagnostics and cell classification are often limited to batch processing only small amount of offline data. While quantitative asymmetric-detection time-stretch optical microscopy (Q-ATOM) offers a unique optical platform for ultrafast and high-sensitivity quantitative phase cellular imaging, performing the computationally demanding backend QPI phase retrieval and image classification in real-time remains a major technical challenge. In this paper, we propose an optimized architecture for QPI on FPGA and compare its performance against CPU and GPU implementations in terms of speed and power efficiency. Results show that our implementation on single FPGA card demonstrates a speedup of 9.4 times over an optimized C implementation running on a 6-core CPU, and 3.47 times over the GPU implementation. It is also 24.19 and 4.88 times more power-efficient than the CPU and GPU implementation respectively. Throughput increase linearly when four FPGA cards are used to further improve the performance. We also demonstrate an increased classification accuracy when phase images instead of single-angle ATOM images are used. Overall, one FPGA card is able to process and categorize 2497 cellular images per second, making it suitable for real-time single-cell analysis applications.

field-programmable technology | 2015

Lower precision for higher accuracy: Precision and resolution exploration for shallow water equations

James Stanley Targett; Xinyu Niu; Francis P. Russell; Wayne Luk; Stephen Jeffress; Peter D. Düben

Accurate forecasts of future climate with numerical models of atmosphere and ocean are of vital importance. However, forecast quality is often limited by the available computational power. This paper investigates the acceleration of a C-grid shallow water model through the use of reduced precision targeting FPGA technology. Using a double-gyre scenario, we show that the mantissa length of variables can be reduced to 14 bits without affecting the accuracy beyond the error inherent in the model. Our reduced precision FPGA implementation runs 5.4 times faster than a double precision FPGA implementation, and 12 times faster than a multi-threaded CPU implementation. Moreover, our reduced precision FPGA implementation uses 39 times less energy than the CPU implementation and can compute a 100×100 grid for the same energy that the CPU implementation would take for a 29×29 grid.

field programmable gate arrays | 2015

EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access

Xinyu Niu; Wayne Luk; Yu Wang

This paper describes Effective Utilities for Run-timE Configuration Adaptation (EURECA), a novel memory architecture for supporting effective dynamic data access in reconfigurable devices. EURECA exploits on-chip configuration generation to reconfigure active connections in such devices cycle by cycle. When integrated into a baseline architecture based on the Virtex-6 SX475T, the EURECA memory architecture introduces small area, delay and power overhead. Three benchmark applications are developed with the proposed architecture targeting social networking (Memcached), scientific computing (sparse matrix-vector multiplication), and in-memory database (large-scale sorting). Compared with conventional static designs, up to 14.9 times reduction in area, 2.2 times reduction in critical-path delay, and 32.1 times reduction in area-delay product are achieved.

IEEE Transactions on Very Large Scale Integration Systems | 2015

Power-Adaptive Computing System Design for Solar-Energy-Powered Embedded Systems

Qiang Liu; Terrence S. T. Mak; Tao Zhang; Xinyu Niu; Wayne Luk; Alex Yakovlev

Through energy harvesting system, new energy sources are made available immediately for many advanced applications based on environmentally embedded systems. However, the harvested power, such as the solar energy, varies significantly under different ambient conditions, which in turn affects the energy conversion efficiency. In this paper, we propose an approach for designing power-adaptive computing systems to maximize the energy utilization under variable solar power supply. Using the geometric programming technique, the proposed approach can generate a customized parallel computing structure effectively. Then, based on the prediction of the solar energy in the future time slots by a multilayer perceptron neural network, a convex model-based adaptation strategy is used to modulate the power behavior of the real-time computing system. The developed power-adaptive computing system is implemented on the hardware and evaluated by a solar harvesting system simulation framework for five applications. The results show that the developed power-adaptive systems can track the variable power supply better. The harvested solar energy utilization efficiency is 2.46 times better than the conventional static designs and the rule-based adaptation approaches. Taken together, the present thorough design approach for self-powered embedded computing systems has a better utilization of ambient energy sources.

international symposium on parallel and distributed processing and applications | 2014

Elastic Management of Reconfigurable Accelerators

Paul Grigoras; Max Tottenham; Xinyu Niu; José Gabriel F. Coutinho; Wayne Luk

This paper presents a runtime system for reconfigurable accelerators that supports elastic management: it enables effective sharing of accelerator resources across multiple applications. For each application, this runtime system allocates an appropriate amount of resources to satisfy its quality-of-service requirements, while minimising the overall execution time for a collection of applications. The effectiveness of this runtime system is due to a set of scheduling algorithms and strategies customised for different types of workloads. We demonstrate our approach by implementing a dynamic Monte Carlo bond options pricing design.

ACM Transactions on Reconfigurable Technology and Systems | 2015