Miaoqing Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Miaoqing Huang is active.

Explore More

Publication

Featured researches published by Miaoqing Huang.

IEEE Computer | 2008

The Promise of High-Performance Reconfigurable Computing

Tarek A. El-Ghazawi; Esam El-Araby; Miaoqing Huang; Kris Gaj; Volodymyr V. Kindratenko; Duncan A. Buell

Several high-performance computers now use field-programmable gate arrays as reconfigurable coprocessors. The authors describe the two major contemporary HPRC architectures and explore the pros and cons of each using representative applications from remote sensing, molecular dynamics, bioinformatics, and cryptanalysis.

IEEE Transactions on Computers | 2011

New Hardware Architectures for Montgomery Modular Multiplication Algorithm

Miaoqing Huang; Kris Gaj; Tarek A. El-Ghazawi

Montgomery modular multiplication is one of the fundamental operations used in cryptographic algorithms, such as RSA and Elliptic Curve Cryptosystems. At CHES 1999, Tenca and Koç proposed the Multiple-Word Radix-2 Montgomery Multiplication (MWR2MM) algorithm and introduced a now-classic architecture for implementing Montgomery multiplication in hardware. With parameters optimized for minimum latency, this architecture performs a single Montgomery multiplication in approximately 2n clock cycles, where n is the size of operands in bits. In this paper, we propose two new hardware architectures that are able to perform the same operation in approximately n clock cycles with almost the same clock period. These two architectures are based on precomputing partial results using two possible assumptions regarding the most significant bit of the previous word. These two architectures outperform the original architecture of Tenca and Koç in terms of the product latency times area by 23 and 50 percent, respectively, for several most common operand sizes used in cryptography. The architecture in radix-2 can be extended to the case of radix-4, while preserving a factor of two speedup over the corresponding radix-4 design by Tenca, Todorov, and Koç from CHES 2001. Our optimization has been verified by modeling it using Verilog-HDL, implementing it on Xilinx Virtex-II 6000 FPGA, and experimentally testing it using SRC-6 reconfigurable computer.

international conference on high performance computing and simulation | 2011

Exploiting concurrent kernel execution on graphic processing units

Lingyuan Wang; Miaoqing Huang; Tarek A. El-Ghazawi

Graphics processing units (GPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of GPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIAs Fermi architecture pioneers the feature of concurrent kernel execution; however, only kernels of the same thread context can execute in parallel. In order to get the best use of a GPU device in a multi-threaded application environment, this paper explores the techniques to effectively share a context, i.e., context funneling, which could be done either manually at application level, or automatically at the GPU runtime starting from CUDA v4.0. For synthetic microbenchmark tests, we find that both funneling mechanisms are more capable of exploring the benefit of concurrent kernel execution than traditional context switching, therefore improving the overall application performance. We also find that the manual funneling mechanism provides the highest performance and more explicit control, while CUDA v4.0 provides better productivity with good performance. Finally, we assess the impact of such techniques on a compact application benchmark, SSCA#3 — SAR sensor processing.

field-programmable technology | 2005

Performance of sorting algorithms on the SRC 6 reconfigurable computer

John Harkins; Tarek A. El-Ghazawi; Esam El-Araby; Miaoqing Huang

The execution speed of the FPGA processing elements are compared to the microprocessor processing elements in the SRC 6 reconfigurable computer using the following algorithms for sorting: quick sort, heap sort, radix sort, bitonic sort, and odd/even merge. The results show that, for sorting, FPGA technology may not be the best processor choice and that factors such as memory bandwidth, clock speed, algorithm computational density and an algorithms ability to be pipelined all have an impact on FPGA performance.

public key cryptography | 2008

An optimized hardware architecture for the montgomery multiplication algorithm

Miaoqing Huang; Kris Gaj; Soonhak Kwon; Tarek A. El-Ghazawi

Montgomery modular multiplication is one of the fundamental operations used in cryptographic algorithms, such as RSA and Elliptic Curve Cryptosystems. At CHES 1999, Tenca and Koc introduced a now-classical architecture for implementing Montgomery multiplication in hardware. With parameters optimized for minimum latency, this architecture performs a single Montgomery multiplication in approximately 2n clock cycles, where n is the size of operands in bits. In this paper we propose and discuss an optimized hardware architecture performing the same operation in approximately n clock cycles with almost the same clock period. Our architecture is based on pre-computing partial results using two possible assumptions regarding the most significant bit of the previous word, and is only marginally more demanding in terms of the circuit area. The new radix-2 architecture can be extended for the case of radix-4, while preserving a factor of two speed-up over the corresponding radix-4 design by Tenca, Todorov, and Koc from CHES 2001. Our architecture has been verified by modeling it in Verilog-HDL, implementing it using Xilinx Virtex-II 6000 FPGA, and experimentally testing it using SRC-6 reconfigurable computer.

ACM Transactions on Reconfigurable Technology and Systems | 2010

Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing

Miaoqing Huang; Vikram K. Narayana; Harald Simmler; Olivier Serres; Tarek A. El-Ghazawi

High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. When the hardware tasks of an application cannot simultaneously fit in an FPGA, the task graph needs to be partitioned and scheduled into multiple FPGA configurations, in a way that minimizes the total execution time. This article proposes the Reduced Data Movement Scheduling (RDMS) algorithm that aims to improve the overall performance of hardware tasks by taking into account the reconfiguration time, data dependency between tasks, intertask communication as well as task resource utilization. The proposed algorithm uses the dynamic programming method. A mathematical analysis of the algorithm shows that the execution time would at most exceed the optimal solution by a factor of around 1.6, in the worst-case. Simulations on randomly generated task graphs indicate that RDMS algorithm can reduce interconfiguration communication time by 11% and 44% respectively, compared with two other approaches that consider data dependency and hardware resource utilization only. The practicality, as well as efficiency of the proposed algorithm over other approaches, is demonstrated by simulating a task graph from a real-life application - N-body simulation - along with constraints for bandwidth and FPGA parameters from existing high-performance reconfigurable computers. Experiments on SRC-6 are carried out to validate the approach.

International Journal of Geographical Information Science | 2016

A hybrid parallel cellular automata model for urban growth simulation over GPU/CPU heterogeneous architectures

Qingfeng Guan; Xuan Shi; Miaoqing Huang; Chenggang Lai

As an important spatiotemporal simulation approach and an effective tool for developing and examining spatial optimization strategies (e.g., land allocation and planning), geospatial cellular automata (CA) models often require multiple data layers and consist of complicated algorithms in order to deal with the complex dynamic processes of interest and the intricate relationships and interactions between the processes and their driving factors. Also, massive amount of data may be used in CA simulations as high-resolution geospatial and non-spatial data are widely available. Thus, geospatial CA models can be both computationally intensive and data intensive, demanding extensive length of computing time and vast memory space. Based on a hybrid parallelism that combines processes with discrete memory and threads with global memory, we developed a parallel geospatial CA model for urban growth simulation over the heterogeneous computer architecture composed of multiple central processing units (CPUs) and graphics processing units (GPUs). Experiments with the datasets of California showed that the overall computing time for a 50-year simulation dropped from 13,647 seconds on a single CPU to 32 seconds using 64 GPU/CPU nodes. We conclude that the hybrid parallelism of geospatial CA over the emerging heterogeneous computer architectures provides scalable solutions to enabling complex simulations and optimizations with massive amount of data that were previously infeasible, sometimes impossible, using individual computing approaches.

International Journal of Reconfigurable Computing | 2012

High-Performance Reconfigurable Computing

Khaled Benkrid; Esam El-Araby; Miaoqing Huang; Kentaro Sano; Thomas Steinke

1 School of Engineering, The University of Edinburgh, Edinburgh EH9 3JL, UK 2Electrical Engineering and Computer Science, The Catholic University of America, Washington, DC 20064, USA 3Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, AR 72701, USA 4Graduate School of Information Sciences, Tohoku University, 6-6-01 Aramaki Aza Aoba, Sendai 980-8579, Japan 5Zuse-Institut Berlin (ZIB), Takustrase 7, 14195 Berlin-Dahlem, Germany

computing frontiers | 2011

Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Lingyuan Wang; Miaoqing Huang; Vikram K. Narayana; Tarek A. El-Ghazawi

Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential. In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.

IEEE Transactions on Computers | 2012

Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants

Miaoqing Huang; Vikram K. Narayana; Mohamed Bakhouya; Jaafar Gaber; Tarek A. El-Ghazawi

High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. Mapping application task graphs onto reconfigurable hardware is, therefore, of rising attention. In this work, we approach the mapping problem by incorporating multiple architectural variants for each hardware task; the variants reflect tradeoffs between the logic resources consumed and the task execution throughput. We propose a mapping approach based on the genetic algorithm, and show its effectiveness for random task graphs as well as an N-body simulation application, demonstrating improvements of up to 78.6 percent in the execution time compared with choosing a fixed implementation variant for all tasks. We then validate our methodology through experiments on real hardware, an SRC-6 reconfigurable computer.

Explore More