Waqar Hussain
Tampere University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Waqar Hussain.
field-programmable logic and applications | 2009
Fabio Garzia; Waqar Hussain; Jari Nurmi
This paper presents CREMA, a coarse-grain reconfigurable array with mapping adaptiveness. Mapping adaptiveness consists of tailoring the array to a specific application requirements. Run-time reconfigurability allows the re-usage of same PE with different functionality and interconnections among the ones supported. We proved this approach very efficient if compared with a standard CGRA. In our test cases CREMA gets a performance speed-up from 1.5X to 4X, reducing in the same time the area occupation by 80%–90% in comparison with Butter CGRA.
signal processing systems | 2012
Waqar Hussain; Fabio Garzia; Tapani Ahonen; Jari Nurmi
Designing accelerators for the real-time computation of Fast Fourier Transform (FFT) algorithms for state-of-the-art Orthogonal Frequency-Division Multiplexing (OFDM) demodulators has always been challenging. We have scaled-up a template-based Coarse-Grain Reconfigurable Array device for faster FFT processing that generates special purpose accelerators based on the user input. Using a basic and a scaled-up version, we have generated a radix-4 and mixed-radix (2, 4) FFT accelerator to process different length and types of algorithms. Our implementation results show that these accelerators satisfy not only the execution time requirements of FFT processing for Single Input Single Output (SISO) wireless standards that are IEEE-802.11 a/g and 3GPP-LTE but also for Multiple Input Multiple Output (MIMO) IEEE-802.11n standard.
norchip | 2009
Fabio Garzia; Waqar Hussain; Roberto Airoldi; Jari Nurmi
This paper presents the mapping of SDR applications on a reconfigurable SoC, based on a run-time reconfigurable coarse-grain accelerator called CREMA. CREMA is characterized by mapping adaptiveness, meaning that its architecture is specified according to the needs of the application mapped on it. CREMA is used to accelerate two kernels used in SDR applications: correlations for synchronization purposes and FFT for the OFDM modulation/demodulation. In both cases we show that the implementation on CREMA is 4X faster that a similar implementation on a general-purpose coarse-grain accelerator, and that its resource occupation is reduced by 4.5X.
design and diagnostics of electronic circuits and systems | 2010
Waqar Hussain; Fabio Garzia; Jari Nurmi
In this paper, we present the mapping of Radix-2 and Radix-4 FFT algorithms using CREMA, a Coarse-Grain Reconfigurable Array (CGRA) with mapping adaptiveness. CREMA is employed to generate special purpose accelerators tailored on the specified mapping. We analyze the results for 64-point FFT targeted for OFDM processing. Both of the implementations are (4–5X) smaller and showed a speed-up of 4X when compared with a general-purpose CGRA.
international symposium on system-on-chip | 2010
Waqar Hussain; Fabio Garzia; Jari Nurmi
In this paper, we present the mapping of 64 and 1024-point FFT algorithms on a Radix-4 FFT accelerator generated using a hardware template called CREMA. The accelerator designed is targeted for FPGA. The execution time for 64 and 1024-point FFT meets 802.11a/g and 3GPP-LTE timing constraints. A speed-up from 3X to 68X was achieved when both of the implementations were compared with similar and general purpose platforms.
signal processing systems | 2016
Waqar Hussain; Roberto Airoldi; Henry Hoffmann; Tapani Ahonen; Jari Nurmi
This paper presents design, development and evaluation of an eXtra-large Scale, Homogeneous and a Heterogeneous Accelerator-Rich Platform (HARP2) for massively parallel signal processing algorithms. HARP is an integrated platform of multiple Coarse-Grained Reconfigurable Arrays (CGRAs) over a Network-on-Chip (NoC) where each CGRA is scaled and tailored for a specific application. The architecture of the NoC consists of nine nodes in a topology of 3-rows × 3-columns and acts as backbone of communication between different CGRAs. In this experimental work, the HARP template is used to instantiate a homogeneous (HARP-hom) and a heterogeneous (HARP-het) platform. The HARP-het is generated for a proof-of-concept test to verify the design and functionality of HARP. It also provides insight to many features of the design and evaluation in terms of different performance metrics. The other version (HARP-hom) is instantiated for a relatively realistic design problem, i.e., satisfying the execution-time constraints imposed on Fast Fourier Transform processing in IEEE-802.11n demodulators. Both of the versions of HARP are treated for comparative analysis using different performance metrics against some of the existing state-of-the-art platforms. The HARP versions are designed to illustrate large-scale homogeneous/heterogeneous multicore architectures while presenting the advantages of maximizing the number of reconfigurable processing resources on a single chip.
application-specific systems, architectures, and processors | 2014
Waqar Hussain; Roberto Airoldi; Henry Hoffmann; Tapani Ahonen; Jari Nurmi
This paper presents an accelerator-rich system-on-chip (SoC) architecture integrating many heterogeneous Coarse Grain Reconfigurable Arrays (CGRA) connected through a Network-on-Chip (NoC). The architecture is designed to maximize the reconfigurable processing capacity for the execution of massively parallel algorithms. The central node of the NoC contains a Reduced Instruction Set Computer (RISC) core that manages distribution of computing functions and data within the SoC while the other nodes contain CGRAs of application-specific sizes. Prior approaches coupled only a few accelerators with a RISC core using special instructions and/or a direct memory access device. In contrast, our design couples a RISC core to many CGRAs through the NoC. This approach provides for independent and simultaneous execution of multiple computing kernels. Furthermore, the proposed architecture mitigates power dissipation as CGRA sizes are tailored for the individual application kernels. We present a proof-of-concept design with a total of 408 reconfigurable processing elements. This instance and its sub-systems are customized and tested for different computationally-intensive signal processing algorithms. The overall single-chip computing system is synthesized for a Field Programmable Gate Array device. We present comparison to and evaluation against some of the existing multicore systems in terms of multiple performance metrics.
application specific systems architectures and processors | 2013
Waqar Hussain; Xiaolin Chen; Gerd Ascheid; Jari Nurmi
In this paper, we have presented a Reconfigurable Application-specific Instruction-set Processor (rASIP) that processes mixed-radix(2, 4) 64 and 128-point Fast Fourier Transform (FFT) algorithms while satisfying the partial execution-time requirements of IEEE-802.11n standard. The rASIP was designed by integrating a template-based Coarse-Grain Reconfigurable Array (CGRA) in the datapath of a simple Reduced Instruction-Set Computing (RISC) Processor. The instruction set of the RISC processor was extended to add special instructions to enable cycle-accurate processing by the CGRA. The rASIP is synthesized for Field Programmable Gate Arrays for the measurement of resource utilization and execution time. The postfit gate-level netlist of rASIP was simulated to estimate the power and energy consumption. Based on our measurements and estimates, we have studied the advantages of using rASIP in comparison with other systems.
reconfigurable communication centric systems on chip | 2012
Waqar Hussain; Tapani Ahonen; Roberto Airoldi; Jari Nurmi
In recent past, we developed 4×8 and 4×16 processing element (PE) template-based Coarse-Grain Reconfigurable Arrays (CGRAs) and mapped different length and type of Fast Fourier Transform (FFT) algorithms on them. In this paper, we have considered radix-4 and radix-(2, 4) FFT accelerators which were generated from 4 × 8 and 4 × 16 PE CGRA templates respectively. We estimated their power and energy consumption while radix-4 accelerator was processing 64 and 1024 points and radix-(2, 4) accelerator was processing 64 and 128 points of FFT algorithms. The power consumption was estimated by timing simulation of postfit gate-level netlist of both of the accelerators for a Field Programmable Gate Array (FPGA) used as target platform. Based on the measurements, we have compared both of the accelerators for their power and energy consumption.
international symposium on system-on-chip | 2012
Waqar Hussain; Tapani Ahonen; Jari Nurmi
In recent past, we scaled a 4 × 8 processing element (PE) template-based Coarse-Grain Reconfigurable Array (CGRA) to a 4×4, 4×16 and 4×32 PE CGRA and generated matrix-vector multiplication (MVM) accelerators from each one of them. Furthermore, on each of the accelerators, MVM kernels of order N = 4; 8; 16; 32 were mapped. In this paper, we have estimated the power and energy consumption by generating the postfit gate-level netlist of each accelerator for a Field Programmable Gate Array as target platform. Based on our measurements, we have studied the effects of scalability of a CGRA on power and energy consumption.