Muhammad Ali Shami | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Muhammad Ali Shami is active.

Explore More

Publication

Featured researches published by Muhammad Ali Shami.

international conference on asic | 2009

Partially reconfigurable interconnection network for dynamically reprogrammable resource array

Muhammad Ali Shami; Ahmed Hemani

This paper describes an innovative regular non-blocking, point-to-point, point-to-multipoint, low latency interconnection network scheme with sliding window connectivity, which allows arbitrary parallelism among large sub-systems. The area overhead of interconnect is only 30% of the chip area which is much smaller as compared to 80% in case of FPGA. The interconnection scheme is partially and dynamically reconfigurable. The configware is reduced 5.6 times by using binary encoding which allows energy efficient dynamic reconfiguration1.

signal processing systems | 2009

Morphable DPU: Smart and efficient data path for signal processing applications

Muhammad Ali Shami; Ahmed Hemani

A coarse grained morphable Datapath Unit (mDPU) has been proposed. This mDPU implements multiplier in a smart way that enables the component adders to be reused when we do not need the multiplier. A pipelined design further enhances the design by creating a balanced datapath in temporal sense. These two features results in a design that optimally uses silicon and time. A judicious set of Coarse Granular instructions are enabled by the mDPU that we show can implement typical signal processing functions. A radix-2 64 point FFT has been implemented in 90 nm technology using the proposed mDPUs and performance and energy results from physical design phase are reported and compared to a state-of-the-art comparable design from the research community. 4X improvement in performance and 2.5X improvement in power-performance product are reported.

international parallel and distributed processing symposium | 2012

Classification of Massively Parallel Computer Architectures

Muhammad Ali Shami; Ahmed Hemani

Faced with slowing performance and energy benefits of technology scaling, VLSI/Computer architectures have turned from parallel to massively parallel machines for personal and embedded applications in the form of multi and many core architectures. Additionally, in the pursuit of finding the sweet spot between engineering and computational efficiency, massively parallel Coarse Grain Reconfigurable Architectures(CRGAs) have been researched. While hese articles have been surveyed, they have not been rigorously classified to enable objective differentiation and comparison for performance, area and flexibility. In this paper, we extend the well known Skillicorn taxonomy to create new classes, present a scoring system to rate these classes on flexibility, and present equations for early estimation of area and configuration overheads. Furthermore, we use this extended classification scheme to classify and compare 25 different massively parallel architectures that covers most of the reported CGRAs and other well known multi and many core architectures.

application specific systems architectures and processors | 2011

Address generation scheme for a coarse grain reconfigurable architecture

Muhammad Ali Shami; Ahmed Hemani

In this paper, we describe a versatile address generation scheme for distributed storage resources of a coarse grain Parallel Distributed Digital Signal Processing (PDDSP) reconfigurable architecture under development in our group. This scheme proposes the distributed address generation units (AGUs) to decouple the address generation logic with compute logic to exploit parallelism (ILP and TLP). To achieve this, the proposed distributed address generation scheme with standard DSP address generation modes like linear vectorized, circular buffer and bit-reverse addressing, all with parameterizable address range and increment/decrement offsets is further enhanced with temporal flexibility by introducing three dynamically programmable delays: initial delay before the stream starts, middle delay after every address generation for the stream and end delay after the stream is complete. The dynamic programmability of these delays makes streams elastic that can be chained with an interrupt mechanism to create chained-elastic streams. Our approach is compared with the traditional approach of using VLIW and Scalar. Our approach shows 21× (Scalar), 10×(VLIW) reduction in instructions and 2×(Scalar) reduction in cycles for a single thread FIR filter. When compared for Synchronous and Asynchronous scenarios of two parallel treads T1 and T2, our approach shows 4.6×(Scalar), 5.6×(VLIW) reduction in instructions, 1.76× reduction in cycles for Synchronous and 4.6×(Scalar), 15×(VLIW) reduction in instructions, 1.76×(Scalar) reduction in cycles for Asynchronous threads.

international symposium on circuits and systems | 2013

39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation

Nasim Farahini; Shuo Li; Muhammad Adeel Tajammul; Muhammad Ali Shami; Guo Chen; Ahmed Hemani; Wei Ye

This paper presents an industrial case study of using a Coarse Grain Reconfigurable Architecture (CGRA) for a multi-mode accelerator for two kernels: FFT for the LTE standard and the Correlation Pool for the UMTS standard to be executed in a mutually exclusive manner. The CGRA multi-mode accelerator achieved computational efficiency of 39.94 GOPS/watt (OP is multiply-add) and silicon efficiency of 56.20 GOPS/mm2. By analyzing the code and inferring the unused features of the fully programmable solution, an in-house developed tool was used to automatically customize the design to run just the two kernels and the two efficiency metrics improved to 49.05 GOPS/watt and 107.57 GOPS/mm2. Corresponding numbers for the ASIC implementation are 63.84 GOPS/watt and 90.91 GOPS/mm2. Though the ASICs silicon and computational efficiency numbers are slightly better, the engineering efficiency of the pre-verified/characterized CGRA solution is at least 10X better than the ASIC solution.

international conference on vlsi design | 2011

NoC Based Distributed Partitionable Memory System for a Coarse Grain Reconfigurable Architecture

Muhammad Adeel Tajammul; Muhammad Ali Shami; Ahmed Hemani; S. Moorthi

This paper presents a Network-on-Chip based distributed partitionable memory system for a Dynamic Reconfigurable Resource Array (DRRA). The main purpose of this design is to extend the Register File (RFile) interface with additional data handling capability. The proposed interconnect which enables the interaction between existing partition of computation fabric and the distributed memory system is programmable and partitionable. The system can modify its memory to computation element ratio at runtime. The interconnect can provide multiple interfaces that can support upto 8 GB/s per interface.

symposium on computer architecture and high performance computing | 2010

Control Scheme for a CGRA

Muhammad Ali Shami; Ahmed Hemani

Ability to instantiate low cost and agile FSMs that can implement an arbitrary parallelism and combine such FSMs in a chain and in a hierarchy is one of the key differentiating factors between the ASICs and MPSOCs. CGRAs that have been reported in literature, like MPSOCs, also lack this ASIC like ability. The downside of ASICs is their lack of reuse and high engineering cost. We present a CGRA architecture that retains the programmability of CGRA and yet has the ASIC like ability to construct a) arbitrarily parallel data-path/FSM combine, b) chain an arbitrary number of such FSMs and c) create a hierarchy of such chains. We present in detail the architecture of such a control scheme and illustrate its use for an example composed of FFT and FIRs. We quantify the benefits of our approach by benchmarking for energy-delay product against a) ASICs (4.8X worse), b) a state-of-the-art CGRA (4.58X better) and FPGAs (63.95X better).

norchip | 2010

High Level Synthesis Framework for a Coarse Grain Reconfigurable Architecture

Omer Malik; Ahmed Hemani; Muhammad Ali Shami

A High Level Synthesis Framework for mapping DSP algorithms on a Coarse Grain Reconfigurable Architecture is presented. Behavioral specification of the algorithm in C is specified with pragmas in comments and the tool generates configware after performing timing and synchronization synthesis. Pragmas identify SIMD type concurrency and sweep the architectural space with allocation and binding annotations to produce implementations from fully serial to fully parallel. This allows user to stay at algorithmic level and guide the HLS tool to search a restricted architectural space bounded by the pragmas thus making the synthesis process more efficient and predictable.

international conference on vlsi design | 2011

A Library Development Framework for a Coarse Grain Reconfigurable Architecture

Omer Malik; Ahmed Hemani; Muhammad Ali Shami

A framework for efficiently capturing the rich micro architectural space of a substantial Matlab like library of DSP functions for a regular Coarse Grain Reconfigurable Architecture (CGRA) fabric is proposed. A subset of C has been proposed to model the DSP functions and an automatic tool to generate the configware for the CGRA fabric developed. A method to estimate the average energy of such functions is reported with error margin of less than 3%. Such a framework is proposed as the basis for raising the abstraction to automate synthesis of the entire physical layers.

norchip | 2010

An improved self-reconfigurable interconnection scheme for a Coarse Grain Reconfigurable Architecture

Muhammad Ali Shami; Ahmed Hemani

An improved Dynamic, Partial and self reconfigurable interconnection network (Hybrid-2 Network) is presented for Dynamically Reprogrammable Resource Array (DRRA), which is a Coarse Grain Reconfiguration Architecture (CGRA). To justify the design decision, Hybrid-2 network implementation is compared against the possible implementations using Multiplexer, NoC, Crossbar and already published Hybrid-1 interconnection network. Results shows that newly presented Hybrid-2 Interconnection network take (1.08x, 0.104x, 0.212x and 0.681x) the area, (1x, 0.037x, 0.026x and 0.107x) the configuration bits of Multiplexer, NoC, Crossbar and Hybrid-1 Implementation respectively. Hybrid-2 network is also 2.87x and 5.86x faster than Multiplexer and Hybrid-1 networks.

Explore More