Muhammad Adeel Tajammul

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Muhammad Adeel Tajammul is active.

Explore More

Publication

Featured researches published by Muhammad Adeel Tajammul.

international conference on embedded computer systems architectures modeling and simulation | 2013

Energy-aware-task-parallelism for efficient dynamic voltage, and frequency scaling, in CGRAs

Syed Mohammad Asad Hassan Jafri; Muhammad Adeel Tajammul; Ahmed Hemani; Kolin Paul; Juha Plosila; Hannu Tenhunen

Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining dynamic parallelism with DVFS. The proposed methods exploit the speedup induced by parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of parallelism. As a solution to this problem, we present energy aware task parallelism. Our solution relies on a resource allocation graphs and an autonomous parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.

international symposium on circuits and systems | 2013

39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation

Nasim Farahini; Shuo Li; Muhammad Adeel Tajammul; Muhammad Ali Shami; Guo Chen; Ahmed Hemani; Wei Ye

This paper presents an industrial case study of using a Coarse Grain Reconfigurable Architecture (CGRA) for a multi-mode accelerator for two kernels: FFT for the LTE standard and the Correlation Pool for the UMTS standard to be executed in a mutually exclusive manner. The CGRA multi-mode accelerator achieved computational efficiency of 39.94 GOPS/watt (OP is multiply-add) and silicon efficiency of 56.20 GOPS/mm2. By analyzing the code and inferring the unused features of the fully programmable solution, an in-house developed tool was used to automatically customize the design to run just the two kernels and the two efficiency metrics improved to 49.05 GOPS/watt and 107.57 GOPS/mm2. Corresponding numbers for the ASIC implementation are 63.84 GOPS/watt and 90.91 GOPS/mm2. Though the ASICs silicon and computational efficiency numbers are slightly better, the engineering efficiency of the pre-verified/characterized CGRA solution is at least 10X better than the ASIC solution.

international conference on vlsi design | 2011

NoC Based Distributed Partitionable Memory System for a Coarse Grain Reconfigurable Architecture

Muhammad Adeel Tajammul; Muhammad Ali Shami; Ahmed Hemani; S. Moorthi

This paper presents a Network-on-Chip based distributed partitionable memory system for a Dynamic Reconfigurable Resource Array (DRRA). The main purpose of this design is to extend the Register File (RFile) interface with additional data handling capability. The proposed interconnect which enables the interaction between existing partition of computation fabric and the distributed memory system is programmable and partitionable. The system can modify its memory to computation element ratio at runtime. The interconnect can provide multiple interfaces that can support upto 8 GB/s per interface.

Microprocessors and Microsystems | 2014

Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric

Nasim Farahini; Ahmed Hemani; Hassan Sohofi; Syed Mohammad Asad Hassan Jafri; Muhammad Adeel Tajammul; Kolin Paul

This paper presents a hardware based solution for a scalable runtime address generation scheme for DSP applications mapped to a parallel distributed coarse grain reconfigurable computation and storage fabric. The scheme can also deal with non-affine functions of multiple variables that typically correspond to multiple nested loops. The key innovation is the judicious use of two categories of address generation resources. The first category of resource is the low cost AGU that generates addresses for given address bounds for affine functions of up to two variables. Such low cost AGUs are distributed and associated with every read/write port in the distributed memory architecture. The second category of resource is relatively more complex but is also distributed but shared among a few storage units and is capable of handling more complex address generation requirements like dynamic computation of address bounds that are then used to configure the AGUs, transformation of non-affine functions to affine function by computing the affine factor outside the loop, etc. The runtime computation of the address constraints results in negligibly small overhead in latency, area and energy while it provides substantial reduction in program storage, reconfiguration agility and energy compared to the prevalent pre-computation of address constraints. The efficacy of the proposed method has been validated against the prevalent address generation schemes for a set of six realistic DSP functions. Compared to the pre-computation method, the proposed solution achieved 75% average code compaction and compared to the centralized runtime address generation scheme, the proposed solution achieved 32.7% average performance improvement.

application specific systems architectures and processors | 2013

Private configuration environments (PCE) for efficient reconfiguration, in CGRAs

Muhammad Adeel Tajammul; Syed Mohammad Asad Hassan Jafri; Ahmed Hemani; Juha Plosila; Hannu Tenhunen

In this paper, we propose a polymorphic configuration architecture, that can be tailored to efficiently support reconfiguration needs of the applications at runtime. Today, CGRAs host multiple applications, running simultaneously on a single platform. Novel CGRAs allow each application to exploit late binding and time sharing for enhancing the power and area efficiency. These features require frequent reconfigurations, making reconfiguration time a bottleneck for time critical applications. Existing solutions to this problem either employ powerful configuration architectures or hide configuration latency (using configuration caching). However, both these methods incur significant costs when designed for worst-case reconfiguration needs. As an alternative to worst-case dedicated configuration mechanism, we exploit reconfiguration to provide each application its private configuration environment (PCE). PCE relies on a morphable configuration infrastructure, a distributed memory sub-system, and a set of PCE controllers. The PCE controllers customize the morphable configuration infrastructure and reserve portion of the a distributed memory sub-system, to act as a context memory for each application, separately. Thereby, each application enjoys its own configuration environment which is optimal in terms of configuration speed, memory requirements and energy. Simulation results using representative applications (WLAN and Matrix Multiplication) showed that PCE offers up to 58% reduction in memory requirements, compared to dedicated, worst case configuration architecture. Synthesis results show that the morphable reconfiguration architecture incurs negligible overheads ( 3% area and 4% power compared of a single processing element).

2012 IEEE 6th International Symposium on Embedded Multicore SoCs | 2012

Segmented Bus Based Path Setup Scheme for a Distributed Memory Architecture

Muhammad Adeel Tajammul; Muhammad Ali Shami; Ahmed Hemani

This paper proposes a composite instruction for path setup and partitioning of a network on chip using segmented buses. The network connects a distributed memory to a coarse grained reconfigurable architecture. The scheme decreases the partitioning and routing instruction in sequencers (S) for the nodes (N) from Nx3 to a single instruction. This reduction in instruction also bear a small performance benefit as less instructions are scheduled onto the network. Furthermore, it is possible to optimizing the system under application specific constraints. A simple use-case with experiments is defined to show for design trade-offs for these optimization decisions.

norchip | 2010

A NoC based distributed memory architecture with programmable and partitionable capabilities

Muhammad Adeel Tajammul; Muhammad Ali Shami; Ahmed Hemani; S. Moorthi

The paper focuses on the design of a Network-on-chip based programmable and partitionable distributed memory architecture which can be integrated with a Coarse Grain Reconfigurable Architecture (CGRA). The proposed interconnect enables better interaction between computation fabric and memory fabric. The system can modify its memory to computation element ratio at runtime. The extensive capabilities of the memory system are analyzed by interfacing it with a Dynamically Reconfigurable Resource Array (DRRA), a CGRA. The interconnect can provide multiple interfaces which supports upto 8 GB/s per interface.

IEEE Transactions on Very Large Scale Integration Systems | 2016

Polymorphic Configuration Architecture for CGRAs

Syed Mohammad Asad Hassan Jafri; Muhammad Adeel Tajammul; Ahmed Hemani; Kolin Paul; Juha Plosila; Peeter Ellervee; Hannu Tenuhnen

In the era of platforms hosting multiple applications with arbitrary reconfiguration requirements, static configuration architectures are neither optimal nor desirable. The static reconfiguration architectures either incur excessive overheads or cannot support advanced features (like time-sharing and runtime parallelism). As a solution to this problem, we present a polymorphic configuration architecture (PCA) that provides each application with a configuration infrastructure tailored to its needs.

international conference on vlsi design | 2015

DyMeP: An Infrastructure to Support Dynamic Memory Binding for Runtime Mapping in CGRAs

Muhammad Adeel Tajammul; Syed M. A. H. Jafri; Peeter Ellerve; Ahmed Hemani; Hannu Tenhunen; Juha Plosila

Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications. Commonly, CGRAs are composed of a computation layer (that performs computations) and a memory layer (that provides data and config ware to the computation layer). Tempted by higher platform utilization and reliability, recently proposed CGRA soffer dynamic application remapping (for the computation layer). Distributed scratch pad (compiler programmed) memories offer high data rates, predictability and low the power consumption (compared to caches). Therefore, the distributed scratchpad memories are emerging as preferred implementation alternative for the memory layer in recent CGRAs. However, the scratchpad memories are programmed at compile time, and do not support dynamic application remapping. The existing solutions that allow dynamic application remapping either rely on fat binaries (that significantly enhance configuration memory requirements) or consider a centralized memory. To extract the benefits of both runtime remapping and distributed scratchpad memories, we present a design framework called DyMeP. DyMeP relies on late binding and provides the architectural support to dynamically remap data in CGRAs. Compared to the state of the art, the proposed technique reduces the configuration memory requirements (needed by fat binary solutions) and supports distributed shared scratchpad memory. Synthesis/Simulation results reveal that DyMeP promises a significant (up to 60%) reduction in config ware size at the cost of negligible additional overheads (less then 1%).

power and timing modeling optimization and simulation | 2016

TransMem: A memory architecture to support dynamic remapping and parallelism in low power high performance CGRAs

Muhammad Adeel Tajammul; Syed Mohammad Asad Hassan Jafri; Ahmed Hemani; Peeter Ellervee

In the nano scale era, the upcoming design challenges like dark silicon, power wall, and memory wall have prompted extensive research into the architectural alternatives to the general purpose processor. Coarse Grained Reconfig-urable Architectures (CGRAs) are emerging as one of the promising alternatives. Commonly, CGRAs are composed of a computation layer and a memory layer. Tempted by higher platform utilization and energy efficiency, recently proposed CGRAs offer dynamic remapping and parallelism. However, the existing works only address the computational elements, while for many applications the bulk of energy is consumed by the memory and memory accesses. Therefore, without architectural support to optimize the memory contents, according to the changes in computational layer, the benefits promised by dynamic parallelism and remapping are severely degraded. As a solution to this problem we present TransMem, a supporting memory infrastructure that complements the dynamic remapping and parallelism in the computational fabric. Simulation results reveal that the additional flexibility enhances the energy efficiency by up to 85% for the tested applications, compared to state of the art. Post-layout analysis reveals that TransMem incurs only 4% area penalty.

Explore More