Ahmed Hemani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ahmed Hemani is active.

Explore More

Publication

Featured researches published by Ahmed Hemani.

ieee computer society annual symposium on vlsi | 2002

A network on chip architecture and design methodology

Shashi Kumar; Axel Jantsch; Juha-Pekka Soininen; Martti Forsell; Mikael Millberg; Johnny Öberg; Kari Tiensyrjä; Ahmed Hemani

We propose a packet switched platform for single chip systems which scales well to an arbitrary number of processor like resources. The platform, which we call Network-on-Chip (NOC), includes both the architecture and the design methodology. The NOC architecture is a m/spl times/n mesh of switches and resources are placed on the slots formed by the switches. We assume a direct layout of the 2-D mesh of switches and resources providing physical- and architectural-level design integration. Each switch is connected to one resource and four neighboring switches, and each resource is connected to one switch. A resource can be a processor core, memory, an FPGA, a custom hardware block or any other intellectual property (IP) block, which fits into the available slot and complies with the interface of the NOC. The NOC architecture essentially is the onchip communication infrastructure comprising the physical layer, the data link layer and the network layer of the OSI protocol stack. We define the concept of a region, which occupies an area of any number of resources and switches. This concept allows the NOC to accommodate large resources such as large memory banks, FPGA areas, or special purpose computation resources such as high performance multi-processors. The NOC design methodology consists of two phases. In the first phase a concrete architecture is derived from the general NOC template. The concrete architecture defines the number of switches and shape of the network, the kind and shape of regions and the number and kind of resources. The second phase maps the application onto the concrete architecture to form a concrete product.

design automation conference | 1999

Lowering power consumption in clock by using globally asynchronous locally synchronous design style

Ahmed Hemani; Thomas Meincke; Shashi Kumar; Adam Postula; Thomas Olsson; Peter Nilsson; Johnny Öberg; Peeter Ellervee; Dan Lundqvist

Power consumption in clock of large high performance VLSIs can be reduced by adopting globally asynchronous, locally synchronous design style (GALS). GALS has small overheads for the global asynchronous communication and local clock generation. We propose methods to (a) evaluate the benefits of GALS and account for its overheads, which can be used as the basis for partitioning the system into optimal number/size of synchronous blocks, and (b) automate the synthesis of the global asynchronous communication. Three realistic ASICs, ranging in complexity from 1 to 3 million gates, were used to evaluate GALS benefits and overheads. The results show an average power saving of about 70% in clock with negligible overheads.

field programmable gate arrays | 1994

A case study on hardware/software partitioning

Axel Jantsch; Peeter Ellervee; Johnny Öberg; Ahmed Hemani

We present an analysis of a fully automatic method to accelerate standard software in C or C++ by use of field programmable gate arrays. Traditional compiler techniques are applied to the hardware/software partitioning problem and a compiler is linked to state of the art hardware synthesis tools. Time critical regions are identified by means of profiling and are automatically implemented in user programmable logic with high level and logic synthesis design tools. The underlying architecture is an add-on board with user programmable logic connected to a Spare based workstation via the system bus. We present an analysis and case study of this method. Eight programs are used as test cases and the data collected by applying this method to programs is used to discuss potentials and limitations of this and similar methods. We discuss architectural parameters, programming language properties, and analysis techniques.<<ETX>>

international symposium on systems synthesis | 1996

Grammar-based hardware synthesis of data communication protocols

Johnny Öberg; Anshul Kumar; Ahmed Hemani

For a synthesis methodology to support implementation independent design specification, a capability for design space exploration is essential. In this paper we present such a methodology for a specific domain: data communication protocols. A natural way to specify various elements of protocols is in terms of a grammar annotated with actions. Our language for protocol specification, called PRO-GRAM, is based on this idea. The hardware specification of the protocol is done by specifying the bit-patterns of the tokens the protocol is supposed to parse together with the actual grammar to parse the input stream. By specifying constraints on the input and output stream ports, the designer is allowed to explore alternative realisations with different widths of the I/O ports. The PRO-GRAM compiler outputs VHDL-code suitable for logic synthesis.

international symposium on circuits and systems | 1999

Globally asynchronous locally synchronous architecture for large high-performance ASICs

Thomas Meincke; Ahmed Hemani; Shashi Kumar; Peeter Ellervee; Johnny Öberg; Thomas Olsson; Peter Nilsson; Dan Lindqvist; Hannu Tenhunen

Clock nets are the major source of power consumption in large, high-performance ASICs and a design bottleneck when it comes to tolerable clock skew. A way to obviate the global clock net is to partition the design into large synchronous blocks each having its own clock. Data with other blocks is exchanged asynchronously using handshake signals. Adopting such a strategy requires a methodology that supports: 1) a partitioning method dividing a design into the number of synchronous blocks such that the gain due to global clock net removal exceeds the communication overhead and 2) synthesis of handshake protocols to implement the data transfer between synchronous blocks. We describe this methodology and present results of applying it to a realistic design done in 0.25 micron, ranging in operating frequencies from 20 MHz to 1 GHz. The results show that the net power savings compared to fully synchronous designs are on an average about 30%.

Neural Networks | 1990

Cell placement by self-organisation

Ahmed Hemani; Adam Postula

Abstract Cell placement is an important step in the physical design of VLSI circuits. The problem with present day algorithms is their inherent sequential nature and inability to efficiently exploit massively parallel architectures. This paper formulates cell placement as a self-organisation problem and presents the result of such an attempt. We use an adaptation of Kohonens algorithm for self-organisation as it is well suited for implementation on massively parallel architectures. The placement problem is simplified by assuming cells to be of uniform width and the cost to be minimised as total weighted wire length. The results are compared to a similar experiment in cell placement. The algorithm is a general optimisation technique and has been successfully applied to other areas of VLSI CAD like scheduling and state assignment. “Birds of a feather, flock together”

international conference on asic | 2009

Partially reconfigurable interconnection network for dynamically reprogrammable resource array

Muhammad Ali Shami; Ahmed Hemani

This paper describes an innovative regular non-blocking, point-to-point, point-to-multipoint, low latency interconnection network scheme with sliding window connectivity, which allows arbitrary parallelism among large sub-systems. The area overhead of interconnect is only 30% of the chip area which is much smaller as compared to 80% in case of FPGA. The interconnection scheme is partially and dynamically reconfigurable. The configware is reduced 5.6 times by using binary encoding which allows energy efficient dynamic reconfiguration1.

field-programmable technology | 2011

Compact generic intermediate representation (CGIR) to enable late binding in coarse grained reconfigurable architectures

Syed Mohammad Asad Hassan Jafri; Ahmed Hemani; Kolin Paul; Juha Plosila; Hannu Tenhunen

In the era of platforms hosting multiple applications, where inter-application communication and concurrency patterns are arbitrary, static compile time decision making is neither optimal nor desirable. As a part of solving this problem, we present a novel method for compactly representing multiple configuration bitstreams of a single application, with varying parallelisms, as a unique, compact, and customizable representation, called CGIR. The representation thus stored is unraveled at runtime to configure the device with optimal (e.g. in terms of energy) implementation. Our goal was to provide optimal decision making capability to the runtime resource manager (RTM) without compromising the runtime behavior or the memory requirements of the system. The presence of multiple binaries enhance optimality by providing the RTM with multiple implementations to choose from. The CGIR ensures minimal increase in memory requirements with the addition of each binary. The low-cost unraveling of CGIR guarantees the runtime behavior. We have chosen the dynamically reconfigurable resource array (DRRA) as a vehicle to study the feasibility of our approach. Simulation results using 16 point decimation in time fast Fourier transform (FFT) has showed massive (up to 18% for 2 versions, 33% for 3 versions) memory savings compared to state of the art. Formal evaluation shows that the savings increase with the increase in the number of implementations stored.

international symposium on low power electronics and design | 2010

Distributed DVFS using rationally-related frequencies and discrete voltage levels

Jean-Michel Chabloz; Ahmed Hemani

We have defined a flexible latency-insensitive design style called Globally Ratiochronous Locally Synchronous (GRLS), based on quantized voltage levels and rationally-related clock frequencies. In this paper we present the infrastructure necessary to enable Distributed DVFS in such a system and analyze its overheads, quantitatively showing how, with minimal overheads, we obtain energy benefits that are close to those of a totally ideal GALS approach. The benefits that we show, coupled with the complexity and performance benefits of GRLS, which we briefly analyze, show how this approach is a strong competitor to GALS.

international conference on embedded computer systems architectures modeling and simulation | 2013

Energy-aware-task-parallelism for efficient dynamic voltage, and frequency scaling, in CGRAs

Syed Mohammad Asad Hassan Jafri; Muhammad Adeel Tajammul; Ahmed Hemani; Kolin Paul; Juha Plosila; Hannu Tenhunen

Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining dynamic parallelism with DVFS. The proposed methods exploit the speedup induced by parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of parallelism. As a solution to this problem, we present energy aware task parallelism. Our solution relies on a resource allocation graphs and an autonomous parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.

Explore More