Mahmood Ahmadi
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mahmood Ahmadi.
international conference on networks | 2007
Mahmood Ahmadi; Stephan Wong
Within packet processing systems, lengthy memory accesses greatly reduce performance. To overcome this limitation, network processors utilize many different techniques, e.g., utilizing multi-level memory hierarchies, special hardware architectures, and hardware threading. In this paper, we introduce a multi-level memory hierarchy and a special hardware cache architecture for counting Bloom filters that is utilized by network processors and packet processing applications such as packet classification and distributed web caching systems. Based on the value of the counters in the counting Bloom filter, a multi-level cache architecture called the cache counting Bloom filter (CCBF) is presented and analyzed. The results show that the proposed cache architecture decreases the number of memory accesses by at least 51.3% when compared to a standard Bloom filter.
global communications conference | 2008
Mahmood Ahmadi; Stephan Wong
A bloom filter is a simple space-efficient randomized data structure for the representation set of items in order to support membership queries. In recent years, Bloom filters have increased in popularity in database and networking applications. In this paper, we introduce a new extension to optimize memory utilization for regular bloom filters, called bloom filter with an additional hashing function (BFAH). The regular bloom filter stores items from a set k times k memory locations that are determined by the k addresses stored in the bit-array structure. Which k addresses to utilize is determined by to which positions in the structure the k (regular) hashing functions are pointing to. Utilizing the additional hashing function, only one out of these k memory addresses is selected to store the item only once. Consequently, there is no longer needed to store the k-1 redundant copies. We implemented our approach in a software packet classifier based on tuple space search with the H3 class of universal hashing functions. Our results show that our approach is able to reduce the number of collisions when compared to a regular bloom filter.
ieee international conference on computer science and information technology | 2009
Ali Azarian; Mahmood Ahmadi
Reconfigurable computing has been driven largely by the development of commodity field-programmable gate arrays (FPGAs). Standard FPGAsare somewhat of a mixed blessing for this field.In this survey we give a brief overview of programming logics and we present configurable logic block (CLB) and Look Up Table (LUT) as logic elements. Also we presented the definition of fine and coarse-grain architectures and present some commercial examples.This survey is also introduced the reconfigurable computing models like static and dynamic, single and multi-context and partial reconfiguration architectures. Finally run-time reconfigurable computing and the coupling of reconfigurable processing unit (RUP) delineated.
reconfigurable computing and fpgas | 2010
Faisal Nadeem; Mahmood Ahmadi; Muhammad Nadeem; Stephan Wong
Traditional grid networks employ general purpose processors (GPPs) as their main processing elements. Incorporating reconfigurable processing elements in such networks can be a promising technology to increase their performance and flexibility. Many grid networks, such as Tera Grid, are already utilizing reconfigurable hardware resource as a processing element. In this paper, we propose queuing models for grid networks that incorporate the following processing elements: a GPP, a reconfigurable element (RE), and a hybrid element (combining a GPP with an RE). The proposed models are validated by taking average response time of these models as validation metric. The comparison of experimental (simulation) and analytical results suggest that the total average error is less than 3.5%.
reconfigurable computing and fpgas | 2009
Asadollah Shahbahrami; Mahmood Ahmadi; Stephan Wong; Koen Bertels
The Discrete Wavelet Transform (DWT) is an important operation in applications of digital signal processing. In this paper, we review several traditional DWT implementation approaches, e.g., application-specific integrated circuits, field-programmable gate arrays, digital signal processors, general-purpose processors, and graphic processors, and discuss their limitations in terms of performance and flexibility. In order to provide both high-performance and flexibility, we propose a new approach, namely a parallel architecture exploiting the collaboration of reconfigurable processing elements in grid computing. Grid computing can exploit the task level parallelism to execute the 2D DWT. In addition, reconfigurable computing offers a flexible platform and can be used as hardware accelerators. We mapped the DWT in a grid. Our experimental results show that speedups of up to 4.1x can be achieved.
Journal of Electrical and Computer Engineering | 2011
Mahmood Ahmadi; Stephan Wong
Within packet processing systems, lengthy memory accesses greatly reduce performance. To overcome this limitation, network processors utilize many different techniques, for example, utilizing multilevel memory hierarchies, special hardware architectures, and hardware threading. In this paper, we introduce a multilevel memory architecture for counting Bloom filters. Based on the probabilities of incrementing of the counters in the counting Bloom filter, a multi-level cache architecture called the cached counting Bloom filter (CCBF) is presented, where each cache level stores the items with the same counters. To test the CCBF architecture, we implement a software packet classifier that utilizes basic tuple space search using a 3-level CCBF. The results of mathematical analysis and implementation of the CCBF for packet classification show that the proposed cache architecture decreases the number of memory accesses when compared to a standard Bloom filter. Based on the mathematical analysis of CCBF, the number of accesses is decreased by at least 53%. The implementation results of the software packet classifier are at most 7.8% (3.5% in average) less than corresponding mathematical analysis results. This difference is due to some parameters in the packet classification application such as number of tuples, distribution of rules through the tuples, and utilized hashing functions.
Future Generation Computer Systems | 2011
Mahmood Ahmadi; Asadollah Shahbahrami; Stephan Wong
Traditional grid networks employ General Purpose Processors (GPPs) as their main processing elements. Incorporating reconfigurable processing elements in such networks can be a promising technology to increase their performance. In this paper, we propose and simulate collaboration of reconfigurable processors in grid computing. Collaborative Reconfigurable Grid Computing (CRGC) employs the availability of any reconfigurable processor to accelerate compute-intensive applications such as multimedia kernels. We explore the mapping of some compute-intensive multimedia kernels such as the 2D DWT and the co-occurrence matrix in the CRGC. These multimedia kernels are simulated as an independent set of gridlets submitted to a software simulator called CRGridSim. In addition, we analyze the lower and upper bounds of performance for CRGC. Our experimental results show that the CRGC approach improves performance up to 7.2x and 2.5x compared to a single GPP and the collaboration of GPPs, respectively, when assuming a speedup of 10 of the reconfigurable processors in a grid with 4 nodes.
grid and pervasive computing | 2010
Mahmood Ahmadi; Asadollah Shahbahrami; Stephan Wong
Multimedia applications are multi-standard, multi-format, and compute-intensive These features in addition to a large set of input and output data lead to that some architectures such as application-specific integrated circuits and general-purpose processors are less suitable to process multimedia applications Therefore, reconfigurable processors are considered as an alternative approach to develop systems to process multimedia applications efficiently In this paper, we propose and simulate collaboration of reconfigurable processors in grid computing Collaborative Reconfigurable Grid Computing (CRGC) employs the availability of any reconfigurable processor to accelerate compute-intensive applications such as multimedia kernels We explore the mapping of some compute-intensive multimedia kernels such as the 2D DWT and the co-occurrence matrix in CRGC These multimedia kernels are simulated as a set of gridlets submitted to a software simulator called CRGridSim In addition, the behavior of multimedia kernels in the CRGC environment is presented The experimental results show that the CRGC approach improves performance of up to 7.2x and 2.5x compared to a GPP and the collaboration of GPPs, respectively, when assuming the speedup of reconfigurable processors 10.
conference on communication networks and services research | 2008
Mahmood Ahmadi; Stephan Wong
The increasing demand for more bandwidth and the increased application variety fuel the need for high performance network processors. A simple but highly repetitive task performed by such processors is packet processing. Typically, a network processor consists of a parallel processor core with a number of memory interfaces and special co-processors. Recently, distributed architectures are being utilized in the design of network processors. In such environments, a challenging problem is to allocate optimal bandwidth between different network processors (NPs) to achieve more performance. In this paper, the formulation and solution of an optimal bandwidth allocation strategy using queuing network for NP-based architectures at system level is proposed. The solution allocates optimal bandwidth between network processors in a grid-oriented environment. It encompasses a new formula based on the optimal capacity allocation concept in queuing network, our simulation results show that the proposed solution is able to enhance the response time in NP-based architectures when compared to a same NP-based architectures without optimal bandwidth allocation.
Parallel and distributed computing and networks | 2007
Mahmood Ahmadi; Stephan Wong