Mohammad Reza Kakoee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohammad Reza Kakoee is active.

Explore More

Publication

Featured researches published by Mohammad Reza Kakoee.

design, automation, and test in europe | 2011

A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters

Abbas Rahimi; Igor Loi; Mohammad Reza Kakoee; Luca Benini

Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placement&routing delay is 38FO4 for a configuration with 8 processors and 16 32-bit memories (8×16); when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the maxperformance 16×32 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.

networks on chips | 2011

A distributed and topology-agnostic approach for on-line NoC testing

Mohammad Reza Kakoee; Valeria Bertacco; Luca Benini

A new distributed on-line test mechanism for NoCs is proposed which scales to large-scale networks with general topologies and routing algorithms. Each router and its links are tested using neighbors in different phases. Only the router under test is in test mode and all other parts of the NoC are in functional mode. Experimental results show that our on-line test approach can detect stuck-at and short-wire faults in the routers and links. Our approach achieves 100% fault coverage for the data-path and 85% for the control paths including routing logic, FIFOs control path and the arbiter of a 5×5 router. Synthesis results show that the hardware overhead of our test components with TMR (Triple Module Redundancy) support is 20% for covering both stuck-at and short-wire faults and 7% for covering only stuck-at faults in the 5×5 router. Simulation results show that our on-line testing approach has an average latency overhead of 20% and 3% in synthetic traffic and PARSEC traffic benchmarks on an 8×8 NoC, respectively.

IEEE Transactions on Computers | 2014

At-Speed Distributed Functional Testing to Detect Logic and Delay Faults in NoCs

Mohammad Reza Kakoee; Valeria Bertacco; Luca Benini

In this work, we propose a distributed functional test mechanism for NoCs which scales to large-scale networks with general topologies and routing algorithms. Each router and its links are tested using neighbors in different phases. The router under test is in test mode while all other parts of the NoC are operational. We use triple module redundancy (TMR) for the robustness of all testing components that are added into the switch. Experimental results show that our functional test approach can detect stuck-at, short and delay faults in the routers and links. Our approach achieves 100 percent stuck-at fault coverage for the data path and 85 percent for the control paths including routing logic, FIFOs control path, and the arbiter of a 5 × 5 router. We also show that our approach is able to detect delay faults in critical control and data paths. Synthesis results show that the area overhead of our test components with TMR support is 20 percent for covering stuck-at, delay, and short-wire faults and 7 percent for covering only stuck-at and delay faults in the 5 × 5 router. Simulation results show that our online testing approach has an average latency overhead of 3 percent in PARSEC traffic benchmarks on an 8 × 8 NoC.

design, automation, and test in europe | 2011

ReliNoC: A reliable network for priority-based on-chip communication

Mohammad Reza Kakoee; Valeria Bertacco; Luca Benini

The reliability of networks-on-chip (NoC) is threatened by low yield and device wearout in aggressively scaled technology nodes. We propose ReliNoC, a network-on-chip architecture which can withstand failures, while maintaining not only basic connectivity, but also quality-of-service support based on packet priorities. Our network leverages a dual physical channel switch architecture which removes the control overhead of virtual channels (VCs) and utilizes the inherent redundancy within the 2-channel switch to provide spares for faulty elements. Experimental results show that ReliNoC provides 1.5 to 3 times better network physical connectivity in presence of several faults, and reduces the latency of both high and low priority traffic by 30 to 50%, compared to a traditional VC architecture. Moreover, it can tolerate up to 50 faults within an 8×8 mesh at only 10 and 40% latency overhead on control and data packets for PARSEC traces [24]. Synthesis results show that our reliable architecture incurs only 13% area overhead on the baseline 2-channel switch.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2011

Fine-Grained Power and Body-Bias Control for Near-Threshold Deep Sub-Micron CMOS Circuits

Mohammad Reza Kakoee; Luca Benini

Lowering supply voltage is still the most effective technique to reduce dynamic power, and Vdd is being pushed toward the threshold voltage for ultra-low power applications. However, near-threshold circuit leakage power is comparable to switching power and performance is highly sensitive to static and dynamic threshold voltage variations. This makes designing circuits for a target performance very difficult, and post-silicon tunability is required to achieve performance targets without taking huge design margins. Post-silicon tuning of voltage supply and body bias in active mode, together with power gating for idle leakage power minimization are being investigated to tackle variability challenges in near-threshold operation. In this paper, we review and put in perspective techniques recently proposed in the literature for fine-grained post-silicon tuning and power gating. These techniques leverage the typical row-based ASIC layout style and use the row as the atomic element for circuit clustering. The results of row-based power gating on a set of benchmark circuits show that the leakage savings can be achieved are, superior to those obtained using existing power-gating solutions and with much tighter timing and area constraints. Benchmark results on row-based forward body biasing show large leakage power savings with a maximum savings of 61% in case of 18% compensation in 45 nm and 93% in case of 10% compensation in 32 nm with respect to block-level approaches. Finally, row-based dual-Vdd can provide post-silicon speed compensation in near-threshold region up-to 45% while achieving more than 50% lower power compared to single-Vdd.

engineering of computer based systems | 2006

Modified pseudo LRU replacement algorithm

Hassan Ghasemzadeh; Sepideh Sepideh Mazrouee; Mohammad Reza Kakoee

Although the LRU replacement algorithm has been widely used in cache memory management, it is well-known for its inability to be easily implemented in hardware. Most of primary caches employ a simple block replacement algorithm like pseudo LRU to avoid the disadvantages of a complex hardware design. In this paper, we propose a novel block replacement scheme, MPLRU (modified pseudo LRU), by exploiting second chance concept in pseudo LRU algorithm. A comprehensive comparison is made between our algorithm and both true LRU and other conventional schemes such as FIFO, random and pseudo LRU. Experimental results show that MPLRU significantly reduces the number of cache misses compared to the other algorithms. Simulation results reveal that in average our algorithm can provide a value of 8.52% improvement on the miss ratio compared to the pseudo LRU algorithm. Moreover, it provides 7.93% and 11.57%performance improvement compared to FIFO and random replacement policies respectively

high level design validation and test | 2007

Reliable network-on-chip based on generalized de Bruijn graph

Mohammad Hosseinabady; Mohammad Reza Kakoee; Jimson Mathew; Dhiraj K. Pradhan

In this paper, we propose the generalized de Bruijn graph as a reliable and efficient network topology for a Network-on-Chip (NoC) design. We also propose a reliable routing algorithm to detour a problematic (i.e., faulty or congested) link. Our experimental results show that the latency and energy consumption of generalized de Bruijn graph are much less with compared to Mesh and Torus, the two common NoC architectures in the literature. The low energy consumption of de Bruijn graph-based NoC makes it suitable for portable devices which have to operate on limited batteries. Also, the gate level implementation of the proposed reliable routing shows a small area, power, and timing overheads due to the proposed reliable routing algorithm.

IEEE Transactions on Very Large Scale Integration Systems | 2011

Low Latency and Energy Efficient Scalable Architecture for Massive NoCs Using Generalized de Bruijn Graph

Mohammad Hosseinabady; Mohammad Reza Kakoee; Jimson Mathew; Dhiraj K. Pradhan

Employing thousands of cores in a single chip is the natural trend to handle the ever increasing performance requirements of complex applications such as those used in graphics and multimedia processing. System-on-chips (SoCs) platforms based on network-on-chips (NoCs) could be a viable option for the deployment of large multicore designs with thousands of cores. This paper proposes the generalized binary de Bruijn (GBDB) graph as a reliable and efficient network topology for a large NoC. We propose a reliable routing algorithm to detour a faulty channel between two adjacent switches. In addition, using integer linear programming, we propose an optimal tile-based implementation for a GBDB-based NoC in which the number of channels is less than that of Torus which has the same number of links. Our experimental results show that the latency and energy consumption of the generalized de Bruijn graph are much less than those of Mesh and Torus. The low energy consumption of a de Bruijn graph-based NoC makes it suitable for portable devices which have to operate on limited batteries. Also, the gate level implementation of the proposed reliable routing shows small area, power, and timing overheads due to the proposed reliable routing algorithm.

international conference on embedded computer systems architectures modeling and simulation | 2012

A tightly-coupled multi-core cluster with shared-memory HW accelerators

Masoud Dehyadegari; Andrea Marongiu; Mohammad Reza Kakoee; Luca Benini; Siamak Mohammadi; Nasser Yazdani

Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2012

Variation-Tolerant Architecture for Ultra Low Power Shared-L1 Processor Clusters

Mohammad Reza Kakoee; Igor Loi; Luca Benini

In this brief, we propose a variation-tolerant architecture for shared-L1 processor clusters working at near-threshold (NT). Our variation-tolerant technique is able to compensate the effect of delay variations, which are exacerbated by moving to the NT region, on the processor to memory communication by adding one or two stages of controllable pipelines. Moreover, we propose a reconfigurable address-interleaving technique, which enables us to shut down some of the memory blocks if they are either too slow due to the variation or not needed by the application (to reduce power consumption). Experimental results show that our speed adaptation approach is able to compensate up to 90% degradation in the request path with less than 2% hardware overhead for a shared-L1 cluster with 16 processors and 32 memory banks. The configurable interleaving technique has an overhead of 10% on the request timing path of a 16 × 32 interconnection network.

Explore More