Ehsan Atoofian | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ehsan Atoofian is active.

Explore More

Publication

Featured researches published by Ehsan Atoofian.

international parallel and distributed processing symposium | 2007

A Power-Aware Prediction-Based Cache Coherence Protocol for Chip Multiprocessors

Ehsan Atoofian; Amirali Baniasadi

Snoopy cache coherence protocols broadcast requests to all nodes, reducing the latency of cache to cache transfer misses at the expense of increasing interconnect power. We propose speculative supplier identification (SSI) to reduce power dissipation in binary tree interconnects in snoopy cache coherence implementations. In SSI, instead of broadcasting a request to all processors, we send the request to the node more likely to have the missing data. We reduce power as we limit access only to the interconnect components between the requestor and the supplier node. We evaluate SSI using shared memory applications. We show that SSI reduces interconnect power by 23% in a 4-way multiprocessor. This comes with negligible performance cost and hardware overhead. SSI does not change existing coherence protocols and is completely transparent to software and the operating system.

computing frontiers | 2007

Computational and storage power optimizations for the O-GEHL branch predictor

Kaveh Aasaraai; Amirali Baniasadi; Ehsan Atoofian

In recent years, highly accurate branch predictors have been proposed primarily for high performance processors. Unfortunately such predictors are extremely energy consuming and in some cases not practical as they come with excessive prediction latency. One example of such predictors is the O-GEHL predictor. To achieve high accuracy, O-GEHL relies on large tables and extensive computations and requires high energy and long prediction delay.In this work we propose power optimization techniques that aim at reducing both computational complexity and storage size for the O-GEHL predictor. We show that by eliminating unnecessary data from computations, we can reduce both predictors energy consumption and delay. Moreover, we apply information theory findings to remove redundant storage, without any significant accuracy penalty. We reduce the dynamic and static power dissipated in the computational parts of the predictor by up to 74% and 65% respectively. Meantime we improve performance by up to 12% as we make faster prediction possible.

acm symposium on applied computing | 2008

Exploiting program cyclic behavior to reduce memory latency in embedded processors

Ehsan Atoofian; Amirali Baniasadi

In this work we modify the conventional row buffer allocation mechanism used in DDR2 SDRAM banks to improve average memory latency and overall processor performance. Our method assigns row buffers to different banks dynamically and by taking into account program cyclic behavior and bank row buffer demand. As we show in this work, memory requests go through several phases. In each phase, programs tend to access a single bank most of the time. We exploit this repetitive behavior and improve the concurrency level for memory read and write operations. We do so by assigning idle row buffers to more demanding banks during specific program phases. This improves average memory latency and processor performance by 12.7% and 7.6% respectively.

computing frontiers | 2007

Speculative supplier identification for reducing power of interconnects in snoopy cache coherence protocols

Ehsan Atoofian; Amirali Baniasadi; Kaveh Aasaraai

In this work we reduce interconnect power dissipation in Symmetric Multiprocessors or SMPs. We revisit snoopy cache coherence protocols and reduce unnecessary interconnect activity by speculating nodes expected to provide a missing data. Conventional snoopy cache coherence protocols broadcast requests to all nodes, reducing the latency of cache to cache transfer misses at the expense of increasing interconnect power. We show that it is possible to reduce the associated power dissipation if such requests are broadcasted selectively and only to nodes more likely to provide the missing data. We reduce power as we limit access only to the interconnect components between the requester and the supplier node. We evaluate our technique using shared memory applications and show that it is possible to reduce interconnect power by 21% in a 4-way multiprocessor without compromising performance. This comes with negligible hardware overhead.

custom integrated circuits conference | 2005

Low-power prediction based data transfer architecture

Maged Ghoneima; Ehsan Atoofian; Amirali Baniasadi; Yehea I. Ismail

The energy dissipation of on-chip buses is becoming one of the main bottlenecks in current integrated circuits. This paper proposes a prediction-based technique to reduce data bus power dissipation. This technique uses value prediction to speculate the next value to transfer over the data bus. Two identical predictors are placed on the two ends of the bus. If the sender predictor accurately predicts the data to be transferred, data is not transmitted, and the bus energy dissipation is reduced. After implementing the proposed architecture in a 70nm CMOS technology, and using a representative subset of SPEC2K benchmarks, an average of 41% of the bus transitions were eliminated, which led to a 31% average overall energy reduction

Journal of Systems Architecture | 2007

Speculative trivialization point advancing in high-performance processors

Ehsan Atoofian; Amirali Baniasadi

Trivial instructions are those instructions whose output can be determined without performing the actual computation. This is due to the fact that for these instructions the output is often either one of the source operands or zero (e.g., addition with or multiplication by zero). In this work we study trivial instructions and use our findings to improve performance in high-performance processors. In particular, we introduce speculative trivialization point advancing to detect and bypass trivial instructions as soon as possible and as early as the decode stage. Consequently, we improve performance over a conventional processor (up to 30%) and a processor that detects and bypasses trivial instructions at their conventional point of trivialization (up to 5%).

international parallel and distributed processing symposium | 2005