Alaa R. Alameldeen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alaa R. Alameldeen is active.

Explore More

Publication

Featured researches published by Alaa R. Alameldeen.

ACM Sigarch Computer Architecture News | 2005

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

Milo M. K. Martin; Daniel J. Sorin; Bradford M. Beckmann; Michael R. Marty; Min Xu; Alaa R. Alameldeen; Kevin E. Moore; Mark D. Hill; David A. Wood

The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

high-performance computer architecture | 2003

Variability in architectural simulations of multi-threaded workloads

Alaa R. Alameldeen; David A. Wood

Multi-threaded commercial workloads implement many important Internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a workloads performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths. Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.

international symposium on computer architecture | 2004

Adaptive Cache Compression for High-Performance Processors

Alaa R. Alameldeen; David A. Wood

Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access latency, potentially degrading performance. In this paper, we develop an adaptive policy that dynamically adapts to the costs and benefits of cache compression. We propose a two-level cache hierarchy where the L1 cache holds uncompressed data and the L2 cache dynamically selects between compressed and uncompressed storage. The L2 cache is 8-way set-associative with LRU replacement, where each set can store up to eight compressed lines but has space for only four uncompressed lines. On each L2 reference, the LRU stack depth and compressed size determine whether compression (could have) eliminated a miss or incurs an unnecessary decompression overhead. Based on this outcome, the adaptive policy updates a single global saturating counter, which predicts whether to allocate lines in compressed or uncompressed form. We evaluate adaptive cache compression using full-system simulation and a range of benchmarks. We show that compression can improve performance for memory-intensive commercial workloads by up to 17%. However, always using compression hurts performance for low-miss-rate benchmarks - due to unnecessary decompression overhead - degrading performance by up to 18%. By dynamically monitoring workload behavior, the adaptive policy achieves comparable benefits from compression, while never degrading performance by more than 0.4%.

international symposium on computer architecture | 2010

Reducing cache power with low-cost, multi-bit error-correcting codes

Chris Wilkerson; Alaa R. Alameldeen; Zeshan Chishti; Wei Wu; Dinesh Somasekhar; Shih-Lien Lu

Technology advancements have enabled the integration of large on-die embedded DRAM (eDRAM) caches. eDRAM is significantly denser than traditional SRAMs, but must be periodically refreshed to retain data. Like SRAM, eDRAM is susceptible to device variations, which play a role in determining refresh time for eDRAM cells. Refresh power potentially represents a large fraction of overall system power, particularly during low-power states when the CPU is idle. Future designs need to reduce cache power without incurring the high cost of flushing cache data when entering low-power states. In this paper, we show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. We propose Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate. Multi-bit error-correcting codes usually have a complex decoder design and high storage cost. Hi-ECC avoids the decoder complexity by using strong ECC codes to identify and disable sections of the cache with multi-bit failures, while providing efficient single-bit error correction for the common case. Hi-ECC includes additional optimizations that allow us to amortize the storage cost of the code over large data words, providing the benefit of multi-bit correction at same storage cost as a single-bit error-correcting (SECDED) code (2% overhead). Our proposal achieves a 93% reduction in refresh power vs. a baseline eDRAM cache without error correcting capability, and a 66% reduction in refresh power vs. a system using SECDED codes.

IEEE Computer | 2003

Simulating a

Alaa R. Alameldeen; Milo M. K. Martin; Carl J. Mauer; Kevin E. Moore; Min Xu; Mark D. Hill; David A. Wood; Daniel J. Sorin

As dependence on database management systems and Web servers increases, so does the need for them to run reliably and efficiently-goals that rigorous simulations can help achieve. Execution-driven simulation models system hardware. These simulations capture actual program behavior and detailed system interactions. The authors have developed a simulation methodology that uses multiple simulations, pays careful attention to the effects of scaling on workload behavior, and extends Virtutech ABs Simics full system functional simulator with detailed timing models. The Wisconsin Commercial Workload Suite contains scaled and tuned benchmarks for multiprocessor servers, enabling full-system simulations to run on the PCs that are routinely available to researchers.

international symposium on microarchitecture | 2009

2M commercial server on a

Zeshan Chishti; Alaa R. Alameldeen; Chris Wilkerson; Wei Wu; Shih-Lien Lu

Voltage scaling is one of the most effective mechanisms to reduce microprocessor power consumption. However, the increased severity of manufacturing-induced parameter variations at lower voltages limits voltage scaling to a minimum voltage, Vccmin, below which a processor cannot operate reliably. Memory cell failures in large memory structures (e.g., caches) typically determine the Vccmin for the whole processor. Memory failures can be persistent (i.e., failures at time zero which cause yield loss) or non-persistent (e.g., soft errors or erratic bit failures). Both types of failures increase as supply voltage decreases and both need to be addressed to achieve reliable operation at low voltages. In this paper, we propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation. This technique, multi-bit segmented ECC (MS-ECC) addresses both persistent and non-persistent failures. Like previous work on mitigating persistent failures, MS-ECC trades off cache capacity for lower voltages. However, unlike previous schemes, MS-ECC does not rely on testing to identify and isolate defective bits, and therefore enables error tolerance for non-persistent failures like erratic bits and soft errors at low voltages. Furthermore, MS-ECCs design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions. Compared to current designs with single-bit correction, the most aggressive implementation for MS-ECC enables a 30% reduction in supply voltage, reducing power by 71% and energy per instruction by 42%.

IEEE Micro | 2006

2K PC

Alaa R. Alameldeen; David A. Wood

Many architectural simulation studies use instructions per cycle (IPC) to analyze performance. In this article, we challenge the commonly held view that IPC accurately reflects performance - at least for multithreaded workloads running on multiprocessors. Work-related metrics, such as time per transaction, are the most accurate and reliable way to estimate multiprocessor workload performance

architectural support for programming languages and operating systems | 2000

Improving cache lifetime reliability at ultra-low voltages

Milo M. K. Martin; Daniel J. Sorin; Anastassia Ailamaki; Alaa R. Alameldeen; Ross M. Dickson; Carl J. Mauer; Kevin E. Moore; Manoj Plakal; Mark D. Hill; David H. Wood

Symmetric muultiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processors cache rather than in memory. SMPs are optimized for this case by using snooping protocols that broadcast address transactions to all processors. Conversely, directory-based shared-memory systems must indirectly locate the owner and sharers through a directory, resulting in larger average miss latencies.This paper proposes timestamp snooping, a technique that allows SMPs to i) utilize high-speed switched interconnection networks and ii) exploit physical locality by delivering address transactions to processors and memories without regard to order. Traditional snooping requires physical ordering of transactions. Timestamp snooping works by processing address transactions in a logical order. Logical time is maintained by adding a few bits per address transaction and having network switches perform a handshake to ensure on-time delivery. Processors and memories then reorder transactions based on their timestamps to establish a total order.We evaluate timestamp snooping with commercial workloads on a 16-processor SPARC system using the Simics full-system simulator. We simulate both an indirect (butterfly) and a direct (torus) network design. For OLTP, DSS, web serving, web searching, and one scientific application, timestamp snooping with the butterfly network runs 6-28% faster than directories, at a cost of 13-43% more link traffic. Similarly, with the torus network, timestamp snooping runs 6-29% faster for 17-37% more link traffic. Thus, timestamp snooping is worth considering when buying more interconnect bandwidth is easier than reducing interconnect latency.

high-performance computer architecture | 2014

IPC Considered Harmful for Multiprocessor Workloads

Kevin Kai-Wei Chang; Donghyuk Lee; Zeshan Chishti; Alaa R. Alameldeen; Chris Wilkerson; Yoongu Kim; Onur Mutlu

Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR (double data rate) DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire DRAM rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR (low power DDR) DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. Unfortunately, there are two shortcomings of per-bank refresh employed in todays systems. First, we observe that the perbank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait. To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies and the performance benefit increases as DRAM density increases.

international symposium on low power electronics and design | 2007

Timestamp snooping: an approach for extending SMPs

Keith A. Bowman; Alaa R. Alameldeen; Srikanth T. Srinivasan; Chris Wilkerson

A statistical performance simulator is developed to explore the impact of die-to-die (D2D) and within-die (WID) parameter variations on the distributions of maximum clock frequency (FMAX) and throughput for multi-core processors in a future 22nm technology. The simulator integrates a compact analytical throughput model, which captures the key dependencies of multi-core processors, into a statistical simulation framework that models the effects of D2D and WID parameter variations on critical path delays across a die. The salient contributions from this paper are: (1) Product-level variation analysis for multi-core processors must focus on throughput, rather than just FMAX, and (2) Multi-core processors are inherently more variation tolerant than single-core processors due to the larger impact of memory latency and bandwidth on overall throughput. To elucidate these two points, multi-core and single-core processors have a similar chip-level FMAX distribution (mean degradation of 9% and standard deviation of 5%) for multi-threaded applications. In contrast to single-core processors, memory latency and bandwidth constraints significantly limit the throughput dependency on FMAX in multi-core processors, thus reducing the throughput mean degradation and standard deviation by 50%. Since single-threaded applications running on a multi-core processor can execute on the fastest core, mean FMAX and throughput gains of 4% are achieved from the nominal design target.

Explore More