Ali G. Saidi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ali G. Saidi is active.

Explore More

Publication

Featured researches published by Ali G. Saidi.

ACM Sigarch Computer Architecture News | 2011

The gem5 simulator

Nathan L. Binkert; Bradford M. Beckmann; Gabriel Black; Steven K. Reinhardt; Ali G. Saidi; Arkaprava Basu; Joel Hestness; Derek R. Hower; Tushar Krishna; Somayeh Sardashti; Rathijit Sen; Korey Sewell; Muhammad Shoaib; Nilay Vaish; Mark D. Hill; David A. Wood

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

international symposium on microarchitecture | 2006

The M5 Simulator: Modeling Networked Systems

Nathan L. Binkert; Ronald G. Dreslinski; Lisa R. Hsu; Kevin T. Lim; Ali G. Saidi; Steven K. Reinhardt

The M5 simulator is developed specifically to enable research in TCP/IP networking. The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically. M5s usefulness as a general-purpose architecture simulator and its liberal open-source license has led to its adoption by several academic and commercial groups

architectural support for programming languages and operating systems | 2006

PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Taeho Kgil; Shaun D'Souza; Ali G. Saidi; Nathan L. Binkert; Ronald G. Dreslinski; Trevor N. Mudge; Steven K. Reinhardt; Krisztian Flautner

In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to multiple DRAM dies sufficient for a primary memory. The 3D technology also enables wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency in turn reduces power and means that thermal constraints, a concern with 3D stacking, are easily satisfied.The PicoServer architecture specifically targets Tier 1 server applications, which exhibit a high degree of thread level parallelism. An architecture targeted to efficient throughput is ideal for this application domain. We find for a similar logic die area, a 12 CPU system with 3D stacking and no L2 cache outperforms an 8 CPU system with a large on-chip L2 cache by about 14% while consuming 55% less power. In addition, we show that a PicoServer performs comparably to a Pentium 4-like class machine while consuming only about 1/10 of the power, even when conservative assumptions are made about the power consumption of the PicoServer.

international symposium on computer architecture | 2013

Thin servers with smart pipes: designing SoC accelerators for memcached

Kevin T. Lim; David Meisner; Ali G. Saidi; Parthasarathy Ranganathan; Thomas F. Wenisch

Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

architectural support for programming languages and operating systems | 2016

High-Performance Transactions for Persistent Memories

Aasheesh Kolli; Steven Pelley; Ali G. Saidi; Peter M. Chen; Thomas F. Wenisch

Emerging non-volatile memory (NVRAM) technologies offer the durability of disk with the byte-addressability of DRAM. These devices will allow software to access persistent data structures directly in NVRAM using processor loads and stores, however, ensuring consistency of persistent data across power failures and crashes is difficult. Atomic, durable transactions are a widely used abstraction to enforce such consistency. Implementing transactions on NVRAM requires the ability to constrain the order of NVRAM writes, for example, to ensure that a transactions log record is complete before it is marked committed. Since NVRAM write latencies are expected to be high, minimizing these ordering constraints is critical for achieving high performance. Recent work has proposed programming interfaces to express NVRAM write ordering constraints to hardware so that NVRAM writes may be coalesced and reordered while preserving necessary constraints. Unfortunately, a straightforward implementation of transactions under these interfaces imposes unnecessary constraints. We show how to remove these dependencies through a variety of techniques, notably, deferring commit until after locks are released. We present a comprehensive analysis contrasting two transaction designs across three NVRAM programming interfaces, demonstrating up to 2.5x speedup.

international conference on parallel architectures and compilation techniques | 2005

Performance analysis of system overheads in TCP/IP workloads

Nathan L. Binkert; Lisa R. Hsu; Ali G. Saidi; Ronald G. Dreslinski; Andrew L. Schultz; Steven K. Reinhardt

Current high-performance computer systems are unable to saturate the latest available high-bandwidth networks such as 10 Gigabit Ethernet. A key obstacle in achieving 10 gigabits per second is the high overhead of communication between the CPU and network interface controller (NIC), which typically resides on a standard I/O bus with high access latency. Using several network-intensive benchmarks, we investigate the impact of this overhead by analyzing the performance of hypothetical systems in which the NIC is more closely coupled to the CPU, including integration on the CPU die. We find that systems with high-latency NICs spend a significant amount of time in the device driver. NIC integration can substantially reduce this overhead, providing significant throughput benefits when other CPU processing is not a bottleneck. NIC integration also enables cache placement of DMA data. This feature has tremendous benefits when pay-loads are touched quickly, but potentially can harm performance in other situations due to cache pollution.

architectural support for programming languages and operating systems | 2006

Integrated network interfaces for high-bandwidth TCP/IP

Nathan L. Binkert; Ali G. Saidi; Steven K. Reinhardt

This paper proposes new network interface controller (NIC) designs that take advantage of integration with the host CPU to provide increased flexibility for operating system kernel-based performance optimization.We believe that this approach is more likely to meet the needs of current and future high-bandwidth TCP/IP networking on end hosts than the current trend of putting more complexity in the NIC, while avoiding the need to modify applications and protocols. This paper presents two such NICs. The first, the simple integrated NIC (SINIC), is a minimally complex design that moves the responsibility for managing the network FIFOs from the NIC to the kernel. Despite this closer interaction between the kernel and the NIC, SINIC provides performance equivalent to a conventional DMA-based NIC without increasing CPU overhead. The second design, V-SINIC, adds virtual per-packet registers to SINIC, enabling parallel packet processing while maintaining a FIFO model. V-SINIC allows the kernel to decouple examining a packets header from copying its payload to memory. We exploit this capability to implement a true zero-copy receive optimization in the Linux 2.6 kernel, providing bandwidth improvements of over 50% on unmodified sockets-based receive-intensive benchmarks.

ACM Journal on Emerging Technologies in Computing Systems | 2008

PicoServer: Using 3D stacking technology to build energy efficient servers

Taeho Kgil; Ali G. Saidi; Nathan L. Binkert; Steven K. Reinhardt; Krisztian Flautner; Trevor N. Mudge

This article extends our prior work to show that a straightforward use of 3D stacking technology enables the design of compact energy-efficient servers. Our proposed architecture, called PicoServer, employs 3D technology to bond one die containing several simple, slow processing cores to multiple memory dies sufficient for a primary memory. The multiple memory dies are composed of DRAM. This use of 3D stacks readily facilitates wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency means that thermal constraints, a concern with 3D stacking, are easily satisfied. We extend our original analysis on PicoServer to include: (1) a wider set of server workloads, (2) the impact of multithreading, and (3) the on-chip DRAM architecture and system memory usage. PicoServer is intentionally simple, requiring only the simplest form of 3D technology where die are stacked on top of one another. Our intent is to minimize risk of introducing a new technology (3D) to implement a class of low-cost, low-power compact server architectures.

international symposium on microarchitecture | 2016

Delegated persist ordering

Aasheesh Kolli; Jeffrey Rosen; Stephan Diestelhorst; Ali G. Saidi; Steven Pelley; Sihang Liu; Peter M. Chen; Thomas F. Wenisch

Systems featuring a load-store interface to persistent memory (PM) are expected soon, making in-memory persistent data structures feasible. Ensuring persistent data structure recoverability requires constraints on the order PM writes become persistent. But, current memory systems reorder writes, providing no such guarantees. To complement their upcoming 3D XPoint memory, Intel has announced new instructions to enable programmer control of data persistence. We describe the semantics implied by these instructions, an ordering model we call synchronous ordering. Synchronous ordering (SO) enforces order by stalling execution when PM write ordering is required, exposing PM write latency on the execution critical path. It incurs an average slowdown of 7.21x over volatile execution without ordering in PM-write-intensive benchmarks. SO tightly couples enforcing order and flushing writes to PM, but this tight coupling is unneeded in many recoverable software systems. Instead, we propose delegated ordering, wherein ordering requirements are communicated explicitly to the PM controller, fully decoupling PM write ordering from volatile execution and cache management. We demonstrate that delegated ordering can bring performance within 1.93x of volatile execution, improving over SO by 3.73x.

international symposium on microarchitecture | 2013

RDIP: return-address-stack directed instruction prefetching

Aasheesh Kolli; Ali G. Saidi; Thomas F. Wenisch

L1 instruction fetch misses remain a critical performance bottleneck, accounting for up to 40% slowdowns in server applications. Whereas instruction footprints typically fit within last-level caches, they overwhelm L1 caches, whose capacity is limited by latency constraints. Past work has shown that server application instruction miss sequences are highly repetitive. By recording, indexing, and prefetching according to these sequences, nearly all L1 instruction misses can be eliminated. However, existing schemes require impractical storage and considerable complexity to correct for minor control-flow variations that disrupt sequences. In this work, we simplify and reduce the energy requirements of accurate instruction prefetching via two observations: (1) program context as captured in the call stack correlates strongly with L1 instruction misses, and (2) the return address stack (RAS), already present in all high performance processors, succinctly summarizes program context. We propose RAS-Directed Instruction Prefetching (RDIP), which associates prefetch operations with signatures formed from the contents of the RAS. RDIP achieves 70% of the potential speedup of an ideal L1 cache, outperforms a prefetcher-less baseline by 11.5% and reduces energy and complexity relative to sequence-based prefetching. RDIPs performance is within 2% of the state-of-the-art Proactive Instruction Fetch, with nearly 3X reduction in storage and 1.9X reduction in energy overheads.

Explore More