Ardavan Pedram
Stanford University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ardavan Pedram.
IEEE Transactions on Computers | 2012
Ardavan Pedram; A. van de Geijn; Andreas Gerstlauer
As technology is reaching physical limits, reducing power consumption is a key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Subprograms (BLAS). It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra operations. In addition to exposing the sources of inefficiencies in current CPUs and GPUs, our results show our prototype Linear Algebra Processor (LAP) implementing Double-precision GEMM (DGEMM) can achieve 600 GFLOPS while consuming less than 25 Watts in standard 45 nm technology, which is up to 50 × more energy efficient than cutting-edge CPUs.
international embedded systems symposium | 2009
Ardavan Pedram; David Craven; Andreas Gerstlauer
Embedded system design complexities are growing exponentially. Demand has increased for modeling techniques that can provide both accurate measurements of delay and fast simulation speed for use in design space exploration. Previous efforts have enabled designers to estimate performance with Transaction Level Modeling (TLM) of software processors but this technique typically does not account for the effect of memory latencies. Modeling latency effects of a cache can greatly increase accuracy of the simulation and assist designers in choosing appropriate algorithms. In this article, we show the implementation of a cache model and its integration into a processor TLM. We demonstrate a method for extracting information about memory accesses from the final binary and abstracting them into cache model accesses. Our methodology is tested on a common embedded processor application with two algorithms exhibiting different cache behaviors. Our experiments show that the cache model can achieve results comparable to a cycle-accurate ISS, but with very little overhead compared to native, host-compiled code execution.
application specific systems architectures and processors | 2011
Ardavan Pedram; Andreas Gerstlauer; Robert A. van de Geijn
Achieving high-performance while reducing power consumption is a key concern as technology scaling is reaching its limits. It is well-accepted that application-specific custom hardware can achieve orders of magnitude improvements in efficiency. The question is whether such efficiency can be maintained while providing enough flexibility to implement a broad class of operations. In this paper, we aim to answer this question for the domain of matrix computations. We propose a design of a novel linear algebra core and demonstrate that it can achieve orders of magnitude improvements in efficiency for matrix-matrix multiplication, an operation that is indicative for a broad class of matrix computations. A feasibility study shows that 47 double- and 104 single-precision GFLOPS/W can be achieved in 19.5 and 15.6 GFLOPS/mm2, respectively with current components and standard 45nm technology.1
IEEE Design & Test of Computers | 2017
Ardavan Pedram; Stephen Richardson; Mark Horowitz; Sameh Galal; Shahar Kvatinsky
Unlike traditional dark silicon works that attack the computing logic, this article puts a focus on the memory part, which dissipates most of the energy for memory-bound CPU applications. This article discusses the dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality. –Muhammad Shafique, Vienna University of Technology
application specific systems architectures and processors | 2013
Ardavan Pedram; John D. McCalpin; Andreas Gerstlauer
This paper considers the modifications required to transform a highly-efficient, specialized linear algebra core into an efficient engine for computing Fast Fourier Transforms (FFTs). We review the minimal changes required to support Radix-4 FFT computations and propose extensions to the micro-architecture of the baseline linear algebra core. Along the way, we study the critical differences between the two classes of algorithms. Special attention is paid to the configuration of the on-chip memory system to support high utilization. We examine design trade-offs between efficiency, specialization and flexibility, and their effects both on the core and memory hierarchy for a unified design as compared to dedicated accelerators for each application. The final design is a flexible architecture that can perform both classes of applications. Results show that the proposed hybrid FFT/Linear Algebra core can achieve 26.6 GFLOPS/S with a power efficiency of 40 GFLOPS/W, which is up to 100× and 40× more energy efficient than cutting-edge CPUs and GPUs, respectively.
symposium on computer architecture and high performance computing | 2012
Ardavan Pedram; Andreas Gerstlauer; Robert A. van de Geijn
Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75× better power and up to 10× better area efficiency compared to traditional SIMD architectures.
application specific systems architectures and processors | 2012
Ardavan Pedram; Syed Zohaib Gilani; Nam Sung Kim; Robert A. van de Geijn; Michael J. Schulte; Andreas Gerstlauer
Reducing power consumption and increasing efficiency is a key concern for many applications. It is well-accepted that specialization and heterogeneity are crucial strategies to improve both power and performance. Yet, how to design highly efficient processing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present the design of a specialized Linear Algebra Core (LAC) for an important class of computational kernels, the level-3 Basic Linear Algebra Subprograms (BLAS). We demonstrate a detailed algorithm/architecture co-design for mapping a number of level-3 BLAS operations onto the LAC. Results show that our prototype LAC achieves a performance of around 64 GFLOPS (double precision) for these operations, while consuming less than 1.3 Watts in standard 45nm CMOS technology. This is on par with a full-custom design and up to 50× and 10× better in terms of power efficiency than CPUs and GPUs.
symposium on computer arithmetic | 2013
Ardavan Pedram; Andreas Gerstlauer; R.A. van de Geijn
This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.
IEEE Transactions on Computers | 2014
Ardavan Pedram; Andreas Gerstlauer; Robert A. van de Geijn
This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations and their blocked algorithms. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design tradeoffs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study of inner kernels is extended to blocked level and shows that, at block level, the Linear Algebra Core (LAC) can achieve high efficiencies with up to 45 GFLOPS/W for both Cholesky and LU factorization, and over 35 GFLOPS/W for QR factorization. While maintaining such efficiencies, our extensions to the MAC units can achieve up to 10, 12, and 20 percent speedup for the blocked algorithms of Cholesky, LU, and QR factorization, respectively.
international symposium on computer architecture | 2017
Raghu Prabhakar; Yaqi Zhang; David Koeplinger; Matthew Feldman; Tian Zhao; Stefan Hadjis; Ardavan Pedram; Christos Kozyrakis; Kunle Olukotun
Reconfigurable architectures have gained popularity in recent years as they allow the design of energy-efficient accelerators. Fine-grain fabrics (e.g. FPGAs) have traditionally suffered from performance and power inefficiencies due to bit-level reconfigurable abstractions. Both fine-grain and coarse-grain architectures (e.g. CGRAs) traditionally require low level programming and suffer from long compilation times. We address both challenges with Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns. Parallel patterns have emerged from recent research on parallel programming as powerful, high-level abstractions that can elegantly capture data locality, memory access patterns, and parallelism across a wide range of dense and sparse applications. We motivate Plasticine by first observing key application characteristics captured by parallel patterns that are amenable to hardware acceleration, such as hierarchical parallelism, data locality, memory access patterns, and control flow. Based on these observations, we architect Plasticine as a collection of Pattern Compute Units and Pattern Memory Units. Pattern Compute Units are multi-stage pipelines of reconfigurable SIMD functional units that can efficiently execute nested patterns. Data locality is exploited in Pattern Memory Units using banked scratchpad memories and configurable address decoders. Multiple on-chip address generators and scatter-gather engines make efficient use of DRAM bandwidth by supporting a large number of outstanding memory requests, memory coalescing, and burst mode for dense accesses. Plasticine has an area footprint of 113 mm2 in a 28nm process, and consumes a maximum power of 49 W at a 1 GHz clock. Using a cycle-accurate simulator, we demonstrate that Plasticine provides an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.