Amir Yazdanbakhsh
Georgia Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amir Yazdanbakhsh.
international symposium on computer architecture | 2014
Renée St. Amant; Amir Yazdanbakhsh; Jongse Park; Bradley Thwaites; Hadi Esmaeilzadeh; Arjang Hassibi; Luis Ceze; Doug Burger
As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution-from circuit to compiler-that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code-code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.7× and energy savings of 6.3× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applications including financial analysis, signal processing, robotics, 3D gaming, compression, and image processing.
design, automation, and test in europe | 2015
Amir Yazdanbakhsh; Divya Mahajan; Bradley Thwaites; Jongse Park; Anandhavel Nagendrakumar; Sindhuja Sethuraman; Kartik Ramkrishnan; Nishanthi Ravindran; Rudra Jariwala; Abbas Rahimi; Hadi Esmaeilzadeh; Kia Bazargan
Relaxing the traditional abstraction of “near-perfect” accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designers annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9× area reduction with 10% output quality loss.
international symposium on microarchitecture | 2015
Amir Yazdanbakhsh; Jongse Park; Hardik Sharma; Pejman Lotfi-Kamran; Hadi Esmaeilzadeh
Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve GPU performance and efficiency. Among approximation techniques, neural accelerators have been shown to provide significant performance and efficiency gains when augmenting CPU processors. However, the integration of neural accelerators within a GPU processor has remained unexplored. GPUs are, in a sense, many-core accelerators that exploit large degrees of data-level parallelism in the applications through the SIMT execution model. This paper aims to harmoniously bring neural and GPU accelerators together without hindering SIMT execution or adding excessive hardware overhead. We introduce a low overhead neurally accelerated architecture for GPUs, called NGPU, that enables scalable integration of neural accelerators for large number of GPU cores. This work also devises a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. Compared to the baseline GPU architecture, cycle-accurate simulation results for NGPU show a 2.4× average speedup and a 2.8× average energy reduction within 10% quality loss margin across a diverse set of benchmarks. The proposed quality control mechanism retains a 1.9 ×s average speedup and a 2.1 × energy reduction while reducing the degradation in the quality of results to 2.5%. These benefits are achieved by less than 1% area overhead.
high-performance computer architecture | 2016
Divya Mahajan; Jongse Park; Emmanuel Amaro; Hardik Sharma; Amir Yazdanbakhsh; Joon Kyung Kim; Hadi Esmaeilzadeh
A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.
international conference on parallel architectures and compilation techniques | 2014
Bradley Thwaites; Gennady Pekhimenko; Hadi Esmaeilzadeh; Amir Yazdanbakhsh; Jongse Park; Girish Mururu; Onur Mutlu; Todd C. Mowry
This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall — the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory wall by allowing the core to continue computation without stalling for long-latency memory accesses. Our detailed study of the quality trade-offs shows that with a modern out-of-order processor, average 8% (up to 19%) performance improvement is possible with 0.8% (up to 1.8%) average quality loss on an approximable subset of SPEC CPU 2000/2006.
IEEE Design & Test of Computers | 2017
Amir Yazdanbakhsh; Divya Mahajan; Hadi Esmaeilzadeh; Pejman Lotfi-Kamran
Approximate computing is claimed to be a powerful knob for alleviating the peak power and energy-efficiency issues. However, providing a consistent benchmark suit with diverse applications amenable to approximate computing is crucial to ensure fair and reproducible comparisons. This article makes an important attempt toward it in the form of the AxBench suite, which contains applications for CPUs, GPUs, and hardware design with necessary annotations to mark the approximable regions and output quality metrics. —Muhammad Shafique, Vienna University of Technology
ACM Transactions on Architecture and Code Optimization | 2016
Amir Yazdanbakhsh; Gennady Pekhimenko; Bradley Thwaites; Hadi Esmaeilzadeh; Onur Mutlu; Todd C. Mowry
This article aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load-value mispredictions, hence, avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops a fraction of load requests that miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate is a knob to control the trade-off between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvement and energy reduction for a wide range of quality-loss levels. We also evaluate RFVP’s latency benefits for a single core CPU. The results show performance improvement and energy reduction for a wide variety of applications with less than 1p loss in quality.
IEEE Design & Test of Computers | 2016
Amir Yazdanbakhsh; Bradley Thwaites; Hadi Esmaeilzadeh; Gennady Pekhimenko; Onur Mutlu; Todd C. Mowry
Limited memory bandwidth and long access latency are serious GPU performance bottlenecks. For emerging applications that are inherently error resilient, this paper proposes to predict the values of safe-to-approximate loads and drop a certain fraction of the cache misses with predicted values to improve performance and energy efficiency.
design and diagnostics of electronic circuits and systems | 2010
Ali Azarpeyvand; Mostafa E. Salehi; Farshad Firouzi; Amir Yazdanbakhsh; Sied Mehdi Fakhraie
Advances in silicon technology and shrinking the feature size to nanometer scale make unreliability of nano devices the most important concern of fault-tolerant designs. Soft error analysis has been greatly aided by the concept of architectural vulnerability factor (AVF) and architecturally correct execution (ACE). In this work, we exploit the techniques of AVF analysis to introduce the instruction-level vulnerability metric for software reliability analysis. The proposed metric can be used to make judgments about the reliability of different programs on different processors with regard to architectural and compiler guidelines for improving the processor reliability.
international symposium on computer architecture | 2016
Divya Mahajan; Amir Yazdanbakhsh; Jongse Park; Bradley Thwaites; Hadi Esmaeilzadeh
Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator-improving performance and efficiency-or run on the precise core-maintaining quality. In this paper we introduce MITHRA, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. MITHRA seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that MITHRA performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.