Is this you? Create Your Porfile

Divya Mahajan

Georgia Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Divya Mahajan is active.

Explore More

Publication

Featured researches published by Divya Mahajan.

design, automation, and test in europe | 2015

Axilog: language support for approximate hardware design

Amir Yazdanbakhsh; Divya Mahajan; Bradley Thwaites; Jongse Park; Anandhavel Nagendrakumar; Sindhuja Sethuraman; Kartik Ramkrishnan; Nishanthi Ravindran; Rudra Jariwala; Abbas Rahimi; Hadi Esmaeilzadeh; Kia Bazargan

Relaxing the traditional abstraction of “near-perfect” accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designers annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9× area reduction with 10% output quality loss.

high-performance computer architecture | 2016

TABLA: A unified template-based framework for accelerating statistical machine learning

Divya Mahajan; Jongse Park; Emmanuel Amaro; Hardik Sharma; Amir Yazdanbakhsh; Joon Kyung Kim; Hadi Esmaeilzadeh

A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

IEEE Design & Test of Computers | 2017

AxBench: A Multiplatform Benchmark Suite for Approximate Computing

Amir Yazdanbakhsh; Divya Mahajan; Hadi Esmaeilzadeh; Pejman Lotfi-Kamran

Approximate computing is claimed to be a powerful knob for alleviating the peak power and energy-efficiency issues. However, providing a consistent benchmark suit with diverse applications amenable to approximate computing is crucial to ensure fair and reproducible comparisons. This article makes an important attempt toward it in the form of the AxBench suite, which contains applications for CPUs, GPUs, and hardware design with necessary annotations to mark the approximable regions and output quality metrics. —Muhammad Shafique, Vienna University of Technology

international symposium on computer architecture | 2016

Towards statistical guarantees in controlling quality tradeoffs for approximate acceleration

Divya Mahajan; Amir Yazdanbakhsh; Jongse Park; Bradley Thwaites; Hadi Esmaeilzadeh

Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator-improving performance and efficiency-or run on the precise core-maintaining quality. In this paper we introduce MITHRA, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. MITHRA seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that MITHRA performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.

international symposium on microarchitecture | 2017

Scale-out acceleration for machine learning

Jongse Park; Hardik Sharma; Divya Mahajan; Joon Kyung Kim; Preston Olds; Hadi Esmaeilzadeh

The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22–55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.

very large data bases | 2018

In-RDBMS hardware acceleration of advanced analytics

Divya Mahajan; Joon Kyung Kim; Jacob Sacks; Adel Ardalan; Arun Kumar; Hadi Esmaeilzadeh

The data revolution is fueled by advances in several areas, including databases, high-performance computer architecture, and machine learning. Although timely, there is a void of solutions that brings these disjoint directions together. This paper sets out to be the initial step towards such a union. The aim is to devise a solution for the in-Database Acceleration of Advanced Analytics (DAnA). DAnA empowers database users to leap beyond traditional data summarization techniques and seamlessly utilize hardware-accelerated machine learning. Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of in-database analytics queries to the FPGA accelerator. The accelerator implementation is generated from a User Defined Function (UDF), expressed as part of a SQL query in a Python-embedded Domain Specific Language (DSL). To realize efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders, that directly interface with the buffer pool of the database. DAnA obtains the schema and page layout information from the database catalog to configure the Striders. In turn, Striders extract, cleanse, and process the training data tuples, which are consumed by a multi-threaded FPGA engine that executes the analytics algorithm. We integrated DAnA with PostgreSQL to generate hardware accelerators for a range of real-world and synthetic datasets running diverse ML algorithms. Results show that DAnA-enhanced PostgreSQL provides, on average, 11.3x end-to-end speedup than MADLib and 5.4x faster than multi-threaded MADLib running on Greenplum. DAnA provides these benefits while hiding the complexity of hardware design from data scientists and allowing them to express the algorithm in 30-60 lines of Python.

international symposium on computer architecture | 2018

Robox: an end-to-end solution to accelerate autonomous control in robotics

Jacob Sacks; Divya Mahajan; Richard C. Lawson; Hadi Esmaeilzadeh

Novel algorithmic advances have paved the way for robotics to transform the dynamics of many social and enterprise applications. To achieve true autonomy, robots need to continuously process and interact with their environment through computationally-intensive motion planning and control algorithms under a low power budget. Specialized architectures offer a potent choice to provide low-power, high-performance accelerators for these algorithms. Instead of taking a traditional route which profiles and maps hot code regions to accelerators, this paper delves into the algorithmic characteristics of the application domain. We observe that many motion planning and control algorithms are formulated as a constrained optimization problems solved online through Model Predictive Control (MPC). While models and objective functions differ between robotic systems and tasks, the structure of the optimization problem and solver remain fixed. Using this theoretical insight, we create RoboX, an end-to-end solution which exposes a high-level domain-specific language to roboticists. This interface allows roboticists to express the physics of the robot and its task in a form close to its concise mathematical expressions. The RoboX backend then automatically maps this high-level specification to a novel programmable architecture, which harbors a programmable memory access engine and compute-enabled interconnects. Hops in the interconnect are augmented with simple functional units that either operate on in-fight data or are bypassed according a micro-program. Evaluations with six different robotic systems and tasks show that RoboX provides a 29.4X (7.3X) speedup and 22.1X (79.4X) performance-per-watt improvement over an ARM Cortex A57 (Intel Xeon E3). Compared to GPUs, RoboX attains 7.8X, 65.5X, and 71.×8 higher Performance-per-Watt to Tegra X2, GTX 650 Ti, and Tesla K40 with a power envelope of only 3.4 Watts at 45 nm.

international symposium on microarchitecture | 2016