Mohammad Ali Ghodrat
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mohammad Ali Ghodrat.
design automation conference | 2012
Jason Cong; Mohammad Ali Ghodrat; Michael Gill; Beayna Grigorian; Glenn Reinman
This work discusses a hardware architectural support for accelerator-rich CMPs (ARC). First, we present a hardware resource management scheme for accelerator sharing. This scheme supports sharing and arbitration of multiple cores for a common set of accelerators, and it uses a hardware-based arbitration mechanism to provide feedback to cores to indicate the wait time before a particular resource becomes available. Second, we propose a light-weight interrupt system to reduce the OS overhead of handling interrupts which occur frequently in an accelerator-rich platform. Third, we propose architectural support that allows us to compose a larger virtual accelerator out of multiple smaller accelerators. We have also implemented a complete simulation tool-chain to verify our ARC architecture. Experimental results show significant performance (on average 51X) and energy improvement (on average 17X) compared to approaches using OS-based accelerator management.
international symposium on low power electronics and design | 2012
Jason Cong; Mohammad Ali Ghodrat; Michael Gill; Beayna Grigorian; Glenn Reinman
This work discusses CHARM, a Composable Heterogeneous Accelerator-Rich Microprocessor design that provides scalability, flexibility, and design reuse in the space of accelerator-rich CMPs. CHARM features a hardware structure called the accelerator block composer (ABC), which can dynamically compose a set of accelerator building blocks (ABBs) into a loosely coupled accelerator (LCA) to provide orders of magnitude improvement in performance and power efficiency. Our software infrastructure provides a data flow graph to describe the composition, and our hardware components dynamically map available resources to the data flow graph to compose the accelerator from components that may be physically distributed across the CMP. Our ABC is also capable of providing load balancing among available compute resources to increase accelerator utilization. Running medical imaging benchmarks, our experimental results show an average speedup of 2.1X (best case 3.7X) compared to approaches that use LCAs together with a hardware resource manager. We also gain in terms of energy consumption (average 2.4X; best case 4.7X).
international symposium on low power electronics and design | 2013
Jason Cong; Mohammad Ali Ghodrat; Michael Gill; Beayna Grigorian; Hui Huang; Glenn Reinman
Accelerator-rich platforms demonstrate orders of magnitude improvement in performance and energy efficiency over software, yet they lack adaptivity to new algorithms and can see low accelerator utilization. To address these issues we propose CAMEL: Composable Accelerator-rich Microprocessor Enhanced for Longevity. CAMEL features programmable fabric (PF) to extend the use of ASIC composable accelerators in supporting algorithms that are beyond the scope of the baseline platform. Using a combination of hardware extensions and compiler support, we demonstrate on average 11.6X performance improvement and 13.9X energy savings across benchmarks that deviate from the original domain for our baseline platform.
international conference on computer design | 2013
Yu-Ting Chen; Jason Cong; Mohammad Ali Ghodrat; Muhuan Huang; Chunyue Liu; Bingjun Xiao; Yi Zou
Application-specific accelerators provide 10-100× improvement in power efficiency over general-purpose processors. The accelerator-rich architectures are especially promising. This work discusses a prototype of accelerator-rich CMPs (PARC). During our development of PARC in real hardware, we encountered a set of technical challenges and proposed corresponding solutions. First, we provided system IPs that serve a sea of accelerators to transfer data between userspace and accelerator memories without cache overhead. Second, we designed a dedicated interconnect between accelerators and memories to enable memory sharing. Third, we implemented an accelerator manager to virtualize accelerator resources for users. Finally, we developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC. We implemented PARC in a Virtex-6 FPGA chip with integration of platform-specific peripherals and booting of unmodified Linux. Experimental results show that PARC can fully exploit the energy benefits of accelerators at little system overhead.
international conference on big data | 2015
Katayoun Neshatpour; Maria Malik; Mohammad Ali Ghodrat; Avesta Sasan; Houman Homayoun
A recent trend for big data analytics is to provide heterogeneous architectures to allow support for hardware specialization. Considering the time dedicated to create such hardware implementations, an analysis that estimates how much benefit we gain in terms of speed and energy efficiency, through offloading various functions to hardware would be necessary. This work analyzes data mining and machine learning algorithms, which are utilized extensively in big data applications in a heterogeneous CPU+FPGA platform. We select and offload the computational intensive kernels to the hardware accelerator to achieve the highest speed-up and best energy-efficiency. We use the latest Xilinx Zynq boards for implementation and result analysis. We also perform a first order comprehensive analysis of communication and computation overheads to understand how the speedup of each application contributes to its overall execution in an end-to-end Hadoop MapReduce environment. Moreover, we study how other system parameters such as the choice of CPU (big vs little) and the number of mapper slots affect the performance and power-efficiency benefits of hardware acceleration. The results show that a kernel speedup of upto χ 321.5 with hardware+software co-design can be achieved. This results in χ2.72 speedup, 2.13χ power reduction, and 15.21χ energy efficiency improvement (EDP) in an end-to-end Hadoop MapReduce environment.
international symposium on low power electronics and design | 2012
Jason Cong; Mohammad Ali Ghodrat; Michael Gill; Chunyue Liu; Glenn Reinman
As the number of on-chip accelerators grows rapidly to improve power-efficiency, the buffer size required by accelerators drastically increases. Existing solutions allow the accelerators to share a common pool of buffers or/and allocate buffers in cache. In this paper we propose a Buffer-in-NUCA (BiN) scheme with the following contributions: (1) a dynamic interval-based global buffer allocation method to assign shared buffer spaces to accelerators that can best utilize the additional buffer space, and (2) a flexible and low-overhead paged buffer allocation method to limit the impact of buffer fragmentation in a shared buffer, especially when allocating buffers in a non-uniform cache architecture (NUCA) with distributed cache banks. Experimental results show that, when compared to two representative schemes from the prior work, BiN improves performance by 32% and 35% and reduces energy by 12% and 29%, respectively.
design automation conference | 2007
Mohammad Ali Ghodrat; Kanishka Lahiri; Anand Raghunathan
Fast and accurate power analysis is a critical requirement for designing power-efficient system-on-chips (SoCs). Current system-level power analysis tools are incapable of generating power estimates under real-life workloads within an acceptable amount of time, even for moderately complex SoCs. Our work addresses this problem by borrowing on emulation, which is a widely used technique to accelerate functional verification. Unfortunately, hardware emulation of all the necessary functions for full SoC power analysis is likely to be infeasible for most systems, due to constraints on emulation capacity, and the lack of emulation-ready, synthesizable models for some SoC components early in the design process. This paper describes hybrid power estimation, an approach to accelerating SoC power analysis by emulating the functional and power models of a subset of SoC components on an FPGA platform (even a low-cost, off-the-shelf FGPA board). We describe the hardware and software components of the framework, and propose techniques to overcome the challenges posed by limited host-board communication bandwidth. We have implemented a hybrid power estimation framework using a Xilinx Virtex-II Pro emulation platform and software extensions to an HDL simulator, to conduct power analysis of a video decoder SoC. The results indicate 65-332X gains in analysis efficiency over simulation-based power estimation, with no loss in accuracy. Further, we show that the increase in FPGA resource requirements for hybrid power estimation over pure functional emulation are modest.
field-programmable custom computing machines | 2015
Katayoun Neshatpour; Maria Malik; Mohammad Ali Ghodrat; Houman Homayoun
Emerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware acceleration through specialization has received renewed interest in recent years, mainly due to the dark silicon challenge. To address the computing requirements of big data, and based on the benchmarking and characterization results, we envision a data-driven heterogeneous architecture for next generation big data server platforms that leverage the power of field-programmable gate array (FPGA) to build custom accelerators in a Hadoop MapReduce framework. Unlike a full and dedicated implementation of Hadoop MapReduce algorithm on FPGA, we propose the hardware/software (HW/SW) co-design of the algorithm, which trades some speedup at a benefit of less hardware. Considering communication overhead with FPGA and other overheads involved in Hadoop MapReduce environment such as compression and decompression, shuffling and sorting, our experimental results show significant potential for accelerating Hadoop MapReduce machine learning kernels using HW/SW co-design methodology.
ACM Transactions in Embedded Computing Systems | 2014
Jason Cong; Mohammad Ali Ghodrat; Michael Gill; Beayna Grigorian; Glenn Reinman
This work discusses hardware architectural support for domain-specific accelerator-rich CMPs. First, we present a hardware resource management scheme for sharing of loosely coupled accelerators and arbitration of multiple requesting cores. Second, we present a mechanism for accelerator virtualization. This allows multiple accelerators to efficiently compose a larger virtual accelerator out of multiple smaller accelerators, as well as to collaborate as multiple copies of a simple accelerator. All of this work is supported by a fully automated simulation tool-chain for both accelerator generation and management. We present the applicability of our approach to four different application domains: medical imaging, commercial, computer vision, and navigation. Our results demonstrate large performance improvements and energy savings over a software implementation. We also show additional improvements that result from enhanced load balancing and simplification of the communication between the core and the arbitration mechanism.
IEEE Transactions on Very Large Scale Integration Systems | 2006
Mohammad Ali Ghodrat; Tony Givargis; Alexandru Nicolau
Arithmetic expressions are the fundamental building blocks of hardware and software systems. An important problem in computational theory is to decide if two arithmetic expressions are equivalent. However, the general problem of equivalence checking, in digital computers, belongs to the NP Hard class of problems. Moreover, existing general techniques for solving this decision problem are applicable to very simple expressions and impractical when applied to more complex expressions found in programs written in high-level languages. In this paper, we propose a method for solving the arithmetic expression equivalence problem using partial evaluation. In particular, our technique is specifically designed to solve the problem of equivalence checking of arithmetic expressions obtained from high-level language descriptions of hardware/software systems. In our method, we use interval analysis to substantially prune the domain space of arithmetic expressions and limit the evaluation effort to a sufficiently limited set of subspaces. Our results show that the proposed method is fast enough to be of use in practice