Mehrzad Samadi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mehrzad Samadi is active.

Explore More

Publication

Featured researches published by Mehrzad Samadi.

international symposium on microarchitecture | 2013

SAGE: self-tuning approximation for graphics engines

Mehrzad Samadi; Janghaeng Lee; D. Anoushe Jamshidi; Amir Hormati; Scott A. Mahlke

Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing overabundance of information. For particular domains such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach, SAGE, combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a run-time system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGEs approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.

architectural support for programming languages and operating systems | 2011

Sponge: portable stream programming on graphics engines

Amir Hormati; Mehrzad Samadi; Mark Woh; Trevor N. Mudge; Scott A. Mahlke

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a write-once software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponges compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.

architectural support for programming languages and operating systems | 2014

Paraprox: pattern-based approximation for data parallel applications

Mehrzad Samadi; Davoud Anoushe Jamshidi; Janghaeng Lee; Scott A. Mahlke

Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where the output of programs need not be perfectly correct to provide useful results or even noticeable differences to the end user. These soft domains include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate computing is transparency to insulate both software and hardware developers from the time, cost, and difficulty of using approximation. This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-parallel programs that operates on commodity hardware systems. Paraprox starts with a data-parallel kernel implemented using OpenCL or CUDA and creates a parameterized approximate kernel that is tuned at runtime to maximize performance subject to a target output quality (TOQ) that is supplied by the user. Approximate kernels are created by recognizing common computation idioms found in data-parallel programs (e.g., Map, Scatter/Gather, Reduction, Scan, Stencil, and Partition) and substituting approximate implementations in their place. Across a set of 13 soft data-parallel applications with at most 10% quality degradation, Paraprox yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor compared to accurate execution on each platform.

international conference on parallel architectures and compilation techniques | 2013

Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

Janghaeng Lee; Mehrzad Samadi; Yongjun Park; Scott A. Mahlke

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortunately, this work distribution can be a poor solution as it under utilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this paper, we present the single kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 29% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

international symposium on computer architecture | 2015

Rumba: an online quality management system for approximate computing

Daya Shanker Khudia; Babak Zamirai; Mehrzad Samadi; Scott A. Mahlke

Approximate computing can be employed for an emerging class of applications from various domains such as multimedia, machine learning and computer vision. The approximated output of such applications, even though not 100% numerically correct, is often either useful or the difference is unnoticeable to the end user. This opens up a new design dimension to trade off application performance and energy consumption with output correctness. However, a largely unaddressed challenge is quality control: how to ensure the user experience meets a prescribed level of quality. Current approaches either do not monitor output quality or use sampling approaches to check a small subset of the output assuming that it is representative. While these approaches have been shown to produce average errors that are acceptable, they often miss large errors without any means to take corrective actions. To overcome this challenge, we propose Rumba for online detection and correction of large approximation errors in an approximate accelerator-based computing environment. Rumba employs continuous lightweight checks in the accelerator to detect large approximation errors and then fixes these errors by exact re-computation on the host processor. Rumba employs computationally inexpensive output error prediction models for efficient detection. Computing patterns amenable for approximation (e.g., map and stencil) are usually data parallel in nature and Rumba exploits this property for selective correction. Overall, Rumba is able to achieve 2.1x reduction in output error for an unchecked approximation accelerator while maintaining the accelerator performance gains at the cost of reducing the energy savings from 3.2x to 2.2x for a set of applications from different approximate computing domains.

high-performance computer architecture | 2011

Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism

Mojtaba Mehrara; Po-Chun Hsu; Mehrzad Samadi; Scott A. Mahlke

As the web becomes the platform of choice for execution of more complex applications, a growing portion of computation is handed off by developers to the client side to reduce network traffic and improve application responsiveness. Therefore, the client-side component, often written in JavaScript, is becoming larger and more compute-intensive, increasing the demand for high performance JavaScript execution. This has led to many recent efforts to improve the performance of JavaScript engines in the web browsers. Furthermore, considering the wide-spread deployment of multi-cores in todays computing systems, exploiting parallelism in these applications is a promising approach to meet their performance requirement. However, JavaScript has traditionally been treated as a sequential language with no support for multithreading, limiting its potential to make use of the extra computing power in multicore systems. In this work, to exploit hardware concurrency while retaining traditional sequential programming model, we develop ParaScript, an automatic runtime parallelization system for JavaScript applications on the clients browser. First, we propose an optimistic runtime scheme for identifying parallelizable regions, generating the parallel code on-the-fly, and speculatively executing it. Second, we introduce an ultra-lightweight software speculation mechanism to manage parallel execution. This speculation engine consists of a selective checkpointing scheme and a novel runtime dependence detection mechanism based on reference counting and range-based array conflict detection. Our system is able to achieve an average of 2.18× speedup over the Firefox browser using 8 threads on commodity multi-core systems, while performing all required analyses and conflict detection dynamically at runtime.

international conference on parallel architectures and compilation techniques | 2013

APOGEE: adaptive prefetching on GPUs for energy efficiency

Ankit Sethia; Ganesh S. Dasika; Mehrzad Samadi; Scott A. Mahlke

Modern graphics processing units (GPUs) combine large amounts of parallel hardware with fast context switching among thousands of active threads to achieve high performance. However, such designs do not translate well to mobile environments where power constraints often limit the amount of hardware. In this work, we investigate the use of prefetching as a means to increase the energy efficiency of GPUs. Classically, CPU prefetching results in higher performance but worse energy efficiency due to unnecessary data being brought on chip. Our approach, called APOGEE, uses an adaptive mechanism to dynamically detect and adapt to the memory access patterns found in both graphics and scientific applications that are run on modern GPUs to achieve prefetching efficiencies of over 90%. Rather than examining threads in isolation, APOGEE uses adjacent threads to more efficiently identify address patterns and dynamically adapt the timeliness of prefetching. The net effect of APOGEE is that fewer thread contexts are necessary to hide memory latency and thus sustain performance. This reduction in thread contexts and related hardware translates to simplification of hardware and leads to a reduction in power. For Graphics and GPGPU applications, APOGEE enables an 8X reduction in multi-threading hardware, while providing a performance benefit of 19%. This translates to a 52% increase in performance per watt over systems with high multi-threading and 33% over existing GPU prefetching techniques.Memory bandwidth has been one of the most critical system performance bottlenecks. As a result, the HMC (Hybrid Memory Cube) has recently been proposed to improve DRAM bandwidth as well as energy efficiency. In this paper, we explore different system interconnect designs with HMCs. We show that processor-centric network architectures cannot fully utilize processor bandwidth across different traffic patterns. Thus, we propose a memory-centric network in which all processor channels are connected to HMCs and not to any other processors as all communication between processors goes through intermediate HMCs. Since there are multiple HMCs per processor, we propose a distributor-based network to reduce the network diameter and achieve lower latency while properly distributing the bandwidth across different routers and providing path diversity. Memory-centric networks lead to some challenges including higher processor-to-processor latency and the need to properly exploit the path diversity. We propose a pass-through microarchitecture, which, in combination with the proper intra-HMC organization, reduces the zero-load latency while exploiting adaptive (and non-minimal) routing to load-balance across different channels. Our results show that memory-centric networks can efficiently utilize processor bandwidth for different traffic patterns and achieve higher performance by providing higher memory bandwidth and lower latency.

IEEE Transactions on Very Large Scale Integration Systems | 2011

Dynamic Voltage and Frequency Scheduling for Embedded Processors Considering Power/Performance Tradeoffs

Mostafa E. Salehi; Mehrzad Samadi; Mehrdad Najibi; Ali Afzali-Kusha; Massoud Pedram; Sied Mehdi Fakhraie

An adaptive method to perform dynamic voltage and frequency scheduling (DVFS) for minimizing the energy consumption of microprocessor chips is presented. Instead of using a fixed update interval, the proposed DVFS system makes use of adaptive update intervals for optimal frequency and voltage scheduling. The optimization enables the system to rapidly track the workload changes so as to meet soft real-time deadlines. The technique, which can be realized with very simple hardware, is completely transparent to the application. The results of applying the method to some real application workloads demonstrate considerable power savings and fewer frequency updates compared to DVFS systems based on fixed update intervals.

international conference on parallel architectures and compilation techniques | 2014

VAST: the illusion of a large memory space for GPUs

Janghaeng Lee; Mehrzad Samadi; Scott A. Mahlke

Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.

architectural support for programming languages and operating systems | 2012

Paragon: collaborative speculative loop execution on GPU and CPU

Mehrzad Samadi; Amir Hormati; Janghaeng Lee; Scott A. Mahlke

The rise of graphics engines as one of the main parallel platforms for general purpose computing has ignited a wide search for better programming support for GPUs. Due to their non-traditional execution model, developing applications for GPUs is usually very challenging, and as a result, these devices are left under-utilized in many commodity systems. Several languages, such as CUDA, have emerged to solve this challenge, but past research has shown that developing applications in these languages is a daunting task because of the tedious performance optimization cycle or inherent algorithmic characteristics of an application, which could make it unsuitable for GPUs. Also, previous approaches of automatically generating optimized parallel code in CUDA for GPUs using complex compilation techniques have failed to utilize GPUs that are present in everyday computing devices such as laptops and mobile systems. In this work, we take a different approach. Although it is hard to generate optimized code for GPU, it is beneficial to utilize them speculatively rather than leaving them running idle due to their high raw performance capabilities compared to CPUs. To achieve this goal, we propose Paragon: a collaborative static/dynamic compiler platform to speculatively run possibly-data-parallel pieces of sequential applications on GPUs. Paragon utilizes the GPU in an opportunistic way for loops that are categorized as possibly-data-parallel by its loop classification phase. While running the loop speculatively, Paragon monitors the dependencies using a light-weight kernel management unit, and transfers the execution to the CPU in case a conflict is detected. Paragon resumes the execution on the GPU after the dependency is executed sequentially on the CPU. Our experiments show that Paragon achieves up to 12x speedup compared to unsafe CPU execution with 4 threads.

Explore More