Amir Kavyan Ziabari
Northeastern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amir Kavyan Ziabari.
ieee international symposium on workload characterization | 2016
Yifan Sun; Xiang Gong; Amir Kavyan Ziabari; Leiming Yu; Xiangyu Li; Saoni Mukherjee; Carter McCardwell; Alejandro Villegas; David R. Kaeli
Graphics Processing Units (GPUs) can easily outperform CPUs in processing large-scale data parallel workloads, but are considered weak in processing serialized tasks and communicating with other devices. Pursuing a CPU-GPU collaborative computing model which takes advantage of both devices could provide an important breakthrough in realizing the full performance potential of heterogeneous computing. In recent years platform vendors and runtime systems have added new features such as unified memory space and dynamic parallelism, providing a path to CPU-GPU coordination and necessary programming infrastructure to support future heterogeneous applications. As the rate of adoption of CPU-GPU collaborative computing continues to increase, it becomes increasingly important to formalize CPU-GPU collaborative programming paradigms and understand the impact of this emerging model on overall application performance. We propose the Hetero-Mark to help heterogeneous system programmers understand CPU-GPU collaborative computing and to provide guidance to computer architects in order to enhance the design of the runtime and the driver. We summarize seven common CPU-GPU collaborative computing programming patterns and include at least one benchmark for each pattern in the suite. We also characterize different workloads in Hetero- Mark to analyze execution metrics specific to CPU-GPU collaborative computing, including CPU and GPU performance, CPU-GPU communication latency and memory transfer latency.
international symposium on performance analysis of systems and software | 2016
Saoni Mukherjee; Yifan Sun; Paul Blinzer; Amir Kavyan Ziabari; David R. Kaeli
Heterogeneous systems, that marry CPUs and GPUs together in a range of configurations, are quickly becoming the design paradigm for todays platforms because of their impressive parallel processing capabilities. However, in many existing heterogeneous systems, the GPU is only treated as an accelerator by the CPU, working as a slave to the CPU master. But recently we are starting to see the introduction of a new class of devices and changes to the system runtime model, which enable accelerators to be treated as first-class computing devices. To support programmability and efficiency of heterogeneous programming, the HSA foundation introduced the Heterogeneous System Architecture (HSA), which defines a platform and runtime architecture that provides rich support for OpenCL 2.0 features including shared virtual memory, dynamic parallelism, and improved atomic operations. In this paper, we provide the first comprehensive study of OpenCL 2.0 and HSA 1.0 execution, considering OpenCL 1.2 as the baseline. For workloads, we develop a suite of OpenCL micro-benchmarks designed to highlight the features of these emerging standards and also utilize real-world applications to better understand their impact at an application level. To fully exercise the new features provided by the HSA model, we experiment with a producer-consumer algorithm and persistent kernels. We find that by using HSA signals, we can remove 92% of the overhead due to synchronous kernel launches. In our real-world applications, the OpenCL 2.0 runtime achieves up to a 1.2X speedup, while the HSA 1.0 runtime achieves a 2.7X speedup over OpenCL 1.2.
international symposium on performance analysis of systems and software | 2013
Yash Ukidave; Amir Kavyan Ziabari; Perhaad Mistry; Gunar Schirner; David R. Kaeli
Heterogeneous computing using Graphic Processing Units (GPUs) has become an attractive computing model given the available scale of data-parallel performance and programming standards such as OpenCL. However, given the energy issues present with GPUs, some devices can exhaust power budgets quickly. Better solutions are needed to effectively exploit the power efficiency available on heterogeneous systems. In this paper we evaluate the power-performance trade-offs of different heterogeneous signal processing applications. More specifically, we compare the performance of 7 different implementations of the Fast Fourier Transform algorithms. Our study covers discrete GPUs and shared memory GPUs (APUs) from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Fermi) and Intel (Ivy Bridge). For this range of platforms, we characterize the different FFTs and identify the specific architectural features that most impact power consumption. Using the 7 FFT kernels, we obtain a 48% reduction in power consumption and up to a 58% improvement in performance across these different FFT implementations. These differences are also found to be target architecture dependent. The results of this study will help the signal processing community identify which class of FFTs are most appropriate for a given platform. More important, we have demonstrated that different algorithms implementing the same fundamental function (FFT) can perform vastly different based on the target hardware and associated programming optimizations.
networks on chips | 2015
Amir Kavyan Ziabari; José L. Abellán; Yenai Ma; Ajay Joshi; David R. Kaeli
While both Chip MultiProcessors (CMPs) and Graphics Processing Units (GPUs) are many-core systems, they exhibit different memory access patterns. CMPs execute threads in parallel, where threads communicate and synchronize through the memory hierarchy (without any coalescing). GPUs on the other hand execute a large number of independent thread blocks and their accesses to memory are frequent and coalesced, resulting in a completely different access pattern. NoC designs for GPUs have not been extensively explored. In this paper, we first evaluate several NoC designs for GPUs to determine the most power/performance efficient NoCs. To improve NoC energy efficiency, we explore an asymmetric NoC design tailored for a GPUs memory access pattern, providing one network for L1-to-L2 communication and a second for L2-to-L1 traffic. Our analysis shows that an asymmetric multi-network Cmesh provides the most energy-efficient communication fabric for our target GPU system.
international conference on supercomputing | 2015
Amir Kavyan Ziabari; José L. Abellán; Rafael Ubal; Chao Chen; Ajay Joshi; David R. Kaeli
Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future many-core systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems. To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid silicon-photonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6 x (2.7 x on average) and reduces the energy-delay2 product (ED2P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9 x (43% on average) and achieve up to 62% reduction in ED2P (3% on average) in comparison to mesh design with best performance.
ACM Transactions on Architecture and Code Optimization | 2016
Amir Kavyan Ziabari; Yifan Sun; Yenai Ma; Dana Schaa; José L. Abellán; Rafael Ubal; John Kim; Ajay Joshi; David R. Kaeli
In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.
ieee international conference on high performance computing data and analytics | 2014
Yash Ukidave; Amir Kavyan Ziabari; Perhaad Mistry; Gunar Schirner; David R. Kaeli
Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on four different fast Fourier transform implementations. Our study covers discrete GPUs, shared memory GPUs (APUs) and low power system-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler), Intel (Ivy Bridge) and Qualcomm (Snapdragon S4) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore a range of application optimizations which show an increase in power consumption by 27%, but result in more than 1.8 × increase in speed of performance. We observe up to an 18% reduction in power consumption due to reduced kernel calls across FFT implementations. We also observe an 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact the power efficiency of the application. More importantly, we demonstrate that different algorithms implementing the same fundamental function (FFT) can perform with vast differences based on the target hardware and associated application design.
design, automation, and test in europe | 2017
Mohammad Khavari Tavana; Amir Kavyan Ziabari; David R. Kaeli
Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated with that block is disabled and becomes unavailable to the physical address space. To reduce the page waste caused by early block failures, other blocks can be used to support the failed block, working cooperatively to keep it alive and extend the faulty pages lifetime. We combine the proposed technique with existing error recovery schemes, such as Error Correction Pointers (ECP) and Aegis, to increase memory lifetimes. Block cooperation is realized through metadata sharing in ECP, where one data block shares its unused metadata with another data block. When combined with Aegis, block cooperation is realized through reorganizing data layout, where blocks possessing few faults come to the aid of failed blocks, bringing them back from the dead. Employing block cooperation at a single level (or multiple levels) on top of ECP and Aegis, we can increase memory lifetimes by 28% (37%), and 8% (14%) on average, respectively.
symposium on code generation and optimization | 2017
Xiang Gong; Zhongliang Chen; Amir Kavyan Ziabari; Rafael Ubal; David R. Kaeli
As throughput-oriented accelerators, GPUs provide tremendous processing power by running a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks. Assuming an SIMT execution model, GPU applications tend to send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. However, the scheduler struggles when the number of computational operations are too low to effectively hide the long latency of memory operations. In this paper, we propose a Twin Kernel Multiple Thread (TKMT) execution model, a compiler-centric solution that improves hardware scheduling at compile time. TKMT better distributes the burst of memory requests in some of the wavefronts through static instruction scheduling. Our results show that TKMT can offer a 12% average improvement over the baseline SIMT implementation on a variety of benchmarks on AMD Radeon systems.
international workshop on opencl | 2015
Amir Kavyan Ziabari; Rafael Ubal Tena; Dana Schaa; David R. Kaeli
Evaluating parallel and heterogeneous programs written in OpenCL can be challenging. Commonly, simulators can be used to aid the programmer in this regard. One of the fundamental requirements of any simulator is to provide traces, reports, and debugging information in a coherent and unambiguous format. Although these traces or reports contain a lot of detailed information about the logical and physical transactions within a simulated structure, they are usually extremely large and hard to analyze. What is needed is an appropriate visualization tool to accompany the simulator to make OpenCL execution process easier to understand and analyze. In this tutorial, we present M2S-Visual interactive cycle-by-cycle trace-driven visualization tool, a complimentary addition to Multi2sim (M2S). M2S is an established simulator, designed with an emphasis on running OpenCL applications without any source code modifications. The simulation of a complete OpenCL application occurs seamlessly by launching vendor-compliant host and device binaries. Multi2sim GPU emulator provides traces of Intel x86 CPU and AMD Southern-Island (as well as AMD Evergreen) GPU instructions, and the detailed simulator tracks execution times and state of architectural components in both host and device. M2S-Visual complements the simulator by providing the visual representation of running instructions and the state of the architectural components, together through a user-friendly GUI. During the execution of an OpenCL application, M2S-Visual captures and represents the state of CPU and GPU software entities (i.e. contexts, work-groups, wavefronts, and work-items), memory entities (i.e., accesses, sharers, owners), and network entities (i.e. messages and packets), along with the state of CPU and GPU hardware resources (i.e. cores and compute units), memory hierarchy (i.e., L1 cache, L2 cache and the main memory), and network resources (i.e., nodes, buses, links and buffers). We designed the M2S-Visual tool to support the research community, by providing deep analysis into the performance of OpenCL programs. We also introduce other new visualization options (through statistical graphs) in M2S which provide further details on OpenCL application characteristics and utilization of system resources. This includes plots that reveals the occupancy of compute units based on static and run-time characteristics of the executed OpenCL kernels, histograms that presents the memory access patterns of the OpenCL applications, plots that characterizes the network traffic generated by transactions between memory modules during an OpenCL application execution, and plots that reveals the utilization of network resources (such as links and buses) after the application execution is complete. The tutorial is organized in two parts, covering the full-system visualization of OpenCL application execution via M2S-Visual, and characterization of OpenCL application impact on system resource using the generated static graphs. Each section is accompanied with simulation examples using working demos. All material to reproduce these demos, as well as the tutorial slides, will be available on the tutorial website at http://www.multi2sim.org/conferences/iwocl-2015.html.