Berkin Akin
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Berkin Akin.
ieee international d systems integration conference | 2013
Qiuling Zhu; Berkin Akin; H. Ekin Sumbul; Fazle Sadi; James C. Hoe; Lawrence T. Pileggi; Franz Franchetti
This paper introduces a 3D-stacked logic-in-memory (LiM) system that integrates the 3D die-stacked DRAM architecture with the application-specific LiM IC to accelerate important data-intensive computing. The proposed system comprises a fine-grained rank-level 3D die-stacked DRAM device and extra LiM layers implementing logic-enhanced SRAM blocks that are dedicated to a particular application. Through silicon vias (TSVs) are used for vertical interconnections providing the required bandwidth to support the high performance LiM computing. We performed a comprehensive 3D DRAM design space exploration and exploit the efficient architectures to accelerate the computing that can balance the performance and power. Our experiments demonstrate orders of magnitude of performance and power efficiency improvements compared with the traditional multithreaded software implementation on modern CPU.
international symposium on computer architecture | 2015
Berkin Akin; Franz Franchetti; James C. Hoe
In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations. We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.
field-programmable custom computing machines | 2012
Berkin Akin; Peter A. Milder; Franz Franchetti; James C. Hoe
Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) data path design to extract the maximum compute throughput for a given level of memory bandwidth. We present results for double-precision 2D-FFT up to size 2,048-by-2,048. On an Alter a DE4 platform our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GByte/s maximum DRAM bandwidth available. The results also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of-the-art CPUs and GPUs in terms of the bandwidth and power efficiency.
application-specific systems, architectures, and processors | 2014
Berkin Akin; Franz Franchetti; James C. Hoe
As technology scaling is reaching its limits, pointing to the well-known memory and power wall problems, achieving high-performance and energy-efficient systems is becoming a significant challenge. Especially for data-intensive computing, efficient utilization of the memory subsystem is the key to achieve high performance and energy efficiency.We address this challenge in DRAM-optimized hardware accelerators for 1D, 2D and 3D fast Fourier transforms (FFT) on large datasets. When the dataset has to be stored in external DRAM, the main challenge for FFT algorithm design lies in reshaping DRAM-unfriendly memory access patterns to eliminate excessive DRAM row buffer misses. More importantly, these algorithms need to be carefully mapped to the targeted platforms architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. We use automatic design generation techniques to consider a family of DRAM-optimized FFT algorithms and their hardware implementation design space. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that generated pareto-optimal designs can yield up to 5.5× energy consumption and order of magnitude memory bandwidth utilization improvements in DRAM, which lead to overall system performance and power efficiency improvements of up to 6× and 6.5× respectively over conventional row-column FFT algorithms.
ieee high performance extreme computing conference | 2014
Berkin Akin; James C. Hoe; Franz Franchetti
Memory layout transformations via data reorganization are very common operations, which occur as a part of the computation or as a performance optimization in data-intensive applications. These operations require inefficient memory access patterns and roundtrip data movement through the memory hierarchy, failing to utilize the performance and energy-efficiency potentials of the memory subsystem. This paper proposes a high-bandwidth and energy-efficient hardware accelerated memory layout transform (HAMLeT) system integrated within a 3D-stacked DRAM. HAMLeT uses a low-overhead hardware that exploits the existing infrastructure in the logic layer of 3D-stacked DRAMs, and does not require any changes to the DRAM layers, yet it can fully exploit the locality and parallelism within the stack by implementing efficient layout transform algorithms. We analyze matrix layout transform operations (such as matrix transpose, matrix blocking and 3D matrix rotation) and demonstrate that HAMLeT can achieve close to peak system utilization, offering up to an order of magnitude performance improvement compared to the CPU and GPU memory subsystems which does not employ HAMLeT.
international conference on acoustics, speech, and signal processing | 2014
Berkin Akin; Franz Franchetti; James C. Hoe
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework. In our evaluations we demonstrate that Spiral generated hardware designs achieve close to theoretical peak performance of the targeted platform and offer significant speed-up (up to 6.5x) compared to naive baseline algorithms.
IEEE Micro | 2016
Berkin Akin; Franz Franchetti; James C. Hoe
3D-stacked integration of DRAM and logic layers using through-silicon via (TSV) technology has given rise to a new interpretation of near-data processing (NDP) concepts that were proposed decades ago. However, processing capability within the stack is limited by stringent power and thermal constraints. Simple processing mechanisms with intensive memory accesses, such as data reorganization, are an effective means of exploiting 3D stacking-based NDP. Data reorganization handled completely in memory improves the host processors memory access performance. However, in-memory data reorganization performed in parallel with host memory accesses raises issues, including interference, bandwidth allocation, and coherence. Previous work has mainly focused on performing data reorganization while blocking host accesses. This article details data reorganization performed in parallel with host memory accesses, providing mechanisms to address host/NDP interference, flexible bandwidth allocation, and in-memory coherence.
ieee high performance extreme computing conference | 2014
Fazle Sadi; Berkin Akin; Doru-Thom Popovici; James C. Hoe; Lawrence T. Pileggi; Franz Franchetti
Real-time system level implementations of complex Synthetic Aperture Radar (SAR) image reconstruction algorithms have always been challenging due to their data intensive characteristics. In this paper, we propose a basis vector transform based novel algorithm to alleviate the data intensity and a 3D-stacked logic in memory based hardware accelerator as the implementation platform. Experimental results indicate that this proposed algorithm/hardware co-optimized system can achieve an accuracy of 91 dB PSNR compared to a reference algorithm implemented in Matlab and energy efficiency of 72 GFLOPS/W for a 8k×8k SAR image reconstruction.
signal processing systems | 2016
Berkin Akin; Franz Franchetti; James C. Hoe
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.
international symposium on microarchitecture | 2015
Qi Guo; Tze Meng Low; Nikolaos Alachiotis; Berkin Akin; Lawrence T. Pileggi; James C. Hoe; Franz Franchetti
Over the last decade, the looming power wall has spurred a flurry of interest in developing heterogeneous systems with hardware accelerators. The questions, then, are what and how accelerators should be designed, and what software support is required. Our accelerator design approach stems from the observation that many efficient and portable software implementations rely on high performance software libraries with well-established application programming interfaces (APIs). We propose the integration of hardware accelerators on 3D-stacked memory that explicitly targets the memory-bounded operations within high performance libraries. The fixed APIs with limited configurability simplify the design of the accelerators, while ensuring that the accelerators have wide applicability. With our software support that automatically converts library APIs to accelerator invocations, an additional advantage of our approach is that library-based legacy code automatically gains the benefit of memory-side accelerators without requiring a reimplementation. On average, the legacy code using our proposed MEmory Accelerated Library (MEALib) improves performance and energy efficiency for individual operations in Intels Math Kernel Library (MKL) by 38× and 75×, respectively. For a real-world signal processing application that employs Intel MKL, MEALib attains more than 10× better energy efficiency.