Yasuko Eckert | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yasuko Eckert is active.

Explore More

Publication

Featured researches published by Yasuko Eckert.

high-performance computer architecture | 2014

Increasing TLB reach by exploiting clustering in page translations

Binh Pham; Abhishek Bhattacharjee; Yasuko Eckert; Gabriel H. Loh

The steadily increasing sizes of main memory capacities require corresponding increases in the processors translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a smaller TLB, but most OSs and applications do not fully utilize the large-page support in current hardware. Recent work has shown that, while not guaranteed, some virtual-to-physical page mappings exhibit “contiguous” spatial locality in which consecutive virtual pages map to consecutive physical pages. Such locality provides opportunities to coalesce “adjacent” TLB entries for increased reach. We observe that beyond simple adjacent-entry coalescing, many more translations exhibit “clustered” spatial locality in which a group or cluster of nearby virtual pages map to a similarly clustered set of physical pages. In this work, we provide a detailed characterization of the spatial locality among the virtual-to-physical translations. Based on this characterization, we present a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support. Our evaluation shows that the multi-granular design outperforms conventional TLBs and the recently proposed coalesced TLBs technique.

high-performance computer architecture | 2017

Design and Analysis of an APU for Exascale Computing

Thiruvengadam Vijayaraghavany; Yasuko Eckert; Gabriel H. Loh; Michael J. Schulte; Mike Ignatowski; Bradford M. Beckmann; William C. Brantley; Joseph L. Greathouse; Wei Huang; Arun Karunanithi; Onur Kayiran; Mitesh R. Meswani; Indrani Paul; Matthew Poremba; Steven E. Raasch; Steven K. Reinhardt; Greg Sadowski; Vilas Sridharan

The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community.

Proceedings of the 2015 International Symposium on Memory Systems | 2015

Interconnect-Memory Challenges for Multi-chip, Silicon Interposer Systems

Gabriel H. Loh; Natalie D. Enright Jerger; Ajaykumar Kannan; Yasuko Eckert

Silicon interposer technology is promising for large-scale integration of memory within a processor package. While past work on vertical, 3D-stacked memory allows a stack of memory to be placed directly on top of a processor, the total amount of memory that could be integrated is limited by the size of the processor die. With silicon interposers, multiple memory stacks can be integrated inside the processor package, thereby increasing both the capacity and the bandwidth provided by the 3D memory. However, the full potential of all of this integrated memory may be squandered if the in-package interconnect architecture cannot keep up with the data rates provided by the multiple memory stacks. This position paper describes key issues in providing the interconnect support for aggressive interposer-based memory integration, and argues for additional research efforts to address these challenges to enable integrated memory to deliver its full value.

measurement and modeling of computer systems | 2014

A comparison of core power gating strategies implemented in modern hardware

Manish Arora; Srilatha Manne; Yasuko Eckert; Indrani Paul; Nuwan Jayasena; Dean M. Tullsen

Idle power is a significant contributor to overall energy consumption in modern multi-core processors. Cores can enter a full-sleep state, also known as C6, to reduce idle power; however, entering C6 incurs performance and power overheads. Since power gating can result in negative savings, hardware vendors implement various algorithms to manage C6 entry. In this paper, we examine state-of-the-art C6 entry algorithms and present a comparative analysis in the context of consumer and CPU-GPU benchmarks.

international symposium on low power electronics and design | 2012

Something old and something new: P-states can borrow microarchitecture techniques too

Yasuko Eckert; Srilatha Manne; Michael J. Schulte; David A. Wood

The limited utility of voltage scaling in nano-scale technologies has led high-performance processors to rely increasingly on frequency scaling for power management. However, frequency scaling provides only a linear dynamic power reduction. In this paper, we make a case for dynamically disabling performance optimizations, leveraging previously proposed low-power techniques, for more efficient power-performance trade-offs. By carefully selecting which optimizations to turn off, our lowest P-state consumes less than half the power achieved by frequency scaling, on average, for comparable performance. For all workloads, our approach performs as well or better than DVFS, demonstrating the effectiveness of our approach.

international conference on supercomputing | 2016

Prefetching Techniques for Near-memory Throughput Processors

Reena Panda; Yasuko Eckert; Nuwan Jayasena; Onur Kayiran; Michael W. Boyer; Lizy Kurian John

Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are touted as great candidates for in-memory processors due to their superior bandwidth utilization capabilities. Although putting a GPU core beneath memory exposes it to unprecedented memory bandwidth, in this paper, we demonstrate that significant opportunities still exist to improve the performance of the simpler, in-memory GPU processors (GPU-PIM) by improving their memory performance. Thus, we propose three light-weight, practical memory-side prefetchers to improve the performance of GPU-PIM systems. The proposed prefetchers exploit the patterns in individual memory accesses and synergy in the wavefront-localized memory streams, combined with a better understanding of the memory-system state, to prefetch from DRAM row buffers into on-chip prefetch buffers, thereby achieving over 75% prefetcher accuracy and 40% improvement in row buffer locality. In order to maximize utilization of prefetched data and minimize thrashing, the prefetchers also use a novel prefetch buffer management policy based on a unique dead-row prediction mechanism together with an eviction-based prefetch-trigger policy to control their aggressiveness. The proposed prefetchers improve performance by over 60% (max) and 9% on average as compared to the baseline, while achieving over 33% of the performance benefits of perfect-L2 using less than 5.6KB of additional hardware. The proposed prefetchers also outperform the state-of-the-art memory-side prefetcher, OWL by more than 20%.

ACM Transactions on Architecture and Code Optimization | 2018

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

Hyojong Kim; Ramyad Hadidi; Lifeng Nai; Hyesoon Kim; Nuwan Jayasena; Yasuko Eckert; Onur Kayiran; Gabriel H. Loh

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.

Archive | 2012