Mitesh R. Meswani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mitesh R. Meswani is active.

Explore More

Publication

Featured researches published by Mitesh R. Meswani.

high-performance computer architecture | 2015

Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories

Mitesh R. Meswani; Sergey Blagodurov; David A. Roberts; John Slice; Mike Ignatowski; Gabriel H. Loh

Die-stacked DRAM is a technology that will soon be integrated in high-performance systems. Recent studies have focused on hardware caching techniques to make use of the stacked memory, but these approaches require complex changes to the processor and also cannot leverage the stacked memory to increase the systems overall memory capacity. In this work, we explore the challenges of exposing the stacked DRAM as part of the systems physical address space. This non-uniform access memory (NUMA) styled approach greatly simplifies the hardware and increases the physical memory capacity of the system, but pushes the burden of managing the heterogeneous memory architecture (HMA) to the software layers. We first explore simple (and somewhat impractical) schemes to manage the HMA, and then refine the mechanisms to address a variety of hardware and software implementation challenges. In the end, we present an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness | 2013

A new perspective on processing-in-memory architecture design

Dong Ping Zhang; Nuwan Jayasena; Alexander Lyashevsky; Joseph L. Greathouse; Mitesh R. Meswani; Mark Nutter; Mike Ignatowski

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical for maintaining the performance scaling that many have come to expect from the computing industry. Moving computation closer to main memory presents an opportunity to reduce the overheads associated with data movement. We explore the potential of using 3D die stacking to move memory-intensive computations closer to memory. This approach to processing-in-memory addresses some drawbacks of prior research on in-memory computing and appears commercially viable in the foreseeable future. We show promising early results from this approach and identify areas that are in need of research to unlock its full potential.

international conference on big data | 2014

Efficient breadth-first search on a heterogeneous processor

Mayank Daga; Mark Nutter; Mitesh R. Meswani

Accelerating breadth-first search (BFS) can be a compelling value-add given its pervasive deployment. The current state-of-the-art hybrid BFS algorithm selects different traversal directions based on graph properties, thereby, possessing heterogeneous characteristics. Related work has studied this heterogeneous BFS algorithm on homogeneous processors. In recent years heterogeneous processors have become mainstream due to their ability to maximize performance under restrictive thermal budgets. However, current software fails to fully leverage the heterogeneous capabilities of the modern processor, lagging behind hardware advancements. We propose a “hybrid++” BFS algorithm for an accelerated processing unit (APU), a heterogeneous processor which fuses the CPU and GPU cores on a single die. Hybrid++ leverages the strength of CPUs and GPUs for serial and data-parallel execution, respectively, to carefully partition BFS by selecting the appropriate execution-core and graph-traversal direction for every search iteration. Our results illustrate that on a variety of graphs ranging from social- to road-networks, hybrid++ yields a speedup of up to 2× compared to the multithreaded hybrid algorithm. Execution of hybrid++ on the APU is also 2.3× more energy efficient than that on a discrete GPU.

high-performance computer architecture | 2017

Design and Analysis of an APU for Exascale Computing

Thiruvengadam Vijayaraghavany; Yasuko Eckert; Gabriel H. Loh; Michael J. Schulte; Mike Ignatowski; Bradford M. Beckmann; William C. Brantley; Joseph L. Greathouse; Wei Huang; Arun Karunanithi; Onur Kayiran; Mitesh R. Meswani; Indrani Paul; Matthew Poremba; Steven E. Raasch; Steven K. Reinhardt; Greg Sadowski; Vilas Sridharan

The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community.

international conference on parallel architectures and compilation techniques | 2016

POSTER: SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization

Jee Ho Ryoo; Mitesh R. Meswani; Reena Panda; Lizy Kurian John

With current DRAM technology reaching its limit, emerging heterogeneous memory systems have become attractive to keep the memory performance scaling. This paper argues for using a small, fast memory closer to the processor as part of a flat address space where the memory system is composed of two or more memory types. OS-transparent management of such memory has been proposed in prior works such as CAMEO and Part of Memory (PoM) work. Data migration is typically handled either at coarse granularity with high bandwidth overheads (as in PoM) or atne granularity with low hit rate (as in CAMEO). Prior work uses restricted address mapping from only congruence groups in order to simplify the mapping. At any time, only one page (block) from a congruence group is resident in the fast memory. In this paper, we present a at address space organization called SILC-FM that uses large granularity but allows subblocks from two pages to coexist in an interleaved fashion in fast memory. Data movement is done at subblocked granularity, avoiding fetching of useless subblocks and consuming less bandwidth compared to migrating the entire large block. SILC-FM can get more spatial locality hits than CAMEO and PoM due to page-level operation and interleaving blocks respectively. The interleaved subblock placement improves performance by 55% on average over a static placement scheme without data migration. We also selectively lock hot blocks to prevent them from being involved in the hardware swapping operations. Additional features such as locking, associativity and bandwidth balancing improve performance by 11%, 8%, and 8% respectively, resulting in a total of 82% performance improvement over no migration static placement scheme. Compared to the best state-of-the-art scheme, SILC-FM gets performance improvement of 36% with 13% energy savings.

Proceedings of the 2015 International Symposium on Memory Systems | 2015

Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories

Ahsen J. Uppal; Mitesh R. Meswani

Die-stacked DRAM is an emerging technology that is expected to be integrated in future systems with off-package memories resulting in a hybrid memory system. A large body of recent research has investigated the use of die-stacked dynamic random-access memory (DRAM) as a hardware-manged last-level cache. This approach comes at the costs of managing large tag arrays, increased hit latencies, and potentially significant increases in hardware verification costs. An alternative approach is for the operating system (OS) to manage the die-stacked DRAM as a page cache for off-package memories. However, recent work in OS-managed page cache focuses on FIFO replacement and related variants as the baseline management policy. In this paper, we take a step back and investigate classical OS page replacement policies and re-evaluate them for hybrid memories. We find that when we use different die-stacked DRAM sizes, the choice of best management policy depends on cache size and application, and can result in as much as a 13X performance difference. Furthermore, within a single application run, the choice of best policy varies over time. We also evaluate co-scheduled workload pairs and find that the best policy varies by workload pair and cache configuration, and that the best-performing policy is typically the most fair. Our research motivates us to continue our investigation for developing workload-aware and cache configuration-aware page cache management policies.

high-performance computer architecture | 2017

MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories

Andreas Prodromou; Mitesh R. Meswani; Nuwan Jayasena; Gabriel H. Loh; Dean M. Tullsen

In the near future, die-stacked DRAM will be increasingly present in conjunction with off-chip memories in hybrid memory systems. Research on this subject revolves around using the stacked memory as a cache or as part of a flat address space. This paper proposes MemPod, a scalable and efficient memory management mechanism for flat address space hybrid memories. MemPod monitors memory activity and periodically migrates the most frequently accessed memory pages to the faster on-chip memory. MemPods partitioned architectural organization allows for efficientscaling with memory system capabilities. Further, a big data analytics algorithm is adapted to develop an efficient, low-cost activity tracking technique. MemPod improves the average main memory access time of multi-programmed workloads, by up to 29% (9% on average) compared to the state of the art, and that will increase as the differential between memory speeds widens. MemPods novel activity tracking approach leads to significant cost reduction (12800x lower storage space requirements) and improved future prediction accuracy over prior work which maintains a separatecounter per page.

Proceedings of the Second International Symposium on Memory Systems | 2016

Prefetching as a Potentially Effective Technique for Hybrid Memory Optimization

Mahzabeen Islam; Soumik Banerjee; Mitesh R. Meswani; Krishna M. Kavi

The promise of 3D-stacked memory solving the memory wall has led to many emerging architectures that integrate 3D-stacked memory into processor memory in a variety of ways including systems that utilize different memory technologies, with different performance and power characteristics, to comprise the system memory. It then becomes necessary to manage these memories such that we get the performance of the fastest memory while having the capacity of the slower but larger memories. Some research in industry and academia proposed using 3D-stacked DRAM as a hardware managed cache. More recently, particularly pushed by the demands for ever larger capacities, researchers are exploring the use of multiple memory technologies as a single main memory. The main challenge for such flat-address-space memories is the placement and migration of memory pages to increase the number of requests serviced from faster memory, as well as managing overhead due to page migrations. In this paper we ask a different question: can traditional prefetching be a viable solution for effective management of hybrid memories? We conjecture that by tuning well-known prefetch mechanism for hybrid memories we can achieve substantial performance improvement. To test our conjecture, we compared the state of the art CAMEO migration policy with a Markov-like prefetcher for a hybrid memory consisting of HBM (3D-stacked DRAM) and Phase Change Memory (PCM) using a set of SPEC CPU2006 and several HPC benchmarks. We find that CAMEO provides better performance improvement than prefetching for 2/3rd of the workloads (by 59%) and prefetching is better than CAMEO for the remaining 1/3rd (by 19%). The EDP analysis shows that the prefetching solution improves EDP over the no-prefetching baseline whereas CAMEO does worse in terms of average EDP. These results indicate that prefetching should be reconsidered as a supplementary technique to data migration.

Proceedings of the Second International Symposium on Memory Systems | 2016

Reliability and Performance Trade-off Study of Heterogeneous Memories

Manish Gupta; David A. Roberts; Mitesh R. Meswani; Vilas Sridharan; Dean M. Tullsen; Rajesh K. Gupta

Heterogeneous memories, organized as die-stacked in-package and off-package memory, have been a focus of attention by the computer architects to improve memory bandwidth and capacity. Researchers have explored methods and organizations to optimize performance by increasing the access rate to faster die-stacked memory. Unfortunately, reliability of such arrangements has not been studied carefully thus making them less attractive for data centers and mission-critical systems. Field studies show memory reliability depends on device physics as well as on error correction codes (ECC). Due to the capacity, latency, and energy costs of ECC, the performance-critical in-package memories may favor weaker ECC solutions than off-chip. Moreover, these systems are optimized to run at peak performance by increasing access rate to high-performance in-package memory. In this paper, authors use the real-world DRAM failure data to conduct a trade-off study on reliability and performance of Heterogeneous Memory Architectures (HMA). This paper illustrates the problem that an HMA system which only optimizes for performance may suffer from impaired reliability over time. This work also proposes an age-aware access rate control algorithm to ensure reliable operation of long-running systems.

acm sigplan symposium on principles and practice of parallel programming | 2018

A Case for Scoped Persist Barriers in GPUs

Dibakar Gope; Arkaprava Basu; Sooraj Puthoor; Mitesh R. Meswani

Two key trends in computing are evident --- emergence of GPU as a first-class compute element and emergence of byte-addressable nonvolatile memory technologies (NVRAM) as DRAM-supplement. GPUs and NVRAMs are likely to coexist in future systems. However, previous works have either focused on GPUs or on NVRAMs, in isolation. In this work, we investigate the enhancements necessary for a GPU to efficiently and correctly manipulate NVRAM-resident persistent data structures. Specifically, we find that previously proposed CPU-centric persist barriers fall short for GPUs. We thus introduce the concept of scoped persist barriers that aligns with the hierarchical programming framework of GPUs. Scoped persist barriers enable GPU programmers to express which execution group (a.k.a., scope) a given persist barrier applies to. We demonstrate that: 1 use of narrower scope than algorithmically-required can lead to inconsistency of persistent data structure, and 2 use of wider scope than necessary leads to significant performance loss (e.g., 25% or more). Therefore, a future GPU can benefit from persist barriers with different scopes.

Explore More