Manu Awasthi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manu Awasthi is active.

Explore More

Publication

Featured researches published by Manu Awasthi.

international conference on parallel architectures and compilation techniques | 2010

Handling the problems and opportunities posed by multiple on-chip memory controllers

Manu Awasthi; David W. Nellans; Kshitij Sudan; Rajeev Balasubramonian; Al Davis

Modern processors such as Tileras Tile64, Intels Nehalem, and AMDs Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMDs HyperTransport™, or Intels Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.

high-performance computer architecture | 2009

Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches

Manu Awasthi; Kshitij Sudan; Rajeev Balasubramonian; John B. Carter

In future multi-cores, large amounts of delay and power will be spent accessing data in large L2/L3 caches. It has been recently shown that OS-based page coloring allows a non-uniform cache architecture (NUCA) to provide low latencies and not be hindered by complex data search mechanisms. In this work, we extend that concept with mechanisms that dynamically move data within caches. The key innovation is the use of a shadow address space to allow hardware control of data placement in the L2 cache while being largely transparent to the user application and off-chip world. These mechanisms allow the hardware and OS to dynamically manage cache capacity per thread as well as optimize placement of data shared by multiple threads. We show an average IPC improvement of 10-20% for multi-programmed workloads with capacity allocation policies and an average IPC improvement of 8% for multi-threaded workloads with policies for shared page placement.

high performance computer architecture | 2012

Efficient scrub mechanisms for error-prone emerging memories

Manu Awasthi; Manjunath Shevgoor; Kshitij Sudan; Bipin Rajendran; Rajeev Balasubramonian; Viii Srinivasan

Many memory cell technologies are being considered as possible replacements for DRAM and Flash technologies, both of which are nearing their scaling limits. While these new cells (PCM, STT-RAM, FeRAM, etc.) promise high density, better scaling, and non-volatility, they introduce new challenges. Solutions at the architecture level can help address some of these problems; e.g., prior research has proposed wear-leveling and hard error tolerance mechanisms to overcome the limited write endurance of PCM cells. In this paper, we focus on the soft error problem in PCM, a topic that has received little attention in the architecture community. Soft errors in DRAM memories are typically addressed by having SECDED support and a scrub mechanism. The scrub mechanism scans the memory looking for a single-bit error and corrects it before the line experiences a second uncorrectable error. However, PCM (and other emerging memories) are prone to new sources of soft errors. In particular, multi-level cell (MLC) PCM devices will suffer from resistance drift, that increases the soft error rate and incurs high overheads for the scrub mechanism. This paper is the first to study the design of architectural scrub mechanisms, especially when tailored to the drift phenomenon in MLC PCM. Many of our solutions will also apply to other soft-error prone emerging memories. We first show that scrub overheads can be reduced with support for strong ECC codes and a lightweight error detection operation. We then design different scrub algorithms that can adaptively trade-off soft and hard errors. Using an approach that combines all proposed solutions, our scrub mechanism yields a 96.5% reduction in uncorrectable errors, a 24.4 × decrease in scrub-related writes, and a 37.8% reduction in scrub energy, relative to a basic scrub algorithm used in modern DRAM systems.

international conference on parallel architectures and compilation techniques | 2008

Scalable and reliable communication for hardware transactional memory

Seth H. Pugsley; Manu Awasthi; Niti Madan; Naveen Muralimanohar; Rajeev Balasubramonian

In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable in terms of delay and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages for commit. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), and 8% (for average transaction length of 4000 cycles). For a small group of multi-threaded programs with frequent transaction commits, improvements of up to 8% were observed for a 32-node simulation.

acm international conference on systems and storage | 2015

Performance analysis of NVMe SSDs and their implication on real world databases

Qiumin Xu; Huzefa Siyamwala; Mrinmoy Ghosh; Tameesh Suri; Manu Awasthi; Zvika Guz; Anahita Shayesteh; Vijay Balakrishnan

The storage subsystem has undergone tremendous innovation in order to keep up with the ever-increasing demand for throughput. Non Volatile Memory Express (NVMe) based solid state devices are the latest development in this domain, delivering unprecedented performance in terms of latency and peak bandwidth. NVMe drives are expected to be particularly beneficial for I/O intensive applications, with databases being one of the prominent use-cases. This paper provides the first, in-depth performance analysis of NVMe drives. Combining driver instrumentation with system monitoring tools, we present a breakdown of access times for I/O requests throughout the entire system. Furthermore, we present a detailed, quantitative analysis of all the factors contributing to the low-latency, high-throughput characteristics of NVMe drives, including the system software stack. Lastly, we characterize the performance of multiple cloud databases (both relational and NoSQL) on state-of-the-art NVMe drives, and compare that to their performance on enterprise-class SATA-based SSDs. We show that NVMe-backed database applications deliver up to 8× superior client-side performance over enterprise-class, SATA-based SSDs.

international conference on parallel architectures and compilation techniques | 2011

Prediction Based DRAM Row-Buffer Management in the Many-Core Era

Manu Awasthi; David W. Nellans; Rajeev Balasubramonian; Al Davis

Modern processors are experiencing interleaved memory access streams from different threads/cores, reducing the spatial locality that is seen at the memory controller, making the combined stream appear increasingly random. Traditional methods for exploiting locality at the DRAM level, such as open-page and timer-based policies, become less effective as the number of threads accessing memory increases. Employing closed-page policies in such systems can improve performance but it eliminates any possibility of exploiting locality. In this paper, we build upon the key insight that a history-based predictor that tracks the number of accesses to a given DRAM page is a much better indicator of DRAM locality than timer based policies. We extend prior work to propose a simple Access Based Predictor (ABP) that tracks limited access history at the page level to determine page closure decisions, and does so with much smaller storage overhead than previously proposed policies. We show that ABP, with additional optimizations, can improve system throughput by 12.3% and 21.6% over open and closed-page policies, respectively. The proposed ABP requires 20 KB of storage overhead and is outside the critical path of memory access.

international performance computing and communications conference | 2016

Understanding performance of I/O intensive containerized applications for NVMe SSDs

Janki Bhimani; Jingpei Yang; Zhengyu Yang; Ningfang Mi; Qiumin Xu; Manu Awasthi; Rajinikanth Pandurangan; Vijay Balakrishnan

Our cloud-based IT world is founded on hyper-visors and containers. Containers are becoming an important cornerstone, which is increasingly used day-by-day. Among different available frameworks, docker has become one of the major adoptees to use containerized platform in data centers and enterprise servers, due to its ease of deploying and scaling. Further more, the performance benefits of a lightweight container platform can be leveraged even more with a fast back-end storage like high performance SSDs. However, increase in number of simultaneously operating docker containers may not guarantee an aggregated performance improvement due to saturation. Thus, understanding performance bottleneck in a multi-tenancy docker environment is critically important to maintain application level fairness and perform better resource management. In this paper, we characterize the performance of persistent storage option (through data volume) for I/O intensive, dockerized applications. Our work investigates the impact on performance with increasing number of simultaneous docker containers in different workload environments. We provide, first of its kind study of I/O intensive containerized applications operating with NVMe SSDs. We show that 1) a six times better application throughput can be obtained, just by wise selection of number of containerized instances compared to single instance; and 2) for multiple application containers running simultaneously, an application throughput may degrade upto 50% compared to a stand-alone applications throughput, if good choice of application and workload is not made. We then propose novel design guidelines for an optimal and fair operation of both homogeneous and heterogeneous environments mixed with different applications and workloads.

ieee international conference on cloud computing technology and science | 2016

A Fresh Perspective on Total Cost of Ownership Models for Flash Storage in Datacenters

Zhengyu Yang; Manu Awasthi; Mrinmoy Ghosh; Ningfang Mi

Recently, adoption of Flash based devices has become increasingly common in all forms of computing devices. Flash devices have started to become more economically viable for large storage installations like datacenters, where metrics like Total Cost of Ownership (TCO) are of paramount importance. Flash devices suffer from write amplification (WA), which, if unaccounted, can substantially increase the TCO of a storage system. In this paper, we develop a TCO model for Flash storage devices, and then plug a Write Amplification (WA) model of NVMe SSDs we build based on empirical data into this TCO model. Our new WA model accounts for workload characteristics like write rate and percentage of sequential writes. Furthermore, using both the TCO and WA models as the optimization criterion, we design new Flash resource management schemes (minTCO) to guide datacenter managers to make workload allocation decisions under the consideration of TCO for SSDs. Experimental results show that minTCO can reduce the TCO and keep relatively high throughput and space utilization of the entire datacenter storage.

Proceedings of the 2015 International Symposium on Memory Systems | 2015

Rethinking Design Metrics for Datacenter DRAM

Manu Awasthi

Over the years, the evolution of DRAM has provided a little improvement in access latencies, but has been optimized to deliver greater peak bandwidths from the devices. The combined bandwidth in a contemporary multi-socket server system runs into hundreds of GB/s. However datacenter scale applications running on server platforms care largely about having access to a large pool of low-latency main memory (DRAM), and in the best case, are unable to utilize even a small fraction of the total memory bandwidth. In this extended abstract, we use measured data from the state-of-the-art servers running memory intensive datacenter workloads like Memcached to argue for main memory design to steer away from optimizing traditional metrics for DRAM design like peak bandwidth so as to be able to cater the growing needs to the datacenter server industry for high density, low latency memory with moderate bandwidth requirements.

International Journal of Parallel Programming | 2012

Managing Data Placement in Memory Systems with Multiple Memory Controllers

Manu Awasthi; David W. Nellans; Kshitij Sudan; Rajeev Balasubramonian; Al Davis

Modern processors such as Tilera’s Tile64, Intel’s Nehalem, and AMD’s Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, flat memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD’s HyperTransportTM, or Intel’s Quick-Path InterconnectTM. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular region of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. Increased competition for memory resources will also increase the memory access latency variation in future systems. Proper allocation of workload data to the appropriate MC will be important in decreasing the variation and average latency when servicing memory requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. We also introduce policies that can handle data placement in memory systems that have regions with heterogeneous properties. The proposed policies yield average performance improvements of 6.5% for adaptive first-touch page-placement, and 8.9% for a dynamic page-migration policy for a system with homogeneous DRAM DIMMs. We also show improvements in systems that contain DIMMs with different performance characteristics.

Explore More