Aayush Gupta | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aayush Gupta is active.

Explore More

Publication

Featured researches published by Aayush Gupta.

architectural support for programming languages and operating systems | 2009

DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings

Aayush Gupta; Young-Jae Kim; Bhuvan Urgaonkar

Recent technological advances in the development of flash-memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage systems. Unlike hard disks, flash devices are free from any mechanical moving parts, have no seek or rotational delays and consume lower power. However, the internal idiosyncrasies of flash technology make its performance highly dependent on workload characteristics. The poor performance of random writes has been a cause of major concern, which needs to be addressed to better utilize the potential of flash in enterprise-scale environments. We examine one of the important causes of this poor performance: the design of the Flash Translation Layer (FTL), which performs the virtual-to-physical address translations and hides the erase-before-write characteristics of flash. We propose a complete paradigm shift in the design of the core FTL engine from the existing techniques with our Demand-based Flash Translation Layer (DFTL), which selectively caches page-level address mappings. We develop a flash simulation framework called FlashSim. Our experimental evaluation with realistic enterprise-scale workloads endorses the utility of DFTL in enterprise-scale storage systems by demonstrating: (i) improved performance, (ii) reduced garbage collection overhead and (iii) better overload behavior compared to state-of-the-art FTL schemes. For example, a predominantly random-write dominant I/O trace from an OLTP application running at a large financial institution shows a 78% improvement in average response time (due to a 3-fold reduction in operations of the garbage collector), compared to a state-of-the-art FTL scheme. Even for the well-known read-dominant TPC-H benchmark, for which DFTL introduces additional overheads, we improve system response time by 56%.

international conference on advances in system simulation | 2009

FlashSim: A Simulator for NAND Flash-Based Solid-State Drives

Young Jae Kim; Brendan Tauras; Aayush Gupta; Bhuvan Urgaonkar

NAND Flash memory-based Solid-State Disks (SSDs) are becoming popular as the storage media in domains ranging from mobile laptops to enterprise-scale storage systems due to a number of benefits (e.g., lighter weights, faster access times, lower power consumption, higher resistance to vibrations) they offer over the conventionally popular Hard Disk Drives (HDDs). While a number of well-regarded simulation environments exist for HDDs, the same is not yet true for SSDs. This is due to SSDs having been in the storage market for relatively less time as well as the lack of information (hardware configuration and software methods) about state-of-the-art SSDs that is publicly available. We describe the design and implementation of FlashSim, a simulator aimed at filling this void in performance evaluation of emerging storage systems that employ SSDs. FlashSim is an event-driven simulator that follows the objected oriented programming paradigm for modularity. We have validated the performance of FlashSim against a number of commercial SSDs for behavioral similarity. We have also used FlashSim to compare the performance of SSD devices employing different Flash Translation Layer (FTL) schemes, and analyzed the energy consumption of different FTL schemes in the SSD. FlashSim has been written to be inter-operable with the well-regarded DiskSim simulator, thus enabling the simulation of a variety of “hybrid” storage systems employing combinations of SSDs and HDDs. Given the current interest in such hybrid systems as opposed to systems with SSDs replacing HDDs (due to higher price), we believe this to be an especially useful feature of FlashSim. We have made FlashSim freely available for download with the hope that it would be of use to researchers exploring the design of SSD-based systems.

modeling, analysis, and simulation on computer and telecommunication systems | 2011

HybridStore: A Cost-Efficient, High-Performance Storage System Combining SSDs and HDDs

Young-Jae Kim; Aayush Gupta; Bhuvan Urgaonkar; Piotr Berman; Anand Sivasubramaniam

Unlike the use of DRAM for caching or buffering, certain idiosyncrasies of SSDs make their integration into existing systems non-trivial. Flash memory suffers from limits on its reliability, is an order of magnitude more expensive than the HDD, and can sometimes be as slow as the HDD (due to excessive garbage collection (GC) induced by high intensity of random writes). Given these trade-offs between HDDs and SSDs in terms of cost, performance, and lifetime, the current consensus among several storage experts is to view SSDs not as a replacement for HDD but rather as a complementary device within the high performance storage hierarchy. We design and evaluate such a hybrid system called Hybrid Store to provide: (a) Hybrid Plan: improved capacity planning technique to administrators with the overall goal of operating within cost-budgets and (b) HybridDyn: improved performance/lifetime guarantees during episodes of deviations from expected workloads through two novel mechanisms: write-regulation and fragmentation busting. As an illustrative example of HybridStores efficacy, Hybrid Plan is able to find the most cost-effective storage configuration for a large scale workload of Microsoft Research and suggest one MLC SSD with ten 7.2K RPM HDDs instead of fourteen 7.2K RPM HDDs only. HybridDyn is able to reduce the average response time for an enterprise scale random-write dominant workload by about 71%as compared to a HDD-based system.

european conference on computer systems | 2015

An in-memory object caching framework with adaptive load balancing

Yue Cheng; Aayush Gupta; Ali Raza Butt

The extreme latency and throughput requirements of modern web applications are driving the use of distributed in-memory object caches such as Memcached. While extant caching systems scale-out seamlessly, their use in the cloud --- with its unique cost and multi-tenancy dynamics --- presents unique opportunities and design challenges. In this paper, we propose MBal, a high-performance in-memory object caching framework with adaptive Multiphase load Balancing, which supports not only horizontal (scale-out) but vertical (scale-up) scalability as well. MBal is able to make efficient use of available resources in the cloud through its fine-grained, partitioned, lockless design. This design also lends itself naturally to provide adaptive load balancing both within a server and across the cache cluster through an event-driven, multi-phased load balancer. While individual load balancing approaches are being lever-aged in in-memory caches, MBal goes beyond the extant systems and offers a holistic solution wherein the load balancing model tracks hotspots and applies different strategies based on imbalance severity -- key replication, server-local or cross-server coordinated data migration. Performance evaluation on an 8-core commodity server shows that compared to a state-of-the-art approach, MBal scales with number of cores and executes 2.3x and 12x more queries/second for GET and SET operations, respectively.

high performance distributed computing | 2015

CAST: Tiering Storage for Data Analytics in the Cloud

Yue Cheng; M. Safdar Iqbal; Aayush Gupta; Ali Raza Butt

Enterprises are increasingly moving their big data analytics to the cloud with the goal of reducing costs without sacrificing application performance. Cloud service providers offer their tenants a myriad of storage options, which while flexible, makes the choice of storage deployment non trivial. Crafting deployment scenarios to leverage these choices in a cost-effective manner - under the unique pricing models and multi-tenancy dynamics of the cloud environment - presents unique challenges in designing cloud-based data analytics frameworks. In this paper, we propose CAST, a Cloud Analytics Storage Tiering solution that cloud tenants can use to reduce monetary cost and improve performance of analytics workloads. The approach takes the first step towards providing storage tiering support for data analytics in the cloud. CAST performs offline workload profiling to construct job performance prediction models on different cloud storage services, and combines these models with workload specifications and high-level tenant goals to generate a cost-effective data placement and storage provisioning plan. Furthermore, we build CAST++ to enhance CASTs optimization model by incorporating data reuse patterns and across-jobs interdependencies common in realistic analytics workloads. Tests with production workload traces from Facebook and a 400-core Google Cloud based Hadoop cluster demonstrate that CAST++ achieves 1.21X performance and reduces deployment costs by 51.4% compared to local storage configuration.

international conference on autonomic computing | 2016

Effective Capacity Modulation as an Explicit Control Knob for Public Cloud Profitability

Cheng Wang; Bhuvan Urgaonkar; Aayush Gupta; Lydia Y. Chen; Robert Birke; George Kesidis

We explore the efficacy of dynamic effective capacity modulation (i.e., using virtualization techniques to offer lower resource capacity than that advertised by the cloud provider) as an explicit control knob for a cloud providers profit maximization complementing the more well-studied approach of dynamic pricing. Our focus is on emerging cloud ecosystems wherein we expect tenants to modify their demands strategically in response to such modulation in effective capacity and prices. We consider a simple model of a cloud provider that offers a single type of virtual machine to its tenants and devise a leader/follower game-based control framework to capture the interactions between the provider and its tenants. We assume both parties employ myopic control and short-term predictions to reflect their operation under the high dynamism and poor predictability in such environments. Our evaluation using a combination of real-world data center traces and benchmarks hosted on a prototype OpenStack-based cloud shows 10-30% profit improvement for a cloud provider compared with baselines that use static pricing and/or static effective capacity.

petascale data storage workshop | 2015

Taming the cloud object storage with MOS

Ali Anwar; Yue Cheng; Aayush Gupta; Ali Raza Butt

Cloud object stores today are deployed using a single set of configuration parameters for all different types of applications. This homogeneous setup results in all applications experiencing the same service level (e.g., data transfer throughput, etc.). However, the vast variety of applications expose extremely different latency and throughput requirements. To this end, we propose MOS, a Micro Object Storage architecture with independently configured microstores each tuned dynamically for a particular type of workload. We then expose these microstores to the tenant who can then choose to place their data in the appropriate microstore according the latency and throughput requirements of their workloads. Our evaluation shows that compared with default setup, MOS can improve the performance up to 200% for small objects and 28% for large objects while providing opportunity of tradeoff between two.

acm international conference on systems and storage | 2013

Mercury: bringing efficiency to key-value stores

Rohan Gandhi; Aayush Gupta; Anna S. Povzner; Wendy Belluomini; Tim Kaldewey

While the initial wave of in-memory key-value stores has been optimized for serving relatively fixed content to a very large number of users, an emerging class of enterprise-scale data analytics workloads focuses on capturing, analyzing, and reacting to data in real-time. At the same time, advances in network technologies are shifting the performance bottleneck from the network to the memory subsystem. To address these new trends, we present a bottom-up approach to building a high performance in-memory key-value store, Mercury, for both traditional, read-intensive as well as emerging workloads with high write-to-read ratio. Mercurys architecture is based on two key design principles: (i) economizing the number of DRAM accesses per operation, and (ii) reducing synchronization overheads. We implement these principles with a simple hash table with linked-list based chaining, and provide high concurrency with a fine-grained, cache-friendly locking scheme. On a commodity single-socket server with 12 cores, Mercury scales with number of cores and executes 14 times more queries/second than a popular hash-based key-value system, Memcached, for both read and write-heavy workloads.

Journal of Computer Science and Technology | 2013

A Temporal Locality-Aware Page-Mapped Flash Translation Layer

Young-Jae Kim; Aayush Gupta; Bhuvan Urgaonkar

The poor performance of random writes has been a cause of major concern which needs to be addressed to better utilize the potential of flash in enterprise-scale environments. We examine one of the important causes of this poor performance: the design of the flash translation layer (FTL) which performs the virtual-to-physical address translations and hides the erase-before-write characteristics of flash. We propose a complete paradigm shift in the design of the core FTL engine from the existing techniques with our Demand-Based Flash Translation Layer (DFTL) which selectively caches page- level address mappings. Our experimental evaluation using FlashSim with realistic enterprise-scale workloads endorses the utility of DFTL in enterprise-scale storage systems by demonstrating: 1) improved performance, 2) reduced garbage collection overhead and 3) better overload behavior compared with hybrid FTL schemes which are the most popular implementation methods. For example, a predominantly random-write dominant I/O trace from an OLTP application running at a large financial institution shows a 78% improvement in average response time (due to a 3-fold reduction in operations of the garbage collector), compared with the hybrid FTL scheme. Even for the well-known read-dominant TPC-H benchmark, for which DFTL introduces additional overheads, we improve system response time by 56%. Moreover, interestingly, when write-back cache on DFTL-based SSD is enabled, DFTL even outperforms the page-based FTL scheme, improving their response time by 72% in Financial trace.

international conference on cloud computing | 2016

Fine-Grained Resource Scaling in a Public Cloud: A Tenant's Perspective

Cheng Wang; Aayush Gupta; Bhuvan Urgaonkar

Growing tenant workload needs and an increasingly competitive market will force cloud providers to operate their data centers at significantly higher utilization levels than seen today. We argue that a key enabler of such cloud ecosystems would be facilities for tenants to engage in fine-grained resource scaling in addition to those offered by current providers. The basic unit of resource scaling exposed by current cloud providers is the canonical interface of virtual machines (VMs) with relatively static resource capacities. This paper describes opportunities and challenges in augmenting this interface to also include fine-grained scaling of CPU and memory within an already procured VM. Qualitative arguments for why this would offer cost benefits for both the provider and its tenants are presented. We focus on the cost-effective operation of a tenant in such an environment via the design of a feedback controller. The efficacy of our ideas is illustrated by implementing a case study in a Memcached tenant workload. Our results are promising and point to an interesting and broad area for further research - e.g., with the real-world workload in our evaluation, up to 50% utility improvement can be achieved by just applying memory scaling, a further 66% improvement can be achieved by coordinating fine-grained CPU and memory scaling.

Explore More