Bikash Sharma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bikash Sharma is active.

Explore More

Publication

Featured researches published by Bikash Sharma.

symposium on cloud computing | 2011

Modeling and synthesizing task placement constraints in Google compute clusters

Bikash Sharma; Victor Chudnovsky; Joseph L. Hellerstein; Rasekh Rifaat; Chita R. Das

Evaluating the performance of large compute clusters requires benchmarks with representative workloads. At Google, performance benchmarks are used to obtain performance metrics such as task scheduling delays and machine resource utilizations to assess changes in application codes, machine configurations, and scheduling algorithms. Existing approaches to workload characterization for high performance computing and grids focus on task resource requirements for CPU, memory, disk, I/O, network, etc. Such resource requirements address how much resource is consumed by a task. However, in addition to resource requirements, Google workloads commonly include task placement constraints that determine which machine resources are consumed by tasks. Task placement constraints arise because of task dependencies such as those related to hardware architecture and kernel version. This paper develops methodologies for incorporating task placement constraints and machine properties into performance benchmarks of large compute clusters. Our studies of Google compute clusters show that constraints increase average task scheduling delays by a factor of 2 to 6, which often results in tens of minutes of additional task wait time. To understand why, we extend the concept of resource utilization to include constraints by introducing a new metric, the Utilization Multiplier (UM). UM is the ratio of the resource utilization seen by tasks with a constraint to the average utilization of the resource. UM provides a simple model of the performance impact of constraints in that task scheduling delays increase with UM. Last, we describe how to synthesize representative task constraints and machine properties, and how to incorporate this synthesis into existing performance benchmarks. Using synthetic task constraints and machine properties generated by our methodology, we accurately reproduce performance metrics for benchmarks of Google compute clusters with a discrepancy of only 13% in task scheduling delay and 5% in resource utilization.

international conference on cluster computing | 2009

MDCSim: A multi-tier data center simulation, platform

Seung-Hwan Lim; Bikash Sharma; Gunwoo Nam; Eun-Kyoung Kim; Chita R. Das

Performance and power issues are becoming increasingly important in the design of large, cluster-based multitier data centers for supporting a multitude of services. The design and analysis of such large/complex distributed systems often suffer from the lack of availability of an adequate physical infrastructure. This paper presents a comprehensive, flexible, and scalable simulation platform for in-depth analysis of multi-tier data centers. Designed as a pluggable three-level architecture, our simulator captures all the important design specifics of the underlying communication paradigm, kernel level scheduling artifacts, and the application level interactions among the tiers of a three-tier data center. The flexibility of the simulator is attributed to its ability in experimenting with different design alternatives in the three layers, and in analyzing both the performance and power consumption with realistic workloads. The scalability of the simulator is demonstrated with analyses of different data center configurations. In addition, we have designed a prototype three-tier data center on an Infiniband Architecture (IBA) connected Linux cluster to validate the simulator. Using RUBiS benchmark workload, it is shown that the simulator is quite accurate in estimating the throughput, response time, and power consumption. We then demonstrate the applicability of the simulator in conducting three different types of studies. First, we conduct a comparative analysis of the IBA and 10 Gigabit Ethernet (10GigE) under different traffic conditions and with varying size clusters for understanding their relative merits in designing cluster-based servers. Second, measurement and characterization of power consumption across the servers of a three-tier data center is done. Third, we perform a configuration analysis of the Web server (WS), Application Server (AS), and Database Server (DB) for performance optimization. We believe that such a comprehensive simulation infrastructure is critical for providing guidelines in designing efficient and cost-effective multi-tier data centers.

dependable systems and networks | 2014

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Yixin Luo; Sriram Govindan; Bikash Sharma; Mark Santaniello; Justin Meza; Aman Kansal; Jie Liu; Badriddine Khessib; Kushagra Vaid; Onur Mutlu

Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. For example, we found that while traditional error protection increases memory system cost by 12.5%, some applications can achieve 99.00% availability on a single server with a large number of memory errors without any error protection. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory reliability for different applications. Toward this end, in this paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance of applications to memory errors. Second, using our methodology, we perform a case study of three new dataintensive workloads (an interactive web search application, an in-memory key -- value store, and a graph mining framework) to identify new insights into the nature of application memory error vulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-off. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.

international conference on distributed computing systems | 2013

HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers

Bikash Sharma; Timothy Wood; Chita R. Das

Virtualized environments are attractive because they simplify cluster management, while facilitating cost-effective workload consolidation. As a result, virtual machines in public clouds or private data centers, have become the norm for running transactional applications like web services and virtual desktops. On the other hand, batch workloads like MapReduce, are typically deployed in a native cluster to avoid the performance overheads of virtualization. While both these virtual and native environments have their own strengths and weaknesses, we demonstrate in this work that it is feasible to provide the best of these two computing paradigms in a hybrid platform. In this paper, we make a case for a hybrid data center consisting of native and virtual environments, and propose a 2-phase hierarchical scheduler, called HybridMR, for the effective resource management of interactive and batch workloads. In the first phase, HybridMR classifies incoming MapReduce jobs based on the expected virtualization overheads, and uses this information to automatically guide placement between physical and virtual machines. In the second phase, HybridMR manages the run-time performance of MapReduce jobs collocated with interactive applications in order to provide best effort delivery to batch jobs, while complying with the Service Level Agreements (SLAs) of interactive applications. By consolidating batch jobs with over-provisioned foreground applications, the available unused resources are better utilized, resulting in improved application performance and energy efficiency. Evaluations on a hybrid cluster consisting of 24 physical servers and 48 virtual machines, with diverse workload mix of interactive and batch MapReduce applications, demonstrate that HybridMR can achieve up to 40% improvement in the completion times of MapReduce jobs, over the virtual-only case, while complying with the SLAs of interactive applications. Compared to the native-only cluster, at the cost of minimal performance penalty, HybridMR boosts resource utilization by 45%, and achieves up to 43% energy savings. These results indicate that a hybrid data center with an efficient scheduling mechanism can provide a cost-effective solution for hosting both batch and interactive workloads.

dependable systems and networks | 2013

CloudPD: Problem determination and diagnosis in shared dynamic clouds

Bikash Sharma; Praveen Jayachandran; Akshat Verma; Chita R. Das

In this work, we address problem determination in virtualized clouds. We show that high dynamism, resource sharing, frequent reconfiguration, high propensity to faults and automated management introduce significant new challenges towards fault diagnosis in clouds. Towards this, we propose CloudPD, a fault management framework for clouds. CloudPD leverages (i) a canonical representation of the operating environment to quantify the impact of sharing; (ii) an online learning process to tackle dynamism; (iii) a correlation-based performance models for higher detection accuracy; and (iv) an integrated end-to-end feedback loop to synergize with a cloud management ecosystem. Using a prototype implementation with cloud representative batch and transactional workloads like Hadoop, Olio and RUBiS, it is shown that CloudPD detects and diagnoses faults with low false positives (<; 16%) and high accuracy of 88%, 83% and 83%, respectively. In an enterprise trace-based case study, CloudPD diagnosed anomalies within 30 seconds and with an accuracy of 77%, demonstrating its effectiveness in real-life operations.

international conference on cloud computing | 2012

MROrchestrator: A Fine-Grained Resource Orchestration Framework for MapReduce Clusters

Bikash Sharma; Ramya Prabhakar; Seung-Hwan Lim; Mahmut T. Kandemir; Chita R. Das

Efficient resource management in data centers and clouds running large distributed data processing frameworks like MapReduce is crucial for enhancing the performance of hosted applications and increasing resource utilization. However, existing resource scheduling schemes in Hadoop MapReduce allocate resources at the granularity of fixed-size, static portions of nodes, called slots. In this work, we show that MapReduce jobs have widely varying demands for multiple resources, making the static and fixed-size slot-level resource allocation a poor choice both from the performance and resource utilization standpoints. Furthermore, lack of coordination in the management of multiple resources across nodes prevents dynamic slot reconfiguration, and leads to resource contention. Motivated by this, we propose MROrchestrator, a MapReduce resource Orchestrator framework, which can dynamically identify resource bottlenecks, and resolve them through fine-grained, coordinated, and on-demand resource allocations. We have implemented MROrchestrator on two 24-node native and virtualized Hadoop clusters. Experimental results with a suite of representative MapReduce benchmarks demonstrate up to 38% reduction in job completion times, and up to 25% increase in resource utilization. We further demonstrate the performance boost in existing resource managers like NGM and Mesos, when augmented with MROrchestrator.

international symposium on performance analysis of systems and software | 2011

A dynamic energy management scheme for multi-tier data centers

Seung-Hwan Lim; Bikash Sharma; Byung Chul Tak; Chita R. Das

Multi-tier data centers have become a norm for hosting modern Internet applications because they provide a flexible, modular, scalable and high performance environment. However, these benefits come at a price of the economic dent incurred in powering and cooling these large hosting centers. Thus, energy efficiency has become a critical consideration in designing Internet data centers. In this paper, we propose a multifaceted approach, Hybrid, consisting of dynamic provisioning, frequency scaling and dynamic power management (DPM) schemes to reduce the energy consumption of multi-tier data centers, while meeting the Service Level Agreements (SLAs). We formulate a mathematical model of the energy and performance/SLA optimization problem followed by a queueing theory based approach to develop two heuristics for solving the optimization problem. The first heuristic dynamically provisions the optimal number of servers required in each tier. The second heuristic proactively decides the CPU speed and the duration of sleep states of a server to achieve further energy savings. We evaluate our heuristics using a simulator that was validated with real measurements on a prototype three-tier data center consisting of 25 servers with two multi-tier application benchmarks. Our experimental results indicate that the proposed scheme, Hybrid, can reduce the energy consumption by 50% relative to static provisioning without CPU frequency scaling and DPM. We demonstrate that Hybrid satisfies the SLAs for dynamically varying workloads. In addition, the proposed multifaceted approach is more energy efficient than the other methods such as dynamic provisioning with exploiting deep sleep states.

international conference on distributed computing systems | 2017

Phoenix: A Constraint-Aware Scheduler for Heterogeneous Datacenters

Prashanth Thinakaran; Jashwant Raj Gunasekaran; Bikash Sharma; Mahmut T. Kandemir; Chita R. Das

Today’s datacenters are increasingly becoming diverse with respect to both hardware and software architectures in order to support a myriad of applications. These applications are also heterogeneous in terms of job response times and resource requirements (eg., Number of Cores, GPUs, Network Speed) and they are expressed as task constraints. Constraints are used for ensuring task performance guarantees/Quality of Service(QoS) by enabling the application to express its specific resource requirements. While several schedulers have recently been proposed that aim to improve overall application and system performance, few of these schedulers consider resource constraints across tasks while making the scheduling decisions. Furthermore, latencycritical workloads and short-lived jobs that typically constitute about 90% of the total jobs in a datacenter have strict QoS requirements, which can be ensured by minimizing the tail latency through effective scheduling. In this paper, we propose Phoenix, a constraint-aware hybrid scheduler to address both these problems (constraint awareness and ensuring low tail latency) by minimizing the job response times at constrained workers. We use a novel Constraint Resource Vector (CRV) based scheduling, which in turn facilitates reordering of the jobs in a queue to minimize tail latency. We have used the publicly available Google traces to analyze their constraint characteristics and have embedded these constraints in Cloudera and Yahoo cluster traces for studying the impact of traces on system performance. Experiments with Google, Cloudera and Yahoo cluster traces across 15,000 worker node cluster shows that Phoenix improves the 99th percentile job response times on an average by 1.9× across all three traces when compared against a state-of-the-art hybrid scheduler. Further, in comparison to other distributed scheduler like Hawk, it improves the 90th and 99th percentile job response times by 4.5× and 5× respectively.

international conference on distributed computing systems | 2017

Rain or Shine? — Making Sense of Cloudy Reliability Data

Iyswarya Narayanan; Bikash Sharma; Di Wang; Sriram Govindan; Laura Marie Caulfield; Anand Sivasubramaniam; Aman Kansal; Jie Liu; Badriddine Khessib; Kushagra Vaid

Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.

Archive | 2012