Archana Ganapathi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Archana Ganapathi is active.

Explore More

Publication

Featured researches published by Archana Ganapathi.

modeling, analysis, and simulation on computer and telecommunication systems | 2011

The Case for Evaluating MapReduce Performance Using Workload Suites

Yanpei Chen; Archana Ganapathi; Rean Griffith; Randy H. Katz

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.

international conference on data engineering | 2010

Statistics-driven workload modeling for the Cloud

Archana Ganapathi; Yanpei Chen; Armando Fox; Randy H. Katz; David A. Patterson

A recent trend for data-intensive computations is to use pay-as-you-go execution environments that scale transparently to the user. However, providers of such environments must tackle the challenge of configuring their system to provide maximal performance while minimizing the cost of resources used. In this paper, we use statistical models to predict resource requirements for Cloud computing applications. Such a prediction framework can guide system design and deployment decisions such as scale, scheduling, and capacity. In addition, we present initial design of a workload generator that can be used to evaluate alternative configurations without the overhead of reproducing a real workload. This paper focuses on statistical modeling and its application to data-intensive workloads.

international conference on data engineering | 2009

Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning

Archana Ganapathi; Harumi A. Kuno; Umeshwar Dayal; Janet L. Wiener; Armando Fox; Michael I. Jordan; David A. Patterson

One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics --- their runtimes and resource usage --- can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries. When these long-running queries can be identified before they start, they can be rejected or scheduled when they will not cause extreme resource contention for the other queries in the system. Second, deciding whether a system can complete a given workload in a given time period (or a bigger system is necessary) depends on knowing the resource requirements of the queries in that workload. We have developed a system that uses machine learning to accurately predict the performance metrics of database queries whose execution times range from milliseconds to hours. For training and testing our system, we used both real customer queries and queries generated from an extended set of TPC-DS templates. The extensions mimic queries that caused customer problems. We used these queries to compare how accurately different techniques predict metrics such as elapsed time, records used, disk I/Os, and message bytes. The most promising technique was not only the most accurate, but also predicted these metrics simultaneously and using only information available prior to query execution. We validated the accuracy of this machine learning technique on a number of HP Neoview configurations. We were able to predict individual query elapsed time within 20% of its actual time for 85% of the test queries. Most importantly, we were able to correctly identify both the short and long-running (up to two hour) queries to inform workload management and capacity planning.

acm special interest group on data communication | 2010

To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Yanpei Chen; Archana Ganapathi; Randy H. Katz

Compression enables us to shift resource bottlenecks between IO and CPU. In modern datacenters, where energy efficiency is a growing concern, the benefits of using compression have not been completely exploited. As MapReduce represents a common computation framework for Internet datacenters, we develop a decision algorithm that helps MapReduce users identify when and where to use compression. For some jobs, using compression gives energy savings of up to 60%. We believe our findings will provide signficant impact on improving datacenter energy efficiency.

dependable systems and networks | 2005

Crash data collection: a Windows case study

Archana Ganapathi; David A. Patterson

Reliability is a rapidly growing concern in contemporary personal computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we discuss our experience with crash and usage data collection for Windows machines. We analyze results based on crashes in the UC Berkeley EECS department.

dependable systems and networks | 2004

Why PCs are fragile and what we can do about it: a study of Windows registry problems

Archana Ganapathi; Yi-Min Wang; Ni Lao; Ji-Rong Wen

Software configuration problems are a major source of failures in computer systems. In this paper, we present a new framework for categorizing configuration problems. We apply this categorization to Windows registry-related problems obtained from various internal as well as external sources. Although infrequent, registry-related problems are difficult to diagnose and repair. Consequently they frustrate the users. We classify problems based on their manifestation and the scope of impact to gain useful insights into how problems affect users and why PCs are fragile. We then describe techniques to identify and eliminate such registry failures. We propose health predicate monitoring for detecting known problems, fault injection for improving application, robustness, and access protection mechanisms for preventing fragility problems.

Operating Systems Review | 2009

Managing operational business intelligence workloads

Umeshwar Dayal; Harumi A. Kuno; Janet L. Wiener; Kevin Wilkinson; Archana Ganapathi; Stefan Krompass

We explore how to manage database workloads that contain a mixture of OLTP-like queries that run for milliseconds as well as business intelligence queries and maintenance tasks that last for hours. As data warehouses grow in size to petabytes and complex analytic queries play a greater role in day-to-day business operations, factors such as inaccurate cardinality estimates, data skew, and resource contention all make it notoriously difficult to predict how such queries will behave before they start executing. However, traditional workload management assumes that accurate expectations for the resource requirements and performance characteristics of a workload are available at compile-time, and relies on such information in order to make critical workload management decisions. In this paper, we describe our approach to dealing with inaccurate predictions. First, we evaluate the ability of workload management algorithms to handle workloads that include unexpectedly long-running queries. Second, we describe a new and more accurate method for predicting the resource usage of queries before runtime. We have carried out an extensive set of experiments, and report on a few of our results.

databases in networked information systems | 2010

Managing dynamic mixed workloads for operational business intelligence

Harumi A. Kuno; Umeshwar Dayal; Janet L. Wiener; Kevin Wilkinson; Archana Ganapathi; Stefan Krompass

As data warehousing technology gains a ubiquitous presence in business today, companies are becoming increasingly reliant upon the information contained in their data warehouses to inform their operational decisions. This information, known as business intelligence (BI), traditionally has taken the form of nightly or monthly reports and batched analytical queries that are run at specific times of day. However, as the time needed for data to migrate into data warehouses has decreased, and as the amount of data stored has increased, business intelligence has come to include metrics, streaming analysis, and reports with expected delivery times that are measured in hours, minutes, or seconds. The challenge is that in order to meet the necessary response times for these operational business intelligence queries, a given warehouse must be able to support at any given time multiple types of queries, possibly with different sets of performance objectives for each type. In this paper, we discuss why these dynamic mixed workloads make workload management for operational business intelligence (BI) databases so challenging, review current and proposed attempts to address these challenges, and describe our own approach. We have carried out an extensive set of experiments, and report on a few of our results.

knowledge discovery and data mining | 2013

Building blocks for exploratory data analysis tools

Sara Alspaugh; Marti A. Hearst; Archana Ganapathi; Randy H. Katz

Data exploration is largely manual and labor intensive. Although there are various tools and statistical techniques that can be applied to data sets, there is little help to identify what questions to ask of a data set, let alone what domain knowledge is useful in answering the questions. In this paper, we study user queries against production data sets in Splunk. Specifically, we characterize the interplay between data sets and the operations used to analyze them using latent semantic analysis, and discuss how this characterization serves as a building block for a data analysis recommendation system. This is a work-in-progress paper.

usenix symposium on internet technologies and systems | 2003