Bill Howe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bill Howe is active.

Explore More

Publication

Featured researches published by Bill Howe.

very large data bases | 2010

HaLoop: efficient iterative data processing on large clusters

Yingyi Bu; Bill Howe; Magdalena Balazinska; Michael D. Ernst

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

international conference on management of data | 2012

SkewTune: mitigating skew in mapreduce applications

YongChul Kwon; Magdalena Balazinska; Bill Howe; Jerome Rolia

We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.

symposium on cloud computing | 2010

Skew-resistant parallel processing of feature-extracting scientific user-defined functions

YongChul Kwon; Magdalena Balazinska; Bill Howe; Jerome Rolia

Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.

IEEE Transactions on Visualization and Computer Graphics | 2016

Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations

Kanit Wongsuphasawat; Dominik Moritz; Anushka Anand; Jock D. Mackinlay; Bill Howe; Jeffrey Heer

General visualization tools typically require manual specification of views: analysts must select data variables and then choose which transformations and visual encodings to apply. These decisions often involve both domain and visualization design expertise, and may impose a tedious specification process that impedes exploration. In this paper, we seek to complement manual chart construction with interactive navigation of a gallery of automatically-generated visualizations. We contribute Voyager, a mixed-initiative system that supports faceted browsing of recommended charts chosen according to statistical and perceptual measures. We describe Voyagers architecture, motivating design principles, and methods for generating and interacting with visualization recommendations. In a study comparing Voyager to a manual visualization specification tool, we find that Voyager facilitates exploration of previously unseen data and leads to increased data variable coverage. We then distill design implications for visualization tools, in particular the need to balance rapid exploration and targeted question-answering.

international conference on management of data | 2015

The BigDAWG Polystore System

Jennie Duggan; Aaron J. Elmore; Michael Stonebraker; Magdalena Balazinska; Bill Howe; Jeremy Kepner; Samuel Madden; David Maier; Timothy G. Mattson; Stan Zdonik

This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. This trend is fueled by the proliferation of storage engines and query languages based on the observation that â no one size fits allâ . To address this shift, we propose a polystore architecture; it is designed to unify querying over multiple data models. We consider the challenges and opportunities associated with polystores. Open questions in this space revolve around query optimization and the assignment of objects to storage engines. We introduce our approach to these topics and discuss our prototype in the context of the Intel Science and Technology Center for Big Data

Proceedings of the National Academy of Sciences of the United States of America | 2016

Deciphering ocean carbon in a changing world

Mary Ann Moran; Elizabeth B. Kujawinski; Aron Stubbins; Rob Fatland; Lihini I. Aluwihare; Alison Buchan; Byron C. Crump; Pieter C. Dorrestein; Sonya T. Dyhrman; Nancy J. Hess; Bill Howe; Krista Longnecker; Patricia M. Medeiros; Jutta Niggemann; Ingrid Obernosterer; Daniel J. Repeta; Jacob R. Waldbauer

Dissolved organic matter (DOM) in the oceans is one of the largest pools of reduced carbon on Earth, comparable in size to the atmospheric CO2 reservoir. A vast number of compounds are present in DOM, and they play important roles in all major element cycles, contribute to the storage of atmospheric CO2 in the ocean, support marine ecosystems, and facilitate interactions between organisms. At the heart of the DOM cycle lie molecular-level relationships between the individual compounds in DOM and the members of the ocean microbiome that produce and consume them. In the past, these connections have eluded clear definition because of the sheer numerical complexity of both DOM molecules and microorganisms. Emerging tools in analytical chemistry, microbiology, and informatics are breaking down the barriers to a fuller appreciation of these connections. Here we highlight questions being addressed using recent methodological and technological developments in those fields and consider how these advances are transforming our understanding of some of the most important reactions of the marine carbon cycle.

Publications of the Astronomical Society of the Pacific | 2011

Astronomy in the Cloud: Using MapReduce for Image Co-Addition

Keith Wiley; Andrew J. Connolly; Jeffrey P. Gardner; K. Simon Krughoff; Magdalena Balazinska; Bill Howe; YongChul Kwon; Yingyi Bu

In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computation challenges such as anomaly detection and classification, and moving object tracking. Since such studies benefit from the highest quality data, methods such as image coaddition (stacking) will be a critical preprocessing step prior to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources or transient objects, these data streams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this paper we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data is partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources, e.g., Amazons EC2. We report on our experience implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multi-terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image coaddition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results comparing their performance.

international conference on management of data | 2014

Demonstration of the Myria big data management service

Daniel Halperin; Victor Teixeira de Almeida; Lee Lee Choo; Shumo Chu; Paraschos Koutris; Dominik Moritz; Jennifer Ortiz; Vaspol Ruamviboonsuk; Jingjing Wang; Andrew Whitaker; Shengliang Xu; Magdalena Balazinska; Bill Howe; Dan Suciu

In this demonstration, we will showcase Myria, our novel cloud service for big data management and analytics designed to improve productivity. Myrias goal is for users to simply upload their data and for the system to help them be self-sufficient data science experts on their data -- self-serve analytics. Using a web browser, Myria users can upload data, author efficient queries to process and explore the data, and debug correctness and performance issues. Myria queries are executed on a scalable, parallel cluster that uses both state-of-the-art and novel methods for distributed query processing. Our interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.

international conference on cluster computing | 2009

Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?

Sarah Loebman; Dylan Nunley; YongChul Kwon; Bill Howe; Magdalena Balazinska; Jeffrey P. Gardner

As the datasets used to fuel modern scientific discovery grow increasingly large, they become increasingly difficult to manage using conventional software. Parallel database management systems (DBMSs) and massive-scale data processing systems such as MapReduce hold promise to address this challenge. However, since these systems have not been expressly designed for scientific applications, their efficacy in this domain has not been thoroughly tested. In this paper, we study the performance of these engines in one specific domain: massive astrophysical simulations. We develop a use case that comprises five representative queries. We implement this use case in one distributed DBMS and in the Pig/Hadoop system. We compare the performance of the tools to each other and to hand-written IDL scripts. We find that certain representative analyses are easy to express in each engines highlevel language and both systems provide competitive performance and improved scalability relative to current IDL-based methods.

Computing in Science and Engineering | 2012

Virtual Appliances, Cloud Computing, and Reproducible Research

Bill Howe

As science becomes increasingly computational, reproducibility has become increasingly difficult, perhaps surprisingly. In many contexts, virtualization and cloud computing can mitigate the issues involved without significant overhead to the researcher, enabling the next generation of rigorous and reproducible computational science.

Explore More