Is this you? Create Your Porfile

Noah Watkins

University of California, Santa Cruz

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Noah Watkins is active.

Explore More

Publication

Featured researches published by Noah Watkins.

ieee international conference on high performance computing data and analytics | 2011

SciHadoop: array-based query processing in Hadoop

Joe B. Buck; Noah Watkins; Jeff LeFevre; Kleoni Ioannidou; Carlos Maltzahn; Neoklis Polyzotis; Scott A. Brandt

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoops byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci- Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.

ieee international conference on high performance computing data and analytics | 2015

Mantle: a programmable metadata load balancer for the ceph file system

Michael A. Sevilla; Noah Watkins; Carlos Maltzahn; Ike Nassi; Scott A. Brandt; Sage A. Weil; Greg Farnum; Sam Palo Alto Fineberg

Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFSs dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.

international parallel and distributed processing symposium | 2017

The Popper Convention: Making Reproducible Systems Evaluation Practical

Ivo Jimenez; Michael A. Sevilla; Noah Watkins; Carlos Maltzahn; Jay F. Lofstead; Kathryn Mohror; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau

Independent validation of experimental results in the field of systems research is a challenging task, mainly due to differences in software and hardware in computational environments. Recreating an environment that resembles the original is difficult and time-consuming. In this paper we introduce _Popper_, a convention based on a set of modern open source software (OSS) development principles for generating reproducible scientific publications. Concretely, we make the case for treating an article as an OSS project following a DevOps approach and applying software engineering best-practices to manage its associated artifacts and maintain the reproducibility of its findings. Popper leverages existing cloud-computing infrastructure and DevOps tools to produce academic articles that are easy to validate and extend. We present a use case that illustrates the usefulness of this approach. We show how, by following the _Popper_ convention, reviewers and researchers can quickly get to the point of getting results without relying on the original authors intervention.

Proceedings of the second international workshop on Data-aware distributed computing | 2009

Abstract storage: moving file format-specific abstractions intopetabyte-scale storage systems

Joe B. Buck; Noah Watkins; Carlos Maltzahn; Scott A. Brandt

High-end computing is increasingly I/O bound as computations become more data-intensive, and data transport technologies struggle to keep pace with the demands of large-scale, distributed computations. One approach to avoiding unnecessary I/O is to move the processing to the data, as seen in Googles successful, but relatively specialized, MapReduce system. This paper discusses our investigation towards a general solution for enabling in-situ computation in a peta-scale storage system. We believe our work with flexible, application-specific structured storage is the key to addressing the I/O overhead caused by data partitioning across storage nodes. In order to manage competing workloads on storage nodes, our research in system performance management is leveraged. Our ultimate goal is a general framework for in-situ data-intensive processing, indexing, and searching, which we expect to provide orders of magnitude performance increases for data-intensive workloads.

european conference on computer systems | 2017

Malacology: A Programmable Storage System

Michael A. Sevilla; Noah Watkins; Ivo Jimenez; Peter Alvaro; Shel Finkelstein; Jeff LeFevre; Carlos Maltzahn

Storage systems need to support high-performance for special-purpose data processing applications that run on an evolving storage device technology landscape. This puts tremendous pressure on storage systems to support rapid change both in terms of their interfaces and their performance. But adapting storage systems can be difficult because unprincipled changes might jeopardize years of code-hardening and performance optimization efforts that were necessary for users to entrust their data to the storage system. We introduce the programmable storage approach, which exposes internal services and abstractions of the storage stack as building blocks for higher-level services. We also build a prototype to explore how existing abstractions of common storage system services can be leveraged to adapt to the needs of new data processing systems and the increasing variety of storage devices. We illustrate the advantages and challenges of this approach by composing existing internal abstractions into two new higher-level services: a file system metadata load balancer and a high-performance distributed shared-log. The evaluation demonstrates that our services inherit desirable qualities of the back-end storage system, including the ability to balance load, efficiently propagate service metadata, recover from failure, and navigate trade-offs between latency and throughput using leases.

ieee international conference on high performance computing data and analytics | 2012

DataMods: Programmable File System Services

Noah Watkins; Carlos Maltzahn; Scott A. Brandt; Adam Manzanares

As applications become more complex, and the level of concurrency in systems continue to rise, developers are struggling to scale complex data models on top of a traditional byte stream interface. Middleware tailored for specific data models is a common approach to dealing with these challenges, but middleware commonly reproduces scalable services already present in many distributed file systems. We present DataMods, an abstraction over existing services found in large-scale storage systems that allows middleware to take advantage of existing, highly tuned services. Specifically, DataMods provides an abstraction for extending storage system services in order to implement native, domain-specific data models and interfaces throughout the storage hierarchy.

Archive | 2006

The Design, Modeling, and Implementation of Group Scheduling for Isolation of Computations from Adversarial Interference

Terry Tidwell; Noah Watkins; Venkita Subramonian; Douglas Niehaus; Armando Gill; Migliaccio

To isolate computations from denial of service (DoS) attacks and other forms of adversarial interference, it is necessary to constrain the effects of interactions among computations. This paper makes four contributions to research on isolation of computations from adversarial interference: (1) it describes the design and implementation of a kernellevel scheduling policy to control the effects of adversarial attacks on computations’ execution; (2) it presents formal models of the system components that are involved in a representative DoS attack scenario; (3) it shows how model checking can be used to analyze that example scenario, under default Linux scheduling semantics and under our scheduling policy design; and (4) it presents empirical studies we have conducted to validate our scheduling policy implementation. Our results show that, with careful design, scheduling and detailed monitoring of computations’ behavior can be combined effectively to mitigate interference of attacks with computations’ execution.

ieee international conference on high performance computing data and analytics | 2013

SIDR: structure-aware intelligent data routing in Hadoop

Joe B. Buck; Noah Watkins; Greg Levin; Adam Crume; Kleoni Ioannidou; Scott A. Brandt; Carlos Maltzahn; Neoklis Polyzotis; Aaron Torres

The MapReduce framework is being extended for domains quite different from the web applications for which it was designed, including the processing of big structured data, e.g., scientific and financial data. Previous work using MapReduce to process scientific data ignores existing structure when assigning intermediate data and scheduling tasks. In this paper, we present a method for incorporating knowledge of the structure of scientific data and executing query into the MapReduce communication model. Built in SciHadoop, a version of the Hadoop MapReduce framework for scientific data, SIDR intelligently partitions and routes intermediate data, allowing it to: remove Hadoops global barrier and execute Reduce tasks prior to all Map tasks completing; minimize intermediate key skew; and produce early, correct results. SIDR executes queries up to 2.5 times faster than Hadoop and 37% faster than SciHadoop; produces initial results with only 6% of the query completed; and produces dense, contiguous output.

european conference on parallel processing | 2013

In-vivo Storage System Development

Noah Watkins; Carlos Maltzahn; Scott A. Brandt; Ian Pye; Adam Manzanares

The emergence of high-performance open-source storage systems is allowing application and middleware developers to consider non-standard storage system interfaces. In contrast to the practice of virtually always designing for file-like byte-stream interfaces, co-designed domain-specific storage system interfaces are becoming increasingly common. However, in order for developers to evolve interfaces in high-availability storage systems, services are needed for in-vivo interface evolution that allows the development of interfaces in the context of a live system. Current clustered storage systems that provide interface customizability expose primitive services for managing ad-hoc interfaces. For maximum utility, the ability to create, evolve, and deploy dynamic storage interfaces is needed. However, in large-scale clusters, dynamic interface instantiation will require system-level support that ensures interface version consistency among storage nodes and client applications. We propose that storage systems should provide services that fully manage the life-cycle of dynamic interfaces that are aligned with the common branch-and-merge form of software maintenance, including isolated development workspaces that can be combined into existing production views of the system.

international conference on performance engineering | 2018

quiho : Automated Performance Regression Testing Using Inferred Resource Utilization Profiles

Ivo Jimenez; Noah Watkins; Michael A. Sevilla; Jay F. Lofstead; Carlos Maltzahn

We introduce quiho, a framework for profiling application performance that can be used in automated performance regression tests. quiho profiles an application by applying sensitivity analysis, in particular statistical regression analysis (SRA), using application-independent performance feature vectors that characterize the performance of machines. The result of the SRA, feature importance specifically, is used as a proxy to identify hardware and low-level system software behavior. The relative importance of these features serve as a performance profile of an application (termed inferred resource utilization profile or IRUP), which is used to automatically validate performance behavior across multiple revisions of an application»s code base without having to instrument code or obtain performance counters. We demonstrate that quiho can successfully discover performance regressions by showing its effectiveness in profiling application performance for synthetically introduced regressions as well as those found in real-world applications.

Explore More