Is this you? Create Your Porfile

Jeff LeFevre

University of California, Santa Cruz

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jeff LeFevre is active.

Explore More

Publication

Featured researches published by Jeff LeFevre.

ieee international conference on high performance computing data and analytics | 2011

SciHadoop: array-based query processing in Hadoop

Joe B. Buck; Noah Watkins; Jeff LeFevre; Kleoni Ioannidou; Carlos Maltzahn; Neoklis Polyzotis; Scott A. Brandt

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoops byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci- Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.

international conference on management of data | 2014

MISO: souping up big data query processing with a multistore system

Jeff LeFevre; Jagan Sankaranarayanan; Hakan Hacigümüs; Junichi Tatemura; Neoklis Polyzotis; Michael J. Carey

Multistore systems utilize multiple distinct data stores such as Hadoops HDFS and an RDBMS for query processing by allowing a query to access data and computation in both stores. Current approaches to multistore query processing fail to achieve the full potential benefits of utilizing both systems due to the high cost of data movement and loading between the stores. Tuning the physical design of a multistore, i.e., deciding what data resides in which store, can reduce the amount of data movement during query processing, which is crucial for good multistore performance. In this work, we provide what we believe to be the first method to tune the physical design of a multistore system, by focusing on which store to place data. Our method, called MISO for MultISstore Online tuning, is adaptive, lightweight, and works in an online fashion utilizing only the by-products of query processing, which we term as opportunistic views. We show that MISO significantly improves the performance of ad-hoc big data query processing by leveraging the specific characteristics of the individual stores while incurring little additional overhead on the stores.

international conference on management of data | 2014

Opportunistic physical design for big data analytics

Jeff LeFevre; Jagan Sankaranarayanan; Hakan Hacigümüs; Junichi Tatemura; Neoklis Polyzotis; Michael J. Carey

Big data analytical systems, such as MapReduce, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that we propose to treat as an opportunistic physical design. We present a semantic model for UDFs that enables effective reuse of views containing UDFs along with a rewrite algorithm that provably finds the minimum-cost rewrite under certain assumptions. An experimental study on real-world datasets using our prototype based on Hive shows that our approach can result in dramatic performance improvements.

very large data bases | 2013

Odyssey: a multistore system for evolutionary analytics

Hakan Hacigümüs; Jagan Sankaranarayanan; Junichi Tatemura; Jeff LeFevre; Neoklis Polyzotis

We present a data analytics system, Odyssey, that is being developed at NEC Labs in collaboration with NEC’s commercial business units and our academic collaborators. The design principles of the system are based on the business requirements identified through extensive surveys and communications with the practitioners and customers. Most notable high-level requirements are: 1) The analytics system should be able to effectively use both structured and unstructured data sources. 2) Business requirements are not captured in a single simple metric but combination of metrics, such as value of data, performance, monetary costs, and they are dynamic. The system should manage data by observing constantly changing metric values. 3) Time-to-insight is very important, so the system should enable immediate exploratory querying of data without heavy prerequisite processes. 4) The system should efficiently support both ad-hoc queries and application workloads. 5) Very often there are already data analytics solutions / products in place (such as a traditional data warehouse), hence the system should be able to incorporate the existing settings. The data analysis is performed in a rapidly changing way and includes the new role of data scientist who is tasked with finding the benefit in big data, which come from disparate sources. The trend of collecting ever-growing data with unknown and unproven benefits, and the nature of the exploratory queries that are posed on these datasets represent an emerging type of data analysis. The fluid nature of this analysis is that the analyst may start by posing simple questions on the data but then evolve towards more sophisticated reasoning as well as apply sophisticated techniques. The evolutionary nature of the investigation is due to the analyst who may not initially be able to express her goals well, thus modifying her workflow slightly and iteratively refining it until achieving the intent. The evolutionary process may also require incorporating more data sources into the analysis to obtain richer and more confident answers. We call this iterative process by which an analyst finds bene-

congress on evolutionary computation | 2004

CODEGEN: the generation and testing of DNA code words

Daniel E. Kephart; Jeff LeFevre

With This work we present algorithms to generate and test DNA code words that avoid unwanted cross hybridizations. Methods from the theory of codes based on formal languages are employed. These algorithms are implemented in user-friendly software, CODEGEN, which contains a collection of language-theoretic objects adaptable to various related tasks. Lists of code words may be stored, viewed, altered and retested. Implemented in Visual Basic 6.0, its interface allows for lists of code words to be assembled at varying levels of acceptability from a single main window.

arXiv: Databases | 2013

Towards a workload for evolutionary analytics

Jeff LeFevre; Jagan Sankaranarayanan; Hakan Hacigümüs; Junichi Tatemura; Neoklis Polyzotis

Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identify its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify several metrics to test system support for evolutionary analytics. Along with our metrics, we present methodologies for running the workload that capture this analytical scenario.

international conference on management of data | 2012

Divergent physical design tuning for replicated databases

Mariano P. Consens; Kleoni Ioannidou; Jeff LeFevre; Neoklis Polyzotis

We introduce divergent designs as a novel tuning paradigm for database systems that employ replication. A divergent design installs a different physical configuration (e.g., indexes and materialized views) with each database replica, specializing replicas for different subsets of the workload. At runtime, queries are routed to the subset of the replicas configured to yield the most efficient execution plans. When compared to uniformly designed replicas, divergent replicas can potentially execute their subset of the queries significantly faster, and their physical configurations could be initialized and maintained(updated) in less time. However, the specialization of divergent replicas limits the ability to load-balance the workload at runtime. We formalize the divergent design problem, characterize the properties of good designs, and analyze the complexity of identifying the optimal divergent design. Our paradigm captures the trade-off between load balancing among all n replicas vs. load balancing among m ≤ n specialized replicas. We develop an effective algorithm (leveraging single-node-tuning functionality) to compute good divergent designs for all the points of this trade-off. Experimental results validate the effectiveness of the algorithm and demonstrate that divergent designs can substantially improve workload performance.

international conference on management of data | 2015

Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction

Shreya Prasad; Arash Fard; Vishrut Gupta; Jorge Martinez; Jeff LeFevre; Vincent Xu; Meichun Hsu; Indrajit Roy

A typical predictive analytics workflow will pre-process data in a database, transfer the resulting data to an external statistical tool such as R, create machine learning models in R, and then apply the model on newly arriving data. Today, this workflow is slow and cumbersome. Extracting data from databases, using ODBC connectors, can take hours on multi-gigabyte datasets. Building models on single-threaded R does not scale. Finally, it is nearly impossible to use R or other common tools, to apply models on terabytes of newly arriving data. We solve all the above challenges by integrating HP Vertica with Distributed R, a distributed framework for R. This paper presents the design of a high performance data transfer mechanism, new data-structures in Distributed R to maintain data locality with database table segments, and extensions to Vertica for saving and deploying R models. Our experiments show that data transfers from Vertica are 6x faster than using ODBC connections. Even complex predictive analysis on 100s of gigabytes of database tables can complete in minutes, and is as fast as in-memory systems like Spark running directly on a distributed file system.

european conference on computer systems | 2017

Malacology: A Programmable Storage System

Michael A. Sevilla; Noah Watkins; Ivo Jimenez; Peter Alvaro; Shel Finkelstein; Jeff LeFevre; Carlos Maltzahn

Storage systems need to support high-performance for special-purpose data processing applications that run on an evolving storage device technology landscape. This puts tremendous pressure on storage systems to support rapid change both in terms of their interfaces and their performance. But adapting storage systems can be difficult because unprincipled changes might jeopardize years of code-hardening and performance optimization efforts that were necessary for users to entrust their data to the storage system. We introduce the programmable storage approach, which exposes internal services and abstractions of the storage stack as building blocks for higher-level services. We also build a prototype to explore how existing abstractions of common storage system services can be leveraged to adapt to the needs of new data processing systems and the increasing variety of storage devices. We illustrate the advantages and challenges of this approach by composing existing internal abstractions into two new higher-level services: a file system metadata load balancer and a high-performance distributed shared-log. The evaluation demonstrates that our services inherit desirable qualities of the back-end storage system, including the ability to balance load, efficiently propagate service metadata, recover from failure, and navigate trade-offs between latency and throughput using leases.

international conference on management of data | 2016

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

Jeff LeFevre; Rui Liu; Cornelio Inigo; Lupita Paz; Edward Ma; Malu Castellanos; Meichun Hsu

Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the robustness of enterprise class analytics for their mission-critical data. In this paper, we present our initial efforts toward a solution that satisfies the above requirements by integrating the HPE Vertica enterprise database with Apache Sparks open source big data computation engine. In particular, it enables fast, reliable transferring of data between Vertica and Spark; and deploying Machine Learning models created by Spark into Vertica for predictive analytics on Vertica data. This integration provides a fabric on which our customers get the best of both worlds: it extends Verticas extensive SQL analytics capabilities with Sparks machine learning library (MLlib), giving Vertica users access to a wide range of ML functions; it also enables customers to leverage Spark as an advanced ETL engine for all data that require the guarantees offered by Vertica.

Explore More