Alan Halverson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alan Halverson is active.

Explore More

Publication

Featured researches published by Alan Halverson.

web search and data mining | 2009

Diversifying search results

Rakesh Agrawal; Sreenivas Gollapudi; Alan Halverson; Samuel Ieong

We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the risk of dissatisfaction of the average user. We propose an algorithm that well approximates this objective in general, and is provably optimal for a natural special case. Furthermore, we generalize several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. We demonstrate empirically that our algorithm scores higher in these generalized metrics compared to results produced by commercial search engines.

international conference on management of data | 2011

Turbocharging DBMS buffer pool using SSDs

Jaeyoung Do; Donghui Zhang; Jignesh M. Patel; David J. DeWitt; Jeffrey F. Naughton; Alan Halverson

Flash solid-state drives (SSDs) are changing the I/O landscape, which has largely been dominated by traditional hard disk drives (HDDs) for the last 50 years. In this paper we propose and systematically explore designs for using an SSD to improve the performance of a DBMS buffer manager. We propose three alternatives that differ mainly in the way that they deal with the dirty pages evicted from the buffer pool. We implemented these alternatives, as well another recently proposed algorithm for this task (TAC), in SQL Server, and ran experiments using a variety of benchmarks (TPC-C, E and H) at multiple scale factors. Our empirical evaluation shows significant performance improvements of our methods over the default HDD configuration (up to 9.4X), and up to a 6.8X speedup over TAC.

international conference on management of data | 2013

Split query processing in polybase

David J. DeWitt; Alan Halverson; Rimma V. Nehme; Srinath Shankar; Josep Aguilar-Saborit; Artin Avanes; Miro Flasza; Jim Gramling

This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.

web search and data mining | 2009

Generating labels from clicks

Rakesh Agrawal; Alan Halverson; Krishnaram Kenthapadi; Nina Mishra; Panayiotis Tsaparas

The ranking function used by search engines to order results is learned from labeled training data. Each training point is a (query, URL) pair that is labeled by a human judge who assigns a score of Perfect, Excellent, etc., depending on how well the URL matches the query. In this paper, we study whether clicks can be used to automatically generate good labels. Intuitively, documents that are clicked (resp., skipped) in aggregate can indicate relevance (resp., lack of relevance). We give a novel way of transforming clicks into weighted, directed graphs inspired by eye-tracking studies and then devise an objective function for finding cuts in these graphs that induce a good labeling. In its full generality, the problem is NP-hard, but we show that, in the case of two labels, an optimum labeling can be found in linear time. For the more general case, we propose heuristic solutions. Experiments on real click logs show that click-based labels align with the opinion of a panel of judges, especially as the consensus of the panel grows stronger.

international conference on management of data | 2012

Query optimization in microsoft SQL server PDW

Srinath Shankar; Rimma V. Nehme; Josep Aguilar-Saborit; Andrew Chung; Mostafa Elhemali; Alan Halverson; Eric R. Robinson; Mahadevan Sankara Subramanian; David J. DeWitt; Cesar A. Galindo-Legaria

In recent years, Massively Parallel Processors have increasingly been used to manage and query vast amounts of data. Dramatic performance improvements are achieved through distributed execution of queries across many nodes. Query optimization for such system is a challenging and important problem. In this paper we describe the Query Optimizer inside the SQL Server Parallel Data Warehouse product (PDW QO). We leverage existing QO technology in Microsoft SQL Server to implement a cost-based optimizer for distributed query execution. By properly abstracting metadata we can readily reuse existing logic for query simplification, space exploration and cardinality estimation. Unlike earlier approaches that simply parallelize the best serial plan, our optimizer considers a rich space of execution alternatives, and picks one based on a cost-model for the distributed execution environment. The result is a high-quality, effective query optimizer for distributed query processing in an MPP.

web search and data mining | 2012

Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Marc Najork; Dennis Fetterly; Alan Halverson; Krishnaram Kenthapadi; Sreenivas Gollapudi

Many phenomena and artifacts such as road networks, social networks and the web can be modeled as large graphs and analyzed using graph algorithms. However, given the size of the underlying graphs, efficient implementation of basic operations such as connected component analysis, approximate shortest paths, and link-based ranking (e.g. PageRank) becomes challenging. This paper presents an empirical study of computations on such large graphs in three well-studied platform models, viz., a relational model, a data-parallel model, and a special-purpose in-memory model. We choose a prototypical member of each platform model and analyze the computational efficiencies and requirements for five basic graph operations used in the analysis of real-world graphs viz., PageRank, SALSA, Strongly Connected Components (SCC), Weakly Connected Components (WCC), and Approximate Shortest Paths (ASP). Further, we characterize each platform in terms of these computations using model-specific implementations of these algorithms on a large web graph. Our experiments show that there is no single platform that performs best across different classes of operations on large graphs. While relational databases are powerful and flexible tools that support a wide variety of computations, there are computations that benefit from using special-purpose storage systems and others that can exploit data-parallel platforms.

very large data bases | 2014

Indexing HDFS data in PDW: splitting the data from the index

Vinitha Reddy Gankidi; Nikhil Teletia; Jignesh M. Patel; Alan Halverson; David J. DeWitt

There is a growing interest in making relational DBMSs work synergistically with MapReduce systems. However, there are interesting technical challenges associated with figuring out the right balance between the use and co-deployment of these systems. This paper focuses on one specific aspect of this balance, namely how to leverage the superior indexing and query processing power of a relational DBMS for data that is often more cost-effectively stored in Hadoop/HDFS. We present a method to use conventional B+-tree indices in an RDBMS for data stored in HDFS and demonstrate that our approach is especially effective for highly selective queries.

international conference on data engineering | 2014

In-RDBMS inverted indexes revisited

Ian Rae; Alan Halverson; Jeffrey F. Naughton

Every major open-source and commercial RDBMS offers some form of support for full-text search using inverted indexes. When providing this support, some developers have implemented specialized indexes that adapt techniques from the Information Retrieval (IR) community to work in a database setting, while others have opted to rely on the standard relational query engine to process inverted index lookups. This choice is an important one, since the storage formats and algorithms used can vary greatly between a specialized index and a relational index, but these alternatives have not been thoroughly compared in the same system. Our work explores the differences in implementation and performance of three representative environments for an in-RDBMS inverted index: an in-RDBMS IR engine, a row-oriented relational query engine, and a column-oriented relational query engine. We found that a specialized IR engine integrated into the RDBMS can provide more than an order of magnitude speedup over both the row- and column-oriented relational query engines for conjunctive and phrase queries. For warm queries, this advantage is largely algorithmic, and we show that by using ZigZag merge join to accelerate conjunctive and phrase query processing, relational inverted indexes can provide performance comparable to a specialized in-RDBMS IR engine with no change to the underlying storage format. Compression and index format, in contrast, have more impact on cold queries, where the IR and column-oriented engines are able to outperform the row-oriented engine, even with ZigZag merge join.

international symposium on low power electronics and design | 2017

Frequency governors for cloud database OLTP workloads

Rathijit Sen; Alan Halverson

Dynamically controlling processor frequency to save power while meeting customer Service-Level Objectives (SLOs) can reduce the cost of goods sold for cloud service providers. However, resource governance for Online Transaction Processing (OLTP) workloads in the cloud is complicated by throughput constraints, latency constraints, shallow sleep states that lower processor utilization, and (often) isolation of applications from hardware resource governors. This paper demonstrates a novel frequency governor that improves upon existing Intel P-state and Cpufreq governors in saving power for a cloud OLTP benchmark on Microsoft SQL Server for Linux.

tpc technology conference | 2011