Is this you? Create Your Porfile

Erik Paulson

University of Wisconsin-Madison

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erik Paulson is active.

Explore More

Publication

Featured researches published by Erik Paulson.

international conference on management of data | 2009

A comparison of approaches to large-scale data analysis

Andrew Pavlo; Erik Paulson; Alexander Rasin; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Michael Stonebraker

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each systems performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

Communications of The ACM | 2010

MapReduce and parallel DBMSs: friends or foes?

Michael Stonebraker; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Erik Paulson; Andrew Pavlo; Alexander Rasin

MapReduce complements DBMSs since databases are not designed for extract-transform-load tasks, a MapReduce specialty.

very large data bases | 2008

Clustera: an integrated computation and data management system

David J. DeWitt; Erik Paulson; Eric Robinson; Jeffrey F. Naughton; Joshua Royalty; Srinath Shankar; Andrew Krioukov

This paper introduces Clustera, an integrated computation and data management system. In contrast to traditional cluster-management systems that target specific types of workloads, Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables. Another unique feature of Clustera is the way in which the system architecture exploits modern software building blocks including application servers and relational database systems in order to realize important performance, scalability, portability and usability benefits. Finally, experimental evaluation suggests that Clustera has good scale-up properties for SQL processing, that Clustera delivers performance comparable to Hadoop for MapReduce processing and that Clustera can support higher job throughput rates than previously published results for the Condor and CondorJ2 batch computing systems.

international conference on management of data | 2011

Efficient processing of data warehousing queries in a split execution environment

Kamil Bajda-Pawlikowski; Daniel J. Abadi; Avi Silberschatz; Erik Paulson

Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework. In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

very large data bases | 2018

Cloudmatcher: a hands-off cloud/crowd service for entity matching

Yash Govind; Erik Paulson; Palaniappan Nagarajan; G C Paul Suganthan; AnHai Doan; Youngchoon Park; Glenn Fung; Devin Conathan; Marshall Carter; Mingju Sun

As data science applications proliferate, more and more lay users must perform data integration (DI) tasks, which used to be done by sophisticated CS developers. Thus, it is increasingly critical that we develop hands-off DI services, which lay users can use to perform such tasks without asking for help from developers. We propose to demonstrate such a service. Specifically, we will demonstrate CloudMatcher, a hands-off cloud/crowd service for entity matching (EM). To use CloudMatcher to match two tables, a lay user only needs to upload them to the CloudMatcher’s Web page then iteratively label a set of tuple pairs as match/no-match. Alternatively, the user can enlist a crowd of workers to label the pairs. In either case, the lay user can easily perform EM end-to-end without having to involve any developers. CloudMatcher has been used in several domain science projects at UW-Madison and at several organizations, and is scheduled to be deployed in a large company in Summer 2018. In the demonstration we will show how easy it is for lay users to perform EM (either via interactive labeling or crowdsourcing), how users can easily create and experiment with a range of EM workflows, and how CloudMatcher can scale to many concurrent users and large datasets. PVLDB Reference Format: Y. Govind, E. Paulson, P Nagarajan, Paul S. G.C., AnHai Doan, Y. Park, G. M. Fung, D. Conathan, M. Carter, M. Sun. CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching. PVLDB, 11 (12): 2042-2045, 2018. DOI: https://doi.org/10.14778/3229863.3236255

international conference on management of data | 2017

Human-in-the-Loop Challenges for Entity Matching: A Midterm Report

AnHai Doan; Adel Ardalan; Jeffrey R. Ballard; Sanjib Das; Yash Govind; Pradap Konda; Han Li; Sidharth Mudgal; Erik Paulson; G C Paul Suganthan; Haojun Zhang

Archive | 2018

SYSTÈME DE GESTION DE BÂTIMENT AVEC COMMUNICATION DE CANAL DYNAMIQUE

Youngchoon Park; Erik Paulson

Archive | 2018

SYSTÈMES D'AUTOMATISATION DE BÂTIMENTS POUR L'AUTORISATION EN LIGNE, HORS LIGNE ET HYBRIDE DE DISPOSITIFS PÉRIPHÉRIQUES DISTRIBUÉS

Youngchoon Park; Justin Ploegert; Erik Paulson; Sudhi R. Sinha

IEEE Data(base) Engineering Bulletin | 2018

Toward a System Building Agenda for Data Integration (and Data Science).

AnHai Doan; Pradap Konda; G C Paul Suganthan; Adel Ardalan; Jeffrey R. Ballard; Sanjib Das; Yash Govind; Han Li; Philip Martinkus; Sidharth Mudgal; Erik Paulson; Haojun Zhang

arXiv: Databases | 2017