Erik Paulson
University of Wisconsin-Madison
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Erik Paulson.
international conference on management of data | 2009
Andrew Pavlo; Erik Paulson; Alexander Rasin; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Michael Stonebraker
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each systems performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
Communications of The ACM | 2010
Michael Stonebraker; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Erik Paulson; Andrew Pavlo; Alexander Rasin
MapReduce complements DBMSs since databases are not designed for extract-transform-load tasks, a MapReduce specialty.
very large data bases | 2008
David J. DeWitt; Erik Paulson; Eric Robinson; Jeffrey F. Naughton; Joshua Royalty; Srinath Shankar; Andrew Krioukov
This paper introduces Clustera, an integrated computation and data management system. In contrast to traditional cluster-management systems that target specific types of workloads, Clustera is designed for extensibility, enabling the system to be easily extended to handle a wide variety of job types ranging from computationally-intensive, long-running jobs with minimal I/O requirements to complex SQL queries over massive relational tables. Another unique feature of Clustera is the way in which the system architecture exploits modern software building blocks including application servers and relational database systems in order to realize important performance, scalability, portability and usability benefits. Finally, experimental evaluation suggests that Clustera has good scale-up properties for SQL processing, that Clustera delivers performance comparable to Hadoop for MapReduce processing and that Clustera can support higher job throughput rates than previously published results for the Condor and CondorJ2 batch computing systems.
international conference on management of data | 2011
Kamil Bajda-Pawlikowski; Daniel J. Abadi; Avi Silberschatz; Erik Paulson
Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework. In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.
very large data bases | 2018
Yash Govind; Erik Paulson; Palaniappan Nagarajan; G C Paul Suganthan; AnHai Doan; Youngchoon Park; Glenn Fung; Devin Conathan; Marshall Carter; Mingju Sun
As data science applications proliferate, more and more lay users must perform data integration (DI) tasks, which used to be done by sophisticated CS developers. Thus, it is increasingly critical that we develop hands-off DI services, which lay users can use to perform such tasks without asking for help from developers. We propose to demonstrate such a service. Specifically, we will demonstrate CloudMatcher, a hands-off cloud/crowd service for entity matching (EM). To use CloudMatcher to match two tables, a lay user only needs to upload them to the CloudMatcher’s Web page then iteratively label a set of tuple pairs as match/no-match. Alternatively, the user can enlist a crowd of workers to label the pairs. In either case, the lay user can easily perform EM end-to-end without having to involve any developers. CloudMatcher has been used in several domain science projects at UW-Madison and at several organizations, and is scheduled to be deployed in a large company in Summer 2018. In the demonstration we will show how easy it is for lay users to perform EM (either via interactive labeling or crowdsourcing), how users can easily create and experiment with a range of EM workflows, and how CloudMatcher can scale to many concurrent users and large datasets. PVLDB Reference Format: Y. Govind, E. Paulson, P Nagarajan, Paul S. G.C., AnHai Doan, Y. Park, G. M. Fung, D. Conathan, M. Carter, M. Sun. CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching. PVLDB, 11 (12): 2042-2045, 2018. DOI: https://doi.org/10.14778/3229863.3236255
international conference on management of data | 2017
AnHai Doan; Adel Ardalan; Jeffrey R. Ballard; Sanjib Das; Yash Govind; Pradap Konda; Han Li; Sidharth Mudgal; Erik Paulson; G C Paul Suganthan; Haojun Zhang
Archive | 2018
Youngchoon Park; Erik Paulson
Archive | 2018
Youngchoon Park; Justin Ploegert; Erik Paulson; Sudhi R. Sinha
IEEE Data(base) Engineering Bulletin | 2018
AnHai Doan; Pradap Konda; G C Paul Suganthan; Adel Ardalan; Jeffrey R. Ballard; Sanjib Das; Yash Govind; Han Li; Philip Martinkus; Sidharth Mudgal; Erik Paulson; Haojun Zhang
arXiv: Databases | 2017
AnHai Doan; Adel Ardalan; Jeffrey R. Ballard; Sanjib Das; Yash Govind; Pradap Konda; Han Li; Erik Paulson; G C Paul Suganthan; Haojun Zhang