Thorsten Papenbrock
Hasso Plattner Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thorsten Papenbrock.
very large data bases | 2015
Thorsten Papenbrock; Jens Ehrlich; Jannik Marten; Tommy Neubert; Jan-Peer Rudolph; Martin Schönberg; Jakob Zwiener; Felix Naumann
Functional dependencies are important metadata used for schema normalization, data cleansing and many other tasks. The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches. Because no comprehensive comparison between these algorithms exist at the time, it is hard to choose the best algorithm for a given dataset. In this experimental paper, we describe, evaluate, and compare the seven most cited and most important algorithms, all solving this same problem. First, we classify the algorithms into three different categories, explaining their commonalities. We then describe all algorithms with their main ideas. The descriptions provide additional details where the original papers were ambiguous or incomplete. Our evaluation of careful re-implementations of all algorithms spans a broad test space including synthetic and real-world data. We show that all functional dependency algorithms optimize for certain data characteristics and provide hints on when to choose which algorithm. In summary, however, all current approaches scale surprisingly poorly, showing potential for future research.
IEEE Transactions on Knowledge and Data Engineering | 2015
Thorsten Papenbrock; Arvid Heise; Felix Naumann
Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.
very large data bases | 2015
Thorsten Papenbrock; Tanja Bergmann; Moritz Finke; Jakob Zwiener; Felix Naumann
Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies. We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanomes goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the, at times, large metadata sets.
international conference on management of data | 2016
Thorsten Papenbrock; Felix Naumann
Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.
very large data bases | 2015
Thorsten Papenbrock; Sebastian Kruse; Jorge Arnulfo Quiané-Ruiz; Felix Naumann
The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets -- an important property on the face of the ever increasing size of todays data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.
conference on information and knowledge management | 2011
Johannes Lorey; Felix Naumann; Benedikt Forchhammer; Andrina Mascher; Peter Retzlaff; Armin ZamaniFarahani; Soeren Discher; Cindy Faehnrich; Stefan Lemme; Thorsten Papenbrock; Robert Christoph Peschel; Stephan Richter; Thomas Stening; Sven Viehmeier
A large number of statistical indicators (GDP, life expectancy, income, etc.) collected over long periods of time as well as data on historical events (wars, earthquakes, elections, etc.) are published on the World Wide Web. By augmenting statistical outliers with relevant historical occurrences, we provide a means to observe (and predict) the influence and impact of events. The vast amount and size of available data sets enable the detection of recurring connections between classes of events and statistical outliers with the help of association rule mining. The results of this analysis are published at http://www.blackswanevents.org and can be explored interactively.
international conference on management of data | 2016
Sebastian Kruse; Anja Jentzsch; Thorsten Papenbrock; Zoi Kaoudi; Jorge Arnulfo Quiané-Ruiz; Felix Naumann
Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.
international conference on management of data | 2017
Fabian Tschirschnitz; Thorsten Papenbrock; Felix Naumann
Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables. With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.
conference on information and knowledge management | 2016
Tobias Bleifuß; Susanne Bülow; Johannes Frohnhofen; Julian Risch; Georg Wiese; Sebastian Kruse; Thorsten Papenbrock; Felix Naumann
Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even todays fastest FD discovery algorithms are limited to small datasets only, due to long runtimes and high memory consumptions. To overcome this situation, we propose an approximate discovery strategy that sacrifices possibly little result correctness in return for large performance improvements. In particular, we introduce AID-FD, an algorithm that approximately discovers FDs within runtimes up to orders of magnitude faster than state-of-the-art FD discovery algorithms. We evaluate and compare our performance results with a focus on scalability in runtime and memory, and with measures for completeness, correctness, and minimality.
Informatik Spektrum | 2014
Felix Naumann; Maximilian Jenders; Thorsten Papenbrock
ZusammenfassungIm Sommersemester 2013 boten wir auf openHPI, der Internet-Bildungsplattform des Hasso-Plattner-Instituts, den Kurs Datenmanagement mit SQL an. Von den über 6000 Teilnehmern erhielten nach sieben Wochen 1641 Teilnehmer ein Zertifikat und 2074 eine Teilnahmebestätigung. Der Kurs folgte der üblichen Struktur einer Datenbankeinführung und umfasste die Grundlagen der ER-Modellierung, des relationalen Entwurfs und der relationalen Algebra sowie eine ausführliche Einführung in SQL. Der Vorlesungsinhalt wurde in kleine Videoeinheiten aufgebrochen, die jeweils mit kleinen Selbsttests abgeschlossen wurden. Begleitend zu jedem Themenblock mussten die Teilnehmer online Hausaufgaben lösen und zum Abschluss des Kurses eine Klausur bearbeiten.Wir berichten über unsere Erfahrungen bei der Durchführung dieses ersten deutschen Datenbank-MOOCs. Insbesondere gehen wir auf die Unterschiede zu einer klassischen Vorlesung ein und beschreiben den für uns teils schwierigen Umgang mit tausenden Teilnehmern. Wir wollen damit allen Interessierten einen Einblick hinter die Kulissen eines freien Online-Kurses und allen Lehrenden, die selbst einen solchen Kurs planen, praktische Hinweise geben.