Arvid Heise
Hasso Plattner Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arvid Heise.
very large data bases | 2014
Alexander Alexandrov; Rico Bergmann; Stephan Ewen; Johann Christoph Freytag; Fabian Hueske; Arvid Heise; Odej Kao; Marcus Leich; Ulf Leser; Volker Markl; Felix Naumann; Mathias Peters; Astrid Rheinländer; Matthias J. Sax; Sebastian Schelter; Mareike Hoger; Kostas Tzoumas; Daniel Warneke
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.
very large data bases | 2013
Arvid Heise; Jorge Arnulfo Quiané-Ruiz; Ziawasch Abedjan; Anja Jentzsch; Felix Naumann
The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.
international world wide web conferences | 2012
Christoph Böhm; Markus Freitag; Arvid Heise; Claudia Lehmann; Andrina Mascher; Felix Naumann; Vuk Ercegovac; Mauricio A. Hernández; Peter Haase; Michael Schmidt
Many government organizations publish a variety of data on the web to enable transparency, foster applications, and to satisfy legal obligations. Data content, format, structure, and quality vary widely, even in cases where the data is published using the wide-spread linked data principles. Yet within this data and their integration lies much value: We demonstrate GovWILD, a web-based prototype that integrates and cleanses Open Government Data at a large scale. Apart from the web-based interface that presents a use case of the created dataset at govwild.org, we provide all integrated data as a download. This data can be used to answer questions about politicians, companies, and government funding.
IEEE Transactions on Knowledge and Data Engineering | 2015
Thorsten Papenbrock; Arvid Heise; Felix Naumann
Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.
Journal of Web Semantics | 2012
Arvid Heise; Felix Naumann
Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze the data. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integration and thus limits the desired transparency. In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysis framework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration of well-known government data sources and other large open data sets at technical, structural, and semantic levels. Furthermore, we publish the integrated data on the Web in a form that enables users to discover relationships between persons, government agencies, funds, and companies. The evaluation shows that linking person entities of different data sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scales well on up to eight machines.
Information Systems | 2015
Astrid Rheinländer; Arvid Heise; Fabian Hueske; Ulf Leser; Felix Naumann
Abstract Recent years have seen an increased interest in large-scale analytical data flows on non-relational data. These data flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such data flows are user-defined predicates or functions (U df s). However, the heavy use of U df s is not well taken into account for data flow optimization in current systems. S ofa is a novel and extensible optimizer for U df -heavy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style U df s and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: we arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate S ofa on a selection of U df -heavy data flows from different domains and compare its performance to three other algorithms for data flow optimization. Our experiments reveal that S ofa finds efficient plans, outperforming the best plans found by its competitors by a factor of up to six.
integrating technology into computer science education | 2010
Michael Haupt; Robert Hirschfeld; Tobias Pape; Gregor Gabrysiak; Stefan Marr; Arne Bergmann; Arvid Heise; Matthias Kleine; Robert Krahn
This paper introduces the SOM (Simple Object Machine) family of virtual machine (VM) implementations, a collection of VMs for the same Smalltalk dialect addressing students at different levels of expertise. Starting from a Java-based implementation, several ports of the VM to different programming languages have been developed and put to successful use in teaching at both undergraduate and graduate levels since 2006. Moreover, the VMs have been used in various research projects. The paper documents the rationale behind each of the SOM VMs and results that have been achieved in teaching and research.
conference on information and knowledge management | 2014
Arvid Heise; Gjergji Kasneci; Felix Naumann
Duplicates in a dataset are multiple representations of the same real-world entity and constitute a major data quality problem. This paper investigates the problem of estimating the number and sizes of duplicate record clusters in advance and describes a sampling-based method for solving this problem. In extensive experiments, on multiple datasets, we show that the proposed method reliably estimates the number of duplicate clusters, while being highly efficient. Our method can be used a) to measure the dirtiness of a dataset, b) to assess the quality of duplicate detection configurations, such as similarity measures, and c) to gather approximate statistics about the true number of entities represented in the dataset.
Journal of Data and Information Quality | 2014
Tobias Vogel; Arvid Heise; Uwe Draisbach; Dustin Lange; Felix Naumann
Duplicates in a database are one of the prime causes of poor data quality and are at the same time among the most difficult data quality problems to alleviate. To detect and remove such duplicates, many commercial and academic products and methods have been developed. The evaluation of such systems is usually in need of pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews. The proposed annealing standard is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident. We formally define gold, silver, and the annealing standard and their maintenance. Experiments show how quickly an annealing standard converges to a gold standard. Finally, we provide an annealing standard for 750,000 CDs to the duplicate detection community.
international conference on management of data | 2014
Astrid Rheinländer; Martin Beckmann; Anja Kunkel; Arvid Heise; Thomas Stoltmann; Ulf Leser
Currently, we witness an increased interest in large-scale analytical data flows on non-relational data. The predominant building blocks of such data flows are user-defined functions (UDFs), a fact that is not well taken into account for data flow language design and optimization in current systems. In this demonstration, we present Meteor, a declarative data flow language, and Sofa, a logical optimizer for UDF-heavy data flows, which are both part of the Stratosphere system. Meteor queries seamlessly combine self-descriptive, domain-specific operators with standard relational operators. Such queries are optimized by Sofa, building on a concise set of UDF annotations and a small set of rewrite rules to enable semantically equivalent plan rewriting of UDF-heavy data flows. A salient feature of Meteor and Sofa is extensibility: User-defined operators and their properties are arranged into a subsumption hierarchy, which considerably eases integration and optimization of new operators. In this demonstration, we will let users pose arbitrary Meteor queries and graphically showcase versatility and extensibility of Sofa during query optimization.