Felix Naumann | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Felix Naumann is active.

Explore More

Publication

Featured researches published by Felix Naumann.

ACM Computing Surveys | 2009

Data fusion

Jens Bleiholder; Felix Naumann

The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation. This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

very large data bases | 2014

The Stratosphere platform for big data analytics

Alexander Alexandrov; Rico Bergmann; Stephan Ewen; Johann Christoph Freytag; Fabian Hueske; Arvid Heise; Odej Kao; Marcus Leich; Ulf Leser; Volker Markl; Felix Naumann; Mathias Peters; Astrid Rheinländer; Matthias J. Sax; Sebastian Schelter; Mareike Hoger; Kostas Tzoumas; Daniel Warneke

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

very large data bases | 1999

Quality-driven Integration of Heterogenous Information Systems

Felix Naumann; Ulf Leser; Johann Christoph Freytag

Integrated access to information that is spread over multiple, distributed, and heterogeneous sources is an important problem in many scienti c and commercial domains. While much work has been done on query processing and choosing plans under cost criteria, very little is known about the important problem of incorporating the information quality aspect into query planning. In this paper we describe a framework for multidatabase query processing that fully includes the quality of information in many facets, such as completeness, timeliness, accuracy, etc. We seamlessly include information quality into a multidatabase query processor based on a view-rewriting mechanism. We model information quality at di erent levels to ultimately nd a set of high-quality queryanswering plans.

IQ | 2000

Assessment Methods for Information Quality Criteria

Felix Naumann; Claudia Rolker

Information quality (IQ) is one of the most important aspects of information integration on the Internet. Many projects realize and address this fact by gathering and classifying IQ criteria. Hardly ever do the projects address the immense difficulty of assessing scores for the criteria. This task must precede any usage of criteria for qualifying and integrating information. After reviewing previous attempts to classify IQ criteria, in this paper we also classify criteria, but in a new, assessment-oriented way. We identify three sources for IQ scores and thus, three IQ criterion classes, each with different general assessment possibilities. Additionally, for each criterion we give detailed assessment methods. Finally, we consider confidence measures for these methods. Confidence expresses the accuracy, lastingness, and credibility of the individual assessment methods.

Archive | 2002

Quality-driven query answering for integrated information systems

Felix Naumann

Querying the Web.- Integrating Autonomous Information Sources.- Information Quality.- Information Quality Criteria.- Quality Ranking Methods.- Quality-Driven Query Answering.- Quality-Driven Query Planning.- Query Planning Revisited.- Completeness of Data.- Completeness-Driven Query Optimization.- Discussion.- Conclusion.

international conference on data engineering | 2005

Schema matching using duplicates

Alexander Bilke; Felix Naumann

Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.

Synthesis Lectures on Data Management | 2010

An Introduction to Duplicate Detection

Felix Naumann; Melanie Herschel

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection. Table of Contents: Data Cleansing: Introduction and Motivation / Problem Definition / Similarity Functions / Duplicate Detection Algorithms / Evaluating Detection Success / Conclusion and Outlook / Bibliography

databases information systems and peer to peer computing | 2003

Semantic Overlay Clusters within Super-Peer Networks

Alexander Löser; Felix Naumann; Wolf Siberski; Wolfgang Nejdl; Uwe Thaden

When joining information provider peers to a peer-to-peer network, an arbitrary distribution is sub-optimal. In fact, clustering peers by their characteristics, enhances search and integration significantly. Currently super-peer networks, such as the Edutella network, provide no sophisticated means for such a ”semantic clustering” of peers. We introduce the concept of semantic overlay clusters (SOC) for super-peer networks enabling a controlled distribution of peers to clusters. In contrast to the recently announced semantic overlay network approach designed for flat, pure peer-to-peer topologies and for limited meta data sets, such as simple filenames, we allow a clustering of complex heterogeneous schemes known from relational databases and use advantages of super-peer networks, such as efficient search and broadcast of messages. Our approach is based on predefined policies defined by human experts. Based on such policies a fully decentralized broadcast- and matching approach distributes the peers automatically to super-peers. Thus we are able to automate the integration of information sources in super-peer networks and reduce flooding of the network with messages.

cooperative information systems | 2004

Completeness of integrated information sources

Felix Naumann; Johann Christoph Freytag; Ulf Leser

For many information domains there are numerous World Wide Web data sources. The sources vary both in their extension and their intension: They represent different real-world entities with possible overlap and provide different attriouites of these entities. Mediator-based information systems allow integrated access to such sources by providing a common schema against which the user can pose queries. Given a query, the mediator must determine which participating sources to access and how to integrate the incoming results.This article describes how to support mediators in their source selection and query planning process. We propose three new merge operators, which formalize the integration of multiple source responses. A completeness model describes the usefulness of a source to answer a query. The completeness measure incorporates both extensional value (called coverage) and intensional value (called density) of a source. We show how to determine the completeness of single sources and of combinations of sources under the new merge operators. Finally, we show how to use the measure for source selection and query planning.

international conference on management of data | 2005

DogmatiX tracks down duplicates in XML

Melanie Weis; Felix Naumann

Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.

Explore More