Melanie Herschel
University of Stuttgart
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Melanie Herschel.
Synthesis Lectures on Data Management | 2010
Felix Naumann; Melanie Herschel
With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection. Table of Contents: Data Cleansing: Introduction and Motivation / Problem Definition / Similarity Functions / Duplicate Detection Algorithms / Evaluating Detection Success / Conclusion and Outlook / Bibliography
very large data bases | 2010
Melanie Herschel; Mauricio A. Hernández
This paper addresses the problem of explaining missing answers in queries that include selection, projection, join, union, aggregation and grouping (SPJUA). Explaining missing answers of queries is useful in various scenarios, including query understanding and debugging. We present a general framework for the generation of these explanations based on source data. We describe the algorithms used to generate a correct, finite, and, when possible, minimal set of explanations. These algorithms are part of Artemis, a system that assists query developers in analyzing queries by, for instance, allowing them to ask why certain tuples are not in the query results. Experimental results demonstrate that Artemis generates explanations of missing tuples at a pace that allows developers to effectively use them for query analysis.
very large data bases | 2009
Melanie Herschel; Mauricio A. Hernández; Wang Chiew Tan
A central feature of relational database management systems is the ability to define multiple different views over an underlying database schema. Views provide a method of defining access control to the underlying database, since a view exposes a part of the database and hides the rest. Views also provide logical data independence to application programs that access the database. For most cases, the process of specifying the desired views in SQL is typically tedious and error-prone. While numerous tools exist to support developers in debugging program code, we are not aware of any tool that supports developers in verifying the correctness of their views defined in SQL.
international world wide web conferences | 2014
Kostas Stefanidis; Vasilis Efthymiou; Melanie Herschel; Vassilis Christophides
This tutorial provides an overview of the key research results in the area of entity resolution that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking but also to search the Web of data for entities and their relations.
IEEE Transactions on Knowledge and Data Engineering | 2012
Melanie Herschel; Felix Naumann; Sascha Szott; Maik Taubert
Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale-up duplicate detection in graph data (DDG) to large amounts of data and pairwise comparisons, using the support of a relational database management system. To this end, we first present a framework that generalizes the DDG process. We then present algorithms to scale DDG in space (amount of data processed with bounded main memory) and in time. Finally, we extend our framework to allow batched and parallel DDG, thus further improving efficiency. Experiments on data of up to two orders of magnitude larger than data considered so far in DDG show that our methods achieve the goal of scaling DDG to large volumes of data.
extending database technology | 2010
Jens Bleiholder; Sascha Szott; Melanie Herschel; Frank Kaufer; Felix Naumann
The goal of data fusion is to combine several representations of one real world object into a single, consistent representation, e.g., in data integration. A very popular operator to perform data fusion is the minimum union operator. It is defined as the outer union and the subsequent removal of subsumed tuples. Minimum union is used in other applications as well, for instance in database query optimization to rewrite outer join queries, in the semantic web community in implementing Sparqls optional operator, etc. Despite its wide applicability, there are only few efficient implementations, and until now, minimum union is not a relational database primitive. This paper fills this gap as we present implementations of subsumption that serve as a building block for minimum union. Furthermore, we consider this operator as database primitive and show how to perform optimization of query plans in presence of subsumption and minimum union through rule-based plan transformations. Experiments on both artificial and real world data show that our algorithms outperform existing algorithms used for subsumption in terms of runtime and they scale to large volumes of data. In the context of data integration, we observe that performing data fusion calls for more than subsumption and minimum union. Therefore, another contribution of this paper is the definition of the complementation and complement union operators. Intuitively, these allow to merge tuples that have complementing values and thus eliminate unnecessary null-values.
conference on information and knowledge management | 2013
Melanie Herschel
In analyzing and debugging data transformations, or more specifically relational queries, a subproblem is to understand why some data are not part of the query result. This problem has recently been addressed from different perspectives for various fragments of relational queries. The different perspectives yield different, yet complementary explanations of such missing-answers. This paper first aims at unifying the different approaches by defining a new type of explanation, called hybrid explanation, that encompasses the variety of previously defined types of explanations. This solution goes beyond simply forming the union of explanations produced by different algorithms and is shown to be able to explain a larger set of missing-answers. Second, we present Conseil, an algorithm to generate hybrid explanations. Conseil is also the first algorithm to handle non-monotonic queries. Experiments on efficiency and explanation quality show that Conseil is comparable to and even outperforms previous algorithms.
conference on information and knowledge management | 2008
Melanie Herschel; Felix Naumann
Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. We first generalize the process of DDG and then present how to scale DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data.
soft computing | 2010
Pável Calado; Melanie Herschel; Luís Leitão
Fuzzy duplicate detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy duplicate detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors to improve duplicate detection effectiveness. In this chapter, we study different approaches that have been proposed for XML fuzzy duplicate detection. Our study includes a description and analysis of the different approaches, as well as a comparative experimental evaluation performed on both artificial and real-world data. The two main dimensions used for comparison are the methods effectiveness and efficiency. Our comparison shows that the DogmatiX system [44] is the most effective overall, as it yields the highest recall and precision values for various kinds of differences between duplicates. Another system, called XMLDup [27] has a similar performance, being most effective especially at low recall values. Finally, the SXNM system [36] is the most efficient, as it avoids executing too many pairwise comparisons, but its effectiveness is greatly affected by errors in the data.
very large data bases | 2017
Melanie Herschel; Ralf Diestelkämper; Houssem Ben Lahmar
Provenance refers to any information describing the production process of an end product, which can be anything from a piece of digital data to a physical object. While this survey focuses on the former type of end product, this definition still leaves room for many different interpretations of and approaches to provenance. These are typically motivated by different application domains for provenance (e.g., accountability, reproducibility, process debugging) and varying technical requirements such as runtime, scalability, or privacy. As a result, we observe a wide variety of provenance types and provenance-generating methods. This survey provides an overview of the research field of provenance, focusing on what provenance is used for (what for?), what types of provenance have been defined and captured for the different applications (what form?), and which resources and system requirements impact the choice of deploying a particular provenance solution (what from?). For each of these three key questions, we provide a classification and review the state of the art for each class. We conclude with a summary and possible future research challenges.