Fabian Panse | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fabian Panse is active.

Explore More

Publication

Featured researches published by Fabian Panse.

international conference on data engineering | 2010

Duplicate detection in probabilistic data

Fabian Panse; Maurice van Keulen; Ander de Keijzer; Norbert Ritter

Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities.

Journal of Data and Information Quality | 2013

Indeterministic Handling of Uncertain Decisions in Deduplication

Fabian Panse; Maurice van Keulen; Norbert Ritter

In current research and practice, deduplication is usually considered as a deterministic approach in which database tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this article, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for one of the most likely situations, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Moreover, the deduplication process becomes almost fully automatic and human effort can be largely reduced. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.

IEEE Transactions on Big Data | 2017

Large-Scale Data Pollution with Apache Spark

Kai Hildebrandt; Fabian Panse; Niklas Wilcke; Norbert Ritter

Because of the increasing volume of autonomously collected data objects, duplicate detection is an important challenge in todays data management. To evaluate the efficiency of duplicate detection algorithms with respect to big data, large test data sets are required. Existing test data generation tools, however, are either not able to produce large test data sets or are domain-dependent which limits their usefulness to a few cases. In this paper, we describe a new framework that can be used to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities. To prove its concept, we implemented a prototype which is built upon the cluster computing framework Apache Spark and evaluate its performance in several experiments.

international conference on conceptual modeling | 2009

Completeness in Databases with Maybe-Tuples

Fabian Panse; Norbert Ritter

Some data models use so-called maybe tuples to express the uncertainty, whether or not a tuple belongs to a relation. In order to assess this relations quality the corresponding vagueness needs to be taken into account. Current metrics of quality dimensions are not designed to deal with this uncertainty and therefore need to be adapted. One major quality dimension is data completeness. In general, there are two basic ways to distinguish maybe tuples from definite tuples . First, an attribute serving as a maybe indicator (values YES or NO) can be used. Second, tuple probabilities can be specified. In this paper, the notion of data completeness is redefined w.r.t. both concepts. Thus, a more precise estimating of data quality in databases with maybe tuples (e.g. probabilistic databases) is enabled.

scalable uncertainty management | 2011

Incorporating domain knowledge and user expertise in probabilistic Tuple merging

Fabian Panse; Norbert Ritter

Today, probabilistic databases (PDB) become helpful in several application areas. In the context of cleaning a single PDB or integrating multiple PDBs, duplicate tuples need to be merged. A basic approach for merging probabilistic tuples is simply to build the union of their sets of possible instances. In a merging process, however, often additional domain knowledge or user expertise is available. For that reason, in this paper we extend the basic approach with aggregation functions, knowledge rules, and instance weights for incorporating external knowledge in the merging process.

scalable uncertainty management | 2012

Evaluating indeterministic duplicate detection results

Fabian Panse; Norbert Ritter

Duplicate detection is an important process for cleaning or integrating data. Since real-life data is often polluted, detecting duplicates usually comes along with uncertainty. To handle duplicate uncertainty in an appropriate way, indeterministic duplicate detection approaches, i.e. approaches in which ambiguous duplicate decisions are probabilistically modeled in the resultant data, have been developed. To rate the goodness of a duplicate detection approach, its detection results need to be evaluated in their quality. In this paper, we propose several semantics to apply traditional quality evaluation measures to indeterministic duplicate detection results and exemplarily present an efficient evaluation for one of these semantics. Finally, we present some experimental results.

Ingénierie Des Systèmes D'information | 2010

Relational data completeness in the presence of maybe-tuples

Fabian Panse; Norbert Ritter

Some data models use so-called maybe-tuples to express the uncertainty, whether or not a tuple belongs to a relation. In order to score this relations quality in a meaningful way the corresponding vagueness needs to be taken into account. Current metrics of quality dimensions are not designed to deal with this uncertainty and therefore need to be adapted. One major quality dimension is data completeness. In general, there are two basic ways to distinguish maybe-tuples from definite-tuples. First, an attribute serving as a maybe indicator (values YES or NO) can be used. Second, confidence values can be specified. In this paper, the notion of data completeness is redefined w.r.t. both concepts. Thus, a more precise estimation of quality in databases with maybe-tuples (e.g. probabilistic databases or fuzzy databases) is enabled.

MUD | 2010