Ziawasch Abedjan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ziawasch Abedjan is active.

Explore More

Publication

Featured researches published by Ziawasch Abedjan.

very large data bases | 2015

Profiling relational data: a survey

Ziawasch Abedjan; Lukasz Golab; Felix Naumann

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

very large data bases | 2013

Scalable discovery of unique column combinations

Arvid Heise; Jorge Arnulfo Quiané-Ruiz; Ziawasch Abedjan; Anja Jentzsch; Felix Naumann

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.

international conference on data engineering | 2010

Profiling linked open data with ProLOD

Christoph Böhm; Felix Naumann; Ziawasch Abedjan; Dandy Fenz; Toni Grütze; Daniel Hefenbrock; Matthias Pohl; David Sonnabend

Linked open data (LOD), as provided by a quickly growing number of sources constitutes a wealth of easily accessible information. However, this data is not easy to understand. It is usually provided as a set of (RDF) triples, often enough in the form of enormous files covering many domains. What is more, the data usually has a loose structure when it is derived from end-user generated sources, such as Wikipedia. Finally, the quality of the actual data is also worrisome, because it may be incomplete, poorly formatted, inconsistent, etc. To understand and profile such linked open data, traditional data profiling methods do not suffice. With ProLOD, we propose a suite of methods ranging from the domain level (clustering, labeling), via the schema level (matching, disambiguation), to the data level (data type detection, pattern detection, value distribution). Packaged into an interactive, web-based tool, they allow iterative exploration and discovery of new LOD sources. Thus, users can quickly gauge the relevance of the source for the problem at hand (e.g., some integration task), focus on and explore the relevant subset.

very large data bases | 2016

Detecting data errors: where are we and what needs to be done?

Ziawasch Abedjan; Xu Chu; Dong Deng; Raul Fernandez; Ihab F. Ilyas; Mourad Ouzzani; Paolo Papotti; Michael Stonebraker; Nan Tang

Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.

international conference on data engineering | 2014

Profiling and mining RDF data with ProLOD

Ziawasch Abedjan; Toni Gruetze; Anja Jentzsch; Felix Naumann

Before reaping the benefits of open data to add value to an organizations internal data, such new, external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling, already difficult for large relational data sources, is even more challenging for RDF datasets, the preferred data model for linked open data. We present ProLod++, a novel tool for various profiling and mining tasks to understand and ultimately improve open RDF data. ProLod++ comprises various traditional data profiling tasks, adapted to the RDF data model. In addition, it features many specific profiling results for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and uniqueness discovery along ontology hierarchies. ProLod++ is highly efficient, allowing interactive profiling for users interested in exploring the properties and structure of yet unknown datasets.

Datenbank-spektrum | 2013

Improving RDF Data Through Association Rule Mining

Ziawasch Abedjan; Felix Naumann

Linked Open Data comprises very many and often large public data sets, which are mostly presented in the Rdf triple structure of subject, predicate, and object. However, the heterogeneity of available open data requires significant integration steps before it can be used in applications. A promising and novel technique to explore such data is the use of association rule mining. We introduce “mining configurations”, which allow us to mine Rdf data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. We present rule-based approaches for predicate suggestion, data enrichment, ontology improvement, and query relaxation. On the one hand we prevent inconsistencies in the data through predicate suggestion, enrichment with missing facts, and alignment of the corresponding ontology. On the other hand we support users to handle inconsistencies during query formulation through predicate expansion techniques. Based on these approaches, we show that association rule mining benefits the integration and usability of Rdf data.

conference on information and knowledge management | 2012

Discovering conditional inclusion dependencies

Jana Bauckmann; Ziawasch Abedjan; Ulf Leser; Heiko Müller; Felix Naumann

Data dependencies are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. Conditional dependencies have been introduced to analyze and improve data quality. A conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs).We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.

conference on information and knowledge management | 2014

DFD: Efficient Functional Dependency Discovery

Ziawasch Abedjan; Patrick Schulze; Felix Naumann

The discovery of unknown functional dependencies in a dataset is of great importance for database redesign, anomaly detection and data cleansing applications. However, as the nature of the problem is exponential in the number of attributes none of the existing approaches can be applied on large datasets. We present a new algorithm DFD for discovering all functional dependencies in a dataset following a depth-first traversal strategy of the attribute lattice that combines aggressive pruning and efficient result verification. Our approach is able to scale far beyond existing algorithms for up to 7.5 million tuples, and is up to three orders of magnitude faster than existing approaches on smaller datasets.

conference on information and knowledge management | 2012

Reconciling ontologies and the web of data

Ziawasch Abedjan; Johannes Lorey; Felix Naumann

To integrate Linked Open Data, which originates from various and heterogeneous sources, the use of well-defined ontologies is essential. However, oftentimes the utilization of these ontologies by data publishers differs from the intended application envisioned by ontology engineers. This may lead to unspecified properties being used ad-hoc as predicates in RDF triples or it may result in infrequent usage of specified properties. These mismatches impede the goals and propagation of the Web of Data as data consumers face difficulties when trying to discover and integrate domain-specific information. In this work, we identify and classify common misusage patterns by employing frequency analysis and rule mining. Based on this analysis, we introduce an algorithm to propose suggestions for a data-driven ontology re-engineering workflow, which we evaluate on two large-scale RDF datasets.

extended semantic web conference | 2013

Synonym Analysis for Predicate Expansion

Ziawasch Abedjan; Felix Naumann

Despite unified data models, such as the Resource Description Framework (Rdf) on structural level and the corresponding query language Sparql, the integration and usage of Linked Open Data faces major heterogeneity challenges on the semantic level. Incorrect use of ontology concepts and class properties impede the goal of machine readability and knowledge discovery. For example, users searching for movies with a certain artist cannot rely on a single given property artist, because some movies may be connected to that artist by the predicate starring. In addition, the information need of a data consumer may not always be clear and her interpretation of given schemata may differ from the intentions of the ontology engineer or data publisher.

Explore More