Jaroslaw Szlichta
University of Ontario Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jaroslaw Szlichta.
international conference on data engineering | 2014
Maksims Volkovs; Fei Chiang; Jaroslaw Szlichta; Renée J. Miller
In declarative data cleaning, data semantics are encoded as constraints and errors arise when the data violates the constraints. Various forms of statistical and logical inference can be used to reason about and repair inconsistencies (errors) in data. Recently, unified approaches that repair both errors in data and errors in semantics (the constraints) have been proposed. However, both data-only approaches and unified approaches are by and large static in that they apply cleaning to a single snapshot of the data and constraints. We introduce a continuous data cleaning framework that can be applied to dynamic data and constraint environments. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence to date. Importantly, our approach uses not only the data and constraints as evidence, but also considers the past repairs chosen and applied by a user (user repair preferences). We introduce a repair classifier that predicts the type of repair needed to resolve an inconsistency, and that learns from past user repair preferences to recommend more accurate repairs in the future. Our evaluation shows that our techniques achieve high prediction accuracy and generate high quality repairs. Of independent interest, our work makes use of a set of data statistics that are shown to be sensitive to predicting particular repair types.
very large data bases | 2012
Jaroslaw Szlichta; Parke Godfrey; Jarek Gryz
Dependencies have played a significant role in database design for many years. They have also been shown to be useful in query optimization. In this paper, we discuss dependencies between lexicographically ordered sets of tuples. We introduce formally the concept of order dependency and present a set of axioms (inference rules) for them. We show how query rewrites based on these axioms can be used for query optimization. We present several interesting theorems that can be derived using the inference rules. We prove that functional dependencies are subsumed by order dependencies and that our set of axioms for order dependencies is sound and complete.
very large data bases | 2015
Nataliya Prokoshyna; Jaroslaw Szlichta; Fei Chiang; Renée J. Miller; Divesh Srivastava
Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths. We instantiate our framework using (i) metric functional dependencies, a type of dependency that generalizes functional dependencies (FDs) to identify inconsistencies in domains where only large differences in metric data are considered to be a data quality problem, and (ii) repairs that modify the inconsistent data so as to minimize statistical distortion, measured using the Earth Movers Distance. We show that the problem of computing a statistical distortion minimal repair is NP-hard. Given this complexity, we present an efficient algorithm for finding a minimal repair that has a small statistical distortion using EMD computation over semantically related attributes. To identify semantically related attributes, we present a sound and complete axiomatization and an efficient algorithm for testing implication of metric FDs. While the complexity of inference for some other FD extensions is co-NP complete, we show that the inference problem for metric FDs remains linear, as in traditional FDs. We prove that every instance that can be generated by our repair algorithm is set-minimal (with no unnecessary changes). Our experimental evaluation demonstrates that our techniques obtain a considerably lower statistical distortion than existing repair techniques, while achieving similar levels of efficiency.
very large data bases | 2013
Jaroslaw Szlichta; Parke Godfrey; Jarek Gryz; Calisto Zuzarte
Dependencies play an important role in databases. We study order dependencies (ODs)--and unidirectional order dependencies (UODs), a proper sub-class of ODs--which describe the relationships among lexicographical orderings of sets of tuples. We consider lexicographical ordering, as by the order-by operator in SQL, because this is the notion of order used in SQL and within query optimization. Our main goal is to investigate the inference problem for ODs, both in theory and in practice. We show the usefulness of ODs in query optimization. We establish the following theoretical results: (i) a hierarchy of order dependency classes; (ii) a proof of co-NP-completeness of the inference problem for the subclass of UODs (and ODs); (iii) a proof of co-NP-completeness of the inference problem of functional dependencies (FDs) from ODs in general, but demonstrate linear time complexity for the inference of FDs from UODs; (iv) a sound and complete elimination procedure for inference over ODs; and (v) a sound and complete polynomial inference algorithm for sets of UODs over restricted domains.
extending database technology | 2011
Jaroslaw Szlichta; Parke Godfrey; Jarek Gryz; Wenbin Ma; Przemyslaw Pawluk; Calisto Zuzarte
Data warehouses are repositories of electronically stored data which are designed to support reporting and analysis. The analysis of historical data often involves aggregation over time. Thus, time is critical in the design of a data warehouse. We describe novel techniques for storing date information and optimization of queries that reference the date dimension. We show how to embed intelligence into the date key and how to exploit monotonic dependencies. We present the value of these techniques for the improvement of performance when combined with partitioning and indexes. We evaluate these techniques on our prototype implemented in IBM® DB2® V9.7 over the current draft version of the TPC-DS benchmark.
international conference on data engineering | 2015
Mehdi Kargar; Aijun An; Nick Cercone; Parke Godfrey; Jaroslaw Szlichta; Xiaohui Yu
Keyword search over relational databases offers an alternative way to SQL to query and explore databases that is effective for lay users who may not be well versed in SQL or the database schema. This becomes more pertinent for databases with large and complex schemas. An answer in this context is a join tree spanning tuples containing the querys keywords. As there are potentially many answers to the query, and the user is often only interested in seeing the top-k answers, how to rank the answers based on their relevance is of paramount importance. We focus on the relevance of join as the fundamental means to rank answers. We devise means to measure relevance of relations and foreign keys in the schema over the information content of the database. This can be done offline with no need for external models. We compare the proposed measures against a gold standard we derive from a real workload over TPC-E and evaluate the effectiveness of our methods. Finally, we test the performance of our measures against existing techniques to demonstrate a marked improvement, and perform a user study to establish naturalness of the ranking of the answers.
international conference on management of data | 2014
Mehdi Kargar; Aijun An; Nick Cercone; Parke Godfrey; Jaroslaw Szlichta; Xiaohui Yu
Keyword search in relational databases was introduced in the last decade to assist users who are not familiar with a query language, the schema of the database, or the content of the data. An answer is a join tree of tuples that contains the query keywords. When searching a database with a complex schema, there are potentially many answers to the query. Therefore, ranking answers based on their relevance is crucial in this context. Prior work has addressed relevance based on the size of the answer or the IR scores of the tuples. However, this is not sufficient when searching a complex schema. We demonstrate MeanKS, a new system for meaningful keyword search over relational databases. The system first captures the users interest by determining the roles of the keywords. Then, it uses schema-based ranking to rank join trees that cover the keyword roles. This uses the relevance of relations and foreign-key relationships in the schema over the information content of the database. In the demonstration, attendees can execute queries against the TPC-E warehouse and compare the proposed measures against a gold standard derived from a real workload over TPC-E to test the effectiveness of our methods.
international conference on data engineering | 2016
Guilherme Damasio; Piotr Mierzejewski; Jaroslaw Szlichta; Calisto Zuzarte
Query performance problem determination is usually performed by analyzing query execution plans (QEPs). Analyzing complex QEPs is excessively time consuming and existing automatic problem determination tools do not provide ability to perform analysis with flexible user-defined problem patterns. We present the novel OptImatch system that allows a relatively naive user to search for patterns in QEPs and get recommendations from an expert and user customizable knowledge base. Our system transforms a QEP into an RDF graph. We provide a web graphical interface for the user to describe a pattern that is transformed with handlers into a SPARQL query. The SPARQL query is matched against the abstracted RDF graph and any matched parts of the graph are relayed back to the user. With the knowledge base the system automatically matches stored patterns to the QEPs by adapting dynamic context through developed tagging language and ranks recommendations using statistical correlation analysis.
very large data bases | 2018
Jaroslaw Szlichta; Parke Godfrey; Lukasz Golab; Mehdi Kargar; Divesh Srivastava
Integrity constraints (ICs) are useful for expressing and enforcing application semantics. Formulating ICs manually, however, requires domain expertise, is prone to human error, and can be exceedingly time-consuming. Thus, methods for automatic discovery have been developed for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs and can express business rules involving order; e.g., an employee who pays higher taxes has a higher salary than another employee. Bidirectional ODs further allow different ordering directions, ascending and descending, as in SQL’s order-by; e.g., a student with an alphabetically lower letter grade has a higher percentage grade than another student. We address the limitations of prior work on automatic OD discovery, which has factorial complexity, is incomplete, and is not concise. We present an efficient bidirectional OD discovery algorithm enabled by a novel polynomial mapping to a canonical form, and a sound and complete set of axioms for canonical bidirectional ODs to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete and minimal set of bidirectional ODs, and we experimentally show orders of magnitude performance improvements over the prior state-of-the-art methodologies.
conference on information and knowledge management | 2017
Sridevi Baskaran; Alexander Keller; Fei Chiang; Lukasz Golab; Jaroslaw Szlichta
Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We enhance dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. Our technical contributions are twofold: 1) theoretical foundations for OFDs, including a set of sound and complete axioms and a linear-time inference procedure, and 2) an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the exponential search space in the number of attributes. We demonstrate the efficiency of our techniques on real datasets, and we show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.