Jorge Arnulfo Quiané-Ruiz

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jorge Arnulfo Quiané-Ruiz is active.

Explore More

Publication

Featured researches published by Jorge Arnulfo Quiané-Ruiz.

international conference on management of data | 2015

BigDansing: A System for Big Data Cleansing

Zuhair Khayyat; Ihab F. Ilyas; Alekh Jindal; Samuel Madden; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Si Yin

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

very large data bases | 2013

Scalable discovery of unique column combinations

Arvid Heise; Jorge Arnulfo Quiané-Ruiz; Ziawasch Abedjan; Anja Jentzsch; Felix Naumann

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.

international conference on data engineering | 2015

CliqueSquare: Flat plans for massively parallel RDF queries

François Goasdoué; Zoi Kaoudi; Ioana Manolescu; Jorge Arnulfo Quiané-Ruiz; Stamatis Zampetakis

As increasing volumes of RDF data are being produced and analyzed, many massively distributed architectures have been proposed for storing and querying this data. These architectures are characterized first, by their RDF partitioning and storage method, and second, by their approach for distributed query optimization, i.e., determining which operations to execute on each node in order to compute the query answers. We present CliqueSquare, a novel optimization approach for evaluating conjunctive RDF queries in a massively parallel environment. We focus on reducing query response time, and thus seek to build flat plans, where the number of joins encountered on a root-to-leaf path in the plan is minimized. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. We have deployed our algorithms in a MapReduce-based RDF platform and demonstrate experimentally the interest of the flat plans built by our best algorithms.

very large data bases | 2013

NADEEF: a generalized data cleaning system

Amr Ebaid; Ahmed K. Elmagarmid; Ihab F. Ilyas; Mourad Ouzzani; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Si Yin

We present NADEEF, an extensible, generic and easy-to-deploy data cleaning system. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows users to specify data quality rules by writing code that implements predefined classes. These classes uniformly define what is wrong with the data and (possibly) how to fix it. We will demonstrate the following features provided by NADEEF. (1) Heterogeneity: The programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. (2) Interdependency: The core algorithms can interleave multiple types of rules to detect and repair data errors. (3) Deployment and extensibility: Users can easily customize NADEEF by defining new types of rules, or by extending the core. (4) Metadata management and data custodians: We show a live data quality dashboard to effectively involve users in the data cleaning process.

very large data bases | 2015

Divide & conquer-based inclusion dependency discovery

Thorsten Papenbrock; Sebastian Kruse; Jorge Arnulfo Quiané-Ruiz; Felix Naumann

The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets -- an important property on the face of the ever increasing size of todays data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.

extending database technology | 2016

Road to freedom in big data analytics

D. Agrawal; Sanjay Chawla; Ahmed K. Elmagarmid; Zoi Kaoudi; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Mohammed Javeed Zaki

The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In doing so, they face many challenges; chiefly, platform dependence, poor interoperability, and poor performance when using multiple platforms. We present RHEEM, our vision for big data analytics over diverse data processing platforms. RHEEM provides a threelayer data processing and storage abstraction to achieve both platform independence and interoperability across multiple platforms. In this paper, we discuss our vision as well as present multiple research challenges that we need to address to achieve it. As a case in point, we present a data cleaning application built using some of the ideas of RHEEM. We show how it achieves platform independence and the performance benefits of following such an approach. 1. WHY TIED TO ONE SINGLE SYSTEM? Data analytic tasks may range from very simple to extremely complex pipelines, such as data extraction, transformation, and loading (ETL), online analytical processing (OLAP), graph processing, and machine learning (ML). Following the dictum “one size does not fit all” [23], academia and industry have embarked on an endless race to develop data processing platforms for supporting these different tasks, e.g., DBMSs and MapReduce-like systems. Semantic completeness, high performance, and scalability are key objectives of such platforms. While there have been major achievements in these objectives, users still face two main roadblocks. The first roadblock is that applications are tied to a single processing platform, making the migration of an application to new and more efficient platforms a difficult and costly task. Furthermore, complex analytic tasks usually require the combined use of different processing platforms. As a result, the common practice is to develop several specialized analytic applications on top of different platforms. This requires users to manually combine the results to draw a conclusion. In addition, users may need to re-implement existing applications on top of faster processing platforms when ∗Work done while at QCRI. c ©2016, Copyright is with the authors. Published in Proc. 19th International Conference on Extending Database Technology (EDBT), March 15-18, 2016 Bordeaux, France: ISBN 978-3-89318-070-7, on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 these become available. For example, Spark SQL [3] and MLlib [2] are the Spark counterparts of Hive [24] and Mahout [1]. The second roadblock is that datasets are often produced by different sources and hence they natively reside on different storage platforms. As a result, users often perform tedious, time-intensive, and costly data migration and integration tasks for further analysis. Let us illustrate these roadblocks with an Oil & Gas industry example [13]. A single oil company can produce more than 1.5TB of diverse data per day [6]. Such data may be structured or unstructured and come from heterogeneous sources, such as sensors, GPS devices, and other measuring instruments. For instance, during the exploration phase, data has to be acquired, integrated, and analyzed in order to predict if a reservoir would be profitable. Thousands of downhole sensors in exploratory wells produce real-time seismic data for monitoring resources and environmental conditions. Users integrate these data with the physical properties of the rocks to visualize volume and surface renderings. From these visualizations, geologists and geophysicists formulate hypotheses and verify them with ML methods, such as regression and classification. Training of the models is performed with historical drilling and production data, but oftentimes users have to go over unstructured data, such as notes exchanged by emails or text from drilling reports filed in a cabinet. Thus, an application supporting such a complex analytic pipeline has to access several sources for historical data (relational, but also text and semi-structured), remove the noise from the streaming data coming from the sensors, and run both traditional (such as SQL) and statistical analytics (such as ML algorithms) over different processing platforms. Similar examples can be drawn from many other domains such as healthcare: e.g., IBM reported that North York hospital needs to process 50 diverse datasets, which are on a dozen different internal systems [15]. These emerging applications clearly show the need for complex analytics coupled with a diversity of processing platforms, which raises two major research challenges. Data Processing Challenge. Users are faced with various choices on where to process their data, each choice with possibly orders of magnitude differences in terms of performance. However, users have to be intimate with the intricacies of the processing platform to achieve high efficiency and scalability. Moreover, once a decision is taken, users may end up being tied up to a particular platform. As a result, migrating the data analytics stack to a more efficient processing platform often becomes a nightmare. Thus, there is a need to build a system that offers data processing platform independence. Furthermore, complex analytic applications require executing tasks over different processing platforms to achieve high performance. For example, one may aggregate large datasets with traditional queries on top of a relational database such as PostgreSQL, but ML tasks might be much faster if executed on Spark [28]. HowVisionary Paper Series ISSN: 2367-2005 479 10.5441/002/edbt.2016.45 ever, this requires a considerable amount of manual work in selecting the best processing platforms, optimizing tasks for the chosen platforms, and coordinating task execution. Thus, this also calls for multi-platform task execution. Data Storage Challenge. Data processing platforms are typically tightly coupled with a specific storage solution. Moving data from a certain storage (e.g., a relational DB) to a more suitable processing platform for the actual task (e.g., Spark on HDFS) requires shuffling data between different systems. Such shuffling may end up dominating the execution time. Moreover, different departments in the same organization may go for different storage engines due to legacy as well as performance reasons. Dealing with such heterogeneity calls for data storage independence. To tackle these two challenges, we envision a system, called RHEEM1, that provides both platform independence and interoperability (Section 2). In the following, we first discuss our vision for the data processing abstraction (Section 3), which is fully based on user-defined functions (UDFs) to provide adaptability as well as extensibility. This processing abstraction allows both users to focus only on the logic of their data analytic tasks and applications to be independent from the data processing platforms. We then discuss how to divide a complex analytic task into smaller subtasks to exploit the availability of different processing platforms (Section 4). As a result, RHEEM can run simultaneously a single data analytic task over multiple processing platforms to boost performance. Next, we present our first attempt to build an instance application based on some of the ideas of RHEEM and the resulting benefits (Section 5). We then show how we push down the processing abstraction idea to the storage layer (Section 6). This storage abstraction allows both users to focus on their storage needs and the processing platforms to be independent from the storage engines. Some initial efforts are also going into the direction of providing data processing platform independence [11,12,21] (Section 7). However, our vision goes beyond the data processing. We not only envision a data processing abstraction but also a data storage abstraction, allowing us to consider data movement costs during task optimization. We give a research agenda highlighting the challenges that need to be tackled to build RHEEM in Section 8.

very large data bases | 2015

Lightning fast and space efficient inequality joins

Zuhair Khayyat; William Lucia; Meghna Singh; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Panos Kalnis

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R*-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.

international conference on management of data | 2016

Rheem: Enabling Multi-Platform Task Execution

D. Agrawal; Lamine Ba; Laure Berti-Equille; Sanjay Chawla; Ahmed K. Elmagarmid; Hossam M. Hammady; Yasser Idris; Zoi Kaoudi; Zuhair Khayyat; Sebastian Kruse; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Mohammed Javeed Zaki

Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.

international conference on management of data | 2017

Generating Concise Entity Matching Rules

Rohit Singh; Venkata Vamsikrishna Meduri; Ahmed K. Elmagarmid; Samuel Madden; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Armando Solar-Lezama; Nan Tang

Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by conjunctions (∨), disjunctions (∧), and negations. (¬) GBFs can generate more concise rules than traditional EM rules represented in disjunctive normal forms (DNFs). We use program synthesis, a powerful tool to automatically generate rules (or programs) that provably satisfy a high-level specification, to automatically synthesize EM rules in GBF format, given only positive and negative matching examples. In this demo, attendees will experience the following features: (1) Interpretability -- they can see and measure the conciseness of EM rules defined using GBFs; (2) Easy customization -- they can provide custom experiment parameters for various datasets, and, easily modify a rich predefined (default) synthesis grammar, using a Web interface; and (3) High performance -- they will be able to compare the generated concise rules, in terms of accuracy, with probabilistic models (e.g., machine learning methods), and hand-written EM rules provided by experts. Moreover, this system will serve as a general platform for evaluating different methods that discover EM rules, which will be released as an open-source tool on GitHub.

international conference on data engineering | 2015

CliqueSquare in action: Flat plans for massively parallel RDF queries

Benjamin Djahandideh; François Goasdoué; Zoi Kaoudi; Ioana Manolescu; Jorge Arnulfo Quiané-Ruiz; Stamatis Zampetakis

RDF is an increasingly popular data model for many practical applications, leading to large volumes of RDF data; efficient RDF data management methods are crucial to allow applications to scale. We propose to demonstrate CliqueSquare, an RDF data management system built on top of a MapReduce-like infrastructure. The main technical novelty of CliqueSquare resides in its logical query optimization algorithm, guaranteed to find a logical plan as flat as possible for a given query, meaning: a plan having the smallest possible number of join operators on top of each other. CliqueSquares ability to build flat plans allows it to take advantage of a parallel processing framework in order to shorten response times. We demonstrate loading and querying the data, with a particular focus on query optimization, and on the performance benefits of CliqueSquares flat plans.

Explore More