Mourad Ouzzani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mourad Ouzzani is active.

Explore More

Publication

Featured researches published by Mourad Ouzzani.

international conference on management of data | 2013

NADEEF: a commodity data cleaning system

Michele Dallachiesa; Amr Ebaid; Ahmed Eldawy; Ahmed K. Elmagarmid; Ihab F. Ilyas; Mourad Ouzzani; Nan Tang

Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

international conference on data engineering | 2012

M3: Stream Processing on Main-Memory MapReduce

Ahmed M. Aly; Asmaa Sallam; Bala M. Gnanasekaran; Long Van Nguyen-Dinh; Walid G. Aref; Mourad Ouzzani; Arif Ghafoor

The continuous growth of social web applications along with the development of sensor capabilities in electronic devices is creating countless opportunities to analyze the enormous amounts of data that is continuously steaming from these applications and devices. To process large scale data on large scale computing clusters, MapReduce has been introduced as a framework for parallel computing. However, most of the current implementations of the MapReduce framework support only the execution of fixed-input jobs. Such restriction makes these implementations inapplicable for most streaming applications, in which queries are continuous in nature, and input data streams are continuously received at high arrival rates. In this demonstration, we showcase M3, a prototype implementation of the MapReduce framework in which continuous queries over streams of data can be efficiently answered. M3 extends Hadoop, the open source implementation of MapReduce, bypassing the Hadoop Distributed File System (HDFS) to support main-memory-only processing. Moreover, M3 supports continuous execution of the Map and Reduce phases where individual Mappers and Reducers never terminate.

Archive | 2011

Implementation and Experiments

Mourad Ouzzani; Athman Bouguettaya

In this chapter, we report on the implementation and experiments for our query infrastructure over Web services. The implementation is conducted in the context of WebDG, a digital government prototype that provides access to e-government databases and services related to social services. As a prototype, WebDG supports access to only few e-government services. Thus, it cannot represent the large number of Web services available on the Web. It is then necessary to asses our approach through experiments on synthetic data. The goal of these experiments is to measure the cost of the different algorithms and the quality of the service execution plans they generate. We focus on computing the time it takes each algorithm to reach a decision. The quality of their results is simply the objective function, Fsmr, as defined for a service execution plan in the previous chapter. These different results are then compared together under different scenarios and constraints.

international conference on management of data | 2017

UGuide: User-Guided Discovery of FD-Detectable Errors

Saravanan Thirumuruganathan; Laure Berti-Equille; Mourad Ouzzani; Jorge-Arnulfo Quiane-Ruiz; Nan Tang

Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of experts time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

international conference on management of data | 2017

A Demonstration of Lusail: Querying Linked Data at Scale

Essam Mansour; Ibrahim Abdelaziz; Mourad Ouzzani; Ashraf Aboulnaga; Panos Kalnis

There has been a proliferation of datasets available as interlinked RDF data accessible through SPARQL endpoints. This has led to the emergence of various applications in life science, distributed social networks, and Internet of Things that need to integrate data from multiple endpoints. We will demonstrate Lusail; a system that supports the need of emerging applications to access tens to hundreds of geo-distributed datasets. Lusail is a geo-distributed graph engine for querying linked RDF data. Lusail delivers out- standing performance using (i) a novel locality-aware query decomposition technique that minimizes the intermediate data to be accessed by the subqueries, and (ii) selectivity-awareness and parallel query execution to reduce network latency and to increase parallelism. During the demo, the audience will be able to query actually deployed RDF end- points as well as large synthetic and real benchmarks that we have deployed in the public cloud. The demo will also show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.

ieee international conference on cloud engineering | 2015

Approving Updates in Collaborative Databases

Khaleel Mershad; Qutaibah M. Malluhi; Mourad Ouzzani; Mingjie Tang; Walid G. Aref

Data curation activities in collaborative databases mandate that collaborators interact until they converge and agree on the content of their data. Typically, updates by a member of the collaboration are made visible to all collaborators for comments but at the same time are pending the approval or rejection of the data custodian, e.g., the principal scientist or investigator (PI). In current database technologies, approval and authorization of updates is based solely on the identity of the user, e.g., via the SQL GRANT and REVOKE commands. However, in collaborative environments, the updated data is open for collaborators for discussion and further editing and is finally approved or rejected by the PI based on the content of the data and not on the identity of the updater. In this paper, we introduce a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. We realize content-based update approval and history tracking of updates inside HBase, a distributed and scalable open-source cluster-based database. The design and implementation as well as a detailed performance study of several approaches for update approval are presented and contrasted in the paper.

Archive | 2011

Conclusions, Open Issues, and Future Directions

Mourad Ouzzani; Athman Bouguettaya

This book addresses key issues to enable efficient access to Web databases and Web services. We described a distributed ontology that allows a meaningful organization of and efficient access to Web databases. We also presented a comprehensive query infrastructure for the emerging concept of Web services. The core of this query infrastructure relates to the efficient delivery of Web services based on the concept of Quality of Web Service.

Archive | 2011

Current Advances in Semantic Web Services and Web Databases

Mourad Ouzzani; Athman Bouguettaya

The work we presented in this book relates to several research areas including Web databases integration and efficient querying, as well as Web service querying, composition, and optimization. A review of these different areas as they relate to this book are discussed in this chapter.

Archive | 2011

Web Services Query Model

Mourad Ouzzani; Athman Bouguettaya

The basic use of Web services consists of invoking operations by sending and receiving messages. Their definition does not describe potential interactions between operations within the same Web service or other Web services. However, complex applications accessing diverse Web services (e.g., benefits for senior citizens) require advanced capabilities to manipulate and deliver Web services’ functionalities. In general, users have needs that cannot be fulfilled by simply invoking one single operation or several operations independently.

Archive | 2011

Ontological Organization of Web Databases

Mourad Ouzzani; Athman Bouguettaya

Organizations rely on a wide variety of databases to conduct their everyday business. Databases are usually designed from scratch if none is found to meet requirements. This has led to a proliferation of databases obeying different sets of requirements oftentimes modeling the same situations. In many instances, and because of a lack of any organized conglomeration of databases, users create their own pieces of information that may exist in current databases.

Explore More