Laure Berti-Equille
Qatar Computing Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Laure Berti-Equille.
very large data bases | 2009
Xin Luna Dong; Laure Berti-Equille; Divesh Srivastava
Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the majority of the sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this paper, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a novel approach that considers dependence between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are rarely provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also extend our model by considering accuracy of data sources and similarity between values. Our experiments on synthetic data as well as real-world data show that our algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.
very large data bases | 2009
Xin Luna Dong; Laure Berti-Equille; Divesh Srivastava
Modern information management applications often require integrating data from a variety of data sources, some of which may copy or buy data from other sources. When these data sources model a dynamically changing world (e.g., peoples contact information changes over time, restaurants open and go out of business), sources often provide out-of-date data. Errors can also creep into data when sources are updated often. Given out-of-date and erroneous data provided by different, possibly dependent, sources, it is challenging for data integration systems to provide the true values. Straightforward ways to resolve such inconsistencies (e.g., voting) may lead to noisy results, often with detrimental consequences. In this paper, we study the problem of finding true values and determining the copying relationship between sources, when the update history of the sources is known. We model the quality of sources over time by their coverage, exactness and freshness. Based on these measures, we conduct a probabilistic analysis. First, we develop a Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies. Second, we develop a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time. Experimental results on both real-world and synthetic data show high accuracy and scalability of our techniques.
very large data bases | 2010
Xin Luna Dong; Laure Berti-Equille; Yifan Hu; Divesh Srivastava
Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.
international conference on management of data | 2013
Mohamed Yakout; Laure Berti-Equille; Ahmed K. Elmagarmid
Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.
international conference on data engineering | 2011
Laure Berti-Equille; Tamraparni Dasu; Divesh Srivastava
Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.
very large data bases | 2010
Xin Luna Dong; Laure Berti-Equille; Yifan Hu; Divesh Srivastava
We live in the Information Era, with access to a huge amount of information from a variety of data sources. However, data sources are of different qualities, often providing conflicting, out-of-date and incomplete data. Data sources can also easily copy, reformat and modify data from other sources, propagating erroneous data. These issues make the identification of high quality information and sources non-trivial. We demonstrate the Solomon system, whose core is a module that detects copying between sources. We demonstrate that we can effectively detect copying relationship between data sources, leverage the results in truth discovery, and provide a user-friendly interface to facilitate users in identifying sources that best suit their information needs.We live in the Information Era: the Web has enabled the availability of a huge amount of useful information and eased sharing of data among sources. Despite the richness of information surrounding us, an information user is often overwhelmed by the huge volume of raw, heterogeneous, and even conflicting data. Data sources can be of different qualities, providing information of different levels of accuracy, freshness, and completeness, and data can flow between data sources, being copied, reformatted, verified, and modified. There is an increasing need to help users find the information and the sources that are of highest quality and authority, to help data producers understand how their data are being used (and possibly protect their rights), and to help analysts and auditors understand how information has been disseminated and how rumors have been propagated [1].
web-age information management | 2013
Xin Luna Dong; Laure Berti-Equille; Divesh Srivastava
Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. This paper describes a novel approach that finds true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a case study on real-world data showing that the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.
Knowledge and Information Systems | 2007
Laure Berti-Equille
The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHS→ RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2015
Andrés Troya-Galvis; Pierre Gançarski; Nicolas Passat; Laure Berti-Equille
Object-based image analysis (OBIA) has been widely adopted as a common paradigm to deal with very high-resolution remote sensing images. Nevertheless, OBIA methods strongly depend on the results of image segmentation. Many segmentation quality metrics have been proposed. Supervised metrics give accurate quality estimation but require a ground-truth segmentation as reference. Unsupervised metrics only make use of intrinsic image and segment properties; yet most of them strongly depend on the application and do not deal well with the variability of objects in remote sensing images. Furthermore, the few metrics developed in a remote sensing context mainly focus on global evaluation. In this paper, we propose a novel unsupervised metric, which evaluates local quality (per segment) by analyzing segment neighborhood, thus quantifying under- and over-segmentation given a certain homogeneity criterion. Additionally, we propose two variants of this metric, for estimating global quality of remote sensing image segmentation by the aggregation of local quality scores. Finally, we analyze the behavior of the proposed metrics and validate their applicability for finding segmentation results having good tradeoff between both kinds of errors.
data integration in the life sciences | 2005
Emilie Guérin; Gwenaëlle Marquet; Anita Burgun; Olivier Loréal; Laure Berti-Equille; Ulf Leser; Fouzia Moussouni
Researchers at the medical research institute Inserm U522, specialized in the liver, use high throughput technologies to diagnose liver disease states. They seek to identify the set of dysregulated genes in different physiopathological situations, along with the molecular regulation mechanisms involved in the occurrence of these diseases, leading at mid-term to new diagnostic and therapeutic tools. To be able to resolve such a complex question, one has to consider both data generated on the genes by in-house transcriptome experiments and annotations extracted from the many publicly available heterogeneous resources in Biomedicine. This paper presents GEDAW, a gene expression data warehouse that has been developed to assist such discovery processes. The distinctive feature of GEDAW is that it systematically integrates gene information from a multitude of structured data sources. Data sources include: i) XML records of GENBANK to annotate gene sequence features, integrated using a schema mapping approach, ii) an inhouse relational database that stores detailed experimental data on the liver genes and is a permanent source for providing expression levels to the warehouse without unnecessary details on the experiments, and iii) a semi-structured data source called BioMeKE-XML that provides for each gene its nomenclature, its functional annotation according to Gene Ontology, and its medical annotation according to the UMLS. Because GEDAW is a liver gene expression data warehouse, we have paid more attention to the medical knowledge to be able to correlate biology mechanisms and medical knowledge with experimental data. The paper discusses the data sources and the transformation process that is applied to resolve syntactic and semantic conflicts between the source format and the GEDAW schema.