Featured Researches


Data Quality Certification using ISO/IEC 25012: Industrial Experiences

The most successful organizations in the world are data-driven businesses. Data is at the core of the business of many organizations as one of the most important assets, since the decisions they make cannot be better than the data on which they are based. Due to this reason, organizations need to be able to trust their data. One important activity that helps to achieve data reliability is the evaluation and certification of the quality level of organizational data repositories. This paper describes the results of the application of a data quality evaluation and certification process to the repositories of three European organizations belonging to different sectors. We present findings from the point of view of both the data quality evaluation team and the organizations that underwent the evaluation process. In this respect, several benefits have been explicitly recognised by the involved organizations after achieving the data quality certification for their repositories (e.g., long-term organizational sustainability, better internal knowledge of data, and a more efficient management of data quality). As a result of this experience, we have also identified a set of best practices aimed to enhance the data quality evaluation process.

Read more

Data Series Indexing Gone Parallel

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this Ph.D. work, we present the first data series indexing solutions, for both on-disk and in-memory data, that are designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments on a variety of synthetic and real data demonstrate that our approaches are up to orders of magnitude faster than the alternatives. More specifically, our on-disk solution can answer exact similarity search queries on 100GB datasets in a few seconds, and our in-memory solution in a few milliseconds, which enables real-time, interactive data exploration on very large data series collections.

Read more

Data Structure Primitives on Persistent Memory: An Evaluation

Persistent Memory (PMem), as already available, e.g., with Intel Optane DC Persistent Memory, represents a very promising, next-generation memory solution with a significant impact on database architectures. Several data structures for this new technology and its properties have already been proposed. However, primarily only complete structures are presented and evaluated. Thus, the implications of the individual ideas and PMem features are concealed. Therefore, in this paper, we disassemble the structures presented so far, identify their underlying design primitives, and assign them to appropriate design goals regarding PMem. As a result of our comprehensive experiments on real PM hardware, we can reveal the trade-offs of the primitives for various access patterns. This allowed us to pinpoint their best use cases as well as vulnerabilities. Besides our general insights regarding PMem-based data structure design, we also discovered new combinations not examined in the literature so far.

Read more

Data Warehouse and Decision Support on Integrated Crop Big Data

In recent years, precision agriculture is becoming very popular. The introduction of modern information and communication technologies for collecting and processing Agricultural data revolutionise the agriculture practises. This has started a while ago (early 20th century) and it is driven by the low cost of collecting data about everything; from information on fields such as seed, soil, fertiliser, pest, to weather data, drones and satellites images. Specially, the agricultural data mining today is considered as Big Data application in terms of volume, variety, velocity and veracity. Hence it leads to challenges in processing vast amounts of complex and diverse information to extract useful knowledge for the farmer, agronomist, and other businesses. It is a key foundation to establishing a crop intelligence platform, which will enable efficient resource management and high quality agronomy decision making and recommendations. In this paper, we designed and implemented a continental level agricultural data warehouse (ADW). ADW is characterised by its (1) flexible schema; (2) data integration from real agricultural multi datasets; (3) data science and business intelligent support; (4) high performance; (5) high storage; (6) security; (7) governance and monitoring; (8) consistency, availability and partition tolerant; (9) cloud compatibility. We also evaluate the performance of ADW and present some complex queries to extract and return necessary knowledge about crop management.

Read more

Data mining and time series segmentation via extrema: preliminary investigations

Time series segmentation is one of the many data mining tools. This paper, in French, takes local extrema as perceptually interesting points (PIPs). The blurring of those PIPs by the quick fluctuations around any time series is treated via an additive decomposition theorem, due to Cartier and Perrin, and algebraic estimation techniques, which are already useful in automatic control and signal processing. Our approach is validated by several computer illustrations. They underline the importance of the choice of a threshold for the extrema detection.

Read more

Data provenance, curation and quality in metrology

Data metrology -- the assessment of the quality of data -- particularly in scientific and industrial settings, has emerged as an important requirement for the UK National Physical Laboratory (NPL) and other national metrology institutes. Data provenance and data curation are key components for emerging understanding of data metrology. However, to date provenance research has had limited visibility to or uptake in metrology. In this work, we summarize a scoping study carried out with NPL staff and industrial participants to understand their current and future needs for provenance, curation and data quality. We then survey provenance technology and standards that are relevant to metrology. We analyse the gaps between requirements and the current state of the art.

Read more

DataFed: Towards Reproducible Research via Federated Data Management

The increasingly collaborative, globalized nature of scientific research combined with the need to share data and the explosion in data volumes present an urgent need for a scientific data management system (SDMS). An SDMS presents a logical and holistic view of data that greatly simplifies and empowers data organization, curation, searching, sharing, dissemination, etc. We present DataFed -- a lightweight, distributed SDMS that spans a federation of storage systems within a loosely-coupled network of scientific facilities. Unlike existing SDMS offerings, DataFed uses high-performance and scalable user management and data transfer technologies that simplify deployment, maintenance, and expansion of DataFed. DataFed provides web-based and command-line interfaces to manage data and integrate with complex scientific workflows. DataFed represents a step towards reproducible scientific research by enabling reliable staging of the correct data at the desired environment.

Read more

Database Optimization to Recommend Software Developers using Canonical Order Tree

Recently frequent and sequential pattern mining algorithms have been widely used in the field of software engineering to mine various source code or specification patterns. In practice software evolves from one version to another is needed for providing extra facilities to user. This kind of task is challenging in this domain since the database is usually updated in all kinds of manners such as insertion, various modifications as well as removal of sequences. If database is optimized then this optimized information will help developer in their development process and save their valuable time as well as development expenses. Some existing algorithms which are used to optimize database but it does not work faster when database is incrementally updated. To overcome this challenges an efficient algorithm is recently introduce, called the Canonical Order Tree that captures the content of the transactions of the database and orders. In this paper we have proposed a technique based on the Canonical Order Tree that can find out frequent patterns from the incremental database with speedy and efficient way. Thus the database will be optimized as well as it gives useful information to recommend software developer.

Read more

Database Repairing with Soft Functional Dependencies

A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a "cardinality repair" of an inconsistent database; in soft interpretations, this subset corresponds to a "most probable world" of a probabilistic database, a "most likely intention" of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of this problem in the more general soft semantics. This paper makes a significant progress in this direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases: a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal "almost matching" of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy.

Read more

Dataset Definition Standard (DDS)

This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the quality of datasets. This is a work in progress as good practices evolve along with our understanding of machine learning. The document is divided into three main parts. Section 2 addresses the data collection activity. Section 3 gives recommendations about the annotation process. Finally, Section 4 gives recommendations concerning the breakdown between train, validation, and test datasets. In each part, we first define the desired properties at stake, then we explain the objectives targeted to meet the properties, finally we state the recommendations to reach these objectives.

Read more

Ready to get started?

Join us today