Maurice van Keulen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Maurice van Keulen is active.

Explore More

Publication

Featured researches published by Maurice van Keulen.

international conference on management of data | 2006

MonetDB/XQuery: a fast XQuery processor powered by a relational engine

Peter A. Boncz; Torsten Grust; Maurice van Keulen; Stefan Manegold; Jan Rittinger; Jens Teubner

Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-the-art with a number of new technical contributions, such as loop-lifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11GB. The performance section also provides an extensive benchmark comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met.

very large data bases | 2003

Staircase join: teach a relational DBMS to watch its (axis) steps

Torsten Grust; Maurice van Keulen; Jens Teubner

Relational query processors derive much of their effectiveness from the awareness of specific table properties like sort order, size, or absence of duplicate tuples. This text applies (and adapts) this successful principle to database-supported XML and XPath processing: the relational system is made tree aware, i.e., tree properties like subtree size, intersection of paths, inclusion or disjointness of subtrees are made explicit. We propose a local change to the database kernel, the staircase join, which encapsulates the necessary tree knowledge needed to improve XPath performance. Staircase join operates on an XML encoding which makes this knowledge available at the cost of simple integer operations (e.g., +, ≤ ). We finally report on quite promising experiments with a staircase join enhanced main-memory database kernel.

ACM Transactions on Database Systems | 2004

Accelerating XPath evaluation in any RDBMS

Torsten Grust; Maurice van Keulen; Jens Teubner

This article is a proposal for a database index structure, the XPath accelerator, that has been specifically designed to support the evaluation of XPath path expressions. As such, the index is capable to support all XPath axes (including ancestor, following, preceding-sibling, descendant-or-self, etc.). This feature lets the index stand out among related work on XML indexing structures which had a focus on the child and descendant axes only. The index has been designed with a close eye on the XPath semantics as well as the desire to engineer its internals so that it can be supported well by existing relational database query processing technology: the index (a) permits set-oriented (or, rather, sequence-oriented) path evaluation, and (b) can be implemented and queried using well-established relational index structures, notably B-trees and R-trees.We discuss the implementation of the XPath accelerator on top of different database backends and show that the index performs well on all levels of the memory hierarchy, including disk-based and main-memory based database systems.

very large data bases | 2009

Qualitative effects of knowledge rules and user feedback in probabilistic data integration

Maurice van Keulen; Ander de Keijzer

In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort—and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.

Lecture Notes in Computer Science | 2003

Tree Awareness for Relational DBMS Kernels: Staircase Join

Torsten Grust; Maurice van Keulen

Relational database management systems (RDBMSs) derive much of their efficiency from the versatility of their core data structure: tables of tuples. Such tables are simple enough to allow for an efficient representation on all levels of the memory hierarchy, yet sufficiently generic to host a wide range of data types. If one can devise mappings from a data type tau to tables and from operations on tau to relational queries, an RDBMS may be a premier implementation alternative. Temporal intervals, complex nested objects, and spatial data are sample instances for such types tau.

database and expert systems applications | 2005

Formalizing the XML schema matching problem as a constraint optimization problem

Marko Smiljanic; Maurice van Keulen; Willem Jonker

The first step in finding an efficient way to solve any difficult problem is making a complete, possibly formal, problem specification. This paper introduces a formal specification for the problem of semantic XML schema matching. Semantic schema matching has been extensively researched, and many matching systems have been developed. However, formal specifications of problems being solved by these systems do not exist, or are partial. In this paper, we analyze the problem of semantic schema matching, identify its main components and deliver a formal specification based on the constraint optimization problem formalism. Throughout the paper, we consider the schema matching problem as encountered in the context of a large scale XML schema matching application.

2nd IFIP WG 2.6, 2.12 International Symposium on Data-Driven Process Discovery and Analysis, SIMPDA 2012 | 2012

Process Prediction in Noisy Data Sets: A Case Study in a Dutch Hospital

Sjoerd van der Spoel; Maurice van Keulen; Chintan Amrit

Predicting the amount of money that can be claimed is critical to the effective running of an Hospital. In this paper we describe a case study of a Dutch Hospital where we use process mining to predict the cash flow of the Hospital. In order to predict the cost of a treatment, we use different data mining techniques to predict the sequence of treatments administered, the duration and the final ”care product” or diagnosis of the patient. While performing the data analysis we encountered three specific kinds of noise that we call sequence noise, human noise and duration noise. Studies in the past have discussed ways to reduce the noise in process data. However, it is not very clear what effect the noise has to different kinds of process analysis. In this paper we describe the combined effect of sequence noise, human noise and duration noise on the analysis of process data, by comparing the performance of several mining techniques on the data.

scalable uncertainty management | 2007

Quality Measures in Uncertain Data Management

Ander de Keijzer; Maurice van Keulen

Many applications deal with data that is uncertain. Some examples are applications dealing with sensor information, data integration applications and healthcare applications. Instead of these applications having to deal with the uncertainty, it should be the responsibility of the DBMS to manage all data including uncertain data. Several projects do research on this topic. In this paper, we introduce four measures to be used to assess and compare important characteristics of data and systems.

very large data bases | 2004

An injection with tree awareness: adding staircase join to postgreSQL

Sabine Mayer; Torsten Grust; Maurice van Keulen; Jens Teubner

The syntactic wellformedness constraints of XML (opening and closing tags nest properly) imply that XML processors face the challenge to efficiently handle data that takes the shape of ordered, unranked trees. Although RDBMSs have originally been designed to manage table-shaped data, we propose their use as XML and XPath processors. In our setup, the database system employs a relational XML document encoding, the XPath accelerator [1], which maps information about the XML node hierarchy to a table, thus making it possible to evaluate XPath expressions on SQL hosts. Conventional RDBMSs, nevertheless, remain ignorant of many interesting properties of the encoded tree data, and were thus found to make no or poor use of these properties. This is why we devised a new join algorithm, staircase join [2], which incorporates the tree-specific knowledge required for an efficient SQL-based evaluation of XPath expressions. In a sense, this demonstration delivers the promise we have made at VLDB 2003 [2]: a notion of tree awareness can be injected into a conventional disk-based RDBMS kernel in terms of staircase join. The demonstration features a side-by-side comparison of both, an original and a staircase-join enhanced instance of PostgreSQL [4]. The required changes to PostgreSQL were local, the achieved eect, however, is significant: the demonstration proves that this injection of tree awareness turns PostgreSQL into a high-performance XML processor that closely adheres to the XPath semantics.

Information Technology | 2012

Managing Uncertainty: The Road Towards Better Data Interoperability

Maurice van Keulen

Data interoperability encompasses the many data management activities needed for effective information management in anyone’s or any organization’s everyday work such as data cleaning, coupling, fusion, mapping, and information extraction. It is our conviction that a significant amount of money and time in IT that is devoted to these activities, is about dealing with one problem: “semantic uncertainty”. Sometimes data is subjective, incomplete, not current, or incorrect, sometimes it can be interpreted in different ways, etc. In our opinion, clean correct data is only a special case, hence data management technology should treat data quality problems as a fact of life, not as something to be repaired afterwards. Recent approaches treat uncertainty as an additional source of information which should be preserved to reduce its impact. We believe that the road towards better data interoperability, is to be found in teaching our data processing tools and systems about all forms of doubt and how to live with them. In this paper, we show for several data interoperability use cases (deduplication, data coupling/fusion, and information extraction) how to formally model the associated data quality problems as semantic uncertainty. Furthermore, we provide an argument why our approach leads to better data interoperability in terms of natural problem exposure and risk assessment, more robustness and automation, reduced development costs, and potential for natural and effective feedback loops leveraging human attention. Zusammenfassung Dateninteroperabilität umfasst die zahlreichen Datenverwaltungsaktivitäten, die für effektives Informationsmanagement nötig sind, z. B. Datenreinigung, Kopplung, Fusion, Mapping oder Informationsextraktion. Wir beobachten, dass ein erheblicher Anteil monetärer und zeitlicher Ressourcen in der IT, die auf diese Bereiche entfallen, der Lösung eines einzelnen Problems gewidmet werden: der „semantischen Unsicherheit“. Manchmal sind Daten subjektiv, unvollständig, nicht aktuell oder nicht korrekt, manchmal können sie unterschiedlich interpretiert werden, usw. Wir sind der Meinung, dass saubere und korrekte Daten nur einen Spezialfall von Daten darstellen und so sollten Datenmanagementtechnologien Datenqualitätsprobleme als eine Tatsache behandeln statt diese im Nachhinein zu reparieren. Neuere Ansätze betrachten Unsicherheit als eine zusätzliche Informationsquelle, die erhalten werden sollte, um Auswirkungen der Unsicherheit zu reduzieren. Wir glauben, dass der Weg zu einer besseren Interoperabilität von Daten darin besteht, unseren Werkzeugen und Systemen zur Datenverwaltung beizubringen, welche Formen der Unsicherheit es gibt und wie man diese handhabt. In diesem Beitrag zeigen wir für mehrere Fallbeispiele der Dateninteroperabilität (Deduplizierung, Datenkopplung/Fusion und Informationsextraktion), wie die entsprechenden Datenqualitätsprobleme als semantische Unischerheit modelliert werden können. Desweiteren motivieren wir, warum unser Ansatz zu einer besseren Interoperabilität bezüglich Risikobeurteilung, Robustheit und Automatisierung, Entwicklungskosten und Potenzial für effektive Feedbackschleifen unter Nutzung menschlicher Interaktion führt.Data interoperability encompasses the many data management activities needed for effective information management in anyone´s or any organization´s everyday work such as data cleaning, coupling, fusion, mapping, and information extraction. It is our conviction that a significant amount of money and time in IT that is devoted to these activities, is about dealing with one problem: “semantic uncertainty”. Sometimes data is subjective, incomplete, not current, or incorrect, sometimes it can be interpreted in different ways, etc. In our opinion, clean correct data is only a special case, hence data management technology should treat data quality problems as a fact of life, not as something to be repaired afterwards. Recent approaches treat uncertainty as an additional source of information which should be preserved to reduce its impact. We believe that the road towards better data interoperability, is to be found in teaching our data processing tools and systems about all forms of doubt and how to live with them. In this paper, we show for several data interoperability use cases (deduplication, data coupling/fusion, and information extraction) how to formally model the associated data quality problems as semantic uncertainty. Furthermore, we provide an argument why our approach leads to better data interoperability in terms of natural problem exposure and risk assessment, more robustness and automation, reduced development costs, and potential for natural and effective feedback loops leveraging human attention.

Explore More