Ander de Keijzer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ander de Keijzer is active.

Explore More

Publication

Featured researches published by Ander de Keijzer.

very large data bases | 2009

Qualitative effects of knowledge rules and user feedback in probabilistic data integration

Maurice van Keulen; Ander de Keijzer

In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort—and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.

scalable uncertainty management | 2007

Quality Measures in Uncertain Data Management

Ander de Keijzer; Maurice van Keulen

Many applications deal with data that is uncertain. Some examples are applications dealing with sensor information, data integration applications and healthcare applications. Instead of these applications having to deal with the uncertainty, it should be the responsibility of the DBMS to manage all data including uncertain data. Several projects do research on this topic. In this paper, we introduce four measures to be used to assess and compare important characteristics of data and systems.

international conference on data engineering | 2010

Duplicate detection in probabilistic data

Fabian Panse; Maurice van Keulen; Ander de Keijzer; Norbert Ritter

Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities.

scalable uncertainty management | 2009

Compression of Probabilistic XML Documents

Irma Veldman; Ander de Keijzer; Maurice van Keulen

Database techniques to store, query and manipulate data that contains uncertainty receives increasing research interest. Such UDBMSs can be classified according to their underlying data model: relational, XML, or RDF. We focus on uncertain XML DBMS with as representative example the Probabilistic XML model (PXML) of [10,9]. The size of a PXML document is obviously a factor in performance. There are PXML-specific techniques to reduce the size, such as a push down mechanism, that produces equivalent but more compact PXML documents. It can only be applied, however, where possibilities are dependent. For normal XML documents there also exist several techniques for compressing a document. Since Probabilistic XML is (a special form of) normal XML, it might benefit from these methods even more. In this paper, we show that existing compression mechanisms can be combined with PXML-specific compression techniques. We also show that best compression rates are obtained with a combination of PXML-specific technique with a rather simple generic DAG-compression technique.

knowledge discovery and data mining | 2011

An IFS-based similarity measure to index electroencephalograms

Ghita Berrada; Ander de Keijzer

EEG is a very useful neurological diagnosis tool, inasmuch as the EEG exam is easy to perform and relatively cheap. However, it generates large amounts of data, not easily interpreted by a clinician. Several methods have been tried to automate the interpretation of EEG recordings. However, their results are hard to compare since they are tested on different datasets. This means a benchmark database of EEG data is required. However, for such a database to be useful, we have to solve the problem of retrieving information from the stored EEGs without having to tag each and every EEG sequence stored in the database (which can be a very time-consuming and error-prone process). In this paper, we present a similarity measure, based on iterated function systems, to index EEGs.

soft computing | 2010

Data Integration Using Uncertain XML

Ander de Keijzer

Data integration has been a challenging problem for decades. In an ambient environment, where many autonomous devices have their own information sources and network connectivity is ad hoc and peer-to-peer, it even becomes a serious bottleneck. In addition, the number of information sources per device, as well as in total, increases as well. To enable devices to exchange information without the need for interaction with a user at data integration time and without the need for extensive semantic annotations, a probabilistic approach seems rather promising. It simply teaches the device how to cope with the uncertainty occurring during data integration. Unfortunately,without any kind of world knowledge, almost everything becomes uncertain, hence maintaining all possibilities produces huge integrated information sources. Automatically integrating data sources, using very simple knowledge rules to rule out most of the nonsense possibilities, combined with storing the remaining possibilities as uncertainty in the database and resolving these during querying by means of user feedback, seems the promising solution. In this chapter we introduce this “good is good-enough” integration approach and explain the uncertainty model that is used to capture the remaining integration possibilities. We show that using this strategy, the time necessary to integrate documents drastically decreases, while the accuracy of the integrated document increases over time.

scalable uncertainty management | 2010

Probabilistic data: a tiny survey

Ander de Keijzer

In this survey, we will visit existing projects and proposals for uncertain data, all supporting probabilistic handling of confidence scores.

Sigkdd Explorations | 2010

Summary of the first ACM SIGKDD workshop on knowledge discovery from uncertain data (U'09)

Jian Pei; Lise Getoor; Ander de Keijzer

The importance of uncertain data is growing quickly in many essential applications such as environmental monitoring, mobile object tracking and data integration. Recently, storing, collecting, processing, and analyzing uncertain data has attracted increasing attention from both academia and industry. Analyzing and mining uncertain data needs collaboration and joint effort from multiple research communities. Based on this motivation, we ran the First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data (U’09) in conjunction with the 2009 SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09) at Paris. The focus of this workshop was to bring together and bridge research in reasoning under uncertainty, probabilistic databases and mining uncertain data. Work in statistics and probabilistic reasoning can provide support with models for representing uncertainty, work in the probabilistic database community can provide methods for storing and managing uncertain data, while work in the mining uncertain data can define data analysis tasks and methods. It is important to build connections among those communities to tackle the overall problem of analyzing and mining uncertain data. There are many common challenges among the communities. One is understanding the different modeling assumptions made, and how they impact the methods, both in terms of accuracy and efficiency. Different researchers hold different assumptions about the semantics for probabilistic models and uncertainty. This is one of the major obstacles in the research of mining uncertain data. Another challenge is the scalability of proposed management and analysis methods. Finally, to make analysis and mining useful and practical, we need real data sets for testing. Unfortunately, uncertain data sets are often hard to get and hard to share. The theme of this workshop was to make connections among the research areas of probabilistic databases, probabilistic reasoning, and data mining, as well as to build bridges among the aspects of models, data, applications, novel mining tasks and effective solutions. By making connections among different communities, we aim at understanding each other in terms of scientific foundation as well as commonality and differences in research methodology. Although the workshop was allocated to only half day, we had a very dynamic and exciting program. The workshop was among one of the best attended ones in conjunction with the conference. There were about 40 attendees when the workshop started. We were lucky to have two excellent invited talks in the workshop. Professor Christopher Jermaine at Rice University gave a talk on “Managing and Mining Uncertain Data: What Might We Do Better?”. In this talk, he expressed a few of his strongly-held opinions on the management and mining of uncertain data. He argued that those who work in the field should listen very carefully to complaints from machine learning experts, who often say, “but all of our methods were already designed to work with uncertain data, so you are wasting your time!” Furthermore, he contended that too much work aimed at managing uncertainty is tightly coupled to first-order logic and related ideas. He also argued that Bayesian approaches and Monte Carlo methods should be much more widely employed in this area. Finally, he argued that too much work in this area neglects the application domains where uncertainty is most important: “what if” analysis, risk assessment, and predication. In his invited talk titled “Querying and Mining Uncertain Data: Methods, Applications, and Challenges”, Dr. Matthias Renz at Ludwig-Maximilians Universitat (LMU) Munchen summarized several very interesting projects in his group exploring various aspects of mining uncertain data, particularly from the point of view of efficiency. The efficiency concern is particularly important for modern databases since they allow users to incorporate uncertainty of data in the hope of increasing the quality of query results. Dr. Matthias Renz gave an overview of modeling uncertain data in feature spaces and illustrated diverse probabilistic similarity search methods which are important tools for many mining applications. In this context, he discussed some current methods as well as the challenges in clustering uncertain data and mining probabilistic rules. The two invited talks were very successful — they led to interesting discussions among the audience and the invited speakers. The invited speeches helped to highlight the interdisciplinary nature of the workshop. The program committee accepted eight papers — four of them were 15 minute presentations and the other four were 10 minute presentations. In the paper titled “Efficient Algorithms for Mining Constrained Frequent Patterns from Uncertain Data”, Leung and Brajczuk argue that constrained frequent pattern mining from uncertain data is important since constrained frequent pattern mining and mining frequent patterns from uncertain data often happen in some common applications such as analyzing medical laboratory data. They developed

CTIT technical report series | 2006