Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Marijn Schraagen is active.

Publication


Featured researches published by Marijn Schraagen.


international conference on tools with artificial intelligence | 2011

Complete Coverage for Approximate String Matching in Record Linkage Using Bit Vectors

Marijn Schraagen

Research in social history is increasingly influenced by the availability of digitized sources. Tools have to be developed to access these sources in an efficient way. This paper describes a tool that performs family reconstruction using record linkage: linking historical civil certificates based on record similarity. Most current approaches in record linkage apply heuristics to limit the amount of similarity computations at the expense of linking coverage. The current paper describes a binary tree based indexing approach that provides complete coverage within practical time bounds. The indexing scheme is constructed using a simulated annealing algorithm to optimize indexing efficiency. A comparison to other methods using heuristics and complete coverage is provided. The method is developed for Levenshtein edit distance, however an extension to other similarity measures is feasible. As an example, extension to Jaro distance is discussed.


Population Reconstruction | 2015

Learning Name Variants from Inexact High-Confidence Matches

Gerrit Bloothooft; Marijn Schraagen

Name variants which differ more than a few characters can seriously hamper record linkage. A method is described by which variants of first names and surnames can be learned automatically from records that contain more information than needed for a true link decision. Post-processing and limited manual intervention (active learning) is unavoidable, however, to differentiate errors in the original and the digitised data from variants. The method is demonstrated on the basis of an analysis of 14.8 million records from the Dutch vital registration.


machine learning and data mining in pattern recognition | 2014

Record Linkage Using Graph Consistency

Marijn Schraagen; Walter A. Kosters

This paper provides a method for automated record linkage in the historical domain based on collective entity resolution. Multiple records are considered for linkage simultaneously, using plausible record sequences as a substitute for pair-wise record similarity measures such as string edit distance. The method is applied to the problem of family reconstruction from historical archives. A benchmark evaluation shows that the approach provides a computationally efficient way to produce family reconstructions which are useful in practise. Further improvements in linkage accuracy are expected by addressing data issues and linkage assumption violations.


international conference on innovative computing technology | 2012

Data-driven name reduction for record linkage

Marijn Schraagen; Walter A. Kosters

Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.


Spyns, P.; Odijk, J. (ed.), Essential Speech and Language Technology for Dutch | 2013

Lexical Modeling for Proper name Recognition in Autonomata Too

Bert Réveil; Jean-Pierre Martens; Henk van den Heuvel; Gerrit Bloothooft; Marijn Schraagen

The research in Autonomata Too aimed at the development of new pronunciation modeling techniques that can bring the speech recognition component of a Dutch/Flemish POI (Points of Interest) information providing business service to the required level of accuracy. The automatic recognition of spoken POI is extremely difficult because of the existence of multiple pronunciations that are frequently used for the same POI and because of the presence of important cross-lingual effects one has to account for. In fact, the ASR (Automatic Speech Recognition) engine must be able to cope with pronunciations of (partly) foreign POI names spoken by native speakers and pronunciations of native POI names uttered by non-native speakers. In order to deal adequately with such pronunciations, one must model them at the level of the acoustic models as well as at the level of the recognition lexicon. This paper describes a novel lexical modeling approach that was developed and tested in the Autonomata Too project. The new method employs a G2P-P2P (grapheme-to-phoneme, phoneme-to-phoneme) tandem to generate suitable lexical pronunciation variants. It was shown to yield a significant improvement over a baseline system already embedding state-of-the-art acoustic and lexical models.


Spyns, P.; Odijk, J. (ed.), Essential Speech and Language Technology for Dutch | 2013

Resources Developed in the Autonomata Projects

H. van den Heuvel; J-P. Martens; Gerrit Bloothooft; Marijn Schraagen; N. Konings; K. D'hanens; Q. Yang

Realistic phonetic transcriptions of names are crucial for many applications. A specific problem for the automatic recognition of names, is the existence of different pronunciations of the same name. These pronunciations often depend on the background (mother tongue) of the user. Typical examples are the pronunciation of foreign city names, foreign proper names, etc. The first goal of Autonomata was, therefore, to collect a large number of name pronunciations and to provide manually checked phonetic transcription of these name utterances. Such pronunciation data are needed both for training and evaluating name recognition systems. The second goal of the Autonomata project was to develop a tool that incorporates a state-of-the-art grapheme-to-phoneme convertor, as well as a dedicated phoneme-to-phoneme (p2p) post-processor which can automatically correct some of the mistakes which are being made by the standard g2p. In this contribution we will describe in more detail four resources that were developed in Autonomata and in Autonomata Too:1. The Autonomata Spoken Names Corpus (ASNC)2. The Autonomata transcription Toolbox3. The Autonomata P2P converters 4. The Autonomata TOO Spoken POI Corpus


language resources and evaluation | 2010

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing.

Marijn Schraagen; Gerrit Bloothooft


Archive | 2015

Population Reconstruction

Gerrit Bloothooft; Peter Christen; Kees Mandemakers; Marijn Schraagen


computational linguistics in the netherlands | 2017

The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation

Erik F. Tjong Kim Sang; Marcel Bollmann; Remko Boschker; Francisco Casacuberta; Feike Dietz; Stefanie Dipper; Miguel Domingo; Robe van der Goot; Marjo van Koppen; Nikola Ljubešić; Robert Östling; Florian Petran; Eva Pettersson; Yves Scherrer; Marijn Schraagen; Leen Sevens; Jörg Tiedemann; Tom Vanallemeersch; Kalliopi Zervanou


conference of the international speech communication association | 2011

A Qualitative Evaluation of Phoneme-to-Phoneme Technology.

Marijn Schraagen; Gerrit Bloothooft

Collaboration


Dive into the Marijn Schraagen's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge