Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Martin Reynaert is active.

Publication


Featured researches published by Martin Reynaert.


international conference on computational linguistics | 2004

Text induced spelling correction

Martin Reynaert

We present TISC, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigrams and word bigrams. It is stored in a novel representation based on a purpose-built hashing function, which provides a fast and computationally tractable way of checking whether a particular word form likely constitutes a spelling error and of retrieving correction candidates. The system employs input context and lexicon evidence to automatically propose a limited number of ranked correction candidates when insufficient information for an unambiguous decision on a single correction is available. We describe the implemented prototype and evaluate it on English and Dutch text, containing real-world errors in more or less limited contexts. The results are compared with those of the isolated word spelling checking programs ISPELL and the MICROSOFT PROOFING TOOLS (MPT).


Spyns, P.;Odijk, J. (ed.), Essential Speech and Language Technology for Dutch | 2013

The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch

Nelleke Oostdijk; Martin Reynaert; Veronique Hoste; Ineke Schuurman

The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme. Such a corpus, sampling texts from conventional and new media, is invaluable for scientific research and application development. The present chapter describes how in two consecutive STEVIN-funded projects, viz. D-Coi and SoNaR, the Dutch reference corpus was developed. The construction of the corpus has been guided by (inter)national standards and best practices. At the same time through the achievements and the experiences gained in the D-Coi and SoNaR projects, a contribution was made to their further advancement and dissemination.


meeting of the association for computational linguistics | 2003

Learning to Predict Pitch Accents and Prosodic Boundaries in Dutch

Erwin Marsi; Martin Reynaert; Antal van den Bosch; Walter Daelemans; Véronique Hoste

We train a decision tree inducer (CART) and a memory-based classifier (MBL) on predicting prosodic pitch accents and breaks in Dutch text, on the basis of shallow, easy-to-compute features. We train the algorithms on both tasks individually and on the two tasks simultaneously. The parameters of both algorithms and the selection of features are optimized per task with iterative deepening, an efficient wrapper procedure that uses progressive sampling of training data. Results show a consistent significant advantage of MBL over CART, and also indicate that task combination can be done at the cost of little generalization score loss. Tests on cross-validated data and on held-out data yield F-scores of MBL on accent placement of 84 and 87, respectively, and on breaks of 88 and 91, respectively. Accent placement is shown to outperform an informed baseline rule; reliably predicting breaks other than those already indicated by intra-sentential punctuation, however, appears to be more challenging.


international conference on computational linguistics | 2004

Multilingual text induced spelling correction

Martin Reynaert

We present TISC, a multilingual, language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from raw text corpora, without supervision, and contains word unigrams and word bigrams. The system employs input context and lexicon evidence to automatically propose a limited number of ranked correction candidates. We describe the implemented trilingual (Dutch, English, French) prototype and evaluate it on English and Dutch text, monolingual and mixed, containing real-world errors in context.


Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage | 2014

On OCR ground truths and OCR post-correction gold standards, tools and formats

Martin Reynaert

We give an overview of activities undertaken in the sidelines of our automatic OCR post-correction core business over the past few years. We present ongoing projects in the Netherlands in which Text-Induced Corpus Clean-up plays a part. We describe the infrastructure we are building to help improve the overall text quality of large digitized text collections. We provide information on the tools we develop to facilitate the process and discuss the role of FoLiA XML which we adopted as a pivot format. Connecting the dots, we discuss the difference we perceive between OCR ground truths and OCR post-correction gold standards and their respective contributions.


international conference on computational linguistics | 2008

Non-interactive OCR post-correction for giga-scale digitization projects

Martin Reynaert


language resources and evaluation | 2008

From D-Coi to SoNaR: A reference corpus for Dutch

Nelleke Oostdijk; Martin Reynaert; Paola Monachesi; G.J.M. van Noord; Roeland Ordelman; I. Schuurman; Vincent Vandeghinste


International Journal on Document Analysis and Recognition | 2011

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Martin Reynaert


computational linguistics in the netherlands | 2014

FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study

M. van Gompel; Martin Reynaert


language resources and evaluation | 2008

All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation.

Martin Reynaert

Collaboration


Dive into the Martin Reynaert's collaboration.

Top Co-Authors

Avatar

Nelleke Oostdijk

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Maarten van Gompel

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar

Ineke Schuurman

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Vincent Vandeghinste

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Eric Sanders

Radboud University Nijmegen

View shared research outputs
Researchain Logo
Decentralizing Knowledge