Walter Daelemans | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Walter Daelemans is active.

Explore More

Publication

Featured researches published by Walter Daelemans.

Machine Learning | 1999

Forgetting Exceptions is Harmful in Language Learning

Walter Daelemans; Antal van den Bosch; Jakub Zavrel

We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.

Computational Linguistics | 2001

Improving accuracy in word class tagging through the combination of machine learning systems

Hans van Halteren; Walter Daelemans; Jakub Zavrel

We examine how differences in language models, learned by different data-driven systems performing the same NLP task, can be exploited to yield a higher accuracy than the best individual system. We do this by means of experiments involving the task of morphosyntactic word class tagging, on the basis of three different tagged corpora. Four well-known tagger generators (hidden Markov model, memory-based, transformation rules, and maximum entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second-stage classifiers. All combination taggers outperform their best component. The reduction in error rate varies with the material in question, but can be as high as 24.3 with the LOB corpus.

Artificial Intelligence Review | 1997

IGTree: using trees for compression and classification in lazy learning algorithms

Walter Daelemans; Antal van den Bosch; Ton Weijters

We describe the IGTree learning algorithm, which compresses an instance base into a tree structure. The concept of information gain is used as a heuristic function for performing this compression. IGTree produces trees that, compared to other lazy learning approaches, reduce storage requirements and the time required to compute classifications. Furthermore, we obtained similar or better generalization accuracy with IGTree when trained on two complex linguistic tasks, viz. letter–phoneme transliteration and part-of-speech-tagging, when compared to alternative lazy learning and decision tree approaches (viz., IB1, information-gain-weighted IB1, and C4.5). A third experiment, with the task of word hyphenation, demonstrates that when the mutual differences in information gain of features is too small, IGTree as well as information-gain-weighted IB1 perform worse than IB1. These results indicate that IGTree is a useful algorithm for problems characterized by the availability of a large number of training instances described by symbolic features with sufficiently differing information gain values.

Proceedings of the 3rd international workshop on Search and mining user-generated contents | 2011

Predicting age and gender in online social networks

Claudia Peersman; Walter Daelemans; Leona Van Vaerenbergh

A common characteristic of communication on online social networks is that it happens via short messages, often using non-standard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide ones true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.

Computational Linguistics | 2004

Recent Advances in Example-Based Machine Translation

Michael Carl; Andy Way; Walter Daelemans

I Foundations of EBMT.- 1 An Overview of EBMT.- 2 What is Example-Based Machine Translation?.- 3 Example-Based Machine Translation in a Controlled Environment.- 4 EBMT Seen as Case-based Reasoning.- II Run-time Approaches to EBMT.- 5 Formalizing Translation Memory.- 6 EBMT Using DP-Matching Between Word Sequences.- 7 A Hybrid Rule and Example-Based Method for Machine Translation.- 8 EBMT of POS-Tagged Sentences via Inductive Learning.- III Template-Driven EBMT.- 9 Learning Translation Templates from Bilingual Translation Examples.- 10 Clustered Transfer Rule Induction for Example-Based Translation.- 11 Translation Patterns, Linguistic Knowledge and Complexity in EBMT.- 12 Inducing Translation Grammars from Bracketed Alignments.- IV EBMT and Derivation Trees.- 13 Extracting Translation Knowledge from Parallel Corpora.- 14 Finding Translation Patterns from Dependency Structures.- 15 A Best-First Alignment Algorithm for Extraction of Transfer Mappings.- 16 Translating with Examples: The LFG-DOT Models of Translation.

meeting of the association for computational linguistics | 1999

Memory-Based Morphological Analysis

Antal van den Bosch; Walter Daelemans

We present a general architecture for efficient and deterministic morphological analysis based on memory-based learning, and apply it to morphological analysis of Dutch. The system makes direct mappings from letters in context to rich categories that encode morphological boundaries, syntactic class labels, and spelling changes. Both precision and recall of labeled morphemes are over 84% on held-out dictionary test words and estimated to be over 93% in free text.

north american chapter of the association for computational linguistics | 2009

Learning the Scope of Hedge Cues in Biomedical Texts

Roser Morante; Walter Daelemans

Identifying hedged information in biomedical literature is an important subtask in information extraction because it would be misleading to extract speculative information as factual information. In this paper we present a machine learning system that finds the scope of hedge cues in biomedical texts. The system is based on a similar system that finds the scope of negation cues. We show that the same scope finding approach can be applied to both negation and hedging. To investigate the robustness of the approach, the system is tested on the three subcorpora of the BioScope corpus that represent different text types.

international conference on computational linguistics | 2008

Authorship Attribution and Verification with Many Authors and Limited Data

Kim Luyckx; Walter Daelemans

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.

meeting of the association for computational linguistics | 1998

Improving Data Driven Wordclass Tagging by System Combination

Hans van Halteren; Jakub Zavrel; Walter Daelemans

In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generator (Hidden Markov Model, Memory-Based, Transformation Rules and Maximum Entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second stage classifiers. All combination taggers outperform their best component, with the best combination showing a 19.1% lower error rate than the best indvidual tagger.

Genome Biology | 2011

BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation

Anthony Liekens; Jeroen De Knijf; Walter Daelemans; Bart Goethals; Peter De Rijk; Jurgen Del-Favero

We present BioGraph, a data integration and data mining platform for the exploration and discovery of biomedical information. The platform offers prioritizations of putative disease genes, supported by functional hypotheses. We show that BioGraph can retrospectively confirm recently discovered disease genes and identify potential susceptibility genes, outperforming existing technologies, without requiring prior domain knowledge. Additionally, BioGraph allows for generic biomedical applications beyond gene discovery. BioGraph is accessible at http://www.biograph.be.

Explore More