Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hans van Halteren is active.

Publication


Featured researches published by Hans van Halteren.


Computational Linguistics | 2001

Improving accuracy in word class tagging through the combination of machine learning systems

Hans van Halteren; Walter Daelemans; Jakub Zavrel

We examine how differences in language models, learned by different data-driven systems performing the same NLP task, can be exploited to yield a higher accuracy than the best individual system. We do this by means of experiments involving the task of morphosyntactic word class tagging, on the basis of three different tagged corpora. Four well-known tagger generators (hidden Markov model, memory-based, transformation rules, and maximum entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second-stage classifiers. All combination taggers outperform their best component. The reduction in error rate varies with the material in question, but can be as high as 24.3 with the LOB corpus.


meeting of the association for computational linguistics | 1998

Improving Data Driven Wordclass Tagging by System Combination

Hans van Halteren; Jakub Zavrel; Walter Daelemans

In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generator (Hidden Markov Model, Memory-Based, Transformation Rules and Maximum Entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second stage classifiers. All combination taggers outperform their best component, with the best combination showing a 19.1% lower error rate than the best indvidual tagger.


meeting of the association for computational linguistics | 2004

Linguistic Profiling for Authorship Recognition and Verification

Hans van Halteren

A new technique is introduced, linguistic profiling, in which large numbers of counts of linguistic features are used as a text profile, which can then be compared to average profiles for groups of texts. The technique proves to be quite effective for authorship verification and recognition. The best parameter settings yield a False Accept Rate of 8.1% at a False Reject Rate equal to zero for the verification task on a test corpus of student essays, and a 99.4% 2-way recognition accuracy on the same corpus.


north american chapter of the association for computational linguistics | 2003

Examining the consensus between human summaries: initial experiments with factoid analysis

Hans van Halteren; Simone Teufel

We present a new approach to summary evaluation which combines two novel aspects, namely (a) content comparison between gold standard summary and system summary via factoids, a pseudo-semantic representation based on atomic information units which can be robustly marked in text, and (b) use of a gold standard consensus summary, in our case based on 50 individual summaries of one text. Even though future work on more than one source text is imperative, our experiments indicate that (1) ranking with regard to a single gold standard summary is insufficient as rankings based on any two randomly chosen summaries are very dissimilar (correlations average ρ = 0.20), (2) a stable consensus summary can only be expected if a larger number of summaries are collected (in the range of at least 30--40 summaries), and (3) similarity measurement using unigrams shows a similarly low ranking correlation when compared with factoid-based ranking.


Journal of Quantitative Linguistics | 2005

New Machine Learning Methods Demonstrate the Existence of a Human Stylome

Hans van Halteren; R. Harald Baayen; Fiona Tweedie; Marco Haverkort; A.H. Neijt

Earlier research has shown that established authors can be distinguished by measuring specific properties of their writings, their stylome as it were. Here, we examine writings of less experienced authors. We succeed in distinguishing between these authors with a very high probability, which implies that a stylome exists even in the general population. However, the number of traits needed for so successful a distinction is an order of magnitude larger than assumed so far. Furthermore, traits referring to syntactic patterns prove less distinctive than traits referring to vocabulary, but much more distinctive than expected on the basis of current generativist theories of language learning.


ACM Transactions on Speech and Language Processing | 2007

Author verification by linguistic profiling: An exploration of the parameter space

Hans van Halteren

This article explores the effects of parameter settings in linguistic profiling, a technique in which large numbers of counts of linguistic features are used as a text profile which can then be compared to average profiles for groups of texts. Although the technique proves to be quite effective for authorship verification, with the best overall parameter settings yielding an equal error rate of 3% on a test corpus of student essays, the optimal parameters vary greatly depending on author and evaluation criterion.


international conference on computational linguistics | 2008

Source Language Markers in EUROPARL Translations

Hans van Halteren

This paper shows that it is very often possible to identify the source language of medium-length speeches in the EUROPARL corpus on the basis of frequency counts of word n-grams (87.2%--96.7% accuracy depending on classification method). The paper also examines in detail which positive markers are most powerful and identifies a number of linguistic aspects as well as culture- and domain-related ones.


conference on computational natural language learning | 2000

Chunking with WPDV models

Hans van Halteren

In this paper I describe the application of the WPDV algorithm to the CoNLL-2000 shared task, the identification of base chunks in English text (Tjong Kim Sang and Buchholz, 2000). For this task, I use a three-stage architecture: I first run five different base chunkers, then combine them and finally try to correct some recurring errors. Except for one base chunker, which uses the memory-based machine learning system TiMBL, all modules are based on WPDV models (van Halteren, 2000a).


Information Retrieval | 2011

Learning to rank for why-question answering

Suzan Verberne; Hans van Halteren; D.L. Theijssen; Stephan Raaijmakers; Lou Boves

In this paper, we evaluate a number of machine learning techniques for the task of ranking answers to why-questions. We use TF-IDF together with a set of 36 linguistically motivated features that characterize questions and answers. We experiment with a number of machine learning techniques (among which several classifiers and regression techniques, Ranking SVM and SVMmap) in various settings. The purpose of the experiments is to assess how the different machine learning approaches can cope with our highly imbalanced binary relevance data, with and without hyperparameter tuning. We find that with all machine learning techniques, we can obtain an MRR score that is significantly above the TF-IDF baseline of 0.25 and not significantly lower than the best score of 0.35. We provide an in-depth analysis of the effect of data imbalance and hyperparameter tuning, and we relate our findings to previous research on learning to rank for Information Retrieval.


international conference on computational linguistics | 2013

N-Gram-Based recognition of threatening tweets

Nelleke Oostdijk; Hans van Halteren

In this paper, we investigate to what degree it is possible to recognize threats in Dutch tweets. We attempt threat recognition on the basis of only the single tweet (without further context) and using only very simple recognition features, namely n-grams. We present two different methods of n-gram-based recognition, one based on manually constructed n-gram patterns and the other on machine learned patterns. Our evaluation is not restricted to precision and recall scores, but also looks into the difference in yield of the two methods, considering either combination or means that may help refine both methods individually.

Collaboration


Dive into the Hans van Halteren's collaboration.

Top Co-Authors

Avatar

Nelleke Oostdijk

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar

D.L. Theijssen

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar

Suzan Verberne

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lou Boves

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Pascal Aventurier

Institut national de la recherche agronomique

View shared research outputs
Researchain Logo
Decentralizing Knowledge