Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guy De Pauw is active.

Publication


Featured researches published by Guy De Pauw.


text speech and dialogue | 2006

Data-Driven part-of-speech tagging of kiswahili

Guy De Pauw; Gilles-Maurice de Schryver; Peter Waiganjo Wagacha

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.


text speech and dialogue | 2007

Automatic diacritic restoration for resource-scarce languages

Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver

The orthography of many resource-scarce languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration. This paper describes experiments with a machine learning approach that is able to automatically restore diacritics on the basis of local graphemic context. We apply the method to the African languages of Ciluba, Gikuyu, Kikamba, Maa, Sesotho sa Leboa, Tshivenda and Yoruba and contrast it with experiments on Czech, Dutch, French, German and Romanian, as well as Vietnamese and Chinese Pinyin.


genetic and evolutionary computation conference | 2003

Evolutionary computing as a tool for grammar development

Guy De Pauw

In this paper, an agent-based evolutionary computing technique is introduced, that is geared towards the automatic induction and optimization of grammars for natural language (grael). We outline three instantiations of the grael-environment: the grael-1 system uses large annotated corpora to bootstrap grammatical structure in a society of autonomous agents, that tries to optimally redistribute grammatical information to reflect accurate probabilistic values for the task of parsing. In grael-2, agents are allowed to mutate grammatical information, effectively implementing grammar rule discovery in a practical context. Finally, by employing a separate grammar induction module at the onset of the society, grael-3 can be used as an unsupervised grammar induction technique.


meeting of the association for computational linguistics | 2004

A comparison of two different approaches to morphological analysis of Dutch

Guy De Pauw; Tom Laureys; Walter Daelemans; Hugo Van hamme

This paper compares two systems for computational morphological analysis of Dutch. Both systems have been independently designed as separate modules in the context of the FLa-VoR project, which aims to develop a modular architecture for automatic speech recognition. The systems are trained and tested on the same Dutch morphological database (CELEX), and can thus be objectively compared as morphological analyzers in their own right.


language resources and evaluation | 2011

Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.


ACM Transactions on Intelligent Systems and Technology | 2016

Multimodular Text Normalization of Dutch User-Generated Content

Sarah Schulz; Guy De Pauw; Orphée De Clercq; Bart Desmet; Veronique Hoste; Walter Daelemans; Lieve Macken

As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.


international conference on computational linguistics | 2000

Aspects of pattern-matching in Data-Oriented Parsing

Guy De Pauw

Data-Oriented Parsing (Dop) ranks among the best parsing schemes, pairing state-of-the art parsing accuracy to the psycholinguistic insight that larger chunks of syntactic structures are relevant grammatical and probabilistic units. Parsing with the DOP-model, however, seems to involve a lot of CPU cycles and a considerable amount of double work, brought on by the concept of multiple derivations, which is necessary for probabilistic processing, but which is not convincingly related to a proper linguistic backbone. It is however possible to reinterpret the DOP-model as a pattern-matching model, which tries to maximize the size of the substructures that construct the parse, rather than the probability of the parse. By emphasizing this memory-based aspect of the DOP-model, it is possible to do away with multiple derivations, opening up possibilities for efficient Viterbistyle optimizations, while still retaining acceptable parsing accuracy through enhanced context-sensitivity.


language resources and evaluation | 2011

Introduction to the special issue on African Language Technology

Guy De Pauw; Gilles-Maurice de Schryver; Laurette Pretorius; Lori S. Levin

In today’s digital multilingual world, language technology is crucial for providing access to information and opportunities for economic development. With approximately two thousand different languages, Africa is a multilingual continent par excellence, presenting acute challenges for those seeking to promote and use African languages in the areas of business development, education and relief aid. In recent times a number of researchers and institutions, both from Africa and elsewhere, have come forward to share the common goal of developing capabilities in language technology for African languages. In 2009 and 2010, the first two workshops on African Language Technology were organized (De Pauw et al. 2009, 2010a) as a forum to bring together a wide range of researchers working in this domain.


Lexikos | 2009

A Corpus-based Survey of Four Electronic Swahili–English Bilingual Dictionaries

Guy De Pauw; Gilles-Maurice de Schryver; Peter Waiganjo Wagacha

Abstract: In this article we survey four different electronic bilingual dictionaries for the language pair Swahili–English. Aided by a data-driven morphological analyzer and part-of-speech tagger, we quantify the coverage of the dictionaries on large monolingual corpora of Swahili. In a second series of experiments, we investigate how applicable the dictionaries are as a tool in the development of a machine translation system, by evaluating bilingual coverage on the parallel SAWA corpus. At the same time we attempt to consolidate the dictionaries into a unified lexicographic database and compare the coverage to that of its composite parts. Keywords: LEXICOGRAPHY, EVALUATION, MORPHOLOGY, LEMMATIZATION, PARALLEL CORPORA, MACHINE LEARNING, MACHINE TRANSLATION, SWAHILI (KISWAHILI), ENGLISH Samenvatting: Een corpusgebaseerde evaluatie van vier bilinguale elek-tronische woordenboeken Swahili–Engels. In dit artikel evalueren we vier verschil-lende elektronische woordenboeken voor het talenpaar Swahili–Engels. Met behulp van automa-tische morfosyntactische analyse, kwantificeren we de dekking van de woordenboeken op basis van grote monolinguale corpora voor het Swahili. In een tweede reeks experimenten onderzoeken we de toepasbaarheid van de woordenboeken als hulpmiddel bij de ontwikkeling van automa-tische vertaalsystemen, door hun bilinguale dekking te meten op basis van het parallelle SAWA corpus. Tegelijkertijd proberen we de woordenboeken te integreren in een overkoepelende lexico-grafische databank en vergelijken we de dekking ervan met die van de samenstellende delen. Sleutelwoorden: LEXICOGRAFIE, EVALUATIE, MORFOLOGIE, LEMMATISERING, PARALLELLE CORPORA, AUTOMATISCHE LEERTECHNIEKEN, AUTOMATISCH VERTA-LEN, SWAHILI (KISWAHILI), ENGELS


conference on computational natural language learning | 2000

The role of algorithm bias vs information source in learning algorithms for Morphosyntactic Disambiguation

Guy De Pauw; Walter Daelemans

Morphosyntactic Disambiguation (Part of Speech tagging) is a useful benchmark problem for system comparison because it is typical for a large class of Natural Language Processing (NLP) problems that can be defined as disambiguation in local context. This paper adds to the literature on the systematic and objective evaluation of different methods to automatically learn this type of disambiguation problem. We systematically compare two inductive learning approaches to tagging: MX-POST (based on maximum entropy modeling) and MBT (based on memory-based learning). We investigate the effect of different sources of information on accuracy when comparing the two approaches under the same conditions. Results indicate that earlier observed differences in accuracy can be attributed largely to differences in information sources used, rather than to algorithm bias.

Collaboration


Dive into the Guy De Pauw's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hugo Van hamme

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Joris Driesen

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar

Jort F. Gemmeke

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Peter Mariën

Vrije Universiteit Brussel

View shared research outputs
Researchain Logo
Decentralizing Knowledge