Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Gilles-Maurice de Schryver is active.

Publication


Featured researches published by Gilles-Maurice de Schryver.


Lexikos | 2010

Do dictionary users really look up frequent words? On the overestimation of the value of corpus-based lexicography

Gilles-Maurice de Schryver; David Joffe; Pitta Joffe; Sarah Hillewaert

Abstract: An innovative online Swahili–English dictionary project is presented. A careful study of some of the log files attached to this reference work reveals some hitherto unknown as-pects of true dictionary look-up behaviour, which results in the depreciation of the importance of corpora for dictionary making. Three lexicography software modules are advanced to further enhance the success of the online dictionary. Keywords: LEXICOGRAPHY, SOFTWARE, ONLINE, DICTIONARY, LOG FILE, CORPUS, FREQUENCY, RANK, CORRELATION, SWAHILI, ENGLISH, TSHWANELEX Samenvatting: Zoeken woordenboekgebruikers werkelijk frequente woor-den op? — Over de overschatting van de waarde van corpusgebaseerde lexi-cografie. Een vernieuwend online Swahili–Engels woordenboekproject wordt voorgesteld. Een minutieuze studie van enkele van de log bestanden gekoppeld aan dit referentiewerk onthult tot dusver onbekende aspecten van het echte opzoekgedrag van woordenboekgebruikers, wat leidt tot een devaluatie van het belang van corpora voor het maken van woordenboeken. Drie lexicogra-fische softwaremodules worden naar voor geschoven om het succes van het online woordenboek verder te vergroten. Sleutelwoorden: LEXICOGRAFIE, SOFTWARE, ONLINE, WOORDENBOEK, LOG BE-STAND, CORPUS, FREQUENTIE, RANG, CORRELATIE, SWAHILI, ENGELS, TSHWANELEX


South African journal of african languages | 2000

Electronic corpora as a basis for the compilation of African-language dictionaries, Part 1: The macrostructure

Gilles-Maurice de Schryver; Daniël Jacobus Prinsloo

Good modern dictionaries increasingly base the compilation of both their macro- and microstructure on electronic corpora. As the macrostructure is the subject of this article, a few typical macrostructural inconsistencies in existing African-language dictionaries, which can be rectified by the utilisation of a corpus, are discussed. It is shown that the first application of a corpus is the utilisation of word-frequency counts to compile a lemmatised frequency list. Together with data on lemma-sign distributions across sub-corpora, the lemma-sign list of a dictionary can subsequently be derived. These theoretical notions are exemplified with a thorough discussion of how an electronic corpus led to the creation of the macrostructure of a Cilubà-Dutch dictionary. In addition, explicit frequency markers are advanced to further enhance the macrostructural reference quality. The latter is illustrated with both Cilubà-Dutch and Sepedi-English dictionaries. Finally, the article concludes with a series of macrostructural improvements of corpus-aided/based dictionaries over manually compiled ones.


text speech and dialogue | 2006

Data-Driven part-of-speech tagging of kiswahili

Guy De Pauw; Gilles-Maurice de Schryver; Peter Waiganjo Wagacha

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.


Lexikos | 2009

Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes

Guy De Pauw; Gilles-Maurice de Schryver

Abstract: Computational morphological analysis is an important first step in the automatic treatment of natural language and a useful lexicographic tool. This article describes a corpus-based approach to the morphological analysis of Swahili. We particularly focus our discussion on its ability to retrieve lemmas for word forms and evaluate it as a tool for corpus-based dictionary compilation. Keywords: LEXICOGRAPHY, MORPHOLOGY, CORPUS ANNOTATION, LEMMATIZATION,MACHINE LEARNING, SWAHILI (KISWAHILI) Samenvatting: Accuratere computationele morfologische analyse van eenSwahili corpus voor lexicografische doeleinden. Computationele morfologischeanalyse is een belangrijke eerste stap in de automatische verwerking van natuurlijke taal en eennuttig lexicografisch hulpmiddel. Dit artikel beschrijft een corpusgebaseerde aanpak voor de morfologischeanalyse van het Swahili. We concentreren ons hierbij vooral op de lemmatiseringseigenschappenvan het ontwikkelde systeem en evalueren het als een hulpmiddel bij de corpusgebaseerdeontwikkeling van woordenboeken. Sleutelwoorden: LEXICOGRAFIE, MORFOLOGIE, CORPUSANNOTATIE, LEMMATISERING,AUTOMATISCHE LEERTECHNIEKEN, SWAHILI (KISWAHILI)


Lexikos | 2010

Dictionary Writing System (DWS) + Corpus Query Package (CQP): The Case of "TshwaneLex"

Gilles-Maurice de Schryver; Guy De Pauw

Abstract: In this article the integrated corpus query functionality of the dictionary compilation software TshwaneLex is analysed. Attention is given to the handling of both raw corpus data and annotated corpus data. With regard to the latter it is shown how, with a minimum of human effort, machine learning techniques can be employed to obtain part-of-speech tagged corpora that can be used for lexicographic purposes. All points are illustrated with data drawn from English and Northern Sotho. The tools and techniques themselves, however, are language-independent, and as such the encouraging outcomes of this study are far-reaching.


text speech and dialogue | 2007

Automatic diacritic restoration for resource-scarce languages

Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver

The orthography of many resource-scarce languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration. This paper describes experiments with a machine learning approach that is able to automatically restore diacritics on the basis of local graphemic context. We apply the method to the African languages of Ciluba, Gikuyu, Kikamba, Maa, Sesotho sa Leboa, Tshivenda and Yoruba and contrast it with experiments on Czech, Dutch, French, German and Romanian, as well as Vietnamese and Chinese Pinyin.


Southern African Linguistics and Applied Language Studies | 2002

The Zulu locative prefix ku- revisited: a corpus-based approach

Gilles-Maurice de Schryver; Rachélle Gauton

This article re-examines the distribution of the class 17 locative prefix ku- and its variants kwi- and ko- in the locativisation of nouns in Zulu. To this end an electronic corpus of 5 million running Zulu words—the University of Pretoria Zulu Corpus (PZC)—is queried. We indicate how PZC can be used to highlight previously under-emphasised and overlooked aspects of a seemingly well-documented language feature such as the class 17 locative prefix ku-. Analysing the output proffered by PZC—an organic corpus that has been organised chronologically, and that consists of a number of sub-corpora stratified according to genre—enables us to reach conclusions regarding not only the frequencies with which the variants ku-, kwi- and ko- of the class 17 locative prefix are used, but also regarding the possible changes in use that these prefixes have undergone with time. We also show how these prefixes relate to each other in the different sub-corpora.


Lexikos | 2010

Online Dictionaries on the Internet: An Overview for the African Languages

Gilles-Maurice de Schryver

Abstract: The main purpose of this research article is rather bold, in that an attempt is made at a comprehensive overview of all currently available African-language Internet dictionaries. Quite surprisingly, a substantial number of such dictionaries is already available, for a large number of languages, with a relatively large number of users. The key characteristics of these dictionaries and various cross-language distributions are expounded on. In a second section the first South African online dictionary interface is introduced. Although compiled by just a small number of scholars, this dictionary contains a worlds first in that lexicographic customisation is implemented on various levels in real time on the Internet. Keywords: LEXICOGRAPHY, TERMINOLOGY, DICTIONARIES, INTERNET, ONLINE, LOOK-UP MODE, BROWSE MODE, AFRICAN LANGUAGES, SESOTHO SA LEBOA, SIMULTANEOUS FEEDBACK, FUZZY SF, CUSTOMISATION Senaganwa: Dipukuntsu tsa online tse di lego mo Inthaneteng: Ponokakaretsoya maleme a Afrika. Morero wo mogolo wa taodiswana ye ya nyakisiso ke wo otiilego ka ge teko e dirilwe ka tebelelo ya kakaretso ye e tletsego go dipukuntsu ka moka tsa Inthanetetseo di setsego di le gona mo malemeng a Afrika. Sa go makatsa ke gore go setse go na le paloye ntsi ya dipukuntsu tse bjalo mo malemeng a mantsi gape di na le badirisi ba bantsi. Go hlaloswadipharologantsho tse bohlokwa tsa dipukuntsu tse le ka moo diphatlalatso di dirwago ka gonagare ga maleme a mantsi a go fapana. Mo karolong ya bobedi go tsebiswa pukuntsu ya online ye elego ya pele gape e lego ya makgonthe ya Afrika Borwa. Le ge e le gore pukuntsu ye e hlamilwe kedirutegi di se kae, e setse e tsea sefoka lefaseng ka bophara. Se ke ka lebaka la gore pukuntsu ye edirilwe ka tsela yeo e lego gore dilo di ka beakanywa gore di itsweletse ka botsona gomme tsalokela batho ka moka bao ba e dirisago mo Inthaneteng ka yona nako yeo. Mantsu a bohlokwa: TLHAMO YA DIPUKUNTSU, TLHAMO YA MAREO, DIPUKUNTSU,INTHANETE, ONLINE, MOKGWA WA GO NYAKA, MOKGWA WA GO LEKOLA,MALEME A AFRIKA, SESOTHO SA LEBOA, SIMULTANEOUS FEEDBACK, FUZZY SF, GOBEAKANYA DILO GORE DI BE KA MOKGWA WO O LEGO GORE O TLA GO LOKELA


South African journal of african languages | 2002

Reversing an African-language lexicon: the Northern Sotho Terminology and Orthography No. 4 as a case in point

Dj Prinsloo; Gilles-Maurice de Schryver

The aim of this aiticle is to give a perspective on the required strategies and the potential difficulties involved in reversing a unidirectional bilingual dictionary with an African language as the target language. This will be done by means of a full-scale case study of the (hypothetical) reversal of the Northern Sotho Terminology and Orthography No. 4. It will be pointed out that the African languages do indeed pose unique problems—difficulties that emanate directly from the structure of those languages. It will also be shown that the use of the forward slash causes particular additional complications. The discussion is preceded by an overview of the main issues involved in the compilation of (bidirectional) bilingual dictionaries with the focus on the different types of equivalence relations on the one hand and the reversibility principle on the other.


language resources and evaluation | 2011

Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.

Collaboration


Dive into the Gilles-Maurice de Schryver's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David Joffe

University of Pretoria

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Els Cranshof

Université libre de Bruxelles

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge