Peter Waiganjo Wagacha
University of Nairobi
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peter Waiganjo Wagacha.
text speech and dialogue | 2006
Guy De Pauw; Gilles-Maurice de Schryver; Peter Waiganjo Wagacha
In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.
text speech and dialogue | 2007
Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver
The orthography of many resource-scarce languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration. This paper describes experiments with a machine learning approach that is able to automatically restore diacritics on the basis of local graphemic context. We apply the method to the African languages of Ciluba, Gikuyu, Kikamba, Maa, Sesotho sa Leboa, Tshivenda and Yoruba and contrast it with experiments on Czech, Dutch, French, German and Romanian, as well as Vietnamese and Chinese Pinyin.
language resources and evaluation | 2011
Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.
conference of the european chapter of the association for computational linguistics | 2009
Guy De Pauw; Peter Waiganjo Wagacha; Gilles-Maurice de Schryver
Research in data-driven methods for Machine Translation has greatly benefited from the increasing availability of parallel corpora. Processing the same text in two different languages yields useful information on how words and phrases are translated from a source language into a target language. To investigate this, a parallel corpus is typically aligned by linking linguistic tokens in the source language to the corresponding units in the target language. An aligned parallel corpus therefore facilitates the automatic development of a machine translation system and can also bootstrap annotation through projection. In this paper, we describe data collection and annotation efforts and preliminary experimental results with a parallel corpus English - Swahili.
Lexikos | 2009
Guy De Pauw; Gilles-Maurice de Schryver; Peter Waiganjo Wagacha
Abstract: In this article we survey four different electronic bilingual dictionaries for the language pair Swahili–English. Aided by a data-driven morphological analyzer and part-of-speech tagger, we quantify the coverage of the dictionaries on large monolingual corpora of Swahili. In a second series of experiments, we investigate how applicable the dictionaries are as a tool in the development of a machine translation system, by evaluating bilingual coverage on the parallel SAWA corpus. At the same time we attempt to consolidate the dictionaries into a unified lexicographic database and compare the coverage to that of its composite parts. Keywords: LEXICOGRAPHY, EVALUATION, MORPHOLOGY, LEMMATIZATION, PARALLEL CORPORA, MACHINE LEARNING, MACHINE TRANSLATION, SWAHILI (KISWAHILI), ENGLISH Samenvatting: Een corpusgebaseerde evaluatie van vier bilinguale elek-tronische woordenboeken Swahili–Engels. In dit artikel evalueren we vier verschil-lende elektronische woordenboeken voor het talenpaar Swahili–Engels. Met behulp van automa-tische morfosyntactische analyse, kwantificeren we de dekking van de woordenboeken op basis van grote monolinguale corpora voor het Swahili. In een tweede reeks experimenten onderzoeken we de toepasbaarheid van de woordenboeken als hulpmiddel bij de ontwikkeling van automa-tische vertaalsystemen, door hun bilinguale dekking te meten op basis van het parallelle SAWA corpus. Tegelijkertijd proberen we de woordenboeken te integreren in een overkoepelende lexico-grafische databank en vergelijken we de dekking ervan met die van de samenstellende delen. Sleutelwoorden: LEXICOGRAFIE, EVALUATIE, MORFOLOGIE, LEMMATISERING, PARALLELLE CORPORA, AUTOMATISCHE LEERTECHNIEKEN, AUTOMATISCH VERTA-LEN, SWAHILI (KISWAHILI), ENGELS
International Journal of Computer Applications | 2014
Stephen M. Kang'ethe; Peter Waiganjo Wagacha
Data mining technologies have been used extensively in the commercial retail sectors to extract data from their “big data” warehouses. In healthcare, data mining has been used as well in various aspects which we explore. The voluminous amounts of data generated by medical systems form a good basis for discovery of interesting patterns that may aid decision making and saving of lives not to mention reduction of costs in research work and possibly reduced morbidity prevalence. It is from this that we set out to implement a concept using association rule mining technology to find out any possible diagnostic associations that may have arisen in patients’ medical records spanning across multiple contacts of care. The dataset was obtained from Practice Fusion’s open research data that contained over 98,000 patient clinic visits from all American states. Using an implementation of the classical apriori algorithm, we were able to mine for patterns arising from medical diagnosis data. The diagnosis data was based on ICD-9 coding and this helped limit the set of possible diagnostic groups for the analysis. We then subjected the results to domain expert opinion. The panel of experts validated some of the most common associations that had a minimum confidence level of between 56-76% with a concurrence rate of 90% whereas others elicited debate amongst the medical practitioners. The results of our research showed that association rule mining can not only be used to confirm what is already known from health data in form of comorbidity patterns, but also generate some very interesting disease diagnosis associations that can provide a good starting point and room for further exploration through studies by medical researchers to explain the patterns that are seemingly unknown or peculiar in the concerned populations.
Acta Pharmaceutica Sinica B | 2017
Harrison Ndung'u Mwangi; Peter Waiganjo Wagacha; Peterson Mathenge; Fredrick Sijenyi; Francis Mulaa
Generation of three dimensional structures of macromolecules using in silico structural modeling technologies such as homology and de novo modeling has improved dramatically and increased the speed by which tertiary structures of organisms can be generated. This is especially the case if a homologous crystal structure is already available. High-resolution structures can be rapidly created using only their sequence information as input, a process that has the potential to increase the speed of scientific discovery. In this study, homology modeling and structure prediction tools such as RNA123 and SWISS–MODEL were used to generate the 40S ribosomal subunit from Plasmodium falciparum. This structure was modeled using the published crystal structure from Tetrahymena thermophila, a homologous eukaryote. In the absence of the Plasmodium falciparum 40S ribosomal crystal structure, the model accurately depicts a global topology, secondary and tertiary connections, and gives an overall root mean square deviation (RMSD) value of 3.9 Å relative to the template׳s crystal structure. Deviations are somewhat larger in areas with no homology between the templates. These results demonstrate that this approach has the power to identify motifs of interest in RNA and identify potential drug targets for macromolecules whose crystal structures are unknown. The results also show the utility of RNA homology modeling software for structure determination and lay the groundwork for applying this approach to larger and more complex eukaryotic ribosomes and other RNA-protein complexes. Structures generated from this study can be used in in silico screening experiments and lead to the determination of structures for targets/hit complexes.
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages | 2014
Edward Ombui; Peter Waiganjo Wagacha; Wanjiku Ng'ang'a
This paper elucidates the InterlinguaPlus design and its application in bi-directional text translations between Ekegusii and Kiswahili languages unlike the traditional translation pairs, one-by-one. Therefore, any of the languages can be the source or target language. The first section is an overview of the project, which is followed by a brief review of Machine Translation. The next section discusses the implementation of the system using Carabao’s open machine translation framework and the results obtained. So far, the translation results have been plausible particularly for the resource-scarce local languages and clearly affirm morphological similarities inherent in Bantu languages.
African Journal of Science and Technology | 2005
Ea Omwenga; T.M Waema; Peter Waiganjo Wagacha
conference of the international speech communication association | 2007
P de Guy; Peter Waiganjo Wagacha