Bruno Golénia | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bruno Golénia is active.

Explore More

Publication

Featured researches published by Bruno Golénia.

spoken language technology workshop | 2008

Learning the morphology of Zulu with different degrees of supervision

Sebastian Spiegler; Bruno Golénia; Ksenia Shalonova; Peter A. Flach; Roger C. F. Tucker

In this paper we compare different levels of supervision for learning the morphology of the indigenous South African language Zulu. After a preliminary analysis of the Zulu data used for our experiments, we concentrate on supervised, semi-supervised and unsupervised approaches comparing strengths and weaknesses of each method. The challenges we face are limited data availability and data sparsity in connection with morphological analysis of indigenous languages. At the end of the paper we draw conclusions for our future work towards a morphological analyzer for Zulu.

cross language evaluation forum | 2009

Unsupervised word decomposition with the promodes algorithm

Sebastian Spiegler; Bruno Golénia; Peter A. Flach

We present PROMODES an algorithm for unsupervised word decomposition, which is based on a probabilistic generative model. The model considers segment boundaries as hidden variables and includes probabilities for letter transitions within segments. For the Morpho Challenge 2009, we demonstrate three versions of PROMODES. The first one uses a simple segmentation algorithm on a subset of the data and applies maximum likelihood estimates for model parameters when decomposing words of the original language data. The second version estimates its parameters through expectation maximization (EM). A third method is a committee of unsupervised learners where learners correspond to different EM initializations. The solution is found by majority vote which decides whether to segment at a word position or not. In this paper, we describe the probabilistic model, parameter estimation and how the most likely decomposition of an input word is found. We have tested PROMODES on non-vowelized and vowelized Arabic as well as on English, Finnish, German and Turkish. All three methods achieved competitive results.

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages

Kseniya B. Shalonova; Bruno Golénia; Peter A. Flach

In this paper, we describe a novel and effective approach for automatically decomposing a word into stem and suffixes. Russian and Turkish are used as exemplars of fusional and agglutinating languages. Rather than relying on corpus counts, we use a small number of word-pairs as training data, that can be particularly suited for under-resourced languages. For fusional languages, we initially learn a tree of aligned suffix rules (TASR) from word-pairs. The tree is built top-down, from general to specific rules, using suffix rule frequency and rule subsumption, and is executed bottom-up, i.e., the most specific rule that fires is chosen. TASR is used to segment a word form into a stem and suffix sequence. For fusional languages learning through generation (using TASR) is essential for proper stem extraction. Subsequently, an unsupervised segmentation algorithm graph-based unsupervised suffix segmentation (GBUSS) is used to segment the suffix sequence. GBUSS employs a suffix graph where node merging, guided by an information-theoretic measure, generates suffix sequences. The approach, experimentally validated on Russian, is shown to be highly effective. For agglutinating languages only the GBUSS is needed for word decomposition. Promising experimental results for Turkish are obtained.

cross language evaluation forum | 2009

Unsupervised morpheme discovery with ungrade

Bruno Golénia; Sebastian Spiegler; Peter A. Flach

In this paper, we present an unsupervised algorithm for morpheme discovery called UNGRADE (UNsupervised GRAph DEcomposition). UNGRADE works in three steps and can be applied to languages whose words have the structure prefixes-stem-suffixes. In the first step, a stem is obtained for each word using a sliding window, such that the description length of the window is minimised. In the next step prefix and suffix sequences are sought using a morpheme graph. The last step consists in combining morphemes found in the previous steps. UNGRADE has been experimentally evaluated on 5 languages (English, German, Finnish, Turkish and Arabic) with encouraging results.

Sigkdd Explorations | 2010