Miikka Silfverberg
University of Helsinki
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Miikka Silfverberg.
systems and frameworks for computational morphology | 2009
Krister Lindén; Miikka Silfverberg; Tommi Pirinen
Morphological analysis of a wide range of languages can be implemented efficiently using finite-state transducer technologies. Over the last 30 years, a number of attempts have been made to create tools for computational morphologies. The two main competing approaches have been parallel vs. cascaded rule application. The parallel rule application was originally introduced by Koskenniemi [1] and implemented in tools like TwolC and LexC. Currently many applications of morphologies could use dictionaries encoding the a priori likelihoods of words and expressions as well as the likelihood of relations to other representations or languages. We have made the choice to create open-source tools and language descriptions in order to let as many as possible participate in the effort. The current article presents some of the main tools that we have created such as HFST-LexC, HFST-TwolC and HFST-Compose-Intersect. We evaluate their efficiency in comparison to some similar tools and libraries. In particular, we evaluate them using several full-fledged morphological descriptions. Our tools compare well with similar open source tools, even if we still have some challenges ahead before we can catch up with the commercial tools. We demonstrate that for various reasons a parallel rule approach still seems to be more efficient than a cascaded rule approach when developing finite-state morphologies.
systems and frameworks for computational morphology | 2011
Krister Lindén; Erik Axelson; Sam Hardwick; Tommi Pirinen; Miikka Silfverberg
HFST–Helsinki Finite-State Technology ( hfst.sf.net ) is a framework for compiling and applying linguistic descriptions with finite-state methods. HFST currently connects some of the most important finite-state tools for creating morphologies and spellers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications in key environments and operating systems. HFST also provides an opportunity to exchange transducers between different software providers in order to get the best out of each finite-state library.
language resources and evaluation | 2016
Miikka Silfverberg; Teemu Ruokolainen; Krister Lindén; Mikko Kurimo
This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.
systems and frameworks for computational morphology | 2013
Krister Lindén; Erik Axelson; Senka Drobac; Sam Hardwick; Juha Kuokkala; Jyrki Niemi; Tommi Pirinen; Miikka Silfverberg
The paper presents and evaluates various NLP tools that have been created using the open source library HFST – Helsinki Finite-State Technology and outlines the minimal extensions that this has required to a pure finite-state system. In particular, the paper describes an implementation and application of Pmatch presented by Karttunen at SFCM 2011.
Computational Linguistics - Applications | 2013
Krister Lindén; Erik Axelson; Senka Drobac; Sam Hardwick; Miikka Silfverberg; Tommi Pirinen
HFST-HelsinkiFinite-StateTechnology (http://hfst.sf.net/) is a framework for compiling and applying linguistic descriptions with finitestatemethods. HFST currently collects some of the most important finite-state tools for creatingmorphologies and spellcheckers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications. In this article, we focus on aspects of HFST that are new to the end user, i.e. new tools, new features in existing tools, or new language applications, in addition to some revised algorithms that increase performance.
meeting of the association for computational linguistics | 2014
Miikka Silfverberg; Teemu Ruokolainen; Krister Lindén; Mikko Kurimo
We discuss part-of-speech (POS) tagging in presence of large, fine-grained label sets using conditional random fields (CRFs). We propose improving tagging accuracy by utilizing dependencies within sub-components of the fine-grained labels. These sub-label dependencies are incorporated into the CRF model via a (relatively) straightforward feature extraction scheme. Experiments on five languages show that the approach can yield significant improvement in tagging accuracy in case the labels have sufficiently rich inner structure.
international conference natural language processing | 2010
Miikka Silfverberg; Krister Lindén
We use parallel weighted finite-state transducers to implement a part-of-speech tagger, which obtains state-of-the-art accuracy when used to tag the Europarl corpora for Finnish, Swedish and English. Our system consists of a weighted lexicon and a guesser combined with a bigram model factored into two weighted transducers. We use both lemmas and tag sequences in the bigram model, which guarantees reliable bigram estimates.
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata | 2016
Miikka Silfverberg; Pekka Kauppinen; Krister Lindén
This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the systems is an unstructured classifier and the other one is structured. Both systems are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on the task of tweet normalization when compared with the recent AliSeTra system introduced by Eger et al. (2016) even though the system presented in the paper is simpler than AliSeTra because it does not include a model for input segmentation. In addition to experiments on tweet normalization, we present experiments on OCR post-processing using an Early Modern Finnish corpus of OCR processed newspaper text.
Septentrio Conference Series | 2015
Miikka Silfverberg; Jack Rueter
Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which is useful in many language processing applications. is paper investigates language modeling in the open-source OCR engine Tesseract using morphological analyzers. We present experiments on two Uralic languages Finnish and Erzya. According to our experiments, word lists may still be superior to morphological analyzers in OCR even for languages with rich morphology. Our error analysis indicates that morphological analyzers can cause a large amount of real word OCR errors.
international conference on computational linguistics | 2012
Miikka Silfverberg; Krister Lindén; Mirka Hyvärinen
Systems for predictive text entry on ambiguous keyboards typically rely on dictionaries with word frequencies which are used to suggest the most likely words matching user input. This approach is insufficient for agglutinative languages, where morphological phenomena increase the rate of out-of-vocabulary words. We propose a method for text entry, which circumvents the problem of out-of-vocabulary words, by replacing the dictionary with a Markov chain on morph sequences combined with a third order hidden Markov model (HMM) mapping key sequences to letter sequences and phonological constraints for pruning suggestion lists. We evaluate our method by constructing text entry systems for Finnish and Turkish and comparing our systems with published text entry systems and the text entry systems of three commercially available mobile phones. Measured using the keystrokes per character ratio (KPC) [8], we achieve superior results. For training, we use corpora, which are segmented using unsupervised morphological segmentation.