Milan Straka | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Milan Straka is active.

Explore More

Publication

Featured researches published by Milan Straka.

meeting of the association for computational linguistics | 2014

Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

Jana Straková; Milan Straka; Jan Hajiċ

We present two recently released opensource taggers: NameTag is a free software for named entity recognition (NER) which achieves state-of-the-art performance on Czech; MorphoDiTa (Morphological Dictionary and Tagger) performs morphological analysis (with lemmatization), morphological generation, tagging and tokenization with state-of-the-art results for Czech and a throughput around 10-200K words per second. The taggers can be trained for any language for which annotated data exist, but they are specifically designed to be efficient for inflective languages, Both tools are free software under LGPL license and are distributed along with trained linguistic models which are free for non-commercial use under the CC BY-NC-SA license. The releases include standalone tools, C++ libraries with Java, Python and Perl bindings and web services.

text speech and dialogue | 2013

A New State-of-The-Art Czech Named Entity Recognizer

Jana Straková; Milan Straka; Jan Hajic

We present a new named entity recognizer for the Czech language. It reaches 82.82 F-measure on the Czech Named Entity Corpus 1.0 and significantly outperforms previously published Czech named entity recognizers. On the English CoNLL-2003 shared task, we achieved 89.16 F-measure, reaching comparable results to the English state of the art. The recognizer is based on Maximum Entropy Markov Model and a Viterbi algorithm decodes an optimal sequence labeling using probabilities estimated by a maximum entropy classifier. The classification features utilize morphological analysis, two-stage prediction, word clustering and gazetteers.

european symposium on algorithms | 2007

Linear-time ranking of permutations

Martin Mareš; Milan Straka

A lexicographic ranking function for the set of all permutations of n ordered symbols translates permutations to their ranks in the lexicographic order of all permutations. This is frequently used for indexing data structures by permutations. We present algorithms for computing both the ranking function and its inverse using O(n) arithmetic operations.

Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw#N# Text to Universal Dependencies | 2017

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Daniel Zeman; Martin Popel; Milan Straka; Jan Hajic; Joakim Nivre; Filip Ginter; Juhani Luotolahti; Sampo Pyysalo; Slav Petrov; Martin Potthast; Francis M. Tyers; Elena Badmaeva; Memduh Gokirmak; Anna Nedoluzhko; Silvie Cinková; Jaroslava Hlaváčová; Václava Kettnerová; Zdenka Uresová; Jenna Kanerva; Stina Ojala; Anna Missilä; Christopher D. Manning; Sebastian Schuster; Siva Reddy; Dima Taji; Nizar Habash; Herman Leung; Marie-Catherine de Marneffe; Manuela Sanguinetti; Maria Simi

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies : August 3-4, 2017 Vancouver, Canada, 2017, ISBN 978-1-945626-70-8, págs. 88-99 | 2017

Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.

Milan Straka; Jana Straková

We present an update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We provide models for all 50 languages of UD 2.0, and furthermore, the pipeline can be trained easily using data in CoNLL-U format. For the purpose of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, the updated UDPipe 1.1 was used as one of the baseline systems, finishing as the 13th system of 33 participants. A further improved UDPipe 1.2 participated in the shared task, placing as the 8th best system, while achieving low running times and moder- ately sized models. The tool is available under open-source Mozilla Public Licence (MPL) and provides bindings for C++, Python (through ufal.udpipe PyPI package), Perl (through UFAL::UDPipe CPAN package), Java and C#.

symposium/workshop on haskell | 2010

The performance of the Haskell containers package

Milan Straka

In this paper, we perform a thorough performance analysis of the containers package, the de facto standard Haskell containers library, comparing it to the most of existing alternatives on HackageDB. We then significantly improve its performance, making it comparable to the best implementations available. Additionally, we describe a new persistent data structure based on hashing, which offers the best performance out of available data structures containing Strings and ByteStrings.

Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) | 2017

Neural Networks for Multi-Word Expression Detection.

Natalia Klyueva; Antoine Doucet; Milan Straka

In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The MUMULS system was implemented using a supervised approach based on recurrent neural networks using the open source library TensorFlow. The model was trained on a data set containing annotated VMWEs as well as morphological and syntactic information. The MUMULS system performed the identification of VMWEs in 15 languages, it was one of few systems that could categorize VMWEs type in nearly all languages.

text speech and dialogue | 2016

Neural Networks for Featureless Named Entity Recognition in Czech

Jana Straková; Milan Straka; Jan Hajic

We present a completely featureless, language agnostic named entity recognition system. Following recent advances in artificial neural network research, the recognizer employs parametric rectified linear units (PReLU), word embeddings and character-level embeddings based on gated linear units (GRU). Without any feature engineering, only with surface forms, lemmas and tags as input, the network achieves excellent results in Czech NER and surpasses the current state of the art of previously published Czech NER systems, which use manually designed rule-based orthographic classification features. Furthermore, the neural network achieves robust results even when only surface forms are available as input. In addition, the proposed neural network can use the manually designed rule-based orthographic classification features and in such combination, it exceeds the current state of the art by a wide margin.

Archive | 2017

Czech Named Entity Corpus

Jana Straková; Milan Straka; Magda Ševčíková; Zdeněk Žabokrtský

We present a corpus of Czech sentences with manually annotated named entities, in which a rich two-level hierarchy of named entity types was used. The corpus was the first available large Czech named entity resource and since 2007, it has stimulated the research in this field for Czech. We describe the two-level fine-grained hierarchy allowing embedded entities and the motivations leading to its design. We further discuss the data selection and the annotation process. We then show how the data can be used for training a named entity recognizer and we perform a number of experiments to critically evaluate the impact of the decisions made in the process of annotation on the named entity recognizer performance. We thoroughly discuss the effect of sentence selection, corpus size, part-of-speech tagging and lemmatization, representativeness and bias of the named entity distribution, classification granularity and other corpus properties in terms of supervised machine learning.

meeting of the association for computational linguistics | 2013