Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Lucie Skorkovská is active.

Publication


Featured researches published by Lucie Skorkovská.


text speech and dialogue | 2010

Comparison of different lemmatization approaches through the means of information retrieval performance

Jakub Kanis; Lucie Skorkovská

This paper presents a quantitative performance analysis of two different approaches to the lemmatization of the Czech text data. The first one is based on manually prepared dictionary of lemmas and set of derivation rules while the second one is based on automatic inference of the dictionary and the rules from training data. The comparison is done by evaluating the mean Generalized Average Precision (mGAP) measure of the lemmatized documents and search queries in the set of information retrieval (IR) experiments. Such method is suitable for efficient and rather reliable comparison of the lemmatization performance since a correct lemmatization has proven to be crucial for IR effectiveness in highly inflected languages. Moreover, the proposed indirect comparison of the lemmatizers circumvents the need for manually lemmatized test data which are hard to obtain and also face the problem of incompatible sets of lemmas across different systems.


text speech and dialogue | 2011

Automatic topic identification for large scale language modeling data filtering

Lucie Skorkovská; Pavel Ircing; Aleš Pražák; Jan Lehečka

The paper presents a module for topic identification that is embedded into a complex system for acquisition and storing large volumes of text data from the Web. The module processes each of the acquired data items and assigns keywords to them from a defined topic hierarchy that was developed for this purposes and is also described in the paper. The quality of the topic identification is evaluated in two ways - using classic precision-recall measures and also indirectly, by measuring the ASR performance of the topic-specific language models that are built using the automatically filtered data.


text, speech and dialogue | 2013

Dynamic Threshold Selection Method for Multi-label Newspaper Topic Identification

Lucie Skorkovská

Nowadays, the multi-label classification is increasingly required in modern categorization systems. It is especially essential in the task of newspaper article topics identification. This paper presents a method based on general topic model normalisation for finding a threshold defining the boundary between the “correct” and the “incorrect” topics of a newspaper article. The proposed method is used to improve the topic identification algorithm which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module uses the Naive Bayes classifier for the multiclass and multi-label classification problem and assigns to each article the topics from a defined quite extensive topic hierarchy - it contains about 450 topics and topic categories. The results of the experiments with the improved topic identification algorithm are presented in this paper.


text, speech and dialogue | 2014

Score Normalization Methods Applied to Topic Identification

Lucie Skorkovská; Zbyněk Zajíc

Multi-label classification plays the key role in modern categorization systems. Its goal is to find a set of labels belonging to each data item. In the multi-label document classification unlike in the multi-class classification, where only the best topic is chosen, the classifier must decide if a document does or does not belong to each topic from the predefined topic set. We are using the generative classifier to tackle this task, but the problem with this approach is that the threshold for the positive classification must be set. This threshold can vary for each document depending on the content of the document (words used, length of the document, ...). In this paper we use the Unconstrained Cohort Normalization, primary proposed for speaker identification/verification task, for robustly finding the threshold defining the boundary between the correc and the incorrect topics of a document. In our former experiments we have proposed a method for finding this threshold inspired by another normalization technique called World Model score normalization. Comparison of these normalization methods has shown that better results can be achieved from the Unconstrained Cohort Normalization.


language resources and evaluation | 2014

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Jan Švec; Jan Lehečka; Pavel Ircing; Lucie Skorkovská; Aleš Pražák; Jan Vavruška; Petr Stanislav; Jan Hoidekr

The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.


international conference on speech and computer | 2014

First Experiments with Relevant Documents Selection for Blind Relevance Feedback in Spoken Document Retrieval

Lucie Skorkovská

This paper presents our first experiments aimed at the automatic selection of the relevant documents for the blind relevance feedback method in speech information retrieval. Usually the relevant documents are selected only by simply determining the first N documents to be relevant. We consider this approach to be insufficient and we would try in this paper to outline the possibilities of the dynamical selection of the relevant documents for each query depending on the content of the retrieved documents instead of just blindly defining the number of the relevant documents to be used for the blind relevance feedback in advance. We have performed initial experiments with the application of the score normalization techniques used in the speaker identification task, which was successfully used in the multi-label classification task for finding the “correct” topics of a newspaper article in the output of a generative classifier. The experiments have shown promising results, therefore they will be used to define the possibilities of the subsequent research in this area.


text speech and dialogue | 2012

Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering

Lucie Skorkovská

The paper presents experiments with the topic identification module which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module processes each acquired data item and assigns it topics from a defined topic hierarchy. The topic hierarchy is quite extensive – it contains about 450 topics and topic categories. It can easily happen that for some narrowly focused topic there is not enough data for the topic identification training. Lemmatization is shown to improve the results when dealing with sparse data in the area of information retrieval, therefore the effects of lemmatization on topic identification results is studied in the paper. On the other hand, since the system is used for processing large amounts of data, a summarization method was implemented and the effect of using only the summary of an article on the topic identification accuracy is studied.


text speech and dialogue | 2016

Relevant Documents Selection for Blind Relevance Feedback in Speech Information Retrieval

Lucie Skorkovská

The experiments presented in this paper were aimed at the selection of documents to be used in the blind or pseudo relevance feedback in spoken document retrieval. The previous experiments with the automatic selection of the relevant documents for the blind relevance feedback method have shown the possibilities of the dynamical selection of the relevant documents for each query depending on the content of the retrieved documents instead of just blindly defining the number of the relevant documents to be used in advance. The score normalization techniques commonly used in the speaker identification task are used for the dynamical selection of the relevant documents. In the previous experiments, the language modeling information retrieval method was used. In the experiments presented in this paper, we have derived the score normalization technique also for the vector space information retrieval method. The results of our experiments show, that these normalization techniques are not method-dependent and can be successfully used in several information retrieval system settings.


international conference on speech and computer | 2016

Comparison of Retrieval Approaches and Blind Relevance Feedback Methods Within the Czech Speech Information Retrieval

Lucie Skorkovská

This article has several objectives. First, it is to compare the most used information retrieval methods on a single speech retrieval collection. The collection, used in the CLEF 2007 Czech task, contains automatically transcribed spontaneous interviews of holocaust survivors and is to our knowledge the only Czech collection of spontaneous speech intended for speech information retrieval. Apart from the first experiments presented in the CLEF competition, no comprehensive experiments have been published on this collection to compare the different information retrieval methods. The second objective of this paper is to compare the results of using the blind relevance feedback methods with the individual retrieval methods and introduce the possibility of using the score normalization methods for the selection of documents for the blind relevance feedback. The third objective of this article is to compare different normalization methods among themselves. Exhaustive experiments were performed for each method and its settings. For all information retrieval methods used, the experiments results showed that the use of score normalization methods significantly improves the achieved retrieval score.


international symposium on signal processing and information technology | 2014

Comparison of score normalization methods applied to multi-label classification

Lucie Skorkovská; Zbyněk Zajíc; Luděk Müller

Our paper deals with the multi-label text classification of the newspaper articles, where the classifier must decide if a document does or does not belong to each topic from the predefined topic set. A generative classifier is used to tackle this task and the problem with finding a threshold for the positive classification is mainly addressed. This threshold can vary for each document depending on the content of the document (words used, length of the document, etc.). An extensive comparison of the score normalization methods, primary proposed in the speaker identification/verification task, for robustly finding the threshold defining the boundary between the “correct” and the “incorrect” topics of a document is presented. Score normalization methods (based on World Model and Unconstrained Cohort Normalization) applied to the topic identification task has shown an improvement of results in our former experiments, therefore in this paper an in-depth experiments with more score normalization techniques applied to the multi-label classification were performed. Thorough analysis of the effects of the various parameters setting is presented.

Collaboration


Dive into the Lucie Skorkovská's collaboration.

Top Co-Authors

Avatar

Pavel Ircing

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Jan Švec

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Ales Prazák

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Aleš Pražák

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Daniel Soutner

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Jakub Kanis

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Jan Lehečka

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Josef Psutka

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Ludek Müller

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Petr Stanislav

University of West Bohemia

View shared research outputs
Researchain Logo
Decentralizing Knowledge