Jan Hoidekr
University of West Bohemia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jan Hoidekr.
text speech and dialogue | 2011
Jan Švec; Jan Hoidekr; Daniel Soutner; Jan Vavruška
The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.
text speech and dialogue | 2007
Pavel Ircing; Pavel Pecina; Douglas W. Oard; Jianqiang Wang; Ryen W. White; Jan Hoidekr
This paper describes the design of the first large-scale IR test collection built for the Czech language. The creation of this collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries. All aspects of the collection building are presented, together with some general findings of initial experiments.
language resources and evaluation | 2014
Jan Švec; Jan Lehečka; Pavel Ircing; Lucie Skorkovská; Aleš Pražák; Jan Vavruška; Petr Stanislav; Jan Hoidekr
The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.
international acm sigir conference on research and development in information retrieval | 2007
Pavel Ircing; Douglas W. Oard; Jan Hoidekr
This paper reports on experiments with the first available Czech IR test collection. The collection consists of a continuous stream from automatic transcription of spontaneous speech (see [3] for details) and the task of the IR system is to identify appropriate replay points where the discussion about the queried topic starts. The collection thus lacks clearly defined document boundaries. Moreover, the accuracy of the transcription is limited (around 35% word error rate), mostly due to the nature of the speech—interviews with Holocaust survivors, which are sometimes emotional, accented, and exhibiting age-related speech impediments. This collection therefore offers an excellent opportunity to explore both effects present in Czech (e.g., morphology) and effects that result from processing spontaneous speech. It was also used in the CL-SR track at the CLEF 2006 evaluation campaign (http://www.clef-campaign.org/).
Lecture Notes in Artificial Intelligence | 2007
Pavel Ircing; Pavel Pecina; Douglas W. Oard; Jianqiang Wang; Ryen W. White; Jan Hoidekr
ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition | 2003
Josef Psutka; Pavel Ircing; Jan Hoidekr
language resources and evaluation | 2006
Pavel Ircing; Jan Hoidekr; Josef Psutka
language resources and evaluation | 2006
Jan Hoidekr; V Psutka Josef; Aleš Pražák; Josef Psutka
Archive | 2007
Jan Hoidekr; Aleš Pražák; Josef Psutka; Z. Tychtl
computational intelligence | 2006
Ales Prazák; Josef Psutka; Jan Hoidekr; Jakub Kanis; Ludek Müller