Jan Hoidekr | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Hoidekr is active.

Explore More

Publication

Featured researches published by Jan Hoidekr.

text speech and dialogue | 2011

Web text data mining for building large scale language modelling corpus

Jan Švec; Jan Hoidekr; Daniel Soutner; Jan Vavruška

The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.

text speech and dialogue | 2007

Information retrieval test collection for searching spontaneous Czech speech

Pavel Ircing; Pavel Pecina; Douglas W. Oard; Jianqiang Wang; Ryen W. White; Jan Hoidekr

This paper describes the design of the first large-scale IR test collection built for the Czech language. The creation of this collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries. All aspects of the collection building are presented, together with some general findings of initial experiments.

language resources and evaluation | 2014

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Jan Švec; Jan Lehečka; Pavel Ircing; Lucie Skorkovská; Aleš Pražák; Jan Vavruška; Petr Stanislav; Jan Hoidekr

The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.

international acm sigir conference on research and development in information retrieval | 2007

First experiments searching spontaneous Czech speech

Pavel Ircing; Douglas W. Oard; Jan Hoidekr

This paper reports on experiments with the first available Czech IR test collection. The collection consists of a continuous stream from automatic transcription of spontaneous speech (see [3] for details) and the task of the IR system is to identify appropriate replay points where the discussion about the queried topic starts. The collection thus lacks clearly defined document boundaries. Moreover, the accuracy of the transcription is limited (around 35% word error rate), mostly due to the nature of the speech—interviews with Holocaust survivors, which are sometimes emotional, accented, and exhibiting age-related speech impediments. This collection therefore offers an excellent opportunity to explore both effects present in Czech (e.g., morphology) and effects that result from processing spontaneous speech. It was also used in the CL-SR track at the CLEF 2006 evaluation campaign (http://www.clef-campaign.org/).

Lecture Notes in Artificial Intelligence | 2007