Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jan Hoidekr is active.

Publication


Featured researches published by Jan Hoidekr.


text speech and dialogue | 2011

Web text data mining for building large scale language modelling corpus

Jan Švec; Jan Hoidekr; Daniel Soutner; Jan Vavruška

The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.


text speech and dialogue | 2007

Information retrieval test collection for searching spontaneous Czech speech

Pavel Ircing; Pavel Pecina; Douglas W. Oard; Jianqiang Wang; Ryen W. White; Jan Hoidekr

This paper describes the design of the first large-scale IR test collection built for the Czech language. The creation of this collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries. All aspects of the collection building are presented, together with some general findings of initial experiments.


language resources and evaluation | 2014

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

Jan Švec; Jan Lehečka; Pavel Ircing; Lucie Skorkovská; Aleš Pražák; Jan Vavruška; Petr Stanislav; Jan Hoidekr

The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.


international acm sigir conference on research and development in information retrieval | 2007

First experiments searching spontaneous Czech speech

Pavel Ircing; Douglas W. Oard; Jan Hoidekr

This paper reports on experiments with the first available Czech IR test collection. The collection consists of a continuous stream from automatic transcription of spontaneous speech (see [3] for details) and the task of the IR system is to identify appropriate replay points where the discussion about the queried topic starts. The collection thus lacks clearly defined document boundaries. Moreover, the accuracy of the transcription is limited (around 35% word error rate), mostly due to the nature of the speech—interviews with Holocaust survivors, which are sometimes emotional, accented, and exhibiting age-related speech impediments. This collection therefore offers an excellent opportunity to explore both effects present in Czech (e.g., morphology) and effects that result from processing spontaneous speech. It was also used in the CL-SR track at the CLEF 2006 evaluation campaign (http://www.clef-campaign.org/).


Lecture Notes in Artificial Intelligence | 2007

Information Retrieval Test Collection for Searching Spontaneous Czech Speech

Pavel Ircing; Pavel Pecina; Douglas W. Oard; Jianqiang Wang; Ryen W. White; Jan Hoidekr


ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition | 2003

Recognition of spontaneously pronounced TV ice-hockey commentary

Josef Psutka; Pavel Ircing; Jan Hoidekr


language resources and evaluation | 2006

Exploiting linguistic knowledge in language modeling of Czech spontaneous speech

Pavel Ircing; Jan Hoidekr; Josef Psutka


language resources and evaluation | 2006

Benefit of a class-based language model for real-time closed-captioning of TV ice-hockey commentaries

Jan Hoidekr; V Psutka Josef; Aleš Pražák; Josef Psutka


Archive | 2007

Systém automatického vyhledávání klíčových segmentů v rozsáhlém audiovizuálním archivu hokejových zápasů

Jan Hoidekr; Aleš Pražák; Josef Psutka; Z. Tychtl


computational intelligence | 2006

Adaptive language model in automatic online subtitling

Ales Prazák; Josef Psutka; Jan Hoidekr; Jakub Kanis; Ludek Müller

Collaboration


Dive into the Jan Hoidekr's collaboration.

Top Co-Authors

Avatar

Josef Psutka

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Pavel Ircing

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Aleš Pražák

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

V Psutka Josef

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Jakub Kanis

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Jan Vavruška

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Jan Švec

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Ludek Müller

University of West Bohemia

View shared research outputs
Top Co-Authors

Avatar

Pavel Pecina

Charles University in Prague

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge