Jindrich Zdansky
Technical University of Liberec
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jindrich Zdansky.
multimedia signal processing | 2012
Jan Nouza; Karel Blavka; Jindrich Zdansky; Petr Cerva; Jan Silovsky; Marek Bohac; Josef Chaloupka; Michaela Kucharova; Ladislav Seps
This paper describes a complex system developed for processing, indexing and accessing data collected in large audio and audio-visual archives that make an important part of Czech cultural heritage. Recently, the system is being applied to the Czech Radio archive, namely to its oral history segment with more than 200.000 individual recordings covering almost ninety years of broadcasting in the Czech Republic and former Czechoslovakia. The ultimate goals are a) to transcribe a significant portion of the archive - with the support of speech, speaker and language recognition technology, b) index the transcriptions, and c) make the audio and text files fully searchable. So far, the system has processed and indexed over 75.000 spoken documents. Most of them come from the last two decades, but the recent demo collection includes also a series of presidential speeches since 1934. The full coverage of the archive should be available by the end of 2014.
COST'09 Proceedings of the Second international conference on Development of Multimodal Interfaces: active Listening and Synchrony | 2009
Jan Nouza; Jindrich Zdansky; Petr Cerva; Jan Silovsky
Slavic languages pose a big challenge for researchers dealing with speech technology. They exhibit a large degree of inflection, namely declension of nouns, pronouns and adjectives, and conjugation of verbs. This has a large impact on the size of lexical inventories in these languages, and significantly complicates the design of text-to-speech and, in particular, speech-to-text systems. In the paper, we demonstrate some of the typical features of the Slavic languages and show how they can be handled in the development of practical speech processing systems. We present our solutions we applied in the design of voice dictation and broadcast speech transcription systems developed for Czech. Furthermore, we demonstrate how these systems can be converted to another similar Slavic language, in our case Slovak. All the presented systems operate in real time with very large vocabularies (350K words in Czech, 170K words in Slovak) and some of them have been already deployed in practice.
International Workshop on Multimedia for Cultural Heritage | 2011
Jan Nouza; Karel Blavka; Marek Bohac; Petr Cerva; Jindrich Zdansky; Jan Silovsky; Jan Prazak
The Czech Radio archive of spoken documents is considered one of the gems of the Czech cultural heritage. It contains the largest collection (more than 100.000 hours) of spoken documents recorded during the last 90 years. We are developing a complex platform that should automatically transcribe a significant portion of the archive, index it and eventually prepare it for full-text search. The four-year project supported by the Czech Ministry of culture is challenging in the way that it copes with huge volumes of data, with historical as well as contemporary language, a rather low signal quality in case of old recordings, and also with documents spoken not only in Czech but also in Slovak. The technology used includes speech, speaker and language recognition modules, speaker and channel adaptation components, tools for data indexation and retrieval, and a web interface that allows for public access to the archive. Recently, a demo version of the platform is available for testing and searching in some 10.000 hours of already processed data.
mediterranean electrotechnical conference | 2010
Jan Nouza; Jindrich Zdansky; Petr Cerva
In the paper we describe a complex system we developed for automatic acquisition of a large corpus of spoken Czech. The system is capable of continuous monitoring of a selected Czech TV station and providing automatic transcription of its audio track. The transcription is performed by our own speech recognition engine that employs a vocabulary with 350 thousand most frequent Czech words (and word-forms). Transcription accuracy is fairly good for studio speech (above 90 per cent), but may drop significantly for noisy recordings and spontaneous speech. Anyway, the system runs without any human supervision and during its operation in 2007 it collected, transcribed, stored and indexed more than 1800 hours of Czech spoken documents. Any word or word combination in this corpus can be easily searched by a full-text search engine with internet access.
Journal of Multimedia | 2012
Jan Nouza; Karel Blavka; Petr Cerva; Jindrich Zdansky; Jan Silovsky; Marek Bohac; Jan Prazak
In this paper we describe a complex software platform that is being developed for the automatic transcription and indexation of the Czech Radio archive of spoken documents. The archive contains more than 100.000 hours of audio recordings covering almost ninety years of public broadcasting in the Czech Republic and former Czechoslovakia. The platform is based on modern speech processing technology and includes modules for speech, speaker and language recognition, and tools for multimodal information retrieval. The aim of the project supported by the Czech Ministry of Culture is to make the archive accessible and searchable both for researchers as well as for wide public. After the first project’s year, the key modules have been already implemented and tested on a 27.400-hour subset of the archive. A web-based full-text search engine allows for the demonstration of the project’s current state.
2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM) | 2015
Lukas Mateju; Petr Cerva; Jindrich Zdansky
This paper deals with utilization of deep neural networks (DNNs) for speech recognition. The main goal is to find out the best strategy for training and utilization of these models within an acoustic modeling module of a large vocabulary continuous speech recognition (LVCSR) system of Czech language. For this purpose, various DNNs are trained a) using several training strategies, b) with different inner structure and c) using various kinds of features. Experimental evaluation is then performed on a large dataset including broadcast recordings, recordings of lectures, dictates of judgments and set of nonlinearly distorted utterances. The resulting recipe for training of DNNs for our LVCSR system employs a) ReLU activation function with hidden layer width of 1024 neurons and b) filter-bank based features.
Speech Communication | 2013
Petr Cerva; Jan Silovsky; Jindrich Zdansky; Jan Nouza; Ladislav Seps
This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech radio. This archive represents a significant part of Czech cultural heritage, as it contains recordings covering 90years of broadcasting. A large portion of these documents (100,000h) is to be transcribed and made public for browsing. To improve the transcription results, an efficient speaker-adaptive scheme is proposed. The scheme is based on integration of speaker diarization and adaptation methods and is designed to achieve a low Real-Time Factor (RTF) of the entire adaptation process, because the archives size is enormous. It thus employs just two decoding passes, where the first one is carried out using the lexicon with a reduced number of items. Moreover, the transcripts from the first pass serve not only for adaptation, but also as the input to the speaker diarization module, which employs two-stage clustering. The output of diarization is then utilized for a cluster-based unsupervised Speaker Adaptation (SA) approach that also utilizes information based on the gender of each individual speaker. Presented experimental results on various types of programs show that our adaptation scheme yields a significant Word Error Rate (WER) reduction from 22.24% to 18.85% over the Speaker Independent (SI) system while operating at a reasonable RTF.
multimedia signal processing | 2012
Jan Silovsky; Jindrich Zdansky; Jan Nouza; Petr Cerva; Jan Prazak
In this paper we study the effect of incorporation of automatic transcriptions in the speaker diarization process. We aim to improve both the diarization accuracy as evaluated by standard objective measures and quality of the diarization output from users perspective. Although the presented approach relies on output of an automatic speech recognizer, it makes no use of lexical information. Instead, we use information about word boundaries and classification of non-speech events occurring in the processed stream. The former information is used as constraining condition for speaker change-point candidates and the latter facilitate to neglect various vocal noise sounds that carry no speaker-specific information (considering representation of the signal by cepstral features) and thus harm the speakers representation. The experimental evaluation of the presented approach was carried out using the COST278 multilingual broadcast news database. We demonstrate that the approach yields improvement in terms of both speaker diarization and segmentation performance measures. Furthermore, we show that the number of change-points detected within words (and not at their boundaries) is significantly reduced.
multimedia signal processing | 2012
Petr Cerva; Jan Silovsky; Jindrich Zdansky; Ondrej Smola; Karel Blavka; Karel Palecek; Jan Nouza; Jiri Malek
This paper presents a complex system developed to improve the quality of distance learning by allowing people to browse the content of various (academic) lectures. The system consists of several main modules. The first automatic speech recognition (ASR) module is designed to cope with inflective Czech language and provides time-aligned transcriptions of input audio-visual recordings of lectures. These transcriptions are generated off-line in two recognition passes using speaker adaptation methods and language models mixed from various text sources including transcriptions of broadcast programs, spontaneous telephone talks, web discussions, thesis, etc. Lecture recordings and their transcriptions are then indexed and stored in the database. The next module, client-server web lecture browser, allows to browse or play the indexed content and search in it.
Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions | 2009
Jan Silovsky; Petr Cerva; Jindrich Zdansky
This paper deals with utilization of maximum likelihood linear regression (MLLR) adaptation transforms for speaker recognition in broadcast news streams. This task is specific particularly for widely varying acoustic conditions, microphones, transmission channels, background noise and short duration of recordings (usually in the range from 5 to 15 seconds). MLLR transforms based features are modeled using support vector machines (SVM). Obtained results are compared with a GMM based system with traditional MFCC features. The paper also deals with inter-session variability compensation techniques suitable for both systems and emphases the importance of feature vector scaling for SVM based system.