Jindrich Matousek | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jindrich Matousek is active.

Explore More

Publication

Featured researches published by Jindrich Matousek.

Speech Communication | 2011

On the detection of pitch marks using a robust multi-phase algorithm

Milan Legát; Jindrich Matousek; Daniel Tihelka

A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have been proposed in recent years. In this paper, we propose to take advantage of both glottal and speech signals in order to increase the accuracy of detection of GCIs. All aspects of this particular issue, from determining speech polarity to handling a delay between glottal and corresponding speech signal, are addressed. A robust multi-phase algorithm (MPA), which combines different methods applied on both signals in a unique way, is presented. Within the process, a special attention is paid to determination of speech waveform polarity, as it was found to be considerably influencing the performance of the detection algorithms. Another feature of the proposed method is that every detected GCI is given a confidence score, which allows to locate potentially inaccurate GCI subsequences. The performance of the proposed algorithm was tested and compared with other freely available GCI detection algorithms. The MPA algorithm was found to be more robust in terms of detection accuracy over various sets of sentences, languages and phone classes. Finally, some pitfalls of the GCI detection are discussed.

international conference on signal processing | 2008

Towards automatic audio track generation for Czech TV broadcasting: Initial experiments with subtitles-to-speech synthesis

Zdenek Hanzlícek; Jindrich Matousek; Daniel Tihelka

In this paper, the project ldquoElimination of the Language Barriers Faced by the Handicapped Watchers of the Czech Televisionrdquo aimed at making Czech TV broadcasting available to a broader group of TV watchers is introduced. More specifically, the problems of the automatic audio track generation within the project are mentioned. As the audio track will be produced from subtitles, text-to-speech (TTS) technology will be utilised. Several versions of a TTS system planned to produce the audio track are described. In this paper, the main attention is paid to the analysis of synchronicity between subtitles and the synthetic speech. Problems with fitting synthetic speech into the predefined subtitles slots were revealed - for more than 44% of all subtitles, the synthetic speech overlapped the slots. So, great care will have to be taken to produce speech af faster rates when customising our TTS system for the task of generating audio tracks from subtitles.

international conference on acoustics, speech, and signal processing | 2014

Very fast unit selection using Viterbi search with zero-concatenation-cost chains

Jiri Kala; Jindrich Matousek

This paper introduces a very fast heuristic search algorithm for unit-selection speech synthesis. The algorithm modifies commonly used Viterbi search framework by introducing zero-concatenation-cost (ZCC) chains of unit candidates that immediately neighbored in a source speech corpus. ZCC chains are preferred as they represent perfect speech segment concatenations (so there is no need to compute concatenation costs inside the chains) unless a so-called target specification is violated. The number of ZCC chains is reduced based on statistics calculated upon the synthesis of a large number of utterances. ZCC chains are then combined with single unit candidates to fill possible gaps in the sequence of candidates. The proposed method reduces the computational load of a unit selection system up to hundreds of times. According to listening tests, the quality of synthetic speech was not deteriorated.

information sciences, signal processing and their applications | 2003

Sentence boundary detection in Czech TTS system using neural networks

Jan Romportl; Daniel Tihelka; Jindrich Matousek

This paper proposes results of an application of a neural network on the problem of deciding whether a certain punctuation mark in Czech text is or is not the end of a sentence. It also discusses possibilities of using methods for relevant parameters extraction and compares a neural network based method with a Bayes classifier and a heuristic classifier.

conference of the international speech communication association | 2016

Voting Detector: A Combination of Anomaly Detectors to Reveal Annotation Errors in TTS Corpora.

Jindrich Matousek; Daniel Tihelka

Anomaly detection techniques were shown to help in detecting word-level annotation errors in read-speech corpora for textto-speech synthesis. In this framework, correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. In this paper we propose a concept of a voting detector—a combination of anomaly detectors in which each “single” detector “votes” on whether a testing word is annotated correctly or not. The final decision is then made by aggregating the votes. Our experiments show that voting detector has a potential to overcome each of the single anomaly detectors.

text speech and dialogue | 1999

Statistical Approach to the Automatic Synthesis of Czech Speech

Jindrich Matousek; Josef Psutka; Zbynek Tychtl

The usage of multiple Hidden Markov Models (HMMs) to construct a Czech speech segment database (SSD) and a speech synthesis based on this inventory are presented in this paper. HMMs are used to model triphones. Binary decision trees are applied to automatically cluster the states of triphone HMMs. The clustered states are then employed to automatically segment the speech corpus and to create a SSD. The SSD constructed in this way is assumed to enable more precise context modeling than was previously possible. Several speech techniques are discussed to construct a concatenation-based synthesizer. Special attention is paid to an MFCC-based pitch-synchronous residually excited approach.

international conference on signal processing | 2016

Examining the ability of one-class classifier to ensure the spectral smoothness of concatenated units

Daniel Tihelka; Martin Gruber; Jindrich Matousek; Markéta Juzová

We present initial experiments with one-class classification, aimed at replacing the “classic” heuristics-based measures used to estimate the smoothness of units concatenated together within unit selection speech synthesizers. A set of spectral feature distances was computed between neighbouring frames in natural speech recordings, i.e. those representing natural joins, from which the per-vowel classifier was trained. For the evaluation, we carried out ad-hoc listening tests collecting several examples of smooth and discontinuous joins, against which the classifier is tested. In addition, we also plugged the classifier into our TTS system to verify that the technique is capable of replacing the classic approach in the generic unit selection procedure.

international conference on telecommunications | 2015

Detection of artefacts in czech synthetic speech based on ANOVA statistics

Jiri Pribil; Anna Pribilova; Jindrich Matousek

The paper describes an experiment with using the statistical approach based on analysis of variances (ANOVA) and hypothesis tests for detection of artefacts in the synthetic speech produced by the Czech text-to-speech system employing the unit selection principle. In addition, the paper analyses influence of different speech spectral features and supra-segmental parameters as well as the length of the feature vector on the resulting artefact detection accuracy. Other factors which can also have influence on stability of the artefact detection process are analysed, too. Obtained results of performed experiments confirm that the chosen concept works properly and the presented artefact detector can be used as an alternative to the standard listening tests.

2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE) | 2016

GMM-based speaker gender and age classification after voice conversion

Jiri Pribil; Anna Pribilova; Jindrich Matousek

This paper describes an experiment using the Gaussian mixture models (GMM) for classification of the speaker gender/age and for evaluation of the achieved success in the voice conversion process. The main motivation of the work was to test whether this type of the classifier can be utilized as an alternative approach instead of the conventional listening test in the area of speech evaluation. The proposed two-level GMM classifier was first verified for detection of four age categories (child, young, adult, senior) as well as discrimination of gender for all but childrens voices in Czech and Slovak languages. Then the classifier was applied for gender/age determination of the basic adult male/female original speech together with its conversion. The obtained resulting classification accuracy confirms usability of the proposed evaluation method and effectiveness of the performed voice conversions.

international conference on signal processing | 2014

Reducing footprint of unit selection TTS system by removing linguistic segments with rarely selected units

Martin Gruber; Jindrich Matousek; Daniel Tihelka; Zdenek Hanzlícek

This paper is focused on reducing the size of speech corpora that are used in the unit-selection-based TTS systems. The size of a speech corpus influences the system requirements like storage and memory demands and computational complexity. For high quality speech synthesis, the speech corpus usually consists of several thousands of sentences. Thus an appropriate reduction of the corpus size is likely to lead to a decrease in the system requirements. In this work, a comparison of impacts on synthetic speech quality is presented when removing specific instances of different linguistic segment types from the original corpus. Removal of the following segment types is used and compared with each other: whole sentences, phrases, words, and diphones. Only segments with rarely selected units are removed from the corpus so that the resulting footprint size reaches a predefined value. Results confirm that synthetic speech generated by the TTS systems using the reduced corpora is of a slightly worse quality when compared with speech produced by the system employing the original full corpus. The comparison of the reduction based on different linguistic segments is also presented here.

Explore More