Zbyněk Zajíc
University of West Bohemia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zbyněk Zajíc.
text, speech and dialogue | 2011
Zbyněk Zajíc; Lukáš Machlica; Luděk Müller
One of the most utilized adaptation techniques is the feature Maximum Likelihood Linear Regression (fMLLR). In comparison with other adaptation methods the number of free parameters to be estimated significantly decreases. Thus, the method is well suited for situations with small amount of adaptation data. However, fMLLR still fails in situations with extremely small data sets. Such situations can be solved through proper initialization of fMLLR estimation adding some a-priori information. In this paper a novel approach is proposed solving the problem of fMLLR initialization involving statistics from speakers acoustically close to the speaker to be adapted. Proposed initialization suitably substitutes missing adaptation data with similar data from a training database, fMLLR estimation becomes well-conditioned, and the accuracy of the recognition system increases even in situations with extremely small data sets.
international conference on speech and computer | 2016
Zbyněk Zajíc; Marie Kunešová; Vlasta Radová
The goal of this paper is to evaluate the contribution of speaker change detection (SCD) to the performance of a speaker diarization system in the telephone domain. We compare the overall performance of an i-vector based system using both SCD-based segmentation and a naive constant length segmentation with overlapping segments. The diarization system performs K-means clustering of i-vectors which represent the individual segments, followed by a resegmentation step. Experiments were done on the English part of the CallHome corpus. The final results indicate that the use of speaker change detection is beneficial, but the differences between the two segmentation approaches are diminished by the use of resegmentation.
international conference on acoustics, speech, and signal processing | 2013
Lukáš Machlica; Zbyněk Zajíc
Probabilistic Linear Discriminant Analysis (PLDA), used particularly in image and speech processing for face and speaker recognition, respectively, is a generative model requesting lots of data to be trained. In the paper several enhancements concerning the implementation of the estimation algorithm of PLDA are proposed providing substantial computational savings. At first, an inverse of a huge matrix is replaced by an inversion of two significantly smaller matrices. Subsequently, it is shown how to avoid the need to process the whole data set in each iteration of the estimation algorithm. Supplementary results are presented on NIST SRE 2008.
text speech and dialogue | 2010
Zbyněk Zajíc; Lukáý Machlica; Luděk Müller
This paper deals with robust estimations of data statistics used for the adaptation. The statistics are accumulated before the adaptation process from available adaptation data. In general, only small amount of adaptation data is assumed. These data are often corrupted by noise, channel, they do not contain only clean speech. Also, when training Hidden Markov Models (HMM) several assumptions are made that could not have been fulfilled in the praxis, etc. Therefore, we described several techniques that aim to make the adaptation as robust as possible in order to increase the accuracy of the adapted system. One of the methods consists in initialization of the adaptation statistics in order to prevent ill-conditioned transformation matrices. Another problem arises when an acoustic feature is assigned to an improper HMM state even if the reference transcription is available. Such situations can occur because of the forced alignment process used to align frames to states. Thus, it is quite handy to accumulate data statistic utilizing only reliable frames (in the sense of data likelihood). We are focusing on Maximum Likelihood Linear Transformations and the experiments were performed utilizing the feature Maximum Likelihood Linear Regression (fMLLR). Experiments are aimed to describe the behavior of the system extended by proposed methods.
text, speech and dialogue | 2014
Lucie Skorkovská; Zbyněk Zajíc
Multi-label classification plays the key role in modern categorization systems. Its goal is to find a set of labels belonging to each data item. In the multi-label document classification unlike in the multi-class classification, where only the best topic is chosen, the classifier must decide if a document does or does not belong to each topic from the predefined topic set. We are using the generative classifier to tackle this task, but the problem with this approach is that the threshold for the positive classification must be set. This threshold can vary for each document depending on the content of the document (words used, length of the document, ...). In this paper we use the Unconstrained Cohort Normalization, primary proposed for speaker identification/verification task, for robustly finding the threshold defining the boundary between the correc and the incorrect topics of a document. In our former experiments we have proposed a method for finding this threshold inspired by another normalization technique called World Model score normalization. Comparison of these normalization methods has shown that better results can be achieved from the Unconstrained Cohort Normalization.
text speech and dialogue | 2012
Lukáš Machlica; Zbyněk Zajíc
In the paper recent methods used in the task of speaker recognition are presented. At first, the extraction of so called i-vectors from GMM based supervectors is discussed. These i-vectors are of low dimension and lie in a subspace denoted as Total Variability Space (TVS). The focus of the paper is put on Probabilistic Linear Discriminant Analysis (PLDA), which is used as a generative model in the TVS. The influence of development data is analyzed utilizing distinct speech corpora. It is shown that it is preferable to cluster available speech corpora to classes, train one PLDA model for each class and fuse the results at the end. Experiments are presented on NIST Speaker Recognition Evaluation (SRE) 2008 and NIST SRE 2010.
text speech and dialogue | 2017
Marie Kunešová; Zbyněk Zajíc; Vlasta Radová
In offline speaker diarization systems, particularly those aimed at telephone speech, the accuracy of the initial segmentation of a conversation is often a secondary concern. Imprecise segment boundaries are typically corrected during resegmentation, which is performed as the final step of the diarization process. However, such resegmentation is generally not possible in online systems, where past decisions are usually unchangeable. In such situations, correct segmentation becomes critical. In this paper, we evaluate several different segmentation approaches in the context of online diarization by comparing the overall performance of an i-vector-based diarization system set to operate in a sequential manner.
international conference on speech and computer | 2014
Zbyněk Zajíc; Jan Zelinka; Jan Vaněk; Luděk Müller
The aim of this work is to propose a refinement of the shift-MLLR (shift Maximum Likelihood Linear Regression) adaptation of an acoustics model in the case of limited amount of adaptation data, which can lead to ill-conditioned transformations matrices. We try to suppress the influence of badly estimated transformation parameters utilizing the Artificial Neural Network (ANN), especially Convolutional Neural Network (CNN) with bottleneck layer on the end. The badly estimated shift-MLLR transformation is propagated through an ANN (suitably trained beforehand), and the output of the net is used as the new refined transformation. To train the ANN the well and the badly conditioned shift-MLLR transformations are used as outputs and inputs of ANN, respectively.
text speech and dialogue | 2012
Zbyněk Zajíc; Lukáš Machlica; Luděk Müller
The worst problem the adaptation is dealing with is the lack of adaptation data. This work focuses on the feature Maximum Likelihood Linear Regression (fMLLR) adaptation where the number of free parameters to be estimated significantly decreases in comparison with other adaptation methods. However, the number of free parameters of fMLLR transform is still too high to be estimated properly when dealing with extremely small data sets. We described and compared various methods used to avoid this problem, namely the initialization of the fMLLR transform and a linear combination of basis matrices varying in the choice of the basis estimation (eigen decomposition, factor analysis, independent component analysis and maximum likelihood estimation). Initialization methods compensate the absence of the test speaker’s data utilizing other suitable data. Methods using linear combination of basis matrices reduce the number of estimated fMLLR parameters to a smaller number of weights to be estimated. Experiments are aimed to compare results of proposed basis and initialization methods.
text, speech and dialogue | 2018
Zbyněk Zajíc; Daniel Soutner; Marek Hrúz; Luděk Müller; Vlasta Radová
In this paper, we propose a speaker change detection system based on lexical information from the transcribed speech. For this purpose, we applied a recurrent neural network to decide if there is an end of an utterance at the end of a spoken word. Our motivation is to use the transcription of the conversation as an additional feature for a speaker diarization system to refine the segmentation step to achieve better accuracy of the whole diarization system. We compare the proposed speaker change detection system based on transcription (text) with our previous system based on information from spectrogram (audio) and combine these two modalities to improve the results of diarization. We cut the conversation into segments according to the detected changes and represent them by an i-vector. We conducted experiments on the English part of the CallHome corpus. The results indicate improvement in speaker change detection (by 0.5% relatively) and also in speaker diarization (by 1% relatively) when both modalities are used.