Peter Viszlay
Technical University of Košice
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peter Viszlay.
international conference on telecommunications | 2015
Jozef Vavrek; Peter Viszlay; Eva Kiktova; Martin Lojka; Jozef Juhár; Anton Cizmar
We introduce a novel approach to Query-by-Example (QbE) retrieval, utilizing fundamental principles of posteriorgram-based Spoken Term Detection (STD), in this paper. Proposed approach is a kind of modification of widely used seg-mental variant of dynamic programming algorithm. Our solution represents sequential variant of DTW algorithm, employing one step forward moving strategy. Each DTW search is carried out sequentially, block by block, where each block represents squared input distance matrix, with size equal to the length of retrieved query. We also examine a way how to speed up sequential DTW algorithm without considerable loss in retrieving performance, by implementing linear time-aligned accumulated distance. The increase of detection accuracy is ensured by weighted cumulative distance score parameter. Therefore, we called this approach Weighted Fast Sequential - DTW (WFS-DTW) algorithm. A novel PCA-based silence discriminator is used along with this algorithm. Evaluation of proposed algorithm is carried out on ParDat1 corpus, using Term Weighted Value (TWV).
Archive | 2012
Jozef Juhár; Peter Viszlay
The most common acoustic front-ends in automatic speech recognition (ASR) systems are based on the state-of-the-art Mel-Frequency Cepstral Coefficients (MFCCs). The practice shows that this general technique is good choice to obtain satisfactory speech representation. In the past few decades, the researchers have made a great effort in order to develop and apply such techniques, which may improve the recognition performance of the conventional MFCCs. In general, these methods were taken from mathematics and applied in many research areas such as face and speech recognition, high-dimensional data and signal processing, video and image coding and many other. One group of mentioned methods is represented by linear transformations.
international conference radioelektronika | 2016
Tomáš Koctúr; Peter Viszlay; Ján Staš; Martin Lojka; Jozef Juhár
An acoustic model is a necessary component of automatic speech recognition system. Acoustic models are trained on a lot of speech recordings with transcriptions. Usually, hundreds of transcribed recordings are required. It is very time and resource consuming process to create manual transcriptions. Acoustic models may be obtained automatically with unsupervised acoustic model training, which uses online speech resources. Obtained speech data are recognized with low resourced automatic speech recognition system. Unsupervised techniques are able to filter out the erroneous hypotheses from the result and the rest use for acoustic model training. Unsupervised methods for generating speech corpora for acoustic model training are presented in this paper.
international conference on systems signals and image processing | 2013
Peter Viszlay; Jozef Juhár; Matus Pleva
Linear discriminant analysis (LDA) is a popular supervised feature transformation applied in current automatic speech recognition (ASR). Generally, the parameters of LDA are computed from the training data partitioned into classes. If the number of classes is smaller than the dimension of the supervectors (typically in phoneme-based LDA) then the between-class covariance matrix can become singular or close to singular (singularity problem in classical LDA). In this paper, we present a modification of the standard between-class covariance matrix estimation, which represents one of the possible approaches to solving the singularity problem. Our method works directly with the supervectors instead of the class mean vectors. The number of estimation cycles is much larger because more data are used during the computation. Thus, the matrix structure can be significantly refined. This implies that larger lengths of context can be used while the singularity problem is efficiently eliminated. The effectiveness of the proposed estimation is evaluated in Slovak phoneme-based and triphone-based large vocabulary continuous speech recognition (LVCSR) task. The method is compared to the state-of-the-art MFCCs and to LDA trained in the standard way. The experimental results confirm that the modified LDA considerably outperforms the MFCCs and consistently leads to improvements of the conventional LDA.
Journal of Linguistics/Jazykovedný casopis | 2017
Ján Staš; Daniel Hládek; Peter Viszlay; Tomáš Koctúr
Abstract This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.
international conference radioelektronika | 2016
Matus Pleva; Eva Kiktova; Peter Viszlay; Patrick Bours
This article describes a new algorithm of calibrated user authentication using acoustical monitoring of keyboard when typing the pre-defined word. The HMM (Hidden Markov Models) with MFCC (Mel-frequency cepstral coefficients) features were used in the setup. In authentication task a low EER (Equal Error Rate) was achieved between 9.4% and 14.8% using a calibration setup and 3 sessions training. For identification part the accuracy of 99.33% was achieved, when testing 25% of realizations (randomly selected from 100 recordings) identifying between 50 users/models. Calibration was done using one user recording to calibrate the microphone and keyboard table setup when enrolling his model. Genuine and impostor tests were realized for 50 volunteers typing 100 words each in 4 sessions.
international conference radioelektronika | 2015
Dávid Čonka; Peter Viszlay; Jozef Juhár
Two-dimensional linear discriminant analysis (2DLDA) is a popular feature transformation being applied in current automatic speech recognition (ASR). The parameters of 2DLDA are usually computed on labelled training data partitioned into phonetic classes. It is generally known that one phonetic class contains speech data collected from different speakers with different speech variability and context for the same phonetic unit. Therefore, many clusters exist in each phonetic class. The mentioned effects are not taken into account in the conventional 2DLDA. In this paper, we present an efficient improvement of 2DLDA, which involves the well-known K-means clustering technique to modify the standard class definition. The clustering algorithm is used to identify the existing clusters in the basic classes, which are treated as the new classes for the subsequent 2DLDA estimation. The proposed method is thoroughly evaluated in Slovak triphone-based large vocabulary continuous speech recognition (LVCSR) task. The modified 2DLDA is compared to the state-of-the-art Mel-frequency cepstral coefficients (MFCCs) and to conventional LDA. The results show that the modified 2DLDA features outperform the MFCCs, LDA and also lead to improvement over the conventional 2DLDA.
network-based information systems | 2018
Martin Lojka; Peter Viszlay; Ján Staš; Daniel Hládek; Jozef Juhár
We have developed a working prototype of automatic subtitling system for transcription, archiving, and indexing of Slovak audiovisual recordings, such as lectures, talks, discussions or broadcast news. To go further in the development and research, we had to incorporate more and more modern speech technologies and embrace nowadays deep learning techniques. This paper describes transition and changes made to our working prototype regarding speech recognition core replacement, architecture changes and new web-based user interface. We have used the state-of-the art speech toolkit KALDI and distributed architecture to achieve better responsivity of the interface and faster processing of the audiovisual recordings. Using acoustic models based on time delay deep neural networks we have been able to lower the system’s average word error rate from previously reported 24% to 15%, absolutely.
Journal of Intelligent Information Systems | 2018
Jozef Vavrek; Peter Viszlay; Martin Lojka; Jozef Juhár; Matus Pleva
This paper examines multilingual audio Query-by-Example (QbE) retrieval, utilizing the posteriorgram-based Phonetic Unit Modelling (PUM) approach and the Weighted Fast Sequential Dynamic Time Warping (WFSDTW) algorithm. The PUM approach employs phone recognizers trained on language-specific external resources in a supervised way. Thus, the information about the phonetic distribution is embedded in the process of acoustic modelling. The resulting acoustic models were also used for language-independent QbE retrieval. The improved WFSDTW algorithm was implemented in order to perform retrievals for each query (keyword) within the particular utterance file. The major interest is placed on a retrieval performance measurement of the proposed WFSDTW solution employing posteriorgram-based keyword matching with Gaussian mixture modelling (GMM). Score normalization and fusion of four different language-dependent sub-systems was carried out using a simple max-score merging strategy. The results show a certain predominance of the proposed WFSDTW solution among two other evaluated techniques, namely basic DTW and segmental DTW algorithms. Also, the combination of multiple PUM techniques together with the WFSDTW has been proved as an effective solution for the QbE task.
international conference on systems signals and image processing | 2016
Dávid Čonka; Peter Viszlay; Jozef Juhár
This paper presents a new approach to improve the standard class definition in two-dimensional linear discriminant analysis (2DLDA). It is known that an HMM-based triphone class contains data collected from many speakers with different speech variability. Thus, there exist many clusters in each class, whose composition has an influence to 2DLDA. We propose to employ the fuzzy C-means clustering to identify the clusters in each class and treat them as new classes. This work follows our past research focused on improving 2DLDA by K-means clustering. Concurrently, we compare the previously published results with the ones evaluated in this paper. The proposed approach is evaluated in Slovak HMM-based large vocabulary continuous speech recognition (LVCSR). It is shown that the proposed method outperforms the Mel-frequency cepstral coefficients (MFCCs), conventional LDA, 2DLDA and it achieves competitive results compared to the K-means algorithm evaluated in the same framework.