Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where György Szaszák is active.

Publication


Featured researches published by György Szaszák.


Speech Communication | 2010

Using prosody to improve automatic speech recognition

Klára Vicsi; György Szaszák

In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, where basic HMM units (the most often acoustic phoneme models) are trained and are then connected according to the dictionary and some grammar (language model) to obtain a recognition network, along which recognition can be interpreted also as an alignment process. In this paper the HMM framework is used to model speech prosody, and to perform initial syntactic and/or semantic level processing of the input speech in parallel to standard speech recognition. As acoustic-prosodic features, fundamental frequency and energy are used. A method was implemented for syntactic level information extraction from the speech. The method was designed to work for fixed-stress languages, and it yields a segmentation of the input speech for syntactically linked word groups, or even single words corresponding to a syntactic unit (these word groups are sometimes referred to as phonological phrases in psycholinguistics, which can consist of one or more words). These so-called word-stress units are marked by prosody, and have an associated fundamental frequency and/or energy contour which allows their discovery. For this, HMMs for the different types of word-stress unit contours were trained and then used for recognition and alignment of such units from the input speech. This prosodic segmentation of the input speech also allows word-boundary recovery and can be used for N-best lattice rescoring based on prosodic information. The syntactic level input speech segmentation algorithm was evaluated for the Hungarian and for the Finnish languages that have fixed stress on the first syllable. (This means if a word is stressed, stress is realized on the first syllable of the word.) The N-best rescoring based on syntactic level word-stress unit alignment was shown to augment the number of correctly recognized words. For further syntactic and semantic level processing of the input speech in ASR, clause and sentence boundary detection and modality (sentence type) recognition was implemented. Again, the classification was carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection was based on HMMs excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from the ASR input. This method also allows punctuation to be automatically marked. This semantic level processing of speech was investigated for the Hungarian and for the German languages. The correctness of recognized types of modalities was 69% for Hungarian, and 78% for German.


International Journal of Speech Technology | 2005

Automatic Segmentation of Continuous Speech on Word Level Based on Supra-segmental Features

Klára Vicsi; György Szaszák

This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative languages.In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.


Journal of Language Modelling | 2012

Exploiting Prosody for Syntactic Analysis in Automatic Speech Understanding

György Szaszák; András Beke

The relation between syntax and prosody is evident, even if the prosodic structure cannot be directly mapped to the syntactic one and vice versa. Syntax-to-prosody mapping is widely used in text-tospeech applications, but prosody-to-syntax mapping is mostly missing from automatic speech recognition/understanding systems. This paper presents an experiment towards filling this gap and evaluating whether a HMM-based automatic prosodic segmentation tool can be used to support the reconstruction of the syntactic structure directly from speech. Results show that up to 85% of syntactic clause boundaries and up to about 70% of embedded syntactic phrase boundaries could be identified based on the detection of phonological phrases. Recall rates do not depend further on syntactic layering, in other words, whether the phrase is multiply embedded or not. Clause boundaries can be well assigned to intonational phrase level in read speech and can be well separated from lower level syntactic phrases based on the type of the aligned phonological phrase(s). These findings can be exploited in speech understanding systems, allowing for the recovery of the skeleton of the syntactic structure, based purely on the speech signal.


text speech and dialogue | 2012

Unsupervised Clustering of Prosodic Patterns in Spontaneous Speech

András Beke; György Szaszák

Dealing with spontaneous speech constitutes big challenge both for linguistics and engineers of speech technology. For read speech, prosody was assessed as an automatic decomposition for phonological phrases using supervised method (HMM) in earlier experiments. However, when trying to adapt this automatic approach for spontaneous speech, the clustering of phonological phrase types becomes problematic: it is unknown which types can be characteristic and hence worth modelling. The authors decided to carry out a more flexible, unsupervised learning to cluster the data in order to evaluate and analyse whether some typical “spontaneous” patterns become selectable in spontaneous speech based on this automatic approach. This paper presents a method for clustering the typical prosody patterns of spontaneous speech based on k-means clustering.


text speech and dialogue | 2004

Examination of Pronunciation Variation from Hand-Labelled Corpora

György Szaszák; Klára Vicsi

Pronunciation variation examinations have two aims: to extent our phonetic, linguistic knowledge and to add variation models to automatic speech recognisers (ASR) to improve recognition accuracy. By examining pronunciation variation in Hungarian language on the corpus of the Hungarian Telephone Speech Database (MTBA), that contains semi-automatically labelled records from 500 speakers, we used a data driven approach. From the statistical analysis pronunciation matrices were constructed, both word-internal and cross-word pronunciation variation were examined separately and compared. Based on the results, pronunciation variation modeling is feasible for ASRs, or automatic phonetic transcription rules can be derived The examination was prepared based on the Hungarian speech databases, but the method is also adaptable for other languages.


conference of the international speech communication association | 2016

Estimating the Sincerity of Apologies in Speech by DNN Rank Learning and Prosodic Analysis

Gábor Gosztolya; Tamás Grósz; György Szaszák; László Tóth

In the Sincerity Sub-Challenge of the Interspeech ComParE 2016 Challenge, the task is to estimate user-annotated sincerity scores for speech samples. We interpret this challenge as a ranklearning regression task, since the evaluation metric (Spearman’s correlation) is calculated from the rank of the instances. As a first approach, Deep Neural Networks are used by introducing a novel error criterion which maximizes the correlation metric directly. We obtained the best performance by combining the proposed error function with the conventional MSE error. This approach yielded results that outperform the baseline on the Challenge test set. Furthermore, we introduce a compact prosodic feature set based on a dynamic representation of F0, energy and sound duration. We extract syllable-based prosodic features which are used as the basis of another machine learning step. We show that a small set of prosodic features is capable of yielding a result very close to the baseline one and that by combining the predictions yielded by DNN and the prosodic feature set, further improvement can be reached, significantly outperforming the baseline SVR on the Challenge test set.


international conference on speech and computer | 2015

Automatic Close Captioning for Live Hungarian Television Broadcast Speech: A Fast and Resource-Efficient Approach

Ádám Varga; Balázs Tarján; Zoltán Tobler; György Szaszák; Tibor Fegyó; Csaba Bordás; Péter Mihajlik

In this paper, the application of LVCSR (Large Vocabulary Continuous Speech Recognition) technology is investigated for real-time, resource-limited broadcast close captioning. The work focuses on transcribing live broadcast conversation speech to make such programs accessible to deaf viewers. Due to computational limitations, real time factor (RTF) and memory requirements are kept low during decoding with various models tailored for Hungarian broadcast speech recognition. Two decoders are compared on the direct transcription task of broadcast conversation recordings, and setups employing re-speakers are also tested. Moreover, the models are evaluated on a broadcast news transcription task as well, and different language models (LMs) are tested in order to demonstrate the performance of our systems in settings when low memory consumption is a less crucial factor.


international conference on speech and computer | 2016

Combining Atom Decomposition of the F0 Track and HMM-based Phonological Phrase Modelling for Robust Stress Detection in Speech

György Szaszák; Máté Ákos Tündik; Branislav Gerazov; Aleksandar Gjoreski

Weighted Correlation based Atom Decomposition (WCAD) algorithm is a technique for intonation modelling that uses a matching pursuit framework to decompose the F0 contour into a set of basic components, called atoms. The atoms attempt to model the physiological activation of the laryngeal muscles responsible for changes in F0. Recently, WCAD has been upgraded to use the orthogonal matching pursuit (OMP) algorithm, which gives qualitative improvements in the modelling of intonation. A possible exploitation of the OMP based WCAD is the automatic detection of stress in speech, which we undertake for the Hungarian language. Correlation is demonstrated between stress and atomic peaks, as well as between stress and atomic valleys on the previous syllable. The stress detection technique based on WCAD is compared to a baseline system using HMM/GMM stress/phrase models. 7 % improvement is noticed in the F-measure compared to baseline when evaluating on hand-made reference. Finally, we propose a hybrid approach which outperforms both individual systems (by 11 % compared to the baseline).


international conference on speech and computer | 2016

Automatic Summarization of Highly Spontaneous Speech

András Beke; György Szaszák

This paper addresses speech summarization of highly spontaneous speech. Speech is converted into text using an ASR, then segmented into tokens. Human made and automatic, prosody based tokenization are compared. The obtained sentence-like units are analysed by a syntactic parser to help automatic sentence selection for the summary. The preprocessed sentences are ranked based on thematic terms and sentence position. The thematic term is expressed in two ways: TF-IDF and Latent Semantic Indexing. The sentence score is calculated as linear combination of the thematic term score and a sentence position score. To generate the summary, the top 10 candidates for the most informative/best summarizing sentences are selected. The system performance showed comparable results (recall: 0.62, precision: 0.79 and F-measure 0.68) with the prosody based tokenization approach. A subjective test is also carried out on a Likert scale.


international conference on speech and computer | 2016

Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer

Milan Sečujski; Branislav Gerazov; Tamás Gábor Csapó; Vlado Delić; Philip N. Garner; Aleksandar Gjoreski; David Guennec; Zoran A. Ivanovski; Aleksandar Melov; Géza Németh; Ana Stojkovic; György Szaszák

Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular speaker intention. Prosody transfer within speech-to-speech translation is a recent research area with increasing importance, with one of its most important research topics being the detection and treatment of salient events, i.e. instances of prominence or focus which do not result from syntactic constraints, but are rather products of semantic or pragmatic level effects. This paper presents the design and the guidelines for the creation of a multilingual speech corpus containing prosodically rich sentences, ultimately aimed at training statistical prosody models for multilingual prosody transfer in the context of expressive speech synthesis.

Collaboration


Dive into the György Szaszák's collaboration.

Top Co-Authors

Avatar

András Beke

Hungarian Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Klára Vicsi

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Máté Ákos Tündik

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bálint Tóth

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Balázs Tarján

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Géza Németh

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Anna Moró

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Dávid Sztahó

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge