Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Klára Vicsi is active.

Publication


Featured researches published by Klára Vicsi.


Speech Communication | 2010

Using prosody to improve automatic speech recognition

Klára Vicsi; György Szaszák

In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, where basic HMM units (the most often acoustic phoneme models) are trained and are then connected according to the dictionary and some grammar (language model) to obtain a recognition network, along which recognition can be interpreted also as an alignment process. In this paper the HMM framework is used to model speech prosody, and to perform initial syntactic and/or semantic level processing of the input speech in parallel to standard speech recognition. As acoustic-prosodic features, fundamental frequency and energy are used. A method was implemented for syntactic level information extraction from the speech. The method was designed to work for fixed-stress languages, and it yields a segmentation of the input speech for syntactically linked word groups, or even single words corresponding to a syntactic unit (these word groups are sometimes referred to as phonological phrases in psycholinguistics, which can consist of one or more words). These so-called word-stress units are marked by prosody, and have an associated fundamental frequency and/or energy contour which allows their discovery. For this, HMMs for the different types of word-stress unit contours were trained and then used for recognition and alignment of such units from the input speech. This prosodic segmentation of the input speech also allows word-boundary recovery and can be used for N-best lattice rescoring based on prosodic information. The syntactic level input speech segmentation algorithm was evaluated for the Hungarian and for the Finnish languages that have fixed stress on the first syllable. (This means if a word is stressed, stress is realized on the first syllable of the word.) The N-best rescoring based on syntactic level word-stress unit alignment was shown to augment the number of correctly recognized words. For further syntactic and semantic level processing of the input speech in ASR, clause and sentence boundary detection and modality (sentence type) recognition was implemented. Again, the classification was carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection was based on HMMs excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from the ASR input. This method also allows punctuation to be automatically marked. This semantic level processing of speech was investigated for the Hungarian and for the German languages. The correctness of recognized types of modalities was 69% for Hungarian, and 78% for German.


International Journal of Speech Technology | 2000

A Multimedia, Multilingual Teaching and Training System for Children with Speech Disorders

Klára Vicsi; Peter Roach; Anne-Marie Öster; Zdravko Kacic; Peter Barczikay; Andras Tantos; Ferenc Csatári; Zsolt Bakcsi; Anna Sfakianaki

The development of an audiovisual pronunciation teaching and training method and software system is discussed in this article. The method is designed to help children with speech and hearing disorders gain better control over their speech production. The teaching method is drawn up for progression from individual sound preparation to practice of sounds in sentences for four languages: English, Swedish, Slovenian, and Hungarian. The system is a general language-independent measuring tool and database editor. This database editor makes it possible to construct modules for all participant languages and for different sound groups. Two modules are under development for the system in all languages: one for teaching and training vowels to hearing-impaired children and the other for correction of misarticulated fricative sounds. In the article we present the measuring methods, the used distance score calculations of the visualized speech spectra, and problems in the evaluation of the new multimedia tool.


international conference on spoken language processing | 1996

BABEL: an Eastern European multi-language database

Peter Roach; Simon Arnfield; William J. Barry; J. Baltova; Marian Boldea; Adrian Fourcin; W. Gonet; Ryszard Gubrynowicz; E. Hallum; Lori Lamel; Krzysztof Marasek; Alain Marchal; Einar Meister; Klára Vicsi

BABEL is a joint European project under the COPERNICUS scheme (Project 1304), comprising partners from five Eastern European countries and three Western ones. The project is producing a multi-language database of five of the most widely-differing Eastern European languages (Bulgarian, Estonian, Hungarian, Romanian and Polish). The collection and formatting of the data conforms to the protocols established by the ESPRIT SAM project and the resulting EUROM databases.


Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction | 2008

Speech Emotion Perception by Human and Machine

Szabolcs Levente Tóth; Dávid Sztahó; Klára Vicsi

The human speech contains and reflects information about the emotional state of the speaker. The importance of research of emotions is increasing in telematics, information technologies and even in health services. The research of the mean acoustical parameters of the emotions is a very complicated task. The emotions are mainly characterized by suprasegmental parameters, but other segmental factors can contribute to the perception of the emotions as well. These parameters are varying within one language, according to speakers etc. In the first part of our research work, human emotion perception was examined. Steps of creating an emotional speech database are presented. The database contains recordings of 3 Hungarian sentences with 8 basic emotions pronounced by nonprofessional speakers. Comparison of perception test results obtained with database recorded by nonprofessional speakers showed similar recognition results as an earlier perception test obtained with professional actors/actresses. It was also made clear, that a neutral sentence before listening to the expression of the emotion pronounced by the same speakers cannot help the perception of the emotion in a great extent. In the second part of our research work, an automatic emotion recognition system was developed. Statistical methods (HMM) were used to train different emotional models. The optimization of the recognition was done by changing the acoustic preprocessing parameters and the number of states of the Markov models.


International Journal of Speech Technology | 2005

Automatic Segmentation of Continuous Speech on Word Level Based on Supra-segmental Features

Klára Vicsi; György Szaszák

This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative languages.In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.


Lecture Notes in Computer Science | 2011

Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues

Anna Esposito; Alessandro Vinciarelli; Klára Vicsi; Catherine Pelachaud; Antinus Nijholt

This volume brings together the advanced research results obtained by the European COST Action 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication,” primarily discussed at the PINK SSPnet-COST 2102 International Conference on “Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues” held in Budapest, Hungary, September 7–10, 2010 (http://berber.tmit.bme.hu/cost2102/). The main focus of the conference was on methods to combine and build up knowledge through verbal and nonverbal signals enacted in an environment and in a context. In previous meetings, COST 2102 focused on the importance of uncovering and exploiting the wealth of information conveyed by multimodal signals. The next steps have been to analyze actions performed in response to multimodal signals and to study how these actions are organized in a realistic and socially believable context. The focus was on processing issues, since the new approach is computationally complex and the amount of data to be treated may be considered algorithmically infeasible. Therefore, data processing for gainingenactive knowledge must account for natural and intuitive approaches, based more on heuristics and experiences rather than on symbols, as well as on the discovery of new processing possibilities that account for new approaches for data analysis, coordination of the data flow through synchronization and temporal organization and optimization of the extracted features.This volume brings together the advanced research results obtained by the European COST Action 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication,” primarily discussed at the PINK SSPnet-COST 2102 International Conference on “Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues” held in Budapest, Hungary, September 7–10, 2010 (http://berber.tmit.bme.hu/cost2102/). The conference was jointly sponsored by COST (European Cooperation in Science and Technology, www.cost.eu ) in the domain of Information and Communication Technologies (ICT) for disseminating the advances of the research activities developed within the COST Action 2102: “Cross-Modal Analysis of Verbal and Nonverbal Communication” (cost2102.cs.stir.ac.uk) and by the European Network of Excellence on Social Signal Processing, SSPnet (http://sspnet.eu/). The main focus of the conference was on methods to combine and build up knowledge through verbal and nonverbal signals enacted in an environment and in a context. In previous meetings, COST 2102 focused on the importance of uncovering and exploiting the wealth of information conveyed by multimodal signals. The next steps have been to analyze actions performed in response to multimodal signals and to study how these actions are organized in a realistic and socially believable context. The focus was on processing issues, since the new approach is computationally complex and the amount of data to be treated may be considered algorithmically infeasible. Therefore, data processing for gaining enactive knowledge must account for natural and intuitive approaches, based more on heuristics and experiences rather than on symbols, as well as on the discovery of new processing possibilities that account for new approaches for data analysis, coordination of the data flow through synchronization and temporal organization and optimization of the extracted features. The themes of the volume cover topics on verbal and nonverbal information in body-to-body communication, cross-modal analysis of speech, gestures, gaze and facial expressions, socio-cultural differences and personal traits, multimodal algorithms and procedures for the automatic recognition of emotions, faces, facial expressions, and gestures, audio and video features for implementing intelligent avatars and interactive dialogue systems, virtual communicative agents and interactive dialogue systems.


SMART INNOVATION, SYSTEMS AND TECHNOLOGIES | 2016

Language independent detection possibilities of depression by speech

Gábor Kiss; Miklos Gabriel Tulics; Dávid Sztahó; Anna Esposito; Klára Vicsi

In this study, acoustic-phonetic analysis of continuous speech and statistical analyses were performed in order to find parameters in depressed speech that show significant differences compared to a healthy reference group. Read speech materials were gathered in the Hungarian and Italian languages from both healthy people and patients diagnosed with different degrees of depression. By statistical examination it was found that there are many parameters in the speech of depressed people that show significant differences compared to a healthy reference group. Moreover, most of those parameters behave similarly in other languages such as in Italian. For classification of the healthy and depressed speech, these parameters were used as an input for the classifiers. Two classification methods were compared: Support Vector Machine (SVM) and a two-layer feed-forward neural network (NN). No difference was found between the results of the two methods when trained and tested on Hungarian language (both SVM and NN classification accuracy was 75 %). In the case of training with Hungarian and testing with Italian healthy and depressed speech both classifiers reached 77 % of accuracy.


International Conference on Statistical Language and Speech Processing | 2014

Physiological and Cognitive Status Monitoring on the Base of Acoustic-Phonetic Speech Parameters

Gábor Kiss; Klára Vicsi

In this paper the development of an online monitoring system is shown in order to track physiological and cognitive condition of crew members of the Concordia Research Station in Antarctica, with specific regard to depression. Follow-up studies were carried out on recorded speech material in such a way that segmental and supra-segmental speech parameters were measured for individual researchers weakly, and the changes of these parameters were detected over time. Two kind of speech were recorded weekly by crew members in their mother tongue: a diary and a tale (“North Wind and The Sun”). An automatic language independent program was used to segment the records in phoneme level for the measurements. Such a way Concordia Speech Databases were constructed. Those acoustic-phonetic parameters were selected for the follow up study at Concordia, which parameters were statistically selected during a research on the base of the analysis of Seasonal Affective Disorder Databases gathered separately in Europe.


COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems | 2011

A cross-cultural study on the perception of emotions: how hungarian subjects evaluate american and italian emotional expressions

Maria Teresa Riviello; Anna Esposito; Klára Vicsi

In the present work a cross-modal evaluation of the visual and auditory channels in conveying emotional information is conducted through perceptual experiments aimed at investigating whether some of the basic emotions are perceptually privileged and whether the perceptual mode, the cultural environment and the language play a role in this preference. To this aim, Hungarian subjects were requested to assess emotional stimuli extracted from Italian and American movies in the single (either mute video or audio alone) and combined audio-video mode. Results showed that among the proposed emotions, anger plays a special role and fear, happiness and sadness are better perceived than surprise and irony in both the cultural environments. The perception of emotions is affected by the communication mode and the language influences the perceptual assessment of emotional information.


Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues | 2010

Problems of the automatic emotion recognitions in spontaneous speech: an example for the recognition in a dispatcher center

Klára Vicsi; Dávid Sztahó

Numerous difficulties, in the examination of emotions occurring in continuous spontaneous speech, are discussed in this paper, than different emotion recognition experiments are presented, using clauses as the recognition unit. In a testing experiment it was examined that what kind of acoustical features are the most important for the characterization of emotions, using spontaneous speech database. An SVM classifier was built for the classification of 4 most frequent emotions. It was found that fundamental frequency, energy, and its dynamics in a clause are the main characteristic parameters for the emotions, and the average spectral information, as MFCC and harmonicity are also very important. In a real life experiment automatic recognition system was prepared for a telecommunication call center. Summing up the results of these experiments, we can say, that clauses can be an optimal unit of the recognition of emotions in continuous speech.

Collaboration


Dive into the Klára Vicsi's collaboration.

Top Co-Authors

Avatar

Dávid Sztahó

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

György Szaszák

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Gábor Kiss

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar

Anna Esposito

Seconda Università degli Studi di Napoli

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Catherine Pelachaud

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Miklos Gabriel Tulics

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ferenc Csatári

Budapest University of Technology and Economics

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge