Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Michael Pucher is active.

Publication


Featured researches published by Michael Pucher.


international conference on acoustics, speech, and signal processing | 2011

Detection of synthetic speech for the problem of imposture

Phillip L. De Leon; Inma Hernaez; Ibon Saratxaga; Michael Pucher; Junichi Yamagishi

In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech

Phillip L. De Leon; Michael Pucher; Junichi Yamagishi; Inma Hernaez; Ibon Saratxaga

In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.


Speech Communication | 2010

Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis

Michael Pucher; Dietmar Schabus; Junichi Yamagishi; Friedrich Neubarth; Volker Strom

An HMM-based speech synthesis framework is applied to both standard Austrian German and a Viennese dialectal variety and several training strategies for multi-dialect modeling such as dialect clustering and dialect-adaptive training are investigated. For bridging the gap between processing on the level of HMMs and on the linguistic level, we add phonological transformations to the HMM interpolation and apply them to dialect interpolation. The crucial steps are to employ several formalized phonological rules between Austrian German and Viennese dialect as constraints for the HMM interpolation. We verify the effectiveness of this strategy in a number of perceptual evaluations. Since the HMM space used is not articulatory but acoustic space, there are some variations in evaluation results between the phonological rules. However, in general we obtained good evaluation results which show that listeners can perceive both continuous and categorical changes of dialect varieties by using phonological transformations employed as switching rules in the HMM interpolation.


international conference on acoustics, speech, and signal processing | 2010

Revisiting the security of speaker verification systems against imposture using synthetic speech

Phillip L. De Leon; Vijendra Raj Apsingekar; Michael Pucher; Junichi Yamagishi

In this paper, we investigate imposture using synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both speaker verification (SV) and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates synthetic speech for a targeted speaker through adaptation of a background model. We use two SV systems: standard GMM-UBM-based and a newer SVM-based. Our results show when the systems are tested with human speech, there are zero false acceptances and zero false rejections. However, when the systems are tested with synthesized speech, all claims for the targeted speaker are accepted while all other claims are rejected. We propose a two-step process for detection of synthesized speech in order to prevent this imposture. Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech will lead to an unacceptably high false acceptance rate.


IEEE Journal of Selected Topics in Signal Processing | 2014

Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis

Dietmar Schabus; Michael Pucher; Gregor Hofer

This paper investigates joint speaker-dependent audiovisual Hidden Semi-Markov Models (HSMM) where the visual models produce a sequence of 3D motion tracking data that is used to animate a talking head and the acoustic models are used for speech synthesis. Different acoustic, visual, and joint audiovisual models for four different Austrian German speakers were trained and we show that the joint models perform better compared to other approaches in terms of synchronization quality of the synthesized visual speech. In addition, a detailed analysis of the acoustic and visual alignment is provided for the different models. Importantly, the joint audiovisual modeling does not decrease the acoustic synthetic speech quality compared to acoustic-only modeling so that there is a clear advantage in the common duration model of the joint audiovisual modeling approach that is used for synchronizing acoustic and visual parameter sequences. Finally, it provides a model that integrates the visual and acoustic speech dynamics.


human language technology | 2001

Component-based multimodal dialog interfaces for mobile knowledge creation

Georg Niklfeld; Robert Finan; Michael Pucher

This paper addresses two related topics: Firstly, it presents building-blocks for flexible multimodal dialog interfaces based on standardized components (VoiceXML, XML) to indicate that thanks to well-supported standardizations, mobile multimodal interfaces to heterogeneous data sources are becoming ready for mass-market deployment, provided that adequate modularization is respected. Secondly, this is put in the perspective of a discussion of knowledge management in firms, and the paper argues that multimodal dialog systems and the naturalized mobile access to company data they offer will trigger a new knowledge management practice of importance for knowledge-intensive companies.


international conference on intelligent transportation systems | 2010

Multimodal highway monitoring for robust incident detection

Michael Pucher; Dietmar Schabus; Peter Schallauer; Yuriy Lypetskyy; Franz Graf; Harald Rainer; Michael Stadtschnitzer; Sabine Sternig; Josef Alois Birchbauer; Wolfgang Schneider; Bernhard Schalko

We present detection and tracking methods for highway monitoring based on video and audio sensors, and the combination of these two modalities. We evaluate the performance of the different systems on realistic data sets that have been recorded on Austrian highways. It is shown that we can achieve a very good performance for video-based incident detection of wrong-way drivers, still standing vehicles, and traffic jams. Algorithms for simultaneous vehicle and driving direction detection using microphone arrays were evaluated and also showed a good performance on these tasks. Robust tracking in case of difficult weather conditions is achieved through multimodal sensor fusion of video and audio sensors.


international conference on computer graphics and interactive techniques | 2011

Simultaneous speech and animation synthesis

Dietmar Schabus; Michael Pucher; Gregor Hofer

Talking computer animated characters are a common sight in video games and movies. Although doing the mouth animation by hand gives the best results, because of cost and time constraints it is not always feasible. Furthermore the amount of speech in current games is ever increasing with some games having more than 200,000 lines of dialogue. This work proposes a system that can produce speech and the corresponding lip animation simultaneously using a statistical machine learning framework based on Hidden Markov Models (HMMs). The key point is that with the developed system never before seen or heard animated dialogues can be produced at a push of a button.


Multimodal Signals: Cognitive and Algorithmic Issues | 2009

Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios

Michael Pucher; Gudrun Schuchmann; Peter Fröhlich

This paper presents results on the selection of application scenarios and persona design for sociolect and dialect speech synthesis. These results are derived from a listening experiment and a user study. Most speech synthesis applications focus on major languages that are spoken by many people. We think that the localization of speech synthesis applications by using sociolects and dialects can be beneficial for the user since these language variants entail specific personas and background knowledge.


international conference on multimodal interfaces | 2002

Mobile multi-modal data services for GPRS phones and beyond

Georg Niklfeld; Michael Pucher; Robert Finan; Wolfgang Eckhart

The paper discusses means to build multi-modal data services in existing GPRS infrastructures, and puts the proposed simple solutions into the perspective of technological possibilities that will become available in public mobile communications networks over the next few years along the progression path from 2G/GSM systems, through GPRS, to 3G systems like UMTS, or equivalently to 802.11 networks. Three demonstrators are presented, which were developed by the authors in an application-oriented research project co-financed by telecommunications companies. The first two, push-to-talk address entry for a route-finder, and an open-microphone map-content navigator simulate a UMTS or WLAN scenario. The third demonstrator implements a multi-modal map finder in a live public GPRS network using WAP-Push. Indications of usability are given. The paper argues for the importance of open, standards-based architectures that will spur attractive multi-modal services in the short term, as current economic difficulties in the telecommunications industry put support for long term research into more advanced forms of multi-modality in question.

Collaboration


Dive into the Michael Pucher's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Markus Toman

Vienna University of Technology

View shared research outputs
Top Co-Authors

Avatar

Junichi Yamagishi

National Institute of Informatics

View shared research outputs
Top Co-Authors

Avatar

Gregor Hofer

University of Edinburgh

View shared research outputs
Top Co-Authors

Avatar

Friedrich Neubarth

Austrian Research Institute for Artificial Intelligence

View shared research outputs
Top Co-Authors

Avatar

Georg Niklfeld

Austrian Research Institute for Artificial Intelligence

View shared research outputs
Top Co-Authors

Avatar

Peter Fröhlich

Austrian Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Gudrun Schuchmann

Austrian Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Volker Strom

University of Edinburgh

View shared research outputs
Researchain Logo
Decentralizing Knowledge