Beat Pfister | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Beat Pfister is active.

Explore More

Publication

Featured researches published by Beat Pfister.

conference of the international speech communication association | 2016

Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition.

Naoya Takahashi; Michael Gygli; Beat Pfister; Luc Van Gool

We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3x3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

Speech Communication | 2007

Text analysis and language identification for polyglot text-to-speech synthesis

Harald Romsdorfer; Beat Pfister

In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a high-quality text-to-speech synthesis system to read such texts in a way that the origin of the inclusions is heard, i.e., with correct language-specific pronunciation and prosody. The challenge for a text analysis component of a text-to-speech synthesis system is to derive from mixed-lingual sentences the correct polyglot phone sequence and all information necessary to generate natural sounding polyglot prosody. This article presents a new approach to analyze mixed-lingual sentences. This approach centers around a modular, mixed-lingual morphological and syntactic analyzer, which additionally provides accurate language identification on morpheme level and word and sentence boundary identification in mixed-lingual texts. This approach can also be applied to word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian, and Spanish. Because of its modular design it is easily extensible to additional languages.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2015

A Semidefinite Programming Based Search Strategy for Feature Selection with Mutual Information Measure

Tofigh Naghibi; Sarah Hoffmann; Beat Pfister

Feature subset selection, as a special case of the general subset selection problem, has been the topic of a considerable number of studies due to the growing importance of data-mining applications. In the feature subset selection problem there are two main issues that need to be addressed: (i) Finding an appropriate measure function than can be fairly fast and robustly computed for high-dimensional data. (ii) A search strategy to optimize the measure over the subset space in a reasonable amount of time. In this article mutual information between features and class labels is considered to be the measure function. Two series expansions for mutual information are proposed, and it is shown that most heuristic criteria suggested in the literature are truncated approximations of these expansions. It is well-known that searching the whole subset space is an NP-hard problem. Here, instead of the conventional sequential search algorithms, we suggest a parallel search strategy based on semidefinite programming (SDP) that can search through the subset space in polynomial time. By exploiting the similarities between the proposed algorithm and an instance of the maximum-cut problem in graph theory, the approximation ratio of this algorithm is derived and is compared with the approximation ratio of the backward elimination method. The experiments show that it can be misleading to judge the quality of a measure solely based on the classification accuracy, without taking the effect of the non-optimum search strategy into account.

Speech Communication | 2012

Syntactic language modeling with formal grammars

Tobias Kaufmann; Beat Pfister

It has repeatedly been demonstrated that automatic speech recognition can benefit from syntactic information. However, virtually all syntactic language models for large-vocabulary continuous speech recognition are based on statistical parsers. In this paper, we investigate the use of a formal grammar as a source of syntactic information. We describe a novel approach to integrating formal grammars into speech recognition and evaluate it in a series of experiments. For a German broadcast news transcription task, the approach was found to reduce the word error rate by 9.7% (relative) compared to a competitive baseline speech recognizer. We provide an extensive discussion on various aspects of the approach, including the contribution of different kinds of information, the development of a precise formal grammar and the acquisition of lexical information.

international conference on acoustics, speech, and signal processing | 2011

Extended Viterbi algorithm for optimized word HMMS

Michael Gerber; Tobias Kaufmann; Beat Pfister

This paper deals with the problem of finding the optimal sequence of sub-word unit HMMs for a number of given utterances of a word. For this problem we present a new solution based on an extension of the Viterbi algorithm which maximizes the joint probability of the utterances and all possible sequences of sub-word units and hence guarantees to find the optimal solution. The new algorithm was applied in an isolated word recognition experiment and compared to simpler approaches to determining the sequence of sub-word units. We report a significant reduction of the recognition error rate with the new algorithm.

ieee automatic speech recognition and understanding workshop | 2005

Integrating a non-probabilistic grammar into large vocabulary continuous speech recognition

René Beutler; Tobias Kaufmann; Beat Pfister

We propose a method of incorporating a non-probabilistic grammar into large vocabulary continuous speech recognition (LVCSR). Our basic assumption is that the utterances to be recognized are grammatical to a sufficient degree, which enables us to decrease the word error rate by favouring grammatical phrases. We use a parser and a handcrafted grammar to identify grammatical phrases in word lattices produced by a speech recognizer. This information is then used to rescore the word lattice. We measured the benefit of our method by extending an LVCSR baseline system (based on hidden Markov models and a 4-gram language model) with our rescoring component. We achieved a statistically significant reduction in word error rate compared to the baseline system

international conference on machine learning | 2004

A mixed-lingual phonological component which drives the statistical prosody control of a polyglot TTS synthesis system

Harald Romsdorfer; Beat Pfister; René Beutler

A polyglot text-to-speech synthesis system which is able to read aloud mixed-lingual text has first of all to derive the correct pronunciation. This is achieved with an accurate morpho-syntactic analyzer that works simultaneously as language detector, followed by a phonological component which performs various phonological transformations. The result of these symbol processing steps is a complete phonological description of the speech to be synthesized. The subsequent processing step, i.e. prosody control, has to generate numerical values for the physical prosodic parameters from this description, a task that is very different from the former ones. This article shows appropriate solutions to both types of tasks, namely a particular rule-based approach for the phonological component and a statistical or machine learning approach to prosody control.

international conference on acoustics, speech, and signal processing | 2013

Convex approximation of the NP-hard search problem in feature subset selection

Tofigh Naghibi; Sarah Hoffmann; Beat Pfister

Feature subset selection, as a special case of the general subset selection problem, attracted a lot of research attention due to the growing importance of data-mining applications. However, since finding the optimal subset is an NP-hard problem, very different heuristic search methods have been suggested to approximate it. Here we propose a new second-order cone programming based search strategy to efficiently solve the feature subset selection for large-scale problems. Experimentally, it is shown that its performance is almost always better than the greedy search methods especially when the features are strongly dependent.

non-linear speech processing | 2007

Perceptron-based class verification

Michael Gerber; Tobias Kaufmann; Beat Pfister

We present a method to use multilayer perceptrons (MLPs) for a verification task, i.e. to verify whether two vectors are from the same class or not. In tests with synthetic data we could show that the verification MLPs are almost optimal from a Bayesian point of view. With speech data we have shown that verification MLPs generalize well such that they can be deployed as well for classes which were not seen during the training.

conference of the international speech communication association | 2016

The SIWIS database: a multilingual speech database with acted emphasis

Jean-Philippe Goldman; Pierre-Edouard Honnet; Robert A. J. Clark; Philip N. Garner; Maria Ivanova; Alexandros Lazaridis; Hui Liang; Tiago Macedo; Beat Pfister; Manuel Sam Ribeiro; Eric Wehrli; Junichi Yamagishi

We describe here a collection of speech data of bilingual and trilingual speakers of English, French, German and Italian. In the context of speech to speech translation (S2ST), this database is designed for several purposes and studies: training CLSA systems (cross-language speaker adaptation), conveying emphasis through S2ST systems, and evaluating TTS systems. More precisely, 36 speakers judged as accentless (22 bilingual and 14 trilingual speakers) were recorded for a set of 171 prompts in two or three languages, amounting to a total of 24 hours of speech. These sets of prompts include 100 sentences from news, 25 sentences from Europarl, the same 25 sentences with one acted emphasised word, 20 semantically unpredictable sentences, and finally a 240-word long text. All in all, it yielded 64 bilingual session pairs of the six possible combinations of the four languages. The database is freely available for non-commercial use and scientific research purposes. Index Terms: speech-to-speech translation, speech corpus, bilingual speakers, emphasis

Explore More