Ivica Rogina
Karlsruhe Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ivica Rogina.
human language technology | 1993
Monika Woszczyna; Noah Coccaro; Andreas Eisele; Alon Lavie; Arthur E. McNair; Thomas Polzin; Ivica Rogina; Carolyn Penstein Rosé; Tilo Sloboda; Masaru Tomita; J. Tsutsumi; Naomi Aoki-Waibel; Alex Waibel; Wayne H. Ward
We present recent advances from our efforts in increasing coverage, robustness, generality and speed of JANUS, CMUs speech-to-speech translation system. JANUS is a speaker-independent system translating spoken utterances in English and also in German into one of German, English or Japanese. The system has been designed around the task of conference registration (CR). It has initially been built based on a speech database of 12 read dialogs, encompassing a vocabulary of around 500 words. We have since been expanding the system along several dimensions to improve speed, robustness and coverage and to move toward spontaneous input.
international conference on acoustics, speech, and signal processing | 1994
Monika Woszczyna; Naomi Aoki-Waibel; Finn Dag Buø; Noah Coccaro; Keiko Horiguchi; Thomas Kemp; Alon Lavie; Arthur E. McNair; Thomas Polzin; Ivica Rogina; Carolyn Penstein Rosé; Tanja Schultz; Bernhard Suhm; Masaru Tomita; Alex Waibel
We present first results from our efforts toward translation of spontaneously spoken speech. Improvements include increasing coverage, robustness, generality and speed of JANUS, the speech-to-speech translation system of Carnegie Mellon and Karlsruhe University. The recognition and machine translation engine have been upgraded to deal with requirements introduced by spontaneous human to human dialogs. To allow for development and evaluation of our system on adequate data, a large database with spontaneous scheduling dialogs is being gathered for English, German and Spanish.<<ETX>>
international conference on acoustics speech and signal processing | 1996
Tanja Schultz; Ivica Rogina; Alex Waibel
Automatic language identification is an important problem in building multilingual speech recognition and understanding systems. Building a language identification module for four languages we studied the influence of applying different levels of knowledge sources on a large vocabulary continuous speech recognition (LVCSR) approach, i.e. phonetic, phonotactic, lexical, and syntactic-semantic knowledge. The resulting language identification (LID) module can identify spontaneous speech input and can be used as a front end for the multilingual speech-to-speech translation system JANUS-II. A comparison of five LID systems showed that the incorporation of lexical and linguistic knowledge reduces the language identification error for the 2-language tests up to 50%. Based on these results we build a LID module for German, English, Spanish, and Japanese which yields 84% identification rate on the spontaneous scheduling task (SST).
international conference on acoustics speech and signal processing | 1996
Jürgen Fritsch; Ivica Rogina
Today, most of the state-of-the-art speech recognizers are based on hidden Markov modeling. Using semi-continuous or continuous density hidden Markov models, the computation of emission probabilities requires the evaluation of mixture Gaussian probability density functions. Since it is very expensive to evaluate all the Gaussians of the mixture density codebook, many recognizers only compute the M most significant Gaussians (M=1,...,8). This paper presents an alternative approach to approximate mixture Gaussians with diagonal covariance matrices, based on a binary feature space partitioning tree. The proposed algorithm is experimentally evaluated in the context of large vocabulary, speaker independent, spontaneous speech recognition using the JANUS-2 speech recognizer. In the case of mixtures with 50 Gaussians, we achieve a speedup of 2-5 in the computation of HMM emission probabilities, without affecting the accuracy of the system.
international conference on multimodal interfaces | 2002
Ivica Rogina; Thomas Schaaf
Archiving, indexing, and later browsing through stored presentations and lectures is increasingly being used. We have investigated the special problems and advantages of lectures and propose the design and adaptation of a speech recognizer to a lecture such that the recognition accuracy can be significantly improved by prior analysis of the presented documents using a special class-based language model. We define a tracking accuracy measure which measures how well a system can automatically align recognized words with parts of a presentation and show that by prior exploitation of the presented documents, the tracking accuracy can be improved. The system described in this paper is part of an intelligent meeting room developed in the European Union-sponsored project FAME (Facilitating Agent for Multicultural Exchange).
international conference on acoustics, speech, and signal processing | 1997
Michael Finke; Ivica Rogina
Context-dependent acoustic models have been applied in speech recognition research for many years, and have been shown to increase the recognition accuracy significantly. The most common approach is to use triphones. Several speech recognition groups have started investigating the use of larger phonetic context windows when building acoustic models. We discuss some of the computational problems arising from wide context modeling (polyphonic modeling) and present methods to cope with these problems. A two stage decision tree based polyphonic clustering approach is described which implements a more flexible parameter tying scheme. The new clustering approach gave us significant improvement across all tasks-WSJ, SWB, and Spontaneous Scheduling Task-and across all languages involved (German, Spanish, English). We report recognition results based on the JANUS speech recognition toolkit on two tasks comparing acoustic context phenomena in English read versus spontaneous speech. We used our WSJ 60K recognizer and the JANUS SWB 10K polyphonic recognizer.
international conference on acoustics, speech, and signal processing | 1995
Tanja Schultz; Ivica Rogina
Several improvements of our speech-to-speech translation system JANUS on spontaneous human-to-human dialogs are presented. Common phenomena in spontaneous speech are described, followed by a classification of different types of noise. To handle the variety of spontaneous effects in human-to-human dialogs, special noise models are introduced representing both human and nonhuman noise, as well as word fragments. It is shown that both the acoustic and the language modeling of the noise increase the recognition performance significantly. In the experiments, a clustering of the noise classes is performed and the resulting cluster variants are compared, thus allowing one to determine the best tradeoff between the sensitivity and trainability of the models.
international conference on acoustics, speech, and signal processing | 2000
Christian Fügen; Ivica Rogina
Context decision trees are widely used in the speech recognition community. Besides questions about phonetic classes of a phones context, questions about their position within a word and questions about the gender of the current speaker have been used so far. In this paper we additionally incorporate questions about current modalities of the spoken utterance like the speakers dialect, the speaking rate, the signal to noise ratio, the latter two of which may change while speaking one utterance. We present a framework that treats all these modalities in a uniform way. Experiments with the Janus speech recognizer have produced error rate reductions of up to 10% when compared to systems that do not use modality questions.
international conference on acoustics, speech, and signal processing | 1994
Ivica Rogina; Alex Waibel
Many speech recognition systems use multiple information streams to compute HMM output probabilities (e.g. systems based on semicontinuous or discrete HMMs use one codebook for cepstral coefficients, and another one for delta cepstral coefficients). The final score is a weighted sum of the contributions of every stream. These weights can be found empirically and usually the same set of weights is used for every acoustic model. There is reason to believe that there are features which are more important for some acoustic models than for others. Especially one would expect the beginning and ending segment of a phoneme to be more context dependent than the middle part, so in that case the probability estimator of the speech recognizer should put more emphasis on the delta-spectrum than on the spectrum. Experiments have shown that spectral or cepstral coefficients are more important than their derivatives and more important than power or delta-power coefficients. We propose an algorithm for learning individual stream weights for every HMM state. Since these individual weights are a superset of the stream-only dependent weights, they can reproduce the results of the stream-only dependent weights and, additionally, discriminate between HMM states. Thus, the recognition performance must improve.<<ETX>>
international conference on machine learning | 2005
Florian Metze; Petra Gieselmann; Hartwig Holzapfel; Tobias Kluge; Ivica Rogina; Alex Waibel; Matthias Wölfel; James L. Crowley; Patrick Reignier; Dominique Vaufreydaz; François Bérard; Bérangère Cohen; Joëlle Coutaz; Sylvie Rouillard; Victoria Arranz; Manuel Bertran; Horacio Rodríguez
This paper describes the FAME multi-modal demonstrator, which integrates multiple communication modes - vision, speech and object manipulation - by combining the physical and virtual worlds to provide support for multi-cultural or multi-lingual communication and problem solving. The major challenges are automatic perception of human actions and understanding of dialogs between people from different cultural or linguistic backgrounds. The system acts as an information butler, which demonstrates context awareness using computer vision, speech and dialog modeling. The integrated computer-enhanced human-to-human communication has been publicly demonstrated at the FORUM2004 in Barcelona and at IST2004 in The Hague. Specifically, the Interactive Space described features an Augmented Table for multi-cultural interaction, which allows several users at the same time to perform multi-modal, cross-lingual document retrieval of audio-visual documents previously recorded by an Intelligent Cameraman during a week-long seminar.