Tobias Bocklet
University of Erlangen-Nuremberg
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Tobias Bocklet.
international conference on acoustics, speech, and signal processing | 2008
Tobias Bocklet; Andreas K. Maier; Josef Bauer; Felix Burkhardt; Elmar Nöth
This paper compares two approaches of automatic age and gender classification with 7 classes. The first approach are Gaussian mixture models (GMMs) with universal background models (UBMs), which is well known for the task of speaker identification/verification. The training is performed by the EM algorithm or MAP adaptation respectively. For the second approach for each speaker of the test and training set a GMM model is trained. The means of each model are extracted and concatenated, which results in a GMM supervector for each speaker. These supervectors are then used in a support vector machine (SVM). Three different kernels were employed for the SVM approach: a polynomial kernel (with different polynomials), an RBF kernel and a linear GMM distance kernel, based on the KL divergence. With the SVM approach we improved the recognition rate to 74% (p < 0.001) and are in the same range as humans.
ieee automatic speech recognition and understanding workshop | 2011
Tobias Bocklet; Elmar Nöth; Georg Stemmer; Hana Ruzickova; Jan Rusz
70% to 90% of patients with Parkinsons disease (PD) show an affected voice. Various studies revealed, that voice and prosody is one of the earliest indicators of PD. The issue of this study is to automatically detect whether the speech/voice of a person is affected by PD. We employ acoustic features, prosodic features and features derived from a two-mass model of the vocal folds on different kinds of speech tests: sustained phonations, syllable repetitions, read texts and monologues. Classification is performed in either case by SVMs. A correlation-based feature selection was performed, in order to identify the most important features for each of these systems. We report recognition results of 91% when trying to differentiate between normal speaking persons and speakers with PD in early stages with prosodic modeling. With acoustic modeling we achieved a recognition rate of 88% and with vocal modeling we achieved 79%. After feature selection these results could greatly be improved. But we expect those results to be too optimistic. We show that read texts and monologues are the most meaningful texts when it comes to the automatic detection of PD based on articulation, voice, and prosodic evaluations. The most important prosodic features were based on energy, pauses and F0. The masses and the compliances of spring were found to be the most important parameters of the two-mass vocal fold model.
international conference on acoustics, speech, and signal processing | 2009
Tobias Bocklet; Elizabeth Shriberg
We describe a new GMM-UBM speaker recognition system that uses standard cepstral features, but selects different frames of speech for different subsystems. Subsystems, or “constraints”, are based on syllable-level information and combined at the score level. Results on both the NIST 2006 and 2008 test data sets for the English telephone train and test condition reveal that a set of eight constraints performs extremely well, resulting in better performance than other commonly-used cepstral models. Given the still largely-unexplored world of possible constraints and combinations, it is likely that the approach can be even further improved.
Journal of Voice | 2012
Tobias Bocklet; Korbinian Riedhammer; Elmar Nöth; Ulrich Eysholdt; Tino Haderlein
OBJECTIVE One aspect of voice and speech evaluation after laryngeal cancer is acoustic analysis. Perceptual evaluation by expert raters is a standard in the clinical environment for global criteria such as overall quality or intelligibility. So far, automatic approaches evaluate acoustic properties of pathologic voices based on voiced/unvoiced distinction and fundamental frequency analysis of sustained vowels. Because of the high amount of noisy components and the increasing aperiodicity of highly pathologic voices, a fully automatic analysis of fundamental frequency is difficult. We introduce a purely data-driven system for the acoustic analysis of pathologic voices based on recordings of a standard text. METHODS Short-time segments of the speech signal are analyzed in the spectral domain, and speaker models based on this information are built. These speaker models act as a clustered representation of the acoustic properties of a persons voice and are thus characteristic for speakers with different kinds and degrees of pathologic conditions. The system is evaluated on two different data sets with speakers reading standardized texts. One data set contains 77 speakers after laryngeal cancer treated with partial removal of the larynx. The other data set contains 54 totally laryngectomized patients, equipped with a Provox shunt valve. Each speaker was rated by five expert listeners regarding three different criteria: strain, voice quality, and speech intelligibility. RESULTS/CONCLUSION We show correlations for each data set with r and ρ≥0.8 between the automatic system and the mean value of the five raters. The interrater correlation of one rater to the mean value of the remaining raters is in the same range. We thus assume that for selected evaluation criteria, the system can serve as a validated objective support for acoustic voice and speech analysis.
text speech and dialogue | 2008
Tobias Bocklet; Andreas K. Maier; Elmar Nöth
This paper focuses on the automatic determination of the age of children in preschool and primary school age. For each child a Gaussian Mixture Model(GMM) is trained. As training method the Maximum A Posterioriadaptation (MAP) is used. MAP derives the speaker models from a Universal Background Model(UBM) and does not perform an independent parameter estimation. The means of each GMM are extracted and concatenated, which results in a so-called GMM supervector. These supervectors are then used as meta features for classification with Support Vector Machines(SVM) or for Support Vector Regression(SVR). With the classification system a precision of 83 % was achieved and a recall of 66 %. When the regression system was used to determine the age in years, a mean error of 0.8 years and a maximal error of 3 years was obtained. A regression with a monthly accuracy brought similar results.
Computer Speech & Language | 2015
Björn W. Schuller; Stefan Steidl; Anton Batliner; E. Nöth; Alessandro Vinciarelli; Felix Burkhardt; R.J.J.H. van Son; Felix Weninger; Florian Eyben; Tobias Bocklet; Gelareh Mohammadi; Benjamin Weiss
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks.
international conference on acoustics, speech, and signal processing | 2012
Korbinian Riedhammer; Tobias Bocklet; Arnab Ghoshal; Daniel Povey
In the past decade, semi-continuous hidden Markov models (SCHMMs) have not attracted much attention in the speech recognition community. Growing amounts of training data and increasing sophistication of model estimation led to the impression that continuous HMMs are the best choice of acoustic model. However, recent work on recognition of under-resourced languages faces the same old problem of estimating a large number of parameters from limited amounts of transcribed speech. This has led to a renewed interest in methods of reducing the number of parameters while maintaining or extending the modeling capabilities of continuous models. In this work, we compare classic and multiple-codebook semi-continuous models using diagonal and full covariance matrices with continuous HMMs and subspace Gaussian mixture models. Experiments on the RM and WSJ corpora show that while a classical semicontinuous system does not perform as well as a continuous one, multiple-codebook semi-continuous systems can perform better, particular when using full-covariance Gaussians.
International Journal of Pediatric Otorhinolaryngology | 2012
Maria Schuster; Andreas K. Maier; Tobias Bocklet; Emeka Nkenke; Alexandra Ioana Holst; Ulrich Eysholdt; Florian Stelzle
Maria Schuster *, Andreas Maier , Tobias Bocklet , Emeka Nkenke , Alexandra Holst , Ulrich Eysholdt , Florian Stelzle d Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital, Ludwig-Maximilians-University Munich, Marchioninstrasse 15, D-81377, Munich, Germany b Pattern Recognition Lab, Technical Faculty, Friedrich-Alexander-University Erlangen-Nuremberg, Martensstrasse 3, D-91058 Erlangen, Germany Department of Phoniatrics and Pediatric Audiology, University Hospital Erlangen, Bohlenplatz 21, D-91054 Erlangen, Germany Department of Oral and Maxillofacial Surgery, University Hospital Erlangen, Glucksstrasse 11, D-91054 Erlangen, Germany e Clinic for Orthodontics, University Hospital Erlangen, Gluckstrasse 11, D-91054 Erlangen, Germany
Folia Phoniatrica Et Logopaedica | 2009
Tobias Bocklet; Hikmet Toy; Elmar Nöth; Maria Schuster; Ulrich Eysholdt; Frank Rosanowski; Frank Gottwald; Tino Haderlein
Objective: The Hoarseness Diagram, a program for voice quality analysis used in German-speaking countries, was compared with an automatic speech recognition system with a module for prosodic analysis. The latter computed prosodic features on the basis of a text recording. We examined whether voice analysis of sustained vowels and text analysis correlate in tracheoesophageal speakers. Patients and Methods: Test speakers were 24 male laryngectomees with tracheoesophageal substitute speech, age 60.6 ± 8.9 years. Each person read the German version of the text ‘The North Wind and the Sun’. Additionally, five sustained vowels were recorded from each patient. The fundamental frequency (F₀) detected by both programs was compared for all vowels. The correlation between the measures obtained by the Hoarseness Diagram and the features from the prosody module was computed. Results: Both programs have problems in determining the F₀ of highly pathologic voices. Parameters like jitter, shimmer, F₀, and irregularity as computed by the Hoarseness Diagram from vowels show correlations of about –0.8 with prosodic features obtained from the text recordings. Conclusion: Voice properties can reliably be evaluated both on the basis of vowel and text recordings. Text analysis, however, also offers possibilities for the automatic evaluation of running speech since it realistically represents everyday speech.
text speech and dialogue | 2009
Tino Haderlein; Tobias Bocklet; Andreas K. Maier; Elmar Nöth; Christian Knipfer; Florian Stelzle
For dento-oral rehabilitation of edentulous (toothless) patients, speech intelligibility is an important criterion. 28 persons read a standardized text once with and once without wearing complete dentures. Six experienced raters evaluated the intelligibility subjectively on a 5-point scale and the voice on the 4-point Roughness-Breathiness-Hoarseness (RBH) scales. Objective evaluation was performed by Support Vector Regression (SVR) on the word accuracy (WA) and word recognition rate (WR) of a speech recognition system, and a set of 95 word based prosodic features. The word accuracy combined with selected prosodic features showed a correlation of up to r = 0.65 to the subjective ratings for patients with dentures and r = 0.72 for patients without dentures. For the RBH scales, however, the average correlation of the feature subsets to the subjective ratings for both types of recordings was r < 0.4.
