Thilo Stadelmann | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thilo Stadelmann is active.

Explore More

Publication

Featured researches published by Thilo Stadelmann.

International Journal of Web and Grid Services | 2009

A scalable service-oriented architecture for multimedia analysis, synthesis and consumption

Steffen Heinzl; Dominik Seiler; Ernst Juhnke; Thilo Stadelmann; Ralph Ewerth; Manfred Grauer; Bernd Freisleben

Although Service-Oriented Architectures (SOAs) were not designed for multimedia processing, they speed up the development of distributed multimedia applications by allowing the composition or reconfiguration of existing services. For example, the Business Process Execution Language for Web Services (BPEL) is a powerful tool to orchestrate, model and execute workflows. However, due to its process-oriented approach, it is not directly applicable to data-intensive applications, such as those from the multimedia domain. In this paper, a comprehensive service-oriented infrastructure for multimedia applications is presented that (a) overcomes some drawbacks of BPEL for data-intensive applications and (b) provides tools that further ease the development and use of web services for a broad scope of multimedia applications covering video content analysis, audio analysis and synthesis and multimedia consumption. The proposed service-oriented infrastructure can be easily integrated into existing business processes by using BPEL. A dynamic allocation of cloud computing resources ensures the scalability of a multimedia application. To allow efficient and flexible data transfers in BPEL workflows, an implementation of the Flexible SOAP with Attachments (Flex-SwA) architecture is used that allows data transmission in conjunction with SOAP messages. The protocol requirements of services in the case of real-time, streaming or file transfer can be described by a communication policy. Three use cases of multimedia applications are evaluated.

international conference on acoustics, speech, and signal processing | 2006

Fast and Robust Speaker Clustering Using the Earth Mover'S Distance and Mixmax Models

Thilo Stadelmann; Bernd Freisleben

Speaker clustering is the task of assigning a unique label to all speech segments in a video uttered by the same speaker. There are two key challenges: processing speed and robustness in the presence of noise. In this paper, we present an approach to significantly improve the processing speed of a hierarchical speaker clustering algorithm by using the earth movers distance (EMD) as the distance measure. By extending the well-known MIXMAX speaker model such that the EMD can be applied, noise robustness is achieved. Experimental results show that the runtime of the proposed EMD approach decreases by more than a factor of 120 compared to a likelihood ratio based distance measure while the clustering performance remains nearly the same

international conference on pattern recognition | 2010

Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition

Thilo Stadelmann; Bernd Freisleben

The Gaussian Mixture Model (GMM) is often used in conjunction with Mel-frequency cepstral coefficient (MFCC) feature vectors for speaker recognition. A great challenge is to use these techniques in situations where only small sets of training and evaluation data are available, which typically results in poor statistical estimates and, finally, recognition scores. Based on the observation of marginal MFCC probability densities, we suggest to greatly reduce the number of free parameters in the GMM by modeling the single dimensions separately after proper preprocessing. Saving about 90% of the free parameters as compared to an already optimized GMM and thus making the estimates more stable, this approach considerably improves recognition accuracy over the baseline as the utterances get shorter and saves a huge amount of computing time both in training and evaluation, enabling real-time performance. The approach is easy to implement and to combine with other short-utterance approaches, and applicable to other features as well.

acm multimedia | 2009

Unfolding speaker clustering potential: a biomimetic approach

Thilo Stadelmann; Bernd Freisleben

Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.

international workshop on machine learning for signal processing | 2016

Speaker identification and clustering using convolutional neural networks

Yanick X. Lukic; Carlo Vogt; Oliver Dürr; Thilo Stadelmann

Deep learning, especially in the form of convolutional neural networks (CNNs), has triggered substantial improvements in computer vision and related fields in recent years. This progress is attributed to the shift from designing features and subsequent individual sub-systems towards learning features and recognition systems end to end from nearly unprocessed data. For speaker clustering, however, it is still common to use handcrafted processing chains such as MFCC features and GMM-based models. In this paper, we use simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering. Furthermore, we elaborate on the question how to transfer a network, trained for speaker identification, to speaker clustering. We demonstrate our approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

information integration and web-based applications & services | 2009

LCDL: an extensible framework for wrapping legacy code

Ernst Juhnke; Dominik Seiler; Thilo Stadelmann; Tim Dörnemann; Bernd Freisleben

If legacy code has to be integrated into an application, it is often necessary to call this code available as source code written in a particular programming language or available in binary format for a particular computing platform from another programming language or from a remote machine. For this reason, wrapping code has to be developed for each source code library or binary code to be integrated. This paper presents an extensible framework that supports legacy code integration by modeling legacy code not only in a way that is programming (language) independent, but also by supporting different input and output types and bindings. This aim is achieved by the use of an integrated plug-in mechanism.

international conference on web services | 2009

The Web Service Browser: Automatic Client Generation and Efficient Data Transfer for Web Services

Steffen Heinzl; Markus Mathes; Thilo Stadelmann; Dominik Seiler; Marcel Diegelmann; Helmut Dohmann; Bernd Freisleben

Web services are supported by almost all major software vendors, but nevertheless there is still a certain barrier that prevents a broader user community to actually use them. The barrier is the lack of appropriate clients offered in conjunction with the services. This paper presents a Web Service Browser that automatically generates a dynamic user interface when the user browses to the location of the service description and additionally handles the invocation of the service. To ease the use of the service, the browser takes care of data management by using an implementation of the Flex-SwA architecture. Results are presented to the user in a human-readable manner. When the result contains multimedia data, an audio or video player is used to present the result. Use cases demonstrate the benefits of the browser. With the Web Service Browser, web services simply become a usable component offered in the WWW.

conference on image and video retrieval | 2007

Semantic video analysis for psychological research on violence in computer games

Markus Mühling; Ralph Ewerth; Thilo Stadelmann; Bernd Freisleben; René Weber; Klaus Mathiak

In this paper, we present an automatic semantic video analysis system to support interdisciplinary research efforts in the field of psychology and media science. The psychological research question studied is whether and how playing violent content in computer games may induce aggression. To investigate this question, the extraction of meaningful content from computer games is required to gain insights into the interrelationship of violent game events and the underlying neurophysiologic basis (brain activity) of a player. Previously, human annotators had to index game content according to the current game state, which is a very time-consuming task. The automatic annotation of a large number of computer game recordings (i.e. videos) speeds up the experimentation process and allows researchers to analyze more experimental data on an objective basis. The proposed computer game video content analysis system for computer games extracts several audiovisual low-level as well as mid-level features and deduces semantic content via a machine learning approach. This system requires manual annotations for a single video only to facilitate the semi-supervised learning process. Finally, human experts are allowed to refine the annotation results via a graphical user interface. Experimental results demonstrate the feasibility of the proposed approach.

international workshop on machine learning for signal processing | 2017

Learning embeddings for speaker clustering based on voice equality

Yanick X. Lukic; Carlo Vogt; Oliver Dürr; Thilo Stadelmann

Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices — namely, identifying few specific speakers. We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering.

international conference on pattern recognition | 2010

Rethinking Algorithm Design and Development in Speech Processing

Thilo Stadelmann; Yinghui Wang; Matthew Smith; Ralph Ewerth; Bernd Freisleben

Speech processing is typically based on a set of complex algorithms requiring many parameters to be specified. When parts of the speech processing chain do not behave as expected, trial and error is often the only way to investigate the reasons. In this paper, we present a research methodology to analyze unexpected algorithmic behavior by making (intermediate) results of the speech processing chain perceivable and intuitively comprehensible by humans. The workflow of the process is explicated using a real-world example leading to considerable improvements in speaker clustering. The described methodology is supported by a software toolbox available for download.

Explore More