David Suendermann-Oeft

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Suendermann-Oeft is active.

Explore More

Publication

Featured researches published by David Suendermann-Oeft.

international conference on multimodal interfaces | 2015

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring

Vikram Ramanarayanan; Chee Wee Leong; Lei Chen; Gary Feng; David Suendermann-Oeft

We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.

Archive | 2017

Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System

Vikram Ramanarayanan; David Suendermann-Oeft; Patrick Lange; Robert Mundkowsky; Alexei V. Ivanov; Zhou Yu; Yao Qian; Keelan Evanini

As dialog systems become increasingly multimodal and distributed in nature with advances in technology and computing power, they become that much more complicated to design and implement. However, open industry and W3C standards provide a silver lining here, allowing the distributed design of different components that are nonetheless compliant with each other. In this chapter we examine how an open-source, modular, multimodal dialog system—HALEF—can be seamlessly assembled, much like a jigsaw puzzle, by putting together multiple distributed components that are compliant with the W3C recommendations or other open industry standards. We highlight the specific standards that HALEF currently uses along with a perspective on other useful standards that could be included in the future. HALEF has an open codebase to encourage progressive community contribution and a common standard testbed for multimodal dialog system development and benchmarking.

IWSDS | 2017

Multimodal HALEF: An Open-Source Modular Web-Based Multimodal Dialog Framework

Zhou Yu; Vikram Ramanarayanan; Robert Mundkowsky; Patrick Lange; Alexei V. Ivanov; Alan W. Black; David Suendermann-Oeft

We present an open-source web-based multimodal dialog framework, “Multimodal HALEF”, that integrates video conferencing and telephony abilities into the existing HALEF cloud-based dialog framework via the FreeSWITCH video telephony server. Due to its distributed and cloud-based architecture, Multimodal HALEF allows researchers to collect video and speech data from participants interacting with the dialog system outside of traditional lab settings, therefore largely reducing cost and labor incurred during the traditional audio-visual data collection process. The framework is equipped with a set of tools including a web-based user survey template, a speech transcription, an annotation and rating portal, a web visual processing server that performs head tracking, and a database that logs full-call audio and video recordings as well as other call-specific information. We present observations from an initial data collection based on an job interview application. Finally we report on some future plans for development of the framework.

ieee automatic speech recognition and understanding workshop | 2015

Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech

Zhou Yu; Vikram Ramanarayanan; David Suendermann-Oeft; Xinhao Wang; Klaus Zechner; Lei Chen; Jidong Tao; Aliaksei Ivanou; Yao Qian

We introduce a new method to grade non-native spoken language tests automatically. Traditional automated response grading approaches use manually engineered time-aggregated features (such as mean length of pauses). We propose to incorporate general time-sequence features (such as pitch) which preserve more information than time-aggregated features and do not require human effort to design. We use a type of recurrent neural network to jointly optimize the learning of high level abstractions from time-sequence features with the time-aggregated features. We first automatically learn high level abstractions from time-sequence features with a Bidirectional Long Short Term Memory (BLSTM) and then combine the high level abstractions with time-aggregated features in a Multilayer Perceptron (MLP)/Linear Regression (LR). We optimize the BLSTM and the MLP/LR jointly. We find such models reach the best performance in terms of correlation with human raters. We also find that when there are limited time-aggregated features available, our model that incorporates time-sequence features improves performance drastically.

annual meeting of the special interest group on discourse and dialogue | 2015

A distributed cloud-based dialog system for conversational application development

Vikram Ramanarayanan; David Suendermann-Oeft; Alexei V. Ivanov; Keelan Evanini

We have previously presented HALEF‐an open-source spoken dialog system‐that supports telephonic interfaces and has a distributed architecture. In this paper, we extend this infrastructure to be cloud-based, and thus truly distributed and scalable. This cloud-based spoken dialog system can be accessed both via telephone interfaces as well as through web clients with WebRTC/HTML5 integration, allowing in-browser access to potentially multimodal dialog applications. We demonstrate the versatility of the system with two conversation applications in the educational domain.

Natural Language Dialog Systems and Intelligent Assistants | 2015

HALEF: An Open-Source Standard-Compliant Telephony-Based Modular Spoken Dialog System: A Review and An Outlook

David Suendermann-Oeft; Vikram Ramanarayanan; Moritz Teckenbrock; Felix Neutatz; Dennis Schmidt

We describe completed and ongoing research on HALEF, a telephony-based open-source spoken dialog system that can be used with different plug-and-play back-end modules. We present two examples of such a module, one which classifies whether the person calling into the system is intoxicated or not and the other a question answering application. The system is compliant with World Wide Web Consortium and related industry standards while maintaining an open codebase to encourage progressive development and a common standard testbed for spoken dialog system development and benchmarking. The system can be deployed towards a versatile range of potential applications, including intelligent tutoring, language learning and assessment.

Workshop on Child Computer Interaction | 2016

Improving DNN-Based Automatic Recognition of Non-native Children Speech with Adult Speech.

Yao Qian; Xinhao Wang; Keelan Evanini; David Suendermann-Oeft

Acoustic models for state-of-the-art DNN-based speech recognition systems are typically trained using at least several hundred hours of task-specific training data. However, this amount of training data is not always available for some applications. In this paper, we investigate how to use an adult speech corpus to improve DNN-based automatic speech recognition for non-native childrens speech. Although there are many acoustic and linguistic mismatches between the speech of adults and children, adult speech can still be used to boost the performance of a speech recognizer for children using acoustic modeling techniques based on the DNN framework. The experimental results show that the best recognition performance can be achieved by combining childrens training data with adult training data of approximately the same size and initializing the DNN with the weights obtained by pre-training using the full training set of the adult corpus. This system can outperform the baseline system trained on only childrens speech with an overall relative WER reduction of 11.9%. Among the three speaking tasks studied, the picture narration task shows the largest gain with a WER reduction from 24.6 % to 20.1%.

annual meeting of the special interest group on discourse and dialogue | 2015

Automated Speech Recognition Technology for Dialogue Interaction with Non-Native Interlocutors

Alexei V. Ivanov; Vikram Ramanarayanan; David Suendermann-Oeft; Melissa Lopez; Keelan Evanini; Jidong Tao

Dialogue interaction with remote interlocutors is a difficult application area for speech recognition technology because of the limited duration of acoustic context available for adaptation, the narrow-band and compressed signal encoding used in telecommunications, high variability of spontaneous speech and the processing time constraints. It is even more difficult in the case of interacting with non-native speakers because of the broader allophonic variation, less canonical prosodic patterns, a higher rate of false starts and incomplete words, unusual word choice and smaller probability to have a grammatically well formed sentence. We present a comparative study of various approaches to speech recognition in non-native context. Comparing systems in terms of their accuracy and real-time factor we find that a Kaldi-based Deep Neural Network Acoustic Model (DNN-AM) system with online speaker adaptation by far outperforms other available methods.

IWSDS | 2019

An Open-Source Dialog System with Real-Time Engagement Tracking for Job Interview Training Applications

Zhou Yu; Vikram Ramanarayanan; Patrick Lange; David Suendermann-Oeft

In complex conversation tasks, people react to their interlocutor’s state, such as uncertainty and engagement to improve conversation effectiveness Forbes-Riley and Litman (Adapting to student uncertainty improves tutoring dialogues, pp 33–40, 2009 [2]). If a conversational system reacts to a user’s state, would that lead to a better conversation experience? To test this hypothesis, we designed and implemented a dialog system that tracks and reacts to a user’s state, such as engagement, in real time. We designed and implemented a conversational job interview task based on the proposed framework. The system acts as an interviewer and reacts to user’s disengagement in real-time with positive feedback strategies designed to re-engage the user in the job interview process. Experiments suggest that users speak more while interacting with the engagement-coordinated version of the system as compared to a non-coordinated version. Users also reported the former system as being more engaging and providing a better user experience.

conference of the international speech communication association | 2016

Noise and Metadata Sensitive Bottleneck Features for Improving Speaker Recognition with Non-Native Speech Input.

Yao Qian; Jidong Tao; David Suendermann-Oeft; Keelan Evanini; Alexei V. Ivanov; Vikram Ramanarayanan

Recently, text independent speaker recognition systems with phonetically-aware DNNs, which allow the comparison among different speakers with “soft-aligned” phonetic content, have significantly outperformed standard i-vector based systems [912]. However, when applied to speaker recognition on a nonnative spontaneous corpus, DNN-based speaker recognition does not show its superior performance due to the relatively lower accuracy of phonetic content recognition. In this paper, noise-aware features and multi-task learning are investigated to improve the alignment of speech feature frames into the subphonemic “senone” space and to “distill” the L1 (native language) information of the test takers into bottleneck features (BNFs), which we refer to as metadata sensitive BNFs. Experimental results show that the system with metadata sensitive BNFs can improve speaker recognition performance by a 23.9% relative reduction in equal error rate (EER) compared to the baseline i-vector system. In addition, L1 info is just used to train the BNFs extractor, so it is not necessary to be used as input for BNFs extraction, i-vector extraction and scoring for the enrollment and evaluation sets, which can avoid the use of erroneous L1s claimed by imposters.

Explore More