David Suendermann-Oeft
Educational Testing Service
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Suendermann-Oeft.
international conference on multimodal interfaces | 2015
Vikram Ramanarayanan; Chee Wee Leong; Lei Chen; Gary Feng; David Suendermann-Oeft
We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.
Archive | 2017
Vikram Ramanarayanan; David Suendermann-Oeft; Patrick Lange; Robert Mundkowsky; Alexei V. Ivanov; Zhou Yu; Yao Qian; Keelan Evanini
As dialog systems become increasingly multimodal and distributed in nature with advances in technology and computing power, they become that much more complicated to design and implement. However, open industry and W3C standards provide a silver lining here, allowing the distributed design of different components that are nonetheless compliant with each other. In this chapter we examine how an open-source, modular, multimodal dialog system—HALEF—can be seamlessly assembled, much like a jigsaw puzzle, by putting together multiple distributed components that are compliant with the W3C recommendations or other open industry standards. We highlight the specific standards that HALEF currently uses along with a perspective on other useful standards that could be included in the future. HALEF has an open codebase to encourage progressive community contribution and a common standard testbed for multimodal dialog system development and benchmarking.
IWSDS | 2017
Zhou Yu; Vikram Ramanarayanan; Robert Mundkowsky; Patrick Lange; Alexei V. Ivanov; Alan W. Black; David Suendermann-Oeft
We present an open-source web-based multimodal dialog framework, “Multimodal HALEF”, that integrates video conferencing and telephony abilities into the existing HALEF cloud-based dialog framework via the FreeSWITCH video telephony server. Due to its distributed and cloud-based architecture, Multimodal HALEF allows researchers to collect video and speech data from participants interacting with the dialog system outside of traditional lab settings, therefore largely reducing cost and labor incurred during the traditional audio-visual data collection process. The framework is equipped with a set of tools including a web-based user survey template, a speech transcription, an annotation and rating portal, a web visual processing server that performs head tracking, and a database that logs full-call audio and video recordings as well as other call-specific information. We present observations from an initial data collection based on an job interview application. Finally we report on some future plans for development of the framework.
ieee automatic speech recognition and understanding workshop | 2015
Zhou Yu; Vikram Ramanarayanan; David Suendermann-Oeft; Xinhao Wang; Klaus Zechner; Lei Chen; Jidong Tao; Aliaksei Ivanou; Yao Qian
We introduce a new method to grade non-native spoken language tests automatically. Traditional automated response grading approaches use manually engineered time-aggregated features (such as mean length of pauses). We propose to incorporate general time-sequence features (such as pitch) which preserve more information than time-aggregated features and do not require human effort to design. We use a type of recurrent neural network to jointly optimize the learning of high level abstractions from time-sequence features with the time-aggregated features. We first automatically learn high level abstractions from time-sequence features with a Bidirectional Long Short Term Memory (BLSTM) and then combine the high level abstractions with time-aggregated features in a Multilayer Perceptron (MLP)/Linear Regression (LR). We optimize the BLSTM and the MLP/LR jointly. We find such models reach the best performance in terms of correlation with human raters. We also find that when there are limited time-aggregated features available, our model that incorporates time-sequence features improves performance drastically.
annual meeting of the special interest group on discourse and dialogue | 2015
Vikram Ramanarayanan; David Suendermann-Oeft; Alexei V. Ivanov; Keelan Evanini
We have previously presented HALEF‐an open-source spoken dialog system‐that supports telephonic interfaces and has a distributed architecture. In this paper, we extend this infrastructure to be cloud-based, and thus truly distributed and scalable. This cloud-based spoken dialog system can be accessed both via telephone interfaces as well as through web clients with WebRTC/HTML5 integration, allowing in-browser access to potentially multimodal dialog applications. We demonstrate the versatility of the system with two conversation applications in the educational domain.
Natural Language Dialog Systems and Intelligent Assistants | 2015
David Suendermann-Oeft; Vikram Ramanarayanan; Moritz Teckenbrock; Felix Neutatz; Dennis Schmidt
We describe completed and ongoing research on HALEF, a telephony-based open-source spoken dialog system that can be used with different plug-and-play back-end modules. We present two examples of such a module, one which classifies whether the person calling into the system is intoxicated or not and the other a question answering application. The system is compliant with World Wide Web Consortium and related industry standards while maintaining an open codebase to encourage progressive development and a common standard testbed for spoken dialog system development and benchmarking. The system can be deployed towards a versatile range of potential applications, including intelligent tutoring, language learning and assessment.
Workshop on Child Computer Interaction | 2016
Yao Qian; Xinhao Wang; Keelan Evanini; David Suendermann-Oeft
Acoustic models for state-of-the-art DNN-based speech recognition systems are typically trained using at least several hundred hours of task-specific training data. However, this amount of training data is not always available for some applications. In this paper, we investigate how to use an adult speech corpus to improve DNN-based automatic speech recognition for non-native childrens speech. Although there are many acoustic and linguistic mismatches between the speech of adults and children, adult speech can still be used to boost the performance of a speech recognizer for children using acoustic modeling techniques based on the DNN framework. The experimental results show that the best recognition performance can be achieved by combining childrens training data with adult training data of approximately the same size and initializing the DNN with the weights obtained by pre-training using the full training set of the adult corpus. This system can outperform the baseline system trained on only childrens speech with an overall relative WER reduction of 11.9%. Among the three speaking tasks studied, the picture narration task shows the largest gain with a WER reduction from 24.6 % to 20.1%.
annual meeting of the special interest group on discourse and dialogue | 2015
Alexei V. Ivanov; Vikram Ramanarayanan; David Suendermann-Oeft; Melissa Lopez; Keelan Evanini; Jidong Tao
Dialogue interaction with remote interlocutors is a difficult application area for speech recognition technology because of the limited duration of acoustic context available for adaptation, the narrow-band and compressed signal encoding used in telecommunications, high variability of spontaneous speech and the processing time constraints. It is even more difficult in the case of interacting with non-native speakers because of the broader allophonic variation, less canonical prosodic patterns, a higher rate of false starts and incomplete words, unusual word choice and smaller probability to have a grammatically well formed sentence. We present a comparative study of various approaches to speech recognition in non-native context. Comparing systems in terms of their accuracy and real-time factor we find that a Kaldi-based Deep Neural Network Acoustic Model (DNN-AM) system with online speaker adaptation by far outperforms other available methods.
IWSDS | 2019
Zhou Yu; Vikram Ramanarayanan; Patrick Lange; David Suendermann-Oeft
In complex conversation tasks, people react to their interlocutor’s state, such as uncertainty and engagement to improve conversation effectiveness Forbes-Riley and Litman (Adapting to student uncertainty improves tutoring dialogues, pp 33–40, 2009 [2]). If a conversational system reacts to a user’s state, would that lead to a better conversation experience? To test this hypothesis, we designed and implemented a dialog system that tracks and reacts to a user’s state, such as engagement, in real time. We designed and implemented a conversational job interview task based on the proposed framework. The system acts as an interviewer and reacts to user’s disengagement in real-time with positive feedback strategies designed to re-engage the user in the job interview process. Experiments suggest that users speak more while interacting with the engagement-coordinated version of the system as compared to a non-coordinated version. Users also reported the former system as being more engaging and providing a better user experience.
conference of the international speech communication association | 2016
Yao Qian; Jidong Tao; David Suendermann-Oeft; Keelan Evanini; Alexei V. Ivanov; Vikram Ramanarayanan
Recently, text independent speaker recognition systems with phonetically-aware DNNs, which allow the comparison among different speakers with “soft-aligned” phonetic content, have significantly outperformed standard i-vector based systems [912]. However, when applied to speaker recognition on a nonnative spontaneous corpus, DNN-based speaker recognition does not show its superior performance due to the relatively lower accuracy of phonetic content recognition. In this paper, noise-aware features and multi-task learning are investigated to improve the alignment of speech feature frames into the subphonemic “senone” space and to “distill” the L1 (native language) information of the test takers into bottleneck features (BNFs), which we refer to as metadata sensitive BNFs. Experimental results show that the system with metadata sensitive BNFs can improve speaker recognition performance by a 23.9% relative reduction in equal error rate (EER) compared to the baseline i-vector system. In addition, L1 info is just used to train the BNFs extractor, so it is not necessary to be used as input for BNFs extraction, i-vector extraction and scoring for the enrollment and evaluation sets, which can avoid the use of erroneous L1s claimed by imposters.