Alexei V. Ivanov
University of Trento
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexei V. Ivanov.
international conference on acoustics, speech, and signal processing | 2012
Alexei V. Ivanov; Giuseppe Riccardi
Automatic emotion recognition from speech is limited by the ability to discover the relevant predicting features. The common approach is to extract a very large set of features over a generally long analysis time window. In this paper we investigate the applicability of two-sample Kolmogorov-Smirnov statistical test (KST) to the problem of segmental speech emotion recognition. We train emotion classifiers for each speech segment within an utterance. The segment labels are then combined to predict the dominant emotion label. Our findings show that KST can be successfully used to extract statistically relevant features. KST criterion is used to optimize the parameters of the statistical segmental analysis, namely the window segment size and shift. We carry out seven binary class emotion classification experiments on the Emo-DB and evaluate the impact of the segmental analysis and emotion-specific feature selection.
annual meeting of the special interest group on discourse and dialogue | 2009
Sebastian Varges; Silvia Quarteroni; Giuseppe Riccardi; Alexei V. Ivanov; Pierluigi Roberti
We have developed a complete spoken dialogue framework that includes rule-based and trainable dialogue managers, speech recognition, spoken language understanding and generation modules, and a comprehensive web visualization interface. We present a spoken dialogue system based on Reinforcement Learning that goes beyond standard rule based models and computes on-line decisions of the best dialogue moves. Bridging the gap between handcrafted (e.g. rule-based) and adaptive (e.g. based on Partially Observable Markov Decision Processes - POMDP) dialogue models, this prototype is able to learn high rewarding policies in a number of dialogue situations.
international conference on acoustics, speech, and signal processing | 2011
Sebastian Varges; Giuseppe Riccardi; Silvia Quarteroni; Alexei V. Ivanov
We address several challenges for applying statistical dialog managers based on Partially Observable Markov Models to real world problems: to deal with large numbers of concepts, we use individual POMDP policies for each concept. To control the use of the concept policies, the dialog manager uses explicit task structures. The POMDP policies model the confusability of concepts at the value level. In contrast to previous work, we use explicit confusability statistics including confidence scores based on real world data in the POMDP models. Since data sparseness becomes a key issue for estimating these probabilities, we introduce a form of smoothing the observation probabilities that maintains the overall concept error rate. We evaluated three POMDP-based dialog systems and a rule-based one in a phone-based user evaluation in a tourist domain. The results show that a POMDP that uses confidence scores, in combination with an improved SLU module, achieves the highest concept precision.
Archive | 2017
Vikram Ramanarayanan; David Suendermann-Oeft; Patrick Lange; Robert Mundkowsky; Alexei V. Ivanov; Zhou Yu; Yao Qian; Keelan Evanini
As dialog systems become increasingly multimodal and distributed in nature with advances in technology and computing power, they become that much more complicated to design and implement. However, open industry and W3C standards provide a silver lining here, allowing the distributed design of different components that are nonetheless compliant with each other. In this chapter we examine how an open-source, modular, multimodal dialog system—HALEF—can be seamlessly assembled, much like a jigsaw puzzle, by putting together multiple distributed components that are compliant with the W3C recommendations or other open industry standards. We highlight the specific standards that HALEF currently uses along with a perspective on other useful standards that could be included in the future. HALEF has an open codebase to encourage progressive community contribution and a common standard testbed for multimodal dialog system development and benchmarking.
IWSDS | 2017
Zhou Yu; Vikram Ramanarayanan; Robert Mundkowsky; Patrick Lange; Alexei V. Ivanov; Alan W. Black; David Suendermann-Oeft
We present an open-source web-based multimodal dialog framework, “Multimodal HALEF”, that integrates video conferencing and telephony abilities into the existing HALEF cloud-based dialog framework via the FreeSWITCH video telephony server. Due to its distributed and cloud-based architecture, Multimodal HALEF allows researchers to collect video and speech data from participants interacting with the dialog system outside of traditional lab settings, therefore largely reducing cost and labor incurred during the traditional audio-visual data collection process. The framework is equipped with a set of tools including a web-based user survey template, a speech transcription, an annotation and rating portal, a web visual processing server that performs head tracking, and a database that logs full-call audio and video recordings as well as other call-specific information. We present observations from an initial data collection based on an job interview application. Finally we report on some future plans for development of the framework.
annual meeting of the special interest group on discourse and dialogue | 2015
Vikram Ramanarayanan; David Suendermann-Oeft; Alexei V. Ivanov; Keelan Evanini
We have previously presented HALEF‐an open-source spoken dialog system‐that supports telephonic interfaces and has a distributed architecture. In this paper, we extend this infrastructure to be cloud-based, and thus truly distributed and scalable. This cloud-based spoken dialog system can be accessed both via telephone interfaces as well as through web clients with WebRTC/HTML5 integration, allowing in-browser access to potentially multimodal dialog applications. We demonstrate the versatility of the system with two conversation applications in the educational domain.
annual meeting of the special interest group on discourse and dialogue | 2015
Alexei V. Ivanov; Vikram Ramanarayanan; David Suendermann-Oeft; Melissa Lopez; Keelan Evanini; Jidong Tao
Dialogue interaction with remote interlocutors is a difficult application area for speech recognition technology because of the limited duration of acoustic context available for adaptation, the narrow-band and compressed signal encoding used in telecommunications, high variability of spontaneous speech and the processing time constraints. It is even more difficult in the case of interacting with non-native speakers because of the broader allophonic variation, less canonical prosodic patterns, a higher rate of false starts and incomplete words, unusual word choice and smaller probability to have a grammatically well formed sentence. We present a comparative study of various approaches to speech recognition in non-native context. Comparing systems in terms of their accuracy and real-time factor we find that a Kaldi-based Deep Neural Network Acoustic Model (DNN-AM) system with online speaker adaptation by far outperforms other available methods.
ieee automatic speech recognition and understanding workshop | 2009
Sebastian Varges; Giuseppe Riccardi; Silvia Quarteroni; Alexei V. Ivanov
Conversational systems use deterministic rules that trigger actions such as requests for confirmation or clarification. More recently, Reinforcement Learning and (Partially Observable) Markov Decision Processes have been proposed for this task. In this paper, we investigate action selection strategies for dialogue management, in particular the exploration/exploitation trade-off and its impact on final reward (i.e. the session reward after optimization has ended) and lifetime reward (i.e. the overall reward accumulated over the learners lifetime). We propose to use interleaved exploitation sessions as a learning methodology to assess the reward obtained from the current policy. The experiments show a statistically significant difference in final reward of exploitation-only sessions between a system that optimizes lifetime reward and one that maximizes the reward of the final policy.
conference of the international speech communication association | 2016
Yao Qian; Jidong Tao; David Suendermann-Oeft; Keelan Evanini; Alexei V. Ivanov; Vikram Ramanarayanan
Recently, text independent speaker recognition systems with phonetically-aware DNNs, which allow the comparison among different speakers with “soft-aligned” phonetic content, have significantly outperformed standard i-vector based systems [912]. However, when applied to speaker recognition on a nonnative spontaneous corpus, DNN-based speaker recognition does not show its superior performance due to the relatively lower accuracy of phonetic content recognition. In this paper, noise-aware features and multi-task learning are investigated to improve the alignment of speech feature frames into the subphonemic “senone” space and to “distill” the L1 (native language) information of the test takers into bottleneck features (BNFs), which we refer to as metadata sensitive BNFs. Experimental results show that the system with metadata sensitive BNFs can improve speaker recognition performance by a 23.9% relative reduction in equal error rate (EER) compared to the baseline i-vector system. In addition, L1 info is just used to train the BNFs extractor, so it is not necessary to be used as input for BNFs extraction, i-vector extraction and scoring for the enrollment and evaluation sets, which can avoid the use of erroneous L1s claimed by imposters.
engineering interactive computing system | 2011
Bernd Ludwig; Martin Hacker; Richard Schaller; Bjoern Zenker; Alexei V. Ivanov; Giuseppe Riccardi
Providing navigation assistance to users is a complex task generally consisting of two phases: planning a tour (phase one) and supporting the user during the tour (phase two). In the first phase, users interface to databases via constrained or natural language interaction to acquire prior knowledge such as bus schedules etc. In the second phase, often unexpected external events, such as delays or accidents, happen, user preferences change, or new needs arise. This requires machine intelligence to support users in the navigation real-time task, update information and trip replanning. To provide assistance in phase two, a navigation system must monitor external events, detect anomalies of the current situation compared to the plan built in the first phase, and provide assistance when the plan has become unfeasible. In this paper we present a prototypical mobile speech-controlled navigation system that provides assistance in both phases. The system was designed based on implications from an analysis of real user assistance needs investigated in a diary study that underlines the vital importance of assistance in phase two.