Pirros Tsiakoulis
University of Cambridge
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pirros Tsiakoulis.
spoken language technology workshop | 2012
Matthew Henderson; Milica Gasic; Blaise Thomson; Pirros Tsiakoulis; Kai Yu; Steve J. Young
Current commercial dialogue systems typically use hand-crafted grammars for Spoken Language Understanding (SLU) operating on the top one or two hypotheses output by the speech recogniser. These systems are expensive to develop and they suffer from significant degradation in performance when faced with recognition errors. This paper presents a robust method for SLU based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. Following [1], the system uses SVM classifiers operating on n-gram features, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance both in terms of accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the previous system output.
international conference on acoustics, speech, and signal processing | 2013
Milica Gasic; Catherine Breslin; Matthew Henderson; Dongho Kim; Martin Szummer; Blaise Thomson; Pirros Tsiakoulis; Steve J. Young
A partially observable Markov decision process has been proposed as a dialogue model that enables robustness to speech recognition errors and automatic policy optimisation using reinforcement learning (RL). However, conventional RL algorithms require a very large number of dialogues, necessitating a user simulator. Recently, Gaussian processes have been shown to substantially speed up the optimisation, making it possible to learn directly from interaction with human users. However, early studies have been limited to very low dimensional spaces and the learning has exhibited convergence problems. Here we investigate learning from human interaction using the Bayesian Update of Dialogue State system. This dynamic Bayesian network based system has an optimisation space covering more than one hundred features, allowing a wide range of behaviours to be learned. Using an improved policy model and a more robust reward function, we show that stable learning can be achieved that significantly outperforms a simulator trained policy.
IEEE Transactions on Consumer Electronics | 2009
Sotiris Karabetsos; Pirros Tsiakoulis; Aimilios Chalamandaris; Spyros Raptis
Nowadays, unit selection based text-to-speech technology is the mainstream approach for near natural speech synthesis systems. However, this is achieved at the expense of raised requirements in terms of computational resources. This work describes design and implementation approaches for the efficient integration of this technology in computational environments with limited resources, such as mobile devices, with no considerable speech quality degradation. In particular, the issues of database reduction, acoustic inventory compression and runtime computational load minimization are mainly addressed in this paper. Both objective and subjective assessments confirm the effectiveness of these approaches in terms of constructing a general purpose embedded unit selection TTS system and reducing the computational requirements while maintaining high speech quality.
international conference on acoustics, speech, and signal processing | 2015
Milica Gasic; Dongho Kim; Pirros Tsiakoulis; Steve J. Young
Statistical dialogue systems offer the potential to reduce costs by learning policies automatically on-line, but are not designed to scale to large open-domains. This paper proposes a hierarchical distributed dialogue architecture in which policies are organised in a class hierarchy aligned to an underlying knowledge graph. This allows a system to be deployed using a modest amount of data to train a small set of generic policies. As further data is collected, generic policies can be adapted to give in-domain performance. Using Gaussian process-based reinforcement learning, it is shown that within this framework generic policies can be constructed which provide acceptable user performance, and better performance than can be obtained using under-trained domain specific policies. It is also shown that as sufficient in-domain data becomes available, it is possible to seamlessly improve performance, without subjecting users to unacceptable behaviour during the adaptation period and without limiting the final performance compared to policies trained from scratch.
international conference on acoustics, speech, and signal processing | 2008
Pirros Tsiakoulis; Aimilios Chalamandaris; Sotiris Karabetsos; Spyros Raptis
This paper presents a new method for the reduction of an existing speech database in order to be used for domain independent embedded unit selection text-to-speech synthesis. The method relies on statistical data produced by the unit selection process on a large text corpus. It utilizes the selection frequency, as well as the actual score of each unit. Both objective and subjective evaluation of the method is performed in comparison with existing similar techniques.
IEEE Signal Processing Letters | 2010
Pirros Tsiakoulis; Alexandros Potamianos; Dimitrios Dimitriadis
We propose a novel Automatic Speech Recognition (ASR) front-end, that consists of the first central Spectral Moment time-frequency distribution Augmented by low order Cepstral coefficients (SMAC). We prove that the first central spectral moment is proportional to the spectral derivative with respect to the filters central frequency. Consequently, the spectral moment is an estimate of the frequency domain derivative of the speech spectrum. However information related to the entire speech spectrum, such as the energy and the spectral tilt, is not adequately modeled. We propose adding this information with few cepstral coefficients. Furthermore, we use a mel-spaced Gabor filterbank with 70% frequency overlap in order to overcome the sensitivity to pitch harmonics. The novel SMAC front-end was evaluated for the speech recognition task for a variety of recording conditions. The experimental results have shown that SMAC performs at least as well as the standard MFCC front-end in clean conditions, and significantly outperforms MFCCs in noisy conditions.
spoken language technology workshop | 2012
Milica Gasic; Matthew Henderson; Blaise Thomson; Pirros Tsiakoulis; Steve J. Young
The partially observable Markov decision process (POMDP) has been proposed as a dialogue model that enables automatic improvement of the dialogue policy and robustness to speech understanding errors. It requires, however, a large number of dialogues to train the dialogue policy. Gaussian processes (GP) have recently been applied to POMDP dialogue management optimisation showing an ability to substantially increase the speed of learning. Here, we investigate this further using the Bayesian Update of Dialogue State dialogue manager. We show that it is possible to apply Gaussian processes directly to the belief state, removing the need for a parametric policy representation. In addition, the resulting policy learns significantly faster while maintaining operational performance.
spoken language technology workshop | 2012
Blaise Thomson; Milica Gasic; Matthew Henderson; Pirros Tsiakoulis; Steve J. Young
A recent trend in spoken dialogue research is the use of reinforcement learning to train dialogue systems in a simulated environment. Past researchers have shown that the types of errors that are simulated can have a significant effect on simulated dialogue performance. Since modern systems typically receive an N-best list of possible user utterances, it is important to be able to simulate a full N-best list of hypotheses. This paper presents a new method for simulating such errors based on logistic regression, as well as a new method for simulating the structure of N-best lists of semantics and their probabilities, based on the Dirichlet distribution. Off-line evaluations show that the new Dirichlet model results in a much closer match to the receiver operating characteristics (ROC) of the live data. Experiments also show that the logistic model gives confusions that are closer to the type of confusions observed in live situations. The hope is that these new error models will be able to improve the resulting performance of trained dialogue systems.
international conference on acoustics, speech, and signal processing | 2009
Pirros Tsiakoulis; Alexandros Potamianos
Several studies have been dedicated to the analysis and modeling of AM-FM modulations in speech and different algorithms have been proposed for the exploitation of modulations in speech applications. This paper details a statistical analysis of amplitude modulations using a multi-band AM-FM analysis framework. The aim of this study is to analyze the phonetic- and speaker-dependency of modulations in the amplitude envelope of speech resonances. The analysis focuses on the dependence of such modulations on acoustic features such as, fundamental frequency, formant proximity, phone identity, as well as, speaker identity and contextual features. The results show that the amplitude modulation index of a speech resonance is mainly a function of the speaker-s average fundamental frequency, the phone identity, and the proximity between neighboring formant resonances. The results are especially relevant for speech and speaker recognition application employing modulation features.
IEEE Signal Processing Letters | 2010
Sotiris Karabetsos; Pirros Tsiakoulis; Aimilios Chalamandaris; Spyros Raptis
This letter introduces one-class classification as a framework for the spectral join cost calculation in unit selection speech synthesis. Instead of quantifying the spectral cost by a single distance measure, a data-driven approach is adopted which exploits the natural similarity of consecutive speech frames in the speech database. A pair of consecutive frames is jointly represented as a vector of spectral distance measures which provide training data for the one-class classifier. At synthesis runtime, speech units are selected based on the scores derived from the classifier. Experimental results provide evidence on the effectiveness of the proposed method which clearly outperforms the conventional approaches currently employed.