Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dimitrios Dimitriadis.
IEEE Signal Processing Letters | 2005
Dimitrios Dimitriadis; Petros Maragos; Alexandros Potamianos
In this letter, a nonlinear AM-FM speech model is used to extract robust features for speech recognition. The proposed features measure the amount of amplitude and frequency modulation that exists in speech resonances and attempt to model aspects of the speech acoustic information that the commonly used linear source-filter model fails to capture. The robustness and discriminability of the AM-FM features is investigated in combination with mel cepstrum coefficients (MFCCs). It is shown that these hybrid features perform well in the presence of noise, both in terms of phoneme-discrimination (J-measure) and in terms of speech recognition performance in several different tasks. Average relative error rate reduction up to 11% for clean and 46% for mismatched noisy conditions is achieved when AM-FM features are combined with MFCCs.
Speech Communication | 2006
Dimitrios Dimitriadis; Petros Maragos
Speech resonance signals appear to contain significant amplitude and frequency modulations. An efficient demodulation approach is based on energy operators. In this paper, we develop two new robust methods for energy-based speech demodulation and compare their performance on both test and actual speech signals. The first method uses smoothing splines for discrete-to-continuous signal approximation. The second (and best) method uses time-derivatives of Gabor filters. Further, we apply the best demodulation method to explore the statistical distribution of speech modulation features and study their properties regarding applications of speech classification and recognition. Finally, we present some preliminary recognition results and underline their improvements when compared to the corresponding MFCC results.
international conference on acoustics, speech, and signal processing | 2001
Dimitrios Dimitriadis; Petros Maragos
A new algorithm is proposed for demodulating discrete-time AM-FM signals, which first interpolates the signals with smooth splines and then uses the continuous-time energy separation algorithm (ESA) based on the Teager-Kaiser energy operator. This spline-based ESA retains the excellent time resolution of the ESA based on discrete energy operators but performs better in the presence of noise. Further, its dependence on smooth splines allows some optimal trade-off between data fitting versus smoothing.
IEEE Signal Processing Letters | 2010
Pirros Tsiakoulis; Alexandros Potamianos; Dimitrios Dimitriadis
We propose a novel Automatic Speech Recognition (ASR) front-end, that consists of the first central Spectral Moment time-frequency distribution Augmented by low order Cepstral coefficients (SMAC). We prove that the first central spectral moment is proportional to the spectral derivative with respect to the filters central frequency. Consequently, the spectral moment is an estimate of the frequency domain derivative of the speech spectrum. However information related to the entire speech spectrum, such as the energy and the spectral tilt, is not adequately modeled. We propose adding this information with few cepstral coefficients. Furthermore, we use a mel-spaced Gabor filterbank with 70% frequency overlap in order to overcome the sensitivity to pitch harmonics. The novel SMAC front-end was evaluated for the speech recognition task for a variety of recording conditions. The experimental results have shown that SMAC performs at least as well as the standard MFCC front-end in clean conditions, and significantly outperforms MFCCs in noisy conditions.
international conference on acoustics, speech, and signal processing | 2002
Dimitrios Dimitriadis; Petros Maragos; Alexandros Potamianos
Automatic speech recognition (ASR) systems can benefit from including into their acoustic processing part new features that account for various nonlinear and time-varying phenomena during speech production. In this paper, we develop robust methods to extract novel acoustic features from speech signals of the modulation type based on time-varying models for speech analysis. Further, we integrate the new speech features with the standard linear ones (mel-frequency cesptrum) to develop a augmented set of acoustic features and demonstrate its efficacy by showing significant improvements in HMM-based word recognition over the TIMIT database.
international conference on acoustics, speech, and signal processing | 2013
Enrico Bocchieri; Dimitrios Dimitriadis
Micro-modulation components such as the formant frequencies are very important characteristics of spoken speech that have allowed great performance improvements in small-vocabulary ASR tasks. Yet they have limited use in large vocabulary ASR applications. To enable the successful application, in real-life tasks, of these frequency measures, we investigate their combination with traditional features (MFCCs and PLPs) by linear (e.g. HDA), and non-linear (bottleneck MLP) feature transforms. Our experiments show that such integration, using non-linear MLP-based transforms, of micro-modulation and cepstral features greatly improves the ASR with respect to the cepstral features alone. We have applied this novel feature extraction scheme onto two very different tasks, i.e. a clean speech task (DARPA-WSJ) and a real-life, open-vocabulary, mobile search task (Speak4itSM), always reporting improved performance. We report relative error rate reduction of 15% for the Speak4itSM task, and similar improvements, up to 21%, for the WSJ task.
international conference on acoustics, speech, and signal processing | 2011
Enrico Bocchieri; Diamantino Antonio Caseiro; Dimitrios Dimitriadis
This paper reports on the development and advances in automatic speech recognition for the AT&T Speak4it® voice-search application. With Speak4it as real-life example, we show the effectiveness of acoustic model (AM) and language model (LM) estimation (adaptation and training) on relatively small amounts of application field-data. We then introduce algorithmic improvements concerning the use of sentence length in LM, of non-contextual features in AM decision-trees, and of the Teager energy in the acoustic front-end. The combination of these algorithms, integrated into the AT&T Watson recognizer, yields substantial accuracy improvements. LM and AM estimation on field-data samples increases the word accuracy from 66.4% to 77.1%, a relative word error reduction of 32%. The algorithmic improvements increase the accuracy to 79.7%, an additional 11.3% relative error reduction.
international conference on acoustics, speech, and signal processing | 2011
Dimitrios Dimitriadis; Enrico Bocchieri; Diamantino Antonio Caseiro
In previously published work, we have proposed a novel feature extraction algorithm, based on the Teager-Kaiser energy estimates, that approximates human auditory characteristics and that is more robust to sub-band noise than the mean-square estimates of standard MFCCs. We refer to the novel features as Teager energy cepstrum coefficients (TECC). Herein, we study the TECC performance under additive noise and suggest how to predict the noisy TECC deviations by estimating the subband SNR values. Then, we report on the effectiveness of the TECCs when they are used in the acoustic front-end of the state-of-the-art AT&T WATSON large-vocabulary recognizer. The TECC front-end is tested in the real-life voice-search Speak4it application for mobile devices. It provides a 6% relative word error rate reduction w.r.t. the MFCC front-end, using the same high performance language model, lexicon and acoustic model training.
ieee automatic speech recognition and understanding workshop | 2009
Pirros Tsiakoulis; Alexandros Potamianos; Dimitrios Dimitriadis
In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.
international conference on acoustics, speech, and signal processing | 2009
Dimitrios Dimitriadis; A. Metallinou; Ioannis Konstantinou; Georgios I. Goumas; Petros Maragos; Nectarios Koziris
In this paper, a distributed system storing and retrieving Broadcast News data recorded from the Greek television is presented. These multimodal data are processed in a grid computational environment interconnecting distributed data storage and processing subsystems. The innovative element of this system is the implementation of the signal processing algorithms in this grid environment, offering additional flexibility and computational power. Among the developed signal processing modules are: the Segmentor, cutting up the original videos into shorter ones, the Classifier, recognizing whether these short videos contain speech or not, the Greek large-vocabulary speech Recognizer, transcribing speech into written text, and finally the text Search engine and the video Retriever. All the processed data are stored and retrieved in geographically distributed storage elements. A user-friendly, web-based interface is developed, facilitating the transparent import and storage of new multimodal data, their off-line processing and finally, their search and retrieval.