Dimitrios Dimitriadis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dimitrios Dimitriadis is active.

Explore More

Publication

Featured researches published by Dimitrios Dimitriadis.

IEEE Signal Processing Letters | 2005

Robust AM-FM features for speech recognition

Dimitrios Dimitriadis; Petros Maragos; Alexandros Potamianos

In this letter, a nonlinear AM-FM speech model is used to extract robust features for speech recognition. The proposed features measure the amount of amplitude and frequency modulation that exists in speech resonances and attempt to model aspects of the speech acoustic information that the commonly used linear source-filter model fails to capture. The robustness and discriminability of the AM-FM features is investigated in combination with mel cepstrum coefficients (MFCCs). It is shown that these hybrid features perform well in the presence of noise, both in terms of phoneme-discrimination (J-measure) and in terms of speech recognition performance in several different tasks. Average relative error rate reduction up to 11% for clean and 46% for mismatched noisy conditions is achieved when AM-FM features are combined with MFCCs.

Speech Communication | 2006

Continuous energy demodulation methods and application to speech analysis

Dimitrios Dimitriadis; Petros Maragos

Speech resonance signals appear to contain significant amplitude and frequency modulations. An efficient demodulation approach is based on energy operators. In this paper, we develop two new robust methods for energy-based speech demodulation and compare their performance on both test and actual speech signals. The first method uses smoothing splines for discrete-to-continuous signal approximation. The second (and best) method uses time-derivatives of Gabor filters. Further, we apply the best demodulation method to explore the statistical distribution of speech modulation features and study their properties regarding applications of speech classification and recognition. Finally, we present some preliminary recognition results and underline their improvements when compared to the corresponding MFCC results.

international conference on acoustics, speech, and signal processing | 2001

An improved energy demodulation algorithm using splines

Dimitrios Dimitriadis; Petros Maragos

A new algorithm is proposed for demodulating discrete-time AM-FM signals, which first interpolates the signals with smooth splines and then uses the continuous-time energy separation algorithm (ESA) based on the Teager-Kaiser energy operator. This spline-based ESA retains the excellent time resolution of the ESA based on discrete energy operators but performs better in the presence of noise. Further, its dependence on smooth splines allows some optimal trade-off between data fitting versus smoothing.

IEEE Signal Processing Letters | 2010

Spectral Moment Features Augmented by Low Order Cepstral Coefficients for Robust ASR

Pirros Tsiakoulis; Alexandros Potamianos; Dimitrios Dimitriadis

We propose a novel Automatic Speech Recognition (ASR) front-end, that consists of the first central Spectral Moment time-frequency distribution Augmented by low order Cepstral coefficients (SMAC). We prove that the first central spectral moment is proportional to the spectral derivative with respect to the filters central frequency. Consequently, the spectral moment is an estimate of the frequency domain derivative of the speech spectrum. However information related to the entire speech spectrum, such as the energy and the spectral tilt, is not adequately modeled. We propose adding this information with few cepstral coefficients. Furthermore, we use a mel-spaced Gabor filterbank with 70% frequency overlap in order to overcome the sensitivity to pitch harmonics. The novel SMAC front-end was evaluated for the speech recognition task for a variety of recording conditions. The experimental results have shown that SMAC performs at least as well as the standard MFCC front-end in clean conditions, and significantly outperforms MFCCs in noisy conditions.

international conference on acoustics, speech, and signal processing | 2002

Modulation features for speech recognition

Dimitrios Dimitriadis; Petros Maragos; Alexandros Potamianos

Automatic speech recognition (ASR) systems can benefit from including into their acoustic processing part new features that account for various nonlinear and time-varying phenomena during speech production. In this paper, we develop robust methods to extract novel acoustic features from speech signals of the modulation type based on time-varying models for speech analysis. Further, we integrate the new speech features with the standard linear ones (mel-frequency cesptrum) to develop a augmented set of acoustic features and demonstrate its efficacy by showing significant improvements in HMM-based word recognition over the TIMIT database.

international conference on acoustics, speech, and signal processing | 2013

Investigating deep neural network based transforms of robust audio features for LVCSR

Enrico Bocchieri; Dimitrios Dimitriadis

Micro-modulation components such as the formant frequencies are very important characteristics of spoken speech that have allowed great performance improvements in small-vocabulary ASR tasks. Yet they have limited use in large vocabulary ASR applications. To enable the successful application, in real-life tasks, of these frequency measures, we investigate their combination with traditional features (MFCCs and PLPs) by linear (e.g. HDA), and non-linear (bottleneck MLP) feature transforms. Our experiments show that such integration, using non-linear MLP-based transforms, of micro-modulation and cepstral features greatly improves the ASR with respect to the cepstral features alone. We have applied this novel feature extraction scheme onto two very different tasks, i.e. a clean speech task (DARPA-WSJ) and a real-life, open-vocabulary, mobile search task (Speak4itSM), always reporting improved performance. We report relative error rate reduction of 15% for the Speak4itSM task, and similar improvements, up to 21%, for the WSJ task.

international conference on acoustics, speech, and signal processing | 2011

Speech recognition modeling advances for mobile voice search

Enrico Bocchieri; Diamantino Antonio Caseiro; Dimitrios Dimitriadis

This paper reports on the development and advances in automatic speech recognition for the AT&T Speak4it® voice-search application. With Speak4it as real-life example, we show the effectiveness of acoustic model (AM) and language model (LM) estimation (adaptation and training) on relatively small amounts of application field-data. We then introduce algorithmic improvements concerning the use of sentence length in LM, of non-contextual features in AM decision-trees, and of the Teager energy in the acoustic front-end. The combination of these algorithms, integrated into the AT&T Watson recognizer, yields substantial accuracy improvements. LM and AM estimation on field-data samples increases the word accuracy from 66.4% to 77.1%, a relative word error reduction of 32%. The algorithmic improvements increase the accuracy to 79.7%, an additional 11.3% relative error reduction.

international conference on acoustics, speech, and signal processing | 2011

An alternative front-end for the AT&T WATSON LV-CSR system

Dimitrios Dimitriadis; Enrico Bocchieri; Diamantino Antonio Caseiro

In previously published work, we have proposed a novel feature extraction algorithm, based on the Teager-Kaiser energy estimates, that approximates human auditory characteristics and that is more robust to sub-band noise than the mean-square estimates of standard MFCCs. We refer to the novel features as Teager energy cepstrum coefficients (TECC). Herein, we study the TECC performance under additive noise and suggest how to predict the noisy TECC deviations by estimating the subband SNR values. Then, we report on the effectiveness of the TECCs when they are used in the acoustic front-end of the state-of-the-art AT&T WATSON large-vocabulary recognizer. The TECC front-end is tested in the real-life voice-search Speak4it application for mobile devices. It provides a 6% relative word error rate reduction w.r.t. the MFCC front-end, using the same high performance language model, lexicon and acoustic model training.

ieee automatic speech recognition and understanding workshop | 2009

Short-time instantaneous frequency and bandwidth features for speech recognition

Pirros Tsiakoulis; Alexandros Potamianos; Dimitrios Dimitriadis

In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.

international conference on acoustics, speech, and signal processing | 2009

GridNews: A distributed automatic Greek broadcast transcription system

Dimitrios Dimitriadis; A. Metallinou; Ioannis Konstantinou; Georgios I. Goumas; Petros Maragos; Nectarios Koziris

In this paper, a distributed system storing and retrieving Broadcast News data recorded from the Greek television is presented. These multimodal data are processed in a grid computational environment interconnecting distributed data storage and processing subsystems. The innovative element of this system is the implementation of the signal processing algorithms in this grid environment, offering additional flexibility and computational power. Among the developed signal processing modules are: the Segmentor, cutting up the original videos into shorter ones, the Classifier, recognizing whether these short videos contain speech or not, the Greek large-vocabulary speech Recognizer, transcribing speech into written text, and finally the text Search engine and the video Retriever. All the processed data are stored and retrieved in geographically distributed storage elements. A user-friendly, web-based interface is developed, facilitating the transparent import and storage of new multimodal data, their off-line processing and finally, their search and retrieval.

Explore More