Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Alexandros Lazaridis is active.

Publication


Featured researches published by Alexandros Lazaridis.


international conference on biometrics theory applications and systems | 2015

On the vulnerability of speaker verification to realistic voice spoofing

Serife Kucur Ergunay; Elie Khoury; Alexandros Lazaridis; Sébastien Marcel

Automatic speaker verification (ASV) systems are subject to various kinds of malicious attacks. Replay, voice conversion and speech synthesis attacks drastically degrade the performance of a standard ASV system by increasing its false acceptance rates. This issue raised a high level of interest in the speech research community where the possible voice spoofing attacks and their related countermeasures have been investigated. However, much less effort has been devoted in creating realistic and diverse spoofing attack databases that foster researchers to correctly evaluate their countermeasures against attacks. The existing studies are not complete in terms of types of attacks, and often difficult to reproduce because of unavailability of public databases. In this paper we introduce the voice spoofing data-set of AVspoof, a public audio-visual spoofing database. AVspoof includes ten realistic spoofing threats generated using replay, speech synthesis and voice conversion. In addition, we provide a set of experimental results that show the effect of such attacks on current state-of-the-art ASV systems.


international conference on tools with artificial intelligence | 2007

Segmental Duration Modeling for Greek Speech Synthesis

Alexandros Lazaridis; Panagiotis Zervas; G. Kokkinakis

In this paper we cope with the task of modeling phoneme duration for Greek speech synthesis. In particular we apply well established machine learning approaches to the WCL-1 prosodic database for predicting segmental durations from shallow morphosyntactic and prosodic features. We employ decision trees, instance based learning and linear regression. Trained on a 5500 word database, both CART and linear regression models proved to be the most effective in terms for the task with a root mean square error off 0. 0252 and 0.0251 respectively.


text speech and dialogue | 2010

Enhancing emotion recognition from speech through feature selection

Theodoros Kostoulas; Todor Ganchev; Alexandros Lazaridis; Nikos Fakotakis

In the present work we aim at performance optimization of a speaker-independent emotion recognition system through speech feature selection process. Specifically, relying on the speech feature set defined in the Interspeech 2009 Emotion Challenge, we studied the relative importance of the individual speech parameters, and based on their ranking, a subset of speech parameters that offered advantageous performance was selected. The affect-emotion recognizer utilized here relies on a GMM-UBM-based classifier. In all experiments, we followed the experimental setup defined by the Interspeech 2009 Emotion Challenge, utilizing the FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech. The experimental results indicate that the correct choice of the speech parameters can lead to better performance than the baseline one.


Speech Communication | 2011

Improving phone duration modelling using support vector regression fusion

Alexandros Lazaridis; Iosif Mporas; Todor Ganchev; George K. Kokkinakis; Nikos Fakotakis

In the present work, we propose a scheme for the fusion of different phone duration models, operating in parallel. Specifically, the predictions from a group of dissimilar and independent to each other individual duration models are fed to a machine learning algorithm, which reconciles and fuses the outputs of the individual models, yielding more precise phone duration predictions. The performance of the individual duration models and of the proposed fusion scheme is evaluated on the American-English KED TIMIT and on the Greek WCL-1 databases. On both databases, the SVR-based individual model demonstrates the lowest error rate. When compared to the second-best individual algorithm, a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE) by 5.5% and 3.7% on KED TIMIT, and 6.8% and 3.7% on WCL-1 is achieved. At the fusion stage, we evaluate the performance of 12 fusion techniques. The proposed fusion scheme, when implemented with SVR-based fusion, contributes to the improvement of the phone duration prediction accuracy over the one of the best individual model, by 1.9% and 2.0% in terms of relative reduction of the MAE and RMSE on KED TIMIT, and by 2.6% and 1.8% on the WCL-1 database.


Computer Speech & Language | 2017

Speech vocoding for laboratory phonology

Milos Cernak; Stefan Benus; Alexandros Lazaridis

Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications.


Computer Speech & Language | 2012

Two-stage phone duration modelling with feature construction and feature vector extension for the needs of speech synthesis

Alexandros Lazaridis; Todor Ganchev; Iosif Mporas; Evaggelos Dermatas; Nikos Fakotakis

We propose a two-stage phone duration modelling scheme, which can be applied for the improvement of prosody modelling in speech synthesis systems. This scheme builds on a number of independent feature constructors (FCs) employed in the first stage, and a phone duration model (PDM) which operates on an extended feature vector in the second stage. The feature vector, which acts as input to the first stage, consists of numerical and non-numerical linguistic features extracted from text. The extended feature vector is obtained by appending the phone duration predictions estimated by the FCs to the initial feature vector. Experiments on the American-English KED TIMIT and on the Modern Greek WCL-1 databases validated the advantage of the proposed two-stage scheme, improving prediction accuracy over the best individual predictor, and over a two-stage scheme which just fuses the first-stage outputs. Specifically, when compared to the best individual predictor, a relative reduction in the mean absolute error and the root mean square error of 3.9% and 3.9% on the KED TIMIT, and of 4.8% and 4.6% on the WCL-1 database, respectively, is observed.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Incremental syllable-context phonetic vocoding

Milos Cernak; Philip N. Garner; Alexandros Lazaridis; Petr Motlicek; Xingyu Na

Current very low bit rate speech coders are, due to complexity limitations, designed to work off-line. This paper investigates incremental speech coding that operates real-time and incrementally (i.e., encoded speech depends only on already-uttered speech without the need of future speech information). Since human speech communication is asynchronous (i.e., different information flows being simultaneously processed), we hypothesized that such an incremental speech coder should also operate asynchronously. To accomplish this task, we describe speech coding that reflects the human cortical temporal sampling that packages information into units of different temporal granularity, such as phonemes and syllables, in parallel. More specifically, a phonetic vocoder-cascaded speech recognition and synthesis systems-extended with syllable-based information transmission mechanisms is investigated. There are two main aspects evaluated in this work, the synchronous and asynchronous coding. Synchronous coding refers to the case when the phonetic vocoder and speech generation process depend on the syllable boundaries during encoding and decoding respectively. On the other hand, asynchronous coding refers to the case when the phonetic encoding and speech generation processes are done independently of the syllable boundaries. Our experiments confirmed that the asynchronous incremental speech coding performs better, in terms of intelligibility and overall speech quality, mainly due to better alignment of the segmental and prosodic information. The proposed vocoding operates at an uncompressed bit rate of 213 bits/sec and achieves an average communication delay of 243 ms.


panhellenic conference on informatics | 2009

Using Hybrid HMM-Based Speech Segmentation to Improve Synthetic Speech Quality

Iosif Mporas; Alexandros Lazaridis; Todor Ganchev; Nikos Fakotakis

The automatic phonetic time-alignment of speech databases is essential for the development cycle of a Text-to-Speech (TTS) system. Furthermore, the quality of the synthesized speech signals is strongly related to the precision of the produced alignment. In the present work we study the performance of a new HMM-based speech segmentation method. The method is based on hybrid embedded and isolated-unit trained models, and has proved to improve the phonetic segmentation accuracy in the multiple speaker task. Here it is employed on the single speaker segmentation task, utilizing a Greek-speech database. The evaluation of the method showed significant improvement in terms of phonetic segmentation accuracy as well as in the perceptual quality of synthetic speech, when compared to the baseline system.


text speech and dialogue | 2008

Performance Evaluation for Voice Conversion Systems

Todor Ganchev; Alexandros Lazaridis; Iosif Mporas; Nikos Fakotakis

In the present work, we introduce a new performance evaluation measure for assessing the capacity of voice conversion systems to modify the speech of one speaker (source) so that it sounds as if it was uttered by another speaker (target). This measure relies on a GMM-UBM-based likelihood estimator that estimates the degree of proximity between an utterance of the converted voice and the predefined models of the source and target voices. The proposed approach allows the formulation of an objective criterion, which is applicable for both evaluation of the virtue of a single system and for direct comparison (benchmarking) among different voice conversion systems. To illustrate the functionality and the practical usefulness of the proposed measure, we contrast it with four well-known objective evaluation criteria.


international conference on speech and computer | 2015

DNN-Based Speech Synthesis: Importance of Input Features and Training Data

Alexandros Lazaridis; Blaise Potard; Philip N. Garner

Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.

Collaboration


Dive into the Alexandros Lazaridis's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Milos Cernak

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

Junichi Yamagishi

National Institute of Informatics

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge