Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ibon Saratxaga is active.

Publication


Featured researches published by Ibon Saratxaga.


international conference on acoustics, speech, and signal processing | 2011

Detection of synthetic speech for the problem of imposture

Phillip L. De Leon; Inma Hernaez; Ibon Saratxaga; Michael Pucher; Junichi Yamagishi

In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech

Phillip L. De Leon; Michael Pucher; Junichi Yamagishi; Inma Hernaez; Ibon Saratxaga

In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.


international conference on acoustics, speech, and signal processing | 2007

Evaluation of Pitch Detection Algorithms Under Real Conditions

Iker Luengo; Ibon Saratxaga; Eva Navas; Inmaculada Hernáez; Jon Sanchez; Iñaki Sainz

A novel algorithm based on classical cepstrum calculation followed by dynamic programming is presented in this paper. The algorithm has been evaluated with a 60-minutes database containing 60 speakers and different recording conditions and environments. A second reference database has also been used. In addition, the performance of four popular PDA algorithms has been evaluated with the same databases. The results prove the good performance of the described algorithm in noisy conditions. Furthermore, the paper is a first initiative to perform an evaluation of widely used PDA algorithms over an extensive and realistic database.


IEEE Transactions on Information Forensics and Security | 2015

Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information

Jon Sanchez; Ibon Saratxaga; Inma Hernaez; Eva Navas; Daniel Erro; Tuomo Raitio

In the field of speaker verification (SV) it is nowadays feasible and relatively easy to create a synthetic voice to deceive a speech driven biometric access system. This paper presents a synthetic speech detector that can be connected at the front-end or at the back-end of a standard SV system, and that will protect it from spoofing attacks coming from state-of-the-art statistical Text to Speech (TTS) systems. The system described is a Gaussian Mixture Model (GMM) based binary classifier that uses natural and copy-synthesized signals obtained from the Wall Street Journal database to train the system models. Three different state-of-the-art vocoders are chosen and modeled using two sets of acoustic parameters: 1) relative phase shift and 2) canonical Mel Frequency Cepstral Coefficients (MFCC) parameters, as baseline. The vocoder dependency of the system and multivocoder modeling features are thoroughly studied. Additional phase-aware vocoders are also tested. Several experiments are carried out, showing that the phase-based parameters perform better and are able to cope with new unknown attacks. The final evaluations, testing synthetic TTS signals obtained from the Blizzard challenge, validate our proposal.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Emotion Conversion Based on Prosodic Unit Selection

Daniel Erro; Eva Navas; Inmaculada Hernáez; Ibon Saratxaga

Voice conversion has been traditionally focused on spectrum. Current systems lack a solid prosody conversion method suitable for different speaking styles. Recently, the unit selection technique has been applied to transform emotional intonation contours. This paper goes one step beyond: it explores strategies for training and configuring the selection cost function in an emotion conversion application. The proposed system, which uses accent groups as basic intonation units and performs conversion also on phoneme durations and intensity, is evaluated by means of a carefully designed subjective test involving the big six emotions. Although the expressiveness of the converted sentences is still far from that of natural emotional speech, satisfactory results are obtained when different configurations are used for different emotions.


Speech Communication | 2016

Synthetic speech detection using phase information

Ibon Saratxaga; Jon Sanchez; Zhizheng Wu; Inma Hernaez; Eva Navas

Phase information based synthetic speech detectors (RPS, MGD) are analyzed.Training using real attack samples and copy-synthesized material is evaluated.Evaluation of the detectors against unknown attacks, including channel effect.Detectors work well for voice conversion and adapted synthetic speech impostors. Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.


text speech and dialogue | 2005

Analysis of the suitability of common corpora for emotional speech modeling in standard basque

Eva Navas; Inmaculada Hernáez; Iker Luengo; Jon Sanchez; Ibon Saratxaga

This paper presents the analysis made to assess the suitability of neutral semantic corpora to study emotional speech. Two corpora have been used: one having neutral texts that were common to all emotions and the other having texts related to the emotion. Subjective and objective analysis have been performed. In the subjective test common corpus has achieved good recognition rates, although worse than those obtained with specific texts. In the objective analysis, differences among emotions are larger for common texts than for specific texts, indicating that in common corpus expression of emotions was more exaggerated. This is convenient for emotional speech synthesis, but no for emotion recognition. So, in this case, common corpus is suitable for the prosodic modeling of emotions to be used in speech synthesis, but for emotion recognition specific texts are more convenient.


COST 2102'07 Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours | 2007

Meaningful parameters in emotion characterisation

Eva Navas; Inmaculada Hernáez; Iker Luengo; Iñaki Sainz; Ibon Saratxaga; Jon Sanchez

In expressive speech synthesis some method of mimicking the way one specific speaker express emotions is needed. In this work we have studied the suitability of long term prosodic parameters and short term spectral parameters to reflect emotions in speech, by means of the analysis of the results of two automatic emotion classification systems. Those systems have been trained with different emotional monospeaker databases recorded in standard Basque that include six emotions. Both of them are able to differentiate among emotions for a specific speaker with very high identification rates (above 75%), but the models are not applicable to other speakers (identification rates drop to 20%). Therefore in the synthesis process the control of both spectral and prosodic features is essential to get expressive speech and when a change in speaker is desired the values of the parameters should be re-estimated.


conference of the international speech communication association | 2016

ML Parameter Generation with a Reformulated MGE Training Criterion - Participation in the Voice Conversion Challenge 2016.

Daniel Erro; Agustín Alonso; Luis Serrano; David Tavarez; Igor Odriozola; Xabier Sarasola; Eder del Blanco; Jon Sanchez; Ibon Saratxaga; Eva Navas; Inma Hernaez

This paper describes our entry to the Voice Conversion Challenge 2016. Based on the maximum likelihood parameter generation algorithm, the method is a reformulation of the minimum generation error training criterion. It uses a GMM for soft classification, a Mel-cepstral vocoder for acoustic analysis and an improved dynamic time warping procedure for source-target alignment. To compensate the oversmoothing effect, the generated parameters are filtered through a speaker-independent postfilter implemented as a linear transform in cepstral domain. The process is completed with mean and variance adaptation of the logfundamental frequency and duration modification by a constant factor. The results of the evaluation show that the proposed system achieves a high conversion accuracy in comparison with other systems, while its naturalness scores are intermediate.


IberSPEECH 2014 Proceedings of the Second International Conference on Advances in Speech and Language Technologies for Iberian Languages - Volume 8854 | 2014

Speech Watermarking Based on Coding of the Harmonic Phase

Inma Hernaez; Ibon Saratxaga; Jianpei Ye; Jon Sanchez; Daniel Erro; Eva Navas

This paper presents a new speech watermarking technique using harmonic modelling of the speech signal and coding of the harmonic phase. We use a representation of the instantaneous harmonic phase which allows straightforward manipulation of its values to embed the digital watermark. The technique converts each harmonic into a communication channel, whose performance is analysed in terms of distortion and BER. The developed tests show that with a simple coding scheme a bit rate of 300bps can be achieved with minimal perceptual distortion and almost zero BER.

Collaboration


Dive into the Ibon Saratxaga's collaboration.

Top Co-Authors

Avatar

Eva Navas

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Jon Sanchez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Inma Hernaez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Daniel Erro

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Iñaki Sainz

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Iker Luengo

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Inmaculada Hernáez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Igor Odriozola

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

David Tavarez

University of the Basque Country

View shared research outputs
Top Co-Authors

Avatar

Michael Pucher

Austrian Academy of Sciences

View shared research outputs
Researchain Logo
Decentralizing Knowledge