Inma Hernaez
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Inma Hernaez.
international conference on acoustics, speech, and signal processing | 2011
Phillip L. De Leon; Inma Hernaez; Ibon Saratxaga; Michael Pucher; Junichi Yamagishi
In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Phillip L. De Leon; Michael Pucher; Junichi Yamagishi; Inma Hernaez; Ibon Saratxaga
In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.
IEEE Journal of Selected Topics in Signal Processing | 2014
Daniel Erro; Iñaki Sainz; Eva Navas; Inma Hernaez
This article explores the potential of the harmonics plus noise model of speech in the development of a high-quality vocoder applicable in statistical frameworks, particularly in modern speech synthesizers. It presents an extensive explanation of all the different alternatives considered during the design of the HNM-based vocoder, together with the corresponding objective and subjective experiments, and a careful description of its implementation details. Three aspects of the analysis have been investigated: refinement of the pitch estimation using quasi-harmonic analysis, study and comparison of several spectral envelope analysis procedures, and strategies to analyze and model the maximum voiced frequency. The performance of the resulting vocoder is shown to be similar to that of state-of-the-art vocoders in synthesis tasks.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Daniel Erro; Eva Navas; Inma Hernaez
Voice conversion methods based on frequency warping followed by amplitude scaling have been recently proposed. These methods modify the frequency axis of the source spectrum in such manner that some significant parts of it, usually the formants, are moved towards their image in the target speakers spectrum. Amplitude scaling is then applied to compensate for the differences between warped source spectra and target spectra. This article presents a fully parametric formulation of a frequency warping plus amplitude scaling method in which bilinear frequency warping functions are used. Introducing this constraint allows for the conversion error to be described in the cepstral domain and to minimize it with respect to the parameters of the transformation through an iterative algorithm, even when multiple overlapping conversion classes are considered. The paper explores the advantages and limitations of this approach when applied to a cepstral representation of speech. We show that it achieves significant improvements in quality with respect to traditional methods based on Gaussian mixture models, with no loss in average conversion accuracy. Despite its relative simplicity, it achieves similar performance scores to state-of-the-art statistical methods involving dynamic features and global variance.
IEEE Transactions on Information Forensics and Security | 2015
Jon Sanchez; Ibon Saratxaga; Inma Hernaez; Eva Navas; Daniel Erro; Tuomo Raitio
In the field of speaker verification (SV) it is nowadays feasible and relatively easy to create a synthetic voice to deceive a speech driven biometric access system. This paper presents a synthetic speech detector that can be connected at the front-end or at the back-end of a standard SV system, and that will protect it from spoofing attacks coming from state-of-the-art statistical Text to Speech (TTS) systems. The system described is a Gaussian Mixture Model (GMM) based binary classifier that uses natural and copy-synthesized signals obtained from the Wall Street Journal database to train the system models. Three different state-of-the-art vocoders are chosen and modeled using two sets of acoustic parameters: 1) relative phase shift and 2) canonical Mel Frequency Cepstral Coefficients (MFCC) parameters, as baseline. The vocoder dependency of the system and multivocoder modeling features are thoroughly studied. Additional phase-aware vocoders are also tested. Several experiments are carried out, showing that the phase-based parameters perform better and are able to cope with new unknown attacks. The final evaluations, testing synthetic TTS signals obtained from the Blizzard challenge, validate our proposal.
Lecture Notes in Computer Science | 2004
Juan J. Igarza; Lorea Gómez; Inma Hernaez; Iñaki Goirizelaia
The handwritten signature is the expression of the will and consent in daily operations such as banking transactions, access control, contracts, etc. However, since signing is a behavioral feature it is not invariant; we do not always sign at the same speed, in the same position or at the same orientation. In order to reduce the errors caused in verification by these differences between original signatures we have introduced a new concept of reference system for the (x, y) coordinates from experimental results. The basis of this concept lies on using global references (centre of mass and principal axes of inertia) instead of local references (initial point and initial angle) for a recognition system based on local parameters. The system is based on the hypothesis that signing is a feedback process, in which humans react to our own signature while writing it following patterns stored in our brain.
international conference on acoustics, speech, and signal processing | 2011
Daniel Erro; Iñaki Sainz; Eva Navas; Inma Hernaez
Currently, the statistical framework based on Hidden Markov Models (HMMs) plays a relevant role in speech synthesis, while voice conversion systems based on Gaussian Mixture Models (GMMs) are almost standard. In both cases, statistical modeling is applied to learn distributions of acoustic vectors extracted from speech signals, each vector containing a suitable parametric representation of one speech frame. The overall performance of the systems is often limited by the accuracy of the underlying speech parameterization and reconstruction method. The method presented in this paper allows accurate MFCC extraction and high-quality reconstruction of speech signals assuming a Harmonics plus Noise Model (HNM). Its suitability for high-quality HMM-based speech synthesis is shown through subjective tests.
text speech and dialogue | 1999
Gorka Elordieta; Iñaki Gaminde; Inma Hernaez; Jasone Salaberria; Igor Martin de Vidales
In this paper the basic features of the intonational structure of Bermeo Basque (BB) are analyzed. In BB there exists a lexical distinction between accented and unaccented words. Accented words are always stressed, and unaccented words (not containing any accented morphemes) only receive stress on their final syllable when they are immediately preceding the verb; otherwise they surface stressless. BB shows a hierarchically organized intonational structure. Accentual Phrases are identified by an initial boundary tone, a phrasal H-tone associated to the second syllable, and a nuclear H* pitch accent. H-spreads rightwards to other syllables until a H* pitch accent is met. A H* pitch accent triggers downstep on the following H*. An Intermediate Phrase contains one or more APs, and is the domain of downstep. Finally, an Intonational Phrase is signalled by a final L% in declarative.
IberSPEECH | 2012
Tudor-Cătălin Zorilă; Daniel Erro; Inma Hernaez
This paper presents a new method to train traditional voice conversion functions based on Gaussian mixture models, linear transforms and cepstral parameterization. Instead of using statistical criteria, this method calculates a set of linear transforms that represent physically meaningful spectral modifications such as frequency warping and amplitude scaling. Our experiments indicate that the proposed training method leads to significant improvements in the average quality of the converted speech with respect to traditional statistical methods. This is achieved without modifying the input/output parameters or the shape of the conversion function.
Speech Communication | 2016
Ibon Saratxaga; Jon Sanchez; Zhizheng Wu; Inma Hernaez; Eva Navas
Phase information based synthetic speech detectors (RPS, MGD) are analyzed.Training using real attack samples and copy-synthesized material is evaluated.Evaluation of the detectors against unknown attacks, including channel effect.Detectors work well for voice conversion and adapted synthetic speech impostors. Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.