Roberto Barra-Chicote

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roberto Barra-Chicote is active.

Explore More

Publication

Featured researches published by Roberto Barra-Chicote.

Speech Communication | 2010

Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

Roberto Barra-Chicote; Junichi Yamagishi; Simon King; Juan Manuel Montero; Javier Macias-Guarasa

We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded -happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.

Signal Processing | 2016

Feature extraction from smartphone inertial signals for human activity segmentation

Rubén San-Segundo; Juan Manuel Montero; Roberto Barra-Chicote; Fernando Fernández; José Manuel Pardo

This paper proposes the adaptation of well-known strategies successfully used in speech processing: Mel Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Prediction (PLP) coefficients. Additionally characteristics like RASTA filtering or delta coefficients are also considered and evaluated for inertial signal processing. These adaptations have been incorporated into a Human Activity Recognition and Segmentation (HARS) system based on Hidden Markov Models (HMMs) for recognizing and segmenting six different physical activities: walking, walking-upstairs, walking-downstairs, sitting, standing and lying.All experiments have been done using a publicly available dataset named UCI Human Activity Recognition Using Smartphones, which includes several sessions with physical activity sequences from 30 volunteers. This dataset has been randomly divided into six subsets for performing a six-fold cross validation procedure. For every experiment, average values from the six-fold cross-validation procedure are shown.The results presented in this paper overcome significantly baseline error rates, constituting a relevant contribution in the field. Adapted MFCC and PLP coefficients improve human activity recognition and segmentation accuracies while reducing feature vector size considerably. RASTA-filtering and delta coefficients contribute significantly to reduce the segmentation error rate obtaining the best results: an Activity Segmentation Error Rate lower than 0.5%. Human activity segmentation using Hidden Markov Models.Frequency-based feature extraction from Inertial Signals.RASTA filtering analysis and delta coefficients.Important dimensionality reduction.

Interacting with Computers | 2010

Spoken Spanish generation from sign language

Rubén San-Segundo; José Manuel Pardo; Javier Ferreiros; V. Sama; Roberto Barra-Chicote; J.M. Lucas; D. Sánchez; A. García

This paper describes the development of a Spoken Spanish generator from sign-writing. The sign language considered was the Spanish sign language (LSE: Lengua de Signos Espanola). This system consists of an advanced visual interface (where a deaf person can specify a sequence of signs in sign-writing), a language translator (for generating the sequence of words in Spanish), and finally, a text to speech converter. The visual interface allows a sign sequence to be defined using several sign-writing alternatives. The paper details the process for designing the visual interface proposing solutions for HCI-specific challenges when working with the Deaf (i.e. important difficulties in writing Spanish or limited sign coverage for describing abstract or conceptual ideas). Three strategies were developed and combined for language translation to implement the final version of the language translator module. The summative evaluation, carried out with Deaf from Madrid and Toledo, includes objective measurements from the system and subjective information from questionnaires. The paper also describes the first Spanish-LSE parallel corpus for language processing research focused on specific domains. This corpus includes more than 4000 Spanish sentences translated into LSE. These sentences focused on two restricted domains: the renewal of the identity document and drivers license. This corpus also contains all sign descriptions in several sign-writing specifications generated with a new version of the eSign Editor. This new version includes a grapheme to phoneme system for Spanish and a SEA-HamNoSys converter.

IEEE Transactions on Audio, Speech, and Language Processing | 2011

Speaker Diarization Based on Intensity Channel Contribution

Roberto Barra-Chicote; José Manuel Pardo; Javier Ferreiros; Juan Manuel Montero

The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last years RT09 evaluation data.

Springer International Publishing | 2014

Towards Cross-Lingual Emotion Transplantation

Jaime Lorenzo-Trueba; Roberto Barra-Chicote; Junichi Yamagishi; Juan Manuel Montero

In this paper we introduce the idea of cross-lingual emotion transplantation. The aim is to lean the nuances of emotional speech in a source language for which we have enough data to adapt an acceptable quality emotional model by means of CSMAPLR adaptation, and then convert the adaptation function so it can be applied to a target language in a different target speaker while maintaining the speaker identity but adding emotional information. The conversion between languages is done at state level by measuring the KLD distance between the Gaussian distributions of all the states and linking the closest ones. Finally, as the cross-lingual transplantation of spectral emotions mainly anger was found out to introduce significant amounts of spectral noise, we show the results of applying three different techniques related to adaptation parameters that can be used to reduce the noise. The results are measured in an objective fashion by means of a bi-dimensional PCA projection of the KLD distances between the considered models neutral models of both languages, reference emotion for both languages and transplanted emotional model for the target language.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Speaker Diarization Features: The UPM Contribution to the RT09 Evaluation

José Manuel Pardo; Roberto Barra-Chicote; Rubén San-Segundo; R. de Cordoba; B. Martinez-Gonzalez

Two new features have been proposed and used in the Rich Transcription Evaluation 2009 by the Universidad Politécnica de Madrid, which outperform the results of the baseline system. One of the features is the intensity channel contribution, a feature related to the location of the speaker. The second feature is the logarithm of the interpolated fundamental frequency. It is the first time that both features are applied to the clustering stage of multiple distant microphone meetings diarization. It is shown that the inclusion of both features improves the baseline results by 15.36% and 16.71% relative to the development set and the RT 09 set, respectively. If we consider speaker errors only, the relative improvement is 23% and 32.83% on the development set and the RT09 set, respectively.

Expert Systems With Applications | 2013

LSESpeak: A spoken language generator for Deaf people

Verónica López-Ludeña; Roberto Barra-Chicote; Syaheerah Lebai Lutfi; Juan Manuel Montero; Rubén San-Segundo

This paper describes the development of LSESpeak, a spoken Spanish generator for Deaf people. This system integrates two main tools: a sign language into speech translation system and an SMS (Short Message Service) into speech translation system. The first tool is made up of three modules: an advanced visual interface (where a deaf person can specify a sequence of signs), a language translator (for generating the sequence of words in Spanish), and finally, an emotional text to speech (TTS) converter to generate spoken Spanish. The visual interface allows a sign sequence to be defined using several utilities. The emotional TTS converter is based on Hidden Semi-Markov Models (HSMMs) permitting voice gender, type of emotion, and emotional strength to be controlled. The second tool is made up of an SMS message editor, a language translator and the same emotional text to speech converter. Both translation tools use a phrase-based translation strategy where translation and target language models are trained from parallel corpora. In the experiments carried out to evaluate the translation performance, the sign language-speech translation system reported a 96.45 BLEU and the SMS-speech system a 44.36 BLEU in a specific domain: the renewal of the Identity Document and Driving License. In the evaluation of the emotional TTS, it is important to highlight the improvement in the naturalness thanks to the morpho-syntactic features, and the high flexibility provided by HSMMs when generating different emotional strengths.

Journal of the Acoustical Society of America | 2016

Improving Spanish speech synthesis intelligibility under noisy environments

Jaime Lorenzo-Trueba; Roberto Barra-Chicote; Junichi Yamagishi; Juan Manuel Montero

In this paper, we evaluate a newly recorded Spanish Lombard Speech database. This database has been recorded with expressive speech synthesis in mind, and more particularly adaptability to the environment. Real stationary noise recorded inside of a car was used to produce the Lombard speech response in the speaker by means of a headphone. Four different noise levels were used, in steps of 5 dB and also clean speech to set the clean speech baseline. Finally, a pair of intelligibility evaluations were carried out, one with natural speech that proves the validity of the recorded database by showing a 37% absolute increase in intelligibility in a -10 dB SNR condition when compared to non-Lombard speech. The second evaluation was carried out with synthetic speech, which showed a 10% absolute increase in intelligibility for both the -10 and -15 dB SNR condition.

conference of the international speech communication association | 2009

Acoustic Emotion Recognition using Dynamic Bayesian Networks and Multi-Space Distributions

Roberto Barra-Chicote; Fernando Fernández-Martínez; Syaheerah Lebai Lutfi; Juan Manuel Lucas-Cuesta; Javier Macías Guarasa; Juan Manuel Montero; Rubén San Segundo; José Manuel Pardo

conference of the international speech communication association | 2006