João P. Cabral
University College Dublin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by João P. Cabral.
international conference on acoustics, speech, and signal processing | 2011
João P. Cabral; Steve Renals; Junichi Yamagishi; Korin Richmond
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6% preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further.
IEEE Journal of Selected Topics in Signal Processing | 2014
João P. Cabral; Korin Richmond; Junichi Yamagishi; Steve Renals
This paper proposes an analysis method to separate the glottal source and vocal tract components of speech that is called Glottal Spectral Separation (GSS). This method can produce high-quality synthetic speech using an acoustic glottal source model. In the source-filter models commonly used in speech technology applications it is assumed the source is a spectrally flat excitation signal and the vocal tract filter can be represented by the spectral envelope of speech. Although this model can produce high-quality speech, it has limitations for voice transformation because it does not allow control over glottal parameters which are correlated with voice quality. The main problem with using a speech model that better represents the glottal source and the vocal tract filter is that current analysis methods for separating these components are not robust enough to produce the same speech quality as using a model based on the spectral envelope of speech. The proposed GSS method is an attempt to overcome this problem, and consists of the following three steps. Initially, the glottal source signal is estimated from the speech signal. Then, the speech spectrum is divided by the spectral envelope of the glottal source signal in order to remove the glottal source effects from the speech signal. Finally, the vocal tract transfer function is obtained by computing the spectral envelope of the resulting signal. In this work, the glottal source signal is represented using the Liljencrants-Fant model (LF-model). The experiments we present here show that the analysis-synthesis technique based on GSS can produce speech comparable to that of a high-quality vocoder that is based on the spectral envelope representation. However, it also permit control over voice qualities, namely to transform a modal voice into breathy and tense, by modifying the glottal parameters.
non-linear speech processing | 2013
João P. Cabral; Julie Carson-Berndsen
The control over aspects of the glottal source signal is fundamental to correctly modify relevant voice characteristics, such as breathiness. This voice quality is strongly related to the characteristics of the glottal source signal produced at the glottis, mainly the shape of the glottal pulse and the aspiration noise. This type of noise results from the turbulence of air passing through the glottis and it can be represented by an amplitude modulated Gaussian noise, which depends on the glottal volume velocity and glottal area. However, the dependency between the glottal signal and the noise component is usually not taken into account for transforming breathiness. In this paper, we propose a method for modelling the aspiration noise which permits to adapt the aspiration noise to take into account its dependency with the glottal pulse shape, while producing high-quality speech. The envelope of the amplitude modulated noise is estimated from the speech signal pitch-synchronously and then it is parameterized by using a non-linear polynomial fitting algorithm. Finally, an asymmetric triangular window is obtained from the non-linear polynomial representation for obtaining a shape of the energy envelope of the noise closer to that of the glottal source. In the experiments for voice transformation, both the proposed aspiration noise model and an acoustic glottal source model are used to transform a modal voice into breathy. Results show that the aspiration noise model improves the voice quality transformation compared with an excitation using only the glottal model and an excitation that combines the glottal source model and a spectral representation of the noise component.
information sciences, signal processing and their applications | 2012
Udochukwu Ogbureke; João P. Cabral; Julie Berndsen
This paper presents a novel approach to explicit duration modelling for HMM-based speech synthesis. The proposed approach is a two-step process. The first step in this process is state level phone alignment and conversion of phone durations into the number of frames. In the second step, a hidden Markov model (HMM) is trained whereby the observation is the number of frames in each state and the hidden state the phone. Finally, the duration of each state (the number of frames) is generated from the trained HMM. Hidden semi-Markov model (HSMM) is the baseline for explicit duration modelling in HMM-based speech synthesis. Both objective and perceptual evaluation on a held-out test set showed comparable results with a baseline HSMM-based speech synthesis. This duration modelling approach is computationally simpler than HSMM and produces comparable results in terms of the quality of synthetic speech.
information sciences, signal processing and their applications | 2012
Udochukwu Ogbureke; João P. Cabral; Julie Berndsen
The fundamental frequency (F0) Modelling is important for speech processing applications, for example, text-to-speech (TTS) synthesis. The most common method for modelling F0 in HMM-based speech synthesis is to use a mixture of discrete and continuous distributions for unvoiced and voiced speech respectively. The reason for using this type of model is that most F0 detection algorithms require a voiced/unvoiced (V/U) decision and F0 is set equal to a constant value in the unvoiced regions of speech (F0 is not defined in these regions). However, errors in voicing detection produce degradation in speech quality. The effect of voicing decision errors can be reduced by modelling F0 using continuous HMMs. This approach to modelling F0 requires a voicing strength parameter to be estimated which is used to decide if a speech frame is either voiced or unvoiced in the generation of the speech waveform from speech parameters. This paper proposes a method for voicing strength estimation based on multilayer perceptron (MLP) and compared it with a baseline method based on signal processing. Results showed that the MLP method obtained lower V/U mean error rate than the baseline.
Speech Communication | 2014
íva Székely; Zeeshan Ahmed; Shannon Hennig; João P. Cabral; Julie Carson-Berndsen
The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automatically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener.
conference of the international speech communication association | 2017
João P. Cabral; Benjamin R. Cowan; Katja Zibrek; Rachel McDonnell
This paper presents a new speaker change detection system based on Long Short-Term Memory (LSTM) neural networks using acoustic data and linguistic content. Language modelling is combined with two different Joint Factor Analysis (JFA) acoustic approaches: i-vectors and speaker factors. Both of them are compared with a baseline algorithm that uses cosine distance to detect speaker turn changes. LSTM neural networks with both linguistic and acoustic features have been able to produce a robust speaker segmentation. The experimental results show that our proposal clearly outperforms the baseline system.Child-directed spoken data is the ideal source of support for claims about children’s linguistic environments. However, phonological transcriptions of child-directed speech are scarce,compared to s ...
applications of natural language to data bases | 2015
Séamus Lawless; Peter Lavin; Mostafa Bayomi; João P. Cabral; M. Rami Ghorab
In today’s fast-paced world, users face the challenge of having to consume a lot of content in a short time. This situation is exacerbated by the fact that content is scattered in a range of different languages and locations. This research addresses these challenges using a number of natural language processing techniques: adapting content using automatic text summarization; enhancing content accessibility through machine translation; and altering the delivery modality through speech synthesis. This paper introduces Lean-back Learning (LbL), an information system that delivers automatically generated audio presentations for consumption in a “lean-back” fashion, i.e. hands-busy, eyes-busy situations. These presentations are personalized and are generated using multilingual multi-document text summarization. The paper discusses the system’s components and algorithms, in addition to initial system evaluations.
9th International Conference on Speech Prosody 2018 | 2018
Beatriz Medeiros; João P. Cabral
In this paper we study how spoken and sung versions of the same text differ in terms of the variability in duration and pitch. These two modalities are usually studied separately and few works can be found in the literature that report results about the comparison of their acoustic properties. In this work, recordings of both speech and singing of Brazilian Portuguese popular songs were conducted. Then, the variability was measured by statistical analysis of the fundamental frequency and speech rate, specifically the mean and variance. In a first study this was done at the syllable and sentence levels and latter at the phone level for further analysis. In general, results show that speech and singing variability cannot be differentiated in terms of the variance. We expected different results because singing is more constrained than speech both in terms of pitch (small variation within the note) and duration (metrical constraint). It seems that the results of higher pitch stability for singing reported in the literature cannot be generalised, particularly for the popular genre in which there is a prosodic proximity between singing and speech. These interesting findings also motivate to analyse other aspects of dynamic pitch and duration to better understand the prosodic differences between the two modalities.
9th ISCA Speech Synthesis Workshop | 2016
Eva Vanmassenhove; João P. Cabral; Fasih Haider
The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less fine-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a stateof-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classification from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.