Manuel Sam Ribeiro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manuel Sam Ribeiro is active.

Explore More

Publication

Featured researches published by Manuel Sam Ribeiro.

international conference on acoustics, speech, and signal processing | 2015

A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform

Manuel Sam Ribeiro; Robert A. J. Clark

We propose a representation of f0 using the Continuous Wavelet Transform (CWT) and the Discrete Cosine Transform (DCT). The CWT decomposes the signal into various scales of selected frequencies, while the DCT compactly represents complex contours as a weighted sum of cosine functions. The proposed approach has the advantage of combining signal decomposition and higher-level representations, thus modeling low-frequencies at higher levels and high-frequencies at lower-levels. Objective results indicate that this representation improves f0 prediction over traditional short-term approaches. Subjective results show that improvements are seen over the typical MSD-HMM and are comparable to the recently proposed CWT-HMM, while using less parameters. These results are discussed and future lines of research are proposed.

conference of the international speech communication association | 2016

The SIWIS database: a multilingual speech database with acted emphasis

Jean-Philippe Goldman; Pierre-Edouard Honnet; Robert A. J. Clark; Philip N. Garner; Maria Ivanova; Alexandros Lazaridis; Hui Liang; Tiago Macedo; Beat Pfister; Manuel Sam Ribeiro; Eric Wehrli; Junichi Yamagishi

We describe here a collection of speech data of bilingual and trilingual speakers of English, French, German and Italian. In the context of speech to speech translation (S2ST), this database is designed for several purposes and studies: training CLSA systems (cross-language speaker adaptation), conveying emphasis through S2ST systems, and evaluating TTS systems. More precisely, 36 speakers judged as accentless (22 bilingual and 14 trilingual speakers) were recorded for a set of 171 prompts in two or three languages, amounting to a total of 24 hours of speech. These sets of prompts include 100 sentences from news, 25 sentences from Europarl, the same 25 sentences with one acted emphasised word, 20 semantically unpredictable sentences, and finally a 240-word long text. All in all, it yielded 64 bilingual session pairs of the six possible combinations of the four languages. The database is freely available for non-commercial use and scientific research purposes. Index Terms: speech-to-speech translation, speech corpus, bilingual speakers, emphasis

international conference on acoustics, speech, and signal processing | 2016

Wavelet-based decomposition of F0 as a secondary task for DNN-based speech synthesis with multi-task learning

Manuel Sam Ribeiro; Oliver Watts; Junichi Yamagishi; Robert A. J. Clark

We investigate two wavelet-based decomposition strategies of the f0 signal and their usefulness as a secondary task for speech synthesis using multi-task deep neural networks (MTL-DNN). The first decomposition strategy uses a static set of scales for all utterances in the training data. We propose a second strategy, where the scale of the mother wavelet is dynamically adjusted to the rate of each utterance. This approach is able to capture f0 variations related to the syllable, word, clitic-group, and phrase units. This method also constrains the wavelet components to be within the frequency range that previous experiments have shown to be more natural. These two strategies are evaluated as a secondary task in multi-task deep neural networks (MTL-DNNs). Results indicate that on an expressive dataset there is a strong preference for the systems using multi-task learning when compared to the baseline system.

conference of the international speech communication association | 2016

Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis

Manuel Sam Ribeiro; Oliver Watts; Junichi Yamagishi

A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of highlevel units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.

9th ISCA Speech Synthesis Workshop | 2016

Parallel and cascaded deep neural networks for text-to-speech synthesis

Manuel Sam Ribeiro; Oliver Watts; Junichi Yamagishi

An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.

Archive | 2014

Nouveaux cahiers de linguistique francaise

Philip N. Garner; Robert A. J. Clark; Jean-Philippe Goldman; Pierre-Edouard Honnet; Maria Ivanova; Alexandros Lazaridis; Hui Lang; Beat Pfister; Manuel Sam Ribeiro; Eric Wehrli; Junichi Yamagishi

Nouveaux cahiers de linguistique francaise | 2014

Translation and Prosody in Swiss Languages

Philip N. Garner; Robert A. J. Clark; Jean-Philippe Goldman; Pierre-Edouard Honnet; Maria Ivanova; Alexandros Lazaridis; Hui Liang; Beat Pfister; Manuel Sam Ribeiro; Eric Wehrli; Junichi Yamagishi

conference of the international speech communication association | 2015