Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Nicolas Obin is active.

Publication


Featured researches published by Nicolas Obin.


international conference on acoustics, speech, and signal processing | 2013

Syll-O-Matic: An adaptive time-frequency representation for the automatic segmentation of speech into syllables

Nicolas Obin; François Lamare; Axel Roebel

This paper introduces novel paradigms for the segmentation of speech into syllables. The main idea of the proposed method is based on the use of a time-frequency representation of the speech signal, and the fusion of intensity and voicing measures through various frequency regions for the automatic selection of pertinent information for the segmentation. The time-frequency representation is used to exploit the speech characteristics depending on the frequency region. In this representation, intensity profiles are measured to provide information into various frequency regions, and voicing profiles are measured to determine the frequency regions that are pertinent for the segmentation. The proposed method outperforms conventional methods for the detection of syllable landmark and boundaries on the TIMIT database of American-English, and provides a promising paradigm for the segmentation of speech into syllables.


international conference on acoustics, speech, and signal processing | 2008

French prominence: A probabilistic framework

Nicolas Obin; Xavier Rodet; Anne Lacheret-Dujour

Identification of prosodic phenomena is of first importance in prosodic analysis and modeling. In this paper, we introduce a new method for automatic prosodic phenomena labelling. The authors set their approach of prosodic phenomena in the framework of prominence. The proposed method for automatic prominence labelling is based on well-known machine learning techniques in a three step procedure: (i) a feature extraction step in which we propose a framework for systematic and multi-level speech acoustic feature extraction, (ii) a feature selection step for identifying the more relevant prominence acoustic correlates, and (iii) a modelling step in which a gaussian mixture model is used for predicting prominence. This model shows robust performance on read speech (84%).


spoken language technology workshop | 2012

On the generalization of Shannon entropy for speech recognition

Nicolas Obin; Marco Liuni

This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures - including Shannon and Wiener entropies - in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech - i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.


international conference on acoustics, speech, and signal processing | 2014

On automatic voice casting for expressive speech: Speaker recognition vs. speech classification

Nicolas Obin; Axel Roebel; Grégoire Bachman

This paper presents the first large-scale automatic voice casting system, and explores the adaptation of speaker recognition techniques to measure voice similarities. The proposed system is based on the representation of a voice by classes (e.g., age/gender, voice quality, emotion). First, a multi-label system is used to classify speech into classes. Then, the output probabilities for each class are concatenated to form a vector that represents the vocal signature of a speech recording. Finally, a similarity search is performed on the vocal signatures to determine the set of target actors that are the most similar to a speech recording of a source actor. In a subjective experiment conducted in the real-context of voice casting for video games, the multi-label system clearly outperforms standard speaker recognition systems. This indicates evidence that speech classes successfully capture the principal directions that are used in the perception of voice similarity.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

Similarity search of acted voices for automatic voice casting

Nicolas Obin; Axel Roebel

This paper presents a large-scale similarity search of professionally acted voices for computer-aided voice casting. The proposed voice casting system explores Gaussian mixture model-based acoustic models and multilabel recognition of perceived paralinguistic content (speaker states and speaker traits, e.g., age/gender, voice quality, emotion) for the voice casting of professionally acted voices. First, acoustic models (universal background model, super-vector, i-vector) are constructed to model the acoustic space of voices, from which the similarity between voices can be measured directly in the acoustic space. Second, multiple binary classification of speaker traits and states is added to the acoustic models in order to represent the vocal signature of a voice, which is then used to measure the similarity between voices in the paralinguistic space. Finally, a similarity search is processed in order to determine the set of target actors that are the most similar to the voice of a source actor. In a subjective experiment conducted in the real-context of cross-language voice casting, the multilabel scoring system significantly outperforms the acoustic scoring system. This constitutes a proof of concept for the role of perceived para-linguistic categories in the perception of voice similarity.


international conference on acoustics, speech, and signal processing | 2015

The role of glottal source parameters for high-quality transformation of perceptual age

Xavier Favory; Nicolas Obin; Gilles Degottex; Axel Roebel

The intuitive control of voice transformation (e.g., age/sex, emotions) is useful to extend the expressive repertoire of a voice. This paper explores the role of glottal source parameters for the control of voice transformation. First, the SVLN speech synthesizer (Separation of the Vocal-tract with the Liljencrants-fant model plus Noise) is used to represent the glottal source parameters (and thus, voice quality) during speech analysis and synthesis. Then, a simple statistical method is presented to control speech parameters during voice transformation: a GMM is used to model the speech parameters of a voice, and regressions are then used to adapt the GMMs statistics (mean and variance) to a control parameter (e.g., age/sex, emotions). A subjective experiment conducted on the control of perceptual age proves the importance of the glottal source parameters for the control of voice transformation, and shows the efficiency of the statistical model to control voice parameters while preserving a high-quality of the voice transformation.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Symbolic modeling of prosody: from linguistics to statistics

Nicolas Obin; Pierre Lanchantin

The assignment of prosodic events (accent and phrasing) from the text is crucial in text-to-speech synthesis systems. This paper addresses the combination of linguistic and metric constraints for the assignment of prosodic events in text-to-speech synthesis. First, a linguistic processing chain is used to provide a rich linguistic description of a text. Then, a novel statistical representation based on a hierarchical HMM (HHMM) is used to model the prosodic structure of a text: the root layer represents the text, each intermediate layer a sequence of intermediate phrases, the pre-terminal layer the sequence of accents, and the terminal layer the sequence of linguistic contexts. For each intermediate layer, a segmental HMM and information fusion are used to fuse the linguistic and metric constraints for the segmentation of a text into phrases. A set of experiments conducted on multi-speaker databases with various speaking styles reports that: the rich linguistic representation improves drastically the assignment of prosodic events, and the fusion of linguistic and metric constraints significantly improves over standard methods for the segmentation of a text into phrases. These constitute substantial advances that can be further used to model the speech prosody of a speaker, a speaking style, and emotions for text-to-speech synthesis.


IEEE Transactions on Audio, Speech, and Language Processing | 2018

Binaural Localization of Multiple Sound Sources by Non-Negative Tensor Factorization

Elie Laurent Benaroya; Nicolas Obin; Marco Liuni; Axel Roebel; Wilson Raumel; Sylvain Argentieri

This paper presents non-negative factorization of audio signals for the binaural localization of multiple sound sources within realistic and unknown sound environments. Non-negative tensor factorization (NTF) provides a sparse representation of multichannel audio signals in time, frequency, and space that can be exploited in computational audio scene analysis and robot audition for the separation and localization of sound sources. In the proposed formulation, each sound source is represented by means of spectral dictionaries, temporal activation, and its distribution within each channel (here, left and right ears). This distribution, being dependent on the frequency, can be interpreted as an explicit estimation of the Head-Related Transfer Function (HRTF) of a binaural head which can then be converted into the estimated sound source position. Moreover, the semisupervised formulation of the non-negative factorization allows us to integrate prior knowledge about some sound sources of interest whose dictionaries can be learned in advance, whereas the remaining sources are considered as background sound, which remains unknown and is estimated on the fly. The proposed NTF-based sound source localization is applied here to binaural sound source localization of multiple speakers within realistic sound environments.


9th International Conference on Speech Prosody 2018 | 2018

At the Interface of Speech and Music: A Study of Prosody and Musical Prosody in Rap Music

Olivier Migliore; Nicolas Obin

This paper presents a pioneer study of speech prosody and musical prosody in modern popular music, with a specific attention to music where the voice is closer to speech than to sing. The voice in music is a complex system in which linguistic and musical systems are coupled and interact dynamically. This paper establishes a new definition of the musical prosody in order to model the specific relations between the voice and the music systems in this kind of music. Additionally, it presents a methodology to measure the musical prosody from the speech and music signals. An illustration is presented to assess whether the speech prosody and the musical prosody can characterize the phonostyle of a speaker, by comparison of three American-English rappers dating from the beginning of the 2000’s. The main finding is that not only the rappers can be characterized and distinguished by their speech prosody, but also by their musical prosody, i.e. by the degree of synchronization between their lyrics with the musical system.


Archive | 2015

Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human

Nicolas Obin; Christophe Veaux; Pierre Lanchantin

The absence of alternatives/variants is a dramatical limitation of text-to-speech (TTS) synthesis compared to the variety of human speech. This chapter introduces the use of speech alternatives/variants in order to improve TTS synthesis systems. Speech alternatives denote the variety of possibilities that a speaker has to pronounce a sentence—depending on linguistic constraints, specific strategies of the speaker, speaking style, and pragmatic constraints. During the training, symbolic and acoustic characteristics of a unit-selection speech synthesis system are statistically modelled with context-dependent parametric models (Gaussian mixture models (GMMs)/hidden Markov models (HMMs)). During the synthesis, symbolic and acoustic alternatives are exploited using a Generalized Viterbi Algorithm (GVA) to determine the sequence of speech units used for the synthesis. Objective and subjective evaluations support evidence that the use of speech alternatives significantly improves speech synthesis over conventional speech synthesis systems. Moreover, speech alternatives can also be used to vary the speech synthesis for a given text. The proposed method can easily be extended to HMM-based speech synthesis.

Collaboration


Dive into the Nicolas Obin's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alice Bardiaux

Université catholique de Louvain

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge