Sandrine Brognaux
Université catholique de Louvain
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sandrine Brognaux.
spoken language technology workshop | 2012
Sandrine Brognaux; Sophie Roekhaut; Thomas Drugman; Richard Beaufort
Several automatic phonetic alignment tools have been proposed in the literature. They usually rely on pre-trained speaker-independent models to align new corpora. Their drawback is that they cover a very limited number of languages and might not perform properly for different speaking styles. This paper presents a new tool for automatic phonetic alignment available online. Its specificity is that it trains the model directly on the corpus to align, which makes it applicable to any language and speaking style. Experiments on three corpora show that it provides results comparable to other existing tools. It also allows the tuning of some training parameters. The use of tied-state triphones, for example, shows further improvement of about 1.5% for a 20 ms threshold. A manually-aligned part of the corpus can also be used as bootstrap to improve the model quality. Alignment rates were found to significantly increase, up to 20%, using only 30 seconds of bootstrapping data.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Sandrine Brognaux; Thomas Drugman
Speech segmentation refers to the problem of determining the phoneme boundaries from an acoustic recording of an utterance together with its orthographic transcription. This paper focuses on a particular case of hidden Markov model (HMM)-based forced alignment in which the models are directly trained on the corpus to align. The obvious advantage of this technique is that it is applicable to any language or speaking style and does not require manually aligned data. Through a systematic step-by-step study, the role played by various training parameters (e.g. models configuration, number of training iterations) on the alignment accuracy is assessed, with corpora varying in speaking style and language. Based on a detailed analysis of the errors commonly made by this technique, we also investigate the use of additional fully automatic strategies to improve the alignment. Beside the use of supplementary acoustic features, we explore two novel approaches: an initialization of the silence models based on a voice activity detection (VAD) algorithm and the consideration of the forced alignment of the time-reversed sound. The evaluation is carried out on 12 corpora of different sizes, languages (some being under-resourced) and speaking styles. It aims at providing a comprehensive study of the alignment accuracy achieved by the different versions of the speech segmentation algorithm depending on corpus-related specificities. While the baseline method is shown to reach good alignment rates with corpora as small as 2 minutes, we also emphasize the benefit of using a few seconds of bootstrapping data. Regarding improvement methods, our results show that the insertion of additional features outperforms both other strategies. The performance of VAD, however, is shown to be notably striking on very small corpora, correcting more than 60% of the errors superior to 40 ms. Finally, the combination of the three improvement methods is also pointed out as providing the highest alignment rates, with very low variability across the corpora, regardless of their size. This combined technique is shown to outperform available speaker-independent models, improving the alignment rate by 8 to 10% absolute.
International Conference on NLP | 2012
Sandrine Brognaux; Sophie Roekhaut; Thomas Drugman; Richard Beaufort
Several automatic phonetic alignment tools have been proposed in the literature. They generally use speaker-independent acoustic models of the language to align new corpora. The problem is that the range of provided models is limited. It does not cover all languages and speaking styles (spontaneous, expressive, etc.). This study investigates the possibility of directly training the statistical model on the corpus to align. The main advantage is that it is applicable to any language and speaking style. Moreover, comparisons indicate that it provides as good or better results than using speaker-independent models of the language. It shows that about 2% are gained, with a 20 ms threshold, by using our method. Experiments were carried out on neutral and expressive corpora in French and English. The study also points out that even a small neutral corpus of a few minutes can be exploited to train a model that will provide high-quality alignment.
international conference on acoustics, speech, and signal processing | 2016
Rasmus Dali; Sandrine Brognaux; Korin Richmond; Cassia Valentini-Botinhao; Gustav Eje Henter; Julia Hirschberg; Junichi Yamagishi; Simon King
Forced alignment for speech synthesis traditionally aligns a phoneme sequence predetermined by the front-end text processing system. This sequence is not altered during alignment, i.e., it is forced, despite possibly being faulty. The consistency assumption is the assumption that these mistakes do not degrade models, as long as the mistakes are consistent across training and synthesis. We present evidence that in the alignment of both standard read prompts and spontaneous speech this phoneme sequence is often wrong, and that this is likely to have a negative impact on acoustic models. A lattice-based forced alignment system allowing for pronunciation variation is implemented, resulting in improved phoneme identity accuracy for both types of speech. A perceptual evaluation of HMM-based voices showed that spontaneous models trained on this improved alignment also improved standard synthesis, despite breaking the consistency assumption.
international conference on acoustics, speech, and signal processing | 2015
Benjamin Picart; Sandrine Brognaux; Stéphane Dupont
“Human BeatBox” (HBB) is a newly expanding contemporary singing style where the vocalist imitates drum beats percussive sounds as well as pitched musical instrument sounds. Drum sounds typically use a notation based on plosives and fricatives, and instrument sounds cover vocalisations that go beyond spoken language vowels. HBB hence constitutes an interesting use case for expanding techniques initially developed for speech processing, with the goal of automatically annotating performances as well as developing new sound effects dedicated to HBB performers. In this paper, we investigate three complementary aspects of HBB analysis: pitch tracking, onset detection, and automatic recognition of sounds and instruments. As a first step, a new high-quality HBB audio database has been recorded, carefully segmented and annotated manually to obtain a ground truth reference. Various pitch tracking and onset detection methods are then compared and assessed against this reference. Finally, Hidden Markov Models are evaluated, together with an exploration of their parameters space, for the automatic recognition of different types of sounds. This study exhibits very encouraging experimental results.
spoken language technology workshop | 2012
Sandrine Brognaux; Thomas Drugman; Richard Beaufort
Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, considering the prosodic nature of each phoneme of the corpus is crucial. Generally, phonemes are assigned labels which should reflect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic realization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modifications. The proposed technique has the advantage of not requiring a manually prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected.
conference of the international speech communication association | 2013
Sandrine Brognaux; Benjamin Picart; Thomas Drugman
conference of the international speech communication association | 2014
Sophie Roekhaut; Sandrine Brognaux; Richard Beaufort; Thierry Dutoit
8th ISCA Speech Synthesis Workshop | 2013
Benjamin Picart; Sandrine Brognaux; Thomas Drugman
Speech prosody | 2014
Marie-Catherine Michaux; Sandrine Brognaux; George Christodoulides