Yannis Stylianou
Toshiba
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yannis Stylianou.
IEEE Transactions on Speech and Audio Processing | 1998
Yannis Stylianou; Olivier Cappé; Eric Moulines
Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic+noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.
IEEE Transactions on Speech and Audio Processing | 2001
Yannis Stylianou
This paper describes the application of the harmonic plus noise model (HNM) for concatenative text-to-speech (TTS) synthesis. In the context of HNM, speech signals are represented as a time-varying harmonic component plus a modulated noise component. The decomposition of a speech signal into these two components allows for more natural-sounding modifications of the signal (e.g., by using different and better adapted schemes to modify each component). The parametric representation of speech using HNM provides a straightforward way of smoothing discontinuities of acoustic units around concatenation points. Formal listening tests have shown that HNM provides high-quality speech synthesis while outperforming other models for synthesis (e.g., TD-PSOLA) in intelligibility, naturalness, and pleasantness.
Journal of the Acoustical Society of America | 1999
Mark C. Beutnagel; Alistair Conkie; Juergen Schroeter; Yannis Stylianou; Ann K. Syrdal
The new AT&T TTS system for general U.S. English text is based on best‐choice components picked from the AT&T Flextalk TTS, the Festival System from the University of Edinburgh, and ATR’s CHATR system. From Flextalk, it employs text normalization, letter‐to‐sound, and (optionally) baseline prosody generation. Festival provides general software‐engineering infrastructure (modularity) for easy experimentation and competitive evaluation of different algorithms or modules. Finally, CHATR’s unit selection was modified to guarantee the intelligibility of a good n‐phone (n=2 would be diphone) synthesizer while improving significantly on perceived naturalness relative to Flextalk. Each decision made during the research and development phase of this system was based on formal subjective evaluations. For example, the best voice found in a test that compared TTS systems built from several speakers gave a 0.3‐point head start (on a 5‐point rating scale) in quality over the mean of all speakers. Similarly, using our H...
international conference on acoustics, speech, and signal processing | 2001
Yannis Stylianou; Ann K. Syrdal
Concatenative speech synthesis systems attempt to minimize audible signal discontinuities between two successive concatenated units. An objective distance measure which is able to predict audible discontinuities is therefore very important, particularly in unit selection synthesis, for which units are selected from among a large inventory at run time. In this paper, we describe a perceptual test to measure the detection rate of concatenation discontinuity by humans, and then we evaluate 13 different objective distance measures based on their ability to predict the human results. Criteria used to classify these distances include the detection rate, the Bhattacharyya measure of separability of two distributions, and receiver operating characteristic (ROC) curves. Results show that the Kullback-Leibler distance on power spectra has the higher detection rate followed by the Euclidean distance on Mel-frequency cepstral coefficients (MFCC).
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Andre Holzapfel; Yannis Stylianou
Nonnegative matrix factorization (NMF) is used to derive a novel description for the timbre of musical sounds. Using NMF, a spectrogram is factorized providing a characteristic spectral basis. Assuming a set of spectrograms given a musical genre, the space spanned by the vectors of the obtained spectral bases is modeled statistically using mixtures of Gaussians, resulting in a description of the spectral base for this musical genre. This description is shown to improve classification results by up to 23.3% compared to MFCC-based models, while the compression performed by the factorization decreases training time significantly. Using a distance-based stability measure this compression is shown to reduce the noise present in the data set resulting in more stable classification models. In addition, we compare the mean squared errors of the approximation to a spectrogram using independent component analysis and nonnegative matrix factorization, showing the superiority of the latter approach.
international conference on acoustics, speech, and signal processing | 2009
Yannis Stylianou
Voice transformation refers to the various modifications one may apply to the sound produced by a person, speaking or singing. Voice Transformation is usually seen as an add-on or an external system in speech synthesis systems since it may create virtual voices in a simple and flexible way. In this paper we review the state-of-the-art Voice Transformation methodology showing its limitations in producing good speech quality and its current challenges. Addressing quality issues of current voice transformation algorithms in conjunction with properties of the speech production and speech perception systems we try to pave the way for more natural Voice Transformation algorithms in the future. Facing the challenges, will allow Voice Transformation systems to be applied in important and versatile areas of speech technology; applications that are far beyond speech synthesis.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Yannis Pantazis; Olivier Rosec; Yannis Stylianou
In this paper, we present an iterative method for the accurate estimation of amplitude and frequency modulations (AM-FM) in time-varying multi-component quasi-periodic signals such as voiced speech. Based on a deterministic plus noise representation of speech initially suggested by Laroche (“HNM: A simple, efficient harmonic plus noise model for speech,” Proc. WASPAA, Oct., 1993, pp. 169-172), and focusing on the deterministic representation, we reveal the properties of the model showing that such a representation is equivalent to a time-varying quasi-harmonic representation of voiced speech. Next, we show how this representation can be used for the estimation of amplitude and frequency modulations and provide the conditions under which such an estimation is valid. Finally, we suggest an adaptive algorithm for nonparametric estimation of AM-FM components in voiced speech. Based on the estimated amplitude and frequency components, a high-resolution time-frequency representation is obtained. The suggested approach was evaluated on synthetic AM-FM signals, while using the estimated AM-FM information, speech signal reconstruction was performed, resulting in a high signal-to-reconstruction error ratio (around 30 dB).
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Andre Holzapfel; Yannis Stylianou; Ali Cenk Gedik; Baris Bozkurt
In this paper, we suggest a novel group delay based method for the onset detection of pitched instruments. It is proposed to approach the problem of onset detection by examining three dimensions separately: phase (i.e., group delay), magnitude and pitch. The evaluation of the suggested onset detectors for phase, pitch and magnitude is performed using a new publicly available and fully onset annotated database of monophonic recordings which is balanced in terms of included instruments and onset samples per instrument, while it contains different performance styles. Results show that the accuracy of onset detection depends on the type of instruments as well as on the style of performance. Combining the information contained in the three dimensions by means of a fusion at decision level leads to an improvement of onset detection by about 8% in terms of F-measure, compared to the best single dimension.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Maria Markaki; Yannis Stylianou
In this paper, we explore the information provided by a joint acoustic and modulation frequency representation, referred to as modulation spectrum, for detection and discrimination of voice disorders. The initial representation is first transformed to a lower dimensional domain using higher order singular value decomposition (HOSVD). From this dimension-reduced representation a feature selection process is suggested using an information-theoretic criterion based on the mutual information between voice classes (i.e., normophonic/dysphonic) and features. To evaluate the suggested approach and representation, we conducted cross-validation experiments on a database of sustained vowel recordings from healthy and pathological voices, using support vector machines (SVMs) for classification. For voice pathology detection, the suggested approach achieved a classification accuracy of 94.1±0.28% (95% confidence interval), which is comparable to the accuracy achieved using cepstral-based features. However, for voice pathology classification the suggested approach significantly outperformed the performance of cepstral-based features.
IEEE Transactions on Speech and Audio Processing | 2001
Yannis Stylianou
Many current text-to-speech (TTS) systems are based on the concatenation of acoustic units of recorded speech. While this approach is believed to lead to higher intelligibility and naturalness than synthesis-by-rule, it has to cope with the issues of concatenating acoustic units that have been recorded at different times and in a different order. One important issue related to the concatenation of these acoustic units is their synchronization. In terms of signal processing this means removing linear phase mismatches between concatenated speech frames. This paper presents two novel approaches to the problem of synchronization of speech frames with an application to concatenative speech synthesis. Both methods are based on the processing of phase spectra without, however, decreasing the quality of the output speech, in contrast to previously proposed methods. The first method is based on the notion of center of gravity and the second on differentiated phase data. They are applied off-line, during the preparation of the speech database without, therefore, any computational burden on synthesis. The proposed methods have been tested with the harmonic plus noise model, HNM, and the TTS system of AT&T Labs. The resulting synthetic speech is free of linear phase mismatches.