Parham Zolfaghari
University of Cambridge
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Parham Zolfaghari.
international conference on spoken language processing | 1996
Parham Zolfaghari; Tony Robinson
The paper describes a new formant analysis technique whereby the formant parameters are represented in the form of Gaussian mixture distributions. These are estimated from the discrete Fourier transform (DFT) magnitude spectrum of the speech signal. The parameters obtained are the means, variances and the masses of the density functions, which are used to calculate centre frequencies, bandwidths and amplitudes of formants within the spectrum. In order to better fit the mixture distributions various modifications to the DFT magnitude spectrum, based on simple models of perception, were investigated. These include reduction of dynamic range, cepstral smoothing, use of the Mel scale and pre-emphasis of speech. Results are presented for these as well as formant tracks from analysing speech using the final formant analysis system.
international conference on acoustics, speech, and signal processing | 2004
Parham Zolfaghari; Shinji Watanabe; Atsushi Nakamura; Shigeru Katagiri
This paper presents a method for modelling the speech spectral envelope using a mixture of Gaussians (MoG). A novel variational Bayesian (VB) framework for Gaussian mixture modelling of a histogram enables the derivation of an objective function that can be used to simultaneously optimise both model parameter distributions and model structure. A histogram representation of the STRAIGHT spectral envelope, which is free of glottal excitation information, is used for parametrisation using this MoG model. This results in a parameterisation scheme that purely models the vocal tract resonant characteristics. Maximum likelihood (ML) and variational Bayesian (VB) solutions of the mixture model on histogram data are found using an iterative algorithm. A comparison between ML-MoG and VB-MoG spectral modelling is carried out using spectral distortion measures and mean opinion scores (MOS). The main advantages of VB-MoG highlighted in this paper include better modelling using fewer Gaussians in the mixture resulting in better correspondence of Gaussians and formant-like peaks, and an objective measure of the number of Gaussians required to best fit the spectral envelope.
international conference on acoustics, speech, and signal processing | 1997
Parham Zolfaghari; Tony Robinson
This paper describes a new low bit-rate formant vocoder. The formant parameters are represented by Gaussian mixture distributions, which are estimated from the discrete Fourier transform (DFT) magnitude spectrum of the speech signal. A voiced/unvoiced classification mechanism has been developed based on the harmonic nature of each formant in the DFT spectrum modulated by the Gaussian mixture distribution. Using a magnitude-only sinusoidal synthesiser, intelligible synthetic speech has been obtained. Vector quantisation of the vocal tract parameters enables this formant vocoder to operate at a bit-rate of 1248 bps.
international conference on acoustics, speech, and signal processing | 2004
Hideki Kawahara; Hideki Banno; Toshio Irino; Parham Zolfaghari
A tool to investigate an important fundamental question in speech processing is proposed aiming to promote research on voice quality and para and non linguistic aspects of speech. The proposed method effectively emulates waveform-based methods, sinusoidal models and the high quality source filter model STRAIGHT The key idea that enables blending these seemingly disjoint algorithms is a group delay based representation of signal excitation. By using a STRAIGHT-based smoothed time-frequency representation that is shared by these three types of speech processing methods, a unified source representation is used to implement the proposed system. Informal listening tests using the proposed system indicated that phase manipulation introduces different timbre, but it does not need to reproduce the exact waveform to reproduce the same timbre. This may suggest that the possibility of further information reduction exists in synthesizing close to natural quality speech.
signal processing systems | 2006
Parham Zolfaghari; Hiroko Kato; Yasuhiro Minami; Atsushi Nakamura; Shigeru Katagiri; Roy D. Patterson
In this paper, we describe a parametric mixture model for modelling the resonant characteristics of the vocal tract where Gaussian distributions are used to model spectral frequency regions. A mixtures of Gaussian (MoG) based parametrisation scheme is used for modelling a smoothed representation of the spectra. This smoothing procedure removes all signal periodicity from the spectra allowing highly natural analysis, manipulation and synthesis of speech. The goal of this parametrisation scheme is to ease the correspondence between the resonant characteristics of the vocal tract and the parametric distributions and modelling the spectrum with an appropriate number of parameters. Previously, a maximum likelihood (ML) approach to this parametrisation scheme was introduced. However, this approach has inherent local optima problems. Noting that, a relatively small class of Gaussian densities can approximate a large class of distributions, we propose a new scheme whereby starting with a large number of distributions in the mixture, we systematically reduce their number and re-approximate the densities in the mixture based on a distance criterion. The Kullback-Leibler (KL) distance was found to allow optimal MoG solutions to the spectra. Furthermore, a fitness measure based on KL information is used to provide a figure for estimating the model order in representing formant-like features. The proposed model is subjectively evaluated and is shown to reduce the number of Gaussian with an appreciable loss in the quality of the re-synthesised speech.
Proceedings of the 2004 14th IEEE Signal Processing Society Workshop Machine Learning for Signal Processing, 2004. | 2004
Parham Zolfaghari; Hiroko Kato; Yasuhiro Minami; Atsushi Nakamura; Shigeru Katagiri
In this paper, we describe a parametric mixture model for modelling the resonant characteristics of the vocal tract. We propose a mixtures of Gaussians (MoG) spectral modelling scheme which enables model selection with a goal of easing the correspondence between the resonant characteristics of the vocal tract and the parametric Gaussians and representing a spectrum with an appropriate number of parameters. Noting that, a relatively small class of Gaussian densities can approximate a large class of distributions, we systematically reduce the number of Gaussians and re-approximate the densities in the MoG spectral model. The Kullback-Leibler (KL) distance between the densities in the mixture was found to allow optimal ML-MoG solutions to the spectra. A fitness measure based on KL information provides a figure for estimating the model order in representing formant-like features. The mixture model was fitted to a normalised smooth spectrum obtained by filtering the short-time Fourier transform in time and frequency by a pitch adaptive Gaussian filter. This results in the removal of all source information from the spectra. By subjectively evaluating the quality of the analysed and synthesised speech using this parametrisation scheme, we show considerable improvement over ML using this Gaussian reduction scheme specifically when using lower number of Gaussians in the mixture
Journal of the Acoustical Society of America | 2002
Parham Zolfaghari; Hideki Banno; Fumitada Itakura; Hideki Kawahara
We describe a glottal event synchronous sinusoidal model for speech analysis and synthesis. The sinusoidal components are event synchronously estimated using a mapping from linearly spaced filter center frequencies to the instantaneous frequencies of the filter outputs. Frequency domain fixed points of this mapping correspond to the constituent sinusoidal components of the input signal. A robust technique based on a wavelet representation of this fixed points model is used for fundamental frequency extraction as used in STRAIGHT [Kawahara et al., IEICE (1999)]. The method for event detection and characterization is based on group delay and similar fixed point analysis. This method enables the detection of precise timing and spread of speech events such as vocal fold closure. A trajectory continuation scheme is also applied to the extracted sinusoidal components. The proposed model is capable of high‐quality speech synthesis using the overlap–add synthesis method and is also applicable to other sound sourc...
Journal of the Acoustical Society of America | 1999
Hideki Kawahara; Parham Zolfaghari
Performance of a new F0 extraction algorithm based on fixed point analysis of filter center frequency to output instantaneous frequency [Kawahara et al., Eurospeech’99] was compared with numbers of F0 extraction algorithms based on different definitions of fundamental frequency. The proposed method uses partial derivatives of the mapping at fixed points to estimate carrier to noise ratio of F0 information. It also enables integration of distributed F0 cues among harmonic components to provide a reliable F0 estimate. Objective evaluations were conducted using simulations and a speech database with simultaneous EGG (electroglottograph) recording. Subjective evaluations were based on reproduced speech quality assessment by a high‐quality speech analysis/modification/synthesis method STRAIGHT [Kawahara et al., Speech Commun. 27, 187–207]. Discussions about the relevance of various F0 definitions for high‐quality speech synthesis will be presented based on these test results. It was also indicated that the pro...
conference of the international speech communication association | 2003
Tomohiro Nakatani; Toshio Irino; Parham Zolfaghari
conference of the international speech communication association | 2000
Hideki Kawahara; Yoshinori Atake; Parham Zolfaghari