Saeed Vaseghi
Brunel University London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Saeed Vaseghi.
IEEE Transactions on Speech and Audio Processing | 1997
Saeed Vaseghi; Ben Milner
Several noise compensation schemes for speech recognition in impulsive and nonimpulsive noise are considered. The noise compensation schemes are spectral subtraction, HMM-based Wiener (1949) filters, noise-adaptive HMMs, and a front-end impulsive noise removal. The use of the cepstral-time matrix as an improved speech feature set is explored, and the noise compensation methods are extended for use with cepstral-time features. Experimental evaluations, on a spoken digit database, in the presence of ear noise, helicopter noise, and impulsive noise, demonstrate that the noise compensation methods achieve substantial improvement in recognition across a wide range of signal-to-noise ratios. The results also show that the cepstral-time matrix is more robust than a vector of identical size, which is composed of a combination of cepstral and differential cepstral features.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Esfandiar Zavarehei; Saeed Vaseghi; Qin Yan
This paper presents a post-processing speech restoration module for enhancing the performance of conventional speech enhancement methods. The restoration module aims to retrieve parts of speech spectrum that may be lost to noise or suppressed when using conventional speech enhancement methods. The proposed restoration method utilizes a harmonic plus noise model (HNM) of speech to retrieve damaged speech structure. A modified HNM of speech is proposed where, instead of the conventional binary labeling of the signal in each subband as voiced or unvoiced, the concept of harmonicity is introduced which is more adaptable to the codebook mapping method used in the later stage of enhancement. To restore the lost or suppressed information, an HNM codebook mapping technique is proposed. The HNM codebook is trained on speaker-independent speech data. To reduce the sensitivity of the HNM codebook to speaker variability, a spectral energy normalization process is introduced. The proposed post-processing method is tested as an add-on module with several popular noise reduction methods. Evaluations of the performance gain obtained from the proposed post-processing are presented and compared to standard speech enhancement systems which show substantial improvement gains in perceptual quality
international conference on acoustics, speech, and signal processing | 1997
Saeed Vaseghi; Naomi Harte; Ben P. Milner
This paper explores the modelling of phonetic segments of speech with multi-resolution spectral/time correlates. For spectral representation a set of multi-resolution cepstral features are proposed. Cepstral features obtained from a DCT of the log energy-spectrum over the full voice-bandwidth (100-4000 Hz) are combined with higher resolution features obtained from the DCT of upper subband (say 100-2100) and lower subband (2100-4000) halves. This approach can be extended to several levels of different resolutions. For representation of the temporal structure of speech segments or phonetic units, the conventional cepstral and dynamic cepstral features representing speech at the sub-phonetic levels, are supplemented by a set of phonetic features that describe the trajectory of speech over the duration of a phonetic unit. A conditional probability model for phonetic and sub-phonetic features is considered. Experiments demonstrate that the inclusion of the segmental features result in about 10% decrease in error rates.
international conference on acoustics, speech, and signal processing | 2002
Qin Yan; Saeed Vaseghi
In this paper, we present a comparative study of the acoustic speech features of two major English accents: British English and American English. Experiments examined the deterioration in speech recognition resulting from the mismatch between English accents of the input speech and the speech models. Mismatch in accents can increase the error rates by more than 100%. Hence a detailed study of the acoustic correlates of accent using intonation pattern and pitch characteristics was performed. Accents differences are acoustic manifestations of differences in duration, pitch and intonation pattern and of course the differences in phonetic transcriptions. Particularly, British speakers possess much steeper pitch rise and fall pattern and lower average pitch in most of vowels. Finally a possible means to convert English accents is suggested based on above analysis.
international conference on acoustics, speech, and signal processing | 2006
Saeed Vaseghi; Esfandiar Zavarehei; Qin Yan
This paper presents a method for restoration of the missing bandwidth of narrowband speech signals. Speech is decomposed into a linear prediction (LP) model of the spectral envelop and a harmonic plus noise model (HNM) of speech excitation. The LP spectral envelope and HNM excitation parameters of the narrowband speech are extrapolated using codebooks trained on narrowband and wideband speech. A novel contribution of this paper is the introduction of a parametric measure of the harmonicity of excitation in harmonically-spaced sub-bands. The wideband LSF parameters and the degree of harmonicity of missing excitation are estimated from those of the narrowband speech via codebook mapping. The method is successful in restoring the harmonicity of speech and converts telephone quality speech to perceptually high quality wideband speech
Computer Speech & Language | 2007
Qin Yan; Saeed Vaseghi; Esfandiar Zavarehei; Ben Milner; Jonathan Darch; P.R. White; Ioannis Andrianakis
This paper presents a formant tracking linear prediction (LP) model for speech processing in noise. The main focus of this work is on the utilization of the correlation of the energy contours of speech, along the formant tracks, for improved formant and LP model estimation in noise. The approach proposed in this paper provides a systematic framework for modelling and utilization of the inter-frame correlation of speech parameters across successive speech frames; the within frame correlations are modelled by the LP parameters. The formant tracking LP model estimation is composed of three stages: (1) a pre-cleaning spectral amplitude estimation stage where an initial estimate of the LP model of speech for each frame is obtained, (2) a formant classification and estimation stage using probability models of formants and Viterbi-decoders and (3) an inter-frame formant de-noising and smoothing stage where Kalman filters are used to model the formant trajectories and reduce the effect of residue noise on formants. The adverse effects of car and train noise on estimates of formant tracks and LP models are investigated. The evaluation results for the estimation of the formant tracking LP model demonstrate that the proposed combination of the initial noise reduction stage with formant tracking and Kalman smoothing stages, results in a significant reduction in errors and distortions.
international conference on acoustics, speech, and signal processing | 2004
Dimitrios Rentzos; Saeed Vaseghi; Qin Yan; Ching-Hsiang Ho
This paper presents a voice conversion method based on transformation of the characteristic features of a source speaker towards a target. Voice characteristic features are grouped into two main categories: (a) the spectral features at formants and (b) the pitch and intonation patterns. Signal modelling and transformation methods for each group of voice features are outlined. The spectral features at formants are modelled using a set of two-dimensional phoneme-dependent HMM. Subband frequency warping is used for spectrum transformation with the subbands centred on the estimates of the formant trajectories. The F0 contour is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is employed for transformation of pitch, intonation patterns and speaking rate. The experiments present illustrations and perceptual evaluations of the results of transformations of the various voice features.
international conference on acoustics, speech, and signal processing | 2003
Qin Yan; Saeed Vaseghi
The formant space of three major English accents namely British, American and Australian are modelled and used for accent conversion. Accent synthesis, through modification of the acoustic parameters of speech, provides a means for assessing the perceptual contribution of each parameter on conveying an accent. An improved method based on a linear prediction (LP) model feature analysis and a 2D hidden Markov model (HMM) is employed for estimation of formant trajectories of vowels and diphthongs. Comparative analysis of the formant space of the three accents indicates that these accents are partly conveyed by the fronting and backing of vowels. It is found that the first formants of the vowels of British and American English accents are higher than those in Australian accent while Australians have higher second formants in vowels compared to Americans and British. The estimates of the distributions of formants for each accent are used in a speech synthesis system for accent conversion. Perceptual evaluations of accent conversion results illustrate that formants, in particular the second formant, play an important role in conveying accents.
international conference on acoustics, speech, and signal processing | 1990
Saeed Vaseghi
A finite-state code excited linear prediction (CELP) system is proposed for variable-rate speech coding. The encoding system consists of a number of CELP coders with different linear predictive coding parameter quantization patterns, code book sizes, and population densities. The selection of the encoding state for each input vector depends on the input signal characteristics, the desired bit rate/signal to quantized noise ratio, and the current state of the encoder. In CELP coders the greater part of the bit resources, about 70%, is used for encoding of the excitation signal. However, the excitation accuracy needed to encode a speech segment with a desired level of fidelity strongly depends on its short-term spectral characteristics. The use of a finite-state system involves some implicit clustering of speech and allows variable-rate coding of the excitation. The gain in compression that can be obtained from variable-rate coding of the excitation signal is investigated. Experiments with a four-state variable-rate CELP coder produce good-quality encoded speech at 5 kbit/s.<<ETX>>
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Qin Yan; Saeed Vaseghi; Dimitrios Rentzos; Ching-Hsiang Ho
In this paper, the probability distribution functions (pdfs) of the formant spaces of three major accents of the English language, namely, British Received Pronunciation (RP), General American, and Broad Australian, are modeled and compared. The statistical differences across the formant spaces of these accents are employed for accent conversion. An improved formant tracking method, based on linear prediction (LP) feature analysis and a two-dimensional hidden Markov model (2-D-HMM) of format trajectories, is used for estimation of the formant trajectories of vowels and diphthongs of each accent. Comparative analysis of the formant spaces of the three accents indicates that these accents are partly conveyed by the differences of the formants of vowels. The estimates of the probability distributions of the formants for each accent are used in a speech synthesis system for accent conversion. Accent synthesis, through modification of the acoustic parameters of speech, provides a means of assessing the perceptual contribution of each formant parameter on conveying an accent. The results of perceptual evaluations of accent conversion illustrate that formants play an important role in conveying accents