Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Michael W. Macon is active.

Publication


Featured researches published by Michael W. Macon.


international conference on acoustics speech and signal processing | 1998

Spectral voice conversion for text-to-speech synthesis

Alexander Kain; Michael W. Macon

A new voice conversion algorithm that modifies a source speakers speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.


international conference on acoustics, speech, and signal processing | 2001

Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction

Alexander Kain; Michael W. Macon

The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.


IEEE Transactions on Speech and Audio Processing | 1997

Sinusoidal modeling and modification of unvoiced speech

Michael W. Macon; Mark A. Clements

Although sinusoidal models have been shown to be useful for time-scale and pitch modification of voiced speech, objectionable artifacts often arise when such models are applied to unvoiced speech. This article presents a sinusoidal model-based speech modification algorithm that preserves the natural character of unvoiced speech sounds after pitch and time-scale modification, eliminating commonly encountered artifacts. This advance is accomplished via a perceptually motivated modulation of the sinusoidal component phases that mitigates artifacts in the reconstructed signal after time-scale and pitch modification.


international conference on acoustics speech and signal processing | 1996

Speech concatenation and synthesis using an overlap-add sinusoidal model

Michael W. Macon; Mark A. Clements

In this paper, an algorithm for the concatenation of speech signal segments taken from disjoint utterances is presented. The algorithm is based on the analysis-by-synthesis/overlap-add (ABS/OLA) sinusoidal model, which is capable of performing high quality pitch- and time-scale modification of both speech and music signals. With the incorporation of concatenation and smoothing techniques, the model is capable of smoothing the transitions between separately-analyzed speech segments by matching the time- and frequency-domain characteristics of the signals at their boundaries. The application of these techniques in a text-to-speech system based on concatenation of diphone sinusoidal models is also presented.


Journal of the Acoustical Society of America | 2002

Effects of prosodic factors on spectral dynamics. II. Synthesis

Johan Wouters; Michael W. Macon

The effects of prosodic factors on the spectral rate of change of vowel transitions are investigated. Thirty two-syllable English words are placed in carrier phrases and read by a single speaker. Liquid-vowel, diphthong, and vowel-liquid transitions are extracted from different prosodic contexts, corresponding to different levels of stress, pitch accent, word position, and speaking style, following a balanced experimental design. The spectral rate of change in these transitions is measured by fitting linear regression lines to the first three formants and computing the root-mean-square of the slopes. Analysis shows that the spectral rate of change increases with linguistic prominence, i.e., in stressed syllables, in accented words, in sentence-medial words, and in hyperarticulated speech. The results are consistent with a contextual view of vowel reduction, where the extent of reduction depends both on the spectral rate of change and on vowel duration. A numerical model of spectral rate of change is proposed, which can be integrated in a system for concatenative speech synthesis, as discussed in Paper II [J. Wouters and M. Macon, J. Acoust. Soc. Am. 111, 428-438 (2002)].


international conference on acoustics speech and signal processing | 1998

Efficient analysis/synthesis of percussion musical instrument sounds using an all-pole model

Michael W. Macon; Alan V. McCree; Wai-Ming Lai; Vishu R. Viswanathan

It is well-known that an impulse-excited, all-pole filter is capable of representing many physical phenomena, including the oscillatory modes of percussion musical instruments like woodblocks, xylophones, or chimes. In contrast to the more common application of all-pole models to speech, however, practical problems arise in music synthesis due to the location of poles very close to the unit circle. The objective of this work was to develop algorithms to find excitation and filter parameters for synthesis of percussion instrument sounds using only an inexpensive all-pole filter chip (TI TSP50C1x). The paper describes analysis methods for dealing with pole locations near the unit circle, as well as a general method for modeling the transient attack characteristics of a particular sound while independently controlling the amplitudes of each oscillatory mode.


international conference on acoustics, speech, and signal processing | 2000

Spectral modification for concatenative speech synthesis

Johan Wouters; Michael W. Macon

Concatenative synthesis can produce high-quality speech but is limited to the allophonic variations and voice types that were captured in the database. It would be desirable to modify speech units to remove formant discontinuities and to create new speaking styles, such as hypo- or hyper-articulated speech. Unfortunately, manipulating the spectral structure often leads to degraded speech quality. We investigate two speech modification strategies, one based on inverse filtering and the other on sinusoidal modeling, and we explain their merits and shortcomings for changing the spectral envelope in speech. We then propose a method which uses sinusoidal modeling and represents the complex sinusoidal amplitudes by an all-pole model. The all-pole model approximates the sinusoidal spectrum well, both in the amplitude and in the phase domain. We use the sinusoidal+all-pole model to control the spectral envelope in recorded speech. High-quality modified speech is generated from the model using sinusoidal synthesis. A perceptual test was conducted, which shows that the model was effective at changing vowel identities and was preferable over residual excited LPC.


IEEE Transactions on Speech and Audio Processing | 1998

Audio coding using variable-depth multistage quantization

Faouzi Kossentini; Michael W. Macon; Mark J. T. Smith

An algorithm for high-quality coding of 48 kHz sampled audio signals is presented. The algorithm employs a perceptual transform and a variable-depth multistage quantizer. The resulting audio reproduction quality is better than that of the Motion Pictures Expert Group (MPEG) layer I coder and roughly equivalent to that of the MPEG layer II coder.


data compression conference | 1996

Audio coding using variable-depth multistage quantizers

Faouzi Kossentini; Michael W. Macon; Mark J. T. Smith

Digital coding of high-fidelity audio signals has become a key technology in the development of cost-effective multimedia systems. Most audio coding algorithms rely on (i) removal of statistical redundancies in the signal and (ii) exploitation of masking properties of the human auditory system to hide distortions in the coded signal. Transform and subband coders provide a convenient framework for time-frequency domain signal analysis and coding based on these two principles. We present a subband coding algorithm for high-quality coding of 48 kHz sampled audio signals that is based on perceptual transformation, variable-depth multistage uniform quantization, and complexity-constrained higher-order entropy coding. The construction does not require any training and avoids completely the need to send side information.


Journal of the Acoustical Society of America | 1999

Waveform models for data‐driven speech synthesis

Michael W. Macon

Many ‘‘data‐driven’’ models for synthesizing speech rely on concatenating waveforms extracted from a database. However, the number of perceptually important degrees of freedom in speech make it unlikely that enough data could be collected to cover all combinations of phonetic variables. By utilizing models that can transform the waveform in perceptually relevant ways, the space of acoustic features covered by the data can be expanded. The minimal requirement for such a model is parametric control of the fundamental frequency and duration. In addition to this, dimensions such as voice quality characteristics (breathiness, creak, etc.), phonetic reduction, and voice identity can be altered to expand the range of effects realizable from a given database. A few classes of models have been proposed to allow varying degrees of control over these dimensions. Trade‐offs between flexibility, fidelity, and computational cost exist with each. This paper will describe common threads running through the best‐known app...

Collaboration


Dive into the Michael W. Macon's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mark A. Clements

Georgia Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mark J. T. Smith

Georgia Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Faouzi Kossentini

University of British Columbia

View shared research outputs
Researchain Logo
Decentralizing Knowledge