Michael A. Berger
University of Edinburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael A. Berger.
Journal of the Acoustical Society of America | 2009
Richard S. McGowan; Michael A. Berger
A method for mapping between simultaneously measured articulatory and acoustic data is proposed. The method uses principal components analysis on the articulatory and acoustic variables, and mapping between the domains by locally weighted linear regression, or loess [Cleveland, W. S. (1979). J. Am. Stat. Assoc. 74, 829-836]. The latter method permits local variation in the slopes of the linear regression, assuming that the function being approximated is smooth. The methodology is applied to vowels of four speakers in the Wisconsin X-ray Microbeam Speech Production Database, with formant analysis. Results are examined in terms of (1) examples of forward (articulation-to-acoustics) mappings and inverse mappings, (2) distributions of local slopes and constants, (3) examples of correlations among slopes and constants, (4) root-mean-square error, and (5) sensitivity of formant frequencies to articulatory change. It is shown that the results are qualitatively correct and that loess performs better than global regression. The forward mappings show different root-mean-square error properties than the inverse mappings indicating that this method is better suited for the forward mappings than the inverse mappings, at least for the data chosen for the current study. Some preliminary results on sensitivity of the first two formant frequencies to the two most important articulatory principal components are presented.
Journal of the Acoustical Society of America | 2012
Richard S. McGowan; Michel T‐T. Jackson; Michael A. Berger
Traditional models of mappings from midsagittal cross-distances to cross-sectional areas use only local cross-distance information. These are not the optimal models on which to base the construction of a mapping between the two domains. This can be understood because phonemic identity can affect the relation between local cross-distance and cross-sectional area. However, phonemic identity is not an appropriate independent variable for the control of an articulatory synthesizer. Two alternative approaches for constructing cross-distance to area mappings that can be used for articulatory synthesis are presented. One is a vowel height-sensitive model and the other is a non-parametric model called loess. These depend on global cross-distance information and generally perform better than the traditional models.
international conference on computer graphics and interactive techniques | 2010
Michael A. Berger; Gregor Hofer; Hiroshi Shimodaira
Facial animation is difficult to do convincingly. The movements of the face are complex and subtle, and we are innately attuned to faces. It is particularly difficult and labor-intensive to accurately synchronize faces with speech. A technology-based solution to this problem is automated facial animation. There are various ways to automate facial animation, each of which drives a face from some input sequence. In performance-driven animation, the input sequence may be either facial motion capture or video of a face. In automatic lip-syncing, the input is audio (and possibly a text transcript), resulting in facial animation synchronized with that audio. In audio-visual text-to-speech synthesis (AVTTS), only text is input, and synchronous auditory and visual speech are synthesized.
international conference on computer graphics and interactive techniques | 2010
Gregor Hofer; Korin Richmond; Michael A. Berger
Talking computer animated characters are a common sight in video games and movies. Although doing the mouth animation by hand gives the best results it is not always feasible because of cost or time constraints. Therefore producing lip animation automatically is highly desirable. The problem can therefore be phrased as mapping from speech to lip animation or in other words as an acoustic inversion. In our work we propose a solution that takes a sequence of input frames of speech and maps it directly to an output sequence of animation frames. The key point is that there is no need for phonemes or visemes which cuts one step in the usual lip synchronization process.
Journal of the Acoustical Society of America | 2008
Richard S. McGowan; Michael A. Berger
Vowels tokens were extracted from four talkers in the Wisconsin X‐ray Microbeam Speech Production Database. The neighboring phonemes of these vowels were restricted to be non‐nasal and non‐liquid. The first three formant frequencies were measured using LPC analysis with manual corrections at a rate corresponding to the pellet trajectory sampling rate, thus yielding large amounts of simultaneous formant frequency and pellet position data points (between 11,000 and 20,000 for each talker.) Principal components analysis was performed for both the formant frequencies and the pellet position data, to produce three orthogonal acoustic components and four orthogonal articulatory components. A local linear regression technique, known as loess [Cleveland, W. S. and Devlin, S. J. (1988), J. Amer. Stat. Assoc., 83, 596 ‐ 610], was applied to orthogonal components to map between the acoustic and articulatory domains. This technique permits regression slopes to vary within the domain of the independent variables. The ...
Journal of the Acoustical Society of America | 2006
Michael A. Berger; Meghan Clayards; Neil P. Bardhan; Joyce McDonough
Nasalized vowels have acoustic characteristics that allow the degree of nasality to be quantified from the acoustic signal. These characteristics include decreased first‐formant amplitude (A1) and increased amplitudes of two extra peaks (P0 and P1). Chen (1995, 1997, 2000) used the parameters A1−P1 and A1−P0 (dB) to quantify nasality in vowels. These two measures are complementary in that different vowel types may be more accessible to one or the other. This research attempts to extend Chen’s measures in several ways: (1) normalize the two parameters so that they are comparable; (2) integrate them into a single composite measure by a weighted average; (3) measure nasality over the time domain of each vowel token, resulting in nasality contours with high temporal resolution; and (4) automate the process of measuring nasality in vowels, so that large amounts of speech data can be processed rapidly by computer. Nasality measurement in this framework, whether manual or automated, requires a solution to the pr...
Journal of the Acoustical Society of America | 1996
Dominic W. Massaro; Christopher S. Campbell; Michael A. Berger; Michael M. Cohen
It is now well established that visible speech is an important source of information in face‐to‐face communication. Given the valuable role of audible speech synthesis in experimental, theoretical, and applied arenas, visible speech synthesis has been developed. Research has shown that the talking head closely resembles real heads in the quality of its speech and its realism (when texture mapping is used). The talking head can be heard, communicates paralinguistic as well as linguistic information, and is controlled by a text‐to‐speech system. Several sources of evidence are presented which show that visible speech perception (speechreading) is fairly robust across various forms of degradation. Speechreading remains fairly accurate even when the mouth is viewed in noncentral vision; eliminating and distorting high‐spatial frequency information does not completely disrupt speechreading; and speechreading is possible when additional visual information is simultaneously being used to recognize the speech inp...
Archive | 2003
Michael A. Berger
Archive | 2007
Michael A. Berger
IEEE Computer Graphics and Applications | 2011
Michael A. Berger; Gregor Hofer; Hiroshi Shimodaira