David Talkin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Talkin is active.

Explore More

Publication

Featured researches published by David Talkin.

Archive | 1997

The Aligner: Text-to-Speech Alignment Using Markov Models

Colin W. Wightman; David Talkin

Development of high-quality synthesizers is typically dependent on having a large corpus of speech that has an accurate, time-aligned, phonetic transcription. Producing such transcriptions has been difficult, slow, and expensive. Here we describe the operation and performance of a new software tool that automates much of the transcription process and requires far less training and expertise to be used successfully.

Journal of the Acoustical Society of America | 1989

Voicing epoch determination with dynamic programming

David Talkin

During voiced speech, the point of maximum flow change in each glottal cycle corresponds to the point of maximum excitation of the vocal tract. Accurate, reliable detection of this “epoch” beginning (or end) is useful for pitch synchronous analysis/synthesis in a variety of contexts. Dynamic programming has been applied to correlation function peak selection [Secrest and Doddington, ICASSP‐83] and lagged waveform matching [Ney, IEEE Trans. SMC‐12 (1982)] for F0 determination with excellent results, but these techniques do not yield the epoch locations. The method described in this paper applies dynamic programming to select waveform maxima directly from a short‐time‐energy‐normalized LPC residual. The cumulative path costs are normalized by the path length. Local costs are based on peak amplitude and quality, transition costs on period and pulse similarity. The output is the set of pulse locations that globally satisfy the cost constraints over all voiced regions. Data to be presented indicate that these ...

Journal of the Acoustical Society of America | 1982

Acoustic cues to final stop voicing for impaired‐ and normal‐hearing listeners

S. Revoile; J. M. Pickett; Lisa D. Holden; David Talkin

Voicing perception for final stops was studied for impaired- and for normal-hearing listeners when cues in naturally spoken syllables were progressively neutralized. The syllables were ten different utterances of /daep, daek, daet, daeb, daeg, daed/ spoken in random order by a male. The cue modifications consisted progressively of neutralized vowel duration, equalized occlusion duration, burst deletion, murmur deletion, vowel-transition interchange, and transition deletion. The impaired subjects had moderate-to-severe losses and showed at least 70% correct voicing for the unmodified syllables. For the voiced stops, vowel-duration adjustment and murmur deletion each resulted in significant reductions in voicing perception for more than one-third of the impaired listeners; all normals showed good performance following neutralization of these cues. For the voiceless stops, large percentages of both listener groups showed decreased voicing perception due to the burst deletion, though a majority of both groups performed well above change even after the vowel-duration adjustment and the burst deletion. When the vowel off-going transitions were exchanged between cognate syllables in given pairs, the effect on voicing perception exhibited by many impaired- and all normal-hearing listeners implicated the vowel transitions as an important additional source of cues to final-stop voicing perception.

international conference on acoustics speech and signal processing | 1998

Speaker transformation using sentence HMM based alignments and detailed prosody modification

Levent M. Arslan; David Talkin

This paper presents several improvements to our voice conversion system which we refer to as speaker transformation algorithm using segmental codebooks (STASC). First, a new concept, sentence HMM, is introduced for the alignment of speech waveforms sharing the same text. This alignment technique allows reliable and high resolution mapping between two speech waveforms. In addition, it is observed that energy and speaking rate differences between two speakers are not constant across all phonemes. Therefore a codebook based duration and energy scaling algorithm is proposed. Finally, a more detailed pitch modification is introduced that takes into account pitch range differences between source and target speakers in addition to mean pitch level differences. The proposed changes made a significant impact on the quality of transformed speech. Subjective listening tests showed that intelligibility is maintained at the same level as natural speech after the speaker transformation.

Speech Communication | 1999

Codebook based face point trajectory synthesis algorithm using speech input

Levent M. Arslan; David Talkin

Abstract This paper presents a novel algorithm which generates three-dimensional face point trajectories for a given speech file with or without its text. The proposed algorithm first employs an off-line training phase. In this phase, recorded face point trajectories along with their speech data and phonetic labels are used to generate phonetic codebooks. These codebooks consist of both acoustic and visual features. Acoustics are represented by line spectral frequencies (LSF), and face points are represented with their principal components (PC). During the synthesis stage, speech input is rated in terms of its similarity to the codebook entries. Based on the similarity, each codebook entry is assigned a weighting coefficient. If the phonetic information about the test speech is available, this is utilized in restricting the codebook search to only several codebook entries which are visually closest to the current phoneme (a visual phoneme similarity matrix is generated for this purpose). Then these weights are used to synthesize the principal components of the face point trajectory. The performance of the algorithm is tested on held-out data, and the synthesized face point trajectories showed a correlation of 0.73 with true face point trajectories.

Journal of the Acoustical Society of America | 1994

Computational aids for the study of prosody

Colin W. Wightman; David Talkin

In the past few years, corpus‐based methods of inquiry have yielded significant insights into the structure and role of prosody in human speech. Efforts are currently underway to discover the relationship between prosody and syntax and discourse, and to develop automated speech processing systems (both for synthesis and recognition) that take advantage of the information contained in prosody. These efforts, however, are critically dependent upon the availability of large speech corpora in which the relevant prosodic phenomena have been consistently transcribed. If the development of such a corpus is to be cost effective or, indeed, if prosodic cues are to be detected in automated systems, computational tools that facilitate and, where possible, automate the transcription process must be made available. In this paper, some the tools currently available will be presented and their performance and utility reviewed. In particular, a new tool for generating accurate, time‐aligned phonetic transcriptions of spo...

Journal of the Acoustical Society of America | 1978

Estimation of effective mass and stiffness of the vocal folds from distributed models

Ingo R. Titze; David Talkin

The effective mass and stiffness of the vocal folds are the primary factors in the control of fundamental frequency. Since the soft tissue medium of the folds constitutes a distributed system, the equivalent lumped constants depend heavily upon the viscoelastic properties and the boundary conditions. We show an evolution of continuum models from a semi‐infinite medium to the more realistic case where three boundaries are fixed and three boundaries are exposed to airflow. In order to compare lumped‐constant parameters to distributed properties, a set of known surface stresses is applied at the exposed surfaces of a continuum model, mode impedances are defined, and the equivalent lumped constants are determined by resonance and asymptotic conditions of impedance curves. We show the dependence of the equivalent mass and stiffness upon the vocal fold length, depth, and thickness, as well as tissue properties such as orthotropy and incompressibility.The effective mass and stiffness of the vocal folds are the primary factors in the control of fundamental frequency. Since the soft tissue medium of the folds constitutes a distributed system, the equivalent lumped constants depend heavily upon the viscoelastic properties and the boundary conditions. We show an evolution of continuum models from a semi‐infinite medium to the more realistic case where three boundaries are fixed and three boundaries are exposed to airflow. In order to compare lumped‐constant parameters to distributed properties, a set of known surface stresses is applied at the exposed surfaces of a continuum model, mode impedances are defined, and the equivalent lumped constants are determined by resonance and asymptotic conditions of impedance curves. We show the dependence of the equivalent mass and stiffness upon the vocal fold length, depth, and thickness, as well as tissue properties such as orthotropy and incompressibility.

Journal of the Acoustical Society of America | 1982

Emergence of vocants in infant utterances

Rachel E. Stark; David Talkin; John M. Heinz; Jennifer L. Bond

This preliminary study was designed to determine if the point vowels /ɑ/, /i/ and /u/ emerge in a particular order and within specific age ranges in the productions of infants. Vowel‐like sounds (vocants) were selected randomly within certain general criteria from the cooing, expansion, and babbling periods of vocal development in two normal female infants. The formant frequencies of these vocants were estimated, at several points, by inspecting the wideband spectrogram, a 32‐ms spectral section, an inverse LPC spectrum, and the frequencies and bandwidths obtained by finding the roots of the denominator polynomial of the LPC model. The order of the LPC model was chosen adaptively to provide frequency resolution consistent with the fundamental frequency of voicing for each frame in order to minimize the interaction between harmonic and resonance locations. The results indicated (1) only modest changes in formant structure from the cooing to the expansion period and (2) a frequent lowering of the first form...

Journal of the Acoustical Society of America | 1982

Development changes in perceived stimulus structure in consonant‐vowel and vowel stimuli

Lynne E. Bernstein; David Talkin; Rosemary Condino; Rachel E. Stark

It was hypothesized that developmental changes take place in the ability to extract acoustic‐phonetic structure from the speech signal. In order to demonstrate developmental change, Garners two‐choice, speeded‐classification paradigm was adapted for use with children [Am. Psych. 25, 350–358 (1970)]. In this paradigm, two stimulus dimensions are presented under control, correlated, and orthogonal experimental conditions. Patterns of results across conditions are interpreted as indicative of certain perceptual structures, viz., integral versus separable processing. Stimulus dimensions being tested with adults and eight‐year‐olds are (1) consonant identity (/bɑ/ versus /dɑ/) and pitch (125 vs 165 Hz F0) [cf Wood, Percept. Psychophys. 15, 501–508 (1974)]; and (2) vowel identity (/ɑ/ versus /i/ and pitch (125 vs 165 Hz F0). It was hypothesized that adults process these two stimulus sets in terms of, respectively, integral and separable structures, but that children process both sets in terms of similarity str...

Journal of the Acoustical Society of America | 1981

Burst, murmur, vowel duration, and transition cues in the identification of final stop voicing by hearing‐impaired and normal‐hearing listeners

S. Revoile; J. M. Pickett; Lisa D. Holden; David Talkin

The effects on voicing identification of progressive neutralization of the voicing cues was examined further (99th ASA meeting, Paper GG6). Cue modifications were digital deletions or iterations performed on ten voicing‐cognate pairs of token syllables, spoken in a randomized list of 10× the set: dap, dak, dat, dab, dag, dad. The cue modifications consisted progressively of neutralized vowel durations, equalized closure duration, burst deletion, murmur deletion, and transition deletion. For voiced‐consonant syllables, about half of the 25 hearing‐impaired were sensitive to vowel duration and the presence of transitions; some of these listeners and others were sensitive to the presence of the burst and/or murmur. For syllables with unvoiced consonants, vowel duration and the presence of release burst affected identification for about half of the hearing impaired. Among the remaining impaired listeners, sensitivity varied unsystematically as a function of burst presence and/or transition deletion. Normal listeners appeared generally to make use of the transition cues more than did the hearing impaired. [Work supported in part by the U. S. Public Health Service.]

Explore More