Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jiahong Yuan is active.

Publication


Featured researches published by Jiahong Yuan.


Journal of the Acoustical Society of America | 2008

Speaker identification on the SCOTUS corpus

Jiahong Yuan; Mark Liberman

This paper reports the results of our experiments on speaker identification in the SCOTUS corpus, which includes oral arguments from the Supreme Court of the United States. Our main findings are as follows: 1) a combination of Gaussian mixture models and monophone HMM models attains near‐100% text‐independent identification accuracy on utterances that are longer than one second; (2) the sampling rate of 11025 Hz achieves the best performance (higher sampling rates are harmful); and a sampling rate as low as 2000 Hz still achieves more than 90% accuracy; (3) a distance score based on likelihood numbers was used to measure the variability of phones among speakers; we found that the most variable phone is the phone UH (as in good), and the velar nasal NG is more variable than the other two nasal sounds M and N; 4.) our models achieved “perfect” forced alignment on very long speech segments (one hour). These findings and their significance are discussed.


international symposium on chinese spoken language processing | 2006

Mechanisms of question intonation in mandarin

Jiahong Yuan

This study investigates mechanisms of question intonation in Mandarin Chinese. Three mechanisms of question intonation have been proposed: an overall higher phrase curve, higher strengths of sentence final tones, and a tone-dependent mechanism that flattens the falling slope of the final falling tone and steepens the rising slope of the final rising tone. The phrase curve and strength mechanisms were revealed by a computational modeling study and verified by the acoustic analyses as well as the perception experiments. The tone-dependent mechanism was suggested by a result from the perceptual study: question intonation is easier to identify if the sentence-final tone is falling whereas it is harder to identify if the sentence-final tone is rising, and was revealed by the acoustic analyses on the final Tone2 and Tone4.


Speech Communication | 2014

F0 declination in English and Mandarin Broadcast News Speech

Jiahong Yuan; Mark Liberman

Abstract This study investigates F 0 declination in broadcast news speech in English and Mandarin Chinese. The results demonstrate a strong relationship between utterance length and declination slope. Shorter utterances have steeper declination, even after excluding the initial rising and final lowering effects. Initial F 0 tends to be higher when the utterance is longer, whereas the low bound of final F 0 is independent of the utterance length. Both top line and baseline show declination. The top line and baseline have different patterns in Mandarin Chinese, whereas in English their patterns are similar. Mandarin Chinese has more and steeper declination than English, as well as wider pitch range and more F 0 fluctuations. Our results suggest that F 0 declination is linguistically controlled, not just a by-product of the physics and physiology of talking.


ieee automatic speech recognition and understanding workshop | 2005

Detection of questions in Chinese conversational speech

Jiahong Yuan; Daniel Jurafsky

What features are helpful for Chinese question detection? Which of them are more important? What are the differences between Chinese and English regarding feature importance? We study these questions by building question detectors for Chinese and English conversational speech, and performing analytic studies and feature selection experiments. As in English, we find that both textual and prosodic features are helpful for Chinese question detection. Among textual features, word identities, especially the utterance-final word, are more useful than the global (N-gram) sentence likelihood. Unlike in English, where final pitch rise is a good cue for questions, we find in Chinese that utterance final pitch behavior is not a good feature. Instead, the most useful prosodic feature is the spectral balance, i.e., the distribution of energy over the frequency spectrum, of the final syllable. We also find effects of tone, e.g., that treating interjection words as having a special tone is helpful. Our final classifier achieves an error rate of 14.9% with respect to a 50% chance-level rate


international conference on acoustics, speech, and signal processing | 2010

Robust speaking rate estimation using broad phonetic class recognition

Jiahong Yuan; Mark Liberman

Robust speaking rate estimation can be useful in automatic speech recognition and speaker identification, and accurate, automatic measures of speaking rate are also relevant for research in linguistics, psychology, and social sciences. In this study we built a broad phonetic class recognizer for speaking rate estimation. We tested the recognizer on a variety of data sets, including laboratory speech, telephone conversations, foreign accented speech, and speech in different languages, and we found that the recognizers estimates are robust under these sources of variation. We also found that the acoustic models of the broad phonetic classes are more robust than those of the monophones for syllable detection.


international conference on acoustics, speech, and signal processing | 2014

Highly accurate phonetic segmentation using boundary correction models and system fusion

Andreas Stolcke; Neville Ryant; Vikramjit Mitra; Jiahong Yuan; Wen Wang; Mark Liberman

Accurate phone-level segmentation of speech remains an important task for many subfields of speech research. We investigate techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models. In prior work [25] we were able to improve on state-of-the-art alignment accuracy by employing special phone boundary HMM models, trained on phonetically segmented training data, in conjunction with a simple boundary-time correction model. Here we present further improved results by using more powerful statistical models for boundary correction that are conditioned on phonetic context and duration features. Furthermore, we find that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps. Overall, we reduce segmentation errors on the TIMIT corpus by almost one half, from 93.9% to 96.8% boundary accuracy with a 20-ms tolerance.


international symposium on chinese spoken language processing | 2004

Perception of Mandarin intonation

Jiahong Yuan

This study investigates how tone and intonation, and how focus and intonation, interact in intonation type (statement versus question) identification. A perception experiment was conducted on a speech corpus of 1040 utterances. Sixteen listeners participated in the experiment. The results reveal three asymmetries: statement and question intonation identification; effects of the sentence-final Tone2 and Tone4 on question intonation identification; and effects of the final focus on statement and question intonation identification. These asymmetries suggest that: (1) statement intonation is a default or unmarked intonation type whereas question intonation is a marked intonation type; (2) question intonation has a higher prosodic strength at the sentence final position; (3) there is a tone-dependent mechanism of question intonation at the sentence-final position.


Journal of the Acoustical Society of America | 2011

Perception of intonation in Mandarin Chinese

Jiahong Yuan

There is a tendency across languages to use a rising pitch contour to convey question intonation and a falling pitch contour to convey a statement. In a lexical tone language such as Mandarin Chinese, rising and falling pitch contours are also used to differentiate lexical meaning. How, then, does the multiplexing of the F(0) channel affect the perception of question and statement intonation in a lexical tone language? This study investigated the effects of lexical tones and focus on the perception of intonation in Mandarin Chinese. The results show that lexical tones and focus impact the perception of sentence intonation. Question intonation was easier for native speakers to identify on a sentence with a final falling tone and more difficult to identify on a sentence with a final rising tone, suggesting that tone identification intervenes in the mapping of F(0) contours to intonational categories and that tone and intonation interact at the phonological level. In contrast, there is no evidence that the interaction between focus and intonation goes beyond the psychoacoustic level. The results provide insights that will be useful for further research on tone and intonation interactions in both acoustic modeling studies and neurobiological studies.


international conference on acoustics, speech, and signal processing | 2014

Automatic phonetic segmentation in Mandarin Chinese: Boundary models, glottal features and tone

Jiahong Yuan; Neville Ryant; Mark Liberman

We conducted experiments on forced alignment in Mandarin Chinese. A corpus of 7,849 utterances was created for the purpose of the study. Systems differing in their use of explicit phone boundary models, glottal features, and tone information were trained and evaluated on the corpus. Results showed that employing special one-state phone boundary HMM models significantly improved forced alignment accuracy, even when no manual phonetic segmentation was available for training. Spectral features extracted from glottal waveforms (by performing glottal inverse filtering from the speech waveforms) also improved forced alignment accuracy. Tone dependent models only slightly outperformed tone independent models. The best system achieved 93.1% agreement (of phone boundaries) within 20 ms compared to manual segmentation without boundary correction.


international conference on acoustics, speech, and signal processing | 2014

Mandarin tone classification without pitch tracking

Neville Ryant; Jiahong Yuan; Mark Liberman

A deep neural network (DNN) based classifier achieved 27.38% frame error rate (FER) and 15.62% segment error rate (SER) in recognizing five tonal categories in Mandarin Chinese broadcast news, based on 40 mel-frequency cepstral coefficients (MFCCs). The same architecture scored substantially lower when trained and tested with F0 and amplitude parameters alone: 40.05% FER and 22.66% SER. These results are substantially better than the best previously-reported results on broadcast-news tone classification [1] and are also better than a human listener achieved in categorizing test stimuli created by amplitude- and frequency-modulating complex tones to match the extracted F0 and amplitude parameters.

Collaboration


Dive into the Jiahong Yuan's collaboration.

Top Co-Authors

Avatar

Mark Liberman

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Neville Ryant

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xiaoying Xu

Beijing Normal University

View shared research outputs
Top Co-Authors

Avatar

Christopher Cieri

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Wei Lai

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Hilary Prichard

University of Pennsylvania

View shared research outputs
Researchain Logo
Decentralizing Knowledge