Yong Qin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yong Qin is active.

Explore More

Publication

Featured researches published by Yong Qin.

international conference on acoustics, speech, and signal processing | 2008

Voice conversion by combining frequency warping with unit selection

Zhiwei Shuang; Fanping Meng; Yong Qin

In this paper, we propose a novel voice conversion method by combining frequency warping and unit selection to improve the similarity to target speaker. We use frequency warping to get the warped source spectrum, which will be used as estimated target for later unit selection of the target speakers spectrum. Such estimated target can preserve the natural transition of humans speech. Then, part of the warped source spectrum is replaced by the selected target speakers real spectrum before reconstructing the converted speech to reduce the difference in detailed spectrum. TC- STAR 2007 voice conversion evaluation results show that the proposed method can achieve about 20% improvement in similarity score compared to only frequency warping.

human factors in computing systems | 2010

Effects of automated transcription quality on non-native speakers' comprehension in real-time computer-mediated communication

Yingxin Pan; Danning Jiang; Lin Yao; Michael Picheny; Yong Qin

Real-time transcription has been shown to be valuable in facilitating non-native speakers comprehension in real-time communication. Automated speech recognition (ASR) technology is a critical ingredient for its practical deployment. This paper presents a series of studies investigating how the quality of transcripts generated by an ASR system impacts user comprehension and subjective evaluation. Experiments are first presented comparing performance across three different transcription conditions: no transcript, a perfect transcript, and a transcript with Word Error Rate (WER) =20%. We found 20% WER was the most likely critical point for transcripts to be just acceptable and useful. Then we further examined a lower WER of 10% (a lower bound for todays state-of-the-art systems) employing the same experimental design. The results indicated that at 10% WER comprehension performance was significantly improved compared to the no-transcript condition. Finally, implications for further system development and design are discussed.

human factors in computing systems | 2009

Effects of real-time transcription on non-native speaker's comprehension in computer-mediated communications

Yingxin Pan; Danning Jiang; Michael Picheny; Yong Qin

We performed an empirical study to understand the relative contributions of real-time transcription to a non-native speakers comprehension in audio/video meetings. 48 participants were assigned to 2 presentation modes (audio, audio+video) and 3 transcription modes (no transcript, real-time transcripts in the streaming mode, transcripts with all past records) in a 3x2 factorial experimental design. The results suggest that comprehension can be significantly improved for both audio and audio+video conditions when real-time transcription is provided. Also, the participants reported positive subjective responses to the presence of real-time transcription in terms of usefulness, preference, and willingness to use such a feature if provided. No cognitive load issues were reported by the participants in the ability to synthesize across modalities. Implications for system development and design, as well as future work utilizing automation speech recognition to provide the transcripts are discussed.

international conference on acoustics, speech, and signal processing | 2008

Recent advances in the IBM GALE Mandarin transcription system

Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Yi Liu; Yong Qin; Qin Shi; Shi Lei Zhang; Hagai Aronowitz

This paper describes the system and algorithmic developments in the automatic transcription of Mandarin broadcast speech made at IBM in the second year of the DARPA GALE program. Technical advances over our previous system include improved acoustic models using embedded tone modeling, and a new topic-adaptive language model (LM) rescoring technique based on dynamically generated LMs. We present results on three community-defined test sets designed to cover both the broadcast news and the broadcast conversation domain. It is shown that our new baseline system attains a 15.4% relative reduction in character error rate compared with our previous GALE evaluation system. And a further 13.6% improvement over the baseline is achieved with the two described techniques.

international conference on acoustics, speech, and signal processing | 2010

The 2009 IBM GALE Mandarin broadcast transcription system

Stephen M. Chu; Daniel Povey; Hong-Kwang Kuo; Lidia Mangu; Shilei Zhang; Qin Shi; Yong Qin

This paper gives an up-to-date description of the IBM Mandarin broadcast transcription system developed under the DARPA GALE program. Technical advances over our previous system include a novel acoustic modeling approach using subspace Gaussian mixture models, a speaking rate adaptation method using frame rate normalization, and an effective recipe for lattice combination. We present results on three consortium-defined test sets. It is shown that with these advances, the new system attains a 9% relative reduction in character error rate compared to our previous GALE evaluation system. The reported 9.1% error rate on the phase three evaluation set represents the state of the art in Mandarin broadcast speech transcription.

international conference on acoustics, speech, and signal processing | 2012

Model dimensionality selection in bilinear transformation for feature space MLLR rapid speaker adaptation

Shilei Zhang; Yong Qin

Bilinear models based feature space Maximum Likelihood Linear Regression (FMLLR) speaker adaptation have showed good performance especially when the amount of adaptation data is limited. However, the model dimensionality selection is very critical to the performance of bilinear models and need more work to find the optimal selection method. In this paper, we present an empirical study on this issue and suggest using a piecewise log-linear function to describe the relationship between the relatively optimal dimensionality parameter and the variant amount of data. This relationship can be used to efficiently select the bilinear model dimensionality in FMLLR speaker adaptation with the variant amount of data for each test speaker to improve recognition performance on the English voice control dataset.

international conference on acoustics, speech, and signal processing | 2007

The IBM Mandarin Broadcast Speech Transcription System

Stephen M. Chu; Hong-Kwang Jeff Kuo; Yi Y. Liu; Yong Qin; Qin Shi; Geoffrey Zweig

This paper describes the technical and system building advances in the automatic transcription of Mandarin broadcast speech made at IBM in the first year of the DARPA GALE program. In particular, we discuss the application of minimum phone error (MPE) discriminative training and a new topic-adaptive language modeling technique. We present results on both the RT04 evaluation data and two larger community-defined test sets designed to cover both the broadcast news and the broadcast conversation domain. It is shown that with the described advances, the new transcription system achieves a 26.3% relative reduction in character error rate over our previous best-performing system, and is competitive with published numbers on these datasets.

meeting of the association for computational linguistics | 2014

Encoding Relation Requirements for Relation Extraction via Joint Inference

Liwei Chen; Yansong Feng; Songfang Huang; Yong Qin; Dongyan Zhao

Most existing relation extraction models make predictions for each entity pair locally and individually, while ignoring implicit global clues available in the knowledge base, sometimes leading to conflicts among local predictions from different entity pairs. In this paper, we propose a joint inference framework that utilizes these global clues to resolve disagreements among local predictions. We exploit two kinds of clues to generate constraints which can capture the implicit type and cardinality requirements of a relation. Experimental results on three datasets, in both English and Chinese, show that our framework outperforms the state-of-theart relation extraction models when such clues are applicable to the datasets. And, we find that the clues learnt automatically from existing knowledge bases perform comparably to those refined by human.

international conference on acoustics, speech, and signal processing | 2011

Generating compound words with high order n-gram information in large vocabulary speech recognition systems

Jie Zhou; Qin Shi; Yong Qin

In this work we concentrate on generating compound words with high order n-gram information for speech recognition. In most existing compound words generation methods, only bi-gram information is considered. They are successful for improving the performance of bi-gram models but doesnt work well in higher order n-gram cases. Since nowadays 3-gram and 4-gram language models are commonly used, here we present a high order n-gram based computation to generate compound words automatically in an exact way which is called gradient criterion. We have this method tested on Mandarin Open Voice Search (OVS) task and make 0.62% absolute improvement over the 16.44% baseline. This result also outperforms the traditional mutual information based methods. Further the history effect and prediction effect of this criterion are tested and we find history effect plays a more important role in the decoding task.

international conference on multimedia and expo | 2009

Effectiveness of n-gram fast match for query-by-humming systems

Jue Hou; Danning Jiang; Wenxiao Cao; Yong Qin; Thomas Fang Zheng; Yi Liu

To achieve a good balance between matching accuracy and computation efficiency is a key challenge for Query-by-Humming (QBH) system. In this paper, we propose an approach of n-gram based fast match. Our n-gram method uses a robust statistical note transcription as well as error compensation method based on the analysis of frequent transcription errors. The effectiveness of our approach has been evaluated on a relatively large melody database with 5223 melodies. The experimental results show that when the searching space was reduced to only 10% of the whole size, 90% of the target melodies were preserved in the candidates, and 88% of the match accuracy of system was kept. Meanwhile, no obvious additional computation was applied.

Explore More