Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xiaojun Qian is active.

Publication


Featured researches published by Xiaojun Qian.


IEEE Signal Processing Magazine | 2015

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

Zhen-Hua Ling; Shiyin Kang; Heiga Zen; Andrew W. Senior; Mike Schuster; Xiaojun Qian; Helen M. Meng; Li Deng

Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. This article systematically reviews these emerging speech generation approaches, with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation.


international conference on acoustics, speech, and signal processing | 2013

Multi-distribution deep belief network for speech synthesis

Shiyin Kang; Xiaojun Qian; Helen M. Meng

Deep belief network (DBN) has been shown to be a good generative model in tasks such as hand-written digit image generation. Previous work on DBN in the speech community mainly focuses on using the generatively pre-trained DBN to initialize a discriminative model for better acoustic modeling in speech recognition (SR). To fully utilize its generative nature, we propose to model the speech parameters including spectrum and F0 simultaneously and generate these parameters from DBN for speech synthesis. Compared with the predominant HMM-based approach, objective evaluation shows that the spectrum generated from DBN has less distortion. Subjective results also confirm the advantage of the spectrum from DBN, and the overall quality is comparable to that of context-independent HMM.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

Kun Li; Xiaojun Qian; Helen M. Meng

This paper investigates the use of multidistribution deep neural networks (DNNs) for mispronunciation detection and diagnosis (MDD), to circumvent the difficulties encountered in an existing approach based on extended recognition networks (ERNs). The ERNs leverage existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDDs are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: 1) Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; and 2) acoustic models and phonological rules are trained independently, and hence, contextual information is lost. To address these issues, we propose an acoustic-graphemic-phonemic model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors). The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likely-pronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified MDD framework, which works much like free-phone recognition. Experiments show that our method achieves a phone error rate (PER) of 11.1%. The false rejection rate (FRR), false acceptance rate (FAR), and diagnostic error rate (DER) for MDD are 4.6%, 30.5%, and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR, and DER are 16.8%, 11.0%, 43.6%, and 32.3%, respectively.


international symposium on chinese spoken language processing | 2010

Capturing L2 segmental mispronunciations with joint-sequence models in Computer-Aided Pronunciation Training (CAPT)

Xiaojun Qian; Helen M. Meng; Frank K. Soong

In this study, we present an extension to our previous efforts on automatically detecting text-dependent segmental mispronunciations by Cantonese (L1) learners of American English (L2), through modeling the L2 production. The problem of segmental mispronunciation modeling is addressed by joint-sequence models. Specifically, a grapheme-to-phoneme model is built to convert the prompted words to their corresponding possible mispronunciations, instead of the previous characterization of phonological processes based on a transfer from the canonical phonetic transcription. Experiments show that the approach can capture the mispronunciations better than the knowledge based and data-driven phonological rules.


international conference on acoustics, speech, and signal processing | 2014

Phonological modeling of mispronunciation gradations in L2 English speech of L1 Chinese learners

Hao Wang; Xiaojun Qian; Helen M. Meng

Generation of corrective feedback carries significant pedagogical importance in the design of computer-aided pronunciation training systems. Such feedback generation should take into account the severity of detected mispronunciations, in order to prioritize different kinds of corrections to be conveyed to the learner. However, mispronunciation gradation is highly dependent on the phonetic context and acoustic context of the word pronunciation, as well as human perception. We have defined several categories of mispronunciation gradation, ranging from subtle to salient, and collected crowdsourced ratings from a large number of listeners. This work aims to capture the phonetic context of word mispronunciation by phonological rules, which are then augmented with statistical scoring to quantitatively model mispronunciation gradations. The model can thus be used to generate gradation ratings of word mispronunciations, especially those that are previously unseen in the training set. We will report the results of automatic gradation classification, as well as its correlation(s) with human perception.


2011 International Conference on Speech Database and Assessments (Oriental COCOSDA) | 2011

Enunciate: An internet-accessible computer-aided pronunciation training system and related user evaluations

Ka-Wa Yuen; Wai-Kim Leung; Pengfei Liu; Ka-Ho Wong; Xiaojun Qian; Wai Kit Lo; Helen M. Meng

This paper presents our groups latest progress in developing Enunciate — an online computer-aided pronunciation training (CAPT) system for Chinese learners of English. Presently, the system targets segmental pronunciation errors. It consists of an audio-enabled web interface, a speech recognizer for mispronunciation detection and diagnosis, a speech synthesizer and a viseme animator. We present a summary of the systems architecture and major interactive features. We also present statistics from evaluations by English teachers and university students who participate in pilot trials. We are also extending the system to cover suprasegmental training and mobile access.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

A two-pass framework of mispronunciation detection and diagnosis for computer-aided pronunciation training

Xiaojun Qian; Helen M. Meng; Frank K. Soong

This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. The framework instantiates a set of antiphones and a filler model to augment the original phone model for each canonical phone. This guarantees full coverage of all possible error patterns while maximally exploiting the phonetic information derived from the text prompt. The antiphones can be used to detect substitutions. The filler model can detect insertions, and phone skips are allowed to detect deletions. As such, there is no prior assumption on the possible error patterns that can occur. The second pass of mispronunciation diagnosis expands the detected insertions and substitutions into phone networks, and another recognition pass attempts to reveal the phonetic identities of the detected mispronunciation errors. Discriminative training (DT) is applied respectively to the acoustic models of the mispronunciation detection pass and the mispronunciation diagnosis pass. DT effectively separates the acoustic models of the canonical phones and the antiphones. Overall, with DT in both passes of MD&D, the error rate is reduced by 40.4% relative, compared with the maximum likelihood baseline. After DT, the error rates of the respective passes are also lower than those of a strong single-pass baseline with DT by 1.3% and 5.1% relative which are statistically significant.


international symposium on chinese spoken language processing | 2010

Rendering a personalized photo-real talking head from short video footage

Lijuan Wang; Wei Han; Xiaojun Qian; Frank K. Soong

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. The whole procedure is fully automatic and data driven. For as short as 20 minutes recording of audio/video footage, the proposed system can synthesize a highly photo-real talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge(2009), perceptually evaluated by recruited human subjects.


asia pacific signal and information processing association annual summit and conference | 2015

A two-pass framework of mispronunciation detection & diagnosis for computer-aided pronunciation training

Xiaojun Qian; Helen M. Meng; Frank K. Soong

This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. The framework instantiates a set of antiphones and a filler model to augment the original phone model for each canonical phone. This guarantees full coverage of all possible error patterns while maximally exploiting the phonetic information derived from the text prompt. The antiphones can be used to detect substitutions. The filler model can detect insertions, and phone skips are allowed to detect deletions. As such, there is no prior assumption on the possible error patterns that can occur. The second pass of mispronunciation diagnosis expands the detected insertions and substitutions into phone networks, and another recognition pass attempts to reveal the phonetic identities of the detected mispronunciation errors. Discriminative training (DT) is applied respectively to the acoustic models of the mispronunciation detection pass and the mispronunciation diagnosis pass. DT effectively separates the acoustic models of the canonical phones and the antiphones. Overall, with DT in both passes of MD&D, the error rate is reduced by 40.4% relative, compared with the maximum likelihood baseline. After DT, the error rates of the respective passes are also lower than those of a strong single-pass baseline with DT by 1.3% and 5.1% relative which are statistically significant.


asia-pacific signal and information processing association annual summit and conference | 2013

Predicting gradation of L2 English mispronunciations using ASR with extended recognition network

Hao Wang; Helen M. Meng; Xiaojun Qian

A CAPT system can be pedagogically improved by giving effective feedback according to the severity of mispronunciations. We obtained perceptual gradations of L2 English mispronunciations through crowdsourcing, conducted quality control to filter for reliable ratings and proposed approaches to predict gradation of word-level mispronunciations. This paper presents our work on making improvements using ASR with extended recognition network to the previous predicting approach to solve its limitations: 1. it is not working for those mispronounced words whose transcriptions are not immediately available; 2. perceptually differently articulated words with the same transcription have the same predicted gradation.

Collaboration


Dive into the Xiaojun Qian's collaboration.

Top Co-Authors

Avatar

Helen M. Meng

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shiyin Kang

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hao Wang

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Kun Li

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Pengfei Liu

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge