Xingyu Na
Beijing Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xingyu Na.
conference of the international speech communication association | 2016
Daniel Povey; Vijayaditya Peddinti; Daniel Galvez; Pegah Ghahremani; Vimal Manohar; Xingyu Na; Yiming Wang; Sanjeev Khudanpur
In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LFMMI provide a relative word error rate reduction of ∼11.5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ∼2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.
international conference on acoustics, speech, and signal processing | 2016
Yajie Miao; Mohammad Gowayyed; Xingyu Na; Tom Ko; Florian Metze; Alex Waibel
The connectionist temporal classification (CTC) loss function has several interesting properties relevant for automatic speech recognition (ASR): applied on top of deep recurrent neural networks (RNNs), CTC learns the alignments between speech frames and label sequences automatically, which removes the need for pre-generated frame-level labels. CTC systems also do not require context decision trees for good performance, using context-independent (CI) phonemes or characters as targets. This paper presents an extensive exploration of CTC-based acoustic models applied to a variety of ASR tasks, including an empirical study of the optimal configuration and architectural variants for CTC. We observe that on large amounts of training data, CTC models tend to outperform state-of-the-art hybrid approach. Further experiments reveal that CTC can be readily ported to syllable-based languages, and can be enhanced by employing improved feature front-ends.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Milos Cernak; Philip N. Garner; Alexandros Lazaridis; Petr Motlicek; Xingyu Na
Current very low bit rate speech coders are, due to complexity limitations, designed to work off-line. This paper investigates incremental speech coding that operates real-time and incrementally (i.e., encoded speech depends only on already-uttered speech without the need of future speech information). Since human speech communication is asynchronous (i.e., different information flows being simultaneously processed), we hypothesized that such an incremental speech coder should also operate asynchronously. To accomplish this task, we describe speech coding that reflects the human cortical temporal sampling that packages information into units of different temporal granularity, such as phonemes and syllables, in parallel. More specifically, a phonetic vocoder-cascaded speech recognition and synthesis systems-extended with syllable-based information transmission mechanisms is investigated. There are two main aspects evaluated in this work, the synchronous and asynchronous coding. Synchronous coding refers to the case when the phonetic vocoder and speech generation process depend on the syllable boundaries during encoding and decoding respectively. On the other hand, asynchronous coding refers to the case when the phonetic encoding and speech generation processes are done independently of the syllable boundaries. Our experiments confirmed that the asynchronous incremental speech coding performs better, in terms of intelligibility and overall speech quality, mainly due to better alignment of the segmental and prosodic information. The proposed vocoding operates at an uncompressed bit rate of 213 bits/sec and achieves an average communication delay of 243 ms.
international conference on pattern recognition | 2014
Ming Tu; Xiang Xie; Xingyu Na
Voice activity detection (VAD) is always important in many speech applications. In this paper, two VAD methods using novel features based on computational auditory scene analysis (CASA) are proposed. The first method is based on statistical model based VAD. Cochlea gram instead of discrete fourier transform coefficients is used as time-frequency representation to do statistical model based VAD. The second is a supervised method based on Gaussian Mixture Model. We extract gamma tone frequency cepstral coefficients (GFCC) from cochlea gram and use this feature to discriminate speech and noise in noisy signal. Gaussian mixture model is used to model GFCC of speech and noise. We evaluate the two methods both in the framework of multiple observation likelihood ratio test. The performances of the two methods are compared with several existing algorithms. The results demonstrate that CASA based features outperform several traditional features in the task of VAD, and the reasons of the superiority of the proposed two features are also investigated.
international conference on multimedia and expo | 2014
Xingyu Na; Xiang Xie; Jingming Kuang
Speech synthesizer is commonly used in human-computer interaction. In many applicational cases, the computing resource is limited while real-time synthesis is demanded. The HMM-based speech synthesis technique allows creating a natural voice quality with small footprint, but current synthesizers require the concatenation of sentence level acoustic units, which is not applicable in real-time mode. In this paper, we propose a blocked parameter generation algorithm for low latency speech synthesis which can work real-time in resource limited applications. Phonetic units at various time spans are used as blocks. The objective and subjective evaluations suggest that the proposed system produce promising voice quality with a low demand for the computing resource.
international conference on acoustics, speech, and signal processing | 2014
Yishan Jiao; Xiang Xie; Xingyu Na; Ming Tu
HMM-based speech synthesis system (HTS) often generates buzzy and muffled speech. Such degradation of voice quality makes synthetic speech sound robotically rather than naturally. From this point, we suppose that synthetic speech is in a different speaker space apart from the original. We propose to use voice conversion method to transform synthetic speech toward the original so as to improve its quality. Local linear transformation (LLT) combined with temporal decomposition (TD) is proposed as the conversion method. It can not only ensure smooth spectral conversion but also avoid over-smoothing problem. Moreover, we design a robust spectral selection and modification strategy to make the modified spectra stable. Preference test shows that the proposed method can improve the quality of HMM-based speech synthesis.
ieee automatic speech recognition and understanding workshop | 2015
Zhichao Wang; Xingyu Na; Xin Li; Jielin Pan; Yonghong Yan
Deep neural networks have shown significant improvements on acoustic modelling, pushing state-of-the-art performance in large vocabulary continuous speech recognition (LVCSR) tasks. However, training DNNs is very time-consuming on scaled data. In this paper, a data-parallel method, namely two-stage ASGD, is proposed. Two-stage ASGD is based on asynchronous stochastic gradient descent (ASGD) paradigm and is tuned for GPU-equipped computing cluster connected by 10Gbit/s Ethernet other than Infiniband. Several techniques, such as hierarchical learning rate control, double-buffering and order-locking are applied to optimise the communication-to-transmission ratio. The proposed framework is evaluated by training a DNN with 29.5M parameters using a 500-hours Chinese continuous telephone speech data set. By using 4 computer nodes and 8 GPU devices (2 devices used in each node), a 5.9 times acceleration is obtained over a single GPU with acceptable loss of accuracy (0.5% in average). A comparative experiment is done to compare the proposed two-stage ASGD with the parallel DNN training systems reported in prior work.
international symposium on chinese spoken language processing | 2012
Xingyu Na; Xiang Xie; Jingming Kuang; Yaling He
This paper proposes a tone labeling technique for tonal language speech synthesis. Non-uniform segmentation using Viterbi alignment is introduced to determine the boundaries to get F0 symbols, which are used as tonal label to eliminate the mismatch between tone patterns and F0 contours of training data. During context clustering, the tendency of adjacent F0 state distributions are captured by the state-based phonetic trees. Means of tone model states are directly quantized to get full tonal label in the synthesis stage. Both objective and subjective experiment results show that the proposed technique can improve the perceptual prosody of synthetic speech of non-professional speakers.
conference of the international speech communication association | 2013
Milos Cernak; Xingyu Na; Philip N. Garner
Archive | 2012
Xingyu Na; Chaomin Wang; Xiang Xie; Yaling He