Yujing Si | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yujing Si is active.

Explore More

Publication

Featured researches published by Yujing Si.

fuzzy systems and knowledge discovery | 2012

Optimized large vocabulary WFST speech recognition system

Yuhong Guo; Ta Li; Yujing Si; Jielin Pan; Yonghong Yan

Speech recognition decoder is an important part of large vocabulary speech recognition application. The speed and the accuracy is the main concern of its application. Recently, weighted finite state transducers (WFST) has become the dominant description of decoding network. However, the large memory and time cost of constructing the final WFST decoding network is the bottleneck of this technique. The goal of this article is to construct a tight, flexible WFST decoding network as well as a fast, scalable decoder. A tight representation of silence in speech is proposed and the decoding algorithm with improved pruning strategies is also suggested. The experimental results show that the proposed network presentation will cut off 37% memory cost and 19% time cost of constructing the final decoding network. And with the decoding strategies of WFST feature specified beams the proposed decoders efficiency and accuracy are also significantly improved.

computational intelligence and security | 2013

Automatic Allophone Deriving for Korean Speech Recognition

Ji Xu; Yujing Si; Jielin Pan; Yonghong Yan

In Korean, the pronunciations of phonemes are severely affected by their contexts. Thus, using phonemes directly translated from their written forms as basic units for acoustic modeling is problematic, as these units lack the ability to capture the complex pronunciation variations occurred in continuous speech. Allophone, a sub-phone unit in phonetics but served as independent phoneme in speech recognition, is considered to have the ability to describe complex pronunciation variations. This paper presents a novel approach called Automatic Allophone Deriving (AAD). In this approach, statistics from Gaussian Mixture Models are used to create measurements for allophone candidates, and decision trees are used to derive allophones. Question set used by the decision tree is also generated automatically, since we assumed no linguistic knowledge would be used in this approach. This paper also adopts long-time features over conventional cepstral features to capture acoustic information over several hundred milliseconds for AAD, as co-articulation effects are unlikely to be limited to a single phoneme. Experiment shows that AAD outperforms previous approaches which derive allophones from linguistic knowledge. Additional experiments use long-time features directly in acoustic modeling. The results show that performance improvement achieved by using the same allophones can be significantly improved by using long-time features, compared with corresponding baselines.

international congress on image and signal processing | 2012

Lattice generation with accurate word boundary in WFST framework

Yuhong Guo; Yujing Si; Yong Liu; Jielin Pan; Yonghong Yan

This paper presents an algorithm to generate the speech recognition lattice with accurate word boundary in weighted finite-state transducer (WFST) decoding framework. In traditional WFST lattice generation algorithms, the transformation from context-dependent phone lattice to word lattice does not yield accurate time boundaries between words. Meanwhile, this lattice is not a Standard Lattice Format nor is it compatible with existing toolkits. The lattice without word boundary can only be used in the area where the word boundary is not needed. In this paper, we propose a lexicon matching algorithm based on token passing to transform the phone lattice to the word lattice. This algorithm generates standard lattices with accurate word boundary. The experiments show that the proposed lattice generation algorithm has an good lattice quality and good algorithm efficiency.

international conference on natural computation | 2012

Recurrent neural network language model in mandarin voice input system

Yujing Si; Ta Li; Shang Cai; Jielin Pan; Yonghong Yan

Over more than three decades, the development of automatic speech recognition (ASR) technology has made it possible for some intelligent query systems to use a voice interface. Specially, voice input system is a practical and interesting application of ASR. In this paper, we present our recent work on using Recurrent Neural Network Language Model (RNNLM) to improve the performance of our Mandarin voice input system. The Mandarin voice input system employs a two-pass strategy. In the first pass, a memory-efficient state network and a tri-gram language model are used to generate the word lattice from which the n-best list is extracted. And, in the second pass, we use a large 4-gram language model and RNNLM to re-rank the n-best list and then output the new best hypothesis. Experiments showed that it was very effective for RNNLM to be used in the n-best list re-score. Eventually, 10.2% relative reduction in word error rate (from 13.7% to 12.3%) was achieved on a voice search task, compared to the result of the first pass.

computational intelligence and security | 2012

An Improved Mandarin Voice Input System Using Recurrent Neural Network Language Model

Yujing Si; Ji Xu; Zhen Zhang; Jielin Pan; Yonghong Yan

In this paper, we present our recent work on using a Recurrent Neural Network Language Model (RNNLM) in a Mandarin voice input system. Specifically, the RNNLM is used in conjunction with a large high-order n-gram language model (LM) to re-score the N-best list. However, it is observed that the repeated computations in the rescoring procedure can make the rescoring inefficient. Therefore, we propose a new nbest-list rescoring framework called Prefix Tree based N-best list Rescore (PTNR) to totally eliminate the repeated computations and speed up the rescoring procedure. Experiments show that the RNNLM leads to about 4.5% relative reduction of word error rate (WER). And, compared to the conventional n-best list rescoring method, the PTNR gets a speed-up of factor 3-4. Compared to the cache based method, the design of PTNR is more explicit and simpler. Besides, the PTNR requires a smaller memory footprint than the cache based method.

global congress on intelligent systems | 2012

Impact of Word Classing on Recurrent Neural Network Language Model

Yujing Si; Yuhong Guo; Yong Liu; Jielin Pan; Yonghong Yan

This paper investigates the impact of word classing on the recurrent neural network language model (RNNLM), which has been recently shown to outperform many competitive language model techniques. In particular, the class-based RNNLM (CRNNLM) was proposed in to speed up both the training and testing phase of RNNLM. However, in past work, word classes for CRNNLM were simply obtained based on the frequencies of words, which is not accurate. Hence, we take a closer look at the classing and to find out whether improved classing would translate to improve performance. More specially, we explore the use of the brown algorithm, which is a classical method of word classing. In experiments with a standard test set, we find that 5% 7% relative reduction in perplexity (PPL) could be obtained by the Brown algorithm, compared to the frequency-based word-classing method.

Journal of the Acoustical Society of America | 2012

Voice search optimization using weighted finite-state transducers

Yuhong Guo; Ta Li; Yujing Si; Jielin Pan; Yonghong Yan

Voice search system can provide users with information according to their spoken queries. However, as the most important module in this system, the high word error rate of the automatic speech recognition (ASR) part degrades the whole systems performance. Moreover, the runtime efficiency of the ASR also becomes the bottleneck in the large scale application of voice search. In this paper, an optimized weighted finite-state transducer (WFST) based voice search system is proposed. A weighed parallel silence short-pause model is introduced to reduce both the final transducer size and the word error rate. The WFST network is optimized as well. The experimental results show that, the recognition speed of proposed system outperforms the other recognition system at the equal word error rate and the miracle error rate is also significantly reduced. This work is partially supported by the National Natural Science Foundation of China (Nos. 10925419, 90920302, 10874203, 60875014, 61072124, 11074275, 11161140319).

conference of the international speech communication association | 2013