Guanglai Gao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guanglai Gao is active.

Explore More

Publication

Featured researches published by Guanglai Gao.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

A pairwise algorithm using the deep stacking network for speech separation and pitch estimation

Xueliang Zhang; Hui Zhang; Shuai Nie; Guanglai Gao; Wenju Liu

Speech separation and pitch estimation in noisy conditions are considered to be a “chicken-and-egg” problem. On one hand, pitch information is an important cue for speech separation. On the other hand, speech separation makes pitch estimation easier when background noise is removed. In this paper, we propose a supervised learning architecture to solve these two problems iteratively. The proposed algorithm is based on the deep stacking network (DSN), which provides a method for stacking simple processing modules to build deep architectures. Each module is a classifier whose target is the ideal binary mask (IBM), and the input vector includes spectral features, pitch-based features and the output from the previous module. During the testing stage, we estimate the pitch using the separation results and update the pitch-based features to the next module. When embedded into the DSN, pitch estimation and speech separation each run several times. We obtain the final results from the last module. Systematic evaluations show that the proposed system results in both a high quality estimated binary mask and accurate pitch estimation and outperforms recent systems in its generalization ability.

international conference on document analysis and recognition | 2015

A multiple instances approach to improving keyword spotting on historical Mongolian document images

Hongxi Wei; Guanglai Gao; Xiangdong Su

For keyword spotting of historical Mongolian document images, when user provides different instance image for the same query keyword, the performance will vary a lot. This paper proposed an approach to solving the above problem. Particularly, the whole procedure of keyword spotting is divided into two stages. The main task of the first stage is to generate multiple ranking lists for a query keyword. And the aim of the second stage is to merge the multiple ranking lists to form a final ranking. In the first stage, the ranking list of one query keyword is firstly returned by traditional image matching and then a number of instances for the query keyword are obtained using pseudo relevant feedback. Next, each instance of the query keyword can return the corresponding ranking list separately. In the second stage, the multiple ranking lists from the multiple instances of the query keyword are combined by the data fusion technique. The final ranking will be taken as the retrieval results of the query keyword. The experimental results show that the proposed approach can significantly improve the performance of keyword spotting for the historical Mongolian document images.

international conference on neural information processing | 2016

LDA-Based Word Image Representation for Keyword Spotting on Historical Mongolian Documents

Hongxi Wei; Guanglai Gao; Xiangdong Su

The original Bag-of-Visual-Words approach discards the spatial relations of the visual words. In this paper, a LDA-based topic model is adopted to obtain the semantic relations of visual words for each word image. Because the LDA-based topic model usually hurts retrieval performance when directly employs itself. Therefore, the LDA-based topic model is linearly combined with a visual language model for each word image in this study. After that, the basic query likelihood model is used for realizing the procedure of retrieval. The experimental results on our dataset show that the proposed LDA-based representation approach can efficiently and accurately attain to the aim of keyword spotting on a collection of historical Mongolian documents. Meanwhile, the proposed approach improves the performance significantly than the original BoVW approach.

chinese conference on pattern recognition | 2009

Improving of Acoustic Model for the Mongolian Speech Recognition System

Feilong Bao; Guanglai Gao

The research of Mongolian speech recognition technology start comparatively late and it is still in its primary stage. In this paper, we optimized the basic resources of Mongolian speech recognition system, and we also improved the acoustic model of Mongolian speech recognition system, and this is most important. In this paper, we realized continuous HMM Gaussian mixture model and multiple data stream SCHMM model on the basis of context dependent phonetic model and decision tree method. And we compared the two models in performances. Finally, a large quantity of experiments have been taken to the testing set with HTK as an experimental platform by applying trigram language model and acoustic model which is composed of context dependent phonetic model, decision tree method and multiple data stream SCHMM model. We found system performance has been optimized, and system recognition accuracy rates of word and sentence have been greatly improved.

international conference on acoustics, speech, and signal processing | 2015

A pairwise algorithm for pitch estimation and speech separation using deep stacking network

Hui Zhang; Xueliang Zhang; Shuai Nie; Guanglai Gao; Wenju Liu

Pitch information is an important cue for speech separation. However, pitch estimation in noisy condition is also a task as challenging as speech separation. In this paper, we propose a supervised learning architecture which combines these two problems concisely. The proposed algorithm is based on deep stacking network (DSN) which provides a method of stacking simple processing modules in building deep architecture. In the training stage, an ideal binary mask is used as target. The input vector includes the outputs of lower module and frame-level features which consist of spectral and pitch-based features. In the testing stage, each module provides an estimated binary mask which is employed to re-estimate pitch. Then we update the pitch-based features to the next module. This procedure is embedded iteratively in DSN, and we obtain the final separation results from the last module of DSN. Systematic evaluations show that the proposed approach produces high quality estimated binary mask and outperforms recent systems in generalization.

CCL | 2015

Mongolian Speech Recognition Based on Deep Neural Networks

Hui Zhang; Feilong Bao; Guanglai Gao

Mongolian is an influential language. And better Mongolian Large Vocabulary Continuous Speech Recognition (LVCSR) systems are required. Recently, the research of speech recognition has achieved a big improvement by introducing the Deep Neural Networks (DNNs). In this study, a DNN-based Mongolian LVCSR system is built. Experimental results show that the DNN-based models outperform the conventional models which based on Gaussian Mixture Models (GMMs) for the Mongolian speech recognition, by a large margin. Compared with the best GMM-based model, the DNN-based one obtains a relative improvement over 50 %. And it becomes a new state-of-the-art system in this field.

international conference on acoustics, speech, and signal processing | 2016

Convolutional neural network for robust pitch determination

Hong Su; Hui Zhang; Xueliang Zhang; Guanglai Gao

Pitch is an important characteristic of speech and is useful for many applications. However, pitch determination in noisy conditions is difficult. In this paper, we propose a supervised learning algorithm to estimate pitch using a convolutional neural network (CNN). Specifically, we use a CNN for pitch candidate selection, and dynamic programming for pitch tracking. Our experimental results show that the proposed method can obtain accurate pitch estimation and they show good generalization ability to new speakers and noisy conditions. We credit the success to the use of CNN, which is suitable for modeling the shift-invariant spectral feature for pitch detection.

international conference on multimedia and expo | 2017

Representing word image using visual word embeddings and RNN for keyword spotting on historical document images

Hongxi Wei; Hui Zhang; Guanglai Gao

Visual words of Bag-of-Visual-Words (BoVW) framework are independent each other, which results in not only discarding spatial orders between visual words but also lacking semantic information. This study is inspired by word embeddings that a similar embedding procedure is applied to a large number of visual words. By this way, the corresponding embedding vectors of the visual words can be formulated. For a word image, the average of embedding vectors of all visual words within the word image is taken as its embedding vector. Moreover, Recurrent Neural Network (RNN) is utilized to encode each word image into embeddings like an auto-encoder. The RNN embeddings and the visual word embeddings are complementary. In this study, all word images are represented by combining visual word embeddings and RNN embeddings. Experimental results show that the proposed representation approach is superior to the traditional BoVW, spatial pyramid matching and latent Dirichlet allocation.

National Conference on Man-Machine Speech Communication | 2017

Mongolian Text-to-Speech System Based on Deep Neural Network

Rui Liu; Feilong Bao; Guanglai Gao; Yonghe Wang

Recently, Deep Neural Network (DNN), which is a feed-forward artificial neural network with many hidden layers, has opened a new research direction for Speech Synthesis. It can represent high dimension and correlated features efficiently and model highly complex mapping function compactly. However, the research on DNN-based Mongolian speech synthesis is still in blank filed. This paper applied the DNN-based acoustic model to Mongolian speech synthesis firstly, and built a Mongolian speech synthesis system according to the Mongolian character and acoustic features. Compared with the conventional HMM-based system under the same corpus, the DNN-based system can synthesize better Mongolian speech than HMM-based system can do. The Mean Opinion Score (MOS) of the synthesized Mongolian speech is 3.83. And it becomes a new state-of-the-art system in this field.

international conference on asian language processing | 2016

Mongolian prosodic phrase prediction using suffix segmentation

Rui Liu; Feilong Bao; Guanglai Gao; Weihua Wang

Accurate prosodic phrase prediction can improve the naturalness of speech synthesis. Predicting the prosodic phrase can be regarded as a sequence labeling problem and the Conditional Random Field (CRF) is typically used to solve it. Mongolian is an agglutinative language, in which massive words can be formed by concatenating these stems and suffixes. This character makes it difficult to build a Mongolian prosodic phrase predictions system, based on CRF, that has high performance. We introduce a new method that segments Mongolian word into stem and suffix as individual token. The proposed method integrates multiple features according to the characteristics of Mongolian word formation. We conduct the contrast experiment by selecting the following features: word, multi-level Part-of-Speech (POS), multi-level lexical for suffix and the existence for suffix. The experimental results show that our method has significantly enhanced the performance of the Mongolian prosodic phrase prediction system through comparing with the conventional method that treats Mongolian word as token directly. The word feature, level one lexical for suffix feature and existence for suffix feature are effective. The best result is measured by Fl-measure as 82.49%.

Explore More