Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xinhui Hu is active.

Publication


Featured researches published by Xinhui Hu.


mobile data management | 2013

Multilingual Speech-to-Speech Translation System: VoiceTra

Shigeki Matsuda; Xinhui Hu; Yoshinori Shiga; Hideki Kashioka; Chiori Hori; Keiji Yasuda; Hideo Okuma; Masao Uchiyama; Eiichiro Sumita; Hisashi Kawai; Satoshi Nakamura

This study presents an overview of VoiceTra, which was developed by NICT and released as the worlds first network-based multilingual speech-to-speech translation system for smartphones, and describes in detail its multilingual speech recognition, its multilingual translation, and its multilingual speech synthesis in regards to field experiments. We show the effects of system updates using the data collected from field experiments to improve our acoustic and language models.


IEICE Transactions on Information and Systems | 2008

Using Mutual Information Criterion to Design an Efficient Phoneme Set for Chinese Speech Recognition

Jinsong Zhang; Xinhui Hu; Satoshi Nakamura

Chinese is a representative tonal language, and it has been an attractive topic of how to process tone information in the state-of-the-art large vocabulary speech recognition system. This paper presents a novel way to derive an efficient phoneme set of tone-dependent units to build a recognition system, by iteratively merging a pair of tone-dependent units according to the principle of minimal loss of the Mutual Information (MI). The mutual information is measured between the word tokens and their phoneme transcriptions in a training text corpus, based on the system lexical and language model. The approach has a capability to keep discriminative tonal (and phoneme) contrasts that are most helpful for disambiguating homophone words due to lack of tones, and merge those tonal (and phoneme) contrasts that are not important for word disambiguation for the recognition task. This enables a flexible selection of phoneme set according to a balance between the MI information amount and the number of phonemes. We applied the method to traditional phoneme set of Initial/Finals, and derived several phoneme sets with different number of units. Speech recognition experiments using the derived sets showed its effectiveness.


international symposium on chinese spoken language processing | 2014

Mandarin speech recognition using convolution neural network with augmented tone features

Xinhui Hu; Xugang Lu; Chiori Hori

Due to its ability of reducing spectral variations and modeling spectral correlations existed in speech signals, the convolutional neural network (CNN) has been shown effective in modeling speech compared to deep neural network (DNN). In this study, we explore applying CNN to Mandarin speech recognitions. Besides exploring appropriate CNN architecture for recognition performance, focuses are on investigating the effective acoustic features, and effectivenesses of applying tonal information which have been verified helpful in other types of acoustic models to the acoustic features in the CNN. We conduct speech recognition experiments on Mandarin broadcast speech recognition to test the effectivenesses of the proposed approaches. The CNN shows its clear superiority to the DNN, with relative reductions of character error rate (CER) among 7.7-13.1% for broadcast news speech (BN), and 5.4-9.9% for broadcast conversation speech (BC). Like in the Gaussian Mixture Model (GMM) and DNN systems, the tonal information characterized by the fundamental frequency (F0) and fundamental frequency variations (FFV) are found still helpful in CNN models, they achieve relative CER reductions over 6.7% for BN and 4.3% for BC respectively when compared with the baseline Mel-filter bank feature.


asia pacific signal and information processing association annual summit and conference | 2015

A Myanmar large vocabulary continuous speech recognition system

Hay Mar Soe Naing; Aye Mya Hlaing; Win Pa Pa; Xinhui Hu; Ye Kyaw Thu; Chiori Hori; Hisashi Kawai

This paper presents a large vocabulary automatic speech recognition (ASR) system for Myanmar. To the best of our knowledge, this is the first such system for the Myanmar language. We will report main processes of developing the system, including data collection, pronunciation lexicon construction, effective acoustic features selection, acoustic and language modelings, and evaluation criteria. Considering the fact that Myanmar being a tonal language, the tonal features were incorporated to acoustic modeling and their effectiveness were verified. Differences between the word-based language model (LM) and syllable-based LM were investigated; the word-based LM was found superior to the syllable-based model. To disambiguate the definitions of Myanmar words and achieve high reliability on the recognition results, we explored the characteristics of the Myanmar language, and proposed the Syllable Error Rate (SER) as a suitable evaluation criterion for Myanmar ASR system. 3 kinds of acoustic models; 1 Gaussian Mixture Model (GMM) and 2 Deep Neural Networks (DNNs) were explored by only utilizing the developed phonemically-balanced corpus consisting of 4K sentences and 40 hours of speech. An open evaluation set containing 100 utterances, spoken by 25 speakers, were experimented. With respect to the sequence discriminative training DNN, the results reached up to 15.63% in word error rate (WER) or 10.87% in SER.


international conference on acoustics, speech, and signal processing | 2006

Automatic Derivation of a Phoneme Set with Tone Information for Chinese Speech Recognition Based on Mutual Information Criterion

Jin-Song Zhang; Xinhui Hu; Satoshi Nakamura

An appropriate approach to model tone information is helpful for building Chinese large vocabulary continuous speech recognition system. We propose to derive an efficient phoneme set of tone-dependent sub-word units to build a recognition system, by iteratively merging a pair of tone-dependent units according to the principle of minimal loss of the mutual information. The mutual information is measured between the word tokens and their phoneme transcriptions in a training text corpus, based on the system lexical and language model. The approach has the capability to keep discriminative tonal (and phoneme) contrasts that are most helpful for disambiguating homophone words due to lack of tones, and merge those tonal (and phoneme) contrasts that are not important for word disambiguation for the recognition task. This enable a flexible selection of phoneme set according to a balance between the MI information amount and the number of phonemes. We applied the method to traditional phoneme set of initial/finals, and derived several phoneme sets with different number of units. Speech recognition experiments using the derived sets showed their effectiveness


Speech Communication | 2016

Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Peng Shen; Xugang Lu; Xinhui Hu; Naoyuki Kanda; Masahiro Saiko; Chiori Hori; Hisashi Kawai

Abstract Automatic speech recognition systems (ASR) have achieved considerable progress in real applications because of skilled design of the architecture with advanced techniques and algorithms. However, how to design a system efficiently integrating these various techniques to obtain advanced performance is still a challenging task. In this paper, we introduced an ensemble model combination and adaptation based ASR system with two characteristics: (1) large-scale combination of multiple ASR systems based on a Recognizer Output Voting Error Reduction (ROVER) system, and (2) multi-pass unsupervised speaker adaptation for deep neural network acoustic models and topic adaptation on language model. The multiple acoustic models were trained with different acoustic features and model architectures which helped to provide complementary and discriminative information in the ROVER process. With these multiple acoustic models, a better estimation of word confidence could be obtained from ROVER process which helped in selecting data for unsupervised adaptation on the previously trained acoustic models. The final recognition result was obtained using multi-pass decoding, ROVER, and adaptation processes. We tested the system on lecture speeches with topics related to Technology, Entertainment and Design (TED) that were used in the international workshop on spoken language translation (IWSLT) evaluation campaign, and obtained 6.5%, 7.0%, 10.6%, and 8.4% word error rates for test sets in 2011, 2012, 2013, and 2014, which to our knowledge are the best results for these evaluation sets.


international universal communication symposium | 2009

Spoken document retrieval using topic models

Xinhui Hu; Ryosuke Isotani; Satoshi Nakamura

In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.


Proceedings of the 7th Workshop on Asian Language Resources | 2009

Construction of Chinese Segmented and POS-tagged Conversational Corpora and Their Evaluations on Spontaneous Speech Recognitions

Xinhui Hu; Ryosuke Isotani; Satoshi Nakamura

The performance of a corpus-based language and speech processing system depends heavily on the quantity and quality of the training corpora. Although several famous Chinese corpora have been developed, most of them are mainly written text. Even for some existing corpora that contain spoken data, the quantity is insufficient and the domain is limited. In this paper, we describe the development of Chinese conversational annotated textual corpora currently being used in the NICT/ATR speech-to-speech translation system. A total of 510K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, this is the largest conversational textual corpora in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Evaluation experiments on these corpora were conducted by comparing the parameters of the language models, perplexities of test sets, and speech recognition performance with Japanese and English. The characteristics of the Chinese corpora, their limitations, and solutions to these limitations are analyzed and discussed.


2009 Oriental COCOSDA International Conference on Speech Database and Assessments | 2009

Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora

Xinhui Hu; Ryosuke Isotani; Satoshi Nakamura

In this paper, we describe the development of Chinese conversational segmented and POS-tagged corpora currently used in the NICT/ATR speech-to-speech translation system. Over 500K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, they are the largest conversational textual corpora; in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Based on these parallel corpora, we make an investigation on the statistics of each language, performances of language model and speech recognition, and find the differences among these languages. The problems and their solutions to the present Chinese corpora are also analyzed and discussed.


asia pacific signal and information processing association annual summit and conference | 2014

Incorporating tone features to convolutional neural network to improve Mandarin/Thai speech recognition

Xinhui Hu; Masahiro Saiko; Chiori Hori

Tone plays an important role in distinguishing lexical meaning in tonal languages, such as Mandarin and Thai. It has been revealed that tone information is helpful to improve automatic speech recognition (ASR) for these languages. In this study, we incorporate tone features from the fundamental frequency (Fo) and fundamental frequency variation (FFV) to the convolutional neural network (CNN), a state-of-the-art acoustic modeling approach, for acoustic modeling of the ASR systems. Due to its abilities of reducing spectral variations and modeling spectral correlations existing in speech signals, the CNN is expected to model well tone patterns which mainly behave in the frequency domain, by Fo contur. We conduct speech ASR experiments on Mandarin and Thai to evaluate the effectivenesses of the proposed approaches. With the help of tone features, the character error rates (CERs) of Mandarin achieve 4.3-7.1% relative reductions, and the word error rates (WERs) of Thai achieve 0.41-6.26% relative reductions. The CNN shows its clear superiority to the deep neural network (DNN), with relative CER reductions of 5.4-13.1% for Mandarin, and relative WER reductions of 0.5-5.6% for Thai.

Collaboration


Dive into the Xinhui Hu's collaboration.

Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Chiori Hori

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Hideki Kashioka

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Hisashi Kawai

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Ryosuke Isotani

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Shigeki Matsuda

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Youzheng Wu

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Eiichiro Sumita

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Hirofumi Yamamoto

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Keiji Yasuda

National Institute of Information and Communications Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge