Yuya Akita
Kyoto University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yuya Akita.
international conference on acoustics, speech, and signal processing | 2008
Tatsuya Kawahara; Yusuke Nemoto; Yuya Akita
The paper addresses language model adaptation for automatic lecture transcription by fully exploiting presentation slide information used in the lecture. As the text in the presentation slides is small in its size and fragmentary in its content, a robust adaptation scheme is addressed by focusing on the keyword and topic information. Several methods are investigated and combined; first, global topic adaptation is conducted based on PLSA (probabilistic latent semantic analysis) using keywords appearing in all slides. Web text is also retrieved to enhance the relevant text. Then, local preference of the keywords are reflected with a cache model by referring to the slide used during each utterance. Experimental evaluations on real lectures show that the proposed method combining the global and local slide information achieves a significant improvement of recognition accuracy, especially in the detection rate of content keywords.
international conference on acoustics, speech, and signal processing | 2006
Yuya Akita; Tatsuya Kawahara
One of the most significant problems in language modeling of spontaneous speech such as meetings and lectures is that only limited amount of matched training data, i.e. faithful transcript for the relevant task domain, is available. In this paper, we propose a novel transformation approach to estimate language model statistics of spontaneous speech from a document-style text database, which is often available with a large scale. The proposed statistical transformation model is designed for modeling characteristic linguistic phenomena in spontaneous speech and estimating their occurrence probabilities. These contextual patterns and probabilities are derived from a small amount of parallel aligned corpus of the faithful transcripts and their document-style texts. To realize wide coverage and reliable estimation, a model based on part-of-speech (POS) is also prepared to provide a back-off scheme from a word-based model. The approach has been successfully applied to estimation of the language model for National Congress meetings from their minute archives, and significant reduction of test-set perplexity is achieved
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Yuya Akita; Tatsuya Kawahara
We propose a novel approach based on a statistical transformation framework for language and pronunciation modeling of spontaneous speech. Since it is not practical to train a spoken-style model using numerous spoken transcripts, the proposed approach generates a spoken-style model by transforming an orthographic model trained with document archives such as the minutes of meetings and the proceedings of lectures. The transformation is based on a statistical model estimated using a small amount of a parallel corpus, which consists of faithful transcripts aligned with their orthographic documents. Patterns of transformation, such as substitution, deletion, and insertion of words, are extracted with their word and part-of-speech (POS) contexts, and transformation probabilities are estimated based on occurrence statistics in a parallel aligned corpus. For pronunciation modeling, subword-based mapping between baseforms and surface forms is extracted with their occurrence counts, then a set of rewrite rules with their probabilities are derived as a transformation model. Spoken-style language and pronunciation (surface forms) models can be predicted by applying these transformation patterns to a document-style language model and baseforms in a lexicon, respectively. The transformed models significantly reduced perplexity and word error rates (WERs) in a task of transcribing congressional meetings, even though the domains and topics were different from the parallel corpus. This result demonstrates the generality and portability of the proposed framework.
international conference on acoustics, speech, and signal processing | 2005
Yuya Akita; Tatsuya Kawahara
Pronunciation variation modeling is one of the major issues in automatic transcription of spontaneous speech. We present statistical modeling of subword-based mapping between baseforms and surface forms using a large-scale spontaneous speech corpus (CSJ). Variation patterns of phone sequences are automatically extracted together with their contexts of up to two preceding and following phones, which are decided by their occurrence statistics. We then derive a set of rewrite rules with their probabilities and variable-length phone contexts. The model effectively predicts pronunciation variations depending on the phone context using a back-off scheme. Since it is based on phone sequences, the model is applicable to any lexicon to generate appropriate surface forms. The proposed method was evaluated on two transcription tasks whose domains are different from the training corpus (CSJ), and significant reduction of word error rates was achieved.
international conference on acoustics, speech, and signal processing | 2010
Graham Neubig; Yuya Akita; Shinsuke Mori; Tatsuya Kawahara
Automatic speech recognition (ASR) results contain not only ASR errors, but also disfluencies and colloquial expressions that must be corrected to create readable transcripts. We take the approach of statistical machine translation (SMT) to “translate” from ASR results into transcript-style text. We introduce two novel modeling techniques in this framework: a context-dependent translation model, which allows for usage of context to accurately model translation probabilities, and log-linear interpolation of conditional and joint probabilities, which allows for frequently observed translation patterns to be given higher priority. The system is implemented using weighted finite state transducers (WFST). On an evaluation using ASR results and manual transcripts of meetings of the Japanese Diet (national congress), the proposed methods showed a significant increase in accuracy over traditional modeling techniques.
international conference on acoustics, speech, and signal processing | 2009
Tatsuya Kawahara; Masato Mimura; Yuya Akita
For effective training of acoustic and language models for spontaneous speech such as meetings, it is significant to exploit the texts available in a large scale, which may not be faithful transcripts of the utterances. We have proposed a language model transformation scheme to cope with the differences between verbatim transcripts of spontaneous utterances and human-made transcripts such as those in proceedings. In this paper, we investigate its application to lightly supervised training of the acoustic model. By transforming the corresponding text in the proceedings, we can generate a very constrained model to predict the actual utterances. The experimental evaluation with the transcription system for the Japanese Congress meetings demonstrated that the proposed scheme can generate accurate labels for acoustic model training and thus realizes the comparable ASR (Automatic Speech Recognition) performance to the case using manual transcripts.
international conference on acoustics, speech, and signal processing | 2007
Yuya Akita; Tatsuya Kawahara
For language modeling of spontaneous speech, we propose a novel approach, based on the statistical machine translation framework, which transforms a document-style model to the spoken style. For better coverage and more reliable estimation, incorporation of POS (part-of-speech) information is explored in addition to lexical information. In this paper, we investigate several methods that combine POS-based model or integrate POS information in the ME (maximum entropy) scheme. They achieve significant reduction in perplexity and WER in a meeting transcription task. Moreover, the model is applied to different domains or committee meetings of different topics. As a result, even larger perplexity reduction is achieved compared with the case tested in the same domain. The result demonstrates the generality and portability of the model.
international symposium on chinese spoken language processing | 2014
Sheng Li; Yuya Akita; Tatsuya Kawahara
The paper introduces our project on automatic speech recognition (ASR) of Chinese lectures. For a comprehensive study on spontaneous Chinese, we compile a corpus of Chinese Lecture Room (CCLR), which has faithful transcripts and caption texts. Based on the annotated alignment of these texts, we conduct analysis on linguistic phenomena of spontaneous Chinese speech. We also develop a baseline ASR system with this corpus, and refine it with the DNN-HMM framework. By exploiting the lecture data without faithful transcripts and conducting unsupervised speaker adaptation, significant improvement of ASR accuracy is achieved.
Computer Speech & Language | 2012
Graham Neubig; Yuya Akita; Shinsuke Mori; Tatsuya Kawahara
This paper presents a method for automatically transforming faithful transcripts or ASR results into clean transcripts for human consumption using a framework we label speaking style transformation (SST). We perform a detailed analysis of the types of corrections performed by human stenographers when creating clean transcripts, and propose a model that is able to handle the majority of the most common corrections. In particular, the proposed model uses a framework of monotonic statistical machine translation to perform not only the deletion of disfluencies and insertion of punctuation, but also correction of colloquial expressions, insertions of omitted words, and other transformations. We provide a detailed description of the model implementation in the weighted finite state transducer (WFST) framework. An evaluation of the proposed model on both faithful transcripts and speech recognition results of parliamentary and lecture speech demonstrates the effectiveness of the proposed model in performing the wide variety of corrections necessary for creating clean transcripts.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Sheng Li; Yuya Akita; Tatsuya Kawahara
While the performance of ASR systems depends on the size of the training data, it is very costly to prepare accurate and faithful transcripts. In this paper, we investigate a semisupervised training scheme, which takes the advantage of huge quantities of unlabeled video lecture archive, particularly for the deep neural network (DNN) acoustic model. In the proposed method, we obtain ASR hypotheses by complementary GMM- and DNN-based ASR systems. Then, a set of CRF-based classifiers is trained to select the correct hypotheses and verify the selected data. The proposed hypothesis combination shows higher quality compared with the conventional system combination method (ROVER). Moreover, compared with the conventional data selection based on confidence measure score, our method is demonstrated more effective for filtering usable data. Significant improvement in the ASR accuracy is achieved over the baseline system and in comparison with the models trained with the conventional system combination and data selection methods.
Collaboration
Dive into the Yuya Akita's collaboration.
National Institute of Information and Communications Technology
View shared research outputs