Hiroaki Nanjo
Kyoto University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hiroaki Nanjo.
IEEE Transactions on Speech and Audio Processing | 2004
Hiroaki Nanjo; Tatsuya Kawahara
The paper addresses adaptation methods to language model and speaking rate (SR) of individual speakers which are two major problems in automatic transcription of spontaneous presentation speech. To cope with a large variation in expression and pronunciation of words depending on the speaker, firstly, we investigate the effect of statistical and context-dependent pronunciation modeling. Secondly, we present unsupervised methods of language model adaptation to a specific speaker and a topic by 1) selecting similar texts based on the word perplexity and TF-IDF measure and 2) making direct use of the initial recognition result for generating an enhanced model. We confirm that all proposed adaptation methods and their combinations reduce the perplexity and word error rate. We also present a decoding strategy adapted to the SR. In spontaneous speech, SR is generally fast and may vary a lot. We also observe different error tendencies for portions of presentations where speech is fast or slow. Therefore, we propose a SR-dependent decoding strategy that applies the most appropriate acoustic analysis, phone models, and decoding parameters according to the SR. Several methods are investigated and their selective application leads to improved accuracy. The combined effect of the two proposed adaptation methods is also confirmed in transcription of real academic presentation.
international conference on acoustics, speech, and signal processing | 2002
Hiroaki Nanjo; Tatsuya Kawahara
This paper addresses the problem of speaking rate in large vocabulary spontaneous speech recognition. In spontaneous lecture speech, the speaking rate is generally fast and may vary a lot within a talk. We also observed different error tendencies for fast and slow speech segments. Therefore, we first present a speaking-rate dependent decoding strategy that applies the most adequate acoustic analysis, phone models and decoding parameters according to the speaking rate. Several methods are investigated and their selective application leads to accuracy improvement. We also propose to make use of speaking-rate information in speaker adaptation, in which the different adapted models are set up for fast and slow utterances. It is confirmed that the method is more effective than normal adaptation.
ieee automatic speech recognition and understanding workshop | 2001
Tatsuya Kawahara; Hiroaki Nanjo; Sadaoki Furui
We introduce our extensive projects on spontaneous speech processing and current trials of lecture speech recognition. A large corpus of lecture presentations and talks is being collected in the project. We have trained initial baseline models and confirmed significant difference of real lectures and written notes. In spontaneous lecture speech, the speaking rate is generally faster and changes a lot, which makes it harder to apply fixed segmentation and decoding settings. Therefore, we propose sequential decoding and speaking-rate dependent decoding strategies. The sequential decoder simultaneously performs automatic segmentation and decoding of input utterances. Then, the most adequate acoustic analysis, phone models and decoding parameters are applied according to the current speaking rate. These strategies achieve improvement on automatic transcription of real lecture speech.
international conference on acoustics, speech, and signal processing | 2005
Hiroaki Nanjo; Tatsuya Kawahara
A new evaluation measure of speech recognition and a decoding strategy for keyword-based open-domain speech understanding are presented. Conventionally, WER (word error rate) has been widely used as an evaluation measure of speech recognition, which treats all words in a uniform manner. We define a weighted keyword error rate (WKER) which gives a weight on errors from a viewpoint of information retrieval. We first demonstrate that this measure is more appropriate for predicting the performance of key sentence indexing of oral presentations. Then, we formulate a decoding method to minimize WKER based on a minimum Bayes-risk (MBR) framework, and show that the decoding method works reasonably for improving WKER and key sentence indexing.
Journal of Information Processing | 2009
Tomoyosi Akiba; Kiyoaki Aikawa; Yoshiaki Itoh; Tatsuya Kawahara; Hiroaki Nanjo; Hiromitsu Nishizaki; Norihito Yasuda; Yoichi Yamashita; Katunobu Itou
The lecture is one of the most valuable genres of audiovisual data. Though spoken document processing is a promising technology for utilizing the lecture in various ways, it is difficult to evaluate because the evaluation require a subjective judgment and/or the verification of large quantities of evaluation data. In this paper, a test collection for the evaluation of spoken lecture retrieval is reported. The test collection consists of the target spoken documents of about 2, 700 lectures (604 hours) taken from the Corpus of Spontaneous Japanese (CSJ), 39 retrieval queries, the relevant passages in the target documents for each query, and the automatic transcription of the target speech data. This paper also reports the retrieval performance targeting the constructed test collection by applying a standard spoken document retrieval (SDR) method, which serves as a baseline for the forthcoming SDR studies using the test collection.
international symposium on consumer electronics | 2009
Hiroaki Nanjo; Hiroki Mikami; Suguru Kunimatsu; Hiroshi Kawano; Takanobu Nishiura
A novel speech interface for computer game systems is addressed. For typical computer game systems, users have input game commands with touching devices such as gamepad, keyboard, touch-panel and foot-board. Acceleration sensors which can detect players motion have been general. Some systems have speech interface for controlling games, namely, voice controller. A simple speech interface just detects large sound signals, and some game systems can recognize what users say with automatic speech recognition (ASR) module. Although ASR works well for polite speech, it dose not work well for exciting speeches such as shout by excited game players. In this paper, we focus on the speech interface which can deal with excited speech, and describe ASR which can discriminate shout from natural speech based on the understanding of speech events.
information assurance and security | 2009
Hiroaki Nanjo; Takanobu Nishiura; Hiroshi Kawano
We have been investigating a speech processing system for ensuring safety and security, namely, acoustic-based security system. Focusing on indoor security, we have been studying for an advanced security system which can discriminate emergency shout from the other acoustic sound events based on automatic understanding of speech events.In this paper, we present our investigations, and describe fundamental results.
international conference on acoustics, speech, and signal processing | 2008
Takashi Shichiri; Hiroaki Nanjo; Takehiko Yoshimi
This paper addresses automatic speech recognition (ASR) oriented for speech based information retrieval (IR). Since the significance of words differs in IR, in ASR for IR, ASR performance should be evaluated based on weighted word error rate (WWER), which gives a different weight on each word recognition error from the viewpoint of IR, instead of word error rate (WER), which treats all words uniformly. In this paper, we firstly discuss an automatic estimation method of word significance (weights), and then, we perform ASR based on Minimum Bayes-Risk framework using the presumed word significance, and show that the ASR approach that minimizes WWER calculated from the presumed word weighs is effective for speech based IR.
Journal of the Acoustical Society of America | 2006
Hiroaki Nanjo; Tatsuya Kawahara
Computer‐assisted speech transcription (CAST) system for making archives such as meeting minutes and lecture notes is addressed. For such a system, automatic speech recognition (ASR) of spontaneous speech is promising, but ASR results of spontaneous speech contain a lot of errors. Moreover, the ASR errors are essentially inevitable. Therefore, it is significant to design a good interface with which users can correct errors easily in order to take advantages of ASR for making speech archives. It is From these points of view that our CAST system is designed. Specifically, the system has three correction interfaces: (1) pointing device for selection from competitive candidates, (2) microphone for respeaking, and (3) keyboard. One of the most significant correction methods is selection from competitive candidates, thus more accurate competitive candidates are required. Therefore, generation methods of competitive candidates are discussed. Then, a speech recognition strategy (decoding method) based on the mini...
international conference on acoustics, speech, and signal processing | 2004
Hiroaki Nanjo; Tasuku Kitade; Tatsuya Kawahara
Automatic extraction of key sentences from lecture audio archives is addressed. The method makes use of the characteristic expressions used in initial utterances of sections, which are defined as discourse markers and derived in a totally unsupervised manner based on word statistics. The statistics of the presumed discourse markers are then used to define the importance of the sentences. It is also combined with the conventional tf-idf measure of content words. Experimental results using a large corpus of lectures confirm the effectiveness of the method based on the discourse markers and its combination with the keyword-based method. It is also shown that the method is robust against ASR errors and sentence segmentation accuracy is more vital. Thus, we also enhance segmentation by incorporating prosodic information.
Collaboration
Dive into the Hiroaki Nanjo's collaboration.
National Institute of Advanced Industrial Science and Technology
View shared research outputs