Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Naoyuki Kanda is active.

Publication


Featured researches published by Naoyuki Kanda.


annual meeting of the special interest group on discourse and dialogue | 2006

Multi-Domain Spoken Dialogue System with Extensibility and Robustness against Speech Recognition Errors

Kazunori Komatani; Naoyuki Kanda; Mikio Nakano; Kazuhiro Nakadai; Hiroshi Tsujino; Tetsuya Ogata; Hiroshi G. Okuno

We developed a multi-domain spoken dialogue system that can handle user requests across multiple domains. Such systems need to satisfy two requirements: extensibility and robustness against speech recognition errors. Extensibility is required to allow for the modification and addition of domains independent of other domains. Robustness against speech recognition errors is required because such errors are inevitable in speech recognition. However, the systems should still behave appropriately, even when their inputs are erroneous. Our system was constructed on an extensible architecture and is equipped with a robust and extensible domain selection method. Domain selection was based on three choices: (I) the previous domain, (II) the domain in which the speech recognition result can be accepted with the highest recognition score, and (III) other domains. With the third choice we newly introduced, our system can prevent dialogues from continuously being stuck in an erroneous domain. Our experimental results, obtained with 10 subjects, showed that our method reduced the domain selection errors by 18.3%, compared to a conventional method.


intelligent robots and systems | 2005

A two-layer model for behavior and dialogue planning in conversational service robots

Mikio Nakano; Yuji Hasegawa; Kazuhiro Nakadai; Takahiro Nakamura; Johane Takeuchi; Toyotaka Torii; Hiroshi Tsujino; Naoyuki Kanda; Hiroshi G. Okuno

This paper presents a model for the behavior and dialogue planning module of conversational service robots. Most of the previously built conversational robots cannot perform dialogue management necessary for accurately recognizing human intentions and providing information to humans. This model integrates robot behavior planning models with spoken dialogue management that is robust enough to engage in mixed-initiative dialogues in specific domains. It has two layers; the upper layer is responsible for global task planning using hierarchical planning and the lower layer engages in local planning by utilizing modules called experts, which are specialized for performing certain kind of tasks by performing physical actions and engaging in dialogues. This model enables switching and canceling tasks based on recognized human intentions. A preliminary implementation of the model, which has been integrated with Honda ASIMO, has shown its effectiveness.


ieee automatic speech recognition and understanding workshop | 2013

Elastic spectral distortion for low resource speech recognition with deep neural networks

Naoyuki Kanda; Ryu Takeda; Yasunari Obuchi

An acoustic model based on hidden Markov models with deep neural networks (DNN-HMM) has recently been proposed and achieved high recognition accuracy. In this paper, we investigated an elastic spectral distortion method to artificially augment training samples to help DNN-HMMs acquire enough robustness even when there are a limited number of training samples. We investigated three distortion methods - vocal tract length distortion, speech rate distortion, and frequency-axis random distortion - and evaluated those methods with Japanese lecture recordings. In a large vocabulary continuous speech recognition task with only 10 hours of training samples, a DNN-HMM trained with the elastic spectral distortion method achieved a 10.1% relative word error reduction compared with a normally trained DNN-HMM.


Knowledge Based Systems | 2011

A multi-expert model for dialogue and behavior control of conversational robots and agents

Mikio Nakano; Yuji Hasegawa; Kotaro Funakoshi; Johane Takeuchi; Toyotaka Torii; Kazuhiro Nakadai; Naoyuki Kanda; Kazunori Komatani; Hiroshi G. Okuno; Hiroshi Tsujino

This paper presents an intelligence model for conversational service robots. It employs modules called experts, each of which is specialized to execute certain kinds of tasks such as performing physical behaviors and engaging in dialogues. Some of the experts take charge in understanding human utterances and deciding robot utterances or actions. The model enables switching and canceling tasks based on recognized human intentions, as well as parallel execution of several tasks. This model specifies the interface that an expert must have, and any kind of expert can be employed if it conforms to the interface. This feature makes the model extensible.


multimedia signal processing | 2008

Open-vocabulary keyword detection from super-large scale speech database

Naoyuki Kanda; Hirohiko Sagawa; Takashi Sumiyoshi; Yasunari Obuchi

This paper presents our recent attempt to make a super-large scale spoken-term detection system, which can detect any keyword uttered in a 2,000-hour speech database within a few seconds. There are three problems to achieve such a system. The system must be able to detect out-of-vocabulary (OOV) terms (OOV problem). The system has to respond to the user quickly without sacrificing search accuracy (search speed and accuracy problem). The pre-stored index database should be sufficiently small (index size problem). We introduced a phoneme-based search method to detect the OOV terms, and combined it with the LVCSR-based method. To search for a keyword from large-scale speech databases accurately and quickly, we introduced a multistage rescoring strategy which uses several search methods to reduce the search space in a stepwise fashion. Furthermore, we constructed an out-of-vocabulary/in-vocabulary region classifier, which allows us to reduce the size of the index database for OOVs. We describe the prototype system and present some evaluation results.


conference of the international speech communication association | 2016

Maximum a posteriori Based Decoding for CTC Acoustic Models.

Naoyuki Kanda; Xugang Lu; Hisashi Kawai

This paper presents a novel decoding framework for connectionist temporal classification (CTC)-based acoustic models (AM). Although CTC-based AM inherently has the property of a language model (LM) in itself, an external LM trained with a large text corpus is still essential to obtain the best results. In the previous literatures, a naive interpolation of the CTCbased AM score and the external LM score was used, although there is no theoretical justification for it. In this paper, we propose a theoretically more sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. In our decoding framework, a subword LM (SLM) is newly introduced to coordinate the CTCbased AM score and the word-level LM score. In experiments with the Wall Street Journal (WSJ) corpus and Corpus of Spontaneous Japanese (CSJ), our proposed framework consistently achieved improvements of 7.4 – 15.3 % over the conventional interpolation-based framework. In the CSJ experiment, given 586 hours of training data, the CTC-based AM finally achieved a 6.7 % better word error rate than the baseline method with deep neural networks and hidden Markov models.


spoken language technology workshop | 2012

Using rhythmic features for Japanese spoken term detection

Naoyuki Kanda; Ryu Takeda; Yasunari Obuchi

A new rescoring method for spoken term detection (STD) is proposed. Phoneme-based close-matching techniques have been used because of their ability to detect out-of-vocabulary (OOV) queries. To improve the accuracy of phoneme-based techniques, rescoring techniques have been used to accurately re-rank the results from phoneme-based close-matching; however, conventional rescoring techniques based on an utterance verification model still produce many false detection results. To further improve the accuracy, in this study, several features representing the “naturalness” (or “abnormality”) of duration of phonemes/syllables in detected candidates of a keyword are proposed. These features are incorporated into a conventional rescoring technique using logistic regression. Experimental results with a 604-hour Japanese speech corpus indicated that combining the rhythmic features achieved a further relative error reduction of 8.9% compared to a conventional rescoring technique.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models

Naoyuki Kanda; Xugang Lu; Hisashi Kawai

This paper presents a novel decoding framework for acoustic models (AMs) based on end-to-end neural networks (e.g., connectionist temporal classification). The end-to-end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition (ASR). When using the trained AM in decoding, although a language model (LM) is implicitly involved in such an end-to-end AM, it is still essential to integrate an external LM trained with a large text corpus to achieve the best results. While there is no theoretical justification, most of the studies suggest using a naive interpolation of the end-to-end AM score and the external LM score, empirically. In this paper, we propose a more theoretically sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. As a consequence of the theory, the subword LM is newly introduced to seamlessly integrate the external LM score with the end-to-end AM score. Our proposed method can be achieved by a small modification of the conventional weighted finite-state transducer-based implementation, without having to heavily increase the graph size. We tested the proposed decoding framework on ASR experiments with the Corpus of the Wall Street Journal and the Corpus of Spontaneous Japanese. The results showed that the proposed framework achieved significant and consistent improvements over the conventional interpolation-based decoding framework.


acm multimedia | 2018

Face-Voice Matching using Cross-modal Embeddings

Shota Horiguchi; Naoyuki Kanda; Kenji Nagamatsu

Face-voice matching is a task to find correspondence between faces and voices. Many researches in cognitive science have confirmed human ability in the face-voice matching tasks. Such ability is useful for creating natural human machine interaction systems and in many other applications. In this paper, we propose a face-voice matching model that learns cross-modal embeddings between face images and voice characteristics. We constructed a novel FVCeleb dataset which consists of face images and utterances from 1,078 persons. These persons were selected from the MS-Celeb-1M face image dataset and the VoxCeleb audio dataset. In two-alternative forced-choice matching task with an audio input and two face-image candidates of the same gender, our model achieved 62.2% and 56.5% accuracy on the FVCeleb and the subset of the GRID corpus, respectively. These results are very similar to human performance reported in cognitive science studies.


conference of the international speech communication association | 2016

Investigation of Semi-Supervised Acoustic Model Training Based on the Committee of Heterogeneous Neural Networks.

Naoyuki Kanda; Shoji Harada; Xugang Lu; Hisashi Kawai

This paper investigates the semi-supervised training for deep neural network-based acoustic models (AM). In the conventional self-learning approach, a “seed-AM” is first trained by using a small transcribed data set. Then, a large untranscribed data set is decoded by using the seed-AM to create a transcription, which is finally used to train a new AM on the entire data. Our investigation in this paper focuses on the different approach that uses additional complementary AMs to form a committee of label creation for untranscribed data. Especially, we investigate the case of using heterogeneous neural networks as complementary AMs, and the case of intentional exclusion of the primary seed-AM from the committee, both of which could enhance the chance to find more informative training samples for the seedAM. We investigated those approaches based on Japanese lecture recognition experiments with 50-hours of transcribed data and 190-hours of untranscribed data. In our experiment, the committee-based approach showed significant improvements in the word error rate, and the best method finally recovered 75.2% of the oracle improvement with full manual transcription, while the conventional self-learning approach recovered only 32.7% of the oracle gain.

Collaboration


Dive into the Naoyuki Kanda's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hisashi Kawai

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xugang Lu

National Institute of Information and Communications Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge