Toru Nakashika
Kobe University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Toru Nakashika.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
This paper presents a voice conversion (VC) method that utilizes the recently proposed probabilistic models called recurrent temporal restricted Boltzmann machines (RTRBMs). One RTRBM is used for each speaker, with the goal of capturing high-order temporal dependencies in an acoustic sequence. Our algorithm starts from the separate training of one RTRBM for a source speaker and another for a target speaker using speaker-dependent training data. Because each RTRBM attempts to discover abstractions to maximally express the training data at each time step, as well as the temporal dependencies in the training data, we expect that the models represent the linguistic-related latent features in high-order spaces. In our approach, we convert (match) features of emphasis for the source speaker to those of the target speaker using a neural network (NN), so that the entire network (consisting of the two RTRBMs and the NN) acts as a deep recurrent NN and can be fine-tuned. Using VC experiments, we confirm the high performance of our method, especially in terms of objective criteria, relative to conventional VC methods such as approaches based on Gaussian mixture models and on NNs.
international conference on acoustics, speech, and signal processing | 2014
Ryo Aihara; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
We present in this paper an exemplar-based voice conversion (VC) method using a phoneme-categorized dictionary. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for spectral conversion between different speakers. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all the training exemplars (frames), and it may cause mismatching of phonemes between input signals and selected exemplars. In this paper, in order to reduce the mismatching of phoneme alignment, we propose a phoneme-categorized sub-dictionary and a dictionary selection method using NMF. By using the sub-dictionary, the performance of VC is improved compared to a conventional NMF-based VC. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method and a conventional NMF-based method.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Toru Nakashika; Tetsuya Takiguchi; Yasuhiro Minami
In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. VC is a technique where only speaker-specific information in source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data-pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems: 1) the data used for the training are limited to the predefined sentences, 2) the trained model is only applied to the speaker pair used in the training, and 3) mismatches in alignment may occur. Although it is, thus, fairly preferable in VC not to use parallel data, a nonparallel approach is considered difficult to learn. In our approach, we achieve nonparallel training based on a speaker adaptation technique and capturing latent phonological information. This approach assumes that speech signals are produced from a restricted Boltzmann machine-based probabilistic model, where phonological information and speaker-related information are defined explicitly. Speaker-independent and speaker-dependent parameters are simultaneously trained under speaker adaptive training. In the conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then voice-converted speech is obtained by mixing the two. Our experimental results showed that our approach outperformed another nonparallel approach, and produced results similar to those of the popular conventional Gaussian mixture models-based method that used parallel data in subjective and objective criteria.
international conference on acoustics, speech, and signal processing | 2014
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
In this paper, we present a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain time-invariant speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. First, we train two CRBMs for a source and target speaker independently using speaker-dependent training data (without the need to parallelize the training data). Then, a small number of parallel data are fed into each CRBM and the high-order features produced by the CRBMs are used to train a concatenating neural network (NN) between the two CRBMs. Finally, the entire network (the two CRBMs and the NN) is fine-tuned using the acoustic parallel data. Through voice-conversion experiments, we confirmed the high performance of our method in terms of objective and subjective evaluations, comparing it with conventional GMM, NN, and speaker-dependent DBN approaches.
international conference on multimedia retrieval | 2015
Jinhui Chen; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Our research focuses on the question of feature descriptors for robust effective computing, proposing a novel feature representation method, namely, rotation-invariant histograms of oriented gradients (Ri-HOG) for image retrieval. Most of the existing HOG techniques are computed on a dense grid of uniformly-spaced cells and use overlapping local contrast of rectangular blocks for normalization. However, we adopt annular spatial bins type cells and apply radial gradient to attain gradient binning invariance for feature extraction. In this way, it significantly enhances HOG in regard to rotation-invariant ability and feature descripting accuracy. In experiments, the proposed method is evaluated on Corel-5k and Corel-10k datasets. The experimental results demonstrate that the proposed method is much more effective than many existing image feature descriptors for content-based image retrieval.
european signal processing conference | 2015
Yuki Takashima; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
In this paper, we investigate the recognition of speech uttered by a person with an articulation disorder resulting from athetoid cerebral palsy based on a robust feature extraction method using pre-trained convolutive bottleneck networks (CBN). Generally speaking, the amount of speech data obtained from a person with an articulation disorder is limited because their burden is large due to strain on the speech muscles. Therefore, a trained CBN tends toward overfitting for a small corpus of training data. In our previous work, the experimental results showed speech recognition using features extracted from CBNs outperformed conventional features. However, the recognition accuracy strongly depends on the initial values of the convolution kernels. To prevent overfitting in the networks, we introduce in this paper a pre-training technique using a convolutional restricted Boltzmann machine (CRBM). Through word-recognition experiments, we confirmed its superiority in comparison to convolutional networks without pre-training.
international conference on signal processing | 2014
Toru Nakashika; Toshiya Yoshioka; Tetsuya Takiguchi; Yasuo Ariki; Stefan Duffner; Christophe Garcia
In this paper, we investigate the recognition of speech produced by a person with an articulation disorder resulting from athetoid cerebral palsy. The articulation of the first spoken words tends to become unstable due to strain on speech muscles, and that causes a degradation of traditional speech recognition systems. Therefore, we propose a robust feature extraction method using a convolutive bottleneck network (CBN) instead of the well-known MFCC. The CBN stacks multiple various types of layers, such as a convolution layer, a subsampling layer, and a bottleneck layer, forming a deep network. Applying the CBN to feature extraction for dysarthric speech, we expect that the CBN will reduce the influence of the unstable speaking style caused by the athetoid symptoms. We confirmed its effectiveness through word-recognition experiments, where the CBN-based feature extraction method outperformed the conventional feature extraction method.
signal-image technology and internet-based systems | 2013
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Super-resolution technology, which restores high-frequency information given a low-resolved image, has attracted much attention recent years. Various super-resolution algorithms were proposed so far: example-based approach, sparse-coding-based, GMM (Gaussian Mixture Model), BPLP (Back Projection for Lost Pixels), and so on. Most of these statistical approaches rely on the training (or just preparing) of the correspondence relationships between low-resolved/high-resolved images. In this paper, we propose a novel super-resolution method that is based on a statistical model but does not require any pairs of low and high-resolved images in the database. In our approach, Deep Belief Bets are used to restore high-frequency information from a low-resolved image. The idea is that only using high-resolved images, the trained networks seek the high-order dependencies among the observed nodes (each spatial frequency: e.g., high and low frequencies). Experimental results show the high performance of our proposed method.
international conference on acoustics, speech, and signal processing | 2016
Toru Nakashika; Yasuhiro Minami
In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. VC is a technique where only speaker specific information in source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data-pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems; 1) the data used for the training is limited to the pre-defined sentences, 2) the trained model is only applied to the speaker pair used in the training, and 3) mismatch in alignment may happen. Although it is, thus, fairy preferable in VC not to use parallel data, a non-parallel approach is considered difficult to learn. In our approach, we realize the non-parallel training based on speaker-adaptive training (SAT). Speech signals are represented using a probabilistic model based on the Boltzmann machine that defines phonological information and speaker-related information explicitly. Speaker-independent (SI) and speaker-dependent (SD) parameters are simultaneously trained using SAT. In conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then a voice-converted speech is obtained by mixing the two. Our experimental results showed that our approach unfortunately fell short of the popular conventional GMM-based method that used parallel data, but outperformed the conventional non-parallel approach.
international conference on pattern recognition | 2014
Toru Nakashika; Takafumi Hori; Tetsuya Takiguchi; Yasuo Ariki
Recently introduced high-accuracy RGB-D cameras are capable of providing high quality three-dimension information (color and depth information) easily. The overall shape of the object can be understood by acquiring depth information. However, conventional methods adopted this camera use depth information only to extract the local feature. To improve the object recognition accuracy, in our approach, the overall object shape is expressed by the depth spatial pyramid based on depth information. In more detail, multiple features within each sub-region of the depth spatial pyramid are pooled. As a result, the feature representation including the depth topological information is constructed. We use histogram of oriented normal vectors (HONV) designed to capture local geometric characteristics as 3D local features and locality-constrained linear coding (LLC) to project each descriptor into its local-coordinate system. As a result of image recognition, the proposed method has improved the recognition rate compared with conventional methods.