Koichi Shinoda | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Koichi Shinoda is active.

Explore More

Publication

Featured researches published by Koichi Shinoda.

spoken language technology workshop | 2014

Speaker adaptation of deep neural networks using a hierarchy of output layers

Ryan Price; Ken-ichi Iso; Koichi Shinoda

Deep neural networks (DNN) used for acoustic modeling in speech recognition often have a very large number of output units corresponding to context dependent (CD) triphone HMM states. The amount of data available for speaker adaptation is often limited so a large majority of these CD states may not be observed during adaptation. In this case, the posterior probabilities of unseen CD states are only pushed towards zero during DNN speaker adaptation and the ability to predict these states can be degraded relative to the speaker independent network. We address this problem by appending an additional output layer which maps the original set of DNN output classes to a smaller set of phonetic classes (e.g. monophones) thereby reducing the occurrences of unseen states in the adaptation data. Adaptation proceeds by backpropagation of errors from the new output layer, which is disregarded at recognition time when posterior probabilities over the original set of CD states are used. We demonstrate the benefits of this approach over adapting the network with the original set of CD states using experiments on a Japanese voice search task and obtain 5.03% relative reduction in character error rate with approximately 60 seconds of adaptation data.

asian conference on computer vision | 2014

Spectral Graph Skeletons for 3D Action Recognition

Tommi Kerola; Nakamasa Inoue; Koichi Shinoda

We present spectral graph skeletons (SGS), a novel graph-based method for action recognition from depth cameras. The contribution of this paper is to leverage a spectral graph wavelet transform (SGWT) for creating an overcomplete representation of an action signal lying on a 3D skeleton graph. The resulting SGS descriptor is efficiently computable in time linear in the action sequence length. We investigate the suitability of our method by experiments on three publicly available datasets, resulting in performance comparable to state-of-the-art action recognition approaches. Namely, our method achieves \(91.4\)% accuracy on the challenging MSRAction3D dataset in the cross-subject setting. SGS also achieves \(96.0\,\%\) and \(98.8\,\%\) accuracy on the MSRActionPairs3D and UCF-Kinect datasets, respectively. While this study focuses on action recognition, the proposed framework can in general be applied to any time series of graphs.

acm multimedia | 2014

n-gram Models for Video Semantic Indexing

Nakamasa Inoue; Koichi Shinoda

We propose n-gram modeling of shot sequences for video semantic indexing, in which semantic concepts are extracted from a video shot. Most previous studies for this task have assumed that video shots in a video clip are independent from each other. We model the time-dependency between them assuming that n-consecutive video shots are dependent. Our models improve the robustness against occlusion and camera-angle changes by effectively using information from the previous video shots. In our experiments on the TRECVID 2012 Semantic Indexing Benchmark, we applied the proposed models to a system using Gaussian mixture models and support vector machines. Mean average precision was improved from 30.62% to 32.14%, which is the best performance on the TRECVID 2012 Semantic Indexing to the best of our knowledge.

acm multimedia | 2015

Vocabulary Expansion Using Word Vectors for Video Semantic Indexing

Nakamasa Inoue; Koichi Shinoda

We propose vocabulary expansion for video semantic indexing. From many semantic concept detectors obtained by using training data, we make detectors for concepts not included in training data. First, we introduce Mikolovs word vectors to represent a word by a low-dimensional vector. Second, we represent a new concept by a weighted sum of concepts in training data in the word vector space. Finally, we use the same weighting coefficients for combining detectors to make a new detector. In our experiments, we evaluate our methods on the TRECVID Video Semantic Indexing (SIN) Task. We train our models with Google News text documents and ImageNET images to generate new semantic detectors for SIN task. We show that our method performs as well as SVMs trained with 100 TRECVID ex- ample videos.

spoken language technology workshop | 2014

An efficient error correction interface for speech recognition on mobile touchscreen devices

Yuan Liang; Koji Iwano; Koichi Shinoda

Correcting speech recognition errors on a mobile touchscreen device is an unavoidable but time-consuming task that requires a lot of user effort. To reduce this user effort, we previously proposed an error correction method using long context match with Web N-gram, which we combined with a simple gesture-based user interface. This method automatically replaces an error word with its corresponding correct word. However, it was evaluated only substitution errors in sentences, each of which involves only one error. In this paper, we extend this method to be used for more general cases when a sentence has more than one error. It recovers not only substitution errors but also deletion errors and insertion errors. For recovering deletion errors, it predicts a deleted word based on the phonemes and the part-of-speech tags of its surrounding words. Our experimental results show that the proposed method recovered the errors more accurately with less user effort than the conventional Word Confusion Network based error correction interface.

international conference on acoustics, speech, and signal processing | 2014

Constrained discriminative PLDA training for speaker verification

Johan Rohdin; Sangeeta Biswas; Koichi Shinoda

Many studies have proven the effectiveness of discriminative training for speaker verification based on probabilistic linear discriminative analysis (PLDA) with i-vectors as features. Most of them directly optimize the log-likelihood ratio score function of the PLDA model instead of explicitly train the PLDA model. But this optimization process removes some of the constraints that normally are imposed on the PLDA log likelihood ratio score function. This may deteriorate the verification performance when the amount of training data is limited. In this paper, we first show two constraints which the score function should follow, and then we propose a new constrained discriminative training algorithm which keeps these constraints. Our experiments show that our method obtained significant improvements in the verification performance in the male trials of the telephone speaker verification tasks of NIST SRE08 and SRE10.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016

Fast Coding of Feature Vectors Using Neighbor-to-Neighbor Search

Nakamasa Inoue; Koichi Shinoda

Searching for matches to high-dimensional vectors using hard/soft vector quantization is the most computationally expensive part of various computer vision algorithms including the bag of visual word (BoW). This paper proposes a fast computation method, Neighbor-to-Neighbor (NTN) search [1], which skips some calculations based on the similarity of input vectors. For example, in image classification using dense SIFT descriptors, the NTN search seeks similar descriptors from a point on a grid to an adjacent point. Applications of the NTN search to vector quantization, a Gaussian mixture model, sparse coding, and a kernel codebook for extracting image or video representation are presented in this paper. We evaluated the proposed method on image and video benchmarks: the PASCAL VOC 2007 Classification Challenge and the TRECVID 2010 Semantic Indexing Task. NTN-VQ reduced the coding cost by 77.4 percent, and NTN-GMM reduced it by 89.3 percent, without any significant degradation in classification performance.

Computer Speech & Language | 2016

Robust discriminative training against data insufficiency in PLDA-based speaker verification

Johan Rohdin; Sangeeta Biswas; Koichi Shinoda

HighlightsWe address data insufficiency in discriminative PLDA training.First, we compensate for statistical dependencies in the training data.Second, we propose three constrained discriminative training schemes. Probabilistic linear discriminant analysis (PLDA) with i-vectors as features has become one of the state-of-the-art methods in speaker verification. Discriminative training (DT) has proven to be effective for improving PLDAs performance but suffers more from data insufficiency than generative training (GT). In this paper, we achieve robustness against data insufficiency in DT in two ways. First, we compensate for statistical dependencies in the training data by adjusting the weights of the training trials in order for the training loss to be an accurate estimate of the expected loss. Second, we propose three constrained DT schemes, among which the best was a discriminatively trained transformation of the PLDA score function having four parameters. Experiments on the male telephone part of the NIST SRE 2010 confirmed the effectiveness of our proposed techniques. For various number of training speakers, the combination of weight-adjustment and the constrained DT scheme gave between 7% and 19% relative improvements in C ? llr over GT followed by score calibration. Compared to another baseline, DT of all the parameters of the PLDA score function, the improvements were larger.

Speech Communication | 2015

Autonomous selection of i-vectors for PLDA modelling in speaker verification

Sangeeta Biswas; Johan Rohdin; Koichi Shinoda

Abstract Recently, systems combining i-vector and probabilistic linear discriminant analysis (PLDA) have become one of the state-of-the-art methods in text-independent speaker verification. The training data of a PLDA model is often collected from a large, diverse population. However, including irrelevant or noisy training data may deteriorate the verification performance. In this paper, we first show that data selection using k -NN improves the speaker verification performance. We then present a robust way of selecting k based on the local distance-based outlier factor (LDOF). We call this method flexible k -NN ( fk -NN). We conduct experiments on male and female trials of several telephone conditions of the NIST 2006, 2008, 2010 and 2012 Speaker Recognition Evaluations (SRE). By using fk -NN, we discard a substantial amount of irrelevant or noisy training data without depending on tuning k , and achieve significant performance improvements on the NIST SRE sets.

2015 International Conference on Computer Application Technologies | 2015

New Materials for Spoofing Touch-Based Fingerprint Scanners

Jan Spurny; Michal Doleel; Ondrej Kanich; Martin Drahansky; Koichi Shinoda

Goal of this article is to describe creation of several fingerprint spoofs using PCB mold but various spoof materials. The focus lays on using new materials and their modification with some additives to get better results in some sensor technologies. Process of creation and material used is thoroughly described including guidance to get better results. For every spoof there is shown resulting fingerprint and description of usage. We found that the best materials for spoof production are aquarium silicone and epoxy resin hobbyking, which are available in nearly each hobby market.

Explore More