Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Nakamasa Inoue is active.

Publication


Featured researches published by Nakamasa Inoue.


IEEE Transactions on Multimedia | 2012

A Fast and Accurate Video Semantic-Indexing System Using Fast MAP Adaptation and GMM Supervectors

Nakamasa Inoue; Koichi Shinoda

We propose a fast maximum a posteriori (MAP) adaptation method for video semantic indexing that uses Gaussian mixture model (GMM) supervectors. In this method, a tree-structured GMM is utilzed to decrease the computational cost, where only the output probabilities of mixture components close to an input sample are precisely calculated. Experimental evaluation on the TRECVID 2010 dataset demonstrates the effectiveness of the proposed method. The calculation time of the MAP adaptation step is reduced by 76.2% compared with that of a conventional method. The total calculation time is reduced by 56.6% while keeping the same level of the accuracy.


acm multimedia | 2011

A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems

Nakamasa Inoue; Koichi Shinoda

We propose a fast maximum a posteriori (MAP) adaptation technique for a GMM-supervectors-based video semantic indexing system.The use of GMM supervectors is one of the state-of-the-art methods in which MAP adaptation is needed for estimating the distribution of local features extracted from video data. The proposed method cuts the calculation time of the MAP adaptation step. With the proposed method, a tree-structured GMM is constructed to quickly calculate posterior probabilities for each mixture component of a GMM. The basic idea of the tree-structured GMM is to cluster Gaussian components and approximate them with a single Gaussian. Leaf nodes of the tree correspond to the mixture components, and each non-leaf node has a single Gaussian that approximates its descendant Gaussian distributions. Experimental evaluation on the TRECVID 2010 dataset demonstrates the effectiveness of the proposed method. The calculation time of the MAP adaptation step is reduced by 76.2% compared to that of a conventional method and resulting accuracy (in terms of Mean average precision) was 10.2%.


asian conference on computer vision | 2014

Spectral Graph Skeletons for 3D Action Recognition

Tommi Kerola; Nakamasa Inoue; Koichi Shinoda

We present spectral graph skeletons (SGS), a novel graph-based method for action recognition from depth cameras. The contribution of this paper is to leverage a spectral graph wavelet transform (SGWT) for creating an overcomplete representation of an action signal lying on a 3D skeleton graph. The resulting SGS descriptor is efficiently computable in time linear in the action sequence length. We investigate the suitability of our method by experiments on three publicly available datasets, resulting in performance comparable to state-of-the-art action recognition approaches. Namely, our method achieves \(91.4\)% accuracy on the challenging MSRAction3D dataset in the cross-subject setting. SGS also achieves \(96.0\,\%\) and \(98.8\,\%\) accuracy on the MSRActionPairs3D and UCF-Kinect datasets, respectively. While this study focuses on action recognition, the proposed framework can in general be applied to any time series of graphs.


international conference on pattern recognition | 2010

High-Level Feature Extraction Using SIFT GMMs and Audio Models

Nakamasa Inoue; Tatsuhiko Saito; Koichi Shinoda; Sadaoki Furui

We propose a statistical framework for high-level feature extraction that uses SIFT Gaussian mixture models (GMMs) and audio models. SIFT features were extracted from all the image frames and modeled by a GMM. In addition, we used mel-frequency cepstral coefficients and ergodic hidden Markov models to detect high-level features in audio streams. The best result obtained by using SIFT GMMs in terms of mean average precision on the TRECVID 2009 corpus was 0.150 and was improved to 0.164 by using audio information.


IEEE Signal Processing Magazine | 2013

Reusing Speech Techniques for Video Semantic Indexing [Applications Corner]

Koichi Shinoda; Nakamasa Inoue

Many techniques developed in speech research have been successfully employed in other fields, such as automatic video semantic indexing. In this application, a user submits a textual input query for an desired object or a scene to a search system, which returns video shots that include the object or scene. Recently, a new method using Gaussian-mixture model (GMM) supervectors and support vector machines (SVMs) was proven to be very effective. In this method, speech technology such as speaker verification and adaptation techniques play very important roles.


Computer Vision and Image Understanding | 2017

Cross-view human action recognition from depth maps using spectral graph sequences

Tommi Kerola; Nakamasa Inoue; Koichi Shinoda

A graph-based method for 3D view-invariant human action recognition is proposed.An action is represented as a sequence of graphs.The vertices can either be tracked skeleton joints or spatio-temporal keypoints.A spectral graph wavelet transform is leveraged to extract features from the graphs.The method is useful for both single- and multi-view action recognition tasks. Display Omitted We present a method for view-invariant action recognition from depth cameras based on graph signal processing techniques. Our framework leverages a novel graph representation of an action as a temporal sequence of graphs, onto which we apply a spectral graph wavelet transform for creating our feature descriptor. We evaluate two view-invariant graph types: skeleton-based and keypoint-based. The skeleton-based descriptor captures the spatial pose of the subject, whereas the keypoint-based is able to capture complementary information about human-object interaction and the shape of the point cloud. We investigate the effectiveness of our method by experiments on five publicly available datasets. By the graph structure, our method captures the temporal interaction between depth map interest points and achieves a 19.8% increase in performance compared to state-of-the-art results for cross-view action recognition, and competing results for frontal-view action recognition and human-object interaction. Namely, our method results in 90.8% accuracy on the cross-view N-UCLA Multiview Action3D dataset and 91.4% accuracy on the challenging MSRAction3D dataset in the cross-subject setting. For human-object interaction, our method achieves 72.3% accuracy on the Online RGBD Action dataset. We also achieve 96.0% and 98.8% accuracy on the MSRActionPairs3D and UCF-Kinect datasets, respectively.


acm multimedia | 2016

Adaptation of Word Vectors using Tree Structure for Visual Semantics

Nakamasa Inoue; Koichi Shinoda

We propose a framework of word-vector adaptation, which makes vectors of visually similar concepts close to each other. Here, word vectors are real-valued vector representation of words, e.g., word2vec representation. Our basic idea is to assume that each concept has some hypernyms that are important to determine its visual features. For example, for a concept Swallow with hypernyms Bird, Animal and Entity, we believe Bird is the most important since birds have common visual features with their feathers etc. Adapted word vectors are obtained for each word by taking a weighted sum of a given original word vector and its hypernym word vectors. Our weight optimization makes vectors of visually similar concepts close to each other, by giving a large weight for such important hypernyms. We apply the adapted word vectors to zero-shot learning on the TRECVID 2014 semantic indexing dataset. We achieved 0.083 of Mean Average Precision, which is the best performance without using TRECVID training data to the best of our knowledge.


acm multimedia | 2014

n-gram Models for Video Semantic Indexing

Nakamasa Inoue; Koichi Shinoda

We propose n-gram modeling of shot sequences for video semantic indexing, in which semantic concepts are extracted from a video shot. Most previous studies for this task have assumed that video shots in a video clip are independent from each other. We model the time-dependency between them assuming that n-consecutive video shots are dependent. Our models improve the robustness against occlusion and camera-angle changes by effectively using information from the previous video shots. In our experiments on the TRECVID 2012 Semantic Indexing Benchmark, we applied the proposed models to a system using Gaussian mixture models and support vector machines. Mean average precision was improved from 30.62% to 32.14%, which is the best performance on the TRECVID 2012 Semantic Indexing to the best of our knowledge.


international conference on computer vision | 2013

Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors

Nakamasa Inoue; Koichi Shinoda

Assigning a visual code to a low-level image descriptor, which we call code assignment, is the most computationally expensive part of image classification algorithms based on the bag of visual word (BoW) framework. This paper proposes a fast computation method, Neighbor-to-Neighbor (NTN) search, for this code assignment. Based on the fact that image features from an adjacent region are usually similar to each other, this algorithm effectively reduces the cost of calculating the distance between a codeword and a feature vector. This method can be applied not only to a hard codebook constructed by vector quantization (NTN-VQ), but also to a soft codebook, a Gaussian mixture model (NTN-GMM). We evaluated this method on the PASCAL VOC 2007 classification challenge task. NTN-VQ reduced the assignment cost by 77.4% in super-vector coding, and NTN-GMM reduced it by 89.3% in Fisher-vector coding, without any significant degradation in classification performance.


Eurasip Journal on Image and Video Processing | 2013

Event detection in consumer videos using GMM supervectors and SVMs

Yusuke Kamishima; Nakamasa Inoue; Koichi Shinoda

In large-scale multimedia event detection, complex target events are extracted from a large set of consumer-generated web videos taken in unconstrained environments. We devised a multimedia event detection method based on Gaussian mixture model (GMM) supervectors and support vector machines. A GMM supervector consists of the parameters of a GMM for the distribution of low-level features extracted from a video clip. A GMM is regarded as an extension of the bag-of-words framework to a probabilistic framework, and thus, it can be expected to be robust against the data insufficiency problem. We also propose a camera motion cancelled feature, which is a spatio-temporal feature robust against camera motions found in consumer-generated web videos. By combining these methods with the existing features, we aim to construct a high-performance event detection system. The effectiveness of our method is evaluated using TRECVID MED task benchmark.

Collaboration


Dive into the Nakamasa Inoue's collaboration.

Top Co-Authors

Avatar

Koichi Shinoda

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Tommi Kerola

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Mengxi Lin

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Yusuke Kamishima

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Fumito Nishi

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tatsuhiko Saito

Georgia Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Conggui Liu

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sadaoki Furui

Tokyo Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge