Shohei Hido | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shohei Hido is active.

Explore More

Publication

Featured researches published by Shohei Hido.

Knowledge and Information Systems | 2011

Statistical outlier detection using direct density ratio estimation

Shohei Hido; Yuta Tsuboi; Hisashi Kashima; Masashi Sugiyama; Takafumi Kanamori

We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.

Journal of Information Processing | 2009

Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation

Yuta Tsuboi; Hisashi Kashima; Shohei Hido; Steffen Bickel; Masashi Sugiyama

Covariate shift is a situation in supervised learning where training and test inputs follow different distributions even though the functional relation remains unchanged. A common approach to compensating for the bias caused by covariate shift is to reweight the loss function according to the importance, which is the ratio of test and training densities. We propose a novel method that allows us to directly estimate the importance from samples without going through the hard task of density estimation. An advantage of the proposed method is that the computation time is nearly independent of the number of test input samples, which is highly beneficial in recent applications with large numbers of unlabeled samples. We demonstrate through experiments that the proposed method is computationally more efficient than existing approaches with comparable accuracy. We also describe a promising result for large-scale covariate shift adaptation in a natural language processing task.

international conference on data mining | 2009

A Linear-Time Graph Kernel

Shohei Hido; Hisashi Kashima

The design of a good kernel is fundamental for knowledge discovery from graph-structured data. Existing graph kernels exploit only limited information about the graph structures but are still computationally expensive. We propose a novel graph kernel based on the structural characteristics of graphs. The key is to represent node labels as binary arrays and characterize each node using logical operations on the label set of the connected nodes. Our kernel has a linear time complexity with respect to the number of nodes times the average number of neighboring nodes in the given graphs. The experimental result shows that the proposed kernel performs comparable and much faster than a state-of-the-art graph kernel for benchmark data sets and shows high scalability for new applications with large graphs.

international conference on data mining | 2008

Inlier-Based Outlier Detection via Direct Density Ratio Estimation

Shohei Hido; Yuta Tsuboi; Hisashi Kashima; Masashi Sugiyama; Takafumi Kanamori

We propose a new statistical approach to the problem of inlier-based outlier detection, i.e.,finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score; we estimate the ratio directly in a semi-parametric fashion without going through density estimation. Thus our approach is expected to have better performance in high-dimensional problems. Furthermore, the applied algorithm for density ratio estimation is equipped with a natural cross-validation procedure, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. The algorithm offers a closed-form solution as well as a closed-form formula for the leave-one-out error. Thanks to this, the proposed outlier detection method is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.

Ipsj Transactions on Computer Vision and Applications | 2009

A Density-ratio Framework for Statistical Data Processing

Masashi Sugiyama; Takafumi Kanamori; Taiji Suzuki; Shohei Hido; Jun Sese; Ichiro Takeuchi; Liwei Wang

In statistical pattern recognition, it is important to avoid density estimation since density estimation is often more difficult than pattern recognition itself. Following this idea—known as Vapnik’s principle, a statistical data processing framework that employs the ratio of two probability density functions has been developed recently and is gathering a lot of attention in the machine learning and data mining communities. The purpose of this paper is to introduce to the computer vision community recent advances in density ratio estimation methods and their usage in various statistical data processing tasks such as nonstationarity adaptation, outlier detection, feature selection, and independent component analysis.

international conference on data mining | 2005

AMIOT: induced ordered tree mining in tree-structured databases

Shohei Hido; Hiroyuki Kawano

Frequent subtree mining has become increasingly important in recent years. In this paper, we present AMIOT algorithm to discover all frequent ordered subtrees in a tree-structured database. In order to avoid the generation of infrequent candidate trees, we propose the techniques such as right-and-left tree join and serial tree extension. Proposed methods enumerate only the candidate trees with high probability of being frequent without any duplication. The experiments on synthetic dataset and XML database show that AMIOT reduces redundant candidate trees and outperforms FREQT algorithm by up to five times in execution time.

knowledge discovery and data mining | 2008

Unsupervised change analysis using supervised learning

Shohei Hido; Tsuyoshi Idé; Hisashi Kashima; Harunobu Kubo; Hirofumi Matsuzawa

We propose a formulation of a new problem, which we call change analysis, and a novel method for solving the problem. In contrast to the existing methods of change (or outlier) detection, the goal of change analysis goes beyond detecting whether or not any changes exist. Its ultimate goal is to find the explanation of the changes.While change analysis falls in the category of unsupervised learning in nature, we propose a novel approach based on supervised learning to achieve the goal. The key idea is to use a supervised classifier for interpreting the changes. A classifier should be able to discriminate between the two data sets if they actually come from two different data sources. In other words, we use a hypothetical label to train the supervised learner, and exploit the learner for interpreting the change. Experimental results using real data show the proposed approach is promising in change analysis as well as concept drift analysis.

Journal of Information Processing | 2012

Modeling Patent Quality: A System for Large-scale Patentability Analysis using Text Mining

Shohei Hido; Shoko Suzuki; Risa Nishiyama; Takashi Imamichi; Rikiya Takahashi; Tetsuya Nasukawa; Tsuyoshi Idé; Yusuke Kanehira; Rinju Yohda; Takeshi Ueno; Akira Tajima; Toshiya Watanabe

Current patent systems face a serious problem of declining quality of patents as the larger number of ap- plications make it difficult for patent officers to spend enough time for evaluating each application. For building a better patent system, it is necessary to define a public consensus on the quality of patent applications in a quantitative way. In this article, we tackle the problem of assessing the quality of patent applications based on machine learning and text mining techniques. For each patent application, our tool automatically computes a score called patentability, which indicates how likely it is that the application will be approved by the patent office. We employ a new statis- tical prediction model to estimate examination results (approval or rejection) based on a large data set including 0.3 million patent applications. The model computes the patentability score based on a set of feature variables including the text contents of the specification documents. Experimental results showed that our model outperforms a conven- tional method which uses only the structural properties of the documents. Since users can access the estimated result through a Web-browser-based GUI, this system allows both patent examiners and applicants to quickly detect weak applications and to find their specific flaws.

international conference on data mining | 2013

Data Marketplace for Efficient Data Placement

Hiroshi Maruyama; Daisuke Okanohara; Shohei Hido

Data values are uneven. Some data have higher (financial) values than others. Data with low value-density should be reduced in size or removed in order to make room for new data with higher values. Okanohara et al. [9] argued that the data values will determine the placement of data in the network so as to maximize the utilization of the storage capacity (and the processing power) of the entire network, and proposed an architecture called Krill. Determining data values, however, is not an easy task because the data values are speculative, meaning that the future values are usually unknown. This paper discusses an attempt to adopt the marketplace concept for determining the data values. It is expected the market efficiency guarantees the best possible value is assigned to each data item. We consider two models with different complexity and show that the overall utilization of the network is maximized.

knowledge discovery and data mining | 2009

Trace Mining from Distributed Assembly Databases for Causal Analysis

Shohei Hido; Hirofumi Matsuzawa; Fumihiko Kitayama; Masayuki Numao

Hierarchical structures of components often appear in industry, such as the components of cars. We focus on association mining from the hierarchically assembled data items that are characterized with identity labels such as lot numbers. Massive and physically distributed product databases make it difficult to directly find the associations of deep-level items. We propose a top-down algorithm using virtual lot numbers to mine association rules from the hierarchical databases. Virtual lot numbers delegate the identity information of the subcomponents to upper-level lot numbers without modifications to the databases. Our pruning method reduces the number of enumerated items and avoids redundant access to the databases. Experiments show that the algorithm works an order of magnitude faster than a naive approach.

Explore More