Mahito Sugiyama | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mahito Sugiyama is active.

Explore More

Publication

Featured researches published by Mahito Sugiyama.

Bioinformatics | 2013

Efficient network-guided multi-locus association mapping with graph cuts

Chloé-Agathe Azencott; Dominik Grimm; Mahito Sugiyama; Yoshinobu Kawahara; Karsten M. Borgwardt

Motivation: As an increasing number of genome-wide association studies reveal the limitations of the attempt to explain phenotypic heritability by single genetic loci, there is a recent focus on associating complex phenotypes with sets of genetic loci. Although several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci or do not scale to genome-wide settings. Results: We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints, which can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci and exhibits higher power in detecting causal SNPs in simulation studies than other methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. Availability: Code is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

knowledge discovery and data mining | 2015

Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing

Felipe Llinares-López; Mahito Sugiyama; Laetitia Papaxanthos; Karsten M. Borgwardt

We present a novel algorithm for significant pattern mining, Westfall-Young light. The target patterns are statistically significantly enriched in one of two classes of objects. Our method corrects for multiple hypothesis testing and correlations between patterns via the Westfall-Young permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations. In our experiments, Westfall-Young light dramatically outperforms the current state-of-the-art approach, both in terms of runtime and memory efficiency on popular real-world benchmark datasets for pattern mining. The key to this efficiency is that, unlike all existing methods, our algorithm does not need to solve the underlying frequent pattern mining problem anew for each permutation and does not need to store the occurrence list of all frequent patterns. Westfall-Young light opens the door to significant pattern mining on large datasets that previously involved prohibitive runtime or memory costs. Our code is available from http://www.bsse.ethz.ch/mlcb/research/machine-learning/wylight.html

Bioinformatics | 2015

Genome-wide detection of intervals of genetic heterogeneity associated with complex traits

Felipe Llinares-López; Dominik Grimm; Dean A. Bodenham; Udo Gieraths; Mahito Sugiyama; Beth A. Rowan; Karsten M. Borgwardt

Motivation: Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: (i) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or (ii) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals. Results: Here, we present an approach that overcomes both problems: it allows one to automatically find all contiguous sequences of single nucleotide polymorphisms in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana genome-wide association study data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping. Conclusions: Our novel approach can contribute to the genome-wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes. Availability and implementation: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

international conference on data mining | 2011

A Fast and Flexible Clustering Algorithm Using Binary Discretization

Mahito Sugiyama; Akihiro Yamamoto

We present in this paper a new clustering algorithm for multivariate data. This algorithm, called BOOL (Binary coding Oriented clustering), can detect arbitrarily shaped clusters and is noise tolerant. BOOL handles data using a two-step procedure: data points are first discretized and represented as binary words, clusters are then iteratively constructed by agglomerating smaller clusters using this representation. This latter step is carried out with linear complexity by sorting such binary representations, which results in dramatic speedups when compared with other techniques. Experiments show that BOOL is faster than K-means, and about two to three orders of magnitude faster than two state-of-the-art algorithms that can detect non-convex clusters of arbitrary shapes. We also show that BOOLs results are robust to changes in parameters, whereas most algorithms for arbitrarily shaped clusters are known to be overly sensitive to such changes. The key to the robustness of BOOL is the hierarchical structure of clusters that is introduced automatically by increasing the accuracy of the discretization.

siam international conference on data mining | 2014

Multi-Task Feature Selection on Multiple Networks via Maximum Flows

Mahito Sugiyama; Chloé-Agathe Azencott; Dominik Grimm; Yoshinobu Kawahara; Karsten M. Borgwardt

We propose a new formulation of multi-task feature selection coupled with multiple network regularizers, and show that the problem can be exactly and efficiently solved by maximum flow algorithms. This method contributes to one of the central topics in data mining: How to exploit structural information in multivariate data analysis, which has numerous applications, such as gene regulatory and social network analysis. On simulated data, we show that the proposed method leads to higher accuracy in discovering causal features by solving multiple tasks simultaneously using networks over features. Moreover, we apply the method to multi-locus association mapping with Arabidopsis thaliana genotypes and flowering time phenotypes, and demonstrate its ability to recover more known phenotype-related genes than other state-of-the-art methods.

Bioinformatics | 2018

graphkernels: R and Python packages for graph comparison

Mahito Sugiyama; M. Elisabetta Ghisu; Felipe Llinares-López; Karsten M. Borgwardt

Abstract Summary Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. Availability and implementation The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels. Supplementary information Supplementary data are available online at Bioinformatics.

algorithmic learning theory | 2010

Learning figures with the hausdorff metric by fractals

Mahito Sugiyama; Eiju Hirowatari; Hideki Tsuiki; Akihiro Yamamoto

Discretization is a fundamental process for machine learning from analog data such as continuous signals. For example, the discrete Fourier analysis is one of the most essential signal processing methods for learning or recognition from continuous signals. However, only the direction of the time axis is discretized in the method, meaning that each datum is not purely discretized. To give a completely computational theoretical basis for machine learning from analog data, we construct a learning framework based on the Gold-style learning model. Using a modern mathematical computability theory in the field of Computable Analysis, we show that scalable sampling of analog data can be formulated as effective Gold-style learning. On the other hand, recursive algorithms are a key expression for models or rules explaining analog data. For example, FFT (Fast Fourier Transformation) is a fundamental recursive algorithm for discrete Fourier analysis. In this paper we adopt fractals, since they are general geometric concepts of recursive algorithms, and set learning objects as nonempty compact sets in the Euclidean space, called figures, in order to introduce fractals into Gold-style learning model, where the Hausdorff metric can be used to measure generalization errors. We analyze learnable classes of figures from informants (positive and negative examples) and from texts (positive examples), and reveal the hierarchy of learnabilities under various learning criteria. Furthermore, we measure the number of positive examples, one of complexities of learning, by using the Hausdorff dimension, which is the central concept of Fractal Geometry, and the VC dimension, which is used to measure the complexity of classes of hypotheses in the Valiant-style learning model. This work provides theoretical support for machine learning from analog data.

intelligent data analysis | 2013

Semi-supervised learning on closed set lattices

Mahito Sugiyama; Akihiro Yamamoto

We propose a new approach for semi-supervised learning using closed set lattices, which have been recently used for frequent pattern mining within the framework of the data analysis technique of Formal Concept Analysis FCA. We present a learning algorithm, called SELF SEmi-supervised Learning via FCA, which performs as a multiclass classifier and a label ranker for mixed-type data containing both discrete and continuous variables, while only few learning algorithms such as the decision tree-based classifier can directly handle mixed-type data. From both labeled and unlabeled data, SELF constructs a closed set lattice, which is a partially ordered set of data clusters with respect to subset inclusion, via FCA together with discretizing continuous variables, followed by learning classification rules through finding maximal clusters on the lattice. Moreover, it can weight each classification rule using the lattice, which gives a partial order of preference over class labels. We illustrate experimentally the competitive performance of SELF in classification and ranking compared to other learning algorithms using UCI datasets.

international conference on conceptual structures | 2011

Semi-supervised learning for mixed-type data via formal concept analysis

Mahito Sugiyama; Akihiro Yamamoto

Only few machine learning methods; e.g., the decision tree-based classification method, can handle mixed-type data sets containing both of discrete (binary and nominal) and continuous (real-valued) variables and, moreover, no semi-supervised learning method can treat such data sets directly. Here we propose a novel semi-supervised learning method, called SELF (SEmi-supervised Learning via FCA), for mixed-type data sets using Formal Concept Analysis (FCA). SELF extracts a lattice structure via FCA together with discretizing continuous variables and learns classification rules using the structure effectively. Incomplete data sets including missing values can be handled directly in our method. We experimentally demonstrate competitive performance of SELF compared to other supervised and semi-supervised learning methods. Our contribution is not only giving a novel semi-supervised learning method, but also bridging two fields of conceptual analysis and knowledge discovery.

international symposium on artificial intelligence | 2014

Detecting Anomalous Subgraphs on Attributed Graphs via Parametric Flow

Mahito Sugiyama; Keisuke Otaki

Detecting anomalies from structured graph data is becoming a critical task for many applications such as an analysis of disease infection in communities. To date, however, there exists no efficient method that works on massive attributed graphs with millions of vertices for detecting anomalous subgraphs with an abnormal distribution of vertex attributes. Here we report that this task is efficiently solved using the recent graph cut-based formulation. In particular, the full hierarchy of anomalous subgraphs can be simultaneously obtained via the parametric flow algorithm, which allows us to introduce the size constraint on anomalous subgraphs. We thoroughly examine the method using various sizes of synthetic and real-world datasets and show that our method is more than five orders of magnitude faster than the state-of-the-art method and is more effective in detection of anomalous subgraphs.

Explore More