Samuel Kaski | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Samuel Kaski is active.

Explore More

Publication

Featured researches published by Samuel Kaski.

IEEE Transactions on Neural Networks | 2000

Self organization of a massive document collection

Teuvo Kohonen; Samuel Kaski; Krista Lagus; Jarkko Salojärvi; Jukka Honkela; Antti Saarela

This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.

Neurocomputing | 1998

WEBSOM – Self-organizing maps of document collections

Samuel Kaski; Timo Honkela; Krista Lagus; Teuvo Kohonen

Abstract With the WEBSOM method a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing. Interesting documents can be located on the map using a content-directed search. Each document is encoded as a histogram of word categories which are formed by the self-organizing map (SOM) algorithm based on the similarities in the contexts of the words. The encoded documents are organized on another self-organizing map, a document map, on which nearby locations contain similar documents. Special consideration is given to the computation of very large document maps which is possible with general-purpose computers if the dimensionality of the word category histograms is first reduced with a random mapping method and if computationally efficient algorithms are used in computing the SOMs.

international symposium on neural networks | 1998

Dimensionality reduction by random mapping: fast similarity computation for clustering

Samuel Kaski

When the data vectors are high-dimensional it is computationally infeasible to use data analysis or pattern recognition algorithms which repeatedly compute similarities or distances in the original data space. It is therefore necessary to reduce the dimensionality before, for example, clustering the data. If the dimensionality is very high, like in the WEBSOM method which organizes textual document collections on a self-organizing map, then even the commonly used dimensionality reduction methods like the principal component analysis may be too costly. It is demonstrated that the document classification accuracy obtained after the dimensionality has been reduced using a random mapping method will be almost as good as the original accuracy if the final dimensionality is sufficiently large (about 100 out of 6000). In fact, it can be shown that the inner product (similarity) between the mapped vectors follows closely the inner product of the original vectors.

Nature Biotechnology | 2014

A community effort to assess and improve drug sensitivity prediction algorithms

James C. Costello; Laura M. Heiser; Elisabeth Georgii; Michael P. Menden; Nicholas Wang; Mukesh Bansal; Muhammad Ammad-ud-din; Petteri Hintsanen; Suleiman A. Khan; John-Patrick Mpindi; Olli Kallioniemi; Antti Honkela; Tero Aittokallio; Krister Wennerberg; Nci Dream Community; James J. Collins; Dan Gallahan; Dinah S. Singer; Julio Saez-Rodriguez; Samuel Kaski; Joe W. Gray; Gustavo Stolovitzky

Predicting the best treatment strategy from genomic information is a core goal of precision medicine. Here we focus on predicting drug response based on a cohort of genomic, epigenomic and proteomic profiling data sets measured in human breast cancer cell lines. Through a collaborative effort between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we analyzed a total of 44 drug sensitivity prediction algorithms. The top-performing approaches modeled nonlinear relationships and incorporated biological pathway information. We found that gene expression microarrays consistently provided the best predictive power of the individual profiling data sets; however, performance was increased by including multiple, independent data sets. We discuss the innovations underlying the top-performing methodology, Bayesian multitask MKL, and we provide detailed descriptions of all methods. This study establishes benchmarks for drug sensitivity prediction and identifies approaches that can be leveraged for the development of new methods.

workshop on self-organizing maps | 2006

Local multidimensional scaling

Jarkko Venna; Samuel Kaski

In a visualization task, every nonlinear projection method needs to make a compromise between trustworthiness and continuity. In a trustworthy projection the visualized proximities hold in the original data as well, whereas a continuous projection visualizes all proximities of the original data. We show experimentally that one of the multidimensional scaling methods, curvilinear components analysis, is good at maximizing trustworthiness. We then extend it to focus on local proximities both in the input and output space, and to explicitly make a user-tunable parameterized compromise between trustworthiness and continuity. The new method compares favorably to alternative nonlinear projection methods.

Information Sciences | 2004

Mining massive document collections by the WEBSOM method

Krista Lagus; Samuel Kaski; Teuvo Kohonen

A viable alternative to the traditional text-mining methods is the WEBSOM, a software system based on the Self-Organizing Map (SOM) principle. Prior to the searching or browsing operations, this method orders a collection of textual items, say, documents according to their contents, and maps them onto a regular two-dimensional array of map units. Documents that are similar on the basis of their whole contents will be mapped to the same or neighboring map units, and at each unit there exist links to the document database. Thus, while the searching can be started by locating those documents that match best with the search expression, further relevant search results can be found on the basis of the pointers stored at the same or neighboring map units, even if they did not match the search criterion exactly. This work contains an overview to the WEBSOM method and its performance, and as a special application, the WEBSOM map of the texts of Encyclopaedia Britannica is described.

international conference on artificial neural networks | 2001

Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study

Jarkko Venna; Samuel Kaski

Several measures have been proposed for comparing nonlinear projection methods but so far no comparisons have taken into account one of their most important properties, the trustworthiness of the resulting neighborhood or proximity relationships. One of the main uses of nonlinear mapping methods is to visualize multivariate data, and in such visualizations it is crucial that the visualized proximities can be trusted upon: If two data samples are close to each other on the display they should be close-by in the original space as well. A local measure of trustworthiness is proposed and it is shown for three data sets that neighborhood relationships visualized by the Self-Organizing Map and its variant, the Generative Topographic Mapping, are more trustworthy than visualizations produced by traditional multidimensional scalingbased nonlinear projection methods.

Neural Networks | 2002

Analysis and visualization of gene expression data using self-organizing maps

Janne Nikkilä; Petri Törönen; Samuel Kaski; Jarkko Venna; Eero Castrén; Garry Wong

Cluster structure of gene expression data obtained from DNA microarrays is analyzed and visualized with the Self-Organizing Map (SOM) algorithm. The SOM forms a non-linear mapping of the data to a two-dimensional map grid that can be used as an exploratory data analysis tool for generating hypotheses on the relationships, and ultimately of the function of the genes. Similarity relationships within the data and cluster structures can be visualized and interpreted. The methods are demonstrated by computing a SOM of yeast genes. The relationships of known functional classes of genes are investigated by analyzing their distribution on the SOM, the cluster structure is visualized by the U-matrix method, and the clusters are characterized in terms of the properties of the expression profiles of the genes. Finally, it is shown that the SOM visualizes the similarity of genes in a more trustworthy way than two alternative methods, multidimensional scaling and hierarchical clustering.

international conference on artificial neural networks | 1996

Comparing Self-Organizing Maps

Samuel Kaski; Krista Lagus

In exploratory analysis of high-dimensional data the self-organizing map can be used to illustrate relations between the data items. We have developed two measures for comparing how different maps represent these relations. The other combines an index of discontinuities in the mapping from the input data set to the map grid with an index of the accuracy with which the map represents the data set. This measure can be used for determining the goodness of single maps. The other measure has been used to directly compare how similarly two maps represent relations between data items. Such a measure of the dissimilarity of maps is useful, e.g., for analyzing the sensitivity of maps to variations in their inputs or in the learning process. Also the similarity of two data sets can be compared indirectly by comparing the maps that represent them.

IEEE Transactions on Neural Networks | 2001

Bankruptcy analysis with self-organizing maps in learning metrics

Samuel Kaski; Janne Sinkkonen; Jaakko Peltonen

We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A self-organizing map (SOM) is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxiliary random variable that reflects what is important in the data. In this paper the variable indicates bankruptcy within the next few years. The conditional density of the auxiliary variable is first estimated, and the change in the estimate resulting from local displacements in the primary data space is measured using the Fisher information matrix. When a self-organizing map is computed in the new metric it still visualizes the data space in a topology-preserving fashion, but represents the (local) directions in which the probability of bankruptcy changes the most.

Explore More