Jarkko Venna
Helsinki University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jarkko Venna.
workshop on self-organizing maps | 2006
Jarkko Venna; Samuel Kaski
In a visualization task, every nonlinear projection method needs to make a compromise between trustworthiness and continuity. In a trustworthy projection the visualized proximities hold in the original data as well, whereas a continuous projection visualizes all proximities of the original data. We show experimentally that one of the multidimensional scaling methods, curvilinear components analysis, is good at maximizing trustworthiness. We then extend it to focus on local proximities both in the input and output space, and to explicitly make a user-tunable parameterized compromise between trustworthiness and continuity. The new method compares favorably to alternative nonlinear projection methods.
international conference on artificial neural networks | 2001
Jarkko Venna; Samuel Kaski
Several measures have been proposed for comparing nonlinear projection methods but so far no comparisons have taken into account one of their most important properties, the trustworthiness of the resulting neighborhood or proximity relationships. One of the main uses of nonlinear mapping methods is to visualize multivariate data, and in such visualizations it is crucial that the visualized proximities can be trusted upon: If two data samples are close to each other on the display they should be close-by in the original space as well. A local measure of trustworthiness is proposed and it is shown for three data sets that neighborhood relationships visualized by the Self-Organizing Map and its variant, the Generative Topographic Mapping, are more trustworthy than visualizations produced by traditional multidimensional scalingbased nonlinear projection methods.
Neural Networks | 2002
Janne Nikkilä; Petri Törönen; Samuel Kaski; Jarkko Venna; Eero Castrén; Garry Wong
Cluster structure of gene expression data obtained from DNA microarrays is analyzed and visualized with the Self-Organizing Map (SOM) algorithm. The SOM forms a non-linear mapping of the data to a two-dimensional map grid that can be used as an exploratory data analysis tool for generating hypotheses on the relationships, and ultimately of the function of the genes. Similarity relationships within the data and cluster structures can be visualized and interpreted. The methods are demonstrated by computing a SOM of yeast genes. The relationships of known functional classes of genes are investigated by analyzing their distribution on the SOM, the cluster structure is visualized by the U-matrix method, and the clusters are characterized in terms of the properties of the expression profiles of the genes. Finally, it is shown that the SOM visualizes the similarity of genes in a more trustworthy way than two alternative methods, multidimensional scaling and hierarchical clustering.
BMC Bioinformatics | 2003
Samuel Kaski; Janne Nikkilä; Merja Oja; Jarkko Venna; Petri Törönen; Eero Castrén
BackgroundConventionally, the first step in analyzing the large and high-dimensional data sets measured by microarrays is visual exploration. Dendrograms of hierarchical clustering, self-organizing maps (SOMs), and multidimensional scaling have been used to visualize similarity relationships of data samples. We address two central properties of the methods: (i) Are the visualizations trustworthy, i.e., if two samples are visualized to be similar, are they really similar? (ii) The metric. The measure of similarity determines the result; we propose using a new learning metrics principle to derive a metric from interrelationships among data sets.ResultsThe trustworthiness of hierarchical clustering, multidimensional scaling, and the self-organizing map were compared in visualizing similarity relationships among gene expression profiles. The self-organizing map was the best except that hierarchical clustering was the most trustworthy for the most similar profiles. Trustworthiness can be further increased by treating separately those genes for which the visualization is least trustworthy. We then proceed to improve the metric. The distance measure between the expression profiles is adjusted to measure differences relevant to functional classes of the genes. The genes for which the new metric is the most different from the usual correlation metric are listed and visualized with one of the visualization methods, the self-organizing map, computed in the new metric.ConclusionsThe conjecture from the methodological results is that the self-organizing map can be recommended to complement the usual hierarchical clustering for visualizing and exploring gene expression data. Discarding the least trustworthy samples and improving the metric still improves it.
international conference on neural information processing | 1999
Samuel Kaski; Jarkko Venna; Teuvo Kohonen
Introduces a method for assigning colors to displays of cluster structures of high-dimensional data, such that the perceptual differences of the colors reflect the distances in the original data space as faithfully as possible. The cluster structure is first discovered with a self-organizing map (SOM), and then a new nonlinear projection method is applied to map the cluster structure into the CIELab color space. The projection method preserves best the local data distances that are the most important ones, while the global order is still discernible from the colors, too. This allows the method to conform flexibly to the available color space. The output space of the projection need not necessarily be the color space, however. Projections onto, say, two dimensions can be visualized as well.
Information Visualization | 2007
Jarkko Venna; Samuel Kaski
This paper has two intertwined goals: (i) to study the feasibility of an atlas of gene expression data sets as a visual interface to expression databanks, and (ii) to study which dimensionality reduction methods would be suitable for visualizing very high-dimensional data sets. Several new methods have been recently proposed for the estimation of data manifolds or embeddings, but they have so far not been compared in the task of visualization. In visualizations the dimensionality is constrained, in addition to the data itself, by the presentation medium. It turns out that an older method, curvilinear component analysis, outperforms the new ones in terms of trustworthiness of the projections. In a sample databank on gene expression, the main sources of variation were the differences between data sets, different labs, and different measurement methods. This hints at a need for better methods for making the data sets commensurable, in accordance with earlier studies. The good news is that the visualized overview, expression atlas, reveals many of these subsets. Hence, we conclude that dimensionality reduction even from 1339 to 2 can produce a useful interface to gene expression databanks.
Computational Statistics & Data Analysis | 2009
Jaakko Peltonen; Jarkko Venna; Samuel Kaski
Bayesian inference often requires approximating the posterior distribution by Markov chain Monte Carlo sampling. The samples come from the true distribution only after the simulation has converged, which makes detecting convergence a central problem. Commonly, several simulation chains are started from different points, and their overlap is used as a measure of convergence. Convergence measures cannot tell the analyst the cause of convergence problems; it is suggested that complementing them with proper visualization will help. A novel connection is pointed out: linear discriminant analysis (LDA) minimizes the overlap of the simulation chains measured by a common multivariate convergence measure. LDA is thus justified for visualizing convergence. However, LDA makes restrictive assumptions about the chains, which can be relaxed by a recent extension called discriminative component analysis (DCA). Lastly, methods are introduced for unidentifiable models and model families with variable number of parameters, where straightforward visualization in the parameter space is not feasible.
european conference on machine learning | 2003
Jarkko Venna; Samuel Kaski; Jaakko Peltonen
Bayesian inference often requires approximating the posterior distribution with Markov Chain Monte Carlo (MCMC) sampling. A central problem with MCMC is how to detect whether the simulation has converged. The samples come from the true posterior distribution only after convergence. A common solution is to start several simulations from different starting points, and measure overlap of the different chains. We point out that Linear Discriminant Analysis (LDA) minimizes the overlap measured by the usual multivariate overlap measure. Hence, LDA is a justified method for visualizing convergence. However, LDA makes restrictive assumptions about the distributions of the chains and their relationships. These restrictions can be relaxed by a recently introduced extension.
Journal of Machine Learning Research | 2010
Jarkko Venna; Jaakko Peltonen; Kristian Nybo; Helena Aidos; Samuel Kaski
Archive | 2007
Jarkko Venna