Is this you? Create Your Porfile

Jarkko Tikka

Helsinki University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jarkko Tikka is active.

Explore More

Publication

Featured researches published by Jarkko Tikka.

international conference on artificial neural networks | 2005

Multiresponse sparse regression with application to multidimensional scaling

Timo Similä; Jarkko Tikka

Sparse regression is the problem of selecting a parsimonious subset of all available regressors for an efficient prediction of a target variable. We consider a general setting in which both the target and regressors may be multivariate. The regressors are selected by a forward selection procedure that extends the Least Angle Regression algorithm. Instead of the common practice of estimating each target variable individually, our proposed method chooses sequentially those regressors that allow, on average, the best predictions of all the target variables. We illustrate the procedure by an experiment with artificial data. The method is also applied to the task of selecting relevant pixels from images in multidimensional scaling of handwritten digits.

Computational Statistics & Data Analysis | 2007

Input selection and shrinkage in multiresponse linear regression

Timo Similä; Jarkko Tikka

The regression problem of modeling several response variables using the same set of input variables is considered. The model is linearly parameterized and the parameters are estimated by minimizing the error sum of squares subject to a sparsity constraint. The constraint has the effect of eliminating useless inputs and constraining the parameters of the remaining inputs in the model. Two algorithms for solving the resulting convex cone programming problem are proposed. The first algorithm gives a pointwise solution, while the second one computes the entire path of solutions as a function of the constraint parameter. Based on experiments with real data sets, the proposed method has a similar performance to existing methods. In simulation experiments, the proposed method is competitive both in terms of prediction accuracy and correctness of input selection. The advantages become more apparent when many correlated inputs are available for model construction.

BMC Medical Genomics | 2008

Classification of human cancers based on DNA copy number amplification modeling

Samuel Myllykangas; Jarkko Tikka; Tom Böhling; Sakari Knuutila; Jaakko Hollmén

BackgroundDNA amplifications alter gene dosage in cancer genomes by multiplying the gene copy number. Amplifications are quintessential in a considerable number of advanced cancers of various anatomical locations. The aims of this study were to classify human cancers based on their amplification patterns, explore the biological and clinical fundamentals behind their amplification-pattern based classification, and understand the characteristics in human genomic architecture that associate with amplification mechanisms.MethodsWe applied a machine learning approach to model DNA copy number amplifications using a data set of binary amplification records at chromosome sub-band resolution from 4400 cases that represent 82 cancer types. Amplification data was fused with background data: clinical, histological and biological classifications, and cytogenetic annotations. Statistical hypothesis testing was used to mine associations between the data sets.ResultsProbabilistic clustering of each chromosome identified 111 amplification models and divided the cancer cases into clusters. The distribution of classification terms in the amplification-model based clustering of cancer cases revealed cancer classes that were associated with specific DNA copy number amplification models. Amplification patterns – finite or bounded descriptions of the ranges of the amplifications in the chromosome – were extracted from the clustered data and expressed according to the original cytogenetic nomenclature. This was achieved by maximal frequent itemset mining using the cluster-specific data sets. The boundaries of amplification patterns were shown to be enriched with fragile sites, telomeres, centromeres, and light chromosome bands.ConclusionsOur results demonstrate that amplifications are non-random chromosomal changes and specifically selected in tumor tissue microenvironment. Furthermore, statistical evidence showed that specific chromosomal features co-localize with amplification breakpoints and link them in the amplification process.

Neurocomputing | 2009

Simultaneous input variable and basis function selection for RBF networks

Jarkko Tikka

Input selection is advantageous in regression problems. It may, for example, decrease the training time of models, reduce measurement costs, and assist in circumventing problems of high dimensionality. Also, the inclusion of useless inputs into the model increases the likelihood of overfitting. Neural networks provide good generalization in many cases, but their interpretability is usually limited. However, selecting a subset of variables and estimating their relative importances would be valuable in many real world applications. In the present work, a simultaneous input and basis function selection method for a radial basis function (RBF) network is proposed. The selection is performed by minimizing a constrained optimization problem, in which sparsity of the network is controlled by two continuous valued shrinkage parameters. Each input dimension is weighted and the constraints are imposed on these weights and the output layer coefficients. Direct and alternating optimization (AO) procedures are presented to solve the problem. The proposed method is applied to simulated and benchmark data. In the comparison with the existing methods, the resulting RBF networks have similar prediction accuracies with the smaller numbers of inputs and basis functions.

Neurocomputing | 2008

Sequential input selection algorithm for long-term prediction of time series

Jarkko Tikka; Jaakko Hollmén

In time series prediction, making accurate predictions is often the primary goal. At the same time, interpretability of the models would be desirable. For the latter goal, we have devised a sequential input selection algorithm (SISAL) to choose a parsimonious, or sparse, set of input variables. Our proposed algorithm is a sequential backward selection type algorithm based on a cross-validation resampling procedure. Our strategy is to use a filter approach in the prediction: first we select a sparse set of inputs using linear models and then the selected inputs are used in the nonlinear prediction conducted with multilayer-perceptron networks. Furthermore, we perform a sensitivity analysis by quantifying the importance of the individual input variables in the nonlinear models using a method based on partial derivatives. Experiments are done with the Santa Fe laser data set that exhibits very nonlinear behavior and a data set in a problem of electricity load prediction. The results in the prediction problems of varying difficulty highlight the range of applicability of our proposed algorithm. In summary, our SISAL yields accurate and parsimonious prediction models giving insight to the original problem.

international work-conference on artificial and natural neural networks | 2007

Mixture modeling of DNA copy number amplification patterns in cancer

Jarkko Tikka; Jaakko Hollmén; Samuel Myllykangas

DNA copy number amplifications are hallmarks of many cancers. In this work we analyzed data of genome-wide DNA copy number amplifications collected from more than 4500 neoplasm cases. Based on the 0-1 representation of the data, we trained finite mixtures of multivariate Bernoulli distributions using the EM algorithm to describe the inherent structure in the data. The resulting component distributions of the mixtures of Bernoulli distributions yielded plausible and localized amplification patterns. Individual amplification patterns were tested for their role in cancer groups formed with known risk associations. Our detailed analysis of chromosome 1 showed that asbestos-exposure related and hormonal imbalance-associated cancers were clustered and specific chromosome bands, 1p34 and 1q42, were identified. These sites contain cancer genes, which might explain the condition-specific selection of these loci for amplification.

intelligent data analysis | 2007

Compact and understandable descriptions of mixtures of Bernoulli distributions

Jaakko Hollmén; Jarkko Tikka

Finite mixturemodels can be used in estimating complex, unknown probability distributions and also in clustering data. The parameters of the models form a complex representation and are not suitable for interpretation purposes as such. In this paper, we present a methodology to describe the finite mixture of multivariate Bernoulli distributions with a compact and understandable description. First, we cluster the data with the mixture model and subsequently extract the maximal frequent itemsets from the cluster-specific data sets. The mixture model is used to model the data set globally and the frequent itemsets model the marginal distributions of the partitioned data locally. We present the results in understandable terms that reflect the domain properties of the data. In our application of analyzing DNA copy number amplifications, the descriptions of amplification patterns are represented in nomenclature used in literature to report amplification patterns and generally used by domain experts in biology and medicine.

international joint conference on neural network | 2006

Common Subset Selection of Inputs in Multiresponse Regression

Timo Similä; Jarkko Tikka

We propose the multiresponse sparse regression algorithm, an input selection method for the purpose of estimating several response variables. It is a forward selection procedure for linearly parameterized models, which updates with carefully chosen step lengths. The step length rule extends the correlation criterion of the least angle regression algorithm for many responses. We present a general concept and explicit formulas for three different variants of the algorithm. Based on experiments with simulated data, the proposed method competes favorably with other methods when many correlated inputs are available for model construction. We also study the performance with several real data sets.

international conference on artificial neural networks | 2005

Input selection for long-term prediction of time series

Jarkko Tikka; Jaakko Hollmén; Amaury Lendasse

Prediction of time series is an important problem in many areas of science and engineering. Extending the horizon of predictions further to the future is the challenging and difficult task of long-term prediction. In this paper, we investigate the problem of selecting non-contiguous input variables for an autoregressive prediction model in order to improve the prediction ability. We present an algorithm in the spirit of backward selection which removes variables sequentially from the prediction models based on the significance of the individual regressors. We successfully test the algorithm with a non-linear system by selecting inputs with a linear model and finally train a non-linear predictor with the selected variables on Santa Fe laser data set.

Metabolomics | 2010

Functional prediction of unidentified lipids using supervised classifiers

Laxman Yetukuri; Jarkko Tikka; Jaakko Hollmén; Matej Orešič

Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation.

Explore More