Tapio Salakoski | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tapio Salakoski is active.

Explore More

Publication

Featured researches published by Tapio Salakoski.

PLOS ONE | 2013

Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization

Sofie Van Landeghem; Jari Björne; Chih Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter

Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications | 2004

Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions

Sampo Pyysalo; Filip Ginter; Tapio Pahikkala; Jorma Boberg; Jouni Järvinen; Tapio Salakoski; Jeppe Koivula

In this paper, we present an evaluation of the Link Grammar parser on a corpus consisting of sentences describing protein-protein interactions. We introduce the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parser for recovery of dependencies, fully correct linkages and interaction subgraphs. We analyze the causes of parser failure and report specific causes of error, and identify potential modifications to the grammar to address the identified issues. We also report and discuss the effect of an extension to the dictionary of the parser.

PLOS Genetics | 2014

Regularized Machine Learning in the Genetic Prediction of Complex Traits

Sebastian Okser; Tapio Pahikkala; Antti Airola; Tapio Salakoski; Samuli Ripatti; Tero Aittokallio

Supervised machine learning aims at constructing a genotype–phenotype model by learning such genetic patterns from a labeled set of training examples that will also provide accurate phenotypic predictions in new cases with similar genetic background. Such predictive models are increasingly being applied to the mining of panels of genetic variants, environmental, or other nongenetic factors in the prediction of various complex traits and disease phenotypes [1]–[8]. These studies are providing increasing evidence in support of the idea that machine learning provides a complementary view into the analysis of high-dimensional genetic datasets as compared to standard statistical association testing approaches. In contrast to identifying variants explaining most of the phenotypic variation at the population level, supervised machine learning models aim to maximize the predictive (or generalization) power at the level of individuals, hence providing exciting opportunities for e.g., individualized risk prediction based on personal genetic profiles [9]–[11]. Machine learning models can also deal with genetic interactions, which are known to play an important role in the development and treatment of many complex diseases [12]–[16], but are often missed by single-locus association tests [17]. Even in the absence of significant single-loci marginal effects, multilocus panels from distinct molecular pathways may provide synergistic contribution to the prediction power, thereby revealing part of such hidden heritability component that has remained missing because of too small marginal effects to pass the stringent genome-wide significance filters [18]. Multivariate modeling approaches have already been shown to provide improved insights into genetic mechanisms and the interaction networks behind many complex traits, including atherosclerosis, coronary heart disease, and lipid levels, which would have gone undetected using the standard univariate modeling [2], [19]–[22]. However, machine learning models also come with inherent pitfalls, such as increased computational complexity and the risk for model overfitting, which must be understood in order to avoid reporting unrealistic prediction models or over-optimistic prediction results. We argue here that many medical applications of machine learning models in genetic disease risk prediction rely essentially on two factors: effective model regularization and rigorous model validation. We demonstrate the effects of these factors using representative examples from the literature as well as illustrative case examples. This review is not meant to be a comprehensive survey of all predictive modeling approaches, but we focus on regularized machine learning models, which enforces constraints on the complexity of the learned models so that they would ignore irrelevant patterns in the training examples. Simple risk allele counting or other multilocus risk models that do not incorporate any model parameters to be learned are outside the scope of this review; in fact, such simplistic models that assume independent variants may lead to suboptimal prediction performance in the presence of either direct or indirect interactions through epistasis effects or linkage disequilibrium, respectively [23], [24]. Perhaps the simplest models considered here as learning approaches are those based on weighted risk allele summaries [23], [25]. However, even with such basic risk models intended for predictive purposes, it is important to learn the model parameters (e.g., select the variants and determine their weights) based on training data only; otherwise there is a severe risk of model overfitting, i.e., models not being capable of generalizing to new samples [5]. Representative examples of how model learning and regularization approaches address the overfitting problem are briefly summarized in Box 1, while those readers interested in their implementation details are referred to the accompanying Text S1. We specifically promote here the use of such regularized machine learning models that are scalable to the entire genome-wide scale, often based on linear models, which are easy to interpret and also enable straightforward variable selection. Genome-scale approaches avoid the need of relying on two-stage approaches [26], which apply standard statistical procedures to reduce the number of variants, since such prefiltering may miss predictive interactions across loci and therefore lead to reduced predictive performance [8], [24], [25], [27], [28]. Box 1. Synthesis of Learning Models for Genetic Risk Prediction The aim of risk models is to capture in a mathematical form the patterns in the genetic and non-genetic data most important for the prediction of disease susceptibility. The first step in model building involves choosing the functional form of the model (e.g., linear or nonlinear), and then making use of a given training data to determine the adjustable parameters of the model (e.g., a subset of variants, their weights, and other model parameters). While it is often sufficient for a statistical model to enable high enough explanatory power in the discovery material, without being overly complicated, a predictive model is also required to generalize to unseen cases. One consideration in the model construction is how to encode the genotypic measurements using genotype models, such as the dominant, recessive, multiplicative, or additive model, each implying different assumptions about the genetic effects in the data [79]. Categorical variables 0, 1, and 2 are typically used for treating genetic predictor variables (e.g., minor allele dosage), while numeric values are required for continuous risk factors (e.g., blood pressure). Expected posterior probabilities of the genotypes can also be used, especially for imputed genotypes. Transforming the genotype categories into three binary features is an alternative way to deal with missing values without imputation (used in the T1D example; see Text S1 for details). Statistical or machine learning models identify statistical or predictive interactions, respectively, rather than biological interactions between or within variants [12], [80]. While nonlinear models may better capture complex genetic interactions [7], [81], linear models are easier to interpret and provide a scalable option for performing supervised selection of multilocus variant panels at the genome-wide scale [3]. In linear models, genetic interactions are modeled implicitly by selecting such variant combinations that together are predictive of the phenotype, rather than considering pairwise gene–gene relationships explicitly. Formally, trait yi to be predicted for an individual i is modeled as a linear combination of the individuals predictor variables xij: (1) Here, the weights wj are assumed constant across the n individuals, w 0 is the bias offset term and p indicates the number of predictors discovered in the training data. In its basic form, Eq. 1 can be used for modeling continuous traits y (linear regression). For case-control classification, the binary dependent variable y is often transformed using a logistic loss function, which models the probability of a case class given a genotype profile and other risk factor covariates x (logistic regression). It has been shown that the logistic regression and naive Bayes risk models are mathematically very closely related in the context of genetic risk prediction [81].

Computer Science Education | 2006

What about a Simple Language? Analyzing the Difficulties in Learning to Program.

Linda Mannila; Mia Peltomäki; Tapio Salakoski

In this paper, we present the results from a two-part study. We analyze 60 programs written by novice programmers aged 16 – 19 after their first programming course, in either Java or Python. The aim is to find difficulties independent of the language used, and such originating from the language. Second, we analyze the transition from a “simple” language to a more “advanced” one, by following up on eight students, who learned programming in Python before moving on to Java. Our results suggest that a simple language gives rise to fewer syntax errors as well as logic errors. The qualitative part of our study did not reveal any disadvantages from having learned to program in a simple language when moving on to a more complex one. This suggests that not only can a simple language be used when introducing programming as a general skill, but also when providing basic skills to future professionals in the field.

meeting of the association for computational linguistics | 2007

On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA

Sampo Pyysalo; Filip Ginter; Veronika Laippala; Katri Haverinen; Juho Heimonen; Tapio Salakoski

Several incompatible syntactic annotation schemes are currently used by parsers and corpora in biomedical information extraction. The recently introduced Stanford dependency scheme has been suggested to be a suitable unifying syntax formalism. In this paper, we present a step towards such unification by creating a conversion from the Link Grammar to the Stanford scheme. Further, we create a version of the BioInfer corpus with syntactic annotation in this scheme. We present an application-oriented evaluation of the transformation and assess the suitability of the scheme and our conversion to the unification of the syntactic annotations of BioInfer and the GENIA Treebank. We find that a highly reliable conversion is both feasible to create and practical, increasing the applicability of both the parser and the corpus to information extraction.

Advances in Bioinformatics | 2012

Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations

Sofie Van Landeghem; Kai Hakala; Samuel Rönnqvist; Tapio Salakoski; Yves Van de Peer; Filip Ginter

Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.

Pattern Recognition | 1993

General formulation and evaluation of agglomerative clustering methods with metric and non-metric distances

Jorma Boberg; Tapio Salakoski

Abstract Agglomerative clustering methods with stopping criteria are generalized. Clustering-related concepts are rigorously formulated with special consideration on metricity of object space. A new definition of combinatoriality is given, and a stronger proposition of monotonicity is proven. Specializations of the general method are applied to non-attributive non-metric and attributive pseudometric representations of biosequences. The furthest neighbor method is shown suitable for non-metric use. In metric object space, four inter-clusteral distance functions, including a new truly context sensitive method, are compared using a method-independent goodness criterion. For biosequence clustering, the new method overcomes the UPGMA, UPGMC, and furthest neighbor methods.

international conference on machine learning and applications | 2010

Speeding Up Greedy Forward Selection for Regularized Least-Squares

Tapio Pahikkala; Antti Airola; Tapio Salakoski

We propose a novel algorithm for greedy forward feature selection for regularized least-squares (RLS) regression and classification, also known as the least-squares support vector machine or ridge regression. The algorithm, which we call greedy RLS, starts from the empty feature set, and on each iteration adds the feature whose addition provides the best leave-one-out cross-validation performance. Our method is considerably faster than the previously proposed ones, since its time complexity is linear in the number of training examples, the number of features in the original data set, and the desired size of the set of selected features. Therefore, as a side effect we obtain a new training algorithm for learning sparse linear RLS predictors which can be used for large scale learning. This speed is possible due to matrix calculus based short-cuts for leave-one-out and feature addition. We experimentally demonstrate the scalability of our algorithm compared to previously proposed implementations.

BMC Bioinformatics | 2005

Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation

Tapio Pahikkala; Filip Ginter; Jorma Boberg; Jouni Järvinen; Tapio Salakoski

BackgroundThe ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task.ResultsWe incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier.ConclusionWe show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM.

intelligent data analysis | 2005

Regularized least-squares for parse ranking

Evgeni Tsivtsivadze; Tapio Pahikkala; Sampo Pyysalo; Jorma Boberg; Aleksandr Mylläri; Tapio Salakoski

We present an adaptation of the Regularized Least-Squares algorithm for the rank learning problem and an application of the method to reranking of the parses produced by the Link Grammar (LG) dependency parser. We study the use of several grammatically motivated features extracted from parses and evaluate the ranker with individual features and the combination of all features on a set of biomedical sentences annotated for syntactic dependencies. Using a parse goodness function based on the F-score, we demonstrate that our method produces a statistically significant increase in rank correlation from 0.18 to 0.42 compared to the built-in ranking heuristics of the LG parser. Further, we analyze the performance of our ranker with respect to the number of sentences and parses per sentence used for training and illustrate that the method is applicable to sparse datasets, showing improved performance with as few as 100 training sentences.

Explore More