Tapio Pahikkala | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tapio Pahikkala is active.

Explore More

Publication

Featured researches published by Tapio Pahikkala.

IEEE Transactions on Services Computing | 2015

Using Ant Colony System to Consolidate VMs for Green Cloud Computing

Fahimeh Farahnakian; Adnan Ashraf; Tapio Pahikkala; Pasi Liljeberg; Juha Plosila; Ivan Porres; Hannu Tenhunen

High energy consumption of cloud data centers is a matter of great concern. Dynamic consolidation of Virtual Machines (VMs) presents a significant opportunity to save energy in data centers. A VM consolidation approach uses live migration of VMs so that some of the under-loaded Physical Machines (PMs) can be switched-off or put into a low-power mode. On the other hand, achieving the desired level of Quality of Service (QoS) between cloud providers and their users is critical. Therefore, the main challenge is to reduce energy consumption of data centers while satisfying QoS requirements. In this paper, we present a distributed system architecture to perform dynamic VM consolidation to reduce energy consumption of cloud data centers while maintaining the desired QoS. Since the VM consolidation problem is strictly NP-hard, we use an online optimization metaheuristic algorithm called Ant Colony System (ACS). The proposed ACS-based VM Consolidation (ACS-VMC) approach finds a near-optimal solution based on a specified objective function. Experimental results on real workload traces show that ACS-VMC reduces energy consumption while maintaining the required performance levels in a cloud data center. It outperforms existing VM consolidation approaches in terms of energy consumption, number of VM migrations, and QoS requirements concerning performance.

Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing | 2008

A Graph Kernel for Protein-Protein Interaction Extraction

Antti Airola; Sampo Pyysalo; Jari Björne; Tapio Pahikkala; Filip Ginter; Tapio Salakoski

In this paper, we propose a graph kernel based approach for the automated extraction of protein-protein interactions (PPI) from scientific literature. In contrast to earlier approaches to PPI extraction, the introduced all-dependency-paths kernel has the capability to consider full, general dependency graphs. We evaluate the proposed method across five publicly available PPI corpora providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, achieving 56.4 F-score and 84.8 AUC on the AImed corpus. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources.

Briefings in Bioinformatics | 2015

Toward more realistic drug–target interaction predictions

Tapio Pahikkala; Antti Airola; Sami Pietilä; Sushil Shakyawar; Agnieszka Szwajda; Jing Tang; Tero Aittokallio

A number of supervised machine learning models have recently been introduced for the prediction of drug–target interactions based on chemical structure and genomic sequence information. Although these models could offer improved means for many network pharmacology applications, such as repositioning of drugs for new therapeutic uses, the prediction models are often being constructed and evaluated under overly simplified settings that do not reflect the real-life problem in practical applications. Using quantitative drug–target bioactivity assays for kinase inhibitors, as well as a popular benchmarking data set of binary drug–target interactions for enzyme, ion channel, nuclear receptor and G protein-coupled receptor targets, we illustrate here the effects of four factors that may lead to dramatic differences in the prediction results: (i) problem formulation (standard binary classification or more realistic regression formulation), (ii) evaluation data set (drug and target families in the application use case), (iii) evaluation procedure (simple or nested cross-validation) and (iv) experimental setting (whether training and test sets share common drugs and targets, only drugs or targets or neither). Each of these factors should be taken into consideration to avoid reporting overoptimistic drug–target interaction prediction results. We also suggest guidelines on how to make the supervised drug–target interaction prediction studies more realistic in terms of such model formulations and evaluation setups that better address the inherent complexity of the prediction task in the practical applications, as well as novel benchmarking data sets that capture the continuous nature of the drug–target interactions for kinase inhibitors.

Computational Statistics & Data Analysis | 2011

An experimental comparison of cross-validation techniques for estimating the area under the ROC curve

Antti Airola; Tapio Pahikkala; Willem Waegeman; Bernard De Baets; Tapio Salakoski

Reliable estimation of the classification performance of inferred predictive models is difficult when working with small data sets. Cross-validation is in this case a typical strategy for estimating the performance. However, many standard approaches to cross-validation suffer from extensive bias or variance when the area under the ROC curve (AUC) is used as the performance measure. This issue is explored through an extensive simulation study. Leave-pair-out cross-validation is proposed for conditional AUC-estimation, as it is almost unbiased, and its deviation variance is as low as that of the best alternative approaches. When using regularized least-squares based learners, efficient algorithms exist for calculating the leave-pair-out cross-validation estimate.

Machine Learning | 2009

An efficient algorithm for learning to rank from preference graphs

Tapio Pahikkala; Evgeni Tsivtsivadze; Antti Airola; Jouni Järvinen; Jorma Boberg

In this paper, we introduce a framework for regularized least-squares (RLS) type of ranking cost functions and we propose three such cost functions. Further, we propose a kernel-based preference learning algorithm, which we call RankRLS, for minimizing these functions. It is shown that RankRLS has many computational advantages compared to the ranking algorithms that are based on minimizing other types of costs, such as the hinge cost. In particular, we present efficient algorithms for training, parameter selection, multiple output learning, cross-validation, and large-scale learning. Circumstances under which these computational benefits make RankRLS preferable to RankSVM are considered. We evaluate RankRLS on four different types of ranking tasks using RankSVM and the standard RLS regression as the baselines. RankRLS outperforms the standard RLS regression and its performance is very similar to that of RankSVM, while RankRLS has several computational benefits over RankSVM.

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications | 2004

Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions

Sampo Pyysalo; Filip Ginter; Tapio Pahikkala; Jorma Boberg; Jouni Järvinen; Tapio Salakoski; Jeppe Koivula

In this paper, we present an evaluation of the Link Grammar parser on a corpus consisting of sentences describing protein-protein interactions. We introduce the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parser for recovery of dependencies, fully correct linkages and interaction subgraphs. We analyze the causes of parser failure and report specific causes of error, and identify potential modifications to the grammar to address the identified issues. We also report and discuss the effect of an extension to the dictionary of the parser.

PLOS Genetics | 2014

Regularized Machine Learning in the Genetic Prediction of Complex Traits

Sebastian Okser; Tapio Pahikkala; Antti Airola; Tapio Salakoski; Samuli Ripatti; Tero Aittokallio

Supervised machine learning aims at constructing a genotype–phenotype model by learning such genetic patterns from a labeled set of training examples that will also provide accurate phenotypic predictions in new cases with similar genetic background. Such predictive models are increasingly being applied to the mining of panels of genetic variants, environmental, or other nongenetic factors in the prediction of various complex traits and disease phenotypes [1]–[8]. These studies are providing increasing evidence in support of the idea that machine learning provides a complementary view into the analysis of high-dimensional genetic datasets as compared to standard statistical association testing approaches. In contrast to identifying variants explaining most of the phenotypic variation at the population level, supervised machine learning models aim to maximize the predictive (or generalization) power at the level of individuals, hence providing exciting opportunities for e.g., individualized risk prediction based on personal genetic profiles [9]–[11]. Machine learning models can also deal with genetic interactions, which are known to play an important role in the development and treatment of many complex diseases [12]–[16], but are often missed by single-locus association tests [17]. Even in the absence of significant single-loci marginal effects, multilocus panels from distinct molecular pathways may provide synergistic contribution to the prediction power, thereby revealing part of such hidden heritability component that has remained missing because of too small marginal effects to pass the stringent genome-wide significance filters [18]. Multivariate modeling approaches have already been shown to provide improved insights into genetic mechanisms and the interaction networks behind many complex traits, including atherosclerosis, coronary heart disease, and lipid levels, which would have gone undetected using the standard univariate modeling [2], [19]–[22]. However, machine learning models also come with inherent pitfalls, such as increased computational complexity and the risk for model overfitting, which must be understood in order to avoid reporting unrealistic prediction models or over-optimistic prediction results. We argue here that many medical applications of machine learning models in genetic disease risk prediction rely essentially on two factors: effective model regularization and rigorous model validation. We demonstrate the effects of these factors using representative examples from the literature as well as illustrative case examples. This review is not meant to be a comprehensive survey of all predictive modeling approaches, but we focus on regularized machine learning models, which enforces constraints on the complexity of the learned models so that they would ignore irrelevant patterns in the training examples. Simple risk allele counting or other multilocus risk models that do not incorporate any model parameters to be learned are outside the scope of this review; in fact, such simplistic models that assume independent variants may lead to suboptimal prediction performance in the presence of either direct or indirect interactions through epistasis effects or linkage disequilibrium, respectively [23], [24]. Perhaps the simplest models considered here as learning approaches are those based on weighted risk allele summaries [23], [25]. However, even with such basic risk models intended for predictive purposes, it is important to learn the model parameters (e.g., select the variants and determine their weights) based on training data only; otherwise there is a severe risk of model overfitting, i.e., models not being capable of generalizing to new samples [5]. Representative examples of how model learning and regularization approaches address the overfitting problem are briefly summarized in Box 1, while those readers interested in their implementation details are referred to the accompanying Text S1. We specifically promote here the use of such regularized machine learning models that are scalable to the entire genome-wide scale, often based on linear models, which are easy to interpret and also enable straightforward variable selection. Genome-scale approaches avoid the need of relying on two-stage approaches [26], which apply standard statistical procedures to reduce the number of variants, since such prefiltering may miss predictive interactions across loci and therefore lead to reduced predictive performance [8], [24], [25], [27], [28]. Box 1. Synthesis of Learning Models for Genetic Risk Prediction The aim of risk models is to capture in a mathematical form the patterns in the genetic and non-genetic data most important for the prediction of disease susceptibility. The first step in model building involves choosing the functional form of the model (e.g., linear or nonlinear), and then making use of a given training data to determine the adjustable parameters of the model (e.g., a subset of variants, their weights, and other model parameters). While it is often sufficient for a statistical model to enable high enough explanatory power in the discovery material, without being overly complicated, a predictive model is also required to generalize to unseen cases. One consideration in the model construction is how to encode the genotypic measurements using genotype models, such as the dominant, recessive, multiplicative, or additive model, each implying different assumptions about the genetic effects in the data [79]. Categorical variables 0, 1, and 2 are typically used for treating genetic predictor variables (e.g., minor allele dosage), while numeric values are required for continuous risk factors (e.g., blood pressure). Expected posterior probabilities of the genotypes can also be used, especially for imputed genotypes. Transforming the genotype categories into three binary features is an alternative way to deal with missing values without imputation (used in the T1D example; see Text S1 for details). Statistical or machine learning models identify statistical or predictive interactions, respectively, rather than biological interactions between or within variants [12], [80]. While nonlinear models may better capture complex genetic interactions [7], [81], linear models are easier to interpret and provide a scalable option for performing supervised selection of multilocus variant panels at the genome-wide scale [3]. In linear models, genetic interactions are modeled implicitly by selecting such variant combinations that together are predictive of the phenotype, rather than considering pairwise gene–gene relationships explicitly. Formally, trait yi to be predicted for an individual i is modeled as a linear combination of the individuals predictor variables xij: (1) Here, the weights wj are assumed constant across the n individuals, w 0 is the bias offset term and p indicates the number of predictors discovered in the training data. In its basic form, Eq. 1 can be used for modeling continuous traits y (linear regression). For case-control classification, the binary dependent variable y is often transformed using a logistic loss function, which models the probability of a case class given a genotype profile and other risk factor covariates x (logistic regression). It has been shown that the logistic regression and naive Bayes risk models are mathematically very closely related in the context of genetic risk prediction [81].

computational intelligence | 2011

EXTRACTING CONTEXTUALIZED COMPLEX BIOLOGICAL EVENTS WITH RICH GRAPH-BASED FEATURE SETS

Jari Björne; Juho Heimonen; Filip Ginter; Antti Airola; Tapio Pahikkala; Tapio Salakoski

We describe a system for extracting complex events among genes and proteins from biomedical literature, developed in context of the BioNLP’09 Shared Task on Event Extraction. For each event, the system extracts its text trigger, class, and arguments. In contrast to the approaches prevailing prior to the shared task, events can be arguments of other events, resulting in a nested structure that better captures the underlying biological statements. We divide the task into independent steps which we approach as machine learning problems. We define a wide array of features and in particular make extensive use of dependency parse graphs. A rule‐based postprocessing step is used to refine the output in accordance with the restrictions of the extraction task. In the shared task evaluation, the system achieved an F‐score of 51.95% on the primary task, the best performance among the participants. Currently, with modifications and improvements described in this article, the system achieves 52.86% F‐score on Task 1, the primary task, improving on its original performance. In addition, we extend the system also to Tasks 2 and 3, gaining F‐scores of 51.28% and 50.18%, respectively. The system thus addresses the BioNLP’09 Shared Task in its entirety and achieves the best performance on all three subtasks.

Magnetic Resonance in Medicine | 2015

Mathematical models for diffusion-weighted imaging of prostate cancer using b values up to 2000 s/mm2: Correlation with Gleason score and repeatability of region of interest analysis

Jussi Toivonen; Harri Merisaari; Marko Pesola; Pekka Taimen; Peter J. Boström; Tapio Pahikkala; Hannu J. Aronen; Ivan Jambor

To evaluate four mathematical models for diffusion weighted imaging (DWI) of prostate cancer (PCa) in terms of PCa detection and characterization.

international conference on cloud computing | 2014

Energy-Aware Dynamic VM Consolidation in Cloud Data Centers Using Ant Colony System

Fahimeh Farahnakian; Adnan Ashraf; Pasi Liljeberg; Tapio Pahikkala; Juha Plosila; Ivan Porres; Hannu Tenhunen

As the scale of a cloud data center becomes larger and larger, the energy consumption of the data center also grows rapidly. Dynamic consolidation of Virtual Machines (VMs) presents a significant opportunity to save energy by turning off unused Physical Machines (PMs) in data centers. In this paper, we present a distributed controller to perform dynamic VM consolidation to improve the resource utilizations of PMs and to reduce their energy consumption. Moreover, we use the ant colony system to find a near-optimal VM placement solution based on the specified objective function. Experimental results on the real workload traces from more than a thousand PlanetLab VMs show that the proposed approach reduces energy consumption and maintains required performance levels in a large-scale data center.

Explore More