Jorma Boberg
Turku Centre for Computer Science
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jorma Boberg.
Machine Learning | 2009
Tapio Pahikkala; Evgeni Tsivtsivadze; Antti Airola; Jouni Järvinen; Jorma Boberg
In this paper, we introduce a framework for regularized least-squares (RLS) type of ranking cost functions and we propose three such cost functions. Further, we propose a kernel-based preference learning algorithm, which we call RankRLS, for minimizing these functions. It is shown that RankRLS has many computational advantages compared to the ranking algorithms that are based on minimizing other types of costs, such as the hinge cost. In particular, we present efficient algorithms for training, parameter selection, multiple output learning, cross-validation, and large-scale learning. Circumstances under which these computational benefits make RankRLS preferable to RankSVM are considered. We evaluate RankRLS on four different types of ranking tasks using RankSVM and the standard RLS regression as the baselines. RankRLS outperforms the standard RLS regression and its performance is very similar to that of RankSVM, while RankRLS has several computational benefits over RankSVM.
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications | 2004
Sampo Pyysalo; Filip Ginter; Tapio Pahikkala; Jorma Boberg; Jouni Järvinen; Tapio Salakoski; Jeppe Koivula
In this paper, we present an evaluation of the Link Grammar parser on a corpus consisting of sentences describing protein-protein interactions. We introduce the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parser for recovery of dependencies, fully correct linkages and interaction subgraphs. We analyze the causes of parser failure and report specific causes of error, and identify potential modifications to the grammar to address the identified issues. We also report and discuss the effect of an extension to the dictionary of the parser.
Pattern Recognition | 1993
Jorma Boberg; Tapio Salakoski
Abstract Agglomerative clustering methods with stopping criteria are generalized. Clustering-related concepts are rigorously formulated with special consideration on metricity of object space. A new definition of combinatoriality is given, and a stronger proposition of monotonicity is proven. Specializations of the general method are applied to non-attributive non-metric and attributive pseudometric representations of biosequences. The furthest neighbor method is shown suitable for non-metric use. In metric object space, four inter-clusteral distance functions, including a new truly context sensitive method, are compared using a method-independent goodness criterion. For biosequence clustering, the new method overcomes the UPGMA, UPGMC, and furthest neighbor methods.
International Journal of Medical Informatics | 2006
Sampo Pyysalo; Filip Ginter; Tapio Pahikkala; Jorma Boberg; Jouni Järvinen; Tapio Salakoski
We present an evaluation of Link Grammar and Connexor Machinese Syntax, two major broad-coverage dependency parsers, on a custom hand-annotated corpus consisting of sentences regarding protein-protein interactions. In the evaluation, we apply the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parsers for recovery of individual dependencies, fully correct parses, and interaction subgraphs. For Link Grammar, an open system that can be inspected in detail, we further perform a comprehensive failure analysis, report specific causes of error, and suggest potential modifications to the grammar. We find that both parsers perform worse on biomedical English than previously reported on general English. While Connexor Machinese Syntax significantly outperforms Link Grammar, the failure analysis suggests specific ways in which the latter could be modified for better performance in the domain.
BMC Bioinformatics | 2005
Tapio Pahikkala; Filip Ginter; Jorma Boberg; Jouni Järvinen; Tapio Salakoski
BackgroundThe ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task.ResultsWe incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier.ConclusionWe show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM.
intelligent data analysis | 2005
Evgeni Tsivtsivadze; Tapio Pahikkala; Sampo Pyysalo; Jorma Boberg; Aleksandr Mylläri; Tapio Salakoski
We present an adaptation of the Regularized Least-Squares algorithm for the rank learning problem and an application of the method to reranking of the parses produced by the Link Grammar (LG) dependency parser. We study the use of several grammatically motivated features extracted from parses and evaluate the ranker with individual features and the combination of all features on a set of biomedical sentences annotated for syntactic dependencies. Using a parse goodness function based on the F-score, we demonstrate that our method produces a statistically significant increase in rank correlation from 0.18 to 0.42 compared to the built-in ranking heuristics of the LG parser. Further, we analyze the performance of our ranker with respect to the number of sentences and parses per sentence used for training and illustrate that the method is applicable to sparse datasets, showing improved performance with as few as 100 training sentences.
Machine Learning | 2009
Tapio Pahikkala; Sampo Pyysalo; Jorma Boberg; Jouni Järvinen; Tapio Salakoski
In the application of machine learning methods with natural language inputs, the words and their positions in the input text are some of the most important features. In this article, we introduce a framework based on a word-position matrix representation of text, linear feature transformations of the word-position matrices, and kernel functions constructed from the transformations. We consider two categories of transformations, one based on word similarities and the second on their positions, which can be applied simultaneously in the framework in an elegant way. We show how word and positional similarities obtained by applying previously proposed techniques, such as latent semantic analysis, can be incorporated as transformations in the framework. We also introduce novel ways to determine word and positional similarities. We further present efficient algorithms for computing kernel functions incorporating the transformations on the word-position matrices, and, more importantly, introduce a highly efficient method for prediction. The framework is particularly suitable to natural language disambiguation tasks where the aim is to select for a single word a particular property from a set of candidates based on the context of the word. We demonstrate the applicability of the framework to this type of tasks using context-sensitive spelling error correction on the Reuters News corpus as a model problem.
industrial and engineering applications of artificial intelligence and expert systems | 2006
Evgeni Tsivtsivadze; Tapio Pahikkala; Jorma Boberg; Tapio Salakoski
We propose a Locality-Convolution (LC) kernel in application to dependency parse ranking. The LC kernel measures parse similarities locally, within a small window constructed around each matching feature. Inside the window it makes use of a position sensitive function to take into account the order of the feature appearance. The similarity between two windows is calculated by computing the product of their common attributes and the kernel value is the sum of the window similarities. We applied the introduced kernel together with Regularized Least-Squares (RLS) algorithm to a dataset containing dependency parses obtained from a manually annotated biomedical corpus of 1100 sentences. Our experiments show that RLS with LC kernel performs better than the baseline method. The results outline the importance of local correlations and the order of feature appearance within the parse. Final validation demonstrates statistically significant increase in parse ranking performance.
Lecture Notes in Computer Science | 2004
Filip Ginter; Tapio Pahikkala; Sampo Pyysalo; Jorma Boberg; Jouni Järvinen; Tapio Salakoski
In this paper, we introduce a way to apply rough set data analysis to the problem of extracting protein-protein interaction sentences in biomedical literature. Our approach builds on decision rules of protein names, interaction words, and their mutual positions in sentences. In order to broaden the set of potential interaction words, we develop a morphological model which generates spelling and inflection variants of the interaction words. We evaluate the performance of the proposed method on a hand-tagged dataset of 1894 sentences and show a precision-recall break-even performance of 79,8% by using leave-one-out crossvalidation.
IEEE Transactions on Biomedical Engineering | 2002
Pentti Riikonen; Jorma Boberg; Tapio Salakoski; Mauno Vihinen
We have developed a new way of accessing biological databases and bioinformatics applications on the Internet. This new service, bioinformatics wireless application protocol (BioWAP) service, which is accessible by mobile devices makes it possible to access bioinformatics services, where normal PC or personal digital assistant (PDA) connections are not feasible. The BioWAP service includes major biological databases and applications demonstrating a simple method of implementing WAP interfaces to uncompliant applications, i.e. the applications that are not WAP or Internet based. The BioWAP service can be browsed with any WAP terminal.