Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jorma Laurikkala is active.

Publication


Featured researches published by Jorma Laurikkala.


artificial intelligence in medicine in europe | 2001

Improving Identification of Difficult Small Classes by Balancing Class Distribution

Jorma Laurikkala

We studied three methods to improve identification of difficult small classes by balancing imbalanced class distribution with data reduction. The new method, neighborhood cleaning rule (NCL), outperformed simple random and one-sided selection methods in experiments with ten data sets. All reduction methods improved identification of small classes (20-30%), but the differences were insignificant. However, significant differences in accuracies, true-positive rates and true-negative rates obtained with the 3-nearest neighbor method and C4.5 from the reduced data favored NCL. The results suggest that NCL is a useful method for improving the modeling of difficult small classes, and for building classifiers to identify these classes from the real-world data.


Information Sciences | 2007

On principal component analysis, cosine and Euclidean measures in information retrieval

Tuomo Korenius; Jorma Laurikkala; Martti Juhola

Abstract Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.


conference on information and knowledge management | 2004

Stemming and lemmatization in the clustering of finnish text documents

Tuomo Korenius; Jorma Laurikkala; Kalervo Järvelin; Martti Juhola

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Wards methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.


ACM Transactions on Information Systems | 2007

Creating and exploiting a comparable corpus in cross-language information retrieval

Tuomas Talvensaari; Jorma Laurikkala; Kalervo Järvelin; Martti Juhola; Heikki Keskustalo

We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin. In this study, we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collection, the source collection, by using the relative average term frequency (RATF) value. The keys were translated into the language of the other collection, the target collection, with a dictionary-based query translation program. The translated queries were run against the target collection and an alignment pair was made if the retrieved documents matched given date and similarity score criteria. The resulting comparable collection was used as a similarity thesaurus to translate queries along with a dictionary-based translator. The combined approaches outperformed translation schemes where dictionary-based translation or corpus translation was used alone.


Information Retrieval | 2008

Focused web crawling in the acquisition of comparable corpora

Tuomas Talvensaari; Ari Pirkola; Kalervo Järvelin; Martti Juhola; Jorma Laurikkala

Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising.


Computer Methods and Programs in Biomedicine | 1998

A genetic-based machine learning system to discover the diagnostic rules for female urinary incontinence

Jorma Laurikkala; Martti Juhola

A machine learning system named Galactica has been developed which uses a genetic algorithm to discover the rules for an expert system from databases. Galactica devised accurate diagnostic rules for female urinary incontinence from difficult heterogeneous data. The percentages of correctly classified stress, mixed and sensory urge incontinence testing cases were 89, 86 and 87%, respectively. However, these rules were rather general, consisting of 4-6 out of 13 conditions available in the data. Diagnostic rules for stress and mixed incontinence extracted from straightforward homogeneous data were highly accurate, classifying 100% of testing cases correctly as well as being specific, having from 10 to 11 conditions. More specific, but less accurate, rules were found from heterogeneous data with a biased fitness function. All of the rules were correct, i.e. every condition in the rules had the expected value specified by the expert. Although, Galactica achieved a slightly better classification than the discriminant analysis, it is argued that the genetic approach is better than the statistical one, due to symbolic rules being comprehensible, whereas understanding a complex mathematical model requires statistical expertise.


Computers in Biology and Medicine | 2001

Analysis of the imputed female urinary incontinence data for the evaluation of expert system parameters

Jorma Laurikkala; Martti Juhola; Seppo Lammi; Jorma Penttinen; Pauliina Aukee

We evaluated parameters for an expert system which will be designed to aid the differential diagnosis of female urinary incontinence by using knowledge discovered from data. To allow the statistical analysis, we applied means, regression and Expectation-Maximization (EM) imputation methods to fill in missing values. In addition, complete-case analysis was performed. Logistic regression results from the imputed data were reasonable. The significant parameters were mostly those that are important in the diagnostic work-up. Moreover, directions of relations between the parameters and the stress, mixed and sensory urge diagnoses were as expected. Analysis with the complete reduced data set gave clearly insufficient results. Imputed values had a moderate agreement, but odds ratios and classification accuracies of logistic regression equations were similar. Results suggest that with these data, simpler methods may be used to allow multivariate analysis and knowledge discovery, when better methods, such as EM imputation, are unavailable. Cluster analysis detected clusters corresponding to the small normal class, but was unable to clearly separate the larger incontinence classes.


Annals of Otology, Rhinology, and Laryngology | 1999

Discovering Diagnostic Rules from a Neurotologic Database with Genetic Algorithms

Erna Kentala; Ilmari Pyykkö; Jorma Laurikkala; Martti Juhola

Data on patients with Menieres disease, vestibular schwannoma, traumatic vertigo, sudden deafness, benign paroxysmal positional vertigo, or vestibular neuritis were retrieved from the database of otoneurologic expert system ONE for the development and testing of a genetic algorithm (GA). The accuracy of the diagnostic rules in solving the test cases was 81%, 91%, 92%, 95%, 96%, and 98% for the respective diseases. The best rules retrieved from the GA were described by a set of questions with the most likely answers. The most important questions concerned the duration of hearing loss and the occurrence of head injury. The validity and structure of the rules created with a GA can be analyzed in detail. For rare diseases, some other reasoning process can be used, for example, case-based reasoning.


International Journal of Medical Informatics | 2000

Usefulness of imputation for the analysis of incomplete otoneurologic data

Jorma Laurikkala; Erna Kentala; Martti Juhola; Ilmari Pyykkö; Seppo Lammi

The usefulness of imputation in the treatment of missing values of an otoneurologic database for the discriminant analysis was evaluated on the basis of the agreement of imputed values and the analysis results. The data consisted of six patient groups with vertigo (N=564). There were 38 variables and 11% of the data was missing. Missing values were filled in with the means, regression and Expectation-Maximisation (EM) imputation methods and a random imputation method provided the baseline results. Means, regression and EM methods agreed on 41-42% of the imputed missing values. The level of agreement between these and the random method was 20-22%. Despite the moderate agreement between the means, regression and EM methods, the discriminant functions were similar and accurate (prediction accuracy 83-99%). The discriminant functions obtained from the randomly imputed data were also accurate having prediction accuracy 88-97%. Imputation seems to be a useful method for treating the missing data in this database. However, a lot of data was missing in otoneurologic tests, which are likely to be of less importance in the diagnosis of vertiginous patients. Consequently, the disagreement of the methods did not affect clearly the discriminant analysis, and, therefore, future research requires more complete data and advanced imputation methods.


Artificial Intelligence Review | 2013

Missing values: how many can they be to preserve classification reliability?

Martti Juhola; Jorma Laurikkala

Using five medical datasets we detected the influence of missing values on true positive rates and classification accuracy. We randomly marked more and more values as missing and tested their effects on classification accuracy. The classifications were performed with nearest neighbour searching when none, 10, 20, 30% or more values were missing. We also used discriminant analysis and naïve Bayesian method for the classification. We discovered that for a two-class dataset, despite as high as 20–30% missing values, almost as good results as with no missing value could still be produced. If there are more than two classes, over 10–20% missing values are probably too many, at least for small classes with relatively few cases. The more classes and the more classes of different sizes, a classification task is the more sensitive to missing values. On the other hand, when values are missing on the basis of actual distributions affected by some selection or non-random cause and not fully random, classification can tolerate even high numbers of missing values for some datasets.

Collaboration


Dive into the Jorma Laurikkala's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xingan Li

University of Tampere

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge