Mari-Sanna Paukkeri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mari-Sanna Paukkeri is active.

Explore More

Publication

Featured researches published by Mari-Sanna Paukkeri.

Applied Soft Computing | 2012

Learning a taxonomy from a set of text documents

Mari-Sanna Paukkeri; Alberto Pérez García-Plaza; Víctor Fresno; Raquel Martínez Unanue; Timo Honkela

We present a methodology for learning a taxonomy from a set of text documents that each describes one concept. The taxonomy is obtained by clustering the concept definition documents with a hierarchical approach to the Self-Organizing Map. In this study, we compare three different feature extraction approaches with varying degree of language independence. The feature extraction schemes include fuzzy logic-based feature weighting and selection, statistical keyphrase extraction, and the traditional tf-idf weighting scheme. The experiments are conducted for English, Finnish, and Spanish. The results show that while the rule-based fuzzy logic systems have an advantage in automatic taxonomy learning, taxonomies can also be constructed with tolerable results using statistical methods without domain- or style-specific knowledge.

international conference on neural information processing | 2011

Effect of Dimensionality Reduction on Different Distance Measures in Document Clustering

Mari-Sanna Paukkeri; Ilkka Kivimäki; Santosh Tirunagari; Erkki Oja; Timo Honkela

In document clustering, semantically similar documents are grouped together. The dimensionality of document collections is often very large, thousands or tens of thousands of terms. Thus, it is common to reduce the original dimensionality before clustering for computational reasons. Cosine distance is widely seen as the best choice for measuring the distances between documents in k-means clustering. In this paper, we experiment three dimensionality reduction methods with a selection of distance measures and show that after dimensionality reduction into small target dimensionalities, such as 10 or below, the superiority of cosine measure does not hold anymore. Also, for small dimensionalities, PCA dimensionality reduction method performs better than SVD. We also show how l 2 normalization affects different distance measures. The experiments are run for three document sets in English and one in Hindi.

international multiconference on computer science and information technology | 2010

Learning taxonomic relations from a set of text documents

Mari-Sanna Paukkeri; Alberto Pérez García-Plaza; Sini Pessala; Timo Honkela

This paper presents a methodology for learning taxonomic relations from a set of documents that each explain one of the concepts. Three different feature extraction approaches with varying degree of language independence are compared in this study. The first feature extraction scheme is a language-independent approach based on statistical keyphrase extraction, and the second one is based on a combination of rule-based stemming and fuzzy logic-based feature weighting and selection. The third approach is the traditional tf-idf weighting scheme with commonly used rule-based stemming. The concept hierarchy is obtained by combining Self-Organizing Map clustering with agglomerative hierarchical clustering. Experiments are conducted for both English and Finnish. The results show that concept hierarchies can be constructed automatically also by using statistical methods without heavy language-specific preprocessing.

Information Processing and Management | 2013

Assessing user-specific difficulty of documents

Mari-Sanna Paukkeri; Marja Ollikainen; Timo Honkela

On the web, a huge variety of text collections contain knowledge in different expertise domains, such as technology or medicine. The texts are written for different uses and thus for people having different levels of expertise on the domain. Texts intended for professionals may not be understandable at all by a lay person, and texts for lay people may not contain all the detailed information needed by a professional. Many information retrieval applications, such as search engines, would offer better user experience if they were able to select the text sources that best fit the expertise level of the user. In this article, we propose a novel approach for assessing the difficulty level of a document: our method assesses difficulty for each user separately. The method enables, for instance, offering information in a personalised manner based on the users knowledge of different domains. The method is based on the comparison of terms appearing in a document and terms known by the user. We present two ways to collect information about the terminology the user knows: by directly asking the users the difficulty of terms or, as a novel automatic approach, indirectly by analysing texts written by the users. We examine the applicability of the methodology with text documents in the medical domain. The results show that the method is able to distinguish between documents written for lay people and documents written for experts.

international conference on computational linguistics | 2012

Exploring extensive linguistic feature sets in near-synonym lexical choice

Mari-Sanna Paukkeri; Jaakko J. Väyrynen; Antti Arppe

In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is selected to fill a lexical gap in a text. We experiment on an approach of an extensive set, over 650, linguistic features to represent the context of a word, and a range of machine learning approaches in the lexical choice task. We extend previous work by experimenting with unsupervised and semi-supervised methods, and use automatic feature selection to cope with the problems arising from the rich feature set. It is natural to think that linguistic analysis of the word context would yield almost perfect performance in the task but we show that too many features, even linguistic, introduce noise and make the task difficult for unsupervised and semi-supervised methods. We also show that purely syntactic features play the biggest role in the performance, but also certain semantic and morphological features are needed.

Natural Language Engineering | 2012

Evaluating vector space models with canonical correlation analysis

Sami Virpioja; Mari-Sanna Paukkeri; Abhishek Tripathi; Tiina Lindh-Knuutila; Krista Lagus

Vector space models are used in language processing applications for calculating semantic similarities of words or documents. The vector spaces are generated with feature extraction methods for text data. However, evaluation of the feature extraction methods may be difficult. Indirect evaluation in an application is often time-consuming and the results may not generalize to other applications, whereas direct evaluations that measure the amount of captured semantic information usually require human evaluators or annotated data sets. We propose a novel direct evaluation method based on canonical correlation analysis (CCA), the classical method for finding linear relationship between two data sets. In our setting, the two sets are parallel text documents in two languages. A good feature extraction method should provide representations that reflect the semantic content of the documents. Assuming that the underlying semantic content is independent of the language, we can study feature extraction methods that capture the content best by measuring dependence between the representations of a document and its translation. In the case of CCA, the applied measure of dependence is correlation. The evaluation method is based on unsupervised learning, it is languageand domain-independent, and it does not require additional resources besides a parallel corpus. In this paper, we demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by showing that the obtained results with bag-of-words representations are intuitive and agree well with the previous findings. Moreover, we examine

international conference on computational linguistics | 2008