Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jarmo Toivonen is active.

Publication


Featured researches published by Jarmo Toivonen.


international acm sigir conference on research and development in information retrieval | 2003

Fuzzy translation of cross-lingual spelling variants

Ari Pirkola; Jarmo Toivonen; Heikki Keskustalo; Kari Visala; Kalervo Järvelin

We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second stage, the intermediate forms obtained in the first stage are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The target word list contained 189 000 English words with the correct equivalents for the source words among them. The source words were translated using the two-step fuzzy translation technique, and the results were compared with those of plain fuzzy matching based translation. The combined technique performed better, sometimes considerably better, than fuzzy matching alone.


International Journal of Accounting Information Systems | 2001

Comparing numerical data and text information from annual reports using self-organizing maps

Barbro Back; Jarmo Toivonen; Hannu Vanharanta; Ari Visa

Abstract More and more companies provide their accounting information in electronic form today. The accounting information in electronic form can be found in large commercial databases or on the web. This information is of great interest for different stakeholders, i.e., stockholders, creditors, auditors, financial analysts, and management. For the stakeholders it is important to be able to extract both quantitative and qualitative information concerning the companies they are interested in. The annual reports contain information both in numerical and symbolic form. So far, only the numerical information has been analyzed with help of computers. However, technology has evolved and in particular neural networks in the form of self-organizing maps (SOMs) provide a new tool for analyzing also text information. In this paper, we compare results on quantitative data with results on qualitative data from annual reports. We use smart encoding, SOMs, and document histograms for comparing the performance of forest companies worldwide. Firstly, we cluster the companies according to, on the one hand, quantitative information, and on the other hand, qualitative information. Secondly, we compare the results produced by the clustering methods. Our results of the comparison show that there is a difference between the results.


Information Processing and Management | 2005

Translating cross-lingual spelling variants using transformation rules

Jarmo Toivonen; Ari Pirkola; Heikki Keskustalo; Kari Visala; Kalervo Järvelin

Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first step, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second step, the intermediate forms obtained in the first step are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The two-step technique performed better, in some cases considerably better, than fuzzy matching alone. Even using the first step as such showed promising results.


acm symposium on applied computing | 2006

FITE-TRT: a high quality translation technique for OOV words

Ari Pirkola; Jarmo Toivonen; Heikki Keskustalo; Kalervo Järvelin

We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the devised FITE (frequency-based identification of translation equivalents) technique was tested using biological and medical cross-lingual spelling variants and OOV words in Spanish-English and Finnish-English TRT. For Spanish-English, translation recall was 89.2%-91.0% and for Finnish-English 71.9%-72.9%. For both language pairs FITE-TRT achieved high translation precision, i.e., 97.0%-98.8%. The technique also reliably identified native source language words, i.e., source words that cannot be correctly translated by TRT. Dictionary-based CLIR augmented with FITE-TRT performed substantially better than dictionary-based CLIR where OOV keys were kept intact.


machine learning and data mining in pattern recognition | 2001

Validation of Text Clustering Based on Document Contents

Jarmo Toivonen; Ari Visa; Tomi Vesanen; Barbro Back; Hannu Vanharanta

In this paper some results of a new text clustering methodology are presented. A prototype is an interesting document or a part of an extracted, interesting text. The given prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of automatic content-based clustering using the information of the document. To verify this hypothesis an experiment was designed with the Bible. Four different translations, one Greek, one Latin, and two Finnish translations from years 1933/38 and 1992 were selected as test text material. Validation experiments were performed with a designed prototype version of the software application.


hawaii international conference on system sciences | 2000

Knowledge discovery from text documents based on paragraph maps

Ari Visa; Jarmo Toivonen; Piia Ruokonen; Hannu Vanharanta; Barbro Back

In law, physics, business, and so on, there are lots of documents. The organisation of these documents is essential. The right way to organise the documents reveals quite a lot from the information contents of the document. It is common that text documents are characterised and classified by keywords. The authors usually define these keywords. Nowadays there exists a tremendous amount of uncharacterised text documents, due to the Internet and also to old paper based archives. It is important that the information can be managed and the knowledge can be retrieved. It would be desirable to retrieve the information without reading the document. We propose a new technology based on multilevel hierarchies. Here we concentrate only on the highest level. The technology is based on a hierarchy of self-organizing maps (SOM) and on smart encoding of words. Our experiment with a text document (an annual report) shows that it is possible to separate between different types of paragraphs. It is possible to separate between the original paragraph and the one containing the same words but in random order. It is also possible to categorise the paragraphs or for instance, to find all unauthorised citations of paragraphs within a long text document. The only requirement is that there be a considerable amount of text documents for the training process. Finally the text documents can be classified based on the trained types of paragraphs. This means that unknown documents can be categorised without reading them. This facility can be called knowledge discovery.


Journal of Management Information Systems | 2002

Contents Matching Defined by Prototypes: Methodology Verification with Books of the Bible

Ari Visa; Jarmo Toivonen; Hannu Vanharanta; Barbro Back

It is common that text documents are characterized and classified by keywords, index terms, or headings. We have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the existing document database or with the monitored document flow. The claim is that the new methodology is capable of extracting the contents of the document. To verify this hypothesis, a test with the Bible was designed. Different translations in English, Latin, Greek, and Finnish were selected to test materials. Verification tests that included the search of the ten nearest books to every book of the Bible were performed with a designed prototype version of the software application. The test results are reported in this paper.


hawaii international conference on system sciences | 2001

Prototype matching finding meaning in the books of the Bible

Ari Visa; Jarmo Toivonen; Hannu Vanharanta; Barbro Back

It is common that text documents are characterised and classified by keywords that the authors use to give and name these text characteristics. Visa et al. (1999; 2000) have, however developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of extracting meaning automatically from the contents of the document. To verify this hypothesis a test was designed with the Bible. Two different translations, one in English and another in Finnish, were selected as test text material. Verification tests that included the search of the ten nearest books to every book of the Bible were performed with a designed prototype version of the software application. The interesting test results are reported in this paper. The new methodology is based on a hierarchy of self-organizing maps (SOM) and on a smart encoding of words. The words of a text document are encoded. The encoded words are represented as word vectors. The word vectors are clustered by the SOM and this process creates a word map.


Proceedings of SPIE | 2001

Data mining of text as a tool in authorship attribution

Ari Visa; Jarmo Toivonen; Sami Autio; Jarno Mäkinen; Barbro Back; Hannu Vanharanta

It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.


ACM Transactions on Information Systems | 2007

Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules

Ari Pirkola; Jarmo Toivonen; Heikki Keskustalo; Kalervo Järvelin

We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the technique called frequency-based identification of translation equivalents (FITE) was tested using biological and medical cross-lingual spelling variants and out-of-vocabulary (OOV) words in Spanish-English and Finnish-English TRT. The results showed that, depending on the source language and frequency corpus, FITE-TRT (the identification of translation equivalents from TRTs translation set by means of the FITE technique) may achieve high translation recall. In the case of the Web as the frequency corpus, translation recall was 89.2%--91.0% for Spanish-English FITE-TRT. For both language pairs FITE-TRT achieved high translation precision: 95.0%--98.8%. The technique also reliably identified native source language words: source words that cannot be correctly translated by TRT. Dictionary-based CLIR augmented with FITE-TRT performed substantially better than basic dictionary-based CLIR where OOV keys were kept intact. FITE-TRT with Web document frequencies was the best technique among several fuzzy translation/matching approaches tested in cross-language retrieval experiments. We also discuss the application of FITE-TRT in the automatic construction of multilingual dictionaries.

Collaboration


Dive into the Jarmo Toivonen's collaboration.

Top Co-Authors

Avatar

Hannu Vanharanta

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Ari Visa

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Barbro Back

Åbo Akademi University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jarno Mäkinen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Tomi Vesanen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge