Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tuomas Talvensaari is active.

Publication


Featured researches published by Tuomas Talvensaari.


ACM Transactions on Information Systems | 2007

Creating and exploiting a comparable corpus in cross-language information retrieval

Tuomas Talvensaari; Jorma Laurikkala; Kalervo Järvelin; Martti Juhola; Heikki Keskustalo

We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin. In this study, we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collection, the source collection, by using the relative average term frequency (RATF) value. The keys were translated into the language of the other collection, the target collection, with a dictionary-based query translation program. The translated queries were run against the target collection and an alignment pair was made if the retrieved documents matched given date and similarity score criteria. The resulting comparable collection was used as a similarity thesaurus to translate queries along with a dictionary-based translator. The combined approaches outperformed translation schemes where dictionary-based translation or corpus translation was used alone.


Information Retrieval | 2008

Focused web crawling in the acquisition of comparable corpora

Tuomas Talvensaari; Ari Pirkola; Kalervo Järvelin; Martti Juhola; Jorma Laurikkala

Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising.


european conference on information retrieval | 2008

Effects of aligned corpus quality and size in corpus-based CLIR

Tuomas Talvensaari

Aligned corpora are often-used resources in CLIR systems. The three qualities of translation corpora that most dramatically affect the performance of a corpus-based CLIR system are: (1) topical nearness to the translated queries, (2) the quality of the alignments, and (3) the size of the corpus. In this paper, the effects of these factors are studied and evaluated. Topics of two different domains (news and genomics) are translated with corpora of varying alignment quality, ranging from a clean parallel corpus to noisier comparable corpora. Also, the sizes of the corpora are varied. The results show that of the three qualities, topical nearness is the most crucial factor, outweighing both other factors. This indicates that noisy comparable corpora should be used as complimentary resources, when parallel corpora are not available for the domain in question.


acm symposium on applied computing | 2010

Addressing the limited scope problem of focused crawling using a result merging approach

Ari Pirkola; Tuomas Talvensaari

Focused crawling refers to a process of fetching domain-specific pages from the Web. It is an important method to build domain-specific document collections, but it suffers from low recall due to the local nature of crawling algorithms associated with Webs community structure. In this study, we address the problem of limited crawling scope of focused crawling using a result merging approach. The results of crawling processes based on different start URL sets and focused crawling methods were merged. We found that merging improves considerably the effectiveness of focused crawling. The results reported here are based on 10 test topics and 140 crawls in the domains of genomics and genetics.


Journal of Documentation | 2006

A study on automatic creation of a comparable document collection in cross‐language information retrieval

Tuomas Talvensaari; Jorma Laurikkala; Kalervo Järvelin; Martti Juhola

Purpose – To present a method for creating a comparable document collection from two document collections in different languages.Design/methodology/approach – The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the relative average term frequency formula. The keys were translated into English with a dictionary‐based query translation program. The resulting lists of words were used as queries that were run against the target collection (Los Angeles Times articles) with the nearest neighbor method. The documents were aligned with unrestricted and date‐restricted alignment schemes, which were also combined.Findings – The combined alignment scheme was found the best, when the relatedness of the document pairs was assessed with a five‐degree relevance scale. Of the 400 document pairs, roughly 40 percent were highly or fairly related and 75 percent included at least lexical similarity.Research limitations/implications – The number of alignment pairs was...


analytics for noisy unstructured text data | 2008

Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Antti Järvelin; Tuomas Talvensaari; Anni Järvelin

In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection. This kind of vocabulary is often out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In historic document retrieval (HDR), OCR errors and historical spelling variants cause similar problems. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpus-based approach, parallel or comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are also presented. The methods are shown to be effective in query translation without dictionaries between closely related languages (TRT and s-grams), OOV word translation (s-grams), and boosting dictionary-based CLIR performance by way of OOV word translation (corpus based methods).


european conference on research and advanced technology for digital libraries | 2010

A topic-specific web search system focusing on quality pages

Ari Pirkola; Tuomas Talvensaari

We describe a topic-specific Web search system focused on quality pages and argue that there is a need for such quality-based topic-specific search tools. The first implementation of the search system is available on the Web and it deals with climate change. The key idea is to crawl (using a focused crawling technique) in known trusted sites and in sites that are connected to them. We also discuss the further development of the system and our future research. Our project plan involves building a larger quality-based Web search system dealing with many globally significant topics (in addition to climate change).


Journal of the Association for Information Science and Technology | 2007

Corpus-based cross-language information retrieval in retrieval of highly relevant documents

Tuomas Talvensaari; Martti Juhola; Jorma Laurikkala; Kalervo Järvelin


Journal of the Association for Information Science and Technology | 2007

Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research Articles

Tuomas Talvensaari; Martti Juhola; Jorma Laurikkala; Kalervo Järvelin


international conference on web information systems and technologies | 2009

Effects of Crawling Strategies on the Performance of Focused Web Crawling.

Ari Pirkola; Tuomas Talvensaari

Collaboration


Dive into the Tuomas Talvensaari's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge