Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Miro Lehtonen is active.

Publication


Featured researches published by Miro Lehtonen.


document engineering | 2002

A dynamic user interface for document assembly

Miro Lehtonen; Renaud Petit; Oskari Heinonen; Greger Lindén

Document assembly has turned out to be a convenient approach to corporate publishing and reuse of large collections of documents. Automated assembly of a document reduces the amount of human effort when creating customized documents consisting of document fragments from a collection.However, most methods used require a number of parameters to be defined prior to the assembly process, and providing these parameters in the correct format is seen to be too demanding for an average user. We have designed and implemented a graphical user interface that provides the user with a simple way to specify the parameters of the assembly process. The interface, which is dynamically generated based on a given document configuration, lets the user create and customize documents such as technical manuals.In our example assembly case, the user can select the product, the manual type, the language of the manual as well as the optional components to be included in the manual.


international acm sigir conference on research and development in information retrieval | 2012

Report on INEX 2008

T. Beckers; Patrice Bellot; Gianluca Demartini; Ludovic Denoyer; C.M. de Vries; Antoine Doucet; Khairun Nisa Fachry; Norbert Fuhr; Patrick Gallinari; Shlomo Geva; Wei-Che Huang; Tereza Iofciu; Jaap Kamps; Gabriella Kazai; Marijn Koolen; Sangeetha Kutty; Monica Landoni; Miro Lehtonen; Véronique Moriceau; Richi Nayak; Ragnar Nordlie; Nils Pharo; Eric SanJuan; Ralf Schenkel; Xavier Tannier; Martin Theobald; James A. Thom; Andrew Trotman; A.P. de Vries

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2008 evaluation campaign, which consisted of a wide range of tracks: Ad hoc, Book, Efficiency, Entity Ranking, Interactive, QA, Link the Wiki, and XML Mining.


International Workshop of the Initiative for the Evaluation of XML Retrieval | 2006

Unsupervised Classification of Text-Centric XML Document Collections

Antoine Doucet; Miro Lehtonen

This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use structural information as a preliminary means to detect and put aside structural outliers. The improvement of the semantic-wise quality of clustering is significantly higher through this approach than through a combination of the structural and textual feature sets.


International Workshop of the Initiative for the Evaluation of XML Retrieval | 2006

XML-IR Users and Use Cases

Andrew Trotman; Nils Pharo; Miro Lehtonen

We examine the INEX ad hoc search tasks and ask if (or not) it is possible to identify any existing commercial use of the task. In each of the tasks: thorough, focused, relevant in context, and best in context, such uses are found. Commercial use of CO and CAS queries are also found. Finally we present abstract use cases of each ad hoc task. Our finding is that XML-IR, or at least parallels in other semi-structured formats, is in use and has been for many years.


ACM Transactions on Information Systems | 2006

Preparing heterogeneous XML for full-text search

Miro Lehtonen

XML retrieval is facing new challenges when applied to heterogeneous XML documents, where next to nothing about the document structure can be taken for granted. We have developed solutions where some of the heterogeneity issues are addressed. Our fragment selection algorithm selectively divides a heterogeneous document collection into equi-sized fragments with full-text content. If the content is considered too data-oriented, it is not accepted. The algorithm needs no information about element names. In addition, three techniques for fragment expansion are presented, all of which yield a 13--17% average improvement in average precision. These techniques and algorithms are among the first steps in developing document-type-independent indexing methods for the full text in heterogeneous XML collections.


Focused Access to XML Documents | 2008

Phrase Detection in the Wikipedia

Miro Lehtonen; Antoine Doucet

The Wikipedia XML collection turned out to be rich of marked-up phrases as we carried out our INEX 2007 experiments. Assuming that a phrase occurs at the inline level of the markup, we were able to identify over 18 million phrase occurrences, most of which were either the anchor text of a hyperlink or a passage of text with added emphasis. As our IR system -- EXTIRP -- indexed the documents, the detected inline-level elements were duplicated in the markup with two direct consequences: 1) The frequency of the phrase terms increased, and 2) the word sequences changed. Because the markup was manipulated before computing word sequences for a phrase index, the actual multi-word phrases became easier to detect. The effect of duplicating the inline-level elements was tested by producing two run submissions in ways that were similar except for the duplication. According to the official INEX 2007 metric, the positive effect of duplicated phrases was clear.


Advances in Focused Retrieval | 2009

Enhancing Keyword Search with a Keyphrase Index

Miro Lehtonen; Antoine Doucet

Combining evidence of relevance coming from two sources -- a keyword index and a keyphrase index -- has been a fundamental part of our INEX-related experiments on XML Retrieval over the past years. In 2008, we focused on improving the quality of the keyphrase index and finding better ways to use it together with the keyword index even when processing non-phrase queries. We also updated our implementation of the word index which now uses a state-of-the-art scoring function for estimating the relevance of XML elements. Compared to the results from previous years, the improvements turned out to be successful in the INEX 2008 ad hoc track evaluation of the focused retrieval task.


International Workshop of the Initiative for the Evaluation of XML Retrieval | 2006

A Taxonomy for XML Retrieval Use Cases

Miro Lehtonen; Nils Pharo; Andrew Trotman

Despite the active research on XML retrieval, it is a great challenge to determine the contexts where the methods can be applied and where the proven results hold. Therefore, having a common taxonomy for the use cases of XML retrieval is useful when presenting the scope of the research. The taxonomy also helps us design more focused user studies that have an increased validity. In the current state, the taxonomy covers most common uses of applying Information Retrieval to XML documents. We are delighted to see how some of the use cases match the tasks of the INEX participants.


INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval | 2004

EXTIRP 2004: towards heterogeneity

Miro Lehtonen

The effort around EXTIRP 2004 focused on the heterogeneity of XML document collections. The subcollections of the heterogeneous track (het-track) did not offer us a suitable testbed, but we successfully applied methods independent of any document type to the original INEX test collection. By closing our eyes to the element names defined in the DTD, we created comparable runs and discovered improvement in the results. This was anticipated evidence for our hypothesis that we do not need to know the element names when indexing the collection or when returning full-text answers to the Content-Only type queries. Some problematic areas were also identified. One of them is score combination which enables us to combine elements of any size into one ranked list of results given that we have the relevance scores of the leaf-level elements. However, finding a suitable score combination method remains part of our future work.


International Workshop of the Initiative for the Evaluation of XML Retrieval | 2006

EXTIRP: Baseline Retrieval from Wikipedia

Miro Lehtonen; Antoine Doucet

The Wikipedia XML documents are considered an interesting challenge to any XML retrieval system that is capable of indexing and retrieving XML without prior knowledge of the structure. Although the structure of the Wikipedia XML documents is highly irregular and thus unpredictable, EXTIRP manages to handle all the well-formed XML documents without problems. Whether the high flexibility of EXTIRP also implies high performance concerning the quality of IR has so far been a question without definite answers. The initial results do not confirm any positive answers, but instead, they tempt us to define some requirements for the XML documents that EXTIRP is expected to index. The most interesting question stemming from our results is about the line between high-quality XML markup which aids accurate IR and noisy “XML spam” that misleads flexible XML search engines.

Collaboration


Dive into the Miro Lehtonen's collaboration.

Top Co-Authors

Avatar

Antoine Doucet

University of La Rochelle

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Nils Pharo

Oslo and Akershus University College of Applied Sciences

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shlomo Geva

Queensland University of Technology

View shared research outputs
Top Co-Authors

Avatar

Jaap Kamps

University of Amsterdam

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lili Aunimo

University of Helsinki

View shared research outputs
Researchain Logo
Decentralizing Knowledge